"If you can't measure it, you can't improve it" . And so, since the release of CloudBees Flow 5.2, we have supported health monitoring of the CloudBees Flow server(s) using statsd and data visualization tools, such as Grafana. In this article, I will walk you through the steps to set up statsd and Grafana using containers.
The Tools:
Before we explore how to take advantage of the health monitoring feature, let's have a look at the tools we integrate with:
statsd and Graphite
statds was created by Etsy, based on some earlier work at Flickr. It is a very simple (a few hundred lines of code) NodeJS daemon that listens on a UDP port, extracts metric data from the messages and flushes them from time to time to Graphite. UDP is beneficial because it's fast and won't error out if nobody listens. After all, you don't want to slow down your application in order to measure it.
Graphite is both a numeric time-series data storage and a graphical frontend rendering of this data. It includes 3 components:
- Carbon: a daemon (based on Twisted) that listens for time-series data, in our case coming from statsd
- Whisper: a simple database to store these metrics
- Graphite webapp: Django-based frontend to display on-demand charts.
Graphite allows you to create new metrics on demand, by simply sending new data - so there is no need to ask IT to modify the configuration to incorporate new metrics.
To help with the setup of those two applications, I've created a
Vagrant environment that you can check out
here .
Grafana
Grafana is a rich graphical frontend that allows you to display data from multiple sources like Graphite, InfluxDB, Elasticsearch, and more.
I'm not sure why but I've never been a big fan of Graphite, maybe I was not familiar enough with it to feel comfortable. A few months back, a customer showed me what they had accomplished with Grafana, and I thought it looks very good. Before we added statsd and REST API support, customers had to use our command line tool,
ectool , to extract data for reporting. Now, with the new releases of CloudBees Flow, I wanted to see if I could
integrate CloudBees Flow 6.0, statsd and Grafana to reduce the load on the server and the Database.
This time I decided to go the
Docker route. I know, I'm adding yet another technology, but as Docker and Containers
move to the mainstream , I figured it was time for me to get on the bandwagon. With a little bit of research I came across
this Docker container and decided to test it out.
Docker
If you want to run Docker on Mac OS X or Windows you will have to install Virtualbox first, as Docker uses Linux-specific kernel features. The installation will create a small Linux VM to serve as the host for your containers. If you run Linux you do not need to do that, as it will use your host to run the container directly.
After installing VirtunalBox, I started the container with
docker run -d -p 80:80 -p81:81 -p 8125:8125/udp -p 8126:8126 --name grafana kamon/grafana_graphite
By default, Linux VM (on mac and Windows) is associated to IP 192.168.99.100. If you run on Linux, use your host name or IP address. As it was my only container, those ports were available; you may need to redirect them. Check the Docker documentation for details.
How to integrate CloudBees Flow 6.0, statsd and Grafana:
1CloudBees Flow Setup
Modify your
<DATA_DIR>conf/wrapper.conf with the following information:
# These are for enabling Commander to send data to a statsd server.
# Only the hostname is required, the other options are included to show
# the default values.
wrapper.java.additional.800=-DCOMMANDER_STATSD_HOST=192.168.99.100
wrapper.java.additional.801=-DCOMMANDER_STATSD_PORT=8125
wrapper.java.additional.802=-DCOMMANDER_STATSD_PREFIX=flow
wrapper.java.additional.803=-DCOMMANDER_STATSD_INCLUDE_HOSTNAME=true
Setting |
Description |
HOST |
The name or IP of your statsd server. In our case, this is the IP of the VM hosting the containers. |
PORT |
The statsd UDP port that will receive the data. You may need to use the port on your Docker machine that you forward to 8125 in the statsd container.The nice thing about UDP is that if nobody listens you won't get any error so you can turn it on even before your stats server is ready. |
PREFIX |
Used to prefix all you data in statsd so it is easier to locate if your statsd server is getting data from multiple services. |
HOSTNAME |
In case you're running in a cluster, this parameter is useful to separate your data by server in your cluster. It's also useful if you want to monitor your DEV and PROD CloudBees Flow servers separately. |
In order for any modification in this file to take effect, you need to restart your CloudBees Flow server daemon.
2Grafana Configuration
I then pointed my web browser to the IP of my Docker virtual machine (http://192.168.99.100), and logged in to Grafana (admin/admin) and
added my data source (be sure to keep the proxy setting)
I created a new dashboard and a new license graph with 2 metrics.
One of the nice things with Grafana is that the system auto-discovers the different data points available, so you don't have to remember all of them. However, a data point must have been generated before you see it automatically.
In the picture above, you'll notice the reference to "flow" that reflects the setting in the wrapper.conf as well as ec601, which is my CloudBees Flow server name.
The only issue I've found with this Docker container is that Graphite
fills the disk very quickly, so you may want to update the retention policy or get a bigger external storage.
3Monitoring of Your CloudBees Flow Server
Now we have all the pieces in place to monitor our server, here are a few recommendations on what to monitor. This is not an exhaustive list, but some key metrics to keep in mind, and data that we also monitor internally.
Note: In the
path below , "
flow " is the prefix and
ec601 is the name of the server.
Performance
- stats.gauges.flow.ec601.memory.G1_Old_Gen.usage.committed
This is an obvious one to ensure your server does not run out of memory.
- stats.gauges.flow.ec601.cpu.user
To monitor you have enough CPU power available to process all the requests.
- stats.gauges.flow.ec601.jobs.runnableSteps
Indicates the number of steps that could be run during the last step scheduler invocation.
- stats.gauges.flow.ec601.api.active
Gives you an idea of the general activity on your server.
Licenses and Users
- summarize(stats.counters.flow.ec601.login.count, '10m', 'sum', false)
Aggregate the number of users for a 10 minutes period.
- stats.gauges.flow.ec601.licenses.hosts
Checks the number of host licenses you are using. You can also query for managed hosts or steps or applications. I also recommend adding some thresholds to make it clear when you've reached your limit.
- stats.counters.flow.ec601.login.count ; stats.counters.flow.ec601.logouts.count ; diffSeries(#A,#B)
This is to make sure your scripts don't pile up sessions . This one is trickier to do in Grafana as you need to define each query, then hide the queries (by clicking on the eye icon), and then creating a third query that diff the #A and #B series:
Database
- stats.timers.flow.ec601.timers.Tx.commit.mean
The mean transaction commit time . You're looking for a smooth chart on this one.
4Dashboarding:
This is what my final dashboard looks like:
As you can see my server is pretty tame :)
Feel free to share your beautiful dashboard, and any additional metrics you monitor, and why.
Happy monitoring!
*Bonus
Now that you have a statsd server running with a graphing frontend, why not take advantage of it to send some run-time data?
For example, let's imagine you have a testing procedure that runs a bunch of tests and you collect 2 job properties "
nbTests " and "
nbErrors ". You can simply send those 2 data points to statsd with the following code:
# send Number of run tests
echo 'flow.ec601.Testing.nbTests:$|c' |
nc -w 1 -u statsd 8125
# send Number of failed tests
echo 'flow.ec601.Testing.nbErrors:$|c' |
nc -w 1 -u statsd 8125
As you can see, I used the same "domain" for the server stats (but you don't have to).
Note the
-u and
-w 1 for UDP to be sure a bad connection does not hang your process.
The result is something like:
Additional Resources
Here are some additional resources you may find useful: