Using Anomaly Detection to Kill a CloudBees Rollout Experiment

Written by: Michael Neale

8 min read

Editor's note: This post first appeared on the Rollout.io blog .

One of the cleanest and nicest features of the CloudBees Rollout app is that you get a big powerful kill switch that you can use when you are running an “experiment” or rolling out a new feature:

Should things go wrong, you can quickly restore things to the default state of the feature flags. If you are using CloudBees Rollout experiments to deploy changes (perhaps a new feature, a new look or even some new API or backend), this can really increase confidence and allow you to move a much faster as you know you can back out at any time.

In this post, I wanted to show you how you can connect this feature up to log and error collection services so that once they spot a problem (an anomaly) your rollout can be halted automatically. Services like these are especially useful if you are running an experiment or rollout out a change. It is during this process where you really want to know if something has changed (for the worse) and then react to it (kill the experiment, deploy a new version etc).

Anomalies and detecting them

First, what are anomalies? Well visually they are very easy to spot, you could consider them as outliers - things that sit far outside of the norm:

We visually have a good intuition for spotting an outlier. Sometimes it is wrong, however; for example, if we were to graph errors in a mobile app over time, the first time we hit a weekend we may notice a spike in error rates. This may appear unusual or alarming, but we notice the same happens again and again on weekends, which means it is probably okay. Data like this is sometimes known as seasonal - i.e. some change is normal, and you have to learn over time what is normal. There is quite an art to this and often machine learning is involved, you can read a bit more of the back story here .

Thankfully there are many tools out there that build this functionality in, in fact, Amazon makes it available freely (well you pay per use, but it isn’t upfront) if you are using CloudWatch : in that case, it uses at least a 15-minute window to learn patterns in the metrics. The powerful Kinesis tool has some built-in machine learning functions (the RANDOM CUT FOREST algorithm has some great documentation that can help give an understanding of anomaly detection and unsupervised learning) that can look at multiple metrics at once and learn what is anomalous (which you can then use to fire alerts - in fact, Amazon uses this itself for network monitoring). The popular New Relic tool has built-in support for outlier detection for example.

There are so many tools out there now (many of which you may be using already), things that were once the realm of science are now readily available and can all be used to fire off alerts or take action, but in this post I wanted to show a few tools that could work quite neatly with CloudBees Rollout experiments, and how they can integrate with the CloudBees Rollout API.

Webhooks

One thing that is common across pretty much all monitoring and alerting and analytics tools are webhooks. For context here - webhooks are HTTPS endpoints that you can set up so a given service can fire an event when a condition is met (such as when an anomaly is detected). Webhooks work best across the internet and are especially nice between SaaS services that are hosted on the public internet (there are techniques to get these behind a firewall - I have written about that before if you are interested). There is even a neat logo:

BugSnag and Logz.io/Loggly

BugSnag is a popular service that handles error tracking: you wire it in wherever you deploy code to catch unhandled errors. It then analyses these, rolls them up and reports on these errors. This is especially useful for the “edge” of computing where you may not get access to logs: mobile apps, browsers and more, but is widely deployed across most technologies.

Logz.io and https://www.loggly.com/ are logging as a service: anything that generates logs can be configured to ship the logs to services like these, where it can analyze and extract metrics, report, and alert.

One thing these tools have in common is that they collect a LOT of data. Applications that are in use will generate a lot of logs, and probably a lot of errors! (even in normal operation, typically unhandled scenarios happen - but that is ok, that is normal, what is not normal is when the logging or errors suddenly change in some non-seasonal way). This is really useful as that data accumulation allows them to learn seasonality and to spot real anomalies that are worth escalating (either to a human or an API).

Logz.io has the ability to spot anomalies by making use of the “formerly known as” ELK stack. It also can be set up with a webhook (custom endpoint, they call it) to call CloudBees Rollout when a problem occurs. Loggly can chart anomalies for you, and of course, like Logz and BugSnag, can generate alerts from these anomalies (which can be sent to webhook endpoints via HTTPS).

BugSnag has an interesting feature where it can identify that there is a “spike” in errors (ie it really stands out amongst the background noise of familiar errors). This is made available via the webhook feature as a field in the JSON that is posted to the webhook:

In this case, we would be most interested in the trigger.type being “projectSpiking ” event - as that would correlate nicely to a feature being rolled out (or an experiment being run). We aren’t so interested in exceptions (maybe firstException could be interesting - if it is truly new after a feature is turned on for the first time).

Now we need to get these webhooks to kill the experiment in CloudBees Rollout…

CloudBees Rollout API and connecting with Webhooks

Like all systems, pretty much anything you do in CloudBees Rollout can be automated with a rich API, including the Kill button illustrated above. To kill an experiment, it uses a fairly simple PATCH HTTP API call, something like this:

The API is richly documented here by the way. The experiment API is the one I am showing. If you have the web console open when using the API, you may note that it updates in real-time with the changes - very neat.

This is great, we can hook it in with various systems, we just need to get the API Key and AppId and a few other things, but the challenge is that with all these systems and tools out there, there may not be a ready-made or flexible integration that allows them to call this API directly.

Given tools like BugSnag and Logz.io mentioned above can already do webhooks, we need a bit of glue to connect them (and others) with the PATCH experiment API of CloudBees Rollout.

Thankfully this is very very easy - and even easier to host. To help with this I made a small project you can deploy as a Google Cloud Function - Cloud Functions are similar to AWS Lambda functions, and are serverless pattern of hosting small pieces of functionality which are only invoked as needed (ie they don’t use resources or cost when they aren’t running, of course, you will need a google cloud account and the gcloud command-line installed):

https://github.com/michaelneale/rollout.io-anomaly-detection

To use this - you clone it and then run:

gcloud functions deploy rollout_webhook --trigger-http --runtime "python37"

This will give you an endpoint you can use for webhooks, copy this URL it generates for you, and add on parameters so it is of the form:

https://x.cloudfunctions.net/rollout_webhook?secret=..&app_id=..&environment_name=..&flag_name=..

To get the secret and app id, follow these instructions to get it from your App Settings/Integrations tab on your CloudBees Rollout console. The Environment and Flag Names are from your experiment that you are running. Once you have this fully formed URL with your parameters, you can use it as the webhook or custom endpoint for applications such as Logz.io, BugSnag or CloudWatch, etc.

If you need to, you can also customize the logic to look at the JSON payload (almost all webhooks use JSON payloads now) using the kill_experiment function here (don’t forget to redeploy it to the google cloud function). This could be useful for looking at the payload of the BugSnag webhook for example.

Hopefully, you find this useful - and keep an eye out for ready-made integrations with the growing set of tools and services that can monitor your systems, this is just the beginning!

Additional resources:

Stay up-to-date with the latest insights

Sign up today for the CloudBees newsletter and get our latest and greatest how-to’s and developer insights, product updates and company news!