At the end of my Jenkins World 2016 talk - “So you want to build the worlds largest Jenkins cluster” - I gave a brief demonstration of a Jenkins cluster with 100,000 concurrent builds to give people an idea of just how far Jenkins clusters can scale.
My talk did not have anywhere near the budget of Sacha’s keynote… where they were able to fire up a PSE cluster of over 2000 controllers with 8000+ concurrent builds. The idle reader might be wondering how exactly I was able to achieve 100,000 concurrent builds and what exactly were the tricks I was playing to get there.
OK, let’s get this over… I did cheat a little, but only in the places where it was safe to do so.
If you want to have a Jenkins cluster with 100,000 concurrent builds, you need to ask yourself “what is it exactly that we want to show?”
I can think of two answers to that question:
We are really good at burning money;
The Jenkins controllers can handle that level of workload.
Given my constrained budget, well I can only really try to answer the second question
Can a Jenkins cluster handle the workload of 100,000 concurrent builds
Most of the work that a Jenkins controller has to do when a build is running on the agent can be broken down as follows:
Streaming the console log from the agent over the remoting channel and writing that log to disk
Copying any archived artifacts over the remoting channel onto the controller’s disk when the build is completed
Fingerprinting files on the remote agent
Copying any test reports over the remoting channel onto the controller’s disk when the build is completed
A well integrated Jenkins cluster might also include
Copying artifacts from upstream jobs into the build agent’s workspace (potentially from a different controller in the cluster’s disk)
Triggering any downstream jobs (potentially on a different controller in the cluster)
The rest of the workload of the build is actually compiling and running tests, etc. These all take place on the build agent and do not have any effect on the controller.
So as long as:
the agent streams back a console log (at more than 60 lines per minute - based on my survey of typical builds)… potentially with I/O flushes for every line output
there are new files (with random content to defeat remoting stream compression) on the agent workspace to be archived and fingerprinted
there are new test results with different content each build written to the agent workspace
Then we don’t actually have to do a real build.
So in April 2014 I created the Mock Load Builder plugin. This plugin allows you to define a build step that will appear to the Jenkins controller just like a regular build… but without generating nearly as much of a CPU requirement on the build agent.
However, when you are aiming for 100,000 concurrent builds, even the Mock Load Builder plugin is not enough as each build will fork a JVM to perform the “mock” build. Now, ok, we don’t need lots of memory in that JVM, but it’s still at least 128Mb… and that will add up to quite a lot of RAM when we have 100,000 of them running at the same time.
So I added another layer of mocking to the Mock Load plugin - fakeMockLoad - with this system property set the mock load will actually be generated directly on the agent JVM instead of in a JVM forked from the agent JVM.
We are still generating all the same console logs, build artifacts, test reports, etc. Only now we are not paying the cost of forking another JVM. Phew, that was 13Tb of RAM saved.
But hang on a second… each build agent is going to use at least 512Mb of RAM… that’s over 50Tb of RAM… or 25 x1.32xlarge
AWS instances… almost $350/hr for On Demand instances just for the Agents… plus these are not exactly doing real work… we won’t have much to show other than a headline number.
Well as part of my load testing for the JNLP4 protocol I wrote a test client that can set up at least 4,000 JNLP connections from the same JVM maybe we could use a modified version of that to multi-tennant the JNLP build agents on the same JVM… The workload on the controller is a function of how many remoting channels there are and how much data is being sent over those channels…
It turns out that with a special multi-tennant remoting.jar I can run nearly 10,000 build agents using fakeMockLoad
per c4.8xlarge
. At $1.675/hr that is a much more reasonable $16/hr… plus even better, we have fewer machines to set-up.
Everything else in my cluster is real. 500 real controllers (running in Docker containers divided between a x1.32xlarge
and a pair of c4.8xlarge
) and a CloudBees Jenkins Operations Center (running naked on a dedicated c4.xlarge
).
I was somewhat constrained by disk space packing all those controllers into a small space, if I had divided the controllers across a larger number of physical machines rather than trying to cram 400 controllers onto the same x1.32xlarge
I could have probably had the cluster run for more than 90 minutes.
There is a video I remembered to capture while spinning up the cluster just before my talk. Two of the build agent machines were running out of disk space at the time, which is why the controllers I checked are running about 160 concurrent builds each.
TL;DR I had (for all of 90 minutes) a Jenkins cluster of 500 controllers each with 200 build agents (per controller) for a combined total concurrent built rate of 100,000 concurrent builds. Yes, there were issues keeping that cluster running within the budget I had available. Yes, there are challenges maintaining a system with that number of concurrent builds. Yes, I did make some cheats to get there. But Jenkins controllers and Jenkins clusters can handle that workload - provided you have the hardware to actually support the workload in the first place!