Everyone wants to have a scalable system because it allows us to be able to meet the demands business sets in front of us. The problem, for many of us, is not whether we want to scale (most of us do), but how we scale. In many cases, scaling is a question of adding more servers and configuring them to be part of our cluster. Even though that sentence sounds correct, it exposes a couple of problems. More servers are often not needed, and, when they are needed, configuring them is a costly operation (each hour people spend working on something is a cost). The result? Business tends to think of the IT department as a liability. It requests more money for more servers, and more people to maintain them. Remember, all successful companies want to scale their business, and, in today’s age, IT is the crucial part of that need. The increase in business means increases in demand, the increase in demand means increase in infrastructure. That holds true unless we manage our infrastructure in a better and more efficient way.
At this point, you might be asking why am I speaking about scaling in the CloudBees blog. The answer is simple. Through a better setup, Jenkins OS, and, especially the Enterprise Edition plugins, improved a lot the way we run Jenkins at scale. There are many aspects we should consider when running Jenkins at scale, but, today, I’ll focus on agents. To be more concrete, we’ll explore how to accomplish their high availability without getting bankrupt during the process.
What do we want from Jenkins? Many things, but one of the utmost importance is to act as a framework for continuous integration (CI), delivery or deployment (CD). I will assume that you are already using Jenkins and that you know what CI/CD is, so I’ll skip to a discussion about availability. It is not enough to setup deployment pipelines and run them on every commit to the code repository, only to find out that they are queued waiting for the next available executor. If they are queued, pipeline result is delayed, problems are detected later, fixing them is more expensive, and so on. With CI/CD, we want pipeline feedback as soon as possible. How do we reduce waiting time in queues? By adding more servers? That’s, in many cases, the wrong answer. More servers mean an increase in costs, the rise in expenses means less revenue for the business. Ask yourself, are your builds queued all the time. Are you running your deployment pipelines 24/7? If you are, you should, indeed, request more servers. However, it is more likely that you have a high demand only at peak hours. Our business is not a factory with three shifts. We tend to work from 9h to 18h, and, hopefully, spend the rest of our time with our families, do some activities, watch movies, and so on. Even during working hours, the demand tends to vary. The first few hours tend to have lower demand on our CI/CD infrastructure. After all, we need a bit of time to write some code before committing it to the repository. In such a setting, our CI/CD infrastructure demand for agents might be as follows.
From 9h to 12h - medium demand
From 12h to 18h - high demand
The rest of the hours - low demand (probably some scheduled jobs)
If we want to run our business efficiently, our infrastructure should be able to handle any demand. That is equally valid for the external as for the internal users. How do we handle any demand? By designing scalable systems. What is scalability? It is a property of a system that indicates its ability to handle increased load in a graceful manner, or its potential to be enlarged as demand increases. It is the capacity to accept increased volume or traffic. The truth is that the way we design our applications dictates the scaling options available. Applications will not scale well if they are not designed to scale. That is not to say that an application not designed for scaling cannot scale. Everything can scale, but not everything can scale well. That applies not only to our public facing applications but also to those created for internal use, like Jenkins.
The solutions for scaling Jenkins agents tend to go into two extremes. One would be entirely on-premise, dedicated infrastructure. That is expensive since, in such a case, that infrastructure needs to be as big as the load at the highest peak or risk having long queue times. The result are wasted resources at medium and low peaks. Wasted resources equal to wasted money. The other extreme is infrastructure entirely set on a cloud. Again, too expensive. The cost per computing unit in a cloud is much higher than on premises (unless we use private cloud).
If both options (fully on-premise and fully cloud) are expensive, what is the solution? The answer is in the combination of the two. We should combine on-premise dedicated agents with elasticity that can be obtained from cloud providers (or built by yourself in your datacenter). Calculate the minimum demand and create agents for it on your servers. Aim for hundred percent resource utilization. Bear in mind that I said aim since hardware can never be fully used all the time. Once you have agents set up on your servers, and you’re confident that they are (almost) fully utilized and meet the minimum demand, move all the rest to the cloud. Beware that your cloud agents need to be configured in a way that they release resources when not in use. Set them up in a way that new VMs are created when needed, and destroyed when unused.
If you manage to combine on-premise dedicated agents with elastic cloud nodes, the result will be high availability and a very short queue time (just enough to instantiate a new VM in the cloud). Your teams will be happy since deployment pipelines will run almost immediately after the code is committed. Your business will be happy because you’ll reduce the cost of the infrastructure. You’ll be using cheaper, on-premise, computing units to their fullest and pay to your cloud provider only for the utilization during peak hours. Everyone’s a winner.
We’ll continue this discussion in the next article. I’ll walk you through the setup we discussed. Stay tuned.