Over the past couple of weeks, we have been working to migrate our build servers to a fully AWS-hosted solution. Our build system has become pretty sophisticated over the last few years, so I thought I would talk about it a little bit, on the off chance that some of you might find it interesting, or get ideas for your own build chain out of it.
Our build system is based on a home-grown infrastructure we call “CI2”. CI2 is a set of tools that we have written over the years that coordinate product builds and – in short – handle Continuous Integration for our products. Essentially, that means whenever someone commits a change to a project to Git, the system goes out and creates a new build of the software and eventually tests it to make sure the new change integrates ok and does not break anything.
Until a couple of weeks ago, the bulk of the system had been running on a custom Xserve server – a rack-mountable Mac server machine that Apple stopped making a few years back – that we owned, and co-hosted with a local hosting company here in Berlin. We’ve long been wanting to migrate this to a fully cloud-based hosting solution on Amazon Web Services (which we use for everything else), and when one of the disks in the sever started flaking out a couple of weeks ago, we decided that now was as good a time as any to tackle that.
With how CI2 is structured (I will talk more about that shortly), the migration was pretty smooth, in general. Most of the holdup was that we needed some more infrastructural changes. Because in the past all the builds happened on the same physical machine, we had quite a bit of logic in place that assumed the individual build VMs could access shared drives, and the like. Also, our local VMs had over the years gotten bloated, as we installed new tools, new versions of Delphi and Visual Studio required by the builds, and so on. We wanted to start with clean virtual machines in EC2, and that meant doing some well-deserved clean-up and simplifications to our build scripts – which actually makes the builds better, as well.
As of last Friday, we’re done, and have all our products build cleanly and successfully on the new infrastructure. As an added benefit, builds have gotten a lot faster, as well. It used to be that we waited a good 40 minutes for a full Elements build to finish – and that was without generating all the different set SKUs we build for releases we ship. On the new system, a complete build, with the large Shell-containing setup and everything, finishes in under 25 minutes.
Faster builds means a more productive team, as with complex products like ours, often one needs to wait for a new build from CI before a change can be fully tested.
How CI2 Works
But let’s take a quick look at how CI2 works.
As I mentioned before, CI2 is not a monolithic system, but a set off tools that work together, locally and across the network, to perform all the various tasks. It’s a very flexible and expandable system that has grown and gotten very sophisticated over the past 10 or so years (the first commit in our CI2 Git repository is from 2009. And mind you, the 2 stands for “version 2” ;)).
It all starts with
CIStarter.exe. CIStarter is a very minimal tool that gets the whole CI system up and running.
CIStarter does only a couple of things:
- Based on information read from either a local config file or (more likely, in production use) from EC2 instance data configured in AWS, it connects to an S3 bucket and downloads the actual CI2 system to the local server.
- Optionally, it runs a tool called
CIPrerequisiteInstallerto, well, install prerequisites needed for building our products. More on that below.
- Finally, it passes things off to run one of the other CI tools or servers to do the actual work. These will usually be
CIStarter has very little logic of its own, and the idea is that (a) only the exe (and optionally a config file) need to be deployed to a server, and more importantly that (b) the CIStarter itself will never really change, even as the core CI2 system gets upgraded.
This means CIStarter can be “burned” into a base image for a virtual machine (be it on EC2 like we use now, or any other virtualization system), and new instances can be booted up as needed. The core CI2 system – which does get regular updates and fixes, of course – can simply be updated on S3, and new instances will always use the latest system.
It’s worth noting that CI2 can run on Windows, Linux and Mac OS X – and we have it, in fact, in use on all three platforms, as you’ll see below.
The first thing CIStarted does after downloading the system from S3 is run
CIPrerequisiteInstaller.exe, a small tool that installs and manages all the various packages of tools and files that might be needed to build our products. Like the system itself, these are obtained from S3, so publishing a new prereq is as simple as uploading it to the right bucket, and any new CI instances will pick it up on next run.
Of course CIPrerequisiteInstaller keeps track of which prereqs are installed, and only downloads and installs what’s missing. Prereqs are also incrementally versioned, so newer versions of a prereq can be uploaded, and get installed as needed.
Prereqs we have set up for our system include things such as Elements (we need Elements to build Elements ;), different Delphi versions, packs of third party libraries, and other tools.
One of the cool things is how we handle Delphi. As you might be aware, Delphi installers are slow and bloated, so installing Delphi is not really a good option. We need all versions of Delphi from 7 through the latest XE8 on our system, and if we were to run the installer for each, getting a new instance all prerequisited would, probably quite literally, take half a day (not to mention hundreds of gigabytes of disk space). Instead, our Delphi team figured out a way to package Delphi so that it can be xcopy-deployed with just the files we need to build with it. When a new version of Delphi (or an updated version of an existing version) comes out, we can get all the build machines updated to have it in a jiffy.
Prereq packages themselves can be made up from an .exe installer that runs quietly, or can be a
.zip with a Train script that does installation or deployment. CIPrerequisiteInstaller manages unique install locations that individual prerequisites can use, assuming they don’t have a well-defined place they get installed to (for example, our Elements prereq simply runs the Elements installer, while our Delphi packages drop their files where CIPrerequisiteInstaller tells them to).
The idea is that CIPrerequisiteInstaller can both take a cleanly set up and freshly booted build machine instance and get it set up with all the prereqs we need, and it can incrementally update an existing live instance with just the new prereqs that we added.
In our case, we have both a clean AMI that we can boot up to start from scratch (which is little more than a fresh Windows install with CIStarter), but also one with the current set of prereqs already installed. Of course, instances booted up from the latter are quicker to start and ready to run a build.
Once everything is set up, CIStarter passes execution on to one of (currently) three different server tools. Which one is determined by the config file or the EC2 Instance Data.
CIMainServer is the brains of the whole operation. It’s a very lightweight server that does not need its dedicated machine (ours runs on a small Linux EC2 instance that also hosts our email, some of our websites, and other infrastructure, for example).
What CIMainServer does is keep in touch with all the actual build servers via RemObjects SDK super channel connection, and monitor the known Git repositories (some of which we host ourselves, and others we have on GitHub) for changes. As new commits are detected, it will fire off build requests to a suitable build server (i.e. one that’s idle and has the right platform and setup to build the project in question).
Currently, we only distinguish between Windows and Mac build servers, and any server can take a build request for any product on its platform. In theory, one could set up dedicated build servers per product (for example, we could have a server running that builds only Elements and not Data Abstract, if we wanted).
In the long run, CIMainServer will also be able to boot up and shut down EC2 instances as needed, depending on demand. For example, we could have a skeleton crew of one server running over the weekend (if that), but come Monday morning when things get busy and people commit changes and want their builds quickly, it could boot up a few more instances. Currently, we don’t do that, and just run two “
t2.large” instances for Windows and one Mac build machine.
CIMainServer uses a small database to keep track of known repositories, active and past builds, as well as their success and testing results. That database could be hosted in RDS, but we find the extra cost isn’t justified since we don’t need that level of fail-over and recovery support, so we just run a local Postgresql DB on the same Linux machine that hosts CIMainServer itself.
CIMainServer is also CI2’s connection to the outside world.
Of course it starts builds automatically as its sees changes in Git, but it also lets us start builds manually (for example if we want a build with a non-default set of options passed to the build scripts, or when the system is locked from doing automatic builds so we can start builds in a controlled order, which we usually do for RTM and public beta builds).
For that, CIMainServer interacts with Slack to listen to commands in the chat channels where our teams hang out, and to also report back statuses. For example, I can simple say “start build elements lockdown” to shoot off a new build of “Elements” from its “lockdown” branch. When a build finishes, a message gets posted to the right channel, along with the URL where the final binaries can be downloaded, or where the log file can be found, if there was a problem.
CIMainServer also has an RO/DA based API, and we have small apps for Mac and iOS that can be used to control builds, get a more detailed status overview, and receive push notifications for critical failures.
If CIMainServer is the heart of the operation, then
CIBuildServer is the muscle.
On start, CIBuildServer just connects to the main server, and then waits for instructions. When it receives a build request, it goes out to Git to clone the repository and check out the right branch, and then fires off to the build script found in the repo to build the product.
Of course, it also does a lot of maintenance around that as well, including cleaning up after a build, managing the different folders where different branches are checked out for each repository (we keep the checkout around, because a simple
pull is a lot faster for subsequent builds than cloning fresh every time would be), and so on.
CIBuildServer also takes care of publishing the finished binaries and the build logs (to S3, of course), as well as keeping CIMainServer up to date with the build status.
We currently run CIBuildServer for Windows in EC2 virtual machines (these are new since we switched from VMware machines that ran on our Xserve). Since Amazon doesn’t offer OS X-based virtual machines (unfortunately), we also currently run CIBuildServer on a local Mac mini. In the long run, we might switch to something like MacMiniColo. Only the “DA/Cocoa” build needs the Mac, everything else builds on the Windows machines.
The nice thing with how the whole build system is distributed now is that the location of the build servers is immaterial; the only difference we see from the Mac builds running locally is that the upload and download to and from S3 is a bit slower that it is from the EC2 instances (where S3 access is so blazingly fast we were blown away. In fact, a lot of the caching of S3 content that we built into Train and CIStarter feels unnecessary when running on EC2).
The Shared Bucket
One thing that required the most rethinking before we could move to EC2 was what we refer to as “shared” files.
While our products build independently, we do have a lot of inter-dependence between them. For example, Everwood builds as its own repository, but both Elements and RO/DA need it to build. The DA/Cocoa build needs binaries from the DA/Windows build (such as Relativity Server), and vice versa.
In the past this was handled with a shared folder that all build VMs (and the Mac) had access to, but for the move to EC2, this infrastructure had to be switched over to leverage S3.
The way we handle this is that every build that runs gets passed a dedicated S3 location where it can publish generated binaries that other projects might need access to. This location is unique for the project (the ”repository”, really), and the particular branch that builds.
For example, the Everwood build zips up the binaries needed by the other projects, and publishes them to the Shared bucket. When an Elements build runs next, it can use a predetermined logic to find the version of Everwood in S3 that most closely matches the right branch, and grab the bits from there. I say most closely matches, because there will not always be a 1:1 match, of course. We might have a “develop-feature-X” branch in our Elements repo, but no such branch in Everwood – so it will fall back to “Everwood/develop” instead.
The third and last server (currently) is
CITestServer. Like CIBuildServer, it runs on its own instance (or instances, although we only run a single one currently). What CITestServer does is wait for builds to finish, and then grab the “dedicated” installer off S3, run it, and apply our suite of tests to it. As with the builds itself, statuses are reported back to CIMainServer, which can post to Slack, send push notifications, and keep track of test results and their evolution over time in the database.
By default, we run tests for the latest new build of every single-word branch (i.e. we’ll test “develop”, but not “develop-feature-x”) that’s new when the test server becomes idle. Tests can take a long time, in many cases longer than the actual product builds, so it’s not uncommon for – say – three consecutive builds to run, but only the first and the last to get tested. Testing times are also what lead us to compromise and test sub-branches such as “develop-feature-x” on demand only.
Our tests once again leverage Train, and a testing infrastructure called “Afterhours” that lets us run test suits created using various different testing frameworks (such as NUnit/XUnit on .NET, DUnit on Delphi, JUnit on Java, as well as of course our own EUNit), as well as some domain-specific test runners we have for Elements, and collect all their results in a standardized manner.
Final binaries, as well as log files from builds, tests and even prerequisite installs are all published to S3. The beauty of S3 is that it’s fairly fail-over resistant, and very accessible.
For giving our staff access to the builds as they get done, we have a few pages on the Staff portal on our website that monitor the respective S3 buckets and offer the files in them for download. This works based on the same infrastructure we use for the public and licensed downloads that are available from our website for you, the only difference being that we skip CloudFront, and just chose a S3 bucket location that’s closest to the bulk of our team, to give the fastest possible download speeds.
Publishing files for our customers (for releases, beta builds, and sometimes one-on-one to your Personal Downloads folder) is as easy as moving the files from one bucket to another (which we currently do manually as needed, but could automate in the future, if we wanted to). We can also give select customers “Fire Hose” access to all the builds that CI2 emits.
S3 offers automatic cleanup/maintenance functionality that we use to move older builds off to Glacier, and eventually delete them (except for the release builds we of course archive, we don’t really need each of the 57 Elements builds that ran yesterday for more than a couple of weeks).
So that’s our CI2 system, as it’s currently in use. We’re fairly proud of it, and very happy with how it works and what it provides for us, so it is a bit of a shame that right now it is purely internal and not many people outside of RemObjects get to see (or use) it – even if you do get to see and use the products it generates.
Who knows, maybe we will change that one day and make CI2 available for general use, be it as open source or as product. The main thing stopping that is that while the system is very flexible and configurable, it does still have a lot of “RemObjects-isms” – little bits and pieces that are made to work exactly how we need it, but aren’t precisely how everyone else would want it. And of course, it being used internally only, it can be pretty rough to set up and configure in places (for example, we add new products/repositories so seldomly that we have no tool for it, but simply add a new row to the database manually ;)).
But whether CI2 will ever see the light outside of our virtual halls or not, I hope you found this an interesting read, a good look behind the scenes of how we work here at RemObjects. And maybe it gave you a couple of ideas for your own build infrastructure.
Let me known what you think!