Verification of zero-downtime deployments using GitHub Actions
In your project, you might have a strict requirement to have zero downtime when deploying to production. Yet in a complex system, it is hard to "guess" which changes actually will cause downtime and which will not; the most reliable way is to attempt an actual deployment and see if any break has occurred to the API. This blog post describes how I automated the task.
Downtime deployment example cases
In my recent blog I wrote about the zero-downtime deployment of a serverless stack. We were able to achieve that - mostly. But we also found some alterations still incur downtime. Here, I recap these cases.
First, the zero-downtime deployment of a Lambda was observed for routine code changes without changing the external properties of the Lambda. But if the Lambda handler function was renamed, there was a break in service during deployment. It happened because the Lambda configuration was updated with a new function name, then Lambda continued to set up, but it took some time. During that time, traffic was sent towards the new function name, but the updated source code wasn't there yet, and an old source code threw an error about an unrecognized identifier. Couple of minutes the endpoint was replying with HTTP 500.
The second time, we faced a bizarre behavior of the serverless.tf AWS API Gateway module. The API Gateway has a "Default route throttling" parameter (with two numbers, "Burst limit" and "Rate limit"). If the throttling was not configured manually or via Terraform, in AWS Console the limits are seen as "Not Configured".
The Terraform module allows to specify the following default route throttling settings:
default_route_settings = {
throttling_rate_limit = 200
throttling_burst_limit = 100
}
So when you run terraform apply
, the aforementioned throttling parameters take the values specified. Now, consider the situation when we first added this default_route_settings
section and then decided to remove it. Once it was removed from the Terraform code, one could expect to go back to the "not configured" case. However, Terraform instead sets both throttling parameters to zero, effectively denying all the traffic (all endpoints always return HTTP 429). I have described the problem in a GitHub issue; our workaround was to never go back to undefined throttling values once they were defined for an AWS API Gateway.
It is necessary to know about such problems before the changes hit production and downtime becomes apparent to actual consumers. Hence, I created a small alerting framework that would notify if a recent deployment caused a downtime. This check is used against testing environments, so developers are made aware of the problem and it is then possible to address it.
A Remark: Load Testing in general
I detect the downtime using a load-testing library that checks all API endpoints several times per second and will sense if there is any problem. Yet I like to briefly note the role that Load Testing can play in general. Load and performance testing is often overlooked and is capturing less attention than, say, unit and integration tests. Yet, a system should be tested at 2-3 times production load to ensure that expected and unexpected load is handled well. To replicate production behavior patterns in a test environment is very hard. Still, to have certain safeguard measures (such as basic load tests that confirm the performance) is desirable, especially if you are dealing with a custom-made application in a container and do not rely on cloud services with known performance characteristics. We should keep in mind that, unlike simple code bugs, performance issues are not always fixable by changing few lines of code; these might require major architecture and database changes followed by an epic-scale re-testing (and if your auto testing is scarce, then you have a problem). It is therefore just natural to ensure performance goals early and of course, before going into production; migrating production to a heavily revised system might be an extremely stressful exercise on many levels.
Architecture
We are answering the question When we make a deployment, will it create a downtime? Consider the following architecture:
Here, GitHub Actions is the "ruler" that is orchestrating two things:
- actual deployment of a new code and infrastructure to AWS
- starting, running, and destroying an instance that runs a Load-testing application that can detect downtime.
This way we can run the load against the API Gateway endpoint while updating the infrastructure stack, which provides us with needed statistics of any errors or timeouts against any of our endpoints.
CI/CD Workflow in GitHub Actions
Here is a gist of a GitHub Actions workflow file:
name: Deployment
on:
push:
branches:
- testing
- production
jobs:
start-runner:
name: Start self-hosted EC2 runner
runs-on: ubuntu-latest
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Start EC2 runner
id: start-ec2-runner
uses: machulav/ec2-github-runner@v2.2.0
with:
mode: start
github-token: ${{ secrets.GITHUB_TOKEN }}
ec2-image-id: ${{ secrets.WORKER_AMI }}
ec2-instance-type: t3.medium
subnet-id: ${{ secrets.VPC_PUBLIC_SUBNET }}
security-group-id: ${{ secrets.VPC_SECURITY_GROUP }}
start-load:
name: Start Load
environment:
name: ${{ github.ref }}
needs: start-runner
runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the new runner
continue-on-error: true
env:
API_URL: ${{ secrets.API_URL }}
# more parameters needed for load tester
steps:
- uses: actions/checkout@v2
- name: Use Node.js 14.x
uses: actions/setup-node@v1
with:
node-version: 14.x
- name: npm install
working-directory: ./loadtester
run: npm install
- name: npm start
run: RUNNER_TRACKING_ID="" && (nohup npm start&)
working-directory: ./loadtester
- run: curl --max-time 60 http://localhost:8080/start
deploy:
name: 'Deploy'
needs:
- start-runner
- start-loadtester
runs-on: 'ubuntu-latest'
steps:
# steps to deploy your application
statistics-load:
name: Output Statistics
needs:
- deploy
runs-on: ${{ needs.start-runner.outputs.label }} # same runner as in start-load!
continue-on-error: true
steps:
- run: curl --max-time 60 http://localhost:8080/stats
stop-runner:
name: Stop self-hosted EC2 runner
needs:
- statistics-load
runs-on: ubuntu-latest
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Stop EC2 runner
uses: machulav/ec2-github-runner@v2.2.0
with:
mode: stop
github-token: ${{ secrets.GITHUB_TOKEN }}
label: ${{ needs.start-runner.outputs.label }}
ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }}
fail-on-error:
# Without this step workflow remains "green"
# if continue-on-error-marked jobs are failed.
name: Test Status -> Workflow Status
needs:
- stop-runner
runs-on: ubuntu-latest
if: always()
steps:
- uses: technote-space/workflow-conclusion-action@v2
- name: Fail if any job has failed
if: env.WORKFLOW_CONCLUSION == 'failure'
run: exit 1
- Job
start-runner
runs first. It uses ec2-github-runner that starts a temporal EC2 instance in your account to run a GitHub job. We call this instance "runner" below. - Job
start-load
is executed on the runner. It runs in the background a small NodeJS application (the scenario implies that it is found inloadtester/
subfolder, see "Load-testing application" section below) that acts as a REST server. The callcurl http://localhost:8080/start
tells the application to start the load. I would like to point attention to how the NodeJS application is started by using the expressionrun: RUNNER_TRACKING_ID="" && (nohup npm start&)
. If we would just use thenohup npm start
part, then after the job has been completed, a Github Actions cleanup step would terminate the server; by using this trick, we let the load application continue running after the job has quit. Also, we make use of Environments, so depending on the branch name we can use different parameters for the load tester. deploy
is a job to perform the actual deployment.- Job
statistics-load
makes the callcurl http://localhost:8080/stats
, which tells the load application to publish its "downtime report" to our notification service of choice. - Job
stop-runner
destroys the runner EC2 instance. - Job
fail-on-error
takes care of the workflow status. Without this step, if a Job that is marked ascontinue-on-error: true
has failed, the entire Workflow would not fail. The workflow-conclusion-action helps us to mark the entire Workflow as "red" in such case.
The load-testing application
Since we need to reach the load-testing application from different Jobs, having it as a REST server was an easy option. I chose NodeJS as a framework as it has a rich ecosystem of libraries that allow me to focus on the essence of the application.
I relied on a couple of libraries:
- restify as an API server. I chose it as it offers a very short declaration of APIs.
- loadtest is a load testing library. Turns out, its out-of-the-box statistics is perfect for our case of seeing time-outs, throttled requests, and other errors, and it offers flexibility in customizing the load traffic. It can execute load on several endpoints at the same time.
The code below is written in TypeScript and is a part of a typical NodeJS application. Various application boilerplates can be easily found online, therefore for clarity, I do not present here the contents of the entire project, focusing only on the business logic parts.
The main.ts
is my main file and it looks like this:
import restify from "restify";
import {LoadTester} from "./tester";
const server = restify.createServer();
const loadTester = new LoadTester();
server.get('/start', loadTester.startLoad)
server.get('/stats, loadTester.statistics)
server.listen(8080, function () {
console.log('%s listening at %s', server.name, server.url);
});
I do not add here any "stop the load" logic because the entire instance that runs the program will be destroyed.
The class for the load tester in tester.ts
looks like this:
import {Next, Request, Response} from "restify";
import loadtest, {LoadTestOptions, LoadTestResult} from "loadtest";
type TestResult = {
key: string; // url used
isDowntime: boolean; // is downtime found
result: string; // reason for downtime
errorRate: number; // found 429 rate
loadTestResult: LoadTestResult; // raw result
};
export class LoadTester {
static statistics: Map<string, LoadTestResult> = new Map();
static host = API_URL;
static loadTestOptions: Map<string, LoadTestOptions>;
static timestampStart = 0;
// This method returns the configurations for load testing. If necessary,
// it can have parameters, such as data for authorization headers.
private static getLoadTestOptions( ) {
return new Map<string, LoadTestOptions>([
['/method1', {
url: LoadTester.host + '/method1',
statusCallback: LoadTester.statusCallback,
requestsPerSecond: 2,
timeout: 2000
}],
['/method2', {
url: LoadTester.host + '/method2',
statusCallback: LoadTester.statusCallback,
requestsPerSecond: 2,
timeout: 5000
}]
]);
}
// This method is called every time when a call has happened and
// it receives a statistics from the loadtest librarby for all the past
// calls combined. We store the statistics in our map.
private static statusCallback(error: any, result: any, latency: any) {
LoadTester.statistics.set(pathKey, latency);
// here we can print any status messages, e.g. if an error happened
}
// This method analyzes the statistics for individual API method
// and decides if the method had a downtime. One can introduce
// a more fine-grained adaptive logic, too.
private static detectDowntimeForMethodResult(key: string, value: loadtest.LoadTestResult): TestResult {
let reply: TestResult = {
key: key,
isDowntime: false,
result: "everything is OK",
errorRate: 0,
loadTestResult: value
}
if (value.totalErrors == 0) {
reply.isDowntime = false;
return reply
}
reply.isDowntime = true;
reply.result = `Downtime happened. Errors: ${JSON.stringify(value.errorCodes)}`;
return reply;
}
// This method generates the text report for downtime.
private static getResultText(downtimeDetected: boolean, downtimeResults: Map<string, TestResult>) {
let response = `${downtimeDetected ?
"Downtime happened" : "no downtime happened"} Api: ${LoadTester.host}. Run time: ${(Date.now() - LoadTester.timestampStart) / 1000} seconds.\n`;
downtimeResults.forEach(((value, key) => {
response = response + `*${key}*: ${value.loadTestResult.totalRequests} requests: ${value.result}\n`
}))
return response;
}
// This method starts the load.
async startLoad(req: Request, res: Response, next: Next) {
LoadTester.loadTestOptions = LoadTester.getLoadTestOptions();
LoadTester.timestampStart = Date.now();
res.send(`Started load at ${LoadTester.timestampStart} towards ${LoadTester.host}`);
LoadTester.loadTestOptions.forEach((options: LoadTestOptions, key: string) => {
console.log("Starting load for " + options.url + " key " + key)
loadtest.loadTest(options, function (error: any) {
if (error) {
return console.error('Got an error: %s', error);
}
console.log('Tests run successfully');
});
});
return next()
}
// This method prints the statistics
async statistics(req: Request, res: Response, next: Next) {
let downtimeDetected = false;
let downtimeResults: Map<string, TestResult> = new Map();
LoadTester.statistics.forEach((value: LoadTestResult, key) => {
const methodResults = LoadTester.detectDowntimeForMethodResult(key, value)
downtimeResults.set(key, methodResults);
downtimeDetected = methodResults.isDowntime || downtimeDetected;
});
const resultText = LoadTester.getResultText(downtimeDetected, downtimeResults);
// Print result
console.log(resultText);
// Return result in REST call
res.send(Array.from(downtimeResults))
// Here one can add calls to post the information to a reporting channel,
// such as email or Slack.
return next()
}
}
Usage would look as follows:
> npm start
...
restify listening at http://[::]:8080
and in other terminal:
> http://localhost:8080/start
"Started load at 1627569136797 towards https://mytest.com."
> http://localhost:8080/stats
[["/method1","totalRequests":17,"totalErrors":0,"totalTimeSeconds":8.779819100000001,
"rps":2,"meanLatencyMs":401.7,"maxLatencyMs":450,"minLatencyMs":351,"percentiles":
{"50":400,"90":442,"95":450,"99":450},"errorCodes":{}}],
["/method2",{"totalRequests":16,"totalErrors":0,"totalTimeSeconds":8.475301799999999,
"rps":2,"meanLatencyMs":578.4,"maxLatencyMs":1510,"minLatencyMs":361,"percentiles":
{"50":385,"90":1397,"95":1510,"99":1510},"errorCodes":{}}]]
The code for /stats endpoint can be enhanced to publish information to other systems.
Conclusion
Hope this small load-testing example is beneficial to make your systems more reliable. Should you have any questions or comments, please let me know via askar.ibragimov@futurice.com
- Askar IbragimovCloud Architect and Senior Developer