Verification of zero-downtime deployments using GitHub Actions

30 Jul 2021|Technology

In your project, you might have a strict requirement to have zero downtime when deploying to production. Yet in a complex system, it is hard to "guess" which changes actually will cause downtime and which will not; the most reliable way is to attempt an actual deployment and see if any break has occurred to the API. This blog post describes how I automated the task.

Downtime deployment example cases

In my recent blog I wrote about the zero-downtime deployment of a serverless stack. We were able to achieve that - mostly. But we also found some alterations still incur downtime. Here, I recap these cases.

First, the zero-downtime deployment of a Lambda was observed for routine code changes without changing the external properties of the Lambda. But if the Lambda handler function was renamed, there was a break in service during deployment. It happened because the Lambda configuration was updated with a new function name, then Lambda continued to set up, but it took some time. During that time, traffic was sent towards the new function name, but the updated source code wasn't there yet, and an old source code threw an error about an unrecognized identifier. Couple of minutes the endpoint was replying with HTTP 500.

The second time, we faced a bizarre behavior of the serverless.tf AWS API Gateway module. The API Gateway has a "Default route throttling" parameter (with two numbers, "Burst limit" and "Rate limit"). If the throttling was not configured manually or via Terraform, in AWS Console the limits are seen as "Not Configured".

The Terraform module allows to specify the following default route throttling settings:

default_route_settings = {
   throttling_rate_limit  = 200
   throttling_burst_limit = 100
}

So when you run terraform apply, the aforementioned throttling parameters take the values specified. Now, consider the situation when we first added this default_route_settings section and then decided to remove it. Once it was removed from the Terraform code, one could expect to go back to the "not configured" case. However, Terraform instead sets both throttling parameters to zero, effectively denying all the traffic (all endpoints always return HTTP 429). I have described the problem in a GitHub issue; our workaround was to never go back to undefined throttling values once they were defined for an AWS API Gateway.

It is necessary to know about such problems before the changes hit production and downtime becomes apparent to actual consumers. Hence, I created a small alerting framework that would notify if a recent deployment caused a downtime. This check is used against testing environments, so developers are made aware of the problem and it is then possible to address it.

A Remark: Load Testing in general

I detect the downtime using a load-testing library that checks all API endpoints several times per second and will sense if there is any problem. Yet I like to briefly note the role that Load Testing can play in general. Load and performance testing is often overlooked and is capturing less attention than, say, unit and integration tests. Yet, a system should be tested at 2-3 times production load to ensure that expected and unexpected load is handled well. To replicate production behavior patterns in a test environment is very hard. Still, to have certain safeguard measures (such as basic load tests that confirm the performance) is desirable, especially if you are dealing with a custom-made application in a container and do not rely on cloud services with known performance characteristics. We should keep in mind that, unlike simple code bugs, performance issues are not always fixable by changing few lines of code; these might require major architecture and database changes followed by an epic-scale re-testing (and if your auto testing is scarce, then you have a problem). It is therefore just natural to ensure performance goals early and of course, before going into production; migrating production to a heavily revised system might be an extremely stressful exercise on many levels.

Architecture

We are answering the question When we make a deployment, will it create a downtime? Consider the following architecture:

Load Testing architecture

Here, GitHub Actions is the "ruler" that is orchestrating two things:

actual deployment of a new code and infrastructure to AWS
starting, running, and destroying an instance that runs a Load-testing application that can detect downtime.

This way we can run the load against the API Gateway endpoint while updating the infrastructure stack, which provides us with needed statistics of any errors or timeouts against any of our endpoints.

CI/CD Workflow in GitHub Actions

Here is a gist of a GitHub Actions workflow file:

name:  Deployment
on:
  push:
    branches:
      - testing
      - production         
jobs:
  start-runner:
    name: Start self-hosted EC2 runner 
    runs-on: ubuntu-latest
    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}
      - name: Start EC2 runner
        id: start-ec2-runner
        uses: machulav/ec2-github-runner@v2.2.0
        with:
          mode: start
          github-token: ${{ secrets.GITHUB_TOKEN }}
          ec2-image-id: ${{ secrets.WORKER_AMI }}
          ec2-instance-type: t3.medium
          subnet-id: ${{ secrets.VPC_PUBLIC_SUBNET }}
          security-group-id: ${{ secrets.VPC_SECURITY_GROUP }}
  start-load:
    name: Start Load
    environment:
      name: ${{ github.ref }}   
    needs: start-runner
    runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the new runner
    continue-on-error: true
    env:
      API_URL: ${{ secrets.API_URL }}
      # more parameters needed for load tester
    steps:
      - uses: actions/checkout@v2
      - name: Use Node.js 14.x
        uses: actions/setup-node@v1
        with:
          node-version: 14.x
      - name: npm install
        working-directory: ./loadtester
        run: npm install
      - name: npm start
        run: RUNNER_TRACKING_ID="" && (nohup npm start&)
        working-directory: ./loadtester
      - run: curl --max-time 60 http://localhost:8080/start     
  deploy:
    name: 'Deploy'
    needs:
      - start-runner
      - start-loadtester
    runs-on: 'ubuntu-latest'
    steps:
      # steps to deploy your application

  statistics-load:
    name: Output Statistics        
    needs:
      - deploy
    runs-on: ${{ needs.start-runner.outputs.label }} # same runner as in start-load!
    continue-on-error: true
    steps:
      - run: curl --max-time 60 http://localhost:8080/stats
  stop-runner:
    name: Stop self-hosted EC2 runner
    needs:
      - statistics-load
    runs-on: ubuntu-latest
    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}
      - name: Stop EC2 runner
        uses: machulav/ec2-github-runner@v2.2.0
        with:
          mode: stop
          github-token: ${{ secrets.GITHUB_TOKEN }}
          label: ${{ needs.start-runner.outputs.label }}
          ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }}

  fail-on-error:
    # Without this step workflow remains "green" 
    # if continue-on-error-marked jobs are failed.
    name: Test Status -> Workflow Status
    needs:
      - stop-runner
    runs-on: ubuntu-latest
    if: always()
    steps:
      - uses: technote-space/workflow-conclusion-action@v2
      - name: Fail if any job has failed
        if: env.WORKFLOW_CONCLUSION == 'failure'
        run: exit 1

Job start-runner runs first. It uses ec2-github-runner that starts a temporal EC2 instance in your account to run a GitHub job. We call this instance "runner" below.
Job start-load is executed on the runner. It runs in the background a small NodeJS application (the scenario implies that it is found in loadtester/ subfolder, see "Load-testing application" section below) that acts as a REST server. The call curl http://localhost:8080/start tells the application to start the load. I would like to point attention to how the NodeJS application is started by using the expression run: RUNNER_TRACKING_ID="" && (nohup npm start&). If we would just use the nohup npm start part, then after the job has been completed, a Github Actions cleanup step would terminate the server; by using this trick, we let the load application continue running after the job has quit. Also, we make use of Environments, so depending on the branch name we can use different parameters for the load tester.
deploy is a job to perform the actual deployment.
Job statistics-load makes the call curl http://localhost:8080/stats, which tells the load application to publish its "downtime report" to our notification service of choice.
Job stop-runner destroys the runner EC2 instance.
Job fail-on-error takes care of the workflow status. Without this step, if a Job that is marked as continue-on-error: true has failed, the entire Workflow would not fail. The workflow-conclusion-action helps us to mark the entire Workflow as "red" in such case.

The load-testing application

Since we need to reach the load-testing application from different Jobs, having it as a REST server was an easy option. I chose NodeJS as a framework as it has a rich ecosystem of libraries that allow me to focus on the essence of the application.

I relied on a couple of libraries:

restify as an API server. I chose it as it offers a very short declaration of APIs.
loadtest is a load testing library. Turns out, its out-of-the-box statistics is perfect for our case of seeing time-outs, throttled requests, and other errors, and it offers flexibility in customizing the load traffic. It can execute load on several endpoints at the same time.

The code below is written in TypeScript and is a part of a typical NodeJS application. Various application boilerplates can be easily found online, therefore for clarity, I do not present here the contents of the entire project, focusing only on the business logic parts.

The main.ts is my main file and it looks like this:

import restify from "restify";
import {LoadTester} from "./tester";
 const server = restify.createServer();
const loadTester = new LoadTester();
 server.get('/start',  loadTester.startLoad)
server.get('/stats, loadTester.statistics)
 server.listen(8080, function () {
  console.log('%s listening  at %s', server.name, server.url);
});

I do not add here any "stop the load" logic because the entire instance that runs the program will be destroyed.

The class for the load tester in tester.ts looks like this:

import {Next, Request, Response} from "restify";
import loadtest, {LoadTestOptions, LoadTestResult} from "loadtest";

type TestResult = {
  key: string; // url used
  isDowntime: boolean; // is downtime found
  result: string; // reason for downtime
  errorRate: number; // found 429 rate
  loadTestResult: LoadTestResult; // raw result
};
   export class LoadTester {
  static statistics: Map<string, LoadTestResult> = new Map();
  static host = API_URL;
  static loadTestOptions: Map<string, LoadTestOptions>;
  static timestampStart = 0;

  // This method returns the configurations for load testing. If necessary,
  // it can have parameters, such as data for authorization headers.
   private static getLoadTestOptions( ) {
    return new Map<string, LoadTestOptions>([
      ['/method1', {
        url: LoadTester.host + '/method1',
        statusCallback: LoadTester.statusCallback,
        requestsPerSecond: 2,
        timeout: 2000
      }],
      ['/method2', {
        url: LoadTester.host + '/method2',
        statusCallback: LoadTester.statusCallback,
        requestsPerSecond: 2,
        timeout: 5000
      }] 
    ]);
  }

   // This method is called every time when a call has happened and 
   // it receives a statistics from the loadtest librarby for all the past 
   // calls combined. We store the statistics in our map.  
  private static statusCallback(error: any, result: any, latency: any) {  
    LoadTester.statistics.set(pathKey, latency);
    // here we can print any status messages, e.g. if an error happened 
  }

  // This method analyzes the statistics for individual API method
  // and decides if the method had a downtime. One can introduce
  // a more fine-grained adaptive logic, too. 
  private static detectDowntimeForMethodResult(key: string, value: loadtest.LoadTestResult): TestResult {     
    let reply: TestResult = {
      key: key,
      isDowntime: false,
      result: "everything is OK",
      errorRate: 0,
      loadTestResult: value
    }
    if (value.totalErrors == 0) {
      reply.isDowntime = false;
      return reply
    }

    reply.isDowntime = true;
    reply.result = `Downtime happened. Errors: ${JSON.stringify(value.errorCodes)}`;
    return reply;
  }

  // This method generates the text report for downtime. 
  private static getResultText(downtimeDetected: boolean, downtimeResults: Map<string, TestResult>) {
    let response = `${downtimeDetected ?
      "Downtime happened" : "no downtime happened"} Api: ${LoadTester.host}. Run time: ${(Date.now() - LoadTester.timestampStart) / 1000} seconds.\n`;
    downtimeResults.forEach(((value, key) => {
      response = response + `*${key}*: ${value.loadTestResult.totalRequests} requests: ${value.result}\n`
    }))
    return response;
  }

  // This method starts the load.
  async startLoad(req: Request, res: Response, next: Next) { 
    LoadTester.loadTestOptions = LoadTester.getLoadTestOptions();
    LoadTester.timestampStart = Date.now();
    res.send(`Started load at ${LoadTester.timestampStart} towards ${LoadTester.host}`);
    LoadTester.loadTestOptions.forEach((options: LoadTestOptions, key: string) => {
      console.log("Starting load for " + options.url + " key " + key)
      loadtest.loadTest(options, function (error: any) {
        if (error) {
          return console.error('Got an error: %s', error);
        }
        console.log('Tests run successfully');
      });
    });
    return next()
  }

  // This method prints the statistics 
  async statistics(req: Request, res: Response, next: Next) {
    let downtimeDetected = false;
    let downtimeResults: Map<string, TestResult> = new Map();
    LoadTester.statistics.forEach((value: LoadTestResult, key) => {
      const methodResults = LoadTester.detectDowntimeForMethodResult(key, value)
      downtimeResults.set(key, methodResults);
      downtimeDetected = methodResults.isDowntime || downtimeDetected;
    });
    const resultText = LoadTester.getResultText(downtimeDetected, downtimeResults);
    // Print result
    console.log(resultText);
    // Return result in REST call
    res.send(Array.from(downtimeResults))
    // Here one can add calls to post the information to a reporting channel,
    // such as email or Slack.
    return next()
  }
}

Usage would look as follows:

> npm start
...
restify listening  at http://[::]:8080

and in other terminal:

> http://localhost:8080/start
"Started load at 1627569136797 towards https://mytest.com."
> http://localhost:8080/stats
[["/method1","totalRequests":17,"totalErrors":0,"totalTimeSeconds":8.779819100000001,
"rps":2,"meanLatencyMs":401.7,"maxLatencyMs":450,"minLatencyMs":351,"percentiles":
{"50":400,"90":442,"95":450,"99":450},"errorCodes":{}}],
["/method2",{"totalRequests":16,"totalErrors":0,"totalTimeSeconds":8.475301799999999,
"rps":2,"meanLatencyMs":578.4,"maxLatencyMs":1510,"minLatencyMs":361,"percentiles":
{"50":385,"90":1397,"95":1510,"99":1510},"errorCodes":{}}]]

The code for /stats endpoint can be enhanced to publish information to other systems.

Conclusion

Hope this small load-testing example is beneficial to make your systems more reliable. Should you have any questions or comments, please let me know via askar.ibragimov@futurice.com