Building an AWS Serverless system: Integration testing and accessing Private VPC resources
In this third blog post in my series about an API-based serverless service in AWS, I want to explain how I may test everything from everywhere. I want to test websites, check APIs and peek into VPC-private databases. I want to do the same thing from my local PC and from CI/CD in GitHub. I want to receive a message in my Slack if something is the matter. And all of it should be so easy that a developer must only think of writing the essence of a test.
How do we test?
First things first: manual testing is not an option. Complex modern systems quickly become unmanageable and untestable manually. It is crucial to have a framework for Integration tests up and running early and place a firm requirement in your Definition of Done to have test coverage added at the development stage.
The list below comes from one of my favorite texts, "Things I believe":
Testing must be a breeze; testing must be able to cover everything that exists in reality. We have usually a choice between so-called unit tests and integration tests — roughly, the former are run against codebase by e.g. importing and testing a single function, the latter — run against several components working together, for example, several deployed services on an environment.
I would use unit tests in case I have some isolated business logic that I can call with limited mocking. On other hand, testing a service by mocking its dependencies does not make much sense. Consider a Lambda that orchestrates several other endpoints - mocking all of these would take a lot of time and there is no guarantee that the mocks will perform like real services. So for these unit tests that are not written, instead, I create integration ones, as they
- test not only business logic but also the entire "texture", including networking communications and components.
- there is much less to mock and maintain
- you test real code and can go all the way to end-to-end system testing, peeking into real databases and third-party services.
I confess I do not find the test writing a particularly exciting work. And boring work is often pushed back and procrastinated, which does not help the stability of the system. Therefore, we should give a developer a way to make test writing easy, so one can execute all testing operations with a simple setup and be able to test exactly what needs to be tested, not something that just resembles it.
Being a good CI/CD "citizen", a properly written test should set up and teardown the vast majority of its data. It is also a good idea to have some cleanup job to purge databases which test might write to by schedule, to clean data left after tests that failed before a proper teardown.
It is a best practice in the cloud archiecture, that databases are hidden in a private VPC subnet, so they have no public endpoint and therefore, no way to directly connect to these from the outer Internet. Yet, we still want to run all tests both from CI/CD worker in GitHub actions as well as from a local machine which is in outer internet. Below, I explain how we can do the setup to achieve that using authorized AWS connections.
As a side note, AWS SAM offers a way to test a Lambda locally based on a dummy event, but this somehow felt unpractical for us. We do not do development and testing with this approach.
What do we test?
The most common set includes:
- my APIs
- my Database entries
- my website pages
Here is the blueprint of a system that I describe in more detail below:
In the image above, an "Identity provider" is essentially an example of a typical third party dependency - tests use it to gain some tokens that system requires.
For the testing stack I chose:
- NodeJS, as it offers a great variety of npm packages and therefore it is easy to borrow functionality for my tests, should I need to call an URL, visit a database or decrypt a JWT token;
jest
test runner;jest-cucumber
to be able to write Gherkin "features" in addition to nativejest
tests;playwright
to be able to test website pages.
Also, the "test report" is generated in a junit
format so it is viewable later in GitHub. In my opinion, this toolkit fulfills most of the usual needs both for website side testing as well as for API layer testing.
I like to write my tests as Cucumber definitions, so the first file I create is a feature that looks like
Feature: Really Useful Data Retrieval
To check if user can retrieve data
Scenario Outline: A <UserType> user can retrieve data
Given The <UserType> user is authorized
When User requests data
Then User <UserType> data is received correctly
Examples:
| UserType |
| Usertype_1 |
| Usertype_2 |
Using jest-cucumber, I convert the lines found in the feature into jest
test code that creates test data, calls APIs or interacts with web pages using jest-playwright
.
I stress that it is still possible to mix in native jest
tests in case Gherkin features are not appropriate.
I also check the contents of a database in these tests. Interestingly, Gherkin authors claim that While it might be tempting to implement Then steps to look in the database - resist that temptation! You should only verify an outcome that is observable for the user (or external system), and changes to a database are usually not, — which, to me, is a very controversial claim that I can not agree with. Using the YAGNI principle, we create only entities that matter — databases included. I would say that I use Gherkin mostly as a nicer way to express classical "Arrange — Act — Assert" test structure.
JUnit test results
It is very handy to have a pretty GitHub page that shows your tests results. It can be achieved in the following way:
We configure our tests to report in a so-called
junit
format:jest.config.js
:module.exports = { ... reporters: [ "default", ["jest-junit", { "outputDirectory": ".", "outputName": "jest-junit.xml", }] ], ... };
In our GitHub CI/CD workflow we attach a step to create the desired report using dorny/test-reporter Action:
- name: Test Report uses: dorny/test-reporter@v1 if: always() # run this step even if previous step failed with: name: Integration tests path: jest*.xml # Path to test results reporter: jest-junit # Format of test results
If you run Integration tests as a part of Pull Request checks, the respective report will be added to the Run page; in all cases, you can recover the report by peeking into the action's log and looking for Check run HTML
line:
Creating test report Integration tests
Processing test results from jest-junit.xml
...
Check run HTML: https://github.com/<repo>/runs/12345678
Here's how it looks like (from dorny/test-reporter):
Accessing Private Resources from tests in CI/CD
Despite some resources staying inside private subnets, we want our tests to peek in everything so we can have robust testing. With GitHub Actions, we have an option to create a custom Runner, a virtual machine that we can place in our VPC. It has to have the correct Security Group settings so it can communicate with GitHub over HTTPS and with desired databases over their respective ports. This approach is implemented by machulav/ec2-github-runner Action: it creates a Runner which is started before the execution and terminated right after, only using cloud resources when a job is underway. The extra time to do so is just a couple of minutes, so it does not produce a great delay.
As a parameter to the Action, we need an AMI Image that will be spun up as a Runner. To make it, I used as a base a default Amazon Linux and additionally installed node
(using this AWS recipe), docker
(using ec2-github-runner README info, and dependencies required for Chrome from this document (CentOS part). The latter are needed for playwright
to work correctly. The AMI ID is stored in a Github Action Secret and later passed to the Workflow. To avoid confusion, a "jump host" discussed in the "Accessing Private Resources from local PC" section and this GitHub runner host are separate entities.
Other parameters for the action include a VPC's Public Subnet ID to place the instance and Security group with correct permissions.
My GitHub job to run the tests follows the example from the ec2-github-runner README info:
integration-tests:
name: Run Integration Tests
needs: start-runner # required to start the main job when the runner is ready
runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
continue-on-error: true # Continue to destroy the runner if the job has failed
env:
MY_TEST_PARAMETER_ENVIRONMENT_VARIABLE: ${{ secrets.MY_SECRET_WITH_TEST_PARAMETER }}
steps:
- uses: actions/checkout@v2
- name: Use Node.js 14.x
uses: actions/setup-node@v1
with:
node-version: 14.x
- run: npm install
- run: npm test
- name: Test Report
uses: dorny/test-reporter@v1
if: always() # run this step even if previous step failed
with:
name: Integration tests results
path: jest*.xml
reporter: jest-junit
By default, a set of Jobs in the Workflow will stop being executed if a Job fails. Tests are destined to fail sometimes, therefore we use the option continue-on-error: true
for the job that runs on the Runner - so if tests fail, the runner instance is still destroyed and we are not incurring charges for a resource that we do not use later.
A final remark, should you want to replace the AMI, remember to deregister old AMIs when creating new ones.
Failing the Workflow and reporting to Slack
Since we used continue-on-error: true
for the integration-tests
job, if it fails, the workflow would remain "green", and we do not want that; we want to get notified that tests have failed. Hence, after the Runner is destroyed, we need to re-check the status of the job and "fail" the Workflow. This is done using yet another Action technote-space/workflow-conclusion-action that is green only if all jobs are green. In the snippet below, it is taken care of by the Fail if Integration tests are red
step.
Failed Workflow might not be visible instantly to developers, while it is very handy to let them know early that there is some problem with tests. So step called Notify Slack
shows how I can report the result of the job to Slack, as well:
name: Test Status -> Workflow Status
needs:
# First we run tests and delete the Runner
- start-runner
- integration-tests
- stop-runner
runs-on: ubuntu-latest
if: always()
steps:
- name: Get current date
id: date
run: echo "::set-output name=date::$(date +'%Y-%m-%d %H:%M')"
- uses: technote-space/workflow-conclusion-action@v2
- name: Notify Slack
# if: env.WORKFLOW_CONCLUSION == 'failure' can be added if we only want to
# report failures in Slack, to reduce noise.
uses: 8398a7/action-slack@v3
with:
status: custom
fields: all
custom_payload: |
{
attachments: [{
color: process.env.WORKFLOW_CONCLUSION == 'failure'? 'danger': 'good',
text: `${process.env.AS_REPO}: ${process.env.AS_WORKFLOW} ${process.env.WORKFLOW_CONCLUSION == 'failure'? '*failed*': 'succeeded'} at ${{ steps.date.outputs.date }}.`,
}]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
- name: Fail if Integration tests are red
if: env.WORKFLOW_CONCLUSION == 'failure'
run: exit 1
Accessing Private Resources from local PC
When resources are located in a private VPC subnet, it might sound counterintuitive to access them from Internet, but it is possible if you can use authenticated channels. Namely, SSM offers a way to establish a session to any instance where an "SSM Agent" is running, even it is in a private subnet. Therefore, we can do the following:
- We launch a small Amazon Linux instance in a private VPC subnet, so no one except us can reach it from outside; we hold on to the private key for the instance.
- The vanilla instance usually does not have SSM Agent installed. While the setup of the Agent is described in Session manager Steps for local machine and the instance and this AWS blog post, I have found these manuals hard to follow and struggled to produce a successful setup. However, AWS offers an easy way to configure the instance: in AWS Console, go to
Systems Manager > Quick Setup
, clickCreate
, chooseHost management
, thenChoose how you want to target instances -> manual ->
, specify Instance ID, then let configuration run and complete its steps. To verify if you can use the SSM connection for the instance, go to your list of EC2 Instances available, select the desired instance, click "Connect" and choose "Session manager" (the tab should NOT display any errors that the connection is impossible), then again click "Connect" and see command prompt appearing. After this, using
aws ssm
command line on a local machine we may establish a connection to the host itself. However, this connection does not support port forwarding to other VPC instances. To work around this, we need to issue anssh
tunneling command as discussed in this Reddit post and this manual. We connect to the instance using the SSM, then over that connection we establish ssh tunnels to databases. So, if we were doing things in the command line, the call we want to make would look like this (the parameters are quoted in square brackets):#> aws configure #> ssh -i [my-secret-key.pem] ec2-user@[i-instanceid] -L 1024:my-rds-postgres-url:5432 -o ProxyCommand="aws ssm start-session --document-name AWS-StartSSHSession --target %h --parameters portNumber=%p"
-L
option is establishing the tunnel and you can supply several such options to a single ssh
command. The command above forwards localhost:1024
to my-rds-postgres-url:5432
.
Enable global setup and teardown
Everything should be automated as much as possible, so with the same jest
command, I would like to establish the tunnel, rewrite URLs for private resources to point these to the tunnel, and run the tests as usual. In my case, the relevant code checks for tunnel-related environment variables: VPC_JUMPHOST_INSTANCEID
(an instance ID for the tunneling operation) and VPC_JUMPHOST_PEMKEY
(a PEM key assigned to the instance while creating).
jest
test framework offers us a way to specify global setup and teardown files that fit perfectly to establish and dismantle tunnels. For the following piece of code to work, remember to run aws confugure
to authorize in AWS - the credentials will be stored in a local file and used by the AWS CLI command that the snippet will invoke. In the command line, ssm
and ssh
commands should be available.
To run global setup and teardown, use the following lines in jest.config.js
:
module.exports = {
...
globalSetup: "./tests/globalSetup.js",
globalTeardown: "./tests/globalTeardown.js"
};
Create tunnel before running tests and dismantle it after tests are finished
./tests/globalSetup.js
contains the following code. It uses tunnelForwardingData
structure to define which environment variables script needs to alter, and which regexp to use to locate the hostname - currently, it must be found in the first matching group of the regexp (inside first brackets within the regexp).
const {exit} = require("shelljs");
const {exec} = require("child_process");
const tunnelForwardingData = [
{
envVariable: "POSTGRES_CONNSTRING", // an URI-formatted connection string, such as "postgresql://postgres:password@mycluster-somenumber.eu-west-1.rds.amazonaws.com:5432"
regexp: /.+@(.+)/ // regexp to extract host and replace, the host and port must be the fist matching group.
}
]
module.exports = async () => {
console.log("This is Global Setup. Checking environment and creating tunnel to VPC resources for local testing.");
const VPC_PEMKEY = "VPC_JUMPHOST_PEMKEY";
const pemKey = process.env[VPC_PEMKEY] || "";
const VPC_INSTANCE_ID = "VPC_JUMPHOST_INSTANCEID";
const instanceId = process.env[VPC_INSTANCE_ID] || "";
const SSH_EXECUTABLE = "SSH_EXECUTABLE";
const sshExecutable = process.env[SSH_EXECUTABLE] || "ssh"
// if the variables are not set, continue without
// tunnel (CI/CD case, where we run tests inside VPC GitHub Runner)
if (!pemKey || !instanceId) {
console.log("Global Setup: " + VPC_PEMKEY + " and/or " + VPC_INSTANCE_ID + " are not set. Not starting tunnel to VPC resources.");
return;
}
console.log(`Global Setup: Tunnel variables set. Will create SSM/SSH tunnel to VPC.\n${VPC_PEMKEY} = ${pemKey}\n${VPC_INSTANCE_ID} = ${instanceId}\n${SSH_EXECUTABLE} = ${sshExecutable}`);
let sshTunnelCommand = "";
let localPort = 1024;
tunnelForwardingData.forEach(tunnelItem => {
let connStringInUriFormat = process.env[tunnelItem.envVariable] || ""
const localHost = "localhost:" + localPort;
const dbUrl = new URL(connStringInUriFormat);
const remoteHost = dbUrl.host; // host and port
const remotePort = dbUrl.port;
connStringInUriFormat = connStringInUriFormat.replace(remoteHost, localHost);
sshTunnelCommand += ` -L ${localPort}:${dbUrl.hostname}:${remotePort}`;
process.env[tunnelItem.envVariable] = connStringInUriFormat;
localPort++; // for next cycle
})
// Creating child process with tunnel
const command = `${sshExecutable} -i ${pemKey} ec2-user@${instanceId} ${sshTunnelCommand} -o ProxyCommand="aws ssm start-session --document-name AWS-StartSSHSession --target %h --parameters portNumber=%p"`
global.tunnelProcess = exec(command);
// One can handle various events to get more information on screen.
global.tunnelProcess.on('close', (code) => {
console.log(`Tunnel: child process close all stdio with code ${code}`);
});
global.tunnelProcess.on('exit', (code) => {
console.log(`Tunnel: child process exited with code ${code}.`);
if (code > 0) {
console.log("Tunnel command resulted in an error. Please check configuration variables.")
exit(1)
}
});
};
The child process is created and stored in the global.tunnelProcess
object, that is available on teardown stage. So, when tests are finished to run, ./tests/globalTeardown.js
is called:
module.exports = async () => {
console.log("This is Global Teardown. Checking environment and deregisterting tunnel to VPC resources for local testing.");
if (global.tunnelProcess) {
console.log("Global Teardown: Found an active tunnel process, dismantling.");
global.tunnelProcess.kill("SIGKILL");
}
};
Conclusion
I would like to thank everyone who devoted some time to reading my blogs; feel free email me with any comments and suggestions.
Read more about Futurice's AWS services here
- Askar IbragimovCloud Architect and Senior Developer