Executing an API load test

Preparing for execution

You have requirements to measure your results against, a workload model based on historical data, and scripts that should simulate user journeys through your application as realistically as possible. You’re ready to execute!

Here are some things you might want to do before you officially kick off a test.

Go through the Execution Checklist. If you don’t already have one, now is a good time to create one. An Execution Checklist is just a list of steps that need to be followed anytime anyone kicks off a test and will include preparatory activities such as those listed below. This is handy even if you’re the only one that will be running tests, because it’s easy to forget something and skip a step, but it’s especially important when there is a team of testers that may need to execute load tests.

Send out a notification to all stakeholders. If there are teams that you’ll need to have on call, either to monitor the test or to quickly resolve any issues that might come up, make sure to check in with them before running a test to confirm that they’re willing to support you. Email is a great way to send pre-test communications out to people, and should include at least the following information:

- What you’re testing: what components do you expect to hit? Where will the traffic likely come from?

- When you’re testing: what date and time are you starting the test and what is its expected duration?

- Why you’re testing: what is the purpose of this test?

- How much you’re testing: how many users or requests per second is your test scenario expected to hit?

Set up monitoring. If you’re monitoring components yourself, start the PerfMon counters, deploy monitoring agents, and bring up the dashboards you will need to monitor the test. If you’re relying on another team to monitor some components for you, reach out to them to confirm that everything is set up and ready to go.

Update the Execution Log. This is simply a record of all tests that have been carried out and should contain this information:

  • Run ID: Give each test a unique ID to be able to refer to it later

  • Date, time and duration of each test so that you can look back at metrics later

  • Purpose of the test

  • Load profile

  • Any other configuration tweaks from the default that were used for this particular test (such as running with a different set of users, or running against a specific database

  • A quick summary of results (after the test)

Unlike the communications sent out to stakeholders, the Execution Log is for the testers; it will help them determine what has already been done and provide them an easy way to reference tests and remember which one is which.

Shake out the environment by running a load test with fewer users to make sure everything is working as expected.

Shake out the data. You may have run your script a few times with some of the data, but if you have a large data set, you may want to double-check that the data is in the state you expect. For instance, if your test scenario includes users with certain permissions doing a privileged action, ramping up without shaking out your test data may lead to a lot of data-related errors if a lot of the users turn out not to have the right privileges. To test this without ramping up and incurring the associated costs, start a smaller-scale test (on a smaller sample of nodes) and leave the script running overnight, just to loop through all of the data (at a very low load, or even with just one thread) and determine whether the data is in the expected state.

When you’re ready to begin execution, check log settings. Debug mode is for shakeouts, not for execution. Logging takes up a significant amount of memory and space, which can severely impact the consistency and accuracy of your results. Having excessive logs while ramping up to full load will also mean that there are a lot of logs generated, leading to large results files that may be unwieldy to transfer after the test.

Watch out for listeners in JMeter that are configured to save more results than are necessary. For instance, when using the View Results Tree listener, make sure that “Log/Display Only: Errors” is checked; otherwise you will also log successes. Another common mistake with this listener is leaving the “Save as XML” option checked in configuration settings, which will save all responses as XML if left checked. Here’s a good example of what the configuration for this listener should look like:

This ensures a minimal footprint for the logs while still logging a sufficient amount of errors.

Custom logging can also be dangerous. Instead of writing code (for Gatling or via JMeter’s Beanshell or Groovy samplers) to capture responses, consider anticipating errors and checking whether they exist. For example, when testing a login scenario, consider adding assertions that the response doesn’t contain “Username or password invalid”, “Account blocked after 5 unsuccessful attempts”, “Account suspended”, “Account already logged in”, and other errors that may come up. You’ll be dealing with a larger data set than normal, so some data-related errors are bound to creep in. Adding a way for your tool to identify these errors will save you time in determining the causes of errors.

Avoid high transaction cardinality. Recall that transactions are an action or group of actions that you want to measure in a load test. The names of these transactions are set in your script. Every user or thread that runs that script will output the response to the requests within your transaction and save it under the name of the transaction. Every user and iteration will have the same transactions, and in a load test, all of the results are collated and tallied. When users are iterating over the same script in a load test, it doesn’t add much information to know User 1’s response time for the Purchase transaction compared to User 2’s (unless the accounts used differ in a way that would change the load profile on the server, such as having different roles). What matters is the response time of the Purchase transaction across all users.

High transaction cardinality occurs when transactions have names that are too unique. Normally this is an effect of using dynamic variables in transaction names.

Here’s what it looks like in Gatling:

exec(

http(”Home_” + username)

.get("/")

.check(status.is(200))

)

and in JMeter:

In both cases, the variable username is often the account username that the script signs up with or a counter, which means that instead of having a single transaction called “Home” that you can compare against multiple users, you’ll end up with “Home_user1”, “Home_user2”, etc. This will make result compilation unnecessarily tedious. Don’t do this.

Run your test in CLI mode. JMeter’s GUI mode is great for writing and debugging test scripts, and potentially even for shakeouts, but all load tests should be executed in CLI or non-GUI mode. With JMeter, you can do this by executing this command in your terminal:

jmeter -n -t /filepath/test.jmx -l /filepath/log.jtl

If you get the error jmeter: command not found , this means that you’ll need to add JMeter to your PATH environment variable. To do this in macOS, execute this command:

export PATH=$PATH:/Users/nvanderhoeven/jmeter/apache-jmeter-5.0/bin

This will allow you to execute the previous command successfully and run your JMeter script.

Baselining

Finding a baseline is one of the primary goals you’ll have as you start executing load tests. In order to be able to assess how changes in the environment or code affect application performance, you’ll need a stable point of comparison. In order to compare two states of the application that hinge on a variable (say, before a code change and after), it’s best to keep as much of the other circumstances surrounding the test as fixed as possible. The baseline is your unchanging test scenario that you can use as a point of comparison to assess future performance.

For example, consider two tests A and B. After the results of A were shown to the team, the developers made a change in how the server caches requests.

A: 1000 users, 1 hour, with an average response time of 3 seconds.

B: 100 users, 10 minutes, with an average response time of 1 second.

Can you make any conclusion about whether or not the change affected performance? The answer, of course, is no— there are too many variables in the two tests to be able to say for sure. The change could have decreased the response time, but that response time decrease could have also been caused by the smaller number of users or the shorter duration.

Instead, what you want to do is run the same test:

A: 1000 users, 1 hour, with an average response time of 3 seconds.

B: 1000 users, 1 hour, with an average response time of 2 seconds.

C: 1000 users, 1 hour, with an average response time of 1 second.

In this second set of tests, it’s a lot easier to see the effect of changes made on the performance. Clearly, the changes applied before executing Test C were more effective at lowering response time than those for Test A.

For this reason, here are a number of things you’ll want to settle on for your baseline and then fix:

- number of users

- duration of test

- think time, pacing, and all other waits and delays

- the script (including how requests are broken up into transactions)

User Density

Another thing you’ll want to baseline is how many users you can run per node, or user density. As much as we may want to look for industry standards on this, prepare to spend a significant amount of time figuring out this number for yourself up front.

Figuring out user density is essential because all load generators, even virtual ones, have finite resources. This means that each load generator will also have a finite amount of load that it can generate, based on its CPU and memory utilisation, among other things. Trying to generate too many users on a single node may result in the node itself being the bottleneck for your test.

If you’re trying to test how much water a bucket can handle before it overflows, make sure your tap is fully open. In order to accurately assess your application’s performance, make sure that the load generators display healthy resource utilisation.

It can be helpful to have a number to start with. At Flood, we’ve found that we can reliably run 1000 users using JMeter or Gatling with one of our AWS m5.xlarge machines. For reference, an m5.xlarge machine has 4 VCPUs and 16 GB RAM.

If your machine is similarly specced, run a test on a single node with 1000 users. While the test runs, watch the CPU and memory utilisation. If the test finishes without either of those consistently hitting over 80%, you’ll know that the node can handle that number of users.

Let me reiterate, though, that you should use the 1000 users figure as a starting point only. You can then figure out your number through trial and error. Increase the number of users past 1000 and watch the resource utilisation again. If that still looks good, add some more users and rerun. When you get to a test where the resource utilisation hovers above 80%, stop and fall back to the previous number of users. You’ll now have your number.

Another thing you might want to play with is the think time and pacing in your script. These waits tend to have a huge impact on resource utilisation, so you can expect to be able to run more users per node if you increase your delays.

If you’re using JMeter, always run your tests in non-GUI mode. GUI mode is great for debugging, but is unnecessarily resource intensive for real load tests. While you’re at it, disable any listeners you may have that you don’t need to capture results.

Figuring out the appropriate user density now will save you from getting inaccurate test results. Running as many users as you can without overloading the load generators is also cost-effective, as you’ll be making sure that you provision only as many nodes as you need.

Scaling your load test

It’s relatively simple to run a load test on one machine, but if you want to run on two or more machines, things quickly get unwieldy. The specifics for how to do this are out of the scope of this ebook, but I do want to run through your options for achieving this.

1. Upload your tool of choice and script to every machine and kick off each test separately.

Instead of running one large load test, think of this method as running several smaller ones at (about) the same time. This is relatively easy to set up, but the main disadvantage is that it’s tedious. You’ll want to ensure that each load generator is as similar as possible to the others in terms of operating system, tool version, and script version. Since each node will kick of a separate test, you’re not going to be able to see a real-time combined view of all the load tests, and after execution you’ll need to retrieve results files individually and combine them.

2. Use your tool’s distributed testing mode. Both JMeter and Gatling have a feature that will allow you to scale. Essentially this will involve setting up agents on each load generator and using scripts to coordinate execution and results collection. This requires a little bit more know-how and time to set up, but it’s a little more cohesive than the first method. Here are links on how to set this up for each tool:

Remote Testing with JMeter

Scaling Out with Gatling

3. Use a load testing platform. Distributed load testing platforms like Flood are popular for this because they take away all the setup considerations. This is the easiest option, especially for teams that are new to load testing or are perhaps less technical, because all the setup is done through a UI. Scaling out in this case just means uploading your script, choosing the number of nodes you want to run and in which region, and all the work is done for you in the background.

Metrics from your test are shown in real time and the results from each node are collated in one place for easy download if necessary. A disadvantage of this method is that you’ll need to pay for the service.

Monitoring

While you’re running the test, monitoring is probably going to be your biggest concern. Here are a few things you need to watch out for.

Load Test Results

Ideally, you’ll have real-time stats of how your load test is going. It’s possible to run your load test and analyse the results later, but it saves a lot of time if you monitor your test in real time. Here’s what you need to be watching during the test:

- Error rate: This is probably the biggest one. If 100% of your transactions fail, there’s really no point continuing the test. Stop the test and debug the error before continuing.

- Response times of key transactions: Seeing higher than expected response times from transactions is not necessarily a reason to stop the test, but alerting the relevant team to the issue can save some time figuring out the issue later. Sometimes keeping a test is running is necessary while logs are checked.

- Throughput: Check that the script is hitting the throughput that you expected, taking response times into account. This is an opportunity to see whether your script is falling short of the targeted load and fix it later. Sharp peaks or drops in throughput should be investigated (if they’re not part of your test scenario).

Server health

You’ll want to be monitoring the health of all servers, and that includes your load generators. In fact, load generator health is critical as over-utilising the load generator itself is a sure-fire way to render your results inaccurate or inconsistent and therefore unusable. The load generator is just like any other machine: attempt to do too much on it and it will become sluggish. Response times are measured from the time that the request is sent by the tool to the time that the tool gets a response back, so they are influenced by sluggishness in the load generator. You don’t want your load generator to be the cause of the performance issue, so make sure you monitor every test while you’re running it.

You’d also ideally do this for important components in the architecture of your application, if at all possible.

Test duration and quantity

You’re ready to kick off the test! But how long should you let each test run? And how many tests should you run for each test scenario identified in the planning phase?

It’s all about sample size.

Sample size is the number of observations recorded before making a hypothesis. Having a sample size that is too small can drastically affect the results of that hypothesis, because you don’t have as much to go on when drawing that conclusion. For example, if you want to know how many people know what load testing is, you may want to ask more than two people. Similarly, when running a load test, we want to make sure that our sample size is large enough that we can draw conclusions from the results of the load test and extrapolate the performance of our application from them.

In load testing, we can look at sample size from two aspects.

The duration of a test affects the sample size because the longer a test runs, the higher the number of transactions that can be executed. The type of test scenario will also affect your test’s duration. For instance, a soak test, which aims to measure how your application responds to a sustained load, will be insufficient if you run it for five minutes. The same five minutes, however, might be perfect for a spike test, where you want to test a high amount of transactions within a very short timeframe. In general, though, I would be wary about drawing conclusions from a peak load test that lasts less than thirty minutes.

The number of tests that you run also affects sample size in that each test gives you new observations that you can use to form an educated hypothesis. It’s a good idea to run more than one test for each test scenario. Relying too heavily on the results of a single test, especially if it’s a short one, may be dangerous because something as simple as a scheduled batch job executing a certain time every day might skew your results.