Planning for API load testing
Every good test begins with requirements.
You may think, “But I’m a tester! I’m not a business analyst!”, and I hear you. But as testers, especially on smaller teams or shorter projects, it’s still part of our job to make sure we know what our clients want. Otherwise, how do we know whether we’ve succeeded or failed?
I often speak to project teams who ask for help interpreting reports. Most of those times, the problem is not that they don’t understand the metrics or what they measure. The problem is that they didn’t set pass or fail criteria in the first place, and so they don’t have anything to judge the results against. This is always a big warning sign.
Requirements inform every step of the load testing process. Why are we doing load testing? What exactly do we want to test? How will we know when a test has passed or failed? How will we know if application performance is good enough to go into production? What does “good enough” mean?
Scalability is the application’s ability to cope with increasing demands by increasing the amount of server resources. This could mean scaling up (increasing the resources of the dedicated server) or scaling out (adding more nodes to shoulder the load). What happens when more users than expected sign up in response to a promotion on your site?
The most common performance metric is page response time, but there are other considerations here, such as throughput (requests per minute) and the number of concurrent sessions that need to be supported. Things like the total size of the resources on the page, whether or not a CDN is being used, and what to cache are also worth discussing.
Elasticity is a relatively newer aspect to performance testing brought about by advances in the cloud that allow application infrastructure to adapt to changes in load. Unlike scalability, elasticity emphasises scaling down as much as it does scaling up. Testing that virtual machines scale up when load increases is important, but testing that virtual machines also scale down when load decreases can also help save on unnecessary costs.
To test for high availability, ask yourself what would happen when (not if) your application’s server fails. Is there another server that the load balancer will seamlessly send traffic towards? Does the throughput fluctuate wildly? If users are connected to one server that fails, is your application smart enough to make new connections to another server? Or will it simply serve up an error page that users won’t know what to do with? Disaster recovery is best tested when there’s no disaster imminent.
Reliability encompasses a lot of scenarios, but they all have to do with whether or not your application returns expected responses. Does your error rate increase when you increase the duration of your load test? Are you adding verification steps to your load testing scripts to check whether or not the HTTP 200 response that the application returned is not actually an error page?
What should my application’s response time be?
Our clients frequently ask us what the industry standard is for response times, wanting to make sure their applications measure up. The answer, however, is more complicated than a single number.
Industry standards for response time are only useful when applications are very similar. Constantly changing technologies used in web development as well as innate differences in business processes, however, make it very difficult to extrapolate a single number that will apply to all, or even most, applications in a certain industry.
The home page of one e-commerce app, for instance, might be several seconds slower than that of their main competitor. However, that doesn’t take into account the fact that their app loads a video showcasing new products. Does that mean that the development team should remove the video in order to fall in line with their competition?
Well, maybe. But not necessarily. It’s a business decision that needs to be made after perhaps using focus groups to determine the impact of the video, forecasting changes in conversion rate due to it, and comparing its projected value to the effects of being slower than the competition. A/B experiments could be used to test these assumptions and gather quantifiable data to support the team decision.
These factors are often not considered in the search for one number to rule them all, which is why a fixation on that number can be detrimental. Instead, I encourage project teams to brainstorm and come up with their own numbers for all metrics that would be more appropriate for their application. Gathering comparative metrics from a competitor may be part of this process.
What is a good API load testing requirement?
A good requirement, just like any good goal, is SMART:
Specific. Vagueness in a requirement leads to vagueness in results interpretation.
Instead of: “The performance of the web application“
Try: “The average response time of the Login transaction”
Measurable. Make sure there is a quantifiable way to know whether requirements have been achieved.
Instead of: “Decrease user frustration”
Try: “The error rate must be below 3% at peak load”
Agreed Upon. Have the appropriate stakeholders been involved?
Instead of: “The system must be able to generate emails as soon as users register”
Try: “The system must be able to generate a maximum of 100 emails an hour, after which emails are queued”
A good example for this is a project I was involved in that aimed to increase the speed with which emails were generated. Unfortunately, the particular email being sent included a big change that many customers were expected to contact Support about, and nobody had looped in the Customer Support team. They quickly raised their concern that they would not be able to handle the expected volume of emails unless the emails were staggered. This could have been avoided if they had been brought into discussions from the very beginning, in the requirements gathering phase.
Realistic. Can we meet this requirement given the resources available?
Instead of: “All requests must be returned within 5 ms”
Try: “90% of the Catalog page requests should be returned within 3 seconds”
Timely. Especially for nonfunctional testing, consider adding a timeframe to requirements.
Instead of: “The digital code is sent by SMS upon successful client log in at peak load”
Try: “The digital code will be sent by SMS no later than 5 minutes after successful client log in at peak load”
Start with why
Requirements represent a great opportunity to think things through and make sure everyone on the project team is on the same page about the goals for your load testing. Too often projects skip this phase, only to realise much later that the tests that were executed didn’t address a key stakeholder’s concerns.
In load testing, as in many things: when in doubt about what to do, start with why.
The discussion of scope among stakeholders should also occur during the requirements gathering process. There’s a big pressure to fit in as much as possible into a sprint, but it’s important to stop and think about what amount of work is realistic to include.
Business priorities need to be weighed against the resource limitations (number of people available to do the work and time available) in order for testing to deliver maximum value.
Some considerations for scope include:
- Specific features or key transactions to be tested
- Types of tests included (component test vs end-to-end test)
- Test scenarios (peak load test vs disaster recovery)
- Applications included in testing
It’s also a good idea to add things that will not be tested.
As with most things in the planning phase, scope is something that can change during the test when unexpected circumstances arise or when priorities change.
Entry criteria are conditions that you need to be fulfilled before the testing actually begins. It’s a good idea to have these conditions communicated beforehand so that everyone is clear on what needs to be set up before you can do your job.
For nonfunctional testing, there are several general conditions you might want to include in your entry criteria.
Load testing cannot realistically be carried out until at least the core functionality has been tested and high-severity defects have been fixed. Depending on the kind of load test you want to execute, you may also want to specify that user-acceptance testing (UAT) has been executed, as there’s no point doing an end-to-end load test with 1000 users if it doesn’t work for one user.
Nonfunctional testing has stricter requirements for an environment than does functional testing, and you may have to champion this cause. For load testing, it is not enough to have an application staging environment that is a virtual machine that is a quarter of the size of the production environment. It’s important to get as close to a production-like environment as possible in terms of capacity (memory, CPU), codebase (the actual build that will be deployed), and integrations with other environments or servers (if within test scope).
Load testing is not linear: a response time of 5 seconds on a server with half the capacity of the production server does not necessarily equate to a response time of 2.5 seconds in production.
Once you have as close a copy of the production environment as possible, keep in mind that it’s still a clean copy, which may not be realistic. If there are databases in production, how much data do they contain? The application server may respond differently when your test database is empty compared to when it must contend with gigabytes of data in the production database.
This is also the time to think about your load injectors. Will they be on-premises, or in the cloud? A good entry criterion is the availability of the machines in the right network and with the right tools installed. If you’re using commercial tools, license provisioning should be a criterion. What sort of capacity will your load testing scripts require?
Monitoring tools that you use during execution will also fall under environment criteria, but we’ll discuss that a little bit more in detail later.
Application Teams to Be Involved
Load testing is a team activity. When a load test involves multiple application teams, it’s important to request availability of key persons on those teams during the test. Often as load testers, we are seen as working independently, but the truth couldn’t be farther from that. Load testing is a team sport. We need support from:
- business analysts who will be able to tell us how things are expected to work and what the current priorities are
- developers whom we can consult when poorly performing code needs to be optimised
- functional testers who can show us how the application works
- DevOps engineers who can help us provision and monitor servers
and many more!
There should be a test schedule drafted with the input of all key resources.
Test data should be examined in conjunction with the determination of the key business processes to be tested. For example, if part of the requirements involves testing user logins, where will the user credentials come from?
In general, there are three ways to get test data:
1. Take it from production. One way to do this would be to take a copy of the production data and use it for your test environment. Note that you may have to scrub sensitive data (such as customer information or passwords) before you can do this. Email addresses should particularly be scrubbed and ideally replaced with a fake domain (such as “[email protected]”) in case there’s a chance that notifications are generated by the test itself that sends emails to real clients.
An advantage of this is that it’s more production-like and may expose issues with non-Latin characters, for example, that may not otherwise have been tested.
2. Inject records into the database. This can often be the easiest way to generate test data. Your friendly DBA can create a script to randomly or sequentially generate the data from the back-end.
3. Write a script to generate the data yourself. Sometimes neither of the first two options is feasible, and in those cases your only other recourse is to create the test data yourself. A disadvantage to this approach is that it will take more time to create the data before you can run your test, but an advantage is that it’s another functionality that you’ve scripted that can perhaps be included in the test suite.
When running a test that either requires a significant amount of data or consumes the data in the course of the test, it’s a good idea to take a backup of the relevant databases and create a restore point after the data has been created but before the test is executed. That way you can save some time by falling back to that restore point after the test so that the data is in a known state each time.
A workload model is a schema describing the load profile for a given test scenario, and it involves determining what (the key transactions), how much (the load distribution among the transactions) and when (timing of the load) to test.
Workload modelling can be the most difficult part of the testing process because it involves finding out how load test scripting can best mimic what is actually happening in production. It can also be the most critical.
Imagine a project that runs load tests of the guest checkout process on their e-commerce site. Despite having tested up to 1,000 concurrent users on their production-like environment and getting sub-two-second response times, they get response times of greater than a minute in production before their servers start to go belly up. What could have gone wrong?
Well, a lot of things, but one thing they may not have taken into account is the standard user path. Perhaps they assumed that most customers would checkout as guests without logging into their accounts, but in reality, 90% of their customers log in before checking out. This means that there’s a big gap in their testing: user login. Perhaps it wasn’t the main application server that ran into issues at all; it could have been their authentication servers.
Workload modelling ensures that you’re testing what you need to test, but it can be more complicated than it sounds.
You can begin building a workload model by gathering data from production about what users are actually doing on your application. Real historical data on the traffic over a wide enough time period (a year or six months) is ideal, so that you get a broad view. If you have something like Google Analytics or other tools such as New Relic, even better.
But what if your application, or the specific feature you’re launching, is new, and you don’t have any historical data to look through? In this case, the answer is to involve business analysts as well as developers and estimate the traffic patterns. If you have data for other similar transactions, you can analyse them and extrapolate best guesses from them about the new transactions.
Whether or not you have the data, here are a few things you will want to determine as a team.
This, together with the scope and business need, will determine what you script and what you actually test. Load testing should not be exhaustive. It may sound good to be able to script absolutely everything, but not everything needs to be load tested. Here’s how to determine what does:
- Business-critical or high risk transactions. Include anything that is vital to the success of the release. An example for this would be the final purchasing page on an ecommerce site or the submission of a contact details form for a request for quote. If it leads directly to conversion, script it and test it.
- Known pain points. A good place to start is any pages that your Customer Support team have reported as a pain point for customers. Let customers tell you what they want.
- Transactions that are technically complex. These transactions sometimes go through several application servers and are processed several times before a response is returned to the user. This means that they’re also the most likely to present issues when one or all of those servers are under load.
- High-traffic transactions. These are the pages that customers use the most. This could be a landing page or the product catalog page that may not be the most complex but is frequently visited. There may be some surprises in this— the landing page might have been obvious, but perhaps you’ll find that the many customers browse to the Contact Us page as well.
The load profile will tell you how the load is distributed over time. Since applications are used in different ways, it’s not enough to get an average hits per second over the last year and test at that level. You’ll need to understand which transactions those hits are going to.
Look specifically for trends over time. These could be yearly, seasonal, monthly or even daily patterns. Does traffic pick up in the month before the tax filing deadline? Do students access their enrolment portal just before the enrolments are finalised? Does management run a series of reports that take their toll on the server at the beginning of every month? Do employees usually navigate to the intranet page at the beginning of the day, and then again after lunch?
One of the most marked examples of this is the load testing required for a sports event. One of our favourite clients at Flood, Hotstar, simulated video streaming traffic to their site in preparation for the Indian Premier League in 2018.
After analysing the data they had from previous years, they decided on a load test that had this load profile:
The spikes in load correspond to the start of cricket games. This spiky profile means that they had to test higher levels of load for a short period of time, instead of testing with a lower user count for a longer time. In this case, using the average number of users throughout the day would have drastically oversimplified things. Given that their load tests required 4 million concurrent users, that small change in the workload model could have negatively impacted their results and performance during the games in production.
You can only determine what the “Peak” in traffic is for your application after you’ve seen historical data. If you’re like Hotstar and you want to test the performance on game day, don’t pick traffic from some other month where there are no games— take a look at your highest season traffic-wise and test for that. You may even want to increase those figures by 10% or so in order to leave some room for growth.
So, now that you know what you’re testing and when you’re testing, the next question is where you’re testing from.
The geographical location of your load test generators affects the response time you’ll get. If you use on-premises machines to generate the load and see 1-second response times, the machines are on the same corporate network as your application servers and have less latency. This means that the response times will be a lot faster than those of a client using your application from across the world.
Depending on where your customers live, latency can change the reported response time. In general, you would want to have a look at your analytics and see where most of your customer base is. Ideally, you’d want to generate load in those regions— remember, we’re trying to make your load testing as realistic as possible.
This is one big reason to switch from on-premises load generators to load generators in the cloud. Service providers like Amazon, Azure and Google allow you to provision machines with a few clicks and even select their location. Load testing on the cloud can be significantly cheaper (because you don’t have to provision or maintain physical machines and you only pay for when you use them) and more realistic, because they allow you to approximate the effect of distance on your application response times. It also allows you to test the effect of Content Distribution Networks (CDNs), if you’re using any.
With more and more people browsing the internet on their mobile devices, it’s worth considering how big an effect users’ networks have on their experience of your application. If your app is an internal company portal, this probably won’t matter, but if it’s a web app that is meant to be accessed both on desktops and mobiles, you can’t discount the effect of slow 3G networks.
While you can’t control users’ network speeds, what you can do is simulate them by throttling the bandwidth available so that you can see how quickly your application would respond for them. Most load test tools can do this, and you may be able to determine how many of your users access your app on slower connections by looking at your analytics. If you decide that this is in scope, you can then build it into your scripts so that your tests will also show you response times on different types of networks.
Network Bandwidth throttling in JMeter
In JMeter, you can control this by adding the following line to your user.properties or jmeter.properties file:
This sets the “characters per second” and, when set to anything greater than zero, will allow you to simulate different speeds. Here’s a way to calculate the value to set here:
cps = (target bandwidth in kbps * 1024) / 8
Network bandwidth throttling in Gatling
With Gatling, this is also indirectly possible to simulate by throttling the number of requests per second. It’ll be a little more difficult to correlate requests per second to network bandwidth, but seeing historical data of real mobile users of your application should help here. Then you can multiply that by the number of users in your simulation and set up the throttling using something like this:
setUp(scn.inject(constantUsersPerSec(100) during (30 minutes))).throttle(
reachRps(100) in (10 seconds),
Most load testing tools will display some standard load testing metrics such as response times, throughput, error rate, and others, but you’ll also need to set up monitoring on the application components that you’re testing.
Executing a load test without monitoring server health is like flying blind. You’ll know when you land safely and you’ll know when you crash, but even if you do crash, you won’t know why—or how you can avoid it next time. Monitoring server health is the black box that will tell you what went wrong.
What Metrics to Monitor
There are a lot of metrics that you can monitor. Here are just a few:
- **Processor Time - how much the processor is being utilised
- Processor Interrupt Time - how much time the processor is spending to handle interrupts
- Processor Privileged Time - the time the processor spends handling overhead activities
- Processor Queue Length - the number of threads that are waiting to be executed
- **Memory (Available Bytes) - unused memory available to process new requests
- Memory Cache Bytes - the size of the data stored in memory for quick retrieval
- Disk I/O - reads and writes to the disk during the test
- Disk Idle Time - time that disks are not doing work
- Disk Transfer/sec - average number of seconds that an I/O request takes to complete
- Disk Write/sec - average number of seconds that a write request takes to complete
- Network I/O - bytes sent and received
And that’s just a small sample! How do you determine which one to use?
If you’re not sure where to start: At a minimum, you’ll need the CPU and memory utilisation (with asterisks in the list above) of every major component that’s involved in the processing of requests. These two metrics are vital and if either of these is consistently maxing out at (or close to) 100%, that’s a sign that the component is struggling with the number of requests. CPU and memory over-utilisation is a very common reason for less-than-ideal response times.
How to Monitor
How you get these metrics depends on your budget and your operating system. I’ll start with the free or lower cost methods and work my way up to enterprise solutions.
If your servers are running Windows, PerfMon is a good alternative. It’s built into Windows and the interface allows you to choose the counters that you want to measure and start recording.
Moving from budget options to more enterprise solutions, DynaTrace is a powerful tool that allows you to track not just server health but can actually be used to trace individual requests using a custom header.
AppDynamics is another fantastic tool that allows you to really drill down to specific SQL queries that take a long time to execute, for example, feeding you important information to give to your DBAs.
Other noteworthy tools are the Microsoft System Center Operations Manager, the Oracle Enterprise Manager, and BlueStripe FactFinder.
Service virtualisation involves creating simpler versions of existing systems and services that are sufficiently realistic as to be able to replace the original component for testing purposes.
A stub is that part that replaces a complicated component that is not within scope. It’s a “dumber” version that responds to requests enough to allow you to go on with your load testing without actually requiring that component.
Let’s say your application involves customers entering their credit card information upon checkout. That data is then sent to a payment gateway that then sends a message to your app to confirm receipt of payment. Often this payment gateway is outsourced to a third party, but you still want to test that your application saves the order information in your database and shows a “Thank you for your order” message after receiving payment confirmation.
It’s often not feasible to include your third party payment provider in your load testing, unless you want to actually be using real credit card numbers. Instead, what you can do is create a stub that will perform this function.
You likely don’t need “real” data, so the stub could take the input of a credit card number and send back a “Payment accepted; order number 123456” as the output to your application, allowing your load tests to continue without the payment gateway.
There are much more complex and full-featured enterprise solutions for this in the market, such as Tricentis Tosca Orchestrated Service Virtualisation (OSV). However one good, open-source tool I’ve found that is easy to set up and not too well-known is Mountebank.
If you are willing to put the time into creating a stub, you can drastically reduce the amount of resources you need to set up an environment and isolate components. Reducing variables in your test allows you to more quickly determine where performance bottlenecks lie.
Choosing your test scenarios means deciding which situation is most likely to yield the data that you require. Employing several different types of scenarios will give you a greater understanding of your application’s capabilities. You should feel free to create your own scenarios that are uniquely tailored to your requirements, but here are some common scenarios to start out with. Take the number of users and durations mentioned as guidelines and not rules.
I often see customers testing out their new load testing script by running it with 10,000 users. Such an approach exposes them to unnecessary costs in the provisioning of the machines for a test that might fail, which it often does.
Test out your scripts with a single user on your local machine. If that works, try it with two users. Then try it with ten users, with each one doing multiple iterations of the script. If that works, then you can consider doing a shakeout test on the cloud.
A shakeout test is a low-load test that is intended to be a quick check to see that the script, the environment, and the entire set-up, are working as expected. Depending on the application, a starting shakeout test could consist of anywhere from 1 to 100 users for about 10 minutes. The goal of a shakeout test is not to expose performance bottlenecks; it is a chance to verify that:
- the scripts are hitting the functionalities that were agreed upon with a low error rate
- the environment is fully integrated and functional
- the server monitoring is operational
- the right people for the involved application teams are on board and watching the test
It’s a rehearsal before the real tests begin.
Depending on your comfort level and the responsiveness of the application, you can increase the number of users gradually until you’re ready to ramp it up.
After shakeout tests are successfully executed (and no major errors are discovered), it’s time to move on to other test scenarios.
Peak Load Test
The peak load test involves the simulation of the number of users that you expect to see on your application in production during your busiest times. Unlike shakeout tests, it is likely to give you valuable information about performance bottlenecks.
The number of users for a peak load test will vary, but in general its duration will be about 30 minutes to 1 hour.
Instead of simulating your busiest time in terms of traffic, like a peak load test, a soak test simulates the effect of a lower but sustained load level. This would be the number of users on your application over several hours.
A soak test will typically involve fewer users than a peak load test, but it will usually run from 3 hours to a few days. The goal of a soak test is to see if there is any performance degradation in the application performance over longer periods of time. A common finding from soak tests is that the application has a memory leak, which causes response times to degrade over time as the servers begin to struggle and become more sluggish. This is not something that is always apparent during the quicker peak load tests.
A stress test is usually done after at least a peak load test, and maybe even a soak test, has successfully passed according to the requirements. It involves subjecting the application to more load than you really expect it to ever need. The goal of a stress test is to determine the bounds of an application’s capacity, with the ideal result being that it can support far more load than it needs to right now.
One good way of doing a stress test is doing a stepped load, where users ramp up to the expected peak load level, stay there for maybe 30 minutes, ramp up to another 100 or so users, sustain that for another 30 minutes, and so on until the application fails. Application failure can be judged to occur at the point that it no longer meets the nonfunctional requirements. The number of users that it comfortably maintained while meeting the requirements describes the outer bounds of its capacity.
This type of test scenario assumes the worst: that one or some of your application servers have failed and are unavailable. The point of this test is to determine how gracefully your application recovers and how resilient it is to unexpected failures.
A typical use case for this is for two application servers that share the user load. Apply load as normal, at peak load level or lower, and take note of the number of connections on each server. Each should be shouldering around half of the load. Then, shut down one of the application servers on purpose. The number of users should halve temporarily as the system struggles to recover, and the users connected to the failed server hopefully see a nice error page asking them to try again (you can check for this in your scripts) rather than some unfriendly error page. After a few minutes, those users should be redirected to the one functioning server, the users should begin to be able to carry out their tasks again, and the number of connections on the one server should match the number of connections at the start of the test before the shutdown.
Unlike other types of tests, the error rate is less important in the failover test— having a server shut down in the middle of a test is going to produce errors for even the most hardy application. The test is how well and how quickly your system recovers.
For those wanting to go even further and test other catastrophic events, I really like Netflix’s Simian Army approach, which conjures up the image of monkeys being set loose in a server room (basically a DevOps engineer’s nightmare). The Chaos Monkey, for instance, randomly shuts down a node at random.
A note about concurrency
So far in this book, I’ve talked a lot about the number of users as if it were a measure of throughput (how much load is being generated), but that’s actually a bit of an oversimplification.
More users doesn’t necessarily mean more load. For example, 1000 users could send a combined total of 100 requests per minute, while 100 users could send 1000 requests in the same amount of time. A user that clicks every link on the page and trigger requests to multiple servers will have a different load profile to a user that just navigates to one page and refreshes it.
So there’s a missing variable here, and that’s the number of requests per second, which is the more accurate measure of test throughput. Most of the time, when people talk about user concurrency, what they’re actually looking for is a way to express how much load they want to apply.
If this is the case, the savvy load tester (that’s you) can reduce the number of users and increase the throughput of each user. This is because most load testing tools require more resources to increase the number of users than to increase the throughput. Reducing the number of users while increasing throughput will maintain the expected load on the server while reducing the amount of machines that need to be provisioned (and paid for).
However, there are some situations where the number of users actually does matter.
User Concurrency Test
This is one situation in which using fewer users with increased throughput would not be sufficient. This tests how the system handles a certain number of users that log in and periodically refresh a page in order to maintain a connection. This is a specific type of scenario that will require a different script to execute, since it’s less about generating load through transactions and more about just maintaining the user sessions on the server. This type of script can also be included as part of the testing suite for the other scenarios, but it can be useful to run this independently to measure just the effect of the concurrent user sessions on the server.
In the next section we’ll actually begin scripting your load tests.