Analyzing results and reporting

What are test results?

We tend to place a lot of importance on load test scripting or execution, and those things are important, but it’s all for nothing if you don’t know what the results mean. In order to run good load tests, you’ll need to get comfortable dealing with data. The quantitative data from a load test may include:

- Data generated by the load testing tool itself while executing the script, including response times and error rates but potentially also debugging information. You’ll get this from your tool’s results file as you specified in the Saving Results section or from Flood’s dashboard.

- The resource utilization metrics of every load generator, to rule out any execution issues.

- The resource utilisation on your application servers to determine how your application actually performed under load. You’ll get the resource utilization metrics from each server.

We’ll refer to all of these as “test results”. The first step will be to collate all of these results in one place. Then we can delve into actually making sense of those results.

Let’s run through the different terms you’re likely to come across in your testing. For each one, we’ll go over what they mean and what they say about your application performance.

Load testing metrics

Below are some common metrics that are generated by your load testing tool. They pertain to the load test scenario you executed and will give you insight into the user experience (in terms of response time) of your application under load.

Concurrent users are the number of users running at the same time. In JMeter or Gatling speak, this is also referred to as the number of threads. Given the same script, a larger number of users will increase load on your application. Note that user concurrency doesn’t say anything about throughput. That is, having 1000 users doesn’t necessarily convert to 1000 requests per second. Nor does it mean that those 1000 users are all actively using the application. It only means that there are 1000 instances of your script that are currently ongoing, and some of those could be executing think time or other wait times that you’ve scripted.

Response time is the time between when the transaction was sent by the load testing tool on the load generator and the time that a response was received. This is the time to last byte. Since it is measured by the load testing tool, it does include things like latency and is affected by bottlenecks on the generator (such as high resource utilization). Both JMeter and Gatling ignore think time when calculating this. Response time is a useful metric to look at when trying to get an idea of how long your application took to process requests. A high response time means a longer processing time.

The transaction rate measures the throughput of your load test. JMeter sometimes reports throughput in terms of samples per second, which is a similar but concept but not the same. Generally speaking, a JMeter sample is a single request, and multiple requests can be grouped into transaction controllers. So samples and transactions are not interchangeable, but they do both describe how quickly your test is sending requests to your application. The transaction rate, more than the number of concurrent users, better describes the load your application is handling. You can expect a higher transaction rate to correspond to higher load.

The failed rate/error rate is normally expressed as the number of failed transactions divided by the total number of transactions that were executed. It is often represented as a percentage: an error rate of 40% means that 40% of all transactions failed. Whether a transaction failed or not is determined by the script and can be caused by many issues. A transaction could fail due to a verification on the page not being found due to an unexpected response (such as an error page being returned) or it could also be due to a connection timeout as a result of the load testing tool waiting too long for a response. High error rates are an indication of either script errors or application errors and should never be ignored.

The passed rate is similar to the error rate but measures the other side of the coin: it expresses how many of the transactions during the test passed.

Resource utilization metrics

Resource utilization metrics, unlike the load testing metrics in the previous section, are not generated by the load testing tool, although some tools do also report this. They are measured by the operating system of the machines involved, including both the load generators from which load is generated as well as your application servers, which receive the load.

What you’re looking for in resource utilization metrics from the load generators is a way to determine whether or not the load generators were a performance bottleneck. Ideally, you will want all the metrics to show that the load generators were not overutilized, which means the results of your test will be valid. If they were overutilized, you will want to fix the issue and re-run the test.

The resource utilization metrics of your application servers, however, will show you how easily your application handled the load and will give you an idea of how much more it can handle. If you’ve identified a bottleneck, these metrics will also give you a clue as to where to start your investigation.

CPU utilization is how much of the machine’s processing power was being used during the test. This indicates whether a server was struggling with the tasks it was carrying out at the time. In a load generator, you’ll want to make sure you stay below 80% utilization for most of the test. In an application server, consistently high utilization may suggest that you need to allocate more CPU towards the server.

Memory utilization is how much of the machine’s memory (RAM) is being used up. Sometimes this is measured in percentages (80% memory utilization means 20% of the memory was not used) or this can be expressed in terms of available bytes (the amount of memory that was not used). Consistently high utilization (again, >80%) could point to a memory leak within the server. Memory leaks are often only spotted in longer tests, which is why it may be worthwhile to extend the duration of your tests. High memory utilization could also be a good reason to consider allocating more memory to the server.

Network throughput is similar to the transaction rate in the sense that it tries to measure how much load is being put through the system; however, it does this by measuring the amount of data in bytes that is being delivered by your application servers to the load generator. High network throughput is only a concern if it is equal to or hitting up against the maximum bandwidth of the connection.

For instance, Flood uses AWS nodes of type m5.xlarge by default, which have an advertised bandwidth of up to 10 Gbps. A test with a network throughput of 10 Gbps is a concern because it means that the bandwidth of the load generator itself is starting to be a bottleneck. In this case, you should attempt to decrease network throughput by incorporating waits into the script so as not to hit this limit.

Latency is the portion of the response time that accounts for the “travel time” of information between the load generator and the application server. It can be influenced by factors such as network congestion and geographical location and is notoriously difficult to measure. While it’s ideal to have low latency, having high latency does not necessarily render a test void; it’s still possible to determine the actual server processing time by subtracting latency from the overall response time.

Disk I/O Utilisation metrics are useful because they measure how quickly data is transferred from memory (RAM) to the actual hard disk drive and back. This can be expressed as a rate (reads/sec or writes/sec), a percentage (busy time is the percentage of time that the disk was actively being used), or even a number of requests (queue length). Requests that cannot immediately be processed are assigned to a queue to be processed later, and a high amount of requests in this queue can be taken as a disk utilization issue on the application server because that means it can’t keep up with the number of read/ write requests. These metrics are more useful for the application server rather than the load generators.

Have a look at the Server Monitoring section to revise your knowledge of these metrics.

Analyzing results

After collating the relevant metrics, you’ll want to start making sense of them. Metrics are useless if their context is not taken into account. Your job is to use those numbers to tell the story of what happened.

It’s impossible to thoroughly explain how to analyze results in this book, but here are some considerations to keep in mind.

First: was the load test valid?

Like any good scientist, your first duty after carrying out an experiment is to determine whether or not the conditions of the experiment accurately recreated the scenario you want to test. Here are some questions to ask yourself:

  • Was the load test executed for the expected duration?

  • Did the load generators display healthy resource utilization for the duration of the test? Were CPU, memory, and network metrics within tolerance?

  • Did your load test hit the throughput (requests per second) that you were aiming for? Is this similar to what you would expect in production? Consider drilling down further into separate transactions: are there business processes that are more common in production than in your load test?

  • Was the transaction error rate acceptable? How many of those errors were due to script errors and data in the wrong state?

Next: How did the application handle the load?

Now that you’ve determined that your load test was a good replication of production load, it’s time to figure out how your application fared under that load. Your goal here is to identify whether any performance bottlenecks exist.

  • What was the average transaction response time? It’s important to drill down into separate transactions here because not all transactions are alike. Which transactions performed worst? Are all the transaction response times pretty close to each other, or are there some transactions that are far and away slower than the others? Also, look at more than just the average: what are minimum and maximum response times, and is there a large gap between those two? What is the 90th percentile response time?

  • How much of the transaction error rate was caused by legitimate application errors? Were there any HTTP 5xx responses that were returned by your server? Did these errors occur at the start of the test when the users were still ramping up, right after the full number of users was reached, or at the end of the test? Are there clumps of errors during the test, or were they spread out across the entire test? Do errors occur at regular intervals, and if so, were there any scheduled jobs going on at the same time on the application server?

  • When the application failed, did it fail gracefully? Did it display a nice error page, or did it simply offer up an unfriendly error? If your application has load balancing, did the load balancer correctly redirect traffic to less utilized nodes?

  • Was the resource utilization healthy on all your application servers? Were your nodes similarly utilized? Were there certain points in the test that display higher utilization than others, and what was going on at that time? Did memory utilization increase as the test went on? Did garbage collection occur?

  • Do the server logs display any unusual errors? What was the disk queue depth during the test? Did the server run out of hard disk space during the test, and do you have policies in place for backing up and deleting unnecessary data in production?

  • What to watch out for: examples of bottlenecks

Finally: If there were bottlenecks, why did they occur?

This is by far the most difficult part of results analysis, and you may have to liaise between different teams in order to determine why your application didn’t behave as expected.

The key here is to go beyond symptoms and actually try to get down to the root cause. If you find yourself saying the following things, it’s a good sign that you haven’t investigated the issue enough:

“The response time is high because CPU utilization on the server was high.”

“The error rate was high because the server returned HTTP 500s.”

“The application was slow because all 1000 users had ramped up.”

“The login server restarted unexpectedly.”

“Load was not even across all nodes, so response times on one server were higher than on the others.”

“Memory utilization was high when the identity verification process was triggered.”

These statements, as true as they may be, don’t really leave you with actionable insights. Instead, ask yourself why several times until you get to the root of the issue.

For instance, take the first statement: “The response time is high because the CPU utilization was high.” Why was the CPU utilization high? Well, because the server was busy processing a lot of information at the time. Why was the server processing a lot of information? Because the requests to go to the home page retrieve information from many application components before being returned to the main server. Why does a user browsing to the home page send so many requests? Maybe it shouldn’t.

In that case, asking why several times got to the root of the issue: a simple GET of the home page was requesting far more resources and potentially causing higher CPU utilization on the server than was actually necessary.

In figuring out the real reason behind the symptoms you’re seeing, you’ll come up with tangible steps towards addressing it or also inform management's decision as to whether or not to proceed with a release.

Putting together a report

Communicating the load testing results is almost as important as running the test itself because a good report can sometimes determine whether your findings are actually addressed or whether your testing becomes just a “check-the-box” activity that gets forgotten.

You’ll likely need to come up with at least two reports based on the expertise of the stakeholders you’re giving reports to.

The Management Report

The management report is one that you intend to give to an audience that is not necessarily technical, so you’ll need to tailor the results you show accordingly.

The Executive Summary

This summary should explain in simple terms what you did, why you did it (your goal), how the application behaved, and what you can recommend that would improve performance. This is the highest level of report and, depending on the experience of the manager you’re giving this to, may be the only thing that some people read. So make it easy to read, as free of technical jargon as possible, and with very simple tables. If possible, put findings in bullet points.

If your report is a book, the executive summary should be the CliffsNotes version: enough to understand the gist of testing even if you don’t get into any of the specifics.

For the rest of the management report, you can then back up a bit and then go into slightly more detail about each point:

- Nonfunctional requirements: what were they?

- Test scenario: what tool did you use for the test? What were the key transactions that you identified? How many times did you run the test, and for how long? Did you run different kinds of tests (load, peak, stress, etc)? How many users did you run?

- Results: what were the response times of the key transactions? Include the top five or ten transactions with the highest response times.

- Fixes: What issues did you find on the application servers? Were there any things that were addressed as part of testing in order to improve testing? Show a before-and-after graph of chart of response times with the differences highlighted.

- Recommendations: What server configuration does your testing suggest is optimal? How did this release compare to other releases in terms of performance (if applicable)? If you had more time to test, what kind of tests would you run, and are there any tests that you would recommend adding to the standard suite of tests in the future?

If appropriate for your role, this is also where you make a GO/NO GO recommendation, along with reasons for that decision.

The Technical Report

While the management report prioritises readability and comprehensibility above precision, the technical report will be geared towards the developers or other technical testers on your team, so you can go ahead and add more detail. This is where you’ll go through ALL of the results, but don’t forget that you’re still telling a story— don’t just copy and paste graphs; explain what they mean.

While in the management report you should tend to go for average response times, you can expand on this a bit in the technical report and include some more statistics, such as percentiles, standard deviation, and other ways to display the same data.

You can also include data here that is not necessarily significant — for instance, if the resource utilisation on the load generators was healthy, you will not want to include graphs in the management report, but you should include it in the technical report to anticipate those questions.