Planning for a large-scale test

List of things to be done prior to starting a large scale test

Setting up your AWS account

For larger tests, we recommend you consider hosting your own infrastructure.

You'll need to have previously set up an AWS account and the IAM role required for Flood to launch infrastructure on user’s behalf. We generally advise creating an AWS sub account if you have a main account, as that will limit the access that we have via the credentials/ role you provide. Here are some other things you can do to make sure everything goes smoothly from the AWS side:

  • Request an EC2 service limit increase for the region you plan on running nodes from, including the instance type (our base is m5.xlarge)

  • Notify AWS support of your planned test window including target region(s)

  • Increase your API gateway throttle limits to avoid HTTP 429s

Baselining performance

Before you run your test, we advise you to baseline the performance of your application on a single node first. A single node’s performance is representative of many nodes (our architecture is shared nothing / loosely coupled). Establishing a baseline enables you to have a solid starting point before scaling up to a full peak load test.

This is also a good time to make sure you've run shakeout tests to make sure your script works as expected.

Preparation on Flood

  • Let us know in advance when you're planning to run tests involving over 1000 nodes. This will enable us to make sure that you get the support you need.

  • Minimise the size of your script and data files. This will be transferred to every node, so very large libraries will affect how long your test takes to start up. Remove all unnecessary libraries.

  • Consider asking us to enable asynchronous execution of your test. The default behaviour on Flood is that we will only start a test after all nodes on all grids are up. However, sometimes when you start many nodes, AWS has trouble fulfilling your request. To prevent this, we are able to enable a switch on our end that will allow your test to start on individual nodes while waiting for others to start. If you're okay with this, we can turn this on for you. We've found that this tends to be better for larger tests.

Starting your grids

Consider using normal On Demand instances rather than spot instances. While spot instances can save money on provisioning costs, it also means that if someone bids higher on spot instances than you did, those grids can be taken from you and allocated to someone else. This obviously isn't ideal in a load test, so if you do have issues starting enough grids, we'd suggest increasing your spot bid or just using normal On Demand EC2 instances.

If you're using your own infrastructure, start all your grids before you start your flood.

Wait for all the nodes to go green. It's normal for this to take 20+ minutes for larger tests.

If there are certain regions that don't come up during that time, it's likely that you've chosen a region with smaller capacity. Stop those grids and start an equal number of grids in another region until you have the right number of grids that you need for your test.

After all grids are started and green, start your test, selecting all those grids. It's normal for this to take several minutes. Even though the grids are started, your files will need to be transferred over before the test can begin.

Running the flood

  • Please use a short TTL for name resolution cache at the JVM level to avoid hitting stale IPs (it is likely the ELB you are targeting will automatically scale out / refresh IPs). You can set this in advanced parameters e.g. -Dnetworkaddress.cache.ttl=10

  • Actively monitor your grid node health

  • Actively monitor your flood results for presence of HTTP 429s

  • Make sure your start multiple nodes/regions. Ideally, as a planning figure, you should generate no more than 1000 threads per node for a JMeter/Gatling.

Do not manually stop or start instances through the AWS console!

This will lead to Flood not picking up the new instances, which means you'll have fewer nodes running your test than you expected. Along with this, do not cancel spot instance requests. The Auto Scale Group will automatically start up new instances to replace instances with issues.