Skip to content

Benchmarks and regression analysis

maxbennedich edited this page Sep 25, 2018 · 3 revisions

In its current form, benchmarking should be thought of as a way to find major regressions in performance and memory usage, such as bugs introduced during feature work or refactoring. The benchmarks are unlikely to detect minor regressions, especially in performance, due to two reasons:

  • Since we're running the benchmark for every commit, and on Travis, we can't afford to run too extensive benchmarks. This means that the results are quite noisy.
  • Travis uses several different CPU platforms, so the performance varies a bit. Unfortunately, some tests behave very differently depending on the CPU platform, and could for example need a lot more iterations to converge on one platform than another. This has to do with tiny rounding errors that grow over time.

To enable benchmarking for a given @testset, put a @bench macro in front of it. There is also an @onlybench macro that will run the test set only in benchmark mode, not during unit tests.

A suggested workflow is to benchmark the majority, if not all, @testsets, and use the results of the benchmark as an additional sanity check before merging a feature branch. If a suspected regression is seen in the Travis benchmark, it is recommended to rerun the benchmark for that specific test in a controlled environment (for example on eight or on your laptop). See the example workflow below.

CPU vs memory

It's important to make a distinction between regressions in CPU time and memory. While CPU time measurements tend to be very sensitive to noise, memory measurements on the other hand are very accurate. If a test run in the same environment uses more memory in one version of code than another, we can in general assume that it's due to a change in the code (of course, it doesn't have to be a problem, it could be something as simple as a start vector having changed, which causes a method to converge differently). However, if a test is slower in one run than another, it could be caused by a number of factors not related to the code, such as differences in system load or caching. It's therefore recommended to run such benchmarks for a longer duration and/or several times to ensure that there indeed is a difference.

Scripts

There are a few different scripts used for unit tests and benchmarking, described below. All should be run from the top-level package directory.

run_tests.sh

Runs unit tests for all tests, or a subset thereof (by specifying a test name / regex as a command line argument). Example to run the two "polygon" tests:

$ scripts/run_tests.sh polygon
run_benchmark.sh

Runs the benchmark suite for all tests, or a subset thereof (by specifying a test name / regex). Example that runs all benchmarks and stores the output to benchmark.json:

$ scripts/run_benchmark.sh benchmark.json

By default, all @testsets are benchmarked for a minimum of 1 second. That can be increased to do a more accurate benchmark. Here's an example that benchmarks all NLEIGS tests for 10 seconds each:

$ scripts/run_benchmark.sh nleigs.json 10.0 nleigs
print_benchmarks.sh

This script accepts one or two JSON file benchmarks as input, and either prints the result of a single run, or a comparison of two runs, to stdout. Example:

$ scripts/print_benchmarks.sh nleigs1.json nleigs2.json
benchmark_report.sh

Generates an HTML report from any number of JSON file benchmarks, either specified as input, or downloaded from the benchmarks that Travis uploads to GitHub. The reports will be ordered by the time they were run. Example to create a report for all matching JSON files in the current directory:

$ scripts/benchmark_report.sh nleigs*.json

Example to create a report for the benchmarks branch on GitHub:

$ scripts/benchmark_report.sh -b benchmarks

Sample workflow

Let's say you've made changes to some core functionality in the package. The unit tests pass, but you want to make sure you didn't screw up on the way and made something slower or less memory efficient. You can then run a general benchmark for the master branch, followed by your branch, and compare the two. Make sure to sync your branch with master first so that you're only benchmarking your changes. You can either run this on your own laptop, or on some external server like eight (each benchmark run currently takes around 10-20 minutes). The workflow would be something like this:

git checkout master && scripts/run_benchmark.sh master.json
git checkout my-branch && scripts/run_benchmark.sh my-branch.json
scripts/print_benchmarks.sh master.json my-branch.json

Let's say that this indicates that the "Infbilanczos" test has gotten slower. We can do a more accurate benchmark (30 seconds) of that test only to make sure that there's indeed a problem, and not just noisy data. If you are only observing differences in CPU time, you should make sure to run this in a controlled environment to minimize noise.

git checkout master && scripts/run_benchmark.sh master.json 30 infbil
git checkout my-branch && scripts/run_benchmark.sh my-branch.json 30 infbil
scripts/print_benchmarks.sh master.json my-branch.json

If there's still a discrepancy, you might want to investigate it further. Some next steps could be:

  • Look at the test and the diff to see if there's an obvious explanation for the regression.
  • Add @time statements to each NEP solver call in the test, and call each solver method twice in a row to make sure JIT compilation and caching is not part of the equation.
  • If you have many commits in your branch, you can benchmark each individual commit (or use bisection) to narrow down which commit introduced the problem.
  • Use the profiler on the old and new version of code to see what has gotten slower.
Clone this wiki locally