“Manual Thread Dump Sampling” for Java uses simple thread dumps like a mini Java profiler. This little-known and low-overhead approach to capturing performance data is extremely effective, especially considering that Java Profilers are not welcome in production, user acceptance, and many other computing environments.
The instructions are surprisingly (perhaps shockingly) basic.
- Apply load.
- Using JStack, take about 4 thread dumps, a few seconds between each dump.
- if you see something that you could fix….
- ...and it shows up on two+ stacktraces, ones that are actually under load,
- ...then it is worth fixing
- As you peruse the thread dump in your favorite text editor, it is often helpful to only include that stacktraces that include activity from your package space.
Here is a little server-side Java application I've put together to demonstrate this technique.
I use this technique in my day job at least a few times a week -- so the demo is more than enough of an explanation for me. But if you are moved by a more detailed, math-based explanations for why its possible to glean actionable performance data from such a small number of samples, then perhaps the following two pieces will help.
The first piece is by Mike Dunlavey, author of this book, which is a bit dated. He wrote the following math-based defense of this approach. Note that he uses the term "random pausing" instead of the java-focused that I've landed on ("Manual Thread Dump Sampling"). Although I didn't learn this technique from Mike Dunlavey, he certainly seems to be the first one to document this approach -- not for Java, but for C and C++. The text for his section was taken from here.
The second piece is from Nick Seward, a great STEM teacher (with some interesting creds) from an extraordinary high school, where my 17 year old attends.
distribution of speedup ratio y s, n 5%-ile 95%-ile mean 2, 2 1.58 59.30 32.36 2, 3 1.33 10.25 4.00 2, 4 1.23 5.28 2.50 2, 5 1.18 3.69 2.00 2,10 1.09 1.89 1.37 2,20 1.04 1.37 1.17 2,40 1.02 1.17 1.08 3, 3 1.90 78.34 42.94 3, 4 1.52 13.10 5.00 3, 5 1.37 6.53 3.00 3,10 1.16 2.29 1.57 3,20 1.07 1.49 1.24 3,40 1.04 1.22 1.11 4, 4 2.22 98.02 52.36 4, 5 1.72 15.95 6.00 4,10 1.25 2.86 1.83 4,20 1.11 1.62 1.31 4,40 1.05 1.26 1.14 5, 5 2.54 117.27 64.29 5,10 1.37 3.69 2.20 5,20 1.15 1.78 1.40 5,40 1.07 1.31 1.17
- It doesn't seem like this would help you if all of your functions used the same amount of time. The worst case scenario is that all of your functions can be optimized equally.
- This does work well with the assumption that one of your functions will dominate (not a bad assumption). The question is how well will it capture that with 10 samples. Say that the ground truth is that it takes up 30% of the CPU time. Ideally we would want 3 of the random samples to reflect this. Using the binomial distribution, there is a 60% chance of seeing it 3 or more time, 85% for 2 or more, and 97% for 1 or more. That leaves you with a 3% chance of missing it all together.
- I would guess the items up for optimization usually take more than 30% of the time so this approach is even better.
If something is to be flagged for a look it will need to show up. So with my above example there is a 3% chance that it won't get flagged. If you bump that to 15 samples then the chance of missing it goes to 0.5%. I would think of it as flipping a coin. What is the chance of getting tails over and over. The more you flip it...the less likely. If all you are looking for is identification then this is great. If you actually want to know exactly how much time is spent in each function then this isn't good enough. It seems like he proposes looking closely at all 10 samples in the hopes of finding something to optimize. I would personally do more samples (say 20) and only look at samples that show up more than once to minimize false positives.
So to hopefully putting this more clearly, it isn't claiming that it can get accurate results with just 10 samples. It is claiming that a problem can be found with great certainty if it takes a reasonable fraction of the execution time. You can fix the problem which in turn exposes the next problem to this 10 sample technique. If you are only worried about problems taking 30% of the time and 3% chance of not finding it is okay then this is the way to go.
Here is a formula that I would use
e=acceptable chance of false negative (.005 would mean 0.5% chance of not finding the problem)
p=problem size(.3 would mean find problems that take over 30% of the execution time)
For example: If you want a .1% chance of a false negative and the problem you are hunting take 5% of the runtime then you would need 135 samples. For 30% problems, you only need 30 samples. You have to be stalking a big problem.
Below is a graph to tell you how many samples you need (for 30% problems) where a false negative is defined as getting less than k hits, where k is a positive integer. That IMHO allows you to switch from 10 samples that all require your inspection to more samples but less of them meet the k-hit requirement so the human gets to save time. I don't show it here but the ratio of false positive decrease as k increase. The false positives are actually minimized when e=0.05 for the 30% problem.
In closing, with a small number of samples you can nearly guarantee that you will find a large enough problem. However, you will have to sort through many false positives. If more samples are cheap, get as many as is feasible and use the largest k that will give you an acceptable e. Doing this will have your false positives and false negatives trend towards zero. That said...the beauty of 10 samples is that it is easy to examine 10 places in your code.