Scientific software is not that much different from other software, as far as how to know what needs tuning.
The method I use is
random pausing. Here are some of the speedups it has found for me:
If a large fraction of time is spent in functions like log
and exp
, I can see what the arguments to those functions are, as a function of the points they are being called from. Often they are being called repeatedly with the same argument. If so, memoizing produces a massive speedup factor.
If I'm using BLAS or LAPACK functions, I may find that a large fraction of time is spent in routines to copy arrays, multiply matrices, choleski transform, etc.
The routine to copy arrays is not there for speed, it is there for convenience. You may find there is a less convenient, but faster, way to do it.
Routines to multiply or invert matrices, or take choleski transforms, tend to have character arguments specifying options, such as 'U' or 'L' for upper or lower triangle. Again, those are there for convenience. What I found was, since my matrices were not very big, the routines were spending more than half their time calling the subroutine for comparing characters just to decipher the options. Writing special-purpose versions of the most costly math routines produced massive speedup.
If I can just expand on the latter: matrix-multiply routine DGEMM calls LSAME to decode its character arguments. Looking at inclusive percent time (the only statistic worth looking at) profilers regarded as "good" could show DGEMM using some percent of total time, like 80%, and LSAME using some percent of total time, like 50%. Looking at the former, you would be tempted to say "well it must be heavily optimized, so not much I can do about that". Looking at the latter, you would be tempted to say "Huh? What's that all about? That's just a teeny little routine. This profiler must be wrong!"
It's not wrong, it's just not telling you what you need to know. What random pausing shows you is that DGEMM is on 80% of stack samples, and LSAME is on 50%. (You don't need a lot of samples to detect that. 10 is usually plenty.) What's more, on many of those samples, DGEMM is in the process of calling LSAME from a couple different lines of code.
So now you know why both routines are taking so much inclusive time. You also know where in your code they are being called from to spend all this time. That's why I use random pausing and take a jaundiced view of profilers, no matter how well-made they are. They're more interested in getting measurements than in telling you what's going on.
It's easy to assume the math library routines have been optimized to the nth degree, but in fact they have been optimized to be usable for a wide range of purposes. You need to see what's really going on, not what is easy to assume.
ADDED: So to answer your last two questions:
What are the most important things to try first?
Take 10-20 stack samples, and don't just summarize them, understand what each one is telling you. Do this first, last, and in-between. (There is no "try", young Skywalker.)
How do I know how much performance I can get?
The stack samples will give you a very rough estimate of what fraction
x of time will be saved. (It follows a
β(s+1,(n−s)+1) distribution, where
s is the number of samples that displayed what you are going to fix, and
n is the total number of samples. That doesn't count the cost of the code that you used to replace it, which will hopefully be small.) Then the speedup ratio is
1/(1−x)which can be large. Notice how this behaves mathematically. If
n=10, and
s=5, the mean and mode of
x is 0.5, for a speedup ratio of 2. Here's the distribution:
If you are risk-averse then, yes, there is a small probability (.03%) that
x is less than 0.1, for a speedup of less than 11%. But balancing that is an equal probability that
x is greater than 0.9, for a speedup ratio of greater than 10! If you're getting money in proportion to program speed, that's not bad odds.
As I've pointed out to you before, you can repeat the whole procedure until you can't any more, and the compounded speedup ratio can be quite large.
ADDED: In response to Pedro's concern about false positives, let me try to construct an example where they might be expected to occur. We never act on a potential problem unless we see it two or more times, so we would expect false positives to occur when we see a problem the fewest possible times, especially when the total number of samples is large. Suppose we take 20 samples and see it twice. That estimates its cost is 10% of total execution time, the mode of its distribution. (The mean of the distribution is higher - it is (s+1)/(n+2)=3/22=13.6%.) The lower curve in the following graph is its distribution:
Consider if we took as many as 40 samples (more than I ever have at one time) and only saw a problem on two of them. The estimated cost (mode) of that problem is 5%, as shown on the taller curve.
What is a "false positive"? It is that if you fix a problem you realize such a smaller gain than expected, that you regret having fixed it. The curves show (if the problem is "small") that, while the gain could be less than the fraction of samples showing it, on average it will be larger.
There is a far more serious risk - a "false negative". That is when there is a problem, but it is not found. (Contributing to this is "confirmation bias", where absence of evidence tends to be treated as evidence of absence.)
What you get with a profiler (a good one) is you get much more precise measurement (thus less chance of false positives), at the expense of much less precise information about what the problem actually is (thus less chance of finding it and getting any gain). That limits the overall speedup that can be achieved.
I would encourage users of profilers to report the speedup factors they actually get in practice.
There's another point to be made re. Pedro's question about false positives.
He mentioned there could be a difficulty when getting down to small problems in highly optimized code. (To me, a small problem is one that accounts for 5% or less of total time.)
Since it is entirely possible to construct a program that is totally optimal except for 5%, this point can only be addressed empirically, as in
this answer. To generalize from empirical experience, it goes like this:
A program, as written, typically contains several opportunities for optimization. (We can call them "problems" but they are often perfectly good code, simply capable of considerable improvement.) This diagram illustrates an artificial program taking some length of time (100s, say), and it contains problems A, B, C, ... that, when found and fixed, save 30%, 21%, etc. of the original 100s.
Notice that problem F costs 5% of the original time, so it is "small", and difficult to find without 40 or more samples.
However, the first 10 samples easily find problem A.** When that is fixed, the program takes only 70s, for a speedup of 100/70 = 1.43x. That not only makes the program faster, it magnifies, by that ratio, the percentages taken by the remaining problems. For example, problem B originally took 21s which was 21% of the total, but after removing A, B takes 21s out of 70s, or 30%, so it is easier to find when the entire process is repeated.
Once the process is repeated five times, now the execution time is 16.8s, out of which problem F is 30%, not 5%, so 10 samples find it easily.
So that's the point. Empirically, programs contain a series of problems having a distribution of sizes, and any problem found and fixed makes it easier to find the remaining ones. In order to accomplish this, none of the problems can be skipped because, if they are, they sit there taking time, limiting the total speedup, and failing to magnify the remaining problems. That's why it is very important to find the problems that are hiding.
If problems A through F are found and fixed, the speedup is 100/11.8 = 8.5x. If one of them is missed, for example D, then the speedup is only 100/(11.8+10.3) = 4.5x. That's the price paid for false negatives.
So, when the profiler says "there doesn't seem to be any significant problem here" (i.e. good coder, this is practically optimal code), maybe it's right, and maybe it's not. (A false negative.) You don't know for sure if there are more problems to fix, for higher speedup, unless you try another profiling method and discover that there are. In my experience, the profiling method does not need a large number of samples, summarized, but a small number of samples, where each sample is understood thoroughly enough to recognize any opportunity for optimization.
** It takes a minimum of 2 hits on a problem to find it, unless one has prior knowledge that there is a (near) infinite loop. (The red tick marks represent 10 random samples); The average number of samples needed to get 2 or more hits, when the problem is 30%, is
2/0.3=6.67 (
negative binomial distribution). 10 samples find it with 85% probability, 20 samples - 99.2% (
binomial distribution). To get the probability of finding the problem, in R, evaluate
1 - pbinom(1, numberOfSamples, sizeOfProblem)
, for example:
1 - pbinom(1, 20, 0.3) = 0.9923627
.
ADDED: The fraction of time saved,
x, follows a Beta distribution
β(s+1,(n−s)+1), where
n is the number of samples, and
s is the number that display the problem. However, the speedup ratio
yequals
1/(1−x) (assuming all of
x is saved), and it would be interesting to understand the distribution of
y. It turns out
y−1 follows a
BetaPrime distribution. I simulated it with 2 million samples, arriving at this behavior:
distribution of speedup
ratio y
s, n 5%-ile 95%-ile mean
2, 2 1.58 59.30 32.36
2, 3 1.33 10.25 4.00
2, 4 1.23 5.28 2.50
2, 5 1.18 3.69 2.00
2,10 1.09 1.89 1.37
2,20 1.04 1.37 1.17
2,40 1.02 1.17 1.08
3, 3 1.90 78.34 42.94
3, 4 1.52 13.10 5.00
3, 5 1.37 6.53 3.00
3,10 1.16 2.29 1.57
3,20 1.07 1.49 1.24
3,40 1.04 1.22 1.11
4, 4 2.22 98.02 52.36
4, 5 1.72 15.95 6.00
4,10 1.25 2.86 1.83
4,20 1.11 1.62 1.31
4,40 1.05 1.26 1.14
5, 5 2.54 117.27 64.29
5,10 1.37 3.69 2.20
5,20 1.15 1.78 1.40
5,40 1.07 1.31 1.17
The first two columns give the 90% confidence interval for the speedup ratio. The mean speedup ratio equals (n+1)/(n−s) except for the case where s=n. In that case it is undefined and, indeed, as I increase the number of simulated y values, the empirical mean increases.
This is a plot of the distribution of the speedup factors, and their means, for 2 hits out of 5, 4, 3, and 2 samples. For example, if 3 samples are taken, and 2 of them are hits on a problem, and that problem can be removed, the average speedup factor would be 4x. If the 2 hits are seen in only 2 samples, the average speedup is undefined - conceptually because programs with infinite loops exist with non-zero probability!