Performance

Performance

``There is no doubt that the "grail" of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.''

from ``Structured Programming with go to Statements'', by D. E. Knuth, 1974.

Performance

Premature optimization is the root of all evil.

Long ago, in a galaxy far, far away, people worried about making their programs go as fast as possible. Computers were expensive, more expensive than people's salaries, and people were slaves to the machine.

Knuth's quote is a historic indictment of that way of thinking, which led to terrible code being written in the (false) hope of efficiency.

Nowadays machines are cheaper than people's salaries. Is it still worth optimising?

Sometimes

Not until the program is correct! Correctness, robustness, and clarity are more important than efficiency. A fast program that gets the wrong answer doesn't save any time.

Even when a program is right, performance is one of the last things you should worry about. Good code tends to be efficient code, so if you take a sensible line you may find that your code works well enough without tuning.

Sometimes, however, the problem really needs to be solved faster. How do we know? How can we speed up a program? What can we expect to gain? These are our questions.

The answer to all three will come from measurement. In performance tuning, don't trust your intuition, ask the machine.

Efficiency

Efficiency can refer to time or space. Usually we worry about time, but space is often the problem too. We'll look at examples of both.

Algorithmic questions come in here: No finely tuned O(n 2 ) algorithm will always beat an O( n log n ) one.

Also, often a program isn't fast or small enough, but it will be in a little while: machines get twice as fast and twice as big every year or two. Writing code now for a system to be shipped a year from now, you might be able to depend on that improvement.

In other words, a program that is inefficient today may seem speedy tomorrow. I can give examples....

Strategy

1. Design a program well, using good algorithms and data structures.

2. Build it well, and portably, using good design and methods.

3. Test it well.

4. Then, and only then, measure it to see if it's fast enough for the job.

5. If necessary, tune the code. Often minor changes will fix any performance problems in well-designed code. Badly-designed code will require major rewriting to be made efficient.

Don't overdo it: it almost always requires more complicated code, and more of it, to make a program run faster or use less space.

Testing for primality

The most important factor in making a program faster is usually the choice of algorithm.

Example: prime testing:

   int main(int argc, char *argv[])
   {
      int i, n, ndiv;
   
      while (scanf("%d", &n) > 0) {
         ndiv = 0;
         for (i=1; i<=n; i++)
            if (n%i == 0)
               ndiv++;
         if (ndiv == 2)
            printf("%d is prime\n", n);
      }
      return 0;
   }

This algorithm is clearly O(n). It takes about 0.9 seconds to test 1257787 and 3014 [sic] seconds to factor 2147483647, or 2 31 -1, the largest prime that fits in an int .

Fixing the algorithm.

A number n can't have any factors bigger than sqrt n, so there is no need to test beyond that point.

      ndiv = 0;
      for(i=1; i<=sqrt(n); i++)
         if(n%i == 0)
            ndiv++;
      if(ndiv == 1)
         printf("%d is prime\n", n);

This version takes about 0.06 seconds to test 1257787 and 0.15 seconds to test 2 31 -1.

The functional complexity of an algorithm makes a big difference once inputs get big enough.

Caveat: be aware of hidden `performance bugs':

   for (i = 0; i < strlen(s); i++)
      if (s[i] == c)
         ...

How slow is too slow?

The algorithm is right but it's still too slow.

What does `too slow' mean?

How do we know that our improvement is worthwhile?

Need to balance the human cost against the computer time saved. An hour spent getting 10% out of a program that takes a minute to run, and is run only once, or even a hundred times, is wasted time.

No absolutes here, but here's a guide: the time spent making a program faster should not be more than the time the speedup will recover during the lifetime of the program.

Speeding up libraries is almost always worthwhile; speeding up test programs is almost never worthwhile.

A program that runs for a year: squeeze out everything you can! It will be worth restarting if you find 5% after the program has already been running a week!

The primality speedup was worthwhile: took a couple of minutes, saved many minutes.

Code tuning

The algorithms are right but the code is too slow. How do we make code faster? Easiest way: turn on optimisations in the compiler. If you need still more, then look at the details.

Primality tester checks all numbers, but we can do better:

   while (scanf("%d", &n) > 0){
      if (n == 2){
         printf("%d is prime\n", n);
         continue;
      }
      if (n%2 == 0)
         continue;
      ndiv = 0;
      for (i=3; i<=sqrt(n); i+=2)
         if (n%i == 0)
            ndiv++;
      if (ndiv == 0)
         printf("%d is prime\n", n);
   }

takes only 0.1 seconds to test 2 31 -1.

Also, since n doesn't change during the loop, calculating sqrt(n) once is enough. That change brings the time down to 0.07 seconds.

Timings, etc.

Each step of the way, measure your progress. Use the time command on Unix, or find its equivalent elsewhere.

If you're doing something within a program, like a library, construct some scaffolding to do the timing for you, and make sure the procedure is reproducible.

Stop when your changes aren't making any difference.

Compiler optimisations come in here, too: sometimes they'll do things better than you will; sometimes they'll confound you.

Sometimes two changes that each improve a program will interact, negating their individual effects.

Whatever happens: measure and keep notes!

Profiling

Profiling is an effective tool for finding hot spots in a program. Usually need a compiler option -pg to enable it, and then a command gprof to produce the results:

   % gprof shaney
     %   Time   Calls   Name
   27.5   0.140    29451   chain_to_wnode
   17.6   0.090    16657   doprint
   15.7   0.080    12804   next_word
    5.9   0.030    16647   Bprint
    5.9   0.030        1   generate
    5.9   0.030    16647   string_from_wordx
   ...

This program shows little chance for improvement: the #1 runner is only 28% and the second one is the main formatted I/O primitive.

But sometimes you see the top runner at 80% or more. Then you can go to town.

Profiling success

Sometimes the top runner is so easy to fix it's worth doing so anyway.

Long ago, a profile of Awk indicated that this loop:

   for (i = 0; i < MAXFLD; i++)
      clearfield(i);

was taking nearly 30 percent of the run time. MAXFLD was 200, but usually only a few fields were used. Putting in a high water mark saved 25% of the run time.

Language Performance

Rob Pike's performance tests on Shaney.250MHz R10000400MHz Pentium IILines of source code
C0.36 sec0.30 sec150
Java4.99.2105
C++/STL/deque2.611.2 *70
C++/STL/list1.71.570
Awk2.22.120
Perl1.81.018

Notes: C is generally faster than most languages for computation. Note the speed vs code length trade-off.

Perl and Python are very good for string manipulation.

* Note the broken deque (double-ended queue) implementation!

Speed Tests and Profiling

   Program      Time (s)
   Compiled C   1.0
   Java VM      2.2
   Perl         13.5
   Python       16.4
   C on 486     30

The rule of thumb

Knuth, 1974: ``Experience has shown that most of the running time in non-IO-bound programs is concentrated in about 3% of the source text.''

Today, the rule you'll usually hear (not necessarily any more accurate), is ``90% of the runtime of a program is concentrated in 10% of the code.''

Profiling is the only way to identify which tiny part of the program is responsible. That piece is called the hot spot.

The amazing thing is, we often find that after fixing only one hot spot we get a `smooth' profile and there's little obvious left to do.

What to do about hot spots? These are the tricks of optimisation.

Tricks

Collect common subexpressions. If an expensive computation appears multiple times, do it in only one place and use the result elsewhere.

Example:
sqrt in our primality tester.

Reduction in strength.
Replacing expensive operations with a simpler one. Used to be common to replace multiplies with adds. We could have replaced

   for (i=1; i<=sqrt(n); i+=2)

by

   for (i=1; i*i<=n; i+=2)

(Of course, we do even better by removing the square root computation from the loop entirely, or running the loop backwards.)

Another example is reducing array indexing by pointers:

   for (i=0; i < N; i++) a[i]

is slower than

   for (p=a; p < endofa; p++) *p, p->

Tricks, cont'd.

If function call is expensive (it isn't, nowadays), and you can do the easy case outside the function, wrap up the function as a macro. Usual case is getc and friends:

   #define getc(p)   (--(p)->_cnt < 0?
      __filbuf(p) : (int)*(p)->_ptr++)

Grep does fast check for first character.

Loop unrolling
Loop overhead can be amortised:

   for(i=0; i < N; i++)
      a[i] = b[i] + c[i];

becomes

   for(i=0; i < N; i+=4){
      a[i+0] = b[i+0] + c[i+0];
      a[i+1] = b[i+1] + c[i+1];
      a[i+2] = b[i+2] + c[i+2];
      a[i+3] = b[i+3] + c[i+3];
   }

What if N isn't 0 mod 4?
And memset of bytes can do 4 at a time by storing words, if we fix up the ends properly.

Tricks, cont'd.

Duff device
A trick for combining loop unrolling and the fixup

   i = 0;
   switch (n % 4) {
   case 0:
      while (i < n) {
         a[i] = b[i] + c[i]; i++;
   case 3:
         a[i] = b[i] + c[i]; i++;
   case 2:
         a[i] = b[i] + c[i]; i++;
   case 1:
         a[i] = b[i] + c[i]; i++;
      }
   }

Caching
Just remember that last value, or store the value with the data instead of computing it again each time.

Tricks, cont'd.

Special-purpose allocation
Saving malloc overhead by allocating blocks of memory and building our own free list within it. Can often save 10% or more, although malloc has been getting better in recent years - used to be very expensive.
Also: stack-based allocation: a whole series of allocations is done in a sequence, and then the entire set is freed at once. Good for compilers, which allocate temporaries to build an expression, and then throw them all away automatically.

Buffering.
I/O is the classic example.

Separation of special cases.
Example: handling 2 separately in prime testing. In general, take the special case out of the loop.

Tricks, cont'd.

Precomputing results
Compute them once now, then use them lots later.
Special case of:

Data instead of code
Use tables instead of computing every time (or vice versa, if space is a premium).

On-the-fly compilation
When you see an instruction, save the actual machine instructions to do that into a block of memory, then jump to the address of that memory.

Going into assembler
As a last resort, can be important, but breaks portability. Usually most effective in a library, e.g. memset . Never use asm .

Space Efficiency

Mostly talked about time; now let's talk about space.

Memory used to be the most precious computing resource, always in short supply, and much bad programming was done to squeeze space. (Year 2000 problem?)

We're much better off now, but there are still programs that are too big to fit in main memory.

Same basic idea applies: sometimes the space we use is
O(n 2 ) when it could be O( n log n ).

Tricks for memory

Reducing the data type
Substitute signed char or short for int , or float for double , if the precision and size permit. (Beware alignment problems and printf , scanf calls.) Sometimes we can even squeeze the data down to a couple of bits. Don't use bitfields: write functions that set and get bit vectors. In C++, can overload array indexing.

Remove the common data type
If most of memory is a single value, don't store it all. Run-encoding. Matrix story: m[i][j] to m(i,j) sparse arrays and matrices are better implemented using hash tables and other space-saving data structures, otherwise m[i][j] can be too large.

Use a space-efficient data structure
Quadtrees for radio terrain data
Graphs for crystal problems.

External space

File storage is sometimes a critical resource. If you can, use text, but if you're tight, can store special binary (e.g. 3-byte integers). Be sure to read and write the data with portable code!

Plan 9 file system format.

Estimation

Time in microseconds on a nominal 150MHz machine:

   Int Operations
     i1++                              0.01
     i1 = i2 + i3                      0.02
     i1 = i2 - i3                      0.02
     i1 = i2 * i3                      0.09
     i1 = i2 / i3                      0.58
     i1 = i2 % i3                      0.60
   Float Operations
     f1 = f2                           0.01
     f1 = f2 + f3                      0.02
     f1 = f2 - f3                      0.02
     f1 = f2 * f3                      0.03
     f1 = f2 / f3                      0.14
   Numeric Conversions
     i1 = f1                           0.03
     f1 = i1                           0.04

Surprise to old-timers: floating point is fast!

Estimation

   Integer Vector Operations
     v[i] = i                          0.09
     v[v[i]] = i                       0.13
     v[v[v[i]]] = i                    0.16
   Control Structures
     if (i == 5) i1++                  0.05
     if (i != 5) i1++                  0.07
     while (i < 0) i1++                0.05
     i1 = sum1(i2)                     0.08
     i1 = sum2(i2, i3)                 0.08
     i1 = sum3(i2, i3, i4)             0.09

Remember, though, these were on a particular machine under peculiar conditions. Use these as a guideline only.

I/O is slow

   Input/Output
     fputs(s, fp)                      2.45
     fgets(s, 9, fp)                   1.43
     fprintf(fp, sdn, i)               7.26
     fscanf(fp, sd, &i1)               8.69
   Malloc (n=1000000)
     free(malloc(8))                   1.22
     push(i)                           2.06
     i1 = pop()                        1.37
   String Functions
     strcpy(s, s0123456789)            0.49
     i1 = strcmp(s, s)                 0.56
     i1 = strcmp(s, sa123456789)       0.14
   String/Number Conversions
     i1 = atoi(s12345)                 1.00
     sscanf(s12345, sd, &i1)           7.18
     sprintf(s, sd, i)                 4.54
     f1 = atof(s123_45)                4.64
     sscanf(s123_45, sf, &f1)         11.64
     sprintf(s, sf62, 123.45)         10.58

Math functions are very fast

   Math Functions
     i1 = rand()                       0.35
     f1 = log(f2)                      1.08
     f1 = exp(f2)                      1.22
     f1 = sin(f2)                      1.27
     f1 = sqrt(f2)                     0.64

Lessons:

Know what the relative costs are.

Measure them: they change from machine to machine and over time.

Package performance analysis as part of testing. Build a test suite to verify performance is as acceptable and doesn't suddenly get worse after a modification.

A final note

Programs spend all their time in loops.

Writing clever code that gets executed once is pointless and bug-prone.