[轉貼] 最佳化C語言程式設計 (英文)

發表時間：2002-06-25 11:57:39

IP:61.218.xxx.xxx 未訂閱

Optimization of Computer Programs in C Michael E. Lee Senior Programmer/Analyst Ontek Corporation 22941 Mill Creek Road Laguna Hills, CA 92953 USA Abstract: A substantial portion of a knowledge worker's life may be spent waiting for a computer program to produce output. Users and organizations control their wait time by purchasing faster computers, adding memory, or using faster network connections. Developers of application programs have a responsibility to design their programs make the best use of these limited and expensive resources. This document describes techniques for optimizing (improving the speed of) computer programs written in C. It focuses on minimizing time spent by the CPU and gives sample source code transformations that often yield improvements. Memory and I/O speed improvements are also discussed. --------------------------------------------------------------------------- CONTENTS 1. Contents 2. Introduction 3. Compute-bound o Choose a Better Algorithm o Write Clear, Simple Code o Perspective o Understand your Compiler Options o Inlining o Loop Unrolling o Loop Jamming o Loop Inversion o Strength Reduction o Loop Invariant Computations o Code for Common Case o Recursion Elimination o Table Lookup o Variables o Function Calls o Digestibility o String Operations o FP Parallelism o Get a Better Compiler o Stack Usage o Code it in Assembly o Shared Library Overhead o Machine-Specific Optimization 4. Memory-bound o Locality of Reference o Column-major Accessing o Don't Copy Large Things o Split or Merge Arrays o Reduce Padding o Increase Padding o March Forward o Beware the Power of Two o Memory Leaks o Be Stringy o Hinting o Fix the Problem in Hardware o Cache Profilers 5. IO-bound o Sequential Access o Random Access o Terminals o Sockets o SFIO 6. Gotchas 7. Glossary 8. References 9. Acknowledgements --------------------------------------------------------------------------- INTRODUCTION The single most effective optimization technique is to use a profiler to identify performance bottlenecks. It's often difficult to guess what part of your program is consuming the most resources, and if you base your optimization efforts on speculation instead of real data, you'll waste a lot of time speeding up the parts of your program that were fast already. Once you have identified a bottleneck, for example a loop that is executed thousands of times, remember that the best thing to do is to redesign the program so that it doesn't need to execute the loop thousands of times. This is more effective than making the loop run 10% faster, yet just as often, which is precisely the sort of thing an optimizing compiler can do by itself. Optimization is simply waste of programmer time if any of these statements are true: * parts of the program haven't been written yet * the program is not fully tested and debugged * it seems to run fast enough already Also take into account how the program will be used. If it's a report-generator program which only needs to be run once a day, the user may start it before going off to lunch and if so there's really no point in making it finish before they get back. If it's being invoked from within another program that's even slower than yours, again the user will notice no difference. But if it handles mouse-tracking events for a GUI the user will complain about any noticeable delay. Given that optimizing is reasonable, compile in full optimize mode and run your program on "real-world" input data. If you don't have access to real input data, choose your test input data with care: programmers tend to test with a smaller set of input data and a different variety of cases than the the program will likely encounter once it's in the hands of users. Use the time(1) command to see if your program is compute-bound, memory bound, or I/O bound, or for that matter if it's bound at all: even if your program "seems" to be taking a long time to run, it may only be a second of CPU time. You may have more "time" commands on your system than you realize: sometimes it's built into the shell (SunOS csh for example) and has lots of nifty options. You can also get performance information from getrusage() if you have it, and of course from profiling programs like gprof, prof, and tcov. --------------------------------------------------------------------------- COMPUTE BOUND Recompile your program with profiling enabled and whatever optimization options are compatible with that. Run your program, again on real world data, and generate a profiling report. Figure out which function uses the most CPU time, then look it over very carefully and see if any of these approaches might be useful. Make one change at a time, then run the profiler again, repeating the process until there is no obvious bottleneck or the program runs sufficiently fast. The first four are quite important; the rest are in no particular order. CHOOSE A BETTER ALGORITHM Think about what the code is really doing. Become familiar with the body of literature that describes your specialty and learn and use the most appropriate algorithms, even if you didn't come up with them yourself. You should be familiar with O(n) notation, which is defined in many computer science texts. Some of the obvious replacements: slow algorithm replace with sequential search binary search or hash lookup insertion or bubble sort quick sort, merge sort, radix sort Also choose an appropriate data structure. If you'll be doing a lot of insertions and deletions at random places then a linked list would be good. If you'll be doing some binary searching, an array would be better. WRITE CLEAR, SIMPLE CODE Some of the very things that make code clear and readable to humans also make it clear and readable to compilers. Complicated expressions are harder to optimize and can cause the compiler to "fallback" to a less intense mode of optimization. I've seen one compiler that has translation-unit limits which when overrun will cause the entire module to be de-optimized by one level. Part of the clarity is making hunks of code into functions when appropriate. The cost of a function call is extremely small on modern machines, so optimization is NOT a valid excuse for writing ten-page functions. If you write clean, portable code you can quickly port to the latest, fastest machine and offer that as a solution to your customers who are interested in speed. PERSPECTIVE Get a sense for how long certain operations take. Among the the slowest are opening a file, reading or writing significant amounts of data, starting a new process, searching, sorting, operations on entire arrays, and copying large amounts of data around. The fastest operations are basic elements of the language like assigning to a variable, dereferencing a pointer, or adding two integers. None of these operations take very long in and of themselves (a few microseconds) but all computer languages allow sections of code to be executed repeatedly. If you perform even the fastest operation 10 million times, it will take a noticeable amount of time. In real programs, various things happen various numbers of times and having some common sense can help you interpret the output from your profiler. A sure sign of misunderstanding is this fragment: if (x != 0) x = 0; The intent is to save time by not initializing x if it's already zero. In reality, the test to see whether it's zero or not will take up about as much time as setting it to zero itself would have. x = 0; has the same effect and will be somewhat faster. For the intrepid hacker, there is no substitute for examining the assembler-level output the compiler generates and counting the instructions. If you do this, don't forget to include wait states in your tally. Also keep in mind that some optimization and instruction scheduling is put off until link time and won't show up in the assembler output for an individual module. I have a small, fairly portable program that prints out the approximate unoptimized speed of basic C operators and library functions, speed.c . It's not a benchmark (the measurements are not weighted or averaged in anyway) but it'll give a beginner a numerical sense to the cost of various things in C. Note: (a) I have no anonymous ftp server at my site, so you'll have to look at it in your browser and save it to a file or copy and paste it into an editor. (b) Be sure to compile it with optimization off. Most of the code is trivial and would be eliminated by an optimizer; the timing would end up being mostly zeroes. UNDERSTAND YOUR COMPILER OPTIONS Most compilers have several levels of optimization. Make sure you're using the highest level. One popular compiler, gcc, has an incredible variety of optimization levels and options. Some compilers have special #pragmas or keywords (for example, inline) which also affect optimization. High levels of optimization can make poorly written (but compilable) code break. Less often, optimizers mangle perfectly correct code. Problems with side effects and evaluation order tend to screw things up, so you may have to fix the code up a little before you can use the highest level. Signal handlers activated at inopportune times may disturb an optimized program in ways that wouldn't disturb an unoptimized one. INLINING gcc (with the -finline-functions option) and a few other compilers are capable of inlining small functions at higher optimization levels automatically. K&R C compilers that I've seen won't do it at all or will only do it for library functions written in assembly (for example the math library). C compilers almost universally support inline functions. If necessary, C functions can be recoded as macros to obtain similar speedup on compilers with no inlining capability. This should be done after the code is completely debugged. All debuggers I've seen are incapable of displaying or stepping through the expanded macro text. An example of inlining via macros: Old code: int foo(a, b) { a = a - b; b ; a = a * b; return a; } New code: #define foo(a, b) (((a)-(b)) * ((b) 1)) The extra parentheses are necessary to preserve grouping in case foo is used in an expression with higher precedence than * or in case a and/or b contain subexpressions with lower precedence than or -. Comma expressions and do { ... } while(0) can be used for more complicated functions, with some restrictions. The do-while macros let you create local variables for use in the macro, but you can't return a value to the expression the macro is used in. The opposite is true for macros using comma expressions. Some caveats: 1. Gratuitously making every function in sight into a macro leads to massive code-bloat and can increase the amount of memory your program needs dramatically. The larger a program is, the less likely it is to fit entirely into cache or some other layer of physical memory, undoing the hoped for gains. 2. Macros in C "evaluate" their arguments each time the argument is mentioned inside the macro. If the actual argument passed to the macro is a complicated expression or function call, the net result very well may be an increase in CPU time. Performing multiple side-effects when the caller did not expect that will almost certainly make the program buggy. 3. Because these macros can contain complicated statements, optimizers have a hard time figuring them out and may give up. Also don't forget that there's a limit on how many characters a macro can have. 4. Profilers don't see macros so it's hard to optimize any further once you've done this. LOOP UNROLLING Many compilers (e.g. gcc -funroll-loops) will do this, but if you know that yours doesn't you can change your source code a bit to get the same effect. Old code: for (i = 0; i < 100; i ) { do_stuff(i); } New code: for (i = 0; i < 100; ) { do_stuff(i); i ; do_stuff(i); i ; do_stuff(i); i ; do_stuff(i); i ; do_stuff(i); i ; do_stuff(i); i ; do_stuff(i); i ; do_stuff(i); i ; do_stuff(i); i ; do_stuff(i); i ; } This way the test for i < 100 and the branch back up to the top of the loop only get executed 11 times rather than 101. Loop unrolling works best when the loop is executed a fixed non-prime number of times and the iteration variable is only modified in one place (aside from its initialization). If do_stuff() didn't make use of i, all the little i 's could be replaced by a single i = 10. Re-arranging the for loop into a do-while loop can make the 11 into 10. If the loop only went to, say, five rather than 100 you could unroll the loop completely and eliminate the branching and testing entirely. For computations which converge on some result, compilers will often refuse to unroll on the grounds that unrolling will change the result. If the application is not sensitive to excess precision you can arrange to test the convergence less often by duplicating the interior of the loop. This is especially useful if the test for convergence is expensive in relation to the calculation of the result. An unrolled loop is larger than the "rolled" version and so may no longer fit into the instruction cache (on machines which have them). This will make the unrolled version slower. Also, in this example, the call to do_stuff() overshadows the cost of the loop, so any savings from loop unrolling are insignificant in comparison to what you'd achieve from inlining in this case. If you happen to be using a vectorizing compiler, loop unrolling can interfere with the vector optimizations. LOOP JAMMING The idea is to combine adjacent loops which loop over the same range of the same variable. Assuming nothing in the second loop indexes forward (for example array[i 3]), you can do this: Old code: for (i = 0; i < MAX; i ) /* initialize 2d array to 0's */ for (j = 0; j < MAX; j ) a[i][j] = 0.0; for (i = 0; i < MAX; i ) /* put 1's along the diagonal */ a[i][i] = 1.0; New code: for (i = 0; i < MAX; i ) /* initialize 2d array to 0's */ { for (j = 0; j < MAX; j ) a[i][j] = 0.0; a[i][i] = 1.0; /* put 1's along the diagonal */ } And now the incrementing and testing of i is done only half as often. Under some circumstances, locality of reference may be better, improving cache behavior. (The example above was lifted from Aho & Ullman.) LOOP INVERSION Some machines have a special instruction for decrement and compare with 0. Assuming the loop is insensitive to direction, try this replacment: Old code: for (i = 1; i <= MAX; i ) { ... } New code: i = MAX 1; while (--i) { ... } Though watch out for problems with lookahead caches as noted below in MARCH FORWARD. If you plan on doing this in combination with pointer arithmetic note that while ANSI C has a special rule that allows you to set an index to one element past the end of the array, there is no similar rule for one element before the beginning of the array. STRENGTH REDUCTION Strength reduction is the replacement of an expression by a different expression that yields the same value but is cheaper to compute. Many compilers will do this for you automatically. The classic examples: Old code: x = w % 8; y = pow(x, 2.0); z = y * 33; for (i = 0; i < MAX; i ) { h = 14 * i; printf("%d", h); } New code: x = w & 7; /* bit-and cheaper than remainder */ y = x * x; /* mult is cheaper than power-of */ z = (y << 5) y; /* shift & add cheaper than mult */ for (i = h = 0; i < MAX; i ) { h = 14; /* addition cheaper than mult */ printf("%d", h); } Also note that array indexing in C is basically a multiply and an add. The multiply part can be subjected to strength reduction under some circumstances, notably when looping through an array. LOOP INVARIANT COMPUTATIONS Any part of a computation that does not depend on the loop variable and which is not subject to side effects can be moved out of the loop entirely. Most compilers are pretty good at doing this. Try to keep the computations within the loop simple anyway, and be prepared to move invariant computations out yourself: there may be some situations where you know the value won't vary, but the compiler is playing it safe in case of side-effects. "Computation" here doesn't mean just arithmetic; array indexing, pointer dereferencing, and calls to pure functions are all possible candidates for moving out of the loop. In loops which call other functions, you might be able to get some speedup by ripping the subroutines apart and figuring out which parts of them are loop-invariant for that particular loop in their caller and calling those parts ahead of time. This is not very easy and seldom leads to much improvement unless you're calling subroutines which open and close files repeatedly or malloc and free large amounts of memory on each call or something else drastic. A common but not-always-optimized-away case is the repeated use of an expression in successive statements. This doesn't have to be related to a loop at all. You can try this replacement and see if it helps: Old code: total = a->b->c[4]->aardvark a->b->c[4]->baboon a->b->c[4]->cheetah a->b->c[4]->dog; New code: struct animals * temp = a->b->c[4]; total = temp->aardvark temp->baboon temp->cheetah temp->dog; Older C compilers were allowed to regroup arithmetic expressions to some extent. ANSI codified that arithmetic expressions are evaluated as they are grouped so as to avoid any unwanted surprises. What this means is that float a, b, c, d, f, g; ... a = b / c * d; f = b * g / c; will not been seen as having the common subexpression b / c. If you rewrite the second expression: float a, b, c, d, f, g; ... a = b / c * d; f = b / c * g; an ANSI C compiler is now free to compute b / c only once. Note that the new code may behave differently for overflow or provide slightly different results with floating point. CODE FOR COMMON CASE In a section of code which deals with several alternative situations, place at the beginning the tests and the code for the situations which occur most often. Frequently, this takes the form of a long train of mutually exclusive if-then-else's, of which only one will get executed. By placing the most likely one first, fewer if's will need to be performed over the long term. But if the conditions are simple things like x == 3, consider using a switch statement. Some compilers are quite sophisticated in how they translate switch statements. RECURSION ELIMINATION In contrast to many LISP-like languages, it's typical to write iterative code in C to begin with, even when it would be possible to use recursion. Since most C programmers use iteration by default unless the algorithm demands recursion, many C compilers some don't bother performing optimizations related to recursion. On the rare occasions when a recursive function is a bottleneck AND it's the recursion itself that's taking the time rather than the statements performed at each recurse, you may want to rewrite your function to be iterative. TABLE LOOKUP A time-honored hack and one which few compilers will do on their own. Computations of simple functions which operate on a small domain of values can be performed ahead of time with the values placed in a lookup table. If the computation is more expensive than an array index then it'll be faster. Old code: long factorial(int i) { if (i == 0) return 1; else return i * factorial(i - 1); } New code: static long factorial_table[] = {1, 1, 2, 6, 24, 120, 720 /* etc */}; long factorial(int i) { return factorial_table[i]; } If the table is too large to type, you can have some initialization code compute all the values on startup or have a second program generate the table and printf it out in a form suitable for #include-ing into a source file. At some point, the size of the table will have a noticeable affect on paging space. You could have the table cover the first N cases then augment this with a function to compute the rest. As long as a significant number of the values requested are in the table there will be a net win. VARIABLES Avoid referring to global or static variables inside the tightest loops. Don't use the volatile qualifier unless you really mean it. Most compilers take it to mean roughly the opposite of register, and will deliberately not optimize expressions involving the volatile variable. Avoid passing addresses of your variables to other functions. The optimizer has to assume that the called function is capable of stashing a pointer to this variable somewhere and so the variable could get modified as a side effect of calling what seems like a totally unrelated function. At less intense levels of optimization, the optimizer may even assume that a signal handler could modify the variable at any time. These cases interfere with placing variables in registers, which is very important to opimization. Example: a = b(); c(&d); Because d has had its address passed to another function, the compiler can no longer leave it in a register across function calls. It can however leave the variable a in a register. The register keyword can be used to track down problems like this; if d had been declared register the compiler would have to warn that its address had been taken. FUNCTION CALLS While functions and modularity are a good thing, a function call inside an oft-executed loop is a possible bottleneck. There are several facets to the problem, some of which I've alluded to above. Aside from the expense of executing the instructions in the other function: * Function calls interrupt an optimizer's train of thought in a drastic way. Any references through pointers or to global variables are now "dirty" and need to be saved/restored across the function call. Local variables which have had their address taken and passed outside the function are also now dirty, as noted above. * There is some overhead to the function call itself as the stack must be manipulated and the program counter altered by whatever mechanism the CPU uses. * If the function being called happens to be paged out, there will be a very long delay before it gets read back in. For functions called in a loop it's unusual for the called function to be paged out until the loop is finished, but if virtual memory is scarce, calls to other functions in the same loop may demand the space and force the other function out, leading to thrashing. Most linkers respect the order in which you list object files, so you can try to get functions near each other in hopes that they'll land on the same page. DIGESTIBILITY Straight-line code, even with an extra statement or two, will run faster than code full of if's, &&'s, switch's, and goto's. Pipelining processors are much happier with a steady diet of sequential instructions than a bunch of branches, even if the branches skip some unnecessary sections. STRING OPERATIONS Most of the C library str* and mem* functions operate in time proportional to the length(s) of the string(s) they are given. It's quite easy to loop over calls to these and wind up with a significant bottleneck. Some things that may ease the situation: strlen Avoid calling strlen() during a loop involving the string itself. Even if you're modifying the string, it should be possible to rewrite it so that you set x = strlen() before the loop and then x or x-- when you add or remove a character. strcat When building up a large string in memory using strcat, it will scan the full (current) length of the string on each call. If you've been keeping track of the length anyway (see above) you can index directly to the end of the string and strcpy or memcpy to there. strcmp You can save a little time by checking the first characters of the strings in question before doing the call. Obviously, if the first characters differ, there's no reason to call strcmp to check the rest. Because of the non-uniform distribution of letters in natural languages, the payoff is not 26:1 but more like 15:1 for uppercase data. #define QUICKIE_STRCMP(a, b) (*(a) != *(b) ? \ (int) ((unsigned char) *(a) - \ (unsigned char) *(b)) : \ strcmp((a), (b))) But watch out: 1. There's a double evaluation going on, so this can be counter-productive if the arguments to strcmp aren't just simple variables. 2. If you're working on adjacent strings in sorted input, you will almost always have to check past the first character anyway. An entirely different way to speed up strcmp is to place all your strings into a single array, in order. Then you only have to compare the pointers, not the strings. If the point of all the calls to strcmp is to search for a value from a large, known set and you expect that you'll be doing many such searches then you'll want to invest in a hash table. strlen(s) == 0 Empty string tests are common, and even experienced programmers may use strlen for this purpose. But if the strings you are working with are of substantial size, strlen will dutifully check each character until it finds the NUL terminator. Replace with *s == '\0' which saves the function call overhead plus it only has to check the first character. There can be a substantial improvement if your character strings are fairly long. strcpy(s, "") *s = '\0' works just fine and saves a function call. strncpy(s, t, n) Be aware that strncpy pads short strings with 0's. Except for the first one, which terminates the string, the extra zeroes are typically never used, so some time is wasted, though typically not very much. memcpy/memmove Generally, memcpy is faster than memmove because it can assume that its arguments don't overlap. If you are tempted to replace these with your own versions, be sure to verify that your version (a) works and (b) is faster. FP PARALLELISM In some really weird cases (early RISC chips with FPUs, which includes many SPARC-based machines) you can get some speed by turning your integer multiplies, divides, and modulos into float operations. This does two things: the FPU is taking some of the load off the main CPU so it can get some other stuff done; but mostly some RISC machines emulate integer multiply and divide in software and the FPU can actually do it faster anyway. This is not a sure thing and you should try it both ways on all the machines you plan to run on before you do this. Also, converting back-and-forth between ints and floats can take up considerable time. I have heard this is an acute problem on Apollos. On many machines, floats work faster than doubles. If the bottleneck involves FP arithmetic and you don't need the extra precision, change the pertinent variables to float and see what happens. But similar to above, the cost of converting between floats and doubles can outweigh the benefits if applied indiscriminately. Note that on K&R compilers and ANSI compilers used w/o prototypes, there is an automatic conversion of floats to doubles in function calls and in many expressions. GET A BETTER COMPILER gcc produces somewhat faster code than the compilers that come with some OS's. Try compiling your code with all the compilers you have at your disposal and use the fastest one for compiling the one or two functions which are bottlenecks. For the rest just use the one that gives the most informative error messages and produces the most reliable output. This is more difficult with C since different compilers can use incompatible name-mangling schemes. Compilers are evolving even as we speak, so keep up with the latest revision if possible. STACK USAGE A typical cause of stack-related problems is having large arrays as local variables. In that case the solution is to rewrite the code so it can use a static or global array, or perhaps allocate it from the heap. A similar solution applies to functions which have large structs as locals or parameters. (On machines with small stacks, a stack bound program may not just run slow, it might stop entirely when it runs out of stack space.) Recursive functions, even ones which have few and small local variables and parameters, can still affect performance. On some ancient machines there is a substantial amount of overhead associated with function calls, and turning a recursive algorithm into an iterative one can save some time. A related issue is last-call optimization. Many LISP and Scheme compilers do this automatically, but few C compilers support it. Consider this example: int func1() { int a, b, c, etc; do_stuff(a, b, c) if (some_condition) return func2(); else return 1; } Since func1()'s last statement is to call func2() and func1() has no need of its variables after that point, it can remove its stack frame. When func2() is done it returns directly to func1()'s caller. This (a) reduces the maximum depth of the stack and (b) saves the execution of some return-from-subroutine code as it will get executed only once instead of twice (or more, depending on how deeply the function results are passed along.) If func1() and func2() are the same function (recursion) the compiler can do something else: the stack can be left in place and the call can be replaced by a goto back to the top of the function. This is called tail-recursion elimination. CODE IT IN ASSEMBLY Estimates vary widely, but a competent human writing assembly-level code can produce code which runs about 10% faster than what a compiler with full optimization on would produce from well-written high-level source. Of course, this is not a portable solution. RISC assembler is especially difficult to write by hand. Experts vary on the approach one should take. Some suggest writing your own assembler version of a function from scratch in hopes that you'll find a novel way of calculating the result while others recommend taking the compiler's version as a starting point and just fine-tuning it. SHARED LIBRARY OVERHEAD Dynamically linked shared libraries are a really nifty thing. In many cases though, calling a dynamically linked function is slightly slower than it would be to call it statically. The principal part of this extra cost is a one-time thing: the fi