[轉貼] 最佳化C語言程式設計 (英文) |
|
axsoft
版主 發表:681 回覆:1056 積分:969 註冊:2002-03-13 發送簡訊給我 |
Optimization of Computer Programs in C Michael E. Lee
Senior Programmer/Analyst
Ontek Corporation
22941 Mill Creek Road
Laguna Hills, CA 92953 USA Abstract: A substantial portion of a knowledge worker's life may be spent
waiting for a computer program to produce output. Users and organizations
control their wait time by purchasing faster computers, adding memory, or
using faster network connections. Developers of application programs have a
responsibility to design their programs make the best use of these limited
and expensive resources. This document describes techniques for optimizing (improving the speed of)
computer programs written in C. It focuses on minimizing time spent by the
CPU and gives sample source code transformations that often yield
improvements. Memory and I/O speed improvements are also discussed.
--------------------------------------------------------------------------- CONTENTS 1. Contents
2. Introduction
3. Compute-bound
o Choose a Better Algorithm
o Write Clear, Simple Code
o Perspective
o Understand your Compiler Options
o Inlining
o Loop Unrolling
o Loop Jamming
o Loop Inversion
o Strength Reduction
o Loop Invariant Computations
o Code for Common Case
o Recursion Elimination
o Table Lookup
o Variables
o Function Calls
o Digestibility
o String Operations
o FP Parallelism
o Get a Better Compiler
o Stack Usage
o Code it in Assembly
o Shared Library Overhead
o Machine-Specific Optimization
4. Memory-bound
o Locality of Reference
o Column-major Accessing
o Don't Copy Large Things
o Split or Merge Arrays
o Reduce Padding
o Increase Padding
o March Forward
o Beware the Power of Two
o Memory Leaks
o Be Stringy
o Hinting
o Fix the Problem in Hardware
o Cache Profilers
5. IO-bound
o Sequential Access
o Random Access
o Terminals
o Sockets
o SFIO
6. Gotchas
7. Glossary
8. References
9. Acknowledgements --------------------------------------------------------------------------- INTRODUCTION The single most effective optimization technique is to use a profiler to
identify performance bottlenecks. It's often difficult to guess what part
of your program is consuming the most resources, and if you base your
optimization efforts on speculation instead of real data, you'll waste a
lot of time speeding up the parts of your program that were fast already. Once you have identified a bottleneck, for example a loop that is executed
thousands of times, remember that the best thing to do is to redesign the
program so that it doesn't need to execute the loop thousands of times.
This is more effective than making the loop run 10% faster, yet just as
often, which is precisely the sort of thing an optimizing compiler can do
by itself. Optimization is simply waste of programmer time if any of these
statements are true: * parts of the program haven't been written yet
* the program is not fully tested and debugged
* it seems to run fast enough already Also take into account how the program will be used. If it's a
report-generator program which only needs to be run once a day, the user
may start it before going off to lunch and if so there's really no point in
making it finish before they get back. If it's being invoked from within
another program that's even slower than yours, again the user will notice
no difference. But if it handles mouse-tracking events for a GUI the user
will complain about any noticeable delay. Given that optimizing is reasonable, compile in full optimize mode and run
your program on "real-world" input data. If you don't have access to real
input data, choose your test input data with care: programmers tend to test
with a smaller set of input data and a different variety of cases than the
the program will likely encounter once it's in the hands of users. Use the time(1) command to see if your program is compute-bound, memory
bound, or I/O bound, or for that matter if it's bound at all: even if your
program "seems" to be taking a long time to run, it may only be a second of
CPU time. You may have more "time" commands on your system than you
realize: sometimes it's built into the shell (SunOS csh for example) and
has lots of nifty options. You can also get performance information from
getrusage() if you have it, and of course from profiling programs like
gprof, prof, and tcov.
--------------------------------------------------------------------------- COMPUTE BOUND Recompile your program with profiling enabled and whatever optimization
options are compatible with that. Run your program, again on real world
data, and generate a profiling report. Figure out which function uses the
most CPU time, then look it over very carefully and see if any of these
approaches might be useful. Make one change at a time, then run the
profiler again, repeating the process until there is no obvious bottleneck
or the program runs sufficiently fast. The first four are quite important; the rest are in no particular order. CHOOSE A BETTER ALGORITHM Think about what the code is really doing. Become familiar with the body of
literature that describes your specialty and learn and use the most
appropriate algorithms, even if you didn't come up with them yourself. You
should be familiar with O(n) notation, which is defined in many computer
science texts. Some of the obvious replacements: slow algorithm replace with sequential search binary search or hash lookup insertion or bubble sort quick sort, merge sort, radix sort Also choose an appropriate data structure. If you'll be doing a lot of
insertions and deletions at random places then a linked list would be good.
If you'll be doing some binary searching, an array would be better. WRITE CLEAR, SIMPLE CODE Some of the very things that make code clear and readable to humans also
make it clear and readable to compilers. Complicated expressions are harder
to optimize and can cause the compiler to "fallback" to a less intense mode
of optimization. I've seen one compiler that has translation-unit limits
which when overrun will cause the entire module to be de-optimized by one
level. Part of the clarity is making hunks of code into functions when
appropriate. The cost of a function call is extremely small on modern
machines, so optimization is NOT a valid excuse for writing ten-page
functions. If you write clean, portable code you can quickly port to the latest,
fastest machine and offer that as a solution to your customers who are
interested in speed. PERSPECTIVE Get a sense for how long certain operations take. Among the the slowest are
opening a file, reading or writing significant amounts of data, starting a
new process, searching, sorting, operations on entire arrays, and copying
large amounts of data around. The fastest operations are basic elements of
the language like assigning to a variable, dereferencing a pointer, or
adding two integers. None of these operations take very long in and of
themselves (a few microseconds) but all computer languages allow sections
of code to be executed repeatedly. If you perform even the fastest
operation 10 million times, it will take a noticeable amount of time. In
real programs, various things happen various numbers of times and having
some common sense can help you interpret the output from your profiler. A sure sign of misunderstanding is this fragment: if (x != 0) x = 0; The intent is to save time by not initializing x if it's already zero. In
reality, the test to see whether it's zero or not will take up about as
much time as setting it to zero itself would have. x = 0; has the same effect and will be somewhat faster. For the intrepid hacker, there is no substitute for examining the
assembler-level output the compiler generates and counting the
instructions. If you do this, don't forget to include wait states in your
tally. Also keep in mind that some optimization and instruction scheduling
is put off until link time and won't show up in the assembler output for an
individual module. I have a small, fairly portable program that prints out the approximate
unoptimized speed of basic C operators and library functions, speed.c .
It's not a benchmark (the measurements are not weighted or averaged in
anyway) but it'll give a beginner a numerical sense to the cost of various
things in C.
Note: (a) I have no anonymous ftp server at my site, so you'll have to look
at it in your browser and save it to a file or copy and paste it into an
editor. (b) Be sure to compile it with optimization off. Most of the code
is trivial and would be eliminated by an optimizer; the timing would end up
being mostly zeroes. UNDERSTAND YOUR COMPILER OPTIONS Most compilers have several levels of optimization. Make sure you're using
the highest level. One popular compiler, gcc, has an incredible variety of
optimization levels and options. Some compilers have special #pragmas or
keywords (for example, inline) which also affect optimization. High levels of optimization can make poorly written (but compilable) code
break. Less often, optimizers mangle perfectly correct code. Problems with
side effects and evaluation order tend to screw things up, so you may have
to fix the code up a little before you can use the highest level. Signal
handlers activated at inopportune times may disturb an optimized program in
ways that wouldn't disturb an unoptimized one. INLINING gcc (with the -finline-functions option) and a few other compilers are
capable of inlining small functions at higher optimization levels
automatically. K&R C compilers that I've seen won't do it at all or will
only do it for library functions written in assembly (for example the math
library). C compilers almost universally support inline functions. If necessary, C functions can be recoded as macros to obtain similar
speedup on compilers with no inlining capability. This should be done after
the code is completely debugged. All debuggers I've seen are incapable of
displaying or stepping through the expanded macro text. An example of inlining via macros: Old code:
int foo(a, b)
{
a = a - b;
b ;
a = a * b;
return a;
} New code:
#define foo(a, b) (((a)-(b)) * ((b) 1)) The extra parentheses are necessary to preserve grouping in case foo is
used in an expression with higher precedence than * or in case a and/or b
contain subexpressions with lower precedence than or -. Comma expressions and do { ... } while(0) can be used for more complicated
functions, with some restrictions. The do-while macros let you create local
variables for use in the macro, but you can't return a value to the
expression the macro is used in. The opposite is true for macros using
comma expressions. Some caveats: 1. Gratuitously making every function in sight into a macro leads to
massive code-bloat and can increase the amount of memory your program
needs dramatically. The larger a program is, the less likely it is to
fit entirely into cache or some other layer of physical memory,
undoing the hoped for gains.
2. Macros in C "evaluate" their arguments each time the argument is
mentioned inside the macro. If the actual argument passed to the macro
is a complicated expression or function call, the net result very well
may be an increase in CPU time. Performing multiple side-effects when
the caller did not expect that will almost certainly make the program
buggy.
3. Because these macros can contain complicated statements, optimizers
have a hard time figuring them out and may give up. Also don't forget
that there's a limit on how many characters a macro can have.
4. Profilers don't see macros so it's hard to optimize any further once
you've done this. LOOP UNROLLING Many compilers (e.g. gcc -funroll-loops) will do this, but if you know that
yours doesn't you can change your source code a bit to get the same effect. Old code:
for (i = 0; i < 100; i )
{
do_stuff(i);
} New code:
for (i = 0; i < 100; )
{
do_stuff(i); i ;
do_stuff(i); i ;
do_stuff(i); i ;
do_stuff(i); i ;
do_stuff(i); i ;
do_stuff(i); i ;
do_stuff(i); i ;
do_stuff(i); i ;
do_stuff(i); i ;
do_stuff(i); i ;
} This way the test for i < 100 and the branch back up to the top of the loop
only get executed 11 times rather than 101. Loop unrolling works best when
the loop is executed a fixed non-prime number of times and the iteration
variable is only modified in one place (aside from its initialization). If
do_stuff() didn't make use of i, all the little i 's could be replaced by
a single i = 10. Re-arranging the for loop into a do-while loop can make
the 11 into 10. If the loop only went to, say, five rather than 100 you
could unroll the loop completely and eliminate the branching and testing
entirely. For computations which converge on some result, compilers will often refuse
to unroll on the grounds that unrolling will change the result. If the
application is not sensitive to excess precision you can arrange to test
the convergence less often by duplicating the interior of the loop. This is
especially useful if the test for convergence is expensive in relation to
the calculation of the result. An unrolled loop is larger than the "rolled" version and so may no longer
fit into the instruction cache (on machines which have them). This will
make the unrolled version slower. Also, in this example, the call to
do_stuff() overshadows the cost of the loop, so any savings from loop
unrolling are insignificant in comparison to what you'd achieve from
inlining in this case. If you happen to be using a vectorizing compiler, loop unrolling can
interfere with the vector optimizations. LOOP JAMMING The idea is to combine adjacent loops which loop over the same range of the
same variable. Assuming nothing in the second loop indexes forward (for
example array[i 3]), you can do this: Old code:
for (i = 0; i < MAX; i ) /* initialize 2d array to 0's */
for (j = 0; j < MAX; j )
a[i][j] = 0.0;
for (i = 0; i < MAX; i ) /* put 1's along the diagonal */
a[i][i] = 1.0; New code:
for (i = 0; i < MAX; i ) /* initialize 2d array to 0's */
{
for (j = 0; j < MAX; j )
a[i][j] = 0.0;
a[i][i] = 1.0; /* put 1's along the diagonal */
} And now the incrementing and testing of i is done only half as often. Under
some circumstances, locality of reference may be better, improving cache
behavior. (The example above was lifted from Aho & Ullman.) LOOP INVERSION Some machines have a special instruction for decrement and compare with 0.
Assuming the loop is insensitive to direction, try this replacment: Old code:
for (i = 1; i <= MAX; i )
{
...
} New code:
i = MAX 1;
while (--i)
{
...
} Though watch out for problems with lookahead caches as noted below in MARCH
FORWARD. If you plan on doing this in combination with pointer arithmetic
note that while ANSI C has a special rule that allows you to set an index
to one element past the end of the array, there is no similar rule for one
element before the beginning of the array. STRENGTH REDUCTION Strength reduction is the replacement of an expression by a different
expression that yields the same value but is cheaper to compute. Many
compilers will do this for you automatically. The classic examples: Old code:
x = w % 8;
y = pow(x, 2.0);
z = y * 33;
for (i = 0; i < MAX; i )
{
h = 14 * i;
printf("%d", h);
} New code:
x = w & 7; /* bit-and cheaper than remainder */
y = x * x; /* mult is cheaper than power-of */
z = (y << 5) y; /* shift & add cheaper than mult */
for (i = h = 0; i < MAX; i )
{
h = 14; /* addition cheaper than mult */
printf("%d", h);
} Also note that array indexing in C is basically a multiply and an add. The
multiply part can be subjected to strength reduction under some
circumstances, notably when looping through an array. LOOP INVARIANT COMPUTATIONS Any part of a computation that does not depend on the loop variable and
which is not subject to side effects can be moved out of the loop entirely.
Most compilers are pretty good at doing this. Try to keep the computations
within the loop simple anyway, and be prepared to move invariant
computations out yourself: there may be some situations where you know the
value won't vary, but the compiler is playing it safe in case of
side-effects. "Computation" here doesn't mean just arithmetic; array
indexing, pointer dereferencing, and calls to pure functions are all
possible candidates for moving out of the loop. In loops which call other functions, you might be able to get some speedup
by ripping the subroutines apart and figuring out which parts of them are
loop-invariant for that particular loop in their caller and calling those
parts ahead of time. This is not very easy and seldom leads to much
improvement unless you're calling subroutines which open and close files
repeatedly or malloc and free large amounts of memory on each call or
something else drastic. A common but not-always-optimized-away case is the repeated use of an
expression in successive statements. This doesn't have to be related to a
loop at all. You can try this replacement and see if it helps: Old code:
total =
a->b->c[4]->aardvark
a->b->c[4]->baboon
a->b->c[4]->cheetah
a->b->c[4]->dog; New code:
struct animals * temp = a->b->c[4];
total =
temp->aardvark
temp->baboon
temp->cheetah
temp->dog; Older C compilers were allowed to regroup arithmetic expressions to some
extent. ANSI codified that arithmetic expressions are evaluated as they are
grouped so as to avoid any unwanted surprises. What this means is that float a, b, c, d, f, g;
...
a = b / c * d;
f = b * g / c; will not been seen as having the common subexpression b / c. If you rewrite
the second expression: float a, b, c, d, f, g;
...
a = b / c * d;
f = b / c * g; an ANSI C compiler is now free to compute b / c only once. Note that the
new code may behave differently for overflow or provide slightly different
results with floating point. CODE FOR COMMON CASE In a section of code which deals with several alternative situations, place
at the beginning the tests and the code for the situations which occur most
often. Frequently, this takes the form of a long train of mutually
exclusive if-then-else's, of which only one will get executed. By placing
the most likely one first, fewer if's will need to be performed over the
long term. But if the conditions are simple things like x == 3, consider using a
switch statement. Some compilers are quite sophisticated in how they
translate switch statements. RECURSION ELIMINATION In contrast to many LISP-like languages, it's typical to write iterative
code in C to begin with, even when it would be possible to use recursion.
Since most C programmers use iteration by default unless the algorithm
demands recursion, many C compilers some don't bother performing
optimizations related to recursion. On the rare occasions when a recursive
function is a bottleneck AND it's the recursion itself that's taking the
time rather than the statements performed at each recurse, you may want to
rewrite your function to be iterative. TABLE LOOKUP A time-honored hack and one which few compilers will do on their own.
Computations of simple functions which operate on a small domain of values
can be performed ahead of time with the values placed in a lookup table. If
the computation is more expensive than an array index then it'll be faster. Old code:
long factorial(int i)
{
if (i == 0)
return 1;
else
return i * factorial(i - 1);
} New code:
static long factorial_table[] =
{1, 1, 2, 6, 24, 120, 720 /* etc */}; long factorial(int i)
{
return factorial_table[i];
} If the table is too large to type, you can have some initialization code
compute all the values on startup or have a second program generate the
table and printf it out in a form suitable for #include-ing into a source
file. At some point, the size of the table will have a noticeable affect on
paging space. You could have the table cover the first N cases then augment
this with a function to compute the rest. As long as a significant number
of the values requested are in the table there will be a net win. VARIABLES Avoid referring to global or static variables inside the tightest loops.
Don't use the volatile qualifier unless you really mean it. Most compilers
take it to mean roughly the opposite of register, and will deliberately not
optimize expressions involving the volatile variable. Avoid passing addresses of your variables to other functions. The optimizer
has to assume that the called function is capable of stashing a pointer to
this variable somewhere and so the variable could get modified as a side
effect of calling what seems like a totally unrelated function. At less
intense levels of optimization, the optimizer may even assume that a signal
handler could modify the variable at any time. These cases interfere with
placing variables in registers, which is very important to opimization.
Example: a = b();
c(&d); Because d has had its address passed to another function, the compiler can
no longer leave it in a register across function calls. It can however
leave the variable a in a register. The register keyword can be used to
track down problems like this; if d had been declared register the compiler
would have to warn that its address had been taken. FUNCTION CALLS While functions and modularity are a good thing, a function call inside an
oft-executed loop is a possible bottleneck. There are several facets to the
problem, some of which I've alluded to above. Aside from the expense of
executing the instructions in the other function: * Function calls interrupt an optimizer's train of thought in a drastic
way. Any references through pointers or to global variables are now
"dirty" and need to be saved/restored across the function call. Local
variables which have had their address taken and passed outside the
function are also now dirty, as noted above.
* There is some overhead to the function call itself as the stack must
be manipulated and the program counter altered by whatever mechanism
the CPU uses.
* If the function being called happens to be paged out, there will be a
very long delay before it gets read back in. For functions called in a
loop it's unusual for the called function to be paged out until the
loop is finished, but if virtual memory is scarce, calls to other
functions in the same loop may demand the space and force the other
function out, leading to thrashing. Most linkers respect the order in
which you list object files, so you can try to get functions near each
other in hopes that they'll land on the same page. DIGESTIBILITY Straight-line code, even with an extra statement or two, will run faster
than code full of if's, &&'s, switch's, and goto's. Pipelining processors
are much happier with a steady diet of sequential instructions than a bunch
of branches, even if the branches skip some unnecessary sections. STRING OPERATIONS Most of the C library str* and mem* functions operate in time proportional
to the length(s) of the string(s) they are given. It's quite easy to loop
over calls to these and wind up with a significant bottleneck. Some things
that may ease the situation: strlen Avoid calling strlen() during a loop involving the string itself. Even if
you're modifying the string, it should be possible to rewrite it so that
you set x = strlen() before the loop and then x or x-- when you add or
remove a character. strcat When building up a large string in memory using strcat, it will scan the
full (current) length of the string on each call. If you've been keeping
track of the length anyway (see above) you can index directly to the end of
the string and strcpy or memcpy to there. strcmp You can save a little time by checking the first characters of the strings
in question before doing the call. Obviously, if the first characters
differ, there's no reason to call strcmp to check the rest. Because of the
non-uniform distribution of letters in natural languages, the payoff is not
26:1 but more like 15:1 for uppercase data. #define QUICKIE_STRCMP(a, b) (*(a) != *(b) ? \
(int) ((unsigned char) *(a) - \
(unsigned char) *(b)) : \
strcmp((a), (b))) But watch out: 1. There's a double evaluation going on, so this can be
counter-productive if the arguments to strcmp aren't just simple
variables.
2. If you're working on adjacent strings in sorted input, you will almost
always have to check past the first character anyway. An entirely different way to speed up strcmp is to place all your strings
into a single array, in order. Then you only have to compare the pointers,
not the strings. If the point of all the calls to strcmp is to search for a
value from a large, known set and you expect that you'll be doing many such
searches then you'll want to invest in a hash table. strlen(s) == 0 Empty string tests are common, and even experienced programmers may use
strlen for this purpose. But if the strings you are working with are of
substantial size, strlen will dutifully check each character until it finds
the NUL terminator. Replace with *s == '\0' which saves the function call
overhead plus it only has to check the first character. There can be a
substantial improvement if your character strings are fairly long. strcpy(s, "") *s = '\0' works just fine and saves a function call. strncpy(s, t, n) Be aware that strncpy pads short strings with 0's. Except for the first
one, which terminates the string, the extra zeroes are typically never
used, so some time is wasted, though typically not very much. memcpy/memmove Generally, memcpy is faster than memmove because it can assume that its
arguments don't overlap. If you are tempted to replace these with your own
versions, be sure to verify that your version (a) works and (b) is faster. FP PARALLELISM In some really weird cases (early RISC chips with FPUs, which includes many
SPARC-based machines) you can get some speed by turning your integer
multiplies, divides, and modulos into float operations. This does two
things: the FPU is taking some of the load off the main CPU so it can get
some other stuff done; but mostly some RISC machines emulate integer
multiply and divide in software and the FPU can actually do it faster
anyway. This is not a sure thing and you should try it both ways on all the
machines you plan to run on before you do this. Also, converting
back-and-forth between ints and floats can take up considerable time. I
have heard this is an acute problem on Apollos. On many machines, floats work faster than doubles. If the bottleneck
involves FP arithmetic and you don't need the extra precision, change the
pertinent variables to float and see what happens. But similar to above,
the cost of converting between floats and doubles can outweigh the benefits
if applied indiscriminately. Note that on K&R compilers and ANSI compilers
used w/o prototypes, there is an automatic conversion of floats to doubles
in function calls and in many expressions. GET A BETTER COMPILER gcc produces somewhat faster code than the compilers that come with some
OS's. Try compiling your code with all the compilers you have at your
disposal and use the fastest one for compiling the one or two functions
which are bottlenecks. For the rest just use the one that gives the most
informative error messages and produces the most reliable output. This is
more difficult with C since different compilers can use incompatible
name-mangling schemes. Compilers are evolving even as we speak, so keep up with the latest
revision if possible. STACK USAGE A typical cause of stack-related problems is having large arrays as local
variables. In that case the solution is to rewrite the code so it can use a
static or global array, or perhaps allocate it from the heap. A similar
solution applies to functions which have large structs as locals or
parameters. (On machines with small stacks, a stack bound program may not
just run slow, it might stop entirely when it runs out of stack space.) Recursive functions, even ones which have few and small local variables and
parameters, can still affect performance. On some ancient machines there is
a substantial amount of overhead associated with function calls, and
turning a recursive algorithm into an iterative one can save some time. A related issue is last-call optimization. Many LISP and Scheme compilers
do this automatically, but few C compilers support it. Consider this
example: int func1()
{
int a, b, c, etc; do_stuff(a, b, c)
if (some_condition)
return func2();
else
return 1;
} Since func1()'s last statement is to call func2() and func1() has no need
of its variables after that point, it can remove its stack frame. When
func2() is done it returns directly to func1()'s caller. This (a) reduces
the maximum depth of the stack and (b) saves the execution of some
return-from-subroutine code as it will get executed only once instead of
twice (or more, depending on how deeply the function results are passed
along.) If func1() and func2() are the same function (recursion) the compiler can
do something else: the stack can be left in place and the call can be
replaced by a goto back to the top of the function. This is called
tail-recursion elimination. CODE IT IN ASSEMBLY Estimates vary widely, but a competent human writing assembly-level code
can produce code which runs about 10% faster than what a compiler with full
optimization on would produce from well-written high-level source. Of
course, this is not a portable solution. RISC assembler is especially
difficult to write by hand. Experts vary on the approach one should take. Some suggest writing your own
assembler version of a function from scratch in hopes that you'll find a
novel way of calculating the result while others recommend taking the
compiler's version as a starting point and just fine-tuning it. SHARED LIBRARY OVERHEAD Dynamically linked shared libraries are a really nifty thing. In many cases
though, calling a dynamically linked function is slightly slower than it
would be to call it statically. The principal part of this extra cost is a
one-time thing: the fi
|
本站聲明 |
1. 本論壇為無營利行為之開放平台,所有文章都是由網友自行張貼,如牽涉到法律糾紛一切與本站無關。 2. 假如網友發表之內容涉及侵權,而損及您的利益,請立即通知版主刪除。 3. 請勿批評中華民國元首及政府或批評各政黨,是藍是綠本站無權干涉,但這裡不是政治性論壇! |