Information Source Code Documentation Contact How to Help Gallery | |
Table of Contents
To use this tool, specify Cachegrind is a high-precision tracing profiler. It runs slowly, but collects precise and reproducible profiling data. It can merge and diff data from different runs. To expand on these characteristics:
For these reasons, Cachegrind is an excellent complement to time-based profilers. Cachegrind can annotate programs written in any language, so long as debug info is present to map machine code back to the original source code. Cachegrind has been used successfully on programs written in C, C++, Rust, and assembly.
Cachegrind can also simulate how your program interacts with a machine's cache
hierarchy and branch predictor. This simulation was the original motivation for
the tool, hence its name. However, the simulations are basic and unlikely to
reflect the behaviour of a modern machine. For this reason they are off by
default. If you really want cache and branch information, a profiler like
First, as for normal Valgrind use, you should compile with debugging info (the
Second, run Cachegrind itself to gather the profiling data. Third, run cg_annotate to get a detailed presentation of that data. cg_annotate can combine the results of multiple Cachegrind output files. It can also perform a diff between two Cachegrind output files.
To run Cachegrind on a program valgrind --tool=cachegrind prog
The program will execute (slowly). Upon completion, summary statistics that look like this will be printed: ==17942== I refs: 8,195,070
The
Cachegrind also writes more detailed profiling data to a file. By default this
Cachegrind output file is named
The default Before using cg_annotate, it is worth widening your window to be at least 120 characters wide if possible, because the output lines can be quite long. Then run: cg_annotate <filename> on a Cachegrind output file. The first part of the output looks like this: -------------------------------------------------------------------------------- -- Metadata -------------------------------------------------------------------------------- Invocation: ../cg_annotate concord.cgout Command: ./concord ../cg_main.c Events recorded: Ir Events shown: Ir Event sort order: Ir Threshold: 0.1% Annotation: on It summarizes how Cachegrind and the profiled program were run.
If cache simulation is enabled, details of the cache parameters will be shown above the "Invocation" line. Next comes the summary for the whole program: -------------------------------------------------------------------------------- -- Summary -------------------------------------------------------------------------------- Ir________________ 8,195,070 (100.0%) PROGRAM TOTALS
The Then comes file:function counts. Here is the first part of that section: -------------------------------------------------------------------------------- -- File:function summary -------------------------------------------------------------------------------- Ir______________________ file:function < 3,078,746 (37.6%, 37.6%) /home/njn/grind/ws1/cachegrind/concord.c: 1,630,232 (19.9%) get_word 630,918 (7.7%) hash 461,095 (5.6%) insert 130,560 (1.6%) add_existing 91,014 (1.1%) init_hash_table 88,056 (1.1%) create 46,676 (0.6%) new_word_node < 1,746,038 (21.3%, 58.9%) ./malloc/./malloc/malloc.c: 1,285,938 (15.7%) _int_malloc 458,225 (5.6%) malloc < 1,107,550 (13.5%, 72.4%) ./libio/./libio/getc.c:getc < 551,071 (6.7%, 79.1%) ./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2 < 521,228 (6.4%, 85.5%) ./ctype/../include/ctype.h: 260,616 (3.2%) __ctype_tolower_loc 260,612 (3.2%) __ctype_b_loc < 468,163 (5.7%, 91.2%) ???: 468,151 (5.7%) ??? < 456,071 (5.6%, 96.8%) /usr/include/ctype.h:get_word Each entry covers one file, and one or more functions within that file. If there is only one significant function within a file, as in the first entry, the file and function are shown on the same line separate by a colon. If there are multiple significant functions within a file, as in the third entry, each function gets its own line.
This example involves a small C program, and shows a combination of code from
the program itself (including functions like
Each entry is preceded with a The first percentage in each column indicates the proportion of the total event count is covered by this line. The second percentage, which only shows on the first line of each entry, shows the cumulative percentage of all the entries up to and including this one. The entries shown here account for 96.8% of the instructions executed by the program.
The name After that comes function:file counts. Here is the first part of that section: -------------------------------------------------------------------------------- -- Function:file summary -------------------------------------------------------------------------------- Ir______________________ function:file > 2,086,303 (25.5%, 25.5%) get_word: 1,630,232 (19.9%) /home/njn/grind/ws1/cachegrind/concord.c 456,071 (5.6%) /usr/include/ctype.h > 1,285,938 (15.7%, 41.1%) _int_malloc:./malloc/./malloc/malloc.c > 1,107,550 (13.5%, 54.7%) getc:./libio/./libio/getc.c > 630,918 (7.7%, 62.4%) hash:/home/njn/grind/ws1/cachegrind/concord.c > 551,071 (6.7%, 69.1%) __strcmp_avx2:./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S > 480,248 (5.9%, 74.9%) malloc: 458,225 (5.6%) ./malloc/./malloc/malloc.c 22,023 (0.3%) ./malloc/./malloc/arena.c > 468,151 (5.7%, 80.7%) ???:??? > 461,095 (5.6%, 86.3%) insert:/home/njn/grind/ws1/cachegrind/concord.c
This is similar to the previous section, but is grouped by functions first and
files second. Also, the entry markers are
You might wonder why this section is needed, and how it differs from the
previous section. The answer is inlining. In this example there are two entries
demonstrating a function whose code is effectively spread across more than one
file: > 30,469,230 (1.3%, 11.1%) <rustc_middle::ty::context::CtxtInterners>::intern_ty: 10,269,220 (0.5%) /home/njn/.cargo/registry/src/github.com-1ecc6299db9ec823/hashbrown-0.12.3/src/raw/mod.rs 7,696,827 (0.3%) /home/njn/dev/rust0/compiler/rustc_middle/src/ty/context.rs 3,858,099 (0.2%) /home/njn/dev/rust0/library/core/src/cell.rs
In this case the compiled function
By default, a source file is annotated if it contains at least one function
that meets the significance threshold. This can be disabled with the
To continue the previous example, here is part of the annotation of the file
-------------------------------------------------------------------------------- -- Annotated source file: /home/njn/grind/ws1/cachegrind/docs/concord.c -------------------------------------------------------------------------------- Ir____________ . /* Function builds the hash table from the given file. */ . void init_hash_table(char *file_name, Word_Node *table[]) 8 (0.0%) { . FILE *file_ptr; . Word_Info *data; 2 (0.0%) int line = 1, i; . . /* Structure used when reading in words and line numbers. */ 3 (0.0%) data = (Word_Info *) create(sizeof(Word_Info)); . . /* Initialise entire table to NULL. */ 2,993 (0.0%) for (i = 0; i < TABLE_SIZE; i++) 997 (0.0%) table[i] = NULL; . . /* Open file, check it. */ 4 (0.0%) file_ptr = fopen(file_name, "r"); 2 (0.0%) if (!(file_ptr)) { . fprintf(stderr, "Couldn't open '%s'.\n", file_name); . exit(EXIT_FAILURE); . } . . /* 'Get' the words and lines one at a time from the file, and insert them . ** into the table one at a time. */ 55,363 (0.7%) while ((line = get_word(data, line, file_ptr)) != EOF) 31,632 (0.4%) insert(data->word, data->line, table); . 2 (0.0%) free(data); 2 (0.0%) fclose(file_ptr); 6 (0.0%) } Each executed line is annotated with its event counts. Other lines are annotated with a dot. This may be because they contain no executable code, or they contain executable code but were never executed.
You can easily tell if a function is inlined from this output. If it is not
inlined, it will have event counts on the lines containing the opening and
closing braces. If it is inlined, it will not have event counts on those lines.
In the example above,
Note again that inlining can lead to surprising results. If a function
Sometimes only a small section of a source file is executed. To minimise uninteresting output, Cachegrind only shows annotated lines and lines within a small distance of annotated lines. Gaps are marked with line numbers, for example: (counts and code for line 704) -- line 375 ---------------------------------------- -- line 514 ---------------------------------------- (counts and code for line 878)
The number of lines of context shown around annotated lines is controlled by
the Any significant source files that could not be found are shown like this: -------------------------------------------------------------------------------- -- Annotated source file: ./malloc/./malloc/malloc.c -------------------------------------------------------------------------------- Unannotated because one or more of these original files are unreadable: - ./malloc/./malloc/malloc.c This is common for library files, because libraries are usually compiled with debugging information but the source files are rarely present on a system. Cachegrind relies heavily on accurate debug info. Sometimes compilers do not map a particular compiled instruction to line number 0, where the 0 represents "unknown" or "none". This is annoying but does happen in practice. cg_annotate prints these in the following way: -------------------------------------------------------------------------------- -- Annotated source file: /home/njn/dev/rust0/compiler/rustc_borrowck/src/lib.rs -------------------------------------------------------------------------------- Ir______________ 1,046,746 (0.0%) <unknown (line 0)> Finally, when annotation is performed, the output ends with a summary of how many counts were annotated and unannotated, and why. For example: -------------------------------------------------------------------------------- -- Annotation summary -------------------------------------------------------------------------------- Ir_______________ 3,534,817 (43.1%) annotated: files known & above threshold & readable, line numbers known 0 annotated: files known & above threshold & readable, line numbers unknown 0 unannotated: files known & above threshold & two or more non-identical 4,132,126 (50.4%) unannotated: files known & above threshold & unreadable 59,950 (0.7%) unannotated: files known & below threshold 468,163 (5.7%) unannotated: files unknown If your program forks, the child will inherit all the profiling data that has been gathered for the parent.
If the output file name (controlled by There are two situations in which cg_annotate prints warnings.
cg_annotate can merge data from multiple Cachegrind output files in a single run. (There is also a program called cg_merge that can merge multiple Cachegrind output files into a single Cachegrind output file, but it is now deprecated because cg_annotate's merging does a better job.) Use it as follows: cg_annotate file1 file2 file3 ...
cg_annotate computes the sum of these files (effectively
The most common merging scenario is if you want to aggregate costs over multiple runs of the same program, possibly on different inputs. cg_annotate can diff data from two Cachegrind output files in a single run. (There is also a program called cg_diff that can diff two Cachegrind output files into a single Cachegrind output file, but it is now deprecated because cg_annotate's differencing does a better job.) Use it as follows: cg_annotate --diff file1 file2
cg_annotate computes the difference between these two files (effectively
The simplest common scenario is comparing two Cachegrind output files that came from the same program, but on different inputs. cg_annotate will do a good job on this without assistance.
A more complex scenario is if you want to compare Cachegrind output files from
two slightly different versions of a program that you have sitting
side-by-side, running on the same input. For example, you might have
In this case, use the
Similarly, sometimes compilers auto-generate certain functions and give them
randomized names like
When Cachegrind can simulate how your program interacts with a machine's cache hierarchy and/or branch predictor. The cache simulation models a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). For these machines (in the cases where Cachegrind can auto-detect the cache configuration) Cachegrind simulates the first-level and last-level caches. Therefore, Cachegrind always refers to the I1, D1 and LL (last-level) caches.
When simulating the cache, with
Note that D1 total accesses is given by
When simulating the branch predictor, with
When cache and/or branch simulation is enabled, cg_annotate will print multiple counts per line of output. For example: Ir______________________ Bc____________________ Bcm__________________ Bi____________________ Bim______________ function:file > 8,547 (0.1%, 99.4%) 936 (0.1%, 99.1%) 177 (0.3%, 96.7%) 59 (0.0%, 99.9%) 38 (19.4%, 66.3%) strcmp: 8,503 (0.1%) 928 (0.1%) 175 (0.3%) 59 (0.0%) 38 (19.4%) ./string/../sysdeps/x86_64/multiarch/../multiarch/strcmp-sse2.S Cachegrind-specific options are:
Cachegrind provides the following client requests in
This section talks about details you don't need to know about in order to use Cachegrind, but may be of interest to some people. The cache simulation approximates the hardware of an AMD Athlon CPU circa 2002. Its specific characteristics are as follows:
The cache configuration simulated (cache size,
associativity and line size) is determined automatically using
the x86 CPUID instruction. If you have a machine that (a)
doesn't support the CPUID instruction, or (b) supports it in an
early incarnation that doesn't give any cache information, then
Cachegrind will fall back to using a default configuration (that
of a model 3/4 Athlon). Cachegrind will tell you if this
happens. You can manually specify one, two or all three levels
(I1/D1/LL) of the cache from the command line using the
On PowerPC platforms
Cachegrind cannot automatically
determine the cache configuration, so you will
need to specify it with the
Other noteworthy behaviour:
If you are interested in simulating a cache with different properties, it is
not particularly hard to write your own cache simulator, or to modify the
existing ones in Cachegrind simulates branch predictors intended to be typical of mainstream desktop/server processors of around 2004. Conditional branches are predicted using an array of 16384 2-bit saturating counters. The array index used for a branch instruction is computed partly from the low-order bits of the branch instruction's address and partly using the taken/not-taken behaviour of the last few conditional branches. As a result the predictions for any specific branch depend both on its own history and the behaviour of previous branches. This is a standard technique for improving prediction accuracy. For indirect branches (that is, jumps to unknown destinations) Cachegrind uses a simple branch target address predictor. Targets are predicted using an array of 512 entries indexed by the low order 9 bits of the branch instruction's address. Each branch is predicted to jump to the same address it did last time. Any other behaviour causes a mispredict. More recent processors have better branch predictors, in particular better indirect branch predictors. Cachegrind's predictor design is deliberately conservative so as to be representative of the large installed base of processors which pre-date widespread deployment of more sophisticated indirect branch predictors. In particular, late model Pentium 4s (Prescott), Pentium M, Core and Core 2 have more sophisticated indirect branch predictors than modelled by Cachegrind. Cachegrind does not simulate a return stack predictor. It assumes that processors perfectly predict function return addresses, an assumption which is probably close to being true. See Hennessy and Patterson's classic text "Computer Architecture: A Quantitative Approach", 4th edition (2007), Section 2.3 (pages 80-89) for background on modern branch predictors. Cachegrind's instruction counting has one shortcoming on x86/amd64:
Cachegrind's cache profiling has a number of shortcomings:
Another thing worth noting is that results are very sensitive. Changing the size of the executable being profiled, or the sizes of any of the shared libraries it uses, or even the length of their file names, can perturb the results. Variations will be small, but don't expect perfectly repeatable results if your program changes at all. Many Linux distributions perform address space layout randomisation (ASLR), in which identical runs of the same program have their shared libraries loaded at different locations, as a security measure. This also perturbs the results. This section talks about details you don't need to know about in order to use Cachegrind, but may be of interest to some people. The best reference for understanding how Cachegrind works is chapter 3 of "Dynamic Binary Analysis and Instrumentation", by Nicholas Nethercote. It is available on the Valgrind publications page. The file format is fairly straightforward, basically giving the cost centre for every line, grouped by files and functions. It's also totally generic and self-describing, in the sense that it can be used for any events that can be counted on a line-by-line basis, not just cache and branch predictor events. For example, earlier versions of Cachegrind didn't have a branch predictor simulation. When this was added, the file format didn't need to change at all. So the format (and consequently, cg_annotate) could be used by other tools. The file format: file ::= desc_line* cmd_line events_line data_line+ summary_line desc_line ::= "desc:" ws? non_nl_string cmd_line ::= "cmd:" ws? cmd events_line ::= "events:" ws? (event ws)+ data_line ::= file_line | fn_line | count_line file_line ::= "fl=" filename fn_line ::= "fn=" fn_name count_line ::= line_num (ws+ count)* ws* summary_line ::= "summary:" ws? count (ws+ count)+ ws* count ::= num Where:
The contents of the "desc:" lines are printed out at the top of the summary. This is a generic way of providing simulation specific information, e.g. for giving the cache configuration for cache simulation. More than one line of info can be present for each file/fn/line number. In such cases, the counts for the named events will be accumulated. The number of counts in each
A Similarly, each The summary line is redundant, because it just holds the total counts for each event. But this serves as a useful sanity check of the data; if the totals for each event don't match the summary line, something has gone wrong. |
|
Copyright © 2000-2024 Valgrind™ Developers |