<?xml version="1.0"?> <!-- -*- sgml -*- --> <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd" [ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]> <chapter id="dh-manual" xreflabel="DHAT: a dynamic heap analysis tool"> <title>DHAT: a dynamic heap analysis tool</title> <para>To use this tool, you must specify <option>--tool=exp-dhat</option> on the Valgrind command line.</para> <sect1 id="dh-manual.overview" xreflabel="Overview"> <title>Overview</title> <para>DHAT is a tool for examining how programs use their heap allocations.</para> <para>It tracks the allocated blocks, and inspects every memory access to find which block, if any, it is to. The following data is collected and presented per allocation point (allocation stack):</para> <itemizedlist> <listitem><para>Total allocation (number of bytes and blocks)</para></listitem> <listitem><para>maximum live volume (number of bytes and blocks)</para></listitem> <listitem><para>average block lifetime (number of instructions between allocation and freeing)</para></listitem> <listitem><para>average number of reads and writes to each byte in the block ("access ratios")</para></listitem> <listitem><para>for allocation points which always allocate blocks only of one size, and that size is 4096 bytes or less: counts showing how often each byte offset inside the block is accessed.</para></listitem> </itemizedlist> <para>Using these statistics it is possible to identify allocation points with the following characteristics:</para> <itemizedlist> <listitem><para>potential process-lifetime leaks: blocks allocated by the point just accumulate, and are freed only at the end of the run.</para></listitem> <listitem><para>excessive turnover: points which chew through a lot of heap, even if it is not held onto for very long</para></listitem> <listitem><para>excessively transient: points which allocate very short lived blocks</para></listitem> <listitem><para>useless or underused allocations: blocks which are allocated but not completely filled in, or are filled in but not subsequently read.</para></listitem> <listitem><para>blocks with inefficient layout -- areas never accessed, or with hot fields scattered throughout the block.</para></listitem> </itemizedlist> <para>As with the Massif heap profiler, DHAT measures program progress by counting instructions, and so presents all age/time related figures as instruction counts. This sounds a little odd at first, but it makes runs repeatable in a way which is not possible if CPU time is used.</para> </sect1> <sect1 id="dh-manual.understanding" xreflabel="Understanding DHAT's output"> <title>Understanding DHAT's output</title> <para>DHAT provides a lot of useful information on dynamic heap usage. Most of the art of using it is in interpretation of the resulting numbers. That is best illustrated via a set of examples.</para> <sect2> <title>Interpreting the max-live, tot-alloc and deaths fields</title> <sect3><title>A simple example</title></sect3> <screen><![CDATA[ ======== SUMMARY STATISTICS ======== guest_insns: 1,045,339,534 [...] max-live: 63,490 in 984 blocks tot-alloc: 1,904,700 in 29,520 blocks (avg size 64.52) deaths: 29,520, at avg age 22,227,424 acc-ratios: 6.37 rd, 1.14 wr (12,141,526 b-read, 2,174,460 b-written) at 0x4C275B8: malloc (vg_replace_malloc.c:236) by 0x40350E: tcc_malloc (tinycc.c:6712) by 0x404580: tok_alloc_new (tinycc.c:7151) by 0x40870A: next_nomacro1 (tinycc.c:9305) ]]></screen> <para>Over the entire run of the program, this stack (allocation point) allocated 29,520 blocks in total, containing 1,904,700 bytes in total. By looking at the max-live data, we see that not many blocks were simultaneously live, though: at the peak, there were 63,490 allocated bytes in 984 blocks. This tells us that the program is steadily freeing such blocks as it runs, rather than hanging on to all of them until the end and freeing them all.</para> <para>The deaths entry tells us that 29,520 blocks allocated by this stack died (were freed) during the run of the program. Since 29,520 is also the number of blocks allocated in total, that tells us that all allocated blocks were freed by the end of the program.</para> <para>It also tells us that the average age at death was 22,227,424 instructions. From the summary statistics we see that the program ran for 1,045,339,534 instructions, and so the average age at death is about 2% of the program's total run time.</para> <sect3><title>Example of a potential process-lifetime leak</title></sect3> <para>This next example (from a different program than the above) shows a potential process lifetime leak. A process lifetime leak occurs when a program keeps allocating data, but only frees the data just before it exits. Hence the program's heap grows constantly in size, yet Memcheck reports no leak, because the program has freed up everything at exit. This is particularly a hazard for long running programs.</para> <screen><![CDATA[ ======== SUMMARY STATISTICS ======== guest_insns: 418,901,537 [...] max-live: 32,512 in 254 blocks tot-alloc: 32,512 in 254 blocks (avg size 128.00) deaths: 254, at avg age 300,467,389 acc-ratios: 0.26 rd, 0.20 wr (8,756 b-read, 6,604 b-written) at 0x4C275B8: malloc (vg_replace_malloc.c:236) by 0x4C27632: realloc (vg_replace_malloc.c:525) by 0x56FF41D: QtFontStyle::pixelSize(unsigned short, bool) (qfontdatabase.cpp:269) by 0x5700D69: loadFontConfig() (qfontdatabase_x11.cpp:1146) ]]></screen> <para>There are two tell-tale signs that this might be a process-lifetime leak. Firstly, the max-live and tot-alloc numbers are identical. The only way that can happen is if these blocks are all allocated and then all deallocated.</para> <para>Secondly, the average age at death (300 million insns) is 71% of the total program lifetime (419 million insns), hence this is not a transient allocation-free spike -- rather, it is spread out over a large part of the entire run. One interpretation is, roughly, that all 254 blocks were allocated in the first half of the run, held onto for the second half, and then freed just before exit.</para> </sect2> <sect2> <title>Interpreting the acc-ratios fields</title> <sect3><title>A fairly harmless allocation point record</title></sect3> <screen><![CDATA[ max-live: 49,398 in 808 blocks tot-alloc: 1,481,940 in 24,240 blocks (avg size 61.13) deaths: 24,240, at avg age 34,611,026 acc-ratios: 2.13 rd, 0.91 wr (3,166,650 b-read, 1,358,820 b-written) at 0x4C275B8: malloc (vg_replace_malloc.c:236) by 0x40350E: tcc_malloc (tinycc.c:6712) by 0x404580: tok_alloc_new (tinycc.c:7151) by 0x4046C4: tok_alloc (tinycc.c:7190) ]]></screen> <para>The acc-ratios field tells us that each byte in the blocks allocated here is read an average of 2.13 times before the block is deallocated. Given that the blocks have an average age at death of 34,611,026, that's one read per block per approximately every 15 million instructions. So from that standpoint the blocks aren't "working" very hard.</para> <para>More interesting is the write ratio: each byte is written an average of 0.91 times. This tells us that some parts of the allocated blocks are never written, at least 9% on average. To completely initialise the block would require writing each byte at least once, and that would give a write ratio of 1.0. The fact that some block areas are evidently unused might point to data alignment holes or other layout inefficiencies.</para> <para>Well, at least all the blocks are freed (24,240 allocations, 24,240 deaths).</para> <para>If all the blocks had been the same size, DHAT would also show the access counts by block offset, so we could see where exactly these unused areas are. However, that isn't the case: the blocks have varying sizes, so DHAT can't perform such an analysis. We can see that they must have varying sizes since the average block size, 61.13, isn't a whole number.</para> <sect3><title>A more suspicious looking example</title></sect3> <screen><![CDATA[ max-live: 180,224 in 22 blocks tot-alloc: 180,224 in 22 blocks (avg size 8192.00) deaths: none (none of these blocks were freed) acc-ratios: 0.00 rd, 0.00 wr (0 b-read, 0 b-written) at 0x4C275B8: malloc (vg_replace_malloc.c:236) by 0x40350E: tcc_malloc (tinycc.c:6712) by 0x40369C: __sym_malloc (tinycc.c:6787) by 0x403711: sym_malloc (tinycc.c:6805) ]]></screen> <para>Here, both the read and write access ratios are zero. Hence this point is allocating blocks which are never used, neither read nor written. Indeed, they are also not freed ("deaths: none") and are simply leaked. So, here is 180k of completely useless allocation that could be removed.</para> <para>Re-running with Memcheck does indeed report the same leak. What DHAT can tell us, that Memcheck can't, is that not only are the blocks leaked, they are also never used.</para> <sect3><title>Another suspicious example</title></sect3> <para>Here's one where blocks are allocated, written to, but never read from. We see this immediately from the zero read access ratio. They do get freed, though:</para> <screen><![CDATA[ max-live: 54 in 3 blocks tot-alloc: 1,620 in 90 blocks (avg size 18.00) deaths: 90, at avg age 34,558,236 acc-ratios: 0.00 rd, 1.11 wr (0 b-read, 1,800 b-written) at 0x4C275B8: malloc (vg_replace_malloc.c:236) by 0x40350E: tcc_malloc (tinycc.c:6712) by 0x4035BD: tcc_strdup (tinycc.c:6750) by 0x41FEBB: tcc_add_sysinclude_path (tinycc.c:20931) ]]></screen> <para>In the previous two examples, it is easy to see blocks that are never written to, or never read from, or some combination of both. Unfortunately, in C++ code, the situation is less clear. That's because an object's constructor will write to the underlying block, and its destructor will read from it. So the block's read and write ratios will be non-zero even if the object, once constructed, is never used, but only eventually destructed.</para> <para>Really, what we want is to measure only memory accesses in between the end of an object's construction and the start of its destruction. Unfortunately I do not know of a reliable way to determine when those transitions are made.</para> </sect2> <sect2> <title>Interpreting "Aggregated access counts by offset" data</title> <para>For allocation points that always allocate blocks of the same size, and which are 4096 bytes or smaller, DHAT counts accesses per offset, for example:</para> <screen><![CDATA[ max-live: 317,408 in 5,668 blocks tot-alloc: 317,408 in 5,668 blocks (avg size 56.00) deaths: 5,668, at avg age 622,890,597 acc-ratios: 1.03 rd, 1.28 wr (327,642 b-read, 408,172 b-written) at 0x4C275B8: malloc (vg_replace_malloc.c:236) by 0x5440C16: QDesignerPropertySheetPrivate::ensureInfo (qhash.h:515) by 0x544350B: QDesignerPropertySheet::setVisible (qdesigner_propertysh...) by 0x5446232: QDesignerPropertySheet::QDesignerPropertySheet (qdesigne...) Aggregated access counts by offset: [ 0] 28782 28782 28782 28782 28782 28782 28782 28782 [ 8] 20638 20638 20638 20638 0 0 0 0 [ 16] 22738 22738 22738 22738 22738 22738 22738 22738 [ 24] 6013 6013 6013 6013 6013 6013 6013 6013 [ 32] 18883 18883 18883 37422 0 0 0 0 [ 36] 5668 11915 5668 5668 11336 11336 11336 11336 [ 48] 6166 6166 6166 6166 0 0 0 0 ]]></screen> <para>This is fairly typical, for C++ code running on a 64-bit platform. Here, we have aggregated access statistics for 5668 blocks, all of size 56 bytes. Each byte has been accessed at least 5668 times, except for offsets 12--15, 36--39 and 52--55. These are likely to be alignment holes.</para> <para>Careful interpretation of the numbers reveals useful information. Groups of N consecutive identical numbers that begin at an N-aligned offset, for N being 2, 4 or 8, are likely to indicate an N-byte object in the structure at that point. For example, the first 32 bytes of this object are likely to have the layout</para> <screen><![CDATA[ [0 ] 64-bit type [8 ] 32-bit type [12] 32-bit alignment hole [16] 64-bit type [24] 64-bit type ]]></screen> <para>As a counterexample, it's also clear that, whatever is at offset 32, it is not a 32-bit value. That's because the last number of the group (37422) is not the same as the first three (18883 18883 18883).</para> <para>This example leads one to enquire (by reading the source code) whether the zeroes at 12--15 and 52--55 are alignment holes, and whether 48--51 is indeed a 32-bit type. If so, it might be possible to place what's at 48--51 at 12--15 instead, which would reduce the object size from 56 to 48 bytes.</para> <para>Bear in mind that the above inferences are all only "maybes". That's because they are based on dynamic data, not static analysis of the object layout. For example, the zeroes might not be alignment holes, but rather just parts of the structure which were not used at all for this particular run. Experience shows that's unlikely to be the case, but it could happen.</para> </sect2> </sect1> <sect1 id="dh-manual.options" xreflabel="DHAT Command-line Options"> <title>DHAT Command-line Options</title> <para>DHAT-specific command-line options are:</para> <!-- start of xi:include in the manpage --> <variablelist id="dh.opts.list"> <varlistentry id="opt.show-top-n" xreflabel="--show-top-n"> <term> <option><![CDATA[--show-top-n=<number> [default: 10] ]]></option> </term> <listitem> <para>At the end of the run, DHAT sorts the accumulated allocation points according to some metric, and shows the highest scoring entries. <varname>--show-top-n</varname> controls how many entries are shown. The default of 10 is quite small. For realistic applications you will probably need to set it much higher, at least several hundred.</para> </listitem> </varlistentry> <varlistentry id="opt.sort-by" xreflabel="--sort-by=string"> <term> <option><![CDATA[--sort-by=<string> [default: max-bytes-live] ]]></option> </term> <listitem> <para>At the end of the run, DHAT sorts the accumulated allocation points according to some metric, and shows the highest scoring entries. <varname>--sort-by</varname> selects the metric used for sorting:</para> <para><varname>max-bytes-live </varname> maximum live bytes [default]</para> <para><varname>tot-bytes-allocd </varname> total allocation (turnover)</para> <para><varname>max-blocks-live </varname> maximum live blocks</para> <para>This controls the order in which allocation points are displayed. You can choose to look at allocation points with the highest maximum liveness, or the highest total turnover, or by the highest number of live blocks. These give usefully different pictures of program behaviour. For example, sorting by maximum live blocks tends to show up allocation points creating large numbers of small objects.</para> </listitem> </varlistentry> </variablelist> <para>One important point to note is that each allocation stack counts as a seperate allocation point. Because stacks by default have 12 frames, this tends to spread data out over multiple allocation points. You may want to use the flag --num-callers=4 or some such small number, to reduce the spreading.</para> <!-- end of xi:include in the manpage --> </sect1> </chapter>