Added first draft of "How Bloaty Works" document.

adamj-codethink · Jun 8, 2019 · 1bb0b93 · 1bb0b93
1 parent 9f04dfb
commit 1bb0b93
Showing 1 changed file with 135 additions and 0 deletions.
diff --git a/how-bloaty-works.md b/how-bloaty-works.md
@@ -0,0 +1,135 @@
+
+# How Bloaty Works
+
+At a high level, Bloaty's goal is to create a map of the binary where every byte
+has a label attached to it.  Every byte starts out as unknown (unattributed).
+As we scan data structures in the binary we assign labels to various ranges of
+the binary.  For example, if the user selected the "symbols" data source we scan
+the symbol table.  For each symbol, we find the corresponding section of the
+binary and "paint" it with the label of that symbol.
+
+The ELF/Mach-O/etc. data structures we are traversing were not designed to
+enable size profiling.  They were designed to assist linkers, loaders,
+debuggers, stack unwinders, etc. to run and debug the binary.  This means
+that Bloaty is inherently an unconventional use of ELF/Mach-O structures,
+and has to be clever about how to use the available information to
+achieve its goal.  This can pose a challenge, but also makes Bloaty fun
+to work on.
+
+## Range Map
+
+RangeMap (as defined in [range_map.h](https://github.com/google/bloaty/blob/master/src/range_map.h))
+is the core data structure of Bloaty.  It is a sparse map of
+`[start, end) -> std::string` that associates regions of VM or file space to
+a label.
+
+Sometimes we know a region's start but not its end.  For example, Mach-O
+symbols have an address but not a size (whereas ELF symbols have both).
+To support this case, `RangeMap` supports adding an address with `kUnknownSize`.
+A range with unknown size will automatically extend to the beginning of the
+next region, even if the next region is added later.
+
+If we try to add a label to a range of the binary that has already been assigned
+a label, the first label assigned takes precedence.  This means that the order
+in which we scan data structures is significant.  So our general strategy is to
+scan our most granular and detailed information first.  We scan generic
+information as a last resort, to give at least some information for parts of the
+binary that we couldn't find any more specific information about.
+
+## VM Space and File Space
+
+Loadable binaries have two fundamental domains of space we are trying to map:
+*VM space* and *file space*.  File space is the bytes of the input file.  VM
+space is the bytes of memory when the executable is loaded at runtime.  Some
+regions of the binary exist only in file space (like debug info) and some
+regions exist only in VM space (like `.bss`, zero-initialized data).  Even
+entities that exist in both spaces can have different sizes in each.
+
+We create two separate `RangeMap` structures for these two domains.  For
+convenience, we put them together into a single structure called `DualMap`:
+
+```cpp
+struct DualMap {
+  RangeMap vm_map;
+  RangeMap file_map;
+};
+```
+
+We populate these two maps simultaneously as we scan the file.  We must do this
+even in the case where we only ultimately care about one of them, because almost
+all of the data structures we are scanning specify locations in either file
+space *or* VM space, but not both.  So we must be able to translate one domain
+to the other to get the benefit of all the metadata in the binary.
+
+## Base Map
+
+To translate between VM and file space, we always begin by creating a "base
+map."  The base map is just a `DualMap` like any other, but we give it special
+meaning:
+
+* It defines what ranges of file and VM space constitute "the entire binary"
+  (ie. the "TOTALS" row of the final report).
+* We use it to translate between VM space and File space.
+
+This means that the base map must be exhaustive, and must also provide
+translation for any entity that exists in both VM and file space.  For example,
+if we see in the symbol table that address `0x12345` corresponds to
+symbol `foo`, we will add that to VM map immediately, but we will
+also use the base map to translate address `0x12355` to a file
+offset so we can add that range to the file map.
+
+How does the base map store translation info?  I left one thing out about
+`RangeMap` above.  In addition to storing a label for every region, it can also
+(optionally) store a member called `other_start`.  This is the offset in the
+other space where this range maps to.
+
+We build the base map by scanning either the segments (program headers) or
+sections of the binary.  These give both VM address and file offset for regions
+of the binary that are loaded into memory.  To make sure we cover the entire
+file space, we use `[Unmapped]` as a last ditch fallback for any regions of the
+on-disk binary that didn't have any segment/section data associated with them.
+This ensures that Bloaty always accounts for the entire physical binary, even if
+we can't find any information about it.
+
+## Scanning Data Sources
+
+Once we have built the base map, we can get on to the meat of Bloaty's work.
+We can now scan the binary according to whatever data source(s) the user has
+selected.
+
+We must do more than just scan a single table though.  For example, if the user
+asked us to scan symbols we can't simply scan the symbol table, add all the
+symbol table entries to a `RangeMap`, and call it a day.  The first prerelease
+version of Bloaty did just that and I got very understandable complaints that
+Bloaty's coverage was low, and very large parts of the binary would show up as
+`[None]`.
+
+The problem is that the symbol table only refers to the actual machine code of
+the function.  But a function can emit far more artifacts into a binary than
+just the machine code.  It can also emit:
+
+* unwind info into `.eh_frame`
+* relocations into `.rela.dyn`, etc.
+* debugging information into various `.debug_*` sections.
+* the symbol table entries themselves (yes, these do take up space!)
+
+If we want to achieve high coverage of the binary, we have to scan all of these
+different sections and attribute them as best we can to individual
+symbols/compileunits/etc.
+
+All of this means that we have to delve deep into ELF, DWARF, Mach-O, etc. to
+get good coverage.  We have to venture into dark arts like `.eh_frame` which is
+underspecified and underdocumented, and that few programs ever have any reason
+to parse.  For Bloaty to work well, we must go deeper.
+
+## ELF
+
+TODO
+
+## DWARF
+
+TODO
+
+## Mach-O
+
+TODO