advices.rmd

General Advices
---------------

Using efficiently the memory in parallel application can be hard, especially
for NUMA machines. This trace visualization is designed to help you identify
your applications' memory behavior and how to improve it.

Before showing any actual visualizations here are a few reminders on efficient
memory access:

1.  **Split the memory**

    If the memory is correctly accessed, some groups of threads (from 1 to the
    maximum thread per NUMA node of the experimental machine) should appear. A
    group of thread is a set of thread accessing (mostly) the same set of pages.
    Moreover, the Average number of accesses should be more or less the same,
    for every threads and for every pages.

    For NUMA machines, it is also important to think about the memory
    distribution over the node: thread working on the same data should be on
    the same node to avoid remote accesses. We should ensure that data are
    mapped near to the thread using them, moreover we should distribute the
    data (and the threads) over the node to avoid memory contention. If a part
    of the memory is used by every thread it should be either duplicated (in
    the case of a relatively small structure, read only) or interleaved
    among the nodes for the same reasons.

    If some of your structures are used by (almost) every threads, you should
    obtain better performances by doing one of the following solutions:

    1. **Divide the structure**

       Modify your code to make threads works on small parts of the structure,
       then pin threads working on close data to the same NUMA node (or use an
       automated tool to do so).

    2.  **Interleave**

        You can distribute the pages of the structure among the NUMA nodes
        (interleave) to balance the memory bandwith.

    3.  **Duplicate**

        If the structure is only read and the structure size is small enough,
        each thread can work on local copy.

2.  **Mapping policy**

    By default, all recent operating systems maps memory pages close to the
    first thread accessing them, it is called first touch. Therefore either
    the first touch should be done be the thread actually using the data, or
    some mapping should be made (manually or via an external / automated
    tool).


3.  **Beware of the stack**

    Stack is designed for private data, shared data should be on the heap
    (dynamically allocated) or global. Remote stack access might be quite
    inefficient and hard to improve for automated tools.