perf.txt

This file contains various notes and lessons learned concerning performance
of the Homa Linux kernel module.  The notes are in reverse chronological
order.

58. (September 2024): Interference between Homa and TCP when both run
    concurrently on the same nodes (no special kernel code to mitigate
    interference)
    Experiment on xl170 cluster:
    cp_both -n 9 --skip 0 -w w4 -b 20 -s 30

    HomaGbps: Gbps generated by Homa (20 - HomaGbps generated by TCP)
    HAvg:     Average slowdown for Homa
    HP50:     Median RTT for Homa short messages
    HP99:     P99 RTT for Homa short messages
    TAvg:     Average slowdown for TCP
    TP50:     Median RTT for TCP short messages
    TP99:     P99 RTT for TCP short messagesAvailable

    HomaGbps  HAvg  HP50  HP99  TAvg  TP50  TP99
        0                       63.4   797  6089
        2      8.1    66   335  80.5  1012 10131
        4      8.6    65   507  80.0  1021  9315
        6      9.9    66   765  80.8  1022  9328
        8     12.1    68  1065  79.8  1042  8309
       10     14.3    70  1324  76.7   993  6881
       12     15.1    72  1394  73.4   971  5866
       14     14.8    75  1305  73.1   927  6076
       16     12.9    75  1077  70.2   816  6564
       18     10.0    70   755  69.7   748  7387
       20      4.4    44   119

    Overall observations:
    * Short messages:
      * Homa: 2x for P50, 10x increase for P99, 2x for P50
      * TCP: 25% increase for P50, 10% increase for P99
      * The TCP degradation is caused by Homa using priorities. If the
        experiment is run without priorities for Homa, TCP's short-message
        latencies are significantly better than TCP by itself: 571 us for P50,
        3835 us for P99.
    * Long messages:
      * TCP P50 and P99 latency drop by up to 40% as Homa traffic share
        increases (perhaps because Homa throttles itself to link speed?)
      * Running Homa without priorities improves TCP even more (2x gain for TCP
        P50 and P99 under even traffic split, relative to TCP alone)
      * Homa latency not much affected
    * Other workloads:
      * W5 similar to W4
      * W3 and W2 show less Homa degradation, more TCP degradation
    * Estimated NIC queue lengths have gotten much longer (e.g P99 queueing
      delay of 235-750 us now, vs. < 10 us when Homa runs alone)
    * Homa packets are experiencing even longer delays than this because
      packets aren't distributed evenly across tx queues, while the NIC serves
      queues evenly.

57. (August 2024): Best known parameters for c6525-100g cluster:
    Homa:
    hijack_tcp=1 .unsched_bytes=20000 window=0 max_incoming=1000000
    gro_policy=0xe2 throttle_min_bytes=1000
    --client-ports 4 --port-receivers 6 --server-ports 4 --port-threads 6
    TCP:
    --tcp-client-ports 4 --tcp-server-ports 6

56. (August 2024): Performance challenges with c6525-100g cluster (AMD CPUs,
    100 Gbps links):
    * The highest achievable throughput for Homa with W4 is 72-75 Gbps.
    * TCP can get 78-79 Gbps with W4.
    * The bottleneck is NIC packet transmission: 1 MB or more of data can
      accumulate in NIC queues, and data can be queued in the NIC for 1 ms
      or more.
    * Memory bandwidth appears to be the limiting factor (not, say,
      per-packet overheads for mapping addresses). For example, W2 can
      transmit more packets than W4 without any problem.
    * NIC queue buildup is not even across output queues. The queue used by
      the pacer has significantly more buildup than the other queues. This
      suggests that the NIC services queues in round-robin order. The pacer
      queue gets a large fraction of all outbound traffic but it receives
      only a 1/Nth share of the NIC's output bandwidth, so when the NIC can't
      keep up, packets accumulate primarily in this one queue.
    * Priorities don't make a significant difference in latency! It appears
      that the NIC queuing issue is the primary contribution to P99 latency
      even for short messages (too short to use the pacer). This is evident
      because not only do P99 packets take a long time to reach the receiver's
      GRO, they also take a long time to get returned to the sender to be
      freed; this suggests that they are waiting a long time to get
      transmitted. Perhaps the P99 packets are using the same output queue
      as the pacer?
    * Even at relatively low throughputs (e.g. 40 Gbps), P99 latency still
      seems to be caused by slow NIC transmission, not incast queueing.
    * Increasing throttle_min_bytes improves latency significantly, because
      packets transmitted by the pacer are much more likely to experience
      high NIC delays.

55. (June/July 2024): Reworked retry mechanism to retry more agressively.
    Introduced ooo_window_usecs sysctl parameter with an initial value of
    100 us; retry gaps once they reach this age. However, this increased the
    number of resent packets by 20x and reduced throughput as well.
    Hypothesis: many packets suffer quite long delays but eventually get
    through; with fast retries, these get resent unnecessarily. Tried
    increasing the value of ooo_window_usecs, and this helped a bit, but
    performance is best if retries only happen when homa_timer hits its
    resend_ticks value. So, backed out support for ooo_window_usecs.

54. (June 2024): New sk_buff allocation mechanism. Up until now, Homa
    allocated an entire tx sk_buff with alloc_skb: both the packet header
    and the packet data were allocated in the head. However, this resulted
    in high overheads for sk_buff allocation. Introduced a new mechanism
    (in homa_skb.c) for tx sk_buffs, where only the packet header is in the
    head. The data for data packets is allocated using frags and high-order
    pages (currently 64 KB). In addition, when sk_buffs are freed, Homa
    saves the pages in pools (one per NUMA node) to eliminate the overhead
    of page allocation. Here are before/after measurements taken with the
    W4 workload on a 9-node c6525-100g cluster:
                                      Before        After
    Avg. time to allocate sk_buff     7-9 us        0.85 us
    Cores spent in sk_buff alloc      3.6-4.5       0.4-0.5
    Cores spent in kfree_skb          1.1-1.3       0.3-0.4
    Goodput/core                      5.9-7.2 Gbps  8.4-10 Gbps
    Time to allocate page                           12 us
    Cores spent allocating pages                    0.04-0.08

53. (May 2024; superceded by #56) Strange NIC behavior (observed with Mellanox
    ConnectX5 NICs on the c6525-100g CloudLab cluster, using W4 with offerred
    load 80 Gbps and actual throughput more like 60 Gbps).
    * The NIC is not returning tx packets to the host promptly after
      transmission. In one set of traces (W4 at 80% offered load), 20% of
      all packets weren't freed until at least 50 us after the packets had
      been received by the target GRO; P99 delay was 400 us, and some packets
      were delayed more than 1 ms. Note: other traces are not as bad, but
      still show significant delays (15-20% of delays are at least 50 usec,
      worst delays range from 250 us - 1100 us).
    * Long delays in returning tx packets cause Linux to stop the tx queue
      (it has a limit on outstanding bytes on a given channel), which slows
      down transmission.
    * The NIC doesn't seem to be able to transmit packets at 100 Gbps.
      Many packets seem not to be transmitted for long periods of time (up to
      1-2 ms) after they are added to a NIC queue: both the time until GRO
      receipt and time until packet free are very long. Different tx queues
      experience different delays: the delays for one queue can be short at
      the same time that delays for another queue are very long. These problems
      occur when Homa is passing packets to the NIC at < 100 Gbps.
    * The NIC is not transmitting packets from different tx queues in a FIFO
      order; it seems to be favoring some tx queues (perhaps it is
      round-robining so queues with more traffic get treated badly?).

52. (February 2024) Impact of core allocation. Benchmark setup: 2 nodes,
    c6525-100g cluster (100 Gbps network, 48 hyperthreads, 24 cores, 3 cores
    per chiplet?):
    cp_node server --pin N
    cp_node client ----workload 500000 --one-way --client-max 1
    window=0 max_incoming=2500000 gro_policy=16 unsched_bytes=50000
    Measured RPC throughput and copy_to_user throughput:

    --pin     Gbps    Copy
    0         17.7    33.4
    3         18.9    32.2
    6         19.0    34.3
    8         18.8    34.1
    9         22.2    54.2
    10        25.7    53.2
    11        26.3    55.1
    12        17.9    31.7
    13        18.2    31.6
    15        17.9    31.5
    18        18.2    32.3
    21        18.1    32.4
    32        18.6    34.0
    33        24.8    54.0
    34        25.9    54.5
    35        26.3    54.5
    36        17.7    31.5

51. (February 2024) RPC lock preemption. When SoftIRQ is processing a large
    batch of packets for a single RPC, it was holding the RPC lock continuously.
    This prevented homa_copy_to_user from acquiring the lock to extract the
    next batch of packets to copy. Since homa_copy_to_user is the bottleneck
    for large messages on 100 Gbps networks, this can potentially affect
    throughput. Fixed by introducing APP_NEEDS_LOCK for RPCs, so that
    SoftIRQ releases the lock temporarily if homa_copy_to_user needs it.
    This may have improved throughput for W4 on c6525-100g cluster by 10%,
    but it's very difficult to measure accurately.

50. (February 2024) Don't queue IPIs. Discovered that when homa_gro_receive
    invokes netif_receive_skb (intending to push a batch of packets through
    to SoftIRQ ASAP), Linux doesn't immediately send an interprocessor
    interrupt (IPI). It just queues the pending IPI until all NAPI processing
    is finished, then issues all of the queued IPIs. This results in
    significant delay for the first batch when NAPI has lots of additional
    packets to process. Fixed this by writing homa_send_ipis and invoking it
    in homa_gro_receive after calling netif_receive_skb. In 2-node tests
    with "cp_node client --workload 500000 --client-max 1 --one-way"
    (c6525-100g cluster), this improved latency from RPC start to beginning
    copy to user space from 79 us to 46 us, resulting in 10-20% improvement
    in throughput. W4 throughput appears to have improved about 10% (but a bit
    had to measure precisely).

49. (November 2023) Implemented "Gen3" load balancing scheme, renamed the
    old scheme "Gen2". For details on load balancing, see balance.txt.
    Gen3 seems to reduce significantly tail latency for cross-core handoffs;
    here are a few samples(us):
                     --Gen2 P50-   ---Gen2 P99---   --Gen3 P50-   --Gen3 P99-
    GRO -> SoftIRQ   2.7 2.8 3.0   71.1 43.6 71.3   2.8 2.6 2.7   8.7 5.4 8.3
    SoftIRQ -> App   0.3 0.3 0.3   20.5 21.7 19.9   0.3 0.3 0.3   7.2 6.8 9.0

    However, this doesn't seem to translate into better overall performance:
    standard slowdown graphs look about the same with Gen2 and Gen3 (Gen2 has
    better P99 latency for W2 and W3; Gen3 is better for W5). This needs more
    analysis.

48. (August 2023) Unexpected packet loss on c6525-100g cluster (AMD processors,
    100 Gbps links). Under some conditions (such as "cp_node client --one-way
    --workload 1000000" with dynamic_windows=1 and unsched_bytes=50000)
    messages suffer packet losses starting around offset 700000 and
    continuing intermittently until the end of the message. I was unable
    to identify a cause, but increasing the size of the Mellanox driver's
    page cache (MLX5E_CACHE_SIZE, see item 46 below) seems to make the problem
    go away. Slight configuration changes, such as unsched_bytes=200000 also
    make the problem go away.

47. (July 2023) Intel vs. AMD processors. 100B roundtrips under best-case
    conditions are about 8.7 us slower on AMD processors than Intel:
      xl170:        14.5 us
      c6525-100g:   23.2 us
    Places where c6525-100g is slower (each way):
      Packet prep (Homa):     1.2 us
      IP stack and driver :   0.9 us
      Network (interrupts?):  1.7 us
      Thread wakeup:          0.6 us
    TCP is also slower on AMD:  38.7 us vs. 23.3 us
    Note: results on AMD are particularly sensitive to core placement of
    various components.

46. (July 2023) MLX buffer issues on c6525-100g cluster. The Mellanox
    driver is configured with 256 pages (1 MB) of receive buffer space
    for each channel. With a 100 Gbps network, this is about 80 us of
    time. However, a single thread can copy data from buffers to user space
    at only about 40 Gbps, which means that with longer messages, the
    copy gets behind and packet lifetimes increase: with 1 MB messages,
    median lifetime is 77 us and P90 lifetime (i.e. the later packets in
    messages) are 115 us. With multiple messages from one host to another,
    the buffer cache is running dry. When this happens, the Mellanox driver
    allocates (and eventually frees) additional buffers, which adds
    significant overhead. Bottom line: it's essential to use multiple
    channels to keep up with a 100 Gbps network (this provides a larger
    total buffer pool, plus more threads to copy to user space).

45. (January 2023) Up until now, output messages had to be completely copied
    into sk_buffs before transmission could begin. Modified Homa to pipeline
    the copy from user space with packet transmission. This makes a significant
    difference in performance. For cp_node client --one-way --workload 500000
    with MTU 1500, goodput increased from 11 Gbps (see #43 below) to 17-19
    Gbps. For comparison, TCP is about 18.5 Gbps .

44. (January 2023) Until now Homa has held an RPC's lock while transmitting
    packets for that RPC. This isn't a problem if ip_queue_xmit returns
    quickly. However, in some configurations (such as Intel xl170 NICs) the
    driver is very slow, and if the NIC can't do TSO for Homa then the packets
    passed to the NIC aren't very large. In these situations, Homa will be
    transmitting packets almost 100% of the time for large messages, which
    means the RPC lock will be held continuously. This locks out other
    activities on the RPC, such as processing grants, which causes additional
    performance problems. To fix this, Homa releases the RPC lock while
    transmitting data packets (ip_queue_xmit or ip6_xmit). This helps a lot
    with bad NICs, and even seems to help a little with good NICs (5-10%
    increase in throughput for single-flow benchmarks).

43. (December 2022) 2-host throughput measurements (Gbps). Configuration:
    * Single message: cp_node client --one-way --workload 500000
      Server: one thread, pinned on a "good" core (avoid GRO/SoftIRQ conflicts)
    * Multiple messages: client adds "--ports 2 --client-max 8
      Server doesn't pin, adds "--port-threads 2" (single port)
    * All measurements used rtt_bytes=150000

                                          1.01     2.0 Buf + Short Bypass
    ---------------------------------------------------------------------
    Single message (MTU 1500)              9                11
    Single message (MTU 3000)            10-11              13
    Multiple messages (MTU 1500)         20-21            21-22
    Multiple messages (MTU 3000)         22-23            22-23

    Conclusions:
    * The new buffering mechanism helps single-message throughput about 20%,
      but not much impact when there are many concurrent messages.
    * Homa 1.01 seems to be able to hide most of the overhead of
      page pool thrashing (#35 below).

42. (December 2022) New cluster measurements with "bench n10_mtu3000" (10
    nodes, MTU 3000B) on the following configurations:
    Jun 22:         Previous measurements from June of 2022
    1.01:           Last commit before implementing new Homa-allocated buffers
    2.0 Buf:        Homa-allocated buffers
    Grants:         2.0 Buf plus GRO_FAST_GRANTS (incoming grants processed
                    entirely during GRO)
    Short Bypass:   2.0 Buf plus GRO_SHORT_BYPASS (all packets < 1400 bytes
                    processed entirely during GRO)

    Short-message latencies in usecs (fastest short messages taken from
    homa_w*.data files, W4NL data taken from unloaded_w4.data):
            Jun 22        1.01       2.0 Buf       Grants    Short Bypass
          P50   P99    P50   P99    P50   P99    P50   P99    P50   P99
          ----------   ----------   ----------   ----------   ----------
    W2    38.2  100    37.1  84.7   38.3  87.1   38.9  89.7   27.1  70.5
    W3    54.8  269    53.0  263    51.8  211    51.0  216    39.2  216
    W4    55.8  189    56.0  207    53.0  113    54.0  128    44.6  106
    W5    65.3  223    66.2  232    61.9  133    62.2  154    61.5  150
    W4NL  16.6  32.4   15.2  30.1   16.2  30.6   16.2  31.5   13.7  27.1

    Best of 5 runs from "bench basic_n10_mtu3000":
                                       1.01   2.0 Buf   Grants  Short Bypass
    ------------------------------------------------------------------------
    Short-message RTT (usec)           16.1      16.1     16.1      13.5
    Single-message throughput (Gbps)   10.2      12.2     12.7      12.5
    Client RPC throughput (Mops/s)     1.46      1.51     1.52      1.75
    Server RPC throughput (Mops/s)     1.52      1.66     1.63      1.73
    Client throughput (Gbps)           23.6      23.7     23.6      23.7
    Server throughput (Gbps)           23.6      23.7     23.7      23.7

    Conclusions:
    * New buffering reduces tail latency >40% for W4 and W5 (perhaps by
      eliminating all-at-once message copies that occupy cores for long
      periods?). Latency improves by 20-30% (both at P50 and P99) for
      all message lengths in W4.
    * New buffering improves single-message throughput by 20% (25% when
      combined with fast grants)
    * Short bypass appears to be a win overall: a bit worse P99 for W5,
      but better everywhere else and a significant improvement for short
      messages at low load

41. (December 2022) More analysis of SMI interrupts. Wrote smi.cc to gather
    data on events that cause all cores to stop simultaneously. Found 3 distinct
    kinds of gaps on xl170 (Intel) CPUs:
    * 2.5 usec gaps every 4 ms
    * 17 usec gaps every 10 ms (however, these don't seem to be consistent:
      they appear for a while at the start of each experiment, then stop)
    * 170 usec gaps every 250 ms
    I don't know for sure that these are all caused by SMI (e.g., could the
    gaps every 4 ms be scheduler wakeups?)

40. (December 2022) NAPI can't process incoming jumbo frames at line rate
    for 100 Gbps network (AMD CPUs): it takes about 850 ns to process each
    packet (median), but packets are arriving every 700 ns.
    Most of the time is spent in __alloc_skb in two places:
    kmalloc_reserve for data:       370 ns
    prefetchw for last word of data:   140 ns
    These times depend on core placements of threads; the above times
    are for an "unfortunate" (but typical) placement; with an ideal placement,
    the times drop to 100 ns for kmalloc_reserve and essentially 0 for the
    prefetch.

    Intel CPUs don't seem to have this problem: on the xl170 cluster, NAPI
    processes 1500B packets in about 300 ns, and 9000B packets in about
    450 ns.

39. (December 2022) One-way throughput for 1M messages varies from 18-27 Gbps
    for Homa on the c6525-100g cluster, whereas TCP throughput is relatively
    constant at 24 Gbps. Homa's variance comes from core placement: performance
    is best if all of NAPI, GRO, and app are in the same group of 3 cores
    (3N..3N+2) or their hypertwins. If they aren't, there are significant
    cache miss costs as skbs get recycled from the app core back to the NAPI
    core. TCP uses RFS to make sure that NAPI and GRO processing happen on
    the same core as the application.

38. (December 2022) Restructured the receive buffer mechanism to mitigate
    the page_pool_alloc_pages_slow problem (see August 2022 below); packets
    can now be copied to user space and their buffers released without waiting
    for the entire message to be received. This has a significant impact on
    throughput. For "cp_node --one-way --client-max 4 --ports 1 --server-ports 1
    --port-threads 8" on the c6525-100g cluster:
    * Throughput increased from 21.5 Gbps to 42-45 Gbps
    * Page allocations still happen with the new code, but they only consume
      0.07 core now, vs. 0.6 core before

37. (November 2022) Software GSO is very slow (17 usec on AMD EPYC processors,
    breaking 64K into 9K jumbo frames). The main problem appears to be sk_buff
    allocation, which takes multiple usecs because the packet buffers are too
    large to be cached in the slab allocator.

36. (November 2022) Intel vs. AMD CPUs. Compared
    "cp_node client --workload 500000" performance on c6525-100g cluster
    (24-core AMD 7402P processors @ 2.8 Ghz, 100 Gbps networking) vs. xl170
    cluster (10-core Intel E5-2640v4 @ 2.4 Ghz, 25 Gbps networking), priorities
    not enabled on either cluster:
                                            Intel/25Gbps     AMD/100Gbps
    -----------------------------------------------------------------------
    Packet size                             1500B            9000B
    Overall throughput (each direction)     3.4 Gbps         6.7-7.5 Gbps
    Stats from ttrpcs.py:
        Xmit/receive tput                   11 Gbps          30-50 Gbps
        Copy to/from user space             36-54 Gbps       30-110 Gbps
        RTT for first grant                 28-32 us         56-70 us
    Stats from ttpktdelay.py:
        SoftIRQ Wakeup (P50/P90)            6/30 us          14/23 us
        Minimum network RTT                 5.5 us           8 us
    RTT with 100B messages                  17 us            28 us

35. (August 2022) Found problem with Mellanox driver that explains the
    page_pool_alloc_pages_slow delays in the item below.
    * The driver keeps a cache of "free" pages, organized as a FIFO
      queue with a size limit.
    * The page for a packet buffer gets added to the queue when the
      packet is received, but with a nonzero reference count.
    * The reference count is decremented when the skbuff is released.
    * If the page gets to the front of the queue with a nonzero reference
      count, it can't be allocated. Instead, a new page is allocated,
      which is slower. Furthermore, this will result in excess pages,
      eventually causing the queue to overflow; at that point, the excess
      pages will be freed back to Linux, which is slow.
    * Homa likes to keep around large numbers of buffers around for
      significant time periods; as a result, it triggers the slow path
      frequently, especially for large messages.

34. (August 2022) 2-node performance is problematic. Ran experiments with
    the following client cp_node command:
    cp_node client --ports 3 --server-ports 3 --client-max 10 --workload 500000
    With max_window = rtt_bytes = 60000, throughput is only about 10 Gbps
    on xl170 nodes. ttpktdelay output shows one-way times commonly 30us or
    more, which means Homa can't keep enough grants outstanding for full
    bandwidth. The overheads are spread across many places:

    IP:        IP stack, from calling ip_queue_xmit to NIC wakeup
    Net:       Additional time until homa_gro_receive gets packet
    GRO Other: Time until end of GRO batch
    GRO Gap:   Delay after GRO packet processing until SoftIRQ handoff
    Wakeup:    Delay until homa_softirq starts
    SoftIRQ:   Time in homa_softirq until packet is processed
    Total:     End-to-end time from calling ip_queue_xmit to homa_softirq
               handler for packet

    Data packet lifetime (us), client -> server:
    Pctile   IP     Net  GRO Other GRO Gap  Wakeup  SoftIRQ   Total
      0     0.5     4.6        0.0     0.2     1.0      0.1     7.3
     10     0.6    10.3        0.0     5.7     2.0      0.2    21.0
     30     0.7    12.4        0.4     6.3     2.1      1.9    27.0
     50     0.7    15.3        1.0     6.6     2.2      3.3    32.2
     70     0.8    18.2        2.0     8.1     2.3      3.8    45.3
     90     1.0    33.9        4.9    31.3     2.5      4.8    62.8
     99     1.4    56.5       20.7    48.5    17.7     17.5    85.6
    100    16.0    74.3       31.0    61.9    28.3     24.4   111.0

    Grant lifetime (us), client -> server:
    Pctile   IP     Net  GRO Other GRO Gap  Wakeup  SoftIRQ   Total
      0     1.7     2.6        0.0     0.3     1.0      0.0     7.6
     10     2.4     5.3        0.0     0.5     1.5      0.1    12.1
     30     2.5    10.3        0.0     6.1     2.1      0.1    23.3
     50     2.6    12.7        0.5     6.5     2.2      0.2    28.1
     70     2.8    16.5        1.1     7.2     2.3      0.3    38.1
     90     3.4    31.7        3.5    22.6     2.5      3.1    56.2
     99     4.6    54.1       17.7    48.4    17.5      4.3    78.5
    100    54.9    67.5       28.4    61.9    28.3     21.9    98.3

    Additional client-side statistics:
    Pre NAPI:   usecs from interrupt entry to NAPI handler
    GRO Total:  usecs from NAPI handler entry to last homa_gro_receive
    Batch:      number of packets processed in one interrupt
    Gap:        usecs from last homa_gro_receive call to SoftIRQ handoff

    Pctile   Pre NAPI    GRO  Batch     Gap
      0           0.7    0.4      0     0.2
     10           0.7    0.6      0     0.3
     30           0.8    0.7      1     0.4
     50           0.8    1.5      2     6.6
     70           1.0    2.6      3     7.0
     90           2.7    4.9      4     7.5
     99           6.4    8.0      7    34.2
    100          21.7   23.9     12    48.2

    In looking over samples of long delays, there are two common issues that
    affect multiple metrics:
    * page_pool_alloc_pages_slow; affects:
      P90/99 Net, P90/99 GRO Gap, P99 SoftIRQ wakeup
    * unidentified 14-17 us gaps in homa_xmit_data, homa_gro_receive,
      homa_data_pkt, and other places:
      affects P99 GRO Other, P99 SoftIRQ, P99 GRO

    In addition, I found the following smaller problems:
    * unknown gaps before homa_gro_complete of 20-30 us, affects:
      P90 SoftIRQ wakeup
      Is this related to the "unidentified 14-17 us gaps" above?
    * net_rx_action sometimes slow to start; affects:
      Wakeup
    * large batch size affects:
      P90 SoftIRQ

33. (June 2022) Short-message timelines (xl170 clusters, "cp_node client
    --workload 100 --port-receivers 0"). All times are ns (data excludes
    client-side recv->send turnaround time). Most of the difference
    seems to be in kernel call time and NIC->NIC time. Also, note that
    the 5.4.80 times have improved considerably from January 2021; there
    appears to be at least 1 us variation in RTT from machine to machine.
                                5.17.7             5.4.80
                            Server   Client    Server   Client
    ----------------------------------------------------------
    Send:
      homa_send/reply         461      588       468      534
      IP/Driver               514      548       508      522
      Total                   975     1136      1475     1056
    Receive:
      Interrupt->Homa GRO     923     1003       789      815
      GRO                     200      227       193      201
      Wakeup SoftIRQ          601      480       355      347
      IP SoftIRQ              361      441       400      361
      Homa SoftIRQ            702      469       588      388
      Wakeup App               94      106        87       53
      homa_recv               447      562       441      588
      Total                  3328     3288      2853     2753
    Recv -> send kcall        682                220
    NIC->NIC (round-trip)             6361               5261
    RTT Total                        15770              13618

32. (January 2021) Best-case short-message timelines (xl170 cluster).
    Linux 4.15.18 numbers were measured in September 2020. All times are ns.
                          5.4.80         4.15.18    Ratio
                     Server   Client
---------------------------------------------------------
Send:
  System call           360     360         240      1.50
  homa_send/reply       620     870         420      1.77
  IP/Driver             495     480         420      1.16
  Total                1475    1710        1080      1.47
Receive:
  Interrupt->NAPI       560     500         530      1.00
  NAPI                  560     675         420      1.47
  Wakeup SoftIRQ        480     470         360      1.32
  IP SoftIRQ            305     335         320      1.00
  Homa SoftIRQ          455     190         240      1.34
  Wakeup App             80     100         270      0.33
  homa_recv             420     450         300      1.45
  System Call           360     360         240      1.50
  Total                3220    3080        2680      1.18
NIC->NIC (1-way)       2805    2805        2540      1.10
RTT Total             15100   15100       12600      1.20

31. (January 2021) Small-message latencies (usec) for different workloads and
    protocols (xl170 cluster, 40 nodes, high load, MTU 3000, Linux 5.4.80):
               W2             W3            W4            W5
Homa  P50     30.9           41.9          46.8          55.4
      P99     57.7           98.5         109.3         139.0
DCTCP P50    106.7 (3.5x)   160.4 (3.8x)  159.1 (3.4x)  151.8 (2.7x)
      P99   4812.1 (83x)   6361.7 (65x)   881.1 (8.1x)  991.2 (7.1x)
TCP   P50    108.8 (3.5x)   192.7 (4.6x)  353.1 (7.5x)  385.7 (6.9x)
      P99   4151.5 (72x)   5092.7 (52x)  2113.1 (19x)  4360.7 (31x)

30. (January 2021) Analyzed effects of various configuration parameters,
    running on 40-node xl170 cluster with MTU 3000:
    duty_cycle:      Reducing to 40% improves small message latency 25% in W4
                     40% in W5
    fifo_fraction:   No impact on small message P99 except W3 (10% degradation);
                     previous measurements showed 2x improvement in P99 for
                     largest messages with modified W4 workload.
    gro_policy:      NORMAL always better; others 10-25% worse for short P99
    max_gro_skbs:    Larger is better; reducing to 5 hurts short P99 10-15%.
                     However, anecdotal experience suggests that very large
                     values can cause long delays for things like sending
                     grants, so perhaps 10 is best?
    max_gso_size:    10K looks best; not much difference above that, 10-20%
                     degradation of short P99 at 5K
    nic_queue_ns:    5-10x degradation in short P99 when there is no limit;
                     no clear winner for short P99 in 1-10 us range; however,
                     shorter is better for P50 (1us slightly better than 2us)
    poll_usecs:      0-50us all equal for W4 and W5; 50us better for W2 and W3
                     (10-20% better short P99 than 0us).
    ports:           Not much sensitivity: 3 server and 3 client looks good.
    client threads:  Need 3 ports: W2 can't keep up with 1-2 ports, W3 can't
                     keep up with 1 port. With 3 ports, 2 receivers has 1.5-2x
                     lower short P99 for W2 and W3 than 4 receivers, but for
                     W5 3 receivers is 10% better than 2. Best choice: 3p2r?
    rtt_bytes:       60K is best, but not much sensitivity: 40K is < 10% worse
    throttle_bytes:  Almost no noticeable difference from 100-2000; 100 is
                     probably best because it includes more traffic in the
                     computation of NIC queue length, reducing the probability
                     of queue buildup

29. (October 2020) Polling performance impact. In isolation, polling saves
    about 4 us RTT per RPC. In the workloads, it reduces short-message P50
    up to 10 us, and P99 up to 25us (the impact is greater with light-tailed
    workloads like W1 and W2). For W2, polling also improved throughput
    by about 3%.

28. (October 2020) Polling problem: some workloads (like W5 with 30 MB
    messages) need a lot of receiving threads for occasional bursts where
    several threads are tied up receiving very large messages. However,
    this same number of receivers results in poor performance in W3,
    because these additional threads spend a lot of time polling, which
    wastes enough CPU time to impact the threads that actually have
    work to do. One possibility: limit the number of polling threads per
    socket? Right now it appears hard to configure polling for all
    workloads.

27. (October 2020) Experimented with new GRO policy HOMA_GRO_NO_TASKS,
    which attempts to avoid cores with active threads when picking cores
    for SoftIRQ processing. This made almost no visible difference in
    performance, and also depends on modifying the Linux kernel to
    export a previously unexported function, so I removed it. It's
    still available in repo commits, though.

26. (October 2020) Receive queue order. Experimented with ordering the
    hsk->ready_requests and @hsk->ready_responses list to return short
    messages first. Not clear that this provided any major benefits, and
    it reduced throughput in some cases because of overheads in inserting
    ready messages into the queues.

25. (October 2020) NIC queue estimation. Experimented with how much to
    underestimate network bandwidth. Answer: not much! The existing 5% margin
    of safety leaves bandwidth on the table, which impacts tail latency for
    large messages. Reduced it to 1%, which helps large messages a lot (up to
    2x reduction in latency). Impact on small messages is mixed (more get worse
    than better), but the impact isn't large in either case.

24. (July 2020) P10 under load. Although Homa can provide 13.5 us RTTs under
    best-case conditions, this almost never occurs in practice. Even at low
    loads, the "best case" (P10) is more like 25-30 us. I analyzed a bunch
    of 25-30 us message traces and found the following sources of additional
    delay:
    * Network delays (from passing packet to NIC until interrupt received)
      account for 5-10 us of the additional delay (most likely packet queuing
      in the NIC). There could also be delays in running the interrupt handler.
    * Every stage of software runs slower, typically taking about 2x as long
      (7.1 us becomes 12-23 us in my samples, with median 14.6 us)
    * Occasional other glitches, such as having to wake up a receiving
      user thread, or interference due to NAPI/SoftIRQ processing of other
      messages.

23. (July 2020) Adaptive polling. A longer polling interval (e.g. 500 usecs)
    lowers tail latency for heavy-tailed workloads such as W4, but it hurts
    other workloads (P999 tail latency gets much worse for W1 because polling
    threads create contention for cores; P99 tail latency for large messages
    suffers in W3). I attempted an adaptive approach to polling, where a thread
    stops polling if it is no longer first in line, and gets woken up later to
    resume polling if it becomes first in line again. The hope was that this
    would allow a longer polling interval without negatively impacting other
    workloads. It did help, but only a bit, and it added a lot of complexity,
    so I removed it.

22. (July 2020) Best-case timetraces for short messages on xl170 CloudLab cluster.
Clients:                                                             Cum.
Event                                                               Median
--------------------------------------------------------------------------
[C?] homa_ioc_send starting, target ?:?, id ?, pid ?                     0
[C?] mlx nic notified                                                  939
[C?] Entering IRQ                                                     9589
[C?] homa_gro_receive got packet from ? id ?, offset ?, priority ?   10491
[C?] enqueue_to_backlog complete, cpu ?, id ?, peer ?                10644
[C?] homa_softirq: first packet from ?:?, id ?, type ?               11300
[C?] incoming data packet, id ?, peer ?, offset ?/?                  11416
[C?] homa_rpc_ready handed off id ?                                  11560
[C?] received message while polling, id ?                            11811
[C?] Freeing rpc id ?, socket ?, dead_skbs ?                         11864
[C?] homa_ioc_recv finished, id ?, peer ?, length ?, pid ?           11987

Servers:                                                             Cum.
Event                                                               Median
--------------------------------------------------------------------------
[C?] Entering IRQ                                                        0
[C?] homa_gro_receive got packet from ? id ?, offset ?, priority ?     762
[C?] homa_softirq: first packet from ?:?, id ?, type ?                1566
[C?] incoming data packet, id ?, peer ?, offset ?/?                   1767
[C?] homa_rpc_ready handed off id ?                                   2012
[C?] received message while polling, id ?                             2071
[C?] homa_ioc_recv finished, id ?, peer ?, length ?, pid ?            2459
[C?] homa_ioc_reply starting, id ?, port ?, pid ?                     2940
[C?] mlx nic notified                                                 3685

21. (July 2020) SMIs impact on tail latency. I observed gaps of 200-300 us where
    a core appears to be doing nothing. These occur in a variety of places
    in the code including in the middle of straight-line code or just
    before an interrupt occurs. Furthermore, when these happen, *every* core
    in the processor appears to stop at the same time (different cores are in
    different places). The gaps do not appear to be related to interrupts (I
    instrumented every __irq_entry in the Linux kernel sources), context
    switches, or c-states (which I disabled). It appears that the gaps are
    caused by System Management Interrupts (SMIs); they appear to account
    for about half of the P99 traces I examined in W4.

20. (July 2020) RSS configuration. Noticed that tail latency most often occurs
    because too much work is being done by either NAPI or SoftIRQ (or both) on
    a single core, which prevents application threads on that core from running.
    Tried several alternative approaches to RSS to see if better load balancing
    is possible, such as:
    * Concentrate NAPI and SoftIRQ packet handling on a small number of cores,
      and use core affinity to keep application threads off of those cores.
    * Identify an unloaded core for SoftIRQ processing and steer packet batches
      to these carefully chosen cores (attempted several different policies).
    * Bypass the entire Linux networking stack and call homa_softirq directly
      from homa_gro_receive.
    * Arrange for SoftIRQ to run on the same core as NAPI (this is more efficient
      because it avoids inter-processor interrupts, but can increase contention
      on that core).
    Most of these attempts made things worse, and none produced dramatic
    benefits. In the end, I settled on the following hybrid approach:
    * For single-packet batches (meaning the NAPI core is underloaded), process
      SoftIRQ on the same core as NAPI. This reduces small-message RTT by about
      3 us in underloaded systems.
    * When there are packet batches, examine several adjacent cores, and pick
      the one for SoftIRQ that has had the least recent NAPI/SoftIRQ work.
    Overall, this results in a 20-35% improvement in P99 latency for small
    messages under heavy-tailed workloads, in comparison to the Linux default
    RSS behavior.

19. (July 2020) P999 latency for small messages. This is 5 ms or more in most
    of the workloads, and it turns out to be caused by Linux SoftIRQ handling.
    If __do_softirq thinks it is taking too much time, it stops processing
    all softirqs in the high-priority NAPI thread, and instead defers them
    to another thread, softirqd, which intentionally runs at a low priority
    so as not to interfere with user threads. This sometimes means it has to
    wait for a full time slice for other threads, which seems to be 5-7 ms.
    I tried disabling this feature of __do_softirq, so that all requests get
    processed in the high-priority thread, and the P9999 latency improved by
    about 10x (< 1 ms worst case).

18. (July 2020) Small-message latency. The best-case RTT for small messages
    is very difficult to achieve under any real-world conditions. As soon as
    there is any load whatsoever, best-case latency jumps from 15 us to 25-40 us
    (depending on overall load). The latency CDF for Homa is almost completely
    unaffected by load (whereas it varies dramatically with TCP).

17. (July 2020) Small-request optimization: if NAPI and SoftIRQ for a packet
    are both done on the same core, it reduces round-trip latency by about
    2 us for short messages; however, this works against the optimization below
    for spreading out the load. I tried implementing it only for packets that
    don't get merged for GRO, but it didn't make a noticeable difference (see
    note above about best-case latency for short messages).

16. (June-July 2020) Analyzing tail latency. P99 latency under W4 seems to
    occur primarily because of core congestion: a core becomes completely
    consumed with either NAPI or SoftIRQ processing (or both) for a long
    message, which keeps it from processing a short message. For example,
    the user thread that handles the message might be on the congested core,
    and hence doesn't run for a long time while the core does NAPI/SoftIRQ
    work. I modified Homa's GRO code to pick the SoftIRQ core for each batch
    of packets intelligently (choose a core that doesn't appear to be busy
    with either NAPI or SoftIRQ processing), and this helped a bit, but not
    a lot (10-20% reduction in P99 for W4). Even with clever assignment of
    SoftIRQ processing, the load from NAPI can be enough to monopolize a core.

15. (June 2020) Cost of interrupt handler for receiving packets:
        mlx5e_mpwqe_fill_rx_skb: 200 ns
        napi_gro_receive:        150 ns

14. (June 2020) Does instrumentation slow Homa down significantly? Modified
    to run without timetraces and without any metrics except essential ones
    for computing priorities:
    Latency dropped from 15.3 us to 15.1 us
    Small-RPC throughput increased from 1.8 Mops/sec to 1.9 Mops/sec
    Large-message throughput didn't change: still about 2.7 MB/sec
    Disabling timetraces while retaining metrics roughly splits the
    difference. Conclusion: not worth the effort of disabling metrics,
    probably not worth turning off timetracing.

13. (June 2020) Implemented busy-waiting, where Homa spins for 2 RTTs
    before putting a receiving thread to sleep. This reduced 100B RTT
    on the xl170 cluster from 17.8 us to 15.3 us.

12. (May 2020) Noticed that cores can disappear for 10-12ms, during which
    softirq handlers do not get invoked. Homa timetraces show no activity
    of any kind during that time (e.g., no interrupts either?). Found out
    later that this is Homa's fault: there is no preemption when executing
    in the kernel, and RPC reaping could potentially run on for a very long
    time if it gets behind. Fixed this by adding calls to schedule() so that
    SoftIRQ tasks can run.

11. (Mar. 2020) For the slowdown tests, the --port-max value needs to be
    pretty high to get true Poisson behavior. It was originally 20, but
    increasing it had significant impact on performance for TCP, particularly
    for short-message workloads. For example, TCP P99 slowdown for W1 increased
    from 15 to 170x when --port-max increase from 20-100. Performance
    got even worse at --port-max=200, but I decided to stick with 100 for now.

10. (Mar. 2020) Having multiple threads receiving on a single port makes a
    big difference in tail latency. cperf had been using just one receiver
    thread for each port (on both clients and servers); changing to
    multiple threads reduced P50/P99 slowdown for small messages in W5
    from 7/65 to 2.5/7.5!

9.* Performance suffers from a variety of load balancing problems. Here
   are some examples:
   * (March 2020) Throughput varies by 20% from run to run when a single client
     sends 500KB messages to a single server. In this configuration, all
     packets arrive through a single NAPI core, which is fully utilized.
     However, if Linux also happens to place other threads on that core (such
     as the pacer) it takes time away from NAPI, which reduces throughput.
   * (March 2020) When analyzing tail latency for small messages in W5, I found
     that user threads are occasionally delayed 100s of microseconds in waking
     up to handle a message. The problem is that both the NAPI and SoftIRQ
     threads happened (randomly) to get busy on that core the same time,
     and they completely monopolized the core.
   * (March 2020) Linux switches threads between cores very frequently when
     threads sleep (2/3 of the time in experiments today).

8. (Feb. 2020) The pacer can potentially be a severe performance bottleneck
   (a single thread cannot keep the network utilized with packets that are
   not huge). In a test with 2 clients bombarding a single server with
   1000-byte packets, performance started off high but then suddenly dropped
   by 10x. There were two contributing factors. First, once the pacer got
   involved, all transmissions had to go through the pacer, and the pacer
   became the bottleneck. Second, this resulted in growth of the throttle
   queue (essentially all standing requests: > 300 entries in this experiment).
   Since the queue is scanned from highest to lowest priority, every insertion
   had to scan the entire queue, which took about 6 us. At this point the queue
   lock becomes the bottleneck, resulting in 10x drop in performance.

   I tried inserting RPCs from the other end of the throttle queue, but
   this still left a 2x reduction in throughput because the pacer couldn't
   keep up. In addition, it seems like there could potentially be situations
   where inserting from the other end results in long searches. So, I backed
   this out.

   The solution was to allow threads other than the pacer to transmit packets
   even if there are entries on the throttle queue, as long as the NIC queue
   isn't long. This allows other threads besides the pacer to transmit
   packets if the pacer can't keep up. In order to avoid pacer starvation,
   the pacer uses a modified approach: if the NIC queue is too full for it to
   transmit a packet immediately, it computes the time when it expects the
   NIC queue to get below threshold, waits until that time arrives, and
   then transmits; it doesn't check again to see if the NIC queue is
   actually below threshold (which it may not be if other threads have
   also been transmitting). This guarantees that the pacer will make progress.

7. The socket lock is a throughput bottleneck when a multi-threaded server
   is receiving large numbers of small requests. One problem was that the
   lock was being acquired twice while processing a single-packet incoming
   request: once during RPC initialization to add the RPC to active_rpcs,
   and again later to add dispatch the RPC to a server thread. Restructured
   the code to do both of these with a single lock acquisition. Also
   cleaned up homa_wait_for_message to reduce the number of times it
   acquires socket locks. This produced the following improvements, measured
   with one server (--port_threads 8) and 3 clients (--workload 100 --alt_client
   --client_threads 20):
   * Throughput increased from 650 kops/sec to 760
   * socket_lock_miss_cycles dropped from 318% to 193%
   * server_lock_miss_cycles dropped from 1.4% to 0.7%

6. Impact of load balancing on latency (xl170, 100B RPCs, 11/2019):
                       1 server thread  18 threads  TCP, 1 thread  TCP, 18 threads
   No RPS/RFS             16.0 us         16.3 us      20.0 us        25.5 us
   RPS/RFS enabled        17.1 us         21.5 us      21.9 us        26.5 us

5. It's better to queue a thread waiting for incoming messages at the *front*
   of the list in homa_wait_for_message, rather than the rear. If there is a
   pool of server threads but not enough load to keep them all busy, it's
   better to reuse a few threads rather than spreading work across all of
   them; this produces better cache locality). This approach improves latency
   by 200-500ns at low loads.

4. Problem: large messages have no pipelining. For example, copying bytes
   from user space to output buffers is not overlapped with sending packets,
   and copying bytes from buffers to user space doesn't start until the
   entire message has been received.
   * Tried overlapping packet transmission with packet creation (7/2019) but
     this made performance worse, not better. Not sure why.

3. It is hard for the pacer to keep the uplink fully utilized, because it
   gets descheduled for long periods of time.
   * Tried disabling interrupts while the pacer is running, but this doesn't
    work: if a packet gets sent with interrupts disabled, the interrupts get
     reenabled someplace along the way, which can lead to deadlock. Also,
     the VLAN driver uses "interrupts off" as a signal that it should enter
     polling mode, which doesn't work.
   * Tried calling homa_pacer_xmit from multiple places; this helps a bit
     (5-10%).
   * Tried making the pacer thread a high-priority real-time thread; this
     actually made things a bit worse.

2. There can be a long lag in sending grants. One problem is that Linux
   tries to collect large numbers of buffers before invoking the softirq
   handler; this causes grants to be delayed. Implemented max_gro_skbs to
   limit buffering. However, varying the parameter doesn't seem to affect
   throughput (11/13/2019).

1. Without RPS enabled, Homa performance is limited by a single core handling
   all softirq actions. In order for RPS to work well, Homa must implement
   its own hash function for mapping packets to cores (the default IP hasher
   doesn't know about Homa ports, so it considers only the peer IP address.
   However, with RPS, packets can get spread out over too many cores, which
   causes poor latency when there is a single client and the server is
   underloaded.