Skip to content

Commit

Permalink
Final touches for publishing
Browse files Browse the repository at this point in the history
  • Loading branch information
gminn committed Aug 27, 2024
1 parent 467679a commit c28f069
Show file tree
Hide file tree
Showing 2 changed files with 64 additions and 64 deletions.
2 changes: 1 addition & 1 deletion _data/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,6 @@ mschulte:
blurb: is a firmware engineer trying to firmware
bertschiettecatte:
name: Bert Schiettecatte
image: /img/author/bertschiettecatte.jpg
image: /img/author/bertschiettecatte.png
linkedin: https://www.linkedin.com/in/bertschiettecatte/
blurb: is the founder of Noisetron LLC, a software & hardware engineering and expert witness consulting practice, and the founder of Percussa, a pro-audio consumer electronics company.
126 changes: 63 additions & 63 deletions _posts/2024-08-27-memory-debugging.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: How Memory Usage Patterns Can Derail Real-time Performance
title: Preventing Real-Time Disruptions — Debugging an Inefficient Memory Usage Pattern
description:
A story about memory usage causing real-time performance issues in an audio
system.
Expand All @@ -18,8 +18,6 @@ issues, ending with my solution and lessons learned.

<!-- excerpt end -->

Optional motivation to continue onwards

{% include newsletter.html %}

{% include toc.html %}
Expand All @@ -37,14 +35,14 @@ metal DSP platform and do not run an off-the-shelf operating system.
Subsequently, I developed the
[Percussa Super Signal Processor (SSP)](https://www.percussa.com/super-signal-processor/),
a multichannel audio DSP platform, which was funded via Kickstarter for 314%.
The SSP is used by professional studio musicians and sound designers world-wide
The SSP is used by professional studio musicians and sound designers worldwide
and is a eurorack module (eurorack is a modular synthesizer standard). It is
installed into end users’ synthesizer racks or in recording studios. It offers a
library of DSP building blocks which can be chained together to create various
signal processing chains, for synthesizing new sounds or processing existing
library of DSP building blocks that can be chained together to create various
signal-processing chains for synthesizing new sounds or processing existing
sounds.

The SSP is a Linux-based platform, based on a Rockchip SoC. In addition to
The SSP is a Linux-based platform based on a Rockchip SoC. In addition to
developing the hardware and bringing up the platform, I also developed the
real-time audio DSP software for the SSP. To deliver a great customer
experience, it was important to get consistent audio DSP performance from the
Expand All @@ -68,7 +66,7 @@ user interface code, menu systems, and code to parallelize DSP code across CPU
cores. My goal on my embedded Linux platform was to use the mainline kernel tree
to reduce maintenance costs.

## The problem
## Audio Issue Diagnosis

As I started testing and debugging my application, I kept running into issues
where the audio output was regularly interrupted, resulting in an audible click.
Expand All @@ -78,82 +76,83 @@ locks in my code, and whether the audio callback could be pre-empted by the
operating system, causing it to return too late and miss its real-time
“deadline.”

Eventually I decided to use a profiler to figure out what was going on and
Eventually, I decided to use a profiler to figure out what was going on and
turned to the excellent [Tracy profiler](https://github.com/wolfpld/tracy),
which is open source and free. Tracy is easy to integrate with your code,
allowing you to define different zones for tracing and profiling. The Tracy
profiling GUI application can connect over the network with your embedded
platform and receive trace information in real-time. This helps speed up
platform and receive trace information in real time. This helps speed up
development and debugging. Tracy also comes with a command-line capture tool you
can use to capture the trace directly to disk instead of sending it over the
network to your workstation.

I typically do not enable Tracy in release builds shipped to customers and only
use it when I’m trying to find a problem with the real-time behavior of my
software. I use compiler switches (`-D`) to enable Tracy, as described in the
Tracy manual, which also describes how to control the amount of data collected.
This is important, as you do not want to generate so much tracing data that it
starts to get in the way of determining the real problem at hand.

<p align="center">
<img width="100%" src="{% img_url memory-usage-in-audio-system/trace1.png %}" alt="Image of Tracy (trace 1)" />
</p>
[Tracy manual (PDF warning)](https://github.com/wolfpld/tracy/releases/latest/download/tracy.pdf),
which also describes how to control the amount of data collected. This is
important, as you do not want to generate so much tracing data that it starts to
get in the way of determining the real problem at hand.

With the help of Tracy, I discovered that my audio callback was occasionally
taking too long to return. This happened often enough that it was a real
problem.

In the above screenshot, you can see an instance when the ALSA callback takes
much longer than normal. At the bottom of the screen, you can see the different
nested zones that I defined in my source code, with Tracy measuring the time
spent in each. The zones allow you to understand which areas of your code are
the most problematic and need review.
<p align="center">
<img width="100%" src="{% img_url memory-usage-in-audio-system/trace1.png %}" alt="Image of Tracy (trace 1)" />
</p>

What was interesting was that the issue always seemed to be connected to a
memory-related function call. Right above the zones at the bottom, you can see a
series of blue dots. When hovering over them, these dots showed the same
memory-related function repeatedly, indicating that the CPU was spending an
unusually long time in this function. Additionally the yellow graph at the
bottom (indicated by yellow arrows) shows memory usage constantly increasing or
decreasing, resulting in the zig-zag pattern being plotted.
At the top of the screen in Tracy, you can see which CPU cores the threads are
being scheduled on (note the four “lanes” in the above screenshot, since the
software is running on a quad-core CPU). At the bottom of the screen, you can
see the different nested zones that I defined in my source code, with Tracy
measuring the time spent in each. The zones allow you to understand which areas
of your code are the most problematic and need review.

You can see in the above screenshot an instance when the ALSA callback takes
much longer than normal. What was interesting was that the issue always seemed
to be connected to a memory-related function call. Right above the zones at the
bottom, you can see a series of blue dots. When hovering over them, these dots
showed the same memory-related function repeatedly, indicating that the CPU was
spending an unusually long time in this function. Additionally, the yellow graph
at the bottom shows memory usage constantly increasing or decreasing, resulting
in the zig-zag pattern being plotted.

However, I had paid close attention to the code in my audio callback and DSP
worker threads, and I was fairly confident I was not doing any kind of dynamic
worker threads and I was fairly confident I was not doing any kind of dynamic
memory allocation in my audio or DSP threads.

As I dug deeper, I noticed that my audio thread was being pre-empted. ALSA
support was being provided by the third-party framework. Eventually I started
examining the memory usage behavior of my application (and third-party
framework) and discovered that the application framework I was using had its own
As I dug deeper, I noticed that my audio thread was being pre-empted by kernel
threads.

<p align="center">
<img width="100%" src="{% img_url memory-usage-in-audio-system/trace2.png %}" alt="Image of Tracy (trace 2)" />
</p>

While I don't have an exact copy of the pre-emption from this instance, I have
an example screenshot above which is very similar. In this case, you can see
that the ALSA thread is being pre-empted with a memory-related function call
appearing at the bottom of the screen shortly before the scheduling of kernel
threads.

> Note: Tracy can sample Linux kernel call stacks in addition to your
> application call stacks to give an even deeper look at what the kernel is
> doing. Refer to the Tracy manual to understand the requirements for this.
Eventually, I started examining the memory usage behavior of my application and
discovered that the third-party application framework I was using had its own
memory management classes. Under the hood, `malloc()` was being called to
allocate blocks of very small sizessome as small as 4 bytes! These classes were
used all over the code base in the frameworkand thus in my application, which
was based on the framework. I had unwittingly invited dynamic memory allocation
in places due to the way the framework was architected.
allocate blocks of very small sizes -- some as small as 4 bytes! These classes
were used all over the code base in the framework -- and thus in my application,
which was based on the framework. I had unwittingly invited dynamic memory
allocation in places due to the way the framework was architected.

Because of the frequency with which these calls happened, it resulted in memory
fragmentation, which explained why some memory-related function calls took so
long to return. Fragmentation can result in memory-related function calls taking
much longer than normal and/or the Linux kernel pre-empting your thread(s) while
it performs its cleanup.

Tracy supports sampling call stacks, and it can also sample Linux kernel call
stacks (refer to the Tracy manual to understand the requirements for this).

<p align="center">
<img width="100%" src="{% img_url memory-usage-in-audio-system/trace2.png %}" alt="Image of Tracy (trace 2)" />
</p>

At the top of the screen in Tracy, you can see which CPU cores the threads are
being scheduled on (note the four “lanes” in the above screenshot, since the
software is running on a quad-core CPU).

Tracy can also show when kernel threads are being scheduled, as shown in grey in
the screenshot above, which can help you understand if any of your threads are
being pre-empted. You can see that the ALSA thread is being pre-empted, with a
memory-related function call appearing at the bottom of the screen shortly
before this happens.

By analyzing the call stacks, combined with callbacks taking unusually long to
return, excessive time spent in memory-related functions, and the scheduling of
kernel threads, you can get an idea of whether your application’s memory usage
Expand All @@ -163,7 +162,7 @@ In my particular case, my application’s memory usage patterns caused my
callbacks and other code to take too long to return, resulting in missed
real-time deadlines and audio clicks.

## The solution
## Solution

To solve this problem, I implemented my own memory allocator and re-implemented
`malloc()` and friends to use my allocator. At a high level, my allocator
Expand All @@ -185,19 +184,20 @@ objects on startup, which are then re-used as much as possible.

While continuing my efforts with profiling and tracing using Tracy, I also
learned the importance of being aware of the memory allocation behavior of the
tracing code (and third-party framework code!) that you integrate in your
tracing code (and third-party framework code!) that you integrate into your
application. Tracy has its own allocator, which is great, but one needs to be
mindful of the tracing features enabled and the volume of data being collected
to avoid affecting the real-time behavior of the application being traced.

## Conclusion

If you are working on a real-time application—whether it is an audio processing
application, a low-latency proprietary trading application, or anything else
that needs to have real-time or low-latency behavior—profile your application in
various ways and be aware of the memory usage patterns of your application. And
don't forget to examine the behavior under the hood of ALL third-party libraries
in your application (including tracing or profiling code or libraries). 🤓
If you are working on a real-time application -- whether it is an audio
processing application, a low-latency proprietary trading application, or
anything else that needs to have real-time or low-latency behavior, profile your
application in various ways and be aware of the memory usage patterns of your
application. And don't forget to examine the behavior under the hood of ALL
third-party libraries in your application (including tracing or profiling code
or libraries). 🤓

<!-- Interrupt Keep START -->

Expand Down

0 comments on commit c28f069

Please sign in to comment.