This benchmark for ddisasm measures ddisasm's accuracy of disassembly (i.e., code inference). It defines a GitLab CI pipeline that generates Precision and Recall metrics with respect to instruction recovery.
These setup instructions have been tested on Ubuntu 20.
To run the benchmark locally, a few Python packages are required. Install them with pip:
pip3 install .
The benchmark also requires some other dependencies:
- binutils-arm-linux-gnueabihf - utilities for ARM binaries
- ddisasm: The benchmark expects to find
ddisasm
inPATH
; whichever version is installed will be evaluated. See installation instructions forddisasm
at https://grammatech.github.io/ddisasm/GENERAL/1-Installation.html
The dataset is committed as a compressed tar file, so it must be extracted for
local use. It is compressed with zstd,
which can be installed on Ubuntu with apt install zstd
.
On Ubuntu 20+, it can be simply extracted with:
tar -xf arm32-dataset.tar.zst
On Ubuntu 18 the necessary command is:
tar --use-compress-program=unzstd -xvf
The compressed file dataset-gt.zip
contains ground truth extracted from
marker symbols (using --truth-source elf
) and extended to include information
GT information from interworking arm veneers (it was extended using
disasm_benchmark.adjust_gt.py
).
You can extract it with the following command:
unzip dataset-gt.zip
To run the entire ARM benchmark locally:
python3 -u -m disasm_benchmark.driver ./dataset/ | tee results.txt
Bin TP FP FN Precision Recall Runtime
libstagefright_foundation.so 23211 73 139 0.99686 0.99405 6.673
libnl.so 17788 135 138 0.99247 0.9923 5.033
libjni_jpegstream.so 93920 508 3421 0.99462 0.96486 35.41
...
tee
writes output tostdout
andresults.txt
, ensuring the results are saved, but also providing live status information.- Providing the
-u
argument to the Python interpreter ensure output is flushed immediately, even when writing to a pipe.
The results of a single binary can also be analyzed. This outputs addresses of each instruction for which an error occurred, followed by summary information:
python3 -m disasm_benchmark.driver ./dataset/android/daemon/bzip2
False positive addrs (Default):
0x6308
0x630c
0x6318
0x631c
...
False positive addrs (Thumb):
0x2e94
0x2f62
0x2f64
0x2fa6
0x2fa8
...
False negative addrs (Default):
0x13c0
0x4e84
0x4e88
...
False negative addrs (Thumb):
0x16a2
0x16bc
0x16be
...
True positive: 5380
False positive: 163
False negative: 122
Precision: 0.97059
Recall: 0.97783
By default, ground truth is collected from mapping symbols present in the ELF binary.
This can be changed by specifying --truth-source
.
elf
collects ground truth from mapping symbols (only possible for ARM binaries).yaml
collects ground truth for a binary[BINARY]
from a file[BINARY].truth.yaml
located next to the binary (see creating baselines).pdb
collects ground truth for a binary[BINARY]
from a PDB file[BINARY].pdb
located next to the binary (only applicable to PE binaries). PDB files are analyzed with thepdb-markers
application (see pdb directory).panginedb
collects ground truth for a binary[BINARY]
from a sqlite database[BINARY].sqlite
located next to the binary. The format of the SQL database is the one defined in https://github.com/pangine/disasm-benchmark?tab=readme-ov-file#using-our-disassembly-ground-truthsok
collects ground truth by tracing compiling process.
pip install .[sok]
The command-line argument --disasm
can be used for choosing a disassembler from ddisasm
, darm
, or various disassemblers supported by SOK (ddisasm
by default).
darm
: DARM
pip3 install .[darm]
- disassemblers supported by SOK: SOK
pip3 install .[sok]
Detailed results in json format can be generated with the --json
option.
Overall metrics can be generated with the --metrics METRICS
option.
The disasm_benchmark.driver
can optionally check against a expected set of metrics
and fail if those metrics are not met with --expected-metrics
(the
process will not fail if the actual metrics are better than expected). The format
of the metrics file is as follows:
disasm_bench_precision 0.9
disasm_bench_recall 0.8
disasm_bench_tp 145461
disasm_bench_fp 0
disasm_bench_fn 23
disasm_bench_failures 0
The expected metrics file does not need to be complete. One can check against only some of the metrics, e.g.:
disasm_bench_failures 0
will make the driver fail if there are benchmark failures.
We can create ground truth files .truth.yaml
on a dataset automatically
using disasm_benchmark.baseline
:
python3 -m disasm_benchmark.baseline ./dataset/
This script also accepts a --truth-source
option:
elf
create a yaml using ARM mapping symbols in the ELF file.gtirb
create a yaml using the current results of Ddisasm.
Below is an example of a ground truth file:
.plt:
- 0x400de0-0x400dec $a
- 0x400dec-0x400df0 $d
- 0x400df0-0x401040 $a
.plt.got:
- 0x401040-0x401046 $a
- 0x401046-0x401048 $d
.text:
- 0x401050-0x4010a7 $a
- 0x4010a7-0x4010b0 $d
- 0x4010b0-0x4010b2 $a
- Address decode ranges are grouped by sections (corresponding to the binary sections).
- Within each section, address ranges are sorted.
- The marker at the end of the range specifies whether the range is:
$a
: Code$t
: Thumb code$d
: Data$i
: Ignored
Triggering the benchmark from CI looks something like this:
trigger:
stage: trigger
variables:
ARTIFACT_URL: ${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/jobs/${JOB_ID_DEBIAN_INSTALLER_UBUNTU20}/artifacts
trigger:
project: rewriting/disasm-benchmark
branch: master
strategy: depend
results:
image: $DOCKER_REGISTRY/rewriting/disasm-benchmark
stage: results
needs:
- trigger
script:
- curl --location --output artifacts.zip "${CI_API_V4_URL}/projects/rewriting%2Fdisasm-benchmark/jobs/artifacts/master/download?job=merge-metrics&job_token=$CI_JOB_TOKEN"
- unzip artifacts.zip
artifacts:
reports:
metrics: metrics.txt
The trigger
job starts the pipeline in the disasm-benchmark
repository
and waits for it to complete, mirroring its success/failure status. After
completion, the results
job downloads the metrics artifact from the pipeline
and re-uploads it as a metrics report in the source pipeline.
The trigger
job passes the PARENT_PIPELINE_ID
environment variable for the
benchmark to download the ddisasm package from the pipeline that triggered it.
The script disasm_benchmark/annotate
is provided to annotate a GTIRB file with the results
of evaluating against ground truth.
Comments will be added for the different kinds of false positives, false negatives, and
address ranges that were ignored with respect to ground truth.
These comments can be seen using gtirb-pprinter's --listing=debug
mode.
The ARM dataset is based on the paper "An Empirical Study on ARM Disassembly Tools". However, no code is reused. Find those at the links below: