Skip to content

GrammaTech/disasm-benchmark

Repository files navigation

ddisasm code inference benchmark

This benchmark for ddisasm measures ddisasm's accuracy of disassembly (i.e., code inference). It defines a GitLab CI pipeline that generates Precision and Recall metrics with respect to instruction recovery.

Local Setup

These setup instructions have been tested on Ubuntu 20.

Python Dependencies

To run the benchmark locally, a few Python packages are required. Install them with pip:

pip3 install .

Other Dependencies

The benchmark also requires some other dependencies:

ARM32 Dataset

The dataset is committed as a compressed tar file, so it must be extracted for local use. It is compressed with zstd, which can be installed on Ubuntu with apt install zstd.

On Ubuntu 20+, it can be simply extracted with:

tar -xf arm32-dataset.tar.zst

On Ubuntu 18 the necessary command is:

tar --use-compress-program=unzstd -xvf

Ground truth

The compressed file dataset-gt.zip contains ground truth extracted from marker symbols (using --truth-source elf) and extended to include information GT information from interworking arm veneers (it was extended using disasm_benchmark.adjust_gt.py). You can extract it with the following command:

unzip dataset-gt.zip

Local Use

To run the entire ARM benchmark locally:

python3 -u -m disasm_benchmark.driver ./dataset/ | tee results.txt
                       Bin                             TP         FP         FN     Precision    Recall    Runtime
libstagefright_foundation.so                         23211        73        139      0.99686    0.99405     6.673
libnl.so                                             17788       135        138      0.99247     0.9923     5.033
libjni_jpegstream.so                                 93920       508        3421     0.99462    0.96486     35.41
...
  • tee writes output to stdout and results.txt, ensuring the results are saved, but also providing live status information.
  • Providing the -u argument to the Python interpreter ensure output is flushed immediately, even when writing to a pipe.

The results of a single binary can also be analyzed. This outputs addresses of each instruction for which an error occurred, followed by summary information:

python3 -m disasm_benchmark.driver ./dataset/android/daemon/bzip2
False positive addrs (Default):
0x6308
0x630c
0x6318
0x631c
...
False positive addrs (Thumb):
0x2e94
0x2f62
0x2f64
0x2fa6
0x2fa8
...
False negative addrs (Default):
0x13c0
0x4e84
0x4e88
...
False negative addrs (Thumb):
0x16a2
0x16bc
0x16be
...
True positive:  5380
False positive: 163
False negative: 122
Precision: 0.97059
Recall:    0.97783

Ground truth sources

By default, ground truth is collected from mapping symbols present in the ELF binary. This can be changed by specifying --truth-source.

  • elf collects ground truth from mapping symbols (only possible for ARM binaries).
  • yaml collects ground truth for a binary [BINARY] from a file [BINARY].truth.yaml located next to the binary (see creating baselines).
  • pdb collects ground truth for a binary [BINARY] from a PDB file [BINARY].pdb located next to the binary (only applicable to PE binaries). PDB files are analyzed with the pdb-markers application (see pdb directory).
  • panginedb collects ground truth for a binary [BINARY] from a sqlite database [BINARY].sqlite located next to the binary. The format of the SQL database is the one defined in https://github.com/pangine/disasm-benchmark?tab=readme-ov-file#using-our-disassembly-ground-truth
  • sok collects ground truth by tracing compiling process.
pip install .[sok]

DISASM

The command-line argument --disasm can be used for choosing a disassembler from ddisasm, darm, or various disassemblers supported by SOK (ddisasm by default).

pip3 install .[darm]
  • disassemblers supported by SOK: SOK
pip3 install .[sok]

Reporting and checking expected results

Detailed results in json format can be generated with the --json option. Overall metrics can be generated with the --metrics METRICS option. The disasm_benchmark.driver can optionally check against a expected set of metrics and fail if those metrics are not met with --expected-metrics (the process will not fail if the actual metrics are better than expected). The format of the metrics file is as follows:

disasm_bench_precision 0.9
disasm_bench_recall 0.8
disasm_bench_tp 145461
disasm_bench_fp 0
disasm_bench_fn 23
disasm_bench_failures 0

The expected metrics file does not need to be complete. One can check against only some of the metrics, e.g.:

disasm_bench_failures 0

will make the driver fail if there are benchmark failures.

Creating baselines

We can create ground truth files .truth.yaml on a dataset automatically using disasm_benchmark.baseline:

python3 -m disasm_benchmark.baseline ./dataset/

This script also accepts a --truth-source option:

  • elf create a yaml using ARM mapping symbols in the ELF file.
  • gtirb create a yaml using the current results of Ddisasm.

Below is an example of a ground truth file:

.plt:
- 0x400de0-0x400dec $a
- 0x400dec-0x400df0 $d
- 0x400df0-0x401040 $a
.plt.got:
- 0x401040-0x401046 $a
- 0x401046-0x401048 $d
.text:
- 0x401050-0x4010a7 $a
- 0x4010a7-0x4010b0 $d
- 0x4010b0-0x4010b2 $a
  • Address decode ranges are grouped by sections (corresponding to the binary sections).
  • Within each section, address ranges are sorted.
  • The marker at the end of the range specifies whether the range is:
    • $a: Code
    • $t: Thumb code
    • $d: Data
    • $i: Ignored

CI

Triggering the benchmark from CI looks something like this:

trigger:
  stage: trigger
  variables:
    ARTIFACT_URL: ${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/jobs/${JOB_ID_DEBIAN_INSTALLER_UBUNTU20}/artifacts
  trigger:
    project: rewriting/disasm-benchmark
    branch: master
    strategy: depend

results:
  image: $DOCKER_REGISTRY/rewriting/disasm-benchmark
  stage: results
  needs:
    - trigger
  script:
    - curl --location --output artifacts.zip "${CI_API_V4_URL}/projects/rewriting%2Fdisasm-benchmark/jobs/artifacts/master/download?job=merge-metrics&job_token=$CI_JOB_TOKEN"
    - unzip artifacts.zip
  artifacts:
    reports:
      metrics: metrics.txt

The trigger job starts the pipeline in the disasm-benchmark repository and waits for it to complete, mirroring its success/failure status. After completion, the results job downloads the metrics artifact from the pipeline and re-uploads it as a metrics report in the source pipeline.

The trigger job passes the PARENT_PIPELINE_ID environment variable for the benchmark to download the ddisasm package from the pipeline that triggered it.

Examining results

The script disasm_benchmark/annotate is provided to annotate a GTIRB file with the results of evaluating against ground truth.

Comments will be added for the different kinds of false positives, false negatives, and address ranges that were ignored with respect to ground truth. These comments can be seen using gtirb-pprinter's --listing=debug mode.

Acknowledgement

The ARM dataset is based on the paper "An Empirical Study on ARM Disassembly Tools". However, no code is reused. Find those at the links below:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published