[Paper
] [Checkpoint
]
This repository hosts the Audio Difference Explanation datasets and ADIFF checkpoint. ADIFF is an audio prefix tuning-based language model with a cross-projection module and undergoes a three-step training process. ADIFF takes two audios and text prompt as input and produces different tiers of difference explanations as output. This involves identifying and describing audio events, acoustic scenes, signal characteristics, and their emotional impact on listeners.
- Install the required dependencies:
pip install -r requirements.txt
. For conda, run the following:
cd adiff && \
conda create -n adiff python=3.8 && \
conda activate adiff && \
pip install -r requirements.txt
- Download ADIFF weights: Pretrained Model [Zenodo]
- Move the
adiff_base.pth
underconfig
folder
The wrapper class allows easy interaction with the model. To use the wrapper, inputs required are:
config
: The option supported is "base"model_path
: Choose between adiff_base.ckpt or adiff_base_wavcaps.ckpt. The second checkpoint is trained on wavcaps difference along with ACD and CLD, can detect similarities between two audios, and has wider coverage of concepts.examples
: List of examples. Each example is a list containing three entries: audiopath1, audiopath2, prompt
Supported functions:
generate
: Produces text response for the given audio inputs and text prompt
from wrapper import ADIFF
adiff = ADIFF(config="<choice of config>",model_path="<model weights")
examples = [
["<path1>", "<path2>", "explain the difference between the two audio in detail"],
["<path1>", "<path2>", "explain the difference between the two audio in one extended sentence"],
["<path1>", "<path2>", "explain the difference between the two audio in few words"],
]
response = adiff.generate(examples=examples,
max_len=300,
temperature=1.0
)
from wrapper import ADIFF
adiff = ADIFF(config="<choice of config>",model_path="<model weights")
examples = [
["<path1>", "<path2>", "caption the first audio"],
["<path1>", "<path2>", "caption the second audio"],
["<path1>", "<path2>", "caption both the audios"],
]
response = adiff.generate(examples=examples,
max_len=300,
temperature=1.0
)
The ACD and CLD dataset sources audio files from Clotho and AudioCaps dataset. For the three tiers of difference annotation, the .csv are located under data
folder
.
├── ...
├── data
│ ├── ACD # AudioCaps Difference Explanation
| ├── acd_test_adiff_fewwords_answer.csv
| ├── acd_test_adiff_sentence_answer.csv
| ├── acd_test_adiff_detail_answer.csv
| ├── ...
│ ├── CLD # Clotho Difference Explanation
| ├── cld_evaluation_adiff_fewwords_answer.csv
| ├── cld_evaluation_adiff_sentence_answer.csv
| ├── cld_evaluation_adiff_detail_answer.csv
| ├── ...
└── ...
The audio files can be downloaded from their respective hosting website: Clotho and AudioCaps.
Please create a PR to add any model to the leaderboard
Model | Decoding | CLD-1 (SPICE) | CLD-2 (SPICE) | CLD-3 (SPICE) | ACD-1 (SPICE) | ACD-2 (SPICE) | ACD-3 (SPICE) |
---|---|---|---|---|---|---|---|
ADIFF | Greedy | 11.85 | 23.15 | 16.67 | 12.68 | 22.16 | 17.07 |
@inproceedings{
anonymous2025adiff,
title={{ADIFF}: Explaining audio difference using natural language},
author={Anonymous},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=l4fMj4Vnly}
}