ADIFF: Explaining audio difference using natural language

[Paper] [Checkpoint]

This repository hosts the Audio Difference Explanation datasets and ADIFF checkpoint. ADIFF is an audio prefix tuning-based language model with a cross-projection module and undergoes a three-step training process. ADIFF takes two audios and text prompt as input and produces different tiers of difference explanations as output. This involves identifying and describing audio events, acoustic scenes, signal characteristics, and their emotional impact on listeners.

Setup

Install the required dependencies: pip install -r requirements.txt. For conda, run the following:

cd adiff && \
conda create -n adiff python=3.8 && \
conda activate adiff && \
pip install -r requirements.txt

Download ADIFF weights: Pretrained Model [Zenodo]
Move the adiff_base.pth under config folder

Usage

The wrapper class allows easy interaction with the model. To use the wrapper, inputs required are:

config: The option supported is "base"
model_path: Choose between adiff_base.ckpt or adiff_base_wavcaps.ckpt. The second checkpoint is trained on wavcaps difference along with ACD and CLD, can detect similarities between two audios, and has wider coverage of concepts.
examples: List of examples. Each example is a list containing three entries: audiopath1, audiopath2, prompt

Supported functions:

generate: Produces text response for the given audio inputs and text prompt

Generate difference explanation

from wrapper import ADIFF

adiff = ADIFF(config="<choice of config>",model_path="<model weights")

examples = [
        ["<path1>", "<path2>", "explain the difference between the two audio in detail"],
        ["<path1>", "<path2>", "explain the difference between the two audio in one extended sentence"],
        ["<path1>", "<path2>", "explain the difference between the two audio in few words"],
    ]

response = adiff.generate(examples=examples, 
                            max_len=300, 
                            temperature=1.0
                            )

Generate audio captions

from wrapper import ADIFF

adiff = ADIFF(config="<choice of config>",model_path="<model weights")

examples = [
        ["<path1>", "<path2>", "caption the first audio"],
        ["<path1>", "<path2>", "caption the second audio"],
        ["<path1>", "<path2>", "caption both the audios"],
    ]

response = adiff.generate(examples=examples, 
                            max_len=300, 
                            temperature=1.0
                            )

Dataset

The ACD and CLD dataset sources audio files from Clotho and AudioCaps dataset. For the three tiers of difference annotation, the .csv are located under data folder

.
├── ...
├── data              
│   ├── ACD         # AudioCaps Difference Explanation
|       ├── acd_test_adiff_fewwords_answer.csv
|       ├── acd_test_adiff_sentence_answer.csv
|       ├── acd_test_adiff_detail_answer.csv
|       ├── ...
│   ├── CLD         # Clotho Difference Explanation
|       ├── cld_evaluation_adiff_fewwords_answer.csv
|       ├── cld_evaluation_adiff_sentence_answer.csv
|       ├── cld_evaluation_adiff_detail_answer.csv
|       ├── ...
└── ...

The audio files can be downloaded from their respective hosting website: Clotho and AudioCaps.

Leaderboard

Please create a PR to add any model to the leaderboard

_^Model	_^Decoding	_{^{CLD-1 (SPICE)}}	_{^{CLD-2 (SPICE)}}	_{^{CLD-3 (SPICE)}}	_{^{ACD-1 (SPICE)}}	_{^{ACD-2 (SPICE)}}	_{^{ACD-3 (SPICE)}}
ADIFF	Greedy	11.85	23.15	16.67	12.68	22.16	17.07

Citation

@inproceedings{
    anonymous2025adiff,
    title={{ADIFF}: Explaining audio difference using natural language},
    author={Anonymous},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=l4fMj4Vnly}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config		config
data		data
model		model
prompt		prompt
LICENSE		LICENSE
README.md		README.md
example.py		example.py
image.png		image.png
requirements.txt		requirements.txt
wrapper.py		wrapper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ADIFF: Explaining audio difference using natural language

Setup

Usage

Generate difference explanation

Generate audio captions

Dataset

Leaderboard

Citation

About

Releases

Packages

Languages

License

soham97/ADIFF

Folders and files

Latest commit

History

Repository files navigation

ADIFF: Explaining audio difference using natural language

Setup

Usage

Generate difference explanation

Generate audio captions

Dataset

Leaderboard

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages