Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of AST on audio clips with different lengths #60

Open
Yuanbo2020 opened this issue Apr 12, 2022 · 3 comments
Open

Poor performance of AST on audio clips with different lengths #60

Yuanbo2020 opened this issue Apr 12, 2022 · 3 comments
Labels
question Further information is requested

Comments

@Yuanbo2020
Copy link

Hi there,

I want to use the pre-trained AST you provided for audio tagging on a one-second audio clip, and I follow the feature extraction method you used and pad it to 1024 frames according to the method you provided.

feats = make_features(audio_path, mel_bins=128) # shape(1024, 128)

Unfortunately, the AST predicts badly, and there are obvious misclassifications, such as recognizing multiple birdsongs as music, etc.

At the same time, I used the pretrained CNN-based PANNs, which I guess you are familiar with it, to predict these short audio clips, and it turned out that the results of PANN are much more accurate than those predicted by AST.

Do you have any suggestions for AST to predict audio events with one-second length?

The audio clips I want to predict is here: https://urban-soundscapes.s3.eu-central-1.wasabisys.com/soundscapes/index.html
If you are interested, I am happy to share with you the results I predicted with AST and PANN respectively, and I hope to discuss them further.

Best,
Yuanbo

@YuanGongND
Copy link
Owner

Hi Yuanbo,

  1. Instead of padding the audio to 1024 frames, it might worthwhile to try to instantiate the AST model with t_dim=100 for your 1-second audio.

  2. If you have some training data, I think you can try to fine-tune the AST model.

  3. When you say

Unfortunately, the AST predicts badly, and there are obvious misclassifications, such as recognizing multiple birdsongs as music, etc.

Do you mean you view the class with the largest logit as the prediction? AST model is trained with BCE loss so the output logits are not normalized for all classes.

-Yuan

@Yuanbo2020
Copy link
Author

Hi Yuan,

Thank you so much for replying.

I am a little bit confused, could you please tell me which trained AST model can receive t_dim=100?

Because when loading the parameters of the trained AST, the dimension of pos_embed is torch.Size([1, 1214, 768]), and if t_dim is set to 100, the corresponding parameter is torch.Size([1, 110, 768]), which obviously mismatch.

Thanks again!

Yuanbo

@YuanGongND
Copy link
Owner

The trick is we trim or interpolate the positional embedding

new_pos_embed = self.v.pos_embed[:, 2:, :].detach().reshape(1, 1212, 768).transpose(1, 2).reshape(1, 768, 12, 101)
# if the input sequence length is larger than the original audioset (10s), then cut the positional embedding
if t_dim < 101:
new_pos_embed = new_pos_embed[:, :, :, 50 - int(t_dim/2): 50 - int(t_dim/2) + t_dim]
# otherwise interpolate
else:
new_pos_embed = torch.nn.functional.interpolate(new_pos_embed, size=(12, t_dim), mode='bilinear')

To use the AudioSet pretrained model, you just need to specify the t_dim when you initialize the AST model, it is not recommended to do torch.load by yourself, otherwise you will need to handle positional embedding trimming by yourself.

In our ESC-50 recipe, we show an example to fine-tune AST model pretrained on 10s audio with 5s audios.

-Yuan

@YuanGongND YuanGongND added the question Further information is requested label Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants