Poor performance of AST on audio clips with different lengths #60

Yuanbo2020 · 2022-04-12T10:19:07Z

Hi there,

I want to use the pre-trained AST you provided for audio tagging on a one-second audio clip, and I follow the feature extraction method you used and pad it to 1024 frames according to the method you provided.

ast/egs/audioset/inference.py

Line 76 in 70c675e

feats = make_features(audio_path, mel_bins=128) # shape(1024, 128)

Unfortunately, the AST predicts badly, and there are obvious misclassifications, such as recognizing multiple birdsongs as music, etc.

At the same time, I used the pretrained CNN-based PANNs, which I guess you are familiar with it, to predict these short audio clips, and it turned out that the results of PANN are much more accurate than those predicted by AST.

Do you have any suggestions for AST to predict audio events with one-second length?

The audio clips I want to predict is here: https://urban-soundscapes.s3.eu-central-1.wasabisys.com/soundscapes/index.html
If you are interested, I am happy to share with you the results I predicted with AST and PANN respectively, and I hope to discuss them further.

Best,
Yuanbo

YuanGongND · 2022-04-14T09:12:15Z

Hi Yuanbo,

Instead of padding the audio to 1024 frames, it might worthwhile to try to instantiate the AST model with t_dim=100 for your 1-second audio.
If you have some training data, I think you can try to fine-tune the AST model.
When you say

Unfortunately, the AST predicts badly, and there are obvious misclassifications, such as recognizing multiple birdsongs as music, etc.

Do you mean you view the class with the largest logit as the prediction? AST model is trained with BCE loss so the output logits are not normalized for all classes.

-Yuan

Yuanbo2020 · 2022-04-14T09:58:40Z

Hi Yuan,

Thank you so much for replying.

I am a little bit confused, could you please tell me which trained AST model can receive t_dim=100?

Because when loading the parameters of the trained AST, the dimension of pos_embed is torch.Size([1, 1214, 768]), and if t_dim is set to 100, the corresponding parameter is torch.Size([1, 110, 768]), which obviously mismatch.

Thanks again!

Yuanbo

YuanGongND · 2022-04-14T19:10:08Z

The trick is we trim or interpolate the positional embedding

ast/src/models/ast_models.py

Lines 141 to 147 in 7b2fe70

    
           new_pos_embed = self.v.pos_embed[:, 2:, :].detach().reshape(1, 1212, 768).transpose(1, 2).reshape(1, 768, 12, 101) 
        
           # if the input sequence length is larger than the original audioset (10s), then cut the positional embedding 
        
           if t_dim < 101: 
        
               new_pos_embed = new_pos_embed[:, :, :, 50 - int(t_dim/2): 50 - int(t_dim/2) + t_dim] 
        
           # otherwise interpolate 
        
           else: 
        
               new_pos_embed = torch.nn.functional.interpolate(new_pos_embed, size=(12, t_dim), mode='bilinear')

To use the AudioSet pretrained model, you just need to specify the t_dim when you initialize the AST model, it is not recommended to do torch.load by yourself, otherwise you will need to handle positional embedding trimming by yourself.

In our ESC-50 recipe, we show an example to fine-tune AST model pretrained on 10s audio with 5s audios.

-Yuan

YuanGongND added the question Further information is requested label Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance of AST on audio clips with different lengths #60

Poor performance of AST on audio clips with different lengths #60

Yuanbo2020 commented Apr 12, 2022

YuanGongND commented Apr 14, 2022

Yuanbo2020 commented Apr 14, 2022

YuanGongND commented Apr 14, 2022

Poor performance of AST on audio clips with different lengths #60

Poor performance of AST on audio clips with different lengths #60

Comments

Yuanbo2020 commented Apr 12, 2022

YuanGongND commented Apr 14, 2022

Yuanbo2020 commented Apr 14, 2022

YuanGongND commented Apr 14, 2022