-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance of AST on audio clips with different lengths #60
Comments
Hi Yuanbo,
Do you mean you view the class with the largest logit as the prediction? AST model is trained with BCE loss so the output logits are not normalized for all classes. -Yuan |
Hi Yuan, Thank you so much for replying. I am a little bit confused, could you please tell me which trained AST model can receive t_dim=100? Because when loading the parameters of the trained AST, the dimension of pos_embed is torch.Size([1, 1214, 768]), and if t_dim is set to 100, the corresponding parameter is torch.Size([1, 110, 768]), which obviously mismatch. Thanks again! Yuanbo |
The trick is we trim or interpolate the positional embedding Lines 141 to 147 in 7b2fe70
To use the AudioSet pretrained model, you just need to specify the In our ESC-50 recipe, we show an example to fine-tune AST model pretrained on 10s audio with 5s audios. -Yuan |
Hi there,
I want to use the pre-trained AST you provided for audio tagging on a one-second audio clip, and I follow the feature extraction method you used and pad it to 1024 frames according to the method you provided.
ast/egs/audioset/inference.py
Line 76 in 70c675e
Unfortunately, the AST predicts badly, and there are obvious misclassifications, such as recognizing multiple birdsongs as music, etc.
At the same time, I used the pretrained CNN-based PANNs, which I guess you are familiar with it, to predict these short audio clips, and it turned out that the results of PANN are much more accurate than those predicted by AST.
Do you have any suggestions for AST to predict audio events with one-second length?
The audio clips I want to predict is here: https://urban-soundscapes.s3.eu-central-1.wasabisys.com/soundscapes/index.html
If you are interested, I am happy to share with you the results I predicted with AST and PANN respectively, and I hope to discuss them further.
Best,
Yuanbo
The text was updated successfully, but these errors were encountered: