-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where did the features of the datasets come from? #39
Comments
I read the paper again, knowing that you use the video features extracted from I3D model. So could you please tell me that which extractor did you use? Two-Stream I3D Kinetics-pretraining or miniKinetics-pretraining? |
You can check this issue, there may have the answer you want. #34 |
Hi~ thanks for your reply. I 've checked your proposal roughly, and noticed that one of the responder put a repo about feature extractor. Have you tried it before? If I choose a video shotted by myself, can I obtain a feature with shape (2048,n)? If not, it's okay~ thanks a lot anyway~ |
Hi. To be honesty, I've just recorded the data and am going to try this feature extractor. I also have the concern about whether the feature extractor repo can work because the last commit is about 5 years ago (cry.). I hope that maybe we can communicate frequently afterward when we have some results. |
|
Fighting~ |
WOW! I'm also facing this problem now. If you two have a breakthrough, please come back and share it! I will thank you very much! |
Hi! So glad to hear that you are also trying to do the same thing as me. I just want to share my latest progress with you. I have tried a new repo, which also have the function for extracting the features. The only different thing is that the output dim is [n, 768] and you need to make a transpose of it in order to satisfy the input requirement of ASFormer. And also, if you want to use this method, you can check this for the more detail ttlmh/Bridge-Prompt#3. Hope this can really help you! |
|
By the way, how did you transform your ouput? Like adding a fully connected network, but didn't it means that the model need to be trained again? |
For the first comment. To be frank, I didn't try this repo any more. And from your description, I think you can try to maintain the downsample rate during the training process so that the dim would be the same. Because if the video is so long, it definitely needs to do the downsample to decrease the feature matrix size. For the second comment. Yes, the models need to be trained from scratch. I just simply modify the action segmentation's input dim. Also, a possible way is that you can add a projection at first and load the pre-trained model parameters for the rest of the model and make a fine-tune. Hope to hear your feedback and success! |
Hi! I have the same problem.
I'll try to use https://github.com/yiskw713/video_feature_extractor. |
Hi, thanks for your proposal! To be honest, I didn’t know this feature extractor until I read your reply, I’ve used the extractor I mentioned in previous answer to do experiments for some time, and I found that the features extracted by that extractor are not good enough, which means that the accuracy wasn’t that high as the record in the paper. I will try your repo, and thank you again. |
Thank you sharing your status! |
Hi! I am also working on the feature extraction process. I am looking forward for your results. If you have any when using this repo, I hope you can share us with the result. Thank you very much! |
Hi, @Youthfeng123 @habakan @shenjiyuan123. I am also exploring I3D lately and find the code from the following repos that you have shared (code1 and code2) work for me. @Youthfeng123 For the shape (2048,n), my assumption is that when you input 21 video frames (as mentioned here) with the frame size of (224,224), i.e., the input size is (-1,3,21,224,224), you will get shape of (-1,1024,2,1,1) as the output of last AvgPool3d. |
Thanks for your explanation of the tensor's shape, it helps a lot! May I ask which optical flow extractor did you use? I used this repo to extract optical flow of video, since I would like to try more different methods, would you please share yours? Thanks a lot! @littlesi789 @shenjiyuan123 @habakan |
Hi, @Youthfeng123. In the I3D paper, the author used tv-l1 to extract optical flow. You may find code implementations from others. Please be cautious with my explanation above. I could not find if the authors used RGB, or optical flow, or both in the MS-TCN paper (please correct me if they did mention it in the paper). In the original I3D paper, the two flows are averaged at the final prediction stage and the predictions are arraies with a length of 400. So the above explanation for 2048 is my assumption, and unfortunately, I cannot find a way right now to reproduce their extracted features. |
@littlesi789 @Youthfeng123 |
Can I ask that anyone of you extract successfully video features with shape (2048xT)? I cannot install opencv 2.4.13 in the kinetic feature extraction repo. |
@littlesi789 @shenjiyuan123 @bqdeng Do you mind answering me :'< |
Hi guys, I am currently facing the same problem, I would like to know if anyone have made this repo work ? https://github.com/ahsaniqbal/Kinetics-FeatureExtractor/tree/master. I would be really grateful if you can give some advice. |
Hi @KarolyneFarfan, I used another repo to extract feature, you can see my project at: https://github.com/XuanHien304/E2E-Action-Segmentation |
Hi there, after reading the paper and code, I found that the ms-tcn takes features of video as input. Then I load a feature into a numpy variable, seeing the data is a matrix with shape (2048,n).
Here is my confuse: do the features in the datasets could be transformed into video? Or they are features that extracted from other backbone? What is that extractor?
Looking forward to your reply.
The text was updated successfully, but these errors were encountered: