-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about ptp #9
Comments
hi NQX1248:
I'd like to see if PTP will helps in video-language tasks. Looking forward to further communcation. |
Thanks for your reply! I'm considering experiments on WebVid. Have you tried pretraining with ptp on WebVid and evaluating on downstream datasets, e.g., msrvtt? I also notice that you use object information in OA-Trans. Will you release extracted object tags and bboxes of WebVid? |
I do not explore PTP on Video-text Task. It should be work. Previously I save object feature and tags together in numpy file and it takes 10T space. Since I have already offboarding and have no access for these data again, you may need to follow https://github.com/FingerRec/OA-Transformer/blob/main/object_extraction.md for extraction. |
Thanks for your response. I'm still confused about zero-shot setting on COCO.
|
|
Thanks for your detailed explanation! Here are my zero-shot testing logs. The checkpoints are pretrained checkpoint and coco zero-shot checkpoint, respectively. |
Cool, I will upload the log you provided. |
Hi @FingerRec , I downloaded the original BLIP checkpoint trained on 14M data then performed zero-shot testing on COCO, and get the following result. |
Hi,
Congratulations on the great success of your wonderful work! I have several questions about ptp in terms of the pretraining/fintuning settings described in the paper. The questions are as follows:
For example, in the zero-shot retrieval setting, captions in the training stage are like "...The block x has a x" but the prompts are not used anymore during inference, why doesn't this harm the performance?
Does the scale of training dataset matters here? I'm curious if it helps if text prompts of ptp is used in the finetuning stage (instead of pre-training)?
I try to extend ptp to video retreival, and did some experiments on video datasets, trying to add ptp in the finetuning stage when fintuning on MSRVTT, but the performance drops a little bit.
Looking forward to your reply!
The text was updated successfully, but these errors were encountered: