Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about ptp #9

Open
nqx12348 opened this issue May 26, 2023 · 8 comments
Open

Questions about ptp #9

nqx12348 opened this issue May 26, 2023 · 8 comments

Comments

@nqx12348
Copy link

nqx12348 commented May 26, 2023

Hi,
Congratulations on the great success of your wonderful work! I have several questions about ptp in terms of the pretraining/fintuning settings described in the paper. The questions are as follows:

  1. I noticed that you perform zero-shot retrieval experiments on MS COCO, but in Chap 4.1 of the paper I find COCO is also used in pre-training. Did you exclude COCO from the pre-training dataset before zero-shot testing on COCO?
  2. You mentioned in the paper that text prompt is used only in the pretraining stage. That sounds quite fair because it doesn't change the inference setting. As far as I'm concerned, using ptp will change the distribution of image captions and make a distribution gap between training corpus and testing corpus, which might harm the retrieval results. But it seems to be the opposite, it helps the downstream retrieval rather than harming it. Why?
    For example, in the zero-shot retrieval setting, captions in the training stage are like "...The block x has a x" but the prompts are not used anymore during inference, why doesn't this harm the performance?
    Does the scale of training dataset matters here? I'm curious if it helps if text prompts of ptp is used in the finetuning stage (instead of pre-training)?
    I try to extend ptp to video retreival, and did some experiments on video datasets, trying to add ptp in the finetuning stage when fintuning on MSRVTT, but the performance drops a little bit.

Looking forward to your reply!

@nqx12348 nqx12348 changed the title Questions about finetune setting Questions about ptp May 26, 2023
@FingerRec
Copy link
Collaborator

hi NQX1248:
Thanks for your good question.

  1. The prepartion of pretraining corpus follow OSCAR (https://github.com/microsoft/Oscar/blob/master/VinVL_DOWNLOAD.md). All setting keep consistent. It's an common pratcie.
  2. Yes, I have same observation! PTP rely highly on the quality of object tags . I previously focus on video-language pre-training and observe its hard to introduce object information during fine-tuning stage (like OA-Trans). The same as you, I also try toincorprate PTP into fine-tuning for common VL tasks not help. Its best to introduce in pre-training stage. In addition, for pure fine-tuning setting. But maybe you can try two experiments: a. 50% probality use PTP. 2. Incorporate in the pretrain dataset like webvid & cc3m.

I'd like to see if PTP will helps in video-language tasks. Looking forward to further communcation.

@nqx12348
Copy link
Author

nqx12348 commented May 29, 2023

hi NQX1248: Thanks for your good question.

  1. The prepartion of pretraining corpus follow OSCAR (https://github.com/microsoft/Oscar/blob/master/VinVL_DOWNLOAD.md). All setting keep consistent. It's an common pratcie.
  2. Yes, I have same observation! PTP rely highly on the quality of object tags . I previously focus on video-language pre-training and observe its hard to introduce object information during fine-tuning stage (like OA-Trans). The same as you, I also try toincorprate PTP into fine-tuning for common VL tasks not help. Its best to introduce in pre-training stage. In addition, for pure fine-tuning setting. But maybe you can try two experiments: a. 50% probality use PTP. 2. Incorporate in the pretrain dataset like webvid & cc3m.

I'd like to see if PTP will helps in video-language tasks. Looking forward to further communcation.

Thanks for your reply! I'm considering experiments on WebVid. Have you tried pretraining with ptp on WebVid and evaluating on downstream datasets, e.g., msrvtt? I also notice that you use object information in OA-Trans. Will you release extracted object tags and bboxes of WebVid?

@FingerRec
Copy link
Collaborator

I do not explore PTP on Video-text Task. It should be work. Previously I save object feature and tags together in numpy file and it takes 10T space. Since I have already offboarding and have no access for these data again, you may need to follow https://github.com/FingerRec/OA-Transformer/blob/main/object_extraction.md for extraction.

@nqx12348
Copy link
Author

nqx12348 commented May 31, 2023

Thanks for your response. I'm still confused about zero-shot setting on COCO.

  1. By comparing relased logs in this repo, I find 4M_ptp_coco_zero_shot.txt is completely the same as 4M_ptp_coco_ft.txt. Why? Does the model need to be trained during zero-shot testing on COCO? I notice there's no training process in zero-shot testing for Flickr30k.
  2. I also find two checkpoints (pretrained checkpoint and coco zero-shot checkpoint) , but according to my understanding, zero-shot testing on COCO need no extra training procedure, so these two checkpoints should be the same. What's the difference between them? I notice there's no checkpoint for zero-shot flickr30k.
  3. Given the above two questions, I'm a bit confused about the definition of zero-shot retrieval task. In my opinion it means pretraining on a number of large datasets and testing on a new dataset (that is not used in pretraining), without finetuning. But in ptp and ViLT I find COCO is used in the 4M training set, as well as in "zero-shot" testing. Is this allowed in the "zero-shot" setting? I read paper of Oscar and ViLT, but still don't find the answer. Could you kindly explain it? Thanks!

@FingerRec
Copy link
Collaborator

  1. Thanks a lot nqx. Yes, I missed to upload coco_zero_shot.txt. You are correct, zero-shot mean directly test without tuning. You can find the log is for fine-tuning rather than zero-shot, the performance is much higher than zero-shot. I'm looking for the zero-shot file, or could you put your test result here?
  2. You are correct. In general, the pre-training include multiple checkpoints and the zero-shot result have small difference. Select the checkpoint that performs best.
  3. I agree with you. Main reason is the lack of high-quality dataset and history reason. The conventional dcataset like cc and yfcc are quite noisy. So previous work like OSCAR introduce human-crafted coco,vg to help pre-training. All works follow their setting and actually there have misleading here. Although the down-stream tasks usually test on val/test set, but their still come from the same domain (dataset). In image classification or DA, the zero-shot setting should not include the data from same domain.

@nqx12348
Copy link
Author

nqx12348 commented Jun 2, 2023

Thanks for your detailed explanation! Here are my zero-shot testing logs. The checkpoints are pretrained checkpoint and coco zero-shot checkpoint, respectively.
pretrained_concated_pred_4m.log
coco_zero_shot.log

@FingerRec
Copy link
Collaborator

Cool, I will upload the log you provided.

@nqx12348
Copy link
Author

nqx12348 commented Jun 8, 2023

Hi @FingerRec , I downloaded the original BLIP checkpoint trained on 14M data then performed zero-shot testing on COCO, and get the following result.
image
It seems the result is very close to BLIP-ptp trained on 4M data, and much higher than BLIP trained on 4M data (according to the number provided in the paper). The performance gap between models trained on 4M and 14M pretraining data is quite surprising.
image
Could you kindly release the BLIP checkpoint trained on 4M data (without using ptp) for comparison? So we can do more experiments to evaluate the effectiveness of ptp. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants