Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic Kernel Accuracy on CIFAR10 #3

Open
ziqi-zhang opened this issue Feb 17, 2023 · 7 comments
Open

Elastic Kernel Accuracy on CIFAR10 #3

ziqi-zhang opened this issue Feb 17, 2023 · 7 comments
Assignees

Comments

@ziqi-zhang
Copy link

Hi,

Thanks for sharing the awesome code with us. I tried to run the code but got low accuracy, so I was wondering whether you met a similar problem.

I successfully trained the teacher model and get a val top1 accuracy of 91%. Then I run python train_ofa_net.py --task kernel to train the elastic kernel. But I only got a top1 accuracy of 52%, which is far from 91%. How can I improve the accuracy?

Best Regards

@pprp pprp self-assigned this Feb 17, 2023
@pprp
Copy link
Owner

pprp commented Feb 17, 2023

Thanks for your attention. That's a good question.

I tried once-for-all in CVPR 2022 NAS workshop, and found that the progressive shrinking strategy used in once-for-all is actually a long pipeline, which includes elastic resolution, kernel size, depth, and width. Each stage have different hyperparameters. And in this repo, we change the dataset from large scale imagenet to cifar10 and the hyperparamter might not work as before. As reported in your experiments, the drastic dropping in performance may attribute to improper hyperparamter setting. There are some possible solutions:

  1. Check the checkpoint loading parts to varify whether the pretrained model are loaded properly.
  2. Try to decay the learning rate by 10 or by 100.

Besides, I prefer sandwich rule proposed in BigNAS, which have less hyperparameters and can converge faster that the progressive shrinking strategy.

Let me know if there are any new progress.

@ziqi-zhang
Copy link
Author

Thanks very much for your quick and detailed answer! I guess I didn't correctly load the pre-trained model, and I will rerun the code to check the results. I will update this issue if I get any new results.

@ziqi-zhang
Copy link
Author

BTW I saw you commit message that "autoaugment 影响训练集非常大". What does it mean? Does it mean the autoaugmentation techniques can improve the final accuracy? Besides, the original OFA repo seems doesn't have these autoaugmentations?

@ziqi-zhang
Copy link
Author

Hi, I found that after initializing net with the weights of pre-trained teacher (except some mismatched weights), top1 accuracy increases to about 70%.

@pprp
Copy link
Owner

pprp commented Feb 18, 2023

@ziqi-zhang

As for the influence of autoaugmentation, I did test the it and achieve 89% training accuarcy and 81% valid accuracy. The capacity of current ofa model is silightly larger than the size of cifar dataset, which means that adopting more data with diversity would boost the performance.

And it 's great to hear that loading pretraining model can alleviate the performance dropping about 20%👍. You can try autoaugmentation or tune the hyperparameters in the next step.

@ziqi-zhang
Copy link
Author

@pprp Thanks very much for your explanation! BTW I read the sandwich rule in BIGNAS, but I have a small question: is the sandwich rule progressive (like OFA) or one stage?

As we know, OFA needs to train for four stages (resolution, kernel, depth, and width). But the sandwich rule seems not to have this requirement. It only trains once, and for each iteration, the sandwich rule samples the largest, smallest, and some random intermediate chile models.

If that is the case, the sandwich rule is much more convenient than OFA (one stage v.s. four stages). But I guess the total training time of the sandwich rule should be comparable to the sum of the time of four stages of OFA?

@pprp
Copy link
Owner

pprp commented Feb 18, 2023

@ziqi-zhang From my experiments, I think sandwich rule should be more quick than OFA because of the inplace distillation. Inplace distillation is quiet usefull during the CVPR NAS Workshop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants