Pool-of-Experts-code

This is the authors' implementation of the following paper:

"Pool of Experts: Realtime Querying Specialized Knowledge in Massive Neural Networks", SIGMOD, 2021

Hakbin Kim, Dong-Wan Choi

Additional experimental results

Experiments on the Effectiveness of L_soft vs. L_scale

Comparison between L_soft and L_scale

We conduct an experiment to examine whether our L_scale can help to address the logit scale problem, as described in Section 4.2. To this end, we focus on a specific type of errors caused by wrong experts that take a highest probability away from the correct expert, which we call cross-expert errors. This type of errors are distinguished from those locally made within the classes of an expert, and cross-expert errors are more likely to happen when logit scales of experts being merged are quite different.

In order to see the effectiveness of L_scale, we build a different group of experts intentionally trained by only L_soft, and build task-specific models upon them by the same knowledge consolidation method. We then look into how the proportion of cross-expert errors to all the errors has been changed after we add L_scale when extracting experts. In Table 6, we can observe that the rates of cross-expert errors are reduced when we use L_scale along with L_soft. Thus, the experts trained by both L_soft and L_scale turn out to be more robust to cross-expert errors than those trained by only L_soft. As the problem of overconfident experts has already been resolved by L_soft, we can surely claim that minimizing the L_scale is effective to mitigate the logit scale problem.

We compared the merge results using PoE for E_i trained using only L_soft and L_scale added L_CKD, respectively, to see the effect of L_scale in Section 4.1. The results are shown in Table 6. The results showed the average of the results for each combination. In Table Table 6 out mis/all mis means the rate at which the misclassification occurred due to having the largest logit value in the wrong M(H_j) other than the model M(H_i) of the H_i to which each image belongs. L_soft can confirm that the out mis/all mis are higher than the L_CKD in both CIFAR-100 and Tiny-ImageNet, even though the high confidence issue has been resolved.

Quick Start: CIFAR-100, TinyImageNet

We provide CIFAR-100 and TinyImageNet examples for Pool of Experts

Preprocessing phase

python Run_preprocessing.py

We provide Oracle in DB_pretrained, library and some experts in DB_PoE

You can check the accuracy of Oracle and each model for primitive tasks by executing the above command

Service phase

python Run_Service.py --queriedTask <primitive tasks>

Example for CIFAR100: python Run_Service.py --queriedTask people vehicles_1 vehicles_2

Example for TinyImageNet: python Run_Service.py --queriedTask arachnid canine feline

CIFAR100 Available primitive tasks:

'large_omnivores_and_herbivores', 'medium-sized_mammals', 'people', 'small_mammals', 'vehicles_1', 'vehicles_2'

TinyImageNet Available primitive tasks:

'arachnid', 'canine', 'feline', 'bird', 'ungulate', 'vertebrate_etc'

You can check the accuracy of model for queried composite task by executing the above command

Implementation and training details for CIFAR100 and TinyImageNet

All algorithms were implemented using PyTorch and evaluated on a machine with an NVIDIA Quadro RTX 6000 and Intel Core Xeon Gold 5122.

When training all the models, we use a stochastic gradient descent (SGD) with 0.9 momentum and the weight decay of L-2 regularization was fixed to 5e-4.

The batch size of all networks was set to 512.

In all networks, we set the temperature T for distillation to 4 and the weight parameter alpha for scale loss to 0.3.

In order to extract each expert by training primitive models, we train only the expert part (i.e., conv 4) for 100 epochs, where the initial learning rate is set to 0.1 and reduced by 0.1 times at 40 and 80 epochs. In our experiments, we randomly choose six of all the primitive tasks for both CIFAR-100 and Tiny-ImageNet. The setting of transfer learning is the same.

When training a target architecture without using the library part, the learning process was continued for 200 epochs, where the initial learning rate was 0.1 and reduced by 0.1 times at 80 and 160 epochs.

In the WRN architecture for Tiny-ImageNet experiments, the stride of the first convolutional layer is set to 2 because the input size of Tiny-ImageNet is (64, 64, 3).

Experimental results on ImageNet

Model Specialization and Consolidation

Training details for ImageNet

We used the front part of MobilenetV2 as library and the number of channels in the back part of MobilenetV2 was significantly reduced and used as experts. (The detailed structure is expressed in ImageNet/PoE/network/pytorch_mobilenetV2.py.)

When training the experts, we use a stochastic gradient descent (SGD) with 0.9 momentum and the weight decay of L-2 regularization was fixed to 5e-4.

We set the temperature T for distillation to 4 and the weight parameter alpha for scale loss to 0.3.

In order to extract each expert by training primitive models, we train only the expert part for 20 epochs, where the initial learning rate is set to 0.01 and reduced by 0.1 times at 10 and 15 epochs. In our experiments, we randomly choose four of all the primitive tasks for ImageNet.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
CIFAR100		CIFAR100
ImageNet_PoE		ImageNet_PoE
TinyImageNet		TinyImageNet
addImg		addImg
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pool-of-Experts-code

Additional experimental results

Experiments on the Effectiveness of L_soft vs. L_scale

Quick Start: CIFAR-100, TinyImageNet

Preprocessing phase

Service phase

Implementation and training details for CIFAR100 and TinyImageNet

Experimental results on ImageNet

Training details for ImageNet

About

Releases

Packages

Languages

PoE-code/Pool-of-Experts-code

Folders and files

Latest commit

History

Repository files navigation

Pool-of-Experts-code

Additional experimental results

Experiments on the Effectiveness of Lsoft vs. Lscale

Quick Start: CIFAR-100, TinyImageNet

Preprocessing phase

Service phase

Implementation and training details for CIFAR100 and TinyImageNet

Experimental results on ImageNet

Training details for ImageNet

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Experiments on the Effectiveness of L_soft vs. L_scale

Packages