Here is the code I used to achieve the #11th place during the training phase and #12th when the final scores were calculated. The final leaderboard can be checked here (username nmacian).
The official site for the challenge is here and the rules are here.
The training and test datasets can be downloaded from here.
The remaining assests needed to run my code can be found here.
The model I built was based in the one provided by Mercado Libre. The original Colab can be visited here. I cleaned the data and tuned hyperparameters to achieve a 0.9048 score (compared to the 0.875 obtained in the original model).
During almost all the competition time I used Google Colab as my coding environment, where GPU usage is quite limited. It wasn't until the last 2 days of the challenge that I managed to run on Google Cloud, but I didn't have much time those final days to continue improving the model result.
As I didn't have access to more powerful processing until the end, I had to subsample the original data to almost half of its original size. The cleaning function and subsampling strategies can be found in "Sampling_Spanish" and "Sampling_Portuguese" notebooks.
The given model uses a Fastai library.
I made two different models to classify Spanish and Portuguese titles. At the end I joined the results.
In order to run the model you need to download the vocabulary language model provided by Mercado Libre. After cleaning and sampling the data you can run the two models and obtain the two submission files that need to be combined.
The notebook "JoinResults" takes the two inputs and takes the Spanish titles classified by the Spanish model and the Portuguese titles classified by the other model. Then it sorts the results and gives the expected format.
As I didn't have much time to use Google Cloud, here is what I would have done:
- Train two vocabulary databunches after cleaning the data, one for each language.
- Train the two models with all the available data.
My guess is that the overall performance would improve and get a higher result.