Skip to content

Payback80/pover_t_test_drivendata

Repository files navigation

Pover T test driven data competition

The World Bank is aiming to end extreme poverty by 2030. Crucial to this goal are techniques for determining which poverty reduction strategies work and which do not. But measuring poverty reduction requires measuring poverty in the first place, and it turns out that measuring poverty is pretty hard. The World Bank helps developing countries measure poverty by conducting in-depth household surveys with a subset of the country's population in order to get a clearer picture of a household's poverty status.

I was able to achieve top 5% in this competition against 2310 other data scientists world-wide. The problem was to model 3 different datasets My solution pipeline was:

pre process and normalize data, feature space reduction with Recursive Feature Elimination (RFE) based on random forest and 10 fold CV, this also gave me the features ordered by importance, feature clustering and reduction with the t-distributed stochastic neighbor embedding algorithm (t-SNE) and added it as new features into the dataset, basic feature interaction due the fact the dataset in anonymized. For the final model i used a stacked ensemble of around 10 each gradient boosting trees, random forest, neural networks, elastic net linear models and stacked with a 10 folds GBM, all the models and parameters tuning have done under H2O.ai framework and R with this code you will be able to obtain 0.1650 mean logloss error.

##instructions: the main file is poverty_github.r you should be able to run it without problems, just change the "fread" directory lines, RFE, TSNE of your local machine and download the mid processing files: RFE/a/b/b and tsne a/b. if you have forgotten the ratios of the 3 countries is: 0.46 A 0.18 B 0.36 C So you can calculate the mean logloss again. I have commented in details just the first dataset, you could refer to it for countries B and C

file RFE_github.r shows how to perform the feature selection file tsne_github shows the how to of a viable t-sne features rapresentation file Neural_net_github.r shows how to find a neural netwtork architecture and hyperparameters search, you should be able to perform the others algorithms optimization from this sketch, if you really really really want drop me a line!

#what didn't worked globally so far (feel free to try it):

PCA GLRM naive bayes models k means clustering

##what didn't worked for balance the B country:

creation of synthetic data with the SMOTE algorithm upsampling the minority class upsampling the whole dataset upsampling the whole dataset and upsample the minority class
autoencoders to add decoded features as new features autoencoders to model the minority class autoencorders to model the outliers (MSE < 0.001)

Author

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

drivendata pover t test solution

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages