Every day, students of pharmacy at the Federal University of Minas Gerais (UFMG) use chicken eggs to study the effects of drugs on blood vessels. Because of this, it is essential to automate, speed up, and improve the accuracy of blood vessel segmentation in chicken eggs; better segmentation guarantees reliable results with good reproducibility. This project uses machine learning algorithms to automate the process, while maintaining speed and improving accuracy.
Nubank, a Brazilian fintech, organized a competition aimed at scouting for new talents. One of the objectives of the competition was to create a model capable of predicting which customers would fail to meet their financial obligations, incurring what is known as default. To make these predictions, a dataset containing customer information was used. One of the major issues with this type of dataset is its class imbalance; there is a large number of customers who pay their debts and a minimal number who do not, making prediction difficult.
What you will see in this notebook:
- Data cleaning and creation of new features.
- Use of Pipeline to simplify data manipulation across various models and minimize the risk of data leakage.
- How to handle an imbalanced dataset.
- Metric selection for imbalanced datasets.
- Conversion of features to ordered categories and to One Hot Encoder (OHE).
- Reducing skewness of features using log or QuantileTransformer.
- Use of grid search to find the best parameters for the models.
- Models used: Decision Tree, Random Tree, Neural Network, Logistic Regression, Bagging, Random Patches, Adaboost, and Voting Classifier.
- Discussion of the results obtained.
Como foi visto nos resultados do notebook E-Commerce Seller, uma baixa avaliação dos vendedores está correlacionada com o atraso da entrega. Para melhorar as avaliações dos vendedores, e também deixar o consumir mais satisfeito, iremos criar uma algoritmo que indique a possibilidade de atraso na entrega.
Empresas que trabalham com marketplace agregam diversos tipos de vendedores em seus sistemas, porém nem todos eles conseguem deixar seus clientes satisfeitos. Sendo assim, é importante identificar característica de vendedores bem avaliados pelos usuários e incentivar que outros se comportem da mesma forma. Com a quantidade de dados disponível sobre os vendedores, é possível identificar de forma automatizada essas características e propor mudanças tanto por parte dos vendedores quanto por parte da empresa.
Recommendation systems play an important role in filtering information before any user can consume it, whether by recommending movies to be watched or items a consumer might want. In this project, we build two recommendation systems using Nearest Neighbor and Matrix Factorization. Moreover, using the t-SNE algorithm, it is shown that a naive implementation of those algorithms has biases. They learn to recognize blockbuster movies from other movies as they are rated more often and with a higher rating score. That can lead to an algorithm prone to recommend more blockbuster movies than any other.
GAN is a machine learning model that can learn how to replicate the properties of a dataset. For example, a GAN trained in a dataset of breast cancer images learns how to generate new images similar to those seen in the dataset. Therefore, a GAN can be used to generate anonymized medical images, as they do not belong to any human, but still describe the disease characteristics. In this project, because GANs require a lot of computational resources and time, we will use two simple models to generate images of numbers.
GitHub Template Photo by Hal Gatewood on Unsplash