Finds the best match (in a database of titles) for a misspelled title,
using a combination of Machine Learning and NLP techniques.
- Matching search terms in a database of millions of "true" titles (for example, company names) could be computationally expensive
- For a data set, the current implementation matches 100,000 titles against 500,000 true titles in around 10 minutes - i.e. around 10,000 matches per minute
- Human beings can be really creative, even come up with new ways, to misspell words in a title
- For the "example" data set - see the error matrix
- Pre-requisites:
- Install Docker (tested on engine v3.7)
- Install
make
:- Windows: Install Cygwin [while on the screen that lets you select packages to install, find
make
and select it] - Debian:
apt-get install build-essential
- RHEL:
yum install make
- macOS: Xcode
xcode-select --install
| or using Homebrewbrew install make
- Windows: Install Cygwin [while on the screen that lets you select packages to install, find
- Check cli.py and Makefile for cli definitions
make --always-make build
- to build and prepare the Docker container for running the projectmake update-docker
- to update the project setup on the Docker containermake stage-example-data-set
- to copy the "example" data set files to the Docker containermake inspect
- inspect the code for PEP-8 issuesmake test
- run the unit tests
Main Classes (please follow the docstrings):
MatchMaker
match_maker.pyFeatureEngineering
feature_engineering.pyPrediction
predict.py
Run the following cli's in order:
Alias of train_model
in cli.py
- Prepares training data for a
OneVsRest[ ⃰]Classifier
- "rest" being the closest "n" (based on the Jaccard distance) titles - Each "positive" match is trained along with the closest n titles, that do not match with that title
- Generates
train
andevaluation
data sets for thetrain-model
cli - Main features generation method:
construct_features
(in feature_engineering.py) - XGBoost training output:
train-auc:0.999979 evaluation-auc:0.999964 train-custom-error:225 evaluation-custom-error:102
- Evaluation set error matrix for the "example" data:
True Positives 7084
True Negatives 18673
False Positives 2
False Negatives 26
- See the definition of
custom_error
in train.py- Also, the custom objective function
weighted_log_loss
- Also, the custom objective function
Alias of generate_predictions
in cli.py
- The algorithm first looks for exact matches
- Then the nearest "n" matches per (remaining) title are found using the the Jaccard (modified) distance
- Next, the nearest matches are "fuzzy" matched with each title
- Finally, the trained model is used to match the remaining titles
- Test set predictions accuracy (run
make get-predictions-accuracy
to calculate the following)
Correctly matched titles 5929
Incorrectly matched titles 114 ⃰
Correctly marked as not-found 3894
Incorrectly marked as not-found 63
*
The model is already biased against "false positives". To have even fewer false positives,
tweak the FALSE_POSITIVE_PENALTY_FACTOR
or PREDICTION_PROBABILITY_THRESHOLD
settings in settings.py
Alias of closest_search_single_title
in cli.py
- Predicts the best match using the
OneVsRestClassifier
for the entire (not just the nearest matches) "truth" database
- The "example" data set is auto-generated, therefore, it is actually not too hard to get a high accuracy
- The solution produces similar accuracy on a data set with actual human errors as well
- All the computationally expensive tasks run in multi-processing mode
- Those tasks, can therefore, be easily refactored to run on distributed computing clusters
- Extend README to include more details/explanation of the solution
- Document all the classes/methods in the code
- Write more unit tests
- Refactor code to be more SOLID