Skip to content

An ML+NLP solution for linking misspelled titles with the true titles

Notifications You must be signed in to change notification settings

mhaseebtariq/doppel-speller

Repository files navigation

DoppelSpeller

Finds the best match (in a database of titles) for a misspelled title, using a combination of Machine Learning and NLP techniques.

Project description

Challenges:

  • Matching search terms in a database of millions of "true" titles (for example, company names) could be computationally expensive
    • For a data set, the current implementation matches 100,000 titles against 500,000 true titles in around 10 minutes - i.e. around 10,000 matches per minute
  • Human beings can be really creative, even come up with new ways, to misspell words in a title

Setup

  • Pre-requisites:
    • Install Docker (tested on engine v3.7)
    • Install make:
      • Windows: Install Cygwin [while on the screen that lets you select packages to install, find make and select it]
      • Debian: apt-get install build-essential
      • RHEL: yum install make
      • macOS: Xcode xcode-select --install | or using Homebrew brew install make
  • Check cli.py and Makefile for cli definitions
  • make --always-make build - to build and prepare the Docker container for running the project
  • make update-docker - to update the project setup on the Docker container
  • make stage-example-data-set - to copy the "example" data set files to the Docker container
  • make inspect - inspect the code for PEP-8 issues
  • make test - run the unit tests

Explanation

Main Classes (please follow the docstrings):

Run the following cli's in order:

make train-model

Alias of train_model in cli.py

  • Prepares training data for a OneVsRest[ ⃰]Classifier - "rest" being the closest "n" (based on the Jaccard distance) titles
  • Each "positive" match is trained along with the closest n titles, that do not match with that title
  • Generates train and evaluation data sets for the train-model cli
  • Main features generation method: construct_features (in feature_engineering.py)
  • XGBoost training output: train-auc:0.999979 evaluation-auc:0.999964 train-custom-error:225 evaluation-custom-error:102
  • Evaluation set error matrix for the "example" data:
True Positives          7084
True Negatives          18673
False Positives         2
False Negatives         26
  • See the definition of custom_error in train.py
    • Also, the custom objective function weighted_log_loss

make generate-predictions

Alias of generate_predictions in cli.py

  • The algorithm first looks for exact matches
  • Then the nearest "n" matches per (remaining) title are found using the the Jaccard (modified) distance
  • Next, the nearest matches are "fuzzy" matched with each title
  • Finally, the trained model is used to match the remaining titles
  • Test set predictions accuracy (run make get-predictions-accuracy to calculate the following)
Accuracy for the "example test" data set:
Correctly matched titles            5929
Incorrectly matched titles          114 ⃰
Correctly marked as not-found       3894
Incorrectly marked as not-found     63

* The model is already biased against "false positives". To have even fewer false positives, tweak the FALSE_POSITIVE_PENALTY_FACTOR or PREDICTION_PROBABILITY_THRESHOLD settings in settings.py

make closest-search-single-title title='PRO teome plc SCIs'

Alias of closest_search_single_title in cli.py

  • Predicts the best match using the OneVsRestClassifier for the entire (not just the nearest matches) "truth" database

NOTES

  • The "example" data set is auto-generated, therefore, it is actually not too hard to get a high accuracy
    • The solution produces similar accuracy on a data set with actual human errors as well
  • All the computationally expensive tasks run in multi-processing mode
    • Those tasks, can therefore, be easily refactored to run on distributed computing clusters

TODO

  • Extend README to include more details/explanation of the solution
  • Document all the classes/methods in the code
  • Write more unit tests
  • Refactor code to be more SOLID