Skip to content

pawelrychlik/duplitector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

duplitector

A duplicate data detector engine based on Elasticsearch. It's been successfully used as a proof of concept, piloting a full-blown enterprize solution.

Context

In certain systems we have to deal with lots of low-quality data, containing some typos, malformatted or missing fields, erraneous bits of information, sometimes coming from different sources, like careless humans, faulty sensors, multiple external data providers, etc. This kind of datasets often contain vast numbers of duplicate or similar entries. If this is the case - then these systems might struggle to deal with such unnatural, often unforeseen, conditions. It might, in turn, affect the quality of service delivered by the system.

This project is meant to be a playground for developing a deduplication algorithm, and is currently aimed at the domain of various sorts of organizations (e.g. NPO databases). Still, it's small and generic enough, so that it can be easily adjusted to handle other data schemes or data sources.

The repository contains a set of crafted organizations and their duplicates (partially fetched from IRS, partially intentionally modified, partially made up), so that it's convenient to test the algorithm's pieces.

How do I run this thing?

Requires:

  • ruby 1.9+ (tested on ruby 1.9.3), RubyGems with bundler
  • elasticsearch server running on localhost:9200 (configurable)
$ git clone https://github.com/pawelrychlik/duplitector.git
$ cd duplitector
$ bundle install
$ ruby lib/duplitector.rb

Configuration via command-line arguments:

$ ruby lib/duplitector.rb --help
Options:
   --filename, -f <s>:   Path to test-data filename (default: data/FoundationCenter.txt)
      --count, -c <i>:   Number of test entries to process
        --url, -u <s>:   URL to elasticsearch server (default: http://localhost:9200)
  --threshold, -t <f>:   elasticsearch scoring threshold for differentiating between a duplicate and a unique item
                         (default: 1.0)
      --index, -i <s>:   Name of elasticsearch index to use (default: duplitector)
        --verbose, -v:   Prints more information
           --help, -h:   Show this message

Example output

(Cut for the sake of brevity).

Processing: {"id"=>"00-2237333", "name"=>"Lincoln Loan Fund", "type"=>"SOUNK", "city"=>"Fayetteville", "state"=>"AR", "country"=>"United States", "gov_id1"=>"EIN:002237333", "group_id"=>"18"}
No potential duplicates found. Highest score: 
Created new organization: {"id"=>["00-2237333"], "name"=>["Lincoln Loan Fund"], "type"=>["SOUNK"], "city"=>["Fayetteville"], "state"=>["AR"], "country"=>["United States"], "gov_id1"=>["EIN:002237333"], "group_id"=>["18"], "es_id"=>99}

Processing: {"id"=>"01-0140283", "name"=>"Pine Grove Cemetery Association", "type"=>"EO", "city"=>"Brunswick", "state"=>"ME", "gov_id1"=>"EIN:010140283", "group_id"=>"48"}
Found potential duplicates. Highest score: 3.485476
Merged duplicates into an existing organization: {"id"=>["01-0140283", "01-0140283"], "name"=>["Pine Grove Cemetery Association", "Pine Grove Cemetery Association"], "type"=>["EO", "EO"], "city"=>["Brunswick", "Brunswick"], "state"=>["ME", "ME"], "gov_id1"=>["EIN:010140283", "EIN:010140283"], "group_id"=>["48", "48"], "country"=>["United States"], "es_id"=>66}

FAIL: "group_id"=>"98" assigned to  2 orgs.
OK: "group_id"=>"11" assigned to  1 orgs.
OK: "group_id"=>"10" assigned to  1 orgs.
Duplicate resolution: OK=99, FAIL=1, ERROR=0.

Done. Stats: Organizations created: 101, Organizations resolved as duplicates: 99

Useful resources on the subject of deduplication:

About

A duplicate data detector engine PoC based on Elasticsearch.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages