duplitector

A duplicate data detector engine based on Elasticsearch. It's been successfully used as a proof of concept, piloting a full-blown enterprize solution.

Context

In certain systems we have to deal with lots of low-quality data, containing some typos, malformatted or missing fields, erraneous bits of information, sometimes coming from different sources, like careless humans, faulty sensors, multiple external data providers, etc. This kind of datasets often contain vast numbers of duplicate or similar entries. If this is the case - then these systems might struggle to deal with such unnatural, often unforeseen, conditions. It might, in turn, affect the quality of service delivered by the system.

This project is meant to be a playground for developing a deduplication algorithm, and is currently aimed at the domain of various sorts of organizations (e.g. NPO databases). Still, it's small and generic enough, so that it can be easily adjusted to handle other data schemes or data sources.

The repository contains a set of crafted organizations and their duplicates (partially fetched from IRS, partially intentionally modified, partially made up), so that it's convenient to test the algorithm's pieces.

How do I run this thing?

Requires:

ruby 1.9+ (tested on ruby 1.9.3), RubyGems with bundler
elasticsearch server running on localhost:9200 (configurable)

$ git clone https://github.com/pawelrychlik/duplitector.git
$ cd duplitector
$ bundle install
$ ruby lib/duplitector.rb

Configuration via command-line arguments:

$ ruby lib/duplitector.rb --help
Options:
   --filename, -f <s>:   Path to test-data filename (default: data/FoundationCenter.txt)
      --count, -c <i>:   Number of test entries to process
        --url, -u <s>:   URL to elasticsearch server (default: http://localhost:9200)
  --threshold, -t <f>:   elasticsearch scoring threshold for differentiating between a duplicate and a unique item
                         (default: 1.0)
      --index, -i <s>:   Name of elasticsearch index to use (default: duplitector)
        --verbose, -v:   Prints more information
           --help, -h:   Show this message

Example output

(Cut for the sake of brevity).

Processing: {"id"=>"00-2237333", "name"=>"Lincoln Loan Fund", "type"=>"SOUNK", "city"=>"Fayetteville", "state"=>"AR", "country"=>"United States", "gov_id1"=>"EIN:002237333", "group_id"=>"18"}
No potential duplicates found. Highest score: 
Created new organization: {"id"=>["00-2237333"], "name"=>["Lincoln Loan Fund"], "type"=>["SOUNK"], "city"=>["Fayetteville"], "state"=>["AR"], "country"=>["United States"], "gov_id1"=>["EIN:002237333"], "group_id"=>["18"], "es_id"=>99}

Processing: {"id"=>"01-0140283", "name"=>"Pine Grove Cemetery Association", "type"=>"EO", "city"=>"Brunswick", "state"=>"ME", "gov_id1"=>"EIN:010140283", "group_id"=>"48"}
Found potential duplicates. Highest score: 3.485476
Merged duplicates into an existing organization: {"id"=>["01-0140283", "01-0140283"], "name"=>["Pine Grove Cemetery Association", "Pine Grove Cemetery Association"], "type"=>["EO", "EO"], "city"=>["Brunswick", "Brunswick"], "state"=>["ME", "ME"], "gov_id1"=>["EIN:010140283", "EIN:010140283"], "group_id"=>["48", "48"], "country"=>["United States"], "es_id"=>66}

FAIL: "group_id"=>"98" assigned to  2 orgs.
OK: "group_id"=>"11" assigned to  1 orgs.
OK: "group_id"=>"10" assigned to  1 orgs.
Duplicate resolution: OK=99, FAIL=1, ERROR=0.

Done. Stats: Organizations created: 101, Organizations resolved as duplicates: 99

Useful resources on the subject of deduplication:

an article by Andrei Zmievski

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
bin		bin
data		data
lib		lib
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

duplitector

Context

How do I run this thing?

Example output

About

Releases

Packages

Languages

License

pawelrychlik/duplitector

Folders and files

Latest commit

History

Repository files navigation

duplitector

Context

How do I run this thing?

Example output

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages