Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch List implementation to use Trie-based lookup #134

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

weppos
Copy link
Owner

@weppos weppos commented Feb 14, 2017

No description provided.

➜  publicsuffix-ruby git:(thesis-trie) ✗ ruby
test/profilers/list_profsize.rb
8061 rules:
 1,631,751   PublicSuffix::List size
   421,868   Size of @rules
 1,313,514   Size of @trie
1. Hash-based trie where children are referenced using a Hash
   and storing one node per word char
2. Hash-based trie (as 1) where each node contains a word char
   and values are stores as Symbol
3. Hash-based trie (as 1) where each node contains a word part
   and values are stored as String
4. Array-based trie where children are referenced using an Array
   and creating a mapping for each letter of the alphabet

Some caveats:

- 4) doesn't play nice with an alphabet which contains
  non ASCII chars as the mapping would be hard to achieve
- 2) doesn't play nice with an alphabet which contains
  non ASCII chars as there's a risk of potential memory issues
  with version of Ruby where Symbols are not garbage collected
- The current list is Unicode (and not Punycode for now) hence
  both 2) and 4) in practice are not usable
- 3) implicitly saves space as there is no need to save the "."
  that, for what silly as it seems, the current list has 8750
  dots (and 8061 rules)
- memory cost is cost of the Trie structure AND cost of the
  string allocated to store the words (including ".").

---

Memory comparison:

    ➜  publicsuffix-ruby git:(thesis-trie) ruby test/profilers/tries_prosize.rb
       943,325   @trie_hash
       598,730   @trie_symbol
       312,361   @trie_parts
     1,627,182   @trie_array

HashTrie:

    Total allocated: 23745976 bytes (333807 objects)
    Total retained:  16647216 bytes (172460 objects)

    allocated memory by gem
    -----------------------------------
      23745936  publicsuffix-ruby/lib
            40  other

    allocated memory by file
    -----------------------------------
      23745936  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb
            40  test/profilers/tries_profiler.rb

    allocated memory by location
    -----------------------------------
      12042560  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:8
       6892920  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:16
       3516640  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:44
       1293696  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:83
           120  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:39
            40  test/profilers/tries_profiler.rb:16

    allocated memory by class
    -----------------------------------
      12042560  Hash
       6140656  String
       2297720  Array
       2297680  PublicSuffix::TrieHash::Node
        967320  Enumerator
            40  PublicSuffix::TrieHash

    allocated objects by gem
    -----------------------------------
        333806  publicsuffix-ruby/lib
             1  other

    allocated objects by file
    -----------------------------------
        333806  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb
             1  test/profilers/tries_profiler.rb

    allocated objects by location
    -----------------------------------
        172323  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:16
         87916  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:44
         57442  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:8
         16122  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:83
             3  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:39
             1  test/profilers/tries_profiler.rb:16

    allocated objects by class
    -----------------------------------
        153418  String
         57443  Array
         57442  Hash
         57442  PublicSuffix::TrieHash::Node
          8061  Enumerator
             1  PublicSuffix::TrieHash

    retained memory by gem
    -----------------------------------
      16647176  publicsuffix-ruby/lib
            40  other

    retained memory by file
    -----------------------------------
      16647176  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb
            40  test/profilers/tries_profiler.rb

    retained memory by location
    -----------------------------------
      12042560  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:8
       4595280  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:16
          9296  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:83
            40  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:39
            40  test/profilers/tries_profiler.rb:16

    retained memory by class
    -----------------------------------
      12042560  Hash
       2306936  String
       2297680  PublicSuffix::TrieHash::Node
            40  PublicSuffix::TrieHash

    retained objects by gem
    -----------------------------------
        172459  publicsuffix-ruby/lib
             1  other

    retained objects by file
    -----------------------------------
        172459  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb
             1  test/profilers/tries_profiler.rb

    retained objects by location
    -----------------------------------
        114882  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:16
         57442  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:8
           134  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:83
             1  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:39
             1  test/profilers/tries_profiler.rb:16

    retained objects by class
    -----------------------------------
         57575  String
         57442  Hash
         57442  PublicSuffix::TrieHash::Node
             1  PublicSuffix::TrieHash

    Retained String Report
    -----------------------------------
          6728  "."
          5987  "a"
          4263  "o"
          3636  "i"
          3027  "e"
          3012  "n"
          2918  "u"
          2868  "m"
                ...

HashTrieSymbol:

    Total allocated: 21449376 bytes (276392 objects)
    Total retained:  14350616 bytes (115045 objects)

    allocated memory by gem
    -----------------------------------
      21449336  publicsuffix-ruby/lib
            40  other

    allocated memory by file
    -----------------------------------
      21448296  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb
          1040  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_symbol.rb
            40  test/profilers/tries_profiler.rb

    allocated memory by location
    -----------------------------------
      12042560  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:8
       4595280  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:16
       3516640  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:44
       1293696  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:83
          1040  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_symbol.rb:9
           120  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:39
            40  test/profilers/tries_profiler.rb:18

    allocated memory by class
    -----------------------------------
      12042560  Hash
       3843536  String
       2297720  Array
       2297680  PublicSuffix::TrieHashSymbol::Node
        967320  Enumerator
           520  Symbol
            40  PublicSuffix::TrieHashSymbol

    allocated objects by gem
    -----------------------------------
        276391  publicsuffix-ruby/lib
             1  other

    allocated objects by file
    -----------------------------------
        276365  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb
            26  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_symbol.rb
             1  test/profilers/tries_profiler.rb

    allocated objects by location
    -----------------------------------
        114882  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:16
         87916  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:44
         57442  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:8
         16122  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:83
            26  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_symbol.rb:9
             3  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:39
             1  test/profilers/tries_profiler.rb:18

    allocated objects by class
    -----------------------------------
         95990  String
         57443  Array
         57442  Hash
         57442  PublicSuffix::TrieHashSymbol::Node
          8061  Enumerator
            13  Symbol
             1  PublicSuffix::TrieHashSymbol

    retained memory by gem
    -----------------------------------
      14350576  publicsuffix-ruby/lib
            40  other

    retained memory by file
    -----------------------------------
      14349536  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb
          1040  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_symbol.rb
            40  test/profilers/tries_profiler.rb

    retained memory by location
    -----------------------------------
      12042560  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:8
       2297640  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:16
          9296  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:83
          1040  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_symbol.rb:9
            40  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:39
            40  test/profilers/tries_profiler.rb:18

    retained memory by class
    -----------------------------------
      12042560  Hash
       2297680  PublicSuffix::TrieHashSymbol::Node
          9816  String
           520  Symbol
            40  PublicSuffix::TrieHashSymbol

    retained objects by gem
    -----------------------------------
        115044  publicsuffix-ruby/lib
             1  other

    retained objects by file
    -----------------------------------
        115018  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb
            26  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_symbol.rb
             1  test/profilers/tries_profiler.rb

    retained objects by location
    -----------------------------------
         57442  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:8
         57441  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:16
           134  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:83
            26  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_symbol.rb:9
             1  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:39
             1  test/profilers/tries_profiler.rb:18

    retained objects by class
    -----------------------------------
         57442  Hash
         57442  PublicSuffix::TrieHashSymbol::Node
           147  String
            13  Symbol
             1  PublicSuffix::TrieHashSymbol

    Retained String Report
    -----------------------------------
             1  "*.compute-1.amazonaws.com"
             1  "*.compute.amazonaws.com.cn"
             1  "*.githubcloudusercontent.com"
             1  "0"
             1  "1"
                ...

HashTrieParts:

    Total allocated: 6263412 bytes (98963 objects)
    Total retained:  3392172 bytes (43476 objects)

    allocated memory by gem
    -----------------------------------
       6263372  publicsuffix-ruby/lib
            40  other

    allocated memory by file
    -----------------------------------
       3971787  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb
       2291585  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_parts.rb
            40  test/profilers/tries_profiler.rb

    allocated memory by location
    -----------------------------------
       2291585  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_parts.rb:29
       2232560  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:8
       1739107  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:16
           120  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:39
            40  test/profilers/tries_profiler.rb:20

    allocated memory by class
    -----------------------------------
       2232560  Hash
       1574772  String
        967320  Enumerator
        909040  Array
        579680  PublicSuffix::TrieHashParts::Node
            40  PublicSuffix::TrieHashParts

    allocated objects by gem
    -----------------------------------
         98962  publicsuffix-ruby/lib
             1  other

    allocated objects by file
    -----------------------------------
         57967  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb
         40995  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_parts.rb
             1  test/profilers/tries_profiler.rb

    allocated objects by location
    -----------------------------------
         43472  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:16
         40995  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_parts.rb:29
         14492  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:8
             3  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:39
             1  test/profilers/tries_profiler.rb:20

    allocated objects by class
    -----------------------------------
         39363  String
         22554  Array
         14492  Hash
         14492  PublicSuffix::TrieHashParts::Node
          8061  Enumerator
             1  PublicSuffix::TrieHashParts

    retained memory by gem
    -----------------------------------
       3392132  publicsuffix-ruby/lib
            40  other

    retained memory by file
    -----------------------------------
       3392067  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb
            65  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_parts.rb
            40  test/profilers/tries_profiler.rb

    retained memory by location
    -----------------------------------
       2232560  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:8
       1159467  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:16
            65  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_parts.rb:29
            40  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:39
            40  test/profilers/tries_profiler.rb:20

    retained memory by class
    -----------------------------------
       2232560  Hash
        579892  String
        579680  PublicSuffix::TrieHashParts::Node
            40  PublicSuffix::TrieHashParts

    retained objects by gem
    -----------------------------------
         43475  publicsuffix-ruby/lib
             1  other

    retained objects by file
    -----------------------------------
         43474  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb
             1  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_parts.rb
             1  test/profilers/tries_profiler.rb

    retained objects by location
    -----------------------------------
         28981  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:16
         14492  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:8
             1  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash.rb:39
             1  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_hash_parts.rb:29
             1  test/profilers/tries_profiler.rb:20

    retained objects by class
    -----------------------------------
         14492  Hash
         14492  PublicSuffix::TrieHashParts::Node
         14491  String
             1  PublicSuffix::TrieHashParts

    Retained String Report
    -----------------------------------
          1792  "jp"
           756  "no"
           549  "museum"
           370  "it"
           332  "com"
                ...

HashTrieArray:

    Total allocated: 27171176 bytes (276366 objects)
    Total retained:  20072416 bytes (115019 objects)

    allocated memory by gem
    -----------------------------------
      27171136  publicsuffix-ruby/lib
            40  other

    allocated memory by file
    -----------------------------------
      27171136  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb
            40  test/profilers/tries_profiler.rb

    allocated memory by location
    -----------------------------------
      17765400  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:14
       4595280  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:22
       3516640  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:50
       1293696  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:89
           120  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:45
            40  test/profilers/tries_profiler.rb:22

    allocated memory by class
    -----------------------------------
      20063120  Array
       3843016  String
       2297680  PublicSuffix::TrieArray::Node
        967320  Enumerator
            40  PublicSuffix::TrieArray

    allocated objects by gem
    -----------------------------------
        276365  publicsuffix-ruby/lib
             1  other

    allocated objects by file
    -----------------------------------
        276365  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb
             1  test/profilers/tries_profiler.rb

    allocated objects by location
    -----------------------------------
        114882  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:22
         87916  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:50
         57442  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:14
         16122  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:89
             3  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:45
             1  test/profilers/tries_profiler.rb:22

    allocated objects by class
    -----------------------------------
        114885  Array
         95977  String
         57442  PublicSuffix::TrieArray::Node
          8061  Enumerator
             1  PublicSuffix::TrieArray

    retained memory by gem
    -----------------------------------
      20072376  publicsuffix-ruby/lib
            40  other

    retained memory by file
    -----------------------------------
      20072376  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb
            40  test/profilers/tries_profiler.rb

    retained memory by location
    -----------------------------------
      17765400  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:14
       2297640  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:22
          9296  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:89
            40  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:45
            40  test/profilers/tries_profiler.rb:22

    retained memory by class
    -----------------------------------
      17765400  Array
       2297680  PublicSuffix::TrieArray::Node
          9296  String
            40  PublicSuffix::TrieArray

    retained objects by gem
    -----------------------------------
        115018  publicsuffix-ruby/lib
             1  other

    retained objects by file
    -----------------------------------
        115018  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb
             1  test/profilers/tries_profiler.rb

    retained objects by location
    -----------------------------------
         57442  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:14
         57441  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:22
           134  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:89
             1  /Users/weppos/Code/publicsuffix-ruby/lib/public_suffix/trie_array.rb:45
             1  test/profilers/tries_profiler.rb:22

    retained objects by class
    -----------------------------------
         57442  Array
         57442  PublicSuffix::TrieArray::Node
           134  String
             1  PublicSuffix::TrieArray

    Retained String Report
    -----------------------------------
             1  "*.compute-1.amazonaws.com"
             1  "*.compute.amazonaws.com.cn"
             1  "*.githubcloudusercontent.com"
             1  "accident-investigation.aero"
             1  "accident-prevention.aero"
             1  "air-traffic-control.aero"
                ...
In the first iteration I completely missed the point that
given the domain name system is hierarchical, to increase compression
it is a good idea to store the reversed string or parts.

In this way strings sharing common suffixes such as:

- io
- github.io
- gitlab.io

will better leverage Trie compression as the space for io will be
shared with the path for the other two suffixes.

As a result of this change, decreased drastically:

Before:

    ➜  publicsuffix-ruby git:(thesis-trie) ruby test/profilers/tries_prosize.rb
        943,325   @trie_hash
        598,730   @trie_symbol
        312,361   @trie_parts
        1,627,182   @trie_array

After:

    ➜  publicsuffix-ruby git:(thesis-trie) ✗ ruby test/profilers/tries_prosize.rb
       624,813   @trie_hash
       399,660   @trie_symbol
       197,291   @trie_parts
       982,347   @trie_array

    ➜  publicsuffix-ruby git:(thesis-trie) ruby test/profilers/tries_profiler.rb hash

    Total allocated: 17,067,504 bytes (262,240 objects)
    Total retained:  10,433,288 bytes (112,605 objects)

    ➜  publicsuffix-ruby git:(thesis-trie) ruby test/profilers/tries_profiler.rb hash-symbol

    Total allocated: 15,567,184 bytes (224,732 objects)
    Total retained:  8,932,968 bytes (75,097 objects)

    ➜  publicsuffix-ruby git:(thesis-trie) ruby test/profilers/tries_profiler.rb hash-parts

    Total allocated: 7,388,993 bytes (130,792 objects)
    Total retained:  1,438,762 bytes (24,513 objects)

    ➜  publicsuffix-ruby git:(thesis-trie) ruby test/profilers/tries_profiler.rb array

    Total allocated: 18,700,776 bytes (224,706 objects)
    Total retained:  12,066,560 bytes (75,071 objects)
    ➜  publicsuffix-ruby git:(thesis-trie) ✗ ruby test/profilers/tries_prosize.rb
       263,451   @rules
       194,536   @trie
Change the Trie to store an associative key/pair, instead of a single set of words. The key is the rule, the value is the metadata of the rule.

    ➜  publicsuffix-ruby git:(thesis-trie) ✗ ruby test/profilers/list_profsize.rb
    8061 rules:
       482,019   PublicSuffix::List size
       263,451   Size of @rules
       307,630   Size of @trie

It looks like the Hash is still a little bit smaller than the Trie.
Merge the entry into the trie node. That will also allow to save the attribute
"length" of the entry, which is not required by the trie as I can already
determine the length by the level in the tree.

Before:

    ➜  publicsuffix-ruby git:(thesis-trie) ✗ ruby test/profilers/list_profsize.rb
    8061 rules:
       482,019   PublicSuffix::List size
       263,451   Size of @rules
       307,630   Size of @trie

After:

    ➜  publicsuffix-ruby git:(thesis-trie) ✗ ruby test/profilers/list_profsize.rb
    8061 rules:
       490,391   PublicSuffix::List size
       263,451   Size of @rules
       226,985   Size of @trie

The trie is now beating the Hash by ~40kb.
This commit handle wildcard and exceptions, and passes all the tests.

---

Benchmark Hash vs Trie:

    ➜  publicsuffix-ruby git:(thesis-trie) ✗ WHAT=hash ruby test/benchmarks/bm_find.rb
    Rehearsal -------------------------------------------------------------
    NAME_SHORT                  0.530000   0.010000   0.540000 (  0.540001)
    NAME_MEDIUM                 0.600000   0.000000   0.600000 (  0.608115)
    NAME_LONG                   0.780000   0.010000   0.790000 (  0.796897)
    NAME_WILD                   0.900000   0.020000   0.920000 (  0.961535)
    NAME_EXCP                   1.020000   0.020000   1.040000 (  1.094007)
    IAAA                        0.620000   0.010000   0.630000 (  0.649537)
    IZZZ                        0.590000   0.000000   0.590000 (  0.604190)
    PAAA                        1.030000   0.020000   1.050000 (  1.082507)
    PZZZ                        0.970000   0.020000   0.990000 (  1.009199)
    JP                          0.920000   0.010000   0.930000 (  0.939533)
    IT                          0.610000   0.010000   0.620000 (  0.618309)
    COM                         0.630000   0.000000   0.630000 (  0.642974)
    ---------------------------------------------------- total: 9.330000sec

                                    user     system      total        real
    NAME_SHORT                  0.580000   0.010000   0.590000 (  0.592958)
    NAME_MEDIUM                 0.680000   0.010000   0.690000 (  0.698372)
    NAME_LONG                   0.820000   0.010000   0.830000 (  0.830893)
    NAME_WILD                   0.810000   0.010000   0.820000 (  0.831984)
    NAME_EXCP                   0.960000   0.010000   0.970000 (  0.981469)
    IAAA                        0.600000   0.010000   0.610000 (  0.611947)
    IZZZ                        0.610000   0.000000   0.610000 (  0.626348)
    PAAA                        0.970000   0.020000   0.990000 (  0.982282)
    PZZZ                        0.990000   0.010000   1.000000 (  1.012680)
    JP                          0.940000   0.010000   0.950000 (  0.954031)
    IT                          0.610000   0.010000   0.620000 (  0.627587)
    COM                         0.620000   0.010000   0.630000 (  0.636131)

    ➜  publicsuffix-ruby git:(thesis-trie) ✗ WHAT=trie ruby test/benchmarks/bm_find.rb
    Rehearsal -------------------------------------------------------------
    NAME_SHORT                  0.700000   0.010000   0.710000 (  0.722887)
    NAME_MEDIUM                 0.750000   0.010000   0.760000 (  0.767034)
    NAME_LONG                   0.790000   0.010000   0.800000 (  0.802235)
    NAME_WILD                   0.770000   0.010000   0.780000 (  0.786366)
    NAME_EXCP                   0.810000   0.010000   0.820000 (  0.832109)
    IAAA                        0.680000   0.000000   0.680000 (  0.690577)
    IZZZ                        0.690000   0.010000   0.700000 (  0.694839)
    PAAA                        0.810000   0.010000   0.820000 (  0.826133)
    PZZZ                        0.790000   0.010000   0.800000 (  0.803508)
    JP                          0.830000   0.000000   0.830000 (  0.855188)
    IT                          0.710000   0.010000   0.720000 (  0.714962)
    COM                         0.670000   0.010000   0.680000 (  0.687400)
    ---------------------------------------------------- total: 9.100000sec

                                    user     system      total        real
    NAME_SHORT                  0.690000   0.010000   0.700000 (  0.706099)
    NAME_MEDIUM                 0.730000   0.010000   0.740000 (  0.749351)
    NAME_LONG                   0.750000   0.010000   0.760000 (  0.765484)
    NAME_WILD                   0.770000   0.010000   0.780000 (  0.781182)
    NAME_EXCP                   0.800000   0.000000   0.800000 (  0.815244)
    IAAA                        0.670000   0.010000   0.680000 (  0.682966)
    IZZZ                        0.670000   0.010000   0.680000 (  0.682771)
    PAAA                        0.830000   0.010000   0.840000 (  0.847581)
    PZZZ                        0.810000   0.010000   0.820000 (  0.829023)
    JP                          0.810000   0.000000   0.810000 (  0.831782)
    IT                          0.680000   0.010000   0.690000 (  0.691071)
    COM                         0.660000   0.010000   0.670000 (  0.669978)
Ruby allocates a reasonable amount of memory even for an empty Hash.
Do not allocate the children Hash until needed, to avoid having
nodes with no children using unnecessary extra memory.

Pre-initialize children:

    226,985   Size of @trie

    ➜  publicsuffix-ruby git:(thesis-trie) ruby test/profilers/initialization_profiler.rb
    Total allocated: 8950176 bytes (117512 objects)
    Total retained:  2475538 bytes (40477 objects)

Lazy-initialize children:

    219,329   Size of @trie

    ➜  publicsuffix-ruby git:(thesis-trie) ✗ ruby test/profilers/initialization_profiler.rb
    Total allocated: 8643936 bytes (109856 objects)
    Total retained:  2169298 bytes (32821 objects)
@pzb
Copy link

pzb commented Jun 11, 2017

As a historical note, I was hacking on something similar a while ago, but never got around to integrating it with the gem. https://gist.github.com/pzb/5aba13a67bd9fa64b3769397c842889b is what I had. It is way faster than the existing gem but is missing support for dynamically enabling/disabling the private section.

@weppos
Copy link
Owner Author

weppos commented Jun 15, 2017

Thanks for the feedback @pzb

This PR, along with #133, was the result of a research I made as part of my degree thesis. I must say that the results achieved with #133 are already stunning compared with the existing gem, and I am planning on releasing it as soon as I can.

Sadly, I merged it a while ago but there is a lot of extra work (mostly docs and deprecation info) I have to complete before releasing it as a major version.

You can already test it using master instead of the released gem. The library is now working in constant time, whereas before it was still linear time (although optimized).

The tree based version in this PR is a few milliseconds slower than the hash-based one, but it allows to save some extra bytes of allocation. That's why I was considering to merge it as well.

The good news is that both this PR and #133 allows dynamic modification of the list. I worked on a DAWG/DAFSA version that was even more lightweight, but that did not allow dynamic modifications of the list hence I discarded it for now.

If you have the chance, take a look at #134 and give a try at the version in master that already includes that PR. I believe you'll be very happy about the improvements. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants