-
Notifications
You must be signed in to change notification settings - Fork 115
Practical Examples Using Python
Practical examples with a large-scale dataset for two types of primary NGT graphs (ANNG, ONNG) are described.
First, to describe how to search large-scale datasets, an NGT dataset needs to be generated. Download the fastText dataset as follows.
curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip
unzip wiki-news-300d-1M-subword.vec.zip
The dataset above should be converted to a format that our sample scripts can read by using the following script.
# dataset.py
with open('wiki-news-300d-1M-subword.vec', 'r') as fi,\
open('objects.tsv', 'w') as fov, open('words.tsv', 'w') as fow:
n, dim = map(int, fi.readline().split())
fov.write('{0}\t{1}\n'.format(n, dim))
for line in fi:
tokens = line.rstrip().split(' ')
fow.write(tokens[0] + '\n')
fov.write('{0}\n'.format('\t'.join(tokens[1:])))
Below is an example of how to construct an ANNG with cosine similarity for metric space.
# create-anng.py
import ngtpy
index_path = 'fasttext.anng'
with open('objects.tsv', 'r') as fin:
n, dim = map(int, fin.readline().split())
ngtpy.create(index_path, dim, distance_type='Cosine') # create an empty index
index = ngtpy.Index(index_path) # open the index
for line in fin:
object = list(map(float, line.rstrip().split('\t')))
index.insert(object) # insert objects
index.build_index() # build the index
index.save() # save the index
The ANNG can be searched with a query by using the following script.
# search.py
import ngtpy
with open('words.tsv', 'r') as fin:
words = list(map(lambda x: x.rstrip('\n'), fin.readlines()))
index = ngtpy.Index('fasttext.anng') # open the index
query_id = 10000
query_object = index.get_object(query_id) # get the object
result = index.search(query_object, epsilon = 0.10) # approximate nearest neighbor search
print('Query={}'.format(words[query_id]))
for rank, object in enumerate(result):
print('{}\t{}\t{:.6f}\t{}'.format(rank + 1, object[0], object[1], words[object[0]]))
Below are the search results.
Query=Doctors
1 10000 0.000000 Doctors
2 4631 0.244096 Doctor
3 79542 0.258944 Medics
4 2044 0.263412 doctors
5 339397 0.274972 Doctoring
6 20667 0.280508 Physicians
7 80646 0.292580 Dentists
8 24255 0.292752 Nurses
9 9480 0.322195 Scientists
10 623160 0.330500 Practioners
When a higher accuracy is needed, you can specify an epsilon value in search() higher than the default 0.1 as shown below.
index.search(query_object, epsilon = 0.15)
When a short query time is needed at the expense of accuracy, you can specify a smaller epsilon value.
Command line tool
Python
C++