from seekr import Seekr
seekr.load_from_db('companies', 'data/companies.sqlite', column = 1)
seekr.create_index('annoy') # 'annoy' or 'kmeans'
matches = seekr.query('Active Fund LLC', limit = 3, index_type = 'annoy') # 'linear', 'annoy', 'kmeans'...
-
Features which occur <=
skip_ktimes in the whole corpus aren't added as a dimension in the vector resulting in fast comparison at the tradeoff of querying for unique features that can be found in the document. -
Using
ngramsto define features instead ofwhitespacewhich result in more dimension to the vector but more accurate fuzzy searching.
ngrams of"EMERGENCY"->['EME', 'MER', 'ERG', 'RGE', 'GEN', 'ENC', 'NCY'] -
During convertion of the target string to vector,
ngrams/whitespaceswhich are not present in corpus, are not added as dimentions in the resulting vector leading to incorrect euclidian distance but optimized comparison.
vector of"!J INC"will have same euclidian distance to its similar vectors as the vector of"!J INC )*&!)#!*)^!))!*&*" -
self.matrixinTfidfVectorizerstores sparse vectors instead of dense vectors for memory optimization
dense vector ->[0, 0, 4.51, 0, 0, 9.23, 0, 0, 0, 0, 0]
sparse vector ->[(2, 4.51), (5, 9.23)] -
Indexed vectors using BTree will not give all the similar vectors as given by linear exhaustive searching but indexing is very time efficient.
-
Finding centers during indexing multiple times to achieve the most optimum vector clustering.
-
sensitivityof ANN which describes the ratio of distribution of vectors on either side of the hyperplane. Increasing the sensitivity can slower the index creation. -
instead of actually using the average vector of a cluster for computation, the vector closest the the average vector is chosen because of its less dimensionality
This project was developed as a major project during my college studies.