-
Notifications
You must be signed in to change notification settings - Fork 4
Compression Factor
"Index compression factor — The index compression factor is defined as (n-s)/n where n is the number of words in the corpus and s is the number of stems. In other words, the index compression factor is the fractional reduction in index size achieved through stemming*. For example, a corpus with 50,000 words (n) and 40,000 stems (s), would have an index compression factor of 20%. Stronger stemmers will tend to have larger index compression factors." Frakes & Fox [Ref: W1].
ICF = Index Compression Factor
N = Number of unique words before Stemming
S = Number of unique stems after Stemming
ICF = (N - S)/N
*This does not apply to all text retrieval systems, for example dtSearch does not stem words prior to indexing, therefor there is no reduction in index size due to compression of the index due to stemming. This enables the dtSearch Desktop product to display the words in the index to assist rapid searching; furthermore it enables indexes to be language independant, it is possible to change the language of the stemming rules when searching an index to optimize searching an index that contains text in multiple languages.
Verification
Use test file alpha2.txt for File A and alpha2_trunc1.txt for file B for the strongest possible results; alpha2.txt contains 52 words, two of each starting with a different letter of the alphabet, thus compression factor using alpha2_trunc1.txt will be (52-26)/52 = 1/2 = 0.5 and the Mean Conflation Class Size will be 52/26 = 2.
Use test file alpha2.txt for File A and File B for the weakest possible results. The results will be Mean Conflation Class Size = 1; Compression Factor and Mean Characters Removed will both be 0; Inverse Mean MHD will be infinity.