Skip to content

Error Counting

Electronart edited this page Dec 12, 2018 · 2 revisions

List Analyser uses the method described by Chris Paice in "Method for Evaluation of Stemming Algorithms Based on Error Counting" Ref: W8

UI = Under-Stemming Index

This is given by UI = 1 - CI, where CI is the Conflation Index, the proportion of equivalent word pairs which were successfully grouped to the same stem.

OI = Over-Stemming Index

This is given by OI = 1 - DI, where DI is the proportion of non-equivalent word pairs which remained distinct after stemming.

SW = Stemmer Weight = OI/UI

Verification:

Open the sample file English2Grouped.txt as File A and English2Grouped_Trunc5.txt as File B. Set the Levenshtein Range at 0 to 32 and click on the Calculate button. The results shown in the Error Count group box should be UI = 0.545 and OI = 0.

These results are taken from the examples at: www.comp.lancs.ac.uk/computing/research/stemming/Links/error.htm

English2Grouped.txt

  • divide
  • dividing
  • divided
  • division
  • divisor
  • ====
  • divine
  • divination

English2Grouped_Trunc5.txt

  • divid
  • divid
  • divid
  • divis
  • divis
  • ====
  • divin
  • divin

UI can be calculated by plotting the results in a table thus:

divide dividing divided division divisor divine divination
divide 1 1 0 0
dividing 1 1 0 0
divided 1 1 0 0
division 0 0 0 1
divisor 0 0 0 1
divine 1
divination 1

1 = identical stems from the same input group. 0 = different stems from same input group.

The total possible matches (total of all 0 and 1 results) is 22 The total with result 1 is 10

Hence UI = 1 - (10/22) = 0.545455 rounded to 6 decimal places

OI can be calculated by plotting the results in a table thus:

divide dividing divided division divisor divine divination
divide x x x x 1 1
dividing x x x x 1 1
divided x x x x 1 1
division x x x x 1 1
divisor x x x x 1 1
divine 1 1 1 1 1 x
divination 1 1 1 1 1 x

1 = non-identical stems from the different input groups. A successful stemming.

0 = a pair of words from different groups with the same stem, i.e. an over-stemming error.

The total possible matches (total of all 0 and 1 results) is 20

The total with result 1 is 20

Hence OI = 1 - (20/20) = 0

Clone this wiki locally