Add WIP gismu validator script.#44
Conversation
|
Cool! You're right, it would be good to have some qualitative measurements. I'll try to do a code review soon. For other validators, we could make sure rafsi don't conflict, and gather some statistics on rafsi distribution and place structure sizes... Levenshtein distance could be used to see which gismu are likely to be confused, even if they don't technically conflict by CLL. Maybe some more fuzzy metrics like definition length, or even run the definitions through something like this to assess readability. What do you think? |
|
In terms of documentation on the YAML files, there isn't any right now except the script that generated them. But I'm open to writing some if it would be useful :). |
|
I think addValidationError should take a set (not list) of affected gismu as well as the error string, so that it can compare the sets and see if the error was already added, ignoring order. Then you wouldn't get campu/sampu and sampu/campu reported as separate conflicts. |
|
Yeah, the validation errors should have the gismu objects. Also that check should be done as part of the pairwise loop below, that I only started writing. |
|
Example output: I think I'm mostly done with refactoring. The only other thing is have a common collection visitor for validators and metrics, so then it can nicely have stuff like pairwise metrics (e.g. useful for Levenshtein distances between pairs). What would be useful for Levenshtein distances? Closest 50 pairs? Any under a certain limit? What is a reasonable limit for pairs of gismu? Try it and see what seems reasonable? |
|
In the gismu + experimental set, there are:
For distance 1 pairs, see https://gist.github.com/DerSaidin/b59ede5e7ef23828d23a |
|
Interesting. I wonder how that compares to, say, a typical English
dictionary?
|
|
I don't know. Btw, the reason computing Levenshtein distance is commented out in the changes is that it takes about 30 min for the 1423 gismu. Everything else takes 2 minutes, and almost all of that is looking for conflicting gismu pairs (under CLL4.14 similarity rules). My /usr/dict/words has 25146 words. To estimate the runtime for running similar code on that: I'm not curious enough to write code for that and let it run for a week (or try optimizing it more). Given the largish number of small distances, I don't think Levenshtein distance is especially important, useful, or insightful for this application. If you were writing some lojban auto-correct or something, sure. But not for looking for issues in the gimste the CLL4.14 rules have a similar purpose, and are more specific to the task. |
|
Also, current output of the script looks like this: Any further code review or general comments welcome. |
PLEASE DO NOT ACCEPT MERGE (YET)
This is a work in progress, like a code review. I'm after feedback.
What is it that we want to validate, in terms of the structure/contents of the file?
Is there a document explaining structure/contents of these .yaml files?
What would useful metrics be?
I'm guessing: no other documentation, figure it out as I go, just assume what is there should be there, and ask if I see anything weird.