Skip to content

Add WIP gismu validator script.#44

Open
DerSaidin wants to merge 6 commits into
balningau:masterfrom
DerSaidin:master
Open

Add WIP gismu validator script.#44
DerSaidin wants to merge 6 commits into
balningau:masterfrom
DerSaidin:master

Conversation

@DerSaidin

Copy link
Copy Markdown
Contributor

PLEASE DO NOT ACCEPT MERGE (YET)
This is a work in progress, like a code review. I'm after feedback.

What is it that we want to validate, in terms of the structure/contents of the file?
Is there a document explaining structure/contents of these .yaml files?
What would useful metrics be?

I'm guessing: no other documentation, figure it out as I go, just assume what is there should be there, and ask if I see anything weird.

@durka

durka commented Nov 11, 2014

Copy link
Copy Markdown
Contributor

Cool! You're right, it would be good to have some qualitative measurements. I'll try to do a code review soon. For other validators, we could make sure rafsi don't conflict, and gather some statistics on rafsi distribution and place structure sizes... Levenshtein distance could be used to see which gismu are likely to be confused, even if they don't technically conflict by CLL. Maybe some more fuzzy metrics like definition length, or even run the definitions through something like this to assess readability. What do you think?

@durka

durka commented Nov 11, 2014

Copy link
Copy Markdown
Contributor

In terms of documentation on the YAML files, there isn't any right now except the script that generated them. But I'm open to writing some if it would be useful :).

@durka

durka commented Nov 23, 2014

Copy link
Copy Markdown
Contributor

I think addValidationError should take a set (not list) of affected gismu as well as the error string, so that it can compare the sets and see if the error was already added, ignoring order. Then you wouldn't get campu/sampu and sampu/campu reported as separate conflicts.

@DerSaidin

Copy link
Copy Markdown
Contributor Author

Yeah, the validation errors should have the gismu objects.

Also that check should be done as part of the pairwise loop below, that I only started writing.
It then shouldn't ever visit (x,y) AND (y,x).

@DerSaidin

Copy link
Copy Markdown
Contributor Author

Example output:

$ ./validator.py
...loaded 100
...loaded 200
...loaded 300
...loaded 400
...loaded 500
...loaded 600
...loaded 700
...loaded 800
...loaded 900
...loaded 1000
...loaded 1100
...loaded 1200
...loaded 1300
...loaded 1400
...loaded 1423 (0 failed to load)
Summary:
1423 gismu
5 defs in language he
1 defs in language ko
1354 defs in language ja
146 defs in language jbo
3 defs in language la
1337 defs in language hu
2 defs in language it
494 defs in language sv
15 defs in language vi
39 defs in language nl
1341 defs in language de
60 defs in language pl
10 defs in language art-guaspi
1 defs in language fi
1339 defs in language eo
1422 defs in language en
1347 defs in language es
1 defs in language et
103 defs in language no
1393 defs in language zh
1392 defs in language en-simple
88 defs in language fr
1374 defs in language ru
gismu too similar: sampu campu [experimental]  (differs by c to s)
gismu too similar: tatru datru [experimental]  (differs by d to t)
gismu too similar: kanro kamro [experimental]  (differs by m to n)
gismu too similar: sarni [experimental] salni [experimental]  (differs by l to r)

I think I'm mostly done with refactoring. The only other thing is have a common collection visitor for validators and metrics, so then it can nicely have stuff like pairwise metrics (e.g. useful for Levenshtein distances between pairs).

What would be useful for Levenshtein distances? Closest 50 pairs? Any under a certain limit? What is a reasonable limit for pairs of gismu? Try it and see what seems reasonable?

@DerSaidin

Copy link
Copy Markdown
Contributor Author

In the gismu + experimental set, there are:

  • 978 pairs with a Levenshtein distance of 1.
  • 16,630 pairs with a Levenshtein distance of 2.

For distance 1 pairs, see https://gist.github.com/DerSaidin/b59ede5e7ef23828d23a

@trans

trans commented Apr 26, 2015 via email

Copy link
Copy Markdown

@DerSaidin

Copy link
Copy Markdown
Contributor Author

I don't know. Btw, the reason computing Levenshtein distance is commented out in the changes is that it takes about 30 min for the 1423 gismu. Everything else takes 2 minutes, and almost all of that is looking for conflicting gismu pairs (under CLL4.14 similarity rules).

My /usr/dict/words has 25146 words. To estimate the runtime for running similar code on that:
1423^2 = 2024929
25146 *25146 = 632321316
30min/2024929 * 632321316 = 9368 min = 6.5days runtime.

I'm not curious enough to write code for that and let it run for a week (or try optimizing it more).

Given the largish number of small distances, I don't think Levenshtein distance is especially important, useful, or insightful for this application. If you were writing some lojban auto-correct or something, sure. But not for looking for issues in the gimste the CLL4.14 rules have a similar purpose, and are more specific to the task.

@DerSaidin

Copy link
Copy Markdown
Contributor Author

Also, current output of the script looks like this:

...loaded 100
...loaded 200
...loaded 300
...loaded 400
...loaded 500
...loaded 600
...loaded 700
...loaded 800
...loaded 900
...loaded 1000
...loaded 1100
...loaded 1200
...loaded 1300
...loaded 1400
...loaded 1423
0 failed to load
Running visitor Language Counts Metrics ... 93 ms
Running visitor Final Vowel Validator ... 8 ms
Running visitor Conflicting Gismu Validator ... 100606 ms
Running visitor Conflicting Rafsi Validator ... 6480 ms
==== Validation: ====
gismu too similar (differs by c to s): campu [experimental], sampu
gismu too similar (differs by d to t): datru [experimental], tatru
gismu too similar (differs by m to n): kamro [experimental], kanro
gismu too similar (differs by l to r): salni [experimental], sarni [experimental]
FAIL
==== Summary: ====
---- Language Counts Metrics: ----
1423 gismu
103 defs in language no
1354 defs in language ja
39 defs in language nl
1374 defs in language ru
5 defs in language he
1 defs in language ko
10 defs in language art-guaspi
2 defs in language it
3 defs in language la
1337 defs in language hu
60 defs in language pl
15 defs in language vi
1422 defs in language en
1339 defs in language eo
1341 defs in language de
146 defs in language jbo
1 defs in language fi
1392 defs in language en-simple
494 defs in language sv
1347 defs in language es
1393 defs in language zh
1 defs in language et
88 defs in language fr

Any further code review or general comments welcome.
I'm satisfied this code is decent enough to consider accepting the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants