Add WIP gismu validator script. by DerSaidin · Pull Request #44 · balningau/gimste

DerSaidin · 2014-11-10T15:03:08Z

PLEASE DO NOT ACCEPT MERGE (YET)
This is a work in progress, like a code review. I'm after feedback.

What is it that we want to validate, in terms of the structure/contents of the file?
Is there a document explaining structure/contents of these .yaml files?
What would useful metrics be?

I'm guessing: no other documentation, figure it out as I go, just assume what is there should be there, and ask if I see anything weird.

requires pyyaml: http://pyyaml.org/

durka · 2014-11-11T16:30:55Z

Cool! You're right, it would be good to have some qualitative measurements. I'll try to do a code review soon. For other validators, we could make sure rafsi don't conflict, and gather some statistics on rafsi distribution and place structure sizes... Levenshtein distance could be used to see which gismu are likely to be confused, even if they don't technically conflict by CLL. Maybe some more fuzzy metrics like definition length, or even run the definitions through something like this to assess readability. What do you think?

durka · 2014-11-11T16:32:13Z

In terms of documentation on the YAML files, there isn't any right now except the script that generated them. But I'm open to writing some if it would be useful :).

durka · 2014-11-23T18:38:21Z

I think addValidationError should take a set (not list) of affected gismu as well as the error string, so that it can compare the sets and see if the error was already added, ignoring order. Then you wouldn't get campu/sampu and sampu/campu reported as separate conflicts.

DerSaidin · 2014-11-23T22:34:52Z

Yeah, the validation errors should have the gismu objects.

Also that check should be done as part of the pairwise loop below, that I only started writing.
It then shouldn't ever visit (x,y) AND (y,x).

DerSaidin · 2014-11-25T14:59:20Z

Example output:

$ ./validator.py
...loaded 100
...loaded 200
...loaded 300
...loaded 400
...loaded 500
...loaded 600
...loaded 700
...loaded 800
...loaded 900
...loaded 1000
...loaded 1100
...loaded 1200
...loaded 1300
...loaded 1400
...loaded 1423 (0 failed to load)
Summary:
1423 gismu
5 defs in language he
1 defs in language ko
1354 defs in language ja
146 defs in language jbo
3 defs in language la
1337 defs in language hu
2 defs in language it
494 defs in language sv
15 defs in language vi
39 defs in language nl
1341 defs in language de
60 defs in language pl
10 defs in language art-guaspi
1 defs in language fi
1339 defs in language eo
1422 defs in language en
1347 defs in language es
1 defs in language et
103 defs in language no
1393 defs in language zh
1392 defs in language en-simple
88 defs in language fr
1374 defs in language ru
gismu too similar: sampu campu [experimental]  (differs by c to s)
gismu too similar: tatru datru [experimental]  (differs by d to t)
gismu too similar: kanro kamro [experimental]  (differs by m to n)
gismu too similar: sarni [experimental] salni [experimental]  (differs by l to r)

I think I'm mostly done with refactoring. The only other thing is have a common collection visitor for validators and metrics, so then it can nicely have stuff like pairwise metrics (e.g. useful for Levenshtein distances between pairs).

What would be useful for Levenshtein distances? Closest 50 pairs? Any under a certain limit? What is a reasonable limit for pairs of gismu? Try it and see what seems reasonable?

DerSaidin · 2015-04-26T05:11:42Z

In the gismu + experimental set, there are:

978 pairs with a Levenshtein distance of 1.
16,630 pairs with a Levenshtein distance of 2.

For distance 1 pairs, see https://gist.github.com/DerSaidin/b59ede5e7ef23828d23a

trans · 2015-04-26T11:21:27Z

Interesting. I wonder how that compares to, say, a typical English dictionary?

DerSaidin · 2015-04-26T14:09:59Z

I don't know. Btw, the reason computing Levenshtein distance is commented out in the changes is that it takes about 30 min for the 1423 gismu. Everything else takes 2 minutes, and almost all of that is looking for conflicting gismu pairs (under CLL4.14 similarity rules).

My /usr/dict/words has 25146 words. To estimate the runtime for running similar code on that:
1423^2 = 2024929
25146 *25146 = 632321316
30min/2024929 * 632321316 = 9368 min = 6.5days runtime.

I'm not curious enough to write code for that and let it run for a week (or try optimizing it more).

Given the largish number of small distances, I don't think Levenshtein distance is especially important, useful, or insightful for this application. If you were writing some lojban auto-correct or something, sure. But not for looking for issues in the gimste the CLL4.14 rules have a similar purpose, and are more specific to the task.

DerSaidin · 2015-04-26T14:13:49Z

Also, current output of the script looks like this:

...loaded 100
...loaded 200
...loaded 300
...loaded 400
...loaded 500
...loaded 600
...loaded 700
...loaded 800
...loaded 900
...loaded 1000
...loaded 1100
...loaded 1200
...loaded 1300
...loaded 1400
...loaded 1423
0 failed to load
Running visitor Language Counts Metrics ... 93 ms
Running visitor Final Vowel Validator ... 8 ms
Running visitor Conflicting Gismu Validator ... 100606 ms
Running visitor Conflicting Rafsi Validator ... 6480 ms
==== Validation: ====
gismu too similar (differs by c to s): campu [experimental], sampu
gismu too similar (differs by d to t): datru [experimental], tatru
gismu too similar (differs by m to n): kamro [experimental], kanro
gismu too similar (differs by l to r): salni [experimental], sarni [experimental]
FAIL
==== Summary: ====
---- Language Counts Metrics: ----
1423 gismu
103 defs in language no
1354 defs in language ja
39 defs in language nl
1374 defs in language ru
5 defs in language he
1 defs in language ko
10 defs in language art-guaspi
2 defs in language it
3 defs in language la
1337 defs in language hu
60 defs in language pl
15 defs in language vi
1422 defs in language en
1339 defs in language eo
1341 defs in language de
146 defs in language jbo
1 defs in language fi
1392 defs in language en-simple
494 defs in language sv
1347 defs in language es
1393 defs in language zh
1 defs in language et
88 defs in language fr

Any further code review or general comments welcome.
I'm satisfied this code is decent enough to consider accepting the pull request.

Add WIP gismu validator script.

efe8b38

DerSaidin added 2 commits November 20, 2014 01:11

Refactor validator.

dc34e3f

More validator refactoring.

963559f

Improve pairwise tests.

d2c5e71

DerSaidin added 2 commits November 26, 2014 01:22

A little more refactoring.

2d9e99e

Refactoring. Implement Levenshtein distance.

5d0577b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WIP gismu validator script.#44

Add WIP gismu validator script.#44
DerSaidin wants to merge 6 commits into
balningau:masterfrom
DerSaidin:master

DerSaidin commented Nov 10, 2014

Uh oh!

durka commented Nov 11, 2014

Uh oh!

durka commented Nov 11, 2014

Uh oh!

durka commented Nov 23, 2014

Uh oh!

DerSaidin commented Nov 23, 2014

Uh oh!

DerSaidin commented Nov 25, 2014

Uh oh!

DerSaidin commented Apr 26, 2015

Uh oh!

trans commented Apr 26, 2015 via email

Uh oh!

DerSaidin commented Apr 26, 2015

Uh oh!

DerSaidin commented Apr 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DerSaidin commented Nov 10, 2014

Uh oh!

durka commented Nov 11, 2014

Uh oh!

durka commented Nov 11, 2014

Uh oh!

durka commented Nov 23, 2014

Uh oh!

DerSaidin commented Nov 23, 2014

Uh oh!

DerSaidin commented Nov 25, 2014

Uh oh!

DerSaidin commented Apr 26, 2015

Uh oh!

trans commented Apr 26, 2015 via email

Uh oh!

DerSaidin commented Apr 26, 2015

Uh oh!

DerSaidin commented Apr 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants