Conversation
|
Thanks for contributing! Just to clarify, the Greek dictionary is not Ancient Greek. It is modern Greek used nowadays in Greece. Either way, I am OK with adding dead or fictional languages, just for fun. But, if possible for these languages, I would like to add big dicitionaries that permit typing a lot of words in different contexts. For Latin, this should be easy - a lot of dictionaries exists today, and they can be easily incorporated in TT9. Unfortunately, I am quite busy now, so I can not spend time on this myself. On the other hand, you seem to have a lot of free time, and if you are willing to make some more effort, I will include Latin in the app. 🙂 I suggest that we use this wordlist. It contains more than 1 million words, so it should produce very nice typing experience. It may require some cleaning, e.g. remove the single letters, remove any words with corrupted or non-Latin letters, but it should mostly be fine. As for the macrons, the word list from Winedt doesn't contain any. I would recommend actually installing and using the latin-macronizer tool you have found to make the dictionary nicer. After that, I can build an APK for you to do some real-world testing, and if it feels alright, I'll merge it an publish it. Go for it! |
Co-authored-by: Dimo Karaivanov <doftor.livain@gmail.com>
|
... wow I should have caught 90% of that. My apologies for submitting that. I do indeed have quite a bit of free time, I'll get right to it 👍🏻 |
|
@sspanak The bigger question is that the dict is also huge. 29.517 MB, and 1243950 words. For its size, it does have the Latin I study and a lot of it that I haven't... It doesn't appear to have duplicates, I haven't looked the thing over completely, but it does look like it's not unnecessarily duplicated (for instance: -que is a suffix basically meaning "and <that word"; que and *que appear once in the dictionary). Do you want me to include the whole 29 MB or try to trim it down to what most classical Latin folks will actually use? |
From what I know, Romans were trying to find ways to mark long vowels. They felt the need to mark them, and I can see how adding macrons with modern-day fonts is useful, that's why I suggested using that tool to macronize the words. Also, from what I know, people who refuse to use macrons are hardcore fans who think they know everything. 😄
TT9 can handle even 2 million words, don't worry about it. See, in T9 keyboards, you must have as many words as possible, otherwise, you have to compose them letter by letter, which is painfully slow. And even if you don't know some word, maybe someone else knows it and may want to type. Also, you may want to quote or copy some text, even if you don't know all the words. You need to have them for nicer typing experience. So, bring it on! The more, the better.
Yes, these dictionaries are quite good. They rarely contain misspelled words, garbage words and whatnot.
I am not quite sure what you mean by that. There are quite a few words that end with "-que". Maybe, I misunderstood you... |
|
I think I've got the ducks in a row now. Unfortunately I'll have to leave out macrons, at least for now. Turns out that the macronizers out there are tuned for making historical documents more readable, not dictionaries (even if I put in ~25000 words at a time), and they do such a great job at it based on contextual clues. With a dictionary, the only context for the _n_th word are the n-1 words behind it... and accordingly, all the tools I tried (especially on edge cases |
|
Whoops misclick. especially on edge cases like malo and ma_lo_, with evil and I prefer) either macronized incorrectly or didn't at all. So for the sake of preserving accuracy (and not having to go through all 1.24m words by hand) I'll leave it out for now and keep my eyes out for a macronizer for dictionaries. |
|
I see. I had a similar problem a couple of times already. The most notable case was Russian, where they have "е" and "ё", but most people use them inconsistently, instead of following the grammar rules. Naturally, this means most word lists found on the Internet contain misspelled words. So, I've found a tool that knew how to correct these two letters, and in the ambiguous cases, where two different words could be written with "е" or "ё" (just like "malo" and "mālō"), I kept both variants, because both are valid words. In my case, it was easy, because the tool had the option of producing one of the variants or both, and it didn't require context, it just processed according to the configuration. But, as I understand, this is not the case with the Latin macronizers. So, in summary, if you can't make the macronizer tool(s) produce all valid word variants, ignoring the context, then we can go without macrons, I guess. They would have been valuable though. Dropping them is a bit sad. |
I saw TT9 has Greek in it, so since I'm a Latin nerd, I was like "hmm, let me see". I use TT9 enough for typing Latin assignments as well, so... :)
I did get Codex by OpenAI to write this, and its notes are in
laWordlistReadme.txt. Let me know if something needs to get changed.