The Chinese LPP conllulex file has examples of sentence and token level alignments to the English LPP file. There is a script scripts/generate_alignments_from_conllulex.py, which generates alignments for the Chinese-English language pair with the Chinese conllulex file as input to the script.
Some proposed changes are needed to expand the alignments to support a) Multiple language pairs , b) 1-many sentence alignments.
-
With english as example, add a new metadata field en_sent_id_2 (and maybe en_sent_id_3 if needed) to support 1-many sentence alignments. Similarly for other languages (e.g hi_sent_id, hi_sent_id_2, hi_sent_id_3 for hi-zh alignment). Add corresponding fields for the sentence text fields. The prefix for these fields could match the language's slug value in Xposition ('en' , 'zh' , 'he' , 'hi' , 'ko' , 'de')
Alternatively, UD notation could have a list data structure which could be used in the metadata field (more research needed)
-
In the misc column in conllulex file, replicate the existing token-level annotations for other languages, separated by a new delimiter delineating annotations across languages. The order of languages follows the order of language tags in the metadata.
Also see #234
The Chinese LPP conllulex file has examples of sentence and token level alignments to the English LPP file. There is a script
scripts/generate_alignments_from_conllulex.py, which generates alignments for the Chinese-English language pair with the Chinese conllulex file as input to the script.Some proposed changes are needed to expand the alignments to support a) Multiple language pairs , b) 1-many sentence alignments.
With english as example, add a new metadata field en_sent_id_2 (and maybe en_sent_id_3 if needed) to support 1-many sentence alignments. Similarly for other languages (e.g hi_sent_id, hi_sent_id_2, hi_sent_id_3 for hi-zh alignment). Add corresponding fields for the sentence text fields. The prefix for these fields could match the language's slug value in Xposition ('en' , 'zh' , 'he' , 'hi' , 'ko' , 'de')
Alternatively, UD notation could have a list data structure which could be used in the metadata field (more research needed)
In the misc column in conllulex file, replicate the existing token-level annotations for other languages, separated by a new delimiter delineating annotations across languages. The order of languages follows the order of language tags in the metadata.
Also see #234