Cross-Linguistic alignments for multiple language pairs

The Chinese LPP conllulex file has examples of sentence and token level alignments to the English LPP file. There is a script `scripts/generate_alignments_from_conllulex.py`, which generates alignments for the Chinese-English language pair with the Chinese conllulex file as input to the script. 

Some proposed changes are needed to expand the alignments to support a) Multiple language pairs , b) 1-many sentence alignments. 

1) With english as example, add a new metadata field en_sent_id_2 (and maybe en_sent_id_3 if needed) to support 1-many sentence alignments. Similarly for other languages (e.g hi_sent_id, hi_sent_id_2, hi_sent_id_3 for hi-zh alignment). Add corresponding fields for the sentence text fields. The prefix for these fields could match the language's slug value in Xposition ('en' , 'zh' , 'he' , 'hi' , 'ko' , 'de')

   Alternatively, UD notation could have a list data structure which could be used in the metadata field (more research needed)

2) In the misc column in conllulex file, replicate the existing token-level annotations for other languages, separated by a new delimiter delineating annotations across languages. The order of languages follows the order of language tags in the metadata. 

Also see #234 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-Linguistic alignments for multiple language pairs #243

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cross-Linguistic alignments for multiple language pairs #243

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions