Skip to content

Cross-Linguistic alignments for multiple language pairs #243

@nitinvwaran

Description

@nitinvwaran

The Chinese LPP conllulex file has examples of sentence and token level alignments to the English LPP file. There is a script scripts/generate_alignments_from_conllulex.py, which generates alignments for the Chinese-English language pair with the Chinese conllulex file as input to the script.

Some proposed changes are needed to expand the alignments to support a) Multiple language pairs , b) 1-many sentence alignments.

  1. With english as example, add a new metadata field en_sent_id_2 (and maybe en_sent_id_3 if needed) to support 1-many sentence alignments. Similarly for other languages (e.g hi_sent_id, hi_sent_id_2, hi_sent_id_3 for hi-zh alignment). Add corresponding fields for the sentence text fields. The prefix for these fields could match the language's slug value in Xposition ('en' , 'zh' , 'he' , 'hi' , 'ko' , 'de')

    Alternatively, UD notation could have a list data structure which could be used in the metadata field (more research needed)

  2. In the misc column in conllulex file, replicate the existing token-level annotations for other languages, separated by a new delimiter delineating annotations across languages. The order of languages follows the order of language tags in the metadata.

Also see #234

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions