Building an index from annotated data.
scripts/import/ud.py --input /path/to/data.conllu --output /path/to/indexSome metadata spans multiple sentences. This includes document and paragraph-level information. The ingestion script generates document and paragraph-level structures. Note that any other carryover metadata (e.g., meta:: fields in GUM) are applied as document-level structures.
By default, a trailing whitespace is added as a separate literal after each token which does not specify the SpaceAfter=No metadata field (under MISC). Alternatively, the --force-alignment flag can be set, in order to match tokens back to the original sentence text, inserting whitespaces where needed.
By default, a whitespace is added after each sentence to prevent exported sentences from following each other directly. This character can be changed using the --sentence-terminator flag (e.g., to "\n" or "").