I've run segmenter.py train successfully with just conllu files in the workspace but when I include the raw text from the 2018 shared task as raw_train.txt and raw_dev.txt, I get
Traceback (most recent call last):
File "segmenter.py", line 155, in <module>
reset=args.reset, tag_scheme=args.tags, ignore_mwt=args.ignore_mwt)
File "/.../ud-parsing-2018/uusegmenter/toolbox.py", line 905, in raw2tags
assert len(raw) == len(sents)
AssertionError
(Line numbers may be slightly off as I added some comments here and there.)
It seems that you assume that the raw text has one sentence per line but the shared task raw text does not use line breaks in this way. Did you not use the raw text at training?
Do you use sentences as training instances? Wouldn't then the CRF never see the context to the right of sentence boundaries, e.g. in English the capitalisation of the next letter is a strong cue, and wouldn't the CRF in worst case learn to simply check whether it's the end of each sequence to assign T or U?
I've run
segmenter.py trainsuccessfully with justconllufiles in the workspace but when I include the raw text from the 2018 shared task asraw_train.txtandraw_dev.txt, I get(Line numbers may be slightly off as I added some comments here and there.)
It seems that you assume that the raw text has one sentence per line but the shared task raw text does not use line breaks in this way. Did you not use the raw text at training?
Do you use sentences as training instances? Wouldn't then the CRF never see the context to the right of sentence boundaries, e.g. in English the capitalisation of the next letter is a strong cue, and wouldn't the CRF in worst case learn to simply check whether it's the end of each sequence to assign T or U?