AssertionError assert len(raw) == len(sents) with 2018 shared task raw text

I've run `segmenter.py train` successfully with just `conllu` files in the workspace but when I include the raw text from the 2018 shared task as `raw_train.txt` and `raw_dev.txt`, I get
```
Traceback (most recent call last):                                                                                                                           
  File "segmenter.py", line 155, in <module>                                                                                                                 
    reset=args.reset, tag_scheme=args.tags, ignore_mwt=args.ignore_mwt)                                                                                      
  File "/.../ud-parsing-2018/uusegmenter/toolbox.py", line 905, in raw2tags                                              
    assert len(raw) == len(sents)                                                                                                                            
AssertionError
```
(Line numbers may be slightly off as I added some comments here and there.)

It seems that you assume that the raw text has one sentence per line but the shared task raw text does not use line breaks in this way. Did you not use the raw text at training?

Do you use sentences as training instances? Wouldn't then the CRF never see the context to the right of sentence boundaries, e.g. in English the capitalisation of the next letter is a strong cue, and wouldn't the CRF in worst case learn to simply check whether it's the end of each sequence to assign T or U?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError assert len(raw) == len(sents) with 2018 shared task raw text #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

AssertionError assert len(raw) == len(sents) with 2018 shared task raw text #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions