Skip to content

Incorrect handling of supplementary unicode characters #9

@danieldk

Description

@danieldk

Tokenizing the following sentence:

"Dabei handelt es sich um Sequenzen aus zwei Zeichen, die Länderkürzeln nach ISO 3166-1 ALPHA-2 entsprechen, beispielsweise 🇩🇪 (U+1F1E9 U+1F1EA) für Deutschland."

Results in incorrect XML.

This is probably related to:
https://issues.apache.org/jira/browse/XALANJ-2419

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions