Conversion of user-written markup turned out to be an interesting topic. We've identified a few particularities making the task distinct:
- Users do not really follow markup syntax. The texts commonly contain copy&pasted sections without any protection, effectively creating a "zoo" of random markup tags not intended to have special meaning.
- Markup specifications tend to be informal and vague. Even though Commonmark and GFM in particular went in the right direction with their extensive specs, the exact behavior is always determined by each particular implementation and its version.
- Markup specs only deal with well-formed sources. Very few clues are provided on interpreting the opposite.
- The conversion is made on versioned content. Individual versions differ both in syntax and semantic contents. Users expect the diffs to be somewhat similar before and after conversion. This may become even more tricky if the contents represents some form of a contract, modifications are binding etc.
Based on the syntax consideration, we can categorize conversion approaches as
- Source-level transformations - the source markup format is parsed rather on the lexical level and the conversion is done by substituting one set of known constructs into another set of contructs.
- Render-level transformations - the source markup format is parsed and rendered to HTML using its normal rendering mechanism. Then some sort of inverse rendering is applied to produce the new markup format.
The render-level transformation is appropriate if:
- the rendered HTML output is semantic enough to reconstruct semantic features in the target format
- it is applied on well-formed documents,
- eventual mistakes in the source markup format have local and limited impact
- there is benefit of HTML as a common intermediate format
The source-level transformations are more appropriate if:
- the markup sources are tread as authoritative information
- the conditions for render-level transformations are not met
- the source markup can render into HTML that can't be easily inverse-rendered to the target markup
- we want to smartly and consistently treat typical markup issues in our data
Consider the following example. Users were typing lists with a leading space (revision r1), which started to be behave differently since Redmine 3.4.7. A user removed the leading spaces in revision r2. The table shows different behavior of:
- Source→MD represented by the
TextileToMarkdownconverter and - Render→MD represented by an external (
Ws) converter derived from Turndown chained to theRedmineFormatterconverter.
| Textile | Source→MD | Render→MD | |
|---|---|---|---|
| v3.4.6 r1 |
|
|
|
| v3.4.7 r1 |
|
|
|
| v3.4.7 r2 |
|
|
|
| v3.4.7 diff r1 r2 |
- * a
- * b
+* a
+* b |
(empty) |
- - a * b
+ - a
+ - b |