Skip to content

HTML Transform: Erronous HTML Table Parsing #635

@Lalit7374

Description

@Lalit7374

Bug Report 🐛

Whenever a html table is defined with a caption, the transformation to Markdown yields to an invalid md table.

Expected Behavior

The following html table,

<table>
<caption>Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
</caption>
<tbody><tr>
<th>
</th>
<th>Aug. 2022 - Jan. 2023
</th>
<th>Feb. 2023 - July 2023
</th></tr>
<tr>
<td>Wikibooks
</td>
<td>6,919,000
</td>
<td>1,611,000
</td></tr>
<tr>
<td>Wikidata
</td>
<td>1,056,000
</td>
<td>1,051,000
</td></tr>
<tr>
<td>Wikimedia Commons
</td>
<td>2,845,000
</td>
<td>3,272,000
</td></tr>
<tr>
<td>Wikinews
</td>
<td>6,283,000
</td>
<td>1,035,000
</td></tr>
<tr>
<td>Wikipedia
</td>
<td><b>151,556,000</b>
</td>
<td><b>151,088,000</b>
</td></tr>
<tr>
<td>Wikiquote
</td>
<td>6,811,000
</td>
<td>1,548,000
</td></tr>
<tr>
<td>Wikisource
</td>
<td>7,106,000
</td>
<td>1,845,000
</td></tr>
<tr>
<td>Wikispecies
</td>
<td>29,000
</td>
<td>37,000
</td></tr>
<tr>
<td>Wikiversity
</td>
<td>6,360,000
</td>
<td>1,082,000
</td></tr>
<tr>
<td>Wikivoyage
</td>
<td>616,000
</td>
<td>632,000
</td></tr>
<tr>
<td>Wiktionary
</td>
<td>8,955,000
</td>
<td>8,425,000
</td></tr>
<tr>
<td><i><span style="color: gray; white-space: pre-wrap">Est. devices per person</span></i>
</td>
<td>2.4<sup id="cite_ref-Cisco_1-0" class="reference"><a href="#cite_note-Cisco-1">&#91;1&#93;</a></sup>
</td>
<td>2.4<sup id="cite_ref-Cisco_1-1" class="reference"><a href="#cite_note-Cisco-1">&#91;1&#93;</a></sup>
</td></tr></tbody></table>

Shall be parsed in the following valid markdown,

|     |     |     |
| --- | --- | --- |  
Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
|     | Aug. 2022 - Jan. 2023 | Feb. 2023 - July 2023 |
| Wikibooks | 6,919,000 | 1,611,000 |
| Wikidata | 1,056,000 | 1,051,000 |
| Wikimedia Commons | 2,845,000 | 3,272,000 |
| Wikinews | 6,283,000 | 1,035,000 |
| Wikipedia | **151,556,000** | **151,088,000** |
| Wikiquote | 6,811,000 | 1,548,000 |
| Wikisource | 7,106,000 | 1,845,000 |
| Wikispecies | 29,000 | 37,000 |
| Wikiversity | 6,360,000 | 1,082,000 |
| Wikivoyage | 616,000 | 632,000 |
| Wiktionary | 8,955,000 | 8,425,000 |
|     | 2.4[\[1\]](#cite_note-Cisco-1) | 2.4[\[1\]](#cite_note-Cisco-1) |

Which parses into a valid Markdown table:

  |   |   -- | -- | -- Average monthly active recipients of the service, in the EU region over prior 6 months (est.) |   |     | Aug. 2022 - Jan. 2023 | Feb. 2023 - July 2023 Wikibooks | 6,919,000 | 1,611,000 Wikidata | 1,056,000 | 1,051,000 Wikimedia Commons | 2,845,000 | 3,272,000 Wikinews | 6,283,000 | 1,035,000 Wikipedia | 151,556,000 | 151,088,000 Wikiquote | 6,811,000 | 1,548,000 Wikisource | 7,106,000 | 1,845,000 Wikispecies | 29,000 | 37,000 Wikiversity | 6,360,000 | 1,082,000 Wikivoyage | 616,000 | 632,000 Wiktionary | 8,955,000 | 8,425,000   | 2.4[1] | 2.4[1]

Current Behavior

Given the previous html table, including a caption, the tool transform the html into the following markdown content,

Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
| | Aug. 2022 - Jan. 2023
| Feb. 2023 - July 2023
|
| Wikibooks
| 6,919,000
| 1,611,000
|
| Wikidata
| 1,056,000
| 1,051,000
|
| Wikimedia Commons
| 2,845,000
| 3,272,000
|
| Wikinews
| 6,283,000
| 1,035,000
|
| Wikipedia
| 151,556,000 | 151,088,000 |
| Wikiquote
| 6,811,000
| 1,548,000
|
| Wikisource
| 7,106,000
| 1,845,000
|
| Wikispecies
| 29,000
| 37,000
|
| Wikiversity
| 6,360,000
| 1,082,000
|
| Wikivoyage
| 616,000
| 632,000
|
| Wiktionary
| 8,955,000
| 8,425,000
|
| Est. devices per person | 2.4[[1]](#cite_note-Cisco-1 "") | 2.4[[1]](#cite_note-Cisco-1 "") |

Which is an invalid md table:

Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
| | Aug. 2022 - Jan. 2023
| Feb. 2023 - July 2023
|
| Wikibooks
| 6,919,000
| 1,611,000
|
| Wikidata
| 1,056,000
| 1,051,000
|
| Wikimedia Commons
| 2,845,000
| 3,272,000
|
| Wikinews
| 6,283,000
| 1,035,000
|
| Wikipedia
| 151,556,000 | 151,088,000 |
| Wikiquote
| 6,811,000
| 1,548,000
|
| Wikisource
| 7,106,000
| 1,845,000
|
| Wikispecies
| 29,000
| 37,000
|
| Wikiversity
| 6,360,000
| 1,082,000
|
| Wikivoyage
| 616,000
| 632,000
|
| Wiktionary
| 8,955,000
| 8,425,000
|
| Est. devices per person | 2.4[1] | 2.4[1] |

Steps to Reproduce

  1. npm install -g @accordproject/markdown-cli
  2. wget https://foundation.wikimedia.org/wiki/Legal:EU_DSA_Userbase_Statistics --output-file test.html
  3. markus transform --from html --to markdown --input test.html --output test.md
  4. Open test.md using a md parser to visiualise the invalid table parsing.

Context (Environment)

Parsing HTML to Markdown for web archiving.

Desktop


Bug Report 🐛 Whenever a html table is defined with a caption, the transformation to Markdown yields to an invalid md table.

Expected Behavior
The following html table,

Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
Aug. 2022 - Jan. 2023 Feb. 2023 - July 2023
Wikibooks 6,919,000 1,611,000
Wikidata 1,056,000 1,051,000
Wikimedia Commons 2,845,000 3,272,000
Wikinews 6,283,000 1,035,000
Wikipedia 151,556,000 151,088,000
Wikiquote 6,811,000 1,548,000
Wikisource 7,106,000 1,845,000
Wikispecies 29,000 37,000
Wikiversity 6,360,000 1,082,000
Wikivoyage 616,000 632,000
Wiktionary 8,955,000 8,425,000
Est. devices per person 2.4[1] 2.4[1]
Shall be parsed in the following valid markdown,
Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
Aug. 2022 - Jan. 2023 Feb. 2023 - July 2023
Wikibooks 6,919,000 1,611,000
Wikidata 1,056,000 1,051,000
Wikimedia Commons 2,845,000 3,272,000
Wikinews 6,283,000 1,035,000
Wikipedia 151,556,000 151,088,000
Wikiquote 6,811,000 1,548,000
Wikisource 7,106,000 1,845,000
Wikispecies 29,000 37,000
Wikiversity 6,360,000 1,082,000
Wikivoyage 616,000 632,000
Wiktionary 8,955,000 8,425,000
2.4[1] 2.4[1]
Which parses into a valid Markdown table:

Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
Aug. 2022 - Jan. 2023 Feb. 2023 - July 2023
Wikibooks 6,919,000 1,611,000
Wikidata 1,056,000 1,051,000
Wikimedia Commons 2,845,000 3,272,000
Wikinews 6,283,000 1,035,000
Wikipedia 151,556,000 151,088,000
Wikiquote 6,811,000 1,548,000
Wikisource 7,106,000 1,845,000
Wikispecies 29,000 37,000
Wikiversity 6,360,000 1,082,000
Wikivoyage 616,000 632,000
Wiktionary 8,955,000 8,425,000
2.4[1] 2.4[1]
Current Behavior
Given the previous html table, including a caption, the tool transform the html into the following markdown content,

Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
| | Aug. 2022 - Jan. 2023
| Feb. 2023 - July 2023
|
| Wikibooks
| 6,919,000
| 1,611,000
|
| Wikidata
| 1,056,000
| 1,051,000
|
| Wikimedia Commons
| 2,845,000
| 3,272,000
|
| Wikinews
| 6,283,000
| 1,035,000
|
| Wikipedia
| 151,556,000 | 151,088,000 |
| Wikiquote
| 6,811,000
| 1,548,000
|
| Wikisource
| 7,106,000
| 1,845,000
|
| Wikispecies
| 29,000
| 37,000
|
| Wikiversity
| 6,360,000
| 1,082,000
|
| Wikivoyage
| 616,000
| 632,000
|
| Wiktionary
| 8,955,000
| 8,425,000
|
| Est. devices per person | 2.4[1] | 2.4[1] |
Which is an invalid md table:

Average monthly active recipients of the service, in the EU region over prior 6 months (est.)
| | Aug. 2022 - Jan. 2023
| Feb. 2023 - July 2023
|
| Wikibooks
| 6,919,000
| 1,611,000
|
| Wikidata
| 1,056,000
| 1,051,000
|
| Wikimedia Commons
| 2,845,000
| 3,272,000
|
| Wikinews
| 6,283,000
| 1,035,000
|
| Wikipedia
| 151,556,000 | 151,088,000 |
| Wikiquote
| 6,811,000
| 1,548,000
|
| Wikisource
| 7,106,000
| 1,845,000
|
| Wikispecies
| 29,000
| 37,000
|
| Wikiversity
| 6,360,000
| 1,082,000
|
| Wikivoyage
| 616,000
| 632,000
|
| Wiktionary
| 8,955,000
| 8,425,000
|
| Est. devices per person | 2.4[1] | 2.4[1] |

Steps to Reproduce
npm install -g @accordproject/markdown-cli
wget https://foundation.wikimedia.org/wiki/Legal:EU_DSA_Userbase_Statistics --output-file test.html
markus transform --from html --to markdown --input test.html --output test.md
Open test.md using a md parser to visiualise the invalid table parsing.
Context (Environment)
Parsing HTML to Markdown for web archiving.

Desktop
OS: UBUNTU Linux
Version: Markus 0.16.22 (markdown-cli)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions