diff --git a/arxmliv.md b/arxmliv.md index 3185898cbd2453a62ed5939b8c7767e25de24dfb..0109a3b9f55d49cbce86edfcdae54584307ff2ca 100644 --- a/arxmliv.md +++ b/arxmliv.md @@ -1,4 +1,6 @@ -An HTML dataset of arXiv.org +# An HTML5 dataset for arXiv.org + +Part of the [arXMLiv](https://kwarc.info/systems/arXMLiv/) project at the [KWARC](https://kwarc.info/) research group ### Current release - 08.2017 @@ -6,17 +8,21 @@ An HTML dataset of arXiv.org ### License TODO: Official SIGMathLing license link -### Generated by +### Generated via - [LaTeXML 0.8.2](https://github.com/brucemiller/LaTeXML/releases/tag/v0.8.2), - [CorTeX 0.2](https://github.com/dginev/CorTeX/releases/tag/0.2.0) -### Details: - - Size: `todo-add-size`GB archived, - - MD5: `todo-add-hash` arXMLiv_08_2017.zip - - Contents: - - 1,088,375 HTML5 documents - - By conversion severity: 112,088 `no_problem`, 574,642 `warning`, 401,645 `error` - +### Contents + - 1,088,370 HTML5 documents + - Three separate archive bundles separated by LaTeXML conversion severity + +| subset | MD5 | number of documents | size archived | size unpacked | +| --- | --- | --- | --- | --- | +| arXMLiv_08_2017_no_problem.zip | `036945755c7cc75ea1577cf04ca4fead` | 112,088 | 5 GB | 37 GB | +| arXMLiv_08_2017_warning.zip | md5 | 574,638 | | 595 GB | +| arXMLiv_08_2017_error.zip | md5 | 401,644 | | 421 GB | + + ### Description: This is a first public release of the arXMLiv dataset generated by the [KWARC](https://kwarc.info/) research group. Its intended redistribution is confined to the scope of the [SIGMathLing] interest group, and access is members-only.. @@ -30,3 +36,4 @@ An HTML dataset of arXiv.org ### Download [Download link (password-protected)](https://gl.kwarc.info/SIGMathLing/dataset-arXMLiv-08-2017) + \ No newline at end of file