Skip to content
Snippets Groups Projects
Commit 46f63796 authored by Deyan Ginev's avatar Deyan Ginev
Browse files

caution on the larger size estimates

parent d84d43c1
No related branches found
No related tags found
No related merge requests found
Pipeline #5931 passed
......@@ -12,7 +12,7 @@ title: ar5iv 04.2024 - An HTML5 dataset for arXiv.org
- 2,170,799 HTML documents
- Three separate archive bundles, separated by LaTeXML conversion severity
- For more information on severity, see [LaTeXML manual: Error Codes](https://math.nist.gov/~BMiller/LaTeXML/manual/errorcodes/)
- The HTML sources total `318 GB` packaged, and `2.83 TB` unpacked.
- The HTML sources total `318 GB` packaged, and `2.9 TB` unpacked.
- you also need 2.18 million free inodes to unpack the full data (check via `df -ih .`)
- This dataset is HTML-only and does not include images.
......@@ -20,7 +20,7 @@ title: ar5iv 04.2024 - An HTML5 dataset for arXiv.org
| subset ID | number of documents | size archived | size unpacked |
| :--- | ---: | ---: | ---: |
| no\_problem | 366,232 | 20 GB | 155 GB |
| warning | 1,304,052 | 216 GB | 1989 GB |
| warning | 1,304,052 | 216 GB | 2 TB |
| error | 500,515 | 82 GB | 753 GB |
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment