diff --git a/resources/ar5iv-dataset-2024.md b/resources/ar5iv-dataset-2024.md index 060db597f631624602742d55b7d79129e9a50cb3..f0ed0b9813e19b3b4d8f8f5bf40183c1bcad18d7 100644 --- a/resources/ar5iv-dataset-2024.md +++ b/resources/ar5iv-dataset-2024.md @@ -12,7 +12,7 @@ title: ar5iv 04.2024 - An HTML5 dataset for arXiv.org - 2,170,799 HTML documents - Three separate archive bundles, separated by LaTeXML conversion severity - For more information on severity, see [LaTeXML manual: Error Codes](https://math.nist.gov/~BMiller/LaTeXML/manual/errorcodes/) - - The HTML sources total `318 GB` packaged, and `2.83 TB` unpacked. + - The HTML sources total `318 GB` packaged, and `2.9 TB` unpacked. - you also need 2.18 million free inodes to unpack the full data (check via `df -ih .`) - This dataset is HTML-only and does not include images. @@ -20,7 +20,7 @@ title: ar5iv 04.2024 - An HTML5 dataset for arXiv.org | subset ID | number of documents | size archived | size unpacked | | :--- | ---: | ---: | ---: | | no\_problem | 366,232 | 20 GB | 155 GB | - | warning | 1,304,052 | 216 GB | 1989 GB | + | warning | 1,304,052 | 216 GB | 2 TB | | error | 500,515 | 82 GB | 753 GB |