Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
SIGMathLing
website
Commits
84d41370
Commit
84d41370
authored
Jul 09, 2019
by
Michael Kohlhase
Browse files
more
parent
d63bd576
Pipeline
#1694
passed with stage
in 2 minutes and 10 seconds
Changes
1
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
_posts/2018-09-13.md
0 → 100644
View file @
84d41370
---
layout
:
post
title
:
2018 arXiv Datasets (1.2 Million scientific HTML5 documents from arXiv and token models)
---
SIGMathLing has published the second set of arXiv data sets. The content of these data
sets are licensed to
[
SIGMathLing members
](
/member/
)
for research
and tool development purposes subject to the
[
SIGMathLing Non-Disclosure-Agreement
](
/nda/
)
.
This collection of 1.2 Million HTML5 documents
has been developed as part of the
[
arXMLiv
](
https://kwarc.info/projects/arXMLiv/
)
project at
the
[
KWARC
](
https://kwarc.info/
)
research group. It was created by converting the
[
arXiv collection of scientific preprints until August 2018
](
http://arxiv.org
)
via
[
LaTeXML
](
https://github.com/brucemiller/LaTeXML
)
using the
[
CorTeX corpus management system
](
https://github.com/dginev/CorTeX
)
.
The token models are generated from this document collection via the
[
LLaMaPuN 0.2
](
https://github.com/KWARC/llamapun/releases/tag/0.2.0
)
and
[
GloVe 1.2
](
https://github.com/stanfordnlp/GloVe/tree/07d59d5e6584e27ec758080bba8b51fce30f69d8
)
libraries.
Details can be found on the
[
SIGMathLing Resource page
](
/resources/
)
.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment