Skip to content
Snippets Groups Projects
Commit c6929f56 authored by jfschaefer's avatar jfschaefer
Browse files

add resource for quantity expressions

parent 7370e2f6
No related branches found
No related tags found
No related merge requests found
Pipeline #
......@@ -5,6 +5,7 @@ title: SIGMathLing - Datasets and Resources
1. [arXMLiv corpus, 08.2017 release](/resources/arxmliv-dataset-082017/)
2. [arXMLiv word embeddings, 08.2017 release](/resources/arxmliv-embeddings-082017)
3. [quantity expressions](/resources/quantity-expressions)
Additional resources are en route, see the [plan](/technical/) for details.
## Quantity Expressions Dataset
This dataset contains the results of Ulrich Rabenstein's
[master thesis](https://gl.kwarc.info/supervision/MSc-archive/blob/master/2017/urabenstein/Rabenstein.pdf),
in which he developed a framework for the detection of quantity expressions in STEM documents.
### Accessibility and License
The content of this Dataset is licensed to [SIGMathLing members](/member/) for research
and tool development purposes.
Access is restricted to [SIGMathLing members](/member/) under the
[SIGMathLing Non-Disclosure-Agreement](/nda/) as for most [arXiv](http://arxiv.org)
articles, the right of distribution was only given (or assumed) to arXiv itself.
### Contents
* `Annotations.zip`: All quantity expressions detected by the spotter in a format suitable for the [Kwarc Annotation Tool (KAT)](https://github.com/kwarc/kat).
* `Documents.zip`: The documents in which quantity expressions were searched. These are modified arXMLiv documents in which each word is wrapped by a `<span>`. This was required by KAT to annotate words.
* `Harvest.zip`: Data for math web search.
* `screen-reader-documents.zip`: The documents prepared in a way that enables screen readers to read out units ("two kilometers" instead of "two k m" for "2km").
### Remarks on Annotation Format
The annotations are stored as RDF in a way suitable for the [Kwarc Annotation Tool (KAT)](https://github.com/kwarc/kat).
For more information on KAT and the KAT format consider reading [this](https://gl.kwarc.info/KAT/papers/blob/master/cicm14/paper.pdf)
and [this](https://gl.kwarc.info/KAT/papers/blob/master/cicm16/paper.pdf) paper.
In the example annotation below,
```
cse(%2F%2F*%5B%40id%3D'S1.p10.1'%5D%2C%2F%2F*%5B%40id%3D'S1.p10.1.w270'%5D%2C%2F%2F*%5B%40id%3D'S1.p10.1.w272'%5D)
```
describes the annotated quantity expression.
URL-decoding the expression in the parentheses, we can obtain the three comma-separated XPaths
```
//*[@id='S1.p10.1'],//*[@id='S1.p10.1.w270'],//*[@id='S1.p10.1.w272']
```
where the first path is the common parent, the second path is the start of the annotated range,
and the third path is the end of the annotated range.
```
<rdf:Description rdf:nodeID="KAT_5764208381">
<kat:run rdf:nodeID="kat_run"/>
<kat:kannspec rdf:nodeID="KAT_1_QuantityExpression"/>
<kat:concept>QuantityExpression</kat:concept>
<kat:type rdf:resource="http://kwarc.info/semanticextraction/KAnnSpec#quantityexpression"/>
<kat:annotates rdf:resource="http://localhost/procl.html#cse(%2F%2F*%5B%40id%3D'S1.p10.1'%5D%2C%2F%2F*%5B%40id%3D'S1.p10.1.w270'%5D%2C%2F%2F*%5B%40id%3D'S1.p10.1.w272'%5D)"/>
<kat:contentmathml rdf:parseType="Literal" score="1">
<apply>
<times/>
<cn>21</cn>
<apply>
<times/>
<apply>
<csymbol cd="Prefix">Prefix</csymbol>
<csymbol cd="centi">c</csymbol>
<csymbol cd="meter">m</csymbol>
</apply>
</apply>
</apply>
</kat:contentmathml>
</rdf:Description>
```
### Download
From [this repository](https://gl.kwarc.info/SIGMathLing/quantity-expressions) (only for [SIGMathLing members](/member/)).
### Evaluation
According to the thesis, a manual validation of 50 randomly selected documents containing in total 646 quantity expressions yielded the following values:
* Precision: 75%
* Recall: 93%
* F-Score: 83%
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment