Commit fef0d72d authored by Ulrich's avatar Ulrich

reworked the whole thesis

parent e4243a1a
This diff is collapsed.
In this section, we introduce the technologies necessary for the thesis. They include the markup language
MathML (Section \ref{ssec:mathml}), LaTeXML -- the converter from LaTeX to XML -- and the arXMLiv corpus (Section~\ref{ssec:latexml}),
the annotation tool KAT (Section~\ref{ssec:kat}) and the unit converter from Astropy (Section~\ref{ssec:astropy}).
We assume basic knowledge of LaTeX and HTML.
the annotation tool KAT (Section~\ref{ssec:kat}) and the unit converter from the community python package Astropy (Section~\ref{ssec:astropy}).
We assume basic knowledge of LaTeX, HTML and XML.
\label{sec:basics}
\subsection{MathML}
\label{ssec:mathml}
MathML~\cite{Miner:14:MML} is a mathematical markup language, which can be used to write formulae
MathML~\cite{Miner:14:MML} is a mathematical markup language which can be used to write formulae
in HTML.
There are two different strains of it -- presentation MathML and content
MathML. Presentation MathML describes the -- possibly ambiguous -- layout of
......@@ -23,7 +23,7 @@ We assume basic knowledge of LaTeX and HTML.
\inline{semantics} child and write presentation MathML directly as a
child of the \inline{math} node.
We use angle brackets to denote meta variables in XML like \meta{Presentation MathML},
\meta{Content MathML} and \meta{LaTeX} in this case.
\meta{Content MathML} and \meta{LaTeX} in Figure~\ref{fig:generalMathML}.
We briefly introduce
presentation and content MathML and point to~\cite{Miner:14:MML} for details.
......@@ -32,26 +32,26 @@ We assume basic knowledge of LaTeX and HTML.
\begin{figure}
\lstinputlisting[language=XML,frame=single]{xml/mathml.xml}
\centering
\caption{A general frame for a MathML expression.}
\caption{A General Frame for a MathML Expression.}
\label{fig:generalMathML}
\end{figure}
\subsubsection{Presentation MathML}
Figure~\ref{fig:presMathMLexample} shows an example of presentation MathML
for the formula $3.0 \cdot 10^{-17}\si{\micro\metre}$. There are two kinds of
Figure~\ref{fig:presMathMLexample} shows the presentation MathML encoding of
the formula \mq{$3.0 \cdot 10^{-17}\si{\micro\metre}$} as an example. There are two kinds of
tokens for presentation MathML -- element tokens and layout schemata.
Element tokens include numbers (\inline{mn}), operators (\inline{mo}),
identifiers (\inline{mi}) and text (\inline{mtext}), while layout schemata
include horizontal grouping (\inline{mrow}), superscript (\inline{msup}) and
subscript (\inline{msub}). Tags can be modified by attributes, such as the
\texttt{mathvariant} attribute. Its value \texttt{normal} ensures a upright
\tt{mathvariant} attribute. Its value \tt{normal} ensures a upright
font for the correct presentation of a unit in MathML -- for details compare
the W3C note about units in MathML~\cite{Devitt:03:UM}.
\begin{figure}
\lstinputlisting[language=XML,frame=single]{xml/presMathMLExample.xml}
\centering
\caption{A presentation MathML expression for $3.0 \cdot 10^{-17} \text{µm}$.}
\caption{A Presentation MathML Expression for \mq{$3.0 \cdot 10^{-17} \text{µm}$}.}
\label{fig:presMathMLexample}
\end{figure}
......@@ -66,10 +66,8 @@ We assume basic knowledge of LaTeX and HTML.
corresponds to the presentational tag \inline{mn}, but content MathML
distinguishes between \inline{csymbol} and \inline{ci} for identifiers.
\inline{csymbol} is used for concrete mathematical terms and the
\texttt{cd} attribute can be used to point to the definition of this term.
Local identifiers are written using \inline{ci}.
\tt{cd} attribute can be used to point to the definition of this term.
Local identifiers are written using \inline{ci}.
\begin{figure}
\lstinputlisting[language=XML,frame=single]{xml/contentMathMLExample.xml}
......@@ -95,7 +93,7 @@ We assume basic knowledge of LaTeX and HTML.
LaTeXML was used to convert the online e-print archive, arXiv, from LaTeX
to HTML5~\cite{stamerjohanns2010transforming}. The resulting corpus is called arXMLiv. The
arXiv contains articles from areas such as physics, mathematics, computer science
arXiv contains articles from areas such as physics, mathematics, computer science,
quantitative biology, quantitative finance and statistics. The articles are mostly uploaded
as LaTeX files and published as PDF and Postscript documents. PDF and Postscript files are
obviously impractical for any further processing, but also the LaTeX source is not intended for
......@@ -119,7 +117,7 @@ We assume basic knowledge of LaTeX and HTML.
with support
for mathematics which is implement in JavaScript and executed by a browser.
It allows the definition of annotation formats in the form of
\textit{KAT Annotation Specifications} (KAnnSpec). Besides an annotation mode, the tool also contains
\mq{KAT Annotation Specifications (KAnnSpecs)}. Besides an annotation mode, the tool also contains
a review mode, in which a user can rate existing annotations -- for instance for the
evaluation of an automated annotation system.
Annotations can be stored as Resource Description Framework (RDF,~\cite{RDFPrime}) documents
......
......@@ -15,8 +15,8 @@ end of this section.
\label{ssec:categorication}
\input{tex/tables/simpmucat.tex}
We distinguish quantity expressions from a syntactic and a semantic point
\input{tex/tables/simpdivcat.tex}
We distinguish quantity expressions from a syntactic and a semantic point
of view and for instance have a category containing relatively plain
expressions with units without superscripts (i.e.
\mq{$1\;\rm m$}; syntactically) and a category containing range expressions
......@@ -24,52 +24,60 @@ end of this section.
This intentionally does not lead to a strictly disjoint categorization,
as the examples demonstrate.
For every example, we show the presentational part of the quantity expressions as well as
their LaTeX source and a reference to the document they occurred in.
its LaTeX source and a reference to the document it occurred in.
It is sufficient to present short LaTeX snippets here, since the same observations
apply to the HTML code which is created from the LaTeX source.
\input{tex/tables/simpdivcat.tex}
The most basic category is the one just mentioned for \textit{simple
\input{tex/tables/complexcat.tex}
\input{tex/tables/superscriptcat.tex}
\input{tex/tables/textualcat.tex}
The most basic category is the just mentioned one for \textit{simple
multiplicative quantity expressions} (Table~\ref{tab:simpmutcat}).
We regard a quantity
expression as simple, if it contains any kind of numeric expression
followed by one or more unit symbols in a multiplicative way (i.e.
\mq{$3\;\text{Nm}$}). This excludes unit symbols, which are in superscript, like
\mq{$30^\circ$}, and we also exclude written-out unit symbols (i.e. meter) and
textual numbers for this class. \input{tex/tables/complexcat.tex}
textual numbers for this class.
Instead there is a separate category for textual unit symbols and
quantity expressions with textual numbers (Table~\ref{tab:textualcat}).
In addition to the first category, we introduce one for \textit{simple
divisive quantity expressions} (Table~\ref{tab:simpdivcat}), which differs
only by the fact that units may also occur in a divisive way (i.e.
\mq{$4\;\text{m/s}$}). \input{tex/tables/superscriptcat.tex}
\mq{$4\;\text{m/s}$}).
We extend the simple multiplicative and divisive quantity expressions
to the category for \textit{complex quantity expressions}
(Table~\ref{tab:complexcat}) which contains expressions from the former
categories, but allows additional superscripts for units (i.e.
\mq{$5\;\text{m}^2$} or \mq{$5\;\text{m/s}^2$}). This category subsumes the first two.
\input{tex/tables/textualcat.tex}
\input{tex/tables/rangecat.tex}
\input{tex/tables/unitprodcat.tex}
Additionally, we have a category for quantity expressions where unit symbols
are part of the superscript -- as in \mq{$30^\circ$}. Although written in their own
manner, these expressions are, from a semantic point of view,
closely related to the simple multiplicative units. \input{tex/tables/rangecat.tex}
closely related to the simple multiplicative units.
\input{tex/tables/constantcat.tex}
\input{tex/tables/onlyunitcat.tex}
Furthermore, there are extra classes for \textit{range expressions}
(Table~\ref{tab:rangecat}) and \textit{unit products}
(Table~\ref{tab:unitprodcat}) because they include the previous expression, but contain additional semantic
information. \input{tex/tables/unitprodcat.tex}
information.
For instance \mq{$23\;\mu\text{m} \times 23\;\mu\text{m}$}
describes not only an area of size \mq{$529\;\mu\text{m}^2$}, but also contains the information
that the area is quadratic and has an edge length of \mq{$23\;\mu\text{m}$}.
Additional information is necessary to handle quantity expressions involving constants (Table~\ref{tab:constantcat}).
For instance, we need to know that \memph{$\Omega_{a}$ is the ratio of the axion energy density to the
critical density in the Universe}~\cite{hep-ph/9807232} and that \mq{$h$} is the Hubble constant to understand
For instance, we need to know that
\begin{quote}
\mq{$\Omega_{a}$ is the ratio of the axion energy density to the
critical density in the Universe}~\cite{hep-ph/9807232}
\end{quote}
and that \mq{$h$} is the Hubble constant to understand
the meaning of Example 2 of
this table. \input{tex/tables/constantcat.tex}
this table.
The examples in Table~\ref{tab:onlyunitcat} do not describe quantity expressions, but
depict that certain formulae are written in these units. Hence these terms also form their own class.
Figure~\ref{fig:taxonomy} summarizes the relations between the categories in a taxonomy.
\input{tex/tables/onlyunitcat.tex}
\begin{figure}
\begin{tikzpicture}[
>=stealth,
......@@ -114,26 +122,26 @@ end of this section.
\subsection{Discussion}
\label{ssec:goaldiscussion}
In this section we discuss how we perceive the examples from the previous section.
In this section we discuss the perception of the examples from the previous section.
At first, we look at the presentational part and then at the corresponding LaTeX source.
From the rendered result, we observe that unit strings can be ambiguous.
For instance \mq{GHz} is very likely to stand for \mq{Gigahertz}, but could also denote
\mq{Gauß\footnote{Gauß is a unit of magnetic induction, commonly
abbreviated as \mq{G}.}~$\cdot$~Hertz}. Similarly, \mq{Pa} has two possible meanings --
\mq{Petayear} and \mq{Pascal\footnote{Pascal is a unit of pressure, commonly abbreviated as \mq{Pa}.}}.
For instance \mq{GHz} is very likely to stand for \mq{gigahertz}, but could also denote
\mq{gauß\footnote{Gauß is a unit of magnetic induction, commonly
abbreviated as \mq{G}.}~$\cdot$~hertz}. Similarly, \mq{Pa} has two possible meanings --
\mq{petayear} and \mq{pascal\footnote{Pascal is a unit of pressure, commonly abbreviated as \mq{Pa}.}}.
Additionally, it is not always clear to which part of the expression exponents refer to.
In Example 4 from Table~\ref{tab:complexcat}, \mq{$0.5$ eV/Å${}^{3}$}, the exponent might refer
only to Ångström\footnote{Ångström is a unit of length ($1 \; \rm \text{\AA} = 10^{-10} \; m$).}.,
only to \mq{Ångström}\footnote{Ångström is a unit of length ($1 \; \rm \text{\AA} = 10^{-10} \; m$).},
but also to the whole expression. Hence we have the two possible meanings
\mq{$0.5 \; \text{eV/(Å}^3\text{)}$} and \mq{$0.5 \; \text{(eV/Å)}^3$}. Given the context of the paper,
we see that the former was meant in this case.
In the rendered result, the changes between text and math mode do not complicate the understanding of the
expressions for humans.
But when looking at the LaTeX source, we observe that this add a lot of noise to the data.
People tend to misuse text mode to ensure an upright font for units, instead of using the
\verb|\rm| command. This leads to somewhat surprising encodings like
But when looking at the \LaTeX\ source, we observe that this adds a lot of noise to the data.
Authors of scientific papers tend to misuse text mode to ensure an upright font for units, instead of using the
``\verb|\rm|'' command for this purpose. This leads to somewhat surprising encodings with frequent changes of math and text mode like
\begin{quote}
``\verb|$100$| km s\verb|${}^{-1}$| Mpc \verb|${}^{-1}$|'',
\end{quote}
......
This diff is collapsed.
......@@ -25,6 +25,7 @@
\usepackage{graphicx}
\usepackage{subcaption}
\usepackage{wrapfig}
\usepackage[indention=0cm]{caption}
\usepackage{hyperref}
\def\UrlBreaks{\do\/\do-}
......@@ -41,6 +42,7 @@
%]{biblatex}
%\addbibresource{literatur.bib}
\setcapindent{0pt}
\lstset{
basicstyle=\ttfamily,
......@@ -211,13 +213,13 @@ Erlangen, \today
This section lists suggestions for the further development of the presented system.
The first two items focus on the use of additional natural language processing tools and on the
detection of more quantity expressions. New technologies that can enhance this system
are suggested in item 3 and 4. The last recommendation mentions the runtime.
are suggested in item 3 and 4. The last recommendation refers the runtime.
\begin{enumerate}
\item Use part-of-speech tagging during text tokenization (compare
Section~\ref{ssec:tokenization}). This can for instance be used to
\item Part-of-speech tagging can be used during text tokenization (compare
Section~\ref{ssec:tokenization}). This can, for instance, be exploited to
correctly detect, that \mq{as} is a part of the text and not an abbreviation
for attoseconds. This misclassification is currently ruled out quite
for \mq{attoseconds}. This misclassification is currently ruled out quite
naively by a scorer (compare Section~\ref{sssec:scoring}) and could be
improved by part-of-speech tagging.
\item This thesis restricts its attention to only a subset of all quantity expression. A
......@@ -242,7 +244,7 @@ are suggested in item 3 and 4. The last recommendation mentions the runtime.
% extracts the semantics of the user input and translates it to the query language
% of MathWebSearch. Several possible languages for user input can be investigated for that.
\item The current spotter and its scoring system are currently prototypical implementations.
One can try to improve their speed in order to allow them to scale potentially on all of
One can improve their speed in order to allow them to scale potentially on all of
the arXMLiv documents.
\end{enumerate}
......@@ -250,8 +252,9 @@ are suggested in item 3 and 4. The last recommendation mentions the runtime.
\label{sec:conclusion}
In this thesis, we have described the extraction
of meaning from STEM documents with a special focus on quantity expressions and units.
We have presented a rule-based and modular implementation thereof on the arXMLiv corpus which delivered promising results.
The architecture proved to be easily extensible by additional detection methods such as Frederik Schäfer's declaration spotter.
We have presented a rule-based and modular implementation thereof on the arXMLiv corpus which delivers promising results.
The architecture is easily extensible by additional detection methods and proved to work well with other spotters,
such as Frederik Schäfer's declaration spotter.
We have exploited the detection results to offer useful semantic services like the automatic conversion of
units in scientific papers which supports users to stay more focused while reading
and helps them to prevent calculation errors by allowing them to convert quantity expressions by right clicking on them.
......@@ -261,7 +264,8 @@ visually impaired people to participate in the scientific discourse.
Additionally, we converted the spotting results in such a way that they can be exploited by Tom Wiesing's semantic search
engine for quantity expressions. It searches not only for the entered expression, but takes also equivalent forms into
account, say it also finds 212 degree Fahrenheit when searching for 100 degree Celsius.
These applications demonstrate the additional benefit of semantic information compared to common syntactic data.
These applications demonstrate the additional benefit of semantic information compared to common purely syntactic data.
\newpage
......
<apply>
<csymbol>superscript</csymbol>
<ci> or <mtext> node
(*\meta{Numeric expression}*)
</apply>
<math>
<apply>
<times/>
<cn>30</cn>
<csymbol cd="degree">(*°*)</csymbol>
</apply>
</math>
<apply>
<divide/>
<apply>
<times/>
(*\meta{Numeric expression}*)
(*\meta{Unit symbol expressions}*)
</apply>
(*\meta{Unit symbol expressions}*)
</apply>
<rdf:Description rdf:nodeID="KAT_123">
<kat:annotates rdf:resource=
"http://localhost:3000/content/sample.html#cse(
//*[@id='S2.p5.1'],
//*[@id='S2.p5.1.w36'],
//*[@id='S2.p5.1.w38'])">
</kat:annotates>
<kat:contentmathml rdf:parseType="Literal">
<apply>
<times/>
<cn>1.42</cn>
<apply>
<csymbol cd="Prefix">Prefix</csymbol>
<csymbol cd="Giga">G</csymbol>
<csymbol cd="Hertz">Hz</csymbol>
</apply>
</apply>
</kat:contentmathml>
<kat:contentmathml rdf:parseType="Literal">
<apply>
<times/>
<cn>1.42</cn>
<apply>
<times/>
<csymbol cd="Gauss">G</csymbol>
<csymbol cd="Hertz">Hz</csymbol>
</apply>
</apply>
</kat:contentmathml>
</rdf:Description>
......@@ -3,7 +3,7 @@
<mrow>(*\meta{Presentation MathML}*)</mrow>
<annotation-xml>(*\meta{Content MathML}*)</annotation-xml>
<annotation encoding="application/x-tex">
(*\meta{LaTeX}*)
(*\meta{LaTeX code}*)
</annotation>
</semantics>
</math>
<mws:harvest xmlns:m="http://www.w3.org/1998/Math/MathML"
xmlns:mws="http://search.mathweb.org/ns">
<mws:expr url="http://localhost/sample.html">
Content MathML
</mws:expr>
</mws:harvest>
<math>
<mn>1500</mn>
<mi aria-label="Volt">V</mi>
</math>
<div role="math" aria-label="1500 Volt">
<math>
<mn>1500</mn>
<mi>V</mi>
</math>
</div>
<div role="math" aria-label="1500 Volt">
<span>1500</span>
<span> </span>
<span>V</span>
</div>
<?xml version="1.0" ?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns"
xmlns:mathml="http://www.w3.org/1998/Math/MathML">
<rdf:Description rdf:about=(*\meta{Identifier}*)>
<rdf:XMLLiteral>
(*\meta{Content MathML}*)
</rdf:XMLLiteral>
</rdf:Description>
</rdf:RDF>
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment