Commit 95ce3ff2 authored by Ulrich's avatar Ulrich

changes superscript and ci dot to power and times in numeric expressions

parent fef0d72d
......@@ -477,7 +477,7 @@ function calculate_number(elem){
return res1 - res2;
}
}else if (first.nodeName.toLowerCase() == "times"){
var erg = 1;
var erg = 1.0;
for (var i = 1; i < elem.childNodes.length; i++){
erg = erg * calculate_number(elem.childNodes[i]);
}
......
use std::collections::HashMap;
use regex::Regex;
static PREFIX_CONSTRUCTOR : &'static str = "<csymbol cd=\"Prefix\">Prefix</csymbol>";
......@@ -275,7 +275,11 @@ impl SpottedQE {
let num_string;
let num = self.opt_number.as_ref().unwrap();
if self.num_is_mathml{
num_string = num.clone();
let re1 = Regex::new(r"<csymbol[^>]*>superscript</csymbol>").unwrap();
let mut result = re1.replace_all(&num.clone(), "<power/>");
let re2 = Regex::new(r"<ci[^>]*>⋅</ci>").unwrap();
result = re2.replace_all(&result,"<times/>");
num_string = result.to_string();
}else{
num_string = format!("<cn>{}</cn>", num);
}
......
......@@ -2,6 +2,7 @@ extern crate libxml;
extern crate llamapun;
extern crate time;
extern crate unidecode;
extern crate regex;
pub mod data;
......
......@@ -106,6 +106,8 @@ an initiative for better access to STEM documents for visually impaired people -
to investigate methods to improve the output of screen reading programs
with semantic information.
\newpage
\subsubsection{Motivation}
In a joint report\ednote{How do I cite this?}, Daniel Hajas explains that
``access to Science, Technology, Engineering, Mathematics (STEM) information can
......
......@@ -143,13 +143,13 @@ end of this section.
Authors of scientific papers tend to misuse text mode to ensure an upright font for units, instead of using the
``\verb|\rm|'' command for this purpose. This leads to somewhat surprising encodings with frequent changes of math and text mode like
\begin{quote}
``\verb|$100$| km s\verb|${}^{-1}$| Mpc \verb|${}^{-1}$|'',
``\verb|$100$| km s\verb|${}^{-1}$| Mpc \verb|${}^{-1}$|'' (from~\cite{hep-ph/9807232}),
\end{quote}
\begin{quote}
``10\verb|$^{-4}$|M\verb|$_\odot$|''
``10\verb|$^{-4}$|M\verb|$_\odot$|'' (from~\cite{astro-ph/9211002})
\end{quote} and especially
\begin{quote}
``3.72 \verb|$\times$| 10 \verb|${}^{10}$| cm \verb|${}^{-2}$|''.
``3.72 \verb|$\times$| 10 \verb|${}^{10}$| cm \verb|${}^{-2}$|'' (from~\cite{astro-ph/9211002}).
\end{quote}
Example 1 from Table~\ref{tab:complexcat} is also a remarkable case, because here,
we see a difference between the semantics of the LaTeX
......@@ -161,6 +161,7 @@ end of this section.
\subsection{Restrictions for this Thesis}
\label{ssec:restriction}
The detection of all kinds of quantity expressions and units in STEM documents is a task
which exceeds the scope of this thesis and the author thus has to restrict his
attention to only a part of this problem.
......
......@@ -37,7 +37,7 @@
\draw[-{latex}] (invis2) |- (search);
\end{tikzpicture}
\caption{The architecture of the implementation.}
\caption{The Architecture of the Implementation.}
\label{fig:architecture}
\end{figure}
......@@ -99,7 +99,7 @@ separated from words but wrapped in exactly the same way. This
tokenization could for instance be further improved by adding
part-of-speech tags to every node.
\subsection{Spotting of Quantity Expressions}
\subsection{Spotting Quantity Expressions}
\label{ssec:spotting}
In Section~\ref{sec:problem}, we derived the requirement for an approach which tolerates the mixture of
math and text mode. There is only one format for text,
......@@ -162,7 +162,7 @@ We will refer to these examples in the next section where we describe the detect
\lstinputlisting[language=XML]{xml/second.xml}
\end{subfigure}
\end{mdframed}
\caption{The XML representation of Example 4 (\mq{$6\;\si{\micro\meter}$}, left side) and 6 (\mq{200 $\si{\micro}$m}, right side) from Table~\ref{tab:simpmutcat}.}
\caption{The XML Representation of Example 4 (\mq{$6\;\si{\micro\meter}$}, left side) and 6 (\mq{200 $\si{\micro}$m}, right side) from Table~\ref{tab:simpmutcat}.}
\label{fig:first}
\end{figure}
......@@ -177,7 +177,7 @@ We will refer to these examples in the next section where we describe the detect
\lstinputlisting[language=XML]{xml/fourth.xml}
\end{subfigure}
\end{mdframed}
\caption{The XML representation of Example 2 (\mq{$1.0\;\rm Wcm^{-2} \si{\micro\meter}^2$}, left side) and 5 (\mq{$100$ km s$^{-1}$ Mpc$^{-1}$}, right side) from Table~\ref{tab:complexcat}.}
\caption{The XML Representation of Example 2 (\mq{$1.0\;\rm Wcm^{-2} \si{\micro\meter}^2$}, left side) and 5 (\mq{$100$ km s$^{-1}$ Mpc$^{-1}$}, right side) from Table~\ref{tab:complexcat}.}
\label{fig:second}
\end{figure}
......@@ -192,7 +192,7 @@ We will refer to these examples in the next section where we describe the detect
\lstinputlisting[language=XML]{xml/degreeCorrect.xml}
\end{subfigure}
\end{mdframed}
\caption{The XML representation of Example 1 (\mq{$30^\circ$}) from Table~\ref{tab:superscriptcat}.
\caption{The XML Representation of Example 1 (\mq{$30^\circ$}) from Table~\ref{tab:superscriptcat}.
On the left, we see the semantically incorrect encoding generated by LaTeXML. We present a correct encoding on the right.}
\label{fig:third}
\end{figure}
......@@ -291,21 +291,21 @@ respectively. The latter example is encoded with the operator priority \mq{$(5 \
\begin{figure}
\lstinputlisting[language=XML, frame=single]{xml/basicPattern.xml}
\centering
\caption{Basic pattern for quantity expressions in Content MathML.}
\caption{Basic Pattern for the Detection of Quantity Expressions in Content MathML.}
\label{fig:PatternCNML}
\end{figure}
\begin{figure}
\lstinputlisting[language=XML, frame=single]{xml/dividePattern.xml}
\centering
\caption{Divide pattern for quantity expressions in Content MathML.}
\caption{Divide Pattern for the Detection of Quantity Expressions in Content MathML.}
\label{fig:DivPatternCNML}
\end{figure}
Two samples matching the presented patterns are displayed on the left side of
Figure~\ref{fig:first} and~\ref{fig:second}.
For the former, we have the numerical expressions \inline{<cn>6</cn>} as well as the unit symbol expressions
\inline{<ci>}{\footnotesize \textmu}\inline{<\ci>} and \inline{<mtext>m</mtext>}.
\inline{<ci>}{\footnotesize \textmu}\inline{</ci>} and \inline{<mtext>m</mtext>}.
For the further analysis, we simplify the unit symbol expressions
to a list of strings with the corresponding exponents being attached.
%For the example from Figure~\ref{fig:first}, we detect two unit symbol expressions,
......@@ -451,7 +451,7 @@ at the end:
\begin{mdframed}
\includegraphics[scale=0.34]{screenshots/KAT.png}
\end{mdframed}
\caption{Highlighted Annotations for the Document~\cite{hep-ph/9807232} in KAT.}
\caption{Highlighted Annotations in KAT for the Quantity Expressions Spotted in ~\cite{hep-ph/9807232}.}
\label{fig:KATEntryExample}
\end{figure}
......@@ -462,9 +462,11 @@ at the end:
\end{figure}
The spotted quantity expressions are stored in a RDF-based
format~\cite{RDFPrime} suitable for annotation tool KAT (compare Section~\ref{ssec:kat}).
format~\cite{RDFPrime} suitable for the annotation tool KAT (compare Section~\ref{ssec:kat}).
Figure~\ref{fig:KATEntryExample}
displays an example of quantity expressions annotated in KAT.
displays an example of highlighted quantity expressions in KAT. The expressions were spotted by the presented application
and are thus limited by its restrictions (compare Section~\ref{ssec:restriction}). Hence the expression
\mq{$10^{-12}\; M_\odot \Omega_a h^2$} is not detected as it involves constants.
These visualizations can easily be used for instance for quality control.
The format is described in detail in~\cite{KATCICM14} and ~\cite{KATCICM16}.
This section gives only a
......
......@@ -29,7 +29,7 @@
\usepackage{hyperref}
\def\UrlBreaks{\do\/\do-}
\onehalfspacing
%\onehalfspacing
%\usepackage[
% backend=biber,
......@@ -45,7 +45,7 @@
\setcapindent{0pt}
\lstset{
basicstyle=\ttfamily,
basicstyle=\sf,
columns=fullflexible,
showstringspaces=false,
commentstyle=\color{gray}\upshape,
......
<apply>
<csymbol>superscript</csymbol>
<power/>
<apply>
<csymbol cd="Prefix">Prefix</csymbol>
<csymbol cd=(*\meta{PrefixName}*)>(*\meta{PrefixSymbol}*)</csymbol>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment