Commit bd60490b authored by ulrich's avatar ulrich

merge

parents 128781a8 5d88e81a
Document,astro-ph0001117,astro-ph0008271,astro-ph0011211,astro-ph0012434,astro-ph0102308,cond-mat0001199,cond-mat0002133,cond-mat0003070,Cond-mat0007243,cond-mat0009225,cond-mat0011175,cond-mat0102209,hep-ex0006009,hep-lat0007038,hep-lat0008012,hep-ph0010254,hep-th0011187,math0101174,physics0007034,physics0011001,astro-ph0003441,astro-ph0007228,astro-ph0009290,astro-ph0011084,astro-ph0103106,cond-mat0004489,cond-mat0005248,gr-qc0103092,hep-ph0001210,hep-ph0004206,hep-ph0008135,hep-ph0010093,hep-ph0010276,hep-ph0011266,hep-th0010258,hep-th0102151,math0002164,nucl-th0004040,nucl-th0006044,physics0002002,quant-ph0010061,quant-ph0102128,physics0001038,nlin0004036,hep-ph0006076,hep-ph0003278,hep-ex0009004,cond-mat0012299,astro-ph0102504,astro-ph0012391
correct Qes (TP),51,9,47,5,33,19,3,0,2,5,1,8,33,19,11,11,0,0,5,5,94,21,10,17,24,0,0,0,13,24,25,12,5,1,0,0,0,0,12,0,11,0,6,0,13,3,4,6,24,6
missed Qes (FN),7,5,6,0,3,0,0,0,0,0,0,0,0,2,1,4,0,0,0,0,0,4,3,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,1,5,0
false Qes (FP),3,3,2,1,5,1,0,5,1,1,2,0,0,31,2,0,1,3,0,2,5,9,2,0,1,19,43,2,1,1,0,1,0,1,4,4,0,7,0,2,0,16,2,5,0,0,0,3,4,0
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Total Documents ,50,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Total TP,598,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Total FN,48,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Total FP,195,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Total Qes,646,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Precision,0.7540983607,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Recall,0.9256965944,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
F-Score,0.8311327311,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
This source diff could not be displayed because it is too large. You can view the blob instead.
......@@ -115,7 +115,8 @@ pub fn main() {
}
}
pub fn read_document(document_path : String, annotation_location : &str, config : &Config) -> (i32, i32){
pub fn read_document(document_path : String, annotation_location : &str, config : &Config){
println!("document {}", document_path);
let parser = Parser::default();
let opt_doc = parser.parse_file(annotation_location);
if opt_doc.is_err(){
......
This diff is collapsed.
This diff is collapsed.
......@@ -24,7 +24,9 @@
\usepackage{mdframed}
\usepackage{graphicx}
\usepackage{subcaption}
\usepackage{wrapfig}
\usepackage{hyperref}
\def\UrlBreaks{\do\/\do-}
\onehalfspacing
......@@ -57,6 +59,8 @@
identifierstyle=\color{blue},
keywordstyle=\color{cyan},
columns=fullflexible,
basicstyle=\sf,
basicstyle=\footnotesize,
morekeywords={xmlns,version,type},
escapeinside={(*}{*)}
}
......@@ -64,6 +68,7 @@
\def\memph#1{``\textit{#1}''}
\def\mq#1{``#1''}
\def\meta#1{\hspace{1mm}\ensuremath{\langle\langle\hspace{1mm}\text{#1}\hspace{1mm}\rangle\rangle}\hspace{1mm}}
\def\tt#1{{\footnotesize \texttt{#1}}}
\title{Meaning Extraction and Semantic Services in STEM-Documents}
......@@ -196,60 +201,70 @@ Erlangen, \today
\label{sec:implementation}
\input{tex/implementation.tex}
\section{Evaluation}
\ednote{Discuss with Michael}
\subsection{Qualitative Evaluation}
\ednote{Describe some errors and their causes here}
\begin{itemize}
\item problems in bigger formuals (astro-ph9211002), due to me and due to MathML
\end{itemize}
\subsection{Quantitative Evaluation}
\section{Applications}
\label{sec:applications}
\input{tex/applications.tex}
\section{Future Work}
This section lists suggestions for the further development of the presented system.
The first two items focus on the use of additional natural language processing tools and on the
detected of more quantity expressions. New technologies that can enhance this system
are suggested in item 3 and 4. Item 5 refers to the adaption of MathWebSearch for the search with
quantity expressions and the last recommendation mentions the runtime.
detection of more quantity expressions. New technologies that can enhance this system
are suggested in item 3 and 4. The last recommendation mentions the runtime.
\begin{enumerate}
\item Use part-of-speech tagging during text tokenization (compare
Section~\ref{ssec:tokenization}). This can for instance be used to
correctly detect, that ``as'' is a part of the text and not an abbreviation
correctly detect, that \mq{as} is a part of the text and not an abbreviation
for attoseconds. This misclassification is currently ruled out quite
naively by a scorer (compare Section~\ref{sssec:scoring}) and could be
improved by part-of-speech tagging.
\item This thesis restricts its attention to only a subset of all quantity expression. A
possible extension would be to include more expressions, for instance, by detecting
also quantity expressions with textual numbers, such as ``five seconds''.
also quantity expressions with textual numbers, such as \mq{five seconds}.
For that, one could use a natural language processing tool to detect textual numbers.
Range expressions, like 20 to 100 kilometers and $1-5 \rm m^2$, are also an important
Range expressions, like \mq{20 to 100 kilometers} and \mq{$1-5 \rm m^2$}, are also an important
extension which not only involves the adaption of the detection schema but also of
the annotation format.
\item The evaluation of machine learning technologies for semantics extraction might also
proof useful. A good starting point for that might be the implementation of a
prove useful. A good starting point for that might be the implementation of a
scoring system based on machine learning. One can either evaluate it using the current
results of the rule based approach or by allowing manual disambiguation for the
users of the unit conversion service and use this data for training and testing.
\item For his declaration spotter Jan Frederik Schäfer developed a XML-based pattern language.
\item For his declaration spotter~\cite{janbsc} Jan Frederik Schäfer developed a XML-based pattern language.
Its use for the detection of quantity expressions can be evaluated which can either lead to
the implementation of an additional spotter or to a reimplementation of the current spotter
using the pattern language. An advantage of the pattern language is that the detection
patterns are separated from the source code of the program and can easier be extended.
However the pattern language does not yet support content MathML.
\item Once MathWebSearch works again, one can extend it by a suitable frontend which
extracts the semantics of the user input and translates it to the query language
of MathWebSearch. Several possible languages for user input can be investigated for that.
% \item Once MathWebSearch works again, one can extend it by a suitable frontend which
% extracts the semantics of the user input and translates it to the query language
% of MathWebSearch. Several possible languages for user input can be investigated for that.
\item The current spotter and its scoring system are currently prototypical implementations.
One can try to improve their speed in order to allow them to scale potentially on all of
the arXMLiv documents.\ednote{mention current runtime estimates here}
the arXMLiv documents.
\end{enumerate}
\section{Conclusions}
In this thesis, we have described how to extract the meaning of quantity expressions and units from STEM documents
and presented an implementation thereof on the arXMLiv corpus. It proved to be easily extensible, scalable and to
deliver promising results.
We have exploited these results to offer useful semantic services like the automatic conversion of
units in scientific papers which supports users to stay more focused while reading
and helps them to prevent calculation errors.
With the semantic enhancement of screen reading programs, we have also contributed
to the field of accessibility for STEM documents and hereby lowered the burden for
visually impaired people to participate in the scientific discourse.
These applications demonstrate the benefit of semantic information.\ednote{How to extend this?}
\newpage
\addcontentsline{toc}{section}{References}
\bibliographystyle{halpha} \bibliography{literatur}
\newpage
......@@ -262,13 +277,5 @@ quantity expressions and the last recommendation mentions the runtime.
\addcontentsline{toc}{section}{Curriculum Vit\ae} \input{cv.tex}
\newpage
\addcontentsline{toc}{section}{References}
\bibliographystyle{halpha} \bibliography{literatur}
\end{document}
<apply>
<times/> or <divide/>
Numeric expression
Unit symbol expressions
(*\meta{Numeric expression}*)
(*\meta{Unit symbol expressions}*)
</apply>
<apply>
<times/>
Numeric Term
(*\meta{Numeric Term}*)
<apply>
<divide/>
<apply>
<times/>
Units
(*\meta{Units}*)
</apply>
<apply>
<times/>
Units
(*\meta{Units}*)
</apply>
</apply>
</apply>
......@@ -2,8 +2,8 @@
<csymbol>superscript</csymbol>
<apply>
<csymbol cd="Prefix">Prefix</csymbol>
<csymbol cd=PrefixName>PrefixSymbol</csymbol>
<csymbol cd=UnitName>UnitSymbol</csymbol>
<csymbol cd=(*\meta{PrefixName}*)>(*\meta{PrefixSymbol}*)</csymbol>
<csymbol cd=(*\meta{UnitName}*)>(*\meta{UnitSymbol}*)</csymbol>
</apply>
Numeric Term
(*\meta{Numeric Term}*)
</apply>
<rdf:Description rdf:nodeID="KAT_123">
<kat:annotates rdf:resource=
"http://localhost:3000/content/sample.html#cse(
"http://localhost:3000/content/hep-ph9807232.html#cse(
//*[@id='S2.p5.1'],
//*[@id='S2.p5.1.w36'],
//*[@id='S2.p5.1.w38'])">
......@@ -8,22 +8,29 @@
<kat:contentmathml rdf:parseType="Literal">
<apply>
<times/>
<cn>1.42</cn>
<apply>
<csymbol>superscript</csymbol>
<cn>10</cn>
<cn>10</cn>
</apply>
<apply>
<csymbol cd="Prefix">Prefix</csymbol>
<csymbol cd="Giga">G</csymbol>
<csymbol cd="Hertz">Hz</csymbol>
<csymbol cd="electron volt">eV</csymbol>
</apply>
</apply>
</kat:contentmathml>
<kat:contentmathml rdf:parseType="Literal">
<apply>
<times/>
<cn>1.42</cn>
<csymbol>superscript</csymbol>
<cn>10</cn>
<cn>10</cn>
</apply>
<apply>
<times/>
<csymbol cd="Gauss">G</csymbol>
<csymbol cd="Hertz">Hz</csymbol>
<csymbol cd="electron volt">eV</csymbol>
</apply>
</apply>
</kat:contentmathml>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment