Commit fadbcfed authored by Ulrich's avatar Ulrich

more text

parent 5600ffc1
Document,astro-ph0001117,astro-ph0008271,astro-ph0011211,astro-ph0012434,astro-ph0102308,cond-mat0001199,cond-mat0002133,cond-mat0003070,Cond-mat0007243,cond-mat0009225,cond-mat0011175,cond-mat0102209,hep-ex0006009,hep-lat0007038,hep-lat0008012,hep-ph0010254,hep-th0011187,math0101174,physics0007034,physics0011001
correct Qes (TP),51,9,47,5,33,19,3,0,2,5,1,8,33,19,11,11,0,0,5,5
missed Qes (FN),7,5,6,0,3,0,0,0,0,0,0,0,0,2,1,4,0,0,0,0
false Qes (FP),3,3,2,1,5,1,0,5,1,1,2,0,0,31,2,0,1,3,0,2
,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,
Total Documents ,=COUNTA(B1:Z1),,,,,,,,,,,,,,,,,,,
Total TP,=SUM(B2:Z2),,,,,,,,,,,,,,,,,,,
Total FN,=SUM(B3:Z3),,,,,,,,,,,,,,,,,,,
Total FP,=SUM(B4:Z4),,,,,,,,,,,,,,,,,,,
Total Qes,=B10+B11,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,
Precision,=B10/(B10+B12),,,,,,,,,,,,,,,,,,,
Recall,=B10/(B10 + B11),,,,,,,,,,,,,,,,,,,
F-Score,=2*B15*B16/(B15 + B16),,,,,,,,,,,,,,,,,,,
Document,astro-ph0001117,astro-ph0008271,astro-ph0011211,astro-ph0012434,astro-ph0102308,cond-mat0001199,cond-mat0002133,cond-mat0003070,Cond-mat0007243,cond-mat0009225,cond-mat0011175,cond-mat0102209,hep-ex0006009,hep-lat0007038,hep-lat0008012,hep-ph0010254,hep-th0011187,math0101174,physics0007034,physics0011001,astro-ph0003441,astro-ph0007228,astro-ph0009290,astro-ph0011084,astro-ph0103106,cond-mat0004489,cond-mat0005248,gr-qc0103092,hep-ph0001210,hep-ph0004206,hep-ph0008135,hep-ph0010093,hep-ph0010276,hep-ph0011266,hep-th0010258,hep-th0102151,math0002164,nucl-th0004040,nucl-th0006044,physics0002002,quant-ph0010061,quant-ph0102128,physics0001038,nlin0004036,hep-ph0006076,hep-ph0003278,hep-ex0009004,cond-mat0012299,astro-ph0102504,astro-ph0012391
correct Qes (TP),51,9,47,5,33,19,3,0,2,5,1,8,33,19,11,11,0,0,5,5,94,21,10,17,24,0,0,0,13,24,25,12,5,1,0,0,0,0,12,0,11,0,6,0,13,3,4,6,24,6
missed Qes (FN),7,5,6,0,3,0,0,0,0,0,0,0,0,2,1,4,0,0,0,0,0,4,3,0,2,0,0,0,0,0,1,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,1,5,0
false Qes (FP),3,3,2,1,5,1,0,5,1,1,2,0,0,31,2,0,1,3,0,2,5,9,2,0,1,19,43,2,1,1,0,1,0,1,4,4,0,7,0,2,0,16,2,5,0,0,0,3,4,0
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Total Documents ,50,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Total TP,598,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Total FN,48,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Total FP,195,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Total Qes,646,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Precision,0.7540983607,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Recall,0.9256965944,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
F-Score,0.8311327311,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
This source diff could not be displayed because it is too large. You can view the blob instead.
The author implemented three prototypical applications to demonstrate the benefit of semantic information.
The first is a service for the conversion of units in documents (Section~\ref{ssec:implunitconv}).
We then describe an enhancement for screen reading applications with semantic information (Section~\ref{ssec:implscreen}).
In Section~\ref{ssec:implharvest}, we explain how to exploit semantic information for search.\ednote{adapt, name evaluation}
In Section~\ref{ssec:implharvest}, we explain how to exploit semantic information for search.
\subsection{Conversion inside Documents}
......
......@@ -544,7 +544,7 @@ at the end:
again by attaching a negative score to them.
\subsection{Evaluation}
\ednote{Discuss with Michael}
In this section, we analyze the results of the implementation. \ednote{Add blabla here}
\subsubsection{Quantitative Evaluation}
The implementation was tested on a set of about 35000 documents and could successfully process nearly all documents.
......@@ -553,27 +553,54 @@ processor. This is equivalent to a runtime of about 70 seconds per document on a
and scoring of quantity expressions. Note that, the scoring task involves the execution of Schäfer's declaration spotter which adds
a significant amount of runtime.
The author evaluated the quality of the implementation by manually validating 20 randomly selected documents.
They include 295 quantity expressions of which 267 were successfully recognized (true positives).
The author evaluated the quality of the implementation by manually validating 50 randomly selected documents.
They include 646 quantity expressions of which 598 were successfully recognized (true positives).
We regard the detection of a quantity expression as correct when its highest scored possible meaning reflects the correct
meaning of the expression. 28 quantity expressions were not detected (false negatives) and 63 times expressions were
meaning of the expression. 48 quantity expressions were not detected (false negatives) and 195 times expressions were
marked as quantity expressions although they are not (false positives).
In this setup, true negatives are equivalent to non-quantity expressions which were successfully not detected.
However, there is no meaningful quantification of these expressions in this case.
The evaluation results in a precision of $267 \,/\, (267 + 63) \approx 81\%$ and in a recall of
$267 \, / \, (267 + 28) \approx 91\%$. This gives us a F-score of about 85\%.
The detailed results are available in the author's repository\footnote{\url{https://gl.kwarc.info/urabenstein/Semanticextraction/blob/master/evaluation/evaluation.csv}}.\ednote{extend the number of documents}
However, there is no meaningful quantification of these expressions in this case and we thus omit them.
The evaluation results in a precision of $598 \,/\, (598 + 195) \approx 75\%$ and in a recall of
$598 \, / \, (598 + 48) \approx 93\%$. This is equivalent to a F-score of about $83\%$.
\subsubsection{Qualitative Evaluation}
\ednote{Describe some errors and their causes here}
\begin{itemize}
\item problems in bigger formuals (astro-ph9211002), due to me and due to MathML
\item abkürzungen (MC)
\end{itemize}
While the recall of the implementation is already very promising, we observe that the precision offers potential for
improvements. From that, we can derive that the implemented pattern is rather too wide than too narrow, in the sense that it
marks too many terms as quantity expressions.
The detailed results of the evaluation, including the number of detected quantity expressions per document, are available in the author's repository\footnote{\url{https://gl.kwarc.info/urabenstein/Semanticextraction/blob/master/evaluation/evaluation.csv}}.
\subsubsection{Qualitative Evaluation}
In this section, we discuss some common sources for misclassifications observed by the author of this thesis.
Human can easily disambiguate than an expressions, such as \mq{$10^{-7}-10^{-6}$ yr$^{-1}$} from \cite{astro-ph/0003441},
denotes a range of values and that the \mq{$-$} in the expression does not stand for subtraction.
However in the generation of content MathML, LaTeXML converts this symbol to a minus operation and such expressions are thus
detected as plain quantity expressions instead of range expressions.
Abbreviations and variables are also a source of errors. For example, the following snippet defines the meaning of \mq{$s$} and \mq{$S$} for the
paper~\cite{cond-mat/0005248}.
\begin{quote}
\mq{Ground-state magnetization curves of ferrimagnetic Heisenberg chains of alternating spins $S$ and $s$ are numerically investigated.}
\end{quote}
After that, the authors often mention the \mq{$2s$-plateau magnetization curve} and the \mq{$2S$ chain period}. The spotter
recognizes the former as \mq{two seconds} and the latter as \mq{two Siemens\footnote{Siemens is a unit of electric conductance.}}
which causes more than forty misclassifications in this document.
Also common phrases like \mq{in the 70s} (from~\cite{astro-ph/0009290}) result in errors where \mq{70 seconds} is detected while
the author referred to an event that took place in the time span from 1970 to 1980.
Astronomic names cause errors in a similar way. The authors of~\cite{astro-ph/9807288}, for instance, reference the supernovae
\mq{1979C, 1992H and 1993W}, resulting in the false detection of \mq{1979 Coulomb\footnote{Coulomb is a unit of electric charge.}}, \mq{1992 Henry\footnote{Henry is a unit of inductance.}} and \mq{1993 Watt}.
Many of these misclassifications consist of single-letter unit string, such as \mq{C}, \mq{H} or \mq{W}, since they are --
without additional context -- often hard to distinguish from plain identifiers.
%\begin{itemize}
%\item problems in bigger formuals (astro-ph9211002, physics-0002002), due to me and due to MathML
%\item abkürzungen (MC)
%\item range expression (astro-ph0003441, astro-ph0103106)
%\item 606W, 702W at astro-ph0003441
%\item 70s at astro-ph0009290
%\item 2s 2S at cont-mat0005248
%\end{itemize}
......
......@@ -209,22 +209,21 @@ Erlangen, \today
\section{Future Work}
This section lists suggestions for the further development of the presented system.
The first two items focus on the use of additional natural language processing tools and on the
detected of more quantity expressions. New technologies that can enhance this system
are suggested in item 3 and 4. Item 5 refers to the adaption of MathWebSearch for the search with
quantity expressions and the last recommendation mentions the runtime.
detection of more quantity expressions. New technologies that can enhance this system
are suggested in item 3 and 4. The last recommendation mentions the runtime.
\begin{enumerate}
\item Use part-of-speech tagging during text tokenization (compare
Section~\ref{ssec:tokenization}). This can for instance be used to
correctly detect, that ``as'' is a part of the text and not an abbreviation
correctly detect, that \mq{as} is a part of the text and not an abbreviation
for attoseconds. This misclassification is currently ruled out quite
naively by a scorer (compare Section~\ref{sssec:scoring}) and could be
improved by part-of-speech tagging.
\item This thesis restricts its attention to only a subset of all quantity expression. A
possible extension would be to include more expressions, for instance, by detecting
also quantity expressions with textual numbers, such as ``five seconds''.
also quantity expressions with textual numbers, such as \mq{five seconds}.
For that, one could use a natural language processing tool to detect textual numbers.
Range expressions, like 20 to 100 kilometers and $1-5 \rm m^2$, are also an important
Range expressions, like \mq{20 to 100 kilometers} and \mq{$1-5 \rm m^2$}, are also an important
extension which not only involves the adaption of the detection schema but also of
the annotation format.
\item The evaluation of machine learning technologies for semantics extraction might also
......@@ -238,15 +237,27 @@ quantity expressions and the last recommendation mentions the runtime.
using the pattern language. An advantage of the pattern language is that the detection
patterns are separated from the source code of the program and can easier be extended.
However the pattern language does not yet support content MathML.
\item Once MathWebSearch works again, one can extend it by a suitable frontend which
extracts the semantics of the user input and translates it to the query language
of MathWebSearch. Several possible languages for user input can be investigated for that.
% \item Once MathWebSearch works again, one can extend it by a suitable frontend which
% extracts the semantics of the user input and translates it to the query language
% of MathWebSearch. Several possible languages for user input can be investigated for that.
\item The current spotter and its scoring system are currently prototypical implementations.
One can try to improve their speed in order to allow them to scale potentially on all of
the arXMLiv documents.\ednote{mention current runtime estimates here}
the arXMLiv documents.
\end{enumerate}
\section{Conclusions}
In this thesis, we have described how to extract the meaning of quantity expressions and units from STEM documents
and presented an implementation thereof on the arXMLiv corpus. It proved to be easily extensible, scalable and to
deliver promising results.
We have exploited these results to offer useful semantic services like the automatic conversion of
units in scientific papers which supports users to stay more focused while reading
and helps them to prevent calculation errors.
With the semantic enhancement of screen reading programs, we have also contributed
to the field of accessibility for STEM documents and hereby lowered the burden for
visually impaired people to participate in the scientific discourse.
These applications demonstrate the benefit of semantic information.\ednote{How to extend this?}
\newpage
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment