Commit 58c3059d authored by Ulrich's avatar Ulrich

more text

parent ac26c725
......@@ -158,21 +158,7 @@ end of this section.
% abbreviated as ``G''.}~$\cdot$~hertz''. In the same way, ``Pa'' has two
% possible meanings -- ``petayear'' and ``pascal''.
%
% These uncertainties need to be taken care of. One possible way to do so,
% would be to implement heuristics, which
% try to guess the correct meaning during the parsing of the expression.
% This has the major disadvantage, that we would loose information very
% early in the process.
% Instead, the desired behavior is to allow multiple meanings for an
% expressions. So rather than disambiguating an expressions directly, we want
% the spotter to return a set of possible meanings for an expressions.
% An application based on the results of the spotter, can then decide how
% to deal with the ambiguities. A search engine could, for instance,
% include all ambiguities. In contrast to that a tool for the automatic
% conversion of quantity expressions could ask its user for help with the
% disambiguation or use the most likely meaning, where computing the
% most likely meaning from a set of ambiguities is an additional
% and independent task.
\subsection{Restrictions for this Thesis}
......
This diff is collapsed.
......@@ -187,48 +187,65 @@ Erlangen, \today
\input{tex/basics.tex}
\section{Research Problem}
\label{sec:problem}
\input{tex/problem.tex}
\section{Implementation}
\label{sec:implementation}
\input{tex/implementation.tex}
\section{Evaluation}
Discuss with Michael
\ednote{Discuss with Michael}
\subsection{Qualitative Evaluation}
\ednote{Describe some errors and their causes here}
\begin{itemize}
\item problems in bigger formuals (astro-ph9211002), due to me and due to MathML
\end{itemize}
\subsection{Quantitative Evaluation}
\section{Future Work}
This section lists suggestions for the further development of the presented system.
blabla
The first two items focus on the use of additional natural language processing tools and on the
detected of more quantity expressions. New technologies that can enhance this system
are suggested in item 3 and 4. Item 5 refers to the adaption of MathWebSearch for the search with
quantity expressions and the last recommendation mentions the runtime.
\begin{itemize}
\begin{enumerate}
\item Use part-of-speech tagging during text tokenization (compare
Section~\ref{ssec:tokenization}). This can for instance be used to
correctly detect, that ``as'' is a part of the text and not a abbreviation
correctly detect, that ``as'' is a part of the text and not an abbreviation
for attoseconds. This misclassification is currently ruled out quite
naively by a scorer (compare Section~\ref{sssec:scoring}) and could be
improved by part-of-speech tagging.
\item Use / Enhance Frederik Schäfers Pattern Matcher to not hardcode rules
\item Allow the user of the unit conversion tool to disambiguate him/herself
and store the choices. One could use this data, for instance, to train a
scorer.
\item Use an natural language processing tool, such as Senna, to
detect numbers written in text form and exploit that to detect
quantity expressions in text form.
\item Detect more kinds of quantity expressions. Think of a good extension
of the current setup for range expressions.
\item Create bindings for siunitx.sty
\item Once MathWebSearch works again, build a frontend for it to allow
search involving quantity expressions.
\item Speedup and scale the tool to potentially run on the whole archive
\end{itemize}
\item This thesis restricts its attention to only a subset of all quantity expression. A
possible extension would be to include more expressions, for instance, by detecting
also quantity expressions with textual numbers, such as ``five seconds''.
For that, one could use a natural language processing tool to detect textual numbers.
Range expressions, like 20 to 100 kilometers and $1-5 \rm m^2$, are also an important
extension which not only involves the adaption of the detection schema but also of
the annotation format.
\item The evaluation of machine learning technologies for semantics extraction might also
proof useful. A good starting point for that might be the implementation of a
scoring system based on machine learning. One can either evaluate it using the current
results of the rule based approach or by allowing manual disambiguation for the
users of the unit conversion service and use this data for training and testing.
\item For his declaration spotter Jan Frederik Schäfer developed a XML-based pattern language.
Its use for the detection of quantity expressions can be evaluated which can either lead to
the implementation of an additional spotter or to a reimplementation of the current spotter
using the pattern language. An advantage of the pattern language is that the detection
patterns are separated from the source code of the program and can easier be extended.
However the pattern language does not yet support content MathML.
\item Once MathWebSearch works again, one can extend it by a suitable frontend which
extracts the semantics of the user input and translates it to the query language
of MathWebSearch. Several possible languages for user input can be investigated for that.
\item The current spotter and its scoring system are currently prototypical implementations.
One can try to improve their speed in order to allow them to scale potentially on all of
the arXMLiv documents.\ednote{mention current runtime estimates here}
\end{enumerate}
\section{Conclusions}
......
......@@ -3,5 +3,4 @@
<math>
<ci>(*\textit{\textmu}*)</ci>
</math>
<span> </span>
<span>m</span>
<math>
<apply>
<times/>
<apply>
<ci>(*$\cdot$*)</ci>
<cn>2.0</cn>
<apply>
<csymbol>
superscript
</csymbol>
<cn>10</cn>
<cn>18</cn>
</apply>
</apply>
<cn>1.0</cn>
<apply>
<csymbol>
superscript
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment