Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
Semanticextraction
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
7
Issues
7
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Ulrich Rabenstein
Semanticextraction
Commits
58c3059d
Commit
58c3059d
authored
Apr 12, 2017
by
Ulrich
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
more text
parent
ac26c725
Changes
5
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
5 changed files
with
120 additions
and
116 deletions
+120
-116
thesis/tex/goal.tex
thesis/tex/goal.tex
+1
-15
thesis/tex/implementation.tex
thesis/tex/implementation.tex
+82
-70
thesis/thesis.tex
thesis/thesis.tex
+36
-19
thesis/xml/second.xml
thesis/xml/second.xml
+0
-1
thesis/xml/third.xml
thesis/xml/third.xml
+1
-11
No files found.
thesis/tex/goal.tex
View file @
58c3059d
...
...
@@ -158,21 +158,7 @@ end of this section.
% abbreviated as ``G''.}~$\cdot$~hertz''. In the same way, ``Pa'' has two
% possible meanings -- ``petayear'' and ``pascal''.
%
% These uncertainties need to be taken care of. One possible way to do so,
% would be to implement heuristics, which
% try to guess the correct meaning during the parsing of the expression.
% This has the major disadvantage, that we would loose information very
% early in the process.
% Instead, the desired behavior is to allow multiple meanings for an
% expressions. So rather than disambiguating an expressions directly, we want
% the spotter to return a set of possible meanings for an expressions.
% An application based on the results of the spotter, can then decide how
% to deal with the ambiguities. A search engine could, for instance,
% include all ambiguities. In contrast to that a tool for the automatic
% conversion of quantity expressions could ask its user for help with the
% disambiguation or use the most likely meaning, where computing the
% most likely meaning from a set of ambiguities is an additional
% and independent task.
\subsection
{
Restrictions for this Thesis
}
...
...
thesis/tex/implementation.tex
View file @
58c3059d
This diff is collapsed.
Click to expand it.
thesis/thesis.tex
View file @
58c3059d
...
...
@@ -187,48 +187,65 @@ Erlangen, \today
\input
{
tex/basics.tex
}
\section
{
Research Problem
}
\label
{
sec:problem
}
\input
{
tex/problem.tex
}
\section
{
Implementation
}
\label
{
sec:implementation
}
\input
{
tex/implementation.tex
}
\section
{
Evaluation
}
Discuss with Michael
\ednote
{
Discuss with Michael
}
\subsection
{
Qualitative Evaluation
}
\ednote
{
Describe some errors and their causes here
}
\begin{itemize}
\item
problems in bigger formuals (astro-ph9211002), due to me and due to MathML
\end{itemize}
\subsection
{
Quantitative Evaluation
}
\section
{
Future Work
}
This section lists suggestions for the further development of the presented system.
blabla
The first two items focus on the use of additional natural language processing tools and on the
detected of more quantity expressions. New technologies that can enhance this system
are suggested in item 3 and 4. Item 5 refers to the adaption of MathWebSearch for the search with
quantity expressions and the last recommendation mentions the runtime.
\begin{
itemiz
e}
\begin{
enumerat
e}
\item
Use part-of-speech tagging during text tokenization (compare
Section~
\ref
{
ssec:tokenization
}
). This can for instance be used to
correctly detect, that ``as'' is a part of the text and not a abbreviation
correctly detect, that ``as'' is a part of the text and not a
n
abbreviation
for attoseconds. This misclassification is currently ruled out quite
naively by a scorer (compare Section~
\ref
{
sssec:scoring
}
) and could be
improved by part-of-speech tagging.
\item
Use / Enhance Frederik Schäfers Pattern Matcher to not hardcode rules
\item
Allow the user of the unit conversion tool to disambiguate him/herself
and store the choices. One could use this data, for instance, to train a
scorer.
\item
Use an natural language processing tool, such as Senna, to
detect numbers written in text form and exploit that to detect
quantity expressions in text form.
\item
Detect more kinds of quantity expressions. Think of a good extension
of the current setup for range expressions.
\item
Create bindings for siunitx.sty
\item
Once MathWebSearch works again, build a frontend for it to allow
search involving quantity expressions.
\item
Speedup and scale the tool to potentially run on the whole archive
\end{itemize}
\item
This thesis restricts its attention to only a subset of all quantity expression. A
possible extension would be to include more expressions, for instance, by detecting
also quantity expressions with textual numbers, such as ``five seconds''.
For that, one could use a natural language processing tool to detect textual numbers.
Range expressions, like 20 to 100 kilometers and
$
1
-
5
\rm
m
^
2
$
, are also an important
extension which not only involves the adaption of the detection schema but also of
the annotation format.
\item
The evaluation of machine learning technologies for semantics extraction might also
proof useful. A good starting point for that might be the implementation of a
scoring system based on machine learning. One can either evaluate it using the current
results of the rule based approach or by allowing manual disambiguation for the
users of the unit conversion service and use this data for training and testing.
\item
For his declaration spotter Jan Frederik Schäfer developed a XML-based pattern language.
Its use for the detection of quantity expressions can be evaluated which can either lead to
the implementation of an additional spotter or to a reimplementation of the current spotter
using the pattern language. An advantage of the pattern language is that the detection
patterns are separated from the source code of the program and can easier be extended.
However the pattern language does not yet support content MathML.
\item
Once MathWebSearch works again, one can extend it by a suitable frontend which
extracts the semantics of the user input and translates it to the query language
of MathWebSearch. Several possible languages for user input can be investigated for that.
\item
The current spotter and its scoring system are currently prototypical implementations.
One can try to improve their speed in order to allow them to scale potentially on all of
the arXMLiv documents.
\ednote
{
mention current runtime estimates here
}
\end{enumerate}
\section
{
Conclusions
}
...
...
thesis/xml/second.xml
View file @
58c3059d
...
...
@@ -3,5 +3,4 @@
<math>
<ci>
(*\textit{\textmu}*)
</ci>
</math>
<span>
</span>
<span>
m
</span>
thesis/xml/third.xml
View file @
58c3059d
<math>
<apply>
<times/>
<apply>
<ci>
(*$\cdot$*)
</ci>
<cn>
2.0
</cn>
<apply>
<csymbol>
superscript
</csymbol>
<cn>
10
</cn>
<cn>
18
</cn>
</apply>
</apply>
<cn>
1.0
</cn>
<apply>
<csymbol>
superscript
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment