Commit 4b4dce1a authored by Michael Kohlhase's avatar Michael Kohlhase

moved here from SVN

parents
See also https://svn.kwarc.info/repos/bourbaki
\ No newline at end of file
This folder used to have externals into Stefan's repository.
proposal https://svn.kwarc.info/repos/sanca/Documents/IUB/Smart%20Systems/Third%20Semester/Research/proposal
thesis https://svn.kwarc.info/repos/sanca/Documents/IUB/Smart%20Systems/Fourth%20Semester/masters/thesis
Since these are not accessible anymore, I replaced them with copies of the last revision I had.
\ No newline at end of file
This diff is collapsed.
\documentclass[10pt,oneside,onecolumn,a4paper]{article}
\usepackage[show]{ed}
\usepackage{url,moreverb}
\usepackage{wrapfig}
\usepackage{graphicx}
\usepackage{listings}
\usepackage{lscape}
\usepackage[a4paper=true, bookmarks=true, linkcolor=blue,citecolor=blue,urlcolor=blue,colorlinks=true,pagecolor=black,breaklinks=true, bookmarksopen=true]{hyperref}
\lstset{float=htb,columns=flexible,frame=lines,basicstyle=\scriptsize,numbers=left,stepnumber=5,numberstyle=\tiny,showstringspaces=false}
\def\latexmlpost{\scsys{LaTeXMLpost}}
\def\openmath{\scsys{OpenMath}}
\def\mathml{\scsys{MathML}}
\def\latexml{\scsys{LaTeXML}}
\def\ltxml{\scsys{LTXML}}
\def\texfourht{\scsys{TeX4HT}}
\def\perl{\scsys{Perl}}
\def\wordnet{\scsys{WordNet}}
\def\omdoc{\scsys{OMDoc}}
\def\arxmliv{\scsys{arXMLiv}}
\def\latex{\LaTeX}
\def\arxiv{\scsys{arXiv}}
\def\werewolf{\scsys{Werewolf}}
\title{\textbf{A Comparison between Idiom Spotting methods}}
\author{\c Stefan Anca\\
\textsf{Jacobs University Bremen}}
\date{}
\begin{document}
\maketitle
\bigskip
\bigskip
\bigskip
\begin{abstract}
\textsf{Information extraction from scientific documents containing mathematics and text is a vast field of research which can be approached from many different angles. One starting point is idiom extraction, the search for fixed-structure sentences containing both text and mathematics. This project looks at three different approaches to mathematical idiom spotting, all based on NLP tools: a heuristic pattern-matching approach based on predefined cleartext patterns, a syntactical analysis approach based on syntax parsing and syntax tree fragment matching and a Discourse Representation Theory analysis based on matching patterns in resulting DRS structures. The three methods will be run on a common corpus and compared against each other for the purpose of finding the best idiom recall rate.}
\end{abstract}
\bigskip
\bigskip
\bigskip
\section{Introduction}
The task that this project sets forth to achieve is to look into scientific texts and extract fixed format natural language formulations, containing math, called \textit{idioms}. Having a large corpus of scientific papers and articles in the form of the \textit{ArXMLiv}\index{ArXMLiv}, the \textit{DLMF}\index{DLMF} or the \emph{Connexions} repository, it is generally assumed that it contains a lot of text describing old theories, proposing new ones or analyzing them in comparison. Thus, these texts will contain a lot of idioms, which are very common to scientific language. For example, a definition will most probably be found in a sentence containing one of the following patterns: "\textit{We define X as Y}" or "\textit{X is defined as Y}", and a theorem will be expressed using a pattern of the type : "\textit{Given X and Y we conclude that Z}" or "\textit{If X then Y}". They are formed from fixed words or \emph{keywords}\index{keywords}, like "\textit{define}" or "\textit{if}" and \emph{placeholders}\index{placeholders} like \textit{X} or \textit{Y} arranged in a given pattern. A language idiom actually expresses a semantic relation between the placeholders, for example "\textit{We define X as Y}" translates to \textit{X} relates to \textit{Y} by the equality relation. Therefore, by identifying the idioms, meaning is extracted from the text. \cite{AnPa09}
The \emph{placeholders} in an idiom are of different types. In order to name the different types of terms used, the authors of\cite{AnPa09} borrowed the terminology of \textit{hypothesis}\index{hypothesis} and \textit{conclusion}\index{conclusion}. The hypothesis is considered to be the term that receives the property imposed by the relation that the idiom defines. For example, in the case of a definition, the \textit{hypothesis} is the definiendum, while the \textit{conclusion} is the definiens.
\section{Idiom Spotting Methods}
In order to extract these formulations from scientific papers, this project will employ three different methods based on NLP, throughly described in \cite{SAMP}.
The first approach is to perform \textbf{heuristic pattern matching} looking at keywords and their order in a sentence. If they correspond to a particular pattern, they are then analyzed and the relevant conclusions and hypotheses are extracted, as text found in between the keywords (replacing a placeholder in the idiom pattern). The main challenge of this approach comes from defining the idiom patterns in a very comprehensive way (i.e. to catch multiple idioms with just one pattern) and from extracting only the desired information from the placeholders.
The second approach to idiom spotting which will be researched tries to escape the problems created by superfluous words inherent to natural language (e.g. \textit{as shown before}, \textit{based on Theorem 3}, etc), which undesirably get caught up in a conclusion or hypothesis. This is attempted by \textbf{syntactical structure analysis}. By using a syntax parser on natural language text, the syntactical roles of the words in the sentence are defined, and can be used as labels for hypotheses and conclusions. In order to be able to run such parsers on scientific texts, all the non-trivial math formulae need to replaced with a generic natural language equivalent. For example, \textit{\textbf{If} we take $n\geq0$, \textbf{then} we know that $\sum_{i=0}^{n}i=\frac{n\cdot(n+1)}2$}
would be transformed into \textit{\textbf{If} we take \textit{some math}, \textbf{then} we know that \textit{some math}}. Once this word or sequence of words is in place, the sentence can be parsed and the resulting syntax tree analyzed. By knowing the part of the sentence that we are looking for, we can extract just the relevant information from the idiom, based solely on the syntactical role that the math replacement plays in the sentence structure. The choice of the generic word replacements for math will be correlated with the choice of the parser used for syntactic analysis in order to find the best tool for syntax oriented idiom spotting.
The third approach to finding idioms is \textbf{Discourse Representation Theory analysis}, i.e. extracting First Order Logic semantics from the text before looking inside the idioms. Johan Bos's Boxer tool\cite{Curran07linguisticallymotivated} generates semantic representations of natural language text based on a Combinatory Categorial Grammar, producing Discourse Representation Structures(DRS). Once again, the mathematical formulae are replaced by a proper generic word that fits the semantics of the sentence and then the Boxer tool is run over the text to produce DRSes. By looking at the structure of the DRS output of Boxer on regular idiom text, DRS patterns can be identified and the position where the hypothesis and the conclusion appear in such structures can be settled. These rules can be reused for the text with the math replaced out of it, in the hope that the DRS structures will not change and the words replacing mathematical formulae will fall exactly into the place where the conclusions and hypotheses are expected to be. The Boxer analysis method can prove to be a very elegant way to get rid of the unwanted words in the conclusion, as the final outputted logical structures are not influenced by filler words.
\section{Evaluation and Conclusion}
The main point of evaluation refers to the idiom recall rate of the three algorithms. The methods will be evaluated with respect to the variety of idiom patterns used and the correctness of the retrieved idioms from the texts. The greater aim is to be able to retrieve idioms with less superfluous words inside and more complex structure, such as the ones which have multiple hypotheses or conclusions. The three methods of idiom spotting will be tried on the same corpus of documents in order to have a direct comparison of the performance results.
This short description presents a project which aims to make a small contribution to the field of combined mathematics and natural language text processing, focusing on these fixed-structure sentences. The three different NLP methods for idiom spotting described here are all approaches which try to employ knowledge from regular text processing in the world of combined mathematics and text documents. By comparing their empirical success rates, this project aims not only at retrieving more idioms but also at learning more about the word ocurrence, structure and frequency of idioms in natural language texts.
\bibliographystyle{alpha}
\bibliography{SArefs}
\end{document}
\ No newline at end of file
This diff is collapsed.
\documentclass{beamer}
\usepackage{setspace}
\usepackage{url, wrapfig}
\usepackage{beamerthemesplit}
%\usepackage{beamerthemeshadow}
\usepackage{color}
\usepackage{amsmath}
\usepackage{cite}
\usepackage{pgf}
\usepackage{tikz}
\usetikzlibrary{shapes}
\usetikzlibrary{arrows}
\title{A Comparison between Idiom Spotting Methods\\Reloaded}
\author{\c Stefan Anca \\ last (re)presentation as Jacobs student}
% \institute{Project supervisor: Prof. Michael Kohlhase}
\date{May 20, 2009}
\def\arXiv{{\scshape{arXiv}}}
\def\xmath{{\scshape{XMath}}}
\def\mathml{{\scshape{MathML}}}
\def\openmath{{\scshape{OpenMath}}}
\def\connexions{{\scshape{Connexions}}}
\begin{document}
\frame{\titlepage}
\frame
{
\frametitle{This presentation goes out to ...}
\begin{figure}[!htpb]
\centering
\includegraphics[height = 10cm]{flobo.jpg}
% \label{fig1}
\end{figure}
}
\section{Introduction and Reminder}
\frame{\tableofcontents}
\subsection{Idioms}
\frame{
\frametitle{Idiom Definition (in case you forgot ...)}
\begin{itemize}
\item \textbf{Idiom} := a formulation following a certain \textcolor{red}{fixed} word and syntax \textcolor{red}{pattern}.
\item Ex: \textit{\textbf{Let} Florian be a regular professor, \textbf{then} Florian should be in class.}
\item Ex: \textit{\textbf{If} $x\in \mathbb{R}$, \textbf{then} ${x}^2>0$.}
\item Ex: \textit{Thus, we \textbf{conclude} that this presentation deserves a $1.0$.}
\end{itemize}
}
\frame{
\frametitle{Idiom Patterns (just to make sure you get it)}
\textbf{Idiom} := \textcolor{red}{pattern}, \textcolor{blue}{keywords}, \textcolor{green}{placeholders}.
\begin{itemize}
\item Ex pattern: \textcolor{red}{Definition:} \textit{We \textcolor{blue}{define} \textcolor{green}{X} as \textcolor{green}{Y}}.
\item Ex pattern: \textcolor{red}{Condition:} \textit{\textcolor{blue}{If} \textcolor{green}{X} \textcolor{blue}{then} \textcolor{green}{Y}}.
% \item Ex pattern: \textcolor{red}{Theorem:} \textit{\textcolor{blue}{Given} \textcolor{green}{X} and \textcolor{green}{Y} we \textcolor{blue}{conclude} that \textcolor{green}{Z}}.
\item \textcolor{blue}{keywords}: \textit{define}, \textit{if} ,\textit{then}
\item \textcolor{green}{placeholders}: \textit{X}, \textit{Y}, \textit{Z}
%
% \item An idiom expresses a semantic relation between the placeholders.
% \item \textcolor{blue}{\textit{We define X as Y}} translates to \textcolor{red}{\textit{X} relates to \textit{Y} by the equality relation}.
% \pause
% \item Extract meaning from the text!
\end{itemize}
\pause
\textbf{Formalization approach}: Split Idioms into \textcolor{red}{hypotheses} and \textcolor{blue}{conclusions}.
Example:
\begin{center}
\textit{\textbf{Let} \textcolor{red}{$f:A\rightarrow B$} and \textcolor{red}{$g:B\rightarrow C$}, \textbf{then} \textcolor{blue}{$f\circ g(x) = f(g(x))$}}.
\end{center}
\begin{itemize}
\item \textcolor{red}{Hypotheses}: {$f:A\rightarrow B$, $g:B\rightarrow C$}
\item \textcolor{blue}{Conclusions}: {$f\circ g(x) = f(g(x))$}
\end{itemize}
}
\frame
{
\frametitle{Mathematics in Idioms (the juicy part)}
\begin{itemize}
\item \textbf{Let} us assume $x,y\in \mathbb{N}$, \textbf{then} \textcolor{red}{$x + y\in \mathbb{N}$}.
\item We prove that \textbf{if} $f$ is continuous on $[a,b]$, \textbf{then} \textcolor{red}{$f$ is bounded on $[a,b]$} \textbf{and} \textcolor{red}{$f$ is derivable}.
\item \textbf{Let} $f:A\rightarrow B$ and $g:B\rightarrow C$, \textbf{then} \textcolor{red}{$f\circ g(x) = f(g(x))$}.
\vspace{1cm}
\pause
\item \textbf{\textcolor{blue}{Let's try to capture the idioms because they have juicy \textcolor{red}{math} inside!}}
\item \textcolor{blue}{Later we can even index it and search against it!}
\end{itemize}
}
\section{The Project}
\subsection{Goals}
\frame
{
\frametitle{Seminar Project: What is the best way I can get my hands on some fresh idioms?}
\begin{itemize}
\item Need as many idioms (\textcolor{red}{with math conclusions}) as possbile!
\item 3 Idiom Spotting Approaches - all (ab)using NLP tools.
\item Run all methods on the same corpus.
\item Compare and find the one with best recall rate (most indexable math conclusion idioms found).
\end{itemize}
\begin{enumerate}
\item \textbf{Heuristic Pattern Matching} - match language patterns and extract the placeholders
\item \textbf{Syntactical Structure Analysis} - match syntactical patterns and extract the right syntactical parts
\item \textbf{Discourse Representation Theory Analysis} - match DRS relations and extract the entities
\end{enumerate}
}
\frame
{
\frametitle{The bigger picture ... you missed it, didn't you?}
\begin{figure}[!htpb]
\centering
\includegraphics[scale=0.22]{architecture_clean.png}
% \label{fig1}
\end{figure}
}
\frame
{
\frametitle{Did you see where my project fits in?}
\begin{figure}[!htpb]
\centering
\includegraphics[scale=0.22]{architecture_stef.png}
% \label{fig1}
\end{figure}
}
\subsection{Dreams}
\frame
{
\frametitle{Applicable Theorem Search: Master Thesis or just a Dream?}
\textbf{Master Thesis Project: Applicable Theorem Search - a 3 step system:}
\begin{itemize}
\item \textcolor{red}{Spot all the idioms in a corpus according to predefined patterns.}
\item Index the mathematic formulae in the conclusion parts with MathWebSearch.
\item Provide a GUI for idiom retrieval, similar to MathWebSearch page.
\vspace{0.5cm}
\item Doctor (MiKo)'s orders: \textbf{Working system needs to run before Idiom Comparison can be performed! :(}
\end{itemize}
}
\subsection{Reality}
\frame[containsverbatim]{
\frametitle{Progress: Heuristic Pattern Matching}
Implementation improvements from Fall 2008 lab scripts:
\begin{itemize}
\item Fixed a lot of bugs :)
\item Switched from \verb|tex.xml| to \verb|xhtml| and from \xmath\ to \mathml
\item Now only capturing idioms with math in \textcolor{red}{conclusion}
\item Increased list of \textcolor{blue}{keywords} and \textcolor{red}{idiom patterns} (Zinn C.,2004)
\item Hardcore testing on multiple (3) corpora
\item Gained a lot of wisdom about math formulations in \arXiv\ texts
\end{itemize}
}
\frame{
\frametitle{Corpus Wisdom (Statistics)}
\tiny
\begin{center}
\begin{tabular}{|c|c|c|c|}
\hline \textbf{Sandbox} & \textbf{LaMaPUn} & \textbf{WebDev} & \textbf{CNX}\\
\hline Total files &2266 math papers &5006 papers & 12229 pages\\
\hline Files with idioms &882 (39\%) &1654 (33\%) &468 (3.8\%)\\
\hline Idioms found &3603 (avg 1.6/4 ipf) &4452 (avg 0.89/2.7 ipf)& 839 (avg 0.068/1.79 ipf)\\
\hline
\end{tabular}
\vspace{0.2cm}
\begin{tabular}{|c|c|c|c|}
\hline \textbf{Idiom pattern} & \textbf{Frequency LaMaPUn} & \textbf{Frequency WebDev}& \textbf{Frequency CNX}\\
\hline assume H1 then C1 &51 &76 &17 \\
\hline conclude H1 is C1 &68 &146 & 15\\
\hline define H1 to be C1 &102 &100 & 38\\
\hline given H1 then C1 &45 &70 & 24\\
\hline H1 if and only if C1 &456 &356 & 49\\
\hline H1 implies C1 &703 &1318 & 151\\
\hline H1 only if C1 &624 &679 &84\\
\hline H1 only when C1 &128 &236 &32\\
\hline if H1 and if H2 then C1 &44 &45 &19\\
\hline if H1 and if H2 then C1 and C2 &34 &43 &8\\
\hline if H1 then C1 &1292 &1325 &373\\
\hline let H1 then C1 &48 &48 & 21\\
\hline suppose H1 then C1 & 8 &10 & 8\\
\hline
\end{tabular}
\end{center}
}
\section{Conclusions}
\subsection{Future}
\frame{
\frametitle{One down, two to go!}
\begin{enumerate}
\item Syntactical Structure Analysis
\begin{itemize}
\item Parser choice narrowed down
\item Start from keywords, replace the math
\item Take more than one derivation tree
\item If all goes well with MSc project, approach in
\item \textcolor{red}{Late June}
\end{itemize}
\item Discourse Representation Theory Analysis