Commit ea14b646 by Ulrich

 ... ... @@ -123,37 +123,40 @@ end of this section. abbreviated as \quote{G}.}~$\cdot$~Hertz}. Similarly, \quote{Pa} has two possible meanings -- \quote{Petayear} and \quote{Pascal\footnote{Pascal is a unit of pressure, commonly abbreviated as \quote{Pa}.}}. Additionally, it is not always clear to which part of the expression exponents refer to. In Example 4 from Table~\ref{tab:complexcat}, $0.5$ eV/Å${}^{3}$, the exponent might refer only to Ångström, but also to the whole expression. Hence we have the two possible meanings $0.5 \; \text{eV/(Å}^3\text{)}$ and $0.5 \; \text{(eV/Å)}^3$. Given the context of the paper, In Example 4 from Table~\ref{tab:complexcat}, \quote{$0.5$ eV/Å${}^{3}$}, the exponent might refer only to Ångström\footnote{Ångström is a unit of length ($1 \; \rm \text{\AA} = 10^{-10} \; m$).}., but also to the whole expression. Hence we have the two possible meanings \quote{$0.5 \; \text{eV/(Å}^3\text{)}$} and \quote{$0.5 \; \text{(eV/Å)}^3$}. Given the context of the paper, we see that the former was meant in this case. In the rendered result, the changes between text and math mode do not complicate the understanding of the expressions for humans. But when looking at the LaTeX source, we observe that this add a lot of noise to the data. People tend to misuse text mode to ensure an upright font for units, instead of using the \verb|\rm| command. This leads to somewhat surprising encodings like \verb|$100$| km s\verb|${}^{-1}$| Mpc \verb|${}^{-1}$|, 10\verb|$^{-4}$|M\verb|$_\odot$| and especially 3.72 \verb|$\times$| 10 \verb|${}^{10}$| cm \verb|${}^{-2}$|. \verb|\rm| command. This leads to somewhat surprising encodings like \\ \verb|$100$| km s\verb|${}^{-1}$| Mpc \verb|${}^{-1}$|'', \\ 10\verb|$^{-4}$|M\verb|$_\odot$|'' and especially \\ 3.72 \verb|$\times$| 10 \verb|${}^{10}$| cm \verb|${}^{-2}$|''. \\ Example 1 from Table~\ref{tab:complexcat} is also a remarkable case, because here, we see a difference between the semantics of the LaTeX source and the rendered result. From its source, we can assume the meaning $(\text{W}/\text{cm})^2$, while the presentational part intends $\text{W}/(\text{cm}^2)$. The latter is correct here, because the expression describes From its source, we can assume the meaning \quote{$\rm (W/cm)^2$}, while the presentational part seems more consistent with \quote{$\rm W/(cm^2)$}. The latter is correct here, because the expression describes the intensity of a laser beam. \subsection{Restrictions for this Thesis} For the further part of the thesis, we restrict our attention to the The detection of all kinds of quantity expressions and units in STEM documents is a task which exceeds the scope of this thesis and the author thus has to restrict his attention to only a part of this problem. We further investigate the detection of quantity expressions from the categories in the Tables~\ref{tab:simpmutcat} to~\ref{tab:superscriptcat}. We omit textual quantity expressions, because they We omit textual quantity expressions and single units, because they hardly occur in the documents and because they need to be handled differently. We also omit unit products and range expressions, due the additional complexity of handling their semantics, as well expressions involving constants which would require to also detect or infer the definition of the constants. The detection and processing of single units (Table~\ref{tab:onlyunitcat}) is also out of the scope of this thesis. % The different kinds of units and quantity expressions require ... ...