semantics extractions based on machine learning
We have a couple of corpora from which we want to extract semantical features.
- quantity expressions like "3m/s" (three meters per second) or "two furlongs per fortnight"
- polarity of identifiers in formulae (essentially, which symbols in a formula can be substituted for)
- where are "definitions/theorems/assumptions" (and what are their definienda, definienses, and statemnets).
- or more generally what is the content form of a formula
If we know any of those, we could extend nice semantic features (e.g. better screen readers for visually challenged people or better scientific search engines) relatively directly.
We have a couple of large corpora e.g. the arXMLiv corpus or the data behind the Online Encyclopaedia of Integer Sequences
All of them are (probably) amenable to machine-learning methods. In some cases, we already have some data about the phenomena above which can act as a baseline.
The topic is to pick one or more of these aspects of semantics and see what contemporary statistical AI methods can do to scale these up to corpus size and develop an symbolic application (possibly with a lot of help from the group).