...
 
Commits (2)
  • Florian Rabe's avatar
    no message · c5cc1154
    Florian Rabe authored
    c5cc1154
  • Florian Rabe's avatar
    no message · d27919af
    Florian Rabe authored
    Merge branch 'master' of gl.kwarc.info:mathhub/data-proposal
    
    # Conflicts:
    #	Proposal/proposal.pdf
    d27919af
......@@ -149,18 +149,18 @@ But it is important to appreciate its community penetration and its ability to d
Mathematics faces a number of specific challenges that make Open Data arguably harder or at least differently hard than for other sciences.
\paragraph{Institutional and Financial Challenge}
The mathematical community is mainly made up from small research groups.
\paragraph*{Institutional and Financial Challenge}
The mathematical community is mainly made up of small research groups.
There are very few large research teams as are common in engineering and experimental sciences.
Large collaborations (e.g., the Classification of Finite Simple Groups or the Polymath project) are driven by individuals and do not have a formal permanent institutional setting.
Large collaborations (e.g., the Classification of Finite Simple Groups or the Polymath project) are driven by individuals and do not have a permanent institutional backing.
Therefore, most datasets are collected and most services are provided by individual mathematicians or small communities and are not archived systematically.
This is connected to a financial challenge: even top researchers in mathematics have little research funding that could be devoted to hosting large databases and services.
That precludes them from the computing resources necessary to host TRL 8 services, which shows the need for and potential of an EOSC-like initiative.
This is connected to a financial challenge: even top researchers in mathematics have little research funding that can be devoted to hosting large databases and services.
That precludes them from employing the computing resources necessary to host TRL 8 services, which shows the need for and potential of an EOSC-like initiative.
But so far only one mathematical dataset has been published via the EOSC infrastructure --- Jukka Kohonen's collection of lattices.
In conversations with him and others, we learned that this is primarily due to the lack of awareness of and the lack of semantics-aware services provided by the EOSC infrastructure.
\paragraph{Cultural challenge}
\paragraph*{Cultural challenge}
Mathematics is traditionally performed in journal articles.
Even though there is a growing mathematical Open Source community, mathematicians receive little reputation and career benefit from maintaining services and sharing data.
Consequently, data sharing is often only an afterthought, and mathematicians publish datasets at whatever site makes sharing easiest.
......@@ -175,18 +175,18 @@ This shows the need for a systematic FAIR culture in mathematics, and our experi
But they need a project like \TheProject that provides them with a single, highly user-oriented, and widely visible service that provides dataset authors with as much added value while requiring as little added work as possible.
% citing datasets and provides an official count of downloads.
\paragraph{Technological challenge}
\paragraph*{Technological challenge}
Even though most mathematicians are enthusiastic about Open Data, it is in most cases not in their interest or expertise to maintain databases and services.
Even when not, the technological difficulties often make a mathematician's time too valuable to spend on data sharing.
Even when it is, the technological difficulties often make a mathematician's time too valuable to spend on data sharing.
Moreover, and contrary to other disciplines, there are many different software suites that mathematicians use to interact with datasets, even within the same area of mathematics and for a single dataset.
Therefore, mathematicians need a standard format for datasets supported by strongly user-oriented services that lets them share their data as widely, easily, and seamlessly as possible.
\paragraph{Theoretical challenge}
\paragraph*{Theoretical challenge}
The rich structure and semantics of mathematics makes it impractical to apply standard techniques like SQL or RDF to mathematical objects directly.
Before mathematical data can be stored (ultimately as strings of bits), it must be converted into a form that is much further from the meaning of the data as perceived by a mathematician than is usual in other sciences.
To find such a representation in the first place, many related objects and conventions must be considered and most interesting mathematical objects have multiple representations.
Often deep mathematical theorems must be established to find each such representation.
Often deep mathematical theorems must be established for each such representation.
Therefore, the standard format for datasets must be tightly integrated with symbolic representations of the objects and rigorously connect these objects to their encodings.
Only then is it possible to build the semantics-aware services envisioned by \TheProject that will attract mathematicians to the EOSC infrastructure.
......@@ -197,9 +197,9 @@ Only then is it possible to build the semantics-aware services envisioned by \Th
\eucommentary{Describe the advance your proposal would provide beyond the state-of-the-art}
Our ambitious proposal is to provide a \textbf{comprehensive} \textbf{semantics-aware} \textbf{FAIR} Open Data solution for mathematics.
The \TheProject service offering will be the first of its kind in each respect.
The \TheProject service offering will be the first of its kind in each of the following respects.
\paragraph{FAIR}
\paragraph*{FAIR}
As a rule, mathematicians strongly support the Open Science movement and happily make their datasets public.
This is accompanied by a vibrant and growing community of Open Source software for computational mathematics.
However, today most mathematical databases are shared in an ad hoc manner that makes FAIR sharing hard to impossible.
......@@ -212,11 +212,11 @@ It will provide a central service for mathematicians to make their datasets acce
It will allow users to browse, search, retrieve, and compute with these datasets.
And it will support system interoperability via automated import/export of datasets, including the transcoding from the original into the desired format.
\paragraph{Semantics-Aware}
\paragraph*{Semantics-Aware}
Generally, reusing shared data requires that the reuser be able to understand the semantics of the data~\cite[Rec. 7]{FAIR}.
This is particularly difficult for system interoperability where the semantics must not only be evident but must itself be accessible for automated processing~\cite[Rec. 8]{FAIR}, and it is particularly critical where data is used in safety-critical systems.
While this problem exists for all data, it is particularly challenging for mathematical data and similar data in related disciplines, where the semantics is very difficult to specify.
Today there are virtually no mathematical datasets whose semantics is itself accessible.
Therefore, today there are virtually no mathematical datasets whose semantics is itself accessible.
\TheProject will deliver the first Open Data solution that can \textit{understand} mathematical data.
This is essential to retain mathematical rigor in seamless Open Data collaboration where data provider and data user will often not interact with each other, e.g., data passed between systems must be translated according to its mathematical meaning, not just its textual presentation.
......@@ -236,11 +236,11 @@ This will allow \TheProject to build several services that have not been realize
\item translating objects from one concrete representation to another.
\end{compactitem}
\paragraph{Comprehensive}
\paragraph*{Comprehensive}
The best existing solutions for mathematical data focus on one kind of data.
For example, LMFDB handles only concrete data from one narrow area of mathematics.
zbMATH and swMATH handle mostly linked data for publication metadata, community reviews, software information, and symbolic data for formulas occurring in publications.
Wikidata stores mostly linked data and objects visualizations.
Wikidata stores mostly linked data and object visualizations.
But mathematics and research in general thrives on transferring results across disciplines and across environments.
\TheProject will deliver the first Open Data solution that supports all three kinds of data --- symbolic, encoded, and linked data.
......@@ -275,10 +275,10 @@ While the individual services and databases already exist, \TheProject will empl
These are in particular the data representation framework and the idea of mathematical schemas, described in Section~\ref{sec:method}.
The \pn project will focus on developing these for mathematical data, but similar ideas and principles apply to other sciences as well, and by providing a blueprint for deep FAIRness, the \TheProject will act as a catalyst for innovation outside the project.
The coherent of the various existing services in a deep FAIR framework constitutes a novel service that is significa
The coherent integration of the various existing services on a uniform deep FAIR platform constitutes a novel service that is a significant improvement on the current state.
Finally -- while not directly targeted in the \pn project -- the mathematical datasets provided by the \pn project in a uniform semantic framework will make them very attractive for applying Machine Learning (ML) methods to derive new (mathematical) conjectures that can then be attacked by conventional mathematical methods.
Such methods of ``\emph{experimental mathematics}'' are already becoming popular in mathematics wherever there is enough data.
Finally --- while not directly targeted in the \pn project --- the hosting of mathematical datasets in a uniform semantic framework will make them very attractive for applying Machine Learning (ML) methods, e.g., to derive new mathematical conjectures that can then be attacked by conventional methods.
Such methods of \emph{experimental mathematics} are already becoming popular in mathematics wherever there is enough data.
The highly connected and semantically enhanced \pn datasets are going to be an enabling resource for experimental, ML-based technologies.
\subsubsection{Technologies and TRLs}\label{sec:trls}
......
......@@ -349,7 +349,9 @@ It provides uniform encodings of symbolic data in a single standardized concrete
All datasets and all objects in them will have URIs via which they become accessible.
We will also define a metadata standard that allows for tracking provenance, version, and license of mathematical datasets and their entries.
These URIs and the associated metadata form a linked dataset themselves, which is also stored in the framework.
These research efforts are detailed in \WPref{foundations}.
Finally, to ensure the sustainability of the \TheProject standard, we will submit it for ISO standardization by the end of the project duration.
These research efforts are detailed in \WPref{foundations} as well as (for the ISO standard) in \WPref{management}.
\paragraph{Pilot datasets}\label{sec:pilotsets}
\strut\par\noindent\highlight{We integrate a representative selection of major datasets from different areas of mathematics into our infrastructure.}
......