@@ -13,12 +13,11 @@ In the sequel, we discuss the four FAIR principles and the challenges they pose

\noindent\textbf{\emph{Accessible}}

Mathematical datasets are typically accessible in the sense of FAIR: they are made available online, include metadata, and can be retrieved via their identifier using standardized and open protocols.

However, this does not allow accessing the rich internal structure of mathematical datasets.

Because this functionality is so critical, mathematical datasets are sometimes shared in a way that assigns persistent and globally unique identifiers to each entry in a dataset or even to every subobject of each entry (e.g., OEIS \cite{OEIS:on}, LMFDB \cite{lmfdb:on}, FindStat \cite{findstat}, and others).

However, this does not allow accessing their rich internal structure.

\inparahighlight{The level of accessibility needed in practice is much harder due to the wide variety of internal structure in mathematical datasets}.

Therefore, it is usually done ad hoc, identifiers are not standardized across datasets and may not be persistent, and communication protocols are dataset-specific.

Because this functionality is so critical, many mathematical datasets are already shared in a way that assigns persistent and globally unique identifiers to each entry in the dataset or even to every subobject of each entry (e.g., OEIS \cite{OEIS:on}, LMFDB \cite{lmfdb:on}, FindStat \cite{findstat}, and others).

But this is usually done ad hoc, identifiers are not standardized across datasets and may not be persistent, and communication protocols are dataset-specific.

%This requires each object to have a globally unique and persistent identifier, and not just the entire dataset (see \cite{BilTen:fingerprint13}).

...

...

@@ -30,12 +29,12 @@ Therefore, it is usually done ad hoc, identifiers are not standardized across da

\textbf{\emph{Reusable}}

Mathematical datasets are typically not reusable or very hard to reuse in the sense of FAIR.

First of all, they are often shared without licenses with the implicit, but legally false assumption that putting them online makes them public domain.

Moreover, the associated documentation often does not cover how precisely the data was created.

More critically, the associated documentation often does not cover how precisely the data was created.

This documentation is usually provided in ad hoc text files or implicitly in journal papers or software source code that potential users may not be aware of and whose detailed connection to the dataset may be elusive.

The problem is that the meaning and provenance of mathematical datasets must usually be given in the form of complex mathematical objects themselves --- not just as simple metadata that can be easily annotated.

Therefore, it is usually provided in ad hoc text files or implicitly in journal papers or software source code that potential users may not be aware of and whose detailed connection to the dataset may be elusive.

The problem is that the meaning and provenance of mathematical data must usually be given in the form of complex mathematical data themselves --- not just as simple metadata that can be easily annotated.

And \inparahighlight{the lack of a standard for associating complex semantics and provenance data effectively precludes or impedes most reuse in practice}.

\inparahighlight{The lack of a standard for associating mathematical semantics and provenance to data effectively precludes most reuse in practice} or at least makes it very difficult.

\medskip

\textbf{\emph{Findable}}

...

...

@@ -43,13 +42,12 @@ Mathematical data is typically somewhat findable in the sense of FAIR in that ob

This is particularly successful for bibliographic metadata (e.g. in Math Reviews, zbMATH or swMATH).

However, for individual datasets, identifiers are often non-persistent, e.g., when shared on researchers' homepages.

But finding an object by its identifier or metadata is an easy problem in practice.

But in any case, finding a mathematical object by its identifier or metadata is an easy problem in practice.

It is much more important and difficult to find objects according to their internal structure or semantic properties.

The indexing necessary to find such objects is very difficult.

The indexing necessary for this is very difficult.

For example, consider an engineer who wants to prevent an electrical system from overheating and thus needs a tight estimate for the term $\int_a^b |V(t)I(t)| dt$ for all $a,b$, where $V$ is the voltage and $I$ the current.

Search engines like Google index accessible mathematical articles, but this is restricted to word-based searches.

This barely helps with finding mathematical objects because there are no keywords to search for.

Search engines like Google are restricted to word-based searches of mathematical articles, which barely helps with finding mathematical objects because there are no keywords to search for.

Computer algebra systems cannot help either since they to do not incorporate the necessary special knowledge.

But the needed information is out there, e.g., in the form of

\begin{quote}

...

...

@@ -67,10 +65,9 @@ and will even extend the calculation

after the engineer chooses $p=q=2$ (Cauchy-Schwarz inequality).

Estimating the individual values of $V$ and $I$ is now a much simpler problem.

Admittedly, Google would have found the information by querying for ``\texttt{Cauchy-Schwarz H\"older}'', but that keyword itself was the crucial information the engineer was missing in the first place.

Admittedly, Google would have found the information by querying for ``Cauchy-Schwarz H\"older'', but that keyword itself was the crucial information the engineer was missing in the first place.

In fact, \inparahighlight{it is not unusual for mathematical datasets to be so large that determining the identifier of the sought-after object is harder than recreating the object itself}.\medskip

\textbf{\emph{Interoperable}}

...

...

@@ -80,8 +77,7 @@ Therefore, existing interoperability solutions tend to be domain-specific, limit

For trivial examples, consider the dihedral group of order 8, which is called $D_4$ in SageMath but $D_8$ in GAP due to differing conventions in different mathematical communities (geometry vs. abstract algebra).

Similarly, $0^\circ C$ in Europe is ``called'' $271.3^\circ K$ in physics.

In principle, this problem can be tackled by standardizing (mathematical) vocabularies.

But in the face of millions of defined concepts in mathematics, this has so far proved elusive.

In principle, this problem can be tackled by standardizing mathematical vocabularies, but in the face of millions of defined concepts in mathematics, this has so far proved elusive.

Moreover, large mathematical datasets are usually shared in highly optimized encodings (or even a hierarchy of consecutive encodings), which knowledge representation languages must capture as well to allow for data interoperability.

The proposers have developed or been involved with multiple leading candidates for such representation languages that will be integrated into a standard language by \TheProject.

...

...

@@ -93,10 +89,9 @@ The proposers have developed or been involved with multiple leading candidates f

A main idea of \TheProject is the following \inparahighlight{novel categorization of mathematical data}, which allows analyzing the specific challenges to FAIR data sharing.

\noindent\textbf{Symbolic data} consists of formal expressions such as formulas, formal proofs, programs, graphs, diagrams, etc.

\textbf{Symbolic data} consists of formal expressions such as formulas, formal proofs, programs, graphs, diagrams, etc.

These are written in a variety of highly-structured formal languages specifically designed for individual domains.

Because it allows for abstraction principles such as underspecification, quantification, and variable binding, symbolic data can (in contrast to the other two kinds) capture the full semantics of mathematical objects.

This comes at the price of being context-sensitive: expressions must be interpreted relative to their context and cannot be easily moved across environments, which makes \emph{Finding}, \emph{Reusing}, and \emph{Interoperability} difficult.

...

...

@@ -110,7 +105,7 @@ In response to this problematic situation, standard formats have been designed s

The latter has been used as an interoperability format for computer algebra systems in the OpenDreamKit project and already offers comprehensive services for symbolic data such as querying.

We get back to this in Section~\ref{sec:state_of_the_art}.

\noindent\textbf{Concrete data} employs representation theorems that allow encoding mathematical objects as simple data structures built from numbers, strings, lists, and records.

\textbf{Concrete data} employs representation theorems that allow encoding mathematical objects as simple data structures built from numbers, strings, lists, and records.

Thus, contrary to the other two kinds of mathematical data, concrete data combines optimized storage and processing with capturing the whole semantics of the objects.

But such representation theorems do not always exist because sets and functions, which are the foundation of most mathematics, are inherently hard to represent concretely.

Moreover, representation theorems may be very difficult to establish and understand, and there may be multiple different representations for the same object.

@@ -10,13 +10,13 @@ This proposal relates to the topic ``Prototyping new innovative services (INFRAE

\eucommentary{Develop an agile, fit-for-purpose and sustainable service offering accessible through the EOSC hub that can satisfy the evolving needs of the scientific community by stimulating the design and prototyping of novel innovative digital services. Innovative models of collaboration that genuinely include incentive mechanisms for a user oriented open science approach should be considered.}

The objectives of \TheProject fits the scope of the call perfectly.

\inparahighlight{The objectives of \TheProject fit the scope of the call perfectly.}

There are currently a variety of highly innovative and widely used services for digital data in the mathematical sciences.

These are developed by researchers, often in very agile open source communities, in the respective field in response to the specific needs in their own community and therefore fit their purpose exactly.

These are developed by researchers, often in very agile open source communities, in response to the specific needs in their own community and therefore fit their purpose exactly.

But they currently form a patchwork of disparate and mostly ad-hoc systems.

But they currently form a patchwork of disparate and mostly ad-hoc --- albeit mature and powerful --- services.

Therefore, the \TheProject consortium brings together a representative selection of experts in the development, maintenance, and application of such services.

\TheProject will integrate the most powerful and most used of these services into a coherent scalable easy-to-use service offering for the mathematical sciences that can be readily deployed at the EOSC.

\TheProject will integrate the most powerful and most used of these services into a coherent scalable easy-to-use service offering for the mathematical sciences that can be readily deployed on the EOSC Hub.

In the mathematical sciences, a systematic collaboration between Open Science-committed and user-oriented service providers is very innovative.

Our proposal is inspired by and partially driven by the results of the OpenDreamKit project (Horizon 2020, 2015--2019), which pioneered this collaboration model.

...

...

@@ -55,10 +55,10 @@ In particular, \TheProject includes the development of multiple client applicati

Crucially, these are integrated with existing widely-used systems, thus making it possible for users from other disciplines and industry to discover our services and integrate them into their existing work flows.

\TheProject is \textbf{strongly committed to providing a prototype service that can be readily integrated with the EOSC} (see \WPref{services}).

To maximize our impact, we ensure that many representative and well-known mathematical datasets out there, like the ones surveyed in \cite{bercic:cmo:table,Bercic:cmo:wiki}, are already deployed on this prototype service (see Figure~\ref{fig:datasets} and \WPref{cases}).

To maximize our impact, we ensure that many representative and well-known mathematical datasets out there, like the ones surveyed in \cite{bercic:cmo:table,Bercic:cmo:wiki}, will be already deployed on this prototype service (see Figure~\ref{fig:datasets} and \WPref{cases}).

Besides increasing the popularity of the EOSC, this will provide a well-greased pathway for other users to share their data via the EOSC.

In particular, this can salvage the many large and practically used datasets, which are currently generated and lost soon thereafter.

In particular, this can salvage the many large and practically used datasets that are currently generated and lost soon thereafter.

The latter happens because many datasets are created in the scope of small underfunded or unfunded research projects, often by junior researchers or PhD students, who are currently forced to abandon their datasets when they change research areas or pursue a non-academic career.