A main idea of \TheProject is the following \inparahighlight{novel categorization of mathematical data}, which allows analyzing the specific challenges to FAIR data sharing.

A main idea of \TheProject is a \inparahighlight{novel categorization of mathematical data}, which allows analyzing the specific challenges to FAIR data sharing.

An overview is given in Figure~\ref{fig:mathematical-data}.

Each kind of data has distinct strengths and weaknesses that a universal approach must take into account.

Some of those strengths and weaknesses are summarized in Figure~\ref{fig:datakinds}.

\caption{Kinds of mathematical data}\label{fig:mathematical-data}

\end{figure}

\textbf{Symbolic data} consists of formal expressions such as formulas, formal proofs, programs, graphs, diagrams, etc.

These are written in a variety of highly-structured formal languages specifically designed for individual domains.

Because it allows for abstraction principles such as underspecification, quantification, and variable binding, symbolic data can (in contrast to the other two kinds) capture the full semantics of mathematical objects.

This comes at the price of being context-sensitive: expressions must be interpreted relative to their context and cannot be easily moved across environments, which makes \emph{Finding}, \emph{Reusing}, and \emph{Interoperability} difficult.

This comes at the price of being context-sensitive: \inparahighlight{expressions cannot be easily moved across environments, which makes \emph{Finding}, \emph{Reusing}, and \emph{Interoperability} difficult}.

Working with symbolic data in mathematics can be subdivided based on the area of application into \textbf{modeling}, \textbf{deduction}, and \textbf{computation}.

Each area employs a wide variety of sophisticated formal languages: modeling languages, logics, resp. programming languages.

Multiple different, often mutually non-interoperable, representation formats have been developed for symbolic data, usually growing out of small research projects and reaching different degrees of standardization, tool support, and user following.

These are usually optimized for specific applications and little cross-format sharing is possible.

These are usually optimized for specific applications, and little cross-format sharing is possible.

In response to this problematic situation, standard formats have been designed such as MathML~\cite{CarlisleEd:MathML3:on} and OMDoc/MMT~\cite{uniformal:on}.

The latter has been used as an interoperability format for computer algebra systems in the OpenDreamKit project and already offers comprehensive services for symbolic data such as querying.

We get back to this in Section~\ref{sec:state_of_the_art}.

...

...

@@ -111,39 +119,31 @@ But such representation theorems do not always exist because sets and functions,

Moreover, representation theorems may be very difficult to establish and understand, and there may be multiple different representations for the same object.

In any case, \emph{Access} is difficult because users need to know the representation theorems to understand the encoding, and this is often very complex.

Therefore, even if the encoding function is documented, \emph{Finding}, \emph{Reuse}, and \emph{Interoperability} are difficult and error-prone.

Therefore, \inparahighlight{even if the representation function is documented, \emph{Finding}, \emph{Reuse}, and \emph{Interoperability} are difficult and error-prone}.

For example, consider the following very recent incident from (Jan. 2019):

There are two encoding formats for directed graphs, both called \texttt{digraph6}: Brendan McKay's \cite{McKayFormats:on} and the one used by the GAP package Digraphs \cite{GAPDigraphFormat:on}, whose authors were unaware of McKay's format and essentially reinvented a similar one \cite{digraph6issue:on}.

The resulting problem has since been resolved but not without causing some misunderstandings first.

Concrete data can be subdivided into \textbf{record} data, where datasets are sets of records conforming to the same schema, and \textbf{array} data, which consists of very large, multidimensional arrays that require optimized management. %% tree and graph data?

Array data tends to come up in settings with large but simply-structured datasets such as simulation time series, while record data is often needed to represent complex objects, especially those from pure mathematics.

Record data and querying is very well-standardized by the relational (SQL) model.

However, if encodings are used, SQL can never answer queries about the semantics of the original object.

The younger array data bases, which offer efficient access to contiguous --- possibly lower-dimensional --- sub-arrays of datasets (voxels), are less standardized, but OPenNDAP~\cite{OPenNDAP:on} is becoming increasingly recognized even outside the GeoData community where it originated.

Array data tends to come up in settings with large but simply-structured datasets such as simulation time series, while record data is often needed to represent complex objects, especially those from pure mathematics.

The younger array data bases, which offer efficient access to contiguous --- possibly lower-dimensional --- sub-arrays of datasets (voxels), are less standardized, but OPenNDAP~\cite{OPenNDAP:on} is becoming increasingly recognized even outside the GeoData community, where it originated.

\textbf{Linked data} introduces identifiers for objects and then treats them as blackboxes, only representing the identifier and not the original object.

The internal structure and the semantics of the object remain unspecified except for maintaining a set of named relations and attributions for these identifiers.

For example, it is general practice to use the URIs of the respective Wikipedia articles unless more specific ontologies are available.

The named relations allow forming large networks of objects, and the attributions of concrete values provide limited information about each one.

Linked data can be subdivided into \textbf{knowledge graphs} and \textbf{metadata}, e.g., as used in publication indexing services.

As linked data forms the backbone of the Semantic Web, linked data formats are very well-standardized: data formats come as RDF~\cite{RDF1.1primer}, the relations and attributes are expressed as ontologies in OWL2~\cite{w3c:owl2-overview}, and RDF-based databases (also called triplestores) can be queried via SPARQL~\cite{w3c:sparql11-query}.

For example, services like DBPedia~\cite{LehIseJak:dlsmkbew13} and Yago~\cite{HofSucBer:yago2a} crawl various aspects of Wikipedia to extract linked data collections and provide SPARQL endpoints.

%They supply a valuable fallback for the linked data ontologies to be developed in the \pn project.

The WikiData database~\cite{wikidata:on} collects such linked data and uses them to about the objects themselves.

The WikiData database~\cite{wikidata:on} collects such linked data and uses them to answer queries about the objects.

%This makes WikiData a primary target for the purposes of linked data management in \pn.

Thus, contrary to the other two kinds, linked data has very good FAIR-readiness, in particular allowing for URI-based \emph{Access}, efficient \emph{Finding} via query languages, and URI-mediated \emph{Reuse} and \emph{Interoperability}.

However, this comes at the price of not capturing the complete semantics of the objects so that \emph{Access} and \emph{Finding} are limited to the exposed part of the semantics, and \emph{Interoperability} and \emph{Reuse} are subject to misinterpretation.

Thus, all kinds of data have distinct strengths and weaknesses that a universal approach must take into account.

Some of those strengths and weaknesses are summarized in Figure~\ref{fig:datakinds}.

%For example, it is general practice to use the URIs of the respective Wikipedia articles unless more specific ontologies are available.

However, this \inparahighlight{FAIR-readiness comes at the price of not capturing the complete semantics of the objects so that \emph{Access} and \emph{Finding} are limited and \emph{Interoperability} and \emph{Reuse} are subject to misinterpretation}.

\begin{figure}[hbt]\centering

\begin{tabular}{|l||ccc|}

...

...

@@ -160,20 +160,20 @@ Easy to process & \textbf{\large--} & \textbf{\large+} & \textbf{\large+}\\

\paragraph{Semantics-aware Open Data and Deep FAIRness}\label{sec:saod}

Concrete and linked data can be easily processed and shared using standardized formats such as CSV (comma-separated lists) or RDF~\cite{w3c:owl2-overview}.

Concrete and linked data can be easily processed and shared using standardized formats such as CSV (comma-separated value lists) or RDF~\cite{w3c:owl2-overview}.

But in doing so, the semantics of the original data is not part of the shared resource: in concrete data, the semantics requires knowing the encoding function; and in linked data, almost the entire semantics is abstracted away.

For datasets with very simple semantics, this can be remedied by attaching informal labels (e.g., column heads), metadata, or free-text documentation.

But this is not sufficient for datasets in mathematics and related scientific disciplines where the semantics is itself very complex.

For example, an object's semantic type (e.g., ``polynomial with integer coefficients'') is typically very different from the type as which it is shared (e.g., ``list of integers'').

For example, an object's semantic type (e.g., ``polynomial with integer coefficients'') is typically very different from the type as which it is encoded and shared (e.g., ``list of integers'').

The latter allows reconstructing the original, but only if its type and encoding function (e.g., ``the entries in the list are the coefficients in order of decreasing degree'') are known.

Already for polynomials, the subtleties make this a problem in practice, e.g., consider different coefficient orders, sparse vs. dense encodings, or multivariate polynomials.

In fact, it is a problem already for seemingly trivial cases like integers: for example, different datasets in the LMFDB use at least 3 different encodings for integers that are larger than the CPU's built-in integers.

Even worse, it is already a problem for seemingly trivial cases like integers: for example, the various datasets in the LMFDB use at least 3 different encodings for integers (because the trivial encoding of using the CPU's built-in integers does not work because the involved numbers are too big).

But mathematicians routinely use much more complex objects like graphs, surfaces, or algebraic structures.

We speak of \textbf{accessible semantics} if data has metadata annotations that allow recovering the exact semantics of the data.

Notably, in mathematics, this semantics metadata is very complex, usually symbolic, mathematical data itself that cannot be easily annotated.

But without knowing the semantics, mathematical datasets only allow FAIR services that operate on the dataset as a whole, which we call \textbf{shallow} FAIR services.

But \inparahighlight{without knowing the semantics, mathematical datasets only allow FAIR services that operate on the dataset as a whole}, which we call \textbf{shallow} FAIR services.

But it is much more important to users to have \textbf{deep} services, i.e., services that process individual entries of the dataset.

Figure~\ref{fig:deepfair} gives some examples of the contrast between shallow and deep services.

Deep services are only possible if the service can access and understand the semantics of the dataset.

...

...

@@ -190,7 +190,7 @@ Identification & DOI for a dataset & DOIs for each entry \\

Provenance & who created the dataset? & how was each entry computed? \\

Validation & is this valid XML? & does this XML represent a set of polynomials? \\

Access & download a dataset & download a specific fragment\\

Finding & find a dataset & find datasets with certain entries, filter a subset\\

Finding & find a dataset & find entries with certain properties\\

Reuse &\multicolumn{2}{c|}{impractical without accessible semantics}\\

Interoperability &\multicolumn{2}{c|}{impossible without accessible semantics}\\

\hline

...

...

@@ -204,22 +204,23 @@ Interoperability & \multicolumn{2}{c|}{impossible without accessible semantics}\

Data & Findable & Accessible & Interoperable & Reusable \\\hline\hline

Symbolic & Hard & Easy & Hard & Hard \\\hline

Concrete &\multicolumn{4}{c|}{Impossible without access to the encoding function}\\\hline

Linked &\multicolumn{4}{c|}{Easy but only applicable to the small fragment that is not abstracted}\\\hline

Linked &\multicolumn{4}{c|}{Easy but only applicable to the small fragment of the semantics that is exposed}\\\hline

\end{tabular}

\caption{Deep FAIR readiness of mathematical data}\label{fig:FAIR-readiness}

\end{figure}

Note that the advantages of deep services are not limited to mathematics at all.

Note that \inparahighlight{the advantages of deep services are not limited to mathematics at all}.

For example, in 2016 \cite{ZieEreElO:GeneErrors16}, researchers found widespread errors in papers in genomics journals with supplementary Microsoft Excel gene lists.

About 20\% of them contain erroneous gene name conversions, because the software misinterpreted string-encoded genes as months.

In engineering, encoding mistakes can quickly become safety-critical, i.e., if a dataset of numbers is shared without their semantics, i.e., without their physical units, precision, and measurement type.

About 20\% of them contain erroneous gene name because the software misinterpreted string-encoded genes as months.

In engineering, encoding mistakes can quickly become safety-critical, i.e., if a dataset of numbers is shared without their physical units, precision, and measurement type.

With accessible semantics, datasets can be validated automatically against their semantic type to avoid errors such as falsely interpreting a measurement in inch as a measurement in meters, a gene name as a month, or a column-vector matrix as a row-vector matrix.

\highlight{\textbf{A Data Representation Standard and Implementation Framework for Deep FAIR Services}\\

We can now circle back to our objectives and state them more concisely: \textbf{the objective of \TheProject is Deep FAIR services for mathematics and related sciences}.

We can now circle back to our objectives and state them more concisely: \textbf{the objective of \TheProject is building Deep FAIR services for mathematics and related sciences}.

The central idea is to integrate the existing standards for different kinds of data into a coherent representation standard for mathematical data that systematically makes the semantics accessible.

This enables prototyping universally applicable Deep FAIR services based on the existing ad hoc or limited solutions and making a wide variety of existing datasets available via a central platform.

This enables (i) prototyping universally applicable Deep FAIR services that improve on the existing ad hoc or limited solutions and (ii) making a wide variety of existing datasets available via a central platform.