Commit d27919af authored by Florian Rabe's avatar Florian Rabe
Browse files

no message

Merge branch 'master' of gl.kwarc.info:mathhub/data-proposal

# Conflicts:
#	Proposal/proposal.pdf
parents c5cc1154 7a752ca3
......@@ -38,9 +38,9 @@ The Open Archive of Formalizations (OAF) is a collection of proof assistant libr
This includes specifying their semantics in the format itself via meta-logics.
In this task we complete the conversion and convert them into the \TheProject standard format.
In particular, the representation of proofs will be very challenging.
Moreover, each proof assistant library is a difficult very conversion task by itself because each uses a different, very complex meta-logic.
Therefore, we will only be pick very few proof assistant libraries as examples.
In particular, the representation of proofs will be challenging.
Moreover, each proof assistant library is a difficult conversion task by itself because each uses a different, complex meta-logic.
Therefore, we will only be pick few proof assistant libraries as examples.
This task will be led by \site{FAU}, which has already built the OAF.
\end{task}
......
......@@ -15,7 +15,7 @@
\end{inparaenum}
This includes:
\begin{compactitem}
\item ensuring awareness of the results in the user community
\item ensuring awareness of the results in the user community,
\item engaging cross communities discussions to foster scientific collaboration and conjoint development,
\item spreading the expertise through workshops and training sessions,
\item providing training for dataset developers how to make their datasets more visible.
......@@ -108,7 +108,7 @@ PM=12,partners={CAE,CHA,FAU,FIZ,UL,PS}]
In this task, we conduct general outreach activities targeted at researchers and industry practitioners.
These will take the form of official communications, workshops at major mathematical meetings, and targeted communications to specific communities.
This includes transdisciplinary outreach to related fields that involve mathematical data, in particular computer science, physics, life sciences, and engineering.
We will build on existing research communities connected with the partners, such as the more than 2,000 individual EMS members or more than 7,000 zbMATH reviewers. APIs designed for specific community needs will propel the adaptation of the services further. As a model serves, e.g., the open zbMATH API for the MathOverflow community site \cite{MO:on}, which is currently the largest online community in mathematics ($\approx 90 K$ registered users). A lean API facilitates the integration of references into the discussed questions there, and allows vice versa for the seamless interlinking of the literature with ongoing research. Similar advantages can be expected from availability of research data. Beyond mathematics, this task will be synergistically supported by the system APIs to RADAR research data hosting service provided by \site{FIZ}: researchers from other domains can host their research data on RADAR and integrate mathematical aspects into the \pn framework. The expertise developed in \TheProject on standardization and dissemination of symbolic data resources will fed into FORCE11 community standards like previously for software standardization.
We will build on existing research communities connected with the partners, such as the more than 2,000 individual EMS members or more than 7,000 zbMATH reviewers. APIs designed for specific community needs will propel the adaptation of the services further. As a model serves, e.g., the open zbMATH API for the MathOverflow community site~\cite{MO:on}, which is currently the largest online community in mathematics ($\approx 90 K$ registered users). A lean API facilitates the integration of references into the discussed questions there, and allows vice versa for the seamless interlinking of the literature with ongoing research. Similar advantages can be expected from availability of research data. Beyond mathematics, this task will be synergistically supported by the system APIs to RADAR research data hosting service provided by \site{FIZ}: researchers from other domains can host their research data on RADAR and integrate mathematical aspects into the \pn framework. The expertise developed in \TheProject on standardization and dissemination of symbolic data resources will fed into FORCE11 community standards like previously for software standardization.
This task will be led by \site{EMS}, which has the institutional support and recognition to conduct formal outreach activities.
All other sites will contribute.
......
......@@ -43,15 +43,15 @@ For symbolic data, this has been solved by the use of meta-logical frameworks an
For linked data and concrete, the use of ontology languages like OWL resp. database schemas is common; but these only offer general purpose datatypes like numbers, strings, and untyped lists, which is too weak for the complex datatypes that pervade mathematical sciences like polynomials, multidimensional arrays, graphs, towers of algebraic structures (e.g.
matrices over polynomials over algebraic extensions over finite fields), physical quantities, or numbers with error intervals.
%These include base types such as string, integers, boolean; collection types such as finite sets, lists, vectors, matrices; aggregation types such as products, unions, and records; algebraic types such as rings and fields; and symbolic types such as rational fields and polynomials.
These have to \emph{encoded} in terms of the low-level datatypes.
These have to be \emph{encoded} in terms of the low-level datatypes.
If these encodings are not described in detail, the data is not reusable.
For example, Kohonen's lattice dataset uses 5 encoding steps: lattices are encoded as graphs with canonically labelled nodes, the graphs as adjacency matrices, the adjacency matrices as bit vectors and the bit vectors as \texttt{digraph6} strings (similar to \texttt{base64}), and finally the entire file containing many lattices is gzipped.
Similar steps are needed for the graph datasets.
Even when these encodings are documented, they are tedious and error-prone, and make difficult any automated processing as needed for data validation, reproduction, or machine learning.
Even when these encodings are documented, they are tedious and error-prone, and make difficult any automated processing needed for data validation, reproduction, or machine learning.
In the OpenDreamKit project, the FAU group has developed a systematic solution by annotating datasets with formal schemas that specify both the high-level mathematical type and the encoding function.
In this task, we expand on this efforts.
In this task, we expand on these efforts.
We standardize a fixed set of mathematical datatypes (such as the one mentioned above) that subsumes at least all datatypes occurring in the datasets of \WPref{cases}.
Moreover, we standardize encodings for these datatypes, again subsuming those used by practitioners in building their datasets.
The biggest subtask here is in surveying the practically used datasets and making sure our standard is comprehensive enough.
......
......@@ -86,7 +86,7 @@ This will be a from-scratch implementation but building on previous experience o
In addition to a human-oriented user interface for the shallow services, we will develop two advanced accessibility services that are specific to the needs of users in the mathematical sciences and engineering.
Firstly, we develop accessibility services for users with disabilities, e.g., to read out mathematical datasets for blind users.
This critically requires the codec framework developed in \taskref{foundations}{dtypes} as reading the encoded data is useless to humans --- only the decoded data yields a mathematical object that can be communicated to a user
This critically requires the codec framework developed in \taskref{foundations}{dtypes} as reading the encoded data is useless to humans --- only the decoded data yields a mathematical object that can be communicated to a user.
Secondly, we adapt the existing visualization components developed by the partners to make mathematical data accessible in ways more enticing and practical for human users.
This includes the browsing and management of large symbolic datasets (MathHub, \site{FAU}), the visualization of large graphs of mathematical objects (TGView(3D), \site{FAU}), the semantic interaction with symbolic data (MMT, \site{FAU}), property-based presentation of mathematical objects (Sage-explorer, \site{PS}), native visualization of mathematical objects within computational systems (Sage, \site{PS}), and the exploration of datasets of mathematical objects via their mathematical invariants (DiscreteZOO, \site{UL}; LMFDB, \site{CHA}).
......@@ -103,7 +103,7 @@ This includes a substitution tree index for all symbolic data, a value index of
We use this index to build an efficient search service and integrate it into the user interface.
This will also allow for an innovative form of conjecturing by finding connections between seemingly unrelated data objects in different datasets that share sub-objects.
This will be based on several technologies already developed by project partners: the MathWebSearch search engine for symbolic data at \site{FAU}, the publication meta-data search service at \site{FIZ}, and the search capabilities developed for concrete mathematical data in the LMFDB \site{CHA}.
This will be based on several technologies already developed by project partners: the MathWebSearch search engine for symbolic data at \site{FAU}, the publication meta-data search service at \site{FIZ}, and the search capabilities developed for concrete mathematical data in the LMFDB (\site{CHA}).
This task will be led by \site{UL} and \site{FAU}, with contributions from \site{FIZ} and \site{CHA} as indicated above.
\end{task}
......
......@@ -330,7 +330,7 @@ Note that the technologies used internally in \pn are listed and discussed separ
FR: the example activities above are EU-speak for the innovation cycle; more relevant for us are the activities in the scope section, which I've used to derive the formulations used below; avoid deleting those phrases when revising this section}
\paragraph{Open data framework}
\highlight{We will develop a framework for representing mathematical datasets using symbolic, concrete, and linked data with accessible semantics.}
\strut\par\noindent\highlight{We will develop a framework for representing mathematical datasets using symbolic, concrete, and linked data with accessible semantics.}
%This will allow the automated discovery and reuse of datasets and items within these sets.
%This will critically boost user-oriented open science because it covers the typical usage scenario where the reuser is not familiar with the details of the dataset she is hoping to find or reuse.
......@@ -354,8 +354,7 @@ Finally, to ensure the sustainability of the \TheProject standard, we will submi
These research efforts are detailed in \WPref{foundations} as well as (for the ISO standard) in \WPref{management}.
\paragraph{Pilot datasets}\label{sec:pilotsets}
\highlight{We integrate a representative selection of major datasets from different areas of mathematics into our infrastructure.}
\strut\par\noindent\highlight{We integrate a representative selection of major datasets from different areas of mathematics into our infrastructure.}
We have carefully put together our consortium such that these communities are represented by partners, i.e., for all pilot datasets there are partners who are the maintainers themselves or have close ties to them.
......@@ -377,8 +376,7 @@ It also indicates
These research efforts are detailed in \WPref{cases}.
\paragraph{Service prototypes and their integration into the EOSC Hub}
\highlight{The core service prototyped by \TheProject is the semantics-aware FAIR data sharing infrastructure that allows the uniform integration and interoperation of mathematical services across all datasets.}
\strut\par\noindent\highlight{The core service prototyped by \TheProject is the semantics-aware FAIR data sharing infrastructure that allows the uniform integration and interoperation of mathematical services across all datasets.}
This service will be deployed on a major server or small cluster of servers that are funded by \TheProject.
A reference instance of these services will be maintained by the coordinating site \site{FAU}, but all hardware and software will be designed such that the services can be easily ported or replicated by other providers such as the \site{FIZ} site.
......@@ -392,8 +390,7 @@ These include in particular the specification of hardware and software interface
These research efforts are detailed in \WPref{services}.
\paragraph{Outreach and Adoption in Different User Communities}
\highlight{We ensure the scalability of our results by providing our services to user communities from different disciplines during the \TheProject lifetime.}
\strut\par\noindent\highlight{We ensure the scalability of our results by providing our services to user communities from different disciplines during the \TheProject lifetime.}
We will start with outreach activities immediately at the start of the project and gradually ramp them up.
This will also aid with raising the awareness of FAIR concepts in the mathematical community.
......
No preview for this file type
No preview for this file type
No preview for this file type
......@@ -45,7 +45,7 @@ The \pn project will integrate the OEIS and similar datasets into the EOSC and t
\smallskip
\noindent\textbf{KPIs}: To evaluate the impact, we will track the number of databases on the \pn server (both the ones described in \WPref{cases} and the ones contributed by mathematicians at the summer events) and the numbers of searches (finding) and downloads (access).
We expect that the latter matches the former in number, and that the number of downloads goes into the dozens for each dataset from pure mathematics and higher for those from other disciplines.
We expect that the contributed databases matche the \pn ones in number, and that the number of downloads goes into the dozens for each dataset from pure mathematics and higher for those from other disciplines.
\paragraph{Expected Impact: Opening up the EOSC ecosystem to new innovative actors}
The mathematical community is currently very under-represented on the EOSC: there is only one mathematical dataset and the EOSC is currently virtually unknown in the community.
......@@ -86,7 +86,7 @@ Generally speaking, it will improve the capacities of multiple groups of ESOC ec
Moreover, the \pn platform gives dataset providers an easy framework for making their results available and academically recognized.
This will also free scarce resources of highly trained researchers that are currently tied up by the overhead of data sharing.
\item \emph{Natural and Social Scientists} can do the same for their interests, e.g., searching possibly connected phenomena via the count sequences (e.g. petal numbers in flowers related to rabbit populations and crystal faces).
\item \emph{Educators} will have an easy-to-use resource for mathematical examples (the \pn datasets) that are integrated into mathematical education infrastructures like CoCalc~\cite{cocalc:on} (via the SageMath interfaces: CoCalc is based on SageMath).
\item \emph{Educators} will have an easy-to-use resource for mathematical examples (the \pn datasets) that are integrated into mathematical education infrastructures like CoCalc~\cite{cocalc:on} (via the SageMath interfaces from \taskref{services}{I}: CoCalc is based on SageMath).
\item \emph{Funding agencies and Hiring Committees} have a central framework to judge direct contributions and re-use patterns in mathematical data. This will ultimately better integrate the recognition of data contributions into the academic reputation economy and make data production a more attractive proposition for junior researchers.
\item \emph{Industry} --- in particular engineering companies --- can use mathematical model databases and related services.
Industrial stakeholders will be directly involved in the development of the data standards and framework, so that the services will be exactly tailored to their specific needs as well as to the needs of the scientific community.
......@@ -116,7 +116,7 @@ Ease of use &
First-time experiences are crucial to gain acceptance. \TheProject will design an ergonomic multi-user web-based graphical user interface, following web standards to best support a large array of browsers, including cell phones and tablets. We will explore opportunities for integration in interactive boards, as an aid for teaching and collaborative research.\\\hline
\end{longtable}
This analysis directly shows the road towards exploitation: in our dataset work package \WPref{cases} and out community outreach work package \WPref{dissem}, we budget extensive resources to increase the visibility and use of both the general EOSC infrastructure in general and the \TheProject service in particular.
This analysis directly shows the road towards exploitation: in our dataset work package \WPref{cases} and our community outreach work package \WPref{dissem}, we budget extensive resources to increase the visibility and use of both the general EOSC infrastructure in general and the \TheProject service in particular.
\inparahighlight{By EOSC-publishing a collection of important datasets ourselves, deeply involving community multipliers in our research, and integrating our services with existing widely used systems, we jump-start the realization of the impacts predicted above.}
\noindent\textbf{KPIs}: Being inherently a catalyst, the exploitation is hard to measure.
......@@ -131,7 +131,7 @@ As described in \WPref{services}, our entire software infrastructure will be des
That includes in particular the compatibility of \TheProject's deep FAIR services with the existing shallow FAIR services on the EOSC Hub, such as B2Handle, B2Share, etc.
\inparahighlight{Deep FAIR services are a critical prerequisite for large-scale adoption of the EOSC, especially for transdisciplinary reuse} because they allow much more meaningful interaction between researchers.
Especially for large and complex datasets --- as typical for but not limited to the mathematical sciences --- it is important that researchers can find/access/reuse/interoperate with dataset fragments and moreover do so using automated tools.
Especially for large and complex datasets --- as typical for, but not limited to the mathematical sciences --- it is important that researchers can find/access/reuse/interoperate with dataset fragments and moreover do so using automated tools.
This can only be accomplished if the semantics of datasets is described in machine-accessible ways, i.e., by using technologies as envisioned by \TheProject.
For example, with shallow FAIR services only, a researcher has virtually no chance of even finding a relevant dataset because the shared version of the dataset uses encodings that make them non-understandable to a search engine.
Only by attaching machine-readable semantics, i.e., by enabling deep FAIR services, can a search engine determine which datasets could be interesting for the potential reuser.
......@@ -318,7 +318,7 @@ The \inparahighlight{maximization of long-term impact is baked into the proposal
Moreover, \inparahighlight{the partners are committed to post-project efforts} including the following activities:
\begin{compactenum}
\item We will continue the dissemination to scientific community and industrial stakeholders through participation to international conferences and publications, including software demonstrations during the conferences.
\item We will training students an colleagues in using the \TheProject technologies to publish and work with mathematical datasets.
\item We will train students and colleagues in using the \TheProject technologies to publish and work with mathematical datasets.
\item We will expand the \TheProject user base by continuing the research collaborations with existing users and identifying new scientific (specifically from neighbouring fields) and industrial users.
\item We will apply for funding at European and national levels for related projects, in particular to deepen the integration with the EOSC and with national research data initiatives.
\end{compactenum}
......
......@@ -99,13 +99,13 @@ The SC is chaired by the Project Coordinator and includes one principal investig
\noindent\textbf{Members} The AB consists of top level experts from partner and external organizations, including both experts from the project scientific area, and experts on legal and social matters. Potential candidates include
\begin{compactenum}
\item Prof. Neil Sloane, founder of the Online Encyclopedia of Integer Sequences (OEIS)
\item Prof. Bettina Eick, heavy contributor (computational packages and data) to the GAP computer algebra system
\item Prof. Bruno Buchberger, founder of the Research Institute of Symbolic Computation (RISC Linz)
\item Prof. Ursula Martin, Mathematical Social Machines
\item a math-affine member of a suitable DIN or ISO standardization committee
\item Prof. Stephen M. Watt, Dean of Mathematics at University of Waterloo, co-founder of Maplesoft, co-founder of the International Mathematical Knowledge Trust (IMKT).
\item Dr. Jukka Kohonen, has published the only mathematical EOSC dataset so far.
\item Prof. Neil Sloane, founder of the Online Encyclopedia of Integer Sequences (OEIS),
\item Prof. Bettina Eick, heavy contributor (computational packages and data) to the GAP computer algebra system,
\item Prof. Bruno Buchberger, founder of the Research Institute of Symbolic Computation (RISC Linz),
\item Prof. Ursula Martin, Mathematical Social Machines,
\item a math-affine member of a suitable DIN or ISO standardization committee,
\item Prof. Stephen M. Watt, Dean of Mathematics at University of Waterloo, co-founder of Maplesoft, co-founder of the International Mathematical Knowledge Trust (IMKT),
\item Dr. Jukka Kohonen, has published the only mathematical EOSC dataset so far,
\item Dr. Paul-Olivier Dehaye, social entrepreneur, mathematician, and data protection activist.
\end{compactenum}
Some of these have been contacted and have already expressed their willingness to serve in the AB.\smallskip
......@@ -128,10 +128,10 @@ The EUG consists of active and visible users of the mathematical data and servic
\item Dr. Martin Otter and Prof. Peter Fritzson, chair-people of the Modelica Association,
\item Prof. Anne Fr\"uhbis-Kr\"uger, computer algebra \& data specialist,
\item Dr. David Carlisle (Numerical Algorithms Group (NAG), Oxford), {\LaTeX} guru, Editor of the MathML Recommendations, XSLT hacker,
\item Andre Gaul, developer of PaperHive, an open annotation and review platform for academic publications
\item Prof Vivian Pons, SageMath developer/ user, educator, and academic diversity advocate.
\item Prof. Jörg Arndt (TH Simon Ohm, N\"urnberg) OEIS Editor
\item Prof. John Cremona and Dr. David Farmer, senior members of the $L$-functions and Modular Forms Database (LMFDB)
\item Andre Gaul, developer of PaperHive, an open annotation and review platform for academic publications,
\item Prof Vivian Pons, SageMath developer/ user, educator, and academic diversity advocate,
\item Prof. Jörg Arndt (TH Simon Ohm, N\"urnberg) OEIS Editor,
\item Prof. John Cremona and Dr. David Farmer, senior members of the $L$-functions and Modular Forms Database (LMFDB),
\item Dietmar Winkler (University of South-Eastern Norway) works on modeling of hydroelectric power systems using Modelica, has developed commercial and open-source libraries and is a member of the Modelica Association.
% https://www.usn.no/english/about/contact-us/employees/dietmar-winkler-article196791-7531.html
% Peter Harman: I've asked him and he is interested in the project and giving some feedback.
......@@ -162,7 +162,7 @@ We have therefore chosen to schedule the first milestone (\mileref{startup}) to
In milestone \mileref{proto1}, we synchronize the first fully functional version of the \pn data standard and showcase first prototypes of the services, which will be essentially completed in milestone \mileref{eval}, which also marks initial evaluation activities.
Milestone \mileref{wrapup} wraps up the project and marks the completion of all deliverables.
The milestones have been scheduled before the project yearly meetings in early summer, where they can be discussed in detail, tracking the progress in ea3ch work package through status reports on the tasks and deliverables and take corrective measures, where necessary, and critical decisions regarding further plans. The later milestones coincide as well with the formal project reviews for demonstration, assessment and discussion with the reviewers. We envisage that this setup will give the project the vital coherence in spite of the broad interdisciplinary mix of various backgrounds of the participants.\smallskip
The milestones have been scheduled before the project yearly meetings in early summer, where they can be discussed in detail, tracking the progress in each work package through status reports on the tasks and deliverables and take corrective measures, where necessary, and critical decisions regarding further plans. The later milestones coincide as well with the formal project reviews for demonstration, assessment and discussion with the reviewers. We envisage that this setup will give the project the vital coherence in spite of the broad interdisciplinary mix of various backgrounds of the participants.\smallskip
\noindent\textbf{General Milestones}
......
No preview for this file type
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment