Skip to content
Snippets Groups Projects
Commit b53eed3f authored by Andreas Schärtl's avatar Andreas Schärtl
Browse files

report: versioning: write about multiple graphdb repositories

parent e0d9b662
Branches
No related tags found
No related merge requests found
...@@ -100,7 +100,7 @@ stream as-is to an HTTP endpoint provided by our GraphDB instance. ...@@ -100,7 +100,7 @@ stream as-is to an HTTP endpoint provided by our GraphDB instance.
maybe I'll also write an Importer for another DB to show that the maybe I'll also write an Importer for another DB to show that the
choice of database is not that important.)} choice of database is not that important.)}
\subsubsection{Scheduling and Version Management} \subsection{Scheduling and Version Management}
Collector and Importer were implemented as library code that can be Collector and Importer were implemented as library code that can be
called from various front ends. For this project, we provide both a called from various front ends. For this project, we provide both a
...@@ -112,30 +112,65 @@ to schedule an import of a given Git repository every seven days to a ...@@ -112,30 +112,65 @@ to schedule an import of a given Git repository every seven days to a
given GraphDB instance. given GraphDB instance.
Automated job control that regularly imports data from the same Automated job control that regularly imports data from the same
sources leads us to the problem of versioning. ULO sources leads us to the problem of versioning. In our current design,
exports~$\mathcal{E}$ depend on an original third party multiple ULO exports~$\mathcal{E}_i$ depend on original third party
library~$\mathcal{L}$. Running~$\mathcal{E}$ through the workflow of libraries~$\mathcal{L}_i$. Running~$\mathcal{E}_i$ through the
Collector and Importer, we get some database workflow of Collector and Importer, we get some database
representation~$\mathcal{D}$. We see that data flows representation~$\mathcal{D}$. We see that data flows
\begin{align*} \begin{align*}
\mathcal{L} \rightarrow \mathcal{E} \rightarrow \mathcal{D} \mathcal{L}_1 \rightarrow \; &\mathcal{E}_1 \rightarrow \mathcal{D} \\
\mathcal{L}_2 \rightarrow \; &\mathcal{E}_2 \rightarrow \mathcal{D} \\
&\vdots{} \\
\mathcal{L}_n \rightarrow \; &\mathcal{E}_n \rightarrow \mathcal{D}
\end{align*} \end{align*}
which means that if records in~$\mathcal{L}$ change, this will from $n$~individual libraries~$\mathcal{L}_i$ into a single
probably result in different triplets~$\mathcal{E}$ which in turn database storage~$\mathcal{D}$ that is used for querying.
results in a need to update~$\mathcal{D}$. This is non-trivial. As it
stands, \emph{ulo-storage} only knows about what is in~$\mathcal{E}$. However, mathematical knowledge isn't static. When a given
While it should be possible to find out the difference between a new library~$\mathcal{L}^{t}_i$ at revision~$t$ gets updated to a new
version of~$\mathcal{E}$ and the current version of~$\mathcal{D}$ and version~$\mathcal{L}^{t+1}_i$, this change will eventually propagate
compute the changes necessary to be applied to~$\mathcal{D}$, the big to the associated export and result in a new set of RDF
number of triplets makes this appear unfeasible. So far, our only triplets~$\mathcal{E}^{t+1}_i$. Our global database
suggestion to solve the problem of changing third party libraries is state~$\mathcal{D}$ needs to get updated to match the changes
to regularly re-create the full data set~$\mathcal{D}$ from scratch, between~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$. Finding an
say every seven days. This circumvents all problems related to efficient implementation for this problem is not trivial. While it
updating existing data sets, but it does mean additional computation should be possible to find out the difference between two
requirements. It also means that changes in~$\mathcal{L}$ take some exports~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$ and compute the
to propagate to~$\mathcal{D}$. If the number of triplets raises changes necessary to be applied to~$\mathcal{D}$, the big number of
by orders of magnitude, this approach will eventually not be scalable triplets makes this appear unfeasible. As this is a problem an
anymore. implementer of a greater tetrapodal search system will encounter, we
suggest two possible approaches to solving this problem.
One approach is to annotate each triplet in~$\mathcal{D}$ with
versioning information about which particular
export~$\mathcal{E}^{t}_i$ it was derived from. During an import
from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove all
triplets in~$\mathcal{D}$ that were derived from the previous version
of~$\mathcal{E}^{t-1}_i$ and (2)~then re-import all triplets from the current
version~$\mathcal{E}^{s}_i$. Annotating triplets with versioning
information is an approach that should work, but it does
introduce~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ where
$n$~is the number of triplets in~$\mathcal{D}$. This does mean
effectively doubling the database storage space, a not very satisfying
solution.
Another approach is to regularly re-create the full data
set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
all problems related to updating existing data sets, but it does have
additional computation requirements. It also means that changes in a
given library~$\mathcal{L}_i$ take some to propagate to~$\mathcal{D}$.
Building on top of this idea, an advanced version of this approach
could forgo the requirement of only one single database
storage~$\mathcal{D}$. Instead of only maintaining one global database
state~$\mathcal{D}$, we suggest the use of dedicated database
instances~$\mathcal{D}_i$ for each given library~$\mathcal{L}_i$. The
advantage here is that re-creating a given database
representation~$\mathcal{D}_i$ is fast as exports~$\mathcal{E}_i$ are
comparably small. The disadvantage is that we still want to query the
whole data set~$\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2 \cup
\cdots \cup \mathcal{D}_n$. This requires the development of some
cross-repository query mechanism, something GraphDB currently only
offers limited support for~\cite{graphdbnested}.
\subsection{Endpoints}\label{sec:endpoints} \subsection{Endpoints}\label{sec:endpoints}
......
...@@ -350,3 +350,11 @@ ...@@ -350,3 +350,11 @@
year={2017}, year={2017},
publisher={Packt Publishing Ltd} publisher={Packt Publishing Ltd}
} }
@online{graphdbnested,
title = {Nested Repositories},
organization = {Ontotext},
date = {2020},
urldate = {2020-09-23},
url = {http://graphdb.ontotext.com/documentation/standard/nested-repositories.html},
}
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment