Skip to content
Snippets Groups Projects
Commit 8ecb0765 authored by Andreas Schärtl's avatar Andreas Schärtl
Browse files

write about multiple graphdb repositories

parent 491113a5
No related branches found
No related tags found
No related merge requests found
......@@ -112,53 +112,65 @@ to schedule an import of a given Git repository every seven days to a
given GraphDB instance.
Automated job control that regularly imports data from the same
sources leads us to the problem of versioning. ULO
exports~$\mathcal{E}$ depend on an original third party
library~$\mathcal{L}$. Running~$\mathcal{E}$ through the workflow of
Collector and Importer, we get some database
sources leads us to the problem of versioning. In our current design,
multiple ULO exports~$\mathcal{E}_i$ depend on original third party
libraries~$\mathcal{L}_i$. Running~$\mathcal{E}_i$ through the
workflow of Collector and Importer, we get some database
representation~$\mathcal{D}$. We see that data flows
\begin{align*}
\mathcal{L} \rightarrow \mathcal{E} \rightarrow \mathcal{D}
\mathcal{L}_1 \rightarrow \; &\mathcal{E}_1 \rightarrow \mathcal{D} \\
\mathcal{L}_2 \rightarrow \; &\mathcal{E}_2 \rightarrow \mathcal{D} \\
&\vdots{} \\
\mathcal{L}_n \rightarrow \; &\mathcal{E}_n \rightarrow \mathcal{D}
\end{align*}
which means that if records in~$\mathcal{L}$ change, this will
probably result in different triplets~$\mathcal{E}$ which in turn
results in a need to update~$\mathcal{D}$. Finding an efficient
implementation for this problem is not trivial. As it stands,
\emph{ulo-storage} only knows about what is in~$\mathcal{E}$. While
it should be possible to find out the difference between a new version
of~$\mathcal{E}$ and the current version of~$\mathcal{D}$ and compute
the changes necessary to be applied to~$\mathcal{D}$, the big number
of triplets makes this appear unfeasible. While this is not exactly a
burning issue for \emph{ulo-storage} itself, it is a problem an
implementor of a greater tetrapodal serach system will encounter. We
from $n$~individual libraries~$\mathcal{L}_i$ into a single
database storage~$\mathcal{D}$ that is used for querying.
However, mathematical knowledge isn't static. When a given
library~$\mathcal{L}^{t}_i$ at revision~$t$ gets updated to a new
version~$\mathcal{L}^{t+1}_i$, this change will eventually propagate
to the associated export and result in a new set of RDF
triplets~$\mathcal{E}^{t+1}_i$. Our global database
state~$\mathcal{D}$ needs to get updated to match the changes
between~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$. Finding an
efficient implementation for this problem is not trivial. While it
should be possible to find out the difference between two
exports~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$ and compute the
changes necessary to be applied to~$\mathcal{D}$, the big number of
triplets makes this appear unfeasible. As this is a problem an
implementer of a greater tetrapodal search system will encounter, we
suggest two possible approaches to solving this problem.
One approach is to annotate each triplet in~$\mathcal{D}$ with
versioning information about which particular~$\mathcal{E}$ it was
derived from. During an import from~$\mathcal{E}$ into~$\mathcal{D}$,
we could (1)~first remove all triplets in~$\mathcal{D}$ that were
derived from a previous version of~$\mathcal{E}$ and (2)~then re-import
all triplets from the current version of~$\mathcal{E}$. Annotating
triplets with versioning information is an approach that should work,
but introduces~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$
where $n$~is the number of triplets in~$\mathcal{E}$. This does mean
versioning information about which particular
export~$\mathcal{E}^{t}_i$ it was derived from. During an import
from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove all
triplets in~$\mathcal{D}$ that were derived from the previous version
of~$\mathcal{E}^{t-1}_i$ and (2)~then re-import all triplets from the current
version~$\mathcal{E}^{s}_i$. Annotating triplets with versioning
information is an approach that should work, but it does
introduce~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ where
$n$~is the number of triplets in~$\mathcal{D}$. This does mean
effectively doubling the database storage space, a not very satisfying
solution.
Another approach is to regularly re-create the full data
set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
all problems related to updating existing data sets, but it does mean
additional computation requirements. It also means that changes
in~$\mathcal{L}$ take some to propagate to~$\mathcal{D}$. An advanced
version of this approach could forgo the requirement of only one
single database storage~$\mathcal{D}$. Instead of only running one
database instace, we could decide to run dedicated database servers
for each export~$\mathcal{E}$. The advantage here is that re-creating
a database representation~$\mathcal{D}$ is fast. The disadvantage is
that we still want to query the whole data set. This requires the
development of some cross-repository query mechanism, something
GraphDB currently only offers limited support
for~\cite{graphdbnested}.
all problems related to updating existing data sets, but it does have
additional computation requirements. It also means that changes in a
given library~$\mathcal{L}_i$ take some to propagate to~$\mathcal{D}$.
Building on top of this idea, an advanced version of this approach
could forgo the requirement of only one single database
storage~$\mathcal{D}$. Instead of only maintaining one global database
state~$\mathcal{D}$, we suggest the use of dedicated database
instances~$\mathcal{D}_i$ for each given library~$\mathcal{L}_i$. The
advantage here is that re-creating a given database
representation~$\mathcal{D}_i$ is fast as exports~$\mathcal{E}_i$ are
comparably small. The disadvantage is that we still want to query the
whole data set~$\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2 \cup
\cdots \cup \mathcal{D}_n$. This requires the development of some
cross-repository query mechanism, something GraphDB currently only
offers limited support for~\cite{graphdbnested}.
\subsection{Endpoints}\label{sec:endpoints}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment