From 8ecb0765641375a18f885b4711d0fe1c4ac29106 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Andreas=20Sch=C3=A4rtl?= <andreas@schaertl.me>
Date: Wed, 23 Sep 2020 13:07:46 +0200
Subject: [PATCH] write about multiple graphdb repositories

---
 doc/report/implementation.tex | 84 ++++++++++++++++++++---------------
 1 file changed, 48 insertions(+), 36 deletions(-)

diff --git a/doc/report/implementation.tex b/doc/report/implementation.tex
index 3711ac0..7e09957 100644
--- a/doc/report/implementation.tex
+++ b/doc/report/implementation.tex
@@ -112,53 +112,65 @@ to schedule an import of a given Git repository every seven days to a
 given GraphDB instance.
 
 Automated job control that regularly imports data from the same
-sources leads us to the problem of versioning.  ULO
-exports~$\mathcal{E}$ depend on an original third party
-library~$\mathcal{L}$. Running~$\mathcal{E}$ through the workflow of
-Collector and Importer, we get some database
+sources leads us to the problem of versioning.  In our current design,
+multiple ULO exports~$\mathcal{E}_i$ depend on original third party
+libraries~$\mathcal{L}_i$. Running~$\mathcal{E}_i$ through the
+workflow of Collector and Importer, we get some database
 representation~$\mathcal{D}$. We see that data flows
 \begin{align*}
-  \mathcal{L} \rightarrow \mathcal{E} \rightarrow \mathcal{D}
+  \mathcal{L}_1 \rightarrow \; &\mathcal{E}_1 \rightarrow \mathcal{D} \\
+  \mathcal{L}_2 \rightarrow \; &\mathcal{E}_2 \rightarrow \mathcal{D} \\
+  &\vdots{}                                                        \\
+  \mathcal{L}_n \rightarrow \; &\mathcal{E}_n \rightarrow \mathcal{D}
 \end{align*}
-which means that if records in~$\mathcal{L}$ change, this will
-probably result in different triplets~$\mathcal{E}$ which in turn
-results in a need to update~$\mathcal{D}$. Finding an efficient
-implementation for this problem is not trivial.  As it stands,
-\emph{ulo-storage} only knows about what is in~$\mathcal{E}$.  While
-it should be possible to find out the difference between a new version
-of~$\mathcal{E}$ and the current version of~$\mathcal{D}$ and compute
-the changes necessary to be applied to~$\mathcal{D}$, the big number
-of triplets makes this appear unfeasible.  While this is not exactly a
-burning issue for \emph{ulo-storage} itself, it is a problem an
-implementor of a greater tetrapodal serach system will encounter. We
+from $n$~individual libraries~$\mathcal{L}_i$ into a single
+database storage~$\mathcal{D}$ that is used for querying.
+
+However, mathematical knowledge isn't static. When a given
+library~$\mathcal{L}^{t}_i$ at revision~$t$ gets updated to a new
+version~$\mathcal{L}^{t+1}_i$, this change will eventually propagate
+to the associated export and result in a new set of RDF
+triplets~$\mathcal{E}^{t+1}_i$. Our global database
+state~$\mathcal{D}$ needs to get updated to match the changes
+between~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$.  Finding an
+efficient implementation for this problem is not trivial.  While it
+should be possible to find out the difference between two
+exports~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$ and compute the
+changes necessary to be applied to~$\mathcal{D}$, the big number of
+triplets makes this appear unfeasible.  As this is a problem an
+implementer of a greater tetrapodal search system will encounter, we
 suggest two possible approaches to solving this problem.
 
 One approach is to annotate each triplet in~$\mathcal{D}$ with
-versioning information about which particular~$\mathcal{E}$ it was
-derived from.  During an import from~$\mathcal{E}$ into~$\mathcal{D}$,
-we could (1)~first remove all triplets in~$\mathcal{D}$ that were
-derived from a previous version of~$\mathcal{E}$ and (2)~then re-import
-all triplets from the current version of~$\mathcal{E}$. Annotating
-triplets with versioning information is an approach that should work,
-but introduces~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$
-where $n$~is the number of triplets in~$\mathcal{E}$. This does mean
+versioning information about which particular
+export~$\mathcal{E}^{t}_i$ it was derived from.  During an import
+from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove all
+triplets in~$\mathcal{D}$ that were derived from the previous version
+of~$\mathcal{E}^{t-1}_i$ and (2)~then re-import all triplets from the current
+version~$\mathcal{E}^{s}_i$. Annotating triplets with versioning
+information is an approach that should work, but it does
+introduce~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ where
+$n$~is the number of triplets in~$\mathcal{D}$. This does mean
 effectively doubling the database storage space, a not very satisfying
 solution.
 
 Another approach is to regularly re-create the full data
 set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
-all problems related to updating existing data sets, but it does mean
-additional computation requirements. It also means that changes
-in~$\mathcal{L}$ take some to propagate to~$\mathcal{D}$. An advanced
-version of this approach could forgo the requirement of only one
-single database storage~$\mathcal{D}$. Instead of only running one
-database instace, we could decide to run dedicated database servers
-for each export~$\mathcal{E}$. The advantage here is that re-creating
-a database representation~$\mathcal{D}$ is fast. The disadvantage is
-that we still want to query the whole data set. This requires the
-development of some cross-repository query mechanism, something
-GraphDB currently only offers limited support
-for~\cite{graphdbnested}.
+all problems related to updating existing data sets, but it does have
+additional computation requirements. It also means that changes in a
+given library~$\mathcal{L}_i$ take some to propagate to~$\mathcal{D}$.
+Building on top of this idea, an advanced version of this approach
+could forgo the requirement of only one single database
+storage~$\mathcal{D}$. Instead of only maintaining one global database
+state~$\mathcal{D}$, we suggest the use of dedicated database
+instances~$\mathcal{D}_i$ for each given library~$\mathcal{L}_i$.  The
+advantage here is that re-creating a given database
+representation~$\mathcal{D}_i$ is fast as exports~$\mathcal{E}_i$ are
+comparably small. The disadvantage is that we still want to query the
+whole data set~$\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2 \cup
+\cdots \cup \mathcal{D}_n$. This requires the development of some
+cross-repository query mechanism, something GraphDB currently only
+offers limited support for~\cite{graphdbnested}.
 
 \subsection{Endpoints}\label{sec:endpoints}
 
-- 
GitLab