report: versioning: write about multiple graphdb repositories

b53eed3f · Andreas Schärtl · e0d9b662 · b53eed3f · b53eed3f
Commit b53eed3f authored Sep 23, 2020 by Andreas Schärtl
--- a/doc/report/implementation.tex
+++ b/doc/report/implementation.tex
@@ -100,7 +100,7 @@ stream as-is to an HTTP endpoint provided by our GraphDB instance.
  maybe I'll also write an Importer for another DB to show that the
  choice of database is not that important.)}
-\subsubsection{Scheduling and Version Management}
+\subsection{Scheduling and Version Management}
 Collector and Importer were implemented as library code that can be
 called from various front ends. For this project, we provide both a
@@ -112,30 +112,65 @@ to schedule an import of a given Git repository every seven days to a
 given GraphDB instance.
 Automated job control that regularly imports data from the same
-sources leads us to the problem of versioning.  ULO
+sources leads us to the problem of versioning.  In our current design,
-exports~$\mathcal{E}$ depend on an original third party
+multiple ULO exports~$\mathcal{E}_i$ depend on original third party
-library~$\mathcal{L}$. Running~$\mathcal{E}$ through the workflow of
+libraries~$\mathcal{L}_i$. Running~$\mathcal{E}_i$ through the
-Collector and Importer, we get some database
+workflow of Collector and Importer, we get some database
 representation~$\mathcal{D}$. We see that data flows
 \begin{align*}
-  \mathcal{L} \rightarrow \mathcal{E} \rightarrow \mathcal{D}
+  \mathcal{L}_1 \rightarrow \; &\mathcal{E}_1 \rightarrow \mathcal{D} \\
+  \mathcal{L}_2 \rightarrow \; &\mathcal{E}_2 \rightarrow \mathcal{D} \\
+  &\vdots{}                                                        \\
+  \mathcal{L}_n \rightarrow \; &\mathcal{E}_n \rightarrow \mathcal{D}
 \end{align*}
-which means that if records in~$\mathcal{L}$ change, this will
+from $n$~individual libraries~$\mathcal{L}_i$ into a single
-probably result in different triplets~$\mathcal{E}$ which in turn
+database storage~$\mathcal{D}$ that is used for querying.
-results in a need to update~$\mathcal{D}$. This is non-trivial.  As it
-stands, \emph{ulo-storage} only knows about what is in~$\mathcal{E}$.
+However, mathematical knowledge isn't static. When a given
-While it should be possible to find out the difference between a new
+library~$\mathcal{L}^{t}_i$ at revision~$t$ gets updated to a new
-version of~$\mathcal{E}$ and the current version of~$\mathcal{D}$ and
+version~$\mathcal{L}^{t+1}_i$, this change will eventually propagate
-compute the changes necessary to be applied to~$\mathcal{D}$, the big
+to the associated export and result in a new set of RDF
-number of triplets makes this appear unfeasible. So far, our only
+triplets~$\mathcal{E}^{t+1}_i$. Our global database
-suggestion to solve the problem of changing third party libraries is
+state~$\mathcal{D}$ needs to get updated to match the changes
-to regularly re-create the full data set~$\mathcal{D}$ from scratch,
+between~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$.  Finding an
-say every seven days. This circumvents all problems related to
+efficient implementation for this problem is not trivial.  While it
-updating existing data sets, but it does mean additional computation
+should be possible to find out the difference between two
-requirements. It also means that changes in~$\mathcal{L}$ take some
+exports~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$ and compute the
-to propagate to~$\mathcal{D}$.  If the number of triplets raises
+changes necessary to be applied to~$\mathcal{D}$, the big number of
-by orders of magnitude, this approach will eventually not be scalable
+triplets makes this appear unfeasible.  As this is a problem an
-anymore.
+implementer of a greater tetrapodal search system will encounter, we
+suggest two possible approaches to solving this problem.
+One approach is to annotate each triplet in~$\mathcal{D}$ with
+versioning information about which particular
+export~$\mathcal{E}^{t}_i$ it was derived from.  During an import
+from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove all
+triplets in~$\mathcal{D}$ that were derived from the previous version
+of~$\mathcal{E}^{t-1}_i$ and (2)~then re-import all triplets from the current
+version~$\mathcal{E}^{s}_i$. Annotating triplets with versioning
+information is an approach that should work, but it does
+introduce~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ where
+$n$~is the number of triplets in~$\mathcal{D}$. This does mean
+effectively doubling the database storage space, a not very satisfying
+solution.
+Another approach is to regularly re-create the full data
+set~$\mathcal{D}$ from scratch, say every seven days. This circumvents
+all problems related to updating existing data sets, but it does have
+additional computation requirements. It also means that changes in a
+given library~$\mathcal{L}_i$ take some to propagate to~$\mathcal{D}$.
+Building on top of this idea, an advanced version of this approach
+could forgo the requirement of only one single database
+storage~$\mathcal{D}$. Instead of only maintaining one global database
+state~$\mathcal{D}$, we suggest the use of dedicated database
+instances~$\mathcal{D}_i$ for each given library~$\mathcal{L}_i$.  The
+advantage here is that re-creating a given database
+representation~$\mathcal{D}_i$ is fast as exports~$\mathcal{E}_i$ are
+comparably small. The disadvantage is that we still want to query the
+whole data set~$\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2 \cup
+\cdots \cup \mathcal{D}_n$. This requires the development of some
+cross-repository query mechanism, something GraphDB currently only
+offers limited support for~\cite{graphdbnested}.
 \subsection{Endpoints}\label{sec:endpoints}

--- a/doc/report/references.bib
+++ b/doc/report/references.bib
@@ -350,3 +350,11 @@
    year={2017},
    publisher={Packt Publishing Ltd}
 }
+@online{graphdbnested,
+    title = {Nested Repositories},
+    organization = {Ontotext},
+    date = {2020},
+    urldate = {2020-09-23},
+    url = {http://graphdb.ontotext.com/documentation/standard/nested-repositories.html},
+}
\ No newline at end of file