From 8ecb0765641375a18f885b4711d0fe1c4ac29106 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andreas=20Sch=C3=A4rtl?= <andreas@schaertl.me> Date: Wed, 23 Sep 2020 13:07:46 +0200 Subject: [PATCH] write about multiple graphdb repositories --- doc/report/implementation.tex | 84 ++++++++++++++++++++--------------- 1 file changed, 48 insertions(+), 36 deletions(-) diff --git a/doc/report/implementation.tex b/doc/report/implementation.tex index 3711ac0..7e09957 100644 --- a/doc/report/implementation.tex +++ b/doc/report/implementation.tex @@ -112,53 +112,65 @@ to schedule an import of a given Git repository every seven days to a given GraphDB instance. Automated job control that regularly imports data from the same -sources leads us to the problem of versioning. ULO -exports~$\mathcal{E}$ depend on an original third party -library~$\mathcal{L}$. Running~$\mathcal{E}$ through the workflow of -Collector and Importer, we get some database +sources leads us to the problem of versioning. In our current design, +multiple ULO exports~$\mathcal{E}_i$ depend on original third party +libraries~$\mathcal{L}_i$. Running~$\mathcal{E}_i$ through the +workflow of Collector and Importer, we get some database representation~$\mathcal{D}$. We see that data flows \begin{align*} - \mathcal{L} \rightarrow \mathcal{E} \rightarrow \mathcal{D} + \mathcal{L}_1 \rightarrow \; &\mathcal{E}_1 \rightarrow \mathcal{D} \\ + \mathcal{L}_2 \rightarrow \; &\mathcal{E}_2 \rightarrow \mathcal{D} \\ + &\vdots{} \\ + \mathcal{L}_n \rightarrow \; &\mathcal{E}_n \rightarrow \mathcal{D} \end{align*} -which means that if records in~$\mathcal{L}$ change, this will -probably result in different triplets~$\mathcal{E}$ which in turn -results in a need to update~$\mathcal{D}$. Finding an efficient -implementation for this problem is not trivial. As it stands, -\emph{ulo-storage} only knows about what is in~$\mathcal{E}$. While -it should be possible to find out the difference between a new version -of~$\mathcal{E}$ and the current version of~$\mathcal{D}$ and compute -the changes necessary to be applied to~$\mathcal{D}$, the big number -of triplets makes this appear unfeasible. While this is not exactly a -burning issue for \emph{ulo-storage} itself, it is a problem an -implementor of a greater tetrapodal serach system will encounter. We +from $n$~individual libraries~$\mathcal{L}_i$ into a single +database storage~$\mathcal{D}$ that is used for querying. + +However, mathematical knowledge isn't static. When a given +library~$\mathcal{L}^{t}_i$ at revision~$t$ gets updated to a new +version~$\mathcal{L}^{t+1}_i$, this change will eventually propagate +to the associated export and result in a new set of RDF +triplets~$\mathcal{E}^{t+1}_i$. Our global database +state~$\mathcal{D}$ needs to get updated to match the changes +between~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$. Finding an +efficient implementation for this problem is not trivial. While it +should be possible to find out the difference between two +exports~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$ and compute the +changes necessary to be applied to~$\mathcal{D}$, the big number of +triplets makes this appear unfeasible. As this is a problem an +implementer of a greater tetrapodal search system will encounter, we suggest two possible approaches to solving this problem. One approach is to annotate each triplet in~$\mathcal{D}$ with -versioning information about which particular~$\mathcal{E}$ it was -derived from. During an import from~$\mathcal{E}$ into~$\mathcal{D}$, -we could (1)~first remove all triplets in~$\mathcal{D}$ that were -derived from a previous version of~$\mathcal{E}$ and (2)~then re-import -all triplets from the current version of~$\mathcal{E}$. Annotating -triplets with versioning information is an approach that should work, -but introduces~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ -where $n$~is the number of triplets in~$\mathcal{E}$. This does mean +versioning information about which particular +export~$\mathcal{E}^{t}_i$ it was derived from. During an import +from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove all +triplets in~$\mathcal{D}$ that were derived from the previous version +of~$\mathcal{E}^{t-1}_i$ and (2)~then re-import all triplets from the current +version~$\mathcal{E}^{s}_i$. Annotating triplets with versioning +information is an approach that should work, but it does +introduce~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ where +$n$~is the number of triplets in~$\mathcal{D}$. This does mean effectively doubling the database storage space, a not very satisfying solution. Another approach is to regularly re-create the full data set~$\mathcal{D}$ from scratch, say every seven days. This circumvents -all problems related to updating existing data sets, but it does mean -additional computation requirements. It also means that changes -in~$\mathcal{L}$ take some to propagate to~$\mathcal{D}$. An advanced -version of this approach could forgo the requirement of only one -single database storage~$\mathcal{D}$. Instead of only running one -database instace, we could decide to run dedicated database servers -for each export~$\mathcal{E}$. The advantage here is that re-creating -a database representation~$\mathcal{D}$ is fast. The disadvantage is -that we still want to query the whole data set. This requires the -development of some cross-repository query mechanism, something -GraphDB currently only offers limited support -for~\cite{graphdbnested}. +all problems related to updating existing data sets, but it does have +additional computation requirements. It also means that changes in a +given library~$\mathcal{L}_i$ take some to propagate to~$\mathcal{D}$. +Building on top of this idea, an advanced version of this approach +could forgo the requirement of only one single database +storage~$\mathcal{D}$. Instead of only maintaining one global database +state~$\mathcal{D}$, we suggest the use of dedicated database +instances~$\mathcal{D}_i$ for each given library~$\mathcal{L}_i$. The +advantage here is that re-creating a given database +representation~$\mathcal{D}_i$ is fast as exports~$\mathcal{E}_i$ are +comparably small. The disadvantage is that we still want to query the +whole data set~$\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2 \cup +\cdots \cup \mathcal{D}_n$. This requires the development of some +cross-repository query mechanism, something GraphDB currently only +offers limited support for~\cite{graphdbnested}. \subsection{Endpoints}\label{sec:endpoints} -- GitLab