diff --git a/doc/report/implementation.tex b/doc/report/implementation.tex index 5a50b3f82a152880876e6b29b9659e522ccf8cbd..ff720ce7c775da2f67403ada50d8a5b21b42ceee 100644 --- a/doc/report/implementation.tex +++ b/doc/report/implementation.tex @@ -100,7 +100,7 @@ stream as-is to an HTTP endpoint provided by our GraphDB instance. maybe I'll also write an Importer for another DB to show that the choice of database is not that important.)} -\subsubsection{Scheduling and Version Management} +\subsection{Scheduling and Version Management} Collector and Importer were implemented as library code that can be called from various front ends. For this project, we provide both a @@ -112,30 +112,65 @@ to schedule an import of a given Git repository every seven days to a given GraphDB instance. Automated job control that regularly imports data from the same -sources leads us to the problem of versioning. ULO -exports~$\mathcal{E}$ depend on an original third party -library~$\mathcal{L}$. Running~$\mathcal{E}$ through the workflow of -Collector and Importer, we get some database +sources leads us to the problem of versioning. In our current design, +multiple ULO exports~$\mathcal{E}_i$ depend on original third party +libraries~$\mathcal{L}_i$. Running~$\mathcal{E}_i$ through the +workflow of Collector and Importer, we get some database representation~$\mathcal{D}$. We see that data flows \begin{align*} - \mathcal{L} \rightarrow \mathcal{E} \rightarrow \mathcal{D} + \mathcal{L}_1 \rightarrow \; &\mathcal{E}_1 \rightarrow \mathcal{D} \\ + \mathcal{L}_2 \rightarrow \; &\mathcal{E}_2 \rightarrow \mathcal{D} \\ + &\vdots{} \\ + \mathcal{L}_n \rightarrow \; &\mathcal{E}_n \rightarrow \mathcal{D} \end{align*} -which means that if records in~$\mathcal{L}$ change, this will -probably result in different triplets~$\mathcal{E}$ which in turn -results in a need to update~$\mathcal{D}$. This is non-trivial. As it -stands, \emph{ulo-storage} only knows about what is in~$\mathcal{E}$. -While it should be possible to find out the difference between a new -version of~$\mathcal{E}$ and the current version of~$\mathcal{D}$ and -compute the changes necessary to be applied to~$\mathcal{D}$, the big -number of triplets makes this appear unfeasible. So far, our only -suggestion to solve the problem of changing third party libraries is -to regularly re-create the full data set~$\mathcal{D}$ from scratch, -say every seven days. This circumvents all problems related to -updating existing data sets, but it does mean additional computation -requirements. It also means that changes in~$\mathcal{L}$ take some -to propagate to~$\mathcal{D}$. If the number of triplets raises -by orders of magnitude, this approach will eventually not be scalable -anymore. +from $n$~individual libraries~$\mathcal{L}_i$ into a single +database storage~$\mathcal{D}$ that is used for querying. + +However, mathematical knowledge isn't static. When a given +library~$\mathcal{L}^{t}_i$ at revision~$t$ gets updated to a new +version~$\mathcal{L}^{t+1}_i$, this change will eventually propagate +to the associated export and result in a new set of RDF +triplets~$\mathcal{E}^{t+1}_i$. Our global database +state~$\mathcal{D}$ needs to get updated to match the changes +between~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$. Finding an +efficient implementation for this problem is not trivial. While it +should be possible to find out the difference between two +exports~$\mathcal{E}^{t}_i$ and $\mathcal{E}^{t+1}_i$ and compute the +changes necessary to be applied to~$\mathcal{D}$, the big number of +triplets makes this appear unfeasible. As this is a problem an +implementer of a greater tetrapodal search system will encounter, we +suggest two possible approaches to solving this problem. + +One approach is to annotate each triplet in~$\mathcal{D}$ with +versioning information about which particular +export~$\mathcal{E}^{t}_i$ it was derived from. During an import +from~$\mathcal{E}^{s}_i$ into~$\mathcal{D}$, we could (1)~first remove all +triplets in~$\mathcal{D}$ that were derived from the previous version +of~$\mathcal{E}^{t-1}_i$ and (2)~then re-import all triplets from the current +version~$\mathcal{E}^{s}_i$. Annotating triplets with versioning +information is an approach that should work, but it does +introduce~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ where +$n$~is the number of triplets in~$\mathcal{D}$. This does mean +effectively doubling the database storage space, a not very satisfying +solution. + +Another approach is to regularly re-create the full data +set~$\mathcal{D}$ from scratch, say every seven days. This circumvents +all problems related to updating existing data sets, but it does have +additional computation requirements. It also means that changes in a +given library~$\mathcal{L}_i$ take some to propagate to~$\mathcal{D}$. +Building on top of this idea, an advanced version of this approach +could forgo the requirement of only one single database +storage~$\mathcal{D}$. Instead of only maintaining one global database +state~$\mathcal{D}$, we suggest the use of dedicated database +instances~$\mathcal{D}_i$ for each given library~$\mathcal{L}_i$. The +advantage here is that re-creating a given database +representation~$\mathcal{D}_i$ is fast as exports~$\mathcal{E}_i$ are +comparably small. The disadvantage is that we still want to query the +whole data set~$\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2 \cup +\cdots \cup \mathcal{D}_n$. This requires the development of some +cross-repository query mechanism, something GraphDB currently only +offers limited support for~\cite{graphdbnested}. \subsection{Endpoints}\label{sec:endpoints} diff --git a/doc/report/references.bib b/doc/report/references.bib index 9e99f35de0ce7ad42ca440d91cd357db1b57154c..5870c8571fb09178bf54c136a3f5b395ae5e001a 100644 --- a/doc/report/references.bib +++ b/doc/report/references.bib @@ -350,3 +350,11 @@ year={2017}, publisher={Packt Publishing Ltd} } + +@online{graphdbnested, + title = {Nested Repositories}, + organization = {Ontotext}, + date = {2020}, + urldate = {2020-09-23}, + url = {http://graphdb.ontotext.com/documentation/standard/nested-repositories.html}, +} \ No newline at end of file