diff --git a/doc/report/implementation.tex b/doc/report/implementation.tex index 5a50b3f82a152880876e6b29b9659e522ccf8cbd..3711ac0cfb7b00f733614e1165ebbd509576183e 100644 --- a/doc/report/implementation.tex +++ b/doc/report/implementation.tex @@ -122,20 +122,43 @@ representation~$\mathcal{D}$. We see that data flows \end{align*} which means that if records in~$\mathcal{L}$ change, this will probably result in different triplets~$\mathcal{E}$ which in turn -results in a need to update~$\mathcal{D}$. This is non-trivial. As it -stands, \emph{ulo-storage} only knows about what is in~$\mathcal{E}$. -While it should be possible to find out the difference between a new -version of~$\mathcal{E}$ and the current version of~$\mathcal{D}$ and -compute the changes necessary to be applied to~$\mathcal{D}$, the big -number of triplets makes this appear unfeasible. So far, our only -suggestion to solve the problem of changing third party libraries is -to regularly re-create the full data set~$\mathcal{D}$ from scratch, -say every seven days. This circumvents all problems related to -updating existing data sets, but it does mean additional computation -requirements. It also means that changes in~$\mathcal{L}$ take some -to propagate to~$\mathcal{D}$. If the number of triplets raises -by orders of magnitude, this approach will eventually not be scalable -anymore. +results in a need to update~$\mathcal{D}$. Finding an efficient +implementation for this problem is not trivial. As it stands, +\emph{ulo-storage} only knows about what is in~$\mathcal{E}$. While +it should be possible to find out the difference between a new version +of~$\mathcal{E}$ and the current version of~$\mathcal{D}$ and compute +the changes necessary to be applied to~$\mathcal{D}$, the big number +of triplets makes this appear unfeasible. While this is not exactly a +burning issue for \emph{ulo-storage} itself, it is a problem an +implementor of a greater tetrapodal serach system will encounter. We +suggest two possible approaches to solving this problem. + +One approach is to annotate each triplet in~$\mathcal{D}$ with +versioning information about which particular~$\mathcal{E}$ it was +derived from. During an import from~$\mathcal{E}$ into~$\mathcal{D}$, +we could (1)~first remove all triplets in~$\mathcal{D}$ that were +derived from a previous version of~$\mathcal{E}$ and (2)~then re-import +all triplets from the current version of~$\mathcal{E}$. Annotating +triplets with versioning information is an approach that should work, +but introduces~$\mathcal{O}(n)$ additional triplets in~$\mathcal{D}$ +where $n$~is the number of triplets in~$\mathcal{E}$. This does mean +effectively doubling the database storage space, a not very satisfying +solution. + +Another approach is to regularly re-create the full data +set~$\mathcal{D}$ from scratch, say every seven days. This circumvents +all problems related to updating existing data sets, but it does mean +additional computation requirements. It also means that changes +in~$\mathcal{L}$ take some to propagate to~$\mathcal{D}$. An advanced +version of this approach could forgo the requirement of only one +single database storage~$\mathcal{D}$. Instead of only running one +database instace, we could decide to run dedicated database servers +for each export~$\mathcal{E}$. The advantage here is that re-creating +a database representation~$\mathcal{D}$ is fast. The disadvantage is +that we still want to query the whole data set. This requires the +development of some cross-repository query mechanism, something +GraphDB currently only offers limited support +for~\cite{graphdbnested}. \subsection{Endpoints}\label{sec:endpoints} diff --git a/doc/report/references.bib b/doc/report/references.bib index 9e99f35de0ce7ad42ca440d91cd357db1b57154c..5870c8571fb09178bf54c136a3f5b395ae5e001a 100644 --- a/doc/report/references.bib +++ b/doc/report/references.bib @@ -350,3 +350,11 @@ year={2017}, publisher={Packt Publishing Ltd} } + +@online{graphdbnested, + title = {Nested Repositories}, + organization = {Ontotext}, + date = {2020}, + urldate = {2020-09-23}, + url = {http://graphdb.ontotext.com/documentation/standard/nested-repositories.html}, +} \ No newline at end of file