Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
schaertl_andreas
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
supervision
schaertl_andreas
Commits
8ecb0765
Commit
8ecb0765
authored
4 years ago
by
Andreas Schärtl
Browse files
Options
Downloads
Patches
Plain Diff
write about multiple graphdb repositories
parent
491113a5
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/report/implementation.tex
+48
-36
48 additions, 36 deletions
doc/report/implementation.tex
with
48 additions
and
36 deletions
doc/report/implementation.tex
+
48
−
36
View file @
8ecb0765
...
...
@@ -112,53 +112,65 @@ to schedule an import of a given Git repository every seven days to a
given GraphDB instance.
Automated job control that regularly imports data from the same
sources leads us to the problem of versioning.
ULO
exports~
$
\mathcal
{
E
}$
depend on
an
original third party
librar
y
~
$
\mathcal
{
L
}$
. Running~
$
\mathcal
{
E
}$
through the
workflow of
Collector and Importer, we get some database
sources leads us to the problem of versioning.
In our current design,
multiple ULO
exports~
$
\mathcal
{
E
}
_
i
$
depend on original third party
librar
ies
~
$
\mathcal
{
L
}
_
i
$
. Running~
$
\mathcal
{
E
}
_
i
$
through the
workflow of
Collector and Importer, we get some database
representation~
$
\mathcal
{
D
}$
. We see that data flows
\begin{align*}
\mathcal
{
L
}
\rightarrow
\mathcal
{
E
}
\rightarrow
\mathcal
{
D
}
\mathcal
{
L
}_
1
\rightarrow
\;
&
\mathcal
{
E
}_
1
\rightarrow
\mathcal
{
D
}
\\
\mathcal
{
L
}_
2
\rightarrow
\;
&
\mathcal
{
E
}_
2
\rightarrow
\mathcal
{
D
}
\\
&
\vdots
{}
\\
\mathcal
{
L
}_
n
\rightarrow
\;
&
\mathcal
{
E
}_
n
\rightarrow
\mathcal
{
D
}
\end{align*}
which means that if records in~
$
\mathcal
{
L
}$
change, this will
probably result in different triplets~
$
\mathcal
{
E
}$
which in turn
results in a need to update~
$
\mathcal
{
D
}$
. Finding an efficient
implementation for this problem is not trivial. As it stands,
\emph
{
ulo-storage
}
only knows about what is in~
$
\mathcal
{
E
}$
. While
it should be possible to find out the difference between a new version
of~
$
\mathcal
{
E
}$
and the current version of~
$
\mathcal
{
D
}$
and compute
the changes necessary to be applied to~
$
\mathcal
{
D
}$
, the big number
of triplets makes this appear unfeasible. While this is not exactly a
burning issue for
\emph
{
ulo-storage
}
itself, it is a problem an
implementor of a greater tetrapodal serach system will encounter. We
from
$
n
$
~individual libraries~
$
\mathcal
{
L
}_
i
$
into a single
database storage~
$
\mathcal
{
D
}$
that is used for querying.
However, mathematical knowledge isn't static. When a given
library~
$
\mathcal
{
L
}^{
t
}_
i
$
at revision~
$
t
$
gets updated to a new
version~
$
\mathcal
{
L
}^{
t
+
1
}_
i
$
, this change will eventually propagate
to the associated export and result in a new set of RDF
triplets~
$
\mathcal
{
E
}^{
t
+
1
}_
i
$
. Our global database
state~
$
\mathcal
{
D
}$
needs to get updated to match the changes
between~
$
\mathcal
{
E
}^{
t
}_
i
$
and
$
\mathcal
{
E
}^{
t
+
1
}_
i
$
. Finding an
efficient implementation for this problem is not trivial. While it
should be possible to find out the difference between two
exports~
$
\mathcal
{
E
}^{
t
}_
i
$
and
$
\mathcal
{
E
}^{
t
+
1
}_
i
$
and compute the
changes necessary to be applied to~
$
\mathcal
{
D
}$
, the big number of
triplets makes this appear unfeasible. As this is a problem an
implementer of a greater tetrapodal search system will encounter, we
suggest two possible approaches to solving this problem.
One approach is to annotate each triplet in~
$
\mathcal
{
D
}$
with
versioning information about which particular~
$
\mathcal
{
E
}$
it was
derived from. During an import from~
$
\mathcal
{
E
}$
into~
$
\mathcal
{
D
}$
,
we could (1)~first remove all triplets in~
$
\mathcal
{
D
}$
that were
derived from a previous version of~
$
\mathcal
{
E
}$
and (2)~then re-import
all triplets from the current version of~
$
\mathcal
{
E
}$
. Annotating
triplets with versioning information is an approach that should work,
but introduces~
$
\mathcal
{
O
}
(
n
)
$
additional triplets in~
$
\mathcal
{
D
}$
where
$
n
$
~is the number of triplets in~
$
\mathcal
{
E
}$
. This does mean
versioning information about which particular
export~
$
\mathcal
{
E
}^{
t
}_
i
$
it was derived from. During an import
from~
$
\mathcal
{
E
}^{
s
}_
i
$
into~
$
\mathcal
{
D
}$
, we could (1)~first remove all
triplets in~
$
\mathcal
{
D
}$
that were derived from the previous version
of~
$
\mathcal
{
E
}^{
t
-
1
}_
i
$
and (2)~then re-import all triplets from the current
version~
$
\mathcal
{
E
}^{
s
}_
i
$
. Annotating triplets with versioning
information is an approach that should work, but it does
introduce~
$
\mathcal
{
O
}
(
n
)
$
additional triplets in~
$
\mathcal
{
D
}$
where
$
n
$
~is the number of triplets in~
$
\mathcal
{
D
}$
. This does mean
effectively doubling the database storage space, a not very satisfying
solution.
Another approach is to regularly re-create the full data
set~
$
\mathcal
{
D
}$
from scratch, say every seven days. This circumvents
all problems related to updating existing data sets, but it does mean
additional computation requirements. It also means that changes
in~
$
\mathcal
{
L
}$
take some to propagate to~
$
\mathcal
{
D
}$
. An advanced
version of this approach could forgo the requirement of only one
single database storage~
$
\mathcal
{
D
}$
. Instead of only running one
database instace, we could decide to run dedicated database servers
for each export~
$
\mathcal
{
E
}$
. The advantage here is that re-creating
a database representation~
$
\mathcal
{
D
}$
is fast. The disadvantage is
that we still want to query the whole data set. This requires the
development of some cross-repository query mechanism, something
GraphDB currently only offers limited support
for~
\cite
{
graphdbnested
}
.
all problems related to updating existing data sets, but it does have
additional computation requirements. It also means that changes in a
given library~
$
\mathcal
{
L
}_
i
$
take some to propagate to~
$
\mathcal
{
D
}$
.
Building on top of this idea, an advanced version of this approach
could forgo the requirement of only one single database
storage~
$
\mathcal
{
D
}$
. Instead of only maintaining one global database
state~
$
\mathcal
{
D
}$
, we suggest the use of dedicated database
instances~
$
\mathcal
{
D
}_
i
$
for each given library~
$
\mathcal
{
L
}_
i
$
. The
advantage here is that re-creating a given database
representation~
$
\mathcal
{
D
}_
i
$
is fast as exports~
$
\mathcal
{
E
}_
i
$
are
comparably small. The disadvantage is that we still want to query the
whole data set~
$
\mathcal
{
D
}
=
\mathcal
{
D
}_
1
\cup
\mathcal
{
D
}_
2
\cup
\cdots
\cup
\mathcal
{
D
}_
n
$
. This requires the development of some
cross-repository query mechanism, something GraphDB currently only
offers limited support for~
\cite
{
graphdbnested
}
.
\subsection
{
Endpoints
}
\label
{
sec:endpoints
}
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment