# Seyfu: Versioned Structured Data

## Abstract

The use of a version control system to store open data is a good thing as it draws a clear path for reproducible science. But none, meets all the expectations. Seyfu aims to replace the use of git and make practical cooperation around the creation, maintenance, publication, storage and re-use of knowledge bases that are possibly bigger than memory. In order to achieve that, we propose a novel approach to tackle the challenge of querying versioned triples in a Direct-Acyclic-Graph (DAG). Seyfu only stores changes between versions. We introduce Seyfu Scheme implementation and compare it to existing solutions.

Keywords:

• Databases
• Version Control Systems
• Knowledge Bases
• Reproducible Science
• Scheme Programming Language

## Introduction

Resource Description Framework offers a good canvas for cooperation around open data but there is no solution that is good enough. TODO: explain what are those features required for cooperation and why those features are important. Maybe cite an article about the importance of git and git hosting solution in the context of software development. Seyfu use a novel approach to query versioned data based on a topological graph ordering of changes. Unlike Cassidy et al., Seyfu does not rely on the Theory of Patches introduced by Darcs. Seyfu use WiredTiger database storage engine, an ordered key-value store, to deliver a pragmatic ACID-compliant versioned quad store. The first part will describe the implementation of Seyfu. The second part will present benchmarks.

## Implementation

• Space and Time complexity for each algorithm
• In the spirit of Scheme programming language, Seyfu try to solve the problem using a minimalist core of powerful primitives upon which one can build abstractions to solve bigger problems. The TupleStore is such an abstraction that allows to take advantage of WiredTiger key-value store features and performance without scarifying too much expressiveness.
• This in turn allows to use the same query language to ask the metadata store about the DAG history of the version quad store and query that quad store. The query language is based a logic language embedded in Scheme called minikanren.
• Merge commits will resolve conflicts in a way that makes it possible to define a history significance measure that allows to linearize with a topological sort the Direct-Acyclic-Graph of changes. Without that, querying versions at any point is not pratical see the second part.
• Cost based query engine was dropped for the time being.

### Indices

Definition: values

Let ${\displaystyle X}$ be a set of values for which there is a binary relation ${\displaystyle \leq }$ that is a total order on ${\displaystyle X}$. Then the following statements hold for all ${\displaystyle a,b}$ and ${\displaystyle c}$ in ${\displaystyle X}$:

• Antisymmetry: If ${\displaystyle a\leq b}$ and ${\displaystyle b\leq a}$ then ${\displaystyle a=b}$;
• Transitivity: If ${\displaystyle a\leq b}$ and ${\displaystyle b\leq c}$ then ${\displaystyle a\leq c}$;
• Connexity: ${\displaystyle a\leq b}$ or ${\displaystyle b\leq a}$.

Definition: n-tuples

n-tuples are the ordered set ${\displaystyle X^{n}}$ where ${\displaystyle n\geq 1}$ ie. ${\displaystyle X^{n}=\{(\alpha _{1},\ldots ,\alpha _{n})\backslash \forall i\in [1..n]\ x_{i}\in X\}}$

Definition: knowledge base

Let ${\displaystyle K\subseteq X^{n}}$that is the subset of n-tuples that are stored in the knowledge base.

Definition: subsets

All the subsets of ${\displaystyle X}$are noted ${\displaystyle \phi (X)}$. TODO define all subsets of X

Definition: variable

Let ${\displaystyle V}$be a disjoint set of ${\displaystyle X}$ie. ${\displaystyle V\cap X=\emptyset }$ if ${\displaystyle v\in V}$then ${\displaystyle v}$ is a variable.

Definition: binding

A binding is a function ${\displaystyle b:V\longrightarrow X\cup \{\varnothing \}}$. The set of all bindings is called ${\displaystyle B}$.

Definition: pattern

A pattern is defined as the association of a template and a function as: ${\displaystyle p=\left({t \atop f}\right)}$

• ${\displaystyle t=(\pi _{1},\pi _{2},\ldots ,\pi _{n})}$ and ${\displaystyle t[i]=\pi _{i}\in V\cup X}$, the number of variables in ${\displaystyle t}$ is noted ${\displaystyle |t|=m}$.
• ${\displaystyle f:B\longrightarrow (X\cup \{\varnothing \})^{n}}$
• ${\displaystyle f(b)=(\alpha _{1},\ldots ,\alpha _{n})}$
• ${\displaystyle f(b)[i]=\alpha _{i}}$
• ${\displaystyle f(b)[i]={\begin{cases}b(\pi _{i}),&{\text{if }}\pi _{i}\in V\\\pi _{i},&{\text{if }}\pi _{i}\in X\end{cases}}}$

Definition: pattern binding

Binding ${\displaystyle b}$ is called a pattern binding of ${\displaystyle p=\left({t \atop f}\right)}$ when ${\displaystyle f(b)\in K}$

Definition: pattern bindings

Pattern bindings ${\displaystyle S}$ for ${\displaystyle p=\left({t \atop f}\right)}$ is the biggest subset of ${\displaystyle B}$ where ${\displaystyle \forall b\in S,\ f(b)\in K}$

Definition: pattern image

Given a pattern bindings ${\displaystyle S}$ for ${\displaystyle p=\left({t \atop f}\right)}$, the pattern image is defined as ${\displaystyle I=\{y{\text{ where }}\forall b\in S,\ f(b)=y\}}$

Property: a pattern image is a subset of ${\displaystyle \phi (K)}$

Proof: TODO

Definition: permutation

it is a function from S to S for which every element occurs exactly once as an image value. This is related to the rearrangement of the elements of S in which each element s is replaced by the corresponding f(s). For example, the permutation (3,1,2) is described by the function ${\displaystyle \alpha }$ defined as:

${\displaystyle \alpha (1)=3,\quad \alpha (2)=1,\quad \alpha (3)=2}$.

Definition: index

squi couvre K

An index ${\displaystyle I_{(}\epsilon _{1},\ldots ,\epsilon _{n})}$is associated with the permutation ${\displaystyle i=(\epsilon _{1},\ldots ,\epsilon _{n})}$so that ${\displaystyle \forall x\in K\Rightarrow i(x)\in I}$

Definition: prefix range

Let prefix range be a function ${\displaystyle f_{I}:P_{n}\longrightarrow \phi (K)}$

TODO: define what querying means

TODO: define what index cover pattern means

#### Property: When n>=2, there is a strict subset of all indices that cover all patterns

Property: the rotations of the base index can be part of a minimal set of covering indices

What is the goal? What are the axioms? What are the properties? What are if ... then ...?!

1. We can choose a permutation of p that is less complex
1. Let's consider a permutation ${\displaystyle s:K\longrightarrow K_{s}}$
2. TODO: how to define a permutation? is permutation always reversible?
3. Similarly there is permutation ${\displaystyle \sigma (p)=\sigma (\langle \pi _{1},\pi _{2},\ldots ,\pi _{n}\rangle )=\langle \omega _{1},\omega _{2},\ldots ,\omega _{n}\rangle =h}$ where ${\displaystyle \omega _{i}\in K_{s}}$ or is a variable
4. when ${\displaystyle p(\alpha _{1},\ldots ,\alpha _{m})=y}$ and ${\displaystyle h(\beta _{1},\ldots ,\beta _{m})=z}$ we can do ${\displaystyle s(z)=y}$
5. we can choose the permutation of p that is the less complex
6. We can permute the pattern so that the operation is less complex
2. Where is the n-permutations of [1..n]
3. Where is the product of Bool^n with [1..n]

#### Examples

All rotations of the 3-tuple (subject, predicate, object) is a solution:

• (subject, predicate, object) aka. spo ie. rotation 0 of the 3-tuple
• (predicate, object, subject) aka. pos ie. rotation 1 of the 3-tuple
• (object, subject, predicate) aka. osp ie. rotation 2 of the 3-tuple

The remaining permutations are ['sop', 'pso', 'ops']. All of their prefixes have a permutation that is prefix of one the index:

• s prefix of sop is also prefix of the index spo
• p prefix of pso is also prefix of the index pos
• o prefix of ops is also prefix of the index osp
• so prefix of sop has os as permutation which is a prefix of the index osp
• ps prefix of pso has sp as permutation which is a prefix of the index spo
• op prefix of ops has po as permutation which is a prefix of the index pos

All rotations of the 4-tuple (graph, subject, predicate, object) plus two permutations that have a prefix that is not the permutation of existing solution prefixes:

• (graph, subject, predicate, object) aka. gspo ie. rotation 0 of the 4-tuple
• (subject, predicate, object, graph) aka. spog ie. rotation 1 of the 4-tuple
• (predicate, object, graph, subject) aka. pogs ie. rotation 2 of the 4-tuple
• (object, graph, subject, predicate) aka. ogsp ie. rotation 3 of the 4-tuple
• (graph, predicate, subject, object) aka. gpsp
• (subject, object, graph, predicate) aka. sogp

#### Problem

TODO: what is the (formal?) problem

#### Python solution

 1 from itertools import permutations
2
3
4 def indices(tab):
5     # Compute all the required columns permutations that will
6     # allow to query solutions for every kind of patterns.
7
8     # Pattern kinds are the product of N booleans where N is the
9     # number of columns.
10
11     # At last I find out that, to be able to minimize the number of
12     # indices and while avoiding a manual join against the commit
13     # table, all indices must at least have one prefix that is not the
14     # permutations of other indices perfixes. It leads to the creation
15     # of a number of tables. Wiredtiger does try to cache (all?!) keys
16     # from tables. So, it's a space / cpu tradeoff because a manual
17     # join would be much smaller but would consume less memory and
18     # disk space.
19
20     # All rotations are solutions
21     solutions = [(tab[i:] + tab[:i]) for i in range(len(tab))]
22
23     # Other solutions are permutations
24     candidates = ["".join(x) for x in permutations(tab) if "".join(x) not in solutions]
25
26     for candidate in candidates:
27         # At least one prefix must not be the permutation of the
28         # prefixes of an existing solution
29         prefixes = [candidate[0:i] for i in range(1, len(candidate))]
30
31         for prefix in prefixes:
32             # Check whether permutations of prefix are not an existing prefix
33             for pattern in ["".join(x) for x in permutations(prefix)]:
34                 if any(solution.startswith(pattern) for solution in solutions):
35                     break  # this prefix has a solution, try next prefix
36             else:
37                 # no permutation of prefix is the prefix of an existing solution
38                 solutions.append(candidate)
39                 break
40
41     return solutions


#### Scheme solution

 1 (define-library (indices)
2   (export indices)
3   (export permutations)
4
5   (import (scheme base))
6   (import (scheme list))
7
8   (begin
9
10     (define (remove* i lst)
11       (remove (lambda (x) (= x i)) lst))
12
13     (define (permutations lst)
14       ;; taken from https://stackoverflow.com/q/23635911/140837
15       (if (= (length lst) 1)
16           (list lst)
17           (apply append (map (lambda (i)
18                                (map (lambda (j) (cons i j)) (permutations (remove* i lst))))
19                              lst))))
20
21     (define (shift lst index)
22       (append (drop lst index) (take lst index)))
23
24     (define (rotations lst)
25       (reverse! (map (lambda (index) (shift lst index)) (iota (length lst)))))
26
27     (define (prefixes lst)
28       (let loop ((out '())
29                  (indices (cdr (iota (length lst)))))
30         (if (null? indices)
31             out
32             (loop (cons (take lst (car indices)) out)
33                   (cdr indices)))))
34
35     (define (prefix? lst other)
36       (let loop ((lst lst)
37                  (other other))
38         (if (null? lst)
39             #t
40             (if (= (car lst) (car other))
41                 (loop (cdr lst) (cdr other))
42                 #f))))
43
44     (define (indices lst)
45       ;; Compute all the required columns permutations that will allow
46       ;; to do queries for every kind of patterns
47       ;;
48       ;; Pattern kinds are the product of N booleans where N is the
49       ;; number of columns
50       ;;
51       ;; At last I find out that, to be able to minimize the number of
52       ;; indices and while avoiding a manual join against the commit
53       ;; table, all indices must at least have one prefix that is not the
54       ;; permutations of other indices perfixes. It leads to the creation
55       ;; of a number of tables. Wiredtiger does try to cache (all?!) keys
56       ;; from tables. So, it's a space / cpu tradeoff because a manual
57       ;; join would be much smaller but would consume less memory and
58       ;; disk space.
59
60       ;; A candidate is a solution if at least one its prefix
61       ;; permutation is not a prefix of an existing solution
62       (let loop1 ((out (rotations lst)) ;; all rotations are solutions
63                   (candidates (permutations lst))) ;; other solutions are permutations
64         (if (null? candidates)
65             (reverse! out)
66             (let ((candidate (car candidates)))
67               ;; look for a prefix that has a permutation that is
68               ;; not prefix of an existing solution
69               (let loop2 ((prefixes* (prefixes candidate)))
70                 (if (null? prefixes*)
71                     ;; none of the prefixes of candidate has a new pattern
72                     (loop1 out (cdr candidates))
73                     (let loop3 ((patterns (permutations (car prefixes*))))
74                       (if (null? patterns)
75                           ;; none of the patterns of the current prefix is new
76                           (loop1 (cons candidate out) (cdr candidates))
77                           ;; try to find a solution that has pattern as prefix
78                           (let loop4 ((solutions out))
79                             (if (null? solutions)
80                                 ;; TODO
81                                 (loop3 (cdr patterns))
82                                 (if (prefix? (car patterns) (car solutions))
83                                     (loop2 (cdr prefixes*))
84                                     ;; try another solution
85                                     (loop4 (cdr solutions)))))))))))))
86     ))


## Benchmarks

There is three existing solutions: R&WBase, R43ples and Quit Store. Those are the only solution with an implementation that is:

• distributed in the sens of version control systems. That is that allows cooperation behavior enabled by branching and merging features,
• semantically compliant in the sens that it implements versioning in a way that the commit is understandable by machines with a query.

The benchmark will try to quantitatively compare those software regarding several aspects:

• Memory storage performance
• Disk storage performance
• Time required to access different versions

Those results are compared to baselines namely git and virtuoso.

The benchmark baseline is Virtuoso X.Y. https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples

Linux kernel git repository is exported as quadruples and loaded it into the solutions. The benchmark baseline is performances of git because it is the industry standard for versioning source code. Git is also the most common solution used to track changes in open data. It will provide a baseline for assessing the performance usability of Seyfu regarding the cooperation features.