# User:Iamamz3/Seyfu: Versioned Structured Data

## Contents

# Seyfu: Versioned Structured Data[edit]

This is a research project at Wikiversity. |

## Abstract[edit]

The use of a version control system to store open data is a good thing as it draws a clear path for reproducible science. But none, meets all the expectations. *Seyfu* aims to replace the use of git and make practical cooperation around the creation, maintenance, publication, storage and re-use of knowledge bases that are possibly bigger than memory. In order to achieve that, we propose a novel approach to tackle the challenge of querying versioned triples in a Direct-Acyclic-Graph (DAG). *Seyfu* only stores changes between versions. We introduce *Seyfu* Scheme implementation and compare it to existing solutions.

**Keywords :**

*Databases**Version Control Systems**Knowledge Bases**Reproducible Science**Scheme Programming Language*

## Introduction[edit]

Resource Description Framework offers a good canvas for cooperation around open data but there is no solution that is good enough. *TODO: explain what are those features required for cooperation and why those features are important. Maybe cite an article about the importance of git and git hosting solution in the context of software development*. *Seyfu* use a novel approach to query versioned data based on a topological graph ordering of changes. Unlike Cassidy et al., Seyfu does not rely on the Theory of Patches introduced by Darcs. Seyfu use WiredTiger database storage engine, an ordered key-value store, to deliver a pragmatic ACID-compliant versioned quad store. The first part will describe the implementation of *Seyfu*. The second part will present benchmarks.

## Background[edit]

### Resource Description Framework[edit]

### Version Control Systems[edit]

### Scheme programming language[edit]

### Ordered key-value stores[edit]

## Implementation[edit]

- Space and Time complexity for each algorithm
- In the spirit of Scheme programming language, Seyfu try to solve the problem using a minimalist core of powerful primitives upon which one can build abstractions to solve bigger problems. The
`TupleStore`

is such an abstraction that allows to take advantage of WiredTiger key-value store features and performance without scarifying too much expressiveness. - This in turn allows to use the same query language to ask the metadata store about the DAG history of the version quad store and query that quad store. The query language is based a logic language embedded in Scheme called minikanren.
- TODO: implement optional and union, see https://github.com/Swirrl/matcha/commit/c8d21c1ec54020ec1f0cc002d4ccaefca1106cbf

- Merge commits will resolve conflicts in a way that makes it possible to define a
*history significance*measure that allows to linearize with a topological sort the Direct-Acyclic-Graph of changes. Without that, querying versions at any point is not pratical see the second part. - Cost based query engine was dropped for the time being.

### Indices[edit]

**Definition:** values

Let be a set of values for which there is a binary relation that is a total order on . Then the following statements hold for all and in :

- Antisymmetry: If and then ;
- Transitivity: If and then ;
- Connexity: or .

**Definition:** n-tuples

n-tuples are the ordered set where ie.

**Definition:** knowledge base

Let that is the subset of n-tuples that are stored in the knowledge base.

**Definition:** subsets

All the subsets of are noted . TODO define all subsets of X

**Definition:** variable

Let be a disjoint set of ie. if then is a variable.

**Definition:** binding

A binding is a function . The set of all bindings is called .

**Definition:** pattern

A pattern is defined as the association of a template and a function as:

- and , the number of variables in is noted .

**Definition:** pattern binding

Binding is called a *pattern binding* of when

**Definition:** pattern bindings

*Pattern bindings* for is the biggest subset of where

**Definition:** pattern image

Given a pattern bindings for , the pattern image is defined as

**Property:** a pattern image is a subset of

**Proof: TODO**

Definition: permutation

it is a function from *S* to *S* for which every element occurs exactly once as an image value. This is related to the rearrangement of the elements of *S* in which each element *s* is replaced by the corresponding *f*(*s*). For example, the permutation (3,1,2) is described by the function defined as:

- .

**Definition:** index

__squi couvre K__

__An index is associated with the permutation so that __

**Definition:** prefix range

__Let prefix range be a function __

TODO: define what querying means

TODO: define what index cover pattern means

#### Property: All indices cover all patterns[edit]

#### Property: When n>=2, there is a strict subset of all indices that cover all patterns[edit]

**Property: the rotations of the base index can be part of a minimal set of covering indices**

What is the goal? What are the axioms? What are the properties? What are if ... then ...?!

- We can choose a permutation of p that is less complex
- Let's consider a permutation
- TODO: how to define a permutation? is permutation always reversible?
- Similarly there is permutation where or is a variable
- when and we can do
- we can choose the permutation of p that is the less complex
- We can permute the pattern so that the operation is less complex

- Where is the
**n-permutations**of [1..n] - Where is the
**product**of Bool^n with [1..n]

#### Examples[edit]

All rotations of the 3-tuple `(subject, predicate, object)`

is a solution:

`(subject, predicate, object)`

aka.`spo`

ie. rotation 0 of the 3-tuple`(predicate, object, subject)`

aka.`pos`

ie. rotation 1 of the 3-tuple`(object, subject, predicate)`

aka.`osp`

ie. rotation 2 of the 3-tuple

The remaining permutations are `['sop', 'pso', 'ops']`

. All of their prefixes have a permutation that is prefix of one the index:

`s`

prefix of`sop`

is also prefix of the index`spo`

`p`

prefix of`pso`

is also prefix of the index`pos`

`o`

prefix of`ops`

is also prefix of the index`osp`

`so`

prefix of`sop`

has`os`

as permutation which is a prefix of the index`osp`

`ps`

prefix of`pso`

has`sp`

as permutation which is a prefix of the index`spo`

`op`

prefix of`ops`

has`po`

as permutation which is a prefix of the index`pos`

All rotations of the 4-tuple (graph, subject, predicate, object) plus two permutations that have a prefix that is not the permutation of existing solution prefixes:

`(graph, subject, predicate, object)`

aka.`gspo`

ie. rotation 0 of the 4-tuple`(subject, predicate, object, graph)`

aka.`spog`

ie. rotation 1 of the 4-tuple`(predicate, object, graph, subject)`

aka.`pogs`

ie. rotation 2 of the 4-tuple`(object, graph, subject, predicate)`

aka.`ogsp`

ie. rotation 3 of the 4-tuple`(graph, predicate, subject, object)`

aka.`gpsp`

`(subject, object, graph, predicate)`

aka.`sogp`

#### Problem[edit]

TODO: what is the (formal?) problem

#### Python solution[edit]

```
1 from itertools import permutations
2
3
4 def indices(tab):
5 # Compute all the required columns permutations that will
6 # allow to query solutions for every kind of patterns.
7
8 # Pattern kinds are the product of N booleans where N is the
9 # number of columns.
10
11 # At last I find out that, to be able to minimize the number of
12 # indices and while avoiding a manual join against the commit
13 # table, all indices must at least have one prefix that is not the
14 # permutations of other indices perfixes. It leads to the creation
15 # of a number of tables. Wiredtiger does try to cache (all?!) keys
16 # from tables. So, it's a space / cpu tradeoff because a manual
17 # join would be much smaller but would consume less memory and
18 # disk space.
19
20 # All rotations are solutions
21 solutions = [(tab[i:] + tab[:i]) for i in range(len(tab))]
22
23 # Other solutions are permutations
24 candidates = ["".join(x) for x in permutations(tab) if "".join(x) not in solutions]
25
26 for candidate in candidates:
27 # At least one prefix must not be the permutation of the
28 # prefixes of an existing solution
29 prefixes = [candidate[0:i] for i in range(1, len(candidate))]
30
31 for prefix in prefixes:
32 # Check whether permutations of prefix are not an existing prefix
33 for pattern in ["".join(x) for x in permutations(prefix)]:
34 if any(solution.startswith(pattern) for solution in solutions):
35 break # this prefix has a solution, try next prefix
36 else:
37 # no permutation of prefix is the prefix of an existing solution
38 solutions.append(candidate)
39 break
40
41 return solutions
```

#### Scheme solution[edit]

```
1 (define-library (indices)
2 (export indices)
3 (export permutations)
4
5 (import (scheme base))
6 (import (scheme list))
7
8 (begin
9
10 (define (remove* i lst)
11 (remove (lambda (x) (= x i)) lst))
12
13 (define (permutations lst)
14 ;; taken from https://stackoverflow.com/q/23635911/140837
15 (if (= (length lst) 1)
16 (list lst)
17 (apply append (map (lambda (i)
18 (map (lambda (j) (cons i j)) (permutations (remove* i lst))))
19 lst))))
20
21 (define (shift lst index)
22 (append (drop lst index) (take lst index)))
23
24 (define (rotations lst)
25 (reverse! (map (lambda (index) (shift lst index)) (iota (length lst)))))
26
27 (define (prefixes lst)
28 (let loop ((out '())
29 (indices (cdr (iota (length lst)))))
30 (if (null? indices)
31 out
32 (loop (cons (take lst (car indices)) out)
33 (cdr indices)))))
34
35 (define (prefix? lst other)
36 (let loop ((lst lst)
37 (other other))
38 (if (null? lst)
39 #t
40 (if (= (car lst) (car other))
41 (loop (cdr lst) (cdr other))
42 #f))))
43
44 (define (indices lst)
45 ;; Compute all the required columns permutations that will allow
46 ;; to do queries for every kind of patterns
47 ;;
48 ;; Pattern kinds are the product of N booleans where N is the
49 ;; number of columns
50 ;;
51 ;; At last I find out that, to be able to minimize the number of
52 ;; indices and while avoiding a manual join against the commit
53 ;; table, all indices must at least have one prefix that is not the
54 ;; permutations of other indices perfixes. It leads to the creation
55 ;; of a number of tables. Wiredtiger does try to cache (all?!) keys
56 ;; from tables. So, it's a space / cpu tradeoff because a manual
57 ;; join would be much smaller but would consume less memory and
58 ;; disk space.
59
60 ;; A candidate is a solution if at least one its prefix
61 ;; permutation is not a prefix of an existing solution
62 (let loop1 ((out (rotations lst)) ;; all rotations are solutions
63 (candidates (permutations lst))) ;; other solutions are permutations
64 (if (null? candidates)
65 (reverse! out)
66 (let ((candidate (car candidates)))
67 ;; look for a prefix that has a permutation that is
68 ;; not prefix of an existing solution
69 (let loop2 ((prefixes* (prefixes candidate)))
70 (if (null? prefixes*)
71 ;; none of the prefixes of candidate has a new pattern
72 (loop1 out (cdr candidates))
73 (let loop3 ((patterns (permutations (car prefixes*))))
74 (if (null? patterns)
75 ;; none of the patterns of the current prefix is new
76 (loop1 (cons candidate out) (cdr candidates))
77 ;; try to find a solution that has pattern as prefix
78 (let loop4 ((solutions out))
79 (if (null? solutions)
80 ;; TODO
81 (loop3 (cdr patterns))
82 (if (prefix? (car patterns) (car solutions))
83 (loop2 (cdr prefixes*))
84 ;; try another solution
85 (loop4 (cdr solutions)))))))))))))
86 ))
```

## Benchmarks[edit]

There is three existing solutions: R&WBase, R43ples and Quit Store. Those are the only solution with an implementation that is:

*distributed*in the sens of version control systems. That is that allows cooperation behavior enabled by branching and merging features,*semantically compliant*in the sens that it implements versioning in a way that the commit is understandable by machines with a query.

The benchmark will try to *quantitatively* compare those software regarding several aspects:

- Memory storage performance
- Disk storage performance
- Time required to access different versions

Those results are compared to baselines namely git and virtuoso.

### Task 1: Wikidata[edit]

The benchmark baseline is Virtuoso X.Y. https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples

### Task 2: Linux[edit]

Linux kernel git repository is exported as quadruples and loaded it into the solutions. The benchmark baseline is performances of git because it is the industry standard for versioning source code. Git is also the most common solution used to track changes in open data. It will provide a baseline for assessing the performance usability of *Seyfu* regarding the cooperation features.

#### Sub-Task 2.0: Clone[edit]

#### Sub-Task 2.1: Checkout[edit]

#### Sub-Task 2.2: Log[edit]

#### Sub-Task 2.3: Show[edit]

#### Sub-Task 2.4: Merge[edit]

#### Sub-Task 2.4: Commit[edit]

#### Sub-Task 2.4: Revert[edit]

## Conclusion[edit]

hello, world!