User:Iamamz3/Seyfu: Versioned Structured Data
- 1 Seyfu: Versioned Structured Data
- 1.1 Abstract
- 1.2 Introduction
- 1.3 Background
- 1.4 Implementation
- 1.5 Benchmarks
- 1.6 Conclusion
- 1.7 Bibliography
Seyfu: Versioned Structured Data
|This is a research project at Wikiversity.|
The use of a version control system to store open data is a good thing as it draws a clear path for reproducible science. But none, meets all the expectations. Seyfu aims to replace the use of git and make practical cooperation around the creation, maintenance, publication, storage and re-use of knowledge bases that are possibly bigger than memory. In order to achieve that, we propose a novel approach to tackle the challenge of querying versioned triples in a Direct-Acyclic-Graph (DAG). Seyfu only stores changes between versions. We introduce Seyfu Scheme implementation and compare it to existing solutions.
- Version Control Systems
- Knowledge Bases
- Reproducible Science
- Scheme Programming Language
Resource Description Framework offers a good canvas for cooperation around open data but there is no solution that is good enough. TODO: explain what are those features required for cooperation and why those features are important. Maybe cite an article about the importance of git and git hosting solution in the context of software development. Seyfu use a novel approach to query versioned data based on a topological graph ordering of changes. Unlike Cassidy et al., Seyfu does not rely on the Theory of Patches introduced by Darcs. Seyfu use WiredTiger database storage engine, an ordered key-value store, to deliver a pragmatic ACID-compliant versioned quad store. The first part will describe the implementation of Seyfu. The second part will present benchmarks.
Resource Description Framework
Version Control Systems
Scheme programming language
Ordered key-value stores
- Space and Time complexity for each algorithm
- In the spirit of Scheme programming language, Seyfu try to solve the problem using a minimalist core of powerful primitives upon which one can build abstractions to solve bigger problems. The
TupleStoreis such an abstraction that allows to take advantage of WiredTiger key-value store features and performance without scarifying too much expressiveness.
- This in turn allows to use the same query language to ask the metadata store about the DAG history of the version quad store and query that quad store. The query language is based a logic language embedded in Scheme called minikanren.
- TODO: implement optional and union, see https://github.com/Swirrl/matcha/commit/c8d21c1ec54020ec1f0cc002d4ccaefca1106cbf
- Merge commits will resolve conflicts in a way that makes it possible to define a history significance measure that allows to linearize with a topological sort the Direct-Acyclic-Graph of changes. Without that, querying versions at any point is not pratical see the second part.
- Cost based query engine was dropped for the time being.
Let be a set of values for which there is a binary relation that is a total order on . Then the following statements hold for all and in :
- Antisymmetry: If and then ;
- Transitivity: If and then ;
- Connexity: or .
n-tuples are the ordered set where ie.
Definition: knowledge base
Let that is the subset of n-tuples that are stored in the knowledge base.
All the subsets of are noted . TODO define all subsets of X
Let be a disjoint set of ie. if then is a variable.
A binding is a function . The set of all bindings is called .
A pattern is defined as the association of a template and a function as:
- and , the number of variables in is noted .
Definition: pattern binding
Binding is called a pattern binding of when
Definition: pattern bindings
Pattern bindings for is the biggest subset of where
Definition: pattern image
Given a pattern bindings for , the pattern image is defined as
Property: a pattern image is a subset of
it is a function from S to S for which every element occurs exactly once as an image value. This is related to the rearrangement of the elements of S in which each element s is replaced by the corresponding f(s). For example, the permutation (3,1,2) is described by the function defined as:
squi couvre K
An index is associated with the permutation so that
Definition: prefix range
Let prefix range be a function
TODO: define what querying means
TODO: define what index cover pattern means
Property: All indices cover all patterns
Property: When n>=2, there is a strict subset of all indices that cover all patterns
Property: the rotations of the base index can be part of a minimal set of covering indices
What is the goal? What are the axioms? What are the properties? What are if ... then ...?!
- We can choose a permutation of p that is less complex
- Let's consider a permutation
- TODO: how to define a permutation? is permutation always reversible?
- Similarly there is permutation where or is a variable
- when and we can do
- we can choose the permutation of p that is the less complex
- We can permute the pattern so that the operation is less complex
- Where is the n-permutations of [1..n]
- Where is the product of Bool^n with [1..n]
All rotations of the 3-tuple
(subject, predicate, object) is a solution:
(subject, predicate, object)aka.
spoie. rotation 0 of the 3-tuple
(predicate, object, subject)aka.
posie. rotation 1 of the 3-tuple
(object, subject, predicate)aka.
ospie. rotation 2 of the 3-tuple
The remaining permutations are
['sop', 'pso', 'ops']. All of their prefixes have a permutation that is prefix of one the index:
sopis also prefix of the index
psois also prefix of the index
opsis also prefix of the index
osas permutation which is a prefix of the index
spas permutation which is a prefix of the index
poas permutation which is a prefix of the index
All rotations of the 4-tuple (graph, subject, predicate, object) plus two permutations that have a prefix that is not the permutation of existing solution prefixes:
(graph, subject, predicate, object)aka.
gspoie. rotation 0 of the 4-tuple
(subject, predicate, object, graph)aka.
spogie. rotation 1 of the 4-tuple
(predicate, object, graph, subject)aka.
pogsie. rotation 2 of the 4-tuple
(object, graph, subject, predicate)aka.
ogspie. rotation 3 of the 4-tuple
(graph, predicate, subject, object)aka.
(subject, object, graph, predicate)aka.
TODO: what is the (formal?) problem
1 from itertools import permutations 2 3 4 def indices(tab): 5 # Compute all the required columns permutations that will 6 # allow to query solutions for every kind of patterns. 7 8 # Pattern kinds are the product of N booleans where N is the 9 # number of columns. 10 11 # At last I find out that, to be able to minimize the number of 12 # indices and while avoiding a manual join against the commit 13 # table, all indices must at least have one prefix that is not the 14 # permutations of other indices perfixes. It leads to the creation 15 # of a number of tables. Wiredtiger does try to cache (all?!) keys 16 # from tables. So, it's a space / cpu tradeoff because a manual 17 # join would be much smaller but would consume less memory and 18 # disk space. 19 20 # All rotations are solutions 21 solutions = [(tab[i:] + tab[:i]) for i in range(len(tab))] 22 23 # Other solutions are permutations 24 candidates = ["".join(x) for x in permutations(tab) if "".join(x) not in solutions] 25 26 for candidate in candidates: 27 # At least one prefix must not be the permutation of the 28 # prefixes of an existing solution 29 prefixes = [candidate[0:i] for i in range(1, len(candidate))] 30 31 for prefix in prefixes: 32 # Check whether permutations of prefix are not an existing prefix 33 for pattern in ["".join(x) for x in permutations(prefix)]: 34 if any(solution.startswith(pattern) for solution in solutions): 35 break # this prefix has a solution, try next prefix 36 else: 37 # no permutation of prefix is the prefix of an existing solution 38 solutions.append(candidate) 39 break 40 41 return solutions
1 (define-library (indices) 2 (export indices) 3 (export permutations) 4 5 (import (scheme base)) 6 (import (scheme list)) 7 8 (begin 9 10 (define (remove* i lst) 11 (remove (lambda (x) (= x i)) lst)) 12 13 (define (permutations lst) 14 ;; taken from https://stackoverflow.com/q/23635911/140837 15 (if (= (length lst) 1) 16 (list lst) 17 (apply append (map (lambda (i) 18 (map (lambda (j) (cons i j)) (permutations (remove* i lst)))) 19 lst)))) 20 21 (define (shift lst index) 22 (append (drop lst index) (take lst index))) 23 24 (define (rotations lst) 25 (reverse! (map (lambda (index) (shift lst index)) (iota (length lst))))) 26 27 (define (prefixes lst) 28 (let loop ((out '()) 29 (indices (cdr (iota (length lst))))) 30 (if (null? indices) 31 out 32 (loop (cons (take lst (car indices)) out) 33 (cdr indices))))) 34 35 (define (prefix? lst other) 36 (let loop ((lst lst) 37 (other other)) 38 (if (null? lst) 39 #t 40 (if (= (car lst) (car other)) 41 (loop (cdr lst) (cdr other)) 42 #f)))) 43 44 (define (indices lst) 45 ;; Compute all the required columns permutations that will allow 46 ;; to do queries for every kind of patterns 47 ;; 48 ;; Pattern kinds are the product of N booleans where N is the 49 ;; number of columns 50 ;; 51 ;; At last I find out that, to be able to minimize the number of 52 ;; indices and while avoiding a manual join against the commit 53 ;; table, all indices must at least have one prefix that is not the 54 ;; permutations of other indices perfixes. It leads to the creation 55 ;; of a number of tables. Wiredtiger does try to cache (all?!) keys 56 ;; from tables. So, it's a space / cpu tradeoff because a manual 57 ;; join would be much smaller but would consume less memory and 58 ;; disk space. 59 60 ;; A candidate is a solution if at least one its prefix 61 ;; permutation is not a prefix of an existing solution 62 (let loop1 ((out (rotations lst)) ;; all rotations are solutions 63 (candidates (permutations lst))) ;; other solutions are permutations 64 (if (null? candidates) 65 (reverse! out) 66 (let ((candidate (car candidates))) 67 ;; look for a prefix that has a permutation that is 68 ;; not prefix of an existing solution 69 (let loop2 ((prefixes* (prefixes candidate))) 70 (if (null? prefixes*) 71 ;; none of the prefixes of candidate has a new pattern 72 (loop1 out (cdr candidates)) 73 (let loop3 ((patterns (permutations (car prefixes*)))) 74 (if (null? patterns) 75 ;; none of the patterns of the current prefix is new 76 (loop1 (cons candidate out) (cdr candidates)) 77 ;; try to find a solution that has pattern as prefix 78 (let loop4 ((solutions out)) 79 (if (null? solutions) 80 ;; TODO 81 (loop3 (cdr patterns)) 82 (if (prefix? (car patterns) (car solutions)) 83 (loop2 (cdr prefixes*)) 84 ;; try another solution 85 (loop4 (cdr solutions))))))))))))) 86 ))
There is three existing solutions: R&WBase, R43ples and Quit Store. Those are the only solution with an implementation that is:
- distributed in the sens of version control systems. That is that allows cooperation behavior enabled by branching and merging features,
- semantically compliant in the sens that it implements versioning in a way that the commit is understandable by machines with a query.
The benchmark will try to quantitatively compare those software regarding several aspects:
- Memory storage performance
- Disk storage performance
- Time required to access different versions
Those results are compared to baselines namely git and virtuoso.
Task 1: Wikidata
The benchmark baseline is Virtuoso X.Y. https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples
Task 2: Linux
Linux kernel git repository is exported as quadruples and loaded it into the solutions. The benchmark baseline is performances of git because it is the industry standard for versioning source code. Git is also the most common solution used to track changes in open data. It will provide a baseline for assessing the performance usability of Seyfu regarding the cooperation features.
Sub-Task 2.0: Clone
Sub-Task 2.1: Checkout
Sub-Task 2.2: Log
Sub-Task 2.3: Show
Sub-Task 2.4: Merge
Sub-Task 2.4: Commit
Sub-Task 2.4: Revert