Big Data/CumulusRDF

From Wikiversity
Jump to navigation Jump to search

CumulusRDF is a distributed RDF store that stores the RDF triples in the key-value store Apache Cassandra.

Each RDF triple consists of a subject (S), a property (P) and an object (O). Several RDF triple form a graph where P is an labelled directed edge starting at S and leading to O. The graph can be queried by using eight basic graph pattern (BGP). In order to answer these queries efficiently three indices are provided by CumulusRDF: SPO, POS and OSP. The different BGPs and which index is required to answer them are shown in the following table:

Triple Pattern Index
(spo) SPO, POS, OSP
(sp?) SPO
(?po) POS
(s?o) OSP
(?p?) POS
(s??) SPO
(??o) OSP
(???) SPO, POS, OSP

CumulusRDF supports two storage representations for RDF triples of the form (s, p, o):

Hierarchical Layout
{ s : { p : { o : - } } }, { o : { s : { p : - } } } and { p : { o : { s : - } } } are inserted.
Flat Layout
{ s : { po : - } }, { o : { sp : - } }, { po : { s : - } } and { po : { 'p' : p } } are inserted.
The third key-key-value triple is required since Apache Cassandra stores all triples with the same key on the same data node. Since some property like rdf:type are used very often this would lead to an unbalanced load distribution. Therefore, the property concatenated with the object is used as key.
In order to find all triples with the same property, i.e., (?p?) the fourth triple is required. It is used in a secondary index that maps all values p to all keys (po) in which this value occurs.

References[edit | edit source]