How to store RDF data on the SAFE Network


#1

Even that the title suggests this is very specific to the SAFE Network, and how to implement an RDF store on it, this is in my belief a problem that many others could also face when trying to store RDF triples on any type of key-value storage.

I’ll be explaining the problem from the SAFE Network perspective since I believe there are details that are important to consider.

Background

A few months ago we thought that we needed the SAFE Network to have RDF data to be a first-class citizen, and we started researching and doing some PoC implementations of APIs/utilities which allow us and devlopers to store application’s data on the network as RDF data.

Based on the currently available data types on the SAFE Network, the question that I’m trying to bring in here with this post is simple: What's the best way to store mutable RDF triples on a key-value storage?

When I say what’s the best way, I mean, the best way to make it efficient not only for storing them, but also for querying the graphs (queries which may be trying to filter by certain subject or predicate as an example) minimising the network traffic needed for it.

I’ll be trying here to leave out of the discussion how the API for this would look like, just to focus on the topic of serialisation of the RDF docs/graphs to store them on the network. And note it’s important to make this very clear, we want interoperability with other systems, but we need to decide how to serialise the RDF data to store it regardless how the API consumers will send/retrieve the RDF docs.

Data types currently available

In the SAFE Network there are currently two storage data types:

  • ImmutableData: a blob stored at a predictable location which is based on the hash of its contents.
  • MutableData: a key-value storage type which allows independent mutations of each key-value pair, as well as fetching a key-value pair by providing a key.

Storing Immutable RDFs

Storing immutable RDF documents is straight forward, since the document can be stored as an ImmutableData object on the network, and the user decides which serialisation format to use, i.e. Turtle, JSON-LD, etc.

Storing Mutable RDFs

When it comes to store mutable RDFs is when we need to think of several options, and where we are trying to get more input/ideas for. Since as I explained above, the only way to have mutable documents on the SAFE Network is by storing them as MutableData objects, which is a key-value storage, at least that’s what it’s currently available.

The SAFE APIs could actually store the data in any non standard RDF serialisation format, as long as the API serialises/deserialises the data when it’s being written/fetched, then from a developers perspective using that API it’s transparent and it’s ok, he/she will be seeing RDF docs going in and coming out.
However, since that’s just one of the APIs, other developers may decide to bypass it and access the data stored on the network using the lower level APIs, in which case they will see data which is not an standard serialised RDF at all. This is why we thought we wanted not just APIs which can help developers with RDF, but we also want the data to be stored using any of the standard serialisation formats (Turtle, JSON-LD, etc.).

But there is one more thing we considered, even storing the data in one of the standard serialisation formats, if I (as a developer) am trying to read/write it using a programming language or framework where I don’t have RDF utilities/libraries available to parse that type of serialisation, I’m then also in the same situation as the scenario above, and I’ll have to write my own parser and utility to read and make queries on it.
This is why we thought that JSON-LD provided an advantage, from this aspect, since we can assume everyone knows how to treat JSON objects and there is a vast amount of libraries and utilities available in all type of programming languages to parse and manage them.

So what we have on the table is:

  1. A key-value storage for our serialised RDFs
  2. We want to minimise the network traffic when making queries which may result in only a subset of the graphs contained in a single RDF doc
  3. We want to store the RDF using any of the standard serialisation formats
  4. We want developers to be able to parse and read the RDF data even if they don’t have access to any RDF lib/utility/API, just to help in the process of adoption of RDF.

For a separate discussion, on top of all this, we want to expose APIs/utilities which make it easy for developers to manage RDF data, but which also provide interoperability with any other systems, including Solid apps.
For example, by supporting several serialisation formats for import/export RDF data, like in the case of Solid where Turtle is mainly to be supported as per what I understood from Tim?

PoC implementation

At this point it’s probably already apparent the path we took to implement our PoC, so trying to be brief here and to detail a bit of it, this is what we’ve done:

  • Since for this PoC we wanted to demonstrate scenarios where we needed to mutate RDF docs, we decided to store them on MutableData objects (this is related to item #1 above). These scenarios are:
  • We wanted to make sure in the future we are able to fetch only some subgraphs from a single RDF doc from the network, we thought that storing each subgraph as a key-value pair would be beneficial since we can send a request to fetch a subset of graphs to the network (this is related to item #2 above).
  • Since the MutableData object is a key-value storage, and we put the subgraph id as the key, then we thought that having the mapped value to be stored as JSON-LD would cover the scenario decribed above in item #4.
  • We used rdflib.js library since that gave us the caabilities to serialise/deserialise RDF docs in several formats, and gives us also some querying mechanisms we can start playing with. We exposed an RDF API which makes use of this library plus the functionality to store/retrieve the triples to/from the SAFE Network.

We also created some utilities to abstract the CRUD operations for WebIDs, but again, I’m keeping this out of this topic.

Just as an example, the following image is a snapshot of some RDF data stored on the SAFE Network using our PoC. In this case three MutableData’s used for a WebID profile doc published on a human readable safe:// URL. Each rounded box is an entry in its corresponding MutableData object. The strings on the left of each rounded box are the entries’ keys, and the content of the rounded boxes are the entries’ values.

I hope it’s clear enough to trigger the discussion. Again, the intention here is to share details of what we’ve done, the considerations we took when implementing our PoC, and hopefully get some feedback, ideas, suggestions to consider in what regards to the aspects explained herein.


Solid backend examples/reference implementations
#2

Reading about minimum self-contained graphs (MSGs) in this paper about RDFSync, about subgraphs here and RDF molecules.

In the last paper about RDF molecules:

If an RDF graph is free of blank nodes, each triple constitutes a molecule; otherwise, triples are grouped into molecules according to the background ontology and the decomposition algorithm. In figure 2, each graph is a two-triple molecule.

image

What should be the minimum unit that we store on the network?
Should this be an immutable unit?
Eventually, all data will be immutable on the network, so mutability will consist of replacing a resource with another.
An RDF molecule, according to above quoted definition, could be stored as ImmutableData on the network and anything larger can be composed with other RDF molecules.
How could Mutable Data be used to contain the composition of RDF molecules? Is it a suitable data type for this purpose?


#3

RDFLib seems the most likely model for us since RDFLib uses a key/value store for persistent storage.

Possibly combined with Hexastore and/or HDT binary format to improve performance.

Applications that need to query a graph will first be needing to compose a graph from retrieved RDF documents on the network.
It seems like performance will be primarily dependent on the time it takes to retrieve documents from the network. Is this in practice not that big of a deal or should we be focused on improving this point?

That question led me into interesting reading about Storing RDF as a graph.
Any one playing around with this? Especially implementing an OODB on a peer-to-peer network?

[UPDATE]
Found this paper on implementing an OODB on a peer-to-peer network.