HDT for dynamic and small graphs


#1

Grateful to have learned about HDT today from @bergos.
What I’m seeing are it’s uses for very large archived data sets, like millions to billions of triples.

Are there any SOLID apps out there creating relatively small graphs, say a simple user profile model containing a small number of triples, that update frequently and are stored in HDT format?


#2

The only extra steps for an application would be to first decompress the .hdt resource in order to update in .ttl or .nt format, then compress again in .hdt format for storage once again.
That process could easily be abstracted away in an API operation.
What are the pieces I might be missing?
How expensive would this be CPU-wise?


#3

To have the file size benefit of HDT also for smaller graphs, the format would need some small changes, like skipping/reducing the header and skip the index (would be already possible). I wrote my ideas down in a small gist. If somebody would like to work on this topic, I would join for JavaScript and C++/Arduino implementations.


#4

@bergos - I’m also interested in a binary serialization for smaller graphs (especially for IoT use cases, as yours seems to be as well).

So the other day, I came across a really interesting paper, Towards a Binary Object Notation for RDF. Which seems to be deleted from the conference upload folder, but I found a google cache of it.

And it basically does a literature survey of various approaches to compact binary RDF serializations, including HDT. And then a bunch of tests/experiments, to test them. And I think they settle on JSON-LD-to-CBOR encoding, but with some sort of additional step (a dictionary, like HDT? it wasn’t clear).

I’m wondering - do you happen to know the authors? Should we reach out to them to talk?


#5

Forgive me if it’s a very silly question, is the reason that for IoT you want a binary serialisation because it’s assumed that the data that’s transferred is mostly binary? e.g. video camera stream. I’m thinking that in IoT there would be devices generating non-binary data which can work with text based serialisations just as efficiently as with binary ones, right?


#6

I know an older paper from that group and I talked with one of the authors. In the older paper they simple made the assumption that the parties talking to each other use a known namespace and didn’t calculate the namespace overhead into the comparison for their own format. But they used the full HDT header, including metadata and the dictionary. Excluding the dictionary of HDT would be very similar to skipping the namespace, but they didn’t do it. I only had a quick look at the new paper, but it looks like they haven’t changed their approach. Siemens always pushed EXI in the WoT group. It’s interesting that they have a patent which describes the benefit of using the EXI directly as a store. That was never mentioned in the W3C mailing list, so I’m not sure if it’s covered by the W3C patent policy. If not, it’s not possible to implement a .match method directly on top of incoming data without parsing it into an internal data structure, without violating the patent.

IMHO you should not take papers about that topic from that group seriously. I’m open for fair comparisons, but the numbers of these papers should be just ignored.

Some thoughts about HDT vs. CBOR encoded JSON-LD:

  • The algorithms of HDT are much better aligned for namespace triple data and I expect better results for HDT in a fair comparison.
  • The benefit of CBOR encoded JSON-LD could be a lower code complexity for encoding.
  • Generating an efficient JSON-LD context could get as complex as generating a HDT dictionary.
  • Parsing CBOR encoded JSON-LD could get very complex because of the different possible forms of JSON-LD.
  • Parsing could be simplified by defining expanded form as the only valid form, but that would have negative effects on the size.