Migrating centrally-stored user data to Solid Pods

schmudde · March 23, 2023, 9:33am

Are there any case studies for this? We are building features that we would like to migrate down the line. Are there best practices? Should we adopt a triple store database today to make things easier tomorrow?

Smag0 · March 23, 2023, 7:46pm

Hi, the good question , I think first is what kind of data you want to put on user’s pod. Are there any interaction between user? Between user’s data?

schmudde · March 24, 2023, 4:31pm

We would like to store personal attributes as they relate to organizations. For example:

Create: my username for forum.solidproject.org is schmudde.
Read: What is my username for forum.solidproject.org? ⇒ schmudde.

Updates are immutable:

My username was schmudde at forum.solidproject.org from 2/January/2023-24/March/2023.
My username is david.schmudde at forum.solidproject.org from 25/March/2023-present.

Databases like Datomic are are a natural fit for this kind of data. But we want to federate this data, it’s probably not the easiest path.

schmudde · May 5, 2023, 1:30pm

I’m surprised there isn’t more public information and interest on this topic. I would hope there are many companies with the same desire: migrating user data off our servers and onto personal Pods.

Absent of discussion or precedence of best practices, we have created our own solution. I’ll share it here in case any others want to build an app today with an eye towards Solid Pods in the future.

We already use PostgresSQL. So the simplest and most portable solution was storing a publicly-developed (such as foaf, etc…) schema as JSONB.

According to Designing JSON Documents, updating could get tricky:

Although storing large documents is practicable, keep in mind that any update acquires a row-level lock on the whole row. Consider limiting JSON documents to a manageable size in order to decrease lock contention among updating transactions.

This is a bit of a concern because we currently want to store history as a series of events (see my previous post). I’m also a bit blind to any practical concerns related to event-driven data stored on Pods, but it’s a risk we can take because our business logic can also tolerate a simplified view of this particular data.

Section 8.14.2 of Designing JSON Documents also had some other advice I found helpful:

The structure is typically unenforced (though enforcing some business rules declaratively is possible), but having a predictable structure makes it easier to write queries that usefully summarize a set of “documents” (datums) in a table.

As @RubenVerborgh notes in Let’s talk about pods, Pods are currently document-centric. So I also like that this approach with JSONB may grant us more flexibility when migrating from a data-centric database model to a model of documents that contain structured semantic data.

At least that’s my thinking. Even though we have already started implementation, I would gladly welcome more conversation with folks that have grappled with (or even thought deeply about) this problem.

jeffz · May 6, 2023, 9:03pm

I suggest you take this topic up on https://app.gitter.im/#/room/#solid_chat:gitter.im. There are definitely people working on it but they may not be participating in the forum. My, possibly not too relevant, suggestion is that you study up on the difference between JSON and JSON-LD, with the former you’ll need more conversion steps before it becomes usable as linked data.

michielbdejong · May 10, 2023, 2:50pm

Hi @schmudde!

migrating from a data-centric database model to a model of documents that contain structured semantic data

can you give a more concrete example of what you mean with this?

I guess the main concern when you go from postgres to Solid would be that you can no longer query across documents. This may be manageable if you have not too much data (e.g. less than 1 Mb) per user.

You need to anticipate the queries you will want to do and then either:

load the whole dataset from a limited number of documents (e.g. < 100) on the pod
use the path tree to structure your search (e.g. the folder documents/2023/05/ may contain documents from May 2023)
use indexes (files that contain pointers to where data can be found, similar to how a database engine would index its data)

I would be happy to have a video call to brainstorm about your specific requirements, and see if I can help you find your way around this!

schmudde · May 11, 2023, 4:15pm

Thanks @michielbdejong.

In the back of my head, I’m still thinking about a proper graph database. It’s probably what we would choose if we didn’t want more flexibility in the future.

But now that we’re kicking this off in Postgres, I think the migration is actually easier than had we started from a graph db. That’s one thing I’d like someone’s informed opinion on.

This is the actual concern looking ahead. In the link above, I gave a trivial example:

I’m not worried about moving this to a document. However, we also endeavor to record a history of actions taken on the platform with rich semantic content. This could get a) large and b) difficult to plan for. A traditional graph database would easily allow us to query this history of a particular action across multiple domains or query which actions occurred within a specific time.

timbl · May 12, 2023, 1:30pm

If you haven’t seen this discussed, it may be because lots of people are doing it, but for all kinds of cases all kinds of systems of different shapes and sizes. So it is hard to answer the general question. A few random points:

Think about how you would store the data in the user’s home folder in a unix system. Look at eg how Apple do it on a Mac in ipHoto libraries, AddressBooks, etc …small database files become RDF files in a pod, indexes become RDF files, other media just stored as files.
Think abut how the data grows. If constant per user but users grow, then a folder per user. If constantly growing with time, then folders by year/month/day may be good.
If some parts of the data afe write-once then consider putting them in append-only resources which can be made unwritable later.
You can add indexes which store data by particular facet for speed esp for immutable data.
Does SolidOS store data like that? See how they do.

rather random points off the top of my head.

schmudde · May 16, 2023, 6:53pm

Thanks! These are some good tips.

Great. We do have some more sophisticated append-only timestamp data that we could treat this way. This would sit along side some simpler files which contain Personally Identifying Information.

I’ll dig into the practices of SolidOS. I started off by looking at the GitHub repository and taking a look at their Pod.

I’ll continue to share our practice as it develops.

Topic		Replies	Views
Beginner Queries	6	562	August 6, 2020
Solid Mashlib Data Browser	5	746	February 1, 2020
General questions and clarifications on app development Solid App Development FAQs	7	1517	October 30, 2018
Integrating with non-SOLID Linked Data	2	486	February 1, 2020
What Pods are good for	13	750	April 19, 2024

Migrating centrally-stored user data to Solid Pods

Related topics