Strategy for "large" Datasets?

ajp · August 24, 2021, 8:28pm

I’m getting a async-lock timed out for a Dataset that on disk was being stored locally as 1200 json files totalling 0.5 megabytes (previously said 5 megabytes, as I was mis-interpreting du output). In Solid I put all the files into one Dataset.

I was wondering what design patterns people used to handle this? The application is a microblogging site and each file has a unique id, title, content and some other fields. All the files are required to be present during use of the application as several hundred can be present in one view at a time as it has various knowledge graph views which you interact with the components through. Additionally it allows you to link between components (files) so to support a search function they also all need to be present locally in the users’ browser.

At the moment I don’t know what to do. Splitting the single Dataset into arbitrary datasets seems like a solution to limiting the maximum size of any one Dataset but will also involve much more complexity to ensure that a post / item is always and only ever in one chunk and its ID is never found in another / or missing entirely. It would therefore rely on things like atomic commits which Solid does not yet have out of the box so I would have to set up some kind of journal.

An example of the data:

wc1		json		{"id":"wc1","title":"@@wc2 has adopted structured data","description":"@@wc2 on boarded...","created_at":"2021-02-11T12:02:34.098Z"...}
		title		@@wc22325919551149920 has adopted structured data
wc2		json		{"id":"wc2","title":"xyz entity name","description":"This group is interesting in using structured data.","created_at":"2021-02-11T12:02:34.098Z",...}
		title		xyz entity name

I really want to use Solid but I think this might be a deal breaker for now Thank you so much again for any time or advice you might have. we can find a solution

** follow up ** I’ve filed a bug as I can not delete the file to try to replace it.

josephguillaume · August 25, 2021, 9:33am

To start with it’s worth noting that performance of node solid server (used by solidcommunity.net) performance can be improved on, so it shouldn’t be considered indicative, and that’s part of the reason the community server is being developed.

Generally, I think this would be a relatively common task with Solid. Probably each file would be assigned a unique immutable url, e.g. https://my.pod/microblog/w1

Each file would be linked to in another index file. They would also be available in the container https://my.pod/microblog/ but it’s better practice to use an appropriate predicate to describe the semantic relationship of the blogpost to the blog as a whole.

Ideally cross references within blogposts should be encoded with semantic triples too. You could get this fairly easily by using json-ld and specifying an appropriate context for your existing json.

The app would query the index file and then fetch of all the files. This does mean we’re talking about 1200 GET requests.
This has been discussed a fair bit. One issue is server performance, but as noted, node solid server is not a good reference here. A second point is that http2 reduces the number of trips for these requests if the client and server both support it (again, node solid server does not).

It might still be that you would want to group the blogposts in some way, e.g. by year
https://my.pod/microblog/2021#w1
Which would also have the effect of returning a large number of posts with a single request.
You might also want to have a sync functionality to load the posts in the background or make them available offline.

It’s not entirely clear to me whether you do have an additional need for atomicity, but it sounds like this does cover your use case?

ajp · August 25, 2021, 9:03pm

Thank you for your response. That’s all very helpful information and context.

Yes atomic commits are vital to ensure the persisted application state retains its integrity. Do you know if saveSolidDatasetAt / the spec for the endpoint that handles it is atomic? i.e. does the specification required that the whole response be received before it starts modifying the file? And does the spec also require something like a WAL to manage exceptions mid write? Finally do you know if the spec makes any requirements about concurrent writes?

Thank you very much for your earlier response.

josephguillaume · August 26, 2021, 7:09am

In short, yes.
There’s also PATCH requests, so a whole RDF resource doesn’t need to be posted every time, and etags can be used to ensure the resource on the server hasn’t changed.

As far as I know there isn’t any support for atomic transactions, which I thought was what I was unsure you needed. If you have to modify multiple files atomically, I think you would have to anticipate possible error handling, though someone else might correct me on this.

Topic		Replies	Views
Solid at scale - querying large datasets	8	593	December 20, 2022
Solid for decentralized blogs	2	533	February 24, 2022
Nss with plugin modules or custom interfaces allowing access to special files(e.g. hdf5) Backend Development	5	1030	March 12, 2019
Solid Capability: Live Data Service Or Not?	3	320	June 2, 2022
Finding the right schema / RDF for data sets Use Solid	0	384	July 19, 2022

Strategy for "large" Datasets?

Related topics