I just finished the first version of a Solid app for tracking movies that I’ve been working on. I’ll post about that next week, but now I’d like to discuss on the state of the art for querying large containers (as in LDP containers).
I am using node-solid-server, and I have a container with 1411 documents. I wouldn’t say that this is “a large container”, but I’m calling it such because I’ve struggled a lot in making the application performant. The request to the container alone takes more than a second which is already slow for a single GET request. But then trying to query the documents is a nightmare. I tried using globbing, but after loading for a while I get a 500 error (probably due to an out-of-memory exception, although I haven’t checked). So I have to request the documents one by one, and even making parallel requests in chunks it takes almost 3 minutes for the application to load.
Given this situation, the only solution I could think of is caching all the documents locally. This still doesn’t fix the initial loading time of 3 minutes (much worse on mobile), but at least subsequent sessions are acceptable.
But there is still an additional problem. Given that I’m caching all the documents, I need to know when a document has been updated to invalidate the cache. I am using the purl:modified of the document returned in the container request and that’s fine. But as I said, the container request takes more than a second. And it is much worse on mobile, although I suspect it is related to parsing a big turtle document (that’s a discussion for another day). So I’d like to avoid this request as well. What I’ve done is read the purl:modified of the container, getting the container’s parent first. But one problem I’ve found with this is that value seems to be changed every time I perform a GET request on the container. I don’t understand why that happens, I was assuming reading the container didn’t cause it to be modified. So one problem with this is that every time I use the app in a different device, other devices will be slower to boot up.
In case you’re wondering why I need all the documents when I start the application, that’s because as far as I know, SPARQL is still not supported. So I wouldn’t be able to filter and sort documents which is not an option for this application.
I don’t know, in general I’ve done everything I could think of and the application is disappointingly slow. Where am I going wrong? Is node-solid-server the problem? Is there something I’m missing?
I heard about CRDTs some time ago listening to a podcast episode and it seems interesting. It would certainly be an improvement, but I don’t think the problem is inherently related with having a server. A normal database-backed server can easily handle returning data from 1000+ entities. And the limitation with searching/filtering would be solved if SPARQL was supported.
That’s why I’m saying “state of the art”. Because I’m interested in what’s feasible today, not what could theoretically be done in the future.
I don’t really have a solution to your problem, but here are some remarks and maybe starting points for you.
But then trying to query the documents is a nightmare. I tried using globbing, but after loading for a while I get a 500 error (probably due to an out-of-memory exception, although I haven’t checked).
Globbing is going to be removed from the spec, so I wouldn’t suggest to use it anyway (source).
I’m not sure exactly about the role of SPARQL in Solid, but it doesn’t seem like it will be completely publicly supported. I think this issue here will be relevant to you (I only skimmed through the answer): https://github.com/solid/specification/issues/162
The TL;DR from there:
I think that the question here is “fast access to multiple documents” and that the appropriate answer is “HTTP/2” (and a decent server implementation).
Maybe you will find more relevant issues with more discussions in this specification repository, I didn’t look that much into it.
Given that I’m caching all the documents, I need to know when a document has been updated to invalidate the cache.
Regarding caching, you could take a look into ETags (mdn reference). The server sends an ETag with each resource which changes whenever the resource changes and can be used to do conditional fetching. ETags are part of the api-rest spec so you can rely on its existence in Solid.
EDIT: And maybe you could also try merging all the files to a single one. I haven’t thought through the pros and cons of this, but I could imagine it to speed up initial loading and other bulk operations
Globbing is going to be removed from the spec, so I wouldn’t suggest to use it anyway
I am aware of that, and I’m not particularly loving globbing. But it was the only alternative I had given that SPARQL didn’t work. For this application, I am not using it anymore given the problems I mentioned.
I am using HTTP/2 in my server and I still have those problems, maybe the problem here is the “a decent server implementation” part.
Regarding caching, you could take a look into ETags.
If I’m not mistaken though, this ETag will be the same as reading the purl:modified property, right? I guess it’d be an improvement if I can do a HEAD request instead, and hopefully that doesn’t cause the container to be modified. I’ll give it a try, thanks for the suggestion.
And maybe you could also try merging all the files to a single one.
I’d only do this as a last resort, and I hope it’s not necessary to end up there. This also wouldn’t allow me to get only an individual resource, which is useful now when I add a new movie in another device, using the caching I mentioned.
It’s only slightly different. First of, ETags are part of the spec (both, in the LDP part that Solid uses and explicitly in Solid). I’m not sure if purl:modified is also specified somewhere.
Secondly, with ETags you can make a conditional request saying GET/PUT/... this resource IF ETag != "some_cached_etag". In general that’s useful for handling write conflicts (e.g. only update the version you currently worked on, don’t update if someone else updated it in the meantime). In your case you could use it for fetching only if the resource updated in the meantime, else omitting the body (GET this resource If-None-Match: "33a64df551425fcc55e4d42a148795d9f25f89d4").
But I think a disadvantage in your scenario would be that you still need one request per resource (which potentially has a smaller size because of the If-None-Match header). So I’m not sure whether it actually increases performance
I already raised an issue a year and a half ago about SPARQL not being supported and it’s still open. I’m not saying that I shouldn’t open an issue just because of that, but before opening issues I’d like to know what’s the proper way to do these things.
Globbing clearly isn’t, seeing the performance problems and the fact that it’s in danger of being removed. My intuition tells me that SPARQL is the solution, but it’s not implemented. I don’t think opening an issue about performance problems with globbing would help.
The issue I’ll probably open is that containers get modified on read. But I’ll wait to see how this discussion evolves.
Looking into the source code and existing issues, I think the cause of GET requests causing containers to be modified is the locking mechanism that creates files. There is already a couple of issues about that, so I don’t think I’ll open a new one: 1372 and 1460.