Solid at scale - querying large datasets

stockcarracer · December 8, 2022, 8:18pm

Hi all, new to this forum but have been reading a bit.
I’m not a coder but work in the data side of diversity & inclusion and privacy. I’m very interested in the potential of Solid but am struggling to understand how advanced it is for querying large datasets e.g. say I’m connected to 1 million pods, each representing a unique individual whose given permission to share certain demographic information such as age, and I want to know what % of the 1million people are aged between 50-55.
Is this possible and computationally practical?
Very grateful for any guidance, or direction as to where I can find it - thanks!!

hochstenbach · December 9, 2022, 6:19am

In Belgium, Flanders, there is a project to provide Solid pods to millions of Flemish citizens: see also https://solidlab.be. This will provide you an idea of the scale that Solid is used.

About searching information that is stored on pods: the insight is that there can be many ways of providing search functionality over many pods. From traditional crawling (or pushing notifications) to search engine services, federated searching , pods that themselves include search capabilities, link traversal searching. With Solid protocols there are more techniques available to allow for searching across nodes than there is with traditional web apis (http + html). For the (traditional) web at large there is just one option: crawling (mainly) unstructured data.

If one can build a search over standard web servers, then certainly one can build a search over pods.

stockcarracer · December 9, 2022, 4:44pm

Thanks @hochstenbach. Re: searching - for my use case this really needs to work with something nearing efficiency of a database configured for that purpose. Are you suggesting that is possible? If so do you know if anyone has achieved/or is working on that (I’ll keep my eye on the Flemish project)? Many thanks!

hochstenbach · December 9, 2022, 7:01pm

I come from the library world were we are used to harvesting data from many library websites and repositories with many millions of records. This we all do in our library field with protocols that were created in the late 1990s. In principle I don’t see a technical reason why Solid protocols couldn’t do this much more elegant/efficient, and keeping search indexes much easier up to date.

This said, I have not a broad enough vision in the indexing field of Solid pods to know which project actually already implemented this on that scale (there are not millions of pods available now). But if you want to learn more about all kinds of scalable searching over pods you might read the publications of ‪Ruben Taelman‬ - ‪Google Scholar‬ , ‪Miel Vander Sande‬ - ‪Google Scholar‬ and ‪Ruben Verborgh‬ - ‪Google Scholar‬ (or contact them directly)

stockcarracer · December 11, 2022, 3:51pm

Thanks @hochstenbach. Very useful to know. I will explore further.

lecoqlibre · December 18, 2022, 1:40pm

We are also investigating this problem with INRIA (France).

One way to solve this is with indexes: indexing data on each POD. Solid has currently a proposal to index object types (to find objects of a certain type).

We are trying to add the ability to index properties and other information.

jeffz · December 18, 2022, 5:55pm

That is an old version of the type-indexes. The latest spec (still under development) is here.

toychicken · December 19, 2022, 4:31pm

I’m new to this, but my understanding is that once the user has given you permission, there’s nothing to stop you taking the data and storing in your own DB. Obviously, it’s then relatively easy to query that data.

As for maintaining the data, I guess there’d need to be a mechanism to get updates from Pods. Also, as a responsible data owner, you might want to have some mechanisms to remove users’ data from your systems if they rescind your access - but I don’t know how feasible that is.

stockcarracer · December 20, 2022, 5:44pm

Thanks @lecoqlibre and @jeffz , good to know, cheers. @toychicken appreciate the suggestion but this would defeat the purpose for our use case (user’s retaining control over their data). Thanks.

Topic		Replies	Views
Search on large number of Solid pods	5	412	October 17, 2023
Starting a research project about searching and querying data in a SOLID ecosystem Solid Specification	5	733	October 20, 2022
General questions and clarifications on app development Solid App Development FAQs	7	1517	October 30, 2018
Providing Solid as an opt-in alternative Solid App Development FAQs	6	923	November 25, 2020
Query things in solid pod	13	1153	July 10, 2023

Solid at scale - querying large datasets

Related topics