What is the best way to gather data from multiple pods so that differential privacy can be applied?

Hello!

I am new to solid and the Community Solid Server (CSS) and I am currently working on a thesis regarding the implementation of data privacy in the solid ecosystem. The goal is to create anonymous pods so that interested entities (like universities or research centers) can gather pod data of many people for statistical research. That way accurate statistical population studies can be executed without violating the privacy of the pod owners data.

At the moment I am trying to figure out the following:

  1. How can I create a large amount of pods on the CSS so that I can fill them with random data?
  2. How do I access and extract the data from all these pods so that a data curator can apply differential privacy for example and afterwards sent the data to the interested entities.

Regarding the second question, do I create a client app for the role of curator or do I make implementations to the CSS so that the CSS can act as curator and apply differential privacy itself?

Thanks in advance

1 Like

Hi,

CSS provides an API to manage accounts, which is documented here: JSON API - Community Solid Server
Here is a sample usage of it I quickly wrote down some weeks ago (replace baseUrl with your servers url): API for creating solid pods - #15 by A_A

I’m not really familiar with differential privacy and how exactly you want to achieve it. So I may misinterpret the details behind your question. However, I’d see three possible implementations:

(1) You tell the users to give the researchers access to the data in question. Then the researchers download the raw data and locally (on the researchers server) process it in a way that preservers differential privacy
(2) You have 3 entities: Users, a differential privacy proxy, the researchers. The users give the differential privacy proxy access to the data. The proxy fetches the data and processes it in a way to preserve differential privacy. The proxy stores this output in their pod and gives researchers access to the data so they can use it for their statistics. It’s more or less the same as (1), however the proxy could be handled by a trusted third party instead of individual researchers.
(3) You create a custom CSS implementation that has a custom API for differential privacy. You tell users to use this custom CSS pod (or migrate their data there). Then they give researchers access to the data and the researchers use the custom (non-standard) API to access the data.

The first two have the advantage, that any user with a Solid pod could contribute data. The third one would diverge from the Solid standards, so only users that use your custom CSS implementation could contribute data.

Regarding tools to achieve this, I’d suggest to look at Tools and libraries overview · Solid. In particular, Inpruts solid-client is pretty good imo.

2 Likes

Thank you for your answer!
I will check out this JSON API. Does a java implementation of this API exist by any chance?

Regarding the differential privacy, option 3 is the one that interests me the most. But why does this option diverge from the Solid standards? Is it because the interoperability aspect of the Solid standards or is it something else?

There is no implementation for the CSS accounts API. However, you can orient yourself at the Javascript example I gave you and convert it to Java (probably a good task for ChatGPT if you don’t have any starting point. The implementation is not super-specific to Solid, so it could be able to understand it).

In short:
Imagine your custom CSS implementation has following endpoint to retrieve data respecting differential privacy:

GET /differential-privacy

Only your custom server has this endpoint, so you can only gather data from users on this server. Other pods would give you an error when you try to GET /differential-privacy, because they only implemented the APIs defined in the Solid specifications.

The goal with Solid applications is usually, that they work with all Solid pods, as long as the pod follows the specifications. This gives application developers a greater user base that can use their app and it gives users more choices on which apps they want to use.

In long:
Solid standards define (among other things) the APIs of Solid pods (called “server” in the specs). So as a client, eg an app running in the browser, you know how you can store data in the pod, how you can fetch the data, how you can give access to other people, etc. For each of these things there is already a defined API to do so (usually a GET/POST/etc request to the resource you want to fetch or update). One advantage is, that a client can rely on these APIs to exist on any Solid pod, no matter if the provider is using a CSS pod, a NSS pod, or another custom implementation. Users are free to choose their pod provider and still expect it to work with any Solid applications, as both work with the same well-defined APIs.

Now, if you create a custom CSS implementation to provide a new API (for differential privacy) and then build an application using this new API, things change: Your app uses a custom API that only works with your custom CSS implementation. If someone stores their data on a regular CSS instance, this pod does not have the API for differential privacy, so they can’t use any application that relies on the custom API. Or vice versa, if you want to collect data from users, you cannot do this if the pod does not have the custom API.

2 Likes

If you want it to work with any Solid pod, I’d suggest you option 2 (renaming the “differential privacy proxy” to curator, which is hopefully what you mean with this word):

Users give the curator access to their pod and consent to their data being collected.

The curator collects the data from the pod, and processes it to remove anything that could be traced back to their original pods (which is how I understand differential privacy currently).

The researchers request the data from the curator. This could be done via a custom API. Maybe, the curator could also store the processed data in a pod, and then the researchers could use Solid APIs to fetch the data from the curators pod.

The curator can use the Solid APIs to gather the data from the end users. The data can be stored on any server implementation, as the curator can use the standardized APIs. So you would have the benefit, that any user can provide data for research, rather than only those that store their data on your custom server.

2 Likes

Thanks again for the comprehensive reply, you are really helping me out here!

So if I pursuit the differential privacy proxy as curator, the curator will be a Solid client through which the interested entities can access the CSS for the anonymised data? And if I pursuit the CSS custom API, the CSS itself will be the curator through which they get access the data?

I’ll try to take a look at both use cases and see which one best matches my end goal for this thesis (the end goal is not yet final so I have some slack).

1 Like

Yes and no :slight_smile:
In the scenario I described, the curator would be both a client and a server.

The client part of it fetches data from the users pods. Maybe it fetches data regularly, processes it and stores it locally in a database. Or it does so upon request from researchers. Either way, in this interaction (Curator ↔ Solid pod) it acts as a client and will need to use the APIs defined in the Solid standards.

The server part of it serves the processed data to the researchers. eg via a REST API it would authenticate the researchers and upon request serve them with the processed data. In this interaction (Researcher ↔ Curator) it is a server and you could use any custom API you want (or store the processed information in a pod and also use APIs from the Solid standards if you want).

Yes, exactly.

And for completeness, if you follow option (1) from my initial answer, the researchers would also be the curators (which is probably not what you want, from my shallow understanding of differential privacy. I guess you want to give the researchers only the already processed data).

Good luck with the thesis :slight_smile:

2 Likes

You can write a script or a program that interacts with the CSS API to create multiple pods. You’ll need to generate unique identifiers for each pod, as well as associated access control lists (ACLs) to manage permissions for accessing the data within those pods. Once the pods are created, you can then populate them with random data. This could involve generating synthetic data based on statistical distributions or using existing datasets to populate the pods. Again, you’ll interact with the CSS API to upload data into each pod. You’ll need a mechanism to access and extract data from all the pods, or other product data enrichment services. This could involve writing another script or program that iterates through each pod, retrieves the data, and aggregates it. You’ll use the CSS API to interact with each pod and fetch the relevant data. Alternatively, you could extend the functionality of CSS to include differential privacy mechanisms. This would involve modifying the CSS codebase to incorporate privacy-preserving techniques directly into the server. While this approach could provide tighter integration and potentially better performance, it also requires more extensive development effort and may be more challenging to maintain and update

1 Like

Thanks for the extensive response. I have decided to approach the differential privacy proxy server setup. This seemed the most interesting and most feasible approach for achieving my goal. Like @A_A suggested, a proxy server would better follow the Solid Standards and making a CSS differential privacy implementation would be more time consumable I think.

Just for some additional note: there is an alternative to Option 2 without relying on a central proxy, but use (Secure) Multi-Party Computation (MPC) instead.
The benefit is that no one will ever see the raw data, except for the data provider himself/herself (and his/her individual trusted server for performing secret-sharing; which could be the Solid server if being extended).

We have a paper for performing this on Solid while respecting user autonomy: [2309.16365] Libertas: Privacy-Preserving Computation for Decentralised Personal Data Stores. We also evaluated it using differential privacy as one scenario.

Of course, extending Solid server with DP endpoint is also valid, especially if you consider (accuracy of) local differential privacy to be good enough.

3 Likes

Very interesting, I will take a look at it!
Thanks!