Index service for a social app


I’m trying to answer the question “How to find people who are interested in the same stuff as me?” Or “How do I find people living in a certain area?” :mag:

This seems to be a fundamental issue of distributed social networks that aspire to connect strangers with each other (hospitality exchange, collaboration, meetups…). :dancing_women:

The solutions I can think of:

  1. :pig_nose: Follow your nose through foaf:knows links, and hope you discover the relevant people in the social graph. (e.g. used in friend crawler)
  2. :page_with_curl: Make a list/group, and people register there, and find each other there (e.g. used in ohn-solid)
  3. :mag: Make an index
  4. :grey_question: Anything else?

In this post I specifically ask about the 3rd option. How to implement an index?

I came up with the following architecture and I’m interested in your feedback before implementing it:

Let’s say we want to index people and their interests.

Input API is a (REST?) API that allows a human via app to suggest a Person or other Thing for indexing

Linked Data Fragments is the API through which a human via app can access the indexed info (i would use Linked Data Fragments Server.js for this, and query it with Comunica from the app)

Crawler (Bot) :robot: is a bot that visits the Persons or other Things suggested by Input API, or those outdated. Then it updates the index with the specific relationships it founds that we’re interested in (e.g. (Person) --(foaf:topic_interest)–> (Thing)). If the bot is a little bit curious/evil, it will also perhaps crawl foaf:knows links, to discover new people and stuff.

RDF is just abstract layer - underlying data are stored as RDF triples, perhaps limited to
?person --foaf:topic_interest–> ?thing
?person --a–> foaf:Person

Storage can be anything - I’d probably use (My)SQL for speed, but even a turtle document (within a Solid Pod?) (or whatever) may do. Or not. We also want to store at least the date of the last visit, and suggested Things to add. (both won’t appear in the LDF output)

As you can see, i haven’t concerned myself with anything like distributed hash indexes, whatever that means. I simply don’t understand what that means and how that’s supposed to work, so far. If you can suggest some resources on this topic, please do!

I’m particularly wondering:

Has anybody already implemented something similar (i.e. to solve the discovery issue)? I don’t want to reinvent a wheel.

I haven’t decided if one would need to authenticate/authorize for Input API and Linked Data Fragments API. Perhaps Solid OIDC could be used for this (i don’t know much about that). Or we would only index public stuff, and perhaps people would have to solve a Captcha or something to make suggestions via Input API

The Input API doesn’t feel very “standard”. It would be just some REST-like endpoints. What would be a nicer (Linked Dataey) way for suggesting things to the Index?

The Linked Data Fragments (LDF) would not suffice for searching people by location; AFAIK LDF doesn’t support that option [geospatial search - find something located within a bounding box].

And yes, such system wouldn’t be very discoverable. Its whereabouts would be probably just hardcoded into the app that needs it.

But well, I just want something working! :woman_shrugging:

I’m interested in any [other] ideas[, too]! :heart: :muscle:t3: :sunflower: :badger:

1 Like

Hi mrkvon, this is an interesting problem, and it sounds like exactly the kind of thing that graphs are supposed to be excellent at solving–sorry to say I don’t know more about it, but it’s something I’ll probably be looking into soon.

Have you tried any of the Gitter channels? They seem to be a lot more active than the forum. - Gitter The “chat” channel is the biggest and seems very active, but you might also want to look at “app-development” or perhaps “data-interoperability”.

Hello @mrkvon, I am familiar with implementing this sort of index, and there are some general patterns you might find useful here.

If you are planning to only index public documents, it makes the AuthN/AuthZ portion much simpler, though I would still recommend considering user privacy. It would, for example, be considerate to allow users to opt-in to a service like this rather than requiring them to opt-out.

One common way that content publishers can indicate sharing and/or distribution constraints is with a Creative Commons license. This also exists as an RDF vocabulary. For instance, a public document that allows for non-commercial copying and redistribution (with attribution) might contain these triples:

@prefix cc: <> .
@prefix ex: <> .

</resource> a ex:UserData ;
    cc:permits cc:Reproduction , cc:Distribution ;
    cc:requires cc:Attribution ;
    cc:prohibits cc:CommercialUse ;
    ex:data "Some data" .

It is always good to respect the constraints of a data publisher, and this is a well-understood way to do that.

In terms of the mechanics of fetching resources and updating your app’s index, a crawler is one approach, and this is how much of the Web is indexed today by major Search Engine companies. Another approach is to look at the Solid Notification protocol. For those documents that have a notification API, your service could subscribe to updates: the WebHook subscription type would likely be the best mechanism here. But for resources that do not have a notification API, you will want to rely on Last-Modified and/or ETag headers with your crawler, since you do not need to update resources that haven’t changed.

Where this all gets much more complicated is for non-public resources. There are ways to address that in a secure and reliable fashion – they typically involve interacting with an OAuth 2.0 Authorization Server (UMA is one example), but the flows are not entirely trivial. For non-public resources, you would need to ensure that your indexer app has read permission to the resources in question (you would still want to consider any Creative Commons assertions).

One final consideration is that of provenance. If a Linked Data Fragments server is providing an API to this data, you will likely want to ensure that the link to the original resource is present. Typically, the subject node in an RDF triple is minimally sufficient for this, but you may also wish to consider the use of PROV-O, a vocabulary specifically for representing provenance.


It seems like an Index requires steps 1 and 2 either actively or passively … essentially the contents of an Index needs to be sourced somehow, and steps 1 and 2 are examples of that …

You mention two things specifically:

  • Input API: this would be your listed solution #2.
  • Crawler Bot: this would be your listed solution #1.