Constructive criticism from an experienced developer

LukeRissacher · October 19, 2020, 3:45am

Wanted to offer a few thoughts on Solid from a developer’s perspective, having been diving deeply into the specs & documentation the last few days (it’s been on my research list for a while after Bruce Schneier posted about it early this year).

First off, I really appreciate the intention & goals; as a user I would very much appreciate having this kind of control of my data, where I could grant or revoke access to, say, Facebook Messenger to my chat history, Google Calendar or LukesCustomCalendar to my calendar data, or have a secure way for some IoT thermostat to share data with the robotic window-openers.

It’s an inspiring vision that tech people clearly have resonated with, and that I hope non-tech people will too, especially as the humane-tech movement gains momentum.

As a developer who’s been building web apps, server software, game engines, and other systems for 20+ years, I worry there might be some fundamental design mistakes going on here to prevent Solid from ever getting off the ground or offering a smooth user experience, compared with existing cloud / centralized server approaches - so at the risk of annoying everybody I thought I’d weigh in with a few thoughts.

From the specs, Solid’s design appears to be essentially a filesystem over the internet - of necessity a fairly high-latency filesystem compared to, say, a hard drive or RAM. Functionally similar to FTP, if you will, or WebDAV, utilizing RDF dialects extensively as file formats. Its model is folders and files (resources and containers), which you can GET, POST, PUSH, and DELETE as units via HTTP calls.

Filesystem or database?

Fundamentally, I wonder if the filesystem model might be a mistake - the bread and butter of most apps is to quickly query and filter large datasets - i.e. extract just the relevant data the app and user needs at that moment. Or to modify the data, generally one small piece at a time. In other words, apps would most commonly want Pods to fulfill the role of a database manager - a flexible filter engine sitting on top of an efficiently-searchable chunk of data (stored in b-trees, etc.).

As I was brainstorming potential app scenarios for Solid - notes, email archives, personal finances, calendar, photo library - I noticed all of them get difficult or start to fall apart when indexing and querying requirements come into the picture, which in nearly all these examples they do - search photos by tag/subject, find my next dentist appointment, summarize my yearly finances by category.

Files work OK for local apps where the code has low-latency access to storage, but over the network things will tend to grind to a halt as datasets grow, which they inevitably do.

If I want to find how much I paid for electricity the last 10 years in a personal finance app, a database like SQLite can rip through an index or table of tens of thousands of rows and return a response in microseconds, from a locally attached disk/SSD or from the disk cache in RAM. To do that over a Solid “filesystem”, an app (if it doesn’t cache into a database of its own) would need to transfer the entire RDF dataset over the wire, parse the somewhat space-inefficient syntax, iterate through the parsed representation and sum up electricity transactions. As the dataset grows this gets more and more inefficient - users will likely wait multiple seconds to answer queries like that, which they’re used to getting near-instantly with modern apps and websites.

With the user’s necessary but important mental complexity of understanding & managing Pods & WebIDs already, it would be critical that their experience with Solid is reliable and snappy, on par with what centralized “cloud” apps offer

The “network filesystem” model puts a burden on app developers too; to work with Solid where any kind of intelligent querying & data filtering is needed (i.e. almost always), Solid apps will either have to maintain homebrew index files for the queries one might want to do (with no transactional guarantees when writing multiple files); or they’ll have to maintain their own internal cache of the RDF data in a more convenient form like a local database; not only is this error prone and a lot of work for the developer (distributed data sync is hard), but it’s a worse user experience than an app + database approach - I picture a big “loading data from your pod…” message at startup while the app transfers & parses a giant chunk of pod data to populate its cache.

I think it’s wise not to neglect performance matters here or hand-wave them away (“HTTP/2 will solve it”) - storing records as RDF/Turtle, for instance, likely incurs a 10x or so storage-space penalty over a compact binary representation like SQLite and other DBMSs would use (with accompanying compute penalty since RAM access latency is one of the biggest factors in performance). That ultimately means 10x the cost for Pods making them more expensive for users to obtain, or require more ad-supported shenanigans to offer for free, more waste of network bandwidth and compute resources, greater carbon footprint, and so on.

Unfortunately my Inrupt Pod crashed with the sample apps I tried to load (got stuck in an infinite redirect loop in the authentication pop-up windows) but from perusing the forums a bit it sounds like other developers have had struggles with performance once data sets get in the hundreds of records or more.

I do like that the PATCH method is in the spec, at least that implies developers can update files partially and not have to retransfer the entire file over the wire every time.

But that’s a pale shadow of the power & flexibility you get with a SQL type interface; picture, for a moment, that your pod contains SQLite databases (or equivalent one-file databases):

/notes.db
/calendar.db

And rather than reading the whole file/database via GET, you can run SQL queries over the network for precisely the data your user wants:

GET /calendar.db   
  query="SELECT SUM(Amount) FROM LineItems 
    WHERE Category = 'Electric Bills' AND Date > '2010-01-01'"

One string of SQL goes out, one sum number comes back, instead of shuttling 10 years of data across.

Likewise on the “write” side, you can, with a very small string of code, update or delete thousands of rows of data on your Pod, right on the disk your Pod has convenient and fast access to:

POST /calendar.db   
  "UPDATE LineItems SET Category = 12 WHERE Category = 7"

SQL (and perhaps something like GraphQL or SPARQL) gives you the full power of a programming language, with abitrarily complex expressions to express concisely what data you want and what to do with it. It’s like a drawer full of surgical tools versus the crude chainsaw that HTTP GET/POST/PUT/DELETE gives you on single files.

WebACL could work at the database level perhaps.

Meta-confusion

As a newcomer to the Semantic Web aspects of this, it’s a bit daunting to see entities defined with 100+ fields (https://schema.org/Person), often each field with its own complex sub-schemas, with mind-bendingly abstract hierarchies, links to 20 other 50-page specs, defining what “is” is, what a Thing is, what the definition of a definition is, etc. In the absence of artificial general intelligence, human programmers will have to end up reading this stuff anyway to make any sense of it and get Solid to work, so sparing their poor brains and simplifying to the essence of the problem really helps. I worry this stuff is veering off into architecture astronautism: https://www.joelonsoftware.com/2001/04/21/dont-let-architecture-astronauts-scare-you/

For interop, common simple-as-possible schemas (column names & types) could be defined for useful data - standard column subsets for tables like Calendar, Contacts, OvenSettings, etc. - ideally in the form of nice documentation for the programmers rather than tricky machine-readable semantic meta-languages.

Given the personal or small-organization nature of data pods, the “single file” approach SQLite uses (which generally has enough performance & concurrency to support all but huge sites with millions of users) could keep things very simple for backups, transfers to a different pod, downloading & browsing a copy of your data etc. A networked DBMS like MySQL or Postgres could also potentially work, though with a lot more complexity.

Anyway figured I’d share these thoughts, in case they haven’t been considered already. Databases are critical, performance matters - and if your audience is developers, and not futuristic semantic AIs, then concise & simple docs will do wonders.

I might be off-base or sound like a SQLite zealot here, and am probably missing some important details, so definitely curious to hear the thoughts of some of the smart people involved -

Best,

-Luke

julien_leicher · October 19, 2020, 8:39am

Hi

I share most of your thoughts coming from the same background I guess. Node Solid Server is indeed build on top of the filesystem but that doesn’t necessary means Solid is too. Another Solid server may implement it differently, as long as you provide a representation mandated by the spec.

Like you, I don’t understand why the SPARQL has been partially implemented and why it’s kind of “on hold”, but I probably miss some background. AFAIK SPARQL has been exactly created to query the web of Linked Data no matter where those data come from, be it a turtle file on the filesystem or a graph database.

There’s a lot of challenges to tackle but I think this kind of discussion will really move Solid forward. We all agree that things need to evolve and providing users with the best UX should be a top priority if we want them to use the platform, querying data fast is one of things users are accustomed to as you said

josephguillaume · October 19, 2020, 9:49am

Avoiding thinking in database and query terms is probably one of the things I’m also finding most difficult, but I think this is also meant to be a feature rather than bug of a linked data approach:
https://medium.com/virtuoso-blog/what-is-small-data-and-why-is-it-important-fbf5f267884

Smag0 · October 20, 2020, 4:25pm

Decentralized web is about combination of multiple source, I think.
User data on Pod combined with something like SemApps SemApps microservices sources looks like a good compromise

jucole · October 27, 2020, 8:59am

@LukeRissacher I too come from a front-end developer background (of 20 years) and I completely agree with your comments.

stresler · October 29, 2020, 4:37am

Been noodling on the same problem. Figured any app, once granted access would need to construct a cache for their app that lives on the pod and can be queried quickly for app relevant data.

Alternatively, app could build thier own cache on their hardware but it would need to be encrypted with an expiring key and require some sort of trust network that doesn’t yet exist.

aschrijver · October 29, 2020, 6:58am

Hi @LukeRissacher,

I think you brought up some great points. But I don’t know if by posting here your feedback gets any visibility with the core team working on Solid. You might want to repost to one of the Solid Panels (https://github.com/solid/process) or talk about it on Gitter.

LukeRissacher · November 2, 2020, 10:49pm

Thanks all for the discussions & thoughts -

@aschrijver good idea - just posted it as an issue on the solid spec repo (https://github.com/solid/specification/issues/207). Hopefully an appropriate place for it.

aschrijver · November 4, 2020, 10:16am

HI @LukeRissacher,

I know Solid has deep SemWeb roots so I’m probably going to lose people here, and I may be misunderstanding this sentence, but - you’re never going to design a universal set of schemas to fit all possible use cases in the entire world, it’s a futile fantasy. Just something simple like addresses or time zones is a crazily complex multicultural nightmare, it’s a wonder we’ve made them work as well as we have. Different software has very different needs, and always, always comes down to unique specifics. It’s great to standardize some universal simple subsets that can be shared, but making big schemas that attempt to support all possible use cases means the length of the specs will tend toward infinity and nobody will use them.

I agree with you on this one. I think OpenEngiadina.net and TerminusDB.com and Go-Fed.org are also focusing on closed-world vocabularies to avoid the vastness of a unified SemWeb. For my own Fediverse-related plans I will follow a similar route.

cristianvasquez · November 9, 2020, 1:26pm

This is so true, and so many times ignored

The right schema complexity depends on the task and the community of use.

Sometimes is easy to use common schemas, which is the case when everybody agrees. ( price of something, latitude-longitude etc)

In the case of personal-dataspaces is not often the case, instead of using big fat schemas from the begining, each node should start as simple as possible, to incrementaly map the data to new vocabularies only when it makes sense. These mappings can happen through middleware, mapping languages, adapters, whathever that was chosen by the pod owners.

cristianvasquez · November 9, 2020, 1:40pm

I love the idea of self-contained little databases

Regarding interoperability, there are mapping languages that go from SPARQL to SQL.

R2RML: RDB to RDF Mapping Language (there are implementations)

So you could SPARQL the databases using the simple-as-possible schemas in these mappings.

aschrijver · November 9, 2020, 1:45pm

Just a FYI, but the text you quote should be attributed to @LukeRissacher who phrased it very well. In my prior post I used Markdown quoting and not Discourse quotes, which it probably the reason why it now seems I wrote that

cristianvasquez · November 9, 2020, 1:48pm

you’re right.

Topic		Replies	Views
Solid scope and ecosystem Use Solid	45	4431	November 15, 2020
What app would you love to see on Solid? Ideas for Solid Apps	86	19645	April 5, 2022
Solid: File server vs database?	11	1611	December 17, 2021
Is Solid primarily a privacy platform or an app platform	19	3557	May 23, 2020
[Post deleted]	13	3570	September 28, 2023

Constructive criticism from an experienced developer

Related topics