Wanted to offer a few thoughts on Solid from a developer’s perspective, having been diving deeply into the specs & documentation the last few days (it’s been on my research list for a while after Bruce Schneier posted about it early this year).
First off, I really appreciate the intention & goals; as a user I would very much appreciate having this kind of control of my data, where I could grant or revoke access to, say, Facebook Messenger to my chat history, Google Calendar or LukesCustomCalendar to my calendar data, or have a secure way for some IoT thermostat to share data with the robotic window-openers.
It’s an inspiring vision that tech people clearly have resonated with, and that I hope non-tech people will too, especially as the humane-tech movement gains momentum.
As a developer who’s been building web apps, server software, game engines, and other systems for 20+ years, I worry there might be some fundamental design mistakes going on here to prevent Solid from ever getting off the ground or offering a smooth user experience, compared with existing cloud / centralized server approaches - so at the risk of annoying everybody I thought I’d weigh in with a few thoughts.
From the specs, Solid’s design appears to be essentially a filesystem over the internet - of necessity a fairly high-latency filesystem compared to, say, a hard drive or RAM. Functionally similar to FTP, if you will, or WebDAV, utilizing RDF dialects extensively as file formats. Its model is folders and files (resources and containers), which you can GET, POST, PUSH, and DELETE as units via HTTP calls.
Filesystem or database?
Fundamentally, I wonder if the filesystem model might be a mistake - the bread and butter of most apps is to quickly query and filter large datasets - i.e. extract just the relevant data the app and user needs at that moment. Or to modify the data, generally one small piece at a time. In other words, apps would most commonly want Pods to fulfill the role of a database manager - a flexible filter engine sitting on top of an efficiently-searchable chunk of data (stored in b-trees, etc.).
As I was brainstorming potential app scenarios for Solid - notes, email archives, personal finances, calendar, photo library - I noticed all of them get difficult or start to fall apart when indexing and querying requirements come into the picture, which in nearly all these examples they do - search photos by tag/subject, find my next dentist appointment, summarize my yearly finances by category.
Files work OK for local apps where the code has low-latency access to storage, but over the network things will tend to grind to a halt as datasets grow, which they inevitably do.
If I want to find how much I paid for electricity the last 10 years in a personal finance app, a database like SQLite can rip through an index or table of tens of thousands of rows and return a response in microseconds, from a locally attached disk/SSD or from the disk cache in RAM. To do that over a Solid “filesystem”, an app (if it doesn’t cache into a database of its own) would need to transfer the entire RDF dataset over the wire, parse the somewhat space-inefficient syntax, iterate through the parsed representation and sum up electricity transactions. As the dataset grows this gets more and more inefficient - users will likely wait multiple seconds to answer queries like that, which they’re used to getting near-instantly with modern apps and websites.
With the user’s necessary but important mental complexity of understanding & managing Pods & WebIDs already, it would be critical that their experience with Solid is reliable and snappy, on par with what centralized “cloud” apps offer
The “network filesystem” model puts a burden on app developers too; to work with Solid where any kind of intelligent querying & data filtering is needed (i.e. almost always), Solid apps will either have to maintain homebrew index files for the queries one might want to do (with no transactional guarantees when writing multiple files); or they’ll have to maintain their own internal cache of the RDF data in a more convenient form like a local database; not only is this error prone and a lot of work for the developer (distributed data sync is hard), but it’s a worse user experience than an app + database approach - I picture a big “loading data from your pod…” message at startup while the app transfers & parses a giant chunk of pod data to populate its cache.
I think it’s wise not to neglect performance matters here or hand-wave them away (“HTTP/2 will solve it”) - storing records as RDF/Turtle, for instance, likely incurs a 10x or so storage-space penalty over a compact binary representation like SQLite and other DBMSs would use (with accompanying compute penalty since RAM access latency is one of the biggest factors in performance). That ultimately means 10x the cost for Pods making them more expensive for users to obtain, or require more ad-supported shenanigans to offer for free, more waste of network bandwidth and compute resources, greater carbon footprint, and so on.
Unfortunately my Inrupt Pod crashed with the sample apps I tried to load (got stuck in an infinite redirect loop in the authentication pop-up windows) but from perusing the forums a bit it sounds like other developers have had struggles with performance once data sets get in the hundreds of records or more.
I do like that the PATCH method is in the spec, at least that implies developers can update files partially and not have to retransfer the entire file over the wire every time.
But that’s a pale shadow of the power & flexibility you get with a SQL type interface; picture, for a moment, that your pod contains SQLite databases (or equivalent one-file databases):
/notes.db
/calendar.db
And rather than reading the whole file/database via GET, you can run SQL queries over the network for precisely the data your user wants:
GET /calendar.db
query="SELECT SUM(Amount) FROM LineItems
WHERE Category = 'Electric Bills' AND Date > '2010-01-01'"
One string of SQL goes out, one sum number comes back, instead of shuttling 10 years of data across.
Likewise on the “write” side, you can, with a very small string of code, update or delete thousands of rows of data on your Pod, right on the disk your Pod has convenient and fast access to:
POST /calendar.db
"UPDATE LineItems SET Category = 12 WHERE Category = 7"
SQL (and perhaps something like GraphQL or SPARQL) gives you the full power of a programming language, with abitrarily complex expressions to express concisely what data you want and what to do with it. It’s like a drawer full of surgical tools versus the crude chainsaw that HTTP GET/POST/PUT/DELETE gives you on single files.
WebACL could work at the database level perhaps.
Meta-confusion
As a newcomer to the Semantic Web aspects of this, it’s a bit daunting to see entities defined with 100+ fields (https://schema.org/Person), often each field with its own complex sub-schemas, with mind-bendingly abstract hierarchies, links to 20 other 50-page specs, defining what “is” is, what a Thing is, what the definition of a definition is, etc. In the absence of artificial general intelligence, human programmers will have to end up reading this stuff anyway to make any sense of it and get Solid to work, so sparing their poor brains and simplifying to the essence of the problem really helps. I worry this stuff is veering off into architecture astronautism: https://www.joelonsoftware.com/2001/04/21/dont-let-architecture-astronauts-scare-you/
For interop, common simple-as-possible schemas (column names & types) could be defined for useful data - standard column subsets for tables like Calendar, Contacts, OvenSettings, etc. - ideally in the form of nice documentation for the programmers rather than tricky machine-readable semantic meta-languages.
Given the personal or small-organization nature of data pods, the “single file” approach SQLite uses (which generally has enough performance & concurrency to support all but huge sites with millions of users) could keep things very simple for backups, transfers to a different pod, downloading & browsing a copy of your data etc. A networked DBMS like MySQL or Postgres could also potentially work, though with a lot more complexity.
Anyway figured I’d share these thoughts, in case they haven’t been considered already. Databases are critical, performance matters - and if your audience is developers, and not futuristic semantic AIs, then concise & simple docs will do wonders.
I might be off-base or sound like a SQLite zealot here, and am probably missing some important details, so definitely curious to hear the thoughts of some of the smart people involved -
Best,
-Luke