So i just finished reading Let’s talk about pods by Ruben Verborgh and I feel he is probably right that the “document” as the atom of the Pod data is doomed to not scale well and not promote data reuse between client apps, and that the only fix is for Pods to provide a triplet query interface over all data.
Traditionally this has been backed by a database in the centralised app approach. Maintaining a database centrally seem easier than distributing this job to every user and really raises the bar technically. Perhaps this is why Ruben’s vision will never take off as file storage is much, much easier to setup than a database… unless a performant DB interface can run on files like Athena (AWS) or BigQuery (GCP)? Files are also easier for the average joe to understand.
What Ruben doesn’t mention in the article is data persistence. Self hosted data is fine when that data doesn’t interact with other individual self hosted data apps. So when it inevitably goes down, is removed or moves etc (try finding self hosted blog posts from 10-20 years ago, archive.org is usually your one and only hope). The great thing about something like Reddit is that I can look at the very first post from 18 years ago and see all the comments. If this were all Pod hosted its highly likely that most if any of the Pods would still be operating and/or in the same place. This would be a very poor user experience not to mention terrible for posterity. So this leads me to believe that Pods are not good for everything and centralised apps, particularly where multiple Pod data is interacting, are still going to be useful.
Not to mention how do you make a distributed data app like a forum performant? It will be only as good as the slowest Pod i imagine. You could cache the data but then isn’t that’s just another way of centralising and how does ownership work in that instance?
I think the future is probably a high bread version of traditional centralised data and personal Pods. Perhaps if Solid takes off all of these problems will find solutions. This is just me thinking aloud reflecting on a great article.
I think the whole idea is that Solid can comply with “the right to be forgotten”.
If you want to save your data in a durable way, you probably do need to make back-ups or move some data (you still want to keep but no longer wish to share) to another place.
Eventually, you’ll need to pay the price to fight entropy if you want persistence though.
Sure, I’m just concerned that that price is centralised corporate ownership.
I like Reddit. particularly one of it’s founders, Aaron Swartz, who is a hero of mine, who’s ideas and struggles were all about freedom of information. The fact that you can go back 18 years on reddit is a testament to infrastructure, design and that it was a great idea. Everyone is anonymous on Reddit yet it’s all about community.
Perhaps ownership, the inward focus on the self, is antithetical to the idea of community, which cannot exist without people giving to the collective good?
As I understand it from the latest news, Reddit is agreeing to sell its raw data to train artificial intelligence models from companies like Google Article in Spanish.
As he says himself, the late Aaron Swartz, co-founder of Reddit, would be turning in his grave if he knew about this deal.
Training ML is not against community or sharing so I don’t know how Aaron would feel about it. He was a freedom of information guy.
People didn’t care if their comments were out there until someone had a use for them (despite bots scraping Reddit for years). I guess I’d prefer them selling them for AI rather than having ads in my face all the time. Some how the servers have for to be paid for…
Yes, it may not be fair to make a hypothetical statement, especially involving Aaron. My apologies.
Now, I don’t think training artificial intelligence is right or wrong. What does seem bad to me is that the material for this training is obtained without the explicit permission of the users who generated it.
You may not feel AI harvesting data without permission is a violation, but other’s do. I am reminded of Steven Wright’s quip “To steal ideas from one person is plagiarism; to steal from many is research.”
Or from Oscar Wilde: “talent borrows but genius steals”! Or perhaps better, from TS Eliot: ”Immature poets imitate; mature poets steal; bad poets deface what they take, and good poets make it into something better, or at least something different. The good poet welds his theft into a whole of feeling which is unique, utterly different from that from which it was torn; the bad poet throws it into something which has no cohesion. A good poet will usually borrow from authors remote in time, or alien in language, or diverse in interest.”
Why do we care of the text is used to train and AI suddenly? We don’t care if Google uses it for search (and to sell ads) or a human reads it and quotes it or it influences their thoughts going forward? What’s different?
Why do we care of the text is used to train and AI suddenly?
So chatGPT now has free, no-login tier. All of your interactions are harvested to make their product better so they can sell it with the option to opt out of harvesting for paid subscribers. So data privacy costs money. Do I care if my interactions are commercialized to make someone else rich and potentially build something I have no say in or control over for profit? Yes I do.
Yes, you started this before. The question was why do you care now? I’m trying to tease out the difference between Google indexing your data for search and providing free search in return so they can make money of you via ads, verses, training an AI model on your data and providing access to a large language model? So you see how they have been using your data since the 90s but in both cases giving you something in return?
As they say, if your not paying for the product you are the product.
One benefit the Solid community has theoretically is similar to the Mastodon system. Since reddit and such are owned by one entity, I am subject to terms, agreements, and interactions under their system, regardless of which subreddit I interact with.
With Mastodon and Solid, I can pick a provider for a community I feel more comfortable with, and (hopefully) with good interop capabilities, if I ever start to disagree with a server hosting my Pod space, I can move it to another Solid-compliant server which I agree with on Terms of Service or usage, or other values that I prioritize.
I think in this similar vein, thinking about client apps as a strict set of apps which can interact with every instance of a server may not be possible. Judging by the Solid standards themselves, this almost seems encouraged due to the nature of agents.
Having an efficient query interface over the entirety of a Pod would be nice and is probably feasible given enough time and fleshing out of specs.
But on the topic of “why Solid”, from an economic perspective may simply be about choice. Someone using Solid right now as opposed to other services may be satisfied with slower service and less data persistence guarantees because they value the things in the Solid community and ethics over an alternative platform. And if it evolves to payment, they may choose to pay for privacy. I imagine at least a few of the users here do in fact pay for VPNs, secured emails, and other privacy-enhancing services.
But this is all just my drop in the bucket of opinions. I do like your thought-provoking post though
Yeah, it makes sense for a individual’s rights and legal perspective, but these things don’t govern infrastructure performance. There was a good post here from another experienced developer who has concerns over the whole feasibility of distributed content from a performance perspective.