Validation and filter feature


#1

@kjetilk: "There have been a number of issues reported (linked in #882) where it seems that user’s data files have been corrupted. Some of these things should be fixed on the frontend, so that it makes sure invalid data is not sent. However, the backend should also check, e.g. by doing a RDF validation as suggested in #882, but we could also imagine SHACL validation, etc. If Inrupt becomes a large POD provider, it may also come with legal requirements.

Also, validation may not only be a boolean accept or error, but possibly also filtering to accept valid parts. For v.next, we need to have an architectural element that does this, but we may need to address parts of this problem already for 5.0.0."

Consider this a super-issue for discussing what should be in 5.0.0, and if we should try to make that reusable in v.next. Also, we may discuss if we do not attempt to solve it on the backend and refer to frontends to do it.

We’ve had a meeting about this here in Ghent between @rubensworks and myself. Since this is a detailed code discussion, it belongs here, but there is a broader discussion too, about what exactly we should validate and if we should also do filtering and/or transformation, and that is a discussion we could have on Discourse.

This is what we have arrived at:

We have decided that the architectural implications of filtering built on the proposed validation (i.e. accept/reject) framework are small, and therefore, we decided to go for validation in the first iteration.

We found that we should use the try/catch system and design a pipeline where accept (resulting in e.g. a 200 response`) is issued if the pipeline doesn’t throw. Since we have not found a typed error library for JS, we figured that modules in the pipeline would add a type attribute to the error object declaring that it is a validation error. The error object would also have a message and an error name(matching its class name, and probably be exposeable as an RDF class).

The calling code (e.g. the HTTP handler) would then catch the error, and by checking the type attribute, it will throw a 400 error. It should include a SHACL Validation report, with the message as an sh:message .

In the first iteration, the only validation class in the pipeline would be an RDF syntax checker as done in #882. I haven’t studied SHACL very deeply, but it seems like for for example RDF checker, we could do :

What we haven’t yet decided is to configure the pipeline, but that’s on a different abstraction layer.


#2

To play devil’s advocate, Why should the back end check Turtle syntax?
Will this slow down the system, for large files particularly?
Can it be done in parallel with streaming it to disk, on a separate CPU? Not really is the file directly overwrites the existing data…
Response time is critical… but maybe mainly for patch?


#3

Feature creep into the back end makes it more complex to implement and could reduce options to slot in different implementations.


#4

@timbl There seems to be quite a few errors happening for developers that arises from Turtle files having invalid syntax (linked in solid/node-solid-server#882), but these assumptions haven’t been verified, so might be faulty assumptions.

In general I think it’s good to have validation on the back-end when it comes to input from users, as everyone makes errors at times. But might not be the case for Solid servers, as they can assume that requests from users will be well-formed?

There will be a overhead wrt performance, but I think the value of making sure Turtle (and other RDF-files) have a valid syntax is greater than this performance cost. But this is another assumption that might be wrong. Maybe an option is to make it opt-in for POD providers (e.g. a flag in config, in CLI, and/or it could be on by default when running bin/solid-test)? Or make it opt-in for app developers, e.g. some option available to set on requests?

@happybeing I realize that we might have been introduced this feature to a bit hastily, and should have gone the route of discussing it with the community first. Maybe this is something that could be discussed and possibly standardized through the Solid Community Group?

In any case, I do think it is an important feature to at least have opt-in, as people make mistakes.


#5

I am tempted to say the server should validate incoming turtle files - mostly because it would have saved me some debugging while I was trying to send an, yes, invalid turtle file - and it showed up way later when trying to use the data.

But I don’t think it is a good idea.

  • You don’t expect a webserver to validate HTML files you upload - right? Neither do you expect it to validate your JavaScript files.

  • And what about images? Should it verify that all images are valid?

  • And once it validates syntax, you soon want it to validate against some sort of schema (or shape) definition.

  • Where should it stop? Why only turtle files?

  • And then there’s performance as Timbl points out.

It is an inherent weakness of Solid that everything must be validated by the client - there is no “intelligent server” here, as there would be if you made a backend for a normal web-app. It is much like building desktop clients that executes SQL directly againts a central database - there is no way to enforce any kind of business validation rules on the server.

Having said all that, I think Solid servers may be forced to implement “copyright infringement filters” and similar for terrorism and child abuse etc. according to future law discussions. See for instance https://www.theverge.com/2018/6/19/17480344/eu-european-union-parliament-copyright-article-13-upload-filter. Which is a much bigger issue …


#6

Yeah, my concern with validation started with simple RDF validation on TTL-files, but was planning to support simple RDF validation on other triple-files such as RDF/XML, N3, JSON-LD, etc. But this is a much bigger issue, and needs a broader discussion, which I’m glad we’re having now.

FYI: I’ve pushed my work on this to feature/simple-validation on NSS; it’s not much, but could be continued at some later point when we know more how we want to handle this.

@JornWildt You already know of this thread, but wanted to link Illegal content - copyrighted material and so on as parts of what you mention relates to it.