Mixing Linked Data and Text?

renyuneyun · April 3, 2023, 4:42pm

Imagine I’m going to build a personal notebook also serving the purpose of a knowledge base. Naturally, I would have this:

The knowledge base part is Linked Data;
The notebook part is semi-structured text (e.g. using Markdown syntax).

Is there a way to make them play nicely together? I.e. How can I mix text and Linked Data?

For example, imagine if a piece of note is stored as the following (in Turtle):

:some-text a :Note;
  :content "##Main\nThis note refers to @:alice who is a person in my knowledge base".

which essentially describes a note whose content is this Markdown-syntax text:

##Main
This note refers to @:alice who is a person in my knowledge base

Is there an existing way to make @:alice a reference to an entity (:alice) in my knowledge base, based on standards?
It’s completely fine to change the way how the note is represented in RDF / Linked Data, as long as it’s RDF/LD and it’s possible to represent a note (of long text).

Surely I can post-process the text in my hypothetical application and render it whatever way I like, thus I can find and display the link there. But that does not seem to be elegant.

Vincent · April 3, 2023, 6:07pm

The one way I can think of, which isn’t terribly ergonomic, is using RDFa, i.e. linked data embedded in HTML.

jeffz · April 3, 2023, 7:23pm

If Alice is defined at http://example.com/people.ttl#Alice, then in the note putting that url would link to the right fragment in the RDF. Just use URLs instead of @alice.

renyuneyun · April 3, 2023, 8:43pm

It seems to be a possibility. But it requires HTML, if I understand correctly? Sounds like a heavy dependency…

Doesn’t this fall into the same category as post-processing the text (which I hope could be avoided)?
In particular, I believe a tool parsing RDF will not interpret this URI inside the text as a reference to the entity, isn’t it? That means, there will not be an edge in the knowledge graph from this (part-of) text node to Alice’s node.

cristianvasquez · April 5, 2023, 2:06pm

Hi @renyuneyun, I maintain my personal knowledge graph using Markdown. Perhaps this lib is interesting for your use case: GitHub - cristianvasquez/vault-triplifier

jeffz · April 5, 2023, 2:16pm

Very cool!

josephguillaume · April 7, 2023, 12:20am

There are no existing standards that I am aware of for RDF in Markdown, but plenty of possible solutions, including Cristian’s

RDFa works by annotating links and containing elements. It can therefore be implemented in different ways in different markdown dialects.

Link attributes/link types/link flavors are probably the most basic way of handling this, in which the rendered html will include an RDFa edge from the page url to the link.

Here’s a range of syntaxes that I’ve come across

[LinkType:: value]
- GitHub - blacksmithgu/obsidian-dataview: A high-performance data index and query language over Markdown files, for https://obsidian.md/.
- GitHub - selimrbd/py-obsidianmd: Python interface to your Obsidian notes
(field3:: key)
- GitHub - blacksmithgu/obsidian-dataview: A high-performance data index and query language over Markdown files, for https://obsidian.md/.
linkType:: [[note 1]]
- GitHub - SkepticMystic/breadcrumbs: Visualise a custom hierarchy in your Obsidian vault. API: https://skepticmystic.github.io/breadcrumbs/
- Link Types - Juggl
  - - linkType [[note 1]]
Adding Metadata - Dataview
[link](url title target=“_blank”)
- Link and Image Attributes
[link](url){:target="_blank"}
- Kramdown
[link](url){target="_blank"}
Semantic Markdown - Annotate your markdown content
- Berlin {.schema:Place}
https://jibe-b.github.io/project/smd/
- An @atom is §composed of a @nucleus, and @electrons…
GitHub - ozekik/markdown-ld: RDF Linked Data in Markdown

I find that the cleanest approach is using shortcut reference links with the multimarkdown syntax, so I can write:

This note refers to [alice] who is a person in my knowledge base

[alice]: http://example.com/people.ttl#Alice rel="http://example.com/ontology#mentions"

And this would generate the triple:

<> <http://example.com/ontology#mentions> <http://example.com/people.ttl#Alice>.

For this approach to be manageable I would still want editor support to autocomplete both predicates and subjects, and to create reference links.

The pandoc notation appears to have a bit more support among renderers and in principle can also be used with reference links

[alice]: http://example.com/people.ttl#Alice {rel="http://example.com/ontology#mentions"}

It also has the advantage that it can be used on other container elements, as in the examples in GitHub - javalent/markdown-attributes: Add attributes to elements in Obsidian

This then allows something like (untested):

Bob {about="http://example.com/people.ttl#Bob"}
- knows [Alice] 

[Alice]: http://example.com/people.ttl#Alice {rel="foaf:knows"}

to generate

<http://example.com/people.ttl#Bob> foaf:knows <http://example.com/people.ttl#Alice>.

Markdown is really not designed to include non-human readable content, so all these solutions are still a little awkward.

For a more complete solution, it really does make sense to use HTML editing software that allows adding RDFa and/or web annotations, e.g. https://dokie.li/

Personally, I’ve been using Semantic Mediawiki for several years but I’m still experimenting with alternative solutions https://www.semantic-mediawiki.org

Needless to say, SolidOS does not yet provide support for anything like this, though it eventually it could by switching to a different markdown renderer/enabling a plugin.

P.S. apologies if the formatting of this post still doesn’t work after multiple edits.

renyuneyun · April 7, 2023, 11:24am

Thank you, Joseph! That’s very useful information to know so many existing approaches in similar directions.
That’s a lot of information for me to digest.

If I understand that correctly, they are all different ways to “postprocess” the (Markdown) text, after it is obtained from some other places (either as a file or a text node in knowledge base). That’s focusing on Markdown => RDF.

Yeah, I understand that. That’s why I was saying “text (e.g. Markdown)”.
To me, the best thing is something in RDF that can contain (ordered) texts and RDF content. Something like this:

some-text a :Note;
  :content [
    :type :markdown;
    :data [ 
      rdf:List (
        "##Main\nThis note refers to ",
        [ a :Link; :text "Alice"; :link :alice ],
        " who is a person in my knowledge base"
      )
    ]
  ]

(I may be wrong about the list syntax, but that’s the closest thing I can think of.)

Therefore, the App can parse this RDF, which automatically results in sensible edges between nodes in the graph (and when giving to users, it can play some tricks to contract/compress the “sub-parts” of the note to a single node). Also, the App will re-serialize the text, as something like "##Main\nThis note refers to [Alice](https://url-to/:alice) who is a person in my knowledge base", which is sent to Markdown parser for rendering.
This would make it generic enough for other semi-structured texts, e.g. reStructuredText, Org-mode.

Of course, there are still some issues… In particular I remember rdf:List does not play well with OWL, but unfortunately don’t remember exactly why. And there are lots of details that need to be carefully designed, including the conversion from RDF node to text (e.g. Markdown) content, from text to RDF, the complexity of the “nested” nodes supported, editor support…

Yes, dokieli is the first thing I wanted to consult… However, I never made it working properly, in particular the annotations and comments never showed up… Off-topic here.

josephguillaume · April 7, 2023, 10:49pm

Thanks for the clarification.

This would be valid syntax:

@prefix : <#>.
:some-text :data (
        "##Main\nThis note refers to "
        [ a :Link; :text "Alice"; :link :alice ]
        " who is a person in my knowledge base"
      ) .

Without the shorthand syntax, it’s

@prefix : <#>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.

<> :test :alice.

:some-text
    :data
            [
                rdf:first
                    """##Main
This note refers to """;
                rdf:rest
                        [
                            rdf:first [ a :Link; :link :alice; :text "Alice" ];
                            rdf:rest
                                    [
                                        rdf:first
                                            " who is a person in my knowledge base.";
                                        rdf:rest rdf:nil
                                    ]
                        ]
            ].

At the moment this doesn’t really add anything more than including the URL in the markdown and then interpreting all links as rel=:link.

However, you could use a notation like

[ a :Link; :object :alice; :predicate :mentions; :text "Alice]

This could then be used to infer the triple:

:some-note :mentions :alice.

What this brings to mind is the JSON serialisation of many different rich text editors, e.g.

https://prosemirror.net/docs/guide/#doc
https://editorjs.io/base-concepts/

github.com

bustle/mobiledoc-kit/blob/master/MOBILEDOC.md

## Mobiledoc

Mobiledoc is a simple post or article format that aims to be:

* **Platform agnostic**. Should be possible to render without an HTML parser.
* **Efficient to transfer**. Compresses well, and limits the duplication of
  content.
* **Extensible at runtime**. Stores content, not layout or final display.

Mobiledoc is primarily intended to be used for news-related content such as
articles and blog posts. It is deliberately simple, and organizes its content
in an array of "sections" that are considered as individual blocks of content.

There is no concept of layout or design built into Mobiledoc. It is up to the
renderer to generate a display appropriate for its context. On mobile this may
mean each section is full-width and they are displayed sequentially. On larger
displays the sections may be rendered side-by-side. Mobiledoc makes no
prescription for output display.

## Usage

This file has been truncated. show original

It should be possible to define a json-ld context for one/some of these, define a link component that includes a predicate and then you’d get editor support more or less out of the box, as well as various format conversion tools depending on which editor/format you pick.

josephguillaume · April 9, 2023, 12:45am

I was doing some more thinking about this, and the most promising rich text editors/formats to start with would probably be:

Prosemirror because it would have the advantage of existing markdown compatibility and a clear schema for which a json-ld context could be defined. ProseMirror markdown example
Quill, because its delta-based format is designed to support collaborative editing. Implementation in solid could therefore use an append only RDF file with patches for every edit. There are obviously other approaches to collaborative editing, but if it’s a priority feature, then this is a good starting point.

renyuneyun · April 9, 2023, 10:49am

josephguillaume:

However, you could use a notation like
[ a :Link; :object :alice; :predicate :mentions; :text "Alice]
This could then be used to infer the triple:
:some-note :mentions :alice.

Indeed. This is something I would expect as well. I didn’t discuss further because it may be endless… How exactly do we represent it; what is the set of features to be supported; how to transform it to the text representation; how to transform from the text representation to RDF…
(Not sure if the S-expression syntax by texmacs would benefit or not, as well.)

Lots of information again. LOL

I did a quick skim through, and some of them did not actually describe how their serialization looks like (e.g. lotion, quill).

For some of those did describe, I wonder how are they related to pandoc’s schema?
I imagine pandoc started earlier than most of them, and has the richest set of features (unless talking about HTML or derivatives of HTML).

Anyway, interesting to know them!

That’s interesting. Is it using CRDTs? Or is it more efficient than CRDTs?
I remember some other Solid Apps posted in the forum (SolidCryptPad, Umai) mentioned CRDT for collaboration / synchronization. Not sure how they would compare with each other.

(Just some random comments. Didn’t expect answers.)

jeffz · April 9, 2023, 4:09pm

Apologies if you’ve addressed this above, it’s a lot to read :-). Why not use RDFA rather than reinventing a parsing mechanism? Use something like Markdown with Attributes to do something like this:

[Alice](http://ex.com/Alice){typeof="schema:Person" property="foaf:name"}

An advantage of this is that both the HTML and the Markdown are readable by humans and machines.

josephguillaume · April 10, 2023, 8:30am

Sorry
Yes, the PHP markdown syntax works, though I prefer multimarkdown’s syntax for link attributes in particular.

There’s a screenshot of the document format+ examples. From the package.json it looks like they’re using tiptap with the starter kit + a few extras.

Tiptap is itself based on prosemirror, so the interest in lotion is mainly to see what kind of interface already exists.

Full documentation of the format is here:

Actually all those I cited are extensible, so you can define your own custom components, e.g. to provide a custom interface to input/import an entire linked data shape.
It does look like you could use pandoc’s json ast syntax too. I was just focussed on (web-based) rich text editors.

Quill uses operational transforms.

There are bindings for CRDTs using yjs for a number of rich text editors, but it’s not clear to me how I’d use them with solid.

josephguillaume · April 11, 2023, 9:34am

I had a go at converting the pandoc json format into JSON-LD, out of curiosity and because I had pandoc installed anyway.

Turns out it’s not really viable as an RDF based format because 1) it uses array position to express semantics, so preprocessing is necessary, 2) nearly everything is blank nodes.

I have a proof of concept to fix (1):

gist.github.com

https://gist.github.com/josephguillaume/e21c89b3b555b67905bce896b8b63c68#file-pandoc-js

Index.ttl

@prefix : <#>.
@prefix ex: <https://example.com/>.
@prefix : <./>.

<_:b0>
ex:blocks ( <_:b1> <_:b4> ); ex:meta <_:b9>; ex:pandoc-api-version ( 1 17 5 4 ).
<_:b1> a ex:Header; ex:attr :subject; ex:inline <_:b3>; ex:level 1 .

<_:b2> ex:about "https://example.com/subject".

This file has been truncated. show original

index.jsonld

{
  "@context": {
    "@vocab": "https://example.com/",
    "t": "@type",
    "blocks": {
      "@id": "https://example.com/blocks",
      "@container": "@list"
    },
    "c": {
      "@id": "https://example.com/c",

This file has been truncated. show original

index.md

# Subject {about="https://example.com/subject"}

[Object](https://example.com/object){rel="https://example.com/predicate"}

There are more than three files. show original

Unfortunately I think (2) is a deal breaker - while the preprocessing step could add identifiers, they won’t survive processing through pandoc, so they’re only really useful if the user only edits the text as RDF, which defeats the point of being able to convert to other formats.

If someone goes down the rich text editor route, I’d therefore suggest using a format that assigns IDs to every block (and ideally all inline elements too).

Needless to say, converting the Markdown to HTML and then reading the RDFa is a much easier path from Markdown to RDF, but the RDF will only include the structured data, not the text.

Edit: I’ve just understood skolemization so I may need to revisit this conclusion later

cristianvasquez · April 19, 2023, 5:04pm

In my opinion, it’s crucial to allow users the freedom to produce RDF in the way that best suits their needs. It’s also valuable to include provenance information so that it’s easy to trace back to the original representation.

Consider, for example, a collection of text representing notes taken over a lifetime. To extract RDF from this text, you may use various mechanisms, such as a custom script or the OpenAI API. Each time you update the text, your knowledge graph will be updated accordingly. Your applications can use this RDF and benefit from pointers to the origin, using selectors (to show the rich text for example)

An example selector could have these elements:

selector:
	source: http://alice/my-notes/Alice.md
	start-offset: char 20
	end-offset: char 84

But can be a query, a # , an xpath, etc.

You can apply this mechanism to anything you want to represent as a triple. This could include Excel files, git logs, tweets, or anything. You can materialize such RDF but also use adapters to produce RDF on demand. It will depend on the use case.

My use case is to profit from a personal knowledge graph and continue using my tools of choice. I use selectors to represent relationships between different portions of notes.

Here’s an example:

relation:
	data: "anything you want to say"
	source: 
		source: file://home/my-notes/Alice.md
		start-offset: char 50
		end-offset: char 60
	target: 
		source: file://home/my-notes/Bob.md
		start-offset: char 10
		end-offset: char 30

Topic		Replies	Views
Modelling joined data in RDF/LD Solid App Development FAQs	0	470	November 6, 2018
New to Solid - Turtle	3	1015	April 2, 2019
Building an app out of RDF Build a Solid App	13	2912	January 6, 2020
A sketch of RDF about Solid environment	4	358	February 5, 2024
Documenting Ontologies Linked Data	10	884	May 10, 2022

Related topics