Errors parsing XML with rdflib.js in the browser


#1

I am attempting to parse a Project Gutenberg catalog.rdf for an in browser eReader. Therefore rdflib.js

My initial idea was to use SPARQL queries to extract the data, however, I can’t seem to get rdflib.js to execute the queries and return results (assuming I just don’t know how to wire them up).

Following the tutorials I seem to have figured out the basic use of the match routines, and have managed to get a list of the books in the catalogue. Unfortunately, when I then attempt to get the corresponding properties (title and author) they are coming back as [object NodeList]

Extremely stripped down example of book RDF:

<pgterms:etext rdf:ID="etext27785">
  <dc:title rdf:parseType="Literal">A Book About Lawyers</dc:title>
</pgterms:etext>

My code:

let store = $rdf.graph();
$rdf.parse(stm,store,baseUrl,'application/rdf+xml');
let books = store.match(undefined, types.RDF('type') , types.PGb('etext')).map(t=>t.subject);
let lib = books.map(b=>{
    let props = store.match(b, null, undefined);
    console.debug("Book: " + schema['_id']);
    props.forEach(a=>{
        console.debug(a);
    });
});

Resulting triple (note object.value is "[object NodeList]"):

{
  "subject": {
    "termType": "NamedNode",
    "value": "http://www.gutenberg.org/feeds/catalog.rdf#etext14600"
  },
  "predicate": {
    "termType": "NamedNode",
    "value": "http://purl.org/dc/elements/1.1/title"
  },
  "object": {
    "termType": "Literal",
    "value": "[object NodeList]",
    "datatype": {
        "termType": "NamedNode",
        "value": "http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral"
    }
  },
  "why": {
    "termType": "NamedNode",
    "value": "https://example.com/datasets/gutenberg/catalog.rdf.gz"
  }
}

Am I using the library incorrectly?


#2

Forgive me if you know Javascript, but have you tried looking at object.value? It would appear to hold a list of titles returned by the query. So console.debug(a.object.value) instead of console.debug(a) to see it. Console commands stop after one level of dereferencing which means you won’t see the internals of objects other than the main object. If that’s not your problem, you should tell us what is the problem - what did you expect to happen and what happened instead?


#3

So I double checked my setup, and my inspection of the value. Unfortunately, it still seems to be an issue; a.object.value does not contain a list, it contains the string value: "[object NodeList]"

I inspected using the console inspector, as well as modifying the test code to explicitly report the datatype.

Exact Input:

    <pgterms:etext rdf:ID="etext27785">
      <dc:publisher>&pg;</dc:publisher>
      <dc:title rdf:parseType="Literal">A Book About Lawyers</dc:title>
      <dc:creator rdf:parseType="Literal">Jeaffreson, John Cordy, 1831-1901</dc:creator>
      <pgterms:friendlytitle rdf:parseType="Literal">A Book About Lawyers by John Cordy Jeaffreson</pgterms:friendlytitle>
      <dc:language><dcterms:ISO639-2><rdf:value>en</rdf:value></dcterms:ISO639-2></dc:language>
      <dc:subject><dcterms:LCSH><rdf:value>Lawyers -- Great Britain -- Anecdotes</rdf:value></dcterms:LCSH></dc:subject>
      <dc:subject><dcterms:LCC><rdf:value>KD</rdf:value></dcterms:LCC></dc:subject>
      <dc:created><dcterms:W3CDTF><rdf:value>2009-01-12</rdf:value></dcterms:W3CDTF></dc:created>
      <pgterms:downloads><xsd:nonNegativeInteger><rdf:value>20</rdf:value></xsd:nonNegativeInteger></pgterms:downloads>
      <dc:rights rdf:resource="&lic;" />
    </pgterms:etext>

Code:

let store = $rdf.graph();
$rdf.parse(stm,store,baseUrl,'application/rdf+xml');
let books = store.match(undefined, types.RDF('type') , types.PGb('etext')).map(t=>t.subject);
let lib = books.map(b=>{
    let props = store.match(b, null, undefined);
    console.debug("Book: " + schema['_id']);
    props.forEach(a=>{
	console.debug(a.predicate.value + '===' + a.object.value);
	console.debug(typeof a.object.value);
    });
});

Output:

Book: etext27785
http://www.w3.org/1999/02/22-rdf-syntax-ns#type===http://www.gutenberg.org/rdfterms/etext
string
http://purl.org/dc/elements/1.1/publisher===Project Gutenberg
string
http://purl.org/dc/elements/1.1/title===[object NodeList]
string
...

EXPECTED:
The text value of the title as “A Book About Lawyers”

ACTUAL:
The text value of the title as “[object NodeList]”

Notes

  1. The title is a text literal (no sub elements)
  2. There is only one “title” item in the book record

#4

I put pg1.rdf from the Gutenberg tarball in the same directory as this script and it works for me. Does it work for you?

let $rdf = require('rdflib')
let a = $rdf.sym('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')
let ebook = $rdf.sym('http://www.gutenberg.org/2009/pgterms/ebook')
let title = $rdf.sym('http://purl.org/dc/terms/title')

let location  = './pg1.rdf'
let subj = 'file://'+location

let txt = require('fs').readFileSync( location, 'utf-8' )
let store = $rdf.graph()
$rdf.parse( txt, store, subj,'application/rdf+xml' )
let books = store.match( undefined, a, ebook )
for(b in books){
    let book = books[b].subject
    let bookTitle = store.match( book, title ,undefined)[0].object.value
    console.log(bookTitle)
}

#5

Or, even simpler, since you know there is only one book in the file and there is only one title per book, instead of the last six lines of the code above, use:

let book = store.any(undefined, a, ebook)
let bookTitle = store.any(book,title,undefined).value
console.log(bookTitle)


#6

That looks node-like (require('fs')): is that meant to be run from the command-line?

When I run that from the command line, I’m getting an error: async popupLogin(options). I think my VM defaults to an old version of node… I’ll need a bit to finish setting this test up.


#7

Yes, that script runs from the command line to read the file, but you can just ignore that part and define txt as whatever you want rather than reading it from a file.


#8

Just comment out the require and define the text of the RDF however you want and this should work in the browser just the same.


#9

OK, got it working in both browser and in node. I agree, pg1.rdf parses and searches cleanly, and the code you gave and the code I originally have are (basically) the same.

Is catalogue.rdf incorrect in some way?


#10

Not sure what catalogue.rdf is, but if you feed it to the same script, it should work if it uses similar terms as pg1.rdf.


#11

catalogue.rdf is one of the three files available as a feed from PG:

  • today.rss
  • catalog.rdf / catalog.marc
  • rdf-files.tar

I chose catalog.rdf because it was complete, but smaller. The terms it uses are similar (as shown in the example I gave), but do not work.

Interestingly, it notice that all of the elements that try being a NodeList all have attributes associated with them. No attribute == it works ?

That’s it.

Original example:

<pgterms:etext rdf:ID="etext27785">
  <dc:title rdf:parseType="Literal">A Book About Lawyers</dc:title>
</pgterms:etext>

Notice the title has an attribute parseType. In the file you used (pg1.rdf), there is no attribute on those. If I remove the attribute from the title tag in catalog it works perfectly.

  1. Is there something extra I was supposed to do to support attributes?
    • Wouldn’t surprise me
  2. Is that invalid RDF syntax?
    • I’m pretty sure I saw parseType was standard (though I’m pretty new to RDF)
  3. Is there a bug in rdflib?
    • I’ve been working to rig up an rdflib I can step through to help me wrap my head around what is going on

#12

Hmm, I don’t have catalog.rdf on hand to test at the moment but I did paste rdf:parseType=“Literal” into the dc:title tag in pg1.rdf and it still parsed fine for me.


#13

“Works on my machine” :wink:

There is a bug in rdflib.js in the way it interprets Literals in the browser. You didn’t see it because you are testing in node.

The implementation of DOMParser imported for NodeJS handles serialization of nodelists differently than the browser implementation (I almost want to say incorrectly… except I like it).

I’ve tested in Firefox and Chrome, and get the same behaviour. I want to check to make sure my fix doesn’t break node implementations before suggesting it.

Looks like the defect was created in response to issue #75