Semantic Web, Linked Data, RDF

Maybe you heard about Semantic Web, Linked Data or RDF before. Maybe you didn’t and that’s why you arrived on this page.

The 3 terms are synonyms. RDF is the data format used to describe Resources. Semantic Web just means that as we have the web of pages (linked by hyper references (href) using the HTTP protocol), we also can create a web of data, and that’s basically what the Semantic Web is all about. We say semantic because it gets close enough to the data to bring meaning to otherwise format-less strings of text.

Linked Data says exactly the same, and is just another name for a concept that Tim Berners-Lee envisioned back in 1999 (the Semantic Web) and reformulated with the term Linked Data in 2006 and the Giant Global Graph in 2007.

SPARQL is the query language of choice for RDF data, and NextGraph supports both of them at its core. You can see SPARQL as the equivalent of SQL for a relational database.

OWL is a language (based on RDF) that expresses the schema (also called Ontology or Vocabulary) of the data.

When we are talking about publicly available data, most of the time concerning immutable facts like academic data, we use the term LOD for Linked Open Data. But Semantic data doesn’t have to be public nor open, and in NextGraph we use RDF to store private and encrypted data that very few people will ever see.

Eventually RDF is just another data format, like JSON, XML or CSV, except that it has its own characteristics, that are very interesting, and that we will detail now here.

You can find an introduction to RDF here and more details about SPARQL here

RDF data is organized in the form of triples. And when we use a database to store and query those triples, we call this database a triplestore.

Triples

The essential information to understand about RDF is that it encodes the data in the form of triples, which are the composition of 3 elements.

The 3 elements are called : Subject -> Predicate -> Object. That’s one triple. The semantic database is just a set of triples.

  • The Subject represents the Resource we are establishing facts about.
  • The Predicate indicates the “key” or “property” we want to specify about the Resource.
  • And the Object represents the “value”.

Hence, if we want to say that “Bob owns a bicycle”, then we write it this way : Bob -> Owns -> Bicycle. Bob is the subject. Owns is the predicate, Bicycle is the object.

We can also say that Bob -> color_of_eyes -> Blue and so on.

In addition, the values (aka, the Object part of the triple) can also be a reference to another Resource. So basically we can link Resources together.

If we have a triple saying Alice -> lives_in -> Wonderland, and we know that Alice and Wonderland are 2 RDF resources that have their own triples, then we say that lives_in is a predicate that represents a relationship between the 2 RDF resources Alice and Wonderland.

Then let’s say there is another resource in the system called Bob and we also want to say that Bob -> is_friend_with -> Alice.

Here we have linked the 2 Resources together, and the Predicate is_friend_with is not just a property, but it is in fact a relationship.

If Alice also considers Bob as a friend, then we could say the inverse relationship Alice -> is_friend_with -> Bob.

We understand that the Predicates of the RDF world, are corresponding to the keys and properties of the JS/JSON world. Values of the JS world are equivalent to Objects of RDF, although in JS, you cannot encode a link to another (possibly remote) Resource as a value, while in RDF, you can.

Finally, the Subject of a resource is its unique identifier. NextGraph assigns a unique ID to each Document, and that’s the Subject of the triples it contains.

In the classical Semantic Web, Resources are identified with URLs, but because NextGraph is not using HTTP, we identify the Resources with unique IDs of the form did:ng:o:[44 chars of the ID] by example did:ng:o:EghEnCqhpzp4Z7KXdbTx0LkQ1dUaaqwC0DGVS-0BAKAA, as explained in the Documents chapter

So in fact, we don’t use names like Alice, Wonderland, or Bob as subjects or objects, but we use their did:ng:... identifiers instead.

Then of course, we attach a nice and easy-to-read text label to each resource, so we can see and understand what the resource is about. This is often done with a predicate called rdfs:label which is used pervasively in the semantic web for defining a “title” to anything.

Ontologies

As you can see, the predicate’s names are often written with 2 words separated by a colon like rdfs:label. This means that we are referring to the prefix rdfs and to the fragment label inside it. The prefix rdfs must have been defined somewhere else before, and it always points to a full URI that contains the ontology.

In the classical semantic web, this URI is a URL, in NextGraph it is a Nuri (a NextGraph DID URI) or it can also be a URL if needed.

So this “file” that contains the ontology, most often in the format OWL, which is also some RDF, describes the classes, properties, and how they can be combined (which properties belong to which classes, the cardinality of relationships, etc).

Each entry in the ontology gets a name that can be used later on as a predicate, like label that can be found in the OWL ontology here https://www.w3.org/2000/01/rdf-schema#label

When this predicate is saved in the triplestore, it is the long-form “fully qualified” version that is saved “https://www.w3.org/2000/01/rdf-schema#label” and not the “rdfs:label” version, because prefixes can change so we replace the prefix by its real value before saving the triple.

When we retrieve the triples, we can give some prefixes and the SPARQL engine will do the reverse operation of changing the long-form to the prefixed form.

What is really interesting here is that Ontologies can be shared across documents, and also across triplestores. In fact, there exist already a good list of ontologies that have been adopted worldwide to represent the most common properties and relationships that we use in data.

Then some specialized ontologies have also emerged, often created in cooperation between several actors of a specific field of concern, and those ontologies became standards.

They form some shared schema that has been agreed globally and that can be reused, and amended of course if need, by anybody.

If need be, it is always possible to define also your own ontologies. And if they are of any interest to others, they might be published too, and reused.

This mechanism tends to foment interoperability.

First of all because the technology itself of predicates encoded as URIs is very much supporting interoperability in and of itself.

But also because groups of interest tend to gather and establish standard ontologies in many fields of concern, which enhance even more interoperability and portability of data.

At NextGraph, we strive to gather the best existing ontologies out there and propose them to you so you can reuse them. We also make it easy for you to create new ones with graphical tools and editors, so you don’t have to have a headache when trying to understand exactly how OWL works. If you know about UML, object-oriented programming, or modelling of data, then you can easily create new ontologies. It is all about defining classes, properties, and relationships between them.

One last word about RDF and ontologies: because the schema of data is encoded inside each triple (in the predicate part), there is no need to keep track of the schema of your data, as we normally do with relational databases or even in JSON. The advantage of RDF is that predicates get assigned with globally unique identifiers too, so there is never any ambiguity about the schema of the data. No separate schema definition that sits outside of the data itself (with problems to keep the 2 in sync). No migrations needed. No data inconsistency neither.

SPARQL

SPARQL is the query language for RDF. And as we have explained earlier in the Data-First chapter, we want to offer to the developer a very simple interface for accessing and modifying the data : in the form of a reactive store with Javascript objects (POJOs).

But sometimes, you also need to run complex queries. Not only to query the data and traverse the graph, but also to update it with specific conditions that SPARQL will help you with, more efficiently than going inside the reactive store.

In any case, rest assured that with our framework, you always have access to your data in both ways : via the reactive store, and also via SPARQL.

SPARQL is a query language that looks a bit rebutting at first contact. Many developers do not like it at first. This happened to me too in the past. But I can tell you from experience that once you get to learn it a little bit, everything gets much simpler and also very powerful. We will also provide some graphical tools that will generate SPARQL queries for you, and we also plan to add GraphQL support, which is a query language that more developers know about.

It is not for us to give you a course on SPARQL, but we will try to give you the basics. You can refer to this page as a starter, and there are many other tutorials online.

What is important to understand, for someone coming from SQL and relational databases, is that RDF does not need you to plan in advance all the relations you will need, to normalize them, and add the corresponding foreign keys in your tables, so you can later do some JOINs across tables. Instead, all the RDF data, all the triples, are JOINABLE by default, and you don’t need to plan it ahead. We believe that this is a very important feature!

The 2 main types of queries are SELECT and CONSTRUCT. Select is similar to SQL, and will return you a table with columns representing the variables that you have defined in your SPARQL query, and one row fow each record that has been found. The things to understand is that in the WHERE part, you put filters and also patterns for traversing the graph, which means that you can ask the SPARQL engine to “navigate” inside your RDF data and hop from one triple to another, from one resource to another, until you are reaching the desired combination of “match patterns”. By example you can ask to find “all the contacts that I have and that live in Barcelona and who are software engineers and who live less than 100m from an ice cream parlor, and you want to see their name, date of birth, phone number, and profile picture.” this obviously will hardly work because we usually don’t store information about the ice cream parlors on a geo map. But if you had the data, then it would work ! You can see from this example, that the SPARQL engine needs to go through several resources before being able to deliver the results. From Contact -> it goes to -> City -> and then to Shop and in those 3 type of Documents, it checks some properties (Contact.job=“software_engineer” and keeps their date_of_birth, phone_number, profile_pic and geo_location for later, then follows the “lives_in” predicate of the contact, that redirects us to a city, then filters by City.label=“Barcelona”, then follows the predicates “has_shop” and filters by Shop.type=“ice_cream_parlor” and by Shop.geo_loc that is within 100 of Contact.geo_location ). This is exactly the same as doing JOINS in SQL, except that you do not need to normalize your tables in advance in order to establish foreign keys. Instead, all the semantic data is always “JOINABLE” by all its predicates.

CONSTRUCT are a special type of queries that always return some full triples. They work the same as the SELECT WHERE, but you cannot have arbitrary variable projections. It will always return triples of the form subject predicate object. But you can of course tell which filters and patterns you want to follow.

Until now, we explained that each Document can hold some RDF triples. but we didn’t explain how they are stored and how the SPARQL engine will be able to run queries that span all the RDF Documents that are present locally.

There is in fact an option in the “SPARQL Query” tool (in the Document Menu, under “Graph” / “View as …”) that lets you query all the documents that you have present locally at once. If you do not toggle this option, you will only get results about the triples of the current Document. While with this “Query all docs” option activated, the SPARQL engine will search in all your documents, regardless if they are in the Public store, Protected store, Private store or in any Group or Dialog store.

What matters is that the documents must be present locally.

When you are using the native app, all your documents of all your stores, are always present locally, because they are stored in the UserStorage.

While for the webapp, you only get to have locally the documents that you manually opened since the last login. This is because the webapp, for now, cannot store locally all your documents, because there is not enough room for that. This will be improved at some point, but it needs more work on our side.

In the future, we will also be able to run federated queries, which means that part or all of the query is gonna be ran on someone else’s data, remotely. This is not ready yet, but that’s the goal of NextGraph. If we want to query the social graph, by example, we have to go to our contacts, friends, followers and run some queries there on their data.

Of course, this will only work if we got the permission to run those remote queries. And about that you should read our chapter about permissions.

In the same manner, we just explained that when you query with the “Query all docs”, you directly have access to all your local documents. First of all, we have to be precise and say that the set of documents you have access to is by user/identity. If you have several Identities in your wallet, then only the current identity will be queried.

Secondly, we have to clarify also that only the official apps can have such unlimited access to all your documents. This is because those apps have been coded and audited by us, they are open source, and we know they are not going to do malicious things with your data. They will never connect to remote machines and send your data there. They will never gather statistics or tracking on you. They can only manipulate your data locally, and they have to behave! you also trust those apps because you trust us. If you wouldn’t trust us in general, then there would be absolutely no point in using our products. Your trust though is not “requested” from you. You can easily check our claims about security, privacy and encryption, as all our code is open source, and you can also compile it yourself, or ask an expert programmer to audit it or compile it for you.

Then, when you install third party apps, those apps will NOT get unlimited access to all your data. Those applications will have to request permissions from you, before being able to read or write some of your data. At any moment, you can revoke this grant. More about that in the permissions chapter.

Those 3rd-party apps are mostly safe, because we also review them. But because you also have the option to bypass the App Store and install any app that you want, those apps will obviously not be reviewed by us, so as a matter of principle, any third-party app needs to present some capability in order to access your data, even if you are the author of such app.

It should be noted that permissions can span a whole UserStorage, or a whole Store, or a set of Documents, or even, for read permission only, a specific branch or block.

includes

As we already mentioned shortly when we talked about blocks here, you can include some other blocks inside a document. This will have the effect of including also all the triples of such block, inside the document where the include is declared.

This is very handy when you are in Document A, and you need to access some extra data coming from another Document or Branch B, and you want to make sure that anybody who will read the current Document A, will also fetch and include the other Document or Block B automatically. If you have some logic in your document that depends on such data, in a SPARQL query by example, this include mechanism will solve the headache of fetching and caching the foreign data.

The included block can be from the same document, from another document in the same store, or not, and even in someone else’s document, on a remote machine. Thanks to the pub/sub and synchronization mechanism of NextGraph, this foreign block will always stay up to date as it will synchronize with the original source.

Named graphs and blank nodes

And this leads us to an explanation about what happens to named graphs in NextGraph.

Named Graphs are an RDF and SPARQL feature that lets you organize your triples into a bag. This bag contains your triples, and we call this bag a Graph. It also gets an ID in the form of a URI (URL normally. In NextGraph, a Nuri).

SPARQL has options to specify which named graph(s) you want the query to relate to.

Theoretically, a triplestore can put any kind of triple in any graph. By the way, when the triplestore can understand the concept of a named graph, then we call it a quadstore, because then each triple has a 4th part telling about which graph the triple is stored in. So triples become quads (they have 4 parts instead of 3). And the triplestore becomes a quadstore.

NextGraph is a quadstore, but there are some limitations.

We do not let the user create graphs manually and arbitrarily. Instead, we associated each new resource/document (with its unique subject ID of the form did:ng:o:...) to a new graph, of the same name as the resource.

Even creating a new resource or document doesn’t happen freely in the quadstore. Instead, there is a special API call for creating a new document, that must be called before any triple can be inserted in this Document. This API call returns the newly generated document ID.

Then it is possible to add new triples in the this Document, and the ID of the document has to be passed to the SPARQL query, as a named graph, or as an argument in the API call itself. This way, we always know in which named graph the data should be saved or retrieved from.

In this Document/Named Graph, the user can add triples, and most of the time, they will add triples that have this Document ID as subject. That’s what we call authoritative triples because we know that the subject is the same ID as the named graph, (as the document ID) and because we have signatures on every commit and also threshold signatures too, then we can prove that those triples have been authored by the users that claim to be editors of such document.

In order to facilitate adding those kind of authoritative triples with a SPARQL UPDATE, or to retrieve them with a SPARQL QUERY, the user has access to the BASE shortcut, which is <> in SPARQL, and that represents the current document. it will be replaced internally by the exact ID of the current document. This placeholder is handy and helps you manipulate the authoritative triples of your document. The default graph of any SPARQL Query is also the current Document, so you do not need to specify it explicitly (except when you select the “Query all docs” option, in this case, the default graph is the union graph of all the graphs).

If we had stopped here, there would be no real interest in having a named graph mechanism.

But you also are able to add triples in the Document/Named graph, that are not authoritative. Those are the triples that have as subject, some other ID than the current Document.

What is it useful for? RDF lets anybody establish facts about any resources. If there is a foreign Document that I am using in my system, and I want to add extra information about this resource, but I don’t have write permission on that foreign Document, I can add the triples in one of the Documents that I own. External people who would see those triples that I added, would immediately understand that they are not authoritative, because they were not signed with the private key of the Document ID that they establish facts about (the subject of the triples). So it is possible to say, by example, that London -> belongs_to -> African_continent but of course, this is not the official point of the view of the author that manages the London Document. it only is “my point of view”, and people who will see this triple, will also be notified that it isn’t authoritative (I think they can easily understand that by themselves without the need for signatures).

Then we have other use cases for extra triples in the Document :

  • fragments, that are prefixed with the authoritative ID, and followed by a hash and a string (like in our previous example #label).

  • blank nodes that have been skolemized. They get a Nuri of the form did:ng:o:...:u:.... this is because blank nodes cannot exist in a local-first system, as we need to give them a unique ID. This is done with the skolemization procedure. For the user or programmer, skolemization is transparent. You can use blank nodes in SPARQL UPDATE and they will be automatically translated to skolems. For SPARQL QUERY anyway, blank nodes are just hidden variables, so there is no impact.

But those extra triples (fragments and skolems) are all prefixed with the authoritative ID, so they are considered authoritative too.

Note that non-authoritative triples can also have fragments and skolemized blank nodes, but their prefix will be a foreign ID, so they won’t be considered authoritative neither.

Now that we’ve explored the Semantic Web and OWL, we can dive more into Schema definition in NextGraph.