Publishing Linked Data
Once the RDF dataset has been created and interlinked, the publishing process involves the following tasks:
- Metadata creation for describing the dataset
- Making the dataset accessible
- Exposing the dataset in linked data repositories
- Validating the dataset
These will be described in the following four subsections.
3.6.1 Providing metadata about the dataset
A published RDF dataset should have metadata about itself that can be processed by search engines. This metadata allows for:
- Efficient and effective search of datasets.
- Selection of appropriate datasets (for consumption or interlinking).
- Acquiring general statistics about the dataset such as its size.
The frequently used vocabulary for describing RDF datasets is VoID (Vocabulary of Interlinked Datasets) . An RDF dataset is expressed as being of the type void:Dataset.
The VoID schema covers four types of metadata:
- General metadata
- Structural metadata
- Descriptions of linksets
- Access metadata
General metadata is intended to help users identify appropriate datasets. This contains general information such as the title, description and publication date. It also identifies contributors, creators and authors of the dataset. The VoID schema makes use of both Dublin Core and FOAF predicates. A list of general VoID properties is shown in Figure 7.
Figure 7: VoID General metadata.
The VoID general metadata also describes licencing terms of the dataset using the dcterms:licence property (see  for a discussion of licensing issues). The topics and domains of the data are expressed using the dcterms:subject property. The property void:feature can be used to express technical features of the dataset such as its serialisation formats (e.g. RDF/XML, Turtle).
This provides high-level information about the internal structure of the dataset. This metadata is useful when exploring or querying the dataset and includes information about resources, vocabularies used in the dataset, statistics and examples of resources in the dataset.
In the example below a URI (which happens to represent The Beatles) is identified as being an example resource in the MusicBrainz dataset.
:MusicBrainz a void:Dataset;
It is also possible to specify the string that prefixes all new entity URIs created in the dataset. Below, all new entities in the MusicBrainz dataset are specified as beginning with the string http://musicbrainz.org/.
:MusicBrainz a void:Dataset;
void:uriSpace "http://musicbrainz.org/" .
The property void:vocabulary identifies the most relevant vocabularies used in the datset. It is not intended to be an exhaustive list. The example below states that the Music Ontology is a vocabulary used by the MusicBrainz dataset. This property can only be used for entire vocabularies. It cannot be used to state that a subset of the vocabulary occurs in the dataset.
:MusicBrainz a void:Dataset;
void:vocabulary <http://purl.org/ontology/mo/> .
A further set of properties are used to express statistics about the dataset such as the number of classes, properties and triples. These statistics can also be expressed for any subset of the dataset.
Figure 8: VoID statistics about a dataset.
The void:subset property is defines parts of a dataset. The example below states that MusicBrainzArtists is a subset of the MusicBrainz dataset.
- MusicBrainz a void:Dataset;
void:subset :. MusicBrainzArtists
The properties void:classPartition and void:propertyPartition are subproperties of void:subset. A subset that is the void:classPartition of another dataset contains only triples that describe entities that are individuals of this class. A subset that is the void:propertyPartition of another dataset contains only triples using that property as the predicate. A class partition has exactly one void:class property. Similarly, a property partition has exactly one void:property property. The example below asserts that there is a class partition of MusicBrainz containing triples describing individuals of mo:Release. It also asserts that there is a property partition that contains triples using mo:member as the predicate.
:MusicBrainz a void:Dataset;
void:classPartition [ void:class mo:Release .] ;
void:propertyParition [ void:property mo:member .] .
Descriptions of linksets
A linkset is a set of RDF triples in which the subject and object are described in different datasets. A linkset is therefore a collection of links between two datasets. The RDF links in a linkset often use the owl:sameAs predicate to link the two datasets. In the example below, LS1 is declared as a subset of the DS1 dataset. LS1 is a linkset using the owl:sameAs predicate. The linkset declares sameAs relations to entities in another dataset (DS2).
Figure 9: A collection of links between two datasets. Based on .
In the MusicBrainz example below a class partition named MBArtists is defined. This is a linkset that has skos:exactMatch links between MusicBrainz and DBpedia.
:MusicBrainz a void:Dataset .
:DBpedia a void:Dataset .
:MusicBrainz void:classPartition :MBArtists .
:MBArtists void:class mo:MusicArtist .
:MBArtists a void:Linkset;
void:target :MusicBrainz, :DBpedia .
The VoID schema can also be used to describe methods for accessing the dataset, for example the location of a URI where entities in the dataset can be inspected, a SPARQL endpoint or file containing the data. The predicate void:rootResource can be used to express the top terms in a hierarchically structured dataset.
Figure 10: Methods for accessing metadata.
3.6.2 Providing access to the dataset
A dataset can be accessed via four different mechanisms:
- Dereferencing HTTP URIs
- SPARQL endpoint
- RDF dump
These will be described below.
As we saw earlier, the first two linked data principles state that URIs should be used as names for things and that HTTP URIs should be used so that users can look up those names. Dereferencing ' is the process of looking up the definition of a HTTP URI.
The URI http://dbpedia.org/resource/The_Beatles is used to name The Beatles. It is not possible to send The Beatles over HTTP. However if you access this URI you will be forwarded to a document at some other location that can provide you with information about The Beatles. The HTTP conversation goes as follows:
- You request data from a URI used to name a thing such as The Beatles (e.g. http://dbpedia.org/resource/The_Beatles). You may request data in a particular format such as HTML or RDF/XML.
- The server responds with a 303 status (meaning redirect) and another location from which the data in the preferred format can be accessed
- You request data from the location to which you were redirected.
- The server responds with a 200 status (meaning your request has been successful) and a document in the preferred format.
If you request data about The Beatles in HTTP format you will be redirected to a web page (http://dbpedia.org/page/The_Beatles). If you request the data in RDF/XML format then you will be redirected to an alternative document (http://dbpedia.org/data/The_Beatles.rdf). If you are providing rather than requesting the data then you need to decide which RDF triples should be returned from your dataset in response to dereferencing a HTTP URI about an entity (such as The Beatles) which cannot itself be returned. Guidance on what to return can be found in . This can be summarised as follows:
- Immediate description: All of the triples in the dataset in which the originally requested URI was the subject.
- Backlinks: All triples in which the URI is the object. This allows browsers or crawlers to traverse the data in two directions.
- Related descriptions: Triples not directly linked to the resource but likely to be of interest. For example, information about the author could be sent with information about a book as this is likely to be of interest
- Metadata: Information about the data, along the lines described in 3.4.1 such as the author of the data and licensing information.
- Syntax: There are a number of ways of serializing RDF triples. The data source may be able to provide RDF in more than one format, for example as Turtle as well as RDF/XML.
RDFa ' stands for “RDF in attributes”. RDFa is an extension to HTML5 for embedding RDF within HTML documents. The advantage of RDFa is that a single document can be used for both human and machine consumption of the data. A human accessing the data via a web browser need not be aware that an alternative RDF representation is embedded within the page. RDFa can be thought of a bridge between the Web of Data and the Web of (human readable) Documents.
Figure 11 lists the main attributes of RDFa. The about attribute specifies the subject that the metadata is about. The typeof attribute specifies the rdf:type of the subject. The property attribute specifies the type of relationship between the subject and another resource. The vocab and prefix attributes specify the default vocabulary and prefix mappings.
Figure 11: RDFa attributes.
Below we can see a portion of HTML+RDFa contained in a HTML <div> element. The subject of this fragment of RDFa is specified using the about attribute. Here the subject is the MusicBrainz URI for The Beatles. In the line below we see the typeof property which is used to specify the rdf:type of the subject. The type of The Beatles is specified as the MusicGroup concept from the Music Ontology.
Below we can be see this RDF triple extracted from the HTML+RDFa.
Figure 12 shows an example of a page in HTML+RDFa format. This is the MusicBrainz page about The Beatles (http://musicbrainz.org/artist/b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d). As mentioned earlier, the human reader need not be aware of the RDF embedded within the page.
Figure 12: MusicBrainz page in HTML+RDFa format.
An RDFa distiller and parser can be used to extract the RDF representation. In Figure 3.13 the URL for the MusicBrainz page about The Beatles has been entered into the form (http://www.w3.org/2007/08/pyRdfa).
Figure 13: Extracting RDF from a MusicBrainz page in HTML+RDFa format.
Figure 14 shows a fragment of the RDF contained in the page, represented in the N-Triples format.
Figure 14: Extracted RDF in N-Triples format.
An RDF Dump is a file that contains the whole or some subset of an RDF dataset. A dataset may be split over several data dumps. An RDF dump may use one of a number of formats. RDF/XML encodes RFD in XML syntax. N-Triples is a subset of the Turtle format in which the RDF is represented as a list of dot-separated triples. The format N-Quads is an extension of N-Triples in which a fourth element specifies the context or named graph of each triple. A site that maintains a list of available RDF data dumps can be found in .
SPARQL ' is a language that can be used to query an RDF dataset. A SPARQL endpoint is service can that processes SPARQL queries and return results. SPARQL queries can be used to retrieve particular subsets of the dataset. See chapter 2 for more information on SPARQL. ' Lists of publicly available SPRAQL endpoints are maintained at  and .
Data catalogs, markets or repositories are platforms that provide access to a wide range of contributed datasets. They assist data consumers in finding and accessing new datasets. Catalogs generally offer relevant metadata about the dataset. The open source platform CKAN can be used for managing and providing access to a large number of datasets. CKAN would be recommended for a large institution that wanted to manage access to a number of datasets. The Data Hub is a public linked data catalog to which datasets can be contributed. CKAN and Data Hub will be described in more detail in section 3.7.
3.6.4 Validating the dataset
There are three different ways in which an RDF dataset can be validated. The first set (labelled accessibility in the Figure 15) checks that URIs are dereferenced correctly, using the HTTP client-server dialogue as described in section 3.5.2. A second set (labelled parsing and syntax) is used to validate the syntax of the RDF that is returned. Separate services are available for validating RDF/XML and RDFa markup. Finally, RDF:Alerts is a general purpose validation service for checking syntax and other problems such as undefined properties and classes and data type errors.
Figure 15: Ways of validating an RDF dataset.