Architecture of Linked Data Applications
5.4 Architecture of Linked Data applications[edit | edit source]
In the previous section we saw a number of example Linked Data applications. Each application not only had a different purpose but also made different architectural decisions in terms of how the data was accessed, processed and stored. In this section we present generic architectural patterns of Linked Data applications and different design decisions associated with these patterns.
The term software architecture describes the components of a software system and the relationships between those elements. For a web-based system the components could include software modules, databases and web servers. Some parts of the architecture may be legacy components brought forward from previous systems. The relationships between the components of a system indicate which components communicate during operation of the system and the mechanisms by which that communication takes place. As well as specifying the structure of the system, the software architecture also stipulates a set of design practices to be followed in order to create and maintain the architecture.
5.5 Multitier architecture[edit | edit source]
An important architectural pattern used in system development is the multitier architecture. A multitier architecture separates functionality into a number of layers from low-level data storage through to user interaction components. This architecture is commonly used for many kinds of web application. As many Linked Data applications are also web applications, they tend to conform to this architectural approach.
An important advantage of the tiered architecture is that it logically separates the functionality of the system into a series of layers and specifies the communication between those layers. This separation makes it far easier to replace a layer of the architecture or reuse a layer of an existing architecture in a new application. For example, an application may have a layer or tier dedicated to data storage. This functionality may be provided by a particular database or triplestore. The storage layer could be reused in an alternative application or replaced by another storage layer providing the same functionality and communication with other layers.
The most commonly used multitier architecture is the three-tier architecture. First, a presentation tier provides a user interface that can accept user input and render results in a human-readable form. Second, a logic tier implements the business logic of the application. This takes the available data and analyses and transforms it to meet the needs of the user. Third, a data tier stores the underlying data in a form independent from the business logic applied to it in the application. Figure 13 illustrates our music example as a three-tier architecture. The presentation tier handles user queries and the returned outputs including textual results and visualisations. The logic tier transforms user queries into SPARQL queries and aggregates the RDF results. The data tier is responsible for storage and in this case uses a triplestore. One important aspect to note in the case of Linked Data applications is that the dividing line between the data tier and the logic tier may not be so clear-cut. If a relational database is used as a storage layer, then all processing of the data beyond the returning of results to a database query is done in the logic layer. However, triplestores are capable of performing various types of reasoning and therefore in some cases a significant part of the business logic can be carried within the triplestore. For reasons of performance it is advantageous to perform reasoning on a lower level as possible.
Figure 13: Music example of the three-tier architecture.
Figure 14 shows the architecture of our music application with a particular emphasis on the components within the data tier. The presentation layer produces the kinds of visualisations we saw in Chapter 4. The logic layer processes data for presentation. The data tier as well as implementing some of the logic does much more work than just data storage. Data stored within the data tier may be consumed from a number of sources. These may be consumed from SPARQL endpoints or RDF dumps. Wrappers may be required to covert the data into an appropriate format. As described in chapter 3 R2RML may be used to transform data from a relational database into RDF.
Also, as covered in Chapter 3, the retrieved data may use different vocabularies on the schema level and may also on the instance level have different individuals referring to the same thing. As described in Chapter 3, languages such as SKOS can be used to express relationships across vocabularies and the owl:sameAs property can be used to relate resources. Chapter 3 also describes how tools such as the SILK framework, could be used to identify and express relationships across datasets.
Some data cleansing may also be required, for example, to identify ambiguities between names of the resources within the datasets and to fix these ambiguities. Therefore, vocabulary mapping, interlinking and cleansing need to be carried out to produce a consistent dataset. As well as being used by the logic layer, a direct facility to republish the data may be provided.
Figure 14: General architecture of Linked Data applications.
5.6 Architectural patterns[edit | edit source]
The data consumed and integrated within the application may be accessed in a number of ways. Three main architectural patterns can be identified. First, there is the crawling pattern, in which data is loaded in advance. The data may also need to be transformed as described above. The data is managed in one triplestore so that it can be accessed efficiently. The disadvantage of this pattern is that the data might not be up to date. Second, there is the On-The-Fly Dereferencing Pattern. Here, URIs are dereferenced at the moment that the application requires the data. This pattern retrieves up to date data but performance is affected when the application must dereference many URIs. Third, there is a (Federated) Query Pattern in which complex queries are submitted to a fixed set of data sources. This approach enables applications to work with current data directly retrieved from the sources. However, finding optimal query execution plans over a large number of sources is a complex problem. This third pattern in specific situations can offer a way of accessing up-to-date data with adequate response times.
5.7 Data layer[edit | edit source]
In the data tier new Linked Data may be consumed from a SPARQL endpoint in RDF. As discussed above, if the data is in another form such as CSV (Comma Separated Values) then a wrapper would be used to translate the data to RDF. Linked Data applications may implement a Mediator-Wrapper Architecture to access heterogeneous sources in which wrappers are built around each data source in order to provide an unified view of the retrieved data. The Mediator-Wrapper Architecture could be used with any of the three architectural patterns (i.e. crawling pattern, on-the-fly dereferencing pattern, federated query pattern). The most appropriate patterns will depend on a number of factors such as the number of sources that need top be accessed, how up-to-date the data is required to be and the speed of response required by the data layer.
A number of tools are available that can be used when implementing a data access component. Linked Data Crawlers are web crawlers designed to harvest RDF data. Linked Data client libraries support the access and traversal of Linked Data. SPARQL Client Libraries provide an API for accessing SPARQL endpoints. Federated SPARQL Engines provide a single access point for multiple heterogeneous data sources. Finally, Search Engine APIs such as Sindice (see chapter 4, section 10.2) support semantic searches that return RDF documents.
Figure 15: Linked Data access mechanisms.
The integrated dataset can then be kept in a local triplestore. A triplestore is required unless data is retrieved on-the-fly, and discarded just after building a response to the user request. A number of commercial or free RDF triplestores are available including OWLIM , Jena TDB , Cumulus , AllegroGraph , Virtuoso Universal Server  and RDF3x . As described in Chapter 3, SPARQL endpoints and RDF dumps can be used to make RDF data available. As we will see later, data can also be made available via APIs or using functionality provided by your chosen application framework.
5.8 Logic and presentation layers[edit | edit source]
Once the integrated data is available on-the-fly or in a triplestore, it can be used and accessed by the logic and presentation layers. As mentioned above, some of the logic may be implemented in the data layer by reasoning over the triplestore. Other forms of processing that cannot be implemented on the data later are carried out in the logic layer. As described in section 23 of Chapter 4, this may involve the application of statistical or machine learning processes to make inferences from the data. Other forms of business logic such as workflow are also implemented within this layer. Finally, the presentation layer displays the information to the user in various formats, including text, diagrams or other types of visualization techniques. A range of tools and formalisms for the display of information were outlined in Chapter 4.