Open Data Consultancy Final Report
This report presents the outcomes of the Open Data Consultancy study that Swirrl IT Limited was commissioned to carry out for the Scottish Government from September to November 2013.
‘Open Data’ is data that is accessible to anyone (usually via the internet), in a machine readable form, free of restriction on use. Adoption of this approach to information sharing is growing in the public sector, with anticipated benefits for transparency, efficiency of government and economic growth.
5. Pilot of linked open data publishing
The open data consultancy study included establishing Open Data Scotland (http://www.opendatascotland.org). This is a pilot data publishing website intended to demonstrate the possibilities of the linked open data approach and to investigate the feasibility and implications of applying this approach more widely to publishing Scottish Government Statistics.
The site incorporates:
- A selection of existing datasets published as linked data: the Scottish Index of Multiple Deprivation (SIMD) for 2004, 2006, 2009 and 2012; school establishments and associated data on school pupil and school leaver numbers; supporting geographical datasets including council areas, data zones, intermediate zones and postcodes
- A showcase website for SIMD, including visualisations and data navigation features, built on top of the linked data SPARQL endpoint (http://simd.opendatascotland.org/)
- A tutorial on using open data in a web context, based on the schools data incorporated in the site (http://schools.opendatascotland.org/)
- A 'technical' section of the site, aimed primarily at developers and data specialists, where the datasets can be explored, and accessed in machine readable forms via a range of Application Programming Interfaces (APIs). (http://data.opendatascotland.org/)
Our main objectives with the pilot were to consider the range of users of Scottish Government analytical services data in general, and of the selected datasets in particular. A particular emphasis was placed on potential new users of the data. Exploiting the possibilities offered by the linked data approach, we have experimented with new data presentation approaches to meet their needs.
5.1 Details of the approach
5.1.1 Overview
The starting point of the pilot was the selection of datasets to be incorporated. It was agreed to use a small representative selection of datasets that we could work with in depth.
The Scottish Index of Multiple Deprivation was chosen as it is already a widely used dataset with many applications. We chose to include datasets about schools in the pilot, because education data is another area with broad interest, and using school-based data required us to tackle data with a different structure to the majority of SNS datasets, which are mainly statistical data organised by council area or by data zone.
In addition we needed to include geographical reference data: we created linked data datasets of council areas, intermediate zones and data zones, the hierarchical relationships between them and the connection from each area to a representation of its boundary. We also included a dataset of postcodes, as these are popular and widely used as a geographical reference point. The postcode data was obtained from the Ordnance Survey and the Office for National Statistics. It includes the coordinates of postcode centroids and a link from each postcode to various geographical region that contain that postcode.
Via the pilot, we wanted to explore and demonstrate the following points:
- Provide data in linked data form to enable queryable access to a collection of datasets
- To use the strengths of the web in presenting the data, by ensuring that all data collections and views have a persistent URL that can be linked to
- To show that it is straightforward to create dynamic visualisations that draw data live from the underlying linked data database
- To connect user-friendly presentations with access to the underlying data, so that others can create their own visualisations.
- To provide learning materials and data access that encourage others to create their own uses of the data
In doing this, we aimed to meet the needs of each of the audience types discussed in the chapter 'Requirements for Effective Use of Open Data'.
5.1.2 Representing the data as linked data
The starting point of the work was to convert data from its original formats to '5 star' open data in RDF format. We followed the approach set out in the Linked Data Cookbook[26], as documented by the World Wide Web Consortium Government Linked Data Working Group. The Cookbook approach is summarised as follows:
"The 7 Best Practices for Producing Linked Data
1. Model the Data
2. Name things with URIs
3. Re-use vocabularies whenever possible
4. Publish human and machine readable descriptions
5. Convert data to RDF
6. Specify an appropriate license
7. Host the Linked Data Set Publicly and Announce it!"
Of these, steps 1, 2, 3 and 5 are the most onerous in terms of effort and required knowledge. We will describe each of the steps in more detail:
1. Modelling the data is essentially a process of understanding the structure of the data and deciding how to represent it in the triple based structure used in RDF. It involves identifying the most important entities in the data, what attributes these entities have, and how the entities are related to each other and to external entities that we might refer to but not describe in detail.
In carrying out this process, we drew on experience of common data structures and good ways of representing these in RDF. A very useful resource in this process is the online book 'Linked Data Patterns'[27] by Dodds and Davis, which documents a large number of frequently encountered data structures with notes on how to represent them as RDF.
2. In RDF all resources are assigned a URI (Uniform Resource Identifier). Therefore having decided what the important entities are in the data, we need to decide how to assign URIs to them. The most important decision at this point is to decide whether the entities of interest already have a maintained URI from a trusted source. If that is the case, it is generally good practice to use that identifier. In our case, we chose to use ONS identifiers for geographical areas and data.gov.uk identifiers for time intervals. This is an aspect of the 'linking' part of linked data, which makes it easy for users of the data to combine data from different sources.
If no suitable identifier already exists, we have to create our own. As explained in the Linked Data Cookbook, this needs to be done in a web domain that we control - in this case http://opendatascotland.org. In designing the format of these URIs, we followed the advice in the URI Patterns[28] document developed by the UK Linked Data Working Group.
3. Step 2 relates to identifying entities in the system. We also need to assign URIs to properties and classes that describe the interrelations of the main entities. These properties and classes are often referred to as a vocabulary or ontology. As with identifiers for entities, a first step is to consider whether a suitable vocabulary already exists in a trusted well-maintained form. If it does, it is good practice to re-use it as this makes it easier for users to understand the data and relate it to other data sources. When using an existing vocabulary, it is important to consider carefully if its documented meaning is a good match for our context. Existing terms should only be used if we accept that their documented meaning is the meaning we want to express. The Linked Data Cookbook lists some of the most commonly used vocabularies[29]. A useful resource for discovering existing vocabularies is the 'Linked Open Vocabularies' site[30] developed by the DataLift project.
A particularly important vocabulary for Open Data Scotland and for many statistical data applications is the RDF Data Cube[31], currently going through the W3C standardisation process.
4. A principle of linked data is that if you look up the identifier for an entity (often called a 'resource') then the system should return a description of that entity. Best practice is to provide both human readable descriptions in HTML form as well as machine readable descriptions, possibly in a variety of formats. The 'content negotiation' methods built in to the standard HTTP protocol[32] are used to specify which format a user wants. This linked data resource look-up approach is a standard feature of Swirrl's PublishMyData platform that we used as the basis of the pilot.
5. Once we have decided the structure of the data and the system of identifiers we want to use, then we can go through the mechanics of converting the data into RDF. Our starting point for this project was a set of spreadsheets and CSV files. Our approach was to write simple scripts in the Ruby programming language, drawing on established code libraries that handle much of the details. These scripts read in the source data and then work through each data point and output the RDF representation of them.
This kind of data processing script is reasonably straightforward to create for someone with modest programming skills. Supporting libraries for working with RDF exist in most programming languages.
Some graphical user interface driven tools exist for converting tabular data to RDF, though the process of configuring these tools is often just as complex as simple programming.
Swirrl, through the EU funded 'OpenCube'[33] project is working on user-friendly tools to support this data conversion process.
6. It is important that users of the system know what they are allowed to do with the data. Therefore a data publisher should always specify a specific licence for any data they publish. In the case of Open Data Scotland, all Scottish Government owned data is released under the Open Government Licence[34]. The system also incorporates some data that belongs to the Ordnance Survey and in these cases the data is made available under the Ordnance Survey Open Data Licence[35].
7. The datasets we have created are hosted at the site http://www.opendatascotland.org that consists of user-friendly navigation and visualisation pages as well as direct access to the underlying linked data via http://data.opendatascotland.org
5.1.3 Structure of the system
The Open Data Scotland site is built on top of Swirrl's PublishMyData[36] linked data publishing platform. The basic structure of the system is illustrated in Figure 3. The core of the system is the triple store, which uses the open source Apache Jena system and holds the RDF data used by the site. The data in the triple store is accessed via SPARQL queries and used to create the dataset navigation and browsing facilities at http://data.opendatascotland.org and to extract data required for the visualisations at http://simd.opendatascotland.org. The geographical visualisations also make use of a set of geographical boundary files which are stored on the server in TopoJSON format. These TopoJSON files were created from original data in ESRI Shapefile format, provided by the Scottish Government. TopoJSON was chosen as a working format because it is compact and easy to use in our chosen visualisation libraries (see next section).
As well as being used internally to generate web pages and graphics, the SPARQL endpoint is also directly available to external users who want to run their own queries against the database.
The system incorporates an administration interface, used for creating and updating datasets and their metadata as well as gathering analytics on system use.
The triple store, administration interface, SPARQL endpoint and Linked Data browsing are all parts of the standard PublishMyData platform. The web pages and visualisations available at http://www.opendatascotland.org, http://simd.opendatascotland.org and http://schools.opendatascotland.org/ were designed and created specifically for this project.
Figure 3 High level system structure
5.1.4 Approach to creating visualisations
The visualisations in the site were created using HTML, CSS and Javascript. The main Javascript library used was D3 ('Data Driven Documents')[37]. The visualisations obtain their data by running SPARQL queries on the PublishMyData SPARQL endpoint.
The architecture of these visualisations is such that they could be hosted on any website - they use the publicly available query endpoint to get their data and they are created with open source libraries.
It is an illustration of how the Linked Data approach enables not only the data owner to create a data portal, but allows third parties to build rich views of the data to suit their own purposes.
5.2 How could this approach be applied more widely?
The pilot incorporates only a small sample of the data available from the Scottish Neighbourhood Statistics site, together with other related education and geographical data.
What would be involved in applying this approach to the entire Scottish Neighbourhood Statistics site?
The overall SNS contains around 1000 indicators, many available for a range of time periods and in some cases at more than one level of geography.
The basic pattern applied in the pilot to representing the SIMD as linked data is well suited for extension to the whole of SNS. Because of the large number of indicators, it will be necessary to apply a reasonably high degree of automation to the process. This is certainly feasible and the process will benefit from the high degree of structural similarity between the different datasets.
The SNS data all fits well into the RDF Data Cube model. Each dataset will have a geography dimension and a time dimension and often has one or more further dimensions such as age range or gender.
Where an indicator is available for multiple time periods, then in general all time periods should be included in the same Data Cube dataset (as long as other dimensions and the method of indicator calculation remain the same across all time periods).
In many cases, several indicators should be combined into a single Data Cube dataset, where the same quantity is being reported but for several values of a dimension such as age-range. In SNS for example, there are separate employment level indicators for age ranges 16-24, 25-34, 35-49, 50-64, as well as a 'total age range of 16+. These can be combined into a single employment levels dataset, with a dimension of age range.
The total size of a linked data version of SNS is therefore likely to run to several hundred datasets. The number of triples in the system will be dominated by those indicators that are provided for the smallest (and hence most numerous) geographical areas - i.e. at data zone level. (A dataset at local authority level will be around 200 times smaller than a data zone dataset).
A data cube dataset typically includes around 5-8 triples per data point. Therefore for a data-zone based dataset, we can expect an average of around 40,000 triples per indicator per time period. If there are say 500 indicators at data zone level with an average of 5 time periods each, then we will have around 100,000,000 triples in the data store. This is well within the standard capabilities of established triple stores and so would not present significant technical challenges.
To establish the patterns in the data that can be automated will require a review of the indicators, dimensions and values to develop a series of URI patterns that can be used consistently across the SNS data. This will involve a decision on the top level domain that the data should be contained in and the URI patterns should be designed taking into account possible future Scottish Government linked data publishing on other topics.
This analysis should include identifying where the same dimensions or dimension values appear in connection with multiple indicators and so can be re-used. For example the same set of age ranges is used for the Employment Level indicators and Employment Rate indicators (amongst others). A particular age range should be described using the same URI wherever it appears.
The pilot has produced geographical datasets for council areas, intermediate zones, data zones and postcodes. To cover the whole SNS will require a number of other geographical datasets to be established: health boards, wards, community health partnerships, parliamentary constituencies and community regeneration areas.
An important aspect of the data creation is providing metadata and documentation about the data. SNS already includes systematic metadata about indicators. As this is in a consistent form, it could be extracted from the original system and processed into a linked data form. It would be useful to consider how this could be improved and extended in future. There are opportunities to create or link to existing richer descriptions of the background of how data collection and data processing has been carried out. The administration system for managers of a new linked data SNS will need to incorporate a user interface for managing and editing metadata.
A system will be required for managing the data in the system, principally for statisticians to add updates as new data becomes available, but also to allow correction of errors, improvements to metadata and so on. This should be designed in consultation with the statisticians responsible for maintaining the data. Such a system could be designed to work in a similar way to the existing SNS data entry system, to minimise unnecessary changes in working patterns.
It would be possible, and highly desirable, to provide an API for adding and updating data (in addition to a user-interface driven process). This creates possibilities for greater automation of SNS data maintenance in future.
An important aspect to consider is how to design the site navigation to enable a user to find and access the data they are interested in. The pilot site does not tackle this navigation question in depth as it incorporates only a small number of datasets, although there are elements in the navigation design of the pilot that could form a useful part of a broader system. Generally speaking, we anticipate the need to allow data navigation both by theme and by geography. The system should allow both browsing and google-style search.
It is not necessary and probably not practical to present all datasets in the SNS in the rich way that we have implemented for the SIMD data in the pilot. The majority of datasets could have a simpler, but still user friendly, presentation. However, the approach taken to mapping the SIMD in the pilot could be easily extended to other datasets and other geographies.
Given the richness of the data in SNS, there are many other opportunities for developing interesting visual ways of exploring and navigating the data and if the approaches demonstrated in the pilot were to be extended to the whole of SNS, it would be useful to investigate ideas in this area.
5.3 Assessing success of the pilot
At the time of writing of this report, Open Data Scotland has just gone live and become accessible to the public. To help decide whether to apply this approach more widely, it will be important to gather feedback on how users of the data react to the site: to find which aspects of it they find useful, which aspects might need modification or improvement and how users regard Open Data Scotland in comparison to existing data sources such as http://www.sns.gov.uk or the various spreadsheets that can be downloaded from http://www.scotland.gov.uk.
The user engagement process has therefore only just begun and it is too soon to report on user views. It is important that sufficient effort is devoted to identifying and communicating with users and potential users of the site.
User engagement will involve informing existing and potential users of SNS of the existence of the site, assisting them where necessary in getting started with using Open Data Scotland and gathering their feedback on what they like or don't like about the site.
Overall use of the system will be monitored using Google Analytics and PublishMyData API Analytics and regular reports provided to the Scottish Government.
Contact
Email: Sara Grainger
There is a problem
Thanks for your feedback