If you’ve ever had to combine two datasets together, you’ve probably faced the problem of making sense of the fields, values and even terms used. Imagine if we were to scale this from datasets on your computer, to data across your colleagues, team, organisation, up to state-wide or national-wide data, or even international datasets! Imagine the challenge of making sense of all of those datasets.
Source: Image courtesy of WIRED magazine.
At its essence, the issue is about enabling agreements about how to interpret and/or encode the data. This is much easier for an individual (assuming you are ok with agreeing with yourself). As more people and organisations are involved, the challenge becomes more complex often dealing with data silos or variations of it.
It is in this context that the idea of “Linked Data” was proposed by the inventor of the World Wide Web, Tim Berners-Lee.
What is Linked Data?
Linked Data is a set of practices to allow structured data to be published so that it can be interlinked on the web and support cross-dataset queries , much like how the Web works today but instead of documents, its for data. Linked Data is a slice of the overall Semantic Web stack and relies on web standards such as HTTP, URIs, and RDF to link bits of information together.
Hyper-text Transfer Protocol (HTTP) is the foundation of communication for the World Wide Web. It provides the protocol for applications and web servers to pass messages to each other. E.g. Web browsers submit HTTP requests to Web Servers and receives back content.
Uniform Resource Identifier (URI) schemes and Uniform Resource Locators (URLs) give identity (names) and help applications locate resources on the web. URIs and hyperlinks create web links.
Resource Description Framework (RDF), a W3C standard format, is used as a world-wide lingua-franca to express information and relationships. RDF allows statements to be expressed as triples, following – subject-predicate-object form. URIs provide unambiguous resource identifiers for subjects, predicates, and objects.
e.g. Bob (subject) is a type of (predicate) Person (object)
rdf:type <http://schema.org/Person> .
A collection of RDF statements or triples is called a graph. Graphs are often stored and queried in RDF triple stores.
RDF provides a general data model which can be serialised into different formats like Turtle, RDF/XML, JSON-LD. Serialization in different formats allow use of the data in a wider range of tools. Conversely, formats like JSON and CSV now have RDF profiles [3,4] and can be transformed into RDF to interlink and integrate with other graphs for enhanced data discovery and use. Using RDF and Linked Data principles, a graph of knowledge is able to be constructed that is web-enabled and web-scalable.
Tim Berners-Lee outlined 4 principles for Linked Data:
How Linked Data can help?
Linked Data vocabularies
Linked Data approaches and Provenance standards/ontologies can be used to help build foundations to capture, record and analyse provenance. This can be used in applications and domains that want to represent, exchange, and integrate provenance information generated in different systems and under different contexts. For example, the OzNome team is working at the moment to embed provenance in scientific workflows for Biodiversity Baseline Assessments project to enable reproducible and traceability of indicators, their inputs and workflow runs.
Spatial Linked Data