The research leading to these results has received
support from the Innovative Medicines Initiative
(IMI) Joint Undertaking under
grant agreement n° 115191, resources of which are composed of financial
contribution from the European Union's Seventh Framework Programme
(FP7/2007-2013) and EFPIA companies’ in kind contribution.
These RDF-guidelines are intended for data providers who want to expose their data as RDF to the Open PHACTS platform.
There are many sides to making data semantic. This guidelines document restricts itself to using RDF, and will not go into ontological discussions, such as when to use a class or an instance. The document will also be limited to giving pointers, and some rules of thumb, and the reader is most invited to read the below-listed further reading.
The most important message is to use RDF not find the best representation for your data, but to be explicit in how you represent your data.
Open PHACTS requires:
Somewhere, probably later in this specification, we should detail that relations (like interactions) can be modeled as classes as well as predicates.
Open PHACTS does not specify requirements or guidelines around:
Before you start thinking about converting something into RDF, the first two questions you should ask yourself:
Because this information is also important for all people who will want to use your data, you must specify as metadata these pieces of crucial information along with the shared data. This step does not imply that the data must be Open, but it does simplify a lot of things when it is. The least you must do is to provide clarity as to whether the data is Open or not.
VoID should be use to encode license information for data sets to be used in Open PHACTS [[Gray2012]].
The Dublin Core ontology [[Nilsson2008]] is a common alternative, and the given example is pureyl
illustrative for encoding license information in RDF:
When creating triples from your data, it is important to think about the data in terms of concepts and their relations in scientific terms, not in terms of database terminologies. The triples must in no way reflect concepts like database tables or other details that originate from the format in which the data was previously stored.
So, the following code example shows bad practices. This generated example RDF shows a pet database, listing pets living in the same household in European capitals, including the food these pets eat. The RDF output reflects the original data structure, and adds little useful meaning (i.e. the semantics) to the data.
Input (with a header on the first line):
Pet;Species;Subspecies;Owner;Address;Food Doger;Dog;Dachshund;Frank Smith;Mainstreet 1 London;Bones
This example needs to be replaced with one from the life sciences.
@prefix any23: <http://any23.org/tmp/> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix csv: <http://vocab.sindice.net/csv/> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . <any23:col0> <rdfs:label> "Pet" ; <csv:columnPosition> "0"^^xsd:integer . <any23:col1> <rdfs:label> "Species" ; <csv:columnPosition> "1"^^xsd:integer . <any23:col2> <rdfs:label> "Subspecies" ; <csv:columnPosition> "2"^^xsd:integer . <any23:col3> <rdfs:label> "Owner" ; <csv:columnPosition> "3"^^xsd:integer . <any23:col4> <rdfs:label> "Address" ; <csv:columnPosition> "4"^^xsd:integer . <any23:col5> <rdfs:label> "Food" ; <csv:columnPosition> "5"^^xsd:integer . <any23:row/0> a <csv:Row> ; <any23:col0> "Doger"^^xsd:string ; <any23:col1> "Dog"^^xsd:string ; <any23:col2> "Dachshund"^^xsd:string ; <any23:col3> "Frank Smith"^^xsd:string ; <any23:col4> "Mainstreet 1 London"^^xsd:string ; <any23:col5> "Bones"^^xsd:string . <any23:> <csv:row> <any23:row/0> .
Importantly, the use of columns and rows in the RDF must be removed. Better would be:
@prefix any23: <http://any23.org/tmp/> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . <any23:Pet> <rdfs:label> "Pet" . <any23:species> <rdfs:label> "Species" . <any23:subspecies> <rdfs:label> "Subspecies" . <any23:owner> <rdfs:label> "Owner" . <any23:address> <rdfs:label> "Address" . <any23:food> <rdfs:label> "Food" . <any23:doger> a <any23:Pet> ; <rdfs:label> "Doger"^^xsd:string ; <any23:species> "Dog"^^xsd:string ; <any23:subspecies> "Dachshund"^^xsd:string ; <any23:owner> "Frank Smith"^^xsd:string ; <any23:address> "Mainstreet 1 London"^^xsd:string ; <any23:food> "Bones"^^xsd:string .
The first step in creating your RDF is to create a list of all concepts that are found in your data. Are there proteins, metabolites, cell cultures, organisms, targets, assays? At what level are those concepts represented in your data? Are they protein names without exactly known point mutations, are the accurate masses resulting from metabolomics experiments, are the exact metabolic structures known? Are there multiple identifiers known? Does your data contain references with a PubMed identifier or a DOI? In this step you define the types, rather than the entities: you observe that they are proteins, but do not enumerate each of them.
The purpose here is not to develop an ontology, but to get clear what the content of your data is, allowing you to identify existing ontologies (see Step 4) that capture that information. To support this process, for each concept found in your data a human readable label and a short definition must be provided, both in English. Here too, the underlying rule is that everything must have an explicit and well-defined meaning. This list may be in a Word document, Excel spreadsheet, but also in RDF itself, for example using SKOS [[SKOS]]. The choice, however, must be chosen such to improve the thinking about the concepts.
|"Activity"||Biochemical property that chemical entities exhibit in some experiment.|
Once you know what concepts are found in your data, it is time to identify how those concepts are linked in your data set. These relations must be identified and listed too, and in the same manner as in Step 2. These relations preferably have a verb form, making them easier to understand. For example, a predicate label has name is preferred over just name.
For each relation, the list should provide a human readable label, and a short definition, again in English. Again, the method of recording the list of relations and properties must be chosen such to improve the thinking about the information in the data.
|"has sequence"||Protein property linking an amino acid sequence to a protein.|
|"binds to"||Relation between a drug and a drug target reflecting a chemical interaction.|
Like in Step 2, you focus on the types only, not in actual binding interactions, etc.
Because existing software already knows about existing, common ontologies, you should use those existing, common ontologies, if you care about having an impact. This sections lists below a number of suggestions, for the various data types that will be covered in Open PHACTS. You should explore those ontologies and check if for each concept and relation you find matching entries in those existing, common ontologies. If you find that only a minor amount of items are missing, you should contact the ontology authors, and see if the missing terms can be added. Only if that fails should you be looking for less common ontologies and see if these provide a substantial higher coverage. Services that allow you to find uncommon ontologies are listed below. You can use the skos vocabulary to express relatedness to existing concepts.
As an alternative, rdfs:subClassOf should be discussed here.
It is OK to keep a number of entries not mapped to existing ontologies. In this case the entries have to be made openly available, i.e at purl.org. This of course also applies for creating a new ontology.
In all cases, you must never use ontologies that you are not allowed to share with your data, as that will effectively leave you with triplified data, of which your users have no means in the future to figure out what is what, and thus is "meaningless".
Importantly, the following two resources must be watched with respect to recommended vocabularies: first, Open PHACTS project deliverables, such as D1.6 and D1.7 [[D16D17]]; second, the Open PHACTS project on BioPortal: http://bioportal.bioontology.org/projects/163. These documents precede the following suggestions.
Below is a list of pointers to ontologies and vocabularies related to the scope of Open PHACTS. For each ontology, the prefix or website is also given. Furthermore, a list of search engines is provided where further ontologies and vocabularies can be found.
For document identifiers use (in order, if existing)
This requires that the structures have been deposited with ChemSpider already. If not, then use in descending order of preference:
The use of the CHEMINF ontology is encouraged for these identifiers [[Hastings2011]]. Also, you should register small molecule names with ChemSpider. Documentationon how to deposit individual structures in ChemSpider can be found at https://www.chemspider.com/Help_DepositStructures.aspx. Larger sets of compounds can be deposited as SD files, but if the purpose is to have those exposed via Open PHACTS, the Scientific Advisory Board should be contacted to give permission for that exposure.
The following search engines can be useful to find suitable ontologies.
The next step is to explore what related data sets are available as Linked (Open) Data,
and link out to those data sets. For example, if your data contains ChemSpider, ChEBI,
ChEMBL, PubChem, DrugBank, KEGG, Uniprot, and PDB identifiers, you can link to the
respective RDF variants of those databases. Various RDF versions of these databases are
around, including Bio2RDF [[Belleau2008]], LODD [[Samwald2011]], and Chem2Bio2RDF [[Chen2010]],
but preferably to the original source directly. The figure below (CC-BY-SA, [[Cyganiak2011]])
shows a diagram of the larger network, including Linked Data relevant the Open PHACTS:
Careful consideration must be taken here in to what relation (predicate) is used. In the table below various options are outlined, the specific meaning, and how and when which predicate can be used; the guidelines around the use of these predicates in Open PHACTS is under development, and will supercede this table when available.
||General link, that indicates that the resource linked to is relevant to the subject. See http://www.w3.org/TR/rdf-schema/.|
||This link indicates that the linked resources are the same, under some assumptions or applications. This link is not transitive. See http://www.w3.org/2004/02/skos/core.html.|
||This link subclasses
||Link that indicates that the subject is an instance, and that the object resource is an instance too, and the same resources as the subject. This link is transitive. See http://www.w3.org/TR/owl-ref/.|
||The same as
owl:equivalentClass predicates are very powerful and should be used with
care since all attributes and relations of two therewith connected entities are merged
together. In Open PHACTS the use of the less restrictive skos:exactMatch is recommended.
These first steps ensure you have IRIs for all resources and predicates, and know where to put all relations, it is time to create triples. It is irrelevant to the triple creation process and thus up to the user to pick whatever tool they find most convenient. Triples can be created with dedicated semantic web tools, as listed below, but also using simple regular expressions, or scripting tools in any language. Of course, generated triples should be validated, but the tool to create them is merely a tool; there is nothing semantic about that. The output in which the triples are serialized can be in any of the standardized or proposed RDF serialization formats, such as RDF/XML [[RDFXML]], Notation3 [[Notation3]], Turtle (preferred) [[Turtle]], or plain N-Triples [[NTriples]]. These guidelines do not encourage nor disallow named graphs; the user is free to use them, but it is not required.
Importantly, this process should be well documented. You must keep track of what versions of the input data was used, who created the RDF data, when that was done, and preferable what tools were used. This information should be available to users along with the data itself. Provenance is really important in the process of creating RDF, and you should in detail track how the transformation was done. However, the exact guidelines for tracking provenance information is under development, and future Open PHACTS guidelines will document in detail how it is captured. The reader is refered to the W3C PROV Model Primer as reference for now [[Gil2012]].
Update the above paragraph to point to specifications how Open PHACTS decides to expose provenance information.
Because the Open PHACTS GUI is human language oriented, all entities in the data must be associated with a human readable label. It is important that for all texts, like labels and definitions, the language it is represented in is explicitly identified. For example (not a full RDF serialization):
ex:methane rdfs:label “methaan”@nl .
Occassionally, there are concepts in your data that do not have labels in the data source. For example, the interaction between two proteins or the property of a molecule. Labels like "The interaction between protein A and protein B" and "The molecular weight of molecule A" can be autogenerated. If this label follows implicitly from the semantic typing of the entities and relations between those entities, then a label may be omitted. An example is the molecular weight property in the next section.
Blank nodes must not be used, following for example the Banff Manifesto [[banffmanifesto]]; each concept or thing should have a unique IRI, which may be similar to those of more principle resources. For example, the following CHEMINF example describes a molecule with one of its properties, where the property is an instance itself and has a IRI quite similar to that of the molecule it characterizes:
ex:m1 cheminf:CHEMINF_000200 ex:m1/full_mwt . ex:m1/full_mwt a cheminf:CHEMINF_000198 . ex:m1/full_mwt cheminf:SIO_000300 "341.748" .
Below is a brief overview of tools that may assist the triple generation.
Description: Sesame is a Java framework for handling RDF data. It includes functionality for parsing, storing, inferencing and querying of RDF data. Development is support by the Dutch company Aduna.
Audience: Java programmers
Description: “Jena is another Java framework for building Semantic Web applications, originally developed by HP, now under the Apache umbrella. It provides am environment for handling RDF, RDFS and OWL, SPARQL.
Audience: Java programmers
Description: Tripliser is a Java library and command-line tool for creating triple graphs from XML. It is particularly suitable for data exhibiting any of the following characteristics: messy - missing data, badly formatted data, changeable structure; bulky - large volumes; and volatile - ongoing changes to data and structure.
Audience: Java programmers
Description: A Ruby library for working with RDF data.
Audience: Ruby programmers
Description: A tool than can convert anything to triples, supporting microformats, RDFa, Microdata, RDF/XML, Turtle, N-Triples and NQuads.
While dedicated semantic web tools make it hard to introduce syntactic errors, it is still possible to make mistakes in the resulting RDF, and the generated triples should be validated.
There are various levels at which the data should be validated. First, it should be validated that the created syntax notation is correct, for which various online services are available. Remark: Some encodings of special characters may pose problems and may have to converted or be replaced. One such validator tool is the W3C RDF Validation Service, at http://www.w3.org/RDF/Validator/.
Second, the output should be checked that the selected common ontologies are correctly used. For example, that predicates with literal domains are indeed used for such in the output. An example of common misuse, is using the wrong Dublin Core namespace [[Nilsson2008]]; there are two, both defining a dc:title predicate, but only one namespace should be used with literal values.
This also applies to the use of links as outlined in step 5, where these linking predicates can make claims of the nature of resources. For example, skos:closeMatch implies that the subject and object resources are also SKOS concepts. That should not conflict with other triples.
One aspect here is that the resulting data should be verified for internal consistency. This is particularly important if the used common ontologies define relations (predicates) that specify what types of objects it links (RDF domain and range). Tools like Protégé (http://protege.stanford.edu/plugins/owl/api/) and Pellet (http://clarkparsia.com/pellet/) can be used for that.
Last but not least, the whole transformation should be unit tested. This testing can be done as part of this step, or after later steps. These tests make assertions regarding number of resources in the RDF data, testing that they match those in the original data. Additionally, the tests should test that the anticipated RDF structure is accurately reflected in the triple data set.
Below is a brief overview of tools that may assist the triple generation.
Description: Webpage that accepts raw triple content and URLs pointing to RDF documents.
Description: Command line utility.
Add details about the VoID validation tool being developed by the Manchester partners.
Look into cwm as validation tool.
There are various ways to make your data available for others to use:
Linked Open Data requires the data to be linkable, and therefore that URIs are dereferencable (see the Linked Open Data star system). Dereferencable means that IRIs identifying resources can be used using the web design (via domain name and web servers) resolve in triples about that resources. For example, the following resource IRI for methane is dereferencable:
However, because data used in Open PHACTS will be loaded into a central cache, all data must be available for bulk download. This means that all triples, including provenance, etc, are archived into a .zip or .tar.bz2 file, and shared via a HTTP and FTP server, allowing others to download all triples and use that locally.
Additionally, a third option is highly recommend as a minimal way to make the triples accessible: via a SPARQL end point. Various tools are available for this purpose, including tools mentioned earlier to create triples, such as Sesame and Jena. These both provide store functionality, including SPARQL functionality, but are APIs primarily, and can wrap around triple stores that scale better, such as Virtuoso and Owlim. A comparison of some triple stores was done by the FU Berlin and can be found here, but we also node that performance depends strongly in your use case [[Erling2011]]. Information about the capacity of triple stores can be found at w3.org (link). We note that these statistics change every half year, and the reader is strongly encouraged to look up recent numbers.
The list of tools that provide SPARQL end point functionality include those below. Other overviews exists, like this one on w3.org.
The final step in creating RDF, is to advertise your RDF as to get it used, and to get it linked to. Various options can be considered, such as announcing the data on mailing lists, on the Open PHACTS website (or in the Open PHACTS weblogs), or more traditional channels like presenting a poster on a conference.
Like with conference posters, advertising RDF goes with certain requirements. Similar to the requirement that conference posters must be of a certain size, advertising RDF data sets must include, for example, license (or waiver) information (see Step 0), what ontologies are used (see Step 4), and their embedding in the Linked Open Data network (see Step 5). For example, this can be done by providing a specification using VoID (Vocabulary of Interlinked Datasets [[Cyganiak2011b]]; VoID is the Open PHACTS selection, as specified in the Open PHACTS Identity Mapping Specification [[Gray2012]].
Additionally, your data point should be registered with the appropriate registries. One of these is the Data Hub, formerly know as CKAN (http://thedatahub.org/).
To make data provided by a SPARQL end point as linked data, the PHP library Puelia can be used.
It is useful to compare your results at the end with what others have been doing. One option is to look at the Linked Open Data stars scheme.
As an additional feature, the steps are complemented with details on how that step addresses the requirements for Linked Open Data outlined by Berners-Lee [[BernersLee2006]], and popularized as the Linked Open Data start Scheme by Hausenblas [[Hausenblas2012]] and can be found at http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/. These are provided as further information on the context of those steps, rather than requirements resulting from these guidelines. In short, the starts have the following meaning according to Hausenblas:
|★||make your stuff available on the Web (whatever format) under an open license|
|★★||make it available as structured data (e.g., Excel instead of image scan of a table)|
|★★★||use non-proprietary formats (e.g., CSV instead of Excel)|
|★★★★||use URIs to identify things, so that people can point at your stuff|
|★★★★★||link your data to other data to provide context|
To comply to any stars, the license is the first step, and in these guidelines too (see Step 0). For most of the readers of this document, the second star comes for free: the data you are converting into in RDF is most likely already in a structured format (see Step 1). The third star does not require your data to be RDF either, but does insist on using Open Standards (see Step 6). Therefore, once your data is converted into some RDF, you reached a three star state. In fact, the fourth star requires the use of RDF (see Step 6), but also requires to provide a linked data version of your RDF data (see Step 8). Linked Data is also about people linking to your data. The interlinking of your RDF data to other linked data, is rewarded with a fifth star (see Step 5).