This is a “how to” guide for exposing your data in RDF in the Open PHACTS system. The guidelines build upon [[Marshall2012]]. For an extensive explanation on RDF see the Introduction section below.


The research leading to these results has received support from the Innovative Medicines Initiative (IMI) Joint Undertaking under grant agreement n° 115191, resources of which are composed of financial contribution from the European Union's Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in kind contribution.

Intended audience

These RDF-guidelines are intended for data providers who want to expose their data as RDF to the Open PHACTS platform.

HTML Source code

The HTML source code and full revision history of this document can be found in this git repository.


There are many sides to making data semantic. This guidelines document restricts itself to using RDF, and will not go into ontological discussions, such as when to use a class or an instance. The document will also be limited to giving pointers, and some rules of thumb, and the reader is most invited to read the below-listed further reading.

The most important message is to use RDF not find the best representation for your data, but to be explicit in how you represent your data.

General Principles

Open PHACTS requires:

  1. Resource Description Framework (RDF) to be used [[RDFPrimer]]
  2. Every primary concept should be typed, and have a label (we recommend rdfs:label and skos:prefLabel), including language specification
  3. Specific relations, like interaction data, modeled as concepts (rather than predicate) do not require a label
  4. Ontologies used for classes and predicates must be openly available
  5. A VoID specification of your data (see Step 9)

Open PHACTS does not specify requirements or guidelines around:

Step 0: determine who owns the copyright of the data and under what license you are sharing it

Before you start thinking about converting something into RDF, the first two questions you should ask yourself:

  1. who owns the data (if anyone), and
  2. under what license or waiver can you modify and reshare the data.
This is important to ensure you have the permission to convert it into RDF and share that version with others.

Because this information is also important for all people who will want to use your data, you must specify as metadata these pieces of crucial information along with the shared data. This step does not imply that the data must be Open, but it does simplify a lot of things when it is. The least you must do is to provide clarity as to whether the data is Open or not.

VoID should be used to encode license information for data sets to be used in Open PHACTS [[OPSDD]]. The Dublin Core ontology [[Nilsson2008]] is a common alternative, and the given example is purely illustrative for encoding license information in RDF:

Step 1: think in terms of meaning, rather than structure

When creating triples from your data, it is important to think about the data in terms of concepts and their relations in scientific terms, not in terms of database terminologies. The triples must in no way reflect concepts like database tables or other details that originate from the format in which the data was previously stored.

So, the following code example shows bad practices. This generated example RDF shows a compound database, listing molecules, their synonyms, mol weight and SMILES representations. The RDF output reflects the original data structure, and adds little useful meaning (i.e. the semantics) to the data.

Input tables with header on the first line, for names, smiles, and properties:

1:acetyl salicylic acid


@prefix any23: <> .
@prefix rdfs: <> .
@prefix props: <> .
@prefix names: <> .
@prefix smiles: <> .

  any23:molid "1" ;
  any23:mw "180.1578" .

  any23:molid "2" ;
  any23:mw "151.1629" .

  any23:molid "1" ;
  rdfs:label "aspirin" .

  any23:molid "1" ;
  rdfs:label "acetyl salicylic acid" .

  any23:molid "2" ;
  rdfs:label "paracetamol" .

  any23:molid "1" ;
  any23:smiles "O=C(O)c1ccccc1(OC(=O)C)" .

  any23:molid "1" ;
  any23:smiles "CC(=O)Oc1ccccc1C(O)=O" .

  any23:molid "2" ;
  any23:smiles "O=C(Nc1ccc(O)cc1)C" .

Importantly, the notion of columns and rows in the RDF must be removed. Better would be:

@prefix any23: <> .
@prefix rdfs: <> .
@prefix compound: <> .

  rdfs:label "aspirin" ;
  rdfs:label "acetyl salicylic acid" ;
  any23:smiles "O=C(O)c1ccccc1(OC(=O)C)" ;
  any23:smiles "CC(=O)Oc1ccccc1C(O)=O" ;
  any23:mw "180.1578" .

  rdfs:label "paracetamol" ;
  any23:smiles "O=C(Nc1ccc(O)cc1)C" ;
  any23:mw "151.1629" .

One can notice that identifiers used in relational databases typically find a role in the URI of resources, as was used in this example too.

Also data types

Another important difference is that tables require an external schema to provide meaning. RDF is much more self-explanatory. Therefore, one must not only think about the structure, but also about data types. With this approach, we can further improve the semantic equivalent of the three tables:

@prefix any23: <> .
@prefix rdfs: <> .
@prefix compound: <> .
@prefix xsd: <> .

  rdfs:label "aspirin"@en ;
  rdfs:label "acetyl salicylic acid"@en ;
  any23:smiles "O=C(O)c1ccccc1(OC(=O)C)" ;
  any23:smiles "CC(=O)Oc1ccccc1C(O)=O" ;
  any23:mw "180.1578"^^xsd:float .

  rdfs:label "paracetamol"@en ;
  any23:smiles "O=C(Nc1ccc(O)cc1)C" ;
  any23:mw "151.1629"^^xsd:float .
Be aware of making assumptions that are not true. For example, the enriched example above assumes that all labels are all in English, which is not generally true for compound databases.

Step 2: what are the concepts in your data?

The first step in creating your RDF is to create a list of all concepts that are found in your data. Are there proteins, metabolites, cell cultures, organisms, targets, assays? At what level are those concepts represented in your data? Are they protein names without exactly known point mutations? Are the accurate masses resulting from metabolomics experiments? Are the exact metabolic structures known? Are there multiple identifiers known? Does your data contain references with a PubMed identifier or a DOI? In this step you define the types, rather than the entities: you observe that they are proteins, but do not enumerate each of them.

The purpose here is not to develop an ontology, but to get clear what the content of your data is, allowing you to identify existing ontologies (see Step 4) that capture that information. To support this process, for each concept found in your data a human readable label and a short definition must be provided, both in English. Here too, the underlying rule is that everything must have an explicit and well-defined meaning. This list may be in a Word document, Excel spreadsheet, but also in RDF itself, for example using SKOS [[SKOS]]. The choice, however, must be chosen such to improve the thinking about the concepts.

For example:

"Activity"Biochemical property that chemical entities exhibit in some experiment.

Step 3: what are the relations that link those concepts?

Once you know what concepts are found in your data, it is time to identify how those concepts are linked in your data set. These relations must be identified and listed too, and in the same manner as in Step 2. These relations preferably have a verb form, making them easier to understand. For example, a predicate label has name is preferred over just name.

For each relation, the list should provide a human readable label, and a short definition, again in English. Again, the method of recording the list of relations and properties must be chosen such to improve the thinking about the information in the data.

For example:

"has sequence"Protein property linking an amino acid sequence to a protein.
"binds to"Relation between a drug and a drug target reflecting a chemical interaction.

Like in Step 2, you focus on the types only, not in actual binding interactions, etc.

Step 4: identify common vocabularies matching your concepts and relations.

Because existing software already knows about existing, common ontologies, you should use those existing, common ontologies, if you care about having an impact. This sections lists below a number of suggestions, for the various data types that will be covered in Open PHACTS. You should explore those ontologies and check if for each concept and relation you find matching entries in those existing, common ontologies. If you find that only a minor amount of items are missing, you should contact the ontology authors, and see if the missing terms can be added. Only if that fails should you be looking for less common ontologies and see if these provide a substantial higher coverage. Services that allow you to find uncommon ontologies are listed below. You can use the SKOS vocabulary to express relatedness to existing concepts.

As an alternative, rdfs:subClassOf should be discussed here.

It is OK to keep a number of entries not mapped to existing ontologies. In this case the entries have to be made openly available, i.e at This of course also applies for creating a new ontology.

In all cases, you must never use ontologies that you are not allowed to share with your data, as that will effectively leave you with triplified data, of which your users have no means in the future to figure out what is what, and thus is "meaningless".

Importantly, the following two resources must be watched with respect to recommended vocabularies: first, Open PHACTS project deliverables, such as D1.6 and D1.7 [[D16D17]]; second, the Open PHACTS project on BioPortal: These documents precede the following suggestions.

Possible vocabularies and ontologies

Below is a list of pointers to ontologies and vocabularies related to the scope of Open PHACTS. For each ontology, the prefix or website is also given. Furthermore, a list of search engines is provided where further ontologies and vocabularies can be found.


For document identifiers use (in order, if existing)

  1. DOI -
  2. PubMed -
  3. PubMed Central -
  4. Webpage


This requires that the structures have been deposited with ChemSpider already. If not, then use in descending order of preference:

  1. InChI String
  2. InChI Key

The use of the CHEMINF ontology is encouraged for these identifiers [[Hastings2011]]. Also, you should register small molecule names with ChemSpider. Documentation how to deposit individual structures in ChemSpider can be found at Larger sets of compounds can be deposited as SD files, but if the purpose is to have those exposed via Open PHACTS, the Scientific Advisory Board should be contacted to give permission for that exposure.

Structures / Hierarchies

Genomic data




TextMining and Manual Annotations



Ontology search engines

The following search engines can be useful to find suitable ontologies. is a very generic ontology. It may not be directly applicable to life sciences data, but is adopted by major search engines, like It can be considered to use this ontology in addition to more detailed domain ontologies, and as such make your data more easily found. It has types for creative works, non-text objects, events, health and medical types, organization, persons, places, products, and reviews.

Step 5: linking out to other Linked Data

The next step is to explore what related data sets are available as Linked (Open) Data, and link out to those data sets. For example, if your data contains ChemSpider, ChEBI, ChEMBL, PubChem, DrugBank, KEGG, Uniprot, and PDB identifiers, you can link to the respective RDF variants of those databases. Various RDF versions of these databases are around, including Bio2RDF [[Belleau2008]], LODD [[Samwald2011]], and Chem2Bio2RDF [[Chen2010]], but preferably to the original source directly. The figure below (CC-BY-SA, [[Cyganiak2011]]) shows a diagram of the larger network, including Linked Data relevant the Open PHACTS:

Careful consideration must be taken here in to what relation (predicate) is used. In the table below various options are outlined, the specific meaning, and how and when which predicate can be used. The Data Set guidelines specify Open PHACTS standards in detail [[OPSDD]] and are the definite documentation, but here follows a general outline of predicates demonstrating the different implications they have:

rdf:seeAlso General link, that indicates that the resource linked to is relevant to the subject. See
skos:relatedMatch This link indicates that the linked resources are somewhat related. See
skos:closeMatch This link indicates that the linked resources are the same, under some assumptions or applications. This link is not transitive. See
skos:exactMatch This link subclasses skos:closeMatch but is stronger, and the same as now applies to a wide range of applications, implying that the link is transitive. See
owl:sameAs Link that indicates that the subject is an instance, and that the object resource is an instance too, and the same resources as the subject. This link is transitive. See
owl:equivalentClass The same as owl:sameAs but then for OWL classes instead of instances. This link is transitive. See

The owl:sameAs and owl:equivalentClass predicates are very powerful and should be used with care since all attributes and relations of two therewith connected entities are merged together. In Open PHACTS the use of the less restrictive skos:exactMatch is recommended.

Step 6: converting your data into RDF

These first steps ensure you have IRIs for all resources and predicates, and know where to put all relations, it is time to create triples. It is irrelevant to the triple creation process and thus up to the user to pick whatever tool they find most convenient. Triples can be created with dedicated semantic web tools, as listed below, but also using simple regular expressions, or scripting tools in any language. Of course, generated triples should be validated, but the tool to create them is merely a tool; there is nothing semantic about that. The output in which the triples are serialized can be in any of the standardized or proposed RDF serialization formats, such as RDF/XML [[RDFXML]], Notation3 [[Notation3]], Turtle (preferred) [[Turtle]], or plain N-Triples [[NTriples]]. These guidelines do not encourage nor disallow named graphs; the user is free to use them, but it is not required.

Importantly, this process should be well documented. You must keep track of what versions of the input data was used, who created the RDF data, when that was done, and preferable what tools were used. This information should be available to users along with the data itself. Provenance is really important in the process of creating RDF, and you should in detail track how the transformation was done. However, the exact guidelines for tracking provenance information is under development, and future Open PHACTS guidelines will document in detail how it is captured. The reader is referred to the W3C PROV Model Primer as reference for now [[Gil2012]].

Update the above paragraph to point to specifications how Open PHACTS decides to expose provenance information.

Because the Open PHACTS GUI is human language oriented, all entities in the data must be associated with a human readable label. It is important that for all texts, like labels and definitions, the language it is represented in is explicitly identified. For example (not a full RDF serialization):

ex:methane rdfs:label “methaan”@nl .

Occasionally, there are concepts in your data that do not have labels in the data source. For example, the interaction between two proteins or the property of a molecule. Labels like "The interaction between protein A and protein B" and "The molecular weight of molecule A" can be autogenerated. If this label follows implicitly from the semantic typing of the entities and relations between those entities, then a label may be omitted. An example is the molecular weight property in the next section.

Why sometimes relationship labels may be omitted

Relations can be modeled as both a predicate and a concept. In the former type it is commonly the relation type that is represented by a predicate. For example, "binds to" clearly has a label. But if the relationship is a unique one, for example, a specific binding affinity with which you want to associate further information, you would commonly model this as a concept instead of a predicate. In such cases, you can omit the label, which is favored over labels like "an affinity between target X and compound Y".

No blank nodes

Blank nodes must not be used, following for example the Banff Manifesto [[banffmanifesto]]; each concept or thing should have a unique IRI, which may be similar to those of more principle resources. For example, the following CHEMINF example describes a molecule with one of its properties, where the property is an instance itself and has a IRI quite similar to that of the molecule it characterizes:

ex:m1 cheminf:CHEMINF_000200 ex:m1/full_mwt .
ex:m1/full_mwt a cheminf:CHEMINF_000198 .
ex:m1/full_mwt cheminf:SIO_000300 "341.748" .

Other considerations

At this moment, no further restrictions on the RDF triple structure is made, but it may be useful to read up on considerations in URIs patterns, labels names, etc, others have made in the past. For example, the OBO Foundry wrote up a series of principles which provide an interesting read into what the consequences can be of practices you adopt.

Tools available for triple generation

Below is a brief overview of tools that may assist the triple generation.


Description: Sesame is a Java framework for handling RDF data. It includes functionality for parsing, storing, inferencing and querying of RDF data. Development is support by the Dutch company Aduna.
Audience: Java programmers

Jena Semantic Web framework

Description: “Jena is another Java framework for building Semantic Web applications, originally developed by HP, now under the Apache umbrella. It provides am environment for handling RDF, RDFS and OWL, SPARQL.
Audience: Java programmers


Description: Tripliser is a Java library and command-line tool for creating triple graphs from XML. It is particularly suitable for data exhibiting any of the following characteristics: messy - missing data, badly formatted data, changeable structure; bulky - large volumes; and volatile - ongoing changes to data and structure.
Audience: Java programmers


Description: A Ruby library for working with RDF data.
Audience: Ruby programmers


Description: A tool than can convert anything to triples, supporting microformats, RDFa, Microdata, RDF/XML, Turtle, N-Triples and NQuads.
Audience: Everybody


Description: An Eclipse plugin [[Marx2013]].
Audience: Everybody

Step 7: validate your triples

While dedicated semantic web tools make it hard to introduce syntactic errors, it is still possible to make mistakes in the resulting RDF, and the generated triples should be validated.

There are various levels at which the data should be validated. First, it should be validated that the created syntax notation is correct, for which various online services are available. Remark: Some encodings of special characters may pose problems and may have to converted or be replaced. One such validator tool is the W3C RDF Validation Service, at

Second, the output should be checked that the selected common ontologies are correctly used. For example, that predicates with literal domains are indeed used for such in the output. An example of common misuse, is using the wrong Dublin Core namespace [[Nilsson2008]]; there are two, both defining a dc:title predicate, but only one namespace should be used with literal values.

This also applies to the use of links as outlined in step 5, where these linking predicates can make claims of the nature of resources. For example, skos:closeMatch implies that the subject and object resources are also SKOS concepts. That should not conflict with other triples.

One aspect here is that the resulting data should be verified for internal consistency. This is particularly important if the used common ontologies define relations (predicates) that specify what types of objects it links (RDF domain and range). Tools like Protégé ( and Pellet ( can be used for that.

Last but not least, the whole transformation should be unit tested. This testing can be done as part of this step, or after later steps. These tests make assertions regarding number of resources in the RDF data, testing that they match those in the original data. Additionally, the tests should test that the anticipated RDF structure is accurately reflected in the triple data set. Unit tests can be exposed as SPARQL queries and a query tool, like Rasqal, can then be used to see if expected results are returned.

Open PHACTS validators

In addition to the tools mentioned above, the Manchester University team has developed a few validation webservices, including validators for RDF documents.

Tools available for validation

Below is a brief overview of tools that may assist the triple validation.

W3C RDF Validation

Description: Webpage that accepts raw triple content and URLs pointing to RDF documents.


Description: Command line utility.

        cat data.ttl | rapper -i turtle -t -q - . > /dev/null


Description: Command line utility.

        roqet --results=ntriples -i sparql -e 'CONSTRUCT WHERE { ?s ?p ?o } LIMIT 5' -D data.ttl

RDF Triple-Checker

Description: Website that spots common mistakes.


Description: A general-purpose command line utility for the semantic web.

Step 8: choose the methods with which people will access the data

There are various ways to make your data available for others to use:

These approaches are complementary, rather then mutually exclusive.

Linked Open Data requires the data to be linkable, and therefore that URIs are dereferencable (see the Linked Open Data star system). Dereferencable means that IRIs identifying resources can be used using the web design (via domain name and web servers) resolve in triples about that resources. For example, the following resource IRI for methane is dereferencable:

However, because data used in Open PHACTS will be loaded into a central cache, all data must be available for bulk download. This means that all triples, including provenance, etc, are archived into a .zip or .tar.bz2 file, and shared via a HTTP and FTP server, allowing others to download all triples and use that locally.

Additionally, a third option is highly recommend as a minimal way to make the triples accessible: via a SPARQL end point. Various tools are available for this purpose, including tools mentioned earlier to create triples, such as Sesame and Jena. These both provide store functionality, including SPARQL functionality, but are APIs primarily, and can wrap around triple stores that scale better, such as Virtuoso and Owlim. A comparison of some triple stores was done by the FU Berlin and can be found here, but we also node that performance depends strongly in your use case [[Erling2011]]. Information about the capacity of triple stores can be found at (link). We note that these statistics change every half year, and the reader is strongly encouraged to look up recent numbers.

The list of tools that provide SPARQL end point functionality include those below. Other overviews exists, like this one on

















Step 9: write up the metadata

Now that you know what data you started with, what the RDF looks like, how you generated it, and how you make it available, you must document this provenance. In Open PHACTS the Dataset Descriptions for the Open Pharmacological Space specification details out how you are expecting to do this in very much detail [[OPSDD]]. The key ontology used by this specification is the VoID (Vocabulary of Interlinked Datasets ontology [[Cyganiak2011b]].

The data you should record includes but is not limited to:

Open PHACTS VoID validator

The Manchester University team has developed a validator specifically aimed at VoID provenance information, and verified compliance with the Open PHACTS data set description specification [[OPSDD]].

Step 10: advertise your data (in Open PHACTS)

The final step in creating RDF, is to advertise your RDF as to get it used, and to get it linked to. Various options can be considered, such as announcing the data on mailing lists, on the Open PHACTS website (or in the Open PHACTS weblogs), or more traditional channels like presenting a poster on a conference.

Like with conference posters, advertising RDF goes with certain requirements. Similar to the requirement that conference posters must be of a certain size, advertising RDF data sets must include, for example, license (or waiver) information (see Step 0), what ontologies are used (see Step 4), and their embedding in the Linked Open Data network (see Step 5). The VoID-encoded meta data from Step 11 can be reused.

Additionally, your data point should be registered with the appropriate registries. One of these is the Data Hub, formerly know as CKAN (

To make data provided by a SPARQL end point as linked data, the PHP library Puelia can be used.

Step 11: compare your results with community standards

It is useful to compare your results at the end with what others have been doing. One option is to look at the Linked Open Data stars scheme.

Linked Open Data stars

As an additional feature, the steps are complemented with details on how that step addresses the requirements for Linked Open Data outlined by Berners-Lee [[BernersLee2006]], and popularized as the Linked Open Data start Scheme by Hausenblas [[Hausenblas2012]] and can be found at These are provided as further information on the context of those steps, rather than requirements resulting from these guidelines. In short, the starts have the following meaning according to Hausenblas:

make your stuff available on the Web (whatever format) under an open license
★★make it available as structured data (e.g., Excel instead of image scan of a table)
★★★use non-proprietary formats (e.g., CSV instead of Excel)
★★★★use URIs to identify things, so that people can point at your stuff
★★★★★link your data to other data to provide context

To comply to any stars, the license is the first step, and in these guidelines too (see Step 0). For most of the readers of this document, the second star comes for free: the data you are converting into in RDF is most likely already in a structured format (see Step 1). The third star does not require your data to be RDF either, but does insist on using Open Standards (see Step 6). Therefore, once your data is converted into some RDF, you reached a three star state. In fact, the fourth star requires the use of RDF (see Step 6), but also requires to provide a linked data version of your RDF data (see Step 8). Linked Data is also about people linking to your data. The interlinking of your RDF data to other linked data, is rewarded with a fifth star (see Step 5).