This is a specification for the metadata to described datasets, and the linksets that relate them, to enable their use within the Open PHACTS platform. The specification defines the metadata properties that are expected to describe datasets and linksets. Details of deploying dataset descriptions and an exchange format for linkset files are also given.

Disclaimer

The research leading to these results has received support from the Innovative Medicines Initiative (IMI) Joint Undertaking under grant agreement n° 115191, resources of which are composed of financial contribution from the European Union's Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in kind contribution.

Intended audience

This dataset and linkset description specification is intended for data providers who want to publish their data as RDF, and link it to other datasets. A basic knowledge of RDF is assumed.

Details for converting a dataset and publishing it as RDF are given in the Open PHACTS RDF guidelines specification [[OPS-RDF]].

Introduction

The Open PHACTS platform [[OPS-ARCH]] relies on data and the interlinks published by a variety of sources. For example, details of chemicals are derived from ChemSpider, ChEMBL, and DrugBank. This specification provides details of the metadata expected to describe the datasets and the links that relate the instances in those datasets.

The Open PHACTS project has produced a set of guidelines aimed at data providers for publishing their data within the OPS system [[!OPS-RDF]]. The RDF guide provides details about modelling your data as RDF. This specification builds on the RDF Guidelines by defining the metadata that should be published to describe the dataset and the links to other datasets.

The dataset description defined in this specification declares the properties that should be included in the description of dataset or its links. The information is exchanged using the Vocabulary of Interlinked Datasets [[VOID]]. The VoID Editor can be used to create dataset descriptions, although in general they should be generated as part of the data creation pipeline.

Document Conventions

RDF Syntax and Namespaces

All examples in this document are written in the Turtle RDF syntax [[TURTLE]]. Throughout the document, the following namespaces are used:

@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix dul: <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix freq: <http://purl.org/cld/freq/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix pav: <http://purl.org/pav/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix voag: <http://voag.linkedmodel.org/schema/voag#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
		

Furthermore, we assume that the empty prefix is bound to the base URL of the current file like this:

@prefix : <#> .

This allows us to quickly mint new identifiers in the local namespace.

Vocabularies for Describing Datasets and Mapping Relationships

In this section, we introduce the vocabularies that will be used for capturing the dataset descriptions and the mappings between the datasets.

Vocabulary of Interlinked Datasets (VoID)

The Vocabulary of Interlinked Datasets (VoID) [[!VOID]] is a W3C interest group note that specifies a vocabulary for describing the metadata about a dataset and its relationship with other datasets. The vocabulary builds upon existing metadata vocabularies, e.g. Dublin Core Terms [[DCTERMS]], and captures four categories of metadata:

Dataset

The VoID specification [[VOID]] defines a dataset to be, "a set of RDF triples that are published, maintained or aggregated by a single provider." The dataset itself may contain logical subsets, which can be captured in the VoID description of the dataset, e.g. ChEMBL can be split into subsets for compounds, targets, etc.

The information captured about a dataset focuses on the general metadata, access, metadata, and the structural metadata.

Linksets

The VoID specification [[VOID]] defines a link to be, "an RDF triple whose subject and object are described in different datasets." The links capture the mapping of an identifier in one dataset to be related to an identifier in another dataset. VoID is agnostic to the relationship to use: possible predicates are given in the next section.

The VoID specification defines a linkset to be, "a collection of such RDF links between two datasets." The linkset captures details of the links, i.e. the datasets that are linked and the relationship, as well as the metadata associated with the links, e.g. provenance information about who created the mapping and the specific versions of the datasets related. The VoID specification enables a separation between (1) the datasets involved in a linkset, and (2) who publishes the linkset.

Note that a VoID linkset is defined to link two datasets via a single link predicate (void:linkPredicate). As such, there can exist multiple linksets relating the same pair of datasets, as illustrated in the figure below. The figure depicts four distinct linksets: two sourced from ChemSpider depicted in blue which use different link predicates; one sourced from ChEMBL depicted in red; and one sourced from a third party depicted in green. Each of the linksets uses a different link relationship. Those shown with a double arrow head are symmetric while those with just a single arrow are directional links.

Depiction of links relating ChemSpider and ChEMBL

Expressing Mapping Relationships

A mapping is expressed as a VoID link, i.e. it is an RDF triple that relates an identifier in one dataset with an identifier in another dataset with some predicate which provides the meaning of the mapping. A justification for the mapping, e.g. two chemical compounds are deemed equivalent as they have the same InChI key, is expressed with the link in the linkset metadata.

Mapping Vocabularies

The mapping predicate captures the way in which the two identifiers are related. The mapping should respect the semantics of the relationship, e.g. the owl:sameAs relationship must only be used when the two identifiers are completely interchangeable.

Standard, and widely used, generic mapping relationships are given in the table below. (A fuller mapping ontology is given in [[Halpin2010]], but it is expected that the main relationships used will be those given in the table.)

Standard generic mapping predicates
Relationship Description Properties
rdfs:seeAlso General link, that indicates that the resource linked to is relevant to the subject. See http://www.w3.org/TR/rdf-schema/#ch_seealso.
skos:relatedMatch This link indicates that the linked resources are in some way associated. See http://www.w3.org/TR/skos-reference/#mapping. Symmetric
skos:closeMatch This link indicates that the linked resources are the same, under some assumptions or applications. See http://www.w3.org/TR/skos-reference/#mapping. Symmetric
skos:exactMatch This link indicates that the linked resources are the same, under the assumptions of most applications. See http://www.w3.org/TR/skos-reference/#mapping. Transitive Symmetric
owl:sameAs This link indicates that the linked resources are the same under all assumptions and can be used interchangeably. Note that if this link is used for classes, then reasoning tasks will fall under OWL Full semantics. See http://www.w3.org/TR/2009/REC-owl2-quick-reference-20091027/#Axioms. Transitive Symmetric
owl:equivalentClass This link indicates that the linked resources (which are both classes in some ontology) are the same. See http://www.w3.org/TR/2009/REC-owl2-quick-reference-20091027/#Axioms. Transitive Symmetric

Note that there are domain specific relationship that may also be used to express the relationship between two data instances. For example, ChEBI provides a set of relationships including for example http://purl.obolibrary.org/obo#has_part and http://purl.obolibrary.org/obo#is_conjugate_acid_of.

Mapping Justifications

A key feature of the Open PHACTS platform is its ability to allow multiple views of the linked data which is achieved by applying scientific lenses over the data [[OPS-LENSES]]. For example, when performing an early stage exploratory task it is desirable to retrieve as much data as possible and as such the system enables the use of a lens whereby compounds are matched on their structural skeleton. Once the research has progressed further, the need for stricter relationships becomes apparent and the structure lens is switched for one which matches compounds on their full chemical structure, i.e. including their charges and stereo-chemistry.

The ability to classify linksets for use under different scientific lenses relies on the justification given in the linkset. A set of vocabulary terms for providing justifications are given in Appendix A.3.

Example

The following example declares that the ChemSpider α-Ketoisovaleric acid concept with the CSID 48 shares many properties with the ChEMBL 3-Methyl-2-oxobutanoic acid concept with the ChEMBL-RDF ChEMBL ID CHEMBL146554. The relationship is drawn from ChemSpider based on the compounds sharing the same structure based on their InChI Keys (in this case the compound has the same InChI Key QHKABHOOEWYVLI-UHFFFAOYSA-N in both data sets.). Only the triples directly related to declaring the link are given in the example.

@prefix chembl: <http://linkedchemistry.info/chembl/chemblid/> .

:cs2chembl_inchi void:linkPredicate skos:exactMatch .
:cs2chembl_inchi dul:expresses <http://semanticscience.org/resource/CHEMINF_000059> .

<http://rdf.chemspider.com/48> skos:exactMatch chembl:CHEMBL146554 .
      	

Describing Datasets

We assume that an RDF representation of the dataset has been generated according to the Open PHACTS RDF Guidelines [[OPS-RDF]]. This section describes the information that must appear in the VoID description for the dataset.

It is recommended that the generation of the VoID description for a dataset is carried out as part of the creation of the RDF version of the dataset. To help understand some of the principles of the VoID description, that is to help you get started there is a VoID Generator which supports the creation of dataset descriptions.

General Dataset Metadata

Type

For more details see Section 1.3 of the VoID Specification.

Title

For more details see Section 2.2 of the VoID Specification.

Description

For more details see Section 2.2 of the VoID Specification.

Webpages

An example of an external page that may be linked is the bioDBcore [[BioDBcore]] entry, see Section 7.2.

For more details see Section 2.1 of the VoID Specification.

License

Where possible, we recommend publishing data under an open license to enable reuse of the data with appropriate acknowledgement. For Open PHACTS, the recommended license is CC-BY-SA. A list of alternative licenses are available in Section 2.4 of the W3C note on VoID.

For more details see Section 2.4 of the VoID Specification.

Namespace

Note that the object of the void:uriSpace predicate is a literal, not a URI.

For more details see Section 4.2 of the VoID Specification.

Provenance and Version

Version

Not all dataset have the notion of a version number. Where the dataset does have this notion, it must be given in the description. For other datasets, the versioning information will be inferred by the last modified date provided as part of the provenance information.

Provenance of Data Origin

Details of the origin of the data must be provided using one of the following groups of predicates.

Note that the object of the pav:xxxBy predicate is a URI.

We currently do not require a specific form of URI for capturing the details of a person or entity. Future versions of this specification may recommend the use of ORCID Identifiers [[ORCID]].

Distinguishing Subsets

Data source

Metadata that is common between the subsets can be declared for the parent only. However, the subsets allow for more specific linking between datasets and for providing details of the subject.

In the VoID note [[VOID]], the void:subset property is used for both subset of and has subset. The declared semantics for the property is has subset. Within Open PHACTS, the property must be used with the has subset semantics.

For more details see Section 4.4 of the VoID Specification.

Vocabularies, Topics, and Example Resources

Dataset Vocabularies

For more details see Section 4.3 of the VoID Specification.

Dataset Topics

Multiple topics can be declared. If the data is split into subsets, then the topics should be associated with the subsets. BioPortal [[BioPortalWeb]] [[BioPortal]] can be used to search for suitable vocabulary terms for topics. DBPedia URIs may also be used. A list of common terms relevant for Open PHACTS is given in Appendix A.1.

For more details see Section 2.5 of the VoID Specification.

Example Resources

Multiple resources can be declared. If the data is split into subsets, then the example resources should be associated with the subsets.

For more details see Section 4.1 of the VoID Specification.

Dataset Access

Data Dump

For more details see Section 3.3 of the VoID Specification.

SPARQL Endpoint

For more details see Section 3.2 of the VoID Specification.

Dataset Update Frequency

The terms from the Frequency Vocabulary have been reproduced in Appendix A.2.

Other Metadata

Other metadata that can be associated with a dataset may be included, see the VoID specification [[VOID]] for additional properties that may be incorporated. The dataset may also have a provenance graph associated with it providing more detailed information about the creation, authorship, and derivation of the dataset. This provenance graph should be expressed using the W3C Provenance Ontology [[PROV-O]].

Examples

Example VoID dataset descriptions can be found in Appendix B. The first example is for the ChemSpider dataset (see Appendix B.1). Note this example is derived from the existing non-conformant ChemSpider VoID description available from http://rdf.chemspider.com/void.rdf.

The second example is for the RDF representation of the ChEMBL database, given in Appendix B.2. This demonstrates the level of information required to track the provenance from a source dataset through to the RDF representation. It also contains subset definitions.

Describing Linksets

A linkset is itself a dataset, and as such should provide the metadata about its content and how it was created. The metadata associated with a link is essential for enabling its reuse by others. It enables a consumer of the link to understand which datasets are linked (including which version), who claimed the link, under what circumstances, and which (if any) tools were used to generate the link (e.g. [[SILK]]).

General Linkset Metadata

Type

For more details see Section 1.4 of the VoID Specification.

Title

For more details see Section 2.2 of the VoID Specification.

Description

For more details see Section 2.2 of the VoID Specification.

Note that for linksets there is unlikely to be a web page directly dedicated to the linkset, as such that part of the dataset metadata is not stated as a requirement.

License

Where possible, we recommend publishing data under an open license to enable reuse of the data with appropriate acknowledgement. For Open PHACTS, the recommended license is CC-BY-SA. A list of alternative licenses are available in Section 2.4 of the W3C note on VoID.

The license under which a linkset is published may be different from that of the datasets that it links, even if it is a subset of one of the datasets.

For more details see Section 2.4 of the VoID Specification.

Provenance

Subset

Note that the dataset URI is the subject of the void:subset triple and the linkset URI is the object. If the dataset VoID description contains the declaration, there is no need to repeat it in the linkset document. However, the linkset documents can be used to declare additional subsets. A subset inherits properties from its parent.

For more details see Section 5.2 of the VoID Specification.

Link Source

Additional information about the settings used when running an automated link generation tool such as Silk [[SILK]] can be captured in an associated provenance graph encoded using PROV-O [[PROV-O]].

Linkset Creation Details

In this version of the specification, it is assumed that the creator of the linkset has accessed the most recent version of the dataset.

Linkset Version

Linkset Statistics

Providing the number of triples included in the linkset allows for applications using the linkset to validate that the entire linkset has been successfully loaded.

For more details see Section 4.6 of the VoID Specification.

Minimal Dataset Description

A linkset should point to the dataset descriptions of the datasets that it uses. These descriptions should be provided by the dataset provider as part of the dataset publishing process [[OPS-RDF]]. However, there are occasions when one or both of the linked datasets do not provide a VoID dataset description. The following set of properties must then be provided in the linkset document. Other properties may also be given.

Examples

Example VoID linkset descriptions can be found in Appendix C. The first example shows how existing dataset descriptors can be reused (see Appendix C.1). The second example shows the declaration of a dataset's metadata in the linkset file (see Appendix C.2).

Deploying and Exchanging VoID Documents

The two preceeding sections have prescribed the metadata required to describe datasets and the linksets that inter-relate them. This section outlines the expected deployment and exchange mechanisms and should be read in conjunction with Section 6 of the VoID specification for more details.

VoID Document Metadata

VoID documents describing datasets and linksets must contain a metadata block describing the VoID document using the following properties:

Of course, other properties may also be declared, e.g. a title using dcterms:title. For more details, see Section 6.2 of the VoID specification [[VOID]].

An example is given below for the ChemSpider deployment. Note the use of an empty-string relative URI (<>) as a syntactic shortcut for the URI of the document that contains the statements.

<> a void:DatasetDescription ;
    dcterms:title "ChemSpider VoID Description"^^xsd:string ;
    pav:createdBy <http://www.chemspider.com/> ;
	pav:createdOn "2012-05-02T13:50:34Z"^^xsd:dateTime;
	pav:lastUpdateOn "2012-08-10T13:52:12Z"^^xsd:dateTime;
    foaf:primaryTopic :chemSpiderDataset .    
		

Deploying VoID Descriptions

Several mechanisms for deploying VoID descriptions are given in Section 6 of the VoID Note [[VOID]].

Deploying a Dataset Description

For datasets that are being published in their own domain, with dereferenceable URIs, then we recommend placing the dataset's VoID description in the root directory in a file called void.ttl, with a local "hash URI" for the dataset (and any subsets). For example, for ChemSpider we would have a URI such as http://rdf.chemspider.com/void.ttl#chemSpiderDataset. When the code is being hosted on an external service, then the VoID descriptor should be provided in the dataset's home directory. For example, for the RDF encoding of ChEMBL this would be http://linkedchemistry.info/chembl/void.ttl#chembl-rdf. Examples of the VoID descriptors are given in Appendix A.

Each RDF document containing the data should then contain a backlink to the dataset descriptor. For example, the ChEMBL-RDF molecule m1, there would be the triple:

<http://linkedchemistry.info/chembl/chemblid/molecule/m1> void:inDataset 
    <http://linkedchemistry.info/chembl/chemblid/void.ttl#chembl-rdf_compounds> .    
			

Deploying a Linkset Description

For the purposes of Open PHACTS, it is anticipated that linksets will be materialised as separate documents from the datasets. This is to allow their loading into the identity mapping service [[IMS]]. These linkset files will contain the metadata about the linkset as well as the links.

It is also permitted to separate the link triples from the linkset metadata. In this case, the file containing the links must provide a link back to the linkset desription using the void:inDataset predicate. The example below shows how a set of links can refer back to the linkset description given in the ChEMBL-RDF VoID description file.

<> void:inDataset <http://linkedchemistry.info/chembl/chemblid/void.ttl#chembl-rdf_targets-uniprot-linkset> .

<http://linkedchemistry.info/chembl/target/t1> skos:exactMatch <http://purl.uniprot.org/uniprot/O43451> .
...
			

External VoID Files

Tooling being developed within Open PHACTS should support the predicates stated in this document. However, they should also be able to read VoID files from external sources that do not comply completely with this specification, but do comply with the VoID standard [[VOID]]. An example would be the use of the void:target predicate instead of the void:subjectsTarget and void:objectsTarget predicates. Such usage should not be the norm and should result in warnings being generated.

Related Standards

Nanopublications

Nanopublications [[NANOPUB]] provide a means for data providers to obtain credit for their data contribution, in particular data that can be described in the form of a minimal set of assertions: a minimal piece of information that represents value for which credit is due. Such information is closely related to a link relating instances in two datasets. In some cases it may be desirable to publish a link as a nanopublication. This should not violate a link being published in a linkset according to this specification.

BioDBcore

BioDBcore defines the following properties as the set of metadata that should be published in relation to a dataset. The aim of BioDBcore is different from that of VoID, but many of the elements defined are covered in the Open PHACTS dataset description.

An example BioDBcore record for ChEMBL.

The metadata specified in Section 4 covers the functional data required from BioDBcore. The aspects not covered are those relating to discovering who is responsible for a dataset and the publications about the dataset. It is expected that such information can be discovered from the dataset's homepage and is not within the use case scope for the description of the dataset. Such information may be added as additional statements in the VoID description.

Provenance Vocabularies

There are a wide range of provenance vocabularies that have been proposed. This section gives brief pointers to related vocabularies that could be used in a dataset or linkset description. For more information about the state of provenance vocabularies, the interested reader is recommended [[PROV-XG]].

Provenance Ontology (PROV-O)

The Provenance Ontology (PROV-O) [[PROV-O]] is a W3C candiate recommendation for representing provenance information about documents, datasets, workflow runs, etc. It is broadly based on the Open Provenance Model [[OPM]]. It is capable of expressing complex provenance relationships.

Provenance, Authoring and Versioning Ontology

The Provenance, Authoring and Versioning Ontology (PAV) [[PAV]] provides a comprehensive set of relationships for capturing basic provenance information.

Provenance Vocabulary

The Provenance Vocabulary [[PRV]]] is another lightweight vocabulary of provenance predicates with an emphasis on data creation and data access on the Web.

Vocabulary for Data and Dataset Provenance (voidp)

Defined as an extension to VoID, the vocabulary for data and dataset provenance (voidp) [[VOIDP]] is a vocabulary for defining provenance relationships of data and datasets. The vocabulary focuses on four specific pieces of provenance information:

"for a piece of data, x :
  • when was x derived,
  • how was x derived,
  • what data had been used to derive x,
  • who carried out the transformations that resulted in the current value of x." [[VOIDP]]

These are a subset of the information that needs to be captured for the Open PHACTS linksets.

Suggested Vocabulary Terms

Vocabulary Terms for Dataset Topics

Complete list of suggested URIs.

This section lists suggested vocabulary terms for the dataset topics metadata that are relevant for Open PHACTS. The term in bold is the preferred URI.

Drug:
Molecule:
Protein:
Target:

Frequency of Change Vocabulary

Below are the terms from the frequency of change vocabulary.

Example Dataset VoID Descriptors

ChemSpider VoID Descriptor

The ChemSpider site already has a VoID descriptor available from http://rdf.chemspider.com/void.rdf. This has been created with a previous version of the VoID specification. Below is a suggested updated version which conforms with the requirements of this specification.

@prefix : <http://rdf.chemspider.com/void.ttl#>.

@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix freq: <http://purl.org/cld/freq/> .
@prefix pav: <http://purl.org/pav/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
@prefix voag: <http://voag.linkedmodel.org/schema/voag#> .
@prefix void: <http://rdfs.org/ns/void#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.

# Metadata about this file
<http://rdf.chemspider.com/void.ttl> 
	a void:DatasetDescription;
	dcterms:title "A VoID Description of the ChemSpider Dataset"@en;
	dcterms:description 
		"""This is an example VoID description for the ChemSpider dataset. 
		   It is derived from the existing VoID description and updating it to the latest
		   version of the VoID specification."""@en;
	pav:createdBy <http://www.cs.man.ac.uk/~graya/me.ttl>;
	pav:createdOn "2012-05-02T14:48:03Z"^^xsd:dateTime;
	pav:lastUpdateOn "2012-08-10T09:32:49Z"^^xsd:dateTime;
	pav:derivedFrom <http://rdf.chemspider.com/void.rdf>;
	foaf:primaryTopic :chemSpiderDataset;
	.

# Description of the ChemSpider dataset
:chemSpiderDataset 
# General metadata
	a void:Dataset;
	dcterms:title "ChemSpider"@en;
	dcterms:description "ChemSpider's Public Dataset"@en;
	foaf:homepage <http://rdf.chemspider.com/>;
	foaf:page <http://www.chemspider.com/>;
	dcterms:license <http://www.chemspider.com/Disclaimer.aspx>;
	void:uriSpace "http://rdf.chemspider.com/"^^xsd:string;
#Provenance
	dcterms:publisher <http://www.chemspider.com/>;
	dcterms:created "2007-03-01T00:00:00"^^xsd:dateTime;
	dcterms:modified "2012-10-16T00:00:00"^^xsd:dateTime;#Subsets
	void:subset :chemSpiderDataset_chembl_subset,:chemSpiderDataset_drugbank_subset;
#Vocabularies, topics, resources
	void:vocabulary <http://purl.org/dc/elements/1.1/>,
		<http://purl.org/dc/terms/>,
		<http://www.openarchives.org/ore/terms/>,
		<http://www.polymerinformatics.com/ChemAxiom/ChemDomain.owl#>,
		<http://xmlns.com/foaf/0.1/>;
	dcterms:subject <http://dbpedia.org/resource/Molecule>;
	void:exampleResource <http://rdf.chemspider.com/2157>;
#Dataset Access	
	void:sparqlEndpoint <http://rdf.chemspider.com/sparql>;
#Update Frequency
	voag:frequencyOfChange freq:continuous;
#Other Metadata
	# Technical features
	void:feature <http://www.w3.org/ns/formats/RDF_XML>;
	# Dataset statistics
	void:triples "1157624328"^^xsd:nonNegativeInteger;
	.

:chemSpiderDataset_chembl_subset
#General Metadata
	a void:Dataset;
	dcterms:title "ChemSpider ChEMBL Subset"@en;
	dcterms:description "The slice of ChemSpider data that corresponds to ChEMBL molecules."@en;
#Provenance
	pav:retrievedFrom <ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_13/chembl_13.sdf.gz>;
	pav:retrievedOn "2012-08-02T10:23:56Z"^^xsd:dateTime;
	pav:retrievedBy <http://www.chemspider.com/> ;
#Dataset Access
	void:dataDump <https://www.dropbox.com/sh/6zboa8z9i9vrzyl/7NxayhkUH0/ChEMBL20120731.zip>;
	.

# Description of the ChemSpider subset relating to DrugBank
:chemSpiderDataset_drugbank_subset
#General Metadata
	a void:Dataset;
	dcterms:title "ChemSpider DrugBank Subset"@en;
	dcterms:description "Data corresponding to DrugBank."@en;
#Provenance
	pav:retrievedFrom <http://www.drugbank.ca/system/downloads/current/structures/all.sdf.zip>;
	pav:retrievedOn "2012-08-02T10:24:06Z"^^xsd:dateTime;
	pav:retrievedBy <http://www.chemspider.com/> ;
#Dataset Access
	void:dataDump <https://www.dropbox.com/sh/6zboa8z9i9vrzyl/qcFZzbLM77/DrugBank20120731.zip>;
	.
    	

ChEMBL-RDF VoID Descriptor

Below we provide the VoID document for the ChEMBL-RDF dataset, which is a conversion of the ChEMBL database. The VoID file would be located at http://linkedchemistry.info/void.ttl.

@prefix : <http://linkedchemistry.info/void.ttl#>.

@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix freq: <http://purl.org/cld/freq/> .
@prefix pav: <http://purl.org/pav/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
@prefix voag: <http://voag.linkedmodel.org/schema/voag#> .
@prefix void: <http://rdfs.org/ns/void#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.

# Metadata about this file
<> 
    a void:DatasetDescription;
    dcterms:title "ChEMBL-RDF VoID Description"@en;
    dcterms:description 
        """This is the VoID description for a ChEMBL-RDF dataset."""@en;
    pav:createdBy <http://egonw.github.com/#me> ;
    pav:createdOn "2012-08-12T16:56:07Z"^^xsd:dateTime;
    pav:lastUpdateOn "2012-09-14T12:21:12Z"^^xsd:dateTime;
    pav:previousVersion <http://semantics.bigcat.unimaas.nl/chembl/v13_ops/chembl-rdf-void.ttl>;
    foaf:primaryTopic :chemblrdf_dataset.

:chemblrdf_dataset
# General metadata
    a void:Dataset;
    dcterms:title "ChEMBL-RDF 13.OPS.2"@en;
    dcterms:description "The ChEMBL database in RDF format."@en;
    foaf:homepage <http://github.com/egonw/chembl.rdf/>;
    foaf:page <http://www.biosharing.org/biodbcore-000015>;
    dcterms:license <http://creativecommons.org/licenses/by-sa/3.0/> ;
    void:uriSpace "http://linkedchemistry.info/chembl/"^^xsd:string ;
# Provenance
	pav:version "v13_ops";
	pav:previousVersion <http://semantics.bigcat.unimaas.nl/chembl/v13_ops/chembl-rdf-void.ttl#chemblrdf_dataset>;
    pav:importedFrom :chembl_dataset;
    pav:importedBy <http://egonw.github.com/#me> ;
    pav:importedOn "2012-05-15T15:34:40Z"^^xsd:dateTime;
    pav:createdWith <https://github.com/openphacts/chembl.rdf>;
# Subsets
    void:subset :chemblrdf_compounds, :chemblrdf_targets ;
# Vocabularies, topics, resources
	void:vocabulary 
		<http://www.w3.org/1999/02/22-rdf-syntax-ns#>,
		<http://www.w3.org/2000/01/rdf-schema#>,
		<http://www.w3.org/2002/07/owl#>,
		<http://www.w3.org/2001/XMLSchema#>,
		<http://purl.org/dc/elements/1.1/>, 
		<http://purl.org/ontology/bibo/>,
		<http://xmlns.com/foaf/0.1/>, 
		<http://purl.org/spar/cito/>, 
		<http://www.w3.org/2004/02/skos/core#> ,
		<http://purl.obolibrary.org/obo#>,
		<http://www.blueobelisk.org/ontologies/chemoinformatics-algorithms/#>,
		<http://www.blueobelisk.org/chemistryblogs/>,
		<http://www.nmrshiftdb.org/onto#>,
		<http://www.ifomis.org/bfo/1.1/snap#>,
		<http://semanticscience.org/resource/>,
		<http://purl.org/obo/owl/CHEBI#>;
# Update frequency
	#Need to verify the update frequency!
	voag:frequencyOfChange freq:semiannual;
# Other metadata
    void:distinctSubjects "18018451"^^xsd:integer ;
	.

:chembl_dataset
# Metadata about the original data source
    a dcterms:Dataset;
    dcterms:title "ChEMBL"@en;
    dcterms:description """ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data)."""@en;
    foaf:homepage <http://www.ebi.ac.uk/chembl/>;
    dcterms:license <http://creativecommons.org/licenses/by-sa/3.0/> ;
    pav:version "13"^^xsd:integer ;
    dcterms:publisher <http://www.ebi.ac.uk/chembl/>;
    dcterms:created "2012-02-29T00:00:00Z"^^xsd:dateTime;
    dcterms:modified "2012-02-29T00:00:00Z"^^xsd:dateTime;
    .

:chemblrdf_compounds
# Subset metadata
    a void:Dataset ;
    dcterms:title "ChEMBL Molecules"@en;
    dcterms:description "The subset of ChEMBL data relating to molecules."@en;
    void:uriSpace "linkedchemistry.info/chembl/molecule/"^^xsd:string ;
    dcterms:subject <http://dbpedia.org/resource/Molecule> ;
    void:exampleResource <http://linkedchemistry.info/chembl/molecule/m1>;	
    void:dataDump <http://semantics.bigcat.unimaas.nl/chembl/v13_ops/compounds.nt.gz> ;
    .
  
:chemblrdf_targets
# Subset metadata
    a void:Dataset ;
    dcterms:title "ChEMBL Targets"@en;
    dcterms:description "The subset of ChEMBL data relating to targets which are single proteins."@en;
    void:uriSpace "linkedchemistry.info/chembl/target/"^^xsd:string ;
    dcterms:subject <http://dbpedia.org/resource/Protein> ;
    void:exampleResource <http://linkedchemistry.info/chembl/target/t1>;	
    void:dataDump <http://semantics.bigcat.unimaas.nl/chembl/v13_ops/targets.nt.gz> ;
    .

    	

Example Linksets

ChemSpider to ChEMBL-RDF Compounds Linkset

Below is the start of the linkset file relating ChemSpider compounds with ChEMBL-RDF compounds. The linkset reuses the VoID descriptions already provided for the datasets, but augments these with additional metadata.

@prefix : <http://linkedchemistry.info/void.ttl>.

@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix dul: <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix freq: <http://purl.org/cld/freq/> .
@prefix pav: <http://purl.org/pav/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
@prefix voag: <http://voag.linkedmodel.org/schema/voag#> .
@prefix void: <http://rdfs.org/ns/void#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.

@prefix chembl: <http://linkedchemistry.info/chemblid/> .

# Metadata about this file
<>
    a void:DatasetDescription;
    dcterms:title "ChEMBL-RDF Compounds to ChemSpider linkset"@en;
    dcterms:description """A VoID linkset that links compounds in ChEMBL-RDF
        with compounds in ChemSpider."""@en; 
    pav:createdBy <http://egonw.github.com/#me> ;
    pav:createdOn "2012-08-13T15:00:00Z"^^xsd:dateTime ;
    pav:lastUpdateOn "2012-09-14T11:15:32Z"^^xsd:dateTime ;
    foaf:primaryTopic :chembl-rdf-compounds_cs_linkset ;
    .

# Pointer to the subset description in the ChEMBL VoID file
<http://linkedchemistry.info/void.ttl#chemblrdf_compounds>
	# Linkset declared as a subset. Inherits properties
    void:subset :chembl-rdf-compounds_cs_linkset .

:chembl-rdf-compounds_cs_linkset
# General linkset metadata
    a void:Linkset ;
    dcterms:title "ChEMBL-RDF Compounds ChemSpider Linkset"@en;
    dcterms:description "Linkset relating ChEMBL-RDF compounds to ChemSpider compounds."@en;
    #Explicit declaration of license for the linkset is preferred
    dcterms:license <http://creativecommons.org/licenses/by-sa/3.0/> ;
# Link information
    void:subjectsTarget <http://linkedchemistry.info/void.ttl#chemblrdf_compounds> ;
    void:objectsTarget <http://rdf.chemspider.com/void.ttl#chemSpiderDataset_chembl_subset> ;
    void:linkPredicate skos:exactMatch ;
    dul:expresses <http://semanticscience.org/resource/CHEMINF_000059> ;
# Linkset provenance
    pav:authoredOn "2012-02-22T10:59:38Z"^^xsd:dateTime ;
    pav:authoredBy <http://www.chemspider.com/> ;
    pav:createdBy <http://egonw.github.com/#me> ;
    pav:createdOn "2012-05-15T11:29:01Z"^^xsd:dateTime ;
# Linkset statistics
    void:triples "1073967"^^xsd:integer ;
    .

chembl:CHEMBL1236438 skos:exactMatch <http://rdf.chemspider.com/43> .
chembl:CHEMBL144103 skos:exactMatch <http://rdf.chemspider.com/60> .
#...
		

An alternative deployment strategy for this linkset would be to include the linkset metadata in the ChEMBL-RDF VoID description file as a subset of the ChEMBL-RDF dataset. The linkset metadata must then include the predicate void:dataDump to point to the file containing the links. The file containing the links must include the corresponding predicate void:inDataset pointing back to the linkset description.

ChemSpider to DrugBank Linkset

Below is the linkset file relating ChemSpider compounds to DrugBank drugs. As there is no VoID file available for the DrugBank dataset, the VoID information is included in the linkset, following the minimal information prescribed in Section 5.5. For ChemSpider, the VoID information is imported from the ChemSpider location.

@prefix : <http://rdf.chemspider.com/void.ttl#>.

@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix dul: <http://www.ontologydesignpatterns.org/ont/dul/DUL.owl> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix freq: <http://purl.org/cld/freq/> .
@prefix pav: <http://purl.org/pav/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
@prefix voag: <http://voag.linkedmodel.org/schema/voag#> .
@prefix void: <http://rdfs.org/ns/void#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.

# Metadata about this file
<> 
	a void:DatasetDescription;
	dcterms:title "ChemSpider to DrugBank Linkset VoID Description"@en;
	dcterms:description 
		"""This is an example VoID description for a ChemSpider linkset. 
		   The linkset relates ChemSpider identifiers with DrugBank identifiers.
		   The links have been generated as the chemicals share the same structure."""@en;
	pav:createdBy <http://www.cs.man.ac.uk/~graya/me.ttl>;
	pav:createdOn "2012-08-13T16:43:25Z"^^xsd:dateTime;
	pav:lastUpdateOn "2012-10-16T10:22:43Z"^^xsd:dateTime;
	foaf:primaryTopic :chemSpider_drugbank_linkset;
	.
	
# Pointer to the ChemSpider dataset
<http://rdf.chemspider.com/void.ttl#chemSpiderDataset_drugbank_subset>
	# declare that the linkset is a subset of ChemSpider
	void:subset :chemSpider_drugbank_linkset;
	.
	
# Need to declare dataset metadata about DrugBank as no VoID file
## Note this is just a minimal amount of information as prescribed in Section 5.5
:drugbank_drugs_dataset
# General Dataset Metadata
	a void:Dataset;
	dcterms:title "DrugBank Drugs Dataset"@en;
	dcterms:description "A subset of the DrugBank database containing the information relating to drugs."@en;	
	foaf:homepage <http://www4.wiwiss.fu-berlin.de/drugbank/>;
	foaf:page <http://www.drugbank.ca/>;
	dcterms:license <http://www.drugbank.ca/about#cite>;
	void:uriSpace "http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/"^^xsd:string;
# Provenance and Version	
	pav:version "3.0";
	pav:retrievedOn "2011-11-30T11:01:59Z"^^xsd:dateTime;
	pav:retrievedFrom <http://www4.wiwiss.fu-berlin.de/drugbank/drugbank_dump.nt.bz2>;
	pav:retrievedBy [
		a foaf:Person;
		foaf:name "Antonis Loizou"^^xsd:string.
		];
	.
	
# Description of the linkset from ChemSpider to DrugBank
:chemSpider_drugbank_linkset
# General Linkset Metadata
	a void:Linkset;
	dcterms:title "ChemSpider DrugBank Linkset"@en;
	dcterms:description "Linkset relating ChemSpider compounds to DrugBank drugs."@en;
	dcterms:license <http://creativecommons.org/licenses/by-sa/3.0/> ;
# Link Information
	void:subjectsTarget <http://rdf.chemspider.com/void.ttl#chemSpiderDataset_drugbank_subset>;
	void:objectsTarget :drugbank_drugs_dataset;
	void:linkPredicate skos:exactMatch;
	dul:expresses <http://semanticscience.org/resource/CHEMINF_000059>;
# Linkset Provenance 
	pav:authoredBy <http://www.chemspider.com/>;
	pav:authoredOn "2012-02-23T09:08:29Z"^^xsd:dateTime;
	pav:createdBy <http://www.cs.man.ac.uk/~graya/me.ttl>;
	pav:createdOn "2012-08-13T15:29:31Z"^^xsd:dateTime;
# Linkset statistics
	void:triples "6428"^^xsd:nonNegativeInteger;
	.

# Location of the triples
## Note that the triples would not contain the void header information as 
## that is in this file. I have assumed the same location as the subset 
## above, but this needs to change to a correct location.
## The file containing the triples should contain the following backlink
## |uriOfTheData| void:inDataset <http://rdf.chemspider.com/void-example.rdf#chemSpider_drugbank_linkset>.