This is a specification for the metadata to described datasets, and the linksets that relate them, to enable their use within the Open PHACTS discovery platform. The specification defines the metadata properties that are expected to describe datasets and linksets; detailing the creation and publication of the dataset. Details of deploying dataset descriptions and an exchange format for linkset files are also given.

Disclaimer

The research leading to these results has received support from the Innovative Medicines Initiative (IMI) Joint Undertaking under grant agreement n° 115191, resources of which are composed of financial contribution from the European Union's Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in kind contribution.

Intended audience

This dataset and linkset description specification is intended for data providers. The principle data format in Open PHACTS is RDF, as such a basic knowledge of RDF is assumed.

Details for converting a dataset and publishing it as RDF are given in the Open PHACTS RDF guidelines specification [[OPS-RDF]].

Introduction

The Open PHACTS Discovery Platform [[OPS-ARCH]] relies on data and the interlinks published by a variety of sources. For example, details of chemicals are derived from ChemSpider, ChEMBL, and DrugBank. This specification provides details of the metadata expected to describe the datasets and the links that relate the instances in those datasets.

Open PHACTS has produced a set of guidelines aimed at data providers for publishing their data within the Open PHACTS Discovery Platform [[!OPS-RDF]]. The RDF guide provides details about modelling your data as RDF. This specification builds on the RDF Guidelines by defining the metadata that should be published to describe the dataset and the links to other datasets.

The dataset description defined in this specification declares the properties that should be included in the description of dataset or its links. The information is exchanged using the Vocabulary of Interlinked Datasets [[VOID]].

Tooling

The VoID Editor can be used to create dataset descriptions. This is a useful tool for prototyping the first version of a dataset description. Ideally the generation of dataset descriptions should be incorporated as part of the data creation pipeline.

A validator is provided to verify whether a dataset description conforms to the latest stable version of these specifications. This could be accessed automatically by the data creation pipeline to ensure that newly created dataset descriptions remain conformant.

Document Conventions

RDF Syntax and Namespaces

All examples in this document are written in the Turtle RDF syntax [[TURTLE]]. Throughout the document, the following namespaces are used:

@prefix bdb: <http://vocabularies.bridgedb.org/ops#> .
@prefix cito: <http://purl.org/spar/cito/> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix dctypes: <http://purl.org/dc/dcmitype/> .
@prefix eco: <http://purl.obolibrary.org/obo/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix freq: <http://purl.org/cld/freq/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix pav: <http://purl.org/pav/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
		

Furthermore, we assume that the empty prefix is bound to the base URL of the current file like this:

@prefix : <#> .

This allows us to quickly mint new identifiers in the local namespace.

Vocabularies for Describing Datasets and Mapping Relationships

In this section, we introduce the vocabularies that will be used for capturing the dataset descriptions and the mappings between the datasets.

Vocabulary of Interlinked Datasets (VoID)

The Vocabulary of Interlinked Datasets (VoID) [[!VOID]] is a W3C interest group note that specifies a vocabulary for describing the metadata about a dataset and its relationship with other datasets. The vocabulary builds upon existing metadata vocabularies, e.g. Dublin Core Terms [[DCTERMS]], and captures four categories of metadata:

Dataset

The VoID specification [[VOID]] defines a dataset to be, "a set of RDF triples that are published, maintained or aggregated by a single provider." The dataset itself may contain logical subsets, which can be captured in the VoID description of the dataset, e.g. ChEMBL can be split into subsets for compounds, targets, etc.

The information captured about a dataset focuses on the general metadata, access, metadata, and the structural metadata.

Linksets

The VoID specification [[VOID]] defines a link to be, "an RDF triple whose subject and object are described in different datasets." The links capture the mapping of an identifier in one dataset to be related to an identifier in another dataset. VoID is agnostic to the relationship to use: possible predicates are given in the next section.

The VoID specification defines a linkset to be, "a collection of such RDF links between two datasets." The linkset captures details of the links, i.e. the datasets that are linked and the relationship, as well as the metadata associated with the links, e.g. provenance information about who created the mapping and the specific versions of the datasets related. The VoID specification enables a separation between (1) the datasets involved in a linkset, and (2) who publishes the linkset.

Note that a VoID linkset is defined to link two datasets via a single link predicate (void:linkPredicate). As such, there can exist multiple linksets relating the same pair of datasets, as illustrated in the figure below. The figure depicts four distinct linksets: two sourced from ChemSpider depicted in blue which use different link predicates; one sourced from ChEMBL depicted in red; and one sourced from a third party depicted in green. Each of the linksets uses a different link relationship. Those shown with a double arrow head are symmetric while those with just a single arrow are directional links.

Depiction of links relating ChemSpider and ChEMBL

Expressing Mapping Relationships

A mapping is expressed as a VoID link, i.e. it is an RDF triple that relates an identifier in one dataset with an identifier in another dataset with some predicate which provides the meaning of the mapping. A justification for the mapping, e.g. two chemical compounds are deemed equivalent as they have the same InChI key, is expressed with the link in the linkset metadata.

Mapping Vocabularies

The mapping predicate captures the way in which the two identifiers are related. The mapping should respect the semantics of the relationship, e.g. the owl:sameAs relationship must only be used when the two identifiers are completely interchangeable.

Standard, and widely used, generic mapping relationships are given in the table below. (A fuller mapping ontology is given in [[Halpin2010]], but it is expected that the main relationships used will be those given in the table.)

Standard generic mapping predicates
Relationship Description Properties
rdfs:seeAlso General link, that indicates that the resource linked to is relevant to the subject. See http://www.w3.org/TR/rdf-schema/#ch_seealso.
skos:relatedMatch This link indicates that the linked resources are in some way associated. See http://www.w3.org/TR/skos-reference/#mapping. Symmetric
skos:closeMatch This link indicates that the linked resources are the same, under some assumptions or applications. See http://www.w3.org/TR/skos-reference/#mapping. Symmetric
skos:exactMatch This link indicates that the linked resources are the same, under the assumptions of most applications. See http://www.w3.org/TR/skos-reference/#mapping. Transitive Symmetric
owl:sameAs This link indicates that the linked resources are the same under all assumptions and can be used interchangeably. Note that if this link is used for classes, then reasoning tasks will fall under OWL Full semantics. See http://www.w3.org/TR/2009/REC-owl2-quick-reference-20091027/#Axioms. Transitive Symmetric
owl:equivalentClass This link indicates that the linked resources (which are both classes in some ontology) are the same. See http://www.w3.org/TR/2009/REC-owl2-quick-reference-20091027/#Axioms. Transitive Symmetric
Hierarchical Relationship Description Properties
skos:broadMatch This link indicates that the target resource is more general than the subject resource. See http://www.w3.org/TR/skos-reference/#mapping.
Inverse of skos:narrowMatch
skos:narrowMatch This link indicates that the target resource is more specific than the subject resource. See http://www.w3.org/TR/skos-reference/#mapping.
Inverse of skos:broadMatch

Mapping Justifications

A key feature of the Open PHACTS Discovery Platform is its ability to allow multiple views of the linked data which is achieved by applying scientific lenses over the data [[OPS-LENSES]]. For example, when performing an early stage exploratory task it is desirable to retrieve as much data as possible and as such the system enables the use of a lens whereby compounds are matched on their structural skeleton. Once the research has progressed further, the need for stricter relationships becomes apparent and the structure lens is switched for one which matches compounds on their full chemical structure, i.e. including their charges and stereo-chemistry.

The ability to classify linksets for use under different scientific lenses relies on the justification given in the linkset, i.e. the notion of operational equivalence, or alternative the interpretation, that is captured by the links. A set of vocabulary terms for providing justifications are given in Appendix B.2.

Example

The following example declares that the ChemSpider α-Ketoisovaleric acid concept with the CSID 48 shares many properties with the ChEMBL 3-Methyl-2-oxobutanoic acid concept with the ChEMBL-RDF ChEMBL ID CHEMBL146554. The relationship is drawn from ChemSpider based on the compounds sharing the same structure based on their InChI Keys (in this case the compound has the same InChI Key QHKABHOOEWYVLI-UHFFFAOYSA-N in both data sets). Only the triples directly related to declaring the link are given in the example.

@prefix chembl: <http://linkedchemistry.info/chembl/chemblid/> .

:cs2chembl_inchi void:linkPredicate skos:exactMatch .
:cs2chembl_inchi bdb:linksetJustification <http://semanticscience.org/resource/CHEMINF_000059> .

<http://rdf.chemspider.com/48> skos:exactMatch chembl:CHEMBL146554 .
      	

Describing Datasets


In this section we specify the metadata expected to describe a dataset – its origins, publication and content. These descriptions should be made at the level of the entire dataset, not each individual record, i.e. it is the metadata that would be expected within a catalogue record such as MIRIAM or Datahub. We define a dataset as follows:

A dataset is a collection of records that are published, maintained or aggregated by a single provider.

Note that this definition is more general than the one given above for a VoID Dataset. Specifically it includes datasets that are not represented as RDF. As such, in the following we sometimes distinguish between RDF and non-RDF datasets; many of the VoID predicates are typed for void:Dataset, i.e. a set of RDF triples, and cannot be applied to non-RDF datasets.

For datasets included in the Open PHACTS linked data cache [[OPS-ARCH]] we assume that an RDF representation of the dataset has been generated according to the Open PHACTS RDF Guidelines [[OPS-RDF]].

The tooling section contains details of tools to help with the creation of dataset descriptions. However, it is recommended that the generation of the VoID description for a dataset is carried out as part of the creation of the RDF version of the dataset and that these descriptions are passed through the dataset description validator.

Dataset Description Checklists

The following gives a checklist of the properties for describing a dataset and associated resources. Subsequent sections give guidance on the values to use with these properties and full details of the predicates used can be found in Appendix A.

Dataset Description

Dataset Distribution

Guidance

Basic Metadata

The title given for a dataset is used in the apps built upon the Open PHACTS Discovery Platform and tabular lists of the datasets that have been loaded into the platform. As such, the title should be unique  and short. For example, we would suggest using strings like "ChEMBL" or "UniProt". Each distinguished subset should have its own unique title so that it is clear in a user interface which part of a dataset is being used, e.g. "ChEMBL Molecule" for the subset of ChEMBL that contains data about molecules.

The description for a dataset should allow someone knowledgeable of the domain, but not familiar with the dataset, to understand the content of the dataset and decide upon its merit. Ideally it should be no more than a paragraph. Typically the paragraph about the dataset on the public web page is suitable.

The publisher for a dataset is the organisation that is responsible for the creation and maintenance for the dataset. This predicate should point to the web page for the organisation. For example, the publisher of the ChemSpider dataset is the Royal Society of Chemistry and the value for this property is http://www.rsc.org/. Ideally the publisher's page should be marked up with RDFa so that it is machine processable.

The landing page for a dataset should point to the public facing web page that gives details of the dataset. For a dataset such as the RDF conversion of DrugBank we suggest that this predicate points to the original DrugBank web page, viz. http://www.drugbank.ca/, rather than the base resource of the RDF conversion. This is to allow scientific users of the data to be able to get straight to the information about the dataset. Other pages associated with the dataset, e.g. the Identifiers.org page, can be linked to using foaf:page. Note that it is important not to use foaf:homepage due to the inverse functional property of the predicate.

For most datasets a license has already been chosen and this needs to be stated as a URI in the description. For licenses that require a citation to be given, the citation information can be captured using the cito:citeAsAuthority property. For new datasets a suitable dataset should be chosen. Below are a list of licenses suggested in VoID [[VOID]]. Note that the final two are not specifically designed for data.

The date issued property is used to capture the date a dataset is issued for public release which may also be the date it was created. This can be used to distinguish the version of a dataset where there are no version numbers, e.g. datasets that are continually updated and do not have fixed version numbers, particularly those that contain on-going user generated content. The date issued should be specified as an xsd:dateTime literal. Where values are unknown then these should be set to the start of the time period, e.g. for a dataset issued on the 23 July 2013 but we do not need to capture the time we would use the literal value "2013-07-23T00:00:00"^^xsd:dateTime.

For datasets which have a version number, then this should be provided as a string literal. For example, ChEMBL would use the literal value "16"^^xsd:string for their version 16 release.

It is useful to point to an example resource in the dataset to allow a user to quickly look at a record in the data. For each subset, a suitable example should be supplied. Other information such as the URI namespace and the identifier pattern can also be supplied.

Other metadata about the dataset, e.g. the author and creator of the content, can be captured using PAV [[PAV]]. The values for these resources should not be locally defined values, but link to external representations for individuals that can be reused. We recommend the use of ORCID identifiers.

Structure

A dataset that can be separated into multiple parts, e.g. ChEMBL can be split into ChEMBL.molecule, ChEMBL.Target, etc, should have this structure captured in the dataset description using th void:subset property. Each distinct subset should be described as a dataset in its own right, although common parts of information can be inherited from the parent. However, each should have a meaningful title and description for display purposes.

In the VoID note [[VOID]], the void:subset property is used for both 'subset of' and 'has subset'. The declared semantics for the property is has subset. Within Open PHACTS, the property is always interpreted with the has subset semantics.

The vocabularies used in the RDF representation of a dataset can be declared using the void:vocabulary property.

Provenance and Change

A dataset description should link to the description of the previous version of the dataset. This provides a backward pointing chain to all the previous versions of the dataset.

The anticipated update frequency of the dataset is captured to allow for the detection of dataset updates.

The provenance of an RDF conversion of an existing dataset is captured using the PAV imported properties [[PAV]]. The imported from property should point to the dataset description for the original dataset. Where this does not exist, it should be supplied conforming to these guidelines. Details of the script used for the conversion including the version number are captured with the pav:createdWith property.

Sources of data from which the dataset derives should be captured by pointing to dataset descriptions for the sources. Where these are not available, or it is unclear, the associated Identifiers.org page may be linked to. However, the version or file used should be captured.

Distribution Information

For an RDF dataset the file(s) containing the data should be linked to; standard file extensions and content-negotiation should be used to allow for the automated parsing of the file. This is to enable the Open PHACTS Discovery Platform to load the data and make it available to users of the platform. The SPARQL endpoint can also be supplied when this exists.

For a non-RDF dataset it is unknown what format the data file will be provided in, so a description of the file should be provided. Standard media types from IANA should be used. Some common ones for the datasets in Open PHACTS are included below. (A comprehensive list is available from Sitepoint.)

Examples

Example VoID dataset descriptions can be found in Appendix C. The first example is for part of the ChEMBL dataset (see Appendix C.1). The second example is for the RDF representation of the DrugBank database, given in Appendix C.2. This demonstrates the level of information required to track the provenance from a source dataset through to the RDF representation. Both datasets contain subset definitions.

Describing Linksets

A linkset is itself a dataset, and as such should provide metadata about its content and how it was created. The metadata associated with a link is essential for enabling its reuse by others. It enables a consumer of the link to understand which datasets are linked (including version information), who claimed the link, under what circumstances, the level of curation of the the links, and which (if any) tools were used to generate the link (e.g. [[SILK]]).

A linkset should point to the dataset descriptions of the datasets that it uses. These descriptions should be provided by the dataset provider as part of the dataset publishing process [[OPS-RDF]]. However, there are occasions when one or both of the linked datasets do not provide a VoID dataset description. In this case enough information should be given to identify the dataset linked to, and if it is known, the version. Ideally this information should be provided as a dataset description, but linking to the Identifiers.org page for the dataset may be all that is possible.

Linkset Description Checklist

The properties for a dataset converted into RDF are:

Guidance

Basic Metadata

The title given to a linkset is used in apps when displaying details of the linkset. A title should be short, idealy stating which datasets are linked, e.g. ConceptWiki-ChemSpider Linkset.

The description for a linkset should give a textual summary of the linkset. That is, it should state which datasets have been linked and the reason for the equivalence. It should be expressed in terminology aimed at an app user. It should be expected to be a few sentences.

See Section 4.2 for guidance about publisher, license, issued date and data dump. Note that for linksets there is unlikely to be a web page directly dedicated to the linkset, as such that part of the dataset metadata is not stated as a requirement. The license under which a linkset is published may be different from that of the datasets that it links, even if it is a subset of one of the datasets.

Link Metadata

Formally, the VoID specification [[VOID]] states that a Linkset links two RDF resources. However, there is a discussion to enable Linksets to link to non-RDF resources. We will permit this usage in the Open PHACTS usage of Linksets.

The link predicate is used to specify the mapping relationship used to relate the concepts in the linkset. This is used to state the degree of equivalence there is between the resources in the subject and object dataset. See Section 3.2 for the different mapping relationships and their meaning. Within Open PHACTS we encourage the use of skos:exactMatch or skos:closeMatch rather than owl:sameAs as the linking predicate. This is because owl:sameAs has a very precise meaning with the consequence that the two resources can be merged together. In general skos:exactMatch provides the appropriate level of equivalence.

The link justification is used to capture the notion of equivalence captured between the resources in the two datasets, e.g. they are conceptually the same gene (http://semanticscience.org/resource/SIO_010035) or the gene produces the protein and thus is used as a proxy for the protein (http://semanticscience.org/resource/SIO_000985). Appendix B.2 gives the set of justifications used in the Open PHACTS Discovery Platform.

The subjects target and objects target predicates should point to dataset descriptions contained in a VoID file. This may be a resource in the same file or in some other file. However, they should point to a description of a specific version of a dataset. For example, for a link to version 16 of the ChEMBL molecules dataset the objects target value would be http://rdf.ebi.ac.uk/dataset/chembl/16.2/void.ttl#chembl_rdf_molecule_dataset. Descriptions of the datasets loaded in the Open PHACTS Discovery Platform are available through the http://beta.openphacts.org/sources method.

The subjects datatype and objects datatype predicates should specify the type of the resources that are linked. This may be different from the datatype of the whole dataset. For example, ConceptWiki has among other things data about genes, proteins, and chemicals. However a specific linkset such as to ChemSpider only contains identifiers that are chemicals. This linkset would have the subjects datatype set to chemical entity (http://semanticscience.org/resource/SIO_010004). The set of data types used in the Open PHACTS Discovery Platform are available in Appendix B.1.

The subjects species and objects species predicates are used to state when a linkset is limited to biological targets from a particular species. For example, for a linkset between mouse proteins and human proteins the subjects species would be mouse (http://purl.obolibrary.org/obo/NCBITaxon_10090) while the objects species would be human (http://purl.obolibrary.org/obo/NCBITaxon_9606). Appendix B.3 gives the set of species used in the Open PHACTS Discovery Platform.

Provenance Metadata

The subset predicate is used to identify those linksets that have been extracted from a larger dataset, e.g. linksets extracted from ChEMBL are a subset of the ChEMBL dataset. Note that the dataset resource is the subject of the triple and the linkset resource is the object as the predicate expresses a has subset relationship.

Author and creator information can be provided to allow the tracing of who is making the claims of equivalence represented by the links in the linkset. The values for these resources should not be locally defined values, but link to external representations for individuals that can be reused. We recommend the use of ORCID identifiers.

Example

An example linkset descriptions can be found in Appendix D. The example shows how a linkset can be included as part of a complete dataset description (see Appendix C.1) and also link to an external dataset description.

Deploying and Exchanging VoID Documents

The two preceeding sections have prescribed the metadata required to describe datasets and the linksets that inter-relate them. This section outlines the expected deployment and exchange mechanisms and should be read in conjunction with Section 6 of the VoID specification for more details.

VoID Document Metadata

The primary purpose of the VoID document metadata is to provide details of who created the description of the dataset and when. It also points to the main (parent) dataset resource described in the document.

Metadata Checklist

VoID documents describing datasets and linksets MUST contain a metadata block describing the VoID document using the following properties:

Of course, other properties may also be declared. For more details, see Section 6.2 of the VoID specification [[VOID]].

Guidance

The date issued property is used to capture the date that the dataset description is ready for use, i.e. its publication date. This is captured as an xsd:dateTime literal, where unknown values should be set to the start of the valid time period. The dataset description issued date may be different from the date of issue of the dataset and also the creation and last modified dates, the latter two of these may be captured using the predicates from the PAV vocabulary.

The primary topic of the dataset description document is the resource that describes the dataset. This should be a locally defined resource.

The person responsible for publishing the dataset as RDF is most likely also the creator of the VoID description for it. This is captured with the created by property. We recommend the use of an ORCID identifier, but any valid URI will suffice. The resulting resource should be available as RDF.

The tool used to create the VoID document can be captured using pav:createdWith. The VoID Editor would be one example, but ideally this should be the script that performs data publishing pipeline, of which the generation of the VoID description is part of.

When a title is provided then this should be short as it will be used in apps to display details of the dataset description. When a description is provided, then these should provide a summary of the process of the creation of the dataset and be understandable to those outside of the immediate development team.

Example

An example is given below based on the ChemSpider deployment. (Note the use of an empty-string relative URI (<>) as a syntactic shortcut for the URI of the document that contains the statements; the real deployment has a date versioned URI such as ftp://ftp.rsc-us.org/OPS/20130408/void_2013-04-08.ttl.)

<> a void:DatasetDescription ;
        dcterms:issued "2013-08-22T10:33:00Z"^^xsd:dateTime;
pav:createdBy <http://orcid.org/0000-0002-5711-4872>;
pav:createdOn "2012-05-02T13:50:34Z"^^xsd:dateTime; pav:lastUpdateOn "2012-08-10T13:52:12Z"^^xsd:dateTime; foaf:primaryTopic :chemSpiderDataset .

Deploying VoID Descriptions

Several mechanisms for deploying VoID descriptions are given in Section 6 of the VoID Note [[VOID]]. In this section we provide recommendations for good practice within Open PHACTS.

We recommend that the dataset description is made available as a separate file from the data that it describes. This enables the use of the description, without needing to download the entire dataset, in catalogues/registries and by tools such as the Open PHACTS Identity Mapping Service (IMS) [[IMS]].

Deploying a Dataset Description

The current dataset description for a dataset SHOULD be available from a well known location relative to the dataset. Best practice for this well known location is a file called void.ttl in the root directory for the dataset. However, we recommend that a HTTP 302 redirect is used to derefenece to a versioned copy of the VoID file.

For example, for ChemSpider we would have a URI such as http://rdf.chemspider.com/void.ttl#chemSpiderDataset which redirects to a versioned URI such as http://rdf.chemspider.com/20130408/void.ttl#chemSpiderDataset.

Dataset Deployment

Each RDF document containing the data MUST contain a backlink using void:inDataset to the dataset descriptor. For example, the ChEMBL-RDF molecule m1, there would be the triple:

<http://linkedchemistry.info/chembl/chemblid/molecule/m1> void:inDataset 
    <http://linkedchemistry.info/chembl/chemblid/void.ttl#chembl-rdf_compounds> .    
			

Guidance

The above deployment example points to the unversioned dataset description, which SHOULD redirect to the current latest version. This is the approach for serving linked data pages and is valid providing that the data item remains in the latest version of the dataset.

When creating a datadump, the in dataset link should point to the corresponding versioned dataset description.

Deploying a Linkset Description

For the purposes of Open PHACTS, it is anticipated that linksets will be materialised as separate documents from the datasets. This is to allow their loading into the identity mapping service [[IMS]].

It is recommended that the linkset is described in the main dataset description document. Where there is no dataset description document it is recommend that a separate dataset description document is created for the linkset. The file containing the links MUST provide a link back to the linkset desription using the void:inDataset predicate. The example below shows how a set of links can refer back to the linkset description given in the ChEMBL-RDF VoID description file.

<> void:inDataset <http://linkedchemistry.info/chembl/chemblid/void.ttl#chembl-rdf_targets-uniprot-linkset> .

<http://linkedchemistry.info/chembl/target/t1> skos:exactMatch <http://purl.uniprot.org/uniprot/O43451> .
...
			

External VoID Files

Tooling being developed within Open PHACTS MUST support the predicates stated in this document. However, they SHOULD also be able to read VoID files from external sources that do not comply completely with this specification, but do comply with the VoID standard [[VOID]]. An example would be the use of the void:target predicate instead of the void:subjectsTarget and void:objectsTarget predicates. Such usage SHOULD not be the norm and SHOULD result in warnings being generated.

Related Standards

Nanopublications

Nanopublications [[NANOPUB]] provide a means for data providers to obtain credit for their data contribution, in particular data that can be described in the form of a minimal set of assertions: a minimal piece of information that represents value for which credit is due. Such information is closely related to a link relating instances in two datasets. In some cases it may be desirable to publish a link as a nanopublication. This should not violate a link being published in a linkset according to this specification.

BioDBcore

BioDBcore defines the following properties as the set of metadata that should be published in relation to a dataset. The aim of BioDBcore is different from that of VoID, but many of the elements defined are covered in the Open PHACTS dataset description.

An example BioDBcore record for ChEMBL.

The metadata specified in Section 4 covers the functional data required from BioDBcore. The aspects not covered are those relating to discovering who is responsible for a dataset and the publications about the dataset. It is expected that such information can be discovered from the dataset's homepage and is not within the use case scope for the description of the dataset. Such information may be added as additional statements in the VoID description.

Provenance Vocabularies

There are a wide range of provenance vocabularies that have been proposed. This section gives brief pointers to related vocabularies that could be used in a dataset or linkset description. For more information about the state of provenance vocabularies, the interested reader is recommended [[PROV-XG]].

Provenance Ontology (PROV-O)

The Provenance Ontology (PROV-O) [[PROV-O]] is a W3C candiate recommendation for representing provenance information about documents, datasets, workflow runs, etc. It is broadly based on the Open Provenance Model [[OPM]]. It is capable of expressing complex provenance relationships.

Provenance, Authoring and Versioning Ontology

The Provenance, Authoring and Versioning Ontology (PAV) [[PAV]] provides a comprehensive set of relationships for capturing basic provenance information.

Provenance Vocabulary

The Provenance Vocabulary [[PRV]] is another lightweight vocabulary of provenance predicates with an emphasis on data creation and data access on the Web.

Vocabulary for Data and Dataset Provenance (voidp)

Defined as an extension to VoID, the vocabulary for data and dataset provenance (voidp) [[VOIDP]] is a vocabulary for defining provenance relationships of data and datasets. The vocabulary focuses on four specific pieces of provenance information:

"for a piece of data, x :
  • when was x derived,
  • how was x derived,
  • what data had been used to derive x,
  • who carried out the transformations that resulted in the current value of x." [[VOIDP]]

These are a subset of the information that needs to be captured for the Open PHACTS linksets.

Vocabularies for Datasets Descriptions

BridgeDB Mapping Vocabulary

Properties

RDF Property bdb:assertionMethod
Definition: The method by which the links in the Linkset have been asserted.
Domain: void:Linkset
Range: rdfs:Resource
Usage note: Appendix B.4 gives the set of values that are expected to be used within the Open PHACTS Discovery Platform for declaring the curation level of a linkset.
Example: A set of links that have been computer generated but verified by a human would have their assertion method declared to be manual, i.e.
<> bdb:assertionMethod eco:ECO_0000218 .
A set of links that have been computer generated and not verified, e.g. those created by the IMS transitive computation would have their assertion method declared to be automatic, i.e.
<> bdb:assertionMethod eco:ECO_0000203 .
RDF Property bdb:linksetJustification
Definition: The reason why the Linkset claims these concepts are equivalent.
Domain: void:Linkset
Range: rdfs:Resource
Usage note: Appendix B.2 gives the set of values that are expected to be used within the Open PHACTS Discovery Platform for declaring equivalence conditions. These include saying that the datasets are conceptually about the same type of concept, that they match in their chemical structure, or that they represent the protein generated by some gene.
Example: A linkset between two chemicals that share the same chemical structure would use the value http://semanticscience.org/resource/CHEMINF_000059.
A linkset between two proteins would say that they were conceptually the same protein using the value http://semanticscience.org/resource/SIO_010043.
A linkset between a protein and a gene would say that they are conceptually the same by stating that it is a protein coding gene with http://semanticscience.org/resource/SIO_000985
RDF Property bdb:objectsDatatype
Definition: The type of concept the objects of the triples in the Linkset represent.
Domain: void:Linkset
Range: rdfs:Resource
Usage note: Appendix B.1 gives the set of values that are expected to be used within the Open PHACTS Discovery Platform.
Example: A linkset between ChemSpider and ConceptWiki would declare its objects datatype to be Chemical Entities. 
RDF Property bdb:objectsSpecies
Definition: The species of the concept the objects of the triples in the Linkset represent.
Domain: void:Linkset
Range: rdfs:Resource
Usage note: Appendix B.3 gives the set of values that are expected to be used within the Open PHACTS Discovery Platform.
Example: A linkset between the mouse proteins in UniProt and the mouse proteins in the Mouse Genome Database would have its objects species set to http://purl.obolibrary.org/obo/NCBITaxon_10090
RDF Property bdb:subjectsDatatype
Definition: The type of concept the subjects of the triples in the Linkset represent.
Domain: void:Linkset
Range: rdfs:Resource
Usage note: Appendix B.1 gives the set of values that are expected to be used within the Open PHACTS Discovery Platform.
Example: A linkset between ConceptWiki and ChemSpider would declare its subjects datatype to be Chemical Entities. 
RDF Property bdb:subjectsSpecies
Definition: The species of the concept the subjects of the triples in the Linkset represent.
Domain: void:Linkset
Range: rdfs:Resource
Usage note: Appendix B.3 gives the set of values that are expected to be used within the Open PHACTS Discovery Platform.
Example: A linkset between the mouse proteins in UniProt and the mouse proteins in the Mouse Genome Database would have its subjects species set to http://purl.obolibrary.org/obo/NCBITaxon_10090

Citation Typing Ontology

Properties

RDF Property cito:citeAsAuthority
Definition: The citing entity cites the cited entity as one that provides an authoritative description or definition of the subject under discussion.
Usage note: Used to specify the recommended citation for a dataset; particularly when this is a requirement of the dataset license.
Example: The UniProt license requires credit is given by citing a specific publication. 

Data Catalog Vocabulary

Classes

RDF Class dcat:Dataset
Definition: A collection of data, published or curated by a single source, and available for access or download in one or more formats.
Sub class of: dctypes:Dataset
RDF Class dcat:Distribution
Definition: Represents a specific available form of a dataset. Each dataset might be available in different forms, these forms might represent different formats of the dataset or different endpoints. Examples of distributions include a downloadable CSV file, an API or an RSS feed.

Properties

RDF Property dcat:byteSize
Definition: The size of a distribution in bytes.
Range: rdfs:Literal typed as xsd:decimal.
Usage note: The size in bytes can be approximated when the precise size is not known.
Example: The compressed ChEMBL 16 molecule data is available from ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/16.0/chembl_16_molecule.ttl.gz and is 483MB in size. This would by captured with the byte size predicate for this resource as
dcat:byteSize "500000000"^^xsd:integer .
RDF Property dcat:distribution
Definition: Connects a dataset to its available distributions.
Domain: dcat:Dataset
Range: dcat:Distribution
Usage note: Used to specify a link between a dataset and a description of a distribution of the dataset.
Example: The ChEMBL dataset is available in multiple distribution formats. Each one would be described giving properties such as the format and download URL.
RDF Property dcat:downloadURL
Definition: This is a direct link to a downloadable file in a given format. For example, CSV file or RDF file. The format is described by the distribution's dcat:mediaType.
Range: rdfs:Resource
Usage note: The value is a URL. The URL may require access credentials, e.g. username and password.
RDF Property dcat:landingPage
Definition: A Web page that can be navigated to in a Web browser to gain access to the dataset, its distributions and/or additional information.
Sub property of: foaf:page
Domain: dcat:Dataset
Range: foaf:Document
Usage note: Used to specify a link to a human readable web page documenting the underlying dataset.
Example: For the RDF conversion of the DrugBank dataset this property should point to http://www.drugbank.ca/ since this resource provides information about the contents of the dataset.
See also: foaf:homepage, foaf:page
RDF Property dcat:mediaType
Definition: The media type of the distribution as defined by IANA.
Range: dcterms:MediaTypeOrExtent
Usage note: This property SHOULD be used when the media type of the distribution is defined in IANA, otherwise dcterms:format MAY be used with different values.
RDF Property dcat:theme
Definition: The main category of the dataset. A dataset can have multiple themes.
Sub property of: dcterms:subject
Range: skos:Concept
Usage note: Used to provide details of the types of concepts covered by the dataset. These should be drawn from existing well used domain vocabularies. The dataset provider may want to consider the semantic types selected from the Semantic Science Integrated Ontology as given in Appendix B.1.

Dublin Core Terms

Properties

RDF Property dcterms:accurualPeriodicity
Definition: The frequency with which items are added to a collection.
Range: Frequency
Usage note: Indicates the expected update frequency of the dataset.
RDF Property dcterms:description
Definition: An account of the resource.
Range: rdfs:Literal
Usage note: Provides a human readable description of the dataset, for example for use in GUIs.
For more details see Section 2.2 of the VoID Specification.
RDF Property dcterms:issued
Definition: Date of formal issuance (e.g., publication) of the resource.
Range: rdfs:Literal typed as xsd:date .
Usage note:

The date is encoded as a literal in "yyyy-mm-dd" form (ISO 8601 Date and Time Formats). If the specific day or month are not known, then 01 should be specified.

See also: pav:createdOn
RDF Property dcterms:license
Definition: A legal document giving official permission to do something with the resource.
Range: rdfs:Resource
Usage note:

Declares the license under which the dataset is published.

Where possible, we recommend publishing data under an open license to enable reuse of the data with appropriate acknowledgement. For Open PHACTS, the recommended license is CC-BY-SA. A list of alternative licenses are available in Section 2.4 of the W3C note on VoID.

For more details see Section 2.4 of the VoID Specification.

RDF Property dcterms:publisher
Definition: An entity responsible for making the resource available.
Range: rdfs:Resource
Example: For ChEMBL, the publisher would be the EBI: http://www.ebi.ac.uk/. This is different from the value for the homepage in this case.
See also: dcat:landingPage, foaf:homepage
RDF Property dcterms:subject
Definition: The topic of the dataset.
Range: rdfs:Resource
Usage note:

Declared the topics covered by the dataset. Multiple topics can be declared. If the data is split into subsets, then the topics should be associated with the subsets. BioPortal [[BioPortalWeb]] [[BioPortal]] can be used to search for suitable vocabulary terms for topics. DBPedia URIs may also be used. A list of common terms relevant for Open PHACTS is given in Appendix A.2.

For more details see Section 2.5 of the VoID Specification.

RDF Property dcterms:title
Definition: A name given to the resource.
Range: rdfs:Literal
Usage note: Used to declare the short name of the dataset, for example for use in GUIs.
For more details see Section 2.2 of the VoID Specification.

Dublin Core Types

Classes

RDF Class dctypes:Dataset
Definition: Data encoded in a defined structure.
Usage note: Used to declare the type of a dataset published in a format other than RDF. For a dataset published in RDF use void:Dataset.

Friend of a Friend

Properties

RDF Property foaf:homepage
Definition: A homepage for the dataset.
Range: foaf:Document
Usage note: Specifies the homepage for the resource. For more details see Section 2.1 of the VoID Specification.
Usage note: foaf:homepage is an inverse functional property (IFP) which means that it should be unique and precisely identify the catalog. This allows smushing various descriptions of the catalog when different URIs are used.
Example: The foaf:homepage for the ChemSpider RDF dataset is http://rdf.chemspider.com/.
See also: foaf:page, dcat:landingPage
RDF Property foaf:page
Definition: A page about the dataset.
Range: foaf:Document
Usage note: Specifies a webpage or document for the resource. Points to secondary sources of infotmation, e.g. catalogue records.
Example: For the ChemSpider RDF dataset, the foaf:page property could point to the associated MIRIAM record http://www.ebi.ac.uk/miriam/main/collections/MIR:00000138. For more details see Section 2.1 of the VoID Specification.
See also: foaf:homepage, dcat:landingPage
RDF Property foaf:primaryTopic
Definition: The primary topic of some page or document.
Domain: foaf:Document
Usage note:

Relates a document, e.g. a DatasetDescription, to the main thing the document is about, e.g. Dataset. The value of this property is a URI that appears in the dataset description.

foaf:primaryTopic is a functional property which means that it can be used at most once.

See also: foaf:topic
RDF Property foaf:topic
Definition: A topic of some page or document.
Domain: foaf:Document
Usage note:

Relates a document, e.g. a DatasetDescription, to the things that the document is about, e.g. Datasets.

See also: foaf:primaryTopic

Provenance, Authoring and Versioning Ontology

Properties

RDF Property pav:authoredBy
Definition: An agent that originated or gave existence to the work that is expressed by the digital resource.
Range: rdfs:Resource
Usage note:

The author of the content of a resource may be different from the creator of the resource representation (although they are often the same). See pav:createdBy for a discussion.

Example: The object of the triple should point to a URL that representes the person responsible for authoring the content, e.g. http://orcid.org/0000-0002-5711-4872.
See also: pav:createdBy
RDF Property pav:authoredOn
Definition: The date this resource was authored.
Range: rdfs:Literal typed as xsd:dateTime .
Usage note: The dateTime is encoded as a literal in "yyyy-mm-ddThh:mm:sszzzzz" form (ISO 8601 Date and Time Formats). If the specific day or month are not known, then 01 should be specified. If the specific hour, minute or second are not known, then 00 should be specified. If the timezone is not known then Z should be specified.
RDF Property pav:createdBy
Definition: An agent primary responsible for making the digital artifact or resource representation.
Range: rdfs:Resource
Usage note:

This property is distinct from forming the content, which is indicated with pav:contributedBy or its subproperties; pav:authoredBy, which identifies who authored the knowledge expressed by this resource; and pav:curatedBy, which identifies who curated the knowledge into its current form.

Example: The object of the triple should point to a URL that representes the person or tool responsible for creating the digital artifact, e.g. http://orcid.org/0000-0002-5711-4872.
See also: pav:authoredBy, pav:createdWith
RDF Property pav:createdOn
Definition: The date of creation of the resource.
Range: rdfs:Literal typed as xsd:dateTime .
Usage note: The dateTime is encoded as a literal in "yyyy-mm-ddThh:mm:sszzzzz" form (ISO 8601 Date and Time Formats). If the specific day or month are not known, then 01 should be specified. If the specific hour, minute or second are not known, then 00 should be specified. If the timezone is not known then Z should be specified.
See also: dcterms:issued, pav:lastUpdateOn
RDF Property pav:createdWith
Definition: The software/tool used by the creator (pav:createdBy) when making the digital resource.
Range: rdfs:Resource
Usage note:

The specific version of a tool/script that was used to created the resource.

Example: The versioned GitHub URL used to perform an RDF conversion: https://github.com/openphacts/chembl.rdf/blob/97c900460b46481aac07dfb11807a3f49fc92b2e/ops.ttl
RDF Property pav:derivedFrom
Definition: Derived from a different resource.
Range: rdfs:Resource
Usage note:

Derivation conserns itself with derived knowledge. If this resource has the same content as the other resource, but has simply been transcribed to fit a different model (like XML -> RDF or SQL -> CVS), use pav:importedFrom. If a resource was simply retrieved, use pav:retrievedFrom. If the content has however been further refined or modified, pav:derivedFrom should be used.

Details about who performed the derivation may be indicated with pav:contributedBy and its subproperties.

See also: pav:importedFrom
RDF Property pav:importedBy
Definition: An entity responsible for importing the data.
Range: rdfs:Resource
Usage note:

The importer is usually a software entity which has done the transcription from the original source.

Note that pav:importedBy may overlap with pav:createdWith.

The source for the import should be given with pav:importedFrom. The time of the import should be given with pav:importedOn.

See also: pav:createdWith
RDF Property pav:importedFrom
Definition: The original source of imported information.
Range: rdfs:Resource
Usage note:

Import means that the content has been preserved, but transcribed somehow, for instance to fit a different representation model. Examples of import are when the original was JSON and the current resource is RDF, or where the original was an document scan, and this resource is the plain text found through OCR.

The imported resource does not have to be complete, but should be consistent with the knowledge conveyed by the original resource.

If additional knowledge has been contributed, pav:derivedFrom would be more appropriate.

If the resource has been copied verbatim from the original representation (e.g. downloaded), use pav:retrievedFrom.

To indicate which agent(s) performed the import, use pav:importedBy. Use pav:importedOn to indicate when it happened.

See also: pav:derivedFrom
RDF Property pav:importedOn
Definition: The date this resource was imported from a source (pav:importedFrom).
Range: rdfs:Literal typed as xsd:dateTime .
Usage note:

If the source is later reimported, this should be indicated with pav:lastRefreshedOn.

The source of the import should be given with pav:importedFrom. The agent that performed the import should be given with pav:importedBy.

The dateTime is encoded as a literal in "yyyy-mm-ddThh:mm:sszzzzz" form (ISO 8601 Date and Time Formats). If the specific day or month are not known, then 01 should be specified. If the specific hour, minute or second are not known, then 00 should be specified. If the timezone is not known then Z should be specified.

RDF Property pav:lastRefreshedOn
Definition: The date of the last re-import of the resource.
Range: rdfs:Literal typed as xsd:dateTime .
Usage note:

This property is used in addition to pav:importedOn if this version has been updated due to a re-import. If the re-import created a new resource rather than refreshing an existing, then pav:importedOn should be used together with pav:previousVersion.

The dateTime is encoded as a literal in "yyyy-mm-ddThh:mm:sszzzzz" form (ISO 8601 Date and Time Formats). If the specific day or month are not known, then 01 should be specified. If the specific hour, minute or second are not known, then 00 should be specified. If the timezone is not known then Z should be specified.

See also: pav:importedOn
RDF Property pav:lastUpdateOn
Definition: The date of the last update of the resource.
Range: rdfs:Literal typed as xsd:dateTime .
Usage note:

An update is a change which did not warrant making a new resource related using pav:previousVersion, for instance correcting a spelling mistake.

The dateTime is encoded as a literal in "yyyy-mm-ddThh:mm:sszzzzz" form (ISO 8601 Date and Time Formats). If the specific day or month are not known, then 01 should be specified. If the specific hour, minute or second are not known, then 00 should be specified. If the timezone is not known then Z should be specified.

See also: pav:createdOn
RDF Property pav:previousVersion
Definition: The previous version of a resource in a lineage.
Range: rdfs:Resource
Usage note:

For instance a news article updated to correct factual information would point to the previous version of the article with pav:previousVersion. If however the content has significantly changed so that the two resources no longer share lineage (say a new news article that talks about the same facts), they should be related using pav:derivedFrom.

A version number of this resource can be provided using the data property pav:version.

See also: pav:version
RDF Property pav:version
Definition: The version number of a resource.
Range: rdfs:Literal
Usage note: This is a freetext string, typical values are "1.5" or "21". The URI identifying the previous version can be provided using prov:previousVersion.
See also: pav:previousVersion

Vocabulary of Interlinked Datasets

Classes

RDF Class void:Dataset
Definition: A set of RDF triples that are published, maintained or aggregated by a single provider.
Superclass: dctypes:Dataset
Subclass: void:Linkset
Usage note: Used to declare the type of a dataset published in RDF. For a dataset not published in RDF use dctypes:Dataset.
RDF Class void:DatasetDescription
Definition: A web resource whose foaf:primaryTopic or foaf:topics include void:Datasets.
Superclass: foaf:Document
Usage note: Used to provide metadata about the dataset description document. For more details, see Section 6.2 of the VoID specification [[VOID]].
RDF Class void:Linkset
Definition: A collection of RDF links between two void:Datasets.
Superclass: void:Dataset
Usage note: For more details see Section 1.4 of the VoID Specification.

Properties

RDF Property void:dataDump
Definition: An RDF dump, partial or complete, of a void:Dataset.
Domain: void:Dataset
Range: rdfs:Resource
Usage note:

Defines the location of an RDF dump file that should be provided in one of the standard RDF serializations, and may be compressed. If the dataset is contained in more than one file, then several values of this property should be given. The files may be password protected.

For more details see Section 3.3 of the VoID Specification.

RDF Property void:exampleResource
Definition: example resource of dataset.
Domain: void:Dataset
Range: rdfs:Resource
Usage note:

Multiple resources can be declared. If the data is split into subsets, then the example resources should be associated with the subsets.

For more details see Section 4.1 of the VoID Specification.

RDF Property void:inDataset
Definition: Points to the void:Dataset that a document is a part of.
Domain: foaf:Document
Range: void:Dataset
Superproperty: dcterms:isPartOf
Usage note:

Links a file containing the RDF dataset to the description of the dataset.

For more details see Section 6.3 of the VoID Specification.

RDF Property void:linkPredicate
Definition: a link predicate
Domain: void:Linkset
Range: rdfs:Property
Usage note:

Specifies the mapping relationship which should be one of the properties given in the Mapping Relationships Table.

For more details see Section 5.3 of the VoID Specification.

RDF Property void:objectsTarget
Definition: The dataset describing the objects of the triples contained in the Linkset.
Domain: void:Linkset
Range: void:Dataset
Superproperty: void:target
Usage note:

Should point to a versioned URI of a dataset that appears in this or another VoID file. The datasets may themselves be a subset of a larger dataset. Where the datasets do not provide a VoID description, the minimal required information must be provided in the linkset description. This is detailed in Section 5.5.

This property may only appear once per Linkset (owl:FunctionalProperty).

For more details see Section 5.1 of the VoID Specification.

See also: void:subjectsTarget
RDF Property void:sparqlEndpoint
Definition: has a SPARQL endpoint at
Domain: void:Dataset
Range: rdfs:Resource
Usage note:

Defines the location of a SPARQL endpoint where the data may be queried.

For more details see Section 3.2 of the VoID Specification.

RDF Property void:subjectsTarget
Definition: The dataset describing the subjects of triples contained in the Linkset.
Domain: void:Linkset
Range: void:Dataset
Superproperty: void:target
Usage note:

Should point to a versioned URI of a dataset that appears in this or another VoID file. The datasets may themselves be a subset of a larger dataset. Where the datasets do not provide a VoID description, the minimal required information must be provided in the linkset description. This is detailed in Section 5.5.

This property may only appear once per Linkset (owl:FunctionalProperty).

For more details see Section 5.1 of the VoID Specification.

See also: void:objectsTarget
RDF Property void:subset
Definition: has subset
Domain: void:Dataset
Range: void:Dataset
Usage note:

In the VoID note [[VOID]], the void:subset property is used for both subset of and has subset. The declared semantics for the property is has subset. Within Open PHACTS, the property must be used with the has subset semantics.

For more details see Section 4.4 of the VoID Specification.

RDF Property void:target
Definition: One of the two datasets linked by the Linkset.
Domain: void:Linkset
Range: void:Dataset
Subproperties: void:subjectsTarget, void:objectsTarget
Usage note: It is recommended that one of the sub-properties be used.
RDF Property void:triples
Definition: The total number of triples contained in a void:Dataset.
Domain: void:Dataset
Range: rdfs:Literal typed as xsd:integer .
Usage note:

Providing the number of triples included in the linkset allows for applications using the linkset to validate that the entire linkset has been successfully loaded.

For more details see Section 4.6 of the VoID Specification.

Example: :chemSpider_drugbank_linkset void:triples "6428"^^xsd:nonNegativeInteger.
RDF Property void:uriRegexPattern
Definition: Defines a regular expression pattern matching URIs in the dataset.
Domain: void:Dataset
Usage note: For more details see Section 4.2 of the VoID Specification.
Example: For the ChemSpider RDF dataset, the value would be "^http://rdf\\.chemspider\\.com/".
See also: void:uriSpace
RDF Property void:uriSpace
Definition: A URI that is a common string prefix of all the entity URIs in a void:Dataset.
Domain: void:Dataset
Range: rdfs:Literal
Usage note: For more details see Section 4.2 of the VoID Specification.
Example: For the ChemSpider RDF dataset, the value would be "http://rdf.chemspider.com/".
See also: void:uriRegexPattern
RDF Property void:vocabulary
Definition: A vocabulary that is used in the dataset.
Domain: void:Dataset
Range: rdfs:Resource
Usage note:

Declare the vocabularies and ontologies that encode the data.

For more details see Section 4.3 of the VoID Specification.

Suggested Vocabulary Terms

Vocabulary Terms for Dataset Topics

This section lists suggested vocabulary terms for the dataset topics metadata that are relevant for Open PHACTS. Where possible, we draw our terminology from the Semantic Science Integrated Ontology.

Concept Type URI Description
Annotation http://semanticscience.org/resource/SIO_001166 An annotation is a written explanatory or critical description, or other in-context information (e.g., pattern, motif, link), that has been associated with data or other types of information.
Chemical Entity http://semanticscience.org/resource/SIO_010004 A chemical entity is a material entity that pertains to chemistry.
Disease http://semanticscience.org/resource/SIO_010299 disease is the outward manifestation of one or more disorders
Drug
http://semanticscience.org/resource/SIO_010038
A drug is a chemical entity that regulates a biological process.
Gene
http://semanticscience.org/resource/SIO_010035
A gene is part of a nucleic acid that contains all the necessary elements to encode a functional transcript.
Ligand
http://semanticscience.org/resource/SIO_010432
a ligand is a molecule that is part of a complex by weakly interacting with another molecule
mRNA http://semanticscience.org/resource/SIO_010099
a messenger RNA is a ribonucleic acid that contains an untranslated region (UTR) and protein coding sequence and lacks introns.
Pathway http://semanticscience.org/resource/SIO_001107 a pathway is an effective specification that outlines a set of actions that forms a way to achieve an objective.
Protein http://semanticscience.org/resource/SIO_010043 a protein is an organic polymer that is composed of one or more linear polymers of amino acids.
RNA http://semanticscience.org/resource/SIO_010009 a ribonucleic acid is an organic polymer composed of a sequence of ribonucleotide residues.
Target http://semanticscience.org/resource/SIO_010423 (None given in ontology)

Species Terms used in OPS

The species terms used in the Open PHACTS Discovery Platform are drawn from the NCBI Taxonomy. Below the subset for the species that appear in the Discovery Platform are given. These can be discovered using the BioPortal browser.

Assertion Method

Below are the expected terms for specifying the assertion method of the links:

Note that a manual assertion includes computer generated assertions that have been manually verified.

Frequency of Change Vocabulary

Below are the terms from the frequency of change vocabulary.

Example Dataset Descriptions

ChEMBL-RDF Dataset Description

Below we provide an example of part of the dataset description document for the ChEMBL-RDF dataset; derived from the file available at  ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/16.0/void.ttl.gz.

The dataset description consists of a parent resource with three distinct subsets. (Note that the actual ChEMBL description contains many more subsets.)

The example dataset description is available as:

@prefix : <http://rdf.ebi.ac.uk/dataset/chembl/16.2/void.ttl#> .
@prefix cco: <http://rdf.ebi.ac.uk/terms/chembl#> .
@prefix chembl: <http://rdf.ebi.ac.uk/resource/chembl/> .
@prefix chembl_activity: <http://rdf.ebi.ac.uk/resource/chembl/activity/> .
@prefix chembl_assay: <http://rdf.ebi.ac.uk/resource/chembl/assay/> .
@prefix chembl_bio_cmpt: <http://rdf.ebi.ac.uk/resource/chembl/biocomponent/> .
@prefix chembl_document: <http://rdf.ebi.ac.uk/resource/chembl/document/> .
@prefix chembl_journal: <http://rdf.ebi.ac.uk/resource/chembl/journal/> .
@prefix chembl_molecule: <http://rdf.ebi.ac.uk/resource/chembl/molecule/> .
@prefix chembl_protclass: <http://rdf.ebi.ac.uk/resource/chembl/protclass/> .
@prefix chembl_source: <http://rdf.ebi.ac.uk/resource/chembl/source/> .
@prefix chembl_target: <http://rdf.ebi.ac.uk/resource/chembl/target/> .
@prefix chembl_target_cmpt: <http://rdf.ebi.ac.uk/resource/chembl/targetcomponent/> .

@prefix bdb: <http://vocabularies.bridgedb.org/ops#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix dctypes: <http://purl.org/dc/dcmitype/> .
@prefix eco: <http://purl.obolibrary.org/obo/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix freq: <http://purl.org/cld/freq/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix pav: <http://purl.org/pav/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sd: <http://www.w3.org/ns/sparql-service-description#> .
@prefix sio: <http://semanticscience.org/resource/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix uniprot: <http://purl.uniprot.org/uniprot/> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://rdf.ebi.ac.uk/dataset/chembl/16.example/void.ttl#> a void:DatasetDescription ;
	pav:createdBy <http://orcid.org/0000-0002-8011-0300> ;
	pav:contributedBy <http://orcid.org/0000-0002-5711-4872> ;
	pav:createdOn "2009-10-28T00:00:00.000Z"^^xsd:dateTime ;
	pav:lastUpdateOn "2013-08-23T00:00:00.000+01:00"^^xsd:dateTime ;
	dcterms:issued "2013-08-23T00:00:00.000+01:00"^^xsd:dateTime;
	pav:previousVersion <http://rdf.ebi.ac.uk/dataset/chembl/16.2/void.ttl#>;
	foaf:primaryTopic :chembl_rdf_dataset .

:chembl_rdf_dataset a void:Dataset ;
	dcterms:title "The ChEMBL Database" ;
	dcterms:description "ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs." ;
	pav:createdBy <http://orcid.org/0000-0002-8011-0300> ;
	pav:createdOn "2009-10-28T00:00:00.000Z"^^xsd:dateTime ;
	pav:lastUpdateOn "2013-05-07T00:00:00.000+01:00"^^xsd:dateTime ;
	dcterms:issued "2013-08-23T00:00:00.000+01:00"^^xsd:dateTime ;
	pav:version "16.example" ;
	pav:previousVersion <http://rdf.ebi.ac.uk/dataset/chembl/16.2/void.ttl#chembl_rdf_dataset> ;
	dcat:landingPage <https://www.ebi.ac.uk/chembl> ;
	foaf:page <ftp://ftp.ebi.ac.uk/pub/databases/chembl/> ;
	dcterms:license <http://creativecommons.org/licenses/by-sa/3.0/> ;
	void:uriSpace "http://rdf.ebi.ac.uk/resource/chembl/" ;
	dcterms:publisher <http://www.ebi.ac.uk/> ;
	void:subset :chembl_rdf_molecule_dataset , :chembl_rdf_target_dataset , :chembl_rdf_targetcmpt_dataset ;
	void:vocabulary <http://purl.org/ontology/bibo> , <http://www.bioassayontology.org/bao> , <http://semanticscience.org/ontology/cheminf.owl> , <http://purl.org/spar/cito> , <http://purl.org/dc/terms> , <http://www.w3.org/2002/07/owl> , <http://www.w3.org/1999/02/22-rdf-syntax-ns> , <http://www.w3.org/2000/01/rdf-schema> , <http://semanticscience.org/ontology/sio.owl> , <http://www.w3.org/2004/02/skos/core> , <http://www.w3.org/2001/XMLSchema> ;
	void:exampleResource chembl_molecule:CHEMBL941 ;
	void:sparqlEndpoint <http://rdf.ebi.ac.uk/dataset/chembl/sparql> ;
	dcterms:accrualPeriodicity freq:quarterly .

:chembl_rdf_molecule_dataset a void:Dataset ;
	dcterms:title "ChEMBL Molecules Dataset" ;
	dcterms:description "The ChEMBL Molecules Dataset about drug-like small molecules containing calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.)." ;
	void:uriSpace <http://rdf.ebi.ac.uk/resource/chembl/molecule/> ;
	void:exampleResource chembl_molecule:CHEMBL941 ;
	void:dataDump <ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/16.example/chembl_16.example_molecule.ttl.gz> ;
	dcat:theme sio:SIO_010004.

:chembl_rdf_target_dataset a void:Dataset ;
	dcterms:title "ChEMBL Target Dataset" ;
	dcterms:description "The ChEMBL Target Dataset containing abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data)" ;
	void:uriSpace <http://rdf.ebi.ac.uk/resource/chembl/target/> ;
	void:exampleResource chembl_target:CHEMBL2242 ;
	void:dataDump <ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/16.example/chembl_16.example_target.ttl.gz> ;
	void:subset :chembl_target_targetcmpt_linkset ;
	dcat:theme sio:SIO_010423 .

:chembl_rdf_targetcmpt_dataset a void:Dataset ;
	dcterms:title "ChEMBL Target Component Dataset" ;
	dcterms:description "The ChEMBL Target Component Dataset about proteins that are contained in targets." ;
	void:uriSpace <http://rdf.ebi.ac.uk/resource/chembl/targetcomponent/> ;
	void:exampleResource chembl_target_cmpt:CHEMBL_TC_583 ;
	void:dataDump <ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/16.example/chembl_16.example_targetcmpt.ttl.gz> ;
	void:subsets :chembl_targetcmpt_uniprot_linkset ;
	dcat:theme sio:SIO_010043 .

        

DrugBank RDF Conversion Dataset Description

Below we provide an example of the dataset description document for the RDF conversion of the DrugBank dataset. The dataset description consists of a parent resource describing the RDF representation of DrugBank, the original DrugBank data that was used to generate the RDF including details of its distribution, and the two distinct subsets that make up the DrugBank dataset. The example dataset description is available as:

@prefix bdb: <http://vocabularies.bridgedb.org/ops#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix dctypes: <http://purl.org/dc/dcmitype/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix freq: <http://purl.org/cld/freq/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix pav: <http://purl.org/pav/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sd: <http://www.w3.org/ns/sparql-service-description#> .
@prefix sio: <http://semanticscience.org/resource/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix uniprot: <http://purl.uniprot.org/uniprot/> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix : <#> .

# VoID Header for the DrugBank RDF dataset
<>
	rdf:type void:DatasetDescription ;
	dcterms:title "DrugBank VoID Description"@en ;
	dcterms:description "The VoID description for the RDF representation of the DrugBank dataset."@en ;
	pav:createdBy <https://orcid.org/0000-0002-5711-4872> ;
	pav:createdOn "2012-10-30T16:08:36Z"^^xsd:dateTime ;
	pav:lastUpdateOn "2013-08-23T16:00:00Z"^^xsd:dateTime ;
	dcterms:issued "2013-08-23T16:00:00Z"^^xsd:dateTime ;
	foaf:primaryTopic :drugbank_rdf .

# Metadata about the original DrugBank dataset
:drugbank rdf:type dctypes:Dataset ;
	dcterms:title "DrugBank dataset"@en ;
	dcterms:description "The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains 6811 drug entries including 1528 FDA-approved small molecule drugs, 150 FDA-approved biotech (protein/peptide) drugs, 87 nutraceuticals and 5080 experimental drugs. Additionally, 4294 non-redundant protein (i.e. drug target/enzyme/transporter/carrier) sequences are linked to these drug entries. Each DrugCard entry contains more than 150 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data."@en ;
	dcterms:publisher <http://www.drugbank.ca/> ;
	dcat:landingPage <http://www.drugbank.ca/> ;
	dcterms:license <http://www.drugbank.ca/about#cite> ;
	dcterms:issued "2009-01-01T00:00:00T"^^xsd:dateTime;
	dcat:distribution :drugbank-distribution;
	pav:version "2.5" ;
	foaf:page <http://thedatahub.org/dataset/drugbank> ;
	. 

# Metadata about the distribution of the DrugBank dataset
:drugbank-distribution rdf:type dcat:Distribution ;
	dcat:mediaType "text";
	dcat:downloadURL <http://www.drugbank.ca/system/downloads/2.5/drugcards.zip>.


# Metadata about the RDF representation of DrugBank
:drugbank-rdf rdf:type void:Dataset ;
	dcterms:title "DrugBank RDF"@en;
	dcterms:description """An RDF representation of the DrugBank dataset taken from Free University of Berlin.
	
	The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains 6811 drug entries including 1528 FDA-approved small molecule drugs, 150 FDA-approved biotech (protein/peptide) drugs, 87 nutraceuticals and 5080 experimental drugs. Additionally, 4294 non-redundant protein (i.e. drug target/enzyme/transporter/carrier) sequences are linked to these drug entries. Each DrugCard entry contains more than 150 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data."""@en;
	dcterms:publisher <http://www4.wiwiss.fu-berlin.de/> ;
	dcat:landingPage <http://www.drugbank.ca/> ;
	dcterms:license <http://www.drugbank.ca/about#cite> ;
	dcterms:issued 	"2010-08-31T00:00:00Z"^^xsd:dateTime;
	void:dataDump <http://www4.wiwiss.fu-berlin.de/drugbank/drugbank_dump.nt> ;
	pav:version "2.5" ;
	pav:importedFrom :drugbank ;
	pav:importedBy <mailto:anja@anjeve.de> ;
	pav:importedOn "2008-11-17T20:52:39"^^xsd:dateTime ;
	pav:createdWith <https://github.com/anjeve/lodd/tree/master/datasets/DrugBank> ;
	void:subset :db-drugs, :db-targets ; 
	void:sparqlEndpoint <http://www4.wiwiss.fu-berlin.de/drugbank/sparql> ;
	void:uriSpace "http://www4.wiwiss.fu-berlin.de/drugbank/resource/"^^xsd:string ;
	void:vocabulary <http://www.w3.org/1999/02/22-rdf-syntax-ns#>,
		<http://www.w3.org/2002/07/owl#>,
		<http://xmlns.com/foaf/0.1/>,
		<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/>,
		<http://www.w3.org/2000/01/rdf-schema#>;	
	foaf:page <http://www4.wiwiss.fu-berlin.de/drugbank/> ;
	foaf:page <http://thedatahub.org/dataset/fu-berlin-drugbank> ;
	void:triples "765936"^^xsd:integer .

# Subset containing drug compound information
:db-drugs rdf:type void:Dataset ;
	void:uriSpace "http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/"^^xsd:string ;
	dcat:theme <http://semanticscience.org/resource/SIO_010038> ;
	void:exampleResource <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB00001> ;
	void:triples "4772"^^xsd:integer .

# Subset containing target information
:db-targets rdf:type void:Dataset ;
	void:uriSpace "http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/"^^xsd:string ;
	dcat:theme <http://semanticscience.org/resource/SIO_010423> ;
	void:exampleResource <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/1> .
        

Example Linksets

ChEMBL Target Component to UniProt Linkset

Below is the linkset description relating the ChEMBL target component subset of ChEMBL to the UniProt dataset. The linkset description is included in the dataset description for the ChEMBL dataset and the links are contained in a separate file with a backlink using the void:inDataset predicate.

The example dataset description is available as:

:chembl_targetcmpt_uniprot_linkset a void:Linkset ;
	dcterms:title "ChEMBL Target Component to UniProt Linkset" ;
	dcterms:description "The ChEMBL Target Component to UniProt Linkset relating target components to their corresponding entry in UniProt based on their sequence." ;
	dcterms:publisher <http://www.ebi.ac.uk/> ;
	dcterms:license <http://creativecommons.org/licenses/by-sa/3.0/> ;
	dcterms:issued "2013-07-26T10:02:03.000+01:00"^^xsd:dateTime ;
	void:dataDump <ftp://ftp.ebi.ac.uk/pub/databases/chembl/chembl-rdf/16.example/chembl_16.example_targetcmpt_uniprot_ls.ttl.gz> ;
	void:subjectsTarget :chembl_rdf_targetcmpt_dataset ;
	bdb:subjectsDatatype sio:SIO_010043;
	void:objectsTarget <http://purl.uniprot.org/void#uniprotdataset_2013_08> ;
	bdb:objectsDatatype sio:SIO_010043;
	void:linkPredicate skos:exactMatch ;
	bdb:linksetJustification sio:SIO_010043 ;
	pav:authoredBy <http://orcid.org/0000-0002-8011-0300> ;
	pav:authoredOn "2013-07-26T10:02:03.000+01:00"^^xsd:dateTime ;
	pav:createdBy <http://orcid.org/0000-0002-8011-0300> ;
	pav:createdOn "2013-07-26T10:02:03.000+01:00"^^xsd:dateTime ;
	bdb:assertionMethod eco:ECO_0000218 .