The research leading to these results has received support from the
Innovative Medicines Initiative (IMI)
Joint Undertaking under grant agreement n° 115191, resources of which
are composed of financial contribution from the European Union's Seventh
Framework Programme (FP7/2007-2013) and EFPIA companies’ in kind
contribution.
This dataset and linkset description specification is intended for data providers. The principle data format in Open PHACTS is RDF, as such a basic knowledge of RDF is assumed.
Details for converting a dataset and publishing it as RDF are given in the Open PHACTS RDF guidelines specification [[OPS-RDF]].
The Open PHACTS Discovery Platform [[OPS-ARCH]] relies on data and the interlinks published by a variety of sources. For example, details of chemicals are derived from ChemSpider, ChEMBL, and DrugBank. This specification provides details of the metadata expected to describe the datasets and the links that relate the instances in those datasets.
Open PHACTS has produced a set of guidelines aimed at data providers for publishing their data within the Open PHACTS Discovery Platform [[!OPS-RDF]]. The RDF guide provides details about modelling your data as RDF. This specification builds on the RDF Guidelines by defining the metadata that should be published to describe the dataset and the links to other datasets.
The dataset description defined in this specification declares the properties that should be included in the description of dataset or its links. The information is exchanged using the Vocabulary of Interlinked Datasets [[VOID]].
The VoID Editor can be used to create dataset descriptions. This is a useful tool for prototyping the first version of a dataset description. Ideally the generation of dataset descriptions should be incorporated as part of the data creation pipeline.
A validator is provided to verify whether a dataset description conforms to the latest stable version of these specifications. This could be accessed automatically by the data creation pipeline to ensure that newly created dataset descriptions remain conformant.
All examples in this document are written in the Turtle RDF syntax [[TURTLE]]. Throughout the document, the following namespaces are used:
@prefix bdb: <http://vocabularies.bridgedb.org/ops#> . @prefix cito: <http://purl.org/spar/cito/> . @prefix dcat: <http://www.w3.org/ns/dcat#> . @prefix dcterms: <http://purl.org/dc/terms/> . @prefix dctypes: <http://purl.org/dc/dcmitype/> . @prefix eco: <http://purl.obolibrary.org/obo/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix freq: <http://purl.org/cld/freq/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix pav: <http://purl.org/pav/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix void: <http://rdfs.org/ns/void#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
Furthermore, we assume that the empty prefix is bound to the base URL of the current file like this:
@prefix : <#> .
This allows us to quickly mint new identifiers in the local namespace.
In this section, we introduce the vocabularies that will be used for capturing the dataset descriptions and the mappings between the datasets.
The Vocabulary of Interlinked Datasets (VoID) [[!VOID]] is a W3C interest group note that specifies a vocabulary for describing the metadata about a dataset and its relationship with other datasets. The vocabulary builds upon existing metadata vocabularies, e.g. Dublin Core Terms [[DCTERMS]], and captures four categories of metadata:
The VoID specification [[VOID]] defines a dataset to be, "a set of RDF triples that are published, maintained or aggregated by a single provider." The dataset itself may contain logical subsets, which can be captured in the VoID description of the dataset, e.g. ChEMBL can be split into subsets for compounds, targets, etc.
The information captured about a dataset focuses on the general metadata, access, metadata, and the structural metadata.
The VoID specification [[VOID]] defines a link to be, "an RDF triple whose subject and object are described in different datasets." The links capture the mapping of an identifier in one dataset to be related to an identifier in another dataset. VoID is agnostic to the relationship to use: possible predicates are given in the next section.
The VoID specification defines a linkset to be, "a collection of such RDF links between two datasets." The linkset captures details of the links, i.e. the datasets that are linked and the relationship, as well as the metadata associated with the links, e.g. provenance information about who created the mapping and the specific versions of the datasets related. The VoID specification enables a separation between (1) the datasets involved in a linkset, and (2) who publishes the linkset.
Note that a VoID linkset is defined to link two datasets via a
single link predicate (void:linkPredicate
). As such, there
can exist multiple linksets relating the same pair of datasets, as
illustrated in the figure below. The
figure depicts four distinct linksets: two sourced from ChemSpider
depicted in blue which use different link predicates; one sourced from
ChEMBL depicted in red; and one sourced from a third party depicted in
green. Each of the linksets uses a different link relationship. Those
shown with a double arrow head are symmetric while those with just a
single arrow are directional links.
A mapping is expressed as a VoID link, i.e. it is an RDF triple that relates an identifier in one dataset with an identifier in another dataset with some predicate which provides the meaning of the mapping. A justification for the mapping, e.g. two chemical compounds are deemed equivalent as they have the same InChI key, is expressed with the link in the linkset metadata.
The mapping predicate captures the way in which the two identifiers
are related. The mapping should respect the semantics of the
relationship, e.g. the owl:sameAs
relationship must only
be used when the two identifiers are completely interchangeable.
Standard, and widely used, generic mapping relationships are given in the table below. (A fuller mapping ontology is given in [[Halpin2010]], but it is expected that the main relationships used will be those given in the table.)
Relationship | Description | Properties |
---|---|---|
rdfs:seeAlso
|
General link, that indicates that the resource linked to is relevant to the subject. See http://www.w3.org/TR/rdf-schema/#ch_seealso. | |
skos:relatedMatch
|
This link indicates that the linked resources are in some way associated. See http://www.w3.org/TR/skos-reference/#mapping. | Symmetric |
skos:closeMatch
|
This link indicates that the linked resources are the same, under some assumptions or applications. See http://www.w3.org/TR/skos-reference/#mapping. | Symmetric |
skos:exactMatch
|
This link indicates that the linked resources are the same, under the assumptions of most applications. See http://www.w3.org/TR/skos-reference/#mapping. | Transitive Symmetric |
owl:sameAs
|
This link indicates that the linked resources are the same under all assumptions and can be used interchangeably. Note that if this link is used for classes, then reasoning tasks will fall under OWL Full semantics. See http://www.w3.org/TR/2009/REC-owl2-quick-reference-20091027/#Axioms. | Transitive Symmetric |
owl:equivalentClass
|
This link indicates that the linked resources (which are both classes in some ontology) are the same. See http://www.w3.org/TR/2009/REC-owl2-quick-reference-20091027/#Axioms. | Transitive Symmetric |
Hierarchical Relationship | Description | Properties |
skos:broadMatch
|
This link indicates that the target resource is more general
than the subject resource. See http://www.w3.org/TR/skos-reference/#mapping. |
Inverse of skos:narrowMatch |
skos:narrowMatch
|
This link indicates that the target resource is more specific
than the subject resource. See http://www.w3.org/TR/skos-reference/#mapping. |
Inverse of skos:broadMatch |
A key feature of the Open PHACTS Discovery Platform is its ability to allow multiple views of the linked data which is achieved by applying scientific lenses over the data [[OPS-LENSES]]. For example, when performing an early stage exploratory task it is desirable to retrieve as much data as possible and as such the system enables the use of a lens whereby compounds are matched on their structural skeleton. Once the research has progressed further, the need for stricter relationships becomes apparent and the structure lens is switched for one which matches compounds on their full chemical structure, i.e. including their charges and stereo-chemistry.
The ability to classify linksets for use under different scientific lenses relies on the justification given in the linkset, i.e. the notion of operational equivalence, or alternative the interpretation, that is captured by the links. A set of vocabulary terms for providing justifications are given in Appendix B.2.
The following example declares that the ChemSpider α-Ketoisovaleric acid concept with the CSID 48 shares many properties with the ChEMBL 3-Methyl-2-oxobutanoic acid concept with the ChEMBL-RDF ChEMBL ID CHEMBL146554. The relationship is drawn from ChemSpider based on the compounds sharing the same structure based on their InChI Keys (in this case the compound has the same InChI Key QHKABHOOEWYVLI-UHFFFAOYSA-N in both data sets). Only the triples directly related to declaring the link are given in the example.
@prefix chembl: <http://linkedchemistry.info/chembl/chemblid/> . :cs2chembl_inchi void:linkPredicate skos:exactMatch . :cs2chembl_inchi bdb:linksetJustification <http://semanticscience.org/resource/CHEMINF_000059> . <http://rdf.chemspider.com/48> skos:exactMatch chembl:CHEMBL146554 .
In this section we specify the metadata expected to describe a dataset –
its origins, publication and content. These descriptions should be made
at the level of the entire dataset, not each individual record, i.e. it
is the metadata that would be expected within a catalogue record such as
MIRIAM or Datahub. We define a
dataset as follows:
A dataset is a collection of records that are published, maintained or aggregated by a single provider.
Note that this definition is more general than the one given above for
a VoID Dataset. Specifically it includes datasets that are not
represented as RDF. As such, in the following we sometimes distinguish
between RDF and non-RDF datasets; many of the VoID predicates are typed
for void:Dataset
, i.e. a set
of RDF triples, and cannot be applied to non-RDF datasets.
For datasets included in the Open PHACTS linked data cache [[OPS-ARCH]] we assume that an RDF representation of the dataset has been generated according to the Open PHACTS RDF Guidelines [[OPS-RDF]].
The tooling section contains details of tools to help with the creation of dataset descriptions. However, it is recommended that the generation of the VoID description for a dataset is carried out as part of the creation of the RDF version of the dataset and that these descriptions are passed through the dataset description validator.
The following gives a checklist of the properties for describing a dataset and associated resources. Subsequent sections give guidance on the values to use with these properties and full details of the predicates used can be found in Appendix A.
xsd:dateTime
;xsd:dateTime
;xsd:dateTime
.xsd:dateTime
.xsd:dateTime
.xsd:integer
.The title given for a dataset is used in the apps built upon the Open PHACTS Discovery Platform and tabular lists of the datasets that have been loaded into the platform. As such, the title should be unique and short. For example, we would suggest using strings like "ChEMBL" or "UniProt". Each distinguished subset should have its own unique title so that it is clear in a user interface which part of a dataset is being used, e.g. "ChEMBL Molecule" for the subset of ChEMBL that contains data about molecules.
The description for a dataset should allow someone knowledgeable of the domain, but not familiar with the dataset, to understand the content of the dataset and decide upon its merit. Ideally it should be no more than a paragraph. Typically the paragraph about the dataset on the public web page is suitable.
The publisher for a dataset is the organisation that is responsible for the creation and maintenance for the dataset. This predicate should point to the web page for the organisation. For example, the publisher of the ChemSpider dataset is the Royal Society of Chemistry and the value for this property is http://www.rsc.org/. Ideally the publisher's page should be marked up with RDFa so that it is machine processable.
The landing page for a dataset should point to the public facing web page that gives details of the dataset. For a dataset such as the RDF conversion of DrugBank we suggest that this predicate points to the original DrugBank web page, viz. http://www.drugbank.ca/, rather than the base resource of the RDF conversion. This is to allow scientific users of the data to be able to get straight to the information about the dataset. Other pages associated with the dataset, e.g. the Identifiers.org page, can be linked to using foaf:page. Note that it is important not to use foaf:homepage due to the inverse functional property of the predicate.
For most datasets a license has already been chosen and this needs to be stated as a URI in the description. For licenses that require a citation to be given, the citation information can be captured using the cito:citeAsAuthority property. For new datasets a suitable dataset should be chosen. Below are a list of licenses suggested in VoID [[VOID]]. Note that the final two are not specifically designed for data.
http://www.opendatacommons.org/licenses/pddl/
http://www.opendatacommons.org/licenses/by/
http://www.opendatacommons.org/licenses/odbl/
http://creativecommons.org/publicdomain/zero/1.0/
http://creativecommons.org/licenses/by-sa/3.0/
http://www.gnu.org/copyleft/fdl.html
The date issued property is used to capture the date a
dataset is issued for public release which may also be the date it was
created. This can be used to distinguish the version of a dataset
where there are no version numbers, e.g. datasets that are continually
updated and do not have fixed version numbers, particularly those that
contain on-going user generated content. The date issued should be
specified as an xsd:dateTime
literal. Where values are
unknown then these should be set to the start of the time period, e.g.
for a dataset issued on the 23 July 2013 but we do not need to capture
the time we would use the literal value "2013-07-23T00:00:00"^^xsd:dateTime
.
For datasets which have a version number, then this should
be provided as a string literal. For example, ChEMBL would use the
literal value "16"^^xsd:string
for their version 16
release.
It is useful to point to an example resource in the dataset to allow a user to quickly look at a record in the data. For each subset, a suitable example should be supplied. Other information such as the URI namespace and the identifier pattern can also be supplied.
Other metadata about the dataset, e.g. the author and creator of the content, can be captured using PAV [[PAV]]. The values for these resources should not be locally defined values, but link to external representations for individuals that can be reused. We recommend the use of ORCID identifiers.
A dataset that can be separated into multiple parts, e.g. ChEMBL can
be split into ChEMBL.molecule, ChEMBL.Target, etc, should have this
structure captured in the dataset description using th void:subset
property. Each distinct subset should be described as a dataset in its
own right, although common parts of information can be inherited from
the parent. However, each should have a meaningful title and
description for display purposes.
In the VoID note [[VOID]], the void:subset
property is used for both 'subset of' and 'has subset'. The declared
semantics for the property is has subset. Within Open PHACTS, the
property is always interpreted with the has subset semantics.
The vocabularies used in the RDF representation of a
dataset can be declared using the void:vocabulary
property.
A dataset description should link to the description of the previous version of the dataset. This provides a backward pointing chain to all the previous versions of the dataset.
The anticipated update frequency of the dataset is captured to allow for the detection of dataset updates.
The provenance of an RDF conversion of an existing dataset is
captured using the PAV imported properties [[PAV]]. The imported
from property should point to the dataset description for the
original dataset. Where this does not exist, it should be supplied
conforming to these guidelines. Details of the script used for the
conversion including the version number are captured with the pav:createdWith
property.
Sources of data from which the dataset derives should be captured by pointing to dataset descriptions for the sources. Where these are not available, or it is unclear, the associated Identifiers.org page may be linked to. However, the version or file used should be captured.
For an RDF dataset the file(s) containing the data should be linked to; standard file extensions and content-negotiation should be used to allow for the automated parsing of the file. This is to enable the Open PHACTS Discovery Platform to load the data and make it available to users of the platform. The SPARQL endpoint can also be supplied when this exists.
For a non-RDF dataset it is unknown what format the data file will be provided in, so a description of the file should be provided. Standard media types from IANA should be used. Some common ones for the datasets in Open PHACTS are included below. (A comprehensive list is available from Sitepoint.)
Example VoID dataset descriptions can be found in Appendix C. The first example is for part of the ChEMBL dataset (see Appendix C.1). The second example is for the RDF representation of the DrugBank database, given in Appendix C.2. This demonstrates the level of information required to track the provenance from a source dataset through to the RDF representation. Both datasets contain subset definitions.
A linkset is itself a dataset, and as such should provide metadata about its content and how it was created. The metadata associated with a link is essential for enabling its reuse by others. It enables a consumer of the link to understand which datasets are linked (including version information), who claimed the link, under what circumstances, the level of curation of the the links, and which (if any) tools were used to generate the link (e.g. [[SILK]]).
A linkset should point to the dataset descriptions of the datasets that it uses. These descriptions should be provided by the dataset provider as part of the dataset publishing process [[OPS-RDF]]. However, there are occasions when one or both of the linked datasets do not provide a VoID dataset description. In this case enough information should be given to identify the dataset linked to, and if it is known, the version. Ideally this information should be provided as a dataset description, but linking to the Identifiers.org page for the dataset may be all that is possible.
The properties for a dataset converted into RDF are:
xsd:dateTime
.xsd:dateTime
.xsd:dateTime
.The title given to a linkset is used in apps when displaying details of the linkset. A title should be short, idealy stating which datasets are linked, e.g. ConceptWiki-ChemSpider Linkset.
The description for a linkset should give a textual summary of the linkset. That is, it should state which datasets have been linked and the reason for the equivalence. It should be expressed in terminology aimed at an app user. It should be expected to be a few sentences.
See Section 4.2 for guidance about publisher, license, issued date and data dump. Note that for linksets there is unlikely to be a web page directly dedicated to the linkset, as such that part of the dataset metadata is not stated as a requirement. The license under which a linkset is published may be different from that of the datasets that it links, even if it is a subset of one of the datasets.
Formally, the VoID specification [[VOID]] states that a Linkset links two RDF resources. However, there is a discussion to enable Linksets to link to non-RDF resources. We will permit this usage in the Open PHACTS usage of Linksets.
The link predicate is used to specify the mapping
relationship used to relate the concepts in the linkset. This is used
to state the degree of equivalence there is between the resources in
the subject and object dataset. See Section
3.2 for the different mapping relationships and their meaning.
Within Open PHACTS we encourage the use of skos:exactMatch
or skos:closeMatch
rather than owl:sameAs
as the linking predicate. This is because owl:sameAs
has
a very precise meaning with the consequence that the two resources can
be merged together. In general skos:exactMatch
provides
the appropriate level of equivalence.
The link justification is used to capture the notion of equivalence captured between the resources in the two datasets, e.g. they are conceptually the same gene (http://semanticscience.org/resource/SIO_010035) or the gene produces the protein and thus is used as a proxy for the protein (http://semanticscience.org/resource/SIO_000985). Appendix B.2 gives the set of justifications used in the Open PHACTS Discovery Platform.
The subjects target and objects target predicates
should point to dataset descriptions contained in a VoID file. This
may be a resource in the same file or in some other file. However,
they should point to a description of a specific version of a dataset.
For example, for a link to version 16 of the ChEMBL molecules dataset
the objects target value would be http://rdf.ebi.ac.uk/dataset/chembl/16.2/void.ttl#chembl_rdf_molecule_dataset
.
Descriptions of the datasets loaded in the Open PHACTS Discovery
Platform are available through the http://beta.openphacts.org/sources
method.
The subjects datatype and objects datatype predicates should specify the type of the resources that are linked. This may be different from the datatype of the whole dataset. For example, ConceptWiki has among other things data about genes, proteins, and chemicals. However a specific linkset such as to ChemSpider only contains identifiers that are chemicals. This linkset would have the subjects datatype set to chemical entity (http://semanticscience.org/resource/SIO_010004). The set of data types used in the Open PHACTS Discovery Platform are available in Appendix B.1.
The subjects species and objects species predicates are used to state when a linkset is limited to biological targets from a particular species. For example, for a linkset between mouse proteins and human proteins the subjects species would be mouse (http://purl.obolibrary.org/obo/NCBITaxon_10090) while the objects species would be human (http://purl.obolibrary.org/obo/NCBITaxon_9606). Appendix B.3 gives the set of species used in the Open PHACTS Discovery Platform.
The subset predicate is used to identify those linksets that have been extracted from a larger dataset, e.g. linksets extracted from ChEMBL are a subset of the ChEMBL dataset. Note that the dataset resource is the subject of the triple and the linkset resource is the object as the predicate expresses a has subset relationship.
Author and creator information can be provided to allow the tracing of who is making the claims of equivalence represented by the links in the linkset. The values for these resources should not be locally defined values, but link to external representations for individuals that can be reused. We recommend the use of ORCID identifiers.
An example linkset descriptions can be found in Appendix D. The example shows how a linkset can be included as part of a complete dataset description (see Appendix C.1) and also link to an external dataset description.
The two preceeding sections have prescribed the metadata required to describe datasets and the linksets that inter-relate them. This section outlines the expected deployment and exchange mechanisms and should be read in conjunction with Section 6 of the VoID specification for more details.
The primary purpose of the VoID document metadata is to provide details of who created the description of the dataset and when. It also points to the main (parent) dataset resource described in the document.
VoID documents describing datasets and linksets MUST contain a metadata block describing the VoID document using the following properties:
xsd:dateTime
. xsd:dateTime
. xsd:dateTime
.Of course, other properties may also be declared. For more details, see Section 6.2 of the VoID specification [[VOID]].
The date issued property is used to capture the date that
the dataset description is ready for use, i.e. its publication date.
This is captured as an xsd:dateTime
literal, where
unknown values should be set to the start of the valid time period.
The dataset description issued date may be different from the date of
issue of the dataset and also the creation and last modified dates,
the latter two of these may be captured using the predicates from the
PAV vocabulary.
The primary topic of the dataset description document is the resource that describes the dataset. This should be a locally defined resource.
The person responsible for publishing the dataset as RDF is most likely also the creator of the VoID description for it. This is captured with the created by property. We recommend the use of an ORCID identifier, but any valid URI will suffice. The resulting resource should be available as RDF.
The tool used to create the VoID document can be captured using pav:createdWith. The VoID Editor would be one example, but ideally this should be the script that performs data publishing pipeline, of which the generation of the VoID description is part of.
When a title is provided then this should be short as it will be used in apps to display details of the dataset description. When a description is provided, then these should provide a summary of the process of the creation of the dataset and be understandable to those outside of the immediate development team.
An example is given below based on the ChemSpider deployment. (Note the use of an empty-string relative URI (<>) as a syntactic shortcut for the URI of the document that contains the statements; the real deployment has a date versioned URI such as ftp://ftp.rsc-us.org/OPS/20130408/void_2013-04-08.ttl.)
<> a void:DatasetDescription ; dcterms:issued "2013-08-22T10:33:00Z"^^xsd:dateTime;
pav:createdBy <http://orcid.org/0000-0002-5711-4872>;
pav:createdOn "2012-05-02T13:50:34Z"^^xsd:dateTime; pav:lastUpdateOn "2012-08-10T13:52:12Z"^^xsd:dateTime; foaf:primaryTopic :chemSpiderDataset .
Several mechanisms for deploying VoID descriptions are given in Section 6 of the VoID Note [[VOID]]. In this section we provide recommendations for good practice within Open PHACTS.
We recommend that the dataset description is made available as a separate file from the data that it describes. This enables the use of the description, without needing to download the entire dataset, in catalogues/registries and by tools such as the Open PHACTS Identity Mapping Service (IMS) [[IMS]].
The current dataset description for a dataset SHOULD be available
from a well known location relative to the dataset. Best practice
for this well known location is a file called void.ttl
in the root directory for the dataset. However, we recommend that a
HTTP 302
redirect is used to derefenece to a versioned
copy of the VoID file.
For example, for ChemSpider we would have a URI such as http://rdf.chemspider.com/void.ttl#chemSpiderDataset
which redirects to a versioned URI such as http://rdf.chemspider.com/20130408/void.ttl#chemSpiderDataset
.
Each RDF document containing the data MUST contain a backlink using void:inDataset to the dataset descriptor. For example, the ChEMBL-RDF molecule m1, there would be the triple:
<http://linkedchemistry.info/chembl/chemblid/molecule/m1> void:inDataset <http://linkedchemistry.info/chembl/chemblid/void.ttl#chembl-rdf_compounds> .
The above deployment example points to the unversioned dataset description, which SHOULD redirect to the current latest version. This is the approach for serving linked data pages and is valid providing that the data item remains in the latest version of the dataset.
When creating a datadump, the in dataset link should point to the corresponding versioned dataset description.
For the purposes of Open PHACTS, it is anticipated that linksets will be materialised as separate documents from the datasets. This is to allow their loading into the identity mapping service [[IMS]].
It is recommended that the linkset is described in the main dataset
description document. Where there is no dataset description document
it is recommend that a separate dataset description document is
created for the linkset. The file containing the links MUST provide
a link back to the linkset desription using the
void:inDataset
predicate. The example below shows
how a set of links can refer back to the linkset description given
in the ChEMBL-RDF VoID description file.
<> void:inDataset <http://linkedchemistry.info/chembl/chemblid/void.ttl#chembl-rdf_targets-uniprot-linkset> . <http://linkedchemistry.info/chembl/target/t1> skos:exactMatch <http://purl.uniprot.org/uniprot/O43451> . ...
Tooling being developed within Open PHACTS MUST support the
predicates stated in this document. However, they SHOULD also be able
to read VoID files from external sources that do not comply completely
with this specification, but do comply with the VoID standard
[[VOID]]. An example would be the use of the
void:target
predicate instead of the
void:subjectsTarget
and
void:objectsTarget
predicates. Such usage SHOULD
not be the norm and SHOULD result in warnings being generated.
Nanopublications [[NANOPUB]] provide a means for data providers to obtain credit for their data contribution, in particular data that can be described in the form of a minimal set of assertions: a minimal piece of information that represents value for which credit is due. Such information is closely related to a link relating instances in two datasets. In some cases it may be desirable to publish a link as a nanopublication. This should not violate a link being published in a linkset according to this specification.
BioDBcore defines the following properties as the set of metadata that should be published in relation to a dataset. The aim of BioDBcore is different from that of VoID, but many of the elements defined are covered in the Open PHACTS dataset description.
An example BioDBcore record for ChEMBL.
The metadata specified in Section 4 covers the functional data required from BioDBcore. The aspects not covered are those relating to discovering who is responsible for a dataset and the publications about the dataset. It is expected that such information can be discovered from the dataset's homepage and is not within the use case scope for the description of the dataset. Such information may be added as additional statements in the VoID description.
There are a wide range of provenance vocabularies that have been proposed. This section gives brief pointers to related vocabularies that could be used in a dataset or linkset description. For more information about the state of provenance vocabularies, the interested reader is recommended [[PROV-XG]].
The Provenance Ontology (PROV-O) [[PROV-O]] is a W3C candiate recommendation for representing provenance information about documents, datasets, workflow runs, etc. It is broadly based on the Open Provenance Model [[OPM]]. It is capable of expressing complex provenance relationships.
The Provenance, Authoring and Versioning Ontology (PAV) [[PAV]] provides a comprehensive set of relationships for capturing basic provenance information.
The Provenance Vocabulary [[PRV]] is another lightweight vocabulary of provenance predicates with an emphasis on data creation and data access on the Web.
Defined as an extension to VoID, the vocabulary for data and dataset provenance (voidp) [[VOIDP]] is a vocabulary for defining provenance relationships of data and datasets. The vocabulary focuses on four specific pieces of provenance information:
"for a piece of data, x :
- when was x derived,
- how was x derived,
- what data had been used to derive x,
- who carried out the transformations that resulted in the current value of x." [[VOIDP]]
These are a subset of the information that needs to be captured for the Open PHACTS linksets.
RDF Property | bdb:assertionMethod |
---|---|
Definition: | The method by which the links in the Linkset have been asserted. |
Domain: | void:Linkset |
Range: | rdfs:Resource |
Usage note: | Appendix B.4 gives the set of values that are expected to be used within the Open PHACTS Discovery Platform for declaring the curation level of a linkset. |
Example: | A set of links that have been computer generated but verified
by a human would have their assertion method declared to be
manual, i.e.<>
bdb:assertionMethod eco:ECO_0000218 . <>
bdb:assertionMethod eco:ECO_0000203 . |
RDF Property | bdb:linksetJustification |
---|---|
Definition: | The reason why the Linkset claims these concepts are equivalent. |
Domain: | void:Linkset |
Range: | rdfs:Resource |
Usage note: | Appendix B.2 gives the set of values that are expected to be used within the Open PHACTS Discovery Platform for declaring equivalence conditions. These include saying that the datasets are conceptually about the same type of concept, that they match in their chemical structure, or that they represent the protein generated by some gene. |
Example: | A linkset between two chemicals that share the same chemical
structure would use the value http://semanticscience.org/resource/CHEMINF_000059. A linkset between two proteins would say that they were conceptually the same protein using the value http://semanticscience.org/resource/SIO_010043. A linkset between a protein and a gene would say that they are conceptually the same by stating that it is a protein coding gene with http://semanticscience.org/resource/SIO_000985. |
RDF Property | bdb:objectsDatatype |
---|---|
Definition: | The type of concept the objects of the triples in the Linkset represent. |
Domain: | void:Linkset |
Range: | rdfs:Resource |
Usage note: | Appendix B.1 gives the set of values that are expected to be used within the Open PHACTS Discovery Platform. |
Example: | A linkset between ChemSpider and ConceptWiki would declare its objects datatype to be Chemical Entities. |
RDF Property | bdb:objectsSpecies |
---|---|
Definition: | The species of the concept the objects of the triples in the Linkset represent. |
Domain: | void:Linkset |
Range: | rdfs:Resource |
Usage note: | Appendix B.3 gives the set of values that are expected to be used within the Open PHACTS Discovery Platform. |
Example: | A linkset between the mouse proteins in UniProt and the mouse proteins in the Mouse Genome Database would have its objects species set to http://purl.obolibrary.org/obo/NCBITaxon_10090. |
RDF Property | bdb:subjectsDatatype |
---|---|
Definition: | The type of concept the subjects of the triples in the Linkset represent. |
Domain: | void:Linkset |
Range: | rdfs:Resource |
Usage note: | Appendix B.1 gives the set of values that are expected to be used within the Open PHACTS Discovery Platform. |
Example: | A linkset between ConceptWiki and ChemSpider would declare its subjects datatype to be Chemical Entities. |
RDF Property | bdb:subjectsSpecies |
---|---|
Definition: | The species of the concept the subjects of the triples in the Linkset represent. |
Domain: | void:Linkset |
Range: | rdfs:Resource |
Usage note: | Appendix B.3 gives the set of values that are expected to be used within the Open PHACTS Discovery Platform. |
Example: | A linkset between the mouse proteins in UniProt and the mouse proteins in the Mouse Genome Database would have its subjects species set to http://purl.obolibrary.org/obo/NCBITaxon_10090. |
RDF Property |
cito:citeAsAuthority |
---|---|
Definition: | The citing entity cites the cited entity as one that provides an authoritative description or definition of the subject under discussion. |
Usage note: | Used to specify the recommended citation for a dataset; particularly when this is a requirement of the dataset license. |
Example: | The UniProt license requires credit is given by citing a specific publication. |
RDF Class |
dcat:Dataset |
---|---|
Definition: | A collection of data, published or curated by a single source, and available for access or download in one or more formats. |
Sub class of: | dctypes:Dataset |
RDF Class |
dcat:Distribution |
---|---|
Definition: | Represents a specific available form of a dataset. Each dataset might be available in different forms, these forms might represent different formats of the dataset or different endpoints. Examples of distributions include a downloadable CSV file, an API or an RSS feed. |
RDF Property |
dcat:byteSize |
---|---|
Definition: | The size of a distribution in bytes. |
Range: | rdfs:Literal typed as xsd:decimal. |
Usage note: | The size in bytes can be approximated when the precise size is not known. |
Example: | The compressed ChEMBL 16 molecule data is available from ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/16.0/chembl_16_molecule.ttl.gz
and is 483MB in size. This would by captured with the byte size
predicate for this resource asdcat:byteSize
"500000000"^^xsd:integer . |
RDF Property |
dcat:distribution |
---|---|
Definition: | Connects a dataset to its available distributions. |
Domain: | dcat:Dataset |
Range: | dcat:Distribution |
Usage note: | Used to specify a link between a dataset and a description of a distribution of the dataset. |
Example: | The ChEMBL dataset is available in multiple distribution formats. Each one would be described giving properties such as the format and download URL. |
RDF Property |
dcat:downloadURL |
---|---|
Definition: | This is a direct link to a downloadable file in a given format. For example, CSV file or RDF file. The format is described by the distribution's dcat:mediaType. |
Range: | rdfs:Resource |
Usage note: | The value is a URL. The URL may require access credentials, e.g. username and password. |
RDF Property |
dcat:landingPage |
---|---|
Definition: | A Web page that can be navigated to in a Web browser to gain access to the dataset, its distributions and/or additional information. |
Sub property of: | foaf:page |
Domain: | dcat:Dataset |
Range: | foaf:Document |
Usage note: | Used to specify a link to a human readable web page documenting the underlying dataset. |
Example: | For the RDF conversion of the DrugBank dataset this property should point to http://www.drugbank.ca/ since this resource provides information about the contents of the dataset. |
See also: | foaf:homepage, foaf:page |
RDF Property |
dcat:mediaType |
---|---|
Definition: | The media type of the distribution as defined by IANA. |
Range: | dcterms:MediaTypeOrExtent |
Usage note: | This property SHOULD be used when the media type of the distribution is defined in IANA, otherwise dcterms:format MAY be used with different values. |
RDF Property | dcat:theme
|
---|---|
Definition: | The main category of the dataset. A dataset can have multiple themes. |
Sub property of: | dcterms:subject |
Range: | skos:Concept |
Usage note: | Used to provide details of the types of concepts covered by the dataset. These should be drawn from existing well used domain vocabularies. The dataset provider may want to consider the semantic types selected from the Semantic Science Integrated Ontology as given in Appendix B.1. |
RDF Property |
dcterms:accurualPeriodicity |
---|---|
Definition: | The frequency with which items are added to a collection. |
Range: | Frequency |
Usage note: | Indicates the expected update frequency of the dataset. |
RDF Property |
dcterms:description |
---|---|
Definition: | An account of the resource. |
Range: | rdfs:Literal |
Usage note: | Provides a human readable description of the dataset, for
example for use in GUIs. For more details see Section 2.2 of the VoID Specification. |
RDF Property | dcterms:issued
|
---|---|
Definition: | Date of formal issuance (e.g., publication) of the resource. |
Range: | rdfs:Literal typed as xsd:date . |
Usage note: |
The date is encoded as a literal in "yyyy-mm-dd" form (ISO 8601 Date and Time Formats). If the specific day or month are not known, then 01 should be specified. |
See also: | pav:createdOn |
RDF Property |
dcterms:license |
---|---|
Definition: | A legal document giving official permission to do something with the resource. |
Range: | rdfs:Resource |
Usage note: |
Declares the license under which the dataset is published. Where possible, we recommend publishing data under an open license to enable reuse of the data with appropriate acknowledgement. For Open PHACTS, the recommended license is CC-BY-SA. A list of alternative licenses are available in Section 2.4 of the W3C note on VoID. For more details see Section 2.4 of the VoID Specification. |
RDF Property |
dcterms:publisher |
---|---|
Definition: | An entity responsible for making the resource available. |
Range: | rdfs:Resource |
Example: | For ChEMBL, the publisher would be the EBI: http://www.ebi.ac.uk/. This is different from the value for the homepage in this case. |
See also: | dcat:landingPage, foaf:homepage |
RDF Property |
dcterms:subject |
---|---|
Definition: | The topic of the dataset. |
Range: | rdfs:Resource |
Usage note: |
Declared the topics covered by the dataset. Multiple topics can be declared. If the data is split into subsets, then the topics should be associated with the subsets. BioPortal [[BioPortalWeb]] [[BioPortal]] can be used to search for suitable vocabulary terms for topics. DBPedia URIs may also be used. A list of common terms relevant for Open PHACTS is given in Appendix A.2. For more details see Section 2.5 of the VoID Specification. |
RDF Property | dcterms:title
|
---|---|
Definition: | A name given to the resource. |
Range: | rdfs:Literal |
Usage note: | Used to declare the short name of the dataset, for example
for use in GUIs. For more details see Section 2.2 of the VoID Specification. |
RDF Class |
dctypes:Dataset |
---|---|
Definition: | Data encoded in a defined structure. |
Usage note: | Used to declare the type of a dataset published in a format
other than RDF. For a dataset published in RDF use void:Dataset .
|
RDF Property |
foaf:homepage |
---|---|
Definition: | A homepage for the dataset. |
Range: | foaf:Document |
Usage note: | Specifies the homepage for the resource. For more details see Section 2.1 of the VoID Specification. |
Usage note: | foaf:homepage is an inverse functional property (IFP) which means that it should be unique and precisely identify the catalog. This allows smushing various descriptions of the catalog when different URIs are used. |
Example: | The foaf:homepage for the ChemSpider RDF dataset is http://rdf.chemspider.com/. |
See also: | foaf:page, dcat:landingPage |
RDF Property |
foaf:page |
---|---|
Definition: | A page about the dataset. |
Range: | foaf:Document |
Usage note: | Specifies a webpage or document for the resource. Points to secondary sources of infotmation, e.g. catalogue records. |
Example: | For the ChemSpider RDF dataset, the foaf:page property could point to the associated MIRIAM record http://www.ebi.ac.uk/miriam/main/collections/MIR:00000138. For more details see Section 2.1 of the VoID Specification. |
See also: | foaf:homepage, dcat:landingPage |
RDF Property |
foaf:primaryTopic |
---|---|
Definition: | The primary topic of some page or document. |
Domain: | foaf:Document |
Usage note: |
Relates a document, e.g. a DatasetDescription, to the main thing the document is about, e.g. Dataset. The value of this property is a URI that appears in the dataset description. foaf:primaryTopic is a functional property which means that it can be used at most once. |
See also: | foaf:topic |
RDF Property |
foaf:topic |
---|---|
Definition: | A topic of some page or document. |
Domain: | foaf:Document |
Usage note: |
Relates a document, e.g. a DatasetDescription, to the things that the document is about, e.g. Datasets. |
See also: | foaf:primaryTopic |
RDF Property | pav:authoredBy |
---|---|
Definition: | An agent that originated or gave existence to the work that is expressed by the digital resource. |
Range: | rdfs:Resource |
Usage note: |
The author of the content of a resource may be different from the creator of the resource representation (although they are often the same). See pav:createdBy for a discussion. |
Example: | The object of the triple should point to a URL that representes the person responsible for authoring the content, e.g. http://orcid.org/0000-0002-5711-4872. |
See also: | pav:createdBy |
RDF Property | pav:authoredOn |
---|---|
Definition: | The date this resource was authored. |
Range: | rdfs:Literal typed as xsd:dateTime . |
Usage note: | The dateTime is encoded as a literal in "yyyy-mm-ddThh:mm:sszzzzz" form (ISO 8601 Date and Time Formats). If the specific day or month are not known, then 01 should be specified. If the specific hour, minute or second are not known, then 00 should be specified. If the timezone is not known then Z should be specified. |
RDF Property | pav:createdBy |
---|---|
Definition: | An agent primary responsible for making the digital artifact or resource representation. |
Range: | rdfs:Resource |
Usage note: |
This property is distinct from forming the content, which is indicated with pav:contributedBy or its subproperties; pav:authoredBy, which identifies who authored the knowledge expressed by this resource; and pav:curatedBy, which identifies who curated the knowledge into its current form. |
Example: | The object of the triple should point to a URL that representes the person or tool responsible for creating the digital artifact, e.g. http://orcid.org/0000-0002-5711-4872. |
See also: | pav:authoredBy, pav:createdWith |
RDF Property | pav:createdOn |
---|---|
Definition: | The date of creation of the resource. |
Range: | rdfs:Literal typed as xsd:dateTime . |
Usage note: | The dateTime is encoded as a literal in "yyyy-mm-ddThh:mm:sszzzzz" form (ISO 8601 Date and Time Formats). If the specific day or month are not known, then 01 should be specified. If the specific hour, minute or second are not known, then 00 should be specified. If the timezone is not known then Z should be specified. |
See also: | dcterms:issued, pav:lastUpdateOn |
RDF Property | pav:createdWith |
---|---|
Definition: | The software/tool used by the creator (pav:createdBy) when making the digital resource. |
Range: | rdfs:Resource |
Usage note: |
The specific version of a tool/script that was used to created the resource. |
Example: | The versioned GitHub URL used to perform an RDF conversion: https://github.com/openphacts/chembl.rdf/blob/97c900460b46481aac07dfb11807a3f49fc92b2e/ops.ttl |
RDF Property | pav:derivedFrom |
---|---|
Definition: | Derived from a different resource. |
Range: | rdfs:Resource |
Usage note: |
Derivation conserns itself with derived knowledge. If this resource has the same content as the other resource, but has simply been transcribed to fit a different model (like XML -> RDF or SQL -> CVS), use pav:importedFrom. If a resource was simply retrieved, use pav:retrievedFrom. If the content has however been further refined or modified, pav:derivedFrom should be used. Details about who performed the derivation may be indicated with pav:contributedBy and its subproperties. |
See also: | pav:importedFrom |
RDF Property | pav:importedBy |
---|---|
Definition: | An entity responsible for importing the data. |
Range: | rdfs:Resource |
Usage note: |
The importer is usually a software entity which has done the transcription from the original source. Note that pav:importedBy may overlap with pav:createdWith. The source for the import should be given with pav:importedFrom. The time of the import should be given with pav:importedOn. |
See also: | pav:createdWith |
RDF Property | pav:importedFrom |
---|---|
Definition: | The original source of imported information. |
Range: | rdfs:Resource |
Usage note: |
Import means that the content has been preserved, but transcribed somehow, for instance to fit a different representation model. Examples of import are when the original was JSON and the current resource is RDF, or where the original was an document scan, and this resource is the plain text found through OCR. The imported resource does not have to be complete, but should be consistent with the knowledge conveyed by the original resource. If additional knowledge has been contributed, pav:derivedFrom would be more appropriate. If the resource has been copied verbatim from the original representation (e.g. downloaded), use pav:retrievedFrom. To indicate which agent(s) performed the import, use pav:importedBy. Use pav:importedOn to indicate when it happened. |
See also: | pav:derivedFrom |
RDF Property | pav:importedOn |
---|---|
Definition: | The date this resource was imported from a source (pav:importedFrom). |
Range: | rdfs:Literal typed as xsd:dateTime . |
Usage note: |
If the source is later reimported, this should be indicated with pav:lastRefreshedOn. The source of the import should be given with pav:importedFrom. The agent that performed the import should be given with pav:importedBy. The dateTime is encoded as a literal in "yyyy-mm-ddThh:mm:sszzzzz" form (ISO 8601 Date and Time Formats). If the specific day or month are not known, then 01 should be specified. If the specific hour, minute or second are not known, then 00 should be specified. If the timezone is not known then Z should be specified. |
RDF Property | pav:lastRefreshedOn |
---|---|
Definition: | The date of the last re-import of the resource. |
Range: | rdfs:Literal typed as xsd:dateTime . |
Usage note: |
This property is used in addition to pav:importedOn if this version has been updated due to a re-import. If the re-import created a new resource rather than refreshing an existing, then pav:importedOn should be used together with pav:previousVersion. The dateTime is encoded as a literal in "yyyy-mm-ddThh:mm:sszzzzz" form (ISO 8601 Date and Time Formats). If the specific day or month are not known, then 01 should be specified. If the specific hour, minute or second are not known, then 00 should be specified. If the timezone is not known then Z should be specified. |
See also: | pav:importedOn |
RDF Property | pav:lastUpdateOn |
---|---|
Definition: | The date of the last update of the resource. |
Range: | rdfs:Literal typed as xsd:dateTime . |
Usage note: |
An update is a change which did not warrant making a new resource related using pav:previousVersion, for instance correcting a spelling mistake. The dateTime is encoded as a literal in "yyyy-mm-ddThh:mm:sszzzzz" form (ISO 8601 Date and Time Formats). If the specific day or month are not known, then 01 should be specified. If the specific hour, minute or second are not known, then 00 should be specified. If the timezone is not known then Z should be specified. |
See also: | pav:createdOn |
RDF Property | pav:previousVersion |
---|---|
Definition: | The previous version of a resource in a lineage. |
Range: | rdfs:Resource |
Usage note: |
For instance a news article updated to correct factual information would point to the previous version of the article with pav:previousVersion. If however the content has significantly changed so that the two resources no longer share lineage (say a new news article that talks about the same facts), they should be related using pav:derivedFrom. A version number of this resource can be provided using the data property pav:version. |
See also: | pav:version |
RDF Property | pav:version |
---|---|
Definition: | The version number of a resource. |
Range: | rdfs:Literal |
Usage note: | This is a freetext string, typical values are "1.5" or "21". The URI identifying the previous version can be provided using prov:previousVersion. |
See also: | pav:previousVersion |
RDF Class | void:Dataset
|
---|---|
Definition: | A set of RDF triples that are published, maintained or aggregated by a single provider. |
Superclass: | dctypes:Dataset
|
Subclass: | void:Linkset |
Usage note: | Used to declare the type of a dataset published in RDF. For a
dataset not published in RDF use dctypes:Dataset .
|
RDF Class |
void:DatasetDescription |
---|---|
Definition: | A web resource whose foaf:primaryTopic or foaf:topics include void:Datasets. |
Superclass: | foaf:Document |
Usage note: | Used to provide metadata about the dataset description document. For more details, see Section 6.2 of the VoID specification [[VOID]]. |
RDF Class | void:Linkset
|
---|---|
Definition: | A collection of RDF links between two void:Datasets. |
Superclass: | void:Dataset |
Usage note: | For more details see Section 1.4 of the VoID Specification. |
RDF Property |
void:dataDump |
---|---|
Definition: | An RDF dump, partial or complete, of a void:Dataset. |
Domain: | void:Dataset |
Range: | rdfs:Resource |
Usage note: |
Defines the location of an RDF dump file that should be provided in one of the standard RDF serializations, and may be compressed. If the dataset is contained in more than one file, then several values of this property should be given. The files may be password protected. For more details see Section 3.3 of the VoID Specification. |
RDF Property |
void:exampleResource |
---|---|
Definition: | example resource of dataset. |
Domain: | void:Dataset |
Range: | rdfs:Resource |
Usage note: |
Multiple resources can be declared. If the data is split into subsets, then the example resources should be associated with the subsets. For more details see Section 4.1 of the VoID Specification. |
RDF Property |
void:inDataset |
---|---|
Definition: | Points to the void:Dataset that a document is a part of. |
Domain: | foaf:Document |
Range: | void:Dataset |
Superproperty: | dcterms:isPartOf |
Usage note: |
Links a file containing the RDF dataset to the description of the dataset. For more details see Section 6.3 of the VoID Specification. |
RDF Property |
void:linkPredicate |
---|---|
Definition: | a link predicate |
Domain: | void:Linkset |
Range: | rdfs:Property |
Usage note: |
Specifies the mapping relationship which should be one of the properties given in the Mapping Relationships Table. For more details see Section 5.3 of the VoID Specification. |
RDF Property |
void:objectsTarget |
---|---|
Definition: | The dataset describing the objects of the triples contained in the Linkset. |
Domain: | void:Linkset |
Range: | void:Dataset |
Superproperty: | void:target |
Usage note: |
Should point to a versioned URI of a dataset that appears in this or another VoID file. The datasets may themselves be a subset of a larger dataset. Where the datasets do not provide a VoID description, the minimal required information must be provided in the linkset description. This is detailed in Section 5.5. This property may only appear once per Linkset (owl:FunctionalProperty). For more details see Section 5.1 of the VoID Specification. |
See also: | void:subjectsTarget |
RDF Property |
void:sparqlEndpoint |
---|---|
Definition: | has a SPARQL endpoint at |
Domain: | void:Dataset |
Range: | rdfs:Resource |
Usage note: |
Defines the location of a SPARQL endpoint where the data may be queried. For more details see Section 3.2 of the VoID Specification. |
RDF Property |
void:subjectsTarget |
---|---|
Definition: | The dataset describing the subjects of triples contained in the Linkset. |
Domain: | void:Linkset |
Range: | void:Dataset |
Superproperty: | void:target |
Usage note: |
Should point to a versioned URI of a dataset that appears in this or another VoID file. The datasets may themselves be a subset of a larger dataset. Where the datasets do not provide a VoID description, the minimal required information must be provided in the linkset description. This is detailed in Section 5.5. This property may only appear once per Linkset (owl:FunctionalProperty). For more details see Section 5.1 of the VoID Specification. |
See also: | void:objectsTarget |
RDF Property | void:subset
|
---|---|
Definition: | has subset |
Domain: | void:Dataset |
Range: | void:Dataset |
Usage note: |
In the VoID note [[VOID]], the For more details see Section 4.4 of the VoID Specification. |
RDF Property | void:target
|
---|---|
Definition: | One of the two datasets linked by the Linkset. |
Domain: | void:Linkset |
Range: | void:Dataset |
Subproperties: | void:subjectsTarget, void:objectsTarget |
Usage note: | It is recommended that one of the sub-properties be used. |
RDF Property | void:triples
|
---|---|
Definition: | The total number of triples contained in a void:Dataset. |
Domain: | void:Dataset |
Range: | rdfs:Literal typed as xsd:integer . |
Usage note: |
Providing the number of triples included in the linkset allows for applications using the linkset to validate that the entire linkset has been successfully loaded. For more details see Section 4.6 of the VoID Specification. |
Example: | :chemSpider_drugbank_linkset void:triples
"6428"^^xsd:nonNegativeInteger. |
RDF Property |
void:uriRegexPattern |
---|---|
Definition: | Defines a regular expression pattern matching URIs in the dataset. |
Domain: | void:Dataset |
Usage note: | For more details see Section 4.2 of the VoID Specification. |
Example: | For the ChemSpider RDF dataset, the value would be "^http://rdf\\.chemspider\\.com/" .
|
See also: | void:uriSpace |
RDF Property |
void:uriSpace |
---|---|
Definition: | A URI that is a common string prefix of all the entity URIs in a void:Dataset. |
Domain: | void:Dataset |
Range: | rdfs:Literal |
Usage note: | For more details see Section 4.2 of the VoID Specification. |
Example: | For the ChemSpider RDF dataset, the value would be "http://rdf.chemspider.com/" .
|
See also: | void:uriRegexPattern |
RDF Property |
void:vocabulary |
---|---|
Definition: | A vocabulary that is used in the dataset. |
Domain: | void:Dataset |
Range: | rdfs:Resource |
Usage note: |
Declare the vocabularies and ontologies that encode the data. For more details see Section 4.3 of the VoID Specification. |
This section lists suggested vocabulary terms for the dataset topics metadata that are relevant for Open PHACTS. Where possible, we draw our terminology from the Semantic Science Integrated Ontology.
Concept Type | URI | Description |
---|---|---|
Annotation | http://semanticscience.org/resource/SIO_001166 | An annotation is a written explanatory or critical description, or other in-context information (e.g., pattern, motif, link), that has been associated with data or other types of information. |
Chemical Entity | http://semanticscience.org/resource/SIO_010004 | A chemical entity is a material entity that pertains to chemistry. |
Disease | http://semanticscience.org/resource/SIO_010299 | disease is the outward manifestation of one or more disorders |
Drug |
http://semanticscience.org/resource/SIO_010038 |
A drug is a chemical entity that regulates a biological process. |
Gene |
http://semanticscience.org/resource/SIO_010035 |
A gene is part of a nucleic acid that contains all the necessary elements to encode a functional transcript. |
Ligand |
http://semanticscience.org/resource/SIO_010432 |
a ligand is a molecule that is part of a complex by weakly interacting with another molecule |
mRNA | http://semanticscience.org/resource/SIO_010099 |
a messenger RNA is a ribonucleic acid that contains an
untranslated region (UTR) and protein coding sequence and lacks
introns. |
Pathway | http://semanticscience.org/resource/SIO_001107 | a pathway is an effective specification that outlines a set of actions that forms a way to achieve an objective. |
Protein | http://semanticscience.org/resource/SIO_010043 | a protein is an organic polymer that is composed of one or more linear polymers of amino acids. |
RNA | http://semanticscience.org/resource/SIO_010009 | a ribonucleic acid is an organic polymer composed of a sequence of ribonucleotide residues. |
Target | http://semanticscience.org/resource/SIO_010423 | (None given in ontology) |
The following subsections provide justifications for relating two datasets.
Linkset Justification | URI | Usage note |
---|---|---|
Chemical entity | http://semanticscience.org/resource/SIO_010004 | Used to denote that the resources linked are conceptually the same chemical entity. |
Has component with uncharged counterpart | http://semanticscience.org/resource/CHEMINF_000480 | |
Has isotopically unspecified parent | http://semanticscience.org/resource/CHEMINF_000459 | |
Has major tautomer at pH 7.4 | http://semanticscience.org/resource/CHEMINF_000486 | |
Has OPS normalized counterpart | http://semanticscience.org/resource/CHEMINF_000458 | |
Has part | http://purl.obolibrary.org/obo#has_part | |
Has stereoundefined parent | http://semanticscience.org/resource/CHEMINF_000456 | |
Has uncharged counterpart | http://semanticscience.org/resource/CHEMINF_000460 | |
InChI Key | http://semanticscience.org/resource/CHEMINF_000059 | Used to denote that the related chemical entities share the same InChI Key. |
Is tautomer of | http://purl.obolibrary.org/obo#is_tautomer_of | Used to denote that the related chemical entities are tautomers. |
Linkset Justification | URI | Usage note |
---|---|---|
Pathway | http://semanticscience.org/resource/SIO_001107 | Used to denote that pathway resources linked are conceptually the same pathway. |
Pathway name | http://edamontology.org/data_2342 | Used to denote that the pathway resources linked have been matched based on their name. |
Linkset Justification | URI | Usage note |
---|---|---|
Functional RNA coding gene | http://semanticscience.org/resource/SIO_000986 | Used to denote that a gene or protein resource and an RNA resource are being treated as equivalent. |
Functional mRNA coding gene | Used to denote that a gene or protein resource and an mRNA resource are being treated as equivalent. | |
Gene | http://semanticscience.org/resource/SIO_010035 | Used to denote that the gene resources linked are conceptually the same gene. |
Has part | http://www.obofoundry.org/ro/ro.owl#has_part | Used to denote that one resource is a part of the other. (The resulting resource will keep the type of the parent resource.) |
mRNA | http://semanticscience.org/resource/SIO_010099 | Used to denote that the mRNA resources linked are conceptually the same mRNA. |
Protein | http://semanticscience.org/resource/SIO_010043 | Used to denote that the protein resources linked are conceptually the same protein. |
Protein coding gene | http://semanticscience.org/resource/SIO_000985 | Used to denote that a gene resource and a protein resource are being treated as equivalent. |
Protein structure | http://edamontology.org/data_1460 | Used to denote that two protein resources are being treated as equivalent where one is a protein sequence and the other the protein structure. |
RNA | http://semanticscience.org/resource/SIO_010009 | Used to denote that the RNA resources linked are conceptually the same RNA. |
Target | http://semanticscience.org/resource/SIO_010423 | Used to denote that the target resources linked are conceptually the same target. |
Linkset Justification | URI | Usage note |
---|---|---|
Annotation | http://semanticscience.org/resource/SIO_001166 | Used to denote that annotation resources linked are conceptually the same annotation. |
Database cross-reference | http://semanticscience.org/resource/SIO_001171 | Used to denote a database cross-reference where no information about its context is available. |
The species terms used in the Open PHACTS Discovery Platform are drawn from the NCBI Taxonomy. Below the subset for the species that appear in the Discovery Platform are given. These can be discovered using the BioPortal browser.
Below are the expected terms for specifying the assertion method of the links:
Note that a manual assertion includes computer generated assertions that have been manually verified.
Below are the terms from the frequency of change vocabulary.
freq:triennial
freq:biennial
freq:annual
freq:semiannual
freq:threeTimesAYear
freq:quarterly
freq:bimonthly
freq:monthly
freq:semimonthly
freq:biweekly
freq:threeTimesAMonth
freq:weekly
freq:semiweekly
freq:threeTimesAWeek
freq:daily
freq:continuous
freq:irregular
Below we provide an example of part of the dataset description document for the ChEMBL-RDF dataset; derived from the file available at ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/16.0/void.ttl.gz.
The dataset description consists of a parent resource with three distinct subsets. (Note that the actual ChEMBL description contains many more subsets.)
The example dataset description is available as:
@prefix : <http://rdf.ebi.ac.uk/dataset/chembl/16.2/void.ttl#> . @prefix cco: <http://rdf.ebi.ac.uk/terms/chembl#> . @prefix chembl: <http://rdf.ebi.ac.uk/resource/chembl/> . @prefix chembl_activity: <http://rdf.ebi.ac.uk/resource/chembl/activity/> . @prefix chembl_assay: <http://rdf.ebi.ac.uk/resource/chembl/assay/> . @prefix chembl_bio_cmpt: <http://rdf.ebi.ac.uk/resource/chembl/biocomponent/> . @prefix chembl_document: <http://rdf.ebi.ac.uk/resource/chembl/document/> . @prefix chembl_journal: <http://rdf.ebi.ac.uk/resource/chembl/journal/> . @prefix chembl_molecule: <http://rdf.ebi.ac.uk/resource/chembl/molecule/> . @prefix chembl_protclass: <http://rdf.ebi.ac.uk/resource/chembl/protclass/> . @prefix chembl_source: <http://rdf.ebi.ac.uk/resource/chembl/source/> . @prefix chembl_target: <http://rdf.ebi.ac.uk/resource/chembl/target/> . @prefix chembl_target_cmpt: <http://rdf.ebi.ac.uk/resource/chembl/targetcomponent/> . @prefix bdb: <http://vocabularies.bridgedb.org/ops#> . @prefix dcat: <http://www.w3.org/ns/dcat#> . @prefix dcterms: <http://purl.org/dc/terms/> . @prefix dctypes: <http://purl.org/dc/dcmitype/> . @prefix eco: <http://purl.obolibrary.org/obo/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix freq: <http://purl.org/cld/freq/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix pav: <http://purl.org/pav/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix sd: <http://www.w3.org/ns/sparql-service-description#> . @prefix sio: <http://semanticscience.org/resource/> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix uniprot: <http://purl.uniprot.org/uniprot/> . @prefix void: <http://rdfs.org/ns/void#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . <http://rdf.ebi.ac.uk/dataset/chembl/16.example/void.ttl#> a void:DatasetDescription ; pav:createdBy <http://orcid.org/0000-0002-8011-0300> ; pav:contributedBy <http://orcid.org/0000-0002-5711-4872> ; pav:createdOn "2009-10-28T00:00:00.000Z"^^xsd:dateTime ; pav:lastUpdateOn "2013-08-23T00:00:00.000+01:00"^^xsd:dateTime ; dcterms:issued "2013-08-23T00:00:00.000+01:00"^^xsd:dateTime; pav:previousVersion <http://rdf.ebi.ac.uk/dataset/chembl/16.2/void.ttl#>; foaf:primaryTopic :chembl_rdf_dataset . :chembl_rdf_dataset a void:Dataset ; dcterms:title "The ChEMBL Database" ; dcterms:description "ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data). The data is abstracted and curated from the primary scientific literature, and cover a significant fraction of the SAR and discovery of modern drugs." ; pav:createdBy <http://orcid.org/0000-0002-8011-0300> ; pav:createdOn "2009-10-28T00:00:00.000Z"^^xsd:dateTime ; pav:lastUpdateOn "2013-05-07T00:00:00.000+01:00"^^xsd:dateTime ; dcterms:issued "2013-08-23T00:00:00.000+01:00"^^xsd:dateTime ; pav:version "16.example" ; pav:previousVersion <http://rdf.ebi.ac.uk/dataset/chembl/16.2/void.ttl#chembl_rdf_dataset> ; dcat:landingPage <https://www.ebi.ac.uk/chembl> ; foaf:page <ftp://ftp.ebi.ac.uk/pub/databases/chembl/> ; dcterms:license <http://creativecommons.org/licenses/by-sa/3.0/> ; void:uriSpace "http://rdf.ebi.ac.uk/resource/chembl/" ; dcterms:publisher <http://www.ebi.ac.uk/> ; void:subset :chembl_rdf_molecule_dataset , :chembl_rdf_target_dataset , :chembl_rdf_targetcmpt_dataset ; void:vocabulary <http://purl.org/ontology/bibo> , <http://www.bioassayontology.org/bao> , <http://semanticscience.org/ontology/cheminf.owl> , <http://purl.org/spar/cito> , <http://purl.org/dc/terms> , <http://www.w3.org/2002/07/owl> , <http://www.w3.org/1999/02/22-rdf-syntax-ns> , <http://www.w3.org/2000/01/rdf-schema> , <http://semanticscience.org/ontology/sio.owl> , <http://www.w3.org/2004/02/skos/core> , <http://www.w3.org/2001/XMLSchema> ; void:exampleResource chembl_molecule:CHEMBL941 ; void:sparqlEndpoint <http://rdf.ebi.ac.uk/dataset/chembl/sparql> ; dcterms:accrualPeriodicity freq:quarterly . :chembl_rdf_molecule_dataset a void:Dataset ; dcterms:title "ChEMBL Molecules Dataset" ; dcterms:description "The ChEMBL Molecules Dataset about drug-like small molecules containing calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.)." ; void:uriSpace <http://rdf.ebi.ac.uk/resource/chembl/molecule/> ; void:exampleResource chembl_molecule:CHEMBL941 ; void:dataDump <ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/16.example/chembl_16.example_molecule.ttl.gz> ; dcat:theme sio:SIO_010004. :chembl_rdf_target_dataset a void:Dataset ; dcterms:title "ChEMBL Target Dataset" ; dcterms:description "The ChEMBL Target Dataset containing abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data)" ; void:uriSpace <http://rdf.ebi.ac.uk/resource/chembl/target/> ; void:exampleResource chembl_target:CHEMBL2242 ; void:dataDump <ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/16.example/chembl_16.example_target.ttl.gz> ; void:subset :chembl_target_targetcmpt_linkset ; dcat:theme sio:SIO_010423 . :chembl_rdf_targetcmpt_dataset a void:Dataset ; dcterms:title "ChEMBL Target Component Dataset" ; dcterms:description "The ChEMBL Target Component Dataset about proteins that are contained in targets." ; void:uriSpace <http://rdf.ebi.ac.uk/resource/chembl/targetcomponent/> ; void:exampleResource chembl_target_cmpt:CHEMBL_TC_583 ; void:dataDump <ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/16.example/chembl_16.example_targetcmpt.ttl.gz> ; void:subsets :chembl_targetcmpt_uniprot_linkset ; dcat:theme sio:SIO_010043 .
Below we provide an example of the dataset description document for the RDF conversion of the DrugBank dataset. The dataset description consists of a parent resource describing the RDF representation of DrugBank, the original DrugBank data that was used to generate the RDF including details of its distribution, and the two distinct subsets that make up the DrugBank dataset. The example dataset description is available as:
@prefix bdb: <http://vocabularies.bridgedb.org/ops#> . @prefix dcat: <http://www.w3.org/ns/dcat#> . @prefix dcterms: <http://purl.org/dc/terms/> . @prefix dctypes: <http://purl.org/dc/dcmitype/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix freq: <http://purl.org/cld/freq/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix pav: <http://purl.org/pav/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix sd: <http://www.w3.org/ns/sparql-service-description#> . @prefix sio: <http://semanticscience.org/resource/> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix uniprot: <http://purl.uniprot.org/uniprot/> . @prefix void: <http://rdfs.org/ns/void#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix : <#> . # VoID Header for the DrugBank RDF dataset <> rdf:type void:DatasetDescription ; dcterms:title "DrugBank VoID Description"@en ; dcterms:description "The VoID description for the RDF representation of the DrugBank dataset."@en ; pav:createdBy <https://orcid.org/0000-0002-5711-4872> ; pav:createdOn "2012-10-30T16:08:36Z"^^xsd:dateTime ; pav:lastUpdateOn "2013-08-23T16:00:00Z"^^xsd:dateTime ; dcterms:issued "2013-08-23T16:00:00Z"^^xsd:dateTime ; foaf:primaryTopic :drugbank_rdf . # Metadata about the original DrugBank dataset :drugbank rdf:type dctypes:Dataset ; dcterms:title "DrugBank dataset"@en ; dcterms:description "The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains 6811 drug entries including 1528 FDA-approved small molecule drugs, 150 FDA-approved biotech (protein/peptide) drugs, 87 nutraceuticals and 5080 experimental drugs. Additionally, 4294 non-redundant protein (i.e. drug target/enzyme/transporter/carrier) sequences are linked to these drug entries. Each DrugCard entry contains more than 150 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data."@en ; dcterms:publisher <http://www.drugbank.ca/> ; dcat:landingPage <http://www.drugbank.ca/> ; dcterms:license <http://www.drugbank.ca/about#cite> ; dcterms:issued "2009-01-01T00:00:00T"^^xsd:dateTime; dcat:distribution :drugbank-distribution; pav:version "2.5" ; foaf:page <http://thedatahub.org/dataset/drugbank> ; . # Metadata about the distribution of the DrugBank dataset :drugbank-distribution rdf:type dcat:Distribution ; dcat:mediaType "text"; dcat:downloadURL <http://www.drugbank.ca/system/downloads/2.5/drugcards.zip>. # Metadata about the RDF representation of DrugBank :drugbank-rdf rdf:type void:Dataset ; dcterms:title "DrugBank RDF"@en; dcterms:description """An RDF representation of the DrugBank dataset taken from Free University of Berlin. The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains 6811 drug entries including 1528 FDA-approved small molecule drugs, 150 FDA-approved biotech (protein/peptide) drugs, 87 nutraceuticals and 5080 experimental drugs. Additionally, 4294 non-redundant protein (i.e. drug target/enzyme/transporter/carrier) sequences are linked to these drug entries. Each DrugCard entry contains more than 150 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data."""@en; dcterms:publisher <http://www4.wiwiss.fu-berlin.de/> ; dcat:landingPage <http://www.drugbank.ca/> ; dcterms:license <http://www.drugbank.ca/about#cite> ; dcterms:issued "2010-08-31T00:00:00Z"^^xsd:dateTime; void:dataDump <http://www4.wiwiss.fu-berlin.de/drugbank/drugbank_dump.nt> ; pav:version "2.5" ; pav:importedFrom :drugbank ; pav:importedBy <mailto:anja@anjeve.de> ; pav:importedOn "2008-11-17T20:52:39"^^xsd:dateTime ; pav:createdWith <https://github.com/anjeve/lodd/tree/master/datasets/DrugBank> ; void:subset :db-drugs, :db-targets ; void:sparqlEndpoint <http://www4.wiwiss.fu-berlin.de/drugbank/sparql> ; void:uriSpace "http://www4.wiwiss.fu-berlin.de/drugbank/resource/"^^xsd:string ; void:vocabulary <http://www.w3.org/1999/02/22-rdf-syntax-ns#>, <http://www.w3.org/2002/07/owl#>, <http://xmlns.com/foaf/0.1/>, <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/>, <http://www.w3.org/2000/01/rdf-schema#>; foaf:page <http://www4.wiwiss.fu-berlin.de/drugbank/> ; foaf:page <http://thedatahub.org/dataset/fu-berlin-drugbank> ; void:triples "765936"^^xsd:integer . # Subset containing drug compound information :db-drugs rdf:type void:Dataset ; void:uriSpace "http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/"^^xsd:string ; dcat:theme <http://semanticscience.org/resource/SIO_010038> ; void:exampleResource <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB00001> ; void:triples "4772"^^xsd:integer . # Subset containing target information :db-targets rdf:type void:Dataset ; void:uriSpace "http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/"^^xsd:string ; dcat:theme <http://semanticscience.org/resource/SIO_010423> ; void:exampleResource <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/1> .
Below is the linkset description relating the ChEMBL target component
subset of ChEMBL to the UniProt dataset. The linkset description is
included in the dataset description for the ChEMBL dataset and the links
are contained in a separate file with a backlink using the void:inDataset
predicate.
The example dataset description is available as:
:chembl_targetcmpt_uniprot_linkset a void:Linkset ; dcterms:title "ChEMBL Target Component to UniProt Linkset" ; dcterms:description "The ChEMBL Target Component to UniProt Linkset relating target components to their corresponding entry in UniProt based on their sequence." ; dcterms:publisher <http://www.ebi.ac.uk/> ; dcterms:license <http://creativecommons.org/licenses/by-sa/3.0/> ; dcterms:issued "2013-07-26T10:02:03.000+01:00"^^xsd:dateTime ; void:dataDump <ftp://ftp.ebi.ac.uk/pub/databases/chembl/chembl-rdf/16.example/chembl_16.example_targetcmpt_uniprot_ls.ttl.gz> ; void:subjectsTarget :chembl_rdf_targetcmpt_dataset ; bdb:subjectsDatatype sio:SIO_010043; void:objectsTarget <http://purl.uniprot.org/void#uniprotdataset_2013_08> ; bdb:objectsDatatype sio:SIO_010043; void:linkPredicate skos:exactMatch ; bdb:linksetJustification sio:SIO_010043 ; pav:authoredBy <http://orcid.org/0000-0002-8011-0300> ; pav:authoredOn "2013-07-26T10:02:03.000+01:00"^^xsd:dateTime ; pav:createdBy <http://orcid.org/0000-0002-8011-0300> ; pav:createdOn "2013-07-26T10:02:03.000+01:00"^^xsd:dateTime ; bdb:assertionMethod eco:ECO_0000218 .