FAIR Research Data Coursebook: All in One View

Content from Introduction

Last updated on 2026-04-14 | Edit this page

Overview

Questions

Does FAIR data mean open data?
What are digital objects and persistent identifiers?
What kinds of persistent identifiers are commonly used?

Objectives

Understand that FAIR does not simply mean open.
Explain the difference between human-readable and machine-friendly digital objects.
Recognize DOI as one type of PID used to identify digital objects.

Does FAIR data mean open data?

Callout

FAIR is not identical to open

FAIR means that data and related research objects are designed to be easy for humans and machines to find, understand, access, and reuse. Some FAIR data can also be open, but openness and FAIRness are not the same thing.

Diagram connecting FAIR and Open Science concepts

Discussion

Human-readable vs machine-friendly

Human-readable content is easy for a person to inspect directly, but that does not guarantee that software can parse and reuse it automatically. Machine readable content uses explicit structure, such as CSV, JSON, or XML, so a computer can interpret relationships and values without guessing.

During this lesson, “machine-friendly” is used broadly to mean material that is machine-readable and also supports further automated action and interoperability.

Illustration about machine-friendly digital objects

What are digital objects and persistent identifiers?

A digital object is a sequence of bits stored in digital memory that has informational value on its own. Examples include:

a scientific publication
a dataset
a rich metadata record
a README file describing access and reuse conditions

A persistent identifier, or PID, is a durable reference to a digital or physical resource. PIDs are backed by technical infrastructure and governance arrangements that help them continue resolving even when the resource itself changes location.

Common uses of PIDs include identifying:

articles, datasets, and software
researchers
organizations and funders
projects and instruments
physical samples and media objects

DOI is one well-known PID type and is commonly used for datasets and publications. ORCID is another example, focused on researcher identity.

For a short explainer, see the FREYA project video on the importance of PIDs.

Challenge

Exercise

Visit the recent machine learning submissions on arXiv.

Pick a paper, open its PDF, and search for http or doi.

What kinds of links do the authors use for data or software references, and why is a DOI usually more robust than a personal website or repository link alone?

Show me the solution

Authors often link to GitHub repositories, project pages, or other web pages for software and data. Those locations may move or disappear over time. A DOI is more robust because it resolves through persistent infrastructure and points to current metadata about the object even if the storage location changes.

Discussion

Reflection

How does your discipline usually share data? Is there a data journal, domain repository, or another community-specific mechanism for making research outputs findable and reusable?

Key Points

FAIR means making research objects more usable by humans and machines, not automatically making them open.
Machine-friendly objects are structured so software can interpret and reuse them reliably.
Digital objects include datasets, publications, metadata records, and related documentation.
DOI is a common PID used for datasets and publications.

Content from Set up your own terms

Last updated on 2026-04-14 | Edit this page

Overview

Questions

What are data terms of use?
What should a data terms of use statement contain?
What format should terms of use use?
What standard licenses are available for data?

Objectives

Understand what data terms of use are and why they matter.
Identify the minimum components of a basic terms of use statement.
Recognize when a standard license is sufficient and when a custom agreement is needed.

Callout

FAIR principles used in data terms of use

Accessible:

FM-A2 Metadata Longevity: https://doi.org/10.25504/FAIRsharing.A2W4nz

Reusable:

FM-R1.1 Accessible Usage License: https://doi.org/10.25504/FAIRsharing.fsB7NK

What are data terms of use?

Data terms of use are a textual statement that sets out the rules, conditions, licenses, and legal considerations that govern reuse of a data source.

Examples:

World Bank terms of use for datasets: https://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
Numbeo terms of use: https://www.numbeo.com/common/terms_of_use.jsp

These examples show that terms of use usually describe the resource, the conditions under which it may be reused, and any expectations around attribution or restrictions.

What must a terms of use statement contain?

As a minimum, a data terms of use statement should cover the following elements:

Section	Description	Example
Description	What the statement refers to and which digital objects it covers	“These terms apply to the Happy Dataset.”
License	Under which conditions reuse is allowed	“The Happy Dataset is in the public domain.”
Attribution	How the data should be cited or acknowledged	“Please cite the Happy Dataset.”
Disclaimer	Important limitations or caveats	“The last 100 records may contain selection bias.”

Depending on the context, the statement may need additional clauses for multiple databases, sensitive data, embargoes, or obligations coming from a larger funded project.

Callout

Terms of use are part of the legal basis for reuse

The terms of use statement is the formal basis on which others may access and reuse a data source. If your work sits inside a larger project or policy framework, check whether terms already exist before drafting a new statement.

An example of a broader policy framework is the FAIRsharing record for the 1958 Birth Cohort policy:

FAIRsharing entry: https://fairsharing.org/FAIRsharing.z09fg9
Original policy document: https://cpb-eu-w2.wpmucdn.com/blogs.bristol.ac.uk/dist/7/314/files/2015/07/POLICY-DOCUMENT-FINAL-Vsn-4.0-DEC-2014.pdf

What format should terms of use use?

Terms of use should be stored as plain text in a machine-friendly format such as .txt, .md, or .html. The exact length and level of detail will vary by project, but the statement should be easy for people to read and for systems to preserve.

Callout

Keep the statement in an accessible text format

You can draft terms of use in almost any editor, but the final version should be stored in a format that does not depend on proprietary software to read it.

Many projects place the statement in a README, LICENSE, or similar documentation file.

Challenge

Exercise

Visit the City of Philadelphia terms-of-use file:

https://github.com/CityOfPhiladelphia/terms-of-use/blob/master/LICENSE.md

Answer the following:

What kind of resource is it about?
In what format is the statement written?
On which platform is it published?

Show me the solution

It is a terms-of-use style statement published for city-maintained digital resources.
The file format is Markdown (.md).
It is published on GitHub.

The statement itself often lives next to the data documentation:

Folder structure containing a README or license file

Some repositories let you define tailored reuse conditions directly on the platform. Dataverse, for example, defaults to a CC0 waiver but also allows custom terms after dataset creation.

Repository interface showing license and terms options

Challenge

Exercise

Is it possible to edit terms of use in DataverseNL?

Check the DataverseNL documentation or FAQ and find the answer.

Show me the solution

Yes. After creating a dataset, you can go to the Terms tab, choose not to apply CC0, and provide your own terms and conditions.

Useful reference:

Sample Data Usage Agreement: https://dataverse.org/best-practices/sample-dua

Are there standard licenses we can pick from?

Two commonly used licensing families for data are:

Creative Commons: https://creativecommons.org/about/cclicenses/
Open Data Commons: https://opendatacommons.org/licenses/index.html

Creative Commons licenses are easy to understand and widely recognized, even if they were not designed only for data. Open Data Commons licenses are more explicitly data-oriented.

Mark	Meaning
BY	Creator must be credited
SA	Derivatives or redistributions must use the same license
NC	Only non-commercial uses are allowed
ND	No derivatives are allowed

Challenge

Exercise

Choose a Creative Commons license for simulation data under these conditions:

others cannot modify the work
commercial reuse is allowed

Which license fits?

Show me the solution

Attribution-NoDerivatives 4.0 International, or CC BY-ND 4.0.

Discussion

Scenario

You are collaborating on a study of quality of life in children and plan to collect potentially sensitive information about bullying, social media, and family structure.

What should be considered when drafting terms of use for this study? Should the statement be drafted only by the researchers, or should legal and governance support be involved?

Key Points

A data terms of use statement defines the legal and practical basis for reuse.
A license is the minimum requirement, but some projects need richer terms or a custom agreement.
Store terms of use in an accessible text format such as .md or .txt.
If a standard license does not fit the project, a tailored terms-of-use statement or usage agreement may be necessary.

Content from Speak the same language

Last updated on 2026-04-14 | Edit this page

Overview

Questions

What are data descriptions?
How can data descriptions be reused?
Are there standard ways to create them?
What is the relationship between data descriptions and linked data?

Objectives

Recognize that different communities use different names for data descriptions.
Learn how to build machine-friendly data description files.
Understand why reusing ontology terms improves interoperability.

Callout

FAIR principles used in data descriptions

Interoperable:

FM-I1 Use a Knowledge Representation Language: https://doi.org/10.25504/FAIRsharing.jLpL6i
FM-I2 Use FAIR Vocabularies: https://doi.org/10.25504/FAIRsharing.0A9kNV

Reusable:

FM-R1.3 Meets Community Standards: https://doi.org/10.25504/FAIRsharing.cuyPH9

What are data descriptions?

Data descriptions document the meaning of each data attribute or variable in a dataset.

Depending on the discipline, similar documents may be called:

codebooks
data dictionaries
labels or data tags
data glossaries

Example:

Variable name	Description	Scale
Weight	Weight of a human	kilograms
Height	Height of a human	centimetres
Age	Age of a human	years
Blood glucose level	Blood glucose level of a human	mg/dl

Callout

Different fields use different names

No matter what the document is called, the goal is the same: describe what the dataset variables mean so others can interpret and reuse the data correctly.

How can data descriptions be reused?

Writing documentation takes time, so when possible you should reuse accepted community definitions instead of inventing a new description every time.

BioPortal is a good example of a registry where community-maintained ontology terms can be searched and reused:

BioPortal: https://bioportal.bioontology.org/

BioPortal search results interface Different ontology matches for the same concept Ontology entry with persistent identifier

Callout

Reuse ontology terms when possible

Advantages:

you do not need to redefine common concepts repeatedly
you gain a persistent identifier for the concept
other systems can align variables across datasets more easily

Disadvantages:

sometimes no existing ontology fits the exact concept you need

Are there standard ways to create data descriptions?

There is no single universal format, but the minimum useful description usually includes:

the variable name
a definition
ideally, a link to the reused ontology term or community definition

Some general-purpose vocabularies and metadata schemas include:

Resource	Link	Use
Schema.org	https://schema.org/	Generic web concepts
DBpedia	https://www.dbpedia.org/resources/lookup/	Concepts derived from Wikipedia
DCAT	https://www.w3.org/TR/vocab-dcat-2/	Data catalog concepts
Dublin Core	https://www.dublincore.org/specifications/dublin-core/dcmi-terms/	General metadata terms

Public registries that help locate vocabularies include:

Linked Open Vocabularies: https://lov.linkeddata.es/dataset/lov/
EU Vocabularies: https://op.europa.eu/en/web/eu-vocabularies
BioPortal: https://bioportal.bioontology.org/
AgroPortal: https://agroportal.lirmm.fr/
EcoPortal: https://ecoportal.lifewatchitaly.eu/
Ontology Lookup Service: https://www.ebi.ac.uk/ols/index
Bioschemas: https://bioschemas.org/

Challenge

Exercise

Visit BioPortal and search for a term describing blood glucose level.

What ontology term would you reuse, and what persistent identifier does it provide?

Show me the solution

There is more than one possible match, but a valid answer is to identify a term such as the clinical measurement concept and record its ontology URI or other persistent identifier provided by the registry.

Data descriptions are often maintained as tables in .csv, .xlsx, or similar formats. For databases, it is also useful to provide both a machine-readable schema and a human-readable diagram.

Illustration comparing data and data dictionary concepts

Other human-friendly approaches include dataset nutrition labels and automated codebook tools, but the FAIR principles push further toward structured, machine-actionable formats.

What is the relationship between data descriptions and linked data?

When data descriptions reuse ontology terms and stable identifiers, datasets can be transformed into linked data formats such as RDF and become easier to combine with other resources.

Tools that can help with this include:

Tool	Source	GUI	Note
OpenRefine	https://openrefine.org/	Yes	Flexible but can be heavy to install
RMLMapper	https://github.com/RMLio/rmlmapper-java/releases	No	Powerful, but technical
SDM-RDFizer	https://github.com/SDM-TIB/SDM-RDFizer	No	Requires programming familiarity
SPARQL-Generate	https://ci.mines-stetienne.fr/sparql-generate/	Yes	Good if you want to learn SPARQL
Virtuoso Universal Server	https://virtuoso.openlinksw.com/	Yes	Commercial licensing may apply
UM LDWizard	https://github.com/MaastrichtU-IDS/ldwizard-humanities	Yes	Quick route to publishable linked data

Challenge

Exercise

Transform a dataset from XLSX to RDF using UM LDWizard:

Download the mock dataset: MOCK DATA
Convert the dataset into RDF
Record which ontology terms you reused for the variables

Show me the solution

There is no single correct answer. The important part is to reuse existing ontology terms where possible and document the choices you made.

Discussion

Scenario

Your marine biology group has discovered new organisms, but you cannot find an existing ontology that fits the concepts you need.

What should you do? Should you adapt an existing vocabulary, create local terms, or work with the community toward a new ontology extension?

Key Points

Data descriptions may appear under names such as codebook or data dictionary.
Reusing ontology terms reduces ambiguity and improves interoperability.
A useful description links dataset variables to accepted community concepts.
Linked data becomes more feasible when descriptions are structured and identifier-based.

Content from Securely share

Last updated on 2026-04-14 | Edit this page

Overview

Questions

What are data access protocols?
Is open access a data access protocol?
Can data be exposed as a service through FAIR API protocols?

Objectives

Learn what data access protocols are.
Distinguish between human-facing and machine-facing access instructions.
Explore how FAIR API protocols can expose data as a service.

Callout

FAIR principles used in data access protocols

Accessible:

FM-A1.1 Access Protocol: https://doi.org/10.25504/FAIRsharing.yDJci5
FM-A1.2 Access Authorization: https://doi.org/10.25504/FAIRsharing.EwnE1n

Interoperable:

FM-I3 Use Qualified References: https://doi.org/10.25504/FAIRsharing.B2sbNh

What are data access protocols?

Data access protocols are the explicit rules and steps that tell humans or machines how to access a data source.

Electronic access control system with keypad

Just as you need the correct key, code, or procedure to enter a secure room, data access often depends on following the correct process.

Access protocol	Example	Note
Communication between machines	HTTP request or API call	A machine requests information using a standard network protocol
Communication between humans	Data request form or email workflow	A person follows explicit instructions to request access

Example of a human-oriented access workflow Screenshot of a human-facing data request form

Callout

Protocols are rules communicated in a shared language

Humans may use a written access policy or form, while machines use APIs and network protocols such as HTTP. Common API styles include SOAP, REST, and SPARQL.

Is open access a data access protocol?

Strictly speaking, open access is a policy framework rather than a network protocol, but in practice it functions as an access model with explicit rules: the data can be retrieved openly without individual authorization.

Depositing data in a public repository often gives the resource an open-access access route automatically. Some open datasets also expose machine-readable access through APIs or SPARQL endpoints.

Example:

EU public data SPARQL interface: https://data.europa.eu/data/sparql
Endpoint: https://data.europa.eu/sparql

Challenge

Exercise

Visit the Zenodo COVID-19 community:

https://zenodo.org/communities/covid-19/

What is the default data access protocol visible for the displayed records?

Show me the solution

The records are presented as open access resources, which is visible through the repository interface and download behavior.

Callout

Open access can be human-friendly and machine-friendly

Humans see direct download options. Machines see stable web requests and, in some systems, structured service endpoints.

Can data be exposed as a service through FAIR API protocols?

Yes, but it requires technical infrastructure and some familiarity with linked data or knowledge graph tooling.

Useful tools include:

Tool	Source	GUI	Note
TriplyDB	https://triplydb.com/	Yes	Quick hosted option for exposing data
GraphDB	https://www.ontotext.com/products/graphdb/	Yes	Requires a server or managed deployment
FAIR Data Point	https://github.com/fair-data/fairdatapoint	No	Powerful but technical
rdflib-endpoint	https://pypi.org/project/rdflib-endpoint/	No	Fast local route for experimentation

Callout

Register FAIR APIs where others can find them

If you expose a FAIR API or similar service endpoint, register it in an appropriate service registry such as SMART API:

https://www.smart-api.info/

Challenge

Exercise

Expose RDF data through a service endpoint:

Reuse the RDF data generated from the data descriptions episode, or bring another RDF file.
Install rdflib-endpoint.
Run a local service and confirm that the data can be queried.

Show me the solution

This is intentionally optional. The goal is to understand what it takes to move from static downloadable data to a service-oriented access model.

Discussion

Scenario

You have sensitive financial information that cannot be openly disclosed, but you still want to enable legitimate research access.

What kind of human and machine access protocols would you design? What should be openly visible, and what should only be available behind review or controlled authorization?

Key Points

Data access protocols describe how humans and machines gain access to data.
Open access is an access model that can still be expressed through explicit rules and interfaces.
Human-facing request forms and machine-facing APIs are both important access patterns in FAIR data practice.
FAIR service endpoints are more useful when they are registered and discoverable.

Content from Publish and preserve

Last updated on 2026-04-14 | Edit this page

Overview

Questions

What is data archiving?
What are data repositories?
What is a DOI, and why is it important?

Objectives

Understand the role of trusted repositories in long-term preservation.
Learn why repositories improve discovery and citation.
Understand how DOI supports persistence and citation.

Callout

FAIR principles used in data archiving

Findable:

FM-F1A Identifier Uniqueness: https://doi.org/10.25504/FAIRsharing.r49beq
FM-F3 Resource Identifier in Metadata: https://doi.org/10.25504/FAIRsharing.o8TYnW

Accessible:

FM-A2 Metadata Longevity: https://doi.org/10.25504/FAIRsharing.A2W4nz

What is data archiving?

Data archiving is the long-term preservation of research data and related digital objects.

Funders and journals increasingly expect enough of the data and methods to be preserved so that results can be inspected, understood, and, where possible, replicated.

What are data repositories?

Data repositories are storage locations for digital objects such as datasets, code, supplementary files, and metadata.

Repositories make research outputs easier to find and can also provide:

preservation
backup
citation infrastructure
versioning
controlled access options

Examples:

Repository	About
DataverseNL	Community repository supporting Dutch universities and research centers
4TU Data	Repository originally developed by Dutch technical universities
PANGAEA	Domain repository for Earth and environmental sciences
Figshare	General-purpose repository for many digital object types

Callout

General repository recommendations

Zenodo: https://zenodo.org/
SURF Repository: https://repository.surfsara.nl/
DataverseNL: https://dataverse.nl/

Aim for a community repository when it exists. If not, a trusted general repository is usually better than leaving data only on a local drive or project website.

Typical strengths of general-purpose repositories include:

persistent identifiers
rapid publication
versioning
backup and preservation
usage statistics
open or restricted access modes

Additional registries that help identify suitable repositories:

re3data: https://www.re3data.org/
PLOS recommended repositories: https://journals.plos.org/plosone/s/recommended-repositories
NIH repositories guidance: https://sharing.nih.gov/data-management-and-sharing-policy/sharing-scientific-data/repositories-for-sharing-scientific-data
OpenDOAR: https://v2.sherpa.ac.uk/opendoar/

For context on the original Dataverse model, see Harvard Dataverse:

https://dataverse.harvard.edu/dataverse/harvard
Introductory video: https://www.youtube.com/watch?v=MPQ0Tpgaxt0

Challenge

Exercise

Visit https://dataverse.org/ and inspect the current installation map and the DataverseNL portal.

Answer:

How many Dataverse installations are listed now?
How many are in the Netherlands?
Roughly how many datasets are visible in DataverseNL at the time of your visit?

Show me the solution

These counts change over time, so record the values you observe at the moment you complete the exercise rather than relying on historical numbers from older lesson versions.

What is a DOI, and why is it important?

DOI is a persistent identifier commonly used for articles, datasets, reports, and other scholarly outputs.

Two major DOI registration services in research are:

Crossref: https://www.crossref.org/
DataCite: https://datacite.org/

A DOI has three conceptual parts:

the resolver service
the registrant prefix
the locally assigned suffix

DOIs make resources easier to cite and track, and they continue resolving even when storage details change.

Challenge

Exercise

Upload a test dataset to the DataverseNL demo or another training repository:

Download the mock dataset: MOCK DATA
Deposit it in a suitable demo or sandbox repository
Record the DOI or persistent identifier you receive

Show me the solution

There is no single correct answer. The point is to experience how a repository assigns persistent identifiers and what metadata are required before publication.

Discussion

Scenario

You conducted personal interviews for an ethnographic migration study. The raw transcripts cannot be openly shared, even under restricted access.

How should archiving be handled in this case? What metadata, derived outputs, or access statements should still be preserved and published?

Key Points

Repositories improve preservation, discovery, and citation.
Prefer a trusted community repository when one fits your discipline.
General repositories such as Zenodo still provide a strong baseline for FAIR publication.
DOI is a key persistent identifier for datasets and publications.

Content from Make machines work for you

Last updated on 2026-04-14 | Edit this page

Overview

Questions

What is the difference between metadata and rich metadata?
How can a rich metadata file be created?
Where should rich metadata be stored?

Objectives

Understand the difference between plain metadata and rich metadata.
Learn where machine-readable metadata comes from and how to generate it.
Identify good places to publish rich metadata files.

Callout

FAIR principles used for rich metadata

Findable:

FM-F2 Machine-readability of Metadata: https://doi.org/10.25504/FAIRsharing.ztr3n9

Interoperable:

FM-I1 Use a Knowledge Representation Language: https://doi.org/10.25504/FAIRsharing.jLpL6i
FM-I2 Use FAIR Vocabularies: https://doi.org/10.25504/FAIRsharing.0A9kNV

What is the difference between metadata and rich metadata?

Metadata is data about the data. It describes properties of a digital object, such as title, creator, publisher, size, or identifier.

Example of metadata fields shown in a repository

Metadata attribute	Example
Descriptive metadata	DOI
Structural metadata	Data size
Administrative metadata	Publisher
Statistical metadata	Number of files

Rich metadata goes further. It is:

standardized
structured
machine-readable
based on shared vocabularies
suitable for search engines and automated reuse

Callout

Rich metadata is more than plain text

Metadata alone can be descriptive but still hard for machines to interpret. Rich metadata uses a structured format such as JSON-LD and shared schemas such as Schema.org, DataCite, or Dublin Core.

How can a rich metadata file be created?

Researchers usually do not need to write rich metadata by hand. In many cases, it can be exported from a repository or generated through a form-based tool.

Platform	Source	Online	Note
Dataverse export button	https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/Q80QUE	Yes	Fastest path for datasets
FAIR Metadata Wizard	https://maastrichtu-ids.github.io/fair-metadata-wizard/	Yes	Tailored to scientific projects
NSDRA JSON-LD generator	https://nsdra.github.io/nsdra-jsonld-metadata-generator-webapp/#	Yes	Community-specific but adaptable
Steal Our JSON-LD	https://jsonld.com/json-ld-generator/	Yes	General-purpose
JSON-LD Schema Generator for SEO	https://hallanalysis.com/json-ld-generator/	Yes	Broad but SEO-oriented

Repository export interface showing JSON-LD options

Where should a rich metadata file be stored?

A simple rule is:

Callout

Publish rich metadata everywhere the data lives

in the project root folder
in the data repository
in the GitHub repository
on the project website

The exact schema may differ by context, but common choices include:

DataCite Metadata Schema: https://schema.datacite.org
Dublin Core: https://www.dublincore.org
discipline-specific schemas such as DDI or Darwin Core

Rich metadata also helps connect publications, datasets, and project websites. Without structured metadata, search engines and aggregators may not recognize the resource as a dataset at all.

Discussion

Scenario

You are creating a project website for a research consortium.

How should rich metadata relate to that website? Which information belongs in the HTML, which belongs in repository records, and how do those pieces work together to improve discovery?

Key Points

Rich metadata combines descriptive metadata with shared vocabularies and a structured machine-readable format.
JSON-LD is a common way to publish rich metadata.
Repositories can often generate rich metadata automatically.
Rich metadata should be published anywhere the digital object is stored or represented.

Content from Responsibly reuse

Last updated on 2026-04-14 | Edit this page

Overview

Questions

How should data be cited when reusing a data source?
How can you check whether data is likely to be reused and discovered?

Objectives

Understand the importance of citing datasets.
Learn how discoverability testing supports future data reuse.
Connect reuse to rich metadata and structured web data.

Callout

FAIR principles used in data reuse

Findable:

FM-F1B Identifier persistence: https://doi.org/10.25504/FAIRsharing.TUq8Zj
FM-F4 Indexed in a Searchable Resource: https://doi.org/10.25504/FAIRsharing.Lcws1N

Reusable:

FM-R1.2 Detailed Provenance: https://doi.org/10.25504/FAIRsharing.qcziIV

How should data be cited when reusing a data source?

Data reuse means working with an existing research data source rather than generating all material from scratch.

At minimum, a dataset citation should include:

creator
publication year
title
resource type or version
publisher
persistent identifier, usually DOI

Example citation:

Neff, Roni A.; L. Spiker, Marie; L. Truant, Patricia (2016). Wasted Food: U.S. Consumers’ Reported Awareness, Attitudes, and Behaviors. PLOS ONE. Dataset. https://doi.org/10.1371/journal.pone.0127881

Original dataset example:

https://figshare.com/articles/dataset/_Wasted_Food_U_S_Consumers_Reported_Awareness_Attitudes_and_Behaviors_/1445199

Dataset repository page with citation tools

Callout

Standard citations improve discoverability

When datasets are cited in recognizable formats, citation indexes and data aggregators can track those references much more effectively.

How can you check whether data is likely to be reused and discovered?

Datasets are digital objects on the web. To be reused, they must first be easy to find. That depends heavily on rich metadata and structured web markup.

Google’s Rich Results Test is one way to inspect whether a dataset page exposes structured information machines can recognize:

https://search.google.com/test/rich-results

Example test:

https://search.google.com/test/rich-results?url=http%3A%2F%2Fdoi.org%2F10.1594%2FPANGAEA.750883

Entering a dataset URL in Google Rich Results Test Test results after checking a dataset URL

Callout

Without rich metadata, data can be effectively invisible

A dataset may be accessible to humans through a webpage and still remain hard for search engines or data aggregators to interpret.

Challenge

Exercise

Visit the Culture Crime News database:

https://news.culturecrime.org/all

Run Google’s Rich Results Test on that resource.

Did the test detect a structured dataset?

Show me the solution

No. This is a useful reminder that a human-friendly page is not automatically a machine-friendly dataset description.

If you want to go further, review Google’s explanation of the Rich Results Test:

https://support.google.com/webmasters/answer/7445569

Search engines and data aggregators rely on shared standards such as:

Schema.org dataset markup: https://schema.org/Dataset
DCAT: https://www.w3.org/TR/vocab-dcat/
Semantic Web concepts more broadly: https://en.wikipedia.org/wiki/Semantic_Web

An example of dataset-oriented JSON-LD:

JSON

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "name": "Dummy Data",
  "description": "This is an example for the coursebook",
  "url": "https://example.com",
  "identifier": ["https://doi.org/XXXXXX"],
  "license": "https://creativecommons.org/publicdomain/zero/1.0/",
  "isAccessibleForFree": true
}

Challenge

Exercise

Go back to the dataset you uploaded in the data archiving episode and export its rich metadata as JSON-LD.

What similarities do you notice between that export and the example above?

Show me the solution

You should usually see:

JSON-LD as the format
Schema.org or another shared schema as the vocabulary context
Dataset or an equivalent type
explicit fields for identifier, title, description, and license

Discussion

Pick one or more datasets used in this lesson and test them with a FAIR assessment tool such as FAIR enough:

https://fair-enough.semanticscience.org/

What do the scores suggest about reuse readiness, and which missing metadata or links would most improve them?

Key Points

Reuse starts with clear citation and persistent identifiers.
Rich metadata is essential for search engines and aggregators to discover datasets.
Human-friendly web pages are not enough; machine-readable structure matters.
Testing discoverability is a practical part of making data reusable.