Cleaning and preparing data

Before research data can be shared, it must be prepared and documented. To make data interpretable and reliable, it should be accompanied by comprehensive documentation, including details on research methods, sampling procedures, and the coding or definition of variables.

The first step in this process is data cleaning, reviewing and correcting inconsistencies, errors, or incomplete entries. At the same time, researchers should assess the risk of disclosure by identifying any personal or sensitive information that could reveal the identity of participants.

Once the data have been cleaned and disclosure risks addressed, the accompanying metadata should be compiled and finalized to ensure that the dataset is complete, comprehensible, and ready for responsible sharing.

Metadata

Metadata is data about data that provides necessary information to understand, interpret and reuse a dataset. Well-described metadata also improves the searchability and discoverability of research data, helping it to be found and used by humans and machines.

Metadata typically includes details such as:

How and when the data were collected
The methods or instruments used
The context of the study
The license and access conditions governing data use.

According to the FAIR Principles, metadata should remain openly accessible even if the dataset itself is subject to access restrictions.

Metadata requirements for DATICE

When publishing data with DATICE, specific metadata elements must accompany each dataset. DATICE follows internationally recognized DDI 2.5 standards of the Data Documentation Initiative and the CESSDA Metadata Model (CMM).

An overview of the annotation of metadata required to deposit data with DATICE can be found here.

Checklist for cleaning and preparing data files

Before sharing or archiving a dataset, it is essential to ensure that the data are accurate, well-documented, and free of identifiable information. The following checklist can help researchers systematically review data files for quality, consistency, and reusability.

Personal identifiers and confidentiality

Have all direct personal identifiers been removed from the dataset?
Does the dataset contain indirect identifiers (such as age, occupation, or location) that could reveal identities when combined with other data? If so, has the risk of disclosure been assessed, and have any necessary measures been applied?

Removal of unnecessary variables

Have irrelevant or sensitive variables been excluded from the version of the data intended for publication or sharing?
Are all remaining variables essential for analysis or for replicating the study?

Variable names, labels, and values

Are all variable names clear and descriptive, avoiding generic labels?
Have values been coded clearly and consistently?

Accuracy and consistency

Have all spelling or typing errors been corrected in variable labels, category names, and free-text fields?
Are there any implausible values (e.g., unrealistic ages, impossible measurements, or incorrect decimal places)?
Have duplicate records or double entries been identified and removed?

Missing data

Have missing values been clearly coded (e.g., “NA”)?

Structure and logical order

Are variables organized in a logical order, for example by theme, section, or time of collection?
In large datasets, have variables been grouped into sections that reflect their purpose or topic?

Metadata and documentation

Has metadata been created or updated to describe the dataset thoroughly?
Title, creator(s), and project name
Purpose and description of the study
Methods used for data collection and analysis
Definitions of variables and coding schemes
Units of measurement and file formats
Access conditions and licensing information