Lesson 3: Data Description & Documentation

3.1 Defining data description and documentation

Describing your data thoroughly and at multiple levels is one of the most important data management practices you can do for both yourself and others.

Personal benefits

  • Returning to a project after a few days is hard enough—imagine returning to your project after months or years! Clear, thorough documentation helps you understand your original work and processes months or years down the road.
  • If you publish your work, well-described and documented data help protect your work from retraction due to missing data or errors caused by poor data management practices.
  • Well-documented data is more usable to others. When others are able to reuse your data, they can also cite your data!

Benefits for others

  • Good documentation makes your data more useful to others. They’ll have to spend less time trying to understand what you did at a certain step in your data cleaning or what a certain variable name means.
  • Well-described data and clear documentation of your work also helps your work be reproducible and can help provide research integrity.

Description and documentation is also sometimes referred to as metadata. Metadata is information about your data or processes that helps provide context for understanding your research data. For example, metadata that you capture about your data might include things like when it was collected, who collected it, how it was collected, any instruments used, or other technical information.

Metadata is important to have at the project or folder level as well as at the item or file level. Describing your data at multiple levels helps provide a more complete picture of how the data was produced, gathered, cleaned, and analyzed.

Photo looking down at someone using a laptop and writing in a notebook at a desk.
Documenting descriptions and notes from data and research

3.2 Describing data at the item level

  • Clear and meaningful file names are a simple way to embed key information about your data in a file.
  • You can create data dictionaries to describe your data.
    • A data dictionary is a central document describing the important information such as variable names, units of measurement, the range of valid values, and anything else others may need to know to interpret your data. The Open Science Framework provides a quick how-to.
  • You can use metadata standards common in your discipline.
    • A metadata standard is a defined way to describe an object that ensures that you’re capturing the same information for each object and the information is structured the same way.
      • In the social sciences, the Data Documentation Initiative (DDI) standard is often used as it was created to describe data produced by surveys and observational methods.
      • In biology, a commonly used standard for describing biological diversity is Darwin Core.
        • If you would like to view the standards used in different research domains, you can go to the Metadata Standards Catalog from the Research Data Alliance.
    • Some disciplines will not have a standard to use, so it’ll be up to you to brainstorm the important information to capture about your data.

3.3 Describing data at the project or folder level

Create a README file. A README file is a plain text file that provides overarching important information about the project and files within the project. It should include a suggested minimum amount of information that will contribute to a dataset’s reusability.

What goes in a README file?

  • Names and contact information for those associated with the project
  • Funding sources or institutional support
  • A list of files and folders, a description of their contents, and how to use them
  • Processing, analyses, or other important information to know about the data
  • Limitations of the data or project
  • Copyright and licensing information; citation preferences

Best practices for README files:

  • Name the README so that it is easily associated with the data file(s) it describes
  • Write your README document as a plain text file (.txt)
  • Format multiple README files identically
  • Use standards for dates
  • Follow the scientific conventions for your discipline for taxonomic, geospatial, and geologic names and keywords (controlled vocabulary or ontology)

If you choose to create a README file, be aware that there is a suggested minimum amount of information to capture about your work to ensure that your data is reusable.

This information includes:

  • Data and file overview
    • For each filename, a short description of what data it contains
    • Date that the file was created
  • Sharing and access information
    • Licenses or restrictions placed on the data
  • Methodological information
    • Description of methods for data collection or generation (include links or references to publications or other documentation containing experimental design or protocols used)
    • Description of methods used for data processing (describe how the data were generated from the raw or collected data)
  • Data-specific information
  • Variable list, including full names and definitions of column headings for tabular data
    • Spell out abbreviated words
  • Units of measurement
  • Definitions for codes or symbols used to record missing data [1]

3.4 Lesson Review

Suppose you’re a botanist who’s leading a team that will count how many specimens of a certain plant are living within an area.

Which of the following are examples of metadata you might record? Choose all that apply.

Correct! Metadata can include any information that may provide important context later on about the data that you’re collecting.
Incorrect. The correct answer is A, B, C, D, and E. Metadata can include any information that may provide important context later on about the data that you’re collecting.

References

[1] Cornell University Research Data Management Service Group, “Guide to writing ‘readme’ style metadata,” https://data.research.cornell.edu/content/readme.