Lesson 1: Data Has Value

1.1 What do we mean when we say “research data”?

Broadly speaking, research data is the information needed to produce and support your research findings. Research data takes many forms, including both physical and digital formats. There is not a single shared definition. What information falls within the scope of “research data” may change depending on your discipline, whether or not your job is positioned in academia, or whether or not you’re subject to funding agency guidelines or university policy.

For Example

If you’re funded by a federal funding agency, the following definition is often used:

Research data is “…the recorded factual information commonly accepted in the scientific community as necessary to validate research findings.”

INCLUDES: code, figures, statistics, interviews, transcripts
EXCLUDES**: preliminary analyses, drafts of papers, plans for further research, communication + peer reviews, physical samples
– White House, OMB Circular 110, 2013

If you’re working here at the University of Wisconsin-Madison, you’d be subject to the following definition from the Policy on Data Stewardship, Access, and Retention, which shares similarities with the federal one above:

Data means recorded factual material, regardless of the form or media on which it may be recorded, that is commonly accepted in the research community as necessary to validate research findings. For example, data may include writings, films, sound recordings, pictorial reproductions, drawings, designs, or other graphic representations, procedural manuals, forms, diagrams, work flow charts, equipment descriptions, data files, statistical records, and other research data.

…This definition of data excludes research results based on data such as preliminary analyses, drafts of research papers, published papers, plans for future research, peer reviews, or communications with colleagues.”

This module will focus on best practices for managing digital research data. However, physical data such as paper lab notebooks or physical samples, as well as non-research data files, are also important to manage well.

Practices for managing physical data should follow standards in your discipline or research group if they exist, and questions can be directed to local resources. Here at UW-Madison, Research Data Services is able to answer your questions.

Practices for managing non-data files can usually be directed to your local records manager. Here at UW-Madison, our University Records Officer can assist you.


1.2 Forms, types, and stages of research data

While definitions for research data may be broad, there are ways to categorize your data based on common forms, general types, and the stages that it moves through during the research process. Understanding your research data at this level will help you make more informed decisions as you begin to manage your data. Certain data types or data at certain stages of the research cycle may be harder to recreate or recollect once lost, so you may choose to use different strategies that provide extra protections for data at a greater risk.

Common Forms:

The most common data forms can vary by discipline. Below we’ve included examples of some commonly used data across a few broad domains.

Examples of commonly used research data forms across general domains:

DomainResearch Data Forms
Hard Sciences
  • Measurements generated by sensors, laboratory instruments
  • Computer modeling
  • Simulations
  • Observations and/or field studies
  • Specimens
Social Sciences
  • Survey responses
  • Focus group and individual interviews
  • Economic indicators
  • Demographics
  • Opinion polling
Arts and Humanities
  • Text – including the text of novels, poems, historical letters or documents, etc.
  • Images
  • Geospatial data, historical maps
  • Video – films, recordings, etc.
  • Music or other audio recordings
    Discipline-Specific Data Types [1]

    Types of Data

    Research data can fall into a few general categories, based on method of collection, that can be used when talking about types of research data.

    • Observational
      • Captured in real-time via observation or sensors, instruments
      • Cannot be reproduced or recaptured, sometimes called “unique data”
    • Experimental
      • Data from lab equipment and under controlled conditions, this is data produced by intervention from a researcher trying to produce a change via an altered variable
      • Data is often reproducible, but can be expensive to do so
    • Simulation data
      • Data generated from test models studying actual or theoretical systems, imitation of a real-world process or system
      • For this data, the models and metadata (information about the data) may be just as valuable, if not more than the output data
    • Compiled or derived data
      • Results of data analysis, or data aggregated together from multiple, existing sources
      • This data can often be reproduced but is very expensive and time-consuming to do so
    • Reference or Canonical
      • Fixed or organic collection datasets, usually peer-reviewed, and often published and curated
      • This data is typically from existing widely-used data sources such as census data or gene sequence data banks

    Stages of Research Data

    As we noted above, research data will also move through the following stages during your project:

    • Raw Data
      • Depending on the type of data you’re using, your raw data may be very valuable and not reproducible. We always suggest keeping a raw copy separate from your other data.
    • Processed Data
      • As long as your processing and cleaning are well documented, this data is likely reproducible, though it may be time-consuming to do so.
    • Finalized and/or Published Data
      • At this point, you have a final dataset ready for publication or sharing. This can also lead to your data moving into another stage where it can be reused by yourself and others.
    • Reuse or Combined with Existing Data

    These different stages of research data are often represented in a more formal model that we call a data life cycle. In reality, research doesn’t move in quite such an orderly fashion and often, many of these steps happen simultaneously.

    However, the data life cycle can be a helpful mental model to use because at each stage in the life cycle, there are best practices for managing data. Visualizing how your data is moving through your project can help remind you of key practices to incorporate. While this module won’t cover every single stage in the life cycle, we will provide a primer to some essential practices for getting started with managing your data.

    Engage with the interactive Research Data Management Life Cycle. Selecting a stage of research data management will reveal its definition and that stage’s role in the life cycle. The Research Data Management Life Cycle can also be opened in a new browser tab or window.

    The Data Life Cycle


    1.3 Why is it important to manage research data?

    Personal benefits

    While data management can sound like a lot of work for little payoff, managing your research data well actually provides a lot of personal and practical benefits. Well-managed and well-described data is easier to sort through, access, and understand, making your research project more efficient. Having a good system also prevents the frustration of data loss in the case of hardware failure or other accidents, as you will spend less time recovering the lost data or redoing your work. Another personal benefit to researchers is that well-managed data can help prevent publication retractions. Retractions can be an unintended consequence of poor data management when it leads to errors in data or the loss of data that supports published material.

    Changes in research expectations

    There are larger changes happening in the research community that have led to increased attention to research data management. First, research is increasingly computational, data-driven, and collaborative. As methods, instruments, and processes continue to advance, so too does the amount of data we are able to create and capture. The increasing size of data and corresponding infrastructure needed for storage and computing requires us to be more responsible, proactive data managers.

    Second, funding agencies, especially federal agencies that provide funding through tax dollars, are increasingly interested in ensuring that publications and data from funded research are openly available to funders. They’ve put policies in place that require data to be managed and shared, something we’ll talk about in another course.

    Third, emphasis is increasingly being placed on the reproducibility and reusability of research. Reproducibility refers to a researcher being able to understand another researcher’s methods well enough to move from the same raw data or beginning point and reproduce the results of the work.

    Caring for a valuable and delicate asset

    Another important reason to manage data is the fact that data is often a valuable asset as well as a very delicate one. Depending on the type of work, data can be expensive. This expense is both in terms of monetary cost spent on instruments or infrastructure needed to collect data but also in terms of resource cost in the time spent to work with that data. The investment you make in your data can be maximized by describing and sharing it so that others can reuse or build upon it. For example, if a researcher has access to a prohibitively expensive instrument, sharing the data from their project makes it available to other researchers who may not have the same resources.

    Data is also more fragile than you may imagine. It can be easy to think that our digital data is somehow safer than physical samples or notebooks we may keep in a lab, but the truth is that digital data relies on hardware that physically exists somewhere in our world. Digital data live on computers in our offices, servers in the basement, on instruments in our labs, and on flash drives in our backpacks. That physical hardware can be damaged by natural causes or accidents, files can be corrupted, and data formats can be rendered inaccessible with constantly and quickly changing technology. Managing your data well can help prevent these losses.

    Overall benefits

    Research data management also:

    • Ensures you’re maximizing the effective use and value of your data and information assets
    • Helps continually improve the quality of the data including: data accuracy, integrity, integration, timeliness of data capture and presentation, relevance, and usefulness
    • Ensures appropriate use of your data and information
    • Facilitates data sharing
    • Ensures sustainability and accessibility in the long term for re-use in science and the advancement of new discoveries

    1.4 Lesson Review

    Of the following, what is considered research data? Select all that apply.

    Correct! According to the definitions of research data we have covered in this section, drafts for publication would not be considered as research data. The satellite images, the code to visualize the data, and the audio recording would all be important for another researcher to have to be able understand, interpret, and reproduce your work. However, another researcher would likely have little use for your drafts for publication.
    Incorrect. The correct answers are A, B, and C.

    According to the definitions of research data we have covered in this section, drafts for publication would not be considered as research data. The satellite images, the code to visualize the data, and the audio recording would all be important for another researcher to have to be able understand, interpret, and reproduce your work. However, another researcher would likely have little use for your drafts for publication.

    Challenge Activity

    Choose one of the following questions to answer:

    • What are common types of data in your area of study or work? List them.
    • If you’re currently conducting research, list some of the data you will produce.

    References

    [1] New England Collaborative Data Management Curriculum, “Module 2: Types, Formats, and Stages of Data” by Lamar Soutter Library, University of Massachusetts Medical School licensed under CC BY SA 4.0 at https://library.umassmed.edu/resources/necdmc/index