Section 4: Data Use

Introduction

As a researcher who will use and/or create data, you are responsible for the care and management of it. Complying with the policies and regulations your data may be subject to is an important part of properly caring for your data. Another important part of caring for your data is ensuring that it can be reused. There are many areas to consider when managing and working responsibly with data to ensure its reuse. Below we highlight some of the areas to consider.

Topics in this Section:

  1. Data as Intellectual Property and Data Citation
  2. Storage and Backup
  3. General Security
  4. File Naming and Organization
  5. Machine Readability
  6. File Formats

1. Data as Intellectual Property and Data Citation

Much like scholarly publications, research data is a scholarly output of a researcher’s work. Due to this, it is important to understand when research data is considered intellectual property as well as how to cite it correctly so that it can contribute to the scholarly discourse. In this section we’ll provide a brief introduction to copyright and licensing, data citation, and Digital Object Identifiers (DOIs).

Creative Commons symbol

In the United States, research data that is considered factual cannot be copyrighted. However, sometimes the associated metadata, databases, figures, software, or work that could be considered a “creative” output can be considered an asset you want to control reuse or redistribution over by applying appropriate licensing. Licenses define how others may interact with, reuse, modify, or redistribute your work. Choosing a license for your data ensures that it is used appropriately by other researchers.

For scientific or factual data, many researchers choose to apply a Creative Commons 0 license, so that it is clear the data is being distributed freely. At most a CC-BY (Creative Commons Attribution) license should be applied.

For creative works, there are varying levels of Creative Commons licenses that you can choose to apply. The license levels build on one another and range from unrestricted to fairly restrictive terms. The Creative Commons website has a good guide that can help you decide what restrictions you may want to apply including request for attribution or non-commercial use. To learn more about Creative Commons licenses and how they differ from copyright view Lesson 2 of the Copyright and Fair Use micro-course.

For software or code, there are multiple choices. You will want to select from a license you are comfortable from ones such as the GNU licenses, MIT license, or Apache licenses. For databases and their content, the Open Data Commons Licenses can be used for the licensing of databases and their contents. Databases are a more complicated situation in that the database may have copyright protection, but its data does not or may have separate copyright, we recommend reviewing Cornell’s guidance on this licensing.

You should cite datasets for the same reasons you cite books and journal articles: for dataset creators to receive appropriate credit for their work, and to make clear the antecedents to your research.

Data citation standards may vary between disciplines and some professional organizations, academic journals, and repositories may also have guidance on preferred data citation formats. However, in general, the information you capture in a data citation is similar to the information included in a citation for any other work.

The Inter-university Consortium for Political and Social Research (ICPSR) suggests the minimum elements of data citation as:

  • Title
  • Author
  • Date
  • Version
  • Persistent identifier (such as the Digital Object Identifier, Uniform Resource Name (URN), or Handle System)
    • Note: Persistent identifiers are preferred, but if not available a URL will suffice.

As mentioned above, a persistent identifier is a useful piece of information for data citation. A commonly used persistent identifier for research data is a DOI, or a digital object identifier. A DOI is a series of alphanumeric characters that serve as a unique identifier for a specific publication, dataset, or other digital object. The DOI for that object won’t change over time as the URL might if the web page is moved or the website is changed. Instead, the URL is information that is attached behind the DOI and it can be updated over time.

This allows researchers to make their datasets easier to locate, access, and cite for other researchers. Reliable location, identification, and citation by others is a critical component for researchers to enable reuse of their data, replication of their work, and to track the impact of their research.

A DOI has to be provided by a DOI Registration Agency, however many publishers, repositories, and institutions work with such agencies in order to provide DOIs for their communities. When sharing your data, we recommend checking with your publisher or repository to see if a DOI is provided for you.


2. Storage and Backup

Person in front of the data center

Let’s define storage and backup, both essential practices for ensuring the care and keeping of your data.

Storage is the act of keeping your data in a secure location that you can access readily. Files in storage should be the working copies of your files that you access and change regularly.Backup is the practice of keeping additional copies of your data in separate physical or cloud locations from your files in storage. Backup copies are copies you would access in the case of data loss and needing to access previous versions of your work.

Backup can be done manually or automatically – automatic is best and your IT contacts can help you identify a good solution. If done manually, ensure you have a schedule and strategy for backing up your data.

Good storage and backup practices help protect your data and research from losses due to hardware failure, natural disaster, or file corruption. You spend a lot of time collecting your data, so ensuring you have a good system for backing up your data will prevent you from having to spend time trying to recover your files, recollect data, or redo any cleaning or analysis. Other benefits:

  • A granting agency may require that you retain data for a given period and may ask you to explain in a data plan how you will store and back it up.
  • Storing and backing up your data ensures that it will be there when you need to use it for publications, theses, or grant proposals.
  • Good preservation practices help make your data available to researchers in your lab/research group, department, or discipline in the future

Storing and using sensitive data requires extra protections to ensure that not only are we complying with important legal obligations but also protecting the safety of human subjects’ information appropriately.

Here at UW-Madison, if you have sensitive or restricted data, you should only be using approved tools for that type of data. You should follow campus guidance regarding handling sensitive data and reach out to your departmental IT or DoIT for guidance on what approved tools are available to you for your data type.

The Research Data Services website hosts a comprehensive comparison of UW-Madison storage and backup options. A note here that some of these options are not suitable for storing sensitive data. Your departmental IT or DoIT can help you come up with appropriate storage and backup solutions for your data.

If you are looking at cloud applications for your data storage, backup, or computing, always read the terms of service. It is important that you understand the permissions you are granting to the company that supports the application and how any data might be potentially shared. Part of protecting your data is understanding the risks to your data – and that includes knowing what risks could come through your storage and backup tools.

For those going to school or working at UW-Madison, we recommend that you always use your institutionally provided accounts over your personal accounts (for example, Google Drive). UW-Madison has an agreement with these providers to provide more intellectual property protections than your personal accounts would provide.

A good rule of thumb to remember is LOCKSS, or Lot Of Copies Keep Stuff Safe. However, you don’t need to go overboard with the number of copies you have. Typically, the rule to follow is the rule of three: three copies, in at least two physically separate locations, on more than one type of storage hardware.This might look like:

  • A copy in active storage. This is a copy you are regularly accessing and working on during your research. It will likely be on your computer or a lab’s shared network drive.
  • A second copy on a different device on- or off-site, such as an external hard drive in your office or a backup server provided by your IT department
    • Note: Be cautious about using USB flash drives to backup your data. They have some advantages that can make them an appealing option: they’re affordable, they’re convenient, and you probably own at least one already. However, flash drives’ portability makes them easy to misplace, have stolen, or accidentally break.
  • A third copy, preferably off-site. This might be on a cloud application like Box, Google Drive, or other appropriate cloud solution

The goal here is to make your backups and storage as physically far apart as possible to prevent any loss due to natural disaster, such as a fire or flood occurring in the lab where you’re doing research. If your backups are all housed together, it could ruin both the primary copy of your data and any backups that you keep in the same building. Having at least one off-site backup increases the chances that you can restore your data if a disaster happens.


3. General Security

Your departmental IT and DoIT can help identify the most appropriate security solutions for your research data. As mentioned above, sensitive data requires extra security measures and you should only use approved tools.

However, there are also some day-to-day basic security measures you can take to protect your non-sensitive research data.

Lock holding a cage shut
  • Keep your computer software and applications up-to-date.
  • Use strong passwords, never using the same password twice, and store passwords securely.
  • Limit access: Regardless of the storage and backup solutions you choose, limiting access to your data is an easy way to provide an extra layer of security.
    • Ways that you can do this include: limit physical access to data and storage solutions by keeping offices locked or restricted as appropriate, remove old collaborators who no longer need access from shared solutions, and don’t travel with your data on a physical device if you can avoid it.
  • DoIT provides some other security tips on their website including using anti-virus software, using a VPN, using a firewall, preventing device theft, and avoiding phishing attempts and suspicious links.

You can find more information about security and tools available to you on the DoIT website.


4. File Naming and Organization

Having a good file naming and organization method is one of the simplest things you can do to make a huge impact on your data management! However, it can also be one of the hardest things to change in your data management practices, because it’s often something we do by hand and changing our personal habits can be difficult.

Though it can be difficult to implement, a good file naming convention and folder organization method can make quick improvements to your research process. It makes your data easier to search through and it makes it easier to distinguish similar files or versions from one another. It also provides built-in description about the contents of the file and can make it easier to share documents with collaborators as they’ll be able to find and understand the file.

One of the most important things for file naming is to develop a naming convention, a template of standard information you’ll use in most file names, and to always use that convention anytime you have multiple related files in a folder. Without a set convention, you may end up recording haphazard information, or not capturing enough important information, each time you create a file: this will make it harder to remember what keywords you can use to search for the file.There is not one recommended naming convention that will work for everyone. Each project and person are different, but below we’ve laid out some suggestions for creating a file naming convention.

For more information on how to create a file naming convention, visit Lesson 2 of the Research Data Management Micro-course.

Organized file folders

Another key component of an effective file management strategy is establishing a well organized hierarchical folder structure to go along with your file naming conventions. While there’s no easy answer as to how many folders you should have or how to best organize them, the trick is to create a structure that balances breadth versus depth.

Try to limit the number of top level folders you create and try to limit the number of folders nested within those as well. You don’t want to create so many layers in your hierarchy that accessing the actual data files becomes difficult but you also don’t want to have too many files within each folder which may also make finding your data more difficult. There is a useful micro-course on research data management that covers in more detail hierarchical organization and folder best practices.


5. Machine Readability

For a computer to read and process your data, the data you collect must be in a machine readable format. To ensure that your data is able to be read and processed over time, there are particular formats that are less susceptible to changes in technology. Here are some quick tips on that:

Best practices for reducing the impact of changing technology:

In general, pick file formats that:

  • Are non-proprietary: the format is not owned by a company or manufacturer. If a file can only be opened in a certain software, it’s likely that it’s a proprietary format. The .DOCX file type you may be familiar with, produced by Microsoft Word, is a proprietary format. Common proprietary formats often have new versions and software upgrades required to use the files. Over time, new versions stop accounting for the ability to open old versions of that format, and you can lose file information or the ability to read the file entirely.
  • Have seen wide adoption in your discipline: Sometimes there isn’t a non-proprietary option for you to use and that’s okay. If a file format is widely used in your field, it’s likely that it will continue to be supported and openable into the future.
  • Have a history of backward compatibility: new versions of the software are still able to support old file versions.

Physical devices and media:

Physical devices and media have a lifespan. While devices we use for storing data have improved exponentially over the last twenty years, expect external hardware devices to have a lifespan of 3 – 5 years. Plan on migrating your files every few years if you use an external hard drive to prevent data loss!


6. File Formats

Type of DataRecommended FormatsAcceptable Formats
Tabular data with extensive metadata

variable labels, code labels, and defined missing values
SPSS portable format (.por)

delimited text and command (‘setup’) file (SPSS, Stata, SAS, etc.)

structured text or mark-up file of metadata information, e.g. DDI XML fil
proprietary formats of statistical packages: SPSS (.sav), Stata (.dta), MS Access (.mdb/.accdb)
Tabular data with minimal metadata

column headings, variable names
comma-separated values (.csv)

tab-delimited file (.tab)

delimited text with SQL data definition statements
delimited text (.txt) with characters not present in data used as delimiters

widely-used formats: MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf), OpenDocument Spreadsheet (.ods)
Geospatial data

vector and raster data
ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn optional)

geo-referenced TIFF (.tif, .tfw)

CAD data (.dwg)

tabular GIS attribute data

Geography Markup Language (.gml)
ESRI Geodatabase format (.mdb)

MapInfo Interchange Format (.mif) for vector data

Keyhole Mark-up Language (.kml)

Adobe Illustrator (.ai), CAD data (.dxf or .svg)

binary formats of GIS and CAD packages
Textual dataRich Text Format (.rtf)

plain text, ASCII (.txt)

eXtensible Mark-up Language (.xml) text according to an appropriate Document Type Definition (DTD) or schema
Hypertext Mark-up Language (.html)

widely-used formats: MS Word (.doc/.docx)

some software-specific formats: NUD*IST, NVivo and ATLAS.ti
Image dataTIFF 6.0 uncompressed (.tif)JPEG (.jpeg, .jpg, .jp2) if original created in this format

GIF (.gif)

TIFF other versions (.tif, .tiff)

RAW image format (.raw)

Photoshop files (.psd)

BMP (.bmp)

PNG (.png)

Adobe Portable Document Format (PDF/A, PDF) (.pdf)
Audio dataFree Lossless Audio Codec (FLAC) (.flac)MPEG-1 Audio Layer 3 (.mp3) if original created in this format

Audio Interchange File Format (.aif)

Waveform Audio Format (.wav)
Video dataMPEG-4 (.mp4)

OGG video (.ogv, .ogg)

motion JPEG 2000 (.mj2)
AVCHD video (.avchd)
Documentation and scriptsRich Text Format (.rtf)

PDF/UA, PDF/A or PDF (.pdf)

XHTML or HTML (.xhtml, .htm)

OpenDocument Text (.odt)
plain text (.txt)

widely-used formats: MS Word (.doc/.docx), MS Excel (.xls/.xlsx)

XML marked-up text (.xml) according to an appropriate DTD or schema, e.g. XHMTL 1.0
File Formats [1]

References

[1] UK Data Service, “Recommended formats”, University of Essex, University of Manchester and Jisc. Retrieved from https://www.ukdataservice.ac.uk/manage-data/format/recommended-formats