Lesson 5: Obsolescence and Sustainability

5.1 Defining Obsolescence and Sustainability

Technology changes quickly. With that change, certain previously preferred file formats and media will stop being used in favor of other, more advantageous ones—this is called obsolescence.

This is a concern when working with research data because we know that, as researchers, our current methods for creating, storing, and sharing data could potentially become obsolete and be replaced by new formats.

Photo of a stack of black computer floppy disks.
Floppy disks, an outdated storage option

To help thwart obsolescence as best we can, we want to try to ensure that we pick the most sustainable options for our data over time. Sustainability in this case means formats or media that are less susceptible to changes in technology. Below you’ll find some tips for this.

Best practices for reducing the impact of changing technology:

In general, pick file formats that:

  • Are non-proprietary: the format is not owned by a company or manufacturer. If a file can only be opened in a certain software, it’s likely that it’s a proprietary format. The .DOCX file type you may be familiar with, produced by Microsoft Word, is a proprietary format. Common proprietary formats often have new versions and software upgrades required to use the files. Over time, new versions stop accounting for the ability to open old versions of that format, and you can lose file information or the ability to read the file entirely.
  • Have seen wide adoption in your discipline: Sometimes there isn’t a non-proprietary option for you to use and that’s okay. If a file format is widely used in your field, it’s likely that it will continued to be supported and openable into the future.
  • Have a history of backward compatibility: new versions of the software are still able to support old file versions.

Physical devices and media:

Physical devices and media have a lifespan. While devices we use for storing data have improved exponentially over the last twenty years, expect external hardware devices to have a lifespan of 3 to 5 years. Plan on migrating your files every few years if you use an external hard drive to prevent data loss!


5.2 File Formats

You can follow best practices for your type of data when selecting a file format to increase its sustainability. Choosing a sustainable file format will increase the likelihood that you and others will be able to access your data in the future. For each type of data, the chart below suggests recommended formats that are prioritized and other acceptable formats.

Type of DataRecommended FormatsAcceptable Formats
Quantitative tabular data with extensive metadata.
A dataset with variable labels, code labels, and defined missing values, in addition to the matrix of data.
Proprietary formats of statistical packages e.g. SPSS (.sav), Stata (.dta), .sas7bdat.
Delimited text and command (‘setup’) file (SPSS, Stata, SAS, etc.) containing metadata information.
Some structured text or mark-up file containing metadata information, e.g. DDI XML file.
SPSS portable format (.por).
MS Access (.mdb/.accdb).
Quantitative tabular data with minimal metadata.
A matrix of data with or without column headings or variable names, but no other metadata or labeling.
Comma-separated values (CSV) file (.csv).
Tab-delimited file (.tab).
Including delimited text of given character set with SQL data definition statements where appropriate.
Delimited text of given character set – only characters not present in the data may be used as delimiters (.txt).
Widely-used formats: MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), OpenDocument Spreadsheet (.ods).
Geospatial data.
Vector and raster data.
ESRI Shapefile (essential – .shp, .shx, .dbf, optional – .prj, .sbx, .sbn).
Geo-referenced TIFF (.tif, .tfw).
CAD data (.dwg).
Tabular GIS attribute data.
ESRI Geodatabase format (.mdb).
MapInfo Interchange Format (.mif) for vector data.
Keyhole Mark-up Language (.kml).
Adobe Illustrator (.ai), CAD data (.dxf or .svg).
Binary formats of GIS and CAD packages.
Qualitative data.
Textual.
eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml).
Rich Text Format (.rtf).
Plain text data, ASCII (.txt).
Hypertext Mark-up Language (.html).
Widely-used formats: MS Word (.doc/.docx).
Some software-specific formats: NUD*IST, NVivo and ATLAS.ti.
Digital image data.TIFF version 6 uncompressed (.tif).
Digital Imaging and Communications in Medicine (DICOM) (.dcm, .dcm30) – for CT/MRI data.
JPEG (.jpeg, .jpg) but only if created in this format.
TIFF (other versions) (.tif, .tiff).
Adobe Portable Document Format (PDF/A, PDF) (.pdf).
Standard applicable RAW image format (.raw).
Photoshop files (.psd).
BMP (.bmp) but only if created in this format.
PNG (.png) but only if created in this format.
Digital audio data.Free Lossless Audio Codec (FLAC) (.flac).MPEG-1 Audio Layer 3 (.mp3) if original created in this format.
Audio Interchange File Format (.aif).
Waveform Audio Format (.wav).
Digital video data.MPEG-4 (.mp4).
OGG video (.ogv, .ogg).
motion JPEG 2000 (.mj2).
MOV (.mov)
Windows Media Video (WMV) (.wmv).
WebM (.webm).
Documentation and scripts.Rich Text Format (.rtf).
PDF/A or PDF (.pdf).
HTML (.htm).
OpenDocument Text (.odt).
R Markdown files (.rmd) (with HTML version as well).
Plain text (.txt).
Widely-used proprietary formats: MS Word (.doc/.docx), MS Excel (.xls/.xlsx).
XML marked-up text (.xml) according to an appropriate DTD or schema, e.g. XHMTL 1.0.
UK Data Service [1]

Check Your Understanding

Open, non-proprietary file formats are in less danger from obsolescence. Which of these four file formats would you use for the following types of research data?

  • .csv
  • .flac
  • .txt
  • .tif

Images

Correct! .tif or TIFF is a more sustainable format for images. Some image formats can compress the image and lose some of the image information. The TIFF format helps avoid that.
Incorrect. The correct answer is D) .tif (or TIFF), which is a more sustainable format for images. Some image formats can compress the image and lose some of the image information. The TIFF format helps avoid that.

Spreadsheets

Correct! .csv or Comma Separated Value is a plain text, more sustainable format for spreadsheets or tabular data. Similarly to the text formats mentioned above, other tabular data formats like .xlsx from Excel are proprietary and rely on specific software to read them.
Incorrect. The correct answer is A) .csv, or Comma Separated Value, which is is a plain text, more sustainable format for spreadsheets or tabular data. Similarly to the text formats mentioned above, other tabular data formats like .xlsx from Excel are proprietary and rely on specific software to read them.

Text

Correct! .txt or plain text is much more sustainable for text based files. Other textual file formats like .docx from Microsoft Word include extra proprietary information and rely on specific software to read them.
Incorrect. The correct answer is C) .txt, or plain text, which is much more sustainable for text based files. Other textual file formats like .docx from Microsoft Word include extra proprietary information and rely on specific software to read them.

Audio

Correct! .flac or Free Lossless Audio Codec is a more sustainable format for audio files, though you may often see .wav and .mp3 also recommended.
Incorrect. The correct answer is B) .flac, or Free Lossless Audio Codec, which is a more sustainable format for audio files. You may often see .wav and .mp3 also recommended.

References

[1] UK Data Service, “Recommended formats,” University of Essex, University of Manchester, Jisc, UCL, and University of Edinburgh. Retrieved from https://ukdataservice.ac.uk/learning-hub/research-data-management/format-your-data/recommended-formats/.