Website Search
Find information on spaces, staff, and services.
Find information on spaces, staff, and services.
Technology changes quickly. With that change, certain previously preferred file formats and media will stop being used in favor of other, more advantageous ones—this is called obsolescence.
This is a concern when working with research data because we know that, as researchers, our current methods for creating, storing, and sharing data could potentially become obsolete and be replaced by new formats.
To help thwart obsolescence as best we can, we want to try to ensure that we pick the most sustainable options for our data over time. Sustainability in this case means formats or media that are less susceptible to changes in technology. Below you’ll find some tips for this.
In general, pick file formats that:
Physical devices and media have a lifespan. While devices we use for storing data have improved exponentially over the last twenty years, expect external hardware devices to have a lifespan of 3 to 5 years. Plan on migrating your files every few years if you use an external hard drive to prevent data loss!
Type of Data | Recommended Formats | Acceptable Formats |
---|---|---|
Tabular data with extensive metadata variable labels, code labels, and defined missing values | SPSS portable format (.por) delimited text and command (‘setup’) file (SPSS, Stata, SAS, etc.) structured text or mark-up file of metadata information, e.g. DDI XML file | proprietary formats of statistical packages: SPSS (.sav), Stata (.dta), MS Access (.mdb/.accdb) |
Tabular data with minimal metadata column headings, variable names | comma-separated values (.csv) tab-delimited file (.tab) delimited text with SQL data definition statements | delimited text (.txt) with characters not present in data used as delimiters widely-used formats: MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf), OpenDocument Spreadsheet (.ods) |
Geospatial data vector and raster data | ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn optional) geo-referenced TIFF (.tif, .tfw) CAD data (.dwg) tabular GIS attribute data Geography Markup Language (.gml) | ESRI Geodatabase format (.mdb) MapInfo Interchange Format (.mif) for vector data Keyhole Mark-up Language (.kml) Adobe Illustrator (.ai), CAD data (.dxf or .svg) binary formats of GIS and CAD packages |
Textual data | Rich Text Format (.rtf) plain text, ASCII (.txt) eXtensible Mark-up Language (.xml) text according to an appropriate Document Type Definition (DTD) or schema | Hypertext Mark-up Language (.html) widely-used formats: MS Word (.doc/.docx) some software-specific formats: NUD*IST, NVivo and ATLAS.ti |
Image data | TIFF 6.0 uncompressed (.tif) | JPEG (.jpeg, .jpg, .jp2) if original created in this format GIF (.gif) TIFF other versions (.tif, .tiff) RAW image format (.raw) Photoshop files (.psd) BMP (.bmp) PNG (.png) Adobe Portable Document Format (PDF/A, PDF) (.pdf) |
Audio data | Free Lossless Audio Codec (FLAC) (.flac) | MPEG-1 Audio Layer 3 (.mp3) if original created in this format Audio Interchange File Format (.aif) Waveform Audio Format (.wav) |
Video data | MPEG-4 (.mp4) OGG video (.ogv, .ogg) motion JPEG 2000 (.mj2) | AVCHD video (.avchd) |
Documentation and scripts | Rich Text Format (.rtf) PDF/UA, PDF/A or PDF (.pdf) XHTML or HTML (.xhtml, .htm) OpenDocument Text (.odt) | plain text (.txt) widely-used formats: MS Word (.doc/.docx), MS Excel (.xls/.xlsx) XML marked-up text (.xml) according to an appropriate DTD or schema, e.g. XHMTL 1.0 |
Open, non-proprietary file formats are in less danger from obsolescence. Which of these four file formats would you use for the following types of research data?
Images
Spreadsheets
Text
Audio
[1] UK Data Service, “Recommended formats,” University of Essex, University of Manchester and Jisc. Retrieved from https://www.ukdataservice.ac.uk/manage-data/format/recommended-formats.