Skip to content

Latest commit

 

History

History
106 lines (78 loc) · 2.92 KB

Data_Files_and_Manipulation.md

File metadata and controls

106 lines (78 loc) · 2.92 KB

Data Files and Manipulation

author: Brian High date: 2014-12-23 transition: fade

Research Computing and Data Management

http://github.com/brianhigh/research-computing

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons License

Data File Types

Two main categories of files:

  • Binary
  • Plain Text

Important considerations:

  • layout
  • filename extensions

Binary Files

  • May be efficient (compressed)
  • Often proprietary format ("opaque")

Plain Text Files

Character Encodings:

  • ASCII
  • UTF-8 (Unicode)

ASCII formatted files are ...

  • Usually open, standard, and "transparent"

Unicode formatted files are ...

  • Like ASCII, but support many more character sets

Example: XML

  • Plain Text
  • Self-describing
  • Structured
  • Standard
  • Easy to parse
  • Example: xlsx

Delimited Text Files

A delimiter is used to separate the fields (columns).

Delimiter may be ...

  • Comma (CSV)
  • Tab (TSV)

Sometimes the data records span multiple lines (rows).

Data File Considerations

  • Layout:
    • Structure data files as simple "columns and rows"
  • Filename extensions:
    • The letters after the (last) dot (period)
  • File type associations:
    • Determines which application opens a file

Data File Manipulation

Conversion and Pipelines:

  • ETL: extract, transform, and load
  • Data cleanup is 90% of the work
  • Pipelines: link utilities together to transform data streams

Command Line Interface (CLI) Pipes

  • Use special character (e.g., "pipe", |) to separate commands
  • Data output from one command streams into next
  • Example utilities: cut, rt, sed, "one-liners", scripts

File Conversion Utilities

  • Domain-Specific Utilities
  • Genomics
  • markup-conversion