author: Brian High date: 2014-12-23 transition: fade
http://github.com/brianhigh/research-computing
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.Two main categories of files:
- Binary
- Plain Text
Important considerations:
- layout
- filename extensions
- May be efficient (compressed)
- Often proprietary format ("opaque")
Character Encodings:
- ASCII
- UTF-8 (Unicode)
ASCII formatted files are ...
- Usually open, standard, and "transparent"
Unicode formatted files are ...
- Like ASCII, but support many more character sets
- Plain Text
- Self-describing
- Structured
- Standard
- Easy to parse
- Example: xlsx
A delimiter is used to separate the fields (columns).
Delimiter may be ...
- Comma (CSV)
- Tab (TSV)
Sometimes the data records span multiple lines (rows).
- Layout:
- Structure data files as simple "columns and rows"
- Filename extensions:
- The letters after the (last) dot (period)
- File type associations:
- Determines which application opens a file
Conversion and Pipelines:
- ETL: extract, transform, and load
- Data cleanup is 90% of the work
- Pipelines: link utilities together to transform data streams
- Use special character (e.g., "pipe", |) to separate commands
- Data output from one command streams into next
- Example utilities: cut, rt, sed, "one-liners", scripts
- Domain-Specific Utilities
- Genomics
- markup-conversion
- Conversion utilities:
- Csvkit, Rio
- See : http://goo.gl/H6PhfS