-
Notifications
You must be signed in to change notification settings - Fork 2
datafiles
-
Data File Types: Binary vs. plain text
-
Structured, Semi-structured, and Unstructured data
-
Delimited and Multi-Line file formats
There are essentially two main categories of digital file types, binary and plain text.
If a file is binary, the file just contains "zeros and ones". While this
is technically true of any digital file stored within a binary computer
system, the contents of a binary file does not conform to any standard
character encoding
system. The format may be highly efficient for storage or processing,
but is essentially opaque, in that by simply looking at a binary
file’s contents, you can’t really know what the format is or how to read
it.[1] Examples of binary files are database files,
multimedia files, and compressed files (such as zip
files).
Plain text files, on the other hand, are composed of characters. Typically they are ASCII or Unicode characters represented by one or more bytes, where a byte is (generally) 8 bits. A bit can be considered either zero (off) or one (on). Plain text file formats are usually open and standard. Examples are web pages (HTML) as well as XML, and CSV (comma separated value) data files.
Filenames generally have an extension, which is the part at the end ("suffix") of the filename, consisting of the last dot (.) and the characters that follow it.[2]
Examples of binary filename extensions for images are .png
and .jpeg
. To
launch "executable" programs on Windows systems you will often launch an
.exe file. The .dmg
("disk image") filename extension is used on OS X.
Common extensions for binary data files are .xls
and .sas7bdat
.
Plain text file formats for data files include .csv
, .tsv
, .txt
, .xml
,
and .json
, among others. Program source code is usually stored in plain
text files, with extensions such as .R
, .py
, .pl
, .c
, .sh
, .bat
, and
.do
.
The extension is used to determine which "default application" should open it. Within the operating system, the extention is mapped to default applications. Mappings such as these are called file type associations.[3]
When viewing the raw contents of files, whether they are binary or text files, we will often make use of a hexidecimal dump.
Hexdecimal is a base-16 number system with digits 0-F:
0 1 2 3 4 5 6 7 8 9 A B C D E F
Whereas binary has two possibilities, 0 and 1, hexadecimal has 16, including the ten decimal digits plus the letters a-f.
Let’s "dump" files in "hex" with hexdump…
$ hexdump -C -n 64 filename
Where the options we are using is this example are:
-
-C
= display in hex and ASCII -
-n 64
= show the first 64 characters -
filename
= name of file to view
In this example, we will view the first 64 bytes of an
SVG image file.
The file format stores information about the image in text, even though
the file is displayed as a graphical image. Our filename is pie.svg
.
$ hexdump -C -n 64 pie.svg
00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |<?xml version="1| 00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 74 |.0" encoding="ut| 00000020 66 2d 38 22 3f 3e 0a 3c 21 44 4f 43 54 59 50 45 |f-8"?>.<!DOCTYPE| 00000030 20 73 76 67 20 50 55 42 4c 49 43 20 22 2d 2f 2f | svg PUBLIC "-//| 00000040
We see a column of numbers on the left which show the character numbers in hexadecimal. Each line shows the hexadecimal number for each of 16 characters, in the center of the output, and on the right is the text equivalent (in "ASCII") of those character numbers.
We can see that this is an "XML" document with a version number, and the character encoding is shown as "UTF-8". The document type ("DOCTYPE") is "svg". All of this is contained in XML tags, similar is structure to HTML (the language of most web pages). The hexadecimal numbers correlate to the ASCII characters because the first 128 characters of the UTF-8 encoding scheme are the same as the ASCII character set. We go into more detail on this matter later in this chapter.
Even if the file was not a text file, and the ASCII printout looked like random characters, we would still be able to look at the hexadecimal dump to lear about the file.
For example, here is the
PNG (binary)
version of that same image. We will use the same syntax with hexdump,
bit look inside the pie.png
file.
$ hexdump -C -n 64 pie.png
00000000 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 |.PNG........IHDR| 00000010 00 00 01 2c 00 00 02 26 10 04 00 00 00 13 97 a3 |...,...&........| 00000020 46 00 00 00 04 67 41 4d 41 00 00 b1 8f 0b fc 61 |F....gAMA......a| 00000030 05 00 00 00 20 63 48 52 4d 00 00 7a 26 00 00 80 |.... cHRM..z&...| 00000040
We have the same format of output. On the right, we see that the file is
identified[4] as a PNG file, as shown in the first few ASCII
characters, but all other ASCII characters appear random (meaningless).
Dots are shown for "non-printing" characters. Since the file is binary,
and not encoded as characters, the ASCII which has been interpreted by
hexdump
is not very useful for learning anything more about the image.
We will just have to open the image in a graphics viewer to see what it
is. Although both image files would display the same, you can see that
there is a big difference between the contents of plain text and binary
file formats.
We will now take a closer look at the most popular character encodings for text files.
-
ASCII (7-bit): The best-known standard for text.
-
Extended ASCII (8-bit): The extra bit allows for a few more special symbols.
-
Unicode (1-4 bytes): The current standard.
Some key points to know about ASCII are:
-
"American Standard Code for Information Interchange"[5]
-
ASCII standard first published in 1963
-
Current version of US ASCII is ANSI X3.4-1986
-
ASCII was internationalized as ISO 646:1983
-
7-bit character set with 128 characters (2^7 = 128)
The ascii
command prints all 128 ASCII characters.
$ ascii
Usage: ascii [-dxohv] [-t] [char-alias...] -t = one-line output -d = Decimal table -o = octal table -x = hex table -h = This help screen -v = version information Prints all aliases of an ASCII character. Args may be chars, C \-escapes, English names, ^-escapes, ASCII mnemonics, or numerics in decimal/octal/hex. Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex 0 00 NUL 16 10 DLE 32 20 48 30 0 64 40 @ 80 50 P 96 60 ` 112 70 p 1 01 SOH 17 11 DC1 33 21 ! 49 31 1 65 41 A 81 51 Q 97 61 a 113 71 q 2 02 STX 18 12 DC2 34 22 " 50 32 2 66 42 B 82 52 R 98 62 b 114 72 r 3 03 ETX 19 13 DC3 35 23 # 51 33 3 67 43 C 83 53 S 99 63 c 115 73 s 4 04 EOT 20 14 DC4 36 24 $ 52 34 4 68 44 D 84 54 T 100 64 d 116 74 t 5 05 ENQ 21 15 NAK 37 25 % 53 35 5 69 45 E 85 55 U 101 65 e 117 75 u 6 06 ACK 22 16 SYN 38 26 & 54 36 6 70 46 F 86 56 V 102 66 f 118 76 v 7 07 BEL 23 17 ETB 39 27 ' 55 37 7 71 47 G 87 57 W 103 67 g 119 77 w 8 08 BS 24 18 CAN 40 28 ( 56 38 8 72 48 H 88 58 X 104 68 h 120 78 x 9 09 HT 25 19 EM 41 29 ) 57 39 9 73 49 I 89 59 Y 105 69 i 121 79 y 10 0A LF 26 1A SUB 42 2A * 58 3A : 74 4A J 90 5A Z 106 6A j 122 7A z 11 0B VT 27 1B ESC 43 2B + 59 3B ; 75 4B K 91 5B [ 107 6B k 123 7B { 12 0C FF 28 1C FS 44 2C , 60 3C < 76 4C L 92 5C \ 108 6C l 124 7C | 13 0D CR 29 1D GS 45 2D - 61 3D = 77 4D M 93 5D ] 109 6D m 125 7D } 14 0E SO 30 1E RS 46 2E . 62 3E > 78 4E N 94 5E ^ 110 6E n 126 7E ~ 15 0F SI 31 1F US 47 2F / 63 3F ? 79 4F O 95 5F _ 111 6F o 127 7F DEL
You will see that there is a header showing the command usage followed by an ASCII table listing. The listing is arranged in 8 sets of columns, with each set showing the decimal (Dec) and hexadecimal (Hex) value for each character. Starting from zero (0), the first 32 characters (and the 128th) are the so-called "non-printing" characters, so those are shown with 2-3 letter codes describing the character. The 33rd character is the "Space" so nothing is shown. All other characters are symbols which appear on the standard US keyboard. The punctuation characters and decimal digits are followed by capital letters, more puntuation, lower-case letters, more punctuation, and finally ending with "DEL" (Delete), the 128th character (numbered 127, or 7F in hexadecimal). To have more characters, we would need more bits in our encoding standard, which we will look into next.
-
ISO-8859-1 is an 8-bit extension with 191 characters
-
ISO-8859-1 ("ISO Latin 1") was first published in 1987
-
ISO-8859-1 was extended to Windows-1252
-
Windows-1252 is sometimes (incorrectly) called "ANSI"[6]
We can see how differences in character encodings can matter with a few simple examples. Let’s first generate a table of characters with Python.
The following Python script will show the printable characters of the Windows-1252 character set when run on a Windows system using a graphical Python interpreter such as IDLE or PyScripter.
Here is the full code listing for that Python script.
# If run on a Windows system in a graphical environment such as
# IDLE's Python Shell, by default, this will print the Windows
# Latin 1 character set, a.k.a. Windows-1252 (WinLatin1).
import sys
# Print Extended ASCII table from character 32 to 256.
# (Skip non-printing characters numbered 1-31.)
start = 32
for i in range(start, 256):
# Replace each non-printing character with a space.
if i not in [129, 127, 141, 143, 144, 157, 160]:
sys.stdout.write(chr(i))
else:
sys.stdout.write(" ")
# Print a newline every 16 characters.
if i > start and (i + 1) % 16 == 0:
print
We can see the characters properly in a non-Windows environment if we specifically set the character encoding in the application.
However, this is not the default setting. Without knowing the output was encoded as Windows-1252, we might have thought our program had a bug.
So, how can we know the character encoding of "plain text" output? Let’s save the output as a file and test the file for it’s character encoding.
To save program output as a file, we can use file redirection. We will run the program on the Windows computer in a DOS shell and redirect with the '>' operator.[7]
C:\> python asciitable.py > asciitable.txt
Redirection allows us to save to a file, but that file just contains the
numeric codes for the characters. There is nothing in the file stating
the actual character encoding format. We will have to guess, using the
file
command.
file
We can check the character encoding and other file properties using
the file
command. This command is available on Unix, Linux, and OS X
systems. Here we will run the file
command from a Bash shell.
$ file asciitable.txt asciitable.txt: Non-ISO extended-ASCII text, with CRLF, NEL line terminators
While this tells us a little about the text format, we still don’t know the specific encoding standard used.
Unicode provides an internationalized character encoding standard, to "encompass the characters of all the world’s living languages".[10]
-
Like ASCII, but supports over 110,000 characters
-
Unicode standard was published in 1991
-
Most commonly used encodings are UTF-8 and UTF-16[11]
You can browse the Unicode code charts to get an idea of the many character sets available.
The character µ, with Unicode[12] name "MICRO SIGN" is encoded:
Encodings | Decimal | Hex |
---|---|---|
Unicode |
181 |
U+00B5 |
Extended ASCII |
181 |
B5 |
HTML numeric character reference |
µ |
µ |
HTML named character entity |
µ |
How do you type the µ character into your computer?
Use these character codes:
Name | Decimal | Hex |
---|---|---|
MICRO SIGN |
181 |
00B5 |
With these operating systems:[13]
-
Windows: [Alt]decimal (using numeric keypad) … or … hex[Alt][x] (does not require numeric keypad)
-
OS X: for µ, you can simply use [Opt][m] … or … [Command][Ctrl][Space] … Search by name … or … use Unicode Hex Input (Input Source) and hex
-
Linux: [Shift][Ctrl]hex
Character Name | Char. | Entity | Num. Entity | Hex. Entity |
---|---|---|---|---|
DEGREE SYMBOL |
° |
° |
° |
° |
MICRO MU SYMBOL |
µ |
µ |
µ |
µ |
LOWER CASE SIGMA |
σ |
σ |
σ |
σ |
N-ARY SUMMATION |
∑ |
∑ |
∑ |
∑ |
GREEK SMALL LETTER PI |
π |
π |
π |
π |
GREEK SMALL LETTER ALPHA |
α |
α |
α |
α |
GREEK SMALL LETTER BETA |
β |
β |
β |
β |
GREEK SMALL LETTER GAMMA |
γ |
γ |
γ |
γ |
INCREMENT |
Δ |
Δ |
∆ |
∆ |
GREEK SMALL LETTER EPSILON |
ε |
ε |
ε |
ε |
INFINITY |
∞ |
∞ |
∞ |
∞ |
PLUS OR MINUS |
± |
± |
± |
± |
NOT EQUALS |
≠ |
≠ |
≠ |
≠ |
ALMOST EQUAL |
≈ |
≈ |
≈ |
≈ |
GREATER THAN OR EQUAL TO |
≥ |
≥ |
≥ |
≥ |
LESS THAN OR EQUAL TO |
≤ |
≤ |
≤ |
≤ |
DIVISION SIGN |
÷ |
÷ |
÷ |
÷ |
SUPERSCRIPT TWO |
² |
² |
² |
² |
SUPERSCRIPT THREE |
³ |
³ |
³ |
³ |
For example, in Windows, you can use the "Num. Entity" column for [Alt] codes such as [Alt]946 for β (beta).
-
UTF-8 (1993) is a variable-length 8-bit character encoding
-
A UTF-8 character will use one to four 8-bit bytes
-
ASCII characters are the first 128 characters of UTF-8
-
Use of UTF-8 surpassed ASCII on the Web in Dec. 2007
-
UTF-8 is the default encoding for HTML5 and JSON
UTF-8 and UTF-16 are the standard encodings for Unicode text in HTML documents, with UTF-8 as the preferred and most used encoding.[14]
UTF-8
We can convert a file encoded as Windows-1252 into UTF-8 with iconv
.[15]
$ iconv -f windows-1252 -t utf-8 asciitable.txt > asciitable2.txt $ file asciitable2.txt asciitable2.txt: UTF-8 Unicode text
As you can see, you can use file
to verify that this is a Unicode file
encoded as UTF-8.
Tip
|
"Normalize" text datafiles to a common, universal encoding format like UTF-8 to ensure characters are displayed with the intended symbols. |
-
Structured: Formal and rigorous design
-
Example: Relational database
-
-
Semi-structured: Self-describing, validatable
-
Markup using tags or key-value pairs
-
-
-
Multimedia and text document files
-
Any internal structure, if present, is assumed or unreliable
-
Example: email "body" ("header" is semi-structured)
-
May have "implied" structure, like "delimited text"
-
-
Open format (ISO and ECMA standards)
-
Human-readable text
-
For transmiting data objects
-
Attribute–value pairs
-
Often used in Ajax web applications
{ "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 25, "height_cm": 167.6, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" } ], "children": [], "spouse": null }
Files formated with delimiter separated values use:
-
Comma (e.g., "CSV")
-
Tab (e.g., "TSV")
-
Pipe (vertical bar: |)
… or other single character as a separator between values.
The records (rows) are separated by line-ending characters (newlines):
-
Carriage-return (CR)
-
Line-feed (LF)
-
Carriage-return, Line-feed (CRLF)
-
Text files arranged in neatly formatted columns
-
Space filled with varying numbers of spaces or tabs
-
Easier to look at, but a little harder to parse
-
Lines are separated with newlines
mpg cyl disp Mazda RX4 21.0 6 160 Mazda RX4 Wag 21.0 6 160 Datsun 710 22.8 4 108 Hornet 4 Drive 21.4 6 258 Hornet Sportabout 18.7 8 360 Valiant 18.1 6 225
Some popular genomics file formats use multi-line records.
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
Structure data files as simple "columns and rows" …
subID | height | weight |
---|---|---|
1 |
58 |
115 |
2 |
59 |
117 |
3 |
60 |
120 |
… to make them easier to import and analyze.
-
Each variable forms a column.
-
Each observation forms a row.
-
Each type of observational unit forms a table
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 |
3.5 |
1.4 |
0.2 |
setosa |
5.7 |
2.8 |
4.1 |
1.3 |
versicolor |
Flower.Id | Species | Flower.Part | Length | Width |
---|---|---|---|---|
1 |
setosa |
Petal |
1.4 |
0.2 |
1 |
setosa |
Sepal |
5.1 |
3.5 |
100 |
versicolor |
Petal |
4.1 |
1.3 |
100 |
versicolor |
Sepal |
5.7 |
2.8 |
Now we can "facet" a plot by Species
and Flower.Part
.
ggplot(data=iris, aes(x=Width, y=Length)) + geom_point() + facet_grid(Species ~ Flower.Part, scale="free") + geom_smooth(method="lm") + theme_bw(base_size=16)
python
command from the Bash shell in Unix, Linux, or OS X.
iconv
is another tool originally developed for Unix, Linux and OS X systems, though Windows versions are available and can be found with an Internet search.
The latest version of this document is online at: https://github.com/brianhigh/research-computing/wiki Copyright © The Research Computing Team. This information is provided for educational purposes only. See LICENSE for more information. Creative Commons Attribution 4.0 International Public License.