JKSN, prononced as "Jackson", is an abbreviation of "JKSN Compressed Serialize Notation", aims to be a binary serialization format compatible with JSON.
JKSN is suitable for the situation that network bandwidth is expensive but processing time is cheap.
JKSN stream may be further compressed with GZIP to get better results. It does not conflict with GZIP. Instead, since GZIP compresses better if similar data are not 32 KiB far apart, JKSN rearranges your data so that similar data are put together.
[magic header] [control byte] [checksum] [control byte] [data bytes] [control byte] [data bytes] ...
One JSON stream must only produce one value. So once a complete value is parsed, the JKSN stream must be terminated.
MIME type application/x-jksn
and filename extension .jksn
are preferred. (It may change in the future.)
This part contains three bytes jk!
, it is used for safe Internet transmission and data type recognition. If the network bandwidth is limited, magic header may be omitted.
No data bytes are followed after 0x0n
.
0x00: undefined
0x01: null
0x02: false
0x03: true
0x0f: a representation (a control byte and data bytes) of a string containing a plain text JSON literal is followed
An integer is followed after 0x1b
0x1c
0x1d
0x1e
0x1f
.
All integers are big endian.
0x1n (where 0<=n<=a): represents an integer that is n
0x1b: a signed 32-bit integer is followed
0x1c: a signed 16-bit integer is followed
0x1d: a signed 8-bit integer is followed
0x1e: a negative variable length integer is followed
0x1f: a positive variable length integer is followed
An IEEE 754 floating point number is followed after 0x2b
0x2c
0x2d
.
The most significant bit, that is the IEEE 754 sign bit, is transferred first.
0x20: NaN
0x2b: an IEEE 754 long double (80-bit, 10 bytes) number is followed
0x2c: an IEEE 754 double (64-bit) number is followed
0x2d: an IEEE 754 float (32-bit) number is followed
0x2e: -Infinity
0x2f: Infinity
IEEE 754 long double (80-bit) numbers may be unavailable on some platform. In most situations, floating point numbers are encouraged to be encoded as double (64-bit).
All UTF-16 strings are little endian, that is for ASCII characters, the second byte is zero.
0x3n (where 0<=n<=b): a UTF-16 string containing n byte pairs is followed
0x3c: an unsigned 8-bit integer is followed, representing the nearest previous UTF-16 or UTF-8 string with this DJB Hash value.
0x3d: an unsigned 16-bit integer and a UTF-16 string containing that amount of byte pairs is followed
0x3e: an unsigned 8-bit integer and a UTF-16 string containing that amount of byte pairs is followed
0x3f: a positive variable length integer and a UTF-16 string containing that amount of byte pairs is followed
Some strings (such as East Asian languages) takes less space with UTF-16, while others (such as European languages) with UTF-8.
The JKSN encoder should try to use the character encoding that takes the least space.
0x4n (where 0<=n<=c): a UTF-8 string containing n bytes is followed
0x4d: an unsigned 16-bit integer and a UTF-8 string containing that amount of bytes is followed
0x4e: an unsigned 8-bit integer and a UTF-8 string containing that amount of bytes is followed
0x4f: a positive variable length integer and a UTF-8 string containing that amount of bytes is followed
Blob strings are those strings not being tried to convert into different character encodings during the encoding of the JKSN stream, that is, the JKSN encoder will not touch the contents of blob strings.
0x5n (where 0<=n<=b): a blob string containing n bytes is followed
0x5c: an unsigned 8-bit integer is followed, representing the nearest previous blob string with this DJB Hash value.
0x5d: an unsigned 16-bit integer and a blob string containing that amount of bytes is followed
0x5e: an unsigned 8-bit integer and a blob string containing that amount of bytes is followed
0x5f: a positive variable length integer and a blob string containing that amount of bytes is followed
Hashtable refresher is an array of string that can be transferred before the value or inside arrays or objects. It does not produce any values but forces the build of a hashtable that can be used with 0x3c
or 0x5c
.
Normally hashtable referesher is not useful since the hashtable is built during the first occurence of a string, however it is useful if there is a persistent connection exchanging continuous JKSN stream.
There are two hashtables with size 256, one for text strings (UTF-16, UTF-8), the other for blob strings.
0x70: clear the hashtable
0x7n (where 1<=n<=c): an array of string containing n strings is followed
0x7d: an unsigned 16-bit integer and an array of string containing that amount of strings is followed
0x7e: an unsigned 8-bit integer and an array of string containing that amount of strings is followed
0x7f: a positive variable length integer and an array of string containing that amount of strings is followed
Items are nested layers of control bytes and data bytes.
0x8n (where 0<=n<=c): an array containing n items is followed
0x8d: an unsigned 16-bit integer and an array containing that amount of items is followed
0x8e: an unsigned 8-bit integer and an array containing that amount of items is followed
0x8f: a positive variable length integer and an array containing that amount of items is followed
0x9n (where 0<=n<=c): an object containing n string-item pairs is followed
0x9d: an unsigned 16-bit integer and an object containing that amount of string-item pairs is followed
0x9e: an unsigned 8-bit integer and an object containing that amount of string-item pairs is followed
0x9f: a positive variable length integer and an object containing that amount of string-item pairs is followed
0xa0: "unspecified", used to skip columns that are not specified
0xan (where 1<=n<=c): a row-col swapped array containing n columns is followed
0xad: an unsigned 16-bit integer and a row-col swapped array containing that amount of columns is followed
0xae: an unsigned 8-bit integer and a row-col swapped array containing that amount of columns is followed
0xaf: a positive variable length integer and a row-col swapped array containing that amount of columns is followed
A lengthless array can be used when the length of the array is unknown before transmission.
0xc8: an unlimited number of items is followed, with a terminating 0xa0
The use of lengthless array is discouraged, since it reduces the robustness and burdens the JKSN decoder.
0xca: another control byte is followed
Padding byte does nothing. It can be used when the transport requires some padding.
An Integer is followed after 0xdb
0xdc
0xdd
0xde
0xdf
, representing an integer that is larger than the previous occurred integer by n.
0xd0: represents the same previous occurred integer
0xdn (where 1<=n<=5): represents an integer that is larger than the previous occurred integer by n
0xdn (where 6<=n<=a): represents an integer that is larger than the previous occurred integer by n-11
0xdb: a signed 32-bit integer is followed
0xdc: a signed 16-bit integer is followed
0xdd: a signed 8-bit integer is followed
0xde: a negative variable length integer is followed
0xdf: a positive variable length integer is followed
0xen: application can define it or extend it
Implementations can do anything with 0xen
. Make sure that both sender and receiver understand the same extensions.
0xf0: a DJBHash checksum will be immediately followed
0xf1: a CRC32 checksum will be immediately followed
0xf2: an MD5 checksum will be immediately followed
0xf3: a SHA-1 checksum will be immediately followed
0xf4: a SHA-256 checksum will be immediately followed
0xf5: a SHA-512 checksum will be immediately followed
0xf8: a delayed DJBHash checksum will be present at the end of the stream
0xf9: a delayed CRC32 checksum will be present at the end of the stream
0xfa: a delayed MD5 checksum will be present at the end of the stream
0xfb: a delayed SHA-1 checksum will be present at the end of the stream
0xfc: a delayed SHA-256 checksum will be present at the end of the stream
0xfd: a delayed SHA-512 checksum will be present at the end of the stream
Pragmas are interpreter-specific directives. They are stored in the form of values, transferred before the real value or inside objects or arrays, but do not produce values. Interpreter that can not understand them may simply ignore them.
0xff: a representation (a control byte and data bytes) of a value (usually a string) is followed
Every byte is separated into two parts, 1 continuation bit and 7 data bits. The last byte has a continuation bit of 0, others have 1.
For example
A B C D
1ddddddd 1ddddddd 1ddddddd 0ddddddd
So that the result number will be (A & 0x7f)<<21 | (B & 0x7F)<<14 | (C & 0x7F)<<7 | D
.
If the control byte indicates it should be a negative number, the result will be 0-((A & 0x7f)<<21 | (B & 0x7F)<<14 | (C & 0x7F)<<7 | D)
.
Imagine that you have an array like this:
[
{"name": "Jason", "email": "[email protected]", "phone": "777-777-7777"},
{"name": "Jackson", "age": 17, "email": "[email protected]", "phone": "888-888-8888"}
]
It is better to preprocess it before transfer:
{
"name": ["Jason", "Jackson"],
"age": [unspecified, 17],
"email": ["[email protected]", "[email protected]"],
"phone": ["777-777-7777", "888-888-8888"]
}
This is a row-col swapped array. JKSN will transparently transform an array to a row-col swapped array if it takes less space, then transform it back after receiving.
This example, represented in JKSN without row-col swapping, is:
"jk!" 0x82 0x93 0x44 "name" 0x45 "Jason" 0x45 "email" 0x4e 0x11 "[email protected]"
0x45 "phone" 0x4c "777-777-7777"
0x94 0x3c 0xc1 0x47 "Jackson" 0x43 "age" 0x1d 0x11
0x3c 0xc8 0x4e 0x13 "[email protected]" 0x3c 0x9a 0x4c "888-888-8888"
If represented in row-col swapped JKSN, is:
"jk!" 0xa4 0x44 "name" 0x82 0x45 "Jason" 0x47 "Jackson"
0x43 "age" 0x82 0xa0 0x1d 0x11
0x45 "email" 0x82 0x4e 0x11 "[email protected]" 0x4e 0x13 "[email protected]"
0x45 "phone" 0x82 0x4c "777-777-7777" 0x4c "888-888-8888"
Row-col swapped array can definitely be nested. However, nested row-col swapped array might cost too much computing time. It is recommended to choose a balanced nesting depth.
The algorithm of DJB Hash is listed below:
def DJBHash(string):
hash = 0
for i in string:
hash = hash + (hash << 5) + ord(i)
return hash & 0xff
When hashing UTF-16 strings, the hash function processes every byte instead of every byte pair.
Checksum is a optional part of JKSN stream. If the transmission media is reliable, or if you decided to GZIP the JKSN stream, checksum may be omitted.
Checksum indicates the checksum from the position immediately after the checksum to the very end of the JKSN stream.
A delayed checksum rearranges the form of JKSN, which puts the checksum to the end of JKSN stream, as the following format:
[magic header] [0xf8..0xfd] [control byte] [data bytes] ... [control byte] [data bytes] [checksum]
An JSON stream has multiple JKSN representation. The JKSN encoder should decide which method generates the smallest JKSN stream in reasonable time.
Since implementing the full functionality of JKSN costs too much resources and does not suit all situation, the application may only implement a subset of JKSN specification. This requires the sender and receiver using features both supported.
Take the example used in row-col swapping section.
Compression method | Size | Percent saved |
---|---|---|
JSON, not stripping whitespace | 170 bytes | 13.3% larger |
JSON, stripped whitespace | 150 bytes | base |
JSON, row-col swapped, stripped | 143 bytes | 4.6% smaller |
JSON, stripped, gzip -9 | 112 bytes | 25.3% smaller |
JSON, swapped, stripped, gzip -9 | 127 bytes | 15.3% smaller |
JKSN, no swapping | 112 bytes | 25.3% smaller |
JKSN, transparent swapped | 109 bytes | 27.3% smaller |
JKSN, no swap, gzip -9 | 111 bytes | 26.0% smaller |
JKSN, transparent swapped, gzip -9 | 107 bytes | 28.7% smaller |
A non gzipped JKSN does even better than JSON compressed in any method above.
What about BSON, MessagePack, BJSON, ubJSON?
Format | Size | Implementation |
---|---|---|
BSON | 172 bytes | Python 3 |
Universal Binary Json | 150 bytes | Node.JS |
MessagePack | 120 bytes | Node.JS |
JSONH | 116 bytes | Python 2 |
BJSON with builtin adaptive huffman encoding | 115 bytes | Node.JS |
JKSN | 109 bytes | Python 3 |
There is a Python reference implementation in the python
directory in this repository.
There are other implementations in other languages as well. Some of them implemented a part of the functionality.
Although there is a detailed specification file, it is unable to be avoided that there is something not stated clearly.
Reference implementations are designed by the designer of JKSN so that you can regard the behavior of the reference implementation as the de facto standard unless the behavior is related to a bug.
In addition, though the reference implementations are not the ones which produce the smallest JKSN streams, neither the ones which do the fastest processing, we want you to have a chance to try JKSN easily.
If you enjoy JKSN, you can contribute to JKSN by write up a new implementation. We will fork you at GitHub, or even include your code as a reference implementation if the conditions above are met.
This document, as well as JKSN specification, is licensed under BSD license.
Copyright (c) 2014 StarBrilliant <[email protected]>. All rights reserved.
Redistribution and use in source and binary forms are permitted provided that the above copyright notice and this paragraph are duplicated in all such forms and that any documentation, advertising materials, and other materials related to such distribution and use acknowledge that the software was developed by StarBrilliant. The name of StarBrilliant may not be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED AS IS AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.