-
Notifications
You must be signed in to change notification settings - Fork 70
/
Copy pathdictionary-format-v6.txt
204 lines (155 loc) · 6.3 KB
/
dictionary-format-v6.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
This is a quick write-up of the old dictionary file format, v6.
It is the format that is (unfortunately) still used by the Tolino
ebook readers. The ConvertToV6 tool can be used to convert
the new format to this old one.
v6 is troublesome as it relies on Java serialization and thus
there will be references to Java types.
This hasn't been checked much for correctness and likely has some bugs.
Also, I really should have used some standard format for writing this...
===========================================
Some basic types:
[Short]
2 bytes: big-endian, signed value (note: negative values generally not used here)
[Int]
4 bytes: big-endian, signed value (note: negative values generally not used here)
[Long]
8 bytes: big-endian, signed value (note: negative values generally not used here)
[String]
[Short]: string length
n bytes: string, modified UTF-8, n is value from previous element
note: no zero termination
======================================================
[Dictionary]
[Int]: version, fixed value 6
[Long]: file creation time (in milliseconds since Jan. 1st 1970)
[String]: dictionary information (human-readable)
list_of([source])
list_of([pair_entry])
list_of([text_entry])
list_of([html_entry]) (since v5)
list_of([index])
[String]: string "END OF DICTIONARY" (length value 17)
===========================
All list_of entries describe a list of elements.
These elements can have variable size, thus an index (table-of-contents, TOC)
is needed.
To reduce the cost of this table and enable more efficient compression,
multiple entries can be stored in a block that gets one single index entry.
I.e. it is only possible to do random-access to the start of a block,
seeking to elements further inside the block must be done via reading.
Caching should be used to reduce the performance impact of this (so
that when entries 5, 4, 3 etc. of a block are read sequentially,
parsing and decompression is done only once).
These lists have the following base format:
[Int]: number of entries in the list (must be >= 0) (<size>)
<toc size>=<size>*8 + 8 bytes:
table-of-contents.
[Long] offset value for each block of entries.
Followed by a final [Long] offset value to the end of the list data (<end offset>).
Each offset is an absolute file position.
<end offset>-<toc size>-<start of toc> bytes:
entry data
==========================================================
[source]
[String]: name of source, e.g. "enwiktionary"
[Int]: number of entries from that source (since v3) (I kind of wouldn't rely on that one
being useful/correct...)
========================================================
[pair entry]
[Short]: source index (see list_of([source])) (since v1)
[Int]: number of pairs in this entry (<num_pairs>)
<num_pairs> times:
[String]: in first language
[String]: in second language (possibly empty)
=================================================
[text_entry]
[Short]: source index (see list_of([source])) (since v1)
[String]: text
===========================================
[html_entry]
[Short]: source index (see list_of([source])) (since v1)
[String]: title for HTML entry
[Int]: length of decompressed data in bytes (<declen>)
[Int]: length of compressed data in bytes (<len>)
<len> bytes: HTML page data, UTF-8 encoded, gzip compressed
=====================================
[index]
Note: this structure is used for binary search.
It is thus critical that all entries are correctly
sorted.
The sorting is according to libicu, however as Java
and Android versions do not match special hacks
have been added, like ignoring "-" for the comparison
(unless that makes them equal, then they are
compared including the dash).
[String]: index short name
[String]: index long name
[String]: language ISO code (sort order depends on this)
[String]: ICU normalizer rules to apply for sorting/searching
1 byte: swap pair entries (if != 0, this index is for the second language entries in [pair_entry])
[Int]: number of main tokens (?) (since v2)
list_of([index_entry])
[Int]: size of stop list set following (since v4)
Set<String> stop list words (since v4)
uniform_list_of([row])
with uniform_list_of:
[Int]: number of entries in list <num_entries>
[Int]: size of entry <entry_size>
<num_entries>*<entry_size> bytes: data
================================================
[index_entry]
[String]: token
[Int]: start index into uniform_list_of([row])
[Int]: number of rows covered
1 byte: <has_normalized>
if <has_normalized> != 0:
[String]: normalized token
list_of([Int]) list of indices into list_of(html_entry) (since v6)
=======================================
[row]
1 byte: <type>
[Int]: index
<type> means:
1: index into list_of([pair_entry])
2: index into list_of([index_entry]) (mark as "main word header" entry)
3: index into list_of([text_entry])
4: index into list_of([index_entry]) (mark as "extra info/translation" entry)
5: index into list_of([html_entry])
=======================================
Set<String>
Java serialization of java.util.HashSet.
First part consists always the same 40 bytes:
0xac, 0xed, // magic
0x00, 0x05, // version
0x73, // object
0x72, // class
// Java String "java.util.HashSet"
0x00, 0x11, 0x6a, 0x61, 0x76, 0x61, 0x2e, 0x75, 0x74, 0x69,
0x6c, 0x2e, 0x48, 0x61, 0x73, 0x68, 0x53, 0x65, 0x74,
// serialization ID
0xba, 0x44, 0x85, 0x95, 0x96, 0xb8, 0xb7, 0x34,
0x03, // flags: serialized, custom serialization function
0x00, 0x00, // fields count
0x78, // blockdata end
0x70, // null (superclass)
0x77, 0x0c // blockdata short, 0xc bytes
[Int]: capacity. Not used for anything, but set to >= <num_entries>
[Float]: capacity factor. May affect performance of old QuickDic versions, set to 0.75f
[Int]: <num_entries>
<num_entries> times:
1 byte 0x74: String type
[String]: stop word
1 byte 0x78: blockdata end
Note: Some even older dictionaries wrote out a LinkedHashSet instead of a
HashSet.
That adds the following bytes describing LinkedHashSet before the 0x72 above:
0x72, // class
// Java String "java.util.LinkedHashSet"
0x00, 0x17, 0x6a, 0x61, 0x76, 0x61, 0x2e, 0x75, 0x74, 0x69,
0x6c, 0x2e, 0x4c, 0x69, 0x6e, 0x6b, 0x65, 0x64, 0x48, 0x61,
0x73, 0x68, 0x53, 0x65, 0x74,
// serialization ID
0xd8, 0x6c, 0xd7, 0x5a, 0x95, 0xdd, 0x2a, 0x1e,
0x02, // flags
0x00, 0x00, // fields count
0x78 // blockdata end