updated format spec
diff --git a/zstd_compression_format.md b/zstd_compression_format.md
index ce17f56..94b1dc5 100644
--- a/zstd_compression_format.md
+++ b/zstd_compression_format.md
@@ -76,9 +76,9 @@
General Structure of Zstandard Frame format
-------------------------------------------
-| MagicNb | F. Header | Block | (More blocks) | EndMark |
-|:-------:|:----------:| ----- | ------------- | ------- |
-| 4 bytes | 2-14 bytes | | | 3 bytes |
+| MagicNb | Frame Header | Block | (More blocks) | EndMark |
+|:-------:|:-------------:| ----- | ------------- | ------- |
+| 4 bytes | 2-14 bytes | | | 3 bytes |
__Magic Number__
@@ -87,11 +87,11 @@
__Frame Header__
-2 to 14 Bytes, to be detailed in the next part.
+2 to 14 Bytes, detailed in [next part](#frame-header).
__Data Blocks__
-To be detailed later on.
+Detailed in [next chapter](#data-blocks).
That’s where compressed data is stored.
__EndMark__
@@ -101,12 +101,12 @@
##### __Content Checksum__
-Content Checksum verify that frame content has been regenrated correctly.
+Content Checksum verify that frame content has been regenerated correctly.
The content checksum is the result
of [xxh64() hash function](https://www.xxHash.com)
digesting the original (decoded) data as input, and a seed of zero.
Bits from 11 to 32 (included) are extracted to form a 22 bits checksum
-stored into the last block header.
+stored into the endmark body.
```
mask22bits = (1<<22)-1;
contentChecksum = (XXH64(content, size, 0) >> 11) & mask22bits;
@@ -224,6 +224,10 @@
that will be present within compressed data.
This information is useful for decoders to allocate enough memory.
+`WD` byte is optional. It's not present in `single segment` mode.
+In which case, the maximum back-reference distance is the content size itself,
+which can be any value from 1 to 2^64-1 bytes (16 EB).
+
| BitNb | 7-3 | 0-2 |
| --------- | -------- | -------- |
| FieldName | Exponent | Mantissa |
@@ -236,20 +240,16 @@
windowSize = windowBase + windowAdd;
```
The minimum window size is 1 KB.
-The maximum size is (15*(2^38))-1 bytes, which is almost 1.875 TB.
+The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
To properly decode compressed data,
a decoder will need to allocate a buffer of at least `windowSize` bytes.
-Note that `WD` byte is optional. It's not present in `single segment` mode.
-In which case, the maximum back-reference distance is the content size itself,
-which can be any value from 1 to 2^64-1 bytes (16 EB).
-
In order to preserve decoder from unreasonable memory requirements,
a decoder can refuse a compressed frame
which requests a memory size beyond decoder's authorized range.
-For better interoperability,
+For improved interoperability,
decoders are recommended to be compatible with window sizes of 8 MB.
Encoders are recommended to not request more than 8 MB.
It's merely a recommendation though,
@@ -266,10 +266,10 @@
Field size depends on __Dictionary ID flag__.
1 byte can represent an ID 0-255.
2 bytes can represent an ID 0-65535.
-4 bytes can represent an ID 0-(2^32-1).
+4 bytes can represent an ID 0-4294967295.
It's allowed to represent a small ID (for example `13`)
-with a large 4-bytes dictionary ID, losing some efficiency in the process.
+with a large 4-bytes dictionary ID, losing some compacity in the process.
__Frame Content Size__
@@ -331,7 +331,7 @@
while the "compressed" block is just 1 byte (the byte to repeat).
- EndMark : this is not a block. Signal the end of the frame.
The rest of the field may be optionally filled by a checksum
- (see [__Content Checksum__]).
+ (see [Content Checksum](#content-checksum)).
Block sizes must respect a few rules :
- In compressed mode, compressed size if always strictly `< decompressed size`.
@@ -345,9 +345,9 @@
It might be compressed or not, depending on previous field indications.
A data block is not necessarily "full" :
since an arbitrary “flush” may happen anytime,
-block decompressed content can be any size, up to Block Maximum Size.
-Block Maximum Decompressed Size is the smallest of :
-- Max back-reference distance
+block decompressed content can be any size,
+up to Block Maximum Decompressed Size, which is the smallest of :
+- Maximum back-reference distance
- 128 KB
@@ -364,7 +364,9 @@
with the sole objective to allow the decoder to quickly skip
over user-defined data and continue decoding.
-Skippable frames defined in this specification are compatible with LZ4 ones.
+Skippable frames defined in this specification are compatible with [LZ4] ones.
+
+[LZ4]:http://www.lz4.org
__Magic Number__ :
@@ -393,8 +395,8 @@
See [Data Blocks](#data-blocks) for more details.
A compressed block consists of 2 sections :
-- Literals section
-- Sequences section
+- [Literals section](#literals-section)
+- [Sequences section](#sequences-section)
### Prerequisites
To decode a compressed block, the following elements are necessary :
@@ -439,12 +441,11 @@
| Block Type | Compressed | Repeat | Raw | RLE |
- Compressed : This is a standard huffman-compressed block,
- starting with a huffman tree description.
- See details below.
+ starting with a huffman tree description.
+ See details below.
- Repeat Stats : This is a huffman-compressed block,
- using huffman tree from previous huffman-compressed block.
- Huffman tree description will be skipped.
- Compressed stream is equivalent to "compressed" block type.
+ using huffman tree _from previous huffman-compressed literals block_.
+ Huffman tree description will be skipped.
- Raw : Literals are stored uncompressed.
- RLE : Literals consist of a single byte value repeated N times.
@@ -476,24 +477,24 @@
__Sizes format for Compressed literals block__ :
Note : also applicable to "repeat-stats" blocks.
-- Value : 00 : 4 streams
- Compressed and regenerated sizes use 10 bits (0-1023)
- Total literal header size is 3 bytes
-- Value : 01 : _Single stream_
- Compressed and regenerated sizes use 10 bits (0-1023)
- Total literal header size is 3 bytes
-- Value : 10 : 4 streams
- Compressed and regenerated sizes use 14 bits (0-16383)
- Total literal header size is 4 bytes
-- Value : 10 : 4 streams
- Compressed and regenerated sizes use 18 bits (0-262143)
- Total literal header size is 5 bytes
+- Value : 00 : 4 streams.
+ Compressed and regenerated sizes use 10 bits (0-1023).
+ Total literal header size is 3 bytes.
+- Value : 01 : _Single stream_.
+ Compressed and regenerated sizes use 10 bits (0-1023).
+ Total literal header size is 3 bytes.
+- Value : 10 : 4 streams.
+ Compressed and regenerated sizes use 14 bits (0-16383).
+ Total literal header size is 4 bytes.
+- Value : 10 : 4 streams.
+ Compressed and regenerated sizes use 18 bits (0-262143).
+ Total literal header size is 5 bytes.
Compressed and regenerated size fields follow big endian convention.
#### Huffman Tree description
-This section is only present when block type is `Compressed` (`0`).
+This section is only present when literals block type is `Compressed` (`0`).
Prefix coding represents symbols from an a priori known alphabet
by bit sequences (codes), one code for each symbol,
@@ -546,7 +547,7 @@
The decoder will do the inverse operation :
having collected weights of literals from `0` to `4`,
it knows the last literal, `5`, is present with a non-zero weight.
-The weight of `5` can be deduced by joining to the nearest power of 2.
+The weight of `5` can be deducted by joining to the nearest power of 2.
Sum of 2^(weight-1) (excluding 0) is :
`8 + 4 + 2 + 0 + 1 = 15`
Nearest power of 2 is 16.
@@ -558,24 +559,17 @@
which tells how to decode the list of weights.
- if headerByte >= 242 : this is one of 14 pre-defined weight distributions :
- + 242 : 1x1 (+ 1x1)
- + 243 : 2x1 (+ 1x2)
- + 244 : 3x1 (+ 1x1)
- + 245 : 4x1 (+ 1x4)
- + 246 : 7x1 (+ 1x1)
- + 247 : 8x1 (+ 1x8)
- + 248 : 15x1 (+ 1x1)
- + 249 : 16x1 (+ 1x16)
- + 250 : 31x1 (+ 1x1)
- + 251 : 32x1 (+ 1x32)
- + 252 : 63x1 (+ 1x1)
- + 253 : 64x1 (+ 1x64)
- + 254 :127x1 (+ 1x1)
- + 255 :128x1 (+ 1x128)
+
+| value |242|243|244|245|246|247|248|249|250|251|252|253|254|255|
+| -------- |
+| Nb of 1s | 1 | 2 | 3 | 4 | 7 | 8 | 15| 16| 31| 32| 63| 64|127|128|
+|Complement| 1 | 2 | 1 | 4 | 1 | 8 | 1 | 16| 1 | 32| 1 | 64| 1 |128|
+
+_Note_ : complement is by using the "join to nearest power of 2" rule.
- if headerByte >= 128 : this is a direct representation,
where each weight is written directly as a 4 bits field (0-15).
- The full representation occupies ((nbSymbols+1)/2) bytes,
+ The full representation occupies `((nbSymbols+1)/2)` bytes,
meaning it uses a last full byte even if nbSymbols is odd.
`nbSymbols = headerByte - 127;`
@@ -593,13 +587,13 @@
Compressed size is provided by `headerByte`.
It's also necessary to know its maximum decompressed size.
In this case, it's `255`, since literal values range from `0` to `255`,
-and the last symbol value is not represented.
+and last symbol value is not represented.
An FSE bitstream starts by a header, describing probabilities distribution.
It will create a Decoding Table.
It is necessary to know the maximum accuracy of distribution
to properly allocate space for the Table.
-For a list of huffman weights, this maximum is 8 bits.
+For a list of huffman weights, this maximum is 7 bits.
FSE header and bitstreams are described in a separated chapter.
@@ -652,13 +646,12 @@
For single stream, header provides both the compressed and regenerated size.
For 4-streams though,
header only provides compressed and regenerated size of all 4 streams combined.
-
In order to properly decode the 4 streams,
it's necessary to know the compressed and regenerated size of each stream.
Regenerated size is easiest :
each stream has a size of `(totalSize+3)/4`,
-except the last one, which is up to 3 bytes smaller, to reach totalSize.
+except the last one, which is up to 3 bytes smaller, to reach `totalSize`.
Compressed size must be provided explicitly : in the 4-streams variant,
bitstreams are preceded by 3 unsigned Little Endian 16-bits values.
@@ -704,11 +697,11 @@
The offset gives the position to copy from,
which can stand within a previous block.
-These are 3 symbol types, `literalLength`, `matchLength` and `offset`,
+There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
which are encoded together, interleaved in a single _bitstream_.
-Each symbol decoding consists of a _code_,
-which specifies a baseline and a number of additional bits.
+Each symbol is a _code_ in its own context,
+which specifies a baseline and a number of bits to add.
_Codes_ are FSE compressed,
and interleaved with raw additional bits in the same bitstream.
@@ -717,7 +710,7 @@
followed by the bitstream.
To decode the Sequence section, it's required to know its size.
-This size is deducted from "blockSize - literalSectionSize".
+This size is deducted from `blockSize - literalSectionSize`.
#### Sequences section header
@@ -733,9 +726,9 @@
- `if (byte0 == 0)` : there are no sequences.
The sequence section stops there.
Regenerated content is defined entirely by literals section.
-- `if (byte0 < 128)` : nbSeqs = byte0 . Uses 1 byte.
-- `if (byte0 < 255)` : nbSeqs = ((byte0-128) << 8) + byte1 . Uses 2 bytes.
-- `if (byte0 == 255)`: nbSeqs = byte1 + (byte2<<8) + 0x7F00 . Uses 3 bytes.
+- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
+- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
+- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
__Symbol compression modes__
@@ -760,8 +753,8 @@
- "RLE" : it's a single code, repeated `nbSeqs` times.
- "Repeat" : re-use distribution table from previous compressed block.
- "FSE" : standard FSE compression.
- Symbol type requires a distribution table,
- which will be described in next part.
+ A distribution table will be present.
+ It will be described in [next part](#distribution-tables).
#### Symbols decoding
@@ -772,8 +765,9 @@
| Code | 0-15 |
| ------ | ---- |
-| nbBits | 0 |
| value | Code |
+| nbBits | 0 |
+
| Code | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
@@ -792,7 +786,7 @@
__Default distribution__
-When "compression mode" is defined as "default distribution",
+When "compression mode" is "predef"",
a pre-defined distribution is used for FSE compression.
Here is its definition. It uses an accuracy of 6 bits (64 states).
@@ -810,8 +804,8 @@
| Code | 0-31 |
| ------ | -------- |
-| nbBits | 0 |
| value | Code + 3 |
+| nbBits | 0 |
| Code | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
@@ -847,8 +841,8 @@
Offset codes are values ranging from `0` to `N`,
with `N` being limited by maximum backreference distance.
-A decoder is free to limit its maximum `N` supported,
-although the recommendation is to support at least up to `22`.
+A decoder is free to limit its maximum `N` supported.
+Recommendation is to support at least up to `22`.
For information, at the time of this writing.
the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
@@ -863,14 +857,15 @@
OFValue from 1 to 3 are special : they define "repeat codes",
which means one of the previous offsets will be repeated.
They are sorted in recency order, with 1 meaning the most recent one.
+See [Repeat offsets](#repeat-offsets) paragraph.
__Default distribution__
-When "compression mode" is defined as "default distribution",
+When "compression mode" is defined as "predef",
a pre-defined distribution is used for FSE compression.
Here is its definition. It uses an accuracy of 5 bits (32 states),
-and support a maximum `N` of 28, allowing offset values up to 536,870,908 .
+and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
If any sequence in the compressed block requires an offset larger than this,
it's not possible to use the default distribution to represent it.
@@ -920,7 +915,7 @@
__example__ :
Presuming an AccuracyLog of 8,
and presuming 100 probabilities points have already been distributed,
- the decoder may discover value from `0` to `255 - 100 + 1 == 156` (included).
+ the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
Therefore, it must read `log2sup(156) == 8` bits.
- Value decoded : small values use 1 less bit :
@@ -946,7 +941,7 @@
It means value `0` becomes negative probability `-1`.
`-1` is a special probability, which means `less than 1`.
-Its effect on distribution table is described in a later paragraph.
+Its effect on distribution table is described in next paragraph.
For the purpose of calculating cumulated distribution, it counts as one.
When a symbol has a probability of `zero`,
@@ -988,8 +983,10 @@
Cell allocation is spreaded, not linear :
each successor position follow this rule :
-`position += (tableSize>>1) + (tableSize>>3) + 3);
-position &= tableSize-1;`
+```
+position += (tableSize>>1) + (tableSize>>3) + 3;
+position &= tableSize-1;
+```
A position is skipped if already occupied,
typically by a "less than 1" probability symbol.
@@ -999,11 +996,11 @@
To get the Number of bits and baseline required for next state,
it's first necessary to sort all states in their natural order.
-The lower states will need 1 bit more than higher ones.
+The lower states will need 1 more bit than higher ones.
__Example__ :
Presuming a symbol has a probability of 5.
-It receives 5 state values.
+It receives 5 state values. States are sorted in natural order.
Next power of 2 is 8.
Space of probabilities is divided into 8 equal parts.
@@ -1034,8 +1031,8 @@
It is therefore necessary to know the bitstream size,
which is deducted from compressed block size.
-The exact last bit of the stream is followed by a set-bit-flag.
-The highest bit of last byte is this flag.
+The bit of the stream is followed by a set-bit-flag.
+Highest bit of last byte is this flag.
It does not belong to the useful part of the bitstream.
Therefore, last byte has 0-7 useful bits.
Note that it also means that last byte cannot be `0`.