yann.collet.73@gmail.com | 5748f62 | 2014-01-07 18:47:50 +0000 | [diff] [blame] | 1 | LZ4 Format Description |
| 2 | Last revised: 2012-02-27 |
| 3 | Author : Y. Collet |
| 4 | |
| 5 | |
| 6 | |
| 7 | This small specification intents to provide enough information |
| 8 | to anyone willing to produce LZ4-compatible compressed data blocks |
| 9 | using any programming language. |
| 10 | |
| 11 | LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding. |
| 12 | The most important design principle behind LZ4 is simplicity. |
| 13 | It helps to create an easy to read and maintain source code. |
| 14 | It also helps later on for optimisations, compactness, and speed. |
| 15 | There is no entropy encoder backend nor framing layer. |
| 16 | The latter is assumed to be handled by other parts of the system. |
| 17 | |
| 18 | This document only describes the format, |
| 19 | not how the LZ4 compressor nor decompressor actually work. |
| 20 | The correctness of the decompressor should not depend |
| 21 | on implementation details of the compressor, and vice versa. |
| 22 | |
| 23 | |
| 24 | |
| 25 | -- Compressed block format -- |
| 26 | |
| 27 | An LZ4 compressed block is composed of sequences. |
| 28 | Schematically, a sequence is a suite of literals, followed by a match copy. |
| 29 | |
| 30 | Each sequence starts with a token. |
| 31 | The token is a one byte value, separated into two 4-bits fields. |
| 32 | Therefore each field ranges from 0 to 15. |
| 33 | |
| 34 | |
| 35 | The first field uses the 4 high-bits of the token. |
| 36 | It provides the length of literals to follow. |
| 37 | (Note : a literal is a not-compressed byte). |
| 38 | If the field value is 0, then there is no literal. |
| 39 | If it is 15, then we need to add some more bytes to indicate the full length. |
| 40 | Each additionnal byte then represent a value from 0 to 255, |
| 41 | which is added to the previous value to produce a total length. |
| 42 | When the byte value is 255, another byte is output. |
| 43 | There can be any number of bytes following the token. There is no "size limit". |
| 44 | (Sidenote this is why a not-compressible input block is expanded by 0.4%). |
| 45 | |
| 46 | Example 1 : A length of 48 will be represented as : |
| 47 | - 15 : value for the 4-bits High field |
| 48 | - 33 : (=48-15) remaining length to reach 48 |
| 49 | |
| 50 | Example 2 : A length of 280 will be represented as : |
| 51 | - 15 : value for the 4-bits High field |
| 52 | - 255 : following byte is maxed, since 280-15 >= 255 |
| 53 | - 10 : (=280 - 15 - 255) ) remaining length to reach 280 |
| 54 | |
| 55 | Example 3 : A length of 15 will be represented as : |
| 56 | - 15 : value for the 4-bits High field |
| 57 | - 0 : (=15-15) yes, the zero must be output |
| 58 | |
| 59 | Following the token and optional length bytes, are the literals themselves. |
| 60 | They are exactly as numerous as previously decoded (length of literals). |
| 61 | It's possible that there are zero literal. |
| 62 | |
| 63 | |
| 64 | Following the literals is the match copy operation. |
| 65 | |
| 66 | It starts by the offset. |
| 67 | This is a 2 bytes value, in little endian format. |
| 68 | |
| 69 | The offset represents the position of the match to be copied from. |
| 70 | 1 means "current position - 1 byte". |
| 71 | The maximum offset value is 65535, 65536 cannot be coded. |
| 72 | Note that 0 is an invalid value, not used. |
| 73 | |
| 74 | Then we need to extract the match length. |
| 75 | For this, we use the second token field, the low 4-bits. |
| 76 | Value, obviously, ranges from 0 to 15. |
| 77 | However here, 0 means that the copy operation will be minimal. |
| 78 | The minimum length of a match, called minmatch, is 4. |
| 79 | As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes. |
| 80 | Similar to literal length, on reaching the highest possible value (15), |
| 81 | we output additional bytes, one at a time, with values ranging from 0 to 255. |
| 82 | They are added to total to provide the final match length. |
| 83 | A 255 value means there is another byte to read and add. |
| 84 | There is no limit to the number of optional bytes that can be output this way. |
| 85 | (This points towards a maximum achievable compression ratio of ~250). |
| 86 | |
| 87 | With the offset and the matchlength, |
| 88 | the decoder can now proceed to copy the data from the already decoded buffer. |
| 89 | On decoding the matchlength, we reach the end of the compressed sequence, |
| 90 | and therefore start another one. |
| 91 | |
| 92 | |
| 93 | -- Parsing restrictions -- |
| 94 | |
| 95 | There are specific parsing rules to respect in order to remain compatible |
| 96 | with assumptions made by the decoder : |
| 97 | 1) The last 5 bytes are always literals |
| 98 | 2) The last match must start at least 12 bytes before end of block |
| 99 | Consequently, a block with less than 13 bytes cannot be compressed. |
| 100 | These rules are in place to ensure that the decoder |
| 101 | will never read beyond the input buffer, nor write beyond the output buffer. |
| 102 | |
| 103 | Note that the last sequence is also incomplete, |
| 104 | and stops right after literals. |
| 105 | |
| 106 | |
| 107 | -- Additional notes -- |
| 108 | |
| 109 | There is no assumption nor limits to the way the compressor |
| 110 | searches and selects matches within the source data block. |
| 111 | It could be a fast scan, a multi-probe, a full search using BST, |
| 112 | standard hash chains or MMC, well whatever. |
| 113 | |
| 114 | Advanced parsing strategies can also be implemented, such as lazy match, |
| 115 | or full optimal parsing. |
| 116 | |
| 117 | All these trade-off offer distinctive speed/memory/compression advantages. |
| 118 | Whatever the method used by the compressor, its result will be decodable |
| 119 | by any LZ4 decoder if it follows the format specification described above. |
| 120 | |