blob: fc0a148524a47815bdf03abd8d39a468f0181ec8 [file] [log] [blame]
Nadeem Vawda3ff069e2011-11-30 00:25:06 +02001:mod:`lzma` --- Compression using the LZMA algorithm
2====================================================
3
4.. module:: lzma
5 :synopsis: A Python wrapper for the liblzma compression library.
6.. moduleauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
7.. sectionauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
8
9.. versionadded:: 3.3
10
11
12This module provides classes and convenience functions for compressing and
13decompressing data using the LZMA compression algorithm. Also included is a file
14interface supporting the ``.xz`` and legacy ``.lzma`` file formats used by the
15:program:`xz` utility, as well as raw compressed streams.
16
17For related file formats, see the :mod:`bz2`, :mod:`gzip`, :mod:`zipfile`, and
18:mod:`tarfile` modules.
19
20The interface provided by this module is very similar to that of the :mod:`bz2`
21module. However, note that :class:`LZMAFile` is *not* thread-safe, unlike
22:class:`bz2.BZ2File`, so if you need to use a single :class:`LZMAFile` instance
23from multiple threads, it is necessary to protect it with a lock.
24
25
26.. exception:: LZMAError
27
28 This exception is raised when an error occurs during compression or
29 decompression, or while initializing the compressor/decompressor state.
30
31
32Reading and writing compressed files
33------------------------------------
34
Nadeem Vawdad85d0e72012-02-04 14:06:07 +020035.. class:: LZMAFile(filename=None, mode="r", \*, fileobj=None, format=None, check=-1, preset=None, filters=None)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020036
37 Open an LZMA-compressed file.
38
39 An :class:`LZMAFile` can wrap an existing :term:`file object` (given by
40 *fileobj*), or operate directly on a named file (named by *filename*).
41 Exactly one of these two parameters should be provided. If *fileobj* is
42 provided, it is not closed when the :class:`LZMAFile` is closed.
43
44 The *mode* argument can be either ``"r"`` for reading (default), ``"w"`` for
45 overwriting, or ``"a"`` for appending. If *fileobj* is provided, a mode of
46 ``"w"`` does not truncate the file, and is instead equivalent to ``"a"``.
47
48 When opening a file for reading, the input file may be the concatenation of
49 multiple separate compressed streams. These are transparently decoded as a
50 single logical stream.
51
52 When opening a file for reading, the *format* and *filters* arguments have
53 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
54 and *preset* arguments should not be used.
55
56 When opening a file for writing, the *format*, *check*, *preset* and
57 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
58
59 :class:`LZMAFile` supports all the members specified by
60 :class:`io.BufferedIOBase`, except for :meth:`detach` and :meth:`truncate`.
61 Iteration and the :keyword:`with` statement are supported.
62
63 The following method is also provided:
64
65 .. method:: peek(size=-1)
66
67 Return buffered data without advancing the file position. At least one
68 byte of data will be returned, unless EOF has been reached. The exact
69 number of bytes returned is unspecified (the *size* argument is ignored).
70
71
72Compressing and decompressing data in memory
73--------------------------------------------
74
75.. class:: LZMACompressor(format=FORMAT_XZ, check=-1, preset=None, filters=None)
76
77 Create a compressor object, which can be used to compress data incrementally.
78
79 For a more convenient way of compressing a single chunk of data, see
80 :func:`compress`.
81
82 The *format* argument specifies what container format should be used.
83 Possible values are:
84
85 * :const:`FORMAT_XZ`: The ``.xz`` container format.
86 This is the default format.
87
88 * :const:`FORMAT_ALONE`: The legacy ``.lzma`` container format.
89 This format is more limited than ``.xz`` -- it does not support integrity
90 checks or multiple filters.
91
92 * :const:`FORMAT_RAW`: A raw data stream, not using any container format.
93 This format specifier does not support integrity checks, and requires that
94 you always specify a custom filter chain (for both compression and
95 decompression). Additionally, data compressed in this manner cannot be
96 decompressed using :const:`FORMAT_AUTO` (see :class:`LZMADecompressor`).
97
98 The *check* argument specifies the type of integrity check to include in the
99 compressed data. This check is used when decompressing, to ensure that the
100 data has not been corrupted. Possible values are:
101
102 * :const:`CHECK_NONE`: No integrity check.
103 This is the default (and the only acceptable value) for
104 :const:`FORMAT_ALONE` and :const:`FORMAT_RAW`.
105
106 * :const:`CHECK_CRC32`: 32-bit Cyclic Redundancy Check.
107
108 * :const:`CHECK_CRC64`: 64-bit Cyclic Redundancy Check.
109 This is the default for :const:`FORMAT_XZ`.
110
111 * :const:`CHECK_SHA256`: 256-bit Secure Hash Algorithm.
112
113 If the specified check is not supported, an :class:`LZMAError` is raised.
114
115 The compression settings can be specified either as a preset compression
116 level (with the *preset* argument), or in detail as a custom filter chain
117 (with the *filters* argument).
118
119 The *preset* argument (if provided) should be an integer between ``0`` and
120 ``9`` (inclusive), optionally OR-ed with the constant
121 :const:`PRESET_EXTREME`. If neither *preset* nor *filters* are given, the
122 default behavior is to use :const:`PRESET_DEFAULT` (preset level ``6``).
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200123 Higher presets produce smaller output, but make the compression process
124 slower.
125
126 .. note::
127
128 In addition to being more CPU-intensive, compression with higher presets
129 also requires much more memory (and produces output that needs more memory
130 to decompress). With preset ``9`` for example, the overhead for an
131 :class:`LZMACompressor` object can be as high as 800MiB. For this reason,
132 it is generally best to stick with the default preset.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200133
134 The *filters* argument (if provided) should be a filter chain specifier.
135 See :ref:`filter-chain-specs` for details.
136
137 .. method:: compress(data)
138
139 Compress *data* (a :class:`bytes` object), returning a :class:`bytes`
140 object containing compressed data for at least part of the input. Some of
141 *data* may be buffered internally, for use in later calls to
142 :meth:`compress` and :meth:`flush`. The returned data should be
143 concatenated with the output of any previous calls to :meth:`compress`.
144
145 .. method:: flush()
146
147 Finish the compression process, returning a :class:`bytes` object
148 containing any data stored in the compressor's internal buffers.
149
150 The compressor cannot be used after this method has been called.
151
152
153.. class:: LZMADecompressor(format=FORMAT_AUTO, memlimit=None, filters=None)
154
155 Create a decompressor object, which can be used to decompress data
156 incrementally.
157
158 For a more convenient way of decompressing an entire compressed stream at
159 once, see :func:`decompress`.
160
161 The *format* argument specifies the container format that should be used. The
162 default is :const:`FORMAT_AUTO`, which can decompress both ``.xz`` and
163 ``.lzma`` files. Other possible values are :const:`FORMAT_XZ`,
164 :const:`FORMAT_ALONE`, and :const:`FORMAT_RAW`.
165
166 The *memlimit* argument specifies a limit (in bytes) on the amount of memory
167 that the decompressor can use. When this argument is used, decompression will
168 fail with an :class:`LZMAError` if it is not possible to decompress the input
169 within the given memory limit.
170
171 The *filters* argument specifies the filter chain that was used to create
172 the stream being decompressed. This argument is required if *format* is
173 :const:`FORMAT_RAW`, but should not be used for other formats.
174 See :ref:`filter-chain-specs` for more information about filter chains.
175
176 .. note::
177 This class does not transparently handle inputs containing multiple
178 compressed streams, unlike :func:`decompress` and :class:`LZMAFile`. To
179 decompress a multi-stream input with :class:`LZMADecompressor`, you must
180 create a new decompressor for each stream.
181
182 .. method:: decompress(data)
183
184 Decompress *data* (a :class:`bytes` object), returning a :class:`bytes`
185 object containing the decompressed data for at least part of the input.
186 Some of *data* may be buffered internally, for use in later calls to
187 :meth:`decompress`. The returned data should be concatenated with the
188 output of any previous calls to :meth:`decompress`.
189
190 .. attribute:: check
191
192 The ID of the integrity check used by the input stream. This may be
193 :const:`CHECK_UNKNOWN` until enough of the input has been decoded to
194 determine what integrity check it uses.
195
196 .. attribute:: eof
197
198 True if the end-of-stream marker has been reached.
199
200 .. attribute:: unused_data
201
202 Data found after the end of the compressed stream.
203
204 Before the end of the stream is reached, this will be ``b""``.
205
206
207.. function:: compress(data, format=FORMAT_XZ, check=-1, preset=None, filters=None)
208
209 Compress *data* (a :class:`bytes` object), returning the compressed data as a
210 :class:`bytes` object.
211
212 See :class:`LZMACompressor` above for a description of the *format*, *check*,
213 *preset* and *filters* arguments.
214
215
216.. function:: decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None)
217
218 Decompress *data* (a :class:`bytes` object), returning the uncompressed data
219 as a :class:`bytes` object.
220
221 If *data* is the concatenation of multiple distinct compressed streams,
222 decompress all of these streams, and return the concatenation of the results.
223
224 See :class:`LZMADecompressor` above for a description of the *format*,
225 *memlimit* and *filters* arguments.
226
227
228Miscellaneous
229-------------
230
231.. function:: check_is_supported(check)
232
233 Returns true if the given integrity check is supported on this system.
234
235 :const:`CHECK_NONE` and :const:`CHECK_CRC32` are always supported.
236 :const:`CHECK_CRC64` and :const:`CHECK_SHA256` may be unavailable if you are
237 using a version of :program:`liblzma` that was compiled with a limited
238 feature set.
239
240
241.. _filter-chain-specs:
242
243Specifying custom filter chains
244-------------------------------
245
246A filter chain specifier is a sequence of dictionaries, where each dictionary
247contains the ID and options for a single filter. Each dictionary must contain
248the key ``"id"``, and may contain additional keys to specify filter-dependent
249options. Valid filter IDs are as follows:
250
251* Compression filters:
252 * :const:`FILTER_LZMA1` (for use with :const:`FORMAT_ALONE`)
253 * :const:`FILTER_LZMA2` (for use with :const:`FORMAT_XZ` and :const:`FORMAT_RAW`)
254
255* Delta filter:
256 * :const:`FILTER_DELTA`
257
258* Branch-Call-Jump (BCJ) filters:
259 * :const:`FILTER_X86`
260 * :const:`FILTER_IA64`
261 * :const:`FILTER_ARM`
262 * :const:`FILTER_ARMTHUMB`
263 * :const:`FILTER_POWERPC`
264 * :const:`FILTER_SPARC`
265
266A filter chain can consist of up to 4 filters, and cannot be empty. The last
267filter in the chain must be a compression filter, and any other filters must be
268delta or BCJ filters.
269
270Compression filters support the following options (specified as additional
271entries in the dictionary representing the filter):
272
273 * ``preset``: A compression preset to use as a source of default values for
274 options that are not specified explicitly.
275 * ``dict_size``: Dictionary size in bytes. This should be between 4KiB and
276 1.5GiB (inclusive).
277 * ``lc``: Number of literal context bits.
278 * ``lp``: Number of literal position bits. The sum ``lc + lp`` must be at
279 most 4.
280 * ``pb``: Number of position bits; must be at most 4.
281 * ``mode``: :const:`MODE_FAST` or :const:`MODE_NORMAL`.
282 * ``nice_len``: What should be considered a "nice length" for a match.
283 This should be 273 or less.
284 * ``mf``: What match finder to use -- :const:`MF_HC3`, :const:`MF_HC4`,
285 :const:`MF_BT2`, :const:`MF_BT3`, or :const:`MF_BT4`.
286 * ``depth``: Maximum search depth used by match finder. 0 (default) means to
287 select automatically based on other filter options.
288
289The delta filter stores the differences between bytes, producing more repetitive
290input for the compressor in certain circumstances. It only supports a single
291The delta filter supports only one option, ``dist``. This indicates the distance
292between bytes to be subtracted. The default is 1, i.e. take the differences
293between adjacent bytes.
294
295The BCJ filters are intended to be applied to machine code. They convert
296relative branches, calls and jumps in the code to use absolute addressing, with
297the aim of increasing the redundancy that can be exploited by the compressor.
298These filters support one option, ``start_offset``. This specifies the address
299that should be mapped to the beginning of the input data. The default is 0.
300
301
302Examples
303--------
304
305Reading in a compressed file::
306
307 import lzma
308 with lzma.LZMAFile("file.xz") as f:
309 file_content = f.read()
310
311Creating a compressed file::
312
313 import lzma
314 data = b"Insert Data Here"
315 with lzma.LZMAFile("file.xz", "w") as f:
316 f.write(data)
317
318Compressing data in memory::
319
320 import lzma
321 data_in = b"Insert Data Here"
322 data_out = lzma.compress(data_in)
323
324Incremental compression::
325
326 import lzma
327 lzc = lzma.LZMACompressor()
328 out1 = lzc.compress(b"Some data\n")
329 out2 = lzc.compress(b"Another piece of data\n")
330 out3 = lzc.compress(b"Even more data\n")
331 out4 = lzc.flush()
332 # Concatenate all the partial results:
333 result = b"".join([out1, out2, out3, out4])
334
335Writing compressed data to an already-open file::
336
337 import lzma
338 with open("file.xz", "wb") as f:
339 f.write(b"This data will not be compressed\n")
340 with lzma.LZMAFile(fileobj=f, mode="w") as lzf:
341 lzf.write(b"This *will* be compressed\n")
342 f.write(b"Not compressed\n")
343
344Creating a compressed file using a custom filter chain::
345
346 import lzma
347 my_filters = [
348 {"id": lzma.FILTER_DELTA, "dist": 5},
349 {"id": lzma.FILTER_LZMA2, "preset": 7 | lzma.PRESET_EXTREME},
350 ]
351 with lzma.LZMAFile("file.xz", "w", filters=my_filters) as f:
352 f.write(b"blah blah blah")