blob: 054600530bcd54d6c0b075cd4c0f89ca2c07ee17 [file] [log] [blame]
Nadeem Vawda3ff069e2011-11-30 00:25:06 +02001:mod:`lzma` --- Compression using the LZMA algorithm
2====================================================
3
4.. module:: lzma
5 :synopsis: A Python wrapper for the liblzma compression library.
6.. moduleauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
7.. sectionauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
8
9.. versionadded:: 3.3
10
11
12This module provides classes and convenience functions for compressing and
13decompressing data using the LZMA compression algorithm. Also included is a file
14interface supporting the ``.xz`` and legacy ``.lzma`` file formats used by the
15:program:`xz` utility, as well as raw compressed streams.
16
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020017The interface provided by this module is very similar to that of the :mod:`bz2`
18module. However, note that :class:`LZMAFile` is *not* thread-safe, unlike
19:class:`bz2.BZ2File`, so if you need to use a single :class:`LZMAFile` instance
20from multiple threads, it is necessary to protect it with a lock.
21
22
23.. exception:: LZMAError
24
25 This exception is raised when an error occurs during compression or
26 decompression, or while initializing the compressor/decompressor state.
27
28
29Reading and writing compressed files
30------------------------------------
31
Nadeem Vawdae8604042012-06-04 23:38:12 +020032.. function:: open(filename, mode="rb", \*, format=None, check=-1, preset=None, filters=None, encoding=None, errors=None, newline=None)
33
34 Open an LZMA-compressed file in binary or text mode, returning a :term:`file
35 object`.
36
37 The *filename* argument can be either an actual file name (given as a
38 :class:`str` or :class:`bytes` object), in which case the named file is
39 opened, or it can be an existing file object to read from or write to.
40
41 The *mode* argument can be any of ``"r"``, ``"rb"``, ``"w"``, ``"wb"``,
Nadeem Vawda42ca9822013-10-19 00:06:19 +020042 ``"x"``, ``"xb"``, ``"a"`` or ``"ab"`` for binary mode, or ``"rt"``,
43 ``"wt"``, ``"xt"``, or ``"at"`` for text mode. The default is ``"rb"``.
Nadeem Vawdae8604042012-06-04 23:38:12 +020044
45 When opening a file for reading, the *format* and *filters* arguments have
46 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
47 and *preset* arguments should not be used.
48
49 When opening a file for writing, the *format*, *check*, *preset* and
50 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
51
52 For binary mode, this function is equivalent to the :class:`LZMAFile`
53 constructor: ``LZMAFile(filename, mode, ...)``. In this case, the *encoding*,
54 *errors* and *newline* arguments must not be provided.
55
56 For text mode, a :class:`LZMAFile` object is created, and wrapped in an
57 :class:`io.TextIOWrapper` instance with the specified encoding, error
58 handling behavior, and line ending(s).
59
Nadeem Vawda42ca9822013-10-19 00:06:19 +020060 .. versionchanged:: 3.4
61 Added support for the ``"x"``, ``"xb"`` and ``"xt"`` modes.
62
Nadeem Vawdae8604042012-06-04 23:38:12 +020063
Nadeem Vawda33c34da2012-06-04 23:34:07 +020064.. class:: LZMAFile(filename=None, mode="r", \*, format=None, check=-1, preset=None, filters=None)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020065
Nadeem Vawda33c34da2012-06-04 23:34:07 +020066 Open an LZMA-compressed file in binary mode.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020067
Nadeem Vawda33c34da2012-06-04 23:34:07 +020068 An :class:`LZMAFile` can wrap an already-open :term:`file object`, or operate
69 directly on a named file. The *filename* argument specifies either the file
70 object to wrap, or the name of the file to open (as a :class:`str` or
71 :class:`bytes` object). When wrapping an existing file object, the wrapped
72 file will not be closed when the :class:`LZMAFile` is closed.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020073
74 The *mode* argument can be either ``"r"`` for reading (default), ``"w"`` for
Nadeem Vawda42ca9822013-10-19 00:06:19 +020075 overwriting, ``"x"`` for exclusive creation, or ``"a"`` for appending. These
76 can equivalently be given as ``"rb"``, ``"wb"``, ``"xb"`` and ``"ab"``
77 respectively.
Nadeem Vawda6cbb20c2012-06-04 23:36:24 +020078
79 If *filename* is a file object (rather than an actual file name), a mode of
80 ``"w"`` does not truncate the file, and is instead equivalent to ``"a"``.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020081
82 When opening a file for reading, the input file may be the concatenation of
83 multiple separate compressed streams. These are transparently decoded as a
84 single logical stream.
85
86 When opening a file for reading, the *format* and *filters* arguments have
87 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
88 and *preset* arguments should not be used.
89
90 When opening a file for writing, the *format*, *check*, *preset* and
91 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
92
93 :class:`LZMAFile` supports all the members specified by
94 :class:`io.BufferedIOBase`, except for :meth:`detach` and :meth:`truncate`.
95 Iteration and the :keyword:`with` statement are supported.
96
97 The following method is also provided:
98
99 .. method:: peek(size=-1)
100
101 Return buffered data without advancing the file position. At least one
102 byte of data will be returned, unless EOF has been reached. The exact
103 number of bytes returned is unspecified (the *size* argument is ignored).
104
Nadeem Vawda69761042013-12-08 19:47:22 +0100105 .. note:: While calling :meth:`peek` does not change the file position of
106 the :class:`LZMAFile`, it may change the position of the underlying
107 file object (e.g. if the :class:`LZMAFile` was constructed by passing a
108 file object for *filename*).
109
Nadeem Vawda42ca9822013-10-19 00:06:19 +0200110 .. versionchanged:: 3.4
111 Added support for the ``"x"`` and ``"xb"`` modes.
112
Antoine Pitrou2dbc6e62015-04-11 00:31:01 +0200113 .. versionchanged:: 3.5
114 The :meth:`~io.BufferedIOBase.read` method now accepts an argument of
115 ``None``.
116
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200117
118Compressing and decompressing data in memory
119--------------------------------------------
120
121.. class:: LZMACompressor(format=FORMAT_XZ, check=-1, preset=None, filters=None)
122
123 Create a compressor object, which can be used to compress data incrementally.
124
125 For a more convenient way of compressing a single chunk of data, see
126 :func:`compress`.
127
128 The *format* argument specifies what container format should be used.
129 Possible values are:
130
131 * :const:`FORMAT_XZ`: The ``.xz`` container format.
132 This is the default format.
133
134 * :const:`FORMAT_ALONE`: The legacy ``.lzma`` container format.
135 This format is more limited than ``.xz`` -- it does not support integrity
136 checks or multiple filters.
137
138 * :const:`FORMAT_RAW`: A raw data stream, not using any container format.
139 This format specifier does not support integrity checks, and requires that
140 you always specify a custom filter chain (for both compression and
141 decompression). Additionally, data compressed in this manner cannot be
142 decompressed using :const:`FORMAT_AUTO` (see :class:`LZMADecompressor`).
143
144 The *check* argument specifies the type of integrity check to include in the
145 compressed data. This check is used when decompressing, to ensure that the
146 data has not been corrupted. Possible values are:
147
148 * :const:`CHECK_NONE`: No integrity check.
149 This is the default (and the only acceptable value) for
150 :const:`FORMAT_ALONE` and :const:`FORMAT_RAW`.
151
152 * :const:`CHECK_CRC32`: 32-bit Cyclic Redundancy Check.
153
154 * :const:`CHECK_CRC64`: 64-bit Cyclic Redundancy Check.
155 This is the default for :const:`FORMAT_XZ`.
156
157 * :const:`CHECK_SHA256`: 256-bit Secure Hash Algorithm.
158
159 If the specified check is not supported, an :class:`LZMAError` is raised.
160
161 The compression settings can be specified either as a preset compression
162 level (with the *preset* argument), or in detail as a custom filter chain
163 (with the *filters* argument).
164
165 The *preset* argument (if provided) should be an integer between ``0`` and
166 ``9`` (inclusive), optionally OR-ed with the constant
167 :const:`PRESET_EXTREME`. If neither *preset* nor *filters* are given, the
168 default behavior is to use :const:`PRESET_DEFAULT` (preset level ``6``).
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200169 Higher presets produce smaller output, but make the compression process
170 slower.
171
172 .. note::
173
174 In addition to being more CPU-intensive, compression with higher presets
175 also requires much more memory (and produces output that needs more memory
176 to decompress). With preset ``9`` for example, the overhead for an
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200177 :class:`LZMACompressor` object can be as high as 800 MiB. For this reason,
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200178 it is generally best to stick with the default preset.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200179
180 The *filters* argument (if provided) should be a filter chain specifier.
181 See :ref:`filter-chain-specs` for details.
182
183 .. method:: compress(data)
184
185 Compress *data* (a :class:`bytes` object), returning a :class:`bytes`
186 object containing compressed data for at least part of the input. Some of
187 *data* may be buffered internally, for use in later calls to
188 :meth:`compress` and :meth:`flush`. The returned data should be
189 concatenated with the output of any previous calls to :meth:`compress`.
190
191 .. method:: flush()
192
193 Finish the compression process, returning a :class:`bytes` object
194 containing any data stored in the compressor's internal buffers.
195
196 The compressor cannot be used after this method has been called.
197
198
199.. class:: LZMADecompressor(format=FORMAT_AUTO, memlimit=None, filters=None)
200
201 Create a decompressor object, which can be used to decompress data
202 incrementally.
203
204 For a more convenient way of decompressing an entire compressed stream at
205 once, see :func:`decompress`.
206
207 The *format* argument specifies the container format that should be used. The
208 default is :const:`FORMAT_AUTO`, which can decompress both ``.xz`` and
209 ``.lzma`` files. Other possible values are :const:`FORMAT_XZ`,
210 :const:`FORMAT_ALONE`, and :const:`FORMAT_RAW`.
211
212 The *memlimit* argument specifies a limit (in bytes) on the amount of memory
213 that the decompressor can use. When this argument is used, decompression will
214 fail with an :class:`LZMAError` if it is not possible to decompress the input
215 within the given memory limit.
216
217 The *filters* argument specifies the filter chain that was used to create
218 the stream being decompressed. This argument is required if *format* is
219 :const:`FORMAT_RAW`, but should not be used for other formats.
220 See :ref:`filter-chain-specs` for more information about filter chains.
221
222 .. note::
223 This class does not transparently handle inputs containing multiple
224 compressed streams, unlike :func:`decompress` and :class:`LZMAFile`. To
225 decompress a multi-stream input with :class:`LZMADecompressor`, you must
226 create a new decompressor for each stream.
227
Antoine Pitrou26795ba2015-01-17 16:22:18 +0100228 .. method:: decompress(data, max_length=-1)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200229
Antoine Pitrou26795ba2015-01-17 16:22:18 +0100230 Decompress *data* (a :term:`bytes-like object`), returning
231 uncompressed data as bytes. Some of *data* may be buffered
232 internally, for use in later calls to :meth:`decompress`. The
233 returned data should be concatenated with the output of any
234 previous calls to :meth:`decompress`.
235
236 If *max_length* is nonnegative, returns at most *max_length*
237 bytes of decompressed data. If this limit is reached and further
238 output can be produced, the :attr:`~.needs_input` attribute will
239 be set to ``False``. In this case, the next call to
240 :meth:`~.decompress` may provide *data* as ``b''`` to obtain
241 more of the output.
242
243 If all of the input data was decompressed and returned (either
244 because this was less than *max_length* bytes, or because
245 *max_length* was negative), the :attr:`~.needs_input` attribute
246 will be set to ``True``.
247
248 Attempting to decompress data after the end of stream is reached
249 raises an `EOFError`. Any data found after the end of the
250 stream is ignored and saved in the :attr:`~.unused_data` attribute.
251
252 .. versionchanged:: 3.5
253 Added the *max_length* parameter.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200254
255 .. attribute:: check
256
257 The ID of the integrity check used by the input stream. This may be
258 :const:`CHECK_UNKNOWN` until enough of the input has been decoded to
259 determine what integrity check it uses.
260
261 .. attribute:: eof
262
Serhiy Storchakafbc1c262013-11-29 12:17:13 +0200263 ``True`` if the end-of-stream marker has been reached.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200264
265 .. attribute:: unused_data
266
267 Data found after the end of the compressed stream.
268
269 Before the end of the stream is reached, this will be ``b""``.
270
Antoine Pitrou26795ba2015-01-17 16:22:18 +0100271 .. attribute:: needs_input
272
273 ``False`` if the :meth:`.decompress` method can provide more
274 decompressed data before requiring new uncompressed input.
275
276 .. versionadded:: 3.5
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200277
278.. function:: compress(data, format=FORMAT_XZ, check=-1, preset=None, filters=None)
279
280 Compress *data* (a :class:`bytes` object), returning the compressed data as a
281 :class:`bytes` object.
282
283 See :class:`LZMACompressor` above for a description of the *format*, *check*,
284 *preset* and *filters* arguments.
285
286
287.. function:: decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None)
288
289 Decompress *data* (a :class:`bytes` object), returning the uncompressed data
290 as a :class:`bytes` object.
291
292 If *data* is the concatenation of multiple distinct compressed streams,
293 decompress all of these streams, and return the concatenation of the results.
294
295 See :class:`LZMADecompressor` above for a description of the *format*,
296 *memlimit* and *filters* arguments.
297
298
299Miscellaneous
300-------------
301
Nadeem Vawdabc459bb2012-05-06 23:01:51 +0200302.. function:: is_check_supported(check)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200303
304 Returns true if the given integrity check is supported on this system.
305
306 :const:`CHECK_NONE` and :const:`CHECK_CRC32` are always supported.
307 :const:`CHECK_CRC64` and :const:`CHECK_SHA256` may be unavailable if you are
308 using a version of :program:`liblzma` that was compiled with a limited
309 feature set.
310
311
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200312.. _filter-chain-specs:
313
314Specifying custom filter chains
315-------------------------------
316
317A filter chain specifier is a sequence of dictionaries, where each dictionary
318contains the ID and options for a single filter. Each dictionary must contain
319the key ``"id"``, and may contain additional keys to specify filter-dependent
320options. Valid filter IDs are as follows:
321
322* Compression filters:
323 * :const:`FILTER_LZMA1` (for use with :const:`FORMAT_ALONE`)
324 * :const:`FILTER_LZMA2` (for use with :const:`FORMAT_XZ` and :const:`FORMAT_RAW`)
325
326* Delta filter:
327 * :const:`FILTER_DELTA`
328
329* Branch-Call-Jump (BCJ) filters:
330 * :const:`FILTER_X86`
331 * :const:`FILTER_IA64`
332 * :const:`FILTER_ARM`
333 * :const:`FILTER_ARMTHUMB`
334 * :const:`FILTER_POWERPC`
335 * :const:`FILTER_SPARC`
336
337A filter chain can consist of up to 4 filters, and cannot be empty. The last
338filter in the chain must be a compression filter, and any other filters must be
339delta or BCJ filters.
340
341Compression filters support the following options (specified as additional
342entries in the dictionary representing the filter):
343
344 * ``preset``: A compression preset to use as a source of default values for
345 options that are not specified explicitly.
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200346 * ``dict_size``: Dictionary size in bytes. This should be between 4 KiB and
347 1.5 GiB (inclusive).
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200348 * ``lc``: Number of literal context bits.
349 * ``lp``: Number of literal position bits. The sum ``lc + lp`` must be at
350 most 4.
351 * ``pb``: Number of position bits; must be at most 4.
352 * ``mode``: :const:`MODE_FAST` or :const:`MODE_NORMAL`.
353 * ``nice_len``: What should be considered a "nice length" for a match.
354 This should be 273 or less.
355 * ``mf``: What match finder to use -- :const:`MF_HC3`, :const:`MF_HC4`,
356 :const:`MF_BT2`, :const:`MF_BT3`, or :const:`MF_BT4`.
357 * ``depth``: Maximum search depth used by match finder. 0 (default) means to
358 select automatically based on other filter options.
359
360The delta filter stores the differences between bytes, producing more repetitive
361input for the compressor in certain circumstances. It only supports a single
362The delta filter supports only one option, ``dist``. This indicates the distance
363between bytes to be subtracted. The default is 1, i.e. take the differences
364between adjacent bytes.
365
366The BCJ filters are intended to be applied to machine code. They convert
367relative branches, calls and jumps in the code to use absolute addressing, with
368the aim of increasing the redundancy that can be exploited by the compressor.
369These filters support one option, ``start_offset``. This specifies the address
370that should be mapped to the beginning of the input data. The default is 0.
371
372
373Examples
374--------
375
376Reading in a compressed file::
377
378 import lzma
Nadeem Vawda50112442012-09-23 18:20:23 +0200379 with lzma.open("file.xz") as f:
Nadeem Vawda667a13b2012-09-23 18:08:57 +0200380 file_content = f.read()
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200381
382Creating a compressed file::
383
384 import lzma
385 data = b"Insert Data Here"
Nadeem Vawda50112442012-09-23 18:20:23 +0200386 with lzma.open("file.xz", "w") as f:
Nadeem Vawda667a13b2012-09-23 18:08:57 +0200387 f.write(data)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200388
389Compressing data in memory::
390
391 import lzma
392 data_in = b"Insert Data Here"
393 data_out = lzma.compress(data_in)
394
395Incremental compression::
396
397 import lzma
398 lzc = lzma.LZMACompressor()
399 out1 = lzc.compress(b"Some data\n")
400 out2 = lzc.compress(b"Another piece of data\n")
401 out3 = lzc.compress(b"Even more data\n")
402 out4 = lzc.flush()
403 # Concatenate all the partial results:
404 result = b"".join([out1, out2, out3, out4])
405
406Writing compressed data to an already-open file::
407
408 import lzma
409 with open("file.xz", "wb") as f:
410 f.write(b"This data will not be compressed\n")
Nadeem Vawda50112442012-09-23 18:20:23 +0200411 with lzma.open(f, "w") as lzf:
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200412 lzf.write(b"This *will* be compressed\n")
413 f.write(b"Not compressed\n")
414
415Creating a compressed file using a custom filter chain::
416
417 import lzma
418 my_filters = [
419 {"id": lzma.FILTER_DELTA, "dist": 5},
420 {"id": lzma.FILTER_LZMA2, "preset": 7 | lzma.PRESET_EXTREME},
421 ]
Nadeem Vawda50112442012-09-23 18:20:23 +0200422 with lzma.open("file.xz", "w", filters=my_filters) as f:
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200423 f.write(b"blah blah blah")