blob: 61b3ba3bb6906440c6e5f34176fcc1b1181bf1ac [file] [log] [blame]
Nadeem Vawda3ff069e2011-11-30 00:25:06 +02001:mod:`lzma` --- Compression using the LZMA algorithm
2====================================================
3
4.. module:: lzma
5 :synopsis: A Python wrapper for the liblzma compression library.
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Nadeem Vawda3ff069e2011-11-30 00:25:06 +02007.. moduleauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
8.. sectionauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
9
10.. versionadded:: 3.3
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/lzma.py`
13
14--------------
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020015
16This module provides classes and convenience functions for compressing and
17decompressing data using the LZMA compression algorithm. Also included is a file
18interface supporting the ``.xz`` and legacy ``.lzma`` file formats used by the
19:program:`xz` utility, as well as raw compressed streams.
20
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020021The interface provided by this module is very similar to that of the :mod:`bz2`
22module. However, note that :class:`LZMAFile` is *not* thread-safe, unlike
23:class:`bz2.BZ2File`, so if you need to use a single :class:`LZMAFile` instance
24from multiple threads, it is necessary to protect it with a lock.
25
26
27.. exception:: LZMAError
28
29 This exception is raised when an error occurs during compression or
30 decompression, or while initializing the compressor/decompressor state.
31
32
33Reading and writing compressed files
34------------------------------------
35
Nadeem Vawdae8604042012-06-04 23:38:12 +020036.. function:: open(filename, mode="rb", \*, format=None, check=-1, preset=None, filters=None, encoding=None, errors=None, newline=None)
37
38 Open an LZMA-compressed file in binary or text mode, returning a :term:`file
39 object`.
40
41 The *filename* argument can be either an actual file name (given as a
42 :class:`str` or :class:`bytes` object), in which case the named file is
43 opened, or it can be an existing file object to read from or write to.
44
45 The *mode* argument can be any of ``"r"``, ``"rb"``, ``"w"``, ``"wb"``,
Nadeem Vawda42ca9822013-10-19 00:06:19 +020046 ``"x"``, ``"xb"``, ``"a"`` or ``"ab"`` for binary mode, or ``"rt"``,
47 ``"wt"``, ``"xt"``, or ``"at"`` for text mode. The default is ``"rb"``.
Nadeem Vawdae8604042012-06-04 23:38:12 +020048
49 When opening a file for reading, the *format* and *filters* arguments have
50 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
51 and *preset* arguments should not be used.
52
53 When opening a file for writing, the *format*, *check*, *preset* and
54 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
55
56 For binary mode, this function is equivalent to the :class:`LZMAFile`
57 constructor: ``LZMAFile(filename, mode, ...)``. In this case, the *encoding*,
58 *errors* and *newline* arguments must not be provided.
59
60 For text mode, a :class:`LZMAFile` object is created, and wrapped in an
61 :class:`io.TextIOWrapper` instance with the specified encoding, error
62 handling behavior, and line ending(s).
63
Nadeem Vawda42ca9822013-10-19 00:06:19 +020064 .. versionchanged:: 3.4
65 Added support for the ``"x"``, ``"xb"`` and ``"xt"`` modes.
66
Nadeem Vawdae8604042012-06-04 23:38:12 +020067
Nadeem Vawda33c34da2012-06-04 23:34:07 +020068.. class:: LZMAFile(filename=None, mode="r", \*, format=None, check=-1, preset=None, filters=None)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020069
Nadeem Vawda33c34da2012-06-04 23:34:07 +020070 Open an LZMA-compressed file in binary mode.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020071
Nadeem Vawda33c34da2012-06-04 23:34:07 +020072 An :class:`LZMAFile` can wrap an already-open :term:`file object`, or operate
73 directly on a named file. The *filename* argument specifies either the file
74 object to wrap, or the name of the file to open (as a :class:`str` or
75 :class:`bytes` object). When wrapping an existing file object, the wrapped
76 file will not be closed when the :class:`LZMAFile` is closed.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020077
78 The *mode* argument can be either ``"r"`` for reading (default), ``"w"`` for
Nadeem Vawda42ca9822013-10-19 00:06:19 +020079 overwriting, ``"x"`` for exclusive creation, or ``"a"`` for appending. These
80 can equivalently be given as ``"rb"``, ``"wb"``, ``"xb"`` and ``"ab"``
81 respectively.
Nadeem Vawda6cbb20c2012-06-04 23:36:24 +020082
83 If *filename* is a file object (rather than an actual file name), a mode of
84 ``"w"`` does not truncate the file, and is instead equivalent to ``"a"``.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020085
86 When opening a file for reading, the input file may be the concatenation of
87 multiple separate compressed streams. These are transparently decoded as a
88 single logical stream.
89
90 When opening a file for reading, the *format* and *filters* arguments have
91 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
92 and *preset* arguments should not be used.
93
94 When opening a file for writing, the *format*, *check*, *preset* and
95 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
96
97 :class:`LZMAFile` supports all the members specified by
98 :class:`io.BufferedIOBase`, except for :meth:`detach` and :meth:`truncate`.
99 Iteration and the :keyword:`with` statement are supported.
100
101 The following method is also provided:
102
103 .. method:: peek(size=-1)
104
105 Return buffered data without advancing the file position. At least one
106 byte of data will be returned, unless EOF has been reached. The exact
107 number of bytes returned is unspecified (the *size* argument is ignored).
108
Nadeem Vawda69761042013-12-08 19:47:22 +0100109 .. note:: While calling :meth:`peek` does not change the file position of
110 the :class:`LZMAFile`, it may change the position of the underlying
111 file object (e.g. if the :class:`LZMAFile` was constructed by passing a
112 file object for *filename*).
113
Nadeem Vawda42ca9822013-10-19 00:06:19 +0200114 .. versionchanged:: 3.4
115 Added support for the ``"x"`` and ``"xb"`` modes.
116
Antoine Pitrou2dbc6e62015-04-11 00:31:01 +0200117 .. versionchanged:: 3.5
118 The :meth:`~io.BufferedIOBase.read` method now accepts an argument of
119 ``None``.
120
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200121
122Compressing and decompressing data in memory
123--------------------------------------------
124
125.. class:: LZMACompressor(format=FORMAT_XZ, check=-1, preset=None, filters=None)
126
127 Create a compressor object, which can be used to compress data incrementally.
128
129 For a more convenient way of compressing a single chunk of data, see
130 :func:`compress`.
131
132 The *format* argument specifies what container format should be used.
133 Possible values are:
134
135 * :const:`FORMAT_XZ`: The ``.xz`` container format.
136 This is the default format.
137
138 * :const:`FORMAT_ALONE`: The legacy ``.lzma`` container format.
139 This format is more limited than ``.xz`` -- it does not support integrity
140 checks or multiple filters.
141
142 * :const:`FORMAT_RAW`: A raw data stream, not using any container format.
143 This format specifier does not support integrity checks, and requires that
144 you always specify a custom filter chain (for both compression and
145 decompression). Additionally, data compressed in this manner cannot be
146 decompressed using :const:`FORMAT_AUTO` (see :class:`LZMADecompressor`).
147
148 The *check* argument specifies the type of integrity check to include in the
149 compressed data. This check is used when decompressing, to ensure that the
150 data has not been corrupted. Possible values are:
151
152 * :const:`CHECK_NONE`: No integrity check.
153 This is the default (and the only acceptable value) for
154 :const:`FORMAT_ALONE` and :const:`FORMAT_RAW`.
155
156 * :const:`CHECK_CRC32`: 32-bit Cyclic Redundancy Check.
157
158 * :const:`CHECK_CRC64`: 64-bit Cyclic Redundancy Check.
159 This is the default for :const:`FORMAT_XZ`.
160
161 * :const:`CHECK_SHA256`: 256-bit Secure Hash Algorithm.
162
163 If the specified check is not supported, an :class:`LZMAError` is raised.
164
165 The compression settings can be specified either as a preset compression
166 level (with the *preset* argument), or in detail as a custom filter chain
167 (with the *filters* argument).
168
169 The *preset* argument (if provided) should be an integer between ``0`` and
170 ``9`` (inclusive), optionally OR-ed with the constant
171 :const:`PRESET_EXTREME`. If neither *preset* nor *filters* are given, the
172 default behavior is to use :const:`PRESET_DEFAULT` (preset level ``6``).
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200173 Higher presets produce smaller output, but make the compression process
174 slower.
175
176 .. note::
177
178 In addition to being more CPU-intensive, compression with higher presets
179 also requires much more memory (and produces output that needs more memory
180 to decompress). With preset ``9`` for example, the overhead for an
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200181 :class:`LZMACompressor` object can be as high as 800 MiB. For this reason,
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200182 it is generally best to stick with the default preset.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200183
184 The *filters* argument (if provided) should be a filter chain specifier.
185 See :ref:`filter-chain-specs` for details.
186
187 .. method:: compress(data)
188
189 Compress *data* (a :class:`bytes` object), returning a :class:`bytes`
190 object containing compressed data for at least part of the input. Some of
191 *data* may be buffered internally, for use in later calls to
192 :meth:`compress` and :meth:`flush`. The returned data should be
193 concatenated with the output of any previous calls to :meth:`compress`.
194
195 .. method:: flush()
196
197 Finish the compression process, returning a :class:`bytes` object
198 containing any data stored in the compressor's internal buffers.
199
200 The compressor cannot be used after this method has been called.
201
202
203.. class:: LZMADecompressor(format=FORMAT_AUTO, memlimit=None, filters=None)
204
205 Create a decompressor object, which can be used to decompress data
206 incrementally.
207
208 For a more convenient way of decompressing an entire compressed stream at
209 once, see :func:`decompress`.
210
211 The *format* argument specifies the container format that should be used. The
212 default is :const:`FORMAT_AUTO`, which can decompress both ``.xz`` and
213 ``.lzma`` files. Other possible values are :const:`FORMAT_XZ`,
214 :const:`FORMAT_ALONE`, and :const:`FORMAT_RAW`.
215
216 The *memlimit* argument specifies a limit (in bytes) on the amount of memory
217 that the decompressor can use. When this argument is used, decompression will
218 fail with an :class:`LZMAError` if it is not possible to decompress the input
219 within the given memory limit.
220
221 The *filters* argument specifies the filter chain that was used to create
222 the stream being decompressed. This argument is required if *format* is
223 :const:`FORMAT_RAW`, but should not be used for other formats.
224 See :ref:`filter-chain-specs` for more information about filter chains.
225
226 .. note::
227 This class does not transparently handle inputs containing multiple
228 compressed streams, unlike :func:`decompress` and :class:`LZMAFile`. To
229 decompress a multi-stream input with :class:`LZMADecompressor`, you must
230 create a new decompressor for each stream.
231
Antoine Pitrou26795ba2015-01-17 16:22:18 +0100232 .. method:: decompress(data, max_length=-1)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200233
Antoine Pitrou26795ba2015-01-17 16:22:18 +0100234 Decompress *data* (a :term:`bytes-like object`), returning
235 uncompressed data as bytes. Some of *data* may be buffered
236 internally, for use in later calls to :meth:`decompress`. The
237 returned data should be concatenated with the output of any
238 previous calls to :meth:`decompress`.
239
240 If *max_length* is nonnegative, returns at most *max_length*
241 bytes of decompressed data. If this limit is reached and further
242 output can be produced, the :attr:`~.needs_input` attribute will
243 be set to ``False``. In this case, the next call to
244 :meth:`~.decompress` may provide *data* as ``b''`` to obtain
245 more of the output.
246
247 If all of the input data was decompressed and returned (either
248 because this was less than *max_length* bytes, or because
249 *max_length* was negative), the :attr:`~.needs_input` attribute
250 will be set to ``True``.
251
252 Attempting to decompress data after the end of stream is reached
253 raises an `EOFError`. Any data found after the end of the
254 stream is ignored and saved in the :attr:`~.unused_data` attribute.
255
256 .. versionchanged:: 3.5
257 Added the *max_length* parameter.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200258
259 .. attribute:: check
260
261 The ID of the integrity check used by the input stream. This may be
262 :const:`CHECK_UNKNOWN` until enough of the input has been decoded to
263 determine what integrity check it uses.
264
265 .. attribute:: eof
266
Serhiy Storchakafbc1c262013-11-29 12:17:13 +0200267 ``True`` if the end-of-stream marker has been reached.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200268
269 .. attribute:: unused_data
270
271 Data found after the end of the compressed stream.
272
273 Before the end of the stream is reached, this will be ``b""``.
274
Antoine Pitrou26795ba2015-01-17 16:22:18 +0100275 .. attribute:: needs_input
276
277 ``False`` if the :meth:`.decompress` method can provide more
278 decompressed data before requiring new uncompressed input.
279
280 .. versionadded:: 3.5
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200281
282.. function:: compress(data, format=FORMAT_XZ, check=-1, preset=None, filters=None)
283
284 Compress *data* (a :class:`bytes` object), returning the compressed data as a
285 :class:`bytes` object.
286
287 See :class:`LZMACompressor` above for a description of the *format*, *check*,
288 *preset* and *filters* arguments.
289
290
291.. function:: decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None)
292
293 Decompress *data* (a :class:`bytes` object), returning the uncompressed data
294 as a :class:`bytes` object.
295
296 If *data* is the concatenation of multiple distinct compressed streams,
297 decompress all of these streams, and return the concatenation of the results.
298
299 See :class:`LZMADecompressor` above for a description of the *format*,
300 *memlimit* and *filters* arguments.
301
302
303Miscellaneous
304-------------
305
Nadeem Vawdabc459bb2012-05-06 23:01:51 +0200306.. function:: is_check_supported(check)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200307
308 Returns true if the given integrity check is supported on this system.
309
310 :const:`CHECK_NONE` and :const:`CHECK_CRC32` are always supported.
311 :const:`CHECK_CRC64` and :const:`CHECK_SHA256` may be unavailable if you are
312 using a version of :program:`liblzma` that was compiled with a limited
313 feature set.
314
315
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200316.. _filter-chain-specs:
317
318Specifying custom filter chains
319-------------------------------
320
321A filter chain specifier is a sequence of dictionaries, where each dictionary
322contains the ID and options for a single filter. Each dictionary must contain
323the key ``"id"``, and may contain additional keys to specify filter-dependent
324options. Valid filter IDs are as follows:
325
326* Compression filters:
327 * :const:`FILTER_LZMA1` (for use with :const:`FORMAT_ALONE`)
328 * :const:`FILTER_LZMA2` (for use with :const:`FORMAT_XZ` and :const:`FORMAT_RAW`)
329
330* Delta filter:
331 * :const:`FILTER_DELTA`
332
333* Branch-Call-Jump (BCJ) filters:
334 * :const:`FILTER_X86`
335 * :const:`FILTER_IA64`
336 * :const:`FILTER_ARM`
337 * :const:`FILTER_ARMTHUMB`
338 * :const:`FILTER_POWERPC`
339 * :const:`FILTER_SPARC`
340
341A filter chain can consist of up to 4 filters, and cannot be empty. The last
342filter in the chain must be a compression filter, and any other filters must be
343delta or BCJ filters.
344
345Compression filters support the following options (specified as additional
346entries in the dictionary representing the filter):
347
348 * ``preset``: A compression preset to use as a source of default values for
349 options that are not specified explicitly.
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200350 * ``dict_size``: Dictionary size in bytes. This should be between 4 KiB and
351 1.5 GiB (inclusive).
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200352 * ``lc``: Number of literal context bits.
353 * ``lp``: Number of literal position bits. The sum ``lc + lp`` must be at
354 most 4.
355 * ``pb``: Number of position bits; must be at most 4.
356 * ``mode``: :const:`MODE_FAST` or :const:`MODE_NORMAL`.
357 * ``nice_len``: What should be considered a "nice length" for a match.
358 This should be 273 or less.
359 * ``mf``: What match finder to use -- :const:`MF_HC3`, :const:`MF_HC4`,
360 :const:`MF_BT2`, :const:`MF_BT3`, or :const:`MF_BT4`.
361 * ``depth``: Maximum search depth used by match finder. 0 (default) means to
362 select automatically based on other filter options.
363
364The delta filter stores the differences between bytes, producing more repetitive
Berker Peksagb334ee02016-10-01 01:19:04 +0300365input for the compressor in certain circumstances. It supports one option,
366``dist``. This indicates the distance between bytes to be subtracted. The
367default is 1, i.e. take the differences between adjacent bytes.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200368
369The BCJ filters are intended to be applied to machine code. They convert
370relative branches, calls and jumps in the code to use absolute addressing, with
371the aim of increasing the redundancy that can be exploited by the compressor.
372These filters support one option, ``start_offset``. This specifies the address
373that should be mapped to the beginning of the input data. The default is 0.
374
375
376Examples
377--------
378
379Reading in a compressed file::
380
381 import lzma
Nadeem Vawda50112442012-09-23 18:20:23 +0200382 with lzma.open("file.xz") as f:
Nadeem Vawda667a13b2012-09-23 18:08:57 +0200383 file_content = f.read()
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200384
385Creating a compressed file::
386
387 import lzma
388 data = b"Insert Data Here"
Nadeem Vawda50112442012-09-23 18:20:23 +0200389 with lzma.open("file.xz", "w") as f:
Nadeem Vawda667a13b2012-09-23 18:08:57 +0200390 f.write(data)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200391
392Compressing data in memory::
393
394 import lzma
395 data_in = b"Insert Data Here"
396 data_out = lzma.compress(data_in)
397
398Incremental compression::
399
400 import lzma
401 lzc = lzma.LZMACompressor()
402 out1 = lzc.compress(b"Some data\n")
403 out2 = lzc.compress(b"Another piece of data\n")
404 out3 = lzc.compress(b"Even more data\n")
405 out4 = lzc.flush()
406 # Concatenate all the partial results:
407 result = b"".join([out1, out2, out3, out4])
408
409Writing compressed data to an already-open file::
410
411 import lzma
412 with open("file.xz", "wb") as f:
413 f.write(b"This data will not be compressed\n")
Nadeem Vawda50112442012-09-23 18:20:23 +0200414 with lzma.open(f, "w") as lzf:
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200415 lzf.write(b"This *will* be compressed\n")
416 f.write(b"Not compressed\n")
417
418Creating a compressed file using a custom filter chain::
419
420 import lzma
421 my_filters = [
422 {"id": lzma.FILTER_DELTA, "dist": 5},
423 {"id": lzma.FILTER_LZMA2, "preset": 7 | lzma.PRESET_EXTREME},
424 ]
Nadeem Vawda50112442012-09-23 18:20:23 +0200425 with lzma.open("file.xz", "w", filters=my_filters) as f:
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200426 f.write(b"blah blah blah")