blob: b71051da85c12145b01bb2d6162f76766592ece5 [file] [log] [blame]
Nadeem Vawda3ff069e2011-11-30 00:25:06 +02001:mod:`lzma` --- Compression using the LZMA algorithm
2====================================================
3
4.. module:: lzma
5 :synopsis: A Python wrapper for the liblzma compression library.
6.. moduleauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
7.. sectionauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
8
9.. versionadded:: 3.3
10
11
12This module provides classes and convenience functions for compressing and
13decompressing data using the LZMA compression algorithm. Also included is a file
14interface supporting the ``.xz`` and legacy ``.lzma`` file formats used by the
15:program:`xz` utility, as well as raw compressed streams.
16
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020017The interface provided by this module is very similar to that of the :mod:`bz2`
18module. However, note that :class:`LZMAFile` is *not* thread-safe, unlike
19:class:`bz2.BZ2File`, so if you need to use a single :class:`LZMAFile` instance
20from multiple threads, it is necessary to protect it with a lock.
21
22
23.. exception:: LZMAError
24
25 This exception is raised when an error occurs during compression or
26 decompression, or while initializing the compressor/decompressor state.
27
28
29Reading and writing compressed files
30------------------------------------
31
Nadeem Vawdae8604042012-06-04 23:38:12 +020032.. function:: open(filename, mode="rb", \*, format=None, check=-1, preset=None, filters=None, encoding=None, errors=None, newline=None)
33
34 Open an LZMA-compressed file in binary or text mode, returning a :term:`file
35 object`.
36
37 The *filename* argument can be either an actual file name (given as a
38 :class:`str` or :class:`bytes` object), in which case the named file is
39 opened, or it can be an existing file object to read from or write to.
40
41 The *mode* argument can be any of ``"r"``, ``"rb"``, ``"w"``, ``"wb"``,
Nadeem Vawda42ca9822013-10-19 00:06:19 +020042 ``"x"``, ``"xb"``, ``"a"`` or ``"ab"`` for binary mode, or ``"rt"``,
43 ``"wt"``, ``"xt"``, or ``"at"`` for text mode. The default is ``"rb"``.
Nadeem Vawdae8604042012-06-04 23:38:12 +020044
45 When opening a file for reading, the *format* and *filters* arguments have
46 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
47 and *preset* arguments should not be used.
48
49 When opening a file for writing, the *format*, *check*, *preset* and
50 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
51
52 For binary mode, this function is equivalent to the :class:`LZMAFile`
53 constructor: ``LZMAFile(filename, mode, ...)``. In this case, the *encoding*,
54 *errors* and *newline* arguments must not be provided.
55
56 For text mode, a :class:`LZMAFile` object is created, and wrapped in an
57 :class:`io.TextIOWrapper` instance with the specified encoding, error
58 handling behavior, and line ending(s).
59
Nadeem Vawda42ca9822013-10-19 00:06:19 +020060 .. versionchanged:: 3.4
61 Added support for the ``"x"``, ``"xb"`` and ``"xt"`` modes.
62
Nadeem Vawdae8604042012-06-04 23:38:12 +020063
Nadeem Vawda33c34da2012-06-04 23:34:07 +020064.. class:: LZMAFile(filename=None, mode="r", \*, format=None, check=-1, preset=None, filters=None)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020065
Nadeem Vawda33c34da2012-06-04 23:34:07 +020066 Open an LZMA-compressed file in binary mode.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020067
Nadeem Vawda33c34da2012-06-04 23:34:07 +020068 An :class:`LZMAFile` can wrap an already-open :term:`file object`, or operate
69 directly on a named file. The *filename* argument specifies either the file
70 object to wrap, or the name of the file to open (as a :class:`str` or
71 :class:`bytes` object). When wrapping an existing file object, the wrapped
72 file will not be closed when the :class:`LZMAFile` is closed.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020073
74 The *mode* argument can be either ``"r"`` for reading (default), ``"w"`` for
Nadeem Vawda42ca9822013-10-19 00:06:19 +020075 overwriting, ``"x"`` for exclusive creation, or ``"a"`` for appending. These
76 can equivalently be given as ``"rb"``, ``"wb"``, ``"xb"`` and ``"ab"``
77 respectively.
Nadeem Vawda6cbb20c2012-06-04 23:36:24 +020078
79 If *filename* is a file object (rather than an actual file name), a mode of
80 ``"w"`` does not truncate the file, and is instead equivalent to ``"a"``.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020081
82 When opening a file for reading, the input file may be the concatenation of
83 multiple separate compressed streams. These are transparently decoded as a
84 single logical stream.
85
86 When opening a file for reading, the *format* and *filters* arguments have
87 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
88 and *preset* arguments should not be used.
89
90 When opening a file for writing, the *format*, *check*, *preset* and
91 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
92
93 :class:`LZMAFile` supports all the members specified by
94 :class:`io.BufferedIOBase`, except for :meth:`detach` and :meth:`truncate`.
95 Iteration and the :keyword:`with` statement are supported.
96
97 The following method is also provided:
98
99 .. method:: peek(size=-1)
100
101 Return buffered data without advancing the file position. At least one
102 byte of data will be returned, unless EOF has been reached. The exact
103 number of bytes returned is unspecified (the *size* argument is ignored).
104
Nadeem Vawda69761042013-12-08 19:47:22 +0100105 .. note:: While calling :meth:`peek` does not change the file position of
106 the :class:`LZMAFile`, it may change the position of the underlying
107 file object (e.g. if the :class:`LZMAFile` was constructed by passing a
108 file object for *filename*).
109
Nadeem Vawda42ca9822013-10-19 00:06:19 +0200110 .. versionchanged:: 3.4
111 Added support for the ``"x"`` and ``"xb"`` modes.
112
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200113
114Compressing and decompressing data in memory
115--------------------------------------------
116
117.. class:: LZMACompressor(format=FORMAT_XZ, check=-1, preset=None, filters=None)
118
119 Create a compressor object, which can be used to compress data incrementally.
120
121 For a more convenient way of compressing a single chunk of data, see
122 :func:`compress`.
123
124 The *format* argument specifies what container format should be used.
125 Possible values are:
126
127 * :const:`FORMAT_XZ`: The ``.xz`` container format.
128 This is the default format.
129
130 * :const:`FORMAT_ALONE`: The legacy ``.lzma`` container format.
131 This format is more limited than ``.xz`` -- it does not support integrity
132 checks or multiple filters.
133
134 * :const:`FORMAT_RAW`: A raw data stream, not using any container format.
135 This format specifier does not support integrity checks, and requires that
136 you always specify a custom filter chain (for both compression and
137 decompression). Additionally, data compressed in this manner cannot be
138 decompressed using :const:`FORMAT_AUTO` (see :class:`LZMADecompressor`).
139
140 The *check* argument specifies the type of integrity check to include in the
141 compressed data. This check is used when decompressing, to ensure that the
142 data has not been corrupted. Possible values are:
143
144 * :const:`CHECK_NONE`: No integrity check.
145 This is the default (and the only acceptable value) for
146 :const:`FORMAT_ALONE` and :const:`FORMAT_RAW`.
147
148 * :const:`CHECK_CRC32`: 32-bit Cyclic Redundancy Check.
149
150 * :const:`CHECK_CRC64`: 64-bit Cyclic Redundancy Check.
151 This is the default for :const:`FORMAT_XZ`.
152
153 * :const:`CHECK_SHA256`: 256-bit Secure Hash Algorithm.
154
155 If the specified check is not supported, an :class:`LZMAError` is raised.
156
157 The compression settings can be specified either as a preset compression
158 level (with the *preset* argument), or in detail as a custom filter chain
159 (with the *filters* argument).
160
161 The *preset* argument (if provided) should be an integer between ``0`` and
162 ``9`` (inclusive), optionally OR-ed with the constant
163 :const:`PRESET_EXTREME`. If neither *preset* nor *filters* are given, the
164 default behavior is to use :const:`PRESET_DEFAULT` (preset level ``6``).
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200165 Higher presets produce smaller output, but make the compression process
166 slower.
167
168 .. note::
169
170 In addition to being more CPU-intensive, compression with higher presets
171 also requires much more memory (and produces output that needs more memory
172 to decompress). With preset ``9`` for example, the overhead for an
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200173 :class:`LZMACompressor` object can be as high as 800 MiB. For this reason,
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200174 it is generally best to stick with the default preset.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200175
176 The *filters* argument (if provided) should be a filter chain specifier.
177 See :ref:`filter-chain-specs` for details.
178
179 .. method:: compress(data)
180
181 Compress *data* (a :class:`bytes` object), returning a :class:`bytes`
182 object containing compressed data for at least part of the input. Some of
183 *data* may be buffered internally, for use in later calls to
184 :meth:`compress` and :meth:`flush`. The returned data should be
185 concatenated with the output of any previous calls to :meth:`compress`.
186
187 .. method:: flush()
188
189 Finish the compression process, returning a :class:`bytes` object
190 containing any data stored in the compressor's internal buffers.
191
192 The compressor cannot be used after this method has been called.
193
194
195.. class:: LZMADecompressor(format=FORMAT_AUTO, memlimit=None, filters=None)
196
197 Create a decompressor object, which can be used to decompress data
198 incrementally.
199
200 For a more convenient way of decompressing an entire compressed stream at
201 once, see :func:`decompress`.
202
203 The *format* argument specifies the container format that should be used. The
204 default is :const:`FORMAT_AUTO`, which can decompress both ``.xz`` and
205 ``.lzma`` files. Other possible values are :const:`FORMAT_XZ`,
206 :const:`FORMAT_ALONE`, and :const:`FORMAT_RAW`.
207
208 The *memlimit* argument specifies a limit (in bytes) on the amount of memory
209 that the decompressor can use. When this argument is used, decompression will
210 fail with an :class:`LZMAError` if it is not possible to decompress the input
211 within the given memory limit.
212
213 The *filters* argument specifies the filter chain that was used to create
214 the stream being decompressed. This argument is required if *format* is
215 :const:`FORMAT_RAW`, but should not be used for other formats.
216 See :ref:`filter-chain-specs` for more information about filter chains.
217
218 .. note::
219 This class does not transparently handle inputs containing multiple
220 compressed streams, unlike :func:`decompress` and :class:`LZMAFile`. To
221 decompress a multi-stream input with :class:`LZMADecompressor`, you must
222 create a new decompressor for each stream.
223
224 .. method:: decompress(data)
225
226 Decompress *data* (a :class:`bytes` object), returning a :class:`bytes`
227 object containing the decompressed data for at least part of the input.
228 Some of *data* may be buffered internally, for use in later calls to
229 :meth:`decompress`. The returned data should be concatenated with the
230 output of any previous calls to :meth:`decompress`.
231
232 .. attribute:: check
233
234 The ID of the integrity check used by the input stream. This may be
235 :const:`CHECK_UNKNOWN` until enough of the input has been decoded to
236 determine what integrity check it uses.
237
238 .. attribute:: eof
239
Serhiy Storchakafbc1c262013-11-29 12:17:13 +0200240 ``True`` if the end-of-stream marker has been reached.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200241
242 .. attribute:: unused_data
243
244 Data found after the end of the compressed stream.
245
246 Before the end of the stream is reached, this will be ``b""``.
247
248
249.. function:: compress(data, format=FORMAT_XZ, check=-1, preset=None, filters=None)
250
251 Compress *data* (a :class:`bytes` object), returning the compressed data as a
252 :class:`bytes` object.
253
254 See :class:`LZMACompressor` above for a description of the *format*, *check*,
255 *preset* and *filters* arguments.
256
257
258.. function:: decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None)
259
260 Decompress *data* (a :class:`bytes` object), returning the uncompressed data
261 as a :class:`bytes` object.
262
263 If *data* is the concatenation of multiple distinct compressed streams,
264 decompress all of these streams, and return the concatenation of the results.
265
266 See :class:`LZMADecompressor` above for a description of the *format*,
267 *memlimit* and *filters* arguments.
268
269
270Miscellaneous
271-------------
272
Nadeem Vawdabc459bb2012-05-06 23:01:51 +0200273.. function:: is_check_supported(check)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200274
275 Returns true if the given integrity check is supported on this system.
276
277 :const:`CHECK_NONE` and :const:`CHECK_CRC32` are always supported.
278 :const:`CHECK_CRC64` and :const:`CHECK_SHA256` may be unavailable if you are
279 using a version of :program:`liblzma` that was compiled with a limited
280 feature set.
281
282
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200283.. _filter-chain-specs:
284
285Specifying custom filter chains
286-------------------------------
287
288A filter chain specifier is a sequence of dictionaries, where each dictionary
289contains the ID and options for a single filter. Each dictionary must contain
290the key ``"id"``, and may contain additional keys to specify filter-dependent
291options. Valid filter IDs are as follows:
292
293* Compression filters:
294 * :const:`FILTER_LZMA1` (for use with :const:`FORMAT_ALONE`)
295 * :const:`FILTER_LZMA2` (for use with :const:`FORMAT_XZ` and :const:`FORMAT_RAW`)
296
297* Delta filter:
298 * :const:`FILTER_DELTA`
299
300* Branch-Call-Jump (BCJ) filters:
301 * :const:`FILTER_X86`
302 * :const:`FILTER_IA64`
303 * :const:`FILTER_ARM`
304 * :const:`FILTER_ARMTHUMB`
305 * :const:`FILTER_POWERPC`
306 * :const:`FILTER_SPARC`
307
308A filter chain can consist of up to 4 filters, and cannot be empty. The last
309filter in the chain must be a compression filter, and any other filters must be
310delta or BCJ filters.
311
312Compression filters support the following options (specified as additional
313entries in the dictionary representing the filter):
314
315 * ``preset``: A compression preset to use as a source of default values for
316 options that are not specified explicitly.
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200317 * ``dict_size``: Dictionary size in bytes. This should be between 4 KiB and
318 1.5 GiB (inclusive).
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200319 * ``lc``: Number of literal context bits.
320 * ``lp``: Number of literal position bits. The sum ``lc + lp`` must be at
321 most 4.
322 * ``pb``: Number of position bits; must be at most 4.
323 * ``mode``: :const:`MODE_FAST` or :const:`MODE_NORMAL`.
324 * ``nice_len``: What should be considered a "nice length" for a match.
325 This should be 273 or less.
326 * ``mf``: What match finder to use -- :const:`MF_HC3`, :const:`MF_HC4`,
327 :const:`MF_BT2`, :const:`MF_BT3`, or :const:`MF_BT4`.
328 * ``depth``: Maximum search depth used by match finder. 0 (default) means to
329 select automatically based on other filter options.
330
331The delta filter stores the differences between bytes, producing more repetitive
332input for the compressor in certain circumstances. It only supports a single
333The delta filter supports only one option, ``dist``. This indicates the distance
334between bytes to be subtracted. The default is 1, i.e. take the differences
335between adjacent bytes.
336
337The BCJ filters are intended to be applied to machine code. They convert
338relative branches, calls and jumps in the code to use absolute addressing, with
339the aim of increasing the redundancy that can be exploited by the compressor.
340These filters support one option, ``start_offset``. This specifies the address
341that should be mapped to the beginning of the input data. The default is 0.
342
343
344Examples
345--------
346
347Reading in a compressed file::
348
349 import lzma
Nadeem Vawda50112442012-09-23 18:20:23 +0200350 with lzma.open("file.xz") as f:
Nadeem Vawda667a13b2012-09-23 18:08:57 +0200351 file_content = f.read()
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200352
353Creating a compressed file::
354
355 import lzma
356 data = b"Insert Data Here"
Nadeem Vawda50112442012-09-23 18:20:23 +0200357 with lzma.open("file.xz", "w") as f:
Nadeem Vawda667a13b2012-09-23 18:08:57 +0200358 f.write(data)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200359
360Compressing data in memory::
361
362 import lzma
363 data_in = b"Insert Data Here"
364 data_out = lzma.compress(data_in)
365
366Incremental compression::
367
368 import lzma
369 lzc = lzma.LZMACompressor()
370 out1 = lzc.compress(b"Some data\n")
371 out2 = lzc.compress(b"Another piece of data\n")
372 out3 = lzc.compress(b"Even more data\n")
373 out4 = lzc.flush()
374 # Concatenate all the partial results:
375 result = b"".join([out1, out2, out3, out4])
376
377Writing compressed data to an already-open file::
378
379 import lzma
380 with open("file.xz", "wb") as f:
381 f.write(b"This data will not be compressed\n")
Nadeem Vawda50112442012-09-23 18:20:23 +0200382 with lzma.open(f, "w") as lzf:
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200383 lzf.write(b"This *will* be compressed\n")
384 f.write(b"Not compressed\n")
385
386Creating a compressed file using a custom filter chain::
387
388 import lzma
389 my_filters = [
390 {"id": lzma.FILTER_DELTA, "dist": 5},
391 {"id": lzma.FILTER_LZMA2, "preset": 7 | lzma.PRESET_EXTREME},
392 ]
Nadeem Vawda50112442012-09-23 18:20:23 +0200393 with lzma.open("file.xz", "w", filters=my_filters) as f:
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200394 f.write(b"blah blah blah")