blob: b20120637dfaa52ace3b2bf262332c34b28a326a [file] [log] [blame]
Nadeem Vawda3ff069e2011-11-30 00:25:06 +02001:mod:`lzma` --- Compression using the LZMA algorithm
2====================================================
3
4.. module:: lzma
5 :synopsis: A Python wrapper for the liblzma compression library.
6.. moduleauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
7.. sectionauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
8
9.. versionadded:: 3.3
10
11
12This module provides classes and convenience functions for compressing and
13decompressing data using the LZMA compression algorithm. Also included is a file
14interface supporting the ``.xz`` and legacy ``.lzma`` file formats used by the
15:program:`xz` utility, as well as raw compressed streams.
16
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020017The interface provided by this module is very similar to that of the :mod:`bz2`
18module. However, note that :class:`LZMAFile` is *not* thread-safe, unlike
19:class:`bz2.BZ2File`, so if you need to use a single :class:`LZMAFile` instance
20from multiple threads, it is necessary to protect it with a lock.
21
22
23.. exception:: LZMAError
24
25 This exception is raised when an error occurs during compression or
26 decompression, or while initializing the compressor/decompressor state.
27
28
29Reading and writing compressed files
30------------------------------------
31
Nadeem Vawdae8604042012-06-04 23:38:12 +020032.. function:: open(filename, mode="rb", \*, format=None, check=-1, preset=None, filters=None, encoding=None, errors=None, newline=None)
33
34 Open an LZMA-compressed file in binary or text mode, returning a :term:`file
35 object`.
36
37 The *filename* argument can be either an actual file name (given as a
38 :class:`str` or :class:`bytes` object), in which case the named file is
39 opened, or it can be an existing file object to read from or write to.
40
41 The *mode* argument can be any of ``"r"``, ``"rb"``, ``"w"``, ``"wb"``,
Nadeem Vawda42ca9822013-10-19 00:06:19 +020042 ``"x"``, ``"xb"``, ``"a"`` or ``"ab"`` for binary mode, or ``"rt"``,
43 ``"wt"``, ``"xt"``, or ``"at"`` for text mode. The default is ``"rb"``.
Nadeem Vawdae8604042012-06-04 23:38:12 +020044
45 When opening a file for reading, the *format* and *filters* arguments have
46 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
47 and *preset* arguments should not be used.
48
49 When opening a file for writing, the *format*, *check*, *preset* and
50 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
51
52 For binary mode, this function is equivalent to the :class:`LZMAFile`
53 constructor: ``LZMAFile(filename, mode, ...)``. In this case, the *encoding*,
54 *errors* and *newline* arguments must not be provided.
55
56 For text mode, a :class:`LZMAFile` object is created, and wrapped in an
57 :class:`io.TextIOWrapper` instance with the specified encoding, error
58 handling behavior, and line ending(s).
59
Nadeem Vawda42ca9822013-10-19 00:06:19 +020060 .. versionchanged:: 3.4
61 Added support for the ``"x"``, ``"xb"`` and ``"xt"`` modes.
62
Nadeem Vawdae8604042012-06-04 23:38:12 +020063
Nadeem Vawda33c34da2012-06-04 23:34:07 +020064.. class:: LZMAFile(filename=None, mode="r", \*, format=None, check=-1, preset=None, filters=None)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020065
Nadeem Vawda33c34da2012-06-04 23:34:07 +020066 Open an LZMA-compressed file in binary mode.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020067
Nadeem Vawda33c34da2012-06-04 23:34:07 +020068 An :class:`LZMAFile` can wrap an already-open :term:`file object`, or operate
69 directly on a named file. The *filename* argument specifies either the file
70 object to wrap, or the name of the file to open (as a :class:`str` or
71 :class:`bytes` object). When wrapping an existing file object, the wrapped
72 file will not be closed when the :class:`LZMAFile` is closed.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020073
74 The *mode* argument can be either ``"r"`` for reading (default), ``"w"`` for
Nadeem Vawda42ca9822013-10-19 00:06:19 +020075 overwriting, ``"x"`` for exclusive creation, or ``"a"`` for appending. These
76 can equivalently be given as ``"rb"``, ``"wb"``, ``"xb"`` and ``"ab"``
77 respectively.
Nadeem Vawda6cbb20c2012-06-04 23:36:24 +020078
79 If *filename* is a file object (rather than an actual file name), a mode of
80 ``"w"`` does not truncate the file, and is instead equivalent to ``"a"``.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020081
82 When opening a file for reading, the input file may be the concatenation of
83 multiple separate compressed streams. These are transparently decoded as a
84 single logical stream.
85
86 When opening a file for reading, the *format* and *filters* arguments have
87 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
88 and *preset* arguments should not be used.
89
90 When opening a file for writing, the *format*, *check*, *preset* and
91 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
92
93 :class:`LZMAFile` supports all the members specified by
94 :class:`io.BufferedIOBase`, except for :meth:`detach` and :meth:`truncate`.
95 Iteration and the :keyword:`with` statement are supported.
96
97 The following method is also provided:
98
99 .. method:: peek(size=-1)
100
101 Return buffered data without advancing the file position. At least one
102 byte of data will be returned, unless EOF has been reached. The exact
103 number of bytes returned is unspecified (the *size* argument is ignored).
104
Nadeem Vawda42ca9822013-10-19 00:06:19 +0200105 .. versionchanged:: 3.4
106 Added support for the ``"x"`` and ``"xb"`` modes.
107
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200108
109Compressing and decompressing data in memory
110--------------------------------------------
111
112.. class:: LZMACompressor(format=FORMAT_XZ, check=-1, preset=None, filters=None)
113
114 Create a compressor object, which can be used to compress data incrementally.
115
116 For a more convenient way of compressing a single chunk of data, see
117 :func:`compress`.
118
119 The *format* argument specifies what container format should be used.
120 Possible values are:
121
122 * :const:`FORMAT_XZ`: The ``.xz`` container format.
123 This is the default format.
124
125 * :const:`FORMAT_ALONE`: The legacy ``.lzma`` container format.
126 This format is more limited than ``.xz`` -- it does not support integrity
127 checks or multiple filters.
128
129 * :const:`FORMAT_RAW`: A raw data stream, not using any container format.
130 This format specifier does not support integrity checks, and requires that
131 you always specify a custom filter chain (for both compression and
132 decompression). Additionally, data compressed in this manner cannot be
133 decompressed using :const:`FORMAT_AUTO` (see :class:`LZMADecompressor`).
134
135 The *check* argument specifies the type of integrity check to include in the
136 compressed data. This check is used when decompressing, to ensure that the
137 data has not been corrupted. Possible values are:
138
139 * :const:`CHECK_NONE`: No integrity check.
140 This is the default (and the only acceptable value) for
141 :const:`FORMAT_ALONE` and :const:`FORMAT_RAW`.
142
143 * :const:`CHECK_CRC32`: 32-bit Cyclic Redundancy Check.
144
145 * :const:`CHECK_CRC64`: 64-bit Cyclic Redundancy Check.
146 This is the default for :const:`FORMAT_XZ`.
147
148 * :const:`CHECK_SHA256`: 256-bit Secure Hash Algorithm.
149
150 If the specified check is not supported, an :class:`LZMAError` is raised.
151
152 The compression settings can be specified either as a preset compression
153 level (with the *preset* argument), or in detail as a custom filter chain
154 (with the *filters* argument).
155
156 The *preset* argument (if provided) should be an integer between ``0`` and
157 ``9`` (inclusive), optionally OR-ed with the constant
158 :const:`PRESET_EXTREME`. If neither *preset* nor *filters* are given, the
159 default behavior is to use :const:`PRESET_DEFAULT` (preset level ``6``).
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200160 Higher presets produce smaller output, but make the compression process
161 slower.
162
163 .. note::
164
165 In addition to being more CPU-intensive, compression with higher presets
166 also requires much more memory (and produces output that needs more memory
167 to decompress). With preset ``9`` for example, the overhead for an
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200168 :class:`LZMACompressor` object can be as high as 800 MiB. For this reason,
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200169 it is generally best to stick with the default preset.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200170
171 The *filters* argument (if provided) should be a filter chain specifier.
172 See :ref:`filter-chain-specs` for details.
173
174 .. method:: compress(data)
175
176 Compress *data* (a :class:`bytes` object), returning a :class:`bytes`
177 object containing compressed data for at least part of the input. Some of
178 *data* may be buffered internally, for use in later calls to
179 :meth:`compress` and :meth:`flush`. The returned data should be
180 concatenated with the output of any previous calls to :meth:`compress`.
181
182 .. method:: flush()
183
184 Finish the compression process, returning a :class:`bytes` object
185 containing any data stored in the compressor's internal buffers.
186
187 The compressor cannot be used after this method has been called.
188
189
190.. class:: LZMADecompressor(format=FORMAT_AUTO, memlimit=None, filters=None)
191
192 Create a decompressor object, which can be used to decompress data
193 incrementally.
194
195 For a more convenient way of decompressing an entire compressed stream at
196 once, see :func:`decompress`.
197
198 The *format* argument specifies the container format that should be used. The
199 default is :const:`FORMAT_AUTO`, which can decompress both ``.xz`` and
200 ``.lzma`` files. Other possible values are :const:`FORMAT_XZ`,
201 :const:`FORMAT_ALONE`, and :const:`FORMAT_RAW`.
202
203 The *memlimit* argument specifies a limit (in bytes) on the amount of memory
204 that the decompressor can use. When this argument is used, decompression will
205 fail with an :class:`LZMAError` if it is not possible to decompress the input
206 within the given memory limit.
207
208 The *filters* argument specifies the filter chain that was used to create
209 the stream being decompressed. This argument is required if *format* is
210 :const:`FORMAT_RAW`, but should not be used for other formats.
211 See :ref:`filter-chain-specs` for more information about filter chains.
212
213 .. note::
214 This class does not transparently handle inputs containing multiple
215 compressed streams, unlike :func:`decompress` and :class:`LZMAFile`. To
216 decompress a multi-stream input with :class:`LZMADecompressor`, you must
217 create a new decompressor for each stream.
218
219 .. method:: decompress(data)
220
221 Decompress *data* (a :class:`bytes` object), returning a :class:`bytes`
222 object containing the decompressed data for at least part of the input.
223 Some of *data* may be buffered internally, for use in later calls to
224 :meth:`decompress`. The returned data should be concatenated with the
225 output of any previous calls to :meth:`decompress`.
226
227 .. attribute:: check
228
229 The ID of the integrity check used by the input stream. This may be
230 :const:`CHECK_UNKNOWN` until enough of the input has been decoded to
231 determine what integrity check it uses.
232
233 .. attribute:: eof
234
235 True if the end-of-stream marker has been reached.
236
237 .. attribute:: unused_data
238
239 Data found after the end of the compressed stream.
240
241 Before the end of the stream is reached, this will be ``b""``.
242
243
244.. function:: compress(data, format=FORMAT_XZ, check=-1, preset=None, filters=None)
245
246 Compress *data* (a :class:`bytes` object), returning the compressed data as a
247 :class:`bytes` object.
248
249 See :class:`LZMACompressor` above for a description of the *format*, *check*,
250 *preset* and *filters* arguments.
251
252
253.. function:: decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None)
254
255 Decompress *data* (a :class:`bytes` object), returning the uncompressed data
256 as a :class:`bytes` object.
257
258 If *data* is the concatenation of multiple distinct compressed streams,
259 decompress all of these streams, and return the concatenation of the results.
260
261 See :class:`LZMADecompressor` above for a description of the *format*,
262 *memlimit* and *filters* arguments.
263
264
265Miscellaneous
266-------------
267
Nadeem Vawdabc459bb2012-05-06 23:01:51 +0200268.. function:: is_check_supported(check)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200269
270 Returns true if the given integrity check is supported on this system.
271
272 :const:`CHECK_NONE` and :const:`CHECK_CRC32` are always supported.
273 :const:`CHECK_CRC64` and :const:`CHECK_SHA256` may be unavailable if you are
274 using a version of :program:`liblzma` that was compiled with a limited
275 feature set.
276
277
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200278.. _filter-chain-specs:
279
280Specifying custom filter chains
281-------------------------------
282
283A filter chain specifier is a sequence of dictionaries, where each dictionary
284contains the ID and options for a single filter. Each dictionary must contain
285the key ``"id"``, and may contain additional keys to specify filter-dependent
286options. Valid filter IDs are as follows:
287
288* Compression filters:
289 * :const:`FILTER_LZMA1` (for use with :const:`FORMAT_ALONE`)
290 * :const:`FILTER_LZMA2` (for use with :const:`FORMAT_XZ` and :const:`FORMAT_RAW`)
291
292* Delta filter:
293 * :const:`FILTER_DELTA`
294
295* Branch-Call-Jump (BCJ) filters:
296 * :const:`FILTER_X86`
297 * :const:`FILTER_IA64`
298 * :const:`FILTER_ARM`
299 * :const:`FILTER_ARMTHUMB`
300 * :const:`FILTER_POWERPC`
301 * :const:`FILTER_SPARC`
302
303A filter chain can consist of up to 4 filters, and cannot be empty. The last
304filter in the chain must be a compression filter, and any other filters must be
305delta or BCJ filters.
306
307Compression filters support the following options (specified as additional
308entries in the dictionary representing the filter):
309
310 * ``preset``: A compression preset to use as a source of default values for
311 options that are not specified explicitly.
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200312 * ``dict_size``: Dictionary size in bytes. This should be between 4 KiB and
313 1.5 GiB (inclusive).
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200314 * ``lc``: Number of literal context bits.
315 * ``lp``: Number of literal position bits. The sum ``lc + lp`` must be at
316 most 4.
317 * ``pb``: Number of position bits; must be at most 4.
318 * ``mode``: :const:`MODE_FAST` or :const:`MODE_NORMAL`.
319 * ``nice_len``: What should be considered a "nice length" for a match.
320 This should be 273 or less.
321 * ``mf``: What match finder to use -- :const:`MF_HC3`, :const:`MF_HC4`,
322 :const:`MF_BT2`, :const:`MF_BT3`, or :const:`MF_BT4`.
323 * ``depth``: Maximum search depth used by match finder. 0 (default) means to
324 select automatically based on other filter options.
325
326The delta filter stores the differences between bytes, producing more repetitive
327input for the compressor in certain circumstances. It only supports a single
328The delta filter supports only one option, ``dist``. This indicates the distance
329between bytes to be subtracted. The default is 1, i.e. take the differences
330between adjacent bytes.
331
332The BCJ filters are intended to be applied to machine code. They convert
333relative branches, calls and jumps in the code to use absolute addressing, with
334the aim of increasing the redundancy that can be exploited by the compressor.
335These filters support one option, ``start_offset``. This specifies the address
336that should be mapped to the beginning of the input data. The default is 0.
337
338
339Examples
340--------
341
342Reading in a compressed file::
343
344 import lzma
Nadeem Vawda50112442012-09-23 18:20:23 +0200345 with lzma.open("file.xz") as f:
Nadeem Vawda667a13b2012-09-23 18:08:57 +0200346 file_content = f.read()
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200347
348Creating a compressed file::
349
350 import lzma
351 data = b"Insert Data Here"
Nadeem Vawda50112442012-09-23 18:20:23 +0200352 with lzma.open("file.xz", "w") as f:
Nadeem Vawda667a13b2012-09-23 18:08:57 +0200353 f.write(data)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200354
355Compressing data in memory::
356
357 import lzma
358 data_in = b"Insert Data Here"
359 data_out = lzma.compress(data_in)
360
361Incremental compression::
362
363 import lzma
364 lzc = lzma.LZMACompressor()
365 out1 = lzc.compress(b"Some data\n")
366 out2 = lzc.compress(b"Another piece of data\n")
367 out3 = lzc.compress(b"Even more data\n")
368 out4 = lzc.flush()
369 # Concatenate all the partial results:
370 result = b"".join([out1, out2, out3, out4])
371
372Writing compressed data to an already-open file::
373
374 import lzma
375 with open("file.xz", "wb") as f:
376 f.write(b"This data will not be compressed\n")
Nadeem Vawda50112442012-09-23 18:20:23 +0200377 with lzma.open(f, "w") as lzf:
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200378 lzf.write(b"This *will* be compressed\n")
379 f.write(b"Not compressed\n")
380
381Creating a compressed file using a custom filter chain::
382
383 import lzma
384 my_filters = [
385 {"id": lzma.FILTER_DELTA, "dist": 5},
386 {"id": lzma.FILTER_LZMA2, "preset": 7 | lzma.PRESET_EXTREME},
387 ]
Nadeem Vawda50112442012-09-23 18:20:23 +0200388 with lzma.open("file.xz", "w", filters=my_filters) as f:
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200389 f.write(b"blah blah blah")