blob: 67e425d3c830b3796473a2216e2652fb27915dd6 [file] [log] [blame]
Nadeem Vawda3ff069e2011-11-30 00:25:06 +02001:mod:`lzma` --- Compression using the LZMA algorithm
2====================================================
3
4.. module:: lzma
5 :synopsis: A Python wrapper for the liblzma compression library.
6.. moduleauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
7.. sectionauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
8
9.. versionadded:: 3.3
10
11
12This module provides classes and convenience functions for compressing and
13decompressing data using the LZMA compression algorithm. Also included is a file
14interface supporting the ``.xz`` and legacy ``.lzma`` file formats used by the
15:program:`xz` utility, as well as raw compressed streams.
16
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020017The interface provided by this module is very similar to that of the :mod:`bz2`
18module. However, note that :class:`LZMAFile` is *not* thread-safe, unlike
19:class:`bz2.BZ2File`, so if you need to use a single :class:`LZMAFile` instance
20from multiple threads, it is necessary to protect it with a lock.
21
22
23.. exception:: LZMAError
24
25 This exception is raised when an error occurs during compression or
26 decompression, or while initializing the compressor/decompressor state.
27
28
29Reading and writing compressed files
30------------------------------------
31
Nadeem Vawda33c34da2012-06-04 23:34:07 +020032.. class:: LZMAFile(filename=None, mode="r", \*, format=None, check=-1, preset=None, filters=None)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020033
Nadeem Vawda33c34da2012-06-04 23:34:07 +020034 Open an LZMA-compressed file in binary mode.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020035
Nadeem Vawda33c34da2012-06-04 23:34:07 +020036 An :class:`LZMAFile` can wrap an already-open :term:`file object`, or operate
37 directly on a named file. The *filename* argument specifies either the file
38 object to wrap, or the name of the file to open (as a :class:`str` or
39 :class:`bytes` object). When wrapping an existing file object, the wrapped
40 file will not be closed when the :class:`LZMAFile` is closed.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020041
42 The *mode* argument can be either ``"r"`` for reading (default), ``"w"`` for
Nadeem Vawda6cbb20c2012-06-04 23:36:24 +020043 overwriting, or ``"a"`` for appending. These can equivalently be given as
44 ``"rb"``, ``"wb"``, and ``"ab"`` respectively.
45
46 If *filename* is a file object (rather than an actual file name), a mode of
47 ``"w"`` does not truncate the file, and is instead equivalent to ``"a"``.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020048
49 When opening a file for reading, the input file may be the concatenation of
50 multiple separate compressed streams. These are transparently decoded as a
51 single logical stream.
52
53 When opening a file for reading, the *format* and *filters* arguments have
54 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
55 and *preset* arguments should not be used.
56
57 When opening a file for writing, the *format*, *check*, *preset* and
58 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
59
60 :class:`LZMAFile` supports all the members specified by
61 :class:`io.BufferedIOBase`, except for :meth:`detach` and :meth:`truncate`.
62 Iteration and the :keyword:`with` statement are supported.
63
64 The following method is also provided:
65
66 .. method:: peek(size=-1)
67
68 Return buffered data without advancing the file position. At least one
69 byte of data will be returned, unless EOF has been reached. The exact
70 number of bytes returned is unspecified (the *size* argument is ignored).
71
72
73Compressing and decompressing data in memory
74--------------------------------------------
75
76.. class:: LZMACompressor(format=FORMAT_XZ, check=-1, preset=None, filters=None)
77
78 Create a compressor object, which can be used to compress data incrementally.
79
80 For a more convenient way of compressing a single chunk of data, see
81 :func:`compress`.
82
83 The *format* argument specifies what container format should be used.
84 Possible values are:
85
86 * :const:`FORMAT_XZ`: The ``.xz`` container format.
87 This is the default format.
88
89 * :const:`FORMAT_ALONE`: The legacy ``.lzma`` container format.
90 This format is more limited than ``.xz`` -- it does not support integrity
91 checks or multiple filters.
92
93 * :const:`FORMAT_RAW`: A raw data stream, not using any container format.
94 This format specifier does not support integrity checks, and requires that
95 you always specify a custom filter chain (for both compression and
96 decompression). Additionally, data compressed in this manner cannot be
97 decompressed using :const:`FORMAT_AUTO` (see :class:`LZMADecompressor`).
98
99 The *check* argument specifies the type of integrity check to include in the
100 compressed data. This check is used when decompressing, to ensure that the
101 data has not been corrupted. Possible values are:
102
103 * :const:`CHECK_NONE`: No integrity check.
104 This is the default (and the only acceptable value) for
105 :const:`FORMAT_ALONE` and :const:`FORMAT_RAW`.
106
107 * :const:`CHECK_CRC32`: 32-bit Cyclic Redundancy Check.
108
109 * :const:`CHECK_CRC64`: 64-bit Cyclic Redundancy Check.
110 This is the default for :const:`FORMAT_XZ`.
111
112 * :const:`CHECK_SHA256`: 256-bit Secure Hash Algorithm.
113
114 If the specified check is not supported, an :class:`LZMAError` is raised.
115
116 The compression settings can be specified either as a preset compression
117 level (with the *preset* argument), or in detail as a custom filter chain
118 (with the *filters* argument).
119
120 The *preset* argument (if provided) should be an integer between ``0`` and
121 ``9`` (inclusive), optionally OR-ed with the constant
122 :const:`PRESET_EXTREME`. If neither *preset* nor *filters* are given, the
123 default behavior is to use :const:`PRESET_DEFAULT` (preset level ``6``).
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200124 Higher presets produce smaller output, but make the compression process
125 slower.
126
127 .. note::
128
129 In addition to being more CPU-intensive, compression with higher presets
130 also requires much more memory (and produces output that needs more memory
131 to decompress). With preset ``9`` for example, the overhead for an
132 :class:`LZMACompressor` object can be as high as 800MiB. For this reason,
133 it is generally best to stick with the default preset.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200134
135 The *filters* argument (if provided) should be a filter chain specifier.
136 See :ref:`filter-chain-specs` for details.
137
138 .. method:: compress(data)
139
140 Compress *data* (a :class:`bytes` object), returning a :class:`bytes`
141 object containing compressed data for at least part of the input. Some of
142 *data* may be buffered internally, for use in later calls to
143 :meth:`compress` and :meth:`flush`. The returned data should be
144 concatenated with the output of any previous calls to :meth:`compress`.
145
146 .. method:: flush()
147
148 Finish the compression process, returning a :class:`bytes` object
149 containing any data stored in the compressor's internal buffers.
150
151 The compressor cannot be used after this method has been called.
152
153
154.. class:: LZMADecompressor(format=FORMAT_AUTO, memlimit=None, filters=None)
155
156 Create a decompressor object, which can be used to decompress data
157 incrementally.
158
159 For a more convenient way of decompressing an entire compressed stream at
160 once, see :func:`decompress`.
161
162 The *format* argument specifies the container format that should be used. The
163 default is :const:`FORMAT_AUTO`, which can decompress both ``.xz`` and
164 ``.lzma`` files. Other possible values are :const:`FORMAT_XZ`,
165 :const:`FORMAT_ALONE`, and :const:`FORMAT_RAW`.
166
167 The *memlimit* argument specifies a limit (in bytes) on the amount of memory
168 that the decompressor can use. When this argument is used, decompression will
169 fail with an :class:`LZMAError` if it is not possible to decompress the input
170 within the given memory limit.
171
172 The *filters* argument specifies the filter chain that was used to create
173 the stream being decompressed. This argument is required if *format* is
174 :const:`FORMAT_RAW`, but should not be used for other formats.
175 See :ref:`filter-chain-specs` for more information about filter chains.
176
177 .. note::
178 This class does not transparently handle inputs containing multiple
179 compressed streams, unlike :func:`decompress` and :class:`LZMAFile`. To
180 decompress a multi-stream input with :class:`LZMADecompressor`, you must
181 create a new decompressor for each stream.
182
183 .. method:: decompress(data)
184
185 Decompress *data* (a :class:`bytes` object), returning a :class:`bytes`
186 object containing the decompressed data for at least part of the input.
187 Some of *data* may be buffered internally, for use in later calls to
188 :meth:`decompress`. The returned data should be concatenated with the
189 output of any previous calls to :meth:`decompress`.
190
191 .. attribute:: check
192
193 The ID of the integrity check used by the input stream. This may be
194 :const:`CHECK_UNKNOWN` until enough of the input has been decoded to
195 determine what integrity check it uses.
196
197 .. attribute:: eof
198
199 True if the end-of-stream marker has been reached.
200
201 .. attribute:: unused_data
202
203 Data found after the end of the compressed stream.
204
205 Before the end of the stream is reached, this will be ``b""``.
206
207
208.. function:: compress(data, format=FORMAT_XZ, check=-1, preset=None, filters=None)
209
210 Compress *data* (a :class:`bytes` object), returning the compressed data as a
211 :class:`bytes` object.
212
213 See :class:`LZMACompressor` above for a description of the *format*, *check*,
214 *preset* and *filters* arguments.
215
216
217.. function:: decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None)
218
219 Decompress *data* (a :class:`bytes` object), returning the uncompressed data
220 as a :class:`bytes` object.
221
222 If *data* is the concatenation of multiple distinct compressed streams,
223 decompress all of these streams, and return the concatenation of the results.
224
225 See :class:`LZMADecompressor` above for a description of the *format*,
226 *memlimit* and *filters* arguments.
227
228
229Miscellaneous
230-------------
231
Nadeem Vawdabc459bb2012-05-06 23:01:51 +0200232.. function:: is_check_supported(check)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200233
234 Returns true if the given integrity check is supported on this system.
235
236 :const:`CHECK_NONE` and :const:`CHECK_CRC32` are always supported.
237 :const:`CHECK_CRC64` and :const:`CHECK_SHA256` may be unavailable if you are
238 using a version of :program:`liblzma` that was compiled with a limited
239 feature set.
240
241
Nadeem Vawdaf55b3292012-05-06 23:01:27 +0200242.. function:: encode_filter_properties(filter)
243
244 Return a :class:`bytes` object encoding the options (properties) of the
245 filter specified by *filter* (a dictionary).
246
247 *filter* is interpreted as a filter specifier, as described in
248 :ref:`filter-chain-specs`.
249
250 The returned data does not include the filter ID itself, only the options.
251
252 This function is primarily of interest to users implementing custom file
253 formats.
254
255
256.. function:: decode_filter_properties(filter_id, encoded_props)
257
258 Return a dictionary describing a filter with ID *filter_id*, and options
259 (properties) decoded from the :class:`bytes` object *encoded_props*.
260
261 The returned dictionary is a filter specifier, as described in
262 :ref:`filter-chain-specs`.
263
264 This function is primarily of interest to users implementing custom file
265 formats.
266
267
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200268.. _filter-chain-specs:
269
270Specifying custom filter chains
271-------------------------------
272
273A filter chain specifier is a sequence of dictionaries, where each dictionary
274contains the ID and options for a single filter. Each dictionary must contain
275the key ``"id"``, and may contain additional keys to specify filter-dependent
276options. Valid filter IDs are as follows:
277
278* Compression filters:
279 * :const:`FILTER_LZMA1` (for use with :const:`FORMAT_ALONE`)
280 * :const:`FILTER_LZMA2` (for use with :const:`FORMAT_XZ` and :const:`FORMAT_RAW`)
281
282* Delta filter:
283 * :const:`FILTER_DELTA`
284
285* Branch-Call-Jump (BCJ) filters:
286 * :const:`FILTER_X86`
287 * :const:`FILTER_IA64`
288 * :const:`FILTER_ARM`
289 * :const:`FILTER_ARMTHUMB`
290 * :const:`FILTER_POWERPC`
291 * :const:`FILTER_SPARC`
292
293A filter chain can consist of up to 4 filters, and cannot be empty. The last
294filter in the chain must be a compression filter, and any other filters must be
295delta or BCJ filters.
296
297Compression filters support the following options (specified as additional
298entries in the dictionary representing the filter):
299
300 * ``preset``: A compression preset to use as a source of default values for
301 options that are not specified explicitly.
302 * ``dict_size``: Dictionary size in bytes. This should be between 4KiB and
303 1.5GiB (inclusive).
304 * ``lc``: Number of literal context bits.
305 * ``lp``: Number of literal position bits. The sum ``lc + lp`` must be at
306 most 4.
307 * ``pb``: Number of position bits; must be at most 4.
308 * ``mode``: :const:`MODE_FAST` or :const:`MODE_NORMAL`.
309 * ``nice_len``: What should be considered a "nice length" for a match.
310 This should be 273 or less.
311 * ``mf``: What match finder to use -- :const:`MF_HC3`, :const:`MF_HC4`,
312 :const:`MF_BT2`, :const:`MF_BT3`, or :const:`MF_BT4`.
313 * ``depth``: Maximum search depth used by match finder. 0 (default) means to
314 select automatically based on other filter options.
315
316The delta filter stores the differences between bytes, producing more repetitive
317input for the compressor in certain circumstances. It only supports a single
318The delta filter supports only one option, ``dist``. This indicates the distance
319between bytes to be subtracted. The default is 1, i.e. take the differences
320between adjacent bytes.
321
322The BCJ filters are intended to be applied to machine code. They convert
323relative branches, calls and jumps in the code to use absolute addressing, with
324the aim of increasing the redundancy that can be exploited by the compressor.
325These filters support one option, ``start_offset``. This specifies the address
326that should be mapped to the beginning of the input data. The default is 0.
327
328
329Examples
330--------
331
332Reading in a compressed file::
333
334 import lzma
335 with lzma.LZMAFile("file.xz") as f:
336 file_content = f.read()
337
338Creating a compressed file::
339
340 import lzma
341 data = b"Insert Data Here"
342 with lzma.LZMAFile("file.xz", "w") as f:
343 f.write(data)
344
345Compressing data in memory::
346
347 import lzma
348 data_in = b"Insert Data Here"
349 data_out = lzma.compress(data_in)
350
351Incremental compression::
352
353 import lzma
354 lzc = lzma.LZMACompressor()
355 out1 = lzc.compress(b"Some data\n")
356 out2 = lzc.compress(b"Another piece of data\n")
357 out3 = lzc.compress(b"Even more data\n")
358 out4 = lzc.flush()
359 # Concatenate all the partial results:
360 result = b"".join([out1, out2, out3, out4])
361
362Writing compressed data to an already-open file::
363
364 import lzma
365 with open("file.xz", "wb") as f:
366 f.write(b"This data will not be compressed\n")
Nadeem Vawda33c34da2012-06-04 23:34:07 +0200367 with lzma.LZMAFile(f, "w") as lzf:
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200368 lzf.write(b"This *will* be compressed\n")
369 f.write(b"Not compressed\n")
370
371Creating a compressed file using a custom filter chain::
372
373 import lzma
374 my_filters = [
375 {"id": lzma.FILTER_DELTA, "dist": 5},
376 {"id": lzma.FILTER_LZMA2, "preset": 7 | lzma.PRESET_EXTREME},
377 ]
378 with lzma.LZMAFile("file.xz", "w", filters=my_filters) as f:
379 f.write(b"blah blah blah")