blob: 156d77aec080e94a0f0d28859da8335dfa680239 [file] [log] [blame]
Nadeem Vawda3ff069e2011-11-30 00:25:06 +02001:mod:`lzma` --- Compression using the LZMA algorithm
2====================================================
3
4.. module:: lzma
5 :synopsis: A Python wrapper for the liblzma compression library.
6.. moduleauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
7.. sectionauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
8
9.. versionadded:: 3.3
10
11
12This module provides classes and convenience functions for compressing and
13decompressing data using the LZMA compression algorithm. Also included is a file
14interface supporting the ``.xz`` and legacy ``.lzma`` file formats used by the
15:program:`xz` utility, as well as raw compressed streams.
16
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020017The interface provided by this module is very similar to that of the :mod:`bz2`
18module. However, note that :class:`LZMAFile` is *not* thread-safe, unlike
19:class:`bz2.BZ2File`, so if you need to use a single :class:`LZMAFile` instance
20from multiple threads, it is necessary to protect it with a lock.
21
22
23.. exception:: LZMAError
24
25 This exception is raised when an error occurs during compression or
26 decompression, or while initializing the compressor/decompressor state.
27
28
29Reading and writing compressed files
30------------------------------------
31
Nadeem Vawda33c34da2012-06-04 23:34:07 +020032.. class:: LZMAFile(filename=None, mode="r", \*, format=None, check=-1, preset=None, filters=None)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020033
Nadeem Vawda33c34da2012-06-04 23:34:07 +020034 Open an LZMA-compressed file in binary mode.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020035
Nadeem Vawda33c34da2012-06-04 23:34:07 +020036 An :class:`LZMAFile` can wrap an already-open :term:`file object`, or operate
37 directly on a named file. The *filename* argument specifies either the file
38 object to wrap, or the name of the file to open (as a :class:`str` or
39 :class:`bytes` object). When wrapping an existing file object, the wrapped
40 file will not be closed when the :class:`LZMAFile` is closed.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020041
42 The *mode* argument can be either ``"r"`` for reading (default), ``"w"`` for
Nadeem Vawda33c34da2012-06-04 23:34:07 +020043 overwriting, or ``"a"`` for appending. If *filename* is an existing file
44 object, a mode of ``"w"`` does not truncate the file, and is instead
45 equivalent to ``"a"``.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020046
47 When opening a file for reading, the input file may be the concatenation of
48 multiple separate compressed streams. These are transparently decoded as a
49 single logical stream.
50
51 When opening a file for reading, the *format* and *filters* arguments have
52 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
53 and *preset* arguments should not be used.
54
55 When opening a file for writing, the *format*, *check*, *preset* and
56 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
57
58 :class:`LZMAFile` supports all the members specified by
59 :class:`io.BufferedIOBase`, except for :meth:`detach` and :meth:`truncate`.
60 Iteration and the :keyword:`with` statement are supported.
61
62 The following method is also provided:
63
64 .. method:: peek(size=-1)
65
66 Return buffered data without advancing the file position. At least one
67 byte of data will be returned, unless EOF has been reached. The exact
68 number of bytes returned is unspecified (the *size* argument is ignored).
69
70
71Compressing and decompressing data in memory
72--------------------------------------------
73
74.. class:: LZMACompressor(format=FORMAT_XZ, check=-1, preset=None, filters=None)
75
76 Create a compressor object, which can be used to compress data incrementally.
77
78 For a more convenient way of compressing a single chunk of data, see
79 :func:`compress`.
80
81 The *format* argument specifies what container format should be used.
82 Possible values are:
83
84 * :const:`FORMAT_XZ`: The ``.xz`` container format.
85 This is the default format.
86
87 * :const:`FORMAT_ALONE`: The legacy ``.lzma`` container format.
88 This format is more limited than ``.xz`` -- it does not support integrity
89 checks or multiple filters.
90
91 * :const:`FORMAT_RAW`: A raw data stream, not using any container format.
92 This format specifier does not support integrity checks, and requires that
93 you always specify a custom filter chain (for both compression and
94 decompression). Additionally, data compressed in this manner cannot be
95 decompressed using :const:`FORMAT_AUTO` (see :class:`LZMADecompressor`).
96
97 The *check* argument specifies the type of integrity check to include in the
98 compressed data. This check is used when decompressing, to ensure that the
99 data has not been corrupted. Possible values are:
100
101 * :const:`CHECK_NONE`: No integrity check.
102 This is the default (and the only acceptable value) for
103 :const:`FORMAT_ALONE` and :const:`FORMAT_RAW`.
104
105 * :const:`CHECK_CRC32`: 32-bit Cyclic Redundancy Check.
106
107 * :const:`CHECK_CRC64`: 64-bit Cyclic Redundancy Check.
108 This is the default for :const:`FORMAT_XZ`.
109
110 * :const:`CHECK_SHA256`: 256-bit Secure Hash Algorithm.
111
112 If the specified check is not supported, an :class:`LZMAError` is raised.
113
114 The compression settings can be specified either as a preset compression
115 level (with the *preset* argument), or in detail as a custom filter chain
116 (with the *filters* argument).
117
118 The *preset* argument (if provided) should be an integer between ``0`` and
119 ``9`` (inclusive), optionally OR-ed with the constant
120 :const:`PRESET_EXTREME`. If neither *preset* nor *filters* are given, the
121 default behavior is to use :const:`PRESET_DEFAULT` (preset level ``6``).
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200122 Higher presets produce smaller output, but make the compression process
123 slower.
124
125 .. note::
126
127 In addition to being more CPU-intensive, compression with higher presets
128 also requires much more memory (and produces output that needs more memory
129 to decompress). With preset ``9`` for example, the overhead for an
130 :class:`LZMACompressor` object can be as high as 800MiB. For this reason,
131 it is generally best to stick with the default preset.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200132
133 The *filters* argument (if provided) should be a filter chain specifier.
134 See :ref:`filter-chain-specs` for details.
135
136 .. method:: compress(data)
137
138 Compress *data* (a :class:`bytes` object), returning a :class:`bytes`
139 object containing compressed data for at least part of the input. Some of
140 *data* may be buffered internally, for use in later calls to
141 :meth:`compress` and :meth:`flush`. The returned data should be
142 concatenated with the output of any previous calls to :meth:`compress`.
143
144 .. method:: flush()
145
146 Finish the compression process, returning a :class:`bytes` object
147 containing any data stored in the compressor's internal buffers.
148
149 The compressor cannot be used after this method has been called.
150
151
152.. class:: LZMADecompressor(format=FORMAT_AUTO, memlimit=None, filters=None)
153
154 Create a decompressor object, which can be used to decompress data
155 incrementally.
156
157 For a more convenient way of decompressing an entire compressed stream at
158 once, see :func:`decompress`.
159
160 The *format* argument specifies the container format that should be used. The
161 default is :const:`FORMAT_AUTO`, which can decompress both ``.xz`` and
162 ``.lzma`` files. Other possible values are :const:`FORMAT_XZ`,
163 :const:`FORMAT_ALONE`, and :const:`FORMAT_RAW`.
164
165 The *memlimit* argument specifies a limit (in bytes) on the amount of memory
166 that the decompressor can use. When this argument is used, decompression will
167 fail with an :class:`LZMAError` if it is not possible to decompress the input
168 within the given memory limit.
169
170 The *filters* argument specifies the filter chain that was used to create
171 the stream being decompressed. This argument is required if *format* is
172 :const:`FORMAT_RAW`, but should not be used for other formats.
173 See :ref:`filter-chain-specs` for more information about filter chains.
174
175 .. note::
176 This class does not transparently handle inputs containing multiple
177 compressed streams, unlike :func:`decompress` and :class:`LZMAFile`. To
178 decompress a multi-stream input with :class:`LZMADecompressor`, you must
179 create a new decompressor for each stream.
180
181 .. method:: decompress(data)
182
183 Decompress *data* (a :class:`bytes` object), returning a :class:`bytes`
184 object containing the decompressed data for at least part of the input.
185 Some of *data* may be buffered internally, for use in later calls to
186 :meth:`decompress`. The returned data should be concatenated with the
187 output of any previous calls to :meth:`decompress`.
188
189 .. attribute:: check
190
191 The ID of the integrity check used by the input stream. This may be
192 :const:`CHECK_UNKNOWN` until enough of the input has been decoded to
193 determine what integrity check it uses.
194
195 .. attribute:: eof
196
197 True if the end-of-stream marker has been reached.
198
199 .. attribute:: unused_data
200
201 Data found after the end of the compressed stream.
202
203 Before the end of the stream is reached, this will be ``b""``.
204
205
206.. function:: compress(data, format=FORMAT_XZ, check=-1, preset=None, filters=None)
207
208 Compress *data* (a :class:`bytes` object), returning the compressed data as a
209 :class:`bytes` object.
210
211 See :class:`LZMACompressor` above for a description of the *format*, *check*,
212 *preset* and *filters* arguments.
213
214
215.. function:: decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None)
216
217 Decompress *data* (a :class:`bytes` object), returning the uncompressed data
218 as a :class:`bytes` object.
219
220 If *data* is the concatenation of multiple distinct compressed streams,
221 decompress all of these streams, and return the concatenation of the results.
222
223 See :class:`LZMADecompressor` above for a description of the *format*,
224 *memlimit* and *filters* arguments.
225
226
227Miscellaneous
228-------------
229
Nadeem Vawdabc459bb2012-05-06 23:01:51 +0200230.. function:: is_check_supported(check)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200231
232 Returns true if the given integrity check is supported on this system.
233
234 :const:`CHECK_NONE` and :const:`CHECK_CRC32` are always supported.
235 :const:`CHECK_CRC64` and :const:`CHECK_SHA256` may be unavailable if you are
236 using a version of :program:`liblzma` that was compiled with a limited
237 feature set.
238
239
Nadeem Vawdaf55b3292012-05-06 23:01:27 +0200240.. function:: encode_filter_properties(filter)
241
242 Return a :class:`bytes` object encoding the options (properties) of the
243 filter specified by *filter* (a dictionary).
244
245 *filter* is interpreted as a filter specifier, as described in
246 :ref:`filter-chain-specs`.
247
248 The returned data does not include the filter ID itself, only the options.
249
250 This function is primarily of interest to users implementing custom file
251 formats.
252
253
254.. function:: decode_filter_properties(filter_id, encoded_props)
255
256 Return a dictionary describing a filter with ID *filter_id*, and options
257 (properties) decoded from the :class:`bytes` object *encoded_props*.
258
259 The returned dictionary is a filter specifier, as described in
260 :ref:`filter-chain-specs`.
261
262 This function is primarily of interest to users implementing custom file
263 formats.
264
265
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200266.. _filter-chain-specs:
267
268Specifying custom filter chains
269-------------------------------
270
271A filter chain specifier is a sequence of dictionaries, where each dictionary
272contains the ID and options for a single filter. Each dictionary must contain
273the key ``"id"``, and may contain additional keys to specify filter-dependent
274options. Valid filter IDs are as follows:
275
276* Compression filters:
277 * :const:`FILTER_LZMA1` (for use with :const:`FORMAT_ALONE`)
278 * :const:`FILTER_LZMA2` (for use with :const:`FORMAT_XZ` and :const:`FORMAT_RAW`)
279
280* Delta filter:
281 * :const:`FILTER_DELTA`
282
283* Branch-Call-Jump (BCJ) filters:
284 * :const:`FILTER_X86`
285 * :const:`FILTER_IA64`
286 * :const:`FILTER_ARM`
287 * :const:`FILTER_ARMTHUMB`
288 * :const:`FILTER_POWERPC`
289 * :const:`FILTER_SPARC`
290
291A filter chain can consist of up to 4 filters, and cannot be empty. The last
292filter in the chain must be a compression filter, and any other filters must be
293delta or BCJ filters.
294
295Compression filters support the following options (specified as additional
296entries in the dictionary representing the filter):
297
298 * ``preset``: A compression preset to use as a source of default values for
299 options that are not specified explicitly.
300 * ``dict_size``: Dictionary size in bytes. This should be between 4KiB and
301 1.5GiB (inclusive).
302 * ``lc``: Number of literal context bits.
303 * ``lp``: Number of literal position bits. The sum ``lc + lp`` must be at
304 most 4.
305 * ``pb``: Number of position bits; must be at most 4.
306 * ``mode``: :const:`MODE_FAST` or :const:`MODE_NORMAL`.
307 * ``nice_len``: What should be considered a "nice length" for a match.
308 This should be 273 or less.
309 * ``mf``: What match finder to use -- :const:`MF_HC3`, :const:`MF_HC4`,
310 :const:`MF_BT2`, :const:`MF_BT3`, or :const:`MF_BT4`.
311 * ``depth``: Maximum search depth used by match finder. 0 (default) means to
312 select automatically based on other filter options.
313
314The delta filter stores the differences between bytes, producing more repetitive
315input for the compressor in certain circumstances. It only supports a single
316The delta filter supports only one option, ``dist``. This indicates the distance
317between bytes to be subtracted. The default is 1, i.e. take the differences
318between adjacent bytes.
319
320The BCJ filters are intended to be applied to machine code. They convert
321relative branches, calls and jumps in the code to use absolute addressing, with
322the aim of increasing the redundancy that can be exploited by the compressor.
323These filters support one option, ``start_offset``. This specifies the address
324that should be mapped to the beginning of the input data. The default is 0.
325
326
327Examples
328--------
329
330Reading in a compressed file::
331
332 import lzma
333 with lzma.LZMAFile("file.xz") as f:
334 file_content = f.read()
335
336Creating a compressed file::
337
338 import lzma
339 data = b"Insert Data Here"
340 with lzma.LZMAFile("file.xz", "w") as f:
341 f.write(data)
342
343Compressing data in memory::
344
345 import lzma
346 data_in = b"Insert Data Here"
347 data_out = lzma.compress(data_in)
348
349Incremental compression::
350
351 import lzma
352 lzc = lzma.LZMACompressor()
353 out1 = lzc.compress(b"Some data\n")
354 out2 = lzc.compress(b"Another piece of data\n")
355 out3 = lzc.compress(b"Even more data\n")
356 out4 = lzc.flush()
357 # Concatenate all the partial results:
358 result = b"".join([out1, out2, out3, out4])
359
360Writing compressed data to an already-open file::
361
362 import lzma
363 with open("file.xz", "wb") as f:
364 f.write(b"This data will not be compressed\n")
Nadeem Vawda33c34da2012-06-04 23:34:07 +0200365 with lzma.LZMAFile(f, "w") as lzf:
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200366 lzf.write(b"This *will* be compressed\n")
367 f.write(b"Not compressed\n")
368
369Creating a compressed file using a custom filter chain::
370
371 import lzma
372 my_filters = [
373 {"id": lzma.FILTER_DELTA, "dist": 5},
374 {"id": lzma.FILTER_LZMA2, "preset": 7 | lzma.PRESET_EXTREME},
375 ]
376 with lzma.LZMAFile("file.xz", "w", filters=my_filters) as f:
377 f.write(b"blah blah blah")