blob: 4e6db15a74f99d3a677cae1888e3fc49bb3c1cb2 [file] [log] [blame]
Nadeem Vawda3ff069e2011-11-30 00:25:06 +02001:mod:`lzma` --- Compression using the LZMA algorithm
2====================================================
3
4.. module:: lzma
5 :synopsis: A Python wrapper for the liblzma compression library.
6.. moduleauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
7.. sectionauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
8
9.. versionadded:: 3.3
10
11
12This module provides classes and convenience functions for compressing and
13decompressing data using the LZMA compression algorithm. Also included is a file
14interface supporting the ``.xz`` and legacy ``.lzma`` file formats used by the
15:program:`xz` utility, as well as raw compressed streams.
16
17For related file formats, see the :mod:`bz2`, :mod:`gzip`, :mod:`zipfile`, and
18:mod:`tarfile` modules.
19
20The interface provided by this module is very similar to that of the :mod:`bz2`
21module. However, note that :class:`LZMAFile` is *not* thread-safe, unlike
22:class:`bz2.BZ2File`, so if you need to use a single :class:`LZMAFile` instance
23from multiple threads, it is necessary to protect it with a lock.
24
25
26.. exception:: LZMAError
27
28 This exception is raised when an error occurs during compression or
29 decompression, or while initializing the compressor/decompressor state.
30
31
32Reading and writing compressed files
33------------------------------------
34
35.. class:: LZMAFile(filename=None, mode="r", fileobj=None, format=None, check=-1, preset=None, filters=None)
36
37 Open an LZMA-compressed file.
38
39 An :class:`LZMAFile` can wrap an existing :term:`file object` (given by
40 *fileobj*), or operate directly on a named file (named by *filename*).
41 Exactly one of these two parameters should be provided. If *fileobj* is
42 provided, it is not closed when the :class:`LZMAFile` is closed.
43
44 The *mode* argument can be either ``"r"`` for reading (default), ``"w"`` for
45 overwriting, or ``"a"`` for appending. If *fileobj* is provided, a mode of
46 ``"w"`` does not truncate the file, and is instead equivalent to ``"a"``.
47
48 When opening a file for reading, the input file may be the concatenation of
49 multiple separate compressed streams. These are transparently decoded as a
50 single logical stream.
51
52 When opening a file for reading, the *format* and *filters* arguments have
53 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
54 and *preset* arguments should not be used.
55
56 When opening a file for writing, the *format*, *check*, *preset* and
57 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
58
59 :class:`LZMAFile` supports all the members specified by
60 :class:`io.BufferedIOBase`, except for :meth:`detach` and :meth:`truncate`.
61 Iteration and the :keyword:`with` statement are supported.
62
63 The following method is also provided:
64
65 .. method:: peek(size=-1)
66
67 Return buffered data without advancing the file position. At least one
68 byte of data will be returned, unless EOF has been reached. The exact
69 number of bytes returned is unspecified (the *size* argument is ignored).
70
71
72Compressing and decompressing data in memory
73--------------------------------------------
74
75.. class:: LZMACompressor(format=FORMAT_XZ, check=-1, preset=None, filters=None)
76
77 Create a compressor object, which can be used to compress data incrementally.
78
79 For a more convenient way of compressing a single chunk of data, see
80 :func:`compress`.
81
82 The *format* argument specifies what container format should be used.
83 Possible values are:
84
85 * :const:`FORMAT_XZ`: The ``.xz`` container format.
86 This is the default format.
87
88 * :const:`FORMAT_ALONE`: The legacy ``.lzma`` container format.
89 This format is more limited than ``.xz`` -- it does not support integrity
90 checks or multiple filters.
91
92 * :const:`FORMAT_RAW`: A raw data stream, not using any container format.
93 This format specifier does not support integrity checks, and requires that
94 you always specify a custom filter chain (for both compression and
95 decompression). Additionally, data compressed in this manner cannot be
96 decompressed using :const:`FORMAT_AUTO` (see :class:`LZMADecompressor`).
97
98 The *check* argument specifies the type of integrity check to include in the
99 compressed data. This check is used when decompressing, to ensure that the
100 data has not been corrupted. Possible values are:
101
102 * :const:`CHECK_NONE`: No integrity check.
103 This is the default (and the only acceptable value) for
104 :const:`FORMAT_ALONE` and :const:`FORMAT_RAW`.
105
106 * :const:`CHECK_CRC32`: 32-bit Cyclic Redundancy Check.
107
108 * :const:`CHECK_CRC64`: 64-bit Cyclic Redundancy Check.
109 This is the default for :const:`FORMAT_XZ`.
110
111 * :const:`CHECK_SHA256`: 256-bit Secure Hash Algorithm.
112
113 If the specified check is not supported, an :class:`LZMAError` is raised.
114
115 The compression settings can be specified either as a preset compression
116 level (with the *preset* argument), or in detail as a custom filter chain
117 (with the *filters* argument).
118
119 The *preset* argument (if provided) should be an integer between ``0`` and
120 ``9`` (inclusive), optionally OR-ed with the constant
121 :const:`PRESET_EXTREME`. If neither *preset* nor *filters* are given, the
122 default behavior is to use :const:`PRESET_DEFAULT` (preset level ``6``).
123 Higher presets produce smaller output, but make compression more CPU- and
124 memory-intensive, and also increase the memory required for decompression.
125
126 The *filters* argument (if provided) should be a filter chain specifier.
127 See :ref:`filter-chain-specs` for details.
128
129 .. method:: compress(data)
130
131 Compress *data* (a :class:`bytes` object), returning a :class:`bytes`
132 object containing compressed data for at least part of the input. Some of
133 *data* may be buffered internally, for use in later calls to
134 :meth:`compress` and :meth:`flush`. The returned data should be
135 concatenated with the output of any previous calls to :meth:`compress`.
136
137 .. method:: flush()
138
139 Finish the compression process, returning a :class:`bytes` object
140 containing any data stored in the compressor's internal buffers.
141
142 The compressor cannot be used after this method has been called.
143
144
145.. class:: LZMADecompressor(format=FORMAT_AUTO, memlimit=None, filters=None)
146
147 Create a decompressor object, which can be used to decompress data
148 incrementally.
149
150 For a more convenient way of decompressing an entire compressed stream at
151 once, see :func:`decompress`.
152
153 The *format* argument specifies the container format that should be used. The
154 default is :const:`FORMAT_AUTO`, which can decompress both ``.xz`` and
155 ``.lzma`` files. Other possible values are :const:`FORMAT_XZ`,
156 :const:`FORMAT_ALONE`, and :const:`FORMAT_RAW`.
157
158 The *memlimit* argument specifies a limit (in bytes) on the amount of memory
159 that the decompressor can use. When this argument is used, decompression will
160 fail with an :class:`LZMAError` if it is not possible to decompress the input
161 within the given memory limit.
162
163 The *filters* argument specifies the filter chain that was used to create
164 the stream being decompressed. This argument is required if *format* is
165 :const:`FORMAT_RAW`, but should not be used for other formats.
166 See :ref:`filter-chain-specs` for more information about filter chains.
167
168 .. note::
169 This class does not transparently handle inputs containing multiple
170 compressed streams, unlike :func:`decompress` and :class:`LZMAFile`. To
171 decompress a multi-stream input with :class:`LZMADecompressor`, you must
172 create a new decompressor for each stream.
173
174 .. method:: decompress(data)
175
176 Decompress *data* (a :class:`bytes` object), returning a :class:`bytes`
177 object containing the decompressed data for at least part of the input.
178 Some of *data* may be buffered internally, for use in later calls to
179 :meth:`decompress`. The returned data should be concatenated with the
180 output of any previous calls to :meth:`decompress`.
181
182 .. attribute:: check
183
184 The ID of the integrity check used by the input stream. This may be
185 :const:`CHECK_UNKNOWN` until enough of the input has been decoded to
186 determine what integrity check it uses.
187
188 .. attribute:: eof
189
190 True if the end-of-stream marker has been reached.
191
192 .. attribute:: unused_data
193
194 Data found after the end of the compressed stream.
195
196 Before the end of the stream is reached, this will be ``b""``.
197
198
199.. function:: compress(data, format=FORMAT_XZ, check=-1, preset=None, filters=None)
200
201 Compress *data* (a :class:`bytes` object), returning the compressed data as a
202 :class:`bytes` object.
203
204 See :class:`LZMACompressor` above for a description of the *format*, *check*,
205 *preset* and *filters* arguments.
206
207
208.. function:: decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None)
209
210 Decompress *data* (a :class:`bytes` object), returning the uncompressed data
211 as a :class:`bytes` object.
212
213 If *data* is the concatenation of multiple distinct compressed streams,
214 decompress all of these streams, and return the concatenation of the results.
215
216 See :class:`LZMADecompressor` above for a description of the *format*,
217 *memlimit* and *filters* arguments.
218
219
220Miscellaneous
221-------------
222
223.. function:: check_is_supported(check)
224
225 Returns true if the given integrity check is supported on this system.
226
227 :const:`CHECK_NONE` and :const:`CHECK_CRC32` are always supported.
228 :const:`CHECK_CRC64` and :const:`CHECK_SHA256` may be unavailable if you are
229 using a version of :program:`liblzma` that was compiled with a limited
230 feature set.
231
232
233.. _filter-chain-specs:
234
235Specifying custom filter chains
236-------------------------------
237
238A filter chain specifier is a sequence of dictionaries, where each dictionary
239contains the ID and options for a single filter. Each dictionary must contain
240the key ``"id"``, and may contain additional keys to specify filter-dependent
241options. Valid filter IDs are as follows:
242
243* Compression filters:
244 * :const:`FILTER_LZMA1` (for use with :const:`FORMAT_ALONE`)
245 * :const:`FILTER_LZMA2` (for use with :const:`FORMAT_XZ` and :const:`FORMAT_RAW`)
246
247* Delta filter:
248 * :const:`FILTER_DELTA`
249
250* Branch-Call-Jump (BCJ) filters:
251 * :const:`FILTER_X86`
252 * :const:`FILTER_IA64`
253 * :const:`FILTER_ARM`
254 * :const:`FILTER_ARMTHUMB`
255 * :const:`FILTER_POWERPC`
256 * :const:`FILTER_SPARC`
257
258A filter chain can consist of up to 4 filters, and cannot be empty. The last
259filter in the chain must be a compression filter, and any other filters must be
260delta or BCJ filters.
261
262Compression filters support the following options (specified as additional
263entries in the dictionary representing the filter):
264
265 * ``preset``: A compression preset to use as a source of default values for
266 options that are not specified explicitly.
267 * ``dict_size``: Dictionary size in bytes. This should be between 4KiB and
268 1.5GiB (inclusive).
269 * ``lc``: Number of literal context bits.
270 * ``lp``: Number of literal position bits. The sum ``lc + lp`` must be at
271 most 4.
272 * ``pb``: Number of position bits; must be at most 4.
273 * ``mode``: :const:`MODE_FAST` or :const:`MODE_NORMAL`.
274 * ``nice_len``: What should be considered a "nice length" for a match.
275 This should be 273 or less.
276 * ``mf``: What match finder to use -- :const:`MF_HC3`, :const:`MF_HC4`,
277 :const:`MF_BT2`, :const:`MF_BT3`, or :const:`MF_BT4`.
278 * ``depth``: Maximum search depth used by match finder. 0 (default) means to
279 select automatically based on other filter options.
280
281The delta filter stores the differences between bytes, producing more repetitive
282input for the compressor in certain circumstances. It only supports a single
283The delta filter supports only one option, ``dist``. This indicates the distance
284between bytes to be subtracted. The default is 1, i.e. take the differences
285between adjacent bytes.
286
287The BCJ filters are intended to be applied to machine code. They convert
288relative branches, calls and jumps in the code to use absolute addressing, with
289the aim of increasing the redundancy that can be exploited by the compressor.
290These filters support one option, ``start_offset``. This specifies the address
291that should be mapped to the beginning of the input data. The default is 0.
292
293
294Examples
295--------
296
297Reading in a compressed file::
298
299 import lzma
300 with lzma.LZMAFile("file.xz") as f:
301 file_content = f.read()
302
303Creating a compressed file::
304
305 import lzma
306 data = b"Insert Data Here"
307 with lzma.LZMAFile("file.xz", "w") as f:
308 f.write(data)
309
310Compressing data in memory::
311
312 import lzma
313 data_in = b"Insert Data Here"
314 data_out = lzma.compress(data_in)
315
316Incremental compression::
317
318 import lzma
319 lzc = lzma.LZMACompressor()
320 out1 = lzc.compress(b"Some data\n")
321 out2 = lzc.compress(b"Another piece of data\n")
322 out3 = lzc.compress(b"Even more data\n")
323 out4 = lzc.flush()
324 # Concatenate all the partial results:
325 result = b"".join([out1, out2, out3, out4])
326
327Writing compressed data to an already-open file::
328
329 import lzma
330 with open("file.xz", "wb") as f:
331 f.write(b"This data will not be compressed\n")
332 with lzma.LZMAFile(fileobj=f, mode="w") as lzf:
333 lzf.write(b"This *will* be compressed\n")
334 f.write(b"Not compressed\n")
335
336Creating a compressed file using a custom filter chain::
337
338 import lzma
339 my_filters = [
340 {"id": lzma.FILTER_DELTA, "dist": 5},
341 {"id": lzma.FILTER_LZMA2, "preset": 7 | lzma.PRESET_EXTREME},
342 ]
343 with lzma.LZMAFile("file.xz", "w", filters=my_filters) as f:
344 f.write(b"blah blah blah")