blob: 33a542883aac4bd5d6affc34e0125f31690ee442 [file] [log] [blame]
Nadeem Vawda3ff069e2011-11-30 00:25:06 +02001:mod:`lzma` --- Compression using the LZMA algorithm
2====================================================
3
4.. module:: lzma
5 :synopsis: A Python wrapper for the liblzma compression library.
6.. moduleauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
7.. sectionauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
8
9.. versionadded:: 3.3
10
11
12This module provides classes and convenience functions for compressing and
13decompressing data using the LZMA compression algorithm. Also included is a file
14interface supporting the ``.xz`` and legacy ``.lzma`` file formats used by the
15:program:`xz` utility, as well as raw compressed streams.
16
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020017The interface provided by this module is very similar to that of the :mod:`bz2`
18module. However, note that :class:`LZMAFile` is *not* thread-safe, unlike
19:class:`bz2.BZ2File`, so if you need to use a single :class:`LZMAFile` instance
20from multiple threads, it is necessary to protect it with a lock.
21
22
23.. exception:: LZMAError
24
25 This exception is raised when an error occurs during compression or
26 decompression, or while initializing the compressor/decompressor state.
27
28
29Reading and writing compressed files
30------------------------------------
31
Nadeem Vawdad85d0e72012-02-04 14:06:07 +020032.. class:: LZMAFile(filename=None, mode="r", \*, fileobj=None, format=None, check=-1, preset=None, filters=None)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020033
34 Open an LZMA-compressed file.
35
36 An :class:`LZMAFile` can wrap an existing :term:`file object` (given by
37 *fileobj*), or operate directly on a named file (named by *filename*).
38 Exactly one of these two parameters should be provided. If *fileobj* is
39 provided, it is not closed when the :class:`LZMAFile` is closed.
40
41 The *mode* argument can be either ``"r"`` for reading (default), ``"w"`` for
42 overwriting, or ``"a"`` for appending. If *fileobj* is provided, a mode of
43 ``"w"`` does not truncate the file, and is instead equivalent to ``"a"``.
44
45 When opening a file for reading, the input file may be the concatenation of
46 multiple separate compressed streams. These are transparently decoded as a
47 single logical stream.
48
49 When opening a file for reading, the *format* and *filters* arguments have
50 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
51 and *preset* arguments should not be used.
52
53 When opening a file for writing, the *format*, *check*, *preset* and
54 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
55
56 :class:`LZMAFile` supports all the members specified by
57 :class:`io.BufferedIOBase`, except for :meth:`detach` and :meth:`truncate`.
58 Iteration and the :keyword:`with` statement are supported.
59
60 The following method is also provided:
61
62 .. method:: peek(size=-1)
63
64 Return buffered data without advancing the file position. At least one
65 byte of data will be returned, unless EOF has been reached. The exact
66 number of bytes returned is unspecified (the *size* argument is ignored).
67
68
69Compressing and decompressing data in memory
70--------------------------------------------
71
72.. class:: LZMACompressor(format=FORMAT_XZ, check=-1, preset=None, filters=None)
73
74 Create a compressor object, which can be used to compress data incrementally.
75
76 For a more convenient way of compressing a single chunk of data, see
77 :func:`compress`.
78
79 The *format* argument specifies what container format should be used.
80 Possible values are:
81
82 * :const:`FORMAT_XZ`: The ``.xz`` container format.
83 This is the default format.
84
85 * :const:`FORMAT_ALONE`: The legacy ``.lzma`` container format.
86 This format is more limited than ``.xz`` -- it does not support integrity
87 checks or multiple filters.
88
89 * :const:`FORMAT_RAW`: A raw data stream, not using any container format.
90 This format specifier does not support integrity checks, and requires that
91 you always specify a custom filter chain (for both compression and
92 decompression). Additionally, data compressed in this manner cannot be
93 decompressed using :const:`FORMAT_AUTO` (see :class:`LZMADecompressor`).
94
95 The *check* argument specifies the type of integrity check to include in the
96 compressed data. This check is used when decompressing, to ensure that the
97 data has not been corrupted. Possible values are:
98
99 * :const:`CHECK_NONE`: No integrity check.
100 This is the default (and the only acceptable value) for
101 :const:`FORMAT_ALONE` and :const:`FORMAT_RAW`.
102
103 * :const:`CHECK_CRC32`: 32-bit Cyclic Redundancy Check.
104
105 * :const:`CHECK_CRC64`: 64-bit Cyclic Redundancy Check.
106 This is the default for :const:`FORMAT_XZ`.
107
108 * :const:`CHECK_SHA256`: 256-bit Secure Hash Algorithm.
109
110 If the specified check is not supported, an :class:`LZMAError` is raised.
111
112 The compression settings can be specified either as a preset compression
113 level (with the *preset* argument), or in detail as a custom filter chain
114 (with the *filters* argument).
115
116 The *preset* argument (if provided) should be an integer between ``0`` and
117 ``9`` (inclusive), optionally OR-ed with the constant
118 :const:`PRESET_EXTREME`. If neither *preset* nor *filters* are given, the
119 default behavior is to use :const:`PRESET_DEFAULT` (preset level ``6``).
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200120 Higher presets produce smaller output, but make the compression process
121 slower.
122
123 .. note::
124
125 In addition to being more CPU-intensive, compression with higher presets
126 also requires much more memory (and produces output that needs more memory
127 to decompress). With preset ``9`` for example, the overhead for an
128 :class:`LZMACompressor` object can be as high as 800MiB. For this reason,
129 it is generally best to stick with the default preset.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200130
131 The *filters* argument (if provided) should be a filter chain specifier.
132 See :ref:`filter-chain-specs` for details.
133
134 .. method:: compress(data)
135
136 Compress *data* (a :class:`bytes` object), returning a :class:`bytes`
137 object containing compressed data for at least part of the input. Some of
138 *data* may be buffered internally, for use in later calls to
139 :meth:`compress` and :meth:`flush`. The returned data should be
140 concatenated with the output of any previous calls to :meth:`compress`.
141
142 .. method:: flush()
143
144 Finish the compression process, returning a :class:`bytes` object
145 containing any data stored in the compressor's internal buffers.
146
147 The compressor cannot be used after this method has been called.
148
149
150.. class:: LZMADecompressor(format=FORMAT_AUTO, memlimit=None, filters=None)
151
152 Create a decompressor object, which can be used to decompress data
153 incrementally.
154
155 For a more convenient way of decompressing an entire compressed stream at
156 once, see :func:`decompress`.
157
158 The *format* argument specifies the container format that should be used. The
159 default is :const:`FORMAT_AUTO`, which can decompress both ``.xz`` and
160 ``.lzma`` files. Other possible values are :const:`FORMAT_XZ`,
161 :const:`FORMAT_ALONE`, and :const:`FORMAT_RAW`.
162
163 The *memlimit* argument specifies a limit (in bytes) on the amount of memory
164 that the decompressor can use. When this argument is used, decompression will
165 fail with an :class:`LZMAError` if it is not possible to decompress the input
166 within the given memory limit.
167
168 The *filters* argument specifies the filter chain that was used to create
169 the stream being decompressed. This argument is required if *format* is
170 :const:`FORMAT_RAW`, but should not be used for other formats.
171 See :ref:`filter-chain-specs` for more information about filter chains.
172
173 .. note::
174 This class does not transparently handle inputs containing multiple
175 compressed streams, unlike :func:`decompress` and :class:`LZMAFile`. To
176 decompress a multi-stream input with :class:`LZMADecompressor`, you must
177 create a new decompressor for each stream.
178
179 .. method:: decompress(data)
180
181 Decompress *data* (a :class:`bytes` object), returning a :class:`bytes`
182 object containing the decompressed data for at least part of the input.
183 Some of *data* may be buffered internally, for use in later calls to
184 :meth:`decompress`. The returned data should be concatenated with the
185 output of any previous calls to :meth:`decompress`.
186
187 .. attribute:: check
188
189 The ID of the integrity check used by the input stream. This may be
190 :const:`CHECK_UNKNOWN` until enough of the input has been decoded to
191 determine what integrity check it uses.
192
193 .. attribute:: eof
194
195 True if the end-of-stream marker has been reached.
196
197 .. attribute:: unused_data
198
199 Data found after the end of the compressed stream.
200
201 Before the end of the stream is reached, this will be ``b""``.
202
203
204.. function:: compress(data, format=FORMAT_XZ, check=-1, preset=None, filters=None)
205
206 Compress *data* (a :class:`bytes` object), returning the compressed data as a
207 :class:`bytes` object.
208
209 See :class:`LZMACompressor` above for a description of the *format*, *check*,
210 *preset* and *filters* arguments.
211
212
213.. function:: decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None)
214
215 Decompress *data* (a :class:`bytes` object), returning the uncompressed data
216 as a :class:`bytes` object.
217
218 If *data* is the concatenation of multiple distinct compressed streams,
219 decompress all of these streams, and return the concatenation of the results.
220
221 See :class:`LZMADecompressor` above for a description of the *format*,
222 *memlimit* and *filters* arguments.
223
224
225Miscellaneous
226-------------
227
Nadeem Vawdabc459bb2012-05-06 23:01:51 +0200228.. function:: is_check_supported(check)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200229
230 Returns true if the given integrity check is supported on this system.
231
232 :const:`CHECK_NONE` and :const:`CHECK_CRC32` are always supported.
233 :const:`CHECK_CRC64` and :const:`CHECK_SHA256` may be unavailable if you are
234 using a version of :program:`liblzma` that was compiled with a limited
235 feature set.
236
237
Nadeem Vawdaf55b3292012-05-06 23:01:27 +0200238.. function:: encode_filter_properties(filter)
239
240 Return a :class:`bytes` object encoding the options (properties) of the
241 filter specified by *filter* (a dictionary).
242
243 *filter* is interpreted as a filter specifier, as described in
244 :ref:`filter-chain-specs`.
245
246 The returned data does not include the filter ID itself, only the options.
247
248 This function is primarily of interest to users implementing custom file
249 formats.
250
251
252.. function:: decode_filter_properties(filter_id, encoded_props)
253
254 Return a dictionary describing a filter with ID *filter_id*, and options
255 (properties) decoded from the :class:`bytes` object *encoded_props*.
256
257 The returned dictionary is a filter specifier, as described in
258 :ref:`filter-chain-specs`.
259
260 This function is primarily of interest to users implementing custom file
261 formats.
262
263
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200264.. _filter-chain-specs:
265
266Specifying custom filter chains
267-------------------------------
268
269A filter chain specifier is a sequence of dictionaries, where each dictionary
270contains the ID and options for a single filter. Each dictionary must contain
271the key ``"id"``, and may contain additional keys to specify filter-dependent
272options. Valid filter IDs are as follows:
273
274* Compression filters:
275 * :const:`FILTER_LZMA1` (for use with :const:`FORMAT_ALONE`)
276 * :const:`FILTER_LZMA2` (for use with :const:`FORMAT_XZ` and :const:`FORMAT_RAW`)
277
278* Delta filter:
279 * :const:`FILTER_DELTA`
280
281* Branch-Call-Jump (BCJ) filters:
282 * :const:`FILTER_X86`
283 * :const:`FILTER_IA64`
284 * :const:`FILTER_ARM`
285 * :const:`FILTER_ARMTHUMB`
286 * :const:`FILTER_POWERPC`
287 * :const:`FILTER_SPARC`
288
289A filter chain can consist of up to 4 filters, and cannot be empty. The last
290filter in the chain must be a compression filter, and any other filters must be
291delta or BCJ filters.
292
293Compression filters support the following options (specified as additional
294entries in the dictionary representing the filter):
295
296 * ``preset``: A compression preset to use as a source of default values for
297 options that are not specified explicitly.
298 * ``dict_size``: Dictionary size in bytes. This should be between 4KiB and
299 1.5GiB (inclusive).
300 * ``lc``: Number of literal context bits.
301 * ``lp``: Number of literal position bits. The sum ``lc + lp`` must be at
302 most 4.
303 * ``pb``: Number of position bits; must be at most 4.
304 * ``mode``: :const:`MODE_FAST` or :const:`MODE_NORMAL`.
305 * ``nice_len``: What should be considered a "nice length" for a match.
306 This should be 273 or less.
307 * ``mf``: What match finder to use -- :const:`MF_HC3`, :const:`MF_HC4`,
308 :const:`MF_BT2`, :const:`MF_BT3`, or :const:`MF_BT4`.
309 * ``depth``: Maximum search depth used by match finder. 0 (default) means to
310 select automatically based on other filter options.
311
312The delta filter stores the differences between bytes, producing more repetitive
313input for the compressor in certain circumstances. It only supports a single
314The delta filter supports only one option, ``dist``. This indicates the distance
315between bytes to be subtracted. The default is 1, i.e. take the differences
316between adjacent bytes.
317
318The BCJ filters are intended to be applied to machine code. They convert
319relative branches, calls and jumps in the code to use absolute addressing, with
320the aim of increasing the redundancy that can be exploited by the compressor.
321These filters support one option, ``start_offset``. This specifies the address
322that should be mapped to the beginning of the input data. The default is 0.
323
324
325Examples
326--------
327
328Reading in a compressed file::
329
330 import lzma
331 with lzma.LZMAFile("file.xz") as f:
332 file_content = f.read()
333
334Creating a compressed file::
335
336 import lzma
337 data = b"Insert Data Here"
338 with lzma.LZMAFile("file.xz", "w") as f:
339 f.write(data)
340
341Compressing data in memory::
342
343 import lzma
344 data_in = b"Insert Data Here"
345 data_out = lzma.compress(data_in)
346
347Incremental compression::
348
349 import lzma
350 lzc = lzma.LZMACompressor()
351 out1 = lzc.compress(b"Some data\n")
352 out2 = lzc.compress(b"Another piece of data\n")
353 out3 = lzc.compress(b"Even more data\n")
354 out4 = lzc.flush()
355 # Concatenate all the partial results:
356 result = b"".join([out1, out2, out3, out4])
357
358Writing compressed data to an already-open file::
359
360 import lzma
361 with open("file.xz", "wb") as f:
362 f.write(b"This data will not be compressed\n")
363 with lzma.LZMAFile(fileobj=f, mode="w") as lzf:
364 lzf.write(b"This *will* be compressed\n")
365 f.write(b"Not compressed\n")
366
367Creating a compressed file using a custom filter chain::
368
369 import lzma
370 my_filters = [
371 {"id": lzma.FILTER_DELTA, "dist": 5},
372 {"id": lzma.FILTER_LZMA2, "preset": 7 | lzma.PRESET_EXTREME},
373 ]
374 with lzma.LZMAFile("file.xz", "w", filters=my_filters) as f:
375 f.write(b"blah blah blah")