blob: 5fd5039775d46790d48eb3c20342bf6115d84408 [file] [log] [blame]
Nadeem Vawda3ff069e2011-11-30 00:25:06 +02001:mod:`lzma` --- Compression using the LZMA algorithm
2====================================================
3
4.. module:: lzma
5 :synopsis: A Python wrapper for the liblzma compression library.
6.. moduleauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
7.. sectionauthor:: Nadeem Vawda <nadeem.vawda@gmail.com>
8
9.. versionadded:: 3.3
10
11
12This module provides classes and convenience functions for compressing and
13decompressing data using the LZMA compression algorithm. Also included is a file
14interface supporting the ``.xz`` and legacy ``.lzma`` file formats used by the
15:program:`xz` utility, as well as raw compressed streams.
16
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020017The interface provided by this module is very similar to that of the :mod:`bz2`
18module. However, note that :class:`LZMAFile` is *not* thread-safe, unlike
19:class:`bz2.BZ2File`, so if you need to use a single :class:`LZMAFile` instance
20from multiple threads, it is necessary to protect it with a lock.
21
22
23.. exception:: LZMAError
24
25 This exception is raised when an error occurs during compression or
26 decompression, or while initializing the compressor/decompressor state.
27
28
29Reading and writing compressed files
30------------------------------------
31
Nadeem Vawdae8604042012-06-04 23:38:12 +020032.. function:: open(filename, mode="rb", \*, format=None, check=-1, preset=None, filters=None, encoding=None, errors=None, newline=None)
33
34 Open an LZMA-compressed file in binary or text mode, returning a :term:`file
35 object`.
36
37 The *filename* argument can be either an actual file name (given as a
38 :class:`str` or :class:`bytes` object), in which case the named file is
39 opened, or it can be an existing file object to read from or write to.
40
41 The *mode* argument can be any of ``"r"``, ``"rb"``, ``"w"``, ``"wb"``,
42 ``"a"`` or ``"ab"`` for binary mode, or ``"rt"``, ``"wt"``, or ``"at"`` for
43 text mode. The default is ``"rb"``.
44
45 When opening a file for reading, the *format* and *filters* arguments have
46 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
47 and *preset* arguments should not be used.
48
49 When opening a file for writing, the *format*, *check*, *preset* and
50 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
51
52 For binary mode, this function is equivalent to the :class:`LZMAFile`
53 constructor: ``LZMAFile(filename, mode, ...)``. In this case, the *encoding*,
54 *errors* and *newline* arguments must not be provided.
55
56 For text mode, a :class:`LZMAFile` object is created, and wrapped in an
57 :class:`io.TextIOWrapper` instance with the specified encoding, error
58 handling behavior, and line ending(s).
59
60
Nadeem Vawda33c34da2012-06-04 23:34:07 +020061.. class:: LZMAFile(filename=None, mode="r", \*, format=None, check=-1, preset=None, filters=None)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020062
Nadeem Vawda33c34da2012-06-04 23:34:07 +020063 Open an LZMA-compressed file in binary mode.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020064
Nadeem Vawda33c34da2012-06-04 23:34:07 +020065 An :class:`LZMAFile` can wrap an already-open :term:`file object`, or operate
66 directly on a named file. The *filename* argument specifies either the file
67 object to wrap, or the name of the file to open (as a :class:`str` or
68 :class:`bytes` object). When wrapping an existing file object, the wrapped
69 file will not be closed when the :class:`LZMAFile` is closed.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020070
71 The *mode* argument can be either ``"r"`` for reading (default), ``"w"`` for
Nadeem Vawda6cbb20c2012-06-04 23:36:24 +020072 overwriting, or ``"a"`` for appending. These can equivalently be given as
73 ``"rb"``, ``"wb"``, and ``"ab"`` respectively.
74
75 If *filename* is a file object (rather than an actual file name), a mode of
76 ``"w"`` does not truncate the file, and is instead equivalent to ``"a"``.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +020077
78 When opening a file for reading, the input file may be the concatenation of
79 multiple separate compressed streams. These are transparently decoded as a
80 single logical stream.
81
82 When opening a file for reading, the *format* and *filters* arguments have
83 the same meanings as for :class:`LZMADecompressor`. In this case, the *check*
84 and *preset* arguments should not be used.
85
86 When opening a file for writing, the *format*, *check*, *preset* and
87 *filters* arguments have the same meanings as for :class:`LZMACompressor`.
88
89 :class:`LZMAFile` supports all the members specified by
90 :class:`io.BufferedIOBase`, except for :meth:`detach` and :meth:`truncate`.
91 Iteration and the :keyword:`with` statement are supported.
92
93 The following method is also provided:
94
95 .. method:: peek(size=-1)
96
97 Return buffered data without advancing the file position. At least one
98 byte of data will be returned, unless EOF has been reached. The exact
99 number of bytes returned is unspecified (the *size* argument is ignored).
100
101
102Compressing and decompressing data in memory
103--------------------------------------------
104
105.. class:: LZMACompressor(format=FORMAT_XZ, check=-1, preset=None, filters=None)
106
107 Create a compressor object, which can be used to compress data incrementally.
108
109 For a more convenient way of compressing a single chunk of data, see
110 :func:`compress`.
111
112 The *format* argument specifies what container format should be used.
113 Possible values are:
114
115 * :const:`FORMAT_XZ`: The ``.xz`` container format.
116 This is the default format.
117
118 * :const:`FORMAT_ALONE`: The legacy ``.lzma`` container format.
119 This format is more limited than ``.xz`` -- it does not support integrity
120 checks or multiple filters.
121
122 * :const:`FORMAT_RAW`: A raw data stream, not using any container format.
123 This format specifier does not support integrity checks, and requires that
124 you always specify a custom filter chain (for both compression and
125 decompression). Additionally, data compressed in this manner cannot be
126 decompressed using :const:`FORMAT_AUTO` (see :class:`LZMADecompressor`).
127
128 The *check* argument specifies the type of integrity check to include in the
129 compressed data. This check is used when decompressing, to ensure that the
130 data has not been corrupted. Possible values are:
131
132 * :const:`CHECK_NONE`: No integrity check.
133 This is the default (and the only acceptable value) for
134 :const:`FORMAT_ALONE` and :const:`FORMAT_RAW`.
135
136 * :const:`CHECK_CRC32`: 32-bit Cyclic Redundancy Check.
137
138 * :const:`CHECK_CRC64`: 64-bit Cyclic Redundancy Check.
139 This is the default for :const:`FORMAT_XZ`.
140
141 * :const:`CHECK_SHA256`: 256-bit Secure Hash Algorithm.
142
143 If the specified check is not supported, an :class:`LZMAError` is raised.
144
145 The compression settings can be specified either as a preset compression
146 level (with the *preset* argument), or in detail as a custom filter chain
147 (with the *filters* argument).
148
149 The *preset* argument (if provided) should be an integer between ``0`` and
150 ``9`` (inclusive), optionally OR-ed with the constant
151 :const:`PRESET_EXTREME`. If neither *preset* nor *filters* are given, the
152 default behavior is to use :const:`PRESET_DEFAULT` (preset level ``6``).
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200153 Higher presets produce smaller output, but make the compression process
154 slower.
155
156 .. note::
157
158 In addition to being more CPU-intensive, compression with higher presets
159 also requires much more memory (and produces output that needs more memory
160 to decompress). With preset ``9`` for example, the overhead for an
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200161 :class:`LZMACompressor` object can be as high as 800 MiB. For this reason,
Nadeem Vawdadc9dd0d2012-01-02 02:24:20 +0200162 it is generally best to stick with the default preset.
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200163
164 The *filters* argument (if provided) should be a filter chain specifier.
165 See :ref:`filter-chain-specs` for details.
166
167 .. method:: compress(data)
168
169 Compress *data* (a :class:`bytes` object), returning a :class:`bytes`
170 object containing compressed data for at least part of the input. Some of
171 *data* may be buffered internally, for use in later calls to
172 :meth:`compress` and :meth:`flush`. The returned data should be
173 concatenated with the output of any previous calls to :meth:`compress`.
174
175 .. method:: flush()
176
177 Finish the compression process, returning a :class:`bytes` object
178 containing any data stored in the compressor's internal buffers.
179
180 The compressor cannot be used after this method has been called.
181
182
183.. class:: LZMADecompressor(format=FORMAT_AUTO, memlimit=None, filters=None)
184
185 Create a decompressor object, which can be used to decompress data
186 incrementally.
187
188 For a more convenient way of decompressing an entire compressed stream at
189 once, see :func:`decompress`.
190
191 The *format* argument specifies the container format that should be used. The
192 default is :const:`FORMAT_AUTO`, which can decompress both ``.xz`` and
193 ``.lzma`` files. Other possible values are :const:`FORMAT_XZ`,
194 :const:`FORMAT_ALONE`, and :const:`FORMAT_RAW`.
195
196 The *memlimit* argument specifies a limit (in bytes) on the amount of memory
197 that the decompressor can use. When this argument is used, decompression will
198 fail with an :class:`LZMAError` if it is not possible to decompress the input
199 within the given memory limit.
200
201 The *filters* argument specifies the filter chain that was used to create
202 the stream being decompressed. This argument is required if *format* is
203 :const:`FORMAT_RAW`, but should not be used for other formats.
204 See :ref:`filter-chain-specs` for more information about filter chains.
205
206 .. note::
207 This class does not transparently handle inputs containing multiple
208 compressed streams, unlike :func:`decompress` and :class:`LZMAFile`. To
209 decompress a multi-stream input with :class:`LZMADecompressor`, you must
210 create a new decompressor for each stream.
211
212 .. method:: decompress(data)
213
214 Decompress *data* (a :class:`bytes` object), returning a :class:`bytes`
215 object containing the decompressed data for at least part of the input.
216 Some of *data* may be buffered internally, for use in later calls to
217 :meth:`decompress`. The returned data should be concatenated with the
218 output of any previous calls to :meth:`decompress`.
219
220 .. attribute:: check
221
222 The ID of the integrity check used by the input stream. This may be
223 :const:`CHECK_UNKNOWN` until enough of the input has been decoded to
224 determine what integrity check it uses.
225
226 .. attribute:: eof
227
228 True if the end-of-stream marker has been reached.
229
230 .. attribute:: unused_data
231
232 Data found after the end of the compressed stream.
233
234 Before the end of the stream is reached, this will be ``b""``.
235
236
237.. function:: compress(data, format=FORMAT_XZ, check=-1, preset=None, filters=None)
238
239 Compress *data* (a :class:`bytes` object), returning the compressed data as a
240 :class:`bytes` object.
241
242 See :class:`LZMACompressor` above for a description of the *format*, *check*,
243 *preset* and *filters* arguments.
244
245
246.. function:: decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None)
247
248 Decompress *data* (a :class:`bytes` object), returning the uncompressed data
249 as a :class:`bytes` object.
250
251 If *data* is the concatenation of multiple distinct compressed streams,
252 decompress all of these streams, and return the concatenation of the results.
253
254 See :class:`LZMADecompressor` above for a description of the *format*,
255 *memlimit* and *filters* arguments.
256
257
258Miscellaneous
259-------------
260
Nadeem Vawdabc459bb2012-05-06 23:01:51 +0200261.. function:: is_check_supported(check)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200262
263 Returns true if the given integrity check is supported on this system.
264
265 :const:`CHECK_NONE` and :const:`CHECK_CRC32` are always supported.
266 :const:`CHECK_CRC64` and :const:`CHECK_SHA256` may be unavailable if you are
267 using a version of :program:`liblzma` that was compiled with a limited
268 feature set.
269
270
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200271.. _filter-chain-specs:
272
273Specifying custom filter chains
274-------------------------------
275
276A filter chain specifier is a sequence of dictionaries, where each dictionary
277contains the ID and options for a single filter. Each dictionary must contain
278the key ``"id"``, and may contain additional keys to specify filter-dependent
279options. Valid filter IDs are as follows:
280
281* Compression filters:
282 * :const:`FILTER_LZMA1` (for use with :const:`FORMAT_ALONE`)
283 * :const:`FILTER_LZMA2` (for use with :const:`FORMAT_XZ` and :const:`FORMAT_RAW`)
284
285* Delta filter:
286 * :const:`FILTER_DELTA`
287
288* Branch-Call-Jump (BCJ) filters:
289 * :const:`FILTER_X86`
290 * :const:`FILTER_IA64`
291 * :const:`FILTER_ARM`
292 * :const:`FILTER_ARMTHUMB`
293 * :const:`FILTER_POWERPC`
294 * :const:`FILTER_SPARC`
295
296A filter chain can consist of up to 4 filters, and cannot be empty. The last
297filter in the chain must be a compression filter, and any other filters must be
298delta or BCJ filters.
299
300Compression filters support the following options (specified as additional
301entries in the dictionary representing the filter):
302
303 * ``preset``: A compression preset to use as a source of default values for
304 options that are not specified explicitly.
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200305 * ``dict_size``: Dictionary size in bytes. This should be between 4 KiB and
306 1.5 GiB (inclusive).
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200307 * ``lc``: Number of literal context bits.
308 * ``lp``: Number of literal position bits. The sum ``lc + lp`` must be at
309 most 4.
310 * ``pb``: Number of position bits; must be at most 4.
311 * ``mode``: :const:`MODE_FAST` or :const:`MODE_NORMAL`.
312 * ``nice_len``: What should be considered a "nice length" for a match.
313 This should be 273 or less.
314 * ``mf``: What match finder to use -- :const:`MF_HC3`, :const:`MF_HC4`,
315 :const:`MF_BT2`, :const:`MF_BT3`, or :const:`MF_BT4`.
316 * ``depth``: Maximum search depth used by match finder. 0 (default) means to
317 select automatically based on other filter options.
318
319The delta filter stores the differences between bytes, producing more repetitive
320input for the compressor in certain circumstances. It only supports a single
321The delta filter supports only one option, ``dist``. This indicates the distance
322between bytes to be subtracted. The default is 1, i.e. take the differences
323between adjacent bytes.
324
325The BCJ filters are intended to be applied to machine code. They convert
326relative branches, calls and jumps in the code to use absolute addressing, with
327the aim of increasing the redundancy that can be exploited by the compressor.
328These filters support one option, ``start_offset``. This specifies the address
329that should be mapped to the beginning of the input data. The default is 0.
330
331
332Examples
333--------
334
335Reading in a compressed file::
336
337 import lzma
Nadeem Vawda50112442012-09-23 18:20:23 +0200338 with lzma.open("file.xz") as f:
Nadeem Vawda667a13b2012-09-23 18:08:57 +0200339 file_content = f.read()
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200340
341Creating a compressed file::
342
343 import lzma
344 data = b"Insert Data Here"
Nadeem Vawda50112442012-09-23 18:20:23 +0200345 with lzma.open("file.xz", "w") as f:
Nadeem Vawda667a13b2012-09-23 18:08:57 +0200346 f.write(data)
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200347
348Compressing data in memory::
349
350 import lzma
351 data_in = b"Insert Data Here"
352 data_out = lzma.compress(data_in)
353
354Incremental compression::
355
356 import lzma
357 lzc = lzma.LZMACompressor()
358 out1 = lzc.compress(b"Some data\n")
359 out2 = lzc.compress(b"Another piece of data\n")
360 out3 = lzc.compress(b"Even more data\n")
361 out4 = lzc.flush()
362 # Concatenate all the partial results:
363 result = b"".join([out1, out2, out3, out4])
364
365Writing compressed data to an already-open file::
366
367 import lzma
368 with open("file.xz", "wb") as f:
369 f.write(b"This data will not be compressed\n")
Nadeem Vawda50112442012-09-23 18:20:23 +0200370 with lzma.open(f, "w") as lzf:
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200371 lzf.write(b"This *will* be compressed\n")
372 f.write(b"Not compressed\n")
373
374Creating a compressed file using a custom filter chain::
375
376 import lzma
377 my_filters = [
378 {"id": lzma.FILTER_DELTA, "dist": 5},
379 {"id": lzma.FILTER_LZMA2, "preset": 7 | lzma.PRESET_EXTREME},
380 ]
Nadeem Vawda50112442012-09-23 18:20:23 +0200381 with lzma.open("file.xz", "w", filters=my_filters) as f:
Nadeem Vawda3ff069e2011-11-30 00:25:06 +0200382 f.write(b"blah blah blah")