blob: 612fa8f27ef821139d35c74a97afd9d2a38c4ba5 [file] [log] [blame]
Skip Montanaro54455942003-01-29 15:41:33 +00001'''"Executable documentation" for the pickle module.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002
3Extensive comments about the pickle protocols and pickle-machine opcodes
4can be found here. Some functions meant for external use:
5
6genops(pickle)
7 Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
8
Andrew M. Kuchlingd0c53fe2004-08-07 16:51:30 +00009dis(pickle, out=None, memo=None, indentlevel=4)
Tim Peters8ecfc8e2003-01-27 18:51:48 +000010 Print a symbolic disassembly of a pickle.
Skip Montanaro54455942003-01-29 15:41:33 +000011'''
Tim Peters8ecfc8e2003-01-27 18:51:48 +000012
Walter Dörwald42748a82007-06-12 16:40:17 +000013import codecs
Guido van Rossum98297ee2007-11-06 21:34:58 +000014import pickle
15import re
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -070016import sys
Walter Dörwald42748a82007-06-12 16:40:17 +000017
Christian Heimes3feef612008-02-11 06:19:17 +000018__all__ = ['dis', 'genops', 'optimize']
Tim Peters90cf2122004-11-06 23:45:48 +000019
Guido van Rossum98297ee2007-11-06 21:34:58 +000020bytes_types = pickle.bytes_types
21
Tim Peters8ecfc8e2003-01-27 18:51:48 +000022# Other ideas:
23#
24# - A pickle verifier: read a pickle and check it exhaustively for
Tim Petersc1c2b3e2003-01-29 20:12:21 +000025# well-formedness. dis() does a lot of this already.
Tim Peters8ecfc8e2003-01-27 18:51:48 +000026#
27# - A protocol identifier: examine a pickle and return its protocol number
28# (== the highest .proto attr value among all the opcodes in the pickle).
Tim Petersc1c2b3e2003-01-29 20:12:21 +000029# dis() already prints this info at the end.
Tim Peters8ecfc8e2003-01-27 18:51:48 +000030#
31# - A pickle optimizer: for example, tuple-building code is sometimes more
32# elaborate than necessary, catering for the possibility that the tuple
33# is recursive. Or lots of times a PUT is generated that's never accessed
34# by a later GET.
35
36
37"""
38"A pickle" is a program for a virtual pickle machine (PM, but more accurately
39called an unpickling machine). It's a sequence of opcodes, interpreted by the
40PM, building an arbitrarily complex Python object.
41
42For the most part, the PM is very simple: there are no looping, testing, or
43conditional instructions, no arithmetic and no function calls. Opcodes are
44executed once each, from first to last, until a STOP opcode is reached.
45
46The PM has two data areas, "the stack" and "the memo".
47
48Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
49integer object on the stack, whose value is gotten from a decimal string
50literal immediately following the INT opcode in the pickle bytestream. Other
51opcodes take Python objects off the stack. The result of unpickling is
52whatever object is left on the stack when the final STOP opcode is executed.
53
54The memo is simply an array of objects, or it can be implemented as a dict
55mapping little integers to objects. The memo serves as the PM's "long term
56memory", and the little integers indexing the memo are akin to variable
57names. Some opcodes pop a stack object into the memo at a given index,
58and others push a memo object at a given index onto the stack again.
59
60At heart, that's all the PM has. Subtleties arise for these reasons:
61
62+ Object identity. Objects can be arbitrarily complex, and subobjects
63 may be shared (for example, the list [a, a] refers to the same object a
64 twice). It can be vital that unpickling recreate an isomorphic object
65 graph, faithfully reproducing sharing.
66
67+ Recursive objects. For example, after "L = []; L.append(L)", L is a
68 list, and L[0] is the same list. This is related to the object identity
69 point, and some sequences of pickle opcodes are subtle in order to
70 get the right result in all cases.
71
72+ Things pickle doesn't know everything about. Examples of things pickle
73 does know everything about are Python's builtin scalar and container
74 types, like ints and tuples. They generally have opcodes dedicated to
75 them. For things like module references and instances of user-defined
76 classes, pickle's knowledge is limited. Historically, many enhancements
77 have been made to the pickle protocol in order to do a better (faster,
78 and/or more compact) job on those.
79
80+ Backward compatibility and micro-optimization. As explained below,
81 pickle opcodes never go away, not even when better ways to do a thing
82 get invented. The repertoire of the PM just keeps growing over time.
Tim Petersfdc03462003-01-28 04:56:33 +000083 For example, protocol 0 had two opcodes for building Python integers (INT
84 and LONG), protocol 1 added three more for more-efficient pickling of short
85 integers, and protocol 2 added two more for more-efficient pickling of
86 long integers (before protocol 2, the only ways to pickle a Python long
87 took time quadratic in the number of digits, for both pickling and
88 unpickling). "Opcode bloat" isn't so much a subtlety as a source of
Tim Peters8ecfc8e2003-01-27 18:51:48 +000089 wearying complication.
90
91
92Pickle protocols:
93
94For compatibility, the meaning of a pickle opcode never changes. Instead new
95pickle opcodes get added, and each version's unpickler can handle all the
96pickle opcodes in all protocol versions to date. So old pickles continue to
97be readable forever. The pickler can generally be told to restrict itself to
98the subset of opcodes available under previous protocol versions too, so that
99users can create pickles under the current version readable by older
100versions. However, a pickle does not contain its version number embedded
101within it. If an older unpickler tries to read a pickle using a later
102protocol, the result is most likely an exception due to seeing an unknown (in
103the older unpickler) opcode.
104
105The original pickle used what's now called "protocol 0", and what was called
106"text mode" before Python 2.3. The entire pickle bytestream is made up of
107printable 7-bit ASCII characters, plus the newline character, in protocol 0.
Tim Petersfdc03462003-01-28 04:56:33 +0000108That's why it was called text mode. Protocol 0 is small and elegant, but
109sometimes painfully inefficient.
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000110
111The second major set of additions is now called "protocol 1", and was called
112"binary mode" before Python 2.3. This added many opcodes with arguments
113consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
114bytes. Binary mode pickles can be substantially smaller than equivalent
115text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
116int as 4 bytes following the opcode, which is cheaper to unpickle than the
Tim Petersfdc03462003-01-28 04:56:33 +0000117(perhaps) 11-character decimal string attached to INT. Protocol 1 also added
118a number of opcodes that operate on many stack elements at once (like APPENDS
Tim Peters81098ac2003-01-28 05:12:08 +0000119and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000120
121The third major set of additions came in Python 2.3, and is called "protocol
Tim Petersfdc03462003-01-28 04:56:33 +00001222". This added:
123
124- A better way to pickle instances of new-style classes (NEWOBJ).
125
126- A way for a pickle to identify its protocol (PROTO).
127
128- Time- and space- efficient pickling of long ints (LONG{1,4}).
129
130- Shortcuts for small tuples (TUPLE{1,2,3}}.
131
132- Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
133
134- The "extension registry", a vector of popular objects that can be pushed
135 efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but
136 the registry contents are predefined (there's nothing akin to the memo's
137 PUT).
Guido van Rossumecb11042003-01-29 06:24:30 +0000138
Skip Montanaro54455942003-01-29 15:41:33 +0000139Another independent change with Python 2.3 is the abandonment of any
140pretense that it might be safe to load pickles received from untrusted
Guido van Rossumecb11042003-01-29 06:24:30 +0000141parties -- no sufficient security analysis has been done to guarantee
Skip Montanaro54455942003-01-29 15:41:33 +0000142this and there isn't a use case that warrants the expense of such an
Guido van Rossumecb11042003-01-29 06:24:30 +0000143analysis.
144
145To this end, all tests for __safe_for_unpickling__ or for
Alexandre Vassalottif7fa63d2008-05-11 08:55:36 +0000146copyreg.safe_constructors are removed from the unpickling code.
Guido van Rossumecb11042003-01-29 06:24:30 +0000147References to these variables in the descriptions below are to be seen
148as describing unpickling in Python 2.2 and before.
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000149"""
150
151# Meta-rule: Descriptions are stored in instances of descriptor objects,
152# with plain constructors. No meta-language is defined from which
153# descriptors could be constructed. If you want, e.g., XML, write a little
154# program to generate XML from the objects.
155
156##############################################################################
157# Some pickle opcodes have an argument, following the opcode in the
158# bytestream. An argument is of a specific type, described by an instance
159# of ArgumentDescriptor. These are not to be confused with arguments taken
160# off the stack -- ArgumentDescriptor applies only to arguments embedded in
161# the opcode stream, immediately following an opcode.
162
163# Represents the number of bytes consumed by an argument delimited by the
164# next newline character.
165UP_TO_NEWLINE = -1
166
167# Represents the number of bytes consumed by a two-argument opcode where
168# the first argument gives the number of bytes in the second argument.
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700169TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int
170TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int
171TAKEN_FROM_ARGUMENT4U = -4 # num bytes is 4-byte unsigned little-endian int
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000172
173class ArgumentDescriptor(object):
174 __slots__ = (
175 # name of descriptor record, also a module global name; a string
176 'name',
177
178 # length of argument, in bytes; an int; UP_TO_NEWLINE and
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000179 # TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length
180 # cases
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000181 'n',
182
183 # a function taking a file-like object, reading this kind of argument
184 # from the object at the current position, advancing the current
185 # position by n bytes, and returning the value of the argument
186 'reader',
187
188 # human-readable docs for this arg descriptor; a string
189 'doc',
190 )
191
192 def __init__(self, name, n, reader, doc):
193 assert isinstance(name, str)
194 self.name = name
195
196 assert isinstance(n, int) and (n >= 0 or
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000197 n in (UP_TO_NEWLINE,
198 TAKEN_FROM_ARGUMENT1,
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700199 TAKEN_FROM_ARGUMENT4,
200 TAKEN_FROM_ARGUMENT4U))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000201 self.n = n
202
203 self.reader = reader
204
205 assert isinstance(doc, str)
206 self.doc = doc
207
208from struct import unpack as _unpack
209
210def read_uint1(f):
Tim Peters55762f52003-01-28 16:01:25 +0000211 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000212 >>> import io
213 >>> read_uint1(io.BytesIO(b'\xff'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000214 255
215 """
216
217 data = f.read(1)
218 if data:
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000219 return data[0]
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000220 raise ValueError("not enough data in stream to read uint1")
221
222uint1 = ArgumentDescriptor(
223 name='uint1',
224 n=1,
225 reader=read_uint1,
226 doc="One-byte unsigned integer.")
227
228
229def read_uint2(f):
Tim Peters55762f52003-01-28 16:01:25 +0000230 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000231 >>> import io
232 >>> read_uint2(io.BytesIO(b'\xff\x00'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000233 255
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000234 >>> read_uint2(io.BytesIO(b'\xff\xff'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000235 65535
236 """
237
238 data = f.read(2)
239 if len(data) == 2:
240 return _unpack("<H", data)[0]
241 raise ValueError("not enough data in stream to read uint2")
242
243uint2 = ArgumentDescriptor(
244 name='uint2',
245 n=2,
246 reader=read_uint2,
247 doc="Two-byte unsigned integer, little-endian.")
248
249
250def read_int4(f):
Tim Peters55762f52003-01-28 16:01:25 +0000251 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000252 >>> import io
253 >>> read_int4(io.BytesIO(b'\xff\x00\x00\x00'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000254 255
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000255 >>> read_int4(io.BytesIO(b'\x00\x00\x00\x80')) == -(2**31)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000256 True
257 """
258
259 data = f.read(4)
260 if len(data) == 4:
261 return _unpack("<i", data)[0]
262 raise ValueError("not enough data in stream to read int4")
263
264int4 = ArgumentDescriptor(
265 name='int4',
266 n=4,
267 reader=read_int4,
268 doc="Four-byte signed integer, little-endian, 2's complement.")
269
270
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700271def read_uint4(f):
272 r"""
273 >>> import io
274 >>> read_uint4(io.BytesIO(b'\xff\x00\x00\x00'))
275 255
276 >>> read_uint4(io.BytesIO(b'\x00\x00\x00\x80')) == 2**31
277 True
278 """
279
280 data = f.read(4)
281 if len(data) == 4:
282 return _unpack("<I", data)[0]
283 raise ValueError("not enough data in stream to read uint4")
284
285uint4 = ArgumentDescriptor(
286 name='uint4',
287 n=4,
288 reader=read_uint4,
289 doc="Four-byte unsigned integer, little-endian.")
290
291
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000292def read_stringnl(f, decode=True, stripquotes=True):
Tim Peters55762f52003-01-28 16:01:25 +0000293 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000294 >>> import io
295 >>> read_stringnl(io.BytesIO(b"'abcd'\nefg\n"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000296 'abcd'
297
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000298 >>> read_stringnl(io.BytesIO(b"\n"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000299 Traceback (most recent call last):
300 ...
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000301 ValueError: no string quotes around b''
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000302
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000303 >>> read_stringnl(io.BytesIO(b"\n"), stripquotes=False)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000304 ''
305
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000306 >>> read_stringnl(io.BytesIO(b"''\n"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000307 ''
308
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000309 >>> read_stringnl(io.BytesIO(b'"abcd"'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000310 Traceback (most recent call last):
311 ...
312 ValueError: no newline found when trying to read stringnl
313
314 Embedded escapes are undone in the result.
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000315 >>> read_stringnl(io.BytesIO(br"'a\n\\b\x00c\td'" + b"\n'e'"))
Tim Peters55762f52003-01-28 16:01:25 +0000316 'a\n\\b\x00c\td'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000317 """
318
Guido van Rossum26986312007-07-17 00:19:46 +0000319 data = f.readline()
Guido van Rossum26d95c32007-08-27 23:18:54 +0000320 if not data.endswith(b'\n'):
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000321 raise ValueError("no newline found when trying to read stringnl")
322 data = data[:-1] # lose the newline
323
324 if stripquotes:
Guido van Rossum26d95c32007-08-27 23:18:54 +0000325 for q in (b'"', b"'"):
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000326 if data.startswith(q):
327 if not data.endswith(q):
328 raise ValueError("strinq quote %r not found at both "
329 "ends of %r" % (q, data))
330 data = data[1:-1]
331 break
332 else:
333 raise ValueError("no string quotes around %r" % data)
334
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000335 if decode:
Guido van Rossum98297ee2007-11-06 21:34:58 +0000336 data = codecs.escape_decode(data)[0].decode("ascii")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000337 return data
338
339stringnl = ArgumentDescriptor(
340 name='stringnl',
341 n=UP_TO_NEWLINE,
342 reader=read_stringnl,
343 doc="""A newline-terminated string.
344
345 This is a repr-style string, with embedded escapes, and
346 bracketing quotes.
347 """)
348
349def read_stringnl_noescape(f):
Guido van Rossum98297ee2007-11-06 21:34:58 +0000350 return read_stringnl(f, stripquotes=False)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000351
352stringnl_noescape = ArgumentDescriptor(
353 name='stringnl_noescape',
354 n=UP_TO_NEWLINE,
355 reader=read_stringnl_noescape,
356 doc="""A newline-terminated string.
357
358 This is a str-style string, without embedded escapes,
359 or bracketing quotes. It should consist solely of
360 printable ASCII characters.
361 """)
362
363def read_stringnl_noescape_pair(f):
Tim Peters55762f52003-01-28 16:01:25 +0000364 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000365 >>> import io
366 >>> read_stringnl_noescape_pair(io.BytesIO(b"Queue\nEmpty\njunk"))
Tim Petersd916cf42003-01-27 19:01:47 +0000367 'Queue Empty'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000368 """
369
Tim Petersd916cf42003-01-27 19:01:47 +0000370 return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000371
372stringnl_noescape_pair = ArgumentDescriptor(
373 name='stringnl_noescape_pair',
374 n=UP_TO_NEWLINE,
375 reader=read_stringnl_noescape_pair,
376 doc="""A pair of newline-terminated strings.
377
378 These are str-style strings, without embedded
379 escapes, or bracketing quotes. They should
380 consist solely of printable ASCII characters.
381 The pair is returned as a single string, with
Tim Petersd916cf42003-01-27 19:01:47 +0000382 a single blank separating the two strings.
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000383 """)
384
385def read_string4(f):
Tim Peters55762f52003-01-28 16:01:25 +0000386 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000387 >>> import io
388 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x00abc"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000389 ''
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000390 >>> read_string4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000391 'abc'
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000392 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000393 Traceback (most recent call last):
394 ...
395 ValueError: expected 50331648 bytes in a string4, but only 6 remain
396 """
397
398 n = read_int4(f)
399 if n < 0:
400 raise ValueError("string4 byte count < 0: %d" % n)
401 data = f.read(n)
402 if len(data) == n:
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000403 return data.decode("latin-1")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000404 raise ValueError("expected %d bytes in a string4, but only %d remain" %
405 (n, len(data)))
406
407string4 = ArgumentDescriptor(
408 name="string4",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000409 n=TAKEN_FROM_ARGUMENT4,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000410 reader=read_string4,
411 doc="""A counted string.
412
413 The first argument is a 4-byte little-endian signed int giving
414 the number of bytes in the string, and the second argument is
415 that many bytes.
416 """)
417
418
419def read_string1(f):
Tim Peters55762f52003-01-28 16:01:25 +0000420 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000421 >>> import io
422 >>> read_string1(io.BytesIO(b"\x00"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000423 ''
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000424 >>> read_string1(io.BytesIO(b"\x03abcdef"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000425 'abc'
426 """
427
428 n = read_uint1(f)
429 assert n >= 0
430 data = f.read(n)
431 if len(data) == n:
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000432 return data.decode("latin-1")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000433 raise ValueError("expected %d bytes in a string1, but only %d remain" %
434 (n, len(data)))
435
436string1 = ArgumentDescriptor(
437 name="string1",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000438 n=TAKEN_FROM_ARGUMENT1,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000439 reader=read_string1,
440 doc="""A counted string.
441
442 The first argument is a 1-byte unsigned int giving the number
443 of bytes in the string, and the second argument is that many
444 bytes.
445 """)
446
447
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700448def read_bytes1(f):
449 r"""
450 >>> import io
451 >>> read_bytes1(io.BytesIO(b"\x00"))
452 b''
453 >>> read_bytes1(io.BytesIO(b"\x03abcdef"))
454 b'abc'
455 """
456
457 n = read_uint1(f)
458 assert n >= 0
459 data = f.read(n)
460 if len(data) == n:
461 return data
462 raise ValueError("expected %d bytes in a bytes1, but only %d remain" %
463 (n, len(data)))
464
465bytes1 = ArgumentDescriptor(
466 name="bytes1",
467 n=TAKEN_FROM_ARGUMENT1,
468 reader=read_bytes1,
469 doc="""A counted bytes string.
470
471 The first argument is a 1-byte unsigned int giving the number
472 of bytes, and the second argument is that many bytes.
473 """)
474
475
476def read_bytes4(f):
477 r"""
478 >>> import io
479 >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x00abc"))
480 b''
481 >>> read_bytes4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
482 b'abc'
483 >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
484 Traceback (most recent call last):
485 ...
486 ValueError: expected 50331648 bytes in a bytes4, but only 6 remain
487 """
488
489 n = read_uint4(f)
490 if n > sys.maxsize:
491 raise ValueError("bytes4 byte count > sys.maxsize: %d" % n)
492 data = f.read(n)
493 if len(data) == n:
494 return data
495 raise ValueError("expected %d bytes in a bytes4, but only %d remain" %
496 (n, len(data)))
497
498bytes4 = ArgumentDescriptor(
499 name="bytes4",
500 n=TAKEN_FROM_ARGUMENT4U,
501 reader=read_bytes4,
502 doc="""A counted bytes string.
503
504 The first argument is a 4-byte little-endian unsigned int giving
505 the number of bytes, and the second argument is that many bytes.
506 """)
507
508
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000509def read_unicodestringnl(f):
Tim Peters55762f52003-01-28 16:01:25 +0000510 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000511 >>> import io
512 >>> read_unicodestringnl(io.BytesIO(b"abc\\uabcd\njunk")) == 'abc\uabcd'
513 True
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000514 """
515
Guido van Rossum26986312007-07-17 00:19:46 +0000516 data = f.readline()
Guido van Rossum26d95c32007-08-27 23:18:54 +0000517 if not data.endswith(b'\n'):
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000518 raise ValueError("no newline found when trying to read "
519 "unicodestringnl")
520 data = data[:-1] # lose the newline
Guido van Rossumef87d6e2007-05-02 19:09:54 +0000521 return str(data, 'raw-unicode-escape')
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000522
523unicodestringnl = ArgumentDescriptor(
524 name='unicodestringnl',
525 n=UP_TO_NEWLINE,
526 reader=read_unicodestringnl,
527 doc="""A newline-terminated Unicode string.
528
529 This is raw-unicode-escape encoded, so consists of
530 printable ASCII characters, and may contain embedded
531 escape sequences.
532 """)
533
534def read_unicodestring4(f):
Tim Peters55762f52003-01-28 16:01:25 +0000535 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000536 >>> import io
537 >>> s = 'abcd\uabcd'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000538 >>> enc = s.encode('utf-8')
539 >>> enc
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000540 b'abcd\xea\xaf\x8d'
541 >>> n = bytes([len(enc), 0, 0, 0]) # little-endian 4-byte length
542 >>> t = read_unicodestring4(io.BytesIO(n + enc + b'junk'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000543 >>> s == t
544 True
545
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000546 >>> read_unicodestring4(io.BytesIO(n + enc[:-1]))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000547 Traceback (most recent call last):
548 ...
549 ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
550 """
551
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700552 n = read_uint4(f)
553 if n > sys.maxsize:
554 raise ValueError("unicodestring4 byte count > sys.maxsize: %d" % n)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000555 data = f.read(n)
556 if len(data) == n:
Victor Stinner485fb562010-04-13 11:07:24 +0000557 return str(data, 'utf-8', 'surrogatepass')
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000558 raise ValueError("expected %d bytes in a unicodestring4, but only %d "
559 "remain" % (n, len(data)))
560
561unicodestring4 = ArgumentDescriptor(
562 name="unicodestring4",
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700563 n=TAKEN_FROM_ARGUMENT4U,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000564 reader=read_unicodestring4,
565 doc="""A counted Unicode string.
566
567 The first argument is a 4-byte little-endian signed int
568 giving the number of bytes in the string, and the second
569 argument-- the UTF-8 encoding of the Unicode string --
570 contains that many bytes.
571 """)
572
573
574def read_decimalnl_short(f):
Tim Peters55762f52003-01-28 16:01:25 +0000575 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000576 >>> import io
577 >>> read_decimalnl_short(io.BytesIO(b"1234\n56"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000578 1234
579
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000580 >>> read_decimalnl_short(io.BytesIO(b"1234L\n56"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000581 Traceback (most recent call last):
582 ...
Serhiy Storchaka95949422013-08-27 19:40:23 +0300583 ValueError: invalid literal for int() with base 10: b'1234L'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000584 """
585
586 s = read_stringnl(f, decode=False, stripquotes=False)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000587
Serhiy Storchaka95949422013-08-27 19:40:23 +0300588 # There's a hack for True and False here.
Jeremy Hyltona5dc3db2007-08-29 19:07:40 +0000589 if s == b"00":
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000590 return False
Jeremy Hyltona5dc3db2007-08-29 19:07:40 +0000591 elif s == b"01":
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000592 return True
593
Florent Xicluna2bb96f52011-10-23 22:11:00 +0200594 return int(s)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000595
596def read_decimalnl_long(f):
Tim Peters55762f52003-01-28 16:01:25 +0000597 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000598 >>> import io
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000599
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000600 >>> read_decimalnl_long(io.BytesIO(b"1234L\n56"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000601 1234
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000602
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000603 >>> read_decimalnl_long(io.BytesIO(b"123456789012345678901234L\n6"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000604 123456789012345678901234
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000605 """
606
607 s = read_stringnl(f, decode=False, stripquotes=False)
Mark Dickinson8dd05142009-01-20 20:43:58 +0000608 if s[-1:] == b'L':
609 s = s[:-1]
Guido van Rossume2a383d2007-01-15 16:59:06 +0000610 return int(s)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000611
612
613decimalnl_short = ArgumentDescriptor(
614 name='decimalnl_short',
615 n=UP_TO_NEWLINE,
616 reader=read_decimalnl_short,
617 doc="""A newline-terminated decimal integer literal.
618
619 This never has a trailing 'L', and the integer fit
620 in a short Python int on the box where the pickle
621 was written -- but there's no guarantee it will fit
622 in a short Python int on the box where the pickle
623 is read.
624 """)
625
626decimalnl_long = ArgumentDescriptor(
627 name='decimalnl_long',
628 n=UP_TO_NEWLINE,
629 reader=read_decimalnl_long,
630 doc="""A newline-terminated decimal integer literal.
631
632 This has a trailing 'L', and can represent integers
633 of any size.
634 """)
635
636
637def read_floatnl(f):
Tim Peters55762f52003-01-28 16:01:25 +0000638 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000639 >>> import io
640 >>> read_floatnl(io.BytesIO(b"-1.25\n6"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000641 -1.25
642 """
643 s = read_stringnl(f, decode=False, stripquotes=False)
644 return float(s)
645
646floatnl = ArgumentDescriptor(
647 name='floatnl',
648 n=UP_TO_NEWLINE,
649 reader=read_floatnl,
650 doc="""A newline-terminated decimal floating literal.
651
652 In general this requires 17 significant digits for roundtrip
653 identity, and pickling then unpickling infinities, NaNs, and
654 minus zero doesn't work across boxes, or on some boxes even
655 on itself (e.g., Windows can't read the strings it produces
656 for infinities or NaNs).
657 """)
658
659def read_float8(f):
Tim Peters55762f52003-01-28 16:01:25 +0000660 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000661 >>> import io, struct
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000662 >>> raw = struct.pack(">d", -1.25)
663 >>> raw
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000664 b'\xbf\xf4\x00\x00\x00\x00\x00\x00'
665 >>> read_float8(io.BytesIO(raw + b"\n"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000666 -1.25
667 """
668
669 data = f.read(8)
670 if len(data) == 8:
671 return _unpack(">d", data)[0]
672 raise ValueError("not enough data in stream to read float8")
673
674
675float8 = ArgumentDescriptor(
676 name='float8',
677 n=8,
678 reader=read_float8,
679 doc="""An 8-byte binary representation of a float, big-endian.
680
681 The format is unique to Python, and shared with the struct
Guido van Rossum99603b02007-07-20 00:22:32 +0000682 module (format string '>d') "in theory" (the struct and pickle
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000683 implementations don't share the code -- they should). It's
684 strongly related to the IEEE-754 double format, and, in normal
685 cases, is in fact identical to the big-endian 754 double format.
686 On other boxes the dynamic range is limited to that of a 754
687 double, and "add a half and chop" rounding is used to reduce
688 the precision to 53 bits. However, even on a 754 box,
689 infinities, NaNs, and minus zero may not be handled correctly
690 (may not survive roundtrip pickling intact).
691 """)
692
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000693# Protocol 2 formats
694
Tim Petersc0c12b52003-01-29 00:56:17 +0000695from pickle import decode_long
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000696
697def read_long1(f):
698 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000699 >>> import io
700 >>> read_long1(io.BytesIO(b"\x00"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000701 0
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000702 >>> read_long1(io.BytesIO(b"\x02\xff\x00"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000703 255
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000704 >>> read_long1(io.BytesIO(b"\x02\xff\x7f"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000705 32767
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000706 >>> read_long1(io.BytesIO(b"\x02\x00\xff"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000707 -256
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000708 >>> read_long1(io.BytesIO(b"\x02\x00\x80"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000709 -32768
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000710 """
711
712 n = read_uint1(f)
713 data = f.read(n)
714 if len(data) != n:
715 raise ValueError("not enough data in stream to read long1")
716 return decode_long(data)
717
718long1 = ArgumentDescriptor(
719 name="long1",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000720 n=TAKEN_FROM_ARGUMENT1,
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000721 reader=read_long1,
722 doc="""A binary long, little-endian, using 1-byte size.
723
724 This first reads one byte as an unsigned size, then reads that
Tim Petersbdbe7412003-01-27 23:54:04 +0000725 many bytes and interprets them as a little-endian 2's-complement long.
Tim Peters4b23f2b2003-01-31 16:43:39 +0000726 If the size is 0, that's taken as a shortcut for the long 0L.
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000727 """)
728
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000729def read_long4(f):
730 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000731 >>> import io
732 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x00"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000733 255
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000734 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x7f"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000735 32767
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000736 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\xff"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000737 -256
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000738 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\x80"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000739 -32768
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000740 >>> read_long1(io.BytesIO(b"\x00\x00\x00\x00"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000741 0
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000742 """
743
744 n = read_int4(f)
745 if n < 0:
Neal Norwitz784a3f52003-01-28 00:20:41 +0000746 raise ValueError("long4 byte count < 0: %d" % n)
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000747 data = f.read(n)
748 if len(data) != n:
Neal Norwitz784a3f52003-01-28 00:20:41 +0000749 raise ValueError("not enough data in stream to read long4")
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000750 return decode_long(data)
751
752long4 = ArgumentDescriptor(
753 name="long4",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000754 n=TAKEN_FROM_ARGUMENT4,
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000755 reader=read_long4,
756 doc="""A binary representation of a long, little-endian.
757
758 This first reads four bytes as a signed size (but requires the
759 size to be >= 0), then reads that many bytes and interprets them
Tim Peters4b23f2b2003-01-31 16:43:39 +0000760 as a little-endian 2's-complement long. If the size is 0, that's taken
Guido van Rossume2a383d2007-01-15 16:59:06 +0000761 as a shortcut for the int 0, although LONG1 should really be used
Tim Peters4b23f2b2003-01-31 16:43:39 +0000762 then instead (and in any case where # of bytes < 256).
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000763 """)
764
765
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000766##############################################################################
767# Object descriptors. The stack used by the pickle machine holds objects,
768# and in the stack_before and stack_after attributes of OpcodeInfo
769# descriptors we need names to describe the various types of objects that can
770# appear on the stack.
771
772class StackObject(object):
773 __slots__ = (
774 # name of descriptor record, for info only
775 'name',
776
777 # type of object, or tuple of type objects (meaning the object can
778 # be of any type in the tuple)
779 'obtype',
780
781 # human-readable docs for this kind of stack object; a string
782 'doc',
783 )
784
785 def __init__(self, name, obtype, doc):
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000786 assert isinstance(name, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000787 self.name = name
788
789 assert isinstance(obtype, type) or isinstance(obtype, tuple)
790 if isinstance(obtype, tuple):
791 for contained in obtype:
792 assert isinstance(contained, type)
793 self.obtype = obtype
794
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000795 assert isinstance(doc, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000796 self.doc = doc
797
Tim Petersc1c2b3e2003-01-29 20:12:21 +0000798 def __repr__(self):
799 return self.name
800
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000801
802pyint = StackObject(
803 name='int',
804 obtype=int,
805 doc="A short (as opposed to long) Python integer object.")
806
807pylong = StackObject(
808 name='long',
Guido van Rossume2a383d2007-01-15 16:59:06 +0000809 obtype=int,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000810 doc="A long (as opposed to short) Python integer object.")
811
812pyinteger_or_bool = StackObject(
813 name='int_or_bool',
Florent Xicluna02ea12b22010-07-28 16:39:41 +0000814 obtype=(int, bool),
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000815 doc="A Python integer object (short or long), or "
816 "a Python bool.")
817
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000818pybool = StackObject(
819 name='bool',
820 obtype=(bool,),
821 doc="A Python bool object.")
822
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000823pyfloat = StackObject(
824 name='float',
825 obtype=float,
826 doc="A Python float object.")
827
828pystring = StackObject(
Guido van Rossumf4169812008-03-17 22:56:06 +0000829 name='string',
830 obtype=bytes,
831 doc="A Python (8-bit) string object.")
832
833pybytes = StackObject(
Guido van Rossum98297ee2007-11-06 21:34:58 +0000834 name='bytes',
835 obtype=bytes,
836 doc="A Python bytes object.")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000837
838pyunicode = StackObject(
Guido van Rossum98297ee2007-11-06 21:34:58 +0000839 name='str',
Guido van Rossumef87d6e2007-05-02 19:09:54 +0000840 obtype=str,
Guido van Rossumf4169812008-03-17 22:56:06 +0000841 doc="A Python (Unicode) string object.")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000842
843pynone = StackObject(
844 name="None",
845 obtype=type(None),
846 doc="The Python None object.")
847
848pytuple = StackObject(
849 name="tuple",
850 obtype=tuple,
851 doc="A Python tuple object.")
852
853pylist = StackObject(
854 name="list",
855 obtype=list,
856 doc="A Python list object.")
857
858pydict = StackObject(
859 name="dict",
860 obtype=dict,
861 doc="A Python dict object.")
862
863anyobject = StackObject(
864 name='any',
865 obtype=object,
866 doc="Any kind of object whatsoever.")
867
868markobject = StackObject(
869 name="mark",
870 obtype=StackObject,
871 doc="""'The mark' is a unique object.
872
873 Opcodes that operate on a variable number of objects
874 generally don't embed the count of objects in the opcode,
875 or pull it off the stack. Instead the MARK opcode is used
876 to push a special marker object on the stack, and then
877 some other opcodes grab all the objects from the top of
878 the stack down to (but not including) the topmost marker
879 object.
880 """)
881
882stackslice = StackObject(
883 name="stackslice",
884 obtype=StackObject,
885 doc="""An object representing a contiguous slice of the stack.
886
Ezio Melotti30b9d5d2013-08-17 15:50:46 +0300887 This is used in conjunction with markobject, to represent all
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000888 of the stack following the topmost markobject. For example,
889 the POP_MARK opcode changes the stack from
890
891 [..., markobject, stackslice]
892 to
893 [...]
894
895 No matter how many object are on the stack after the topmost
896 markobject, POP_MARK gets rid of all of them (including the
897 topmost markobject too).
898 """)
899
900##############################################################################
901# Descriptors for pickle opcodes.
902
903class OpcodeInfo(object):
904
905 __slots__ = (
906 # symbolic name of opcode; a string
907 'name',
908
909 # the code used in a bytestream to represent the opcode; a
910 # one-character string
911 'code',
912
913 # If the opcode has an argument embedded in the byte string, an
914 # instance of ArgumentDescriptor specifying its type. Note that
915 # arg.reader(s) can be used to read and decode the argument from
916 # the bytestream s, and arg.doc documents the format of the raw
917 # argument bytes. If the opcode doesn't have an argument embedded
918 # in the bytestream, arg should be None.
919 'arg',
920
921 # what the stack looks like before this opcode runs; a list
922 'stack_before',
923
924 # what the stack looks like after this opcode runs; a list
925 'stack_after',
926
927 # the protocol number in which this opcode was introduced; an int
928 'proto',
929
930 # human-readable docs for this opcode; a string
931 'doc',
932 )
933
934 def __init__(self, name, code, arg,
935 stack_before, stack_after, proto, doc):
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000936 assert isinstance(name, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000937 self.name = name
938
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000939 assert isinstance(code, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000940 assert len(code) == 1
941 self.code = code
942
943 assert arg is None or isinstance(arg, ArgumentDescriptor)
944 self.arg = arg
945
946 assert isinstance(stack_before, list)
947 for x in stack_before:
948 assert isinstance(x, StackObject)
949 self.stack_before = stack_before
950
951 assert isinstance(stack_after, list)
952 for x in stack_after:
953 assert isinstance(x, StackObject)
954 self.stack_after = stack_after
955
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700956 assert isinstance(proto, int) and 0 <= proto <= pickle.HIGHEST_PROTOCOL
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000957 self.proto = proto
958
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000959 assert isinstance(doc, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000960 self.doc = doc
961
962I = OpcodeInfo
963opcodes = [
964
965 # Ways to spell integers.
966
967 I(name='INT',
968 code='I',
969 arg=decimalnl_short,
970 stack_before=[],
971 stack_after=[pyinteger_or_bool],
972 proto=0,
973 doc="""Push an integer or bool.
974
975 The argument is a newline-terminated decimal literal string.
976
977 The intent may have been that this always fit in a short Python int,
978 but INT can be generated in pickles written on a 64-bit box that
979 require a Python long on a 32-bit box. The difference between this
980 and LONG then is that INT skips a trailing 'L', and produces a short
981 int whenever possible.
982
983 Another difference is due to that, when bool was introduced as a
984 distinct type in 2.3, builtin names True and False were also added to
985 2.2.2, mapping to ints 1 and 0. For compatibility in both directions,
986 True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
987 Leading zeroes are never produced for a genuine integer. The 2.3
988 (and later) unpicklers special-case these and return bool instead;
989 earlier unpicklers ignore the leading "0" and return the int.
990 """),
991
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000992 I(name='BININT',
993 code='J',
994 arg=int4,
995 stack_before=[],
996 stack_after=[pyint],
997 proto=1,
998 doc="""Push a four-byte signed integer.
999
1000 This handles the full range of Python (short) integers on a 32-bit
1001 box, directly as binary bytes (1 for the opcode and 4 for the integer).
1002 If the integer is non-negative and fits in 1 or 2 bytes, pickling via
1003 BININT1 or BININT2 saves space.
1004 """),
1005
1006 I(name='BININT1',
1007 code='K',
1008 arg=uint1,
1009 stack_before=[],
1010 stack_after=[pyint],
1011 proto=1,
1012 doc="""Push a one-byte unsigned integer.
1013
1014 This is a space optimization for pickling very small non-negative ints,
1015 in range(256).
1016 """),
1017
1018 I(name='BININT2',
1019 code='M',
1020 arg=uint2,
1021 stack_before=[],
1022 stack_after=[pyint],
1023 proto=1,
1024 doc="""Push a two-byte unsigned integer.
1025
1026 This is a space optimization for pickling small positive ints, in
1027 range(256, 2**16). Integers in range(256) can also be pickled via
1028 BININT2, but BININT1 instead saves a byte.
1029 """),
1030
Tim Petersfdc03462003-01-28 04:56:33 +00001031 I(name='LONG',
1032 code='L',
1033 arg=decimalnl_long,
1034 stack_before=[],
1035 stack_after=[pylong],
1036 proto=0,
1037 doc="""Push a long integer.
1038
1039 The same as INT, except that the literal ends with 'L', and always
1040 unpickles to a Python long. There doesn't seem a real purpose to the
1041 trailing 'L'.
1042
1043 Note that LONG takes time quadratic in the number of digits when
1044 unpickling (this is simply due to the nature of decimal->binary
1045 conversion). Proto 2 added linear-time (in C; still quadratic-time
1046 in Python) LONG1 and LONG4 opcodes.
1047 """),
1048
1049 I(name="LONG1",
1050 code='\x8a',
1051 arg=long1,
1052 stack_before=[],
1053 stack_after=[pylong],
1054 proto=2,
1055 doc="""Long integer using one-byte length.
1056
1057 A more efficient encoding of a Python long; the long1 encoding
1058 says it all."""),
1059
1060 I(name="LONG4",
1061 code='\x8b',
1062 arg=long4,
1063 stack_before=[],
1064 stack_after=[pylong],
1065 proto=2,
1066 doc="""Long integer using found-byte length.
1067
1068 A more efficient encoding of a Python long; the long4 encoding
1069 says it all."""),
1070
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001071 # Ways to spell strings (8-bit, not Unicode).
1072
1073 I(name='STRING',
1074 code='S',
1075 arg=stringnl,
1076 stack_before=[],
1077 stack_after=[pystring],
1078 proto=0,
1079 doc="""Push a Python string object.
1080
1081 The argument is a repr-style string, with bracketing quote characters,
1082 and perhaps embedded escapes. The argument extends until the next
Guido van Rossumf4169812008-03-17 22:56:06 +00001083 newline character. (Actually, they are decoded into a str instance
1084 using the encoding given to the Unpickler constructor. or the default,
1085 'ASCII'.)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001086 """),
1087
1088 I(name='BINSTRING',
1089 code='T',
1090 arg=string4,
1091 stack_before=[],
1092 stack_after=[pystring],
1093 proto=1,
1094 doc="""Push a Python string object.
1095
1096 There are two arguments: the first is a 4-byte little-endian signed int
1097 giving the number of bytes in the string, and the second is that many
Guido van Rossumf4169812008-03-17 22:56:06 +00001098 bytes, which are taken literally as the string content. (Actually,
1099 they are decoded into a str instance using the encoding given to the
1100 Unpickler constructor. or the default, 'ASCII'.)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001101 """),
1102
1103 I(name='SHORT_BINSTRING',
1104 code='U',
1105 arg=string1,
1106 stack_before=[],
1107 stack_after=[pystring],
1108 proto=1,
1109 doc="""Push a Python string object.
1110
1111 There are two arguments: the first is a 1-byte unsigned int giving
1112 the number of bytes in the string, and the second is that many bytes,
Guido van Rossumf4169812008-03-17 22:56:06 +00001113 which are taken literally as the string content. (Actually, they
1114 are decoded into a str instance using the encoding given to the
1115 Unpickler constructor. or the default, 'ASCII'.)
1116 """),
1117
1118 # Bytes (protocol 3 only; older protocols don't support bytes at all)
1119
1120 I(name='BINBYTES',
1121 code='B',
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001122 arg=bytes4,
Guido van Rossumf4169812008-03-17 22:56:06 +00001123 stack_before=[],
1124 stack_after=[pybytes],
1125 proto=3,
1126 doc="""Push a Python bytes object.
1127
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001128 There are two arguments: the first is a 4-byte little-endian unsigned int
1129 giving the number of bytes, and the second is that many bytes, which are
1130 taken literally as the bytes content.
Guido van Rossumf4169812008-03-17 22:56:06 +00001131 """),
1132
1133 I(name='SHORT_BINBYTES',
1134 code='C',
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001135 arg=bytes1,
Guido van Rossumf4169812008-03-17 22:56:06 +00001136 stack_before=[],
1137 stack_after=[pybytes],
Collin Wintere61d4372009-05-20 17:46:47 +00001138 proto=3,
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001139 doc="""Push a Python bytes object.
Guido van Rossumf4169812008-03-17 22:56:06 +00001140
1141 There are two arguments: the first is a 1-byte unsigned int giving
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001142 the number of bytes, and the second is that many bytes, which are taken
1143 literally as the string content.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001144 """),
1145
1146 # Ways to spell None.
1147
1148 I(name='NONE',
1149 code='N',
1150 arg=None,
1151 stack_before=[],
1152 stack_after=[pynone],
1153 proto=0,
1154 doc="Push None on the stack."),
1155
Tim Petersfdc03462003-01-28 04:56:33 +00001156 # Ways to spell bools, starting with proto 2. See INT for how this was
1157 # done before proto 2.
1158
1159 I(name='NEWTRUE',
1160 code='\x88',
1161 arg=None,
1162 stack_before=[],
1163 stack_after=[pybool],
1164 proto=2,
1165 doc="""True.
1166
1167 Push True onto the stack."""),
1168
1169 I(name='NEWFALSE',
1170 code='\x89',
1171 arg=None,
1172 stack_before=[],
1173 stack_after=[pybool],
1174 proto=2,
1175 doc="""True.
1176
1177 Push False onto the stack."""),
1178
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001179 # Ways to spell Unicode strings.
1180
1181 I(name='UNICODE',
1182 code='V',
1183 arg=unicodestringnl,
1184 stack_before=[],
1185 stack_after=[pyunicode],
1186 proto=0, # this may be pure-text, but it's a later addition
1187 doc="""Push a Python Unicode string object.
1188
1189 The argument is a raw-unicode-escape encoding of a Unicode string,
1190 and so may contain embedded escape sequences. The argument extends
1191 until the next newline character.
1192 """),
1193
1194 I(name='BINUNICODE',
1195 code='X',
1196 arg=unicodestring4,
1197 stack_before=[],
1198 stack_after=[pyunicode],
1199 proto=1,
1200 doc="""Push a Python Unicode string object.
1201
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001202 There are two arguments: the first is a 4-byte little-endian unsigned int
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001203 giving the number of bytes in the string. The second is that many
1204 bytes, and is the UTF-8 encoding of the Unicode string.
1205 """),
1206
1207 # Ways to spell floats.
1208
1209 I(name='FLOAT',
1210 code='F',
1211 arg=floatnl,
1212 stack_before=[],
1213 stack_after=[pyfloat],
1214 proto=0,
1215 doc="""Newline-terminated decimal float literal.
1216
1217 The argument is repr(a_float), and in general requires 17 significant
1218 digits for roundtrip conversion to be an identity (this is so for
1219 IEEE-754 double precision values, which is what Python float maps to
1220 on most boxes).
1221
1222 In general, FLOAT cannot be used to transport infinities, NaNs, or
1223 minus zero across boxes (or even on a single box, if the platform C
1224 library can't read the strings it produces for such things -- Windows
1225 is like that), but may do less damage than BINFLOAT on boxes with
1226 greater precision or dynamic range than IEEE-754 double.
1227 """),
1228
1229 I(name='BINFLOAT',
1230 code='G',
1231 arg=float8,
1232 stack_before=[],
1233 stack_after=[pyfloat],
1234 proto=1,
1235 doc="""Float stored in binary form, with 8 bytes of data.
1236
1237 This generally requires less than half the space of FLOAT encoding.
1238 In general, BINFLOAT cannot be used to transport infinities, NaNs, or
1239 minus zero, raises an exception if the exponent exceeds the range of
1240 an IEEE-754 double, and retains no more than 53 bits of precision (if
1241 there are more than that, "add a half and chop" rounding is used to
1242 cut it back to 53 significant bits).
1243 """),
1244
1245 # Ways to build lists.
1246
1247 I(name='EMPTY_LIST',
1248 code=']',
1249 arg=None,
1250 stack_before=[],
1251 stack_after=[pylist],
1252 proto=1,
1253 doc="Push an empty list."),
1254
1255 I(name='APPEND',
1256 code='a',
1257 arg=None,
1258 stack_before=[pylist, anyobject],
1259 stack_after=[pylist],
1260 proto=0,
1261 doc="""Append an object to a list.
1262
1263 Stack before: ... pylist anyobject
1264 Stack after: ... pylist+[anyobject]
Tim Peters81098ac2003-01-28 05:12:08 +00001265
1266 although pylist is really extended in-place.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001267 """),
1268
1269 I(name='APPENDS',
1270 code='e',
1271 arg=None,
1272 stack_before=[pylist, markobject, stackslice],
1273 stack_after=[pylist],
1274 proto=1,
1275 doc="""Extend a list by a slice of stack objects.
1276
1277 Stack before: ... pylist markobject stackslice
1278 Stack after: ... pylist+stackslice
Tim Peters81098ac2003-01-28 05:12:08 +00001279
1280 although pylist is really extended in-place.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001281 """),
1282
1283 I(name='LIST',
1284 code='l',
1285 arg=None,
1286 stack_before=[markobject, stackslice],
1287 stack_after=[pylist],
1288 proto=0,
1289 doc="""Build a list out of the topmost stack slice, after markobject.
1290
1291 All the stack entries following the topmost markobject are placed into
1292 a single Python list, which single list object replaces all of the
1293 stack from the topmost markobject onward. For example,
1294
1295 Stack before: ... markobject 1 2 3 'abc'
1296 Stack after: ... [1, 2, 3, 'abc']
1297 """),
1298
1299 # Ways to build tuples.
1300
1301 I(name='EMPTY_TUPLE',
1302 code=')',
1303 arg=None,
1304 stack_before=[],
1305 stack_after=[pytuple],
1306 proto=1,
1307 doc="Push an empty tuple."),
1308
1309 I(name='TUPLE',
1310 code='t',
1311 arg=None,
1312 stack_before=[markobject, stackslice],
1313 stack_after=[pytuple],
1314 proto=0,
1315 doc="""Build a tuple out of the topmost stack slice, after markobject.
1316
1317 All the stack entries following the topmost markobject are placed into
1318 a single Python tuple, which single tuple object replaces all of the
1319 stack from the topmost markobject onward. For example,
1320
1321 Stack before: ... markobject 1 2 3 'abc'
1322 Stack after: ... (1, 2, 3, 'abc')
1323 """),
1324
Tim Petersfdc03462003-01-28 04:56:33 +00001325 I(name='TUPLE1',
1326 code='\x85',
1327 arg=None,
1328 stack_before=[anyobject],
1329 stack_after=[pytuple],
1330 proto=2,
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001331 doc="""Build a one-tuple out of the topmost item on the stack.
Tim Petersfdc03462003-01-28 04:56:33 +00001332
1333 This code pops one value off the stack and pushes a tuple of
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001334 length 1 whose one item is that value back onto it. In other
1335 words:
Tim Petersfdc03462003-01-28 04:56:33 +00001336
1337 stack[-1] = tuple(stack[-1:])
1338 """),
1339
1340 I(name='TUPLE2',
1341 code='\x86',
1342 arg=None,
1343 stack_before=[anyobject, anyobject],
1344 stack_after=[pytuple],
1345 proto=2,
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001346 doc="""Build a two-tuple out of the top two items on the stack.
Tim Petersfdc03462003-01-28 04:56:33 +00001347
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001348 This code pops two values off the stack and pushes a tuple of
1349 length 2 whose items are those values back onto it. In other
1350 words:
Tim Petersfdc03462003-01-28 04:56:33 +00001351
1352 stack[-2:] = [tuple(stack[-2:])]
1353 """),
1354
1355 I(name='TUPLE3',
1356 code='\x87',
1357 arg=None,
1358 stack_before=[anyobject, anyobject, anyobject],
1359 stack_after=[pytuple],
1360 proto=2,
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001361 doc="""Build a three-tuple out of the top three items on the stack.
Tim Petersfdc03462003-01-28 04:56:33 +00001362
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001363 This code pops three values off the stack and pushes a tuple of
1364 length 3 whose items are those values back onto it. In other
1365 words:
Tim Petersfdc03462003-01-28 04:56:33 +00001366
1367 stack[-3:] = [tuple(stack[-3:])]
1368 """),
1369
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001370 # Ways to build dicts.
1371
1372 I(name='EMPTY_DICT',
1373 code='}',
1374 arg=None,
1375 stack_before=[],
1376 stack_after=[pydict],
1377 proto=1,
1378 doc="Push an empty dict."),
1379
1380 I(name='DICT',
1381 code='d',
1382 arg=None,
1383 stack_before=[markobject, stackslice],
1384 stack_after=[pydict],
1385 proto=0,
1386 doc="""Build a dict out of the topmost stack slice, after markobject.
1387
1388 All the stack entries following the topmost markobject are placed into
1389 a single Python dict, which single dict object replaces all of the
1390 stack from the topmost markobject onward. The stack slice alternates
1391 key, value, key, value, .... For example,
1392
1393 Stack before: ... markobject 1 2 3 'abc'
1394 Stack after: ... {1: 2, 3: 'abc'}
1395 """),
1396
1397 I(name='SETITEM',
1398 code='s',
1399 arg=None,
1400 stack_before=[pydict, anyobject, anyobject],
1401 stack_after=[pydict],
1402 proto=0,
1403 doc="""Add a key+value pair to an existing dict.
1404
1405 Stack before: ... pydict key value
1406 Stack after: ... pydict
1407
1408 where pydict has been modified via pydict[key] = value.
1409 """),
1410
1411 I(name='SETITEMS',
1412 code='u',
1413 arg=None,
1414 stack_before=[pydict, markobject, stackslice],
1415 stack_after=[pydict],
1416 proto=1,
1417 doc="""Add an arbitrary number of key+value pairs to an existing dict.
1418
1419 The slice of the stack following the topmost markobject is taken as
1420 an alternating sequence of keys and values, added to the dict
1421 immediately under the topmost markobject. Everything at and after the
1422 topmost markobject is popped, leaving the mutated dict at the top
1423 of the stack.
1424
1425 Stack before: ... pydict markobject key_1 value_1 ... key_n value_n
1426 Stack after: ... pydict
1427
1428 where pydict has been modified via pydict[key_i] = value_i for i in
1429 1, 2, ..., n, and in that order.
1430 """),
1431
1432 # Stack manipulation.
1433
1434 I(name='POP',
1435 code='0',
1436 arg=None,
1437 stack_before=[anyobject],
1438 stack_after=[],
1439 proto=0,
1440 doc="Discard the top stack item, shrinking the stack by one item."),
1441
1442 I(name='DUP',
1443 code='2',
1444 arg=None,
1445 stack_before=[anyobject],
1446 stack_after=[anyobject, anyobject],
1447 proto=0,
1448 doc="Push the top stack item onto the stack again, duplicating it."),
1449
1450 I(name='MARK',
1451 code='(',
1452 arg=None,
1453 stack_before=[],
1454 stack_after=[markobject],
1455 proto=0,
1456 doc="""Push markobject onto the stack.
1457
1458 markobject is a unique object, used by other opcodes to identify a
1459 region of the stack containing a variable number of objects for them
1460 to work on. See markobject.doc for more detail.
1461 """),
1462
1463 I(name='POP_MARK',
1464 code='1',
1465 arg=None,
1466 stack_before=[markobject, stackslice],
1467 stack_after=[],
Collin Wintere61d4372009-05-20 17:46:47 +00001468 proto=1,
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001469 doc="""Pop all the stack objects at and above the topmost markobject.
1470
1471 When an opcode using a variable number of stack objects is done,
1472 POP_MARK is used to remove those objects, and to remove the markobject
1473 that delimited their starting position on the stack.
1474 """),
1475
1476 # Memo manipulation. There are really only two operations (get and put),
1477 # each in all-text, "short binary", and "long binary" flavors.
1478
1479 I(name='GET',
1480 code='g',
1481 arg=decimalnl_short,
1482 stack_before=[],
1483 stack_after=[anyobject],
1484 proto=0,
1485 doc="""Read an object from the memo and push it on the stack.
1486
Ezio Melotti13925002011-03-16 11:05:33 +02001487 The index of the memo object to push is given by the newline-terminated
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001488 decimal string following. BINGET and LONG_BINGET are space-optimized
1489 versions.
1490 """),
1491
1492 I(name='BINGET',
1493 code='h',
1494 arg=uint1,
1495 stack_before=[],
1496 stack_after=[anyobject],
1497 proto=1,
1498 doc="""Read an object from the memo and push it on the stack.
1499
1500 The index of the memo object to push is given by the 1-byte unsigned
1501 integer following.
1502 """),
1503
1504 I(name='LONG_BINGET',
1505 code='j',
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001506 arg=uint4,
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001507 stack_before=[],
1508 stack_after=[anyobject],
1509 proto=1,
1510 doc="""Read an object from the memo and push it on the stack.
1511
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001512 The index of the memo object to push is given by the 4-byte unsigned
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001513 little-endian integer following.
1514 """),
1515
1516 I(name='PUT',
1517 code='p',
1518 arg=decimalnl_short,
1519 stack_before=[],
1520 stack_after=[],
1521 proto=0,
1522 doc="""Store the stack top into the memo. The stack is not popped.
1523
1524 The index of the memo location to write into is given by the newline-
1525 terminated decimal string following. BINPUT and LONG_BINPUT are
1526 space-optimized versions.
1527 """),
1528
1529 I(name='BINPUT',
1530 code='q',
1531 arg=uint1,
1532 stack_before=[],
1533 stack_after=[],
1534 proto=1,
1535 doc="""Store the stack top into the memo. The stack is not popped.
1536
1537 The index of the memo location to write into is given by the 1-byte
1538 unsigned integer following.
1539 """),
1540
1541 I(name='LONG_BINPUT',
1542 code='r',
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001543 arg=uint4,
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001544 stack_before=[],
1545 stack_after=[],
1546 proto=1,
1547 doc="""Store the stack top into the memo. The stack is not popped.
1548
1549 The index of the memo location to write into is given by the 4-byte
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001550 unsigned little-endian integer following.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001551 """),
1552
Tim Petersfdc03462003-01-28 04:56:33 +00001553 # Access the extension registry (predefined objects). Akin to the GET
1554 # family.
1555
1556 I(name='EXT1',
1557 code='\x82',
1558 arg=uint1,
1559 stack_before=[],
1560 stack_after=[anyobject],
1561 proto=2,
1562 doc="""Extension code.
1563
1564 This code and the similar EXT2 and EXT4 allow using a registry
1565 of popular objects that are pickled by name, typically classes.
1566 It is envisioned that through a global negotiation and
1567 registration process, third parties can set up a mapping between
1568 ints and object names.
1569
1570 In order to guarantee pickle interchangeability, the extension
1571 code registry ought to be global, although a range of codes may
1572 be reserved for private use.
1573
1574 EXT1 has a 1-byte integer argument. This is used to index into the
1575 extension registry, and the object at that index is pushed on the stack.
1576 """),
1577
1578 I(name='EXT2',
1579 code='\x83',
1580 arg=uint2,
1581 stack_before=[],
1582 stack_after=[anyobject],
1583 proto=2,
1584 doc="""Extension code.
1585
1586 See EXT1. EXT2 has a two-byte integer argument.
1587 """),
1588
1589 I(name='EXT4',
1590 code='\x84',
1591 arg=int4,
1592 stack_before=[],
1593 stack_after=[anyobject],
1594 proto=2,
1595 doc="""Extension code.
1596
1597 See EXT1. EXT4 has a four-byte integer argument.
1598 """),
1599
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001600 # Push a class object, or module function, on the stack, via its module
1601 # and name.
1602
1603 I(name='GLOBAL',
1604 code='c',
1605 arg=stringnl_noescape_pair,
1606 stack_before=[],
1607 stack_after=[anyobject],
1608 proto=0,
1609 doc="""Push a global object (module.attr) on the stack.
1610
1611 Two newline-terminated strings follow the GLOBAL opcode. The first is
1612 taken as a module name, and the second as a class name. The class
1613 object module.class is pushed on the stack. More accurately, the
1614 object returned by self.find_class(module, class) is pushed on the
1615 stack, so unpickling subclasses can override this form of lookup.
1616 """),
1617
1618 # Ways to build objects of classes pickle doesn't know about directly
1619 # (user-defined classes). I despair of documenting this accurately
1620 # and comprehensibly -- you really have to read the pickle code to
1621 # find all the special cases.
1622
1623 I(name='REDUCE',
1624 code='R',
1625 arg=None,
1626 stack_before=[anyobject, anyobject],
1627 stack_after=[anyobject],
1628 proto=0,
1629 doc="""Push an object built from a callable and an argument tuple.
1630
1631 The opcode is named to remind of the __reduce__() method.
1632
1633 Stack before: ... callable pytuple
1634 Stack after: ... callable(*pytuple)
1635
1636 The callable and the argument tuple are the first two items returned
1637 by a __reduce__ method. Applying the callable to the argtuple is
1638 supposed to reproduce the original object, or at least get it started.
1639 If the __reduce__ method returns a 3-tuple, the last component is an
1640 argument to be passed to the object's __setstate__, and then the REDUCE
1641 opcode is followed by code to create setstate's argument, and then a
1642 BUILD opcode to apply __setstate__ to that argument.
1643
Guido van Rossum13257902007-06-07 23:15:56 +00001644 If not isinstance(callable, type), REDUCE complains unless the
Alexandre Vassalottif7fa63d2008-05-11 08:55:36 +00001645 callable has been registered with the copyreg module's
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001646 safe_constructors dict, or the callable has a magic
1647 '__safe_for_unpickling__' attribute with a true value. I'm not sure
1648 why it does this, but I've sure seen this complaint often enough when
1649 I didn't want to <wink>.
1650 """),
1651
1652 I(name='BUILD',
1653 code='b',
1654 arg=None,
1655 stack_before=[anyobject, anyobject],
1656 stack_after=[anyobject],
1657 proto=0,
1658 doc="""Finish building an object, via __setstate__ or dict update.
1659
1660 Stack before: ... anyobject argument
1661 Stack after: ... anyobject
1662
1663 where anyobject may have been mutated, as follows:
1664
1665 If the object has a __setstate__ method,
1666
1667 anyobject.__setstate__(argument)
1668
1669 is called.
1670
1671 Else the argument must be a dict, the object must have a __dict__, and
1672 the object is updated via
1673
1674 anyobject.__dict__.update(argument)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001675 """),
1676
1677 I(name='INST',
1678 code='i',
1679 arg=stringnl_noescape_pair,
1680 stack_before=[markobject, stackslice],
1681 stack_after=[anyobject],
1682 proto=0,
1683 doc="""Build a class instance.
1684
1685 This is the protocol 0 version of protocol 1's OBJ opcode.
1686 INST is followed by two newline-terminated strings, giving a
1687 module and class name, just as for the GLOBAL opcode (and see
1688 GLOBAL for more details about that). self.find_class(module, name)
1689 is used to get a class object.
1690
1691 In addition, all the objects on the stack following the topmost
1692 markobject are gathered into a tuple and popped (along with the
1693 topmost markobject), just as for the TUPLE opcode.
1694
1695 Now it gets complicated. If all of these are true:
1696
1697 + The argtuple is empty (markobject was at the top of the stack
1698 at the start).
1699
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001700 + The class object does not have a __getinitargs__ attribute.
1701
1702 then we want to create an old-style class instance without invoking
1703 its __init__() method (pickle has waffled on this over the years; not
1704 calling __init__() is current wisdom). In this case, an instance of
1705 an old-style dummy class is created, and then we try to rebind its
1706 __class__ attribute to the desired class object. If this succeeds,
Guido van Rossuma8add0e2007-05-14 22:03:55 +00001707 the new instance object is pushed on the stack, and we're done.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001708
1709 Else (the argtuple is not empty, it's not an old-style class object,
1710 or the class object does have a __getinitargs__ attribute), the code
1711 first insists that the class object have a __safe_for_unpickling__
1712 attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE,
1713 it doesn't matter whether this attribute has a true or false value, it
Guido van Rossum99603b02007-07-20 00:22:32 +00001714 only matters whether it exists (XXX this is a bug). If
1715 __safe_for_unpickling__ doesn't exist, UnpicklingError is raised.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001716
1717 Else (the class object does have a __safe_for_unpickling__ attr),
1718 the class object obtained from INST's arguments is applied to the
1719 argtuple obtained from the stack, and the resulting instance object
1720 is pushed on the stack.
Tim Peters2b93c4c2003-01-30 16:35:08 +00001721
1722 NOTE: checks for __safe_for_unpickling__ went away in Python 2.3.
Florent Xiclunaaa6c1d22011-12-12 18:54:29 +01001723 NOTE: the distinction between old-style and new-style classes does
1724 not make sense in Python 3.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001725 """),
1726
1727 I(name='OBJ',
1728 code='o',
1729 arg=None,
1730 stack_before=[markobject, anyobject, stackslice],
1731 stack_after=[anyobject],
1732 proto=1,
1733 doc="""Build a class instance.
1734
1735 This is the protocol 1 version of protocol 0's INST opcode, and is
1736 very much like it. The major difference is that the class object
1737 is taken off the stack, allowing it to be retrieved from the memo
1738 repeatedly if several instances of the same class are created. This
1739 can be much more efficient (in both time and space) than repeatedly
1740 embedding the module and class names in INST opcodes.
1741
1742 Unlike INST, OBJ takes no arguments from the opcode stream. Instead
1743 the class object is taken off the stack, immediately above the
1744 topmost markobject:
1745
1746 Stack before: ... markobject classobject stackslice
1747 Stack after: ... new_instance_object
1748
1749 As for INST, the remainder of the stack above the markobject is
1750 gathered into an argument tuple, and then the logic seems identical,
Guido van Rossumecb11042003-01-29 06:24:30 +00001751 except that no __safe_for_unpickling__ check is done (XXX this is
Guido van Rossum99603b02007-07-20 00:22:32 +00001752 a bug). See INST for the gory details.
Tim Peters2b93c4c2003-01-30 16:35:08 +00001753
1754 NOTE: In Python 2.3, INST and OBJ are identical except for how they
1755 get the class object. That was always the intent; the implementations
1756 had diverged for accidental reasons.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001757 """),
1758
Tim Petersfdc03462003-01-28 04:56:33 +00001759 I(name='NEWOBJ',
1760 code='\x81',
1761 arg=None,
1762 stack_before=[anyobject, anyobject],
1763 stack_after=[anyobject],
1764 proto=2,
1765 doc="""Build an object instance.
1766
1767 The stack before should be thought of as containing a class
1768 object followed by an argument tuple (the tuple being the stack
1769 top). Call these cls and args. They are popped off the stack,
1770 and the value returned by cls.__new__(cls, *args) is pushed back
1771 onto the stack.
1772 """),
1773
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001774 # Machine control.
1775
Tim Petersfdc03462003-01-28 04:56:33 +00001776 I(name='PROTO',
1777 code='\x80',
1778 arg=uint1,
1779 stack_before=[],
1780 stack_after=[],
1781 proto=2,
1782 doc="""Protocol version indicator.
1783
1784 For protocol 2 and above, a pickle must start with this opcode.
1785 The argument is the protocol version, an int in range(2, 256).
1786 """),
1787
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001788 I(name='STOP',
1789 code='.',
1790 arg=None,
1791 stack_before=[anyobject],
1792 stack_after=[],
1793 proto=0,
1794 doc="""Stop the unpickling machine.
1795
1796 Every pickle ends with this opcode. The object at the top of the stack
1797 is popped, and that's the result of unpickling. The stack should be
1798 empty then.
1799 """),
1800
1801 # Ways to deal with persistent IDs.
1802
1803 I(name='PERSID',
1804 code='P',
1805 arg=stringnl_noescape,
1806 stack_before=[],
1807 stack_after=[anyobject],
1808 proto=0,
1809 doc="""Push an object identified by a persistent ID.
1810
1811 The pickle module doesn't define what a persistent ID means. PERSID's
1812 argument is a newline-terminated str-style (no embedded escapes, no
1813 bracketing quote characters) string, which *is* "the persistent ID".
1814 The unpickler passes this string to self.persistent_load(). Whatever
1815 object that returns is pushed on the stack. There is no implementation
1816 of persistent_load() in Python's unpickler: it must be supplied by an
1817 unpickler subclass.
1818 """),
1819
1820 I(name='BINPERSID',
1821 code='Q',
1822 arg=None,
1823 stack_before=[anyobject],
1824 stack_after=[anyobject],
1825 proto=1,
1826 doc="""Push an object identified by a persistent ID.
1827
1828 Like PERSID, except the persistent ID is popped off the stack (instead
1829 of being a string embedded in the opcode bytestream). The persistent
1830 ID is passed to self.persistent_load(), and whatever object that
1831 returns is pushed on the stack. See PERSID for more detail.
1832 """),
1833]
1834del I
1835
1836# Verify uniqueness of .name and .code members.
1837name2i = {}
1838code2i = {}
1839
1840for i, d in enumerate(opcodes):
1841 if d.name in name2i:
1842 raise ValueError("repeated name %r at indices %d and %d" %
1843 (d.name, name2i[d.name], i))
1844 if d.code in code2i:
1845 raise ValueError("repeated code %r at indices %d and %d" %
1846 (d.code, code2i[d.code], i))
1847
1848 name2i[d.name] = i
1849 code2i[d.code] = i
1850
1851del name2i, code2i, i, d
1852
1853##############################################################################
1854# Build a code2op dict, mapping opcode characters to OpcodeInfo records.
1855# Also ensure we've got the same stuff as pickle.py, although the
1856# introspection here is dicey.
1857
1858code2op = {}
1859for d in opcodes:
1860 code2op[d.code] = d
1861del d
1862
1863def assure_pickle_consistency(verbose=False):
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001864
1865 copy = code2op.copy()
1866 for name in pickle.__all__:
1867 if not re.match("[A-Z][A-Z0-9_]+$", name):
1868 if verbose:
Guido van Rossumbe19ed72007-02-09 05:37:30 +00001869 print("skipping %r: it doesn't look like an opcode name" % name)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001870 continue
1871 picklecode = getattr(pickle, name)
Guido van Rossum617dbc42007-05-07 23:57:08 +00001872 if not isinstance(picklecode, bytes) or len(picklecode) != 1:
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001873 if verbose:
Guido van Rossumbe19ed72007-02-09 05:37:30 +00001874 print(("skipping %r: value %r doesn't look like a pickle "
1875 "code" % (name, picklecode)))
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001876 continue
Guido van Rossum617dbc42007-05-07 23:57:08 +00001877 picklecode = picklecode.decode("latin-1")
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001878 if picklecode in copy:
1879 if verbose:
Guido van Rossumbe19ed72007-02-09 05:37:30 +00001880 print("checking name %r w/ code %r for consistency" % (
1881 name, picklecode))
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001882 d = copy[picklecode]
1883 if d.name != name:
1884 raise ValueError("for pickle code %r, pickle.py uses name %r "
1885 "but we're using name %r" % (picklecode,
1886 name,
1887 d.name))
1888 # Forget this one. Any left over in copy at the end are a problem
1889 # of a different kind.
1890 del copy[picklecode]
1891 else:
1892 raise ValueError("pickle.py appears to have a pickle opcode with "
1893 "name %r and code %r, but we don't" %
1894 (name, picklecode))
1895 if copy:
1896 msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
1897 for code, d in copy.items():
1898 msg.append(" name %r with code %r" % (d.name, code))
1899 raise ValueError("\n".join(msg))
1900
1901assure_pickle_consistency()
Tim Petersc0c12b52003-01-29 00:56:17 +00001902del assure_pickle_consistency
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001903
1904##############################################################################
1905# A pickle opcode generator.
1906
1907def genops(pickle):
Guido van Rossuma72ded92003-01-27 19:40:47 +00001908 """Generate all the opcodes in a pickle.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001909
1910 'pickle' is a file-like object, or string, containing the pickle.
1911
1912 Each opcode in the pickle is generated, from the current pickle position,
1913 stopping after a STOP opcode is delivered. A triple is generated for
1914 each opcode:
1915
1916 opcode, arg, pos
1917
1918 opcode is an OpcodeInfo record, describing the current opcode.
1919
1920 If the opcode has an argument embedded in the pickle, arg is its decoded
1921 value, as a Python object. If the opcode doesn't have an argument, arg
1922 is None.
1923
1924 If the pickle has a tell() method, pos was the value of pickle.tell()
Guido van Rossum34d19282007-08-09 01:03:29 +00001925 before reading the current opcode. If the pickle is a bytes object,
1926 it's wrapped in a BytesIO object, and the latter's tell() result is
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001927 used. Else (the pickle doesn't have a tell(), and it's not obvious how
1928 to query its current position) pos is None.
1929 """
1930
Guido van Rossum98297ee2007-11-06 21:34:58 +00001931 if isinstance(pickle, bytes_types):
Guido van Rossumcfe5f202007-05-08 21:26:54 +00001932 import io
1933 pickle = io.BytesIO(pickle)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001934
1935 if hasattr(pickle, "tell"):
1936 getpos = pickle.tell
1937 else:
1938 getpos = lambda: None
1939
1940 while True:
1941 pos = getpos()
1942 code = pickle.read(1)
Guido van Rossumcfe5f202007-05-08 21:26:54 +00001943 opcode = code2op.get(code.decode("latin-1"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001944 if opcode is None:
Guido van Rossumcfe5f202007-05-08 21:26:54 +00001945 if code == b"":
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001946 raise ValueError("pickle exhausted before seeing STOP")
1947 else:
1948 raise ValueError("at position %s, opcode %r unknown" % (
1949 pos is None and "<unknown>" or pos,
1950 code))
1951 if opcode.arg is None:
1952 arg = None
1953 else:
1954 arg = opcode.arg.reader(pickle)
1955 yield opcode, arg, pos
Guido van Rossumcfe5f202007-05-08 21:26:54 +00001956 if code == b'.':
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001957 assert opcode.name == 'STOP'
1958 break
1959
1960##############################################################################
Christian Heimes3feef612008-02-11 06:19:17 +00001961# A pickle optimizer.
1962
1963def optimize(p):
1964 'Optimize a pickle string by removing unused PUT opcodes'
1965 gets = set() # set of args used by a GET opcode
1966 puts = [] # (arg, startpos, stoppos) for the PUT opcodes
1967 prevpos = None # set to pos if previous opcode was a PUT
1968 for opcode, arg, pos in genops(p):
1969 if prevpos is not None:
1970 puts.append((prevarg, prevpos, pos))
1971 prevpos = None
1972 if 'PUT' in opcode.name:
1973 prevarg, prevpos = arg, pos
1974 elif 'GET' in opcode.name:
1975 gets.add(arg)
1976
1977 # Copy the pickle string except for PUTS without a corresponding GET
1978 s = []
1979 i = 0
1980 for arg, start, stop in puts:
1981 j = stop if (arg in gets) else start
1982 s.append(p[i:j])
1983 i = stop
1984 s.append(p[i:])
Christian Heimes126d29a2008-02-11 22:57:17 +00001985 return b''.join(s)
Christian Heimes3feef612008-02-11 06:19:17 +00001986
1987##############################################################################
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001988# A symbolic pickle disassembler.
1989
Alexander Belopolsky929d3842010-07-17 15:51:21 +00001990def dis(pickle, out=None, memo=None, indentlevel=4, annotate=0):
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001991 """Produce a symbolic disassembly of a pickle.
1992
1993 'pickle' is a file-like object, or string, containing a (at least one)
1994 pickle. The pickle is disassembled from the current position, through
1995 the first STOP opcode encountered.
1996
1997 Optional arg 'out' is a file-like object to which the disassembly is
1998 printed. It defaults to sys.stdout.
1999
Tim Peters62235e72003-02-05 19:55:53 +00002000 Optional arg 'memo' is a Python dict, used as the pickle's memo. It
2001 may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes.
2002 Passing the same memo object to another dis() call then allows disassembly
2003 to proceed across multiple pickles that were all created by the same
2004 pickler with the same memo. Ordinarily you don't need to worry about this.
2005
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002006 Optional arg 'indentlevel' is the number of blanks by which to indent
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002007 a new MARK level. It defaults to 4.
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002008
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002009 Optional arg 'annotate' if nonzero instructs dis() to add short
2010 description of the opcode on each line of disassembled output.
2011 The value given to 'annotate' must be an integer and is used as a
2012 hint for the column where annotation should start. The default
2013 value is 0, meaning no annotations.
2014
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002015 In addition to printing the disassembly, some sanity checks are made:
2016
2017 + All embedded opcode arguments "make sense".
2018
2019 + Explicit and implicit pop operations have enough items on the stack.
2020
2021 + When an opcode implicitly refers to a markobject, a markobject is
2022 actually on the stack.
2023
2024 + A memo entry isn't referenced before it's defined.
2025
2026 + The markobject isn't stored in the memo.
2027
2028 + A memo entry isn't redefined.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002029 """
2030
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002031 # Most of the hair here is for sanity checks, but most of it is needed
2032 # anyway to detect when a protocol 0 POP takes a MARK off the stack
2033 # (which in turn is needed to indent MARK blocks correctly).
2034
2035 stack = [] # crude emulation of unpickler stack
Tim Peters62235e72003-02-05 19:55:53 +00002036 if memo is None:
Ezio Melotti30b9d5d2013-08-17 15:50:46 +03002037 memo = {} # crude emulation of unpickler memo
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002038 maxproto = -1 # max protocol number seen
2039 markstack = [] # bytecode positions of MARK opcodes
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002040 indentchunk = ' ' * indentlevel
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002041 errormsg = None
Ezio Melotti30b9d5d2013-08-17 15:50:46 +03002042 annocol = annotate # column hint for annotations
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002043 for opcode, arg, pos in genops(pickle):
2044 if pos is not None:
Guido van Rossumbe19ed72007-02-09 05:37:30 +00002045 print("%5d:" % pos, end=' ', file=out)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002046
Tim Petersd0f7c862003-01-28 15:27:57 +00002047 line = "%-4s %s%s" % (repr(opcode.code)[1:-1],
2048 indentchunk * len(markstack),
2049 opcode.name)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002050
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002051 maxproto = max(maxproto, opcode.proto)
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002052 before = opcode.stack_before # don't mutate
2053 after = opcode.stack_after # don't mutate
Tim Peters43277d62003-01-30 15:02:12 +00002054 numtopop = len(before)
2055
2056 # See whether a MARK should be popped.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002057 markmsg = None
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002058 if markobject in before or (opcode.name == "POP" and
2059 stack and
2060 stack[-1] is markobject):
2061 assert markobject not in after
Tim Peters43277d62003-01-30 15:02:12 +00002062 if __debug__:
2063 if markobject in before:
2064 assert before[-1] is stackslice
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002065 if markstack:
2066 markpos = markstack.pop()
2067 if markpos is None:
2068 markmsg = "(MARK at unknown opcode offset)"
2069 else:
2070 markmsg = "(MARK at %d)" % markpos
2071 # Pop everything at and after the topmost markobject.
2072 while stack[-1] is not markobject:
2073 stack.pop()
2074 stack.pop()
Tim Peters43277d62003-01-30 15:02:12 +00002075 # Stop later code from popping too much.
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002076 try:
Tim Peters43277d62003-01-30 15:02:12 +00002077 numtopop = before.index(markobject)
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002078 except ValueError:
2079 assert opcode.name == "POP"
Tim Peters43277d62003-01-30 15:02:12 +00002080 numtopop = 0
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002081 else:
2082 errormsg = markmsg = "no MARK exists on stack"
2083
2084 # Check for correct memo usage.
2085 if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT"):
Tim Peters43277d62003-01-30 15:02:12 +00002086 assert arg is not None
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002087 if arg in memo:
2088 errormsg = "memo key %r already defined" % arg
2089 elif not stack:
2090 errormsg = "stack is empty -- can't store into memo"
2091 elif stack[-1] is markobject:
2092 errormsg = "can't store markobject in the memo"
2093 else:
2094 memo[arg] = stack[-1]
2095
2096 elif opcode.name in ("GET", "BINGET", "LONG_BINGET"):
2097 if arg in memo:
2098 assert len(after) == 1
2099 after = [memo[arg]] # for better stack emulation
2100 else:
2101 errormsg = "memo key %r has never been stored into" % arg
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002102
2103 if arg is not None or markmsg:
2104 # make a mild effort to align arguments
2105 line += ' ' * (10 - len(opcode.name))
2106 if arg is not None:
2107 line += ' ' + repr(arg)
2108 if markmsg:
2109 line += ' ' + markmsg
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002110 if annotate:
2111 line += ' ' * (annocol - len(line))
2112 # make a mild effort to align annotations
2113 annocol = len(line)
2114 if annocol > 50:
2115 annocol = annotate
2116 line += ' ' + opcode.doc.split('\n', 1)[0]
Guido van Rossumbe19ed72007-02-09 05:37:30 +00002117 print(line, file=out)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002118
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002119 if errormsg:
2120 # Note that we delayed complaining until the offending opcode
2121 # was printed.
2122 raise ValueError(errormsg)
2123
2124 # Emulate the stack effects.
Tim Peters43277d62003-01-30 15:02:12 +00002125 if len(stack) < numtopop:
2126 raise ValueError("tries to pop %d items from stack with "
2127 "only %d items" % (numtopop, len(stack)))
2128 if numtopop:
2129 del stack[-numtopop:]
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002130 if markobject in after:
Tim Peters43277d62003-01-30 15:02:12 +00002131 assert markobject not in before
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002132 markstack.append(pos)
2133
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002134 stack.extend(after)
2135
Guido van Rossumbe19ed72007-02-09 05:37:30 +00002136 print("highest protocol among opcodes =", maxproto, file=out)
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002137 if stack:
2138 raise ValueError("stack not empty after STOP: %r" % stack)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002139
Tim Peters90718a42005-02-15 16:22:34 +00002140# For use in the doctest, simply as an example of a class to pickle.
2141class _Example:
2142 def __init__(self, value):
2143 self.value = value
2144
Guido van Rossum03e35322003-01-28 15:37:13 +00002145_dis_test = r"""
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002146>>> import pickle
Guido van Rossumf4169812008-03-17 22:56:06 +00002147>>> x = [1, 2, (3, 4), {b'abc': "def"}]
2148>>> pkl0 = pickle.dumps(x, 0)
2149>>> dis(pkl0)
Tim Petersd0f7c862003-01-28 15:27:57 +00002150 0: ( MARK
2151 1: l LIST (MARK at 0)
2152 2: p PUT 0
Guido van Rossumf4100002007-01-15 00:21:46 +00002153 5: L LONG 1
Mark Dickinson8dd05142009-01-20 20:43:58 +00002154 9: a APPEND
2155 10: L LONG 2
2156 14: a APPEND
2157 15: ( MARK
2158 16: L LONG 3
2159 20: L LONG 4
2160 24: t TUPLE (MARK at 15)
2161 25: p PUT 1
2162 28: a APPEND
2163 29: ( MARK
2164 30: d DICT (MARK at 29)
2165 31: p PUT 2
Alexandre Vassalotti3bfc65a2011-12-13 13:08:09 -05002166 34: c GLOBAL '_codecs encode'
2167 50: p PUT 3
2168 53: ( MARK
2169 54: V UNICODE 'abc'
Antoine Pitroud9dfaa92009-06-04 20:32:06 +00002170 59: p PUT 4
Alexandre Vassalotti3bfc65a2011-12-13 13:08:09 -05002171 62: V UNICODE 'latin1'
2172 70: p PUT 5
2173 73: t TUPLE (MARK at 53)
2174 74: p PUT 6
2175 77: R REDUCE
2176 78: p PUT 7
2177 81: V UNICODE 'def'
2178 86: p PUT 8
2179 89: s SETITEM
2180 90: a APPEND
2181 91: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002182highest protocol among opcodes = 0
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002183
2184Try again with a "binary" pickle.
2185
Guido van Rossumf4169812008-03-17 22:56:06 +00002186>>> pkl1 = pickle.dumps(x, 1)
2187>>> dis(pkl1)
Tim Petersd0f7c862003-01-28 15:27:57 +00002188 0: ] EMPTY_LIST
2189 1: q BINPUT 0
2190 3: ( MARK
2191 4: K BININT1 1
2192 6: K BININT1 2
2193 8: ( MARK
2194 9: K BININT1 3
2195 11: K BININT1 4
2196 13: t TUPLE (MARK at 8)
2197 14: q BINPUT 1
2198 16: } EMPTY_DICT
2199 17: q BINPUT 2
Alexandre Vassalotti3bfc65a2011-12-13 13:08:09 -05002200 19: c GLOBAL '_codecs encode'
2201 35: q BINPUT 3
2202 37: ( MARK
2203 38: X BINUNICODE 'abc'
2204 46: q BINPUT 4
2205 48: X BINUNICODE 'latin1'
2206 59: q BINPUT 5
2207 61: t TUPLE (MARK at 37)
2208 62: q BINPUT 6
2209 64: R REDUCE
2210 65: q BINPUT 7
2211 67: X BINUNICODE 'def'
2212 75: q BINPUT 8
2213 77: s SETITEM
2214 78: e APPENDS (MARK at 3)
2215 79: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002216highest protocol among opcodes = 1
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002217
2218Exercise the INST/OBJ/BUILD family.
2219
Mark Dickinsoncddcf442009-01-24 21:46:33 +00002220>>> import pickletools
2221>>> dis(pickle.dumps(pickletools.dis, 0))
2222 0: c GLOBAL 'pickletools dis'
2223 17: p PUT 0
2224 20: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002225highest protocol among opcodes = 0
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002226
Tim Peters90718a42005-02-15 16:22:34 +00002227>>> from pickletools import _Example
2228>>> x = [_Example(42)] * 2
Guido van Rossumf29d3d62003-01-27 22:47:53 +00002229>>> dis(pickle.dumps(x, 0))
Tim Petersd0f7c862003-01-28 15:27:57 +00002230 0: ( MARK
2231 1: l LIST (MARK at 0)
2232 2: p PUT 0
Antoine Pitroud9dfaa92009-06-04 20:32:06 +00002233 5: c GLOBAL 'copy_reg _reconstructor'
2234 30: p PUT 1
2235 33: ( MARK
2236 34: c GLOBAL 'pickletools _Example'
2237 56: p PUT 2
2238 59: c GLOBAL '__builtin__ object'
2239 79: p PUT 3
2240 82: N NONE
2241 83: t TUPLE (MARK at 33)
2242 84: p PUT 4
2243 87: R REDUCE
2244 88: p PUT 5
2245 91: ( MARK
2246 92: d DICT (MARK at 91)
2247 93: p PUT 6
2248 96: V UNICODE 'value'
2249 103: p PUT 7
2250 106: L LONG 42
2251 111: s SETITEM
2252 112: b BUILD
Mark Dickinson8dd05142009-01-20 20:43:58 +00002253 113: a APPEND
Antoine Pitroud9dfaa92009-06-04 20:32:06 +00002254 114: g GET 5
2255 117: a APPEND
2256 118: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002257highest protocol among opcodes = 0
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002258
2259>>> dis(pickle.dumps(x, 1))
Tim Petersd0f7c862003-01-28 15:27:57 +00002260 0: ] EMPTY_LIST
2261 1: q BINPUT 0
2262 3: ( MARK
Antoine Pitroud9dfaa92009-06-04 20:32:06 +00002263 4: c GLOBAL 'copy_reg _reconstructor'
2264 29: q BINPUT 1
2265 31: ( MARK
2266 32: c GLOBAL 'pickletools _Example'
2267 54: q BINPUT 2
2268 56: c GLOBAL '__builtin__ object'
2269 76: q BINPUT 3
2270 78: N NONE
2271 79: t TUPLE (MARK at 31)
2272 80: q BINPUT 4
2273 82: R REDUCE
2274 83: q BINPUT 5
2275 85: } EMPTY_DICT
2276 86: q BINPUT 6
2277 88: X BINUNICODE 'value'
2278 98: q BINPUT 7
2279 100: K BININT1 42
2280 102: s SETITEM
2281 103: b BUILD
2282 104: h BINGET 5
2283 106: e APPENDS (MARK at 3)
2284 107: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002285highest protocol among opcodes = 1
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002286
2287Try "the canonical" recursive-object test.
2288
2289>>> L = []
2290>>> T = L,
2291>>> L.append(T)
2292>>> L[0] is T
2293True
2294>>> T[0] is L
2295True
2296>>> L[0][0] is L
2297True
2298>>> T[0][0] is T
2299True
Guido van Rossumf29d3d62003-01-27 22:47:53 +00002300>>> dis(pickle.dumps(L, 0))
Tim Petersd0f7c862003-01-28 15:27:57 +00002301 0: ( MARK
2302 1: l LIST (MARK at 0)
2303 2: p PUT 0
2304 5: ( MARK
2305 6: g GET 0
2306 9: t TUPLE (MARK at 5)
2307 10: p PUT 1
2308 13: a APPEND
2309 14: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002310highest protocol among opcodes = 0
2311
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002312>>> dis(pickle.dumps(L, 1))
Tim Petersd0f7c862003-01-28 15:27:57 +00002313 0: ] EMPTY_LIST
2314 1: q BINPUT 0
2315 3: ( MARK
2316 4: h BINGET 0
2317 6: t TUPLE (MARK at 3)
2318 7: q BINPUT 1
2319 9: a APPEND
2320 10: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002321highest protocol among opcodes = 1
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002322
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002323Note that, in the protocol 0 pickle of the recursive tuple, the disassembler
2324has to emulate the stack in order to realize that the POP opcode at 16 gets
2325rid of the MARK at 0.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002326
Guido van Rossumf29d3d62003-01-27 22:47:53 +00002327>>> dis(pickle.dumps(T, 0))
Tim Petersd0f7c862003-01-28 15:27:57 +00002328 0: ( MARK
2329 1: ( MARK
2330 2: l LIST (MARK at 1)
2331 3: p PUT 0
2332 6: ( MARK
2333 7: g GET 0
2334 10: t TUPLE (MARK at 6)
2335 11: p PUT 1
2336 14: a APPEND
2337 15: 0 POP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002338 16: 0 POP (MARK at 0)
2339 17: g GET 1
2340 20: . STOP
2341highest protocol among opcodes = 0
2342
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002343>>> dis(pickle.dumps(T, 1))
Tim Petersd0f7c862003-01-28 15:27:57 +00002344 0: ( MARK
2345 1: ] EMPTY_LIST
2346 2: q BINPUT 0
2347 4: ( MARK
2348 5: h BINGET 0
2349 7: t TUPLE (MARK at 4)
2350 8: q BINPUT 1
2351 10: a APPEND
2352 11: 1 POP_MARK (MARK at 0)
2353 12: h BINGET 1
2354 14: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002355highest protocol among opcodes = 1
Tim Petersd0f7c862003-01-28 15:27:57 +00002356
2357Try protocol 2.
2358
2359>>> dis(pickle.dumps(L, 2))
2360 0: \x80 PROTO 2
2361 2: ] EMPTY_LIST
2362 3: q BINPUT 0
2363 5: h BINGET 0
2364 7: \x85 TUPLE1
2365 8: q BINPUT 1
2366 10: a APPEND
2367 11: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002368highest protocol among opcodes = 2
Tim Petersd0f7c862003-01-28 15:27:57 +00002369
2370>>> dis(pickle.dumps(T, 2))
2371 0: \x80 PROTO 2
2372 2: ] EMPTY_LIST
2373 3: q BINPUT 0
2374 5: h BINGET 0
2375 7: \x85 TUPLE1
2376 8: q BINPUT 1
2377 10: a APPEND
2378 11: 0 POP
2379 12: h BINGET 1
2380 14: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002381highest protocol among opcodes = 2
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002382
2383Try protocol 3 with annotations:
2384
2385>>> dis(pickle.dumps(T, 3), annotate=1)
2386 0: \x80 PROTO 3 Protocol version indicator.
2387 2: ] EMPTY_LIST Push an empty list.
2388 3: q BINPUT 0 Store the stack top into the memo. The stack is not popped.
2389 5: h BINGET 0 Read an object from the memo and push it on the stack.
2390 7: \x85 TUPLE1 Build a one-tuple out of the topmost item on the stack.
2391 8: q BINPUT 1 Store the stack top into the memo. The stack is not popped.
2392 10: a APPEND Append an object to a list.
2393 11: 0 POP Discard the top stack item, shrinking the stack by one item.
2394 12: h BINGET 1 Read an object from the memo and push it on the stack.
2395 14: . STOP Stop the unpickling machine.
2396highest protocol among opcodes = 2
2397
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002398"""
2399
Tim Peters62235e72003-02-05 19:55:53 +00002400_memo_test = r"""
2401>>> import pickle
Guido van Rossumcfe5f202007-05-08 21:26:54 +00002402>>> import io
2403>>> f = io.BytesIO()
Tim Peters62235e72003-02-05 19:55:53 +00002404>>> p = pickle.Pickler(f, 2)
2405>>> x = [1, 2, 3]
2406>>> p.dump(x)
2407>>> p.dump(x)
2408>>> f.seek(0)
Guido van Rossumcfe5f202007-05-08 21:26:54 +000024090
Tim Peters62235e72003-02-05 19:55:53 +00002410>>> memo = {}
2411>>> dis(f, memo=memo)
2412 0: \x80 PROTO 2
2413 2: ] EMPTY_LIST
2414 3: q BINPUT 0
2415 5: ( MARK
2416 6: K BININT1 1
2417 8: K BININT1 2
2418 10: K BININT1 3
2419 12: e APPENDS (MARK at 5)
2420 13: . STOP
2421highest protocol among opcodes = 2
2422>>> dis(f, memo=memo)
2423 14: \x80 PROTO 2
2424 16: h BINGET 0
2425 18: . STOP
2426highest protocol among opcodes = 2
2427"""
2428
Guido van Rossum57028352003-01-28 15:09:10 +00002429__test__ = {'disassembler_test': _dis_test,
Tim Peters62235e72003-02-05 19:55:53 +00002430 'disassembler_memo_test': _memo_test,
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002431 }
2432
2433def _test():
2434 import doctest
2435 return doctest.testmod()
2436
2437if __name__ == "__main__":
Alexander Belopolsky60c762b2010-07-03 20:35:53 +00002438 import sys, argparse
2439 parser = argparse.ArgumentParser(
2440 description='disassemble one or more pickle files')
2441 parser.add_argument(
2442 'pickle_file', type=argparse.FileType('br'),
2443 nargs='*', help='the pickle file')
2444 parser.add_argument(
2445 '-o', '--output', default=sys.stdout, type=argparse.FileType('w'),
2446 help='the file where the output should be written')
2447 parser.add_argument(
2448 '-m', '--memo', action='store_true',
2449 help='preserve memo between disassemblies')
2450 parser.add_argument(
2451 '-l', '--indentlevel', default=4, type=int,
2452 help='the number of blanks by which to indent a new MARK level')
2453 parser.add_argument(
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002454 '-a', '--annotate', action='store_true',
2455 help='annotate each line with a short opcode description')
2456 parser.add_argument(
Alexander Belopolsky60c762b2010-07-03 20:35:53 +00002457 '-p', '--preamble', default="==> {name} <==",
2458 help='if more than one pickle file is specified, print this before'
2459 ' each disassembly')
2460 parser.add_argument(
2461 '-t', '--test', action='store_true',
2462 help='run self-test suite')
2463 parser.add_argument(
2464 '-v', action='store_true',
2465 help='run verbosely; only affects self-test run')
2466 args = parser.parse_args()
2467 if args.test:
2468 _test()
2469 else:
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002470 annotate = 30 if args.annotate else 0
Alexander Belopolsky60c762b2010-07-03 20:35:53 +00002471 if not args.pickle_file:
2472 parser.print_help()
2473 elif len(args.pickle_file) == 1:
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002474 dis(args.pickle_file[0], args.output, None,
2475 args.indentlevel, annotate)
Alexander Belopolsky60c762b2010-07-03 20:35:53 +00002476 else:
2477 memo = {} if args.memo else None
2478 for f in args.pickle_file:
2479 preamble = args.preamble.format(name=f.name)
2480 args.output.write(preamble + '\n')
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002481 dis(f, args.output, memo, args.indentlevel, annotate)