blob: e92146deb985db41cc86f5caa0ebe194c64b01df [file] [log] [blame]
Skip Montanaro54455942003-01-29 15:41:33 +00001'''"Executable documentation" for the pickle module.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002
3Extensive comments about the pickle protocols and pickle-machine opcodes
4can be found here. Some functions meant for external use:
5
6genops(pickle)
7 Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
8
Andrew M. Kuchlingd0c53fe2004-08-07 16:51:30 +00009dis(pickle, out=None, memo=None, indentlevel=4)
Tim Peters8ecfc8e2003-01-27 18:51:48 +000010 Print a symbolic disassembly of a pickle.
Skip Montanaro54455942003-01-29 15:41:33 +000011'''
Tim Peters8ecfc8e2003-01-27 18:51:48 +000012
Walter Dörwald42748a82007-06-12 16:40:17 +000013import codecs
Guido van Rossum98297ee2007-11-06 21:34:58 +000014import pickle
15import re
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -070016import sys
Walter Dörwald42748a82007-06-12 16:40:17 +000017
Christian Heimes3feef612008-02-11 06:19:17 +000018__all__ = ['dis', 'genops', 'optimize']
Tim Peters90cf2122004-11-06 23:45:48 +000019
Guido van Rossum98297ee2007-11-06 21:34:58 +000020bytes_types = pickle.bytes_types
21
Tim Peters8ecfc8e2003-01-27 18:51:48 +000022# Other ideas:
23#
24# - A pickle verifier: read a pickle and check it exhaustively for
Tim Petersc1c2b3e2003-01-29 20:12:21 +000025# well-formedness. dis() does a lot of this already.
Tim Peters8ecfc8e2003-01-27 18:51:48 +000026#
27# - A protocol identifier: examine a pickle and return its protocol number
28# (== the highest .proto attr value among all the opcodes in the pickle).
Tim Petersc1c2b3e2003-01-29 20:12:21 +000029# dis() already prints this info at the end.
Tim Peters8ecfc8e2003-01-27 18:51:48 +000030#
31# - A pickle optimizer: for example, tuple-building code is sometimes more
32# elaborate than necessary, catering for the possibility that the tuple
33# is recursive. Or lots of times a PUT is generated that's never accessed
34# by a later GET.
35
36
Victor Stinner765531d2013-03-26 01:11:54 +010037# "A pickle" is a program for a virtual pickle machine (PM, but more accurately
38# called an unpickling machine). It's a sequence of opcodes, interpreted by the
39# PM, building an arbitrarily complex Python object.
40#
41# For the most part, the PM is very simple: there are no looping, testing, or
42# conditional instructions, no arithmetic and no function calls. Opcodes are
43# executed once each, from first to last, until a STOP opcode is reached.
44#
45# The PM has two data areas, "the stack" and "the memo".
46#
47# Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
48# integer object on the stack, whose value is gotten from a decimal string
49# literal immediately following the INT opcode in the pickle bytestream. Other
50# opcodes take Python objects off the stack. The result of unpickling is
51# whatever object is left on the stack when the final STOP opcode is executed.
52#
53# The memo is simply an array of objects, or it can be implemented as a dict
54# mapping little integers to objects. The memo serves as the PM's "long term
55# memory", and the little integers indexing the memo are akin to variable
56# names. Some opcodes pop a stack object into the memo at a given index,
57# and others push a memo object at a given index onto the stack again.
58#
59# At heart, that's all the PM has. Subtleties arise for these reasons:
60#
61# + Object identity. Objects can be arbitrarily complex, and subobjects
62# may be shared (for example, the list [a, a] refers to the same object a
63# twice). It can be vital that unpickling recreate an isomorphic object
64# graph, faithfully reproducing sharing.
65#
66# + Recursive objects. For example, after "L = []; L.append(L)", L is a
67# list, and L[0] is the same list. This is related to the object identity
68# point, and some sequences of pickle opcodes are subtle in order to
69# get the right result in all cases.
70#
71# + Things pickle doesn't know everything about. Examples of things pickle
72# does know everything about are Python's builtin scalar and container
73# types, like ints and tuples. They generally have opcodes dedicated to
74# them. For things like module references and instances of user-defined
75# classes, pickle's knowledge is limited. Historically, many enhancements
76# have been made to the pickle protocol in order to do a better (faster,
77# and/or more compact) job on those.
78#
79# + Backward compatibility and micro-optimization. As explained below,
80# pickle opcodes never go away, not even when better ways to do a thing
81# get invented. The repertoire of the PM just keeps growing over time.
82# For example, protocol 0 had two opcodes for building Python integers (INT
83# and LONG), protocol 1 added three more for more-efficient pickling of short
84# integers, and protocol 2 added two more for more-efficient pickling of
85# long integers (before protocol 2, the only ways to pickle a Python long
86# took time quadratic in the number of digits, for both pickling and
87# unpickling). "Opcode bloat" isn't so much a subtlety as a source of
88# wearying complication.
89#
90#
91# Pickle protocols:
92#
93# For compatibility, the meaning of a pickle opcode never changes. Instead new
94# pickle opcodes get added, and each version's unpickler can handle all the
95# pickle opcodes in all protocol versions to date. So old pickles continue to
96# be readable forever. The pickler can generally be told to restrict itself to
97# the subset of opcodes available under previous protocol versions too, so that
98# users can create pickles under the current version readable by older
99# versions. However, a pickle does not contain its version number embedded
100# within it. If an older unpickler tries to read a pickle using a later
101# protocol, the result is most likely an exception due to seeing an unknown (in
102# the older unpickler) opcode.
103#
104# The original pickle used what's now called "protocol 0", and what was called
105# "text mode" before Python 2.3. The entire pickle bytestream is made up of
106# printable 7-bit ASCII characters, plus the newline character, in protocol 0.
107# That's why it was called text mode. Protocol 0 is small and elegant, but
108# sometimes painfully inefficient.
109#
110# The second major set of additions is now called "protocol 1", and was called
111# "binary mode" before Python 2.3. This added many opcodes with arguments
112# consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
113# bytes. Binary mode pickles can be substantially smaller than equivalent
114# text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
115# int as 4 bytes following the opcode, which is cheaper to unpickle than the
116# (perhaps) 11-character decimal string attached to INT. Protocol 1 also added
117# a number of opcodes that operate on many stack elements at once (like APPENDS
118# and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
119#
120# The third major set of additions came in Python 2.3, and is called "protocol
121# 2". This added:
122#
123# - A better way to pickle instances of new-style classes (NEWOBJ).
124#
125# - A way for a pickle to identify its protocol (PROTO).
126#
127# - Time- and space- efficient pickling of long ints (LONG{1,4}).
128#
129# - Shortcuts for small tuples (TUPLE{1,2,3}}.
130#
131# - Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
132#
133# - The "extension registry", a vector of popular objects that can be pushed
134# efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but
135# the registry contents are predefined (there's nothing akin to the memo's
136# PUT).
137#
138# Another independent change with Python 2.3 is the abandonment of any
139# pretense that it might be safe to load pickles received from untrusted
140# parties -- no sufficient security analysis has been done to guarantee
141# this and there isn't a use case that warrants the expense of such an
142# analysis.
143#
144# To this end, all tests for __safe_for_unpickling__ or for
145# copyreg.safe_constructors are removed from the unpickling code.
146# References to these variables in the descriptions below are to be seen
147# as describing unpickling in Python 2.2 and before.
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000148
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000149
150# Meta-rule: Descriptions are stored in instances of descriptor objects,
151# with plain constructors. No meta-language is defined from which
152# descriptors could be constructed. If you want, e.g., XML, write a little
153# program to generate XML from the objects.
154
155##############################################################################
156# Some pickle opcodes have an argument, following the opcode in the
157# bytestream. An argument is of a specific type, described by an instance
158# of ArgumentDescriptor. These are not to be confused with arguments taken
159# off the stack -- ArgumentDescriptor applies only to arguments embedded in
160# the opcode stream, immediately following an opcode.
161
162# Represents the number of bytes consumed by an argument delimited by the
163# next newline character.
164UP_TO_NEWLINE = -1
165
166# Represents the number of bytes consumed by a two-argument opcode where
167# the first argument gives the number of bytes in the second argument.
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700168TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int
169TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int
170TAKEN_FROM_ARGUMENT4U = -4 # num bytes is 4-byte unsigned little-endian int
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000171
172class ArgumentDescriptor(object):
173 __slots__ = (
174 # name of descriptor record, also a module global name; a string
175 'name',
176
177 # length of argument, in bytes; an int; UP_TO_NEWLINE and
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000178 # TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length
179 # cases
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000180 'n',
181
182 # a function taking a file-like object, reading this kind of argument
183 # from the object at the current position, advancing the current
184 # position by n bytes, and returning the value of the argument
185 'reader',
186
187 # human-readable docs for this arg descriptor; a string
188 'doc',
189 )
190
191 def __init__(self, name, n, reader, doc):
192 assert isinstance(name, str)
193 self.name = name
194
195 assert isinstance(n, int) and (n >= 0 or
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000196 n in (UP_TO_NEWLINE,
197 TAKEN_FROM_ARGUMENT1,
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700198 TAKEN_FROM_ARGUMENT4,
199 TAKEN_FROM_ARGUMENT4U))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000200 self.n = n
201
202 self.reader = reader
203
204 assert isinstance(doc, str)
205 self.doc = doc
206
207from struct import unpack as _unpack
208
209def read_uint1(f):
Tim Peters55762f52003-01-28 16:01:25 +0000210 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000211 >>> import io
212 >>> read_uint1(io.BytesIO(b'\xff'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000213 255
214 """
215
216 data = f.read(1)
217 if data:
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000218 return data[0]
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000219 raise ValueError("not enough data in stream to read uint1")
220
221uint1 = ArgumentDescriptor(
222 name='uint1',
223 n=1,
224 reader=read_uint1,
225 doc="One-byte unsigned integer.")
226
227
228def read_uint2(f):
Tim Peters55762f52003-01-28 16:01:25 +0000229 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000230 >>> import io
231 >>> read_uint2(io.BytesIO(b'\xff\x00'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000232 255
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000233 >>> read_uint2(io.BytesIO(b'\xff\xff'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000234 65535
235 """
236
237 data = f.read(2)
238 if len(data) == 2:
239 return _unpack("<H", data)[0]
240 raise ValueError("not enough data in stream to read uint2")
241
242uint2 = ArgumentDescriptor(
243 name='uint2',
244 n=2,
245 reader=read_uint2,
246 doc="Two-byte unsigned integer, little-endian.")
247
248
249def read_int4(f):
Tim Peters55762f52003-01-28 16:01:25 +0000250 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000251 >>> import io
252 >>> read_int4(io.BytesIO(b'\xff\x00\x00\x00'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000253 255
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000254 >>> read_int4(io.BytesIO(b'\x00\x00\x00\x80')) == -(2**31)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000255 True
256 """
257
258 data = f.read(4)
259 if len(data) == 4:
260 return _unpack("<i", data)[0]
261 raise ValueError("not enough data in stream to read int4")
262
263int4 = ArgumentDescriptor(
264 name='int4',
265 n=4,
266 reader=read_int4,
267 doc="Four-byte signed integer, little-endian, 2's complement.")
268
269
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700270def read_uint4(f):
271 r"""
272 >>> import io
273 >>> read_uint4(io.BytesIO(b'\xff\x00\x00\x00'))
274 255
275 >>> read_uint4(io.BytesIO(b'\x00\x00\x00\x80')) == 2**31
276 True
277 """
278
279 data = f.read(4)
280 if len(data) == 4:
281 return _unpack("<I", data)[0]
282 raise ValueError("not enough data in stream to read uint4")
283
284uint4 = ArgumentDescriptor(
285 name='uint4',
286 n=4,
287 reader=read_uint4,
288 doc="Four-byte unsigned integer, little-endian.")
289
290
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000291def read_stringnl(f, decode=True, stripquotes=True):
Tim Peters55762f52003-01-28 16:01:25 +0000292 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000293 >>> import io
294 >>> read_stringnl(io.BytesIO(b"'abcd'\nefg\n"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000295 'abcd'
296
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000297 >>> read_stringnl(io.BytesIO(b"\n"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000298 Traceback (most recent call last):
299 ...
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000300 ValueError: no string quotes around b''
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000301
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000302 >>> read_stringnl(io.BytesIO(b"\n"), stripquotes=False)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000303 ''
304
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000305 >>> read_stringnl(io.BytesIO(b"''\n"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000306 ''
307
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000308 >>> read_stringnl(io.BytesIO(b'"abcd"'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000309 Traceback (most recent call last):
310 ...
311 ValueError: no newline found when trying to read stringnl
312
313 Embedded escapes are undone in the result.
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000314 >>> read_stringnl(io.BytesIO(br"'a\n\\b\x00c\td'" + b"\n'e'"))
Tim Peters55762f52003-01-28 16:01:25 +0000315 'a\n\\b\x00c\td'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000316 """
317
Guido van Rossum26986312007-07-17 00:19:46 +0000318 data = f.readline()
Guido van Rossum26d95c32007-08-27 23:18:54 +0000319 if not data.endswith(b'\n'):
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000320 raise ValueError("no newline found when trying to read stringnl")
321 data = data[:-1] # lose the newline
322
323 if stripquotes:
Guido van Rossum26d95c32007-08-27 23:18:54 +0000324 for q in (b'"', b"'"):
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000325 if data.startswith(q):
326 if not data.endswith(q):
327 raise ValueError("strinq quote %r not found at both "
328 "ends of %r" % (q, data))
329 data = data[1:-1]
330 break
331 else:
332 raise ValueError("no string quotes around %r" % data)
333
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000334 if decode:
Guido van Rossum98297ee2007-11-06 21:34:58 +0000335 data = codecs.escape_decode(data)[0].decode("ascii")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000336 return data
337
338stringnl = ArgumentDescriptor(
339 name='stringnl',
340 n=UP_TO_NEWLINE,
341 reader=read_stringnl,
342 doc="""A newline-terminated string.
343
344 This is a repr-style string, with embedded escapes, and
345 bracketing quotes.
346 """)
347
348def read_stringnl_noescape(f):
Guido van Rossum98297ee2007-11-06 21:34:58 +0000349 return read_stringnl(f, stripquotes=False)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000350
351stringnl_noescape = ArgumentDescriptor(
352 name='stringnl_noescape',
353 n=UP_TO_NEWLINE,
354 reader=read_stringnl_noescape,
355 doc="""A newline-terminated string.
356
357 This is a str-style string, without embedded escapes,
358 or bracketing quotes. It should consist solely of
359 printable ASCII characters.
360 """)
361
362def read_stringnl_noescape_pair(f):
Tim Peters55762f52003-01-28 16:01:25 +0000363 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000364 >>> import io
365 >>> read_stringnl_noescape_pair(io.BytesIO(b"Queue\nEmpty\njunk"))
Tim Petersd916cf42003-01-27 19:01:47 +0000366 'Queue Empty'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000367 """
368
Tim Petersd916cf42003-01-27 19:01:47 +0000369 return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000370
371stringnl_noescape_pair = ArgumentDescriptor(
372 name='stringnl_noescape_pair',
373 n=UP_TO_NEWLINE,
374 reader=read_stringnl_noescape_pair,
375 doc="""A pair of newline-terminated strings.
376
377 These are str-style strings, without embedded
378 escapes, or bracketing quotes. They should
379 consist solely of printable ASCII characters.
380 The pair is returned as a single string, with
Tim Petersd916cf42003-01-27 19:01:47 +0000381 a single blank separating the two strings.
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000382 """)
383
384def read_string4(f):
Tim Peters55762f52003-01-28 16:01:25 +0000385 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000386 >>> import io
387 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x00abc"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000388 ''
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000389 >>> read_string4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000390 'abc'
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000391 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000392 Traceback (most recent call last):
393 ...
394 ValueError: expected 50331648 bytes in a string4, but only 6 remain
395 """
396
397 n = read_int4(f)
398 if n < 0:
399 raise ValueError("string4 byte count < 0: %d" % n)
400 data = f.read(n)
401 if len(data) == n:
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000402 return data.decode("latin-1")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000403 raise ValueError("expected %d bytes in a string4, but only %d remain" %
404 (n, len(data)))
405
406string4 = ArgumentDescriptor(
407 name="string4",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000408 n=TAKEN_FROM_ARGUMENT4,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000409 reader=read_string4,
410 doc="""A counted string.
411
412 The first argument is a 4-byte little-endian signed int giving
413 the number of bytes in the string, and the second argument is
414 that many bytes.
415 """)
416
417
418def read_string1(f):
Tim Peters55762f52003-01-28 16:01:25 +0000419 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000420 >>> import io
421 >>> read_string1(io.BytesIO(b"\x00"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000422 ''
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000423 >>> read_string1(io.BytesIO(b"\x03abcdef"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000424 'abc'
425 """
426
427 n = read_uint1(f)
428 assert n >= 0
429 data = f.read(n)
430 if len(data) == n:
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000431 return data.decode("latin-1")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000432 raise ValueError("expected %d bytes in a string1, but only %d remain" %
433 (n, len(data)))
434
435string1 = ArgumentDescriptor(
436 name="string1",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000437 n=TAKEN_FROM_ARGUMENT1,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000438 reader=read_string1,
439 doc="""A counted string.
440
441 The first argument is a 1-byte unsigned int giving the number
442 of bytes in the string, and the second argument is that many
443 bytes.
444 """)
445
446
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700447def read_bytes1(f):
448 r"""
449 >>> import io
450 >>> read_bytes1(io.BytesIO(b"\x00"))
451 b''
452 >>> read_bytes1(io.BytesIO(b"\x03abcdef"))
453 b'abc'
454 """
455
456 n = read_uint1(f)
457 assert n >= 0
458 data = f.read(n)
459 if len(data) == n:
460 return data
461 raise ValueError("expected %d bytes in a bytes1, but only %d remain" %
462 (n, len(data)))
463
464bytes1 = ArgumentDescriptor(
465 name="bytes1",
466 n=TAKEN_FROM_ARGUMENT1,
467 reader=read_bytes1,
468 doc="""A counted bytes string.
469
470 The first argument is a 1-byte unsigned int giving the number
471 of bytes, and the second argument is that many bytes.
472 """)
473
474
475def read_bytes4(f):
476 r"""
477 >>> import io
478 >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x00abc"))
479 b''
480 >>> read_bytes4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
481 b'abc'
482 >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
483 Traceback (most recent call last):
484 ...
485 ValueError: expected 50331648 bytes in a bytes4, but only 6 remain
486 """
487
488 n = read_uint4(f)
489 if n > sys.maxsize:
490 raise ValueError("bytes4 byte count > sys.maxsize: %d" % n)
491 data = f.read(n)
492 if len(data) == n:
493 return data
494 raise ValueError("expected %d bytes in a bytes4, but only %d remain" %
495 (n, len(data)))
496
497bytes4 = ArgumentDescriptor(
498 name="bytes4",
499 n=TAKEN_FROM_ARGUMENT4U,
500 reader=read_bytes4,
501 doc="""A counted bytes string.
502
503 The first argument is a 4-byte little-endian unsigned int giving
504 the number of bytes, and the second argument is that many bytes.
505 """)
506
507
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000508def read_unicodestringnl(f):
Tim Peters55762f52003-01-28 16:01:25 +0000509 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000510 >>> import io
511 >>> read_unicodestringnl(io.BytesIO(b"abc\\uabcd\njunk")) == 'abc\uabcd'
512 True
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000513 """
514
Guido van Rossum26986312007-07-17 00:19:46 +0000515 data = f.readline()
Guido van Rossum26d95c32007-08-27 23:18:54 +0000516 if not data.endswith(b'\n'):
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000517 raise ValueError("no newline found when trying to read "
518 "unicodestringnl")
519 data = data[:-1] # lose the newline
Guido van Rossumef87d6e2007-05-02 19:09:54 +0000520 return str(data, 'raw-unicode-escape')
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000521
522unicodestringnl = ArgumentDescriptor(
523 name='unicodestringnl',
524 n=UP_TO_NEWLINE,
525 reader=read_unicodestringnl,
526 doc="""A newline-terminated Unicode string.
527
528 This is raw-unicode-escape encoded, so consists of
529 printable ASCII characters, and may contain embedded
530 escape sequences.
531 """)
532
533def read_unicodestring4(f):
Tim Peters55762f52003-01-28 16:01:25 +0000534 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000535 >>> import io
536 >>> s = 'abcd\uabcd'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000537 >>> enc = s.encode('utf-8')
538 >>> enc
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000539 b'abcd\xea\xaf\x8d'
540 >>> n = bytes([len(enc), 0, 0, 0]) # little-endian 4-byte length
541 >>> t = read_unicodestring4(io.BytesIO(n + enc + b'junk'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000542 >>> s == t
543 True
544
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000545 >>> read_unicodestring4(io.BytesIO(n + enc[:-1]))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000546 Traceback (most recent call last):
547 ...
548 ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
549 """
550
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700551 n = read_uint4(f)
552 if n > sys.maxsize:
553 raise ValueError("unicodestring4 byte count > sys.maxsize: %d" % n)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000554 data = f.read(n)
555 if len(data) == n:
Victor Stinner485fb562010-04-13 11:07:24 +0000556 return str(data, 'utf-8', 'surrogatepass')
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000557 raise ValueError("expected %d bytes in a unicodestring4, but only %d "
558 "remain" % (n, len(data)))
559
560unicodestring4 = ArgumentDescriptor(
561 name="unicodestring4",
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700562 n=TAKEN_FROM_ARGUMENT4U,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000563 reader=read_unicodestring4,
564 doc="""A counted Unicode string.
565
566 The first argument is a 4-byte little-endian signed int
567 giving the number of bytes in the string, and the second
568 argument-- the UTF-8 encoding of the Unicode string --
569 contains that many bytes.
570 """)
571
572
573def read_decimalnl_short(f):
Tim Peters55762f52003-01-28 16:01:25 +0000574 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000575 >>> import io
576 >>> read_decimalnl_short(io.BytesIO(b"1234\n56"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000577 1234
578
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000579 >>> read_decimalnl_short(io.BytesIO(b"1234L\n56"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000580 Traceback (most recent call last):
581 ...
Serhiy Storchaka95949422013-08-27 19:40:23 +0300582 ValueError: invalid literal for int() with base 10: b'1234L'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000583 """
584
585 s = read_stringnl(f, decode=False, stripquotes=False)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000586
Serhiy Storchaka95949422013-08-27 19:40:23 +0300587 # There's a hack for True and False here.
Jeremy Hyltona5dc3db2007-08-29 19:07:40 +0000588 if s == b"00":
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000589 return False
Jeremy Hyltona5dc3db2007-08-29 19:07:40 +0000590 elif s == b"01":
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000591 return True
592
Florent Xicluna2bb96f52011-10-23 22:11:00 +0200593 return int(s)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000594
595def read_decimalnl_long(f):
Tim Peters55762f52003-01-28 16:01:25 +0000596 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000597 >>> import io
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000598
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000599 >>> read_decimalnl_long(io.BytesIO(b"1234L\n56"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000600 1234
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000601
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000602 >>> read_decimalnl_long(io.BytesIO(b"123456789012345678901234L\n6"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000603 123456789012345678901234
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000604 """
605
606 s = read_stringnl(f, decode=False, stripquotes=False)
Mark Dickinson8dd05142009-01-20 20:43:58 +0000607 if s[-1:] == b'L':
608 s = s[:-1]
Guido van Rossume2a383d2007-01-15 16:59:06 +0000609 return int(s)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000610
611
612decimalnl_short = ArgumentDescriptor(
613 name='decimalnl_short',
614 n=UP_TO_NEWLINE,
615 reader=read_decimalnl_short,
616 doc="""A newline-terminated decimal integer literal.
617
618 This never has a trailing 'L', and the integer fit
619 in a short Python int on the box where the pickle
620 was written -- but there's no guarantee it will fit
621 in a short Python int on the box where the pickle
622 is read.
623 """)
624
625decimalnl_long = ArgumentDescriptor(
626 name='decimalnl_long',
627 n=UP_TO_NEWLINE,
628 reader=read_decimalnl_long,
629 doc="""A newline-terminated decimal integer literal.
630
631 This has a trailing 'L', and can represent integers
632 of any size.
633 """)
634
635
636def read_floatnl(f):
Tim Peters55762f52003-01-28 16:01:25 +0000637 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000638 >>> import io
639 >>> read_floatnl(io.BytesIO(b"-1.25\n6"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000640 -1.25
641 """
642 s = read_stringnl(f, decode=False, stripquotes=False)
643 return float(s)
644
645floatnl = ArgumentDescriptor(
646 name='floatnl',
647 n=UP_TO_NEWLINE,
648 reader=read_floatnl,
649 doc="""A newline-terminated decimal floating literal.
650
651 In general this requires 17 significant digits for roundtrip
652 identity, and pickling then unpickling infinities, NaNs, and
653 minus zero doesn't work across boxes, or on some boxes even
654 on itself (e.g., Windows can't read the strings it produces
655 for infinities or NaNs).
656 """)
657
658def read_float8(f):
Tim Peters55762f52003-01-28 16:01:25 +0000659 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000660 >>> import io, struct
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000661 >>> raw = struct.pack(">d", -1.25)
662 >>> raw
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000663 b'\xbf\xf4\x00\x00\x00\x00\x00\x00'
664 >>> read_float8(io.BytesIO(raw + b"\n"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000665 -1.25
666 """
667
668 data = f.read(8)
669 if len(data) == 8:
670 return _unpack(">d", data)[0]
671 raise ValueError("not enough data in stream to read float8")
672
673
674float8 = ArgumentDescriptor(
675 name='float8',
676 n=8,
677 reader=read_float8,
678 doc="""An 8-byte binary representation of a float, big-endian.
679
680 The format is unique to Python, and shared with the struct
Guido van Rossum99603b02007-07-20 00:22:32 +0000681 module (format string '>d') "in theory" (the struct and pickle
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000682 implementations don't share the code -- they should). It's
683 strongly related to the IEEE-754 double format, and, in normal
684 cases, is in fact identical to the big-endian 754 double format.
685 On other boxes the dynamic range is limited to that of a 754
686 double, and "add a half and chop" rounding is used to reduce
687 the precision to 53 bits. However, even on a 754 box,
688 infinities, NaNs, and minus zero may not be handled correctly
689 (may not survive roundtrip pickling intact).
690 """)
691
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000692# Protocol 2 formats
693
Tim Petersc0c12b52003-01-29 00:56:17 +0000694from pickle import decode_long
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000695
696def read_long1(f):
697 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000698 >>> import io
699 >>> read_long1(io.BytesIO(b"\x00"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000700 0
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000701 >>> read_long1(io.BytesIO(b"\x02\xff\x00"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000702 255
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000703 >>> read_long1(io.BytesIO(b"\x02\xff\x7f"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000704 32767
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000705 >>> read_long1(io.BytesIO(b"\x02\x00\xff"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000706 -256
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000707 >>> read_long1(io.BytesIO(b"\x02\x00\x80"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000708 -32768
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000709 """
710
711 n = read_uint1(f)
712 data = f.read(n)
713 if len(data) != n:
714 raise ValueError("not enough data in stream to read long1")
715 return decode_long(data)
716
717long1 = ArgumentDescriptor(
718 name="long1",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000719 n=TAKEN_FROM_ARGUMENT1,
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000720 reader=read_long1,
721 doc="""A binary long, little-endian, using 1-byte size.
722
723 This first reads one byte as an unsigned size, then reads that
Tim Petersbdbe7412003-01-27 23:54:04 +0000724 many bytes and interprets them as a little-endian 2's-complement long.
Tim Peters4b23f2b2003-01-31 16:43:39 +0000725 If the size is 0, that's taken as a shortcut for the long 0L.
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000726 """)
727
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000728def read_long4(f):
729 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000730 >>> import io
731 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x00"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000732 255
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000733 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x7f"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000734 32767
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000735 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\xff"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000736 -256
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000737 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\x80"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000738 -32768
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000739 >>> read_long1(io.BytesIO(b"\x00\x00\x00\x00"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000740 0
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000741 """
742
743 n = read_int4(f)
744 if n < 0:
Neal Norwitz784a3f52003-01-28 00:20:41 +0000745 raise ValueError("long4 byte count < 0: %d" % n)
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000746 data = f.read(n)
747 if len(data) != n:
Neal Norwitz784a3f52003-01-28 00:20:41 +0000748 raise ValueError("not enough data in stream to read long4")
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000749 return decode_long(data)
750
751long4 = ArgumentDescriptor(
752 name="long4",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000753 n=TAKEN_FROM_ARGUMENT4,
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000754 reader=read_long4,
755 doc="""A binary representation of a long, little-endian.
756
757 This first reads four bytes as a signed size (but requires the
758 size to be >= 0), then reads that many bytes and interprets them
Tim Peters4b23f2b2003-01-31 16:43:39 +0000759 as a little-endian 2's-complement long. If the size is 0, that's taken
Guido van Rossume2a383d2007-01-15 16:59:06 +0000760 as a shortcut for the int 0, although LONG1 should really be used
Tim Peters4b23f2b2003-01-31 16:43:39 +0000761 then instead (and in any case where # of bytes < 256).
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000762 """)
763
764
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000765##############################################################################
766# Object descriptors. The stack used by the pickle machine holds objects,
767# and in the stack_before and stack_after attributes of OpcodeInfo
768# descriptors we need names to describe the various types of objects that can
769# appear on the stack.
770
771class StackObject(object):
772 __slots__ = (
773 # name of descriptor record, for info only
774 'name',
775
776 # type of object, or tuple of type objects (meaning the object can
777 # be of any type in the tuple)
778 'obtype',
779
780 # human-readable docs for this kind of stack object; a string
781 'doc',
782 )
783
784 def __init__(self, name, obtype, doc):
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000785 assert isinstance(name, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000786 self.name = name
787
788 assert isinstance(obtype, type) or isinstance(obtype, tuple)
789 if isinstance(obtype, tuple):
790 for contained in obtype:
791 assert isinstance(contained, type)
792 self.obtype = obtype
793
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000794 assert isinstance(doc, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000795 self.doc = doc
796
Tim Petersc1c2b3e2003-01-29 20:12:21 +0000797 def __repr__(self):
798 return self.name
799
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000800
801pyint = StackObject(
802 name='int',
803 obtype=int,
804 doc="A short (as opposed to long) Python integer object.")
805
806pylong = StackObject(
807 name='long',
Guido van Rossume2a383d2007-01-15 16:59:06 +0000808 obtype=int,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000809 doc="A long (as opposed to short) Python integer object.")
810
811pyinteger_or_bool = StackObject(
812 name='int_or_bool',
Florent Xicluna02ea12b22010-07-28 16:39:41 +0000813 obtype=(int, bool),
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000814 doc="A Python integer object (short or long), or "
815 "a Python bool.")
816
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000817pybool = StackObject(
818 name='bool',
819 obtype=(bool,),
820 doc="A Python bool object.")
821
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000822pyfloat = StackObject(
823 name='float',
824 obtype=float,
825 doc="A Python float object.")
826
827pystring = StackObject(
Guido van Rossumf4169812008-03-17 22:56:06 +0000828 name='string',
829 obtype=bytes,
830 doc="A Python (8-bit) string object.")
831
832pybytes = StackObject(
Guido van Rossum98297ee2007-11-06 21:34:58 +0000833 name='bytes',
834 obtype=bytes,
835 doc="A Python bytes object.")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000836
837pyunicode = StackObject(
Guido van Rossum98297ee2007-11-06 21:34:58 +0000838 name='str',
Guido van Rossumef87d6e2007-05-02 19:09:54 +0000839 obtype=str,
Guido van Rossumf4169812008-03-17 22:56:06 +0000840 doc="A Python (Unicode) string object.")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000841
842pynone = StackObject(
843 name="None",
844 obtype=type(None),
845 doc="The Python None object.")
846
847pytuple = StackObject(
848 name="tuple",
849 obtype=tuple,
850 doc="A Python tuple object.")
851
852pylist = StackObject(
853 name="list",
854 obtype=list,
855 doc="A Python list object.")
856
857pydict = StackObject(
858 name="dict",
859 obtype=dict,
860 doc="A Python dict object.")
861
862anyobject = StackObject(
863 name='any',
864 obtype=object,
865 doc="Any kind of object whatsoever.")
866
867markobject = StackObject(
868 name="mark",
869 obtype=StackObject,
870 doc="""'The mark' is a unique object.
871
872 Opcodes that operate on a variable number of objects
873 generally don't embed the count of objects in the opcode,
874 or pull it off the stack. Instead the MARK opcode is used
875 to push a special marker object on the stack, and then
876 some other opcodes grab all the objects from the top of
877 the stack down to (but not including) the topmost marker
878 object.
879 """)
880
881stackslice = StackObject(
882 name="stackslice",
883 obtype=StackObject,
884 doc="""An object representing a contiguous slice of the stack.
885
Ezio Melotti30b9d5d2013-08-17 15:50:46 +0300886 This is used in conjunction with markobject, to represent all
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000887 of the stack following the topmost markobject. For example,
888 the POP_MARK opcode changes the stack from
889
890 [..., markobject, stackslice]
891 to
892 [...]
893
894 No matter how many object are on the stack after the topmost
895 markobject, POP_MARK gets rid of all of them (including the
896 topmost markobject too).
897 """)
898
899##############################################################################
900# Descriptors for pickle opcodes.
901
902class OpcodeInfo(object):
903
904 __slots__ = (
905 # symbolic name of opcode; a string
906 'name',
907
908 # the code used in a bytestream to represent the opcode; a
909 # one-character string
910 'code',
911
912 # If the opcode has an argument embedded in the byte string, an
913 # instance of ArgumentDescriptor specifying its type. Note that
914 # arg.reader(s) can be used to read and decode the argument from
915 # the bytestream s, and arg.doc documents the format of the raw
916 # argument bytes. If the opcode doesn't have an argument embedded
917 # in the bytestream, arg should be None.
918 'arg',
919
920 # what the stack looks like before this opcode runs; a list
921 'stack_before',
922
923 # what the stack looks like after this opcode runs; a list
924 'stack_after',
925
926 # the protocol number in which this opcode was introduced; an int
927 'proto',
928
929 # human-readable docs for this opcode; a string
930 'doc',
931 )
932
933 def __init__(self, name, code, arg,
934 stack_before, stack_after, proto, doc):
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000935 assert isinstance(name, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000936 self.name = name
937
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000938 assert isinstance(code, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000939 assert len(code) == 1
940 self.code = code
941
942 assert arg is None or isinstance(arg, ArgumentDescriptor)
943 self.arg = arg
944
945 assert isinstance(stack_before, list)
946 for x in stack_before:
947 assert isinstance(x, StackObject)
948 self.stack_before = stack_before
949
950 assert isinstance(stack_after, list)
951 for x in stack_after:
952 assert isinstance(x, StackObject)
953 self.stack_after = stack_after
954
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -0700955 assert isinstance(proto, int) and 0 <= proto <= pickle.HIGHEST_PROTOCOL
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000956 self.proto = proto
957
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000958 assert isinstance(doc, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000959 self.doc = doc
960
961I = OpcodeInfo
962opcodes = [
963
964 # Ways to spell integers.
965
966 I(name='INT',
967 code='I',
968 arg=decimalnl_short,
969 stack_before=[],
970 stack_after=[pyinteger_or_bool],
971 proto=0,
972 doc="""Push an integer or bool.
973
974 The argument is a newline-terminated decimal literal string.
975
976 The intent may have been that this always fit in a short Python int,
977 but INT can be generated in pickles written on a 64-bit box that
978 require a Python long on a 32-bit box. The difference between this
979 and LONG then is that INT skips a trailing 'L', and produces a short
980 int whenever possible.
981
982 Another difference is due to that, when bool was introduced as a
983 distinct type in 2.3, builtin names True and False were also added to
984 2.2.2, mapping to ints 1 and 0. For compatibility in both directions,
985 True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
986 Leading zeroes are never produced for a genuine integer. The 2.3
987 (and later) unpicklers special-case these and return bool instead;
988 earlier unpicklers ignore the leading "0" and return the int.
989 """),
990
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000991 I(name='BININT',
992 code='J',
993 arg=int4,
994 stack_before=[],
995 stack_after=[pyint],
996 proto=1,
997 doc="""Push a four-byte signed integer.
998
999 This handles the full range of Python (short) integers on a 32-bit
1000 box, directly as binary bytes (1 for the opcode and 4 for the integer).
1001 If the integer is non-negative and fits in 1 or 2 bytes, pickling via
1002 BININT1 or BININT2 saves space.
1003 """),
1004
1005 I(name='BININT1',
1006 code='K',
1007 arg=uint1,
1008 stack_before=[],
1009 stack_after=[pyint],
1010 proto=1,
1011 doc="""Push a one-byte unsigned integer.
1012
1013 This is a space optimization for pickling very small non-negative ints,
1014 in range(256).
1015 """),
1016
1017 I(name='BININT2',
1018 code='M',
1019 arg=uint2,
1020 stack_before=[],
1021 stack_after=[pyint],
1022 proto=1,
1023 doc="""Push a two-byte unsigned integer.
1024
1025 This is a space optimization for pickling small positive ints, in
1026 range(256, 2**16). Integers in range(256) can also be pickled via
1027 BININT2, but BININT1 instead saves a byte.
1028 """),
1029
Tim Petersfdc03462003-01-28 04:56:33 +00001030 I(name='LONG',
1031 code='L',
1032 arg=decimalnl_long,
1033 stack_before=[],
1034 stack_after=[pylong],
1035 proto=0,
1036 doc="""Push a long integer.
1037
1038 The same as INT, except that the literal ends with 'L', and always
1039 unpickles to a Python long. There doesn't seem a real purpose to the
1040 trailing 'L'.
1041
1042 Note that LONG takes time quadratic in the number of digits when
1043 unpickling (this is simply due to the nature of decimal->binary
1044 conversion). Proto 2 added linear-time (in C; still quadratic-time
1045 in Python) LONG1 and LONG4 opcodes.
1046 """),
1047
1048 I(name="LONG1",
1049 code='\x8a',
1050 arg=long1,
1051 stack_before=[],
1052 stack_after=[pylong],
1053 proto=2,
1054 doc="""Long integer using one-byte length.
1055
1056 A more efficient encoding of a Python long; the long1 encoding
1057 says it all."""),
1058
1059 I(name="LONG4",
1060 code='\x8b',
1061 arg=long4,
1062 stack_before=[],
1063 stack_after=[pylong],
1064 proto=2,
1065 doc="""Long integer using found-byte length.
1066
1067 A more efficient encoding of a Python long; the long4 encoding
1068 says it all."""),
1069
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001070 # Ways to spell strings (8-bit, not Unicode).
1071
1072 I(name='STRING',
1073 code='S',
1074 arg=stringnl,
1075 stack_before=[],
1076 stack_after=[pystring],
1077 proto=0,
1078 doc="""Push a Python string object.
1079
1080 The argument is a repr-style string, with bracketing quote characters,
1081 and perhaps embedded escapes. The argument extends until the next
Guido van Rossumf4169812008-03-17 22:56:06 +00001082 newline character. (Actually, they are decoded into a str instance
1083 using the encoding given to the Unpickler constructor. or the default,
1084 'ASCII'.)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001085 """),
1086
1087 I(name='BINSTRING',
1088 code='T',
1089 arg=string4,
1090 stack_before=[],
1091 stack_after=[pystring],
1092 proto=1,
1093 doc="""Push a Python string object.
1094
1095 There are two arguments: the first is a 4-byte little-endian signed int
1096 giving the number of bytes in the string, and the second is that many
Guido van Rossumf4169812008-03-17 22:56:06 +00001097 bytes, which are taken literally as the string content. (Actually,
1098 they are decoded into a str instance using the encoding given to the
1099 Unpickler constructor. or the default, 'ASCII'.)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001100 """),
1101
1102 I(name='SHORT_BINSTRING',
1103 code='U',
1104 arg=string1,
1105 stack_before=[],
1106 stack_after=[pystring],
1107 proto=1,
1108 doc="""Push a Python string object.
1109
1110 There are two arguments: the first is a 1-byte unsigned int giving
1111 the number of bytes in the string, and the second is that many bytes,
Guido van Rossumf4169812008-03-17 22:56:06 +00001112 which are taken literally as the string content. (Actually, they
1113 are decoded into a str instance using the encoding given to the
1114 Unpickler constructor. or the default, 'ASCII'.)
1115 """),
1116
1117 # Bytes (protocol 3 only; older protocols don't support bytes at all)
1118
1119 I(name='BINBYTES',
1120 code='B',
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001121 arg=bytes4,
Guido van Rossumf4169812008-03-17 22:56:06 +00001122 stack_before=[],
1123 stack_after=[pybytes],
1124 proto=3,
1125 doc="""Push a Python bytes object.
1126
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001127 There are two arguments: the first is a 4-byte little-endian unsigned int
1128 giving the number of bytes, and the second is that many bytes, which are
1129 taken literally as the bytes content.
Guido van Rossumf4169812008-03-17 22:56:06 +00001130 """),
1131
1132 I(name='SHORT_BINBYTES',
1133 code='C',
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001134 arg=bytes1,
Guido van Rossumf4169812008-03-17 22:56:06 +00001135 stack_before=[],
1136 stack_after=[pybytes],
Collin Wintere61d4372009-05-20 17:46:47 +00001137 proto=3,
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001138 doc="""Push a Python bytes object.
Guido van Rossumf4169812008-03-17 22:56:06 +00001139
1140 There are two arguments: the first is a 1-byte unsigned int giving
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001141 the number of bytes, and the second is that many bytes, which are taken
1142 literally as the string content.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001143 """),
1144
1145 # Ways to spell None.
1146
1147 I(name='NONE',
1148 code='N',
1149 arg=None,
1150 stack_before=[],
1151 stack_after=[pynone],
1152 proto=0,
1153 doc="Push None on the stack."),
1154
Tim Petersfdc03462003-01-28 04:56:33 +00001155 # Ways to spell bools, starting with proto 2. See INT for how this was
1156 # done before proto 2.
1157
1158 I(name='NEWTRUE',
1159 code='\x88',
1160 arg=None,
1161 stack_before=[],
1162 stack_after=[pybool],
1163 proto=2,
1164 doc="""True.
1165
1166 Push True onto the stack."""),
1167
1168 I(name='NEWFALSE',
1169 code='\x89',
1170 arg=None,
1171 stack_before=[],
1172 stack_after=[pybool],
1173 proto=2,
1174 doc="""True.
1175
1176 Push False onto the stack."""),
1177
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001178 # Ways to spell Unicode strings.
1179
1180 I(name='UNICODE',
1181 code='V',
1182 arg=unicodestringnl,
1183 stack_before=[],
1184 stack_after=[pyunicode],
1185 proto=0, # this may be pure-text, but it's a later addition
1186 doc="""Push a Python Unicode string object.
1187
1188 The argument is a raw-unicode-escape encoding of a Unicode string,
1189 and so may contain embedded escape sequences. The argument extends
1190 until the next newline character.
1191 """),
1192
1193 I(name='BINUNICODE',
1194 code='X',
1195 arg=unicodestring4,
1196 stack_before=[],
1197 stack_after=[pyunicode],
1198 proto=1,
1199 doc="""Push a Python Unicode string object.
1200
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001201 There are two arguments: the first is a 4-byte little-endian unsigned int
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001202 giving the number of bytes in the string. The second is that many
1203 bytes, and is the UTF-8 encoding of the Unicode string.
1204 """),
1205
1206 # Ways to spell floats.
1207
1208 I(name='FLOAT',
1209 code='F',
1210 arg=floatnl,
1211 stack_before=[],
1212 stack_after=[pyfloat],
1213 proto=0,
1214 doc="""Newline-terminated decimal float literal.
1215
1216 The argument is repr(a_float), and in general requires 17 significant
1217 digits for roundtrip conversion to be an identity (this is so for
1218 IEEE-754 double precision values, which is what Python float maps to
1219 on most boxes).
1220
1221 In general, FLOAT cannot be used to transport infinities, NaNs, or
1222 minus zero across boxes (or even on a single box, if the platform C
1223 library can't read the strings it produces for such things -- Windows
1224 is like that), but may do less damage than BINFLOAT on boxes with
1225 greater precision or dynamic range than IEEE-754 double.
1226 """),
1227
1228 I(name='BINFLOAT',
1229 code='G',
1230 arg=float8,
1231 stack_before=[],
1232 stack_after=[pyfloat],
1233 proto=1,
1234 doc="""Float stored in binary form, with 8 bytes of data.
1235
1236 This generally requires less than half the space of FLOAT encoding.
1237 In general, BINFLOAT cannot be used to transport infinities, NaNs, or
1238 minus zero, raises an exception if the exponent exceeds the range of
1239 an IEEE-754 double, and retains no more than 53 bits of precision (if
1240 there are more than that, "add a half and chop" rounding is used to
1241 cut it back to 53 significant bits).
1242 """),
1243
1244 # Ways to build lists.
1245
1246 I(name='EMPTY_LIST',
1247 code=']',
1248 arg=None,
1249 stack_before=[],
1250 stack_after=[pylist],
1251 proto=1,
1252 doc="Push an empty list."),
1253
1254 I(name='APPEND',
1255 code='a',
1256 arg=None,
1257 stack_before=[pylist, anyobject],
1258 stack_after=[pylist],
1259 proto=0,
1260 doc="""Append an object to a list.
1261
1262 Stack before: ... pylist anyobject
1263 Stack after: ... pylist+[anyobject]
Tim Peters81098ac2003-01-28 05:12:08 +00001264
1265 although pylist is really extended in-place.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001266 """),
1267
1268 I(name='APPENDS',
1269 code='e',
1270 arg=None,
1271 stack_before=[pylist, markobject, stackslice],
1272 stack_after=[pylist],
1273 proto=1,
1274 doc="""Extend a list by a slice of stack objects.
1275
1276 Stack before: ... pylist markobject stackslice
1277 Stack after: ... pylist+stackslice
Tim Peters81098ac2003-01-28 05:12:08 +00001278
1279 although pylist is really extended in-place.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001280 """),
1281
1282 I(name='LIST',
1283 code='l',
1284 arg=None,
1285 stack_before=[markobject, stackslice],
1286 stack_after=[pylist],
1287 proto=0,
1288 doc="""Build a list out of the topmost stack slice, after markobject.
1289
1290 All the stack entries following the topmost markobject are placed into
1291 a single Python list, which single list object replaces all of the
1292 stack from the topmost markobject onward. For example,
1293
1294 Stack before: ... markobject 1 2 3 'abc'
1295 Stack after: ... [1, 2, 3, 'abc']
1296 """),
1297
1298 # Ways to build tuples.
1299
1300 I(name='EMPTY_TUPLE',
1301 code=')',
1302 arg=None,
1303 stack_before=[],
1304 stack_after=[pytuple],
1305 proto=1,
1306 doc="Push an empty tuple."),
1307
1308 I(name='TUPLE',
1309 code='t',
1310 arg=None,
1311 stack_before=[markobject, stackslice],
1312 stack_after=[pytuple],
1313 proto=0,
1314 doc="""Build a tuple out of the topmost stack slice, after markobject.
1315
1316 All the stack entries following the topmost markobject are placed into
1317 a single Python tuple, which single tuple object replaces all of the
1318 stack from the topmost markobject onward. For example,
1319
1320 Stack before: ... markobject 1 2 3 'abc'
1321 Stack after: ... (1, 2, 3, 'abc')
1322 """),
1323
Tim Petersfdc03462003-01-28 04:56:33 +00001324 I(name='TUPLE1',
1325 code='\x85',
1326 arg=None,
1327 stack_before=[anyobject],
1328 stack_after=[pytuple],
1329 proto=2,
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001330 doc="""Build a one-tuple out of the topmost item on the stack.
Tim Petersfdc03462003-01-28 04:56:33 +00001331
1332 This code pops one value off the stack and pushes a tuple of
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001333 length 1 whose one item is that value back onto it. In other
1334 words:
Tim Petersfdc03462003-01-28 04:56:33 +00001335
1336 stack[-1] = tuple(stack[-1:])
1337 """),
1338
1339 I(name='TUPLE2',
1340 code='\x86',
1341 arg=None,
1342 stack_before=[anyobject, anyobject],
1343 stack_after=[pytuple],
1344 proto=2,
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001345 doc="""Build a two-tuple out of the top two items on the stack.
Tim Petersfdc03462003-01-28 04:56:33 +00001346
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001347 This code pops two values off the stack and pushes a tuple of
1348 length 2 whose items are those values back onto it. In other
1349 words:
Tim Petersfdc03462003-01-28 04:56:33 +00001350
1351 stack[-2:] = [tuple(stack[-2:])]
1352 """),
1353
1354 I(name='TUPLE3',
1355 code='\x87',
1356 arg=None,
1357 stack_before=[anyobject, anyobject, anyobject],
1358 stack_after=[pytuple],
1359 proto=2,
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001360 doc="""Build a three-tuple out of the top three items on the stack.
Tim Petersfdc03462003-01-28 04:56:33 +00001361
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001362 This code pops three values off the stack and pushes a tuple of
1363 length 3 whose items are those values back onto it. In other
1364 words:
Tim Petersfdc03462003-01-28 04:56:33 +00001365
1366 stack[-3:] = [tuple(stack[-3:])]
1367 """),
1368
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001369 # Ways to build dicts.
1370
1371 I(name='EMPTY_DICT',
1372 code='}',
1373 arg=None,
1374 stack_before=[],
1375 stack_after=[pydict],
1376 proto=1,
1377 doc="Push an empty dict."),
1378
1379 I(name='DICT',
1380 code='d',
1381 arg=None,
1382 stack_before=[markobject, stackslice],
1383 stack_after=[pydict],
1384 proto=0,
1385 doc="""Build a dict out of the topmost stack slice, after markobject.
1386
1387 All the stack entries following the topmost markobject are placed into
1388 a single Python dict, which single dict object replaces all of the
1389 stack from the topmost markobject onward. The stack slice alternates
1390 key, value, key, value, .... For example,
1391
1392 Stack before: ... markobject 1 2 3 'abc'
1393 Stack after: ... {1: 2, 3: 'abc'}
1394 """),
1395
1396 I(name='SETITEM',
1397 code='s',
1398 arg=None,
1399 stack_before=[pydict, anyobject, anyobject],
1400 stack_after=[pydict],
1401 proto=0,
1402 doc="""Add a key+value pair to an existing dict.
1403
1404 Stack before: ... pydict key value
1405 Stack after: ... pydict
1406
1407 where pydict has been modified via pydict[key] = value.
1408 """),
1409
1410 I(name='SETITEMS',
1411 code='u',
1412 arg=None,
1413 stack_before=[pydict, markobject, stackslice],
1414 stack_after=[pydict],
1415 proto=1,
1416 doc="""Add an arbitrary number of key+value pairs to an existing dict.
1417
1418 The slice of the stack following the topmost markobject is taken as
1419 an alternating sequence of keys and values, added to the dict
1420 immediately under the topmost markobject. Everything at and after the
1421 topmost markobject is popped, leaving the mutated dict at the top
1422 of the stack.
1423
1424 Stack before: ... pydict markobject key_1 value_1 ... key_n value_n
1425 Stack after: ... pydict
1426
1427 where pydict has been modified via pydict[key_i] = value_i for i in
1428 1, 2, ..., n, and in that order.
1429 """),
1430
1431 # Stack manipulation.
1432
1433 I(name='POP',
1434 code='0',
1435 arg=None,
1436 stack_before=[anyobject],
1437 stack_after=[],
1438 proto=0,
1439 doc="Discard the top stack item, shrinking the stack by one item."),
1440
1441 I(name='DUP',
1442 code='2',
1443 arg=None,
1444 stack_before=[anyobject],
1445 stack_after=[anyobject, anyobject],
1446 proto=0,
1447 doc="Push the top stack item onto the stack again, duplicating it."),
1448
1449 I(name='MARK',
1450 code='(',
1451 arg=None,
1452 stack_before=[],
1453 stack_after=[markobject],
1454 proto=0,
1455 doc="""Push markobject onto the stack.
1456
1457 markobject is a unique object, used by other opcodes to identify a
1458 region of the stack containing a variable number of objects for them
1459 to work on. See markobject.doc for more detail.
1460 """),
1461
1462 I(name='POP_MARK',
1463 code='1',
1464 arg=None,
1465 stack_before=[markobject, stackslice],
1466 stack_after=[],
Collin Wintere61d4372009-05-20 17:46:47 +00001467 proto=1,
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001468 doc="""Pop all the stack objects at and above the topmost markobject.
1469
1470 When an opcode using a variable number of stack objects is done,
1471 POP_MARK is used to remove those objects, and to remove the markobject
1472 that delimited their starting position on the stack.
1473 """),
1474
1475 # Memo manipulation. There are really only two operations (get and put),
1476 # each in all-text, "short binary", and "long binary" flavors.
1477
1478 I(name='GET',
1479 code='g',
1480 arg=decimalnl_short,
1481 stack_before=[],
1482 stack_after=[anyobject],
1483 proto=0,
1484 doc="""Read an object from the memo and push it on the stack.
1485
Ezio Melotti13925002011-03-16 11:05:33 +02001486 The index of the memo object to push is given by the newline-terminated
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001487 decimal string following. BINGET and LONG_BINGET are space-optimized
1488 versions.
1489 """),
1490
1491 I(name='BINGET',
1492 code='h',
1493 arg=uint1,
1494 stack_before=[],
1495 stack_after=[anyobject],
1496 proto=1,
1497 doc="""Read an object from the memo and push it on the stack.
1498
1499 The index of the memo object to push is given by the 1-byte unsigned
1500 integer following.
1501 """),
1502
1503 I(name='LONG_BINGET',
1504 code='j',
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001505 arg=uint4,
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001506 stack_before=[],
1507 stack_after=[anyobject],
1508 proto=1,
1509 doc="""Read an object from the memo and push it on the stack.
1510
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001511 The index of the memo object to push is given by the 4-byte unsigned
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001512 little-endian integer following.
1513 """),
1514
1515 I(name='PUT',
1516 code='p',
1517 arg=decimalnl_short,
1518 stack_before=[],
1519 stack_after=[],
1520 proto=0,
1521 doc="""Store the stack top into the memo. The stack is not popped.
1522
1523 The index of the memo location to write into is given by the newline-
1524 terminated decimal string following. BINPUT and LONG_BINPUT are
1525 space-optimized versions.
1526 """),
1527
1528 I(name='BINPUT',
1529 code='q',
1530 arg=uint1,
1531 stack_before=[],
1532 stack_after=[],
1533 proto=1,
1534 doc="""Store the stack top into the memo. The stack is not popped.
1535
1536 The index of the memo location to write into is given by the 1-byte
1537 unsigned integer following.
1538 """),
1539
1540 I(name='LONG_BINPUT',
1541 code='r',
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001542 arg=uint4,
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001543 stack_before=[],
1544 stack_after=[],
1545 proto=1,
1546 doc="""Store the stack top into the memo. The stack is not popped.
1547
1548 The index of the memo location to write into is given by the 4-byte
Alexandre Vassalotti8db89ca2013-04-14 03:30:35 -07001549 unsigned little-endian integer following.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001550 """),
1551
Tim Petersfdc03462003-01-28 04:56:33 +00001552 # Access the extension registry (predefined objects). Akin to the GET
1553 # family.
1554
1555 I(name='EXT1',
1556 code='\x82',
1557 arg=uint1,
1558 stack_before=[],
1559 stack_after=[anyobject],
1560 proto=2,
1561 doc="""Extension code.
1562
1563 This code and the similar EXT2 and EXT4 allow using a registry
1564 of popular objects that are pickled by name, typically classes.
1565 It is envisioned that through a global negotiation and
1566 registration process, third parties can set up a mapping between
1567 ints and object names.
1568
1569 In order to guarantee pickle interchangeability, the extension
1570 code registry ought to be global, although a range of codes may
1571 be reserved for private use.
1572
1573 EXT1 has a 1-byte integer argument. This is used to index into the
1574 extension registry, and the object at that index is pushed on the stack.
1575 """),
1576
1577 I(name='EXT2',
1578 code='\x83',
1579 arg=uint2,
1580 stack_before=[],
1581 stack_after=[anyobject],
1582 proto=2,
1583 doc="""Extension code.
1584
1585 See EXT1. EXT2 has a two-byte integer argument.
1586 """),
1587
1588 I(name='EXT4',
1589 code='\x84',
1590 arg=int4,
1591 stack_before=[],
1592 stack_after=[anyobject],
1593 proto=2,
1594 doc="""Extension code.
1595
1596 See EXT1. EXT4 has a four-byte integer argument.
1597 """),
1598
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001599 # Push a class object, or module function, on the stack, via its module
1600 # and name.
1601
1602 I(name='GLOBAL',
1603 code='c',
1604 arg=stringnl_noescape_pair,
1605 stack_before=[],
1606 stack_after=[anyobject],
1607 proto=0,
1608 doc="""Push a global object (module.attr) on the stack.
1609
1610 Two newline-terminated strings follow the GLOBAL opcode. The first is
1611 taken as a module name, and the second as a class name. The class
1612 object module.class is pushed on the stack. More accurately, the
1613 object returned by self.find_class(module, class) is pushed on the
1614 stack, so unpickling subclasses can override this form of lookup.
1615 """),
1616
1617 # Ways to build objects of classes pickle doesn't know about directly
1618 # (user-defined classes). I despair of documenting this accurately
1619 # and comprehensibly -- you really have to read the pickle code to
1620 # find all the special cases.
1621
1622 I(name='REDUCE',
1623 code='R',
1624 arg=None,
1625 stack_before=[anyobject, anyobject],
1626 stack_after=[anyobject],
1627 proto=0,
1628 doc="""Push an object built from a callable and an argument tuple.
1629
1630 The opcode is named to remind of the __reduce__() method.
1631
1632 Stack before: ... callable pytuple
1633 Stack after: ... callable(*pytuple)
1634
1635 The callable and the argument tuple are the first two items returned
1636 by a __reduce__ method. Applying the callable to the argtuple is
1637 supposed to reproduce the original object, or at least get it started.
1638 If the __reduce__ method returns a 3-tuple, the last component is an
1639 argument to be passed to the object's __setstate__, and then the REDUCE
1640 opcode is followed by code to create setstate's argument, and then a
1641 BUILD opcode to apply __setstate__ to that argument.
1642
Guido van Rossum13257902007-06-07 23:15:56 +00001643 If not isinstance(callable, type), REDUCE complains unless the
Alexandre Vassalottif7fa63d2008-05-11 08:55:36 +00001644 callable has been registered with the copyreg module's
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001645 safe_constructors dict, or the callable has a magic
1646 '__safe_for_unpickling__' attribute with a true value. I'm not sure
1647 why it does this, but I've sure seen this complaint often enough when
1648 I didn't want to <wink>.
1649 """),
1650
1651 I(name='BUILD',
1652 code='b',
1653 arg=None,
1654 stack_before=[anyobject, anyobject],
1655 stack_after=[anyobject],
1656 proto=0,
1657 doc="""Finish building an object, via __setstate__ or dict update.
1658
1659 Stack before: ... anyobject argument
1660 Stack after: ... anyobject
1661
1662 where anyobject may have been mutated, as follows:
1663
1664 If the object has a __setstate__ method,
1665
1666 anyobject.__setstate__(argument)
1667
1668 is called.
1669
1670 Else the argument must be a dict, the object must have a __dict__, and
1671 the object is updated via
1672
1673 anyobject.__dict__.update(argument)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001674 """),
1675
1676 I(name='INST',
1677 code='i',
1678 arg=stringnl_noescape_pair,
1679 stack_before=[markobject, stackslice],
1680 stack_after=[anyobject],
1681 proto=0,
1682 doc="""Build a class instance.
1683
1684 This is the protocol 0 version of protocol 1's OBJ opcode.
1685 INST is followed by two newline-terminated strings, giving a
1686 module and class name, just as for the GLOBAL opcode (and see
1687 GLOBAL for more details about that). self.find_class(module, name)
1688 is used to get a class object.
1689
1690 In addition, all the objects on the stack following the topmost
1691 markobject are gathered into a tuple and popped (along with the
1692 topmost markobject), just as for the TUPLE opcode.
1693
1694 Now it gets complicated. If all of these are true:
1695
1696 + The argtuple is empty (markobject was at the top of the stack
1697 at the start).
1698
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001699 + The class object does not have a __getinitargs__ attribute.
1700
1701 then we want to create an old-style class instance without invoking
1702 its __init__() method (pickle has waffled on this over the years; not
1703 calling __init__() is current wisdom). In this case, an instance of
1704 an old-style dummy class is created, and then we try to rebind its
1705 __class__ attribute to the desired class object. If this succeeds,
Guido van Rossuma8add0e2007-05-14 22:03:55 +00001706 the new instance object is pushed on the stack, and we're done.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001707
1708 Else (the argtuple is not empty, it's not an old-style class object,
1709 or the class object does have a __getinitargs__ attribute), the code
1710 first insists that the class object have a __safe_for_unpickling__
1711 attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE,
1712 it doesn't matter whether this attribute has a true or false value, it
Guido van Rossum99603b02007-07-20 00:22:32 +00001713 only matters whether it exists (XXX this is a bug). If
1714 __safe_for_unpickling__ doesn't exist, UnpicklingError is raised.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001715
1716 Else (the class object does have a __safe_for_unpickling__ attr),
1717 the class object obtained from INST's arguments is applied to the
1718 argtuple obtained from the stack, and the resulting instance object
1719 is pushed on the stack.
Tim Peters2b93c4c2003-01-30 16:35:08 +00001720
1721 NOTE: checks for __safe_for_unpickling__ went away in Python 2.3.
Florent Xiclunaaa6c1d22011-12-12 18:54:29 +01001722 NOTE: the distinction between old-style and new-style classes does
1723 not make sense in Python 3.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001724 """),
1725
1726 I(name='OBJ',
1727 code='o',
1728 arg=None,
1729 stack_before=[markobject, anyobject, stackslice],
1730 stack_after=[anyobject],
1731 proto=1,
1732 doc="""Build a class instance.
1733
1734 This is the protocol 1 version of protocol 0's INST opcode, and is
1735 very much like it. The major difference is that the class object
1736 is taken off the stack, allowing it to be retrieved from the memo
1737 repeatedly if several instances of the same class are created. This
1738 can be much more efficient (in both time and space) than repeatedly
1739 embedding the module and class names in INST opcodes.
1740
1741 Unlike INST, OBJ takes no arguments from the opcode stream. Instead
1742 the class object is taken off the stack, immediately above the
1743 topmost markobject:
1744
1745 Stack before: ... markobject classobject stackslice
1746 Stack after: ... new_instance_object
1747
1748 As for INST, the remainder of the stack above the markobject is
1749 gathered into an argument tuple, and then the logic seems identical,
Guido van Rossumecb11042003-01-29 06:24:30 +00001750 except that no __safe_for_unpickling__ check is done (XXX this is
Guido van Rossum99603b02007-07-20 00:22:32 +00001751 a bug). See INST for the gory details.
Tim Peters2b93c4c2003-01-30 16:35:08 +00001752
1753 NOTE: In Python 2.3, INST and OBJ are identical except for how they
1754 get the class object. That was always the intent; the implementations
1755 had diverged for accidental reasons.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001756 """),
1757
Tim Petersfdc03462003-01-28 04:56:33 +00001758 I(name='NEWOBJ',
1759 code='\x81',
1760 arg=None,
1761 stack_before=[anyobject, anyobject],
1762 stack_after=[anyobject],
1763 proto=2,
1764 doc="""Build an object instance.
1765
1766 The stack before should be thought of as containing a class
1767 object followed by an argument tuple (the tuple being the stack
1768 top). Call these cls and args. They are popped off the stack,
1769 and the value returned by cls.__new__(cls, *args) is pushed back
1770 onto the stack.
1771 """),
1772
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001773 # Machine control.
1774
Tim Petersfdc03462003-01-28 04:56:33 +00001775 I(name='PROTO',
1776 code='\x80',
1777 arg=uint1,
1778 stack_before=[],
1779 stack_after=[],
1780 proto=2,
1781 doc="""Protocol version indicator.
1782
1783 For protocol 2 and above, a pickle must start with this opcode.
1784 The argument is the protocol version, an int in range(2, 256).
1785 """),
1786
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001787 I(name='STOP',
1788 code='.',
1789 arg=None,
1790 stack_before=[anyobject],
1791 stack_after=[],
1792 proto=0,
1793 doc="""Stop the unpickling machine.
1794
1795 Every pickle ends with this opcode. The object at the top of the stack
1796 is popped, and that's the result of unpickling. The stack should be
1797 empty then.
1798 """),
1799
1800 # Ways to deal with persistent IDs.
1801
1802 I(name='PERSID',
1803 code='P',
1804 arg=stringnl_noescape,
1805 stack_before=[],
1806 stack_after=[anyobject],
1807 proto=0,
1808 doc="""Push an object identified by a persistent ID.
1809
1810 The pickle module doesn't define what a persistent ID means. PERSID's
1811 argument is a newline-terminated str-style (no embedded escapes, no
1812 bracketing quote characters) string, which *is* "the persistent ID".
1813 The unpickler passes this string to self.persistent_load(). Whatever
1814 object that returns is pushed on the stack. There is no implementation
1815 of persistent_load() in Python's unpickler: it must be supplied by an
1816 unpickler subclass.
1817 """),
1818
1819 I(name='BINPERSID',
1820 code='Q',
1821 arg=None,
1822 stack_before=[anyobject],
1823 stack_after=[anyobject],
1824 proto=1,
1825 doc="""Push an object identified by a persistent ID.
1826
1827 Like PERSID, except the persistent ID is popped off the stack (instead
1828 of being a string embedded in the opcode bytestream). The persistent
1829 ID is passed to self.persistent_load(), and whatever object that
1830 returns is pushed on the stack. See PERSID for more detail.
1831 """),
1832]
1833del I
1834
1835# Verify uniqueness of .name and .code members.
1836name2i = {}
1837code2i = {}
1838
1839for i, d in enumerate(opcodes):
1840 if d.name in name2i:
1841 raise ValueError("repeated name %r at indices %d and %d" %
1842 (d.name, name2i[d.name], i))
1843 if d.code in code2i:
1844 raise ValueError("repeated code %r at indices %d and %d" %
1845 (d.code, code2i[d.code], i))
1846
1847 name2i[d.name] = i
1848 code2i[d.code] = i
1849
1850del name2i, code2i, i, d
1851
1852##############################################################################
1853# Build a code2op dict, mapping opcode characters to OpcodeInfo records.
1854# Also ensure we've got the same stuff as pickle.py, although the
1855# introspection here is dicey.
1856
1857code2op = {}
1858for d in opcodes:
1859 code2op[d.code] = d
1860del d
1861
1862def assure_pickle_consistency(verbose=False):
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001863
1864 copy = code2op.copy()
1865 for name in pickle.__all__:
1866 if not re.match("[A-Z][A-Z0-9_]+$", name):
1867 if verbose:
Guido van Rossumbe19ed72007-02-09 05:37:30 +00001868 print("skipping %r: it doesn't look like an opcode name" % name)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001869 continue
1870 picklecode = getattr(pickle, name)
Guido van Rossum617dbc42007-05-07 23:57:08 +00001871 if not isinstance(picklecode, bytes) or len(picklecode) != 1:
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001872 if verbose:
Guido van Rossumbe19ed72007-02-09 05:37:30 +00001873 print(("skipping %r: value %r doesn't look like a pickle "
1874 "code" % (name, picklecode)))
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001875 continue
Guido van Rossum617dbc42007-05-07 23:57:08 +00001876 picklecode = picklecode.decode("latin-1")
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001877 if picklecode in copy:
1878 if verbose:
Guido van Rossumbe19ed72007-02-09 05:37:30 +00001879 print("checking name %r w/ code %r for consistency" % (
1880 name, picklecode))
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001881 d = copy[picklecode]
1882 if d.name != name:
1883 raise ValueError("for pickle code %r, pickle.py uses name %r "
1884 "but we're using name %r" % (picklecode,
1885 name,
1886 d.name))
1887 # Forget this one. Any left over in copy at the end are a problem
1888 # of a different kind.
1889 del copy[picklecode]
1890 else:
1891 raise ValueError("pickle.py appears to have a pickle opcode with "
1892 "name %r and code %r, but we don't" %
1893 (name, picklecode))
1894 if copy:
1895 msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
1896 for code, d in copy.items():
1897 msg.append(" name %r with code %r" % (d.name, code))
1898 raise ValueError("\n".join(msg))
1899
1900assure_pickle_consistency()
Tim Petersc0c12b52003-01-29 00:56:17 +00001901del assure_pickle_consistency
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001902
1903##############################################################################
1904# A pickle opcode generator.
1905
1906def genops(pickle):
Guido van Rossuma72ded92003-01-27 19:40:47 +00001907 """Generate all the opcodes in a pickle.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001908
1909 'pickle' is a file-like object, or string, containing the pickle.
1910
1911 Each opcode in the pickle is generated, from the current pickle position,
1912 stopping after a STOP opcode is delivered. A triple is generated for
1913 each opcode:
1914
1915 opcode, arg, pos
1916
1917 opcode is an OpcodeInfo record, describing the current opcode.
1918
1919 If the opcode has an argument embedded in the pickle, arg is its decoded
1920 value, as a Python object. If the opcode doesn't have an argument, arg
1921 is None.
1922
1923 If the pickle has a tell() method, pos was the value of pickle.tell()
Guido van Rossum34d19282007-08-09 01:03:29 +00001924 before reading the current opcode. If the pickle is a bytes object,
1925 it's wrapped in a BytesIO object, and the latter's tell() result is
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001926 used. Else (the pickle doesn't have a tell(), and it's not obvious how
1927 to query its current position) pos is None.
1928 """
1929
Guido van Rossum98297ee2007-11-06 21:34:58 +00001930 if isinstance(pickle, bytes_types):
Guido van Rossumcfe5f202007-05-08 21:26:54 +00001931 import io
1932 pickle = io.BytesIO(pickle)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001933
1934 if hasattr(pickle, "tell"):
1935 getpos = pickle.tell
1936 else:
1937 getpos = lambda: None
1938
1939 while True:
1940 pos = getpos()
1941 code = pickle.read(1)
Guido van Rossumcfe5f202007-05-08 21:26:54 +00001942 opcode = code2op.get(code.decode("latin-1"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001943 if opcode is None:
Guido van Rossumcfe5f202007-05-08 21:26:54 +00001944 if code == b"":
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001945 raise ValueError("pickle exhausted before seeing STOP")
1946 else:
1947 raise ValueError("at position %s, opcode %r unknown" % (
1948 pos is None and "<unknown>" or pos,
1949 code))
1950 if opcode.arg is None:
1951 arg = None
1952 else:
1953 arg = opcode.arg.reader(pickle)
1954 yield opcode, arg, pos
Guido van Rossumcfe5f202007-05-08 21:26:54 +00001955 if code == b'.':
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001956 assert opcode.name == 'STOP'
1957 break
1958
1959##############################################################################
Christian Heimes3feef612008-02-11 06:19:17 +00001960# A pickle optimizer.
1961
1962def optimize(p):
1963 'Optimize a pickle string by removing unused PUT opcodes'
1964 gets = set() # set of args used by a GET opcode
1965 puts = [] # (arg, startpos, stoppos) for the PUT opcodes
1966 prevpos = None # set to pos if previous opcode was a PUT
1967 for opcode, arg, pos in genops(p):
1968 if prevpos is not None:
1969 puts.append((prevarg, prevpos, pos))
1970 prevpos = None
1971 if 'PUT' in opcode.name:
1972 prevarg, prevpos = arg, pos
1973 elif 'GET' in opcode.name:
1974 gets.add(arg)
1975
1976 # Copy the pickle string except for PUTS without a corresponding GET
1977 s = []
1978 i = 0
1979 for arg, start, stop in puts:
1980 j = stop if (arg in gets) else start
1981 s.append(p[i:j])
1982 i = stop
1983 s.append(p[i:])
Christian Heimes126d29a2008-02-11 22:57:17 +00001984 return b''.join(s)
Christian Heimes3feef612008-02-11 06:19:17 +00001985
1986##############################################################################
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001987# A symbolic pickle disassembler.
1988
Alexander Belopolsky929d3842010-07-17 15:51:21 +00001989def dis(pickle, out=None, memo=None, indentlevel=4, annotate=0):
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001990 """Produce a symbolic disassembly of a pickle.
1991
1992 'pickle' is a file-like object, or string, containing a (at least one)
1993 pickle. The pickle is disassembled from the current position, through
1994 the first STOP opcode encountered.
1995
1996 Optional arg 'out' is a file-like object to which the disassembly is
1997 printed. It defaults to sys.stdout.
1998
Tim Peters62235e72003-02-05 19:55:53 +00001999 Optional arg 'memo' is a Python dict, used as the pickle's memo. It
2000 may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes.
2001 Passing the same memo object to another dis() call then allows disassembly
2002 to proceed across multiple pickles that were all created by the same
2003 pickler with the same memo. Ordinarily you don't need to worry about this.
2004
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002005 Optional arg 'indentlevel' is the number of blanks by which to indent
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002006 a new MARK level. It defaults to 4.
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002007
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002008 Optional arg 'annotate' if nonzero instructs dis() to add short
2009 description of the opcode on each line of disassembled output.
2010 The value given to 'annotate' must be an integer and is used as a
2011 hint for the column where annotation should start. The default
2012 value is 0, meaning no annotations.
2013
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002014 In addition to printing the disassembly, some sanity checks are made:
2015
2016 + All embedded opcode arguments "make sense".
2017
2018 + Explicit and implicit pop operations have enough items on the stack.
2019
2020 + When an opcode implicitly refers to a markobject, a markobject is
2021 actually on the stack.
2022
2023 + A memo entry isn't referenced before it's defined.
2024
2025 + The markobject isn't stored in the memo.
2026
2027 + A memo entry isn't redefined.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002028 """
2029
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002030 # Most of the hair here is for sanity checks, but most of it is needed
2031 # anyway to detect when a protocol 0 POP takes a MARK off the stack
2032 # (which in turn is needed to indent MARK blocks correctly).
2033
2034 stack = [] # crude emulation of unpickler stack
Tim Peters62235e72003-02-05 19:55:53 +00002035 if memo is None:
Ezio Melotti30b9d5d2013-08-17 15:50:46 +03002036 memo = {} # crude emulation of unpickler memo
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002037 maxproto = -1 # max protocol number seen
2038 markstack = [] # bytecode positions of MARK opcodes
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002039 indentchunk = ' ' * indentlevel
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002040 errormsg = None
Ezio Melotti30b9d5d2013-08-17 15:50:46 +03002041 annocol = annotate # column hint for annotations
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002042 for opcode, arg, pos in genops(pickle):
2043 if pos is not None:
Guido van Rossumbe19ed72007-02-09 05:37:30 +00002044 print("%5d:" % pos, end=' ', file=out)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002045
Tim Petersd0f7c862003-01-28 15:27:57 +00002046 line = "%-4s %s%s" % (repr(opcode.code)[1:-1],
2047 indentchunk * len(markstack),
2048 opcode.name)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002049
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002050 maxproto = max(maxproto, opcode.proto)
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002051 before = opcode.stack_before # don't mutate
2052 after = opcode.stack_after # don't mutate
Tim Peters43277d62003-01-30 15:02:12 +00002053 numtopop = len(before)
2054
2055 # See whether a MARK should be popped.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002056 markmsg = None
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002057 if markobject in before or (opcode.name == "POP" and
2058 stack and
2059 stack[-1] is markobject):
2060 assert markobject not in after
Tim Peters43277d62003-01-30 15:02:12 +00002061 if __debug__:
2062 if markobject in before:
2063 assert before[-1] is stackslice
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002064 if markstack:
2065 markpos = markstack.pop()
2066 if markpos is None:
2067 markmsg = "(MARK at unknown opcode offset)"
2068 else:
2069 markmsg = "(MARK at %d)" % markpos
2070 # Pop everything at and after the topmost markobject.
2071 while stack[-1] is not markobject:
2072 stack.pop()
2073 stack.pop()
Tim Peters43277d62003-01-30 15:02:12 +00002074 # Stop later code from popping too much.
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002075 try:
Tim Peters43277d62003-01-30 15:02:12 +00002076 numtopop = before.index(markobject)
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002077 except ValueError:
2078 assert opcode.name == "POP"
Tim Peters43277d62003-01-30 15:02:12 +00002079 numtopop = 0
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002080 else:
2081 errormsg = markmsg = "no MARK exists on stack"
2082
2083 # Check for correct memo usage.
2084 if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT"):
Tim Peters43277d62003-01-30 15:02:12 +00002085 assert arg is not None
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002086 if arg in memo:
2087 errormsg = "memo key %r already defined" % arg
2088 elif not stack:
2089 errormsg = "stack is empty -- can't store into memo"
2090 elif stack[-1] is markobject:
2091 errormsg = "can't store markobject in the memo"
2092 else:
2093 memo[arg] = stack[-1]
2094
2095 elif opcode.name in ("GET", "BINGET", "LONG_BINGET"):
2096 if arg in memo:
2097 assert len(after) == 1
2098 after = [memo[arg]] # for better stack emulation
2099 else:
2100 errormsg = "memo key %r has never been stored into" % arg
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002101
2102 if arg is not None or markmsg:
2103 # make a mild effort to align arguments
2104 line += ' ' * (10 - len(opcode.name))
2105 if arg is not None:
2106 line += ' ' + repr(arg)
2107 if markmsg:
2108 line += ' ' + markmsg
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002109 if annotate:
2110 line += ' ' * (annocol - len(line))
2111 # make a mild effort to align annotations
2112 annocol = len(line)
2113 if annocol > 50:
2114 annocol = annotate
2115 line += ' ' + opcode.doc.split('\n', 1)[0]
Guido van Rossumbe19ed72007-02-09 05:37:30 +00002116 print(line, file=out)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002117
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002118 if errormsg:
2119 # Note that we delayed complaining until the offending opcode
2120 # was printed.
2121 raise ValueError(errormsg)
2122
2123 # Emulate the stack effects.
Tim Peters43277d62003-01-30 15:02:12 +00002124 if len(stack) < numtopop:
2125 raise ValueError("tries to pop %d items from stack with "
2126 "only %d items" % (numtopop, len(stack)))
2127 if numtopop:
2128 del stack[-numtopop:]
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002129 if markobject in after:
Tim Peters43277d62003-01-30 15:02:12 +00002130 assert markobject not in before
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002131 markstack.append(pos)
2132
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002133 stack.extend(after)
2134
Guido van Rossumbe19ed72007-02-09 05:37:30 +00002135 print("highest protocol among opcodes =", maxproto, file=out)
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002136 if stack:
2137 raise ValueError("stack not empty after STOP: %r" % stack)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002138
Tim Peters90718a42005-02-15 16:22:34 +00002139# For use in the doctest, simply as an example of a class to pickle.
2140class _Example:
2141 def __init__(self, value):
2142 self.value = value
2143
Guido van Rossum03e35322003-01-28 15:37:13 +00002144_dis_test = r"""
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002145>>> import pickle
Guido van Rossumf4169812008-03-17 22:56:06 +00002146>>> x = [1, 2, (3, 4), {b'abc': "def"}]
2147>>> pkl0 = pickle.dumps(x, 0)
2148>>> dis(pkl0)
Tim Petersd0f7c862003-01-28 15:27:57 +00002149 0: ( MARK
2150 1: l LIST (MARK at 0)
2151 2: p PUT 0
Guido van Rossumf4100002007-01-15 00:21:46 +00002152 5: L LONG 1
Mark Dickinson8dd05142009-01-20 20:43:58 +00002153 9: a APPEND
2154 10: L LONG 2
2155 14: a APPEND
2156 15: ( MARK
2157 16: L LONG 3
2158 20: L LONG 4
2159 24: t TUPLE (MARK at 15)
2160 25: p PUT 1
2161 28: a APPEND
2162 29: ( MARK
2163 30: d DICT (MARK at 29)
2164 31: p PUT 2
Alexandre Vassalotti3bfc65a2011-12-13 13:08:09 -05002165 34: c GLOBAL '_codecs encode'
2166 50: p PUT 3
2167 53: ( MARK
2168 54: V UNICODE 'abc'
Antoine Pitroud9dfaa92009-06-04 20:32:06 +00002169 59: p PUT 4
Alexandre Vassalotti3bfc65a2011-12-13 13:08:09 -05002170 62: V UNICODE 'latin1'
2171 70: p PUT 5
2172 73: t TUPLE (MARK at 53)
2173 74: p PUT 6
2174 77: R REDUCE
2175 78: p PUT 7
2176 81: V UNICODE 'def'
2177 86: p PUT 8
2178 89: s SETITEM
2179 90: a APPEND
2180 91: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002181highest protocol among opcodes = 0
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002182
2183Try again with a "binary" pickle.
2184
Guido van Rossumf4169812008-03-17 22:56:06 +00002185>>> pkl1 = pickle.dumps(x, 1)
2186>>> dis(pkl1)
Tim Petersd0f7c862003-01-28 15:27:57 +00002187 0: ] EMPTY_LIST
2188 1: q BINPUT 0
2189 3: ( MARK
2190 4: K BININT1 1
2191 6: K BININT1 2
2192 8: ( MARK
2193 9: K BININT1 3
2194 11: K BININT1 4
2195 13: t TUPLE (MARK at 8)
2196 14: q BINPUT 1
2197 16: } EMPTY_DICT
2198 17: q BINPUT 2
Alexandre Vassalotti3bfc65a2011-12-13 13:08:09 -05002199 19: c GLOBAL '_codecs encode'
2200 35: q BINPUT 3
2201 37: ( MARK
2202 38: X BINUNICODE 'abc'
2203 46: q BINPUT 4
2204 48: X BINUNICODE 'latin1'
2205 59: q BINPUT 5
2206 61: t TUPLE (MARK at 37)
2207 62: q BINPUT 6
2208 64: R REDUCE
2209 65: q BINPUT 7
2210 67: X BINUNICODE 'def'
2211 75: q BINPUT 8
2212 77: s SETITEM
2213 78: e APPENDS (MARK at 3)
2214 79: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002215highest protocol among opcodes = 1
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002216
2217Exercise the INST/OBJ/BUILD family.
2218
Mark Dickinsoncddcf442009-01-24 21:46:33 +00002219>>> import pickletools
2220>>> dis(pickle.dumps(pickletools.dis, 0))
2221 0: c GLOBAL 'pickletools dis'
2222 17: p PUT 0
2223 20: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002224highest protocol among opcodes = 0
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002225
Tim Peters90718a42005-02-15 16:22:34 +00002226>>> from pickletools import _Example
2227>>> x = [_Example(42)] * 2
Guido van Rossumf29d3d62003-01-27 22:47:53 +00002228>>> dis(pickle.dumps(x, 0))
Tim Petersd0f7c862003-01-28 15:27:57 +00002229 0: ( MARK
2230 1: l LIST (MARK at 0)
2231 2: p PUT 0
Antoine Pitroud9dfaa92009-06-04 20:32:06 +00002232 5: c GLOBAL 'copy_reg _reconstructor'
2233 30: p PUT 1
2234 33: ( MARK
2235 34: c GLOBAL 'pickletools _Example'
2236 56: p PUT 2
2237 59: c GLOBAL '__builtin__ object'
2238 79: p PUT 3
2239 82: N NONE
2240 83: t TUPLE (MARK at 33)
2241 84: p PUT 4
2242 87: R REDUCE
2243 88: p PUT 5
2244 91: ( MARK
2245 92: d DICT (MARK at 91)
2246 93: p PUT 6
2247 96: V UNICODE 'value'
2248 103: p PUT 7
2249 106: L LONG 42
2250 111: s SETITEM
2251 112: b BUILD
Mark Dickinson8dd05142009-01-20 20:43:58 +00002252 113: a APPEND
Antoine Pitroud9dfaa92009-06-04 20:32:06 +00002253 114: g GET 5
2254 117: a APPEND
2255 118: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002256highest protocol among opcodes = 0
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002257
2258>>> dis(pickle.dumps(x, 1))
Tim Petersd0f7c862003-01-28 15:27:57 +00002259 0: ] EMPTY_LIST
2260 1: q BINPUT 0
2261 3: ( MARK
Antoine Pitroud9dfaa92009-06-04 20:32:06 +00002262 4: c GLOBAL 'copy_reg _reconstructor'
2263 29: q BINPUT 1
2264 31: ( MARK
2265 32: c GLOBAL 'pickletools _Example'
2266 54: q BINPUT 2
2267 56: c GLOBAL '__builtin__ object'
2268 76: q BINPUT 3
2269 78: N NONE
2270 79: t TUPLE (MARK at 31)
2271 80: q BINPUT 4
2272 82: R REDUCE
2273 83: q BINPUT 5
2274 85: } EMPTY_DICT
2275 86: q BINPUT 6
2276 88: X BINUNICODE 'value'
2277 98: q BINPUT 7
2278 100: K BININT1 42
2279 102: s SETITEM
2280 103: b BUILD
2281 104: h BINGET 5
2282 106: e APPENDS (MARK at 3)
2283 107: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002284highest protocol among opcodes = 1
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002285
2286Try "the canonical" recursive-object test.
2287
2288>>> L = []
2289>>> T = L,
2290>>> L.append(T)
2291>>> L[0] is T
2292True
2293>>> T[0] is L
2294True
2295>>> L[0][0] is L
2296True
2297>>> T[0][0] is T
2298True
Guido van Rossumf29d3d62003-01-27 22:47:53 +00002299>>> dis(pickle.dumps(L, 0))
Tim Petersd0f7c862003-01-28 15:27:57 +00002300 0: ( MARK
2301 1: l LIST (MARK at 0)
2302 2: p PUT 0
2303 5: ( MARK
2304 6: g GET 0
2305 9: t TUPLE (MARK at 5)
2306 10: p PUT 1
2307 13: a APPEND
2308 14: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002309highest protocol among opcodes = 0
2310
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002311>>> dis(pickle.dumps(L, 1))
Tim Petersd0f7c862003-01-28 15:27:57 +00002312 0: ] EMPTY_LIST
2313 1: q BINPUT 0
2314 3: ( MARK
2315 4: h BINGET 0
2316 6: t TUPLE (MARK at 3)
2317 7: q BINPUT 1
2318 9: a APPEND
2319 10: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002320highest protocol among opcodes = 1
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002321
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002322Note that, in the protocol 0 pickle of the recursive tuple, the disassembler
2323has to emulate the stack in order to realize that the POP opcode at 16 gets
2324rid of the MARK at 0.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002325
Guido van Rossumf29d3d62003-01-27 22:47:53 +00002326>>> dis(pickle.dumps(T, 0))
Tim Petersd0f7c862003-01-28 15:27:57 +00002327 0: ( MARK
2328 1: ( MARK
2329 2: l LIST (MARK at 1)
2330 3: p PUT 0
2331 6: ( MARK
2332 7: g GET 0
2333 10: t TUPLE (MARK at 6)
2334 11: p PUT 1
2335 14: a APPEND
2336 15: 0 POP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002337 16: 0 POP (MARK at 0)
2338 17: g GET 1
2339 20: . STOP
2340highest protocol among opcodes = 0
2341
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002342>>> dis(pickle.dumps(T, 1))
Tim Petersd0f7c862003-01-28 15:27:57 +00002343 0: ( MARK
2344 1: ] EMPTY_LIST
2345 2: q BINPUT 0
2346 4: ( MARK
2347 5: h BINGET 0
2348 7: t TUPLE (MARK at 4)
2349 8: q BINPUT 1
2350 10: a APPEND
2351 11: 1 POP_MARK (MARK at 0)
2352 12: h BINGET 1
2353 14: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002354highest protocol among opcodes = 1
Tim Petersd0f7c862003-01-28 15:27:57 +00002355
2356Try protocol 2.
2357
2358>>> dis(pickle.dumps(L, 2))
2359 0: \x80 PROTO 2
2360 2: ] EMPTY_LIST
2361 3: q BINPUT 0
2362 5: h BINGET 0
2363 7: \x85 TUPLE1
2364 8: q BINPUT 1
2365 10: a APPEND
2366 11: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002367highest protocol among opcodes = 2
Tim Petersd0f7c862003-01-28 15:27:57 +00002368
2369>>> dis(pickle.dumps(T, 2))
2370 0: \x80 PROTO 2
2371 2: ] EMPTY_LIST
2372 3: q BINPUT 0
2373 5: h BINGET 0
2374 7: \x85 TUPLE1
2375 8: q BINPUT 1
2376 10: a APPEND
2377 11: 0 POP
2378 12: h BINGET 1
2379 14: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002380highest protocol among opcodes = 2
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002381
2382Try protocol 3 with annotations:
2383
2384>>> dis(pickle.dumps(T, 3), annotate=1)
2385 0: \x80 PROTO 3 Protocol version indicator.
2386 2: ] EMPTY_LIST Push an empty list.
2387 3: q BINPUT 0 Store the stack top into the memo. The stack is not popped.
2388 5: h BINGET 0 Read an object from the memo and push it on the stack.
2389 7: \x85 TUPLE1 Build a one-tuple out of the topmost item on the stack.
2390 8: q BINPUT 1 Store the stack top into the memo. The stack is not popped.
2391 10: a APPEND Append an object to a list.
2392 11: 0 POP Discard the top stack item, shrinking the stack by one item.
2393 12: h BINGET 1 Read an object from the memo and push it on the stack.
2394 14: . STOP Stop the unpickling machine.
2395highest protocol among opcodes = 2
2396
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002397"""
2398
Tim Peters62235e72003-02-05 19:55:53 +00002399_memo_test = r"""
2400>>> import pickle
Guido van Rossumcfe5f202007-05-08 21:26:54 +00002401>>> import io
2402>>> f = io.BytesIO()
Tim Peters62235e72003-02-05 19:55:53 +00002403>>> p = pickle.Pickler(f, 2)
2404>>> x = [1, 2, 3]
2405>>> p.dump(x)
2406>>> p.dump(x)
2407>>> f.seek(0)
Guido van Rossumcfe5f202007-05-08 21:26:54 +000024080
Tim Peters62235e72003-02-05 19:55:53 +00002409>>> memo = {}
2410>>> dis(f, memo=memo)
2411 0: \x80 PROTO 2
2412 2: ] EMPTY_LIST
2413 3: q BINPUT 0
2414 5: ( MARK
2415 6: K BININT1 1
2416 8: K BININT1 2
2417 10: K BININT1 3
2418 12: e APPENDS (MARK at 5)
2419 13: . STOP
2420highest protocol among opcodes = 2
2421>>> dis(f, memo=memo)
2422 14: \x80 PROTO 2
2423 16: h BINGET 0
2424 18: . STOP
2425highest protocol among opcodes = 2
2426"""
2427
Guido van Rossum57028352003-01-28 15:09:10 +00002428__test__ = {'disassembler_test': _dis_test,
Tim Peters62235e72003-02-05 19:55:53 +00002429 'disassembler_memo_test': _memo_test,
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002430 }
2431
2432def _test():
2433 import doctest
2434 return doctest.testmod()
2435
2436if __name__ == "__main__":
Alexander Belopolsky60c762b2010-07-03 20:35:53 +00002437 import sys, argparse
2438 parser = argparse.ArgumentParser(
2439 description='disassemble one or more pickle files')
2440 parser.add_argument(
2441 'pickle_file', type=argparse.FileType('br'),
2442 nargs='*', help='the pickle file')
2443 parser.add_argument(
2444 '-o', '--output', default=sys.stdout, type=argparse.FileType('w'),
2445 help='the file where the output should be written')
2446 parser.add_argument(
2447 '-m', '--memo', action='store_true',
2448 help='preserve memo between disassemblies')
2449 parser.add_argument(
2450 '-l', '--indentlevel', default=4, type=int,
2451 help='the number of blanks by which to indent a new MARK level')
2452 parser.add_argument(
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002453 '-a', '--annotate', action='store_true',
2454 help='annotate each line with a short opcode description')
2455 parser.add_argument(
Alexander Belopolsky60c762b2010-07-03 20:35:53 +00002456 '-p', '--preamble', default="==> {name} <==",
2457 help='if more than one pickle file is specified, print this before'
2458 ' each disassembly')
2459 parser.add_argument(
2460 '-t', '--test', action='store_true',
2461 help='run self-test suite')
2462 parser.add_argument(
2463 '-v', action='store_true',
2464 help='run verbosely; only affects self-test run')
2465 args = parser.parse_args()
2466 if args.test:
2467 _test()
2468 else:
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002469 annotate = 30 if args.annotate else 0
Alexander Belopolsky60c762b2010-07-03 20:35:53 +00002470 if not args.pickle_file:
2471 parser.print_help()
2472 elif len(args.pickle_file) == 1:
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002473 dis(args.pickle_file[0], args.output, None,
2474 args.indentlevel, annotate)
Alexander Belopolsky60c762b2010-07-03 20:35:53 +00002475 else:
2476 memo = {} if args.memo else None
2477 for f in args.pickle_file:
2478 preamble = args.preamble.format(name=f.name)
2479 args.output.write(preamble + '\n')
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002480 dis(f, args.output, memo, args.indentlevel, annotate)