blob: 9f90e3e658f992ee783590efaaf690303438543f [file] [log] [blame]
Skip Montanaro54455942003-01-29 15:41:33 +00001'''"Executable documentation" for the pickle module.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002
3Extensive comments about the pickle protocols and pickle-machine opcodes
4can be found here. Some functions meant for external use:
5
6genops(pickle)
7 Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
8
Andrew M. Kuchlingd0c53fe2004-08-07 16:51:30 +00009dis(pickle, out=None, memo=None, indentlevel=4)
Tim Peters8ecfc8e2003-01-27 18:51:48 +000010 Print a symbolic disassembly of a pickle.
Skip Montanaro54455942003-01-29 15:41:33 +000011'''
Tim Peters8ecfc8e2003-01-27 18:51:48 +000012
Walter Dörwald42748a82007-06-12 16:40:17 +000013import codecs
Guido van Rossum98297ee2007-11-06 21:34:58 +000014import pickle
15import re
Walter Dörwald42748a82007-06-12 16:40:17 +000016
Christian Heimes3feef612008-02-11 06:19:17 +000017__all__ = ['dis', 'genops', 'optimize']
Tim Peters90cf2122004-11-06 23:45:48 +000018
Guido van Rossum98297ee2007-11-06 21:34:58 +000019bytes_types = pickle.bytes_types
20
Tim Peters8ecfc8e2003-01-27 18:51:48 +000021# Other ideas:
22#
23# - A pickle verifier: read a pickle and check it exhaustively for
Tim Petersc1c2b3e2003-01-29 20:12:21 +000024# well-formedness. dis() does a lot of this already.
Tim Peters8ecfc8e2003-01-27 18:51:48 +000025#
26# - A protocol identifier: examine a pickle and return its protocol number
27# (== the highest .proto attr value among all the opcodes in the pickle).
Tim Petersc1c2b3e2003-01-29 20:12:21 +000028# dis() already prints this info at the end.
Tim Peters8ecfc8e2003-01-27 18:51:48 +000029#
30# - A pickle optimizer: for example, tuple-building code is sometimes more
31# elaborate than necessary, catering for the possibility that the tuple
32# is recursive. Or lots of times a PUT is generated that's never accessed
33# by a later GET.
34
35
Victor Stinner765531d2013-03-26 01:11:54 +010036# "A pickle" is a program for a virtual pickle machine (PM, but more accurately
37# called an unpickling machine). It's a sequence of opcodes, interpreted by the
38# PM, building an arbitrarily complex Python object.
39#
40# For the most part, the PM is very simple: there are no looping, testing, or
41# conditional instructions, no arithmetic and no function calls. Opcodes are
42# executed once each, from first to last, until a STOP opcode is reached.
43#
44# The PM has two data areas, "the stack" and "the memo".
45#
46# Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
47# integer object on the stack, whose value is gotten from a decimal string
48# literal immediately following the INT opcode in the pickle bytestream. Other
49# opcodes take Python objects off the stack. The result of unpickling is
50# whatever object is left on the stack when the final STOP opcode is executed.
51#
52# The memo is simply an array of objects, or it can be implemented as a dict
53# mapping little integers to objects. The memo serves as the PM's "long term
54# memory", and the little integers indexing the memo are akin to variable
55# names. Some opcodes pop a stack object into the memo at a given index,
56# and others push a memo object at a given index onto the stack again.
57#
58# At heart, that's all the PM has. Subtleties arise for these reasons:
59#
60# + Object identity. Objects can be arbitrarily complex, and subobjects
61# may be shared (for example, the list [a, a] refers to the same object a
62# twice). It can be vital that unpickling recreate an isomorphic object
63# graph, faithfully reproducing sharing.
64#
65# + Recursive objects. For example, after "L = []; L.append(L)", L is a
66# list, and L[0] is the same list. This is related to the object identity
67# point, and some sequences of pickle opcodes are subtle in order to
68# get the right result in all cases.
69#
70# + Things pickle doesn't know everything about. Examples of things pickle
71# does know everything about are Python's builtin scalar and container
72# types, like ints and tuples. They generally have opcodes dedicated to
73# them. For things like module references and instances of user-defined
74# classes, pickle's knowledge is limited. Historically, many enhancements
75# have been made to the pickle protocol in order to do a better (faster,
76# and/or more compact) job on those.
77#
78# + Backward compatibility and micro-optimization. As explained below,
79# pickle opcodes never go away, not even when better ways to do a thing
80# get invented. The repertoire of the PM just keeps growing over time.
81# For example, protocol 0 had two opcodes for building Python integers (INT
82# and LONG), protocol 1 added three more for more-efficient pickling of short
83# integers, and protocol 2 added two more for more-efficient pickling of
84# long integers (before protocol 2, the only ways to pickle a Python long
85# took time quadratic in the number of digits, for both pickling and
86# unpickling). "Opcode bloat" isn't so much a subtlety as a source of
87# wearying complication.
88#
89#
90# Pickle protocols:
91#
92# For compatibility, the meaning of a pickle opcode never changes. Instead new
93# pickle opcodes get added, and each version's unpickler can handle all the
94# pickle opcodes in all protocol versions to date. So old pickles continue to
95# be readable forever. The pickler can generally be told to restrict itself to
96# the subset of opcodes available under previous protocol versions too, so that
97# users can create pickles under the current version readable by older
98# versions. However, a pickle does not contain its version number embedded
99# within it. If an older unpickler tries to read a pickle using a later
100# protocol, the result is most likely an exception due to seeing an unknown (in
101# the older unpickler) opcode.
102#
103# The original pickle used what's now called "protocol 0", and what was called
104# "text mode" before Python 2.3. The entire pickle bytestream is made up of
105# printable 7-bit ASCII characters, plus the newline character, in protocol 0.
106# That's why it was called text mode. Protocol 0 is small and elegant, but
107# sometimes painfully inefficient.
108#
109# The second major set of additions is now called "protocol 1", and was called
110# "binary mode" before Python 2.3. This added many opcodes with arguments
111# consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
112# bytes. Binary mode pickles can be substantially smaller than equivalent
113# text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
114# int as 4 bytes following the opcode, which is cheaper to unpickle than the
115# (perhaps) 11-character decimal string attached to INT. Protocol 1 also added
116# a number of opcodes that operate on many stack elements at once (like APPENDS
117# and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
118#
119# The third major set of additions came in Python 2.3, and is called "protocol
120# 2". This added:
121#
122# - A better way to pickle instances of new-style classes (NEWOBJ).
123#
124# - A way for a pickle to identify its protocol (PROTO).
125#
126# - Time- and space- efficient pickling of long ints (LONG{1,4}).
127#
128# - Shortcuts for small tuples (TUPLE{1,2,3}}.
129#
130# - Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
131#
132# - The "extension registry", a vector of popular objects that can be pushed
133# efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but
134# the registry contents are predefined (there's nothing akin to the memo's
135# PUT).
136#
137# Another independent change with Python 2.3 is the abandonment of any
138# pretense that it might be safe to load pickles received from untrusted
139# parties -- no sufficient security analysis has been done to guarantee
140# this and there isn't a use case that warrants the expense of such an
141# analysis.
142#
143# To this end, all tests for __safe_for_unpickling__ or for
144# copyreg.safe_constructors are removed from the unpickling code.
145# References to these variables in the descriptions below are to be seen
146# as describing unpickling in Python 2.2 and before.
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000147
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000148
149# Meta-rule: Descriptions are stored in instances of descriptor objects,
150# with plain constructors. No meta-language is defined from which
151# descriptors could be constructed. If you want, e.g., XML, write a little
152# program to generate XML from the objects.
153
154##############################################################################
155# Some pickle opcodes have an argument, following the opcode in the
156# bytestream. An argument is of a specific type, described by an instance
157# of ArgumentDescriptor. These are not to be confused with arguments taken
158# off the stack -- ArgumentDescriptor applies only to arguments embedded in
159# the opcode stream, immediately following an opcode.
160
161# Represents the number of bytes consumed by an argument delimited by the
162# next newline character.
163UP_TO_NEWLINE = -1
164
165# Represents the number of bytes consumed by a two-argument opcode where
166# the first argument gives the number of bytes in the second argument.
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000167TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int
168TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000169
170class ArgumentDescriptor(object):
171 __slots__ = (
172 # name of descriptor record, also a module global name; a string
173 'name',
174
175 # length of argument, in bytes; an int; UP_TO_NEWLINE and
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000176 # TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length
177 # cases
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000178 'n',
179
180 # a function taking a file-like object, reading this kind of argument
181 # from the object at the current position, advancing the current
182 # position by n bytes, and returning the value of the argument
183 'reader',
184
185 # human-readable docs for this arg descriptor; a string
186 'doc',
187 )
188
189 def __init__(self, name, n, reader, doc):
190 assert isinstance(name, str)
191 self.name = name
192
193 assert isinstance(n, int) and (n >= 0 or
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000194 n in (UP_TO_NEWLINE,
195 TAKEN_FROM_ARGUMENT1,
196 TAKEN_FROM_ARGUMENT4))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000197 self.n = n
198
199 self.reader = reader
200
201 assert isinstance(doc, str)
202 self.doc = doc
203
204from struct import unpack as _unpack
205
206def read_uint1(f):
Tim Peters55762f52003-01-28 16:01:25 +0000207 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000208 >>> import io
209 >>> read_uint1(io.BytesIO(b'\xff'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000210 255
211 """
212
213 data = f.read(1)
214 if data:
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000215 return data[0]
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000216 raise ValueError("not enough data in stream to read uint1")
217
218uint1 = ArgumentDescriptor(
219 name='uint1',
220 n=1,
221 reader=read_uint1,
222 doc="One-byte unsigned integer.")
223
224
225def read_uint2(f):
Tim Peters55762f52003-01-28 16:01:25 +0000226 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000227 >>> import io
228 >>> read_uint2(io.BytesIO(b'\xff\x00'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000229 255
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000230 >>> read_uint2(io.BytesIO(b'\xff\xff'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000231 65535
232 """
233
234 data = f.read(2)
235 if len(data) == 2:
236 return _unpack("<H", data)[0]
237 raise ValueError("not enough data in stream to read uint2")
238
239uint2 = ArgumentDescriptor(
240 name='uint2',
241 n=2,
242 reader=read_uint2,
243 doc="Two-byte unsigned integer, little-endian.")
244
245
246def read_int4(f):
Tim Peters55762f52003-01-28 16:01:25 +0000247 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000248 >>> import io
249 >>> read_int4(io.BytesIO(b'\xff\x00\x00\x00'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000250 255
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000251 >>> read_int4(io.BytesIO(b'\x00\x00\x00\x80')) == -(2**31)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000252 True
253 """
254
255 data = f.read(4)
256 if len(data) == 4:
257 return _unpack("<i", data)[0]
258 raise ValueError("not enough data in stream to read int4")
259
260int4 = ArgumentDescriptor(
261 name='int4',
262 n=4,
263 reader=read_int4,
264 doc="Four-byte signed integer, little-endian, 2's complement.")
265
266
267def read_stringnl(f, decode=True, stripquotes=True):
Tim Peters55762f52003-01-28 16:01:25 +0000268 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000269 >>> import io
270 >>> read_stringnl(io.BytesIO(b"'abcd'\nefg\n"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000271 'abcd'
272
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000273 >>> read_stringnl(io.BytesIO(b"\n"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000274 Traceback (most recent call last):
275 ...
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000276 ValueError: no string quotes around b''
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000277
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000278 >>> read_stringnl(io.BytesIO(b"\n"), stripquotes=False)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000279 ''
280
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000281 >>> read_stringnl(io.BytesIO(b"''\n"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000282 ''
283
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000284 >>> read_stringnl(io.BytesIO(b'"abcd"'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000285 Traceback (most recent call last):
286 ...
287 ValueError: no newline found when trying to read stringnl
288
289 Embedded escapes are undone in the result.
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000290 >>> read_stringnl(io.BytesIO(br"'a\n\\b\x00c\td'" + b"\n'e'"))
Tim Peters55762f52003-01-28 16:01:25 +0000291 'a\n\\b\x00c\td'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000292 """
293
Guido van Rossum26986312007-07-17 00:19:46 +0000294 data = f.readline()
Guido van Rossum26d95c32007-08-27 23:18:54 +0000295 if not data.endswith(b'\n'):
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000296 raise ValueError("no newline found when trying to read stringnl")
297 data = data[:-1] # lose the newline
298
299 if stripquotes:
Guido van Rossum26d95c32007-08-27 23:18:54 +0000300 for q in (b'"', b"'"):
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000301 if data.startswith(q):
302 if not data.endswith(q):
303 raise ValueError("strinq quote %r not found at both "
304 "ends of %r" % (q, data))
305 data = data[1:-1]
306 break
307 else:
308 raise ValueError("no string quotes around %r" % data)
309
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000310 if decode:
Guido van Rossum98297ee2007-11-06 21:34:58 +0000311 data = codecs.escape_decode(data)[0].decode("ascii")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000312 return data
313
314stringnl = ArgumentDescriptor(
315 name='stringnl',
316 n=UP_TO_NEWLINE,
317 reader=read_stringnl,
318 doc="""A newline-terminated string.
319
320 This is a repr-style string, with embedded escapes, and
321 bracketing quotes.
322 """)
323
324def read_stringnl_noescape(f):
Guido van Rossum98297ee2007-11-06 21:34:58 +0000325 return read_stringnl(f, stripquotes=False)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000326
327stringnl_noescape = ArgumentDescriptor(
328 name='stringnl_noescape',
329 n=UP_TO_NEWLINE,
330 reader=read_stringnl_noescape,
331 doc="""A newline-terminated string.
332
333 This is a str-style string, without embedded escapes,
334 or bracketing quotes. It should consist solely of
335 printable ASCII characters.
336 """)
337
338def read_stringnl_noescape_pair(f):
Tim Peters55762f52003-01-28 16:01:25 +0000339 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000340 >>> import io
341 >>> read_stringnl_noescape_pair(io.BytesIO(b"Queue\nEmpty\njunk"))
Tim Petersd916cf42003-01-27 19:01:47 +0000342 'Queue Empty'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000343 """
344
Tim Petersd916cf42003-01-27 19:01:47 +0000345 return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000346
347stringnl_noescape_pair = ArgumentDescriptor(
348 name='stringnl_noescape_pair',
349 n=UP_TO_NEWLINE,
350 reader=read_stringnl_noescape_pair,
351 doc="""A pair of newline-terminated strings.
352
353 These are str-style strings, without embedded
354 escapes, or bracketing quotes. They should
355 consist solely of printable ASCII characters.
356 The pair is returned as a single string, with
Tim Petersd916cf42003-01-27 19:01:47 +0000357 a single blank separating the two strings.
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000358 """)
359
360def read_string4(f):
Tim Peters55762f52003-01-28 16:01:25 +0000361 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000362 >>> import io
363 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x00abc"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000364 ''
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000365 >>> read_string4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000366 'abc'
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000367 >>> read_string4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000368 Traceback (most recent call last):
369 ...
370 ValueError: expected 50331648 bytes in a string4, but only 6 remain
371 """
372
373 n = read_int4(f)
374 if n < 0:
375 raise ValueError("string4 byte count < 0: %d" % n)
376 data = f.read(n)
377 if len(data) == n:
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000378 return data.decode("latin-1")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000379 raise ValueError("expected %d bytes in a string4, but only %d remain" %
380 (n, len(data)))
381
382string4 = ArgumentDescriptor(
383 name="string4",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000384 n=TAKEN_FROM_ARGUMENT4,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000385 reader=read_string4,
386 doc="""A counted string.
387
388 The first argument is a 4-byte little-endian signed int giving
389 the number of bytes in the string, and the second argument is
390 that many bytes.
391 """)
392
393
394def read_string1(f):
Tim Peters55762f52003-01-28 16:01:25 +0000395 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000396 >>> import io
397 >>> read_string1(io.BytesIO(b"\x00"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000398 ''
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000399 >>> read_string1(io.BytesIO(b"\x03abcdef"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000400 'abc'
401 """
402
403 n = read_uint1(f)
404 assert n >= 0
405 data = f.read(n)
406 if len(data) == n:
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000407 return data.decode("latin-1")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000408 raise ValueError("expected %d bytes in a string1, but only %d remain" %
409 (n, len(data)))
410
411string1 = ArgumentDescriptor(
412 name="string1",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000413 n=TAKEN_FROM_ARGUMENT1,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000414 reader=read_string1,
415 doc="""A counted string.
416
417 The first argument is a 1-byte unsigned int giving the number
418 of bytes in the string, and the second argument is that many
419 bytes.
420 """)
421
422
423def read_unicodestringnl(f):
Tim Peters55762f52003-01-28 16:01:25 +0000424 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000425 >>> import io
426 >>> read_unicodestringnl(io.BytesIO(b"abc\\uabcd\njunk")) == 'abc\uabcd'
427 True
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000428 """
429
Guido van Rossum26986312007-07-17 00:19:46 +0000430 data = f.readline()
Guido van Rossum26d95c32007-08-27 23:18:54 +0000431 if not data.endswith(b'\n'):
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000432 raise ValueError("no newline found when trying to read "
433 "unicodestringnl")
434 data = data[:-1] # lose the newline
Guido van Rossumef87d6e2007-05-02 19:09:54 +0000435 return str(data, 'raw-unicode-escape')
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000436
437unicodestringnl = ArgumentDescriptor(
438 name='unicodestringnl',
439 n=UP_TO_NEWLINE,
440 reader=read_unicodestringnl,
441 doc="""A newline-terminated Unicode string.
442
443 This is raw-unicode-escape encoded, so consists of
444 printable ASCII characters, and may contain embedded
445 escape sequences.
446 """)
447
448def read_unicodestring4(f):
Tim Peters55762f52003-01-28 16:01:25 +0000449 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000450 >>> import io
451 >>> s = 'abcd\uabcd'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000452 >>> enc = s.encode('utf-8')
453 >>> enc
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000454 b'abcd\xea\xaf\x8d'
455 >>> n = bytes([len(enc), 0, 0, 0]) # little-endian 4-byte length
456 >>> t = read_unicodestring4(io.BytesIO(n + enc + b'junk'))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000457 >>> s == t
458 True
459
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000460 >>> read_unicodestring4(io.BytesIO(n + enc[:-1]))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000461 Traceback (most recent call last):
462 ...
463 ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
464 """
465
466 n = read_int4(f)
467 if n < 0:
468 raise ValueError("unicodestring4 byte count < 0: %d" % n)
469 data = f.read(n)
470 if len(data) == n:
Victor Stinner485fb562010-04-13 11:07:24 +0000471 return str(data, 'utf-8', 'surrogatepass')
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000472 raise ValueError("expected %d bytes in a unicodestring4, but only %d "
473 "remain" % (n, len(data)))
474
475unicodestring4 = ArgumentDescriptor(
476 name="unicodestring4",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000477 n=TAKEN_FROM_ARGUMENT4,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000478 reader=read_unicodestring4,
479 doc="""A counted Unicode string.
480
481 The first argument is a 4-byte little-endian signed int
482 giving the number of bytes in the string, and the second
483 argument-- the UTF-8 encoding of the Unicode string --
484 contains that many bytes.
485 """)
486
487
488def read_decimalnl_short(f):
Tim Peters55762f52003-01-28 16:01:25 +0000489 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000490 >>> import io
491 >>> read_decimalnl_short(io.BytesIO(b"1234\n56"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000492 1234
493
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000494 >>> read_decimalnl_short(io.BytesIO(b"1234L\n56"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000495 Traceback (most recent call last):
496 ...
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000497 ValueError: trailing 'L' not allowed in b'1234L'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000498 """
499
500 s = read_stringnl(f, decode=False, stripquotes=False)
Guido van Rossum26d95c32007-08-27 23:18:54 +0000501 if s.endswith(b"L"):
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000502 raise ValueError("trailing 'L' not allowed in %r" % s)
503
504 # It's not necessarily true that the result fits in a Python short int:
505 # the pickle may have been written on a 64-bit box. There's also a hack
506 # for True and False here.
Jeremy Hyltona5dc3db2007-08-29 19:07:40 +0000507 if s == b"00":
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000508 return False
Jeremy Hyltona5dc3db2007-08-29 19:07:40 +0000509 elif s == b"01":
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000510 return True
511
Florent Xicluna2bb96f52011-10-23 22:11:00 +0200512 return int(s)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000513
514def read_decimalnl_long(f):
Tim Peters55762f52003-01-28 16:01:25 +0000515 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000516 >>> import io
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000517
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000518 >>> read_decimalnl_long(io.BytesIO(b"1234L\n56"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000519 1234
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000520
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000521 >>> read_decimalnl_long(io.BytesIO(b"123456789012345678901234L\n6"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000522 123456789012345678901234
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000523 """
524
525 s = read_stringnl(f, decode=False, stripquotes=False)
Mark Dickinson8dd05142009-01-20 20:43:58 +0000526 if s[-1:] == b'L':
527 s = s[:-1]
Guido van Rossume2a383d2007-01-15 16:59:06 +0000528 return int(s)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000529
530
531decimalnl_short = ArgumentDescriptor(
532 name='decimalnl_short',
533 n=UP_TO_NEWLINE,
534 reader=read_decimalnl_short,
535 doc="""A newline-terminated decimal integer literal.
536
537 This never has a trailing 'L', and the integer fit
538 in a short Python int on the box where the pickle
539 was written -- but there's no guarantee it will fit
540 in a short Python int on the box where the pickle
541 is read.
542 """)
543
544decimalnl_long = ArgumentDescriptor(
545 name='decimalnl_long',
546 n=UP_TO_NEWLINE,
547 reader=read_decimalnl_long,
548 doc="""A newline-terminated decimal integer literal.
549
550 This has a trailing 'L', and can represent integers
551 of any size.
552 """)
553
554
555def read_floatnl(f):
Tim Peters55762f52003-01-28 16:01:25 +0000556 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000557 >>> import io
558 >>> read_floatnl(io.BytesIO(b"-1.25\n6"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000559 -1.25
560 """
561 s = read_stringnl(f, decode=False, stripquotes=False)
562 return float(s)
563
564floatnl = ArgumentDescriptor(
565 name='floatnl',
566 n=UP_TO_NEWLINE,
567 reader=read_floatnl,
568 doc="""A newline-terminated decimal floating literal.
569
570 In general this requires 17 significant digits for roundtrip
571 identity, and pickling then unpickling infinities, NaNs, and
572 minus zero doesn't work across boxes, or on some boxes even
573 on itself (e.g., Windows can't read the strings it produces
574 for infinities or NaNs).
575 """)
576
577def read_float8(f):
Tim Peters55762f52003-01-28 16:01:25 +0000578 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000579 >>> import io, struct
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000580 >>> raw = struct.pack(">d", -1.25)
581 >>> raw
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000582 b'\xbf\xf4\x00\x00\x00\x00\x00\x00'
583 >>> read_float8(io.BytesIO(raw + b"\n"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000584 -1.25
585 """
586
587 data = f.read(8)
588 if len(data) == 8:
589 return _unpack(">d", data)[0]
590 raise ValueError("not enough data in stream to read float8")
591
592
593float8 = ArgumentDescriptor(
594 name='float8',
595 n=8,
596 reader=read_float8,
597 doc="""An 8-byte binary representation of a float, big-endian.
598
599 The format is unique to Python, and shared with the struct
Guido van Rossum99603b02007-07-20 00:22:32 +0000600 module (format string '>d') "in theory" (the struct and pickle
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000601 implementations don't share the code -- they should). It's
602 strongly related to the IEEE-754 double format, and, in normal
603 cases, is in fact identical to the big-endian 754 double format.
604 On other boxes the dynamic range is limited to that of a 754
605 double, and "add a half and chop" rounding is used to reduce
606 the precision to 53 bits. However, even on a 754 box,
607 infinities, NaNs, and minus zero may not be handled correctly
608 (may not survive roundtrip pickling intact).
609 """)
610
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000611# Protocol 2 formats
612
Tim Petersc0c12b52003-01-29 00:56:17 +0000613from pickle import decode_long
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000614
615def read_long1(f):
616 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000617 >>> import io
618 >>> read_long1(io.BytesIO(b"\x00"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000619 0
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000620 >>> read_long1(io.BytesIO(b"\x02\xff\x00"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000621 255
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000622 >>> read_long1(io.BytesIO(b"\x02\xff\x7f"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000623 32767
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000624 >>> read_long1(io.BytesIO(b"\x02\x00\xff"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000625 -256
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000626 >>> read_long1(io.BytesIO(b"\x02\x00\x80"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000627 -32768
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000628 """
629
630 n = read_uint1(f)
631 data = f.read(n)
632 if len(data) != n:
633 raise ValueError("not enough data in stream to read long1")
634 return decode_long(data)
635
636long1 = ArgumentDescriptor(
637 name="long1",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000638 n=TAKEN_FROM_ARGUMENT1,
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000639 reader=read_long1,
640 doc="""A binary long, little-endian, using 1-byte size.
641
642 This first reads one byte as an unsigned size, then reads that
Tim Petersbdbe7412003-01-27 23:54:04 +0000643 many bytes and interprets them as a little-endian 2's-complement long.
Tim Peters4b23f2b2003-01-31 16:43:39 +0000644 If the size is 0, that's taken as a shortcut for the long 0L.
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000645 """)
646
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000647def read_long4(f):
648 r"""
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000649 >>> import io
650 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x00"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000651 255
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000652 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x7f"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000653 32767
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000654 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\xff"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000655 -256
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000656 >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\x80"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000657 -32768
Guido van Rossumcfe5f202007-05-08 21:26:54 +0000658 >>> read_long1(io.BytesIO(b"\x00\x00\x00\x00"))
Guido van Rossume2b70bc2006-08-18 22:13:04 +0000659 0
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000660 """
661
662 n = read_int4(f)
663 if n < 0:
Neal Norwitz784a3f52003-01-28 00:20:41 +0000664 raise ValueError("long4 byte count < 0: %d" % n)
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000665 data = f.read(n)
666 if len(data) != n:
Neal Norwitz784a3f52003-01-28 00:20:41 +0000667 raise ValueError("not enough data in stream to read long4")
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000668 return decode_long(data)
669
670long4 = ArgumentDescriptor(
671 name="long4",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000672 n=TAKEN_FROM_ARGUMENT4,
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000673 reader=read_long4,
674 doc="""A binary representation of a long, little-endian.
675
676 This first reads four bytes as a signed size (but requires the
677 size to be >= 0), then reads that many bytes and interprets them
Tim Peters4b23f2b2003-01-31 16:43:39 +0000678 as a little-endian 2's-complement long. If the size is 0, that's taken
Guido van Rossume2a383d2007-01-15 16:59:06 +0000679 as a shortcut for the int 0, although LONG1 should really be used
Tim Peters4b23f2b2003-01-31 16:43:39 +0000680 then instead (and in any case where # of bytes < 256).
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000681 """)
682
683
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000684##############################################################################
685# Object descriptors. The stack used by the pickle machine holds objects,
686# and in the stack_before and stack_after attributes of OpcodeInfo
687# descriptors we need names to describe the various types of objects that can
688# appear on the stack.
689
690class StackObject(object):
691 __slots__ = (
692 # name of descriptor record, for info only
693 'name',
694
695 # type of object, or tuple of type objects (meaning the object can
696 # be of any type in the tuple)
697 'obtype',
698
699 # human-readable docs for this kind of stack object; a string
700 'doc',
701 )
702
703 def __init__(self, name, obtype, doc):
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000704 assert isinstance(name, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000705 self.name = name
706
707 assert isinstance(obtype, type) or isinstance(obtype, tuple)
708 if isinstance(obtype, tuple):
709 for contained in obtype:
710 assert isinstance(contained, type)
711 self.obtype = obtype
712
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000713 assert isinstance(doc, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000714 self.doc = doc
715
Tim Petersc1c2b3e2003-01-29 20:12:21 +0000716 def __repr__(self):
717 return self.name
718
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000719
720pyint = StackObject(
721 name='int',
722 obtype=int,
723 doc="A short (as opposed to long) Python integer object.")
724
725pylong = StackObject(
726 name='long',
Guido van Rossume2a383d2007-01-15 16:59:06 +0000727 obtype=int,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000728 doc="A long (as opposed to short) Python integer object.")
729
730pyinteger_or_bool = StackObject(
731 name='int_or_bool',
Florent Xicluna02ea12b22010-07-28 16:39:41 +0000732 obtype=(int, bool),
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000733 doc="A Python integer object (short or long), or "
734 "a Python bool.")
735
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000736pybool = StackObject(
737 name='bool',
738 obtype=(bool,),
739 doc="A Python bool object.")
740
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000741pyfloat = StackObject(
742 name='float',
743 obtype=float,
744 doc="A Python float object.")
745
746pystring = StackObject(
Guido van Rossumf4169812008-03-17 22:56:06 +0000747 name='string',
748 obtype=bytes,
749 doc="A Python (8-bit) string object.")
750
751pybytes = StackObject(
Guido van Rossum98297ee2007-11-06 21:34:58 +0000752 name='bytes',
753 obtype=bytes,
754 doc="A Python bytes object.")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000755
756pyunicode = StackObject(
Guido van Rossum98297ee2007-11-06 21:34:58 +0000757 name='str',
Guido van Rossumef87d6e2007-05-02 19:09:54 +0000758 obtype=str,
Guido van Rossumf4169812008-03-17 22:56:06 +0000759 doc="A Python (Unicode) string object.")
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000760
761pynone = StackObject(
762 name="None",
763 obtype=type(None),
764 doc="The Python None object.")
765
766pytuple = StackObject(
767 name="tuple",
768 obtype=tuple,
769 doc="A Python tuple object.")
770
771pylist = StackObject(
772 name="list",
773 obtype=list,
774 doc="A Python list object.")
775
776pydict = StackObject(
777 name="dict",
778 obtype=dict,
779 doc="A Python dict object.")
780
781anyobject = StackObject(
782 name='any',
783 obtype=object,
784 doc="Any kind of object whatsoever.")
785
786markobject = StackObject(
787 name="mark",
788 obtype=StackObject,
789 doc="""'The mark' is a unique object.
790
791 Opcodes that operate on a variable number of objects
792 generally don't embed the count of objects in the opcode,
793 or pull it off the stack. Instead the MARK opcode is used
794 to push a special marker object on the stack, and then
795 some other opcodes grab all the objects from the top of
796 the stack down to (but not including) the topmost marker
797 object.
798 """)
799
800stackslice = StackObject(
801 name="stackslice",
802 obtype=StackObject,
803 doc="""An object representing a contiguous slice of the stack.
804
805 This is used in conjuction with markobject, to represent all
806 of the stack following the topmost markobject. For example,
807 the POP_MARK opcode changes the stack from
808
809 [..., markobject, stackslice]
810 to
811 [...]
812
813 No matter how many object are on the stack after the topmost
814 markobject, POP_MARK gets rid of all of them (including the
815 topmost markobject too).
816 """)
817
818##############################################################################
819# Descriptors for pickle opcodes.
820
821class OpcodeInfo(object):
822
823 __slots__ = (
824 # symbolic name of opcode; a string
825 'name',
826
827 # the code used in a bytestream to represent the opcode; a
828 # one-character string
829 'code',
830
831 # If the opcode has an argument embedded in the byte string, an
832 # instance of ArgumentDescriptor specifying its type. Note that
833 # arg.reader(s) can be used to read and decode the argument from
834 # the bytestream s, and arg.doc documents the format of the raw
835 # argument bytes. If the opcode doesn't have an argument embedded
836 # in the bytestream, arg should be None.
837 'arg',
838
839 # what the stack looks like before this opcode runs; a list
840 'stack_before',
841
842 # what the stack looks like after this opcode runs; a list
843 'stack_after',
844
845 # the protocol number in which this opcode was introduced; an int
846 'proto',
847
848 # human-readable docs for this opcode; a string
849 'doc',
850 )
851
852 def __init__(self, name, code, arg,
853 stack_before, stack_after, proto, doc):
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000854 assert isinstance(name, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000855 self.name = name
856
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000857 assert isinstance(code, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000858 assert len(code) == 1
859 self.code = code
860
861 assert arg is None or isinstance(arg, ArgumentDescriptor)
862 self.arg = arg
863
864 assert isinstance(stack_before, list)
865 for x in stack_before:
866 assert isinstance(x, StackObject)
867 self.stack_before = stack_before
868
869 assert isinstance(stack_after, list)
870 for x in stack_after:
871 assert isinstance(x, StackObject)
872 self.stack_after = stack_after
873
Guido van Rossumf4169812008-03-17 22:56:06 +0000874 assert isinstance(proto, int) and 0 <= proto <= 3
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000875 self.proto = proto
876
Guido van Rossum3172c5d2007-10-16 18:12:55 +0000877 assert isinstance(doc, str)
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000878 self.doc = doc
879
880I = OpcodeInfo
881opcodes = [
882
883 # Ways to spell integers.
884
885 I(name='INT',
886 code='I',
887 arg=decimalnl_short,
888 stack_before=[],
889 stack_after=[pyinteger_or_bool],
890 proto=0,
891 doc="""Push an integer or bool.
892
893 The argument is a newline-terminated decimal literal string.
894
895 The intent may have been that this always fit in a short Python int,
896 but INT can be generated in pickles written on a 64-bit box that
897 require a Python long on a 32-bit box. The difference between this
898 and LONG then is that INT skips a trailing 'L', and produces a short
899 int whenever possible.
900
901 Another difference is due to that, when bool was introduced as a
902 distinct type in 2.3, builtin names True and False were also added to
903 2.2.2, mapping to ints 1 and 0. For compatibility in both directions,
904 True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
905 Leading zeroes are never produced for a genuine integer. The 2.3
906 (and later) unpicklers special-case these and return bool instead;
907 earlier unpicklers ignore the leading "0" and return the int.
908 """),
909
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000910 I(name='BININT',
911 code='J',
912 arg=int4,
913 stack_before=[],
914 stack_after=[pyint],
915 proto=1,
916 doc="""Push a four-byte signed integer.
917
918 This handles the full range of Python (short) integers on a 32-bit
919 box, directly as binary bytes (1 for the opcode and 4 for the integer).
920 If the integer is non-negative and fits in 1 or 2 bytes, pickling via
921 BININT1 or BININT2 saves space.
922 """),
923
924 I(name='BININT1',
925 code='K',
926 arg=uint1,
927 stack_before=[],
928 stack_after=[pyint],
929 proto=1,
930 doc="""Push a one-byte unsigned integer.
931
932 This is a space optimization for pickling very small non-negative ints,
933 in range(256).
934 """),
935
936 I(name='BININT2',
937 code='M',
938 arg=uint2,
939 stack_before=[],
940 stack_after=[pyint],
941 proto=1,
942 doc="""Push a two-byte unsigned integer.
943
944 This is a space optimization for pickling small positive ints, in
945 range(256, 2**16). Integers in range(256) can also be pickled via
946 BININT2, but BININT1 instead saves a byte.
947 """),
948
Tim Petersfdc03462003-01-28 04:56:33 +0000949 I(name='LONG',
950 code='L',
951 arg=decimalnl_long,
952 stack_before=[],
953 stack_after=[pylong],
954 proto=0,
955 doc="""Push a long integer.
956
957 The same as INT, except that the literal ends with 'L', and always
958 unpickles to a Python long. There doesn't seem a real purpose to the
959 trailing 'L'.
960
961 Note that LONG takes time quadratic in the number of digits when
962 unpickling (this is simply due to the nature of decimal->binary
963 conversion). Proto 2 added linear-time (in C; still quadratic-time
964 in Python) LONG1 and LONG4 opcodes.
965 """),
966
967 I(name="LONG1",
968 code='\x8a',
969 arg=long1,
970 stack_before=[],
971 stack_after=[pylong],
972 proto=2,
973 doc="""Long integer using one-byte length.
974
975 A more efficient encoding of a Python long; the long1 encoding
976 says it all."""),
977
978 I(name="LONG4",
979 code='\x8b',
980 arg=long4,
981 stack_before=[],
982 stack_after=[pylong],
983 proto=2,
984 doc="""Long integer using found-byte length.
985
986 A more efficient encoding of a Python long; the long4 encoding
987 says it all."""),
988
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000989 # Ways to spell strings (8-bit, not Unicode).
990
991 I(name='STRING',
992 code='S',
993 arg=stringnl,
994 stack_before=[],
995 stack_after=[pystring],
996 proto=0,
997 doc="""Push a Python string object.
998
999 The argument is a repr-style string, with bracketing quote characters,
1000 and perhaps embedded escapes. The argument extends until the next
Guido van Rossumf4169812008-03-17 22:56:06 +00001001 newline character. (Actually, they are decoded into a str instance
1002 using the encoding given to the Unpickler constructor. or the default,
1003 'ASCII'.)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001004 """),
1005
1006 I(name='BINSTRING',
1007 code='T',
1008 arg=string4,
1009 stack_before=[],
1010 stack_after=[pystring],
1011 proto=1,
1012 doc="""Push a Python string object.
1013
1014 There are two arguments: the first is a 4-byte little-endian signed int
1015 giving the number of bytes in the string, and the second is that many
Guido van Rossumf4169812008-03-17 22:56:06 +00001016 bytes, which are taken literally as the string content. (Actually,
1017 they are decoded into a str instance using the encoding given to the
1018 Unpickler constructor. or the default, 'ASCII'.)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001019 """),
1020
1021 I(name='SHORT_BINSTRING',
1022 code='U',
1023 arg=string1,
1024 stack_before=[],
1025 stack_after=[pystring],
1026 proto=1,
1027 doc="""Push a Python string object.
1028
1029 There are two arguments: the first is a 1-byte unsigned int giving
1030 the number of bytes in the string, and the second is that many bytes,
Guido van Rossumf4169812008-03-17 22:56:06 +00001031 which are taken literally as the string content. (Actually, they
1032 are decoded into a str instance using the encoding given to the
1033 Unpickler constructor. or the default, 'ASCII'.)
1034 """),
1035
1036 # Bytes (protocol 3 only; older protocols don't support bytes at all)
1037
1038 I(name='BINBYTES',
1039 code='B',
1040 arg=string4,
1041 stack_before=[],
1042 stack_after=[pybytes],
1043 proto=3,
1044 doc="""Push a Python bytes object.
1045
1046 There are two arguments: the first is a 4-byte little-endian signed int
1047 giving the number of bytes in the string, and the second is that many
1048 bytes, which are taken literally as the bytes content.
1049 """),
1050
1051 I(name='SHORT_BINBYTES',
1052 code='C',
1053 arg=string1,
1054 stack_before=[],
1055 stack_after=[pybytes],
Collin Wintere61d4372009-05-20 17:46:47 +00001056 proto=3,
Guido van Rossumf4169812008-03-17 22:56:06 +00001057 doc="""Push a Python string object.
1058
1059 There are two arguments: the first is a 1-byte unsigned int giving
1060 the number of bytes in the string, and the second is that many bytes,
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001061 which are taken literally as the string content.
1062 """),
1063
1064 # Ways to spell None.
1065
1066 I(name='NONE',
1067 code='N',
1068 arg=None,
1069 stack_before=[],
1070 stack_after=[pynone],
1071 proto=0,
1072 doc="Push None on the stack."),
1073
Tim Petersfdc03462003-01-28 04:56:33 +00001074 # Ways to spell bools, starting with proto 2. See INT for how this was
1075 # done before proto 2.
1076
1077 I(name='NEWTRUE',
1078 code='\x88',
1079 arg=None,
1080 stack_before=[],
1081 stack_after=[pybool],
1082 proto=2,
1083 doc="""True.
1084
1085 Push True onto the stack."""),
1086
1087 I(name='NEWFALSE',
1088 code='\x89',
1089 arg=None,
1090 stack_before=[],
1091 stack_after=[pybool],
1092 proto=2,
1093 doc="""True.
1094
1095 Push False onto the stack."""),
1096
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001097 # Ways to spell Unicode strings.
1098
1099 I(name='UNICODE',
1100 code='V',
1101 arg=unicodestringnl,
1102 stack_before=[],
1103 stack_after=[pyunicode],
1104 proto=0, # this may be pure-text, but it's a later addition
1105 doc="""Push a Python Unicode string object.
1106
1107 The argument is a raw-unicode-escape encoding of a Unicode string,
1108 and so may contain embedded escape sequences. The argument extends
1109 until the next newline character.
1110 """),
1111
1112 I(name='BINUNICODE',
1113 code='X',
1114 arg=unicodestring4,
1115 stack_before=[],
1116 stack_after=[pyunicode],
1117 proto=1,
1118 doc="""Push a Python Unicode string object.
1119
1120 There are two arguments: the first is a 4-byte little-endian signed int
1121 giving the number of bytes in the string. The second is that many
1122 bytes, and is the UTF-8 encoding of the Unicode string.
1123 """),
1124
1125 # Ways to spell floats.
1126
1127 I(name='FLOAT',
1128 code='F',
1129 arg=floatnl,
1130 stack_before=[],
1131 stack_after=[pyfloat],
1132 proto=0,
1133 doc="""Newline-terminated decimal float literal.
1134
1135 The argument is repr(a_float), and in general requires 17 significant
1136 digits for roundtrip conversion to be an identity (this is so for
1137 IEEE-754 double precision values, which is what Python float maps to
1138 on most boxes).
1139
1140 In general, FLOAT cannot be used to transport infinities, NaNs, or
1141 minus zero across boxes (or even on a single box, if the platform C
1142 library can't read the strings it produces for such things -- Windows
1143 is like that), but may do less damage than BINFLOAT on boxes with
1144 greater precision or dynamic range than IEEE-754 double.
1145 """),
1146
1147 I(name='BINFLOAT',
1148 code='G',
1149 arg=float8,
1150 stack_before=[],
1151 stack_after=[pyfloat],
1152 proto=1,
1153 doc="""Float stored in binary form, with 8 bytes of data.
1154
1155 This generally requires less than half the space of FLOAT encoding.
1156 In general, BINFLOAT cannot be used to transport infinities, NaNs, or
1157 minus zero, raises an exception if the exponent exceeds the range of
1158 an IEEE-754 double, and retains no more than 53 bits of precision (if
1159 there are more than that, "add a half and chop" rounding is used to
1160 cut it back to 53 significant bits).
1161 """),
1162
1163 # Ways to build lists.
1164
1165 I(name='EMPTY_LIST',
1166 code=']',
1167 arg=None,
1168 stack_before=[],
1169 stack_after=[pylist],
1170 proto=1,
1171 doc="Push an empty list."),
1172
1173 I(name='APPEND',
1174 code='a',
1175 arg=None,
1176 stack_before=[pylist, anyobject],
1177 stack_after=[pylist],
1178 proto=0,
1179 doc="""Append an object to a list.
1180
1181 Stack before: ... pylist anyobject
1182 Stack after: ... pylist+[anyobject]
Tim Peters81098ac2003-01-28 05:12:08 +00001183
1184 although pylist is really extended in-place.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001185 """),
1186
1187 I(name='APPENDS',
1188 code='e',
1189 arg=None,
1190 stack_before=[pylist, markobject, stackslice],
1191 stack_after=[pylist],
1192 proto=1,
1193 doc="""Extend a list by a slice of stack objects.
1194
1195 Stack before: ... pylist markobject stackslice
1196 Stack after: ... pylist+stackslice
Tim Peters81098ac2003-01-28 05:12:08 +00001197
1198 although pylist is really extended in-place.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001199 """),
1200
1201 I(name='LIST',
1202 code='l',
1203 arg=None,
1204 stack_before=[markobject, stackslice],
1205 stack_after=[pylist],
1206 proto=0,
1207 doc="""Build a list out of the topmost stack slice, after markobject.
1208
1209 All the stack entries following the topmost markobject are placed into
1210 a single Python list, which single list object replaces all of the
1211 stack from the topmost markobject onward. For example,
1212
1213 Stack before: ... markobject 1 2 3 'abc'
1214 Stack after: ... [1, 2, 3, 'abc']
1215 """),
1216
1217 # Ways to build tuples.
1218
1219 I(name='EMPTY_TUPLE',
1220 code=')',
1221 arg=None,
1222 stack_before=[],
1223 stack_after=[pytuple],
1224 proto=1,
1225 doc="Push an empty tuple."),
1226
1227 I(name='TUPLE',
1228 code='t',
1229 arg=None,
1230 stack_before=[markobject, stackslice],
1231 stack_after=[pytuple],
1232 proto=0,
1233 doc="""Build a tuple out of the topmost stack slice, after markobject.
1234
1235 All the stack entries following the topmost markobject are placed into
1236 a single Python tuple, which single tuple object replaces all of the
1237 stack from the topmost markobject onward. For example,
1238
1239 Stack before: ... markobject 1 2 3 'abc'
1240 Stack after: ... (1, 2, 3, 'abc')
1241 """),
1242
Tim Petersfdc03462003-01-28 04:56:33 +00001243 I(name='TUPLE1',
1244 code='\x85',
1245 arg=None,
1246 stack_before=[anyobject],
1247 stack_after=[pytuple],
1248 proto=2,
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001249 doc="""Build a one-tuple out of the topmost item on the stack.
Tim Petersfdc03462003-01-28 04:56:33 +00001250
1251 This code pops one value off the stack and pushes a tuple of
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001252 length 1 whose one item is that value back onto it. In other
1253 words:
Tim Petersfdc03462003-01-28 04:56:33 +00001254
1255 stack[-1] = tuple(stack[-1:])
1256 """),
1257
1258 I(name='TUPLE2',
1259 code='\x86',
1260 arg=None,
1261 stack_before=[anyobject, anyobject],
1262 stack_after=[pytuple],
1263 proto=2,
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001264 doc="""Build a two-tuple out of the top two items on the stack.
Tim Petersfdc03462003-01-28 04:56:33 +00001265
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001266 This code pops two values off the stack and pushes a tuple of
1267 length 2 whose items are those values back onto it. In other
1268 words:
Tim Petersfdc03462003-01-28 04:56:33 +00001269
1270 stack[-2:] = [tuple(stack[-2:])]
1271 """),
1272
1273 I(name='TUPLE3',
1274 code='\x87',
1275 arg=None,
1276 stack_before=[anyobject, anyobject, anyobject],
1277 stack_after=[pytuple],
1278 proto=2,
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001279 doc="""Build a three-tuple out of the top three items on the stack.
Tim Petersfdc03462003-01-28 04:56:33 +00001280
Alexander Belopolsky44c2ffd2010-07-16 14:39:45 +00001281 This code pops three values off the stack and pushes a tuple of
1282 length 3 whose items are those values back onto it. In other
1283 words:
Tim Petersfdc03462003-01-28 04:56:33 +00001284
1285 stack[-3:] = [tuple(stack[-3:])]
1286 """),
1287
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001288 # Ways to build dicts.
1289
1290 I(name='EMPTY_DICT',
1291 code='}',
1292 arg=None,
1293 stack_before=[],
1294 stack_after=[pydict],
1295 proto=1,
1296 doc="Push an empty dict."),
1297
1298 I(name='DICT',
1299 code='d',
1300 arg=None,
1301 stack_before=[markobject, stackslice],
1302 stack_after=[pydict],
1303 proto=0,
1304 doc="""Build a dict out of the topmost stack slice, after markobject.
1305
1306 All the stack entries following the topmost markobject are placed into
1307 a single Python dict, which single dict object replaces all of the
1308 stack from the topmost markobject onward. The stack slice alternates
1309 key, value, key, value, .... For example,
1310
1311 Stack before: ... markobject 1 2 3 'abc'
1312 Stack after: ... {1: 2, 3: 'abc'}
1313 """),
1314
1315 I(name='SETITEM',
1316 code='s',
1317 arg=None,
1318 stack_before=[pydict, anyobject, anyobject],
1319 stack_after=[pydict],
1320 proto=0,
1321 doc="""Add a key+value pair to an existing dict.
1322
1323 Stack before: ... pydict key value
1324 Stack after: ... pydict
1325
1326 where pydict has been modified via pydict[key] = value.
1327 """),
1328
1329 I(name='SETITEMS',
1330 code='u',
1331 arg=None,
1332 stack_before=[pydict, markobject, stackslice],
1333 stack_after=[pydict],
1334 proto=1,
1335 doc="""Add an arbitrary number of key+value pairs to an existing dict.
1336
1337 The slice of the stack following the topmost markobject is taken as
1338 an alternating sequence of keys and values, added to the dict
1339 immediately under the topmost markobject. Everything at and after the
1340 topmost markobject is popped, leaving the mutated dict at the top
1341 of the stack.
1342
1343 Stack before: ... pydict markobject key_1 value_1 ... key_n value_n
1344 Stack after: ... pydict
1345
1346 where pydict has been modified via pydict[key_i] = value_i for i in
1347 1, 2, ..., n, and in that order.
1348 """),
1349
1350 # Stack manipulation.
1351
1352 I(name='POP',
1353 code='0',
1354 arg=None,
1355 stack_before=[anyobject],
1356 stack_after=[],
1357 proto=0,
1358 doc="Discard the top stack item, shrinking the stack by one item."),
1359
1360 I(name='DUP',
1361 code='2',
1362 arg=None,
1363 stack_before=[anyobject],
1364 stack_after=[anyobject, anyobject],
1365 proto=0,
1366 doc="Push the top stack item onto the stack again, duplicating it."),
1367
1368 I(name='MARK',
1369 code='(',
1370 arg=None,
1371 stack_before=[],
1372 stack_after=[markobject],
1373 proto=0,
1374 doc="""Push markobject onto the stack.
1375
1376 markobject is a unique object, used by other opcodes to identify a
1377 region of the stack containing a variable number of objects for them
1378 to work on. See markobject.doc for more detail.
1379 """),
1380
1381 I(name='POP_MARK',
1382 code='1',
1383 arg=None,
1384 stack_before=[markobject, stackslice],
1385 stack_after=[],
Collin Wintere61d4372009-05-20 17:46:47 +00001386 proto=1,
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001387 doc="""Pop all the stack objects at and above the topmost markobject.
1388
1389 When an opcode using a variable number of stack objects is done,
1390 POP_MARK is used to remove those objects, and to remove the markobject
1391 that delimited their starting position on the stack.
1392 """),
1393
1394 # Memo manipulation. There are really only two operations (get and put),
1395 # each in all-text, "short binary", and "long binary" flavors.
1396
1397 I(name='GET',
1398 code='g',
1399 arg=decimalnl_short,
1400 stack_before=[],
1401 stack_after=[anyobject],
1402 proto=0,
1403 doc="""Read an object from the memo and push it on the stack.
1404
Ezio Melotti13925002011-03-16 11:05:33 +02001405 The index of the memo object to push is given by the newline-terminated
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001406 decimal string following. BINGET and LONG_BINGET are space-optimized
1407 versions.
1408 """),
1409
1410 I(name='BINGET',
1411 code='h',
1412 arg=uint1,
1413 stack_before=[],
1414 stack_after=[anyobject],
1415 proto=1,
1416 doc="""Read an object from the memo and push it on the stack.
1417
1418 The index of the memo object to push is given by the 1-byte unsigned
1419 integer following.
1420 """),
1421
1422 I(name='LONG_BINGET',
1423 code='j',
1424 arg=int4,
1425 stack_before=[],
1426 stack_after=[anyobject],
1427 proto=1,
1428 doc="""Read an object from the memo and push it on the stack.
1429
1430 The index of the memo object to push is given by the 4-byte signed
1431 little-endian integer following.
1432 """),
1433
1434 I(name='PUT',
1435 code='p',
1436 arg=decimalnl_short,
1437 stack_before=[],
1438 stack_after=[],
1439 proto=0,
1440 doc="""Store the stack top into the memo. The stack is not popped.
1441
1442 The index of the memo location to write into is given by the newline-
1443 terminated decimal string following. BINPUT and LONG_BINPUT are
1444 space-optimized versions.
1445 """),
1446
1447 I(name='BINPUT',
1448 code='q',
1449 arg=uint1,
1450 stack_before=[],
1451 stack_after=[],
1452 proto=1,
1453 doc="""Store the stack top into the memo. The stack is not popped.
1454
1455 The index of the memo location to write into is given by the 1-byte
1456 unsigned integer following.
1457 """),
1458
1459 I(name='LONG_BINPUT',
1460 code='r',
1461 arg=int4,
1462 stack_before=[],
1463 stack_after=[],
1464 proto=1,
1465 doc="""Store the stack top into the memo. The stack is not popped.
1466
1467 The index of the memo location to write into is given by the 4-byte
1468 signed little-endian integer following.
1469 """),
1470
Tim Petersfdc03462003-01-28 04:56:33 +00001471 # Access the extension registry (predefined objects). Akin to the GET
1472 # family.
1473
1474 I(name='EXT1',
1475 code='\x82',
1476 arg=uint1,
1477 stack_before=[],
1478 stack_after=[anyobject],
1479 proto=2,
1480 doc="""Extension code.
1481
1482 This code and the similar EXT2 and EXT4 allow using a registry
1483 of popular objects that are pickled by name, typically classes.
1484 It is envisioned that through a global negotiation and
1485 registration process, third parties can set up a mapping between
1486 ints and object names.
1487
1488 In order to guarantee pickle interchangeability, the extension
1489 code registry ought to be global, although a range of codes may
1490 be reserved for private use.
1491
1492 EXT1 has a 1-byte integer argument. This is used to index into the
1493 extension registry, and the object at that index is pushed on the stack.
1494 """),
1495
1496 I(name='EXT2',
1497 code='\x83',
1498 arg=uint2,
1499 stack_before=[],
1500 stack_after=[anyobject],
1501 proto=2,
1502 doc="""Extension code.
1503
1504 See EXT1. EXT2 has a two-byte integer argument.
1505 """),
1506
1507 I(name='EXT4',
1508 code='\x84',
1509 arg=int4,
1510 stack_before=[],
1511 stack_after=[anyobject],
1512 proto=2,
1513 doc="""Extension code.
1514
1515 See EXT1. EXT4 has a four-byte integer argument.
1516 """),
1517
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001518 # Push a class object, or module function, on the stack, via its module
1519 # and name.
1520
1521 I(name='GLOBAL',
1522 code='c',
1523 arg=stringnl_noescape_pair,
1524 stack_before=[],
1525 stack_after=[anyobject],
1526 proto=0,
1527 doc="""Push a global object (module.attr) on the stack.
1528
1529 Two newline-terminated strings follow the GLOBAL opcode. The first is
1530 taken as a module name, and the second as a class name. The class
1531 object module.class is pushed on the stack. More accurately, the
1532 object returned by self.find_class(module, class) is pushed on the
1533 stack, so unpickling subclasses can override this form of lookup.
1534 """),
1535
1536 # Ways to build objects of classes pickle doesn't know about directly
1537 # (user-defined classes). I despair of documenting this accurately
1538 # and comprehensibly -- you really have to read the pickle code to
1539 # find all the special cases.
1540
1541 I(name='REDUCE',
1542 code='R',
1543 arg=None,
1544 stack_before=[anyobject, anyobject],
1545 stack_after=[anyobject],
1546 proto=0,
1547 doc="""Push an object built from a callable and an argument tuple.
1548
1549 The opcode is named to remind of the __reduce__() method.
1550
1551 Stack before: ... callable pytuple
1552 Stack after: ... callable(*pytuple)
1553
1554 The callable and the argument tuple are the first two items returned
1555 by a __reduce__ method. Applying the callable to the argtuple is
1556 supposed to reproduce the original object, or at least get it started.
1557 If the __reduce__ method returns a 3-tuple, the last component is an
1558 argument to be passed to the object's __setstate__, and then the REDUCE
1559 opcode is followed by code to create setstate's argument, and then a
1560 BUILD opcode to apply __setstate__ to that argument.
1561
Guido van Rossum13257902007-06-07 23:15:56 +00001562 If not isinstance(callable, type), REDUCE complains unless the
Alexandre Vassalottif7fa63d2008-05-11 08:55:36 +00001563 callable has been registered with the copyreg module's
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001564 safe_constructors dict, or the callable has a magic
1565 '__safe_for_unpickling__' attribute with a true value. I'm not sure
1566 why it does this, but I've sure seen this complaint often enough when
1567 I didn't want to <wink>.
1568 """),
1569
1570 I(name='BUILD',
1571 code='b',
1572 arg=None,
1573 stack_before=[anyobject, anyobject],
1574 stack_after=[anyobject],
1575 proto=0,
1576 doc="""Finish building an object, via __setstate__ or dict update.
1577
1578 Stack before: ... anyobject argument
1579 Stack after: ... anyobject
1580
1581 where anyobject may have been mutated, as follows:
1582
1583 If the object has a __setstate__ method,
1584
1585 anyobject.__setstate__(argument)
1586
1587 is called.
1588
1589 Else the argument must be a dict, the object must have a __dict__, and
1590 the object is updated via
1591
1592 anyobject.__dict__.update(argument)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001593 """),
1594
1595 I(name='INST',
1596 code='i',
1597 arg=stringnl_noescape_pair,
1598 stack_before=[markobject, stackslice],
1599 stack_after=[anyobject],
1600 proto=0,
1601 doc="""Build a class instance.
1602
1603 This is the protocol 0 version of protocol 1's OBJ opcode.
1604 INST is followed by two newline-terminated strings, giving a
1605 module and class name, just as for the GLOBAL opcode (and see
1606 GLOBAL for more details about that). self.find_class(module, name)
1607 is used to get a class object.
1608
1609 In addition, all the objects on the stack following the topmost
1610 markobject are gathered into a tuple and popped (along with the
1611 topmost markobject), just as for the TUPLE opcode.
1612
1613 Now it gets complicated. If all of these are true:
1614
1615 + The argtuple is empty (markobject was at the top of the stack
1616 at the start).
1617
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001618 + The class object does not have a __getinitargs__ attribute.
1619
1620 then we want to create an old-style class instance without invoking
1621 its __init__() method (pickle has waffled on this over the years; not
1622 calling __init__() is current wisdom). In this case, an instance of
1623 an old-style dummy class is created, and then we try to rebind its
1624 __class__ attribute to the desired class object. If this succeeds,
Guido van Rossuma8add0e2007-05-14 22:03:55 +00001625 the new instance object is pushed on the stack, and we're done.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001626
1627 Else (the argtuple is not empty, it's not an old-style class object,
1628 or the class object does have a __getinitargs__ attribute), the code
1629 first insists that the class object have a __safe_for_unpickling__
1630 attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE,
1631 it doesn't matter whether this attribute has a true or false value, it
Guido van Rossum99603b02007-07-20 00:22:32 +00001632 only matters whether it exists (XXX this is a bug). If
1633 __safe_for_unpickling__ doesn't exist, UnpicklingError is raised.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001634
1635 Else (the class object does have a __safe_for_unpickling__ attr),
1636 the class object obtained from INST's arguments is applied to the
1637 argtuple obtained from the stack, and the resulting instance object
1638 is pushed on the stack.
Tim Peters2b93c4c2003-01-30 16:35:08 +00001639
1640 NOTE: checks for __safe_for_unpickling__ went away in Python 2.3.
Florent Xiclunaaa6c1d22011-12-12 18:54:29 +01001641 NOTE: the distinction between old-style and new-style classes does
1642 not make sense in Python 3.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001643 """),
1644
1645 I(name='OBJ',
1646 code='o',
1647 arg=None,
1648 stack_before=[markobject, anyobject, stackslice],
1649 stack_after=[anyobject],
1650 proto=1,
1651 doc="""Build a class instance.
1652
1653 This is the protocol 1 version of protocol 0's INST opcode, and is
1654 very much like it. The major difference is that the class object
1655 is taken off the stack, allowing it to be retrieved from the memo
1656 repeatedly if several instances of the same class are created. This
1657 can be much more efficient (in both time and space) than repeatedly
1658 embedding the module and class names in INST opcodes.
1659
1660 Unlike INST, OBJ takes no arguments from the opcode stream. Instead
1661 the class object is taken off the stack, immediately above the
1662 topmost markobject:
1663
1664 Stack before: ... markobject classobject stackslice
1665 Stack after: ... new_instance_object
1666
1667 As for INST, the remainder of the stack above the markobject is
1668 gathered into an argument tuple, and then the logic seems identical,
Guido van Rossumecb11042003-01-29 06:24:30 +00001669 except that no __safe_for_unpickling__ check is done (XXX this is
Guido van Rossum99603b02007-07-20 00:22:32 +00001670 a bug). See INST for the gory details.
Tim Peters2b93c4c2003-01-30 16:35:08 +00001671
1672 NOTE: In Python 2.3, INST and OBJ are identical except for how they
1673 get the class object. That was always the intent; the implementations
1674 had diverged for accidental reasons.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001675 """),
1676
Tim Petersfdc03462003-01-28 04:56:33 +00001677 I(name='NEWOBJ',
1678 code='\x81',
1679 arg=None,
1680 stack_before=[anyobject, anyobject],
1681 stack_after=[anyobject],
1682 proto=2,
1683 doc="""Build an object instance.
1684
1685 The stack before should be thought of as containing a class
1686 object followed by an argument tuple (the tuple being the stack
1687 top). Call these cls and args. They are popped off the stack,
1688 and the value returned by cls.__new__(cls, *args) is pushed back
1689 onto the stack.
1690 """),
1691
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001692 # Machine control.
1693
Tim Petersfdc03462003-01-28 04:56:33 +00001694 I(name='PROTO',
1695 code='\x80',
1696 arg=uint1,
1697 stack_before=[],
1698 stack_after=[],
1699 proto=2,
1700 doc="""Protocol version indicator.
1701
1702 For protocol 2 and above, a pickle must start with this opcode.
1703 The argument is the protocol version, an int in range(2, 256).
1704 """),
1705
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001706 I(name='STOP',
1707 code='.',
1708 arg=None,
1709 stack_before=[anyobject],
1710 stack_after=[],
1711 proto=0,
1712 doc="""Stop the unpickling machine.
1713
1714 Every pickle ends with this opcode. The object at the top of the stack
1715 is popped, and that's the result of unpickling. The stack should be
1716 empty then.
1717 """),
1718
1719 # Ways to deal with persistent IDs.
1720
1721 I(name='PERSID',
1722 code='P',
1723 arg=stringnl_noescape,
1724 stack_before=[],
1725 stack_after=[anyobject],
1726 proto=0,
1727 doc="""Push an object identified by a persistent ID.
1728
1729 The pickle module doesn't define what a persistent ID means. PERSID's
1730 argument is a newline-terminated str-style (no embedded escapes, no
1731 bracketing quote characters) string, which *is* "the persistent ID".
1732 The unpickler passes this string to self.persistent_load(). Whatever
1733 object that returns is pushed on the stack. There is no implementation
1734 of persistent_load() in Python's unpickler: it must be supplied by an
1735 unpickler subclass.
1736 """),
1737
1738 I(name='BINPERSID',
1739 code='Q',
1740 arg=None,
1741 stack_before=[anyobject],
1742 stack_after=[anyobject],
1743 proto=1,
1744 doc="""Push an object identified by a persistent ID.
1745
1746 Like PERSID, except the persistent ID is popped off the stack (instead
1747 of being a string embedded in the opcode bytestream). The persistent
1748 ID is passed to self.persistent_load(), and whatever object that
1749 returns is pushed on the stack. See PERSID for more detail.
1750 """),
1751]
1752del I
1753
1754# Verify uniqueness of .name and .code members.
1755name2i = {}
1756code2i = {}
1757
1758for i, d in enumerate(opcodes):
1759 if d.name in name2i:
1760 raise ValueError("repeated name %r at indices %d and %d" %
1761 (d.name, name2i[d.name], i))
1762 if d.code in code2i:
1763 raise ValueError("repeated code %r at indices %d and %d" %
1764 (d.code, code2i[d.code], i))
1765
1766 name2i[d.name] = i
1767 code2i[d.code] = i
1768
1769del name2i, code2i, i, d
1770
1771##############################################################################
1772# Build a code2op dict, mapping opcode characters to OpcodeInfo records.
1773# Also ensure we've got the same stuff as pickle.py, although the
1774# introspection here is dicey.
1775
1776code2op = {}
1777for d in opcodes:
1778 code2op[d.code] = d
1779del d
1780
1781def assure_pickle_consistency(verbose=False):
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001782
1783 copy = code2op.copy()
1784 for name in pickle.__all__:
1785 if not re.match("[A-Z][A-Z0-9_]+$", name):
1786 if verbose:
Guido van Rossumbe19ed72007-02-09 05:37:30 +00001787 print("skipping %r: it doesn't look like an opcode name" % name)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001788 continue
1789 picklecode = getattr(pickle, name)
Guido van Rossum617dbc42007-05-07 23:57:08 +00001790 if not isinstance(picklecode, bytes) or len(picklecode) != 1:
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001791 if verbose:
Guido van Rossumbe19ed72007-02-09 05:37:30 +00001792 print(("skipping %r: value %r doesn't look like a pickle "
1793 "code" % (name, picklecode)))
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001794 continue
Guido van Rossum617dbc42007-05-07 23:57:08 +00001795 picklecode = picklecode.decode("latin-1")
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001796 if picklecode in copy:
1797 if verbose:
Guido van Rossumbe19ed72007-02-09 05:37:30 +00001798 print("checking name %r w/ code %r for consistency" % (
1799 name, picklecode))
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001800 d = copy[picklecode]
1801 if d.name != name:
1802 raise ValueError("for pickle code %r, pickle.py uses name %r "
1803 "but we're using name %r" % (picklecode,
1804 name,
1805 d.name))
1806 # Forget this one. Any left over in copy at the end are a problem
1807 # of a different kind.
1808 del copy[picklecode]
1809 else:
1810 raise ValueError("pickle.py appears to have a pickle opcode with "
1811 "name %r and code %r, but we don't" %
1812 (name, picklecode))
1813 if copy:
1814 msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
1815 for code, d in copy.items():
1816 msg.append(" name %r with code %r" % (d.name, code))
1817 raise ValueError("\n".join(msg))
1818
1819assure_pickle_consistency()
Tim Petersc0c12b52003-01-29 00:56:17 +00001820del assure_pickle_consistency
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001821
1822##############################################################################
1823# A pickle opcode generator.
1824
1825def genops(pickle):
Guido van Rossuma72ded92003-01-27 19:40:47 +00001826 """Generate all the opcodes in a pickle.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001827
1828 'pickle' is a file-like object, or string, containing the pickle.
1829
1830 Each opcode in the pickle is generated, from the current pickle position,
1831 stopping after a STOP opcode is delivered. A triple is generated for
1832 each opcode:
1833
1834 opcode, arg, pos
1835
1836 opcode is an OpcodeInfo record, describing the current opcode.
1837
1838 If the opcode has an argument embedded in the pickle, arg is its decoded
1839 value, as a Python object. If the opcode doesn't have an argument, arg
1840 is None.
1841
1842 If the pickle has a tell() method, pos was the value of pickle.tell()
Guido van Rossum34d19282007-08-09 01:03:29 +00001843 before reading the current opcode. If the pickle is a bytes object,
1844 it's wrapped in a BytesIO object, and the latter's tell() result is
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001845 used. Else (the pickle doesn't have a tell(), and it's not obvious how
1846 to query its current position) pos is None.
1847 """
1848
Guido van Rossum98297ee2007-11-06 21:34:58 +00001849 if isinstance(pickle, bytes_types):
Guido van Rossumcfe5f202007-05-08 21:26:54 +00001850 import io
1851 pickle = io.BytesIO(pickle)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001852
1853 if hasattr(pickle, "tell"):
1854 getpos = pickle.tell
1855 else:
1856 getpos = lambda: None
1857
1858 while True:
1859 pos = getpos()
1860 code = pickle.read(1)
Guido van Rossumcfe5f202007-05-08 21:26:54 +00001861 opcode = code2op.get(code.decode("latin-1"))
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001862 if opcode is None:
Guido van Rossumcfe5f202007-05-08 21:26:54 +00001863 if code == b"":
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001864 raise ValueError("pickle exhausted before seeing STOP")
1865 else:
1866 raise ValueError("at position %s, opcode %r unknown" % (
1867 pos is None and "<unknown>" or pos,
1868 code))
1869 if opcode.arg is None:
1870 arg = None
1871 else:
1872 arg = opcode.arg.reader(pickle)
1873 yield opcode, arg, pos
Guido van Rossumcfe5f202007-05-08 21:26:54 +00001874 if code == b'.':
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001875 assert opcode.name == 'STOP'
1876 break
1877
1878##############################################################################
Christian Heimes3feef612008-02-11 06:19:17 +00001879# A pickle optimizer.
1880
1881def optimize(p):
1882 'Optimize a pickle string by removing unused PUT opcodes'
1883 gets = set() # set of args used by a GET opcode
1884 puts = [] # (arg, startpos, stoppos) for the PUT opcodes
1885 prevpos = None # set to pos if previous opcode was a PUT
1886 for opcode, arg, pos in genops(p):
1887 if prevpos is not None:
1888 puts.append((prevarg, prevpos, pos))
1889 prevpos = None
1890 if 'PUT' in opcode.name:
1891 prevarg, prevpos = arg, pos
1892 elif 'GET' in opcode.name:
1893 gets.add(arg)
1894
1895 # Copy the pickle string except for PUTS without a corresponding GET
1896 s = []
1897 i = 0
1898 for arg, start, stop in puts:
1899 j = stop if (arg in gets) else start
1900 s.append(p[i:j])
1901 i = stop
1902 s.append(p[i:])
Christian Heimes126d29a2008-02-11 22:57:17 +00001903 return b''.join(s)
Christian Heimes3feef612008-02-11 06:19:17 +00001904
1905##############################################################################
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001906# A symbolic pickle disassembler.
1907
Alexander Belopolsky929d3842010-07-17 15:51:21 +00001908def dis(pickle, out=None, memo=None, indentlevel=4, annotate=0):
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001909 """Produce a symbolic disassembly of a pickle.
1910
1911 'pickle' is a file-like object, or string, containing a (at least one)
1912 pickle. The pickle is disassembled from the current position, through
1913 the first STOP opcode encountered.
1914
1915 Optional arg 'out' is a file-like object to which the disassembly is
1916 printed. It defaults to sys.stdout.
1917
Tim Peters62235e72003-02-05 19:55:53 +00001918 Optional arg 'memo' is a Python dict, used as the pickle's memo. It
1919 may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes.
1920 Passing the same memo object to another dis() call then allows disassembly
1921 to proceed across multiple pickles that were all created by the same
1922 pickler with the same memo. Ordinarily you don't need to worry about this.
1923
Alexander Belopolsky929d3842010-07-17 15:51:21 +00001924 Optional arg 'indentlevel' is the number of blanks by which to indent
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001925 a new MARK level. It defaults to 4.
Tim Petersc1c2b3e2003-01-29 20:12:21 +00001926
Alexander Belopolsky929d3842010-07-17 15:51:21 +00001927 Optional arg 'annotate' if nonzero instructs dis() to add short
1928 description of the opcode on each line of disassembled output.
1929 The value given to 'annotate' must be an integer and is used as a
1930 hint for the column where annotation should start. The default
1931 value is 0, meaning no annotations.
1932
Tim Petersc1c2b3e2003-01-29 20:12:21 +00001933 In addition to printing the disassembly, some sanity checks are made:
1934
1935 + All embedded opcode arguments "make sense".
1936
1937 + Explicit and implicit pop operations have enough items on the stack.
1938
1939 + When an opcode implicitly refers to a markobject, a markobject is
1940 actually on the stack.
1941
1942 + A memo entry isn't referenced before it's defined.
1943
1944 + The markobject isn't stored in the memo.
1945
1946 + A memo entry isn't redefined.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001947 """
1948
Tim Petersc1c2b3e2003-01-29 20:12:21 +00001949 # Most of the hair here is for sanity checks, but most of it is needed
1950 # anyway to detect when a protocol 0 POP takes a MARK off the stack
1951 # (which in turn is needed to indent MARK blocks correctly).
1952
1953 stack = [] # crude emulation of unpickler stack
Tim Peters62235e72003-02-05 19:55:53 +00001954 if memo is None:
1955 memo = {} # crude emulation of unpicker memo
Tim Petersc1c2b3e2003-01-29 20:12:21 +00001956 maxproto = -1 # max protocol number seen
1957 markstack = [] # bytecode positions of MARK opcodes
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001958 indentchunk = ' ' * indentlevel
Tim Petersc1c2b3e2003-01-29 20:12:21 +00001959 errormsg = None
Alexander Belopolsky929d3842010-07-17 15:51:21 +00001960 annocol = annotate # columnt hint for annotations
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001961 for opcode, arg, pos in genops(pickle):
1962 if pos is not None:
Guido van Rossumbe19ed72007-02-09 05:37:30 +00001963 print("%5d:" % pos, end=' ', file=out)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001964
Tim Petersd0f7c862003-01-28 15:27:57 +00001965 line = "%-4s %s%s" % (repr(opcode.code)[1:-1],
1966 indentchunk * len(markstack),
1967 opcode.name)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001968
Tim Petersc1c2b3e2003-01-29 20:12:21 +00001969 maxproto = max(maxproto, opcode.proto)
Tim Petersc1c2b3e2003-01-29 20:12:21 +00001970 before = opcode.stack_before # don't mutate
1971 after = opcode.stack_after # don't mutate
Tim Peters43277d62003-01-30 15:02:12 +00001972 numtopop = len(before)
1973
1974 # See whether a MARK should be popped.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001975 markmsg = None
Tim Petersc1c2b3e2003-01-29 20:12:21 +00001976 if markobject in before or (opcode.name == "POP" and
1977 stack and
1978 stack[-1] is markobject):
1979 assert markobject not in after
Tim Peters43277d62003-01-30 15:02:12 +00001980 if __debug__:
1981 if markobject in before:
1982 assert before[-1] is stackslice
Tim Petersc1c2b3e2003-01-29 20:12:21 +00001983 if markstack:
1984 markpos = markstack.pop()
1985 if markpos is None:
1986 markmsg = "(MARK at unknown opcode offset)"
1987 else:
1988 markmsg = "(MARK at %d)" % markpos
1989 # Pop everything at and after the topmost markobject.
1990 while stack[-1] is not markobject:
1991 stack.pop()
1992 stack.pop()
Tim Peters43277d62003-01-30 15:02:12 +00001993 # Stop later code from popping too much.
Tim Petersc1c2b3e2003-01-29 20:12:21 +00001994 try:
Tim Peters43277d62003-01-30 15:02:12 +00001995 numtopop = before.index(markobject)
Tim Petersc1c2b3e2003-01-29 20:12:21 +00001996 except ValueError:
1997 assert opcode.name == "POP"
Tim Peters43277d62003-01-30 15:02:12 +00001998 numtopop = 0
Tim Petersc1c2b3e2003-01-29 20:12:21 +00001999 else:
2000 errormsg = markmsg = "no MARK exists on stack"
2001
2002 # Check for correct memo usage.
2003 if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT"):
Tim Peters43277d62003-01-30 15:02:12 +00002004 assert arg is not None
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002005 if arg in memo:
2006 errormsg = "memo key %r already defined" % arg
2007 elif not stack:
2008 errormsg = "stack is empty -- can't store into memo"
2009 elif stack[-1] is markobject:
2010 errormsg = "can't store markobject in the memo"
2011 else:
2012 memo[arg] = stack[-1]
2013
2014 elif opcode.name in ("GET", "BINGET", "LONG_BINGET"):
2015 if arg in memo:
2016 assert len(after) == 1
2017 after = [memo[arg]] # for better stack emulation
2018 else:
2019 errormsg = "memo key %r has never been stored into" % arg
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002020
2021 if arg is not None or markmsg:
2022 # make a mild effort to align arguments
2023 line += ' ' * (10 - len(opcode.name))
2024 if arg is not None:
2025 line += ' ' + repr(arg)
2026 if markmsg:
2027 line += ' ' + markmsg
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002028 if annotate:
2029 line += ' ' * (annocol - len(line))
2030 # make a mild effort to align annotations
2031 annocol = len(line)
2032 if annocol > 50:
2033 annocol = annotate
2034 line += ' ' + opcode.doc.split('\n', 1)[0]
Guido van Rossumbe19ed72007-02-09 05:37:30 +00002035 print(line, file=out)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002036
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002037 if errormsg:
2038 # Note that we delayed complaining until the offending opcode
2039 # was printed.
2040 raise ValueError(errormsg)
2041
2042 # Emulate the stack effects.
Tim Peters43277d62003-01-30 15:02:12 +00002043 if len(stack) < numtopop:
2044 raise ValueError("tries to pop %d items from stack with "
2045 "only %d items" % (numtopop, len(stack)))
2046 if numtopop:
2047 del stack[-numtopop:]
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002048 if markobject in after:
Tim Peters43277d62003-01-30 15:02:12 +00002049 assert markobject not in before
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002050 markstack.append(pos)
2051
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002052 stack.extend(after)
2053
Guido van Rossumbe19ed72007-02-09 05:37:30 +00002054 print("highest protocol among opcodes =", maxproto, file=out)
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002055 if stack:
2056 raise ValueError("stack not empty after STOP: %r" % stack)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002057
Tim Peters90718a42005-02-15 16:22:34 +00002058# For use in the doctest, simply as an example of a class to pickle.
2059class _Example:
2060 def __init__(self, value):
2061 self.value = value
2062
Guido van Rossum03e35322003-01-28 15:37:13 +00002063_dis_test = r"""
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002064>>> import pickle
Guido van Rossumf4169812008-03-17 22:56:06 +00002065>>> x = [1, 2, (3, 4), {b'abc': "def"}]
2066>>> pkl0 = pickle.dumps(x, 0)
2067>>> dis(pkl0)
Tim Petersd0f7c862003-01-28 15:27:57 +00002068 0: ( MARK
2069 1: l LIST (MARK at 0)
2070 2: p PUT 0
Guido van Rossumf4100002007-01-15 00:21:46 +00002071 5: L LONG 1
Mark Dickinson8dd05142009-01-20 20:43:58 +00002072 9: a APPEND
2073 10: L LONG 2
2074 14: a APPEND
2075 15: ( MARK
2076 16: L LONG 3
2077 20: L LONG 4
2078 24: t TUPLE (MARK at 15)
2079 25: p PUT 1
2080 28: a APPEND
2081 29: ( MARK
2082 30: d DICT (MARK at 29)
2083 31: p PUT 2
Alexandre Vassalotti3bfc65a2011-12-13 13:08:09 -05002084 34: c GLOBAL '_codecs encode'
2085 50: p PUT 3
2086 53: ( MARK
2087 54: V UNICODE 'abc'
Antoine Pitroud9dfaa92009-06-04 20:32:06 +00002088 59: p PUT 4
Alexandre Vassalotti3bfc65a2011-12-13 13:08:09 -05002089 62: V UNICODE 'latin1'
2090 70: p PUT 5
2091 73: t TUPLE (MARK at 53)
2092 74: p PUT 6
2093 77: R REDUCE
2094 78: p PUT 7
2095 81: V UNICODE 'def'
2096 86: p PUT 8
2097 89: s SETITEM
2098 90: a APPEND
2099 91: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002100highest protocol among opcodes = 0
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002101
2102Try again with a "binary" pickle.
2103
Guido van Rossumf4169812008-03-17 22:56:06 +00002104>>> pkl1 = pickle.dumps(x, 1)
2105>>> dis(pkl1)
Tim Petersd0f7c862003-01-28 15:27:57 +00002106 0: ] EMPTY_LIST
2107 1: q BINPUT 0
2108 3: ( MARK
2109 4: K BININT1 1
2110 6: K BININT1 2
2111 8: ( MARK
2112 9: K BININT1 3
2113 11: K BININT1 4
2114 13: t TUPLE (MARK at 8)
2115 14: q BINPUT 1
2116 16: } EMPTY_DICT
2117 17: q BINPUT 2
Alexandre Vassalotti3bfc65a2011-12-13 13:08:09 -05002118 19: c GLOBAL '_codecs encode'
2119 35: q BINPUT 3
2120 37: ( MARK
2121 38: X BINUNICODE 'abc'
2122 46: q BINPUT 4
2123 48: X BINUNICODE 'latin1'
2124 59: q BINPUT 5
2125 61: t TUPLE (MARK at 37)
2126 62: q BINPUT 6
2127 64: R REDUCE
2128 65: q BINPUT 7
2129 67: X BINUNICODE 'def'
2130 75: q BINPUT 8
2131 77: s SETITEM
2132 78: e APPENDS (MARK at 3)
2133 79: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002134highest protocol among opcodes = 1
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002135
2136Exercise the INST/OBJ/BUILD family.
2137
Mark Dickinsoncddcf442009-01-24 21:46:33 +00002138>>> import pickletools
2139>>> dis(pickle.dumps(pickletools.dis, 0))
2140 0: c GLOBAL 'pickletools dis'
2141 17: p PUT 0
2142 20: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002143highest protocol among opcodes = 0
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002144
Tim Peters90718a42005-02-15 16:22:34 +00002145>>> from pickletools import _Example
2146>>> x = [_Example(42)] * 2
Guido van Rossumf29d3d62003-01-27 22:47:53 +00002147>>> dis(pickle.dumps(x, 0))
Tim Petersd0f7c862003-01-28 15:27:57 +00002148 0: ( MARK
2149 1: l LIST (MARK at 0)
2150 2: p PUT 0
Antoine Pitroud9dfaa92009-06-04 20:32:06 +00002151 5: c GLOBAL 'copy_reg _reconstructor'
2152 30: p PUT 1
2153 33: ( MARK
2154 34: c GLOBAL 'pickletools _Example'
2155 56: p PUT 2
2156 59: c GLOBAL '__builtin__ object'
2157 79: p PUT 3
2158 82: N NONE
2159 83: t TUPLE (MARK at 33)
2160 84: p PUT 4
2161 87: R REDUCE
2162 88: p PUT 5
2163 91: ( MARK
2164 92: d DICT (MARK at 91)
2165 93: p PUT 6
2166 96: V UNICODE 'value'
2167 103: p PUT 7
2168 106: L LONG 42
2169 111: s SETITEM
2170 112: b BUILD
Mark Dickinson8dd05142009-01-20 20:43:58 +00002171 113: a APPEND
Antoine Pitroud9dfaa92009-06-04 20:32:06 +00002172 114: g GET 5
2173 117: a APPEND
2174 118: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002175highest protocol among opcodes = 0
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002176
2177>>> dis(pickle.dumps(x, 1))
Tim Petersd0f7c862003-01-28 15:27:57 +00002178 0: ] EMPTY_LIST
2179 1: q BINPUT 0
2180 3: ( MARK
Antoine Pitroud9dfaa92009-06-04 20:32:06 +00002181 4: c GLOBAL 'copy_reg _reconstructor'
2182 29: q BINPUT 1
2183 31: ( MARK
2184 32: c GLOBAL 'pickletools _Example'
2185 54: q BINPUT 2
2186 56: c GLOBAL '__builtin__ object'
2187 76: q BINPUT 3
2188 78: N NONE
2189 79: t TUPLE (MARK at 31)
2190 80: q BINPUT 4
2191 82: R REDUCE
2192 83: q BINPUT 5
2193 85: } EMPTY_DICT
2194 86: q BINPUT 6
2195 88: X BINUNICODE 'value'
2196 98: q BINPUT 7
2197 100: K BININT1 42
2198 102: s SETITEM
2199 103: b BUILD
2200 104: h BINGET 5
2201 106: e APPENDS (MARK at 3)
2202 107: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002203highest protocol among opcodes = 1
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002204
2205Try "the canonical" recursive-object test.
2206
2207>>> L = []
2208>>> T = L,
2209>>> L.append(T)
2210>>> L[0] is T
2211True
2212>>> T[0] is L
2213True
2214>>> L[0][0] is L
2215True
2216>>> T[0][0] is T
2217True
Guido van Rossumf29d3d62003-01-27 22:47:53 +00002218>>> dis(pickle.dumps(L, 0))
Tim Petersd0f7c862003-01-28 15:27:57 +00002219 0: ( MARK
2220 1: l LIST (MARK at 0)
2221 2: p PUT 0
2222 5: ( MARK
2223 6: g GET 0
2224 9: t TUPLE (MARK at 5)
2225 10: p PUT 1
2226 13: a APPEND
2227 14: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002228highest protocol among opcodes = 0
2229
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002230>>> dis(pickle.dumps(L, 1))
Tim Petersd0f7c862003-01-28 15:27:57 +00002231 0: ] EMPTY_LIST
2232 1: q BINPUT 0
2233 3: ( MARK
2234 4: h BINGET 0
2235 6: t TUPLE (MARK at 3)
2236 7: q BINPUT 1
2237 9: a APPEND
2238 10: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002239highest protocol among opcodes = 1
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002240
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002241Note that, in the protocol 0 pickle of the recursive tuple, the disassembler
2242has to emulate the stack in order to realize that the POP opcode at 16 gets
2243rid of the MARK at 0.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002244
Guido van Rossumf29d3d62003-01-27 22:47:53 +00002245>>> dis(pickle.dumps(T, 0))
Tim Petersd0f7c862003-01-28 15:27:57 +00002246 0: ( MARK
2247 1: ( MARK
2248 2: l LIST (MARK at 1)
2249 3: p PUT 0
2250 6: ( MARK
2251 7: g GET 0
2252 10: t TUPLE (MARK at 6)
2253 11: p PUT 1
2254 14: a APPEND
2255 15: 0 POP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002256 16: 0 POP (MARK at 0)
2257 17: g GET 1
2258 20: . STOP
2259highest protocol among opcodes = 0
2260
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002261>>> dis(pickle.dumps(T, 1))
Tim Petersd0f7c862003-01-28 15:27:57 +00002262 0: ( MARK
2263 1: ] EMPTY_LIST
2264 2: q BINPUT 0
2265 4: ( MARK
2266 5: h BINGET 0
2267 7: t TUPLE (MARK at 4)
2268 8: q BINPUT 1
2269 10: a APPEND
2270 11: 1 POP_MARK (MARK at 0)
2271 12: h BINGET 1
2272 14: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002273highest protocol among opcodes = 1
Tim Petersd0f7c862003-01-28 15:27:57 +00002274
2275Try protocol 2.
2276
2277>>> dis(pickle.dumps(L, 2))
2278 0: \x80 PROTO 2
2279 2: ] EMPTY_LIST
2280 3: q BINPUT 0
2281 5: h BINGET 0
2282 7: \x85 TUPLE1
2283 8: q BINPUT 1
2284 10: a APPEND
2285 11: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002286highest protocol among opcodes = 2
Tim Petersd0f7c862003-01-28 15:27:57 +00002287
2288>>> dis(pickle.dumps(T, 2))
2289 0: \x80 PROTO 2
2290 2: ] EMPTY_LIST
2291 3: q BINPUT 0
2292 5: h BINGET 0
2293 7: \x85 TUPLE1
2294 8: q BINPUT 1
2295 10: a APPEND
2296 11: 0 POP
2297 12: h BINGET 1
2298 14: . STOP
Tim Petersc1c2b3e2003-01-29 20:12:21 +00002299highest protocol among opcodes = 2
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002300
2301Try protocol 3 with annotations:
2302
2303>>> dis(pickle.dumps(T, 3), annotate=1)
2304 0: \x80 PROTO 3 Protocol version indicator.
2305 2: ] EMPTY_LIST Push an empty list.
2306 3: q BINPUT 0 Store the stack top into the memo. The stack is not popped.
2307 5: h BINGET 0 Read an object from the memo and push it on the stack.
2308 7: \x85 TUPLE1 Build a one-tuple out of the topmost item on the stack.
2309 8: q BINPUT 1 Store the stack top into the memo. The stack is not popped.
2310 10: a APPEND Append an object to a list.
2311 11: 0 POP Discard the top stack item, shrinking the stack by one item.
2312 12: h BINGET 1 Read an object from the memo and push it on the stack.
2313 14: . STOP Stop the unpickling machine.
2314highest protocol among opcodes = 2
2315
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002316"""
2317
Tim Peters62235e72003-02-05 19:55:53 +00002318_memo_test = r"""
2319>>> import pickle
Guido van Rossumcfe5f202007-05-08 21:26:54 +00002320>>> import io
2321>>> f = io.BytesIO()
Tim Peters62235e72003-02-05 19:55:53 +00002322>>> p = pickle.Pickler(f, 2)
2323>>> x = [1, 2, 3]
2324>>> p.dump(x)
2325>>> p.dump(x)
2326>>> f.seek(0)
Guido van Rossumcfe5f202007-05-08 21:26:54 +000023270
Tim Peters62235e72003-02-05 19:55:53 +00002328>>> memo = {}
2329>>> dis(f, memo=memo)
2330 0: \x80 PROTO 2
2331 2: ] EMPTY_LIST
2332 3: q BINPUT 0
2333 5: ( MARK
2334 6: K BININT1 1
2335 8: K BININT1 2
2336 10: K BININT1 3
2337 12: e APPENDS (MARK at 5)
2338 13: . STOP
2339highest protocol among opcodes = 2
2340>>> dis(f, memo=memo)
2341 14: \x80 PROTO 2
2342 16: h BINGET 0
2343 18: . STOP
2344highest protocol among opcodes = 2
2345"""
2346
Guido van Rossum57028352003-01-28 15:09:10 +00002347__test__ = {'disassembler_test': _dis_test,
Tim Peters62235e72003-02-05 19:55:53 +00002348 'disassembler_memo_test': _memo_test,
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002349 }
2350
2351def _test():
2352 import doctest
2353 return doctest.testmod()
2354
2355if __name__ == "__main__":
Alexander Belopolsky60c762b2010-07-03 20:35:53 +00002356 import sys, argparse
2357 parser = argparse.ArgumentParser(
2358 description='disassemble one or more pickle files')
2359 parser.add_argument(
2360 'pickle_file', type=argparse.FileType('br'),
2361 nargs='*', help='the pickle file')
2362 parser.add_argument(
2363 '-o', '--output', default=sys.stdout, type=argparse.FileType('w'),
2364 help='the file where the output should be written')
2365 parser.add_argument(
2366 '-m', '--memo', action='store_true',
2367 help='preserve memo between disassemblies')
2368 parser.add_argument(
2369 '-l', '--indentlevel', default=4, type=int,
2370 help='the number of blanks by which to indent a new MARK level')
2371 parser.add_argument(
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002372 '-a', '--annotate', action='store_true',
2373 help='annotate each line with a short opcode description')
2374 parser.add_argument(
Alexander Belopolsky60c762b2010-07-03 20:35:53 +00002375 '-p', '--preamble', default="==> {name} <==",
2376 help='if more than one pickle file is specified, print this before'
2377 ' each disassembly')
2378 parser.add_argument(
2379 '-t', '--test', action='store_true',
2380 help='run self-test suite')
2381 parser.add_argument(
2382 '-v', action='store_true',
2383 help='run verbosely; only affects self-test run')
2384 args = parser.parse_args()
2385 if args.test:
2386 _test()
2387 else:
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002388 annotate = 30 if args.annotate else 0
Alexander Belopolsky60c762b2010-07-03 20:35:53 +00002389 if not args.pickle_file:
2390 parser.print_help()
2391 elif len(args.pickle_file) == 1:
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002392 dis(args.pickle_file[0], args.output, None,
2393 args.indentlevel, annotate)
Alexander Belopolsky60c762b2010-07-03 20:35:53 +00002394 else:
2395 memo = {} if args.memo else None
2396 for f in args.pickle_file:
2397 preamble = args.preamble.format(name=f.name)
2398 args.output.write(preamble + '\n')
Alexander Belopolsky929d3842010-07-17 15:51:21 +00002399 dis(f, args.output, memo, args.indentlevel, annotate)