blob: 152ea8816cbfe5138ad6fe63e49fa57402827a9f [file] [log] [blame]
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001""""Executable documentation" for the pickle module.
2
3Extensive comments about the pickle protocols and pickle-machine opcodes
4can be found here. Some functions meant for external use:
5
6genops(pickle)
7 Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
8
9dis(pickle, out=None, indentlevel=4)
10 Print a symbolic disassembly of a pickle.
11"""
12
13# Other ideas:
14#
15# - A pickle verifier: read a pickle and check it exhaustively for
16# well-formedness.
17#
18# - A protocol identifier: examine a pickle and return its protocol number
19# (== the highest .proto attr value among all the opcodes in the pickle).
20#
21# - A pickle optimizer: for example, tuple-building code is sometimes more
22# elaborate than necessary, catering for the possibility that the tuple
23# is recursive. Or lots of times a PUT is generated that's never accessed
24# by a later GET.
25
26
27"""
28"A pickle" is a program for a virtual pickle machine (PM, but more accurately
29called an unpickling machine). It's a sequence of opcodes, interpreted by the
30PM, building an arbitrarily complex Python object.
31
32For the most part, the PM is very simple: there are no looping, testing, or
33conditional instructions, no arithmetic and no function calls. Opcodes are
34executed once each, from first to last, until a STOP opcode is reached.
35
36The PM has two data areas, "the stack" and "the memo".
37
38Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
39integer object on the stack, whose value is gotten from a decimal string
40literal immediately following the INT opcode in the pickle bytestream. Other
41opcodes take Python objects off the stack. The result of unpickling is
42whatever object is left on the stack when the final STOP opcode is executed.
43
44The memo is simply an array of objects, or it can be implemented as a dict
45mapping little integers to objects. The memo serves as the PM's "long term
46memory", and the little integers indexing the memo are akin to variable
47names. Some opcodes pop a stack object into the memo at a given index,
48and others push a memo object at a given index onto the stack again.
49
50At heart, that's all the PM has. Subtleties arise for these reasons:
51
52+ Object identity. Objects can be arbitrarily complex, and subobjects
53 may be shared (for example, the list [a, a] refers to the same object a
54 twice). It can be vital that unpickling recreate an isomorphic object
55 graph, faithfully reproducing sharing.
56
57+ Recursive objects. For example, after "L = []; L.append(L)", L is a
58 list, and L[0] is the same list. This is related to the object identity
59 point, and some sequences of pickle opcodes are subtle in order to
60 get the right result in all cases.
61
62+ Things pickle doesn't know everything about. Examples of things pickle
63 does know everything about are Python's builtin scalar and container
64 types, like ints and tuples. They generally have opcodes dedicated to
65 them. For things like module references and instances of user-defined
66 classes, pickle's knowledge is limited. Historically, many enhancements
67 have been made to the pickle protocol in order to do a better (faster,
68 and/or more compact) job on those.
69
70+ Backward compatibility and micro-optimization. As explained below,
71 pickle opcodes never go away, not even when better ways to do a thing
72 get invented. The repertoire of the PM just keeps growing over time.
Tim Peters1996e232003-01-27 19:38:34 +000073 So, e.g., there are now five distinct opcodes for building a Python integer,
74 four of them devoted to "short" integers. Even so, the only way to pickle
Tim Peters8ecfc8e2003-01-27 18:51:48 +000075 a Python long int takes time quadratic in the number of digits, for both
76 pickling and unpickling. This isn't so much a subtlety as a source of
77 wearying complication.
78
79
80Pickle protocols:
81
82For compatibility, the meaning of a pickle opcode never changes. Instead new
83pickle opcodes get added, and each version's unpickler can handle all the
84pickle opcodes in all protocol versions to date. So old pickles continue to
85be readable forever. The pickler can generally be told to restrict itself to
86the subset of opcodes available under previous protocol versions too, so that
87users can create pickles under the current version readable by older
88versions. However, a pickle does not contain its version number embedded
89within it. If an older unpickler tries to read a pickle using a later
90protocol, the result is most likely an exception due to seeing an unknown (in
91the older unpickler) opcode.
92
93The original pickle used what's now called "protocol 0", and what was called
94"text mode" before Python 2.3. The entire pickle bytestream is made up of
95printable 7-bit ASCII characters, plus the newline character, in protocol 0.
96That's why it was called text mode.
97
98The second major set of additions is now called "protocol 1", and was called
99"binary mode" before Python 2.3. This added many opcodes with arguments
100consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
101bytes. Binary mode pickles can be substantially smaller than equivalent
102text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
103int as 4 bytes following the opcode, which is cheaper to unpickle than the
104(perhaps) 11-character decimal string attached to INT.
105
106The third major set of additions came in Python 2.3, and is called "protocol
1072". XXX Write a short blurb when Guido figures out what they are <wink>. XXX
108"""
109
110# Meta-rule: Descriptions are stored in instances of descriptor objects,
111# with plain constructors. No meta-language is defined from which
112# descriptors could be constructed. If you want, e.g., XML, write a little
113# program to generate XML from the objects.
114
115##############################################################################
116# Some pickle opcodes have an argument, following the opcode in the
117# bytestream. An argument is of a specific type, described by an instance
118# of ArgumentDescriptor. These are not to be confused with arguments taken
119# off the stack -- ArgumentDescriptor applies only to arguments embedded in
120# the opcode stream, immediately following an opcode.
121
122# Represents the number of bytes consumed by an argument delimited by the
123# next newline character.
124UP_TO_NEWLINE = -1
125
126# Represents the number of bytes consumed by a two-argument opcode where
127# the first argument gives the number of bytes in the second argument.
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000128TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int
129TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000130
131class ArgumentDescriptor(object):
132 __slots__ = (
133 # name of descriptor record, also a module global name; a string
134 'name',
135
136 # length of argument, in bytes; an int; UP_TO_NEWLINE and
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000137 # TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length
138 # cases
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000139 'n',
140
141 # a function taking a file-like object, reading this kind of argument
142 # from the object at the current position, advancing the current
143 # position by n bytes, and returning the value of the argument
144 'reader',
145
146 # human-readable docs for this arg descriptor; a string
147 'doc',
148 )
149
150 def __init__(self, name, n, reader, doc):
151 assert isinstance(name, str)
152 self.name = name
153
154 assert isinstance(n, int) and (n >= 0 or
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000155 n in (UP_TO_NEWLINE,
156 TAKEN_FROM_ARGUMENT1,
157 TAKEN_FROM_ARGUMENT4))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000158 self.n = n
159
160 self.reader = reader
161
162 assert isinstance(doc, str)
163 self.doc = doc
164
165from struct import unpack as _unpack
166
167def read_uint1(f):
168 """
169 >>> import StringIO
170 >>> read_uint1(StringIO.StringIO('\\xff'))
171 255
172 """
173
174 data = f.read(1)
175 if data:
176 return ord(data)
177 raise ValueError("not enough data in stream to read uint1")
178
179uint1 = ArgumentDescriptor(
180 name='uint1',
181 n=1,
182 reader=read_uint1,
183 doc="One-byte unsigned integer.")
184
185
186def read_uint2(f):
187 """
188 >>> import StringIO
189 >>> read_uint2(StringIO.StringIO('\\xff\\x00'))
190 255
191 >>> read_uint2(StringIO.StringIO('\\xff\\xff'))
192 65535
193 """
194
195 data = f.read(2)
196 if len(data) == 2:
197 return _unpack("<H", data)[0]
198 raise ValueError("not enough data in stream to read uint2")
199
200uint2 = ArgumentDescriptor(
201 name='uint2',
202 n=2,
203 reader=read_uint2,
204 doc="Two-byte unsigned integer, little-endian.")
205
206
207def read_int4(f):
208 """
209 >>> import StringIO
210 >>> read_int4(StringIO.StringIO('\\xff\\x00\\x00\\x00'))
211 255
212 >>> read_int4(StringIO.StringIO('\\x00\\x00\\x00\\x80')) == -(2**31)
213 True
214 """
215
216 data = f.read(4)
217 if len(data) == 4:
218 return _unpack("<i", data)[0]
219 raise ValueError("not enough data in stream to read int4")
220
221int4 = ArgumentDescriptor(
222 name='int4',
223 n=4,
224 reader=read_int4,
225 doc="Four-byte signed integer, little-endian, 2's complement.")
226
227
228def read_stringnl(f, decode=True, stripquotes=True):
229 """
230 >>> import StringIO
231 >>> read_stringnl(StringIO.StringIO("'abcd'\\nefg\\n"))
232 'abcd'
233
234 >>> read_stringnl(StringIO.StringIO("\\n"))
235 Traceback (most recent call last):
236 ...
237 ValueError: no string quotes around ''
238
239 >>> read_stringnl(StringIO.StringIO("\\n"), stripquotes=False)
240 ''
241
242 >>> read_stringnl(StringIO.StringIO("''\\n"))
243 ''
244
245 >>> read_stringnl(StringIO.StringIO('"abcd"'))
246 Traceback (most recent call last):
247 ...
248 ValueError: no newline found when trying to read stringnl
249
250 Embedded escapes are undone in the result.
251 >>> read_stringnl(StringIO.StringIO("'a\\\\nb\\x00c\\td'\\n'e'"))
252 'a\\nb\\x00c\\td'
253 """
254
255 data = f.readline()
256 if not data.endswith('\n'):
257 raise ValueError("no newline found when trying to read stringnl")
258 data = data[:-1] # lose the newline
259
260 if stripquotes:
261 for q in "'\"":
262 if data.startswith(q):
263 if not data.endswith(q):
264 raise ValueError("strinq quote %r not found at both "
265 "ends of %r" % (q, data))
266 data = data[1:-1]
267 break
268 else:
269 raise ValueError("no string quotes around %r" % data)
270
271 # I'm not sure when 'string_escape' was added to the std codecs; it's
272 # crazy not to use it if it's there.
273 if decode:
274 data = data.decode('string_escape')
275 return data
276
277stringnl = ArgumentDescriptor(
278 name='stringnl',
279 n=UP_TO_NEWLINE,
280 reader=read_stringnl,
281 doc="""A newline-terminated string.
282
283 This is a repr-style string, with embedded escapes, and
284 bracketing quotes.
285 """)
286
287def read_stringnl_noescape(f):
288 return read_stringnl(f, decode=False, stripquotes=False)
289
290stringnl_noescape = ArgumentDescriptor(
291 name='stringnl_noescape',
292 n=UP_TO_NEWLINE,
293 reader=read_stringnl_noescape,
294 doc="""A newline-terminated string.
295
296 This is a str-style string, without embedded escapes,
297 or bracketing quotes. It should consist solely of
298 printable ASCII characters.
299 """)
300
301def read_stringnl_noescape_pair(f):
302 """
303 >>> import StringIO
304 >>> read_stringnl_noescape_pair(StringIO.StringIO("Queue\\nEmpty\\njunk"))
Tim Petersd916cf42003-01-27 19:01:47 +0000305 'Queue Empty'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000306 """
307
Tim Petersd916cf42003-01-27 19:01:47 +0000308 return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000309
310stringnl_noescape_pair = ArgumentDescriptor(
311 name='stringnl_noescape_pair',
312 n=UP_TO_NEWLINE,
313 reader=read_stringnl_noescape_pair,
314 doc="""A pair of newline-terminated strings.
315
316 These are str-style strings, without embedded
317 escapes, or bracketing quotes. They should
318 consist solely of printable ASCII characters.
319 The pair is returned as a single string, with
Tim Petersd916cf42003-01-27 19:01:47 +0000320 a single blank separating the two strings.
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000321 """)
322
323def read_string4(f):
324 """
325 >>> import StringIO
326 >>> read_string4(StringIO.StringIO("\\x00\\x00\\x00\\x00abc"))
327 ''
328 >>> read_string4(StringIO.StringIO("\\x03\\x00\\x00\\x00abcdef"))
329 'abc'
330 >>> read_string4(StringIO.StringIO("\\x00\\x00\\x00\\x03abcdef"))
331 Traceback (most recent call last):
332 ...
333 ValueError: expected 50331648 bytes in a string4, but only 6 remain
334 """
335
336 n = read_int4(f)
337 if n < 0:
338 raise ValueError("string4 byte count < 0: %d" % n)
339 data = f.read(n)
340 if len(data) == n:
341 return data
342 raise ValueError("expected %d bytes in a string4, but only %d remain" %
343 (n, len(data)))
344
345string4 = ArgumentDescriptor(
346 name="string4",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000347 n=TAKEN_FROM_ARGUMENT4,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000348 reader=read_string4,
349 doc="""A counted string.
350
351 The first argument is a 4-byte little-endian signed int giving
352 the number of bytes in the string, and the second argument is
353 that many bytes.
354 """)
355
356
357def read_string1(f):
358 """
359 >>> import StringIO
360 >>> read_string1(StringIO.StringIO("\\x00"))
361 ''
362 >>> read_string1(StringIO.StringIO("\\x03abcdef"))
363 'abc'
364 """
365
366 n = read_uint1(f)
367 assert n >= 0
368 data = f.read(n)
369 if len(data) == n:
370 return data
371 raise ValueError("expected %d bytes in a string1, but only %d remain" %
372 (n, len(data)))
373
374string1 = ArgumentDescriptor(
375 name="string1",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000376 n=TAKEN_FROM_ARGUMENT1,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000377 reader=read_string1,
378 doc="""A counted string.
379
380 The first argument is a 1-byte unsigned int giving the number
381 of bytes in the string, and the second argument is that many
382 bytes.
383 """)
384
385
386def read_unicodestringnl(f):
387 """
388 >>> import StringIO
389 >>> read_unicodestringnl(StringIO.StringIO("abc\\uabcd\\njunk"))
390 u'abc\\uabcd'
391 """
392
393 data = f.readline()
394 if not data.endswith('\n'):
395 raise ValueError("no newline found when trying to read "
396 "unicodestringnl")
397 data = data[:-1] # lose the newline
398 return unicode(data, 'raw-unicode-escape')
399
400unicodestringnl = ArgumentDescriptor(
401 name='unicodestringnl',
402 n=UP_TO_NEWLINE,
403 reader=read_unicodestringnl,
404 doc="""A newline-terminated Unicode string.
405
406 This is raw-unicode-escape encoded, so consists of
407 printable ASCII characters, and may contain embedded
408 escape sequences.
409 """)
410
411def read_unicodestring4(f):
412 """
413 >>> import StringIO
414 >>> s = u'abcd\\uabcd'
415 >>> enc = s.encode('utf-8')
416 >>> enc
417 'abcd\\xea\\xaf\\x8d'
418 >>> n = chr(len(enc)) + chr(0) * 3 # little-endian 4-byte length
419 >>> t = read_unicodestring4(StringIO.StringIO(n + enc + 'junk'))
420 >>> s == t
421 True
422
423 >>> read_unicodestring4(StringIO.StringIO(n + enc[:-1]))
424 Traceback (most recent call last):
425 ...
426 ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
427 """
428
429 n = read_int4(f)
430 if n < 0:
431 raise ValueError("unicodestring4 byte count < 0: %d" % n)
432 data = f.read(n)
433 if len(data) == n:
434 return unicode(data, 'utf-8')
435 raise ValueError("expected %d bytes in a unicodestring4, but only %d "
436 "remain" % (n, len(data)))
437
438unicodestring4 = ArgumentDescriptor(
439 name="unicodestring4",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000440 n=TAKEN_FROM_ARGUMENT4,
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000441 reader=read_unicodestring4,
442 doc="""A counted Unicode string.
443
444 The first argument is a 4-byte little-endian signed int
445 giving the number of bytes in the string, and the second
446 argument-- the UTF-8 encoding of the Unicode string --
447 contains that many bytes.
448 """)
449
450
451def read_decimalnl_short(f):
452 """
453 >>> import StringIO
454 >>> read_decimalnl_short(StringIO.StringIO("1234\\n56"))
455 1234
456
457 >>> read_decimalnl_short(StringIO.StringIO("1234L\\n56"))
458 Traceback (most recent call last):
459 ...
460 ValueError: trailing 'L' not allowed in '1234L'
461 """
462
463 s = read_stringnl(f, decode=False, stripquotes=False)
464 if s.endswith("L"):
465 raise ValueError("trailing 'L' not allowed in %r" % s)
466
467 # It's not necessarily true that the result fits in a Python short int:
468 # the pickle may have been written on a 64-bit box. There's also a hack
469 # for True and False here.
470 if s == "00":
471 return False
472 elif s == "01":
473 return True
474
475 try:
476 return int(s)
477 except OverflowError:
478 return long(s)
479
480def read_decimalnl_long(f):
481 """
482 >>> import StringIO
483
484 >>> read_decimalnl_long(StringIO.StringIO("1234\\n56"))
485 Traceback (most recent call last):
486 ...
487 ValueError: trailing 'L' required in '1234'
488
489 Someday the trailing 'L' will probably go away from this output.
490
491 >>> read_decimalnl_long(StringIO.StringIO("1234L\\n56"))
492 1234L
493
494 >>> read_decimalnl_long(StringIO.StringIO("123456789012345678901234L\\n6"))
495 123456789012345678901234L
496 """
497
498 s = read_stringnl(f, decode=False, stripquotes=False)
499 if not s.endswith("L"):
500 raise ValueError("trailing 'L' required in %r" % s)
501 return long(s)
502
503
504decimalnl_short = ArgumentDescriptor(
505 name='decimalnl_short',
506 n=UP_TO_NEWLINE,
507 reader=read_decimalnl_short,
508 doc="""A newline-terminated decimal integer literal.
509
510 This never has a trailing 'L', and the integer fit
511 in a short Python int on the box where the pickle
512 was written -- but there's no guarantee it will fit
513 in a short Python int on the box where the pickle
514 is read.
515 """)
516
517decimalnl_long = ArgumentDescriptor(
518 name='decimalnl_long',
519 n=UP_TO_NEWLINE,
520 reader=read_decimalnl_long,
521 doc="""A newline-terminated decimal integer literal.
522
523 This has a trailing 'L', and can represent integers
524 of any size.
525 """)
526
527
528def read_floatnl(f):
529 """
530 >>> import StringIO
531 >>> read_floatnl(StringIO.StringIO("-1.25\\n6"))
532 -1.25
533 """
534 s = read_stringnl(f, decode=False, stripquotes=False)
535 return float(s)
536
537floatnl = ArgumentDescriptor(
538 name='floatnl',
539 n=UP_TO_NEWLINE,
540 reader=read_floatnl,
541 doc="""A newline-terminated decimal floating literal.
542
543 In general this requires 17 significant digits for roundtrip
544 identity, and pickling then unpickling infinities, NaNs, and
545 minus zero doesn't work across boxes, or on some boxes even
546 on itself (e.g., Windows can't read the strings it produces
547 for infinities or NaNs).
548 """)
549
550def read_float8(f):
551 """
552 >>> import StringIO, struct
553 >>> raw = struct.pack(">d", -1.25)
554 >>> raw
555 '\\xbf\\xf4\\x00\\x00\\x00\\x00\\x00\\x00'
556 >>> read_float8(StringIO.StringIO(raw + "\\n"))
557 -1.25
558 """
559
560 data = f.read(8)
561 if len(data) == 8:
562 return _unpack(">d", data)[0]
563 raise ValueError("not enough data in stream to read float8")
564
565
566float8 = ArgumentDescriptor(
567 name='float8',
568 n=8,
569 reader=read_float8,
570 doc="""An 8-byte binary representation of a float, big-endian.
571
572 The format is unique to Python, and shared with the struct
573 module (format string '>d') "in theory" (the struct and cPickle
574 implementations don't share the code -- they should). It's
575 strongly related to the IEEE-754 double format, and, in normal
576 cases, is in fact identical to the big-endian 754 double format.
577 On other boxes the dynamic range is limited to that of a 754
578 double, and "add a half and chop" rounding is used to reduce
579 the precision to 53 bits. However, even on a 754 box,
580 infinities, NaNs, and minus zero may not be handled correctly
581 (may not survive roundtrip pickling intact).
582 """)
583
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000584# Protocol 2 formats
585
586def decode_long(data):
587 r"""Decode a long from a two's complement little-endian binary string.
588 >>> decode_long("\xff\x00")
589 255L
590 >>> decode_long("\xff\x7f")
591 32767L
592 >>> decode_long("\x00\xff")
593 -256L
594 >>> decode_long("\x00\x80")
595 -32768L
Tim Peters217e5712003-01-27 23:51:11 +0000596 >>> decode_long("\x80")
597 -128L
598 >>> decode_long("\x7f")
599 127L
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000600 """
601 x = 0L
602 i = 0L
603 for c in data:
604 x |= long(ord(c)) << i
605 i += 8L
Tim Peters217e5712003-01-27 23:51:11 +0000606 if data and ord(c) >= 0x80:
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000607 x -= 1L << i
608 return x
609
610def read_long1(f):
611 r"""
612 >>> import StringIO
613 >>> read_long1(StringIO.StringIO("\x02\xff\x00"))
614 255L
615 >>> read_long1(StringIO.StringIO("\x02\xff\x7f"))
616 32767L
617 >>> read_long1(StringIO.StringIO("\x02\x00\xff"))
618 -256L
619 >>> read_long1(StringIO.StringIO("\x02\x00\x80"))
620 -32768L
Tim Peters5eed3402003-01-27 23:51:36 +0000621 >>>
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000622 """
623
624 n = read_uint1(f)
625 data = f.read(n)
626 if len(data) != n:
627 raise ValueError("not enough data in stream to read long1")
628 return decode_long(data)
629
630long1 = ArgumentDescriptor(
631 name="long1",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000632 n=TAKEN_FROM_ARGUMENT1,
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000633 reader=read_long1,
634 doc="""A binary long, little-endian, using 1-byte size.
635
636 This first reads one byte as an unsigned size, then reads that
Tim Petersbdbe7412003-01-27 23:54:04 +0000637 many bytes and interprets them as a little-endian 2's-complement long.
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000638 """)
639
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000640def read_long4(f):
641 r"""
642 >>> import StringIO
643 >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x00"))
644 255L
645 >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x7f"))
646 32767L
647 >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\xff"))
648 -256L
649 >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\x80"))
650 -32768L
Tim Peters5eed3402003-01-27 23:51:36 +0000651 >>>
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000652 """
653
654 n = read_int4(f)
655 if n < 0:
Neal Norwitz784a3f52003-01-28 00:20:41 +0000656 raise ValueError("long4 byte count < 0: %d" % n)
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000657 data = f.read(n)
658 if len(data) != n:
Neal Norwitz784a3f52003-01-28 00:20:41 +0000659 raise ValueError("not enough data in stream to read long4")
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000660 return decode_long(data)
661
662long4 = ArgumentDescriptor(
663 name="long4",
Tim Petersfdb8cfa2003-01-28 00:13:19 +0000664 n=TAKEN_FROM_ARGUMENT4,
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000665 reader=read_long4,
666 doc="""A binary representation of a long, little-endian.
667
668 This first reads four bytes as a signed size (but requires the
669 size to be >= 0), then reads that many bytes and interprets them
Tim Petersbdbe7412003-01-27 23:54:04 +0000670 as a little-endian 2's-complement long.
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000671 """)
672
673
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000674##############################################################################
675# Object descriptors. The stack used by the pickle machine holds objects,
676# and in the stack_before and stack_after attributes of OpcodeInfo
677# descriptors we need names to describe the various types of objects that can
678# appear on the stack.
679
680class StackObject(object):
681 __slots__ = (
682 # name of descriptor record, for info only
683 'name',
684
685 # type of object, or tuple of type objects (meaning the object can
686 # be of any type in the tuple)
687 'obtype',
688
689 # human-readable docs for this kind of stack object; a string
690 'doc',
691 )
692
693 def __init__(self, name, obtype, doc):
694 assert isinstance(name, str)
695 self.name = name
696
697 assert isinstance(obtype, type) or isinstance(obtype, tuple)
698 if isinstance(obtype, tuple):
699 for contained in obtype:
700 assert isinstance(contained, type)
701 self.obtype = obtype
702
703 assert isinstance(doc, str)
704 self.doc = doc
705
706
707pyint = StackObject(
708 name='int',
709 obtype=int,
710 doc="A short (as opposed to long) Python integer object.")
711
712pylong = StackObject(
713 name='long',
714 obtype=long,
715 doc="A long (as opposed to short) Python integer object.")
716
717pyinteger_or_bool = StackObject(
718 name='int_or_bool',
719 obtype=(int, long, bool),
720 doc="A Python integer object (short or long), or "
721 "a Python bool.")
722
Guido van Rossum5a2d8f52003-01-27 21:44:25 +0000723pybool = StackObject(
724 name='bool',
725 obtype=(bool,),
726 doc="A Python bool object.")
727
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000728pyfloat = StackObject(
729 name='float',
730 obtype=float,
731 doc="A Python float object.")
732
733pystring = StackObject(
734 name='str',
735 obtype=str,
736 doc="A Python string object.")
737
738pyunicode = StackObject(
739 name='unicode',
740 obtype=unicode,
741 doc="A Python Unicode string object.")
742
743pynone = StackObject(
744 name="None",
745 obtype=type(None),
746 doc="The Python None object.")
747
748pytuple = StackObject(
749 name="tuple",
750 obtype=tuple,
751 doc="A Python tuple object.")
752
753pylist = StackObject(
754 name="list",
755 obtype=list,
756 doc="A Python list object.")
757
758pydict = StackObject(
759 name="dict",
760 obtype=dict,
761 doc="A Python dict object.")
762
763anyobject = StackObject(
764 name='any',
765 obtype=object,
766 doc="Any kind of object whatsoever.")
767
768markobject = StackObject(
769 name="mark",
770 obtype=StackObject,
771 doc="""'The mark' is a unique object.
772
773 Opcodes that operate on a variable number of objects
774 generally don't embed the count of objects in the opcode,
775 or pull it off the stack. Instead the MARK opcode is used
776 to push a special marker object on the stack, and then
777 some other opcodes grab all the objects from the top of
778 the stack down to (but not including) the topmost marker
779 object.
780 """)
781
782stackslice = StackObject(
783 name="stackslice",
784 obtype=StackObject,
785 doc="""An object representing a contiguous slice of the stack.
786
787 This is used in conjuction with markobject, to represent all
788 of the stack following the topmost markobject. For example,
789 the POP_MARK opcode changes the stack from
790
791 [..., markobject, stackslice]
792 to
793 [...]
794
795 No matter how many object are on the stack after the topmost
796 markobject, POP_MARK gets rid of all of them (including the
797 topmost markobject too).
798 """)
799
800##############################################################################
801# Descriptors for pickle opcodes.
802
803class OpcodeInfo(object):
804
805 __slots__ = (
806 # symbolic name of opcode; a string
807 'name',
808
809 # the code used in a bytestream to represent the opcode; a
810 # one-character string
811 'code',
812
813 # If the opcode has an argument embedded in the byte string, an
814 # instance of ArgumentDescriptor specifying its type. Note that
815 # arg.reader(s) can be used to read and decode the argument from
816 # the bytestream s, and arg.doc documents the format of the raw
817 # argument bytes. If the opcode doesn't have an argument embedded
818 # in the bytestream, arg should be None.
819 'arg',
820
821 # what the stack looks like before this opcode runs; a list
822 'stack_before',
823
824 # what the stack looks like after this opcode runs; a list
825 'stack_after',
826
827 # the protocol number in which this opcode was introduced; an int
828 'proto',
829
830 # human-readable docs for this opcode; a string
831 'doc',
832 )
833
834 def __init__(self, name, code, arg,
835 stack_before, stack_after, proto, doc):
836 assert isinstance(name, str)
837 self.name = name
838
839 assert isinstance(code, str)
840 assert len(code) == 1
841 self.code = code
842
843 assert arg is None or isinstance(arg, ArgumentDescriptor)
844 self.arg = arg
845
846 assert isinstance(stack_before, list)
847 for x in stack_before:
848 assert isinstance(x, StackObject)
849 self.stack_before = stack_before
850
851 assert isinstance(stack_after, list)
852 for x in stack_after:
853 assert isinstance(x, StackObject)
854 self.stack_after = stack_after
855
856 assert isinstance(proto, int) and 0 <= proto <= 2
857 self.proto = proto
858
859 assert isinstance(doc, str)
860 self.doc = doc
861
862I = OpcodeInfo
863opcodes = [
864
865 # Ways to spell integers.
866
867 I(name='INT',
868 code='I',
869 arg=decimalnl_short,
870 stack_before=[],
871 stack_after=[pyinteger_or_bool],
872 proto=0,
873 doc="""Push an integer or bool.
874
875 The argument is a newline-terminated decimal literal string.
876
877 The intent may have been that this always fit in a short Python int,
878 but INT can be generated in pickles written on a 64-bit box that
879 require a Python long on a 32-bit box. The difference between this
880 and LONG then is that INT skips a trailing 'L', and produces a short
881 int whenever possible.
882
883 Another difference is due to that, when bool was introduced as a
884 distinct type in 2.3, builtin names True and False were also added to
885 2.2.2, mapping to ints 1 and 0. For compatibility in both directions,
886 True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
887 Leading zeroes are never produced for a genuine integer. The 2.3
888 (and later) unpicklers special-case these and return bool instead;
889 earlier unpicklers ignore the leading "0" and return the int.
890 """),
891
892 I(name='LONG',
893 code='L',
894 arg=decimalnl_long,
895 stack_before=[],
896 stack_after=[pylong],
897 proto=0,
898 doc="""Push a long integer.
899
900 The same as INT, except that the literal ends with 'L', and always
901 unpickles to a Python long. There doesn't seem a real purpose to the
902 trailing 'L'.
903 """),
904
905 I(name='BININT',
906 code='J',
907 arg=int4,
908 stack_before=[],
909 stack_after=[pyint],
910 proto=1,
911 doc="""Push a four-byte signed integer.
912
913 This handles the full range of Python (short) integers on a 32-bit
914 box, directly as binary bytes (1 for the opcode and 4 for the integer).
915 If the integer is non-negative and fits in 1 or 2 bytes, pickling via
916 BININT1 or BININT2 saves space.
917 """),
918
919 I(name='BININT1',
920 code='K',
921 arg=uint1,
922 stack_before=[],
923 stack_after=[pyint],
924 proto=1,
925 doc="""Push a one-byte unsigned integer.
926
927 This is a space optimization for pickling very small non-negative ints,
928 in range(256).
929 """),
930
931 I(name='BININT2',
932 code='M',
933 arg=uint2,
934 stack_before=[],
935 stack_after=[pyint],
936 proto=1,
937 doc="""Push a two-byte unsigned integer.
938
939 This is a space optimization for pickling small positive ints, in
940 range(256, 2**16). Integers in range(256) can also be pickled via
941 BININT2, but BININT1 instead saves a byte.
942 """),
943
944 # Ways to spell strings (8-bit, not Unicode).
945
946 I(name='STRING',
947 code='S',
948 arg=stringnl,
949 stack_before=[],
950 stack_after=[pystring],
951 proto=0,
952 doc="""Push a Python string object.
953
954 The argument is a repr-style string, with bracketing quote characters,
955 and perhaps embedded escapes. The argument extends until the next
956 newline character.
957 """),
958
959 I(name='BINSTRING',
960 code='T',
961 arg=string4,
962 stack_before=[],
963 stack_after=[pystring],
964 proto=1,
965 doc="""Push a Python string object.
966
967 There are two arguments: the first is a 4-byte little-endian signed int
968 giving the number of bytes in the string, and the second is that many
969 bytes, which are taken literally as the string content.
970 """),
971
972 I(name='SHORT_BINSTRING',
973 code='U',
974 arg=string1,
975 stack_before=[],
976 stack_after=[pystring],
977 proto=1,
978 doc="""Push a Python string object.
979
980 There are two arguments: the first is a 1-byte unsigned int giving
981 the number of bytes in the string, and the second is that many bytes,
982 which are taken literally as the string content.
983 """),
984
985 # Ways to spell None.
986
987 I(name='NONE',
988 code='N',
989 arg=None,
990 stack_before=[],
991 stack_after=[pynone],
992 proto=0,
993 doc="Push None on the stack."),
994
995 # Ways to spell Unicode strings.
996
997 I(name='UNICODE',
998 code='V',
999 arg=unicodestringnl,
1000 stack_before=[],
1001 stack_after=[pyunicode],
1002 proto=0, # this may be pure-text, but it's a later addition
1003 doc="""Push a Python Unicode string object.
1004
1005 The argument is a raw-unicode-escape encoding of a Unicode string,
1006 and so may contain embedded escape sequences. The argument extends
1007 until the next newline character.
1008 """),
1009
1010 I(name='BINUNICODE',
1011 code='X',
1012 arg=unicodestring4,
1013 stack_before=[],
1014 stack_after=[pyunicode],
1015 proto=1,
1016 doc="""Push a Python Unicode string object.
1017
1018 There are two arguments: the first is a 4-byte little-endian signed int
1019 giving the number of bytes in the string. The second is that many
1020 bytes, and is the UTF-8 encoding of the Unicode string.
1021 """),
1022
1023 # Ways to spell floats.
1024
1025 I(name='FLOAT',
1026 code='F',
1027 arg=floatnl,
1028 stack_before=[],
1029 stack_after=[pyfloat],
1030 proto=0,
1031 doc="""Newline-terminated decimal float literal.
1032
1033 The argument is repr(a_float), and in general requires 17 significant
1034 digits for roundtrip conversion to be an identity (this is so for
1035 IEEE-754 double precision values, which is what Python float maps to
1036 on most boxes).
1037
1038 In general, FLOAT cannot be used to transport infinities, NaNs, or
1039 minus zero across boxes (or even on a single box, if the platform C
1040 library can't read the strings it produces for such things -- Windows
1041 is like that), but may do less damage than BINFLOAT on boxes with
1042 greater precision or dynamic range than IEEE-754 double.
1043 """),
1044
1045 I(name='BINFLOAT',
1046 code='G',
1047 arg=float8,
1048 stack_before=[],
1049 stack_after=[pyfloat],
1050 proto=1,
1051 doc="""Float stored in binary form, with 8 bytes of data.
1052
1053 This generally requires less than half the space of FLOAT encoding.
1054 In general, BINFLOAT cannot be used to transport infinities, NaNs, or
1055 minus zero, raises an exception if the exponent exceeds the range of
1056 an IEEE-754 double, and retains no more than 53 bits of precision (if
1057 there are more than that, "add a half and chop" rounding is used to
1058 cut it back to 53 significant bits).
1059 """),
1060
1061 # Ways to build lists.
1062
1063 I(name='EMPTY_LIST',
1064 code=']',
1065 arg=None,
1066 stack_before=[],
1067 stack_after=[pylist],
1068 proto=1,
1069 doc="Push an empty list."),
1070
1071 I(name='APPEND',
1072 code='a',
1073 arg=None,
1074 stack_before=[pylist, anyobject],
1075 stack_after=[pylist],
1076 proto=0,
1077 doc="""Append an object to a list.
1078
1079 Stack before: ... pylist anyobject
1080 Stack after: ... pylist+[anyobject]
1081 """),
1082
1083 I(name='APPENDS',
1084 code='e',
1085 arg=None,
1086 stack_before=[pylist, markobject, stackslice],
1087 stack_after=[pylist],
1088 proto=1,
1089 doc="""Extend a list by a slice of stack objects.
1090
1091 Stack before: ... pylist markobject stackslice
1092 Stack after: ... pylist+stackslice
1093 """),
1094
1095 I(name='LIST',
1096 code='l',
1097 arg=None,
1098 stack_before=[markobject, stackslice],
1099 stack_after=[pylist],
1100 proto=0,
1101 doc="""Build a list out of the topmost stack slice, after markobject.
1102
1103 All the stack entries following the topmost markobject are placed into
1104 a single Python list, which single list object replaces all of the
1105 stack from the topmost markobject onward. For example,
1106
1107 Stack before: ... markobject 1 2 3 'abc'
1108 Stack after: ... [1, 2, 3, 'abc']
1109 """),
1110
1111 # Ways to build tuples.
1112
1113 I(name='EMPTY_TUPLE',
1114 code=')',
1115 arg=None,
1116 stack_before=[],
1117 stack_after=[pytuple],
1118 proto=1,
1119 doc="Push an empty tuple."),
1120
1121 I(name='TUPLE',
1122 code='t',
1123 arg=None,
1124 stack_before=[markobject, stackslice],
1125 stack_after=[pytuple],
1126 proto=0,
1127 doc="""Build a tuple out of the topmost stack slice, after markobject.
1128
1129 All the stack entries following the topmost markobject are placed into
1130 a single Python tuple, which single tuple object replaces all of the
1131 stack from the topmost markobject onward. For example,
1132
1133 Stack before: ... markobject 1 2 3 'abc'
1134 Stack after: ... (1, 2, 3, 'abc')
1135 """),
1136
1137 # Ways to build dicts.
1138
1139 I(name='EMPTY_DICT',
1140 code='}',
1141 arg=None,
1142 stack_before=[],
1143 stack_after=[pydict],
1144 proto=1,
1145 doc="Push an empty dict."),
1146
1147 I(name='DICT',
1148 code='d',
1149 arg=None,
1150 stack_before=[markobject, stackslice],
1151 stack_after=[pydict],
1152 proto=0,
1153 doc="""Build a dict out of the topmost stack slice, after markobject.
1154
1155 All the stack entries following the topmost markobject are placed into
1156 a single Python dict, which single dict object replaces all of the
1157 stack from the topmost markobject onward. The stack slice alternates
1158 key, value, key, value, .... For example,
1159
1160 Stack before: ... markobject 1 2 3 'abc'
1161 Stack after: ... {1: 2, 3: 'abc'}
1162 """),
1163
1164 I(name='SETITEM',
1165 code='s',
1166 arg=None,
1167 stack_before=[pydict, anyobject, anyobject],
1168 stack_after=[pydict],
1169 proto=0,
1170 doc="""Add a key+value pair to an existing dict.
1171
1172 Stack before: ... pydict key value
1173 Stack after: ... pydict
1174
1175 where pydict has been modified via pydict[key] = value.
1176 """),
1177
1178 I(name='SETITEMS',
1179 code='u',
1180 arg=None,
1181 stack_before=[pydict, markobject, stackslice],
1182 stack_after=[pydict],
1183 proto=1,
1184 doc="""Add an arbitrary number of key+value pairs to an existing dict.
1185
1186 The slice of the stack following the topmost markobject is taken as
1187 an alternating sequence of keys and values, added to the dict
1188 immediately under the topmost markobject. Everything at and after the
1189 topmost markobject is popped, leaving the mutated dict at the top
1190 of the stack.
1191
1192 Stack before: ... pydict markobject key_1 value_1 ... key_n value_n
1193 Stack after: ... pydict
1194
1195 where pydict has been modified via pydict[key_i] = value_i for i in
1196 1, 2, ..., n, and in that order.
1197 """),
1198
1199 # Stack manipulation.
1200
1201 I(name='POP',
1202 code='0',
1203 arg=None,
1204 stack_before=[anyobject],
1205 stack_after=[],
1206 proto=0,
1207 doc="Discard the top stack item, shrinking the stack by one item."),
1208
1209 I(name='DUP',
1210 code='2',
1211 arg=None,
1212 stack_before=[anyobject],
1213 stack_after=[anyobject, anyobject],
1214 proto=0,
1215 doc="Push the top stack item onto the stack again, duplicating it."),
1216
1217 I(name='MARK',
1218 code='(',
1219 arg=None,
1220 stack_before=[],
1221 stack_after=[markobject],
1222 proto=0,
1223 doc="""Push markobject onto the stack.
1224
1225 markobject is a unique object, used by other opcodes to identify a
1226 region of the stack containing a variable number of objects for them
1227 to work on. See markobject.doc for more detail.
1228 """),
1229
1230 I(name='POP_MARK',
1231 code='1',
1232 arg=None,
1233 stack_before=[markobject, stackslice],
1234 stack_after=[],
1235 proto=0,
1236 doc="""Pop all the stack objects at and above the topmost markobject.
1237
1238 When an opcode using a variable number of stack objects is done,
1239 POP_MARK is used to remove those objects, and to remove the markobject
1240 that delimited their starting position on the stack.
1241 """),
1242
1243 # Memo manipulation. There are really only two operations (get and put),
1244 # each in all-text, "short binary", and "long binary" flavors.
1245
1246 I(name='GET',
1247 code='g',
1248 arg=decimalnl_short,
1249 stack_before=[],
1250 stack_after=[anyobject],
1251 proto=0,
1252 doc="""Read an object from the memo and push it on the stack.
1253
1254 The index of the memo object to push is given by the newline-teriminated
1255 decimal string following. BINGET and LONG_BINGET are space-optimized
1256 versions.
1257 """),
1258
1259 I(name='BINGET',
1260 code='h',
1261 arg=uint1,
1262 stack_before=[],
1263 stack_after=[anyobject],
1264 proto=1,
1265 doc="""Read an object from the memo and push it on the stack.
1266
1267 The index of the memo object to push is given by the 1-byte unsigned
1268 integer following.
1269 """),
1270
1271 I(name='LONG_BINGET',
1272 code='j',
1273 arg=int4,
1274 stack_before=[],
1275 stack_after=[anyobject],
1276 proto=1,
1277 doc="""Read an object from the memo and push it on the stack.
1278
1279 The index of the memo object to push is given by the 4-byte signed
1280 little-endian integer following.
1281 """),
1282
1283 I(name='PUT',
1284 code='p',
1285 arg=decimalnl_short,
1286 stack_before=[],
1287 stack_after=[],
1288 proto=0,
1289 doc="""Store the stack top into the memo. The stack is not popped.
1290
1291 The index of the memo location to write into is given by the newline-
1292 terminated decimal string following. BINPUT and LONG_BINPUT are
1293 space-optimized versions.
1294 """),
1295
1296 I(name='BINPUT',
1297 code='q',
1298 arg=uint1,
1299 stack_before=[],
1300 stack_after=[],
1301 proto=1,
1302 doc="""Store the stack top into the memo. The stack is not popped.
1303
1304 The index of the memo location to write into is given by the 1-byte
1305 unsigned integer following.
1306 """),
1307
1308 I(name='LONG_BINPUT',
1309 code='r',
1310 arg=int4,
1311 stack_before=[],
1312 stack_after=[],
1313 proto=1,
1314 doc="""Store the stack top into the memo. The stack is not popped.
1315
1316 The index of the memo location to write into is given by the 4-byte
1317 signed little-endian integer following.
1318 """),
1319
1320 # Push a class object, or module function, on the stack, via its module
1321 # and name.
1322
1323 I(name='GLOBAL',
1324 code='c',
1325 arg=stringnl_noescape_pair,
1326 stack_before=[],
1327 stack_after=[anyobject],
1328 proto=0,
1329 doc="""Push a global object (module.attr) on the stack.
1330
1331 Two newline-terminated strings follow the GLOBAL opcode. The first is
1332 taken as a module name, and the second as a class name. The class
1333 object module.class is pushed on the stack. More accurately, the
1334 object returned by self.find_class(module, class) is pushed on the
1335 stack, so unpickling subclasses can override this form of lookup.
1336 """),
1337
1338 # Ways to build objects of classes pickle doesn't know about directly
1339 # (user-defined classes). I despair of documenting this accurately
1340 # and comprehensibly -- you really have to read the pickle code to
1341 # find all the special cases.
1342
1343 I(name='REDUCE',
1344 code='R',
1345 arg=None,
1346 stack_before=[anyobject, anyobject],
1347 stack_after=[anyobject],
1348 proto=0,
1349 doc="""Push an object built from a callable and an argument tuple.
1350
1351 The opcode is named to remind of the __reduce__() method.
1352
1353 Stack before: ... callable pytuple
1354 Stack after: ... callable(*pytuple)
1355
1356 The callable and the argument tuple are the first two items returned
1357 by a __reduce__ method. Applying the callable to the argtuple is
1358 supposed to reproduce the original object, or at least get it started.
1359 If the __reduce__ method returns a 3-tuple, the last component is an
1360 argument to be passed to the object's __setstate__, and then the REDUCE
1361 opcode is followed by code to create setstate's argument, and then a
1362 BUILD opcode to apply __setstate__ to that argument.
1363
1364 There are lots of special cases here. The argtuple can be None, in
1365 which case callable.__basicnew__() is called instead to produce the
1366 object to be pushed on the stack. This appears to be a trick unique
1367 to ExtensionClasses, and is deprecated regardless.
1368
1369 If type(callable) is not ClassType, REDUCE complains unless the
1370 callable has been registered with the copy_reg module's
1371 safe_constructors dict, or the callable has a magic
1372 '__safe_for_unpickling__' attribute with a true value. I'm not sure
1373 why it does this, but I've sure seen this complaint often enough when
1374 I didn't want to <wink>.
1375 """),
1376
1377 I(name='BUILD',
1378 code='b',
1379 arg=None,
1380 stack_before=[anyobject, anyobject],
1381 stack_after=[anyobject],
1382 proto=0,
1383 doc="""Finish building an object, via __setstate__ or dict update.
1384
1385 Stack before: ... anyobject argument
1386 Stack after: ... anyobject
1387
1388 where anyobject may have been mutated, as follows:
1389
1390 If the object has a __setstate__ method,
1391
1392 anyobject.__setstate__(argument)
1393
1394 is called.
1395
1396 Else the argument must be a dict, the object must have a __dict__, and
1397 the object is updated via
1398
1399 anyobject.__dict__.update(argument)
1400
1401 This may raise RuntimeError in restricted execution mode (which
1402 disallows access to __dict__ directly); in that case, the object
1403 is updated instead via
1404
1405 for k, v in argument.items():
1406 anyobject[k] = v
1407 """),
1408
1409 I(name='INST',
1410 code='i',
1411 arg=stringnl_noescape_pair,
1412 stack_before=[markobject, stackslice],
1413 stack_after=[anyobject],
1414 proto=0,
1415 doc="""Build a class instance.
1416
1417 This is the protocol 0 version of protocol 1's OBJ opcode.
1418 INST is followed by two newline-terminated strings, giving a
1419 module and class name, just as for the GLOBAL opcode (and see
1420 GLOBAL for more details about that). self.find_class(module, name)
1421 is used to get a class object.
1422
1423 In addition, all the objects on the stack following the topmost
1424 markobject are gathered into a tuple and popped (along with the
1425 topmost markobject), just as for the TUPLE opcode.
1426
1427 Now it gets complicated. If all of these are true:
1428
1429 + The argtuple is empty (markobject was at the top of the stack
1430 at the start).
1431
1432 + It's an old-style class object (the type of the class object is
1433 ClassType).
1434
1435 + The class object does not have a __getinitargs__ attribute.
1436
1437 then we want to create an old-style class instance without invoking
1438 its __init__() method (pickle has waffled on this over the years; not
1439 calling __init__() is current wisdom). In this case, an instance of
1440 an old-style dummy class is created, and then we try to rebind its
1441 __class__ attribute to the desired class object. If this succeeds,
1442 the new instance object is pushed on the stack, and we're done. In
1443 restricted execution mode it can fail (assignment to __class__ is
1444 disallowed), and I'm not really sure what happens then -- it looks
1445 like the code ends up calling the class object's __init__ anyway,
1446 via falling into the next case.
1447
1448 Else (the argtuple is not empty, it's not an old-style class object,
1449 or the class object does have a __getinitargs__ attribute), the code
1450 first insists that the class object have a __safe_for_unpickling__
1451 attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE,
1452 it doesn't matter whether this attribute has a true or false value, it
1453 only matters whether it exists (XXX this smells like a bug). If
1454 __safe_for_unpickling__ dosn't exist, UnpicklingError is raised.
1455
1456 Else (the class object does have a __safe_for_unpickling__ attr),
1457 the class object obtained from INST's arguments is applied to the
1458 argtuple obtained from the stack, and the resulting instance object
1459 is pushed on the stack.
1460 """),
1461
1462 I(name='OBJ',
1463 code='o',
1464 arg=None,
1465 stack_before=[markobject, anyobject, stackslice],
1466 stack_after=[anyobject],
1467 proto=1,
1468 doc="""Build a class instance.
1469
1470 This is the protocol 1 version of protocol 0's INST opcode, and is
1471 very much like it. The major difference is that the class object
1472 is taken off the stack, allowing it to be retrieved from the memo
1473 repeatedly if several instances of the same class are created. This
1474 can be much more efficient (in both time and space) than repeatedly
1475 embedding the module and class names in INST opcodes.
1476
1477 Unlike INST, OBJ takes no arguments from the opcode stream. Instead
1478 the class object is taken off the stack, immediately above the
1479 topmost markobject:
1480
1481 Stack before: ... markobject classobject stackslice
1482 Stack after: ... new_instance_object
1483
1484 As for INST, the remainder of the stack above the markobject is
1485 gathered into an argument tuple, and then the logic seems identical,
1486 except that no __safe_for_unpickling__ check is done (XXX this smells
1487 like a bug). See INST for the gory details.
1488 """),
1489
1490 # Machine control.
1491
1492 I(name='STOP',
1493 code='.',
1494 arg=None,
1495 stack_before=[anyobject],
1496 stack_after=[],
1497 proto=0,
1498 doc="""Stop the unpickling machine.
1499
1500 Every pickle ends with this opcode. The object at the top of the stack
1501 is popped, and that's the result of unpickling. The stack should be
1502 empty then.
1503 """),
1504
1505 # Ways to deal with persistent IDs.
1506
1507 I(name='PERSID',
1508 code='P',
1509 arg=stringnl_noescape,
1510 stack_before=[],
1511 stack_after=[anyobject],
1512 proto=0,
1513 doc="""Push an object identified by a persistent ID.
1514
1515 The pickle module doesn't define what a persistent ID means. PERSID's
1516 argument is a newline-terminated str-style (no embedded escapes, no
1517 bracketing quote characters) string, which *is* "the persistent ID".
1518 The unpickler passes this string to self.persistent_load(). Whatever
1519 object that returns is pushed on the stack. There is no implementation
1520 of persistent_load() in Python's unpickler: it must be supplied by an
1521 unpickler subclass.
1522 """),
1523
1524 I(name='BINPERSID',
1525 code='Q',
1526 arg=None,
1527 stack_before=[anyobject],
1528 stack_after=[anyobject],
1529 proto=1,
1530 doc="""Push an object identified by a persistent ID.
1531
1532 Like PERSID, except the persistent ID is popped off the stack (instead
1533 of being a string embedded in the opcode bytestream). The persistent
1534 ID is passed to self.persistent_load(), and whatever object that
1535 returns is pushed on the stack. See PERSID for more detail.
1536 """),
Guido van Rossum5a2d8f52003-01-27 21:44:25 +00001537
1538 # Protocol 2 opcodes
1539
1540 I(name='PROTO',
1541 code='\x80',
1542 arg=uint1,
1543 stack_before=[],
1544 stack_after=[],
1545 proto=2,
1546 doc="""Protocol version indicator.
1547
1548 For protocol 2 and above, a pickle must start with this opcode.
1549 The argument is the protocol version, an int in range(2, 256).
1550 """),
1551
1552 I(name='NEWOBJ',
1553 code='\x81',
1554 arg=None,
1555 stack_before=[anyobject, anyobject],
1556 stack_after=[anyobject],
1557 proto=2,
1558 doc="""Build an object instance.
1559
1560 The stack before should be thought of as containing a class
1561 object followed by an argument tuple (the tuple being the stack
1562 top). Call these cls and args. They are popped off the stack,
1563 and the value returned by cls.__new__(cls, *args) is pushed back
1564 onto the stack.
1565 """),
1566
1567 I(name='EXT1',
1568 code='\x82',
1569 arg=uint1,
1570 stack_before=[],
1571 stack_after=[anyobject],
1572 proto=2,
1573 doc="""Extension code.
1574
1575 This code and the similar EXT2 and EXT4 allow using a registry
1576 of popular objects that are pickled by name, typically classes.
1577 It is envisioned that through a global negotiation and
1578 registration process, third parties can set up a mapping between
1579 ints and object names.
1580
1581 In order to guarantee pickle interchangeability, the extension
1582 code registry ought to be global, although a range of codes may
1583 be reserved for private use.
1584 """),
1585
1586 I(name='EXT2',
1587 code='\x83',
1588 arg=uint2,
1589 stack_before=[],
1590 stack_after=[anyobject],
1591 proto=2,
1592 doc="""Extension code.
1593
1594 See EXT1.
1595 """),
1596
1597 I(name='EXT4',
1598 code='\x84',
1599 arg=int4,
1600 stack_before=[],
1601 stack_after=[anyobject],
1602 proto=2,
1603 doc="""Extension code.
1604
1605 See EXT1.
1606 """),
1607
1608 I(name='TUPLE1',
1609 code='\x85',
1610 arg=None,
1611 stack_before=[anyobject],
1612 stack_after=[pytuple],
1613 proto=2,
1614 doc="""One-tuple.
1615
1616 This code pops one value off the stack and pushes a tuple of
1617 length 1 whose one item is that value back onto it. IOW:
1618
1619 stack[-1] = tuple(stack[-1:])
1620 """),
1621
1622 I(name='TUPLE2',
1623 code='\x86',
1624 arg=None,
1625 stack_before=[anyobject, anyobject],
1626 stack_after=[pytuple],
1627 proto=2,
1628 doc="""One-tuple.
1629
1630 This code pops two values off the stack and pushes a tuple
1631 of length 2 whose items are those values back onto it. IOW:
1632
1633 stack[-2:] = [tuple(stack[-2:])]
1634 """),
1635
1636 I(name='TUPLE3',
1637 code='\x87',
1638 arg=None,
1639 stack_before=[anyobject, anyobject, anyobject],
1640 stack_after=[pytuple],
1641 proto=2,
1642 doc="""One-tuple.
1643
1644 This code pops three values off the stack and pushes a tuple
1645 of length 3 whose items are those values back onto it. IOW:
1646
1647 stack[-3:] = [tuple(stack[-3:])]
1648 """),
1649
1650 I(name='NEWTRUE',
1651 code='\x88',
1652 arg=None,
1653 stack_before=[],
1654 stack_after=[pybool],
1655 proto=2,
1656 doc="""True.
1657
1658 Push True onto the stack."""),
1659
1660 I(name='NEWFALSE',
1661 code='\x89',
1662 arg=None,
1663 stack_before=[],
1664 stack_after=[pybool],
1665 proto=2,
1666 doc="""True.
1667
1668 Push False onto the stack."""),
1669
1670 I(name="LONG1",
1671 code='\x8a',
1672 arg=long1,
1673 stack_before=[],
1674 stack_after=[pylong],
1675 proto=2,
1676 doc="""Long integer using one-byte length.
1677
1678 A more efficient encoding of a Python long; the long1 encoding
1679 says it all."""),
1680
Guido van Rossum5a2d8f52003-01-27 21:44:25 +00001681 I(name="LONG4",
Tim Petersfdb8cfa2003-01-28 00:13:19 +00001682 code='\x8b',
Guido van Rossum5a2d8f52003-01-27 21:44:25 +00001683 arg=long4,
1684 stack_before=[],
1685 stack_after=[pylong],
1686 proto=2,
1687 doc="""Long integer using found-byte length.
1688
1689 A more efficient encoding of a Python long; the long4 encoding
1690 says it all."""),
1691
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001692]
1693del I
1694
1695# Verify uniqueness of .name and .code members.
1696name2i = {}
1697code2i = {}
1698
1699for i, d in enumerate(opcodes):
1700 if d.name in name2i:
1701 raise ValueError("repeated name %r at indices %d and %d" %
1702 (d.name, name2i[d.name], i))
1703 if d.code in code2i:
1704 raise ValueError("repeated code %r at indices %d and %d" %
1705 (d.code, code2i[d.code], i))
1706
1707 name2i[d.name] = i
1708 code2i[d.code] = i
1709
1710del name2i, code2i, i, d
1711
1712##############################################################################
1713# Build a code2op dict, mapping opcode characters to OpcodeInfo records.
1714# Also ensure we've got the same stuff as pickle.py, although the
1715# introspection here is dicey.
1716
1717code2op = {}
1718for d in opcodes:
1719 code2op[d.code] = d
1720del d
1721
1722def assure_pickle_consistency(verbose=False):
1723 import pickle, re
1724
1725 copy = code2op.copy()
1726 for name in pickle.__all__:
1727 if not re.match("[A-Z][A-Z0-9_]+$", name):
1728 if verbose:
1729 print "skipping %r: it doesn't look like an opcode name" % name
1730 continue
1731 picklecode = getattr(pickle, name)
1732 if not isinstance(picklecode, str) or len(picklecode) != 1:
1733 if verbose:
1734 print ("skipping %r: value %r doesn't look like a pickle "
1735 "code" % (name, picklecode))
1736 continue
1737 if picklecode in copy:
1738 if verbose:
1739 print "checking name %r w/ code %r for consistency" % (
1740 name, picklecode)
1741 d = copy[picklecode]
1742 if d.name != name:
1743 raise ValueError("for pickle code %r, pickle.py uses name %r "
1744 "but we're using name %r" % (picklecode,
1745 name,
1746 d.name))
1747 # Forget this one. Any left over in copy at the end are a problem
1748 # of a different kind.
1749 del copy[picklecode]
1750 else:
1751 raise ValueError("pickle.py appears to have a pickle opcode with "
1752 "name %r and code %r, but we don't" %
1753 (name, picklecode))
1754 if copy:
1755 msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
1756 for code, d in copy.items():
1757 msg.append(" name %r with code %r" % (d.name, code))
1758 raise ValueError("\n".join(msg))
1759
1760assure_pickle_consistency()
1761
1762##############################################################################
1763# A pickle opcode generator.
1764
1765def genops(pickle):
Guido van Rossuma72ded92003-01-27 19:40:47 +00001766 """Generate all the opcodes in a pickle.
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001767
1768 'pickle' is a file-like object, or string, containing the pickle.
1769
1770 Each opcode in the pickle is generated, from the current pickle position,
1771 stopping after a STOP opcode is delivered. A triple is generated for
1772 each opcode:
1773
1774 opcode, arg, pos
1775
1776 opcode is an OpcodeInfo record, describing the current opcode.
1777
1778 If the opcode has an argument embedded in the pickle, arg is its decoded
1779 value, as a Python object. If the opcode doesn't have an argument, arg
1780 is None.
1781
1782 If the pickle has a tell() method, pos was the value of pickle.tell()
1783 before reading the current opcode. If the pickle is a string object,
1784 it's wrapped in a StringIO object, and the latter's tell() result is
1785 used. Else (the pickle doesn't have a tell(), and it's not obvious how
1786 to query its current position) pos is None.
1787 """
1788
1789 import cStringIO as StringIO
1790
1791 if isinstance(pickle, str):
1792 pickle = StringIO.StringIO(pickle)
1793
1794 if hasattr(pickle, "tell"):
1795 getpos = pickle.tell
1796 else:
1797 getpos = lambda: None
1798
1799 while True:
1800 pos = getpos()
1801 code = pickle.read(1)
1802 opcode = code2op.get(code)
1803 if opcode is None:
1804 if code == "":
1805 raise ValueError("pickle exhausted before seeing STOP")
1806 else:
1807 raise ValueError("at position %s, opcode %r unknown" % (
1808 pos is None and "<unknown>" or pos,
1809 code))
1810 if opcode.arg is None:
1811 arg = None
1812 else:
1813 arg = opcode.arg.reader(pickle)
1814 yield opcode, arg, pos
1815 if code == '.':
1816 assert opcode.name == 'STOP'
1817 break
1818
1819##############################################################################
1820# A symbolic pickle disassembler.
1821
1822def dis(pickle, out=None, indentlevel=4):
1823 """Produce a symbolic disassembly of a pickle.
1824
1825 'pickle' is a file-like object, or string, containing a (at least one)
1826 pickle. The pickle is disassembled from the current position, through
1827 the first STOP opcode encountered.
1828
1829 Optional arg 'out' is a file-like object to which the disassembly is
1830 printed. It defaults to sys.stdout.
1831
1832 Optional arg indentlevel is the number of blanks by which to indent
1833 a new MARK level. It defaults to 4.
1834 """
1835
1836 markstack = []
1837 indentchunk = ' ' * indentlevel
1838 for opcode, arg, pos in genops(pickle):
1839 if pos is not None:
1840 print >> out, "%5d:" % pos,
1841
1842 line = "%s %s%s" % (opcode.code,
1843 indentchunk * len(markstack),
1844 opcode.name)
1845
1846 markmsg = None
1847 if markstack and markobject in opcode.stack_before:
1848 assert markobject not in opcode.stack_after
1849 markpos = markstack.pop()
1850 if markpos is not None:
1851 markmsg = "(MARK at %d)" % markpos
1852
1853 if arg is not None or markmsg:
1854 # make a mild effort to align arguments
1855 line += ' ' * (10 - len(opcode.name))
1856 if arg is not None:
1857 line += ' ' + repr(arg)
1858 if markmsg:
1859 line += ' ' + markmsg
1860 print >> out, line
1861
1862 if markobject in opcode.stack_after:
1863 assert markobject not in opcode.stack_before
1864 markstack.append(pos)
1865
1866
1867_dis_test = """
1868>>> import pickle
1869>>> x = [1, 2, (3, 4), {'abc': u"def"}]
Guido van Rossumf29d3d62003-01-27 22:47:53 +00001870>>> pik = pickle.dumps(x, 0)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001871>>> dis(pik)
1872 0: ( MARK
1873 1: l LIST (MARK at 0)
1874 2: p PUT 0
1875 5: I INT 1
1876 8: a APPEND
1877 9: I INT 2
1878 12: a APPEND
1879 13: ( MARK
1880 14: I INT 3
1881 17: I INT 4
1882 20: t TUPLE (MARK at 13)
1883 21: p PUT 1
1884 24: a APPEND
1885 25: ( MARK
1886 26: d DICT (MARK at 25)
1887 27: p PUT 2
1888 30: S STRING 'abc'
1889 37: p PUT 3
1890 40: V UNICODE u'def'
1891 45: p PUT 4
1892 48: s SETITEM
1893 49: a APPEND
1894 50: . STOP
1895
1896Try again with a "binary" pickle.
1897
1898>>> pik = pickle.dumps(x, 1)
1899>>> dis(pik)
1900 0: ] EMPTY_LIST
1901 1: q BINPUT 0
1902 3: ( MARK
1903 4: K BININT1 1
1904 6: K BININT1 2
1905 8: ( MARK
1906 9: K BININT1 3
1907 11: K BININT1 4
1908 13: t TUPLE (MARK at 8)
1909 14: q BINPUT 1
1910 16: } EMPTY_DICT
1911 17: q BINPUT 2
1912 19: U SHORT_BINSTRING 'abc'
1913 24: q BINPUT 3
1914 26: X BINUNICODE u'def'
1915 34: q BINPUT 4
1916 36: s SETITEM
1917 37: e APPENDS (MARK at 3)
1918 38: . STOP
1919
1920Exercise the INST/OBJ/BUILD family.
1921
1922>>> import random
Guido van Rossumf29d3d62003-01-27 22:47:53 +00001923>>> dis(pickle.dumps(random.random, 0))
Tim Petersd916cf42003-01-27 19:01:47 +00001924 0: c GLOBAL 'random random'
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001925 15: p PUT 0
1926 18: . STOP
1927
1928>>> x = [pickle.PicklingError()] * 2
Guido van Rossumf29d3d62003-01-27 22:47:53 +00001929>>> dis(pickle.dumps(x, 0))
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001930 0: ( MARK
1931 1: l LIST (MARK at 0)
1932 2: p PUT 0
1933 5: ( MARK
Tim Petersd916cf42003-01-27 19:01:47 +00001934 6: i INST 'pickle PicklingError' (MARK at 5)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001935 28: p PUT 1
1936 31: ( MARK
1937 32: d DICT (MARK at 31)
1938 33: p PUT 2
1939 36: S STRING 'args'
1940 44: p PUT 3
1941 47: ( MARK
1942 48: t TUPLE (MARK at 47)
1943 49: p PUT 4
1944 52: s SETITEM
1945 53: b BUILD
1946 54: a APPEND
1947 55: g GET 1
1948 58: a APPEND
1949 59: . STOP
1950
1951>>> dis(pickle.dumps(x, 1))
1952 0: ] EMPTY_LIST
1953 1: q BINPUT 0
1954 3: ( MARK
1955 4: ( MARK
Tim Petersd916cf42003-01-27 19:01:47 +00001956 5: c GLOBAL 'pickle PicklingError'
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001957 27: q BINPUT 1
1958 29: o OBJ (MARK at 4)
1959 30: q BINPUT 2
1960 32: } EMPTY_DICT
1961 33: q BINPUT 3
1962 35: U SHORT_BINSTRING 'args'
1963 41: q BINPUT 4
1964 43: ) EMPTY_TUPLE
1965 44: s SETITEM
1966 45: b BUILD
1967 46: h BINGET 2
1968 48: e APPENDS (MARK at 3)
1969 49: . STOP
1970
1971Try "the canonical" recursive-object test.
1972
1973>>> L = []
1974>>> T = L,
1975>>> L.append(T)
1976>>> L[0] is T
1977True
1978>>> T[0] is L
1979True
1980>>> L[0][0] is L
1981True
1982>>> T[0][0] is T
1983True
Guido van Rossumf29d3d62003-01-27 22:47:53 +00001984>>> dis(pickle.dumps(L, 0))
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001985 0: ( MARK
1986 1: l LIST (MARK at 0)
1987 2: p PUT 0
1988 5: ( MARK
1989 6: g GET 0
1990 9: t TUPLE (MARK at 5)
1991 10: p PUT 1
1992 13: a APPEND
1993 14: . STOP
1994>>> dis(pickle.dumps(L, 1))
1995 0: ] EMPTY_LIST
1996 1: q BINPUT 0
1997 3: ( MARK
1998 4: h BINGET 0
1999 6: t TUPLE (MARK at 3)
2000 7: q BINPUT 1
2001 9: a APPEND
2002 10: . STOP
2003
2004The protocol 0 pickle of the tuple causes the disassembly to get confused,
2005as it doesn't realize that the POP opcode at 16 gets rid of the MARK at 0
2006(so the output remains indented until the end). The protocol 1 pickle
2007doesn't trigger this glitch, because the disassembler realizes that
2008POP_MARK gets rid of the MARK. Doing a better job on the protocol 0
2009pickle would require the disassembler to emulate the stack.
2010
Guido van Rossumf29d3d62003-01-27 22:47:53 +00002011>>> dis(pickle.dumps(T, 0))
Tim Peters8ecfc8e2003-01-27 18:51:48 +00002012 0: ( MARK
2013 1: ( MARK
2014 2: l LIST (MARK at 1)
2015 3: p PUT 0
2016 6: ( MARK
2017 7: g GET 0
2018 10: t TUPLE (MARK at 6)
2019 11: p PUT 1
2020 14: a APPEND
2021 15: 0 POP
2022 16: 0 POP
2023 17: g GET 1
2024 20: . STOP
2025>>> dis(pickle.dumps(T, 1))
2026 0: ( MARK
2027 1: ] EMPTY_LIST
2028 2: q BINPUT 0
2029 4: ( MARK
2030 5: h BINGET 0
2031 7: t TUPLE (MARK at 4)
2032 8: q BINPUT 1
2033 10: a APPEND
2034 11: 1 POP_MARK (MARK at 0)
2035 12: h BINGET 1
2036 14: . STOP
2037"""
2038
2039__test__ = {'dissassembler_test': _dis_test,
2040 }
2041
2042def _test():
2043 import doctest
2044 return doctest.testmod()
2045
2046if __name__ == "__main__":
2047 _test()