blob: eda6d466d292682275f6208321b2562c74414db1 [file] [log] [blame]
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001""""Executable documentation" for the pickle module.
2
3Extensive comments about the pickle protocols and pickle-machine opcodes
4can be found here. Some functions meant for external use:
5
6genops(pickle)
7 Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
8
9dis(pickle, out=None, indentlevel=4)
10 Print a symbolic disassembly of a pickle.
11"""
12
13# Other ideas:
14#
15# - A pickle verifier: read a pickle and check it exhaustively for
16# well-formedness.
17#
18# - A protocol identifier: examine a pickle and return its protocol number
19# (== the highest .proto attr value among all the opcodes in the pickle).
20#
21# - A pickle optimizer: for example, tuple-building code is sometimes more
22# elaborate than necessary, catering for the possibility that the tuple
23# is recursive. Or lots of times a PUT is generated that's never accessed
24# by a later GET.
25
26
27"""
28"A pickle" is a program for a virtual pickle machine (PM, but more accurately
29called an unpickling machine). It's a sequence of opcodes, interpreted by the
30PM, building an arbitrarily complex Python object.
31
32For the most part, the PM is very simple: there are no looping, testing, or
33conditional instructions, no arithmetic and no function calls. Opcodes are
34executed once each, from first to last, until a STOP opcode is reached.
35
36The PM has two data areas, "the stack" and "the memo".
37
38Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
39integer object on the stack, whose value is gotten from a decimal string
40literal immediately following the INT opcode in the pickle bytestream. Other
41opcodes take Python objects off the stack. The result of unpickling is
42whatever object is left on the stack when the final STOP opcode is executed.
43
44The memo is simply an array of objects, or it can be implemented as a dict
45mapping little integers to objects. The memo serves as the PM's "long term
46memory", and the little integers indexing the memo are akin to variable
47names. Some opcodes pop a stack object into the memo at a given index,
48and others push a memo object at a given index onto the stack again.
49
50At heart, that's all the PM has. Subtleties arise for these reasons:
51
52+ Object identity. Objects can be arbitrarily complex, and subobjects
53 may be shared (for example, the list [a, a] refers to the same object a
54 twice). It can be vital that unpickling recreate an isomorphic object
55 graph, faithfully reproducing sharing.
56
57+ Recursive objects. For example, after "L = []; L.append(L)", L is a
58 list, and L[0] is the same list. This is related to the object identity
59 point, and some sequences of pickle opcodes are subtle in order to
60 get the right result in all cases.
61
62+ Things pickle doesn't know everything about. Examples of things pickle
63 does know everything about are Python's builtin scalar and container
64 types, like ints and tuples. They generally have opcodes dedicated to
65 them. For things like module references and instances of user-defined
66 classes, pickle's knowledge is limited. Historically, many enhancements
67 have been made to the pickle protocol in order to do a better (faster,
68 and/or more compact) job on those.
69
70+ Backward compatibility and micro-optimization. As explained below,
71 pickle opcodes never go away, not even when better ways to do a thing
72 get invented. The repertoire of the PM just keeps growing over time.
73 So, e.g., there are now six distinct opcodes for building a Python integer,
74 five of them devoted to "short" integers. Even so, the only way to pickle
75 a Python long int takes time quadratic in the number of digits, for both
76 pickling and unpickling. This isn't so much a subtlety as a source of
77 wearying complication.
78
79
80Pickle protocols:
81
82For compatibility, the meaning of a pickle opcode never changes. Instead new
83pickle opcodes get added, and each version's unpickler can handle all the
84pickle opcodes in all protocol versions to date. So old pickles continue to
85be readable forever. The pickler can generally be told to restrict itself to
86the subset of opcodes available under previous protocol versions too, so that
87users can create pickles under the current version readable by older
88versions. However, a pickle does not contain its version number embedded
89within it. If an older unpickler tries to read a pickle using a later
90protocol, the result is most likely an exception due to seeing an unknown (in
91the older unpickler) opcode.
92
93The original pickle used what's now called "protocol 0", and what was called
94"text mode" before Python 2.3. The entire pickle bytestream is made up of
95printable 7-bit ASCII characters, plus the newline character, in protocol 0.
96That's why it was called text mode.
97
98The second major set of additions is now called "protocol 1", and was called
99"binary mode" before Python 2.3. This added many opcodes with arguments
100consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
101bytes. Binary mode pickles can be substantially smaller than equivalent
102text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
103int as 4 bytes following the opcode, which is cheaper to unpickle than the
104(perhaps) 11-character decimal string attached to INT.
105
106The third major set of additions came in Python 2.3, and is called "protocol
1072". XXX Write a short blurb when Guido figures out what they are <wink>. XXX
108"""
109
110# Meta-rule: Descriptions are stored in instances of descriptor objects,
111# with plain constructors. No meta-language is defined from which
112# descriptors could be constructed. If you want, e.g., XML, write a little
113# program to generate XML from the objects.
114
115##############################################################################
116# Some pickle opcodes have an argument, following the opcode in the
117# bytestream. An argument is of a specific type, described by an instance
118# of ArgumentDescriptor. These are not to be confused with arguments taken
119# off the stack -- ArgumentDescriptor applies only to arguments embedded in
120# the opcode stream, immediately following an opcode.
121
122# Represents the number of bytes consumed by an argument delimited by the
123# next newline character.
124UP_TO_NEWLINE = -1
125
126# Represents the number of bytes consumed by a two-argument opcode where
127# the first argument gives the number of bytes in the second argument.
128TAKEN_FROM_ARGUMENT = -2
129
130class ArgumentDescriptor(object):
131 __slots__ = (
132 # name of descriptor record, also a module global name; a string
133 'name',
134
135 # length of argument, in bytes; an int; UP_TO_NEWLINE and
136 # TAKEN_FROM_ARGUMENT are negative values for variable-length cases
137 'n',
138
139 # a function taking a file-like object, reading this kind of argument
140 # from the object at the current position, advancing the current
141 # position by n bytes, and returning the value of the argument
142 'reader',
143
144 # human-readable docs for this arg descriptor; a string
145 'doc',
146 )
147
148 def __init__(self, name, n, reader, doc):
149 assert isinstance(name, str)
150 self.name = name
151
152 assert isinstance(n, int) and (n >= 0 or
153 n is UP_TO_NEWLINE or
154 n is TAKEN_FROM_ARGUMENT)
155 self.n = n
156
157 self.reader = reader
158
159 assert isinstance(doc, str)
160 self.doc = doc
161
162from struct import unpack as _unpack
163
164def read_uint1(f):
165 """
166 >>> import StringIO
167 >>> read_uint1(StringIO.StringIO('\\xff'))
168 255
169 """
170
171 data = f.read(1)
172 if data:
173 return ord(data)
174 raise ValueError("not enough data in stream to read uint1")
175
176uint1 = ArgumentDescriptor(
177 name='uint1',
178 n=1,
179 reader=read_uint1,
180 doc="One-byte unsigned integer.")
181
182
183def read_uint2(f):
184 """
185 >>> import StringIO
186 >>> read_uint2(StringIO.StringIO('\\xff\\x00'))
187 255
188 >>> read_uint2(StringIO.StringIO('\\xff\\xff'))
189 65535
190 """
191
192 data = f.read(2)
193 if len(data) == 2:
194 return _unpack("<H", data)[0]
195 raise ValueError("not enough data in stream to read uint2")
196
197uint2 = ArgumentDescriptor(
198 name='uint2',
199 n=2,
200 reader=read_uint2,
201 doc="Two-byte unsigned integer, little-endian.")
202
203
204def read_int4(f):
205 """
206 >>> import StringIO
207 >>> read_int4(StringIO.StringIO('\\xff\\x00\\x00\\x00'))
208 255
209 >>> read_int4(StringIO.StringIO('\\x00\\x00\\x00\\x80')) == -(2**31)
210 True
211 """
212
213 data = f.read(4)
214 if len(data) == 4:
215 return _unpack("<i", data)[0]
216 raise ValueError("not enough data in stream to read int4")
217
218int4 = ArgumentDescriptor(
219 name='int4',
220 n=4,
221 reader=read_int4,
222 doc="Four-byte signed integer, little-endian, 2's complement.")
223
224
225def read_stringnl(f, decode=True, stripquotes=True):
226 """
227 >>> import StringIO
228 >>> read_stringnl(StringIO.StringIO("'abcd'\\nefg\\n"))
229 'abcd'
230
231 >>> read_stringnl(StringIO.StringIO("\\n"))
232 Traceback (most recent call last):
233 ...
234 ValueError: no string quotes around ''
235
236 >>> read_stringnl(StringIO.StringIO("\\n"), stripquotes=False)
237 ''
238
239 >>> read_stringnl(StringIO.StringIO("''\\n"))
240 ''
241
242 >>> read_stringnl(StringIO.StringIO('"abcd"'))
243 Traceback (most recent call last):
244 ...
245 ValueError: no newline found when trying to read stringnl
246
247 Embedded escapes are undone in the result.
248 >>> read_stringnl(StringIO.StringIO("'a\\\\nb\\x00c\\td'\\n'e'"))
249 'a\\nb\\x00c\\td'
250 """
251
252 data = f.readline()
253 if not data.endswith('\n'):
254 raise ValueError("no newline found when trying to read stringnl")
255 data = data[:-1] # lose the newline
256
257 if stripquotes:
258 for q in "'\"":
259 if data.startswith(q):
260 if not data.endswith(q):
261 raise ValueError("strinq quote %r not found at both "
262 "ends of %r" % (q, data))
263 data = data[1:-1]
264 break
265 else:
266 raise ValueError("no string quotes around %r" % data)
267
268 # I'm not sure when 'string_escape' was added to the std codecs; it's
269 # crazy not to use it if it's there.
270 if decode:
271 data = data.decode('string_escape')
272 return data
273
274stringnl = ArgumentDescriptor(
275 name='stringnl',
276 n=UP_TO_NEWLINE,
277 reader=read_stringnl,
278 doc="""A newline-terminated string.
279
280 This is a repr-style string, with embedded escapes, and
281 bracketing quotes.
282 """)
283
284def read_stringnl_noescape(f):
285 return read_stringnl(f, decode=False, stripquotes=False)
286
287stringnl_noescape = ArgumentDescriptor(
288 name='stringnl_noescape',
289 n=UP_TO_NEWLINE,
290 reader=read_stringnl_noescape,
291 doc="""A newline-terminated string.
292
293 This is a str-style string, without embedded escapes,
294 or bracketing quotes. It should consist solely of
295 printable ASCII characters.
296 """)
297
298def read_stringnl_noescape_pair(f):
299 """
300 >>> import StringIO
301 >>> read_stringnl_noescape_pair(StringIO.StringIO("Queue\\nEmpty\\njunk"))
Tim Petersd916cf42003-01-27 19:01:47 +0000302 'Queue Empty'
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000303 """
304
Tim Petersd916cf42003-01-27 19:01:47 +0000305 return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000306
307stringnl_noescape_pair = ArgumentDescriptor(
308 name='stringnl_noescape_pair',
309 n=UP_TO_NEWLINE,
310 reader=read_stringnl_noescape_pair,
311 doc="""A pair of newline-terminated strings.
312
313 These are str-style strings, without embedded
314 escapes, or bracketing quotes. They should
315 consist solely of printable ASCII characters.
316 The pair is returned as a single string, with
Tim Petersd916cf42003-01-27 19:01:47 +0000317 a single blank separating the two strings.
Tim Peters8ecfc8e2003-01-27 18:51:48 +0000318 """)
319
320def read_string4(f):
321 """
322 >>> import StringIO
323 >>> read_string4(StringIO.StringIO("\\x00\\x00\\x00\\x00abc"))
324 ''
325 >>> read_string4(StringIO.StringIO("\\x03\\x00\\x00\\x00abcdef"))
326 'abc'
327 >>> read_string4(StringIO.StringIO("\\x00\\x00\\x00\\x03abcdef"))
328 Traceback (most recent call last):
329 ...
330 ValueError: expected 50331648 bytes in a string4, but only 6 remain
331 """
332
333 n = read_int4(f)
334 if n < 0:
335 raise ValueError("string4 byte count < 0: %d" % n)
336 data = f.read(n)
337 if len(data) == n:
338 return data
339 raise ValueError("expected %d bytes in a string4, but only %d remain" %
340 (n, len(data)))
341
342string4 = ArgumentDescriptor(
343 name="string4",
344 n=TAKEN_FROM_ARGUMENT,
345 reader=read_string4,
346 doc="""A counted string.
347
348 The first argument is a 4-byte little-endian signed int giving
349 the number of bytes in the string, and the second argument is
350 that many bytes.
351 """)
352
353
354def read_string1(f):
355 """
356 >>> import StringIO
357 >>> read_string1(StringIO.StringIO("\\x00"))
358 ''
359 >>> read_string1(StringIO.StringIO("\\x03abcdef"))
360 'abc'
361 """
362
363 n = read_uint1(f)
364 assert n >= 0
365 data = f.read(n)
366 if len(data) == n:
367 return data
368 raise ValueError("expected %d bytes in a string1, but only %d remain" %
369 (n, len(data)))
370
371string1 = ArgumentDescriptor(
372 name="string1",
373 n=TAKEN_FROM_ARGUMENT,
374 reader=read_string1,
375 doc="""A counted string.
376
377 The first argument is a 1-byte unsigned int giving the number
378 of bytes in the string, and the second argument is that many
379 bytes.
380 """)
381
382
383def read_unicodestringnl(f):
384 """
385 >>> import StringIO
386 >>> read_unicodestringnl(StringIO.StringIO("abc\\uabcd\\njunk"))
387 u'abc\\uabcd'
388 """
389
390 data = f.readline()
391 if not data.endswith('\n'):
392 raise ValueError("no newline found when trying to read "
393 "unicodestringnl")
394 data = data[:-1] # lose the newline
395 return unicode(data, 'raw-unicode-escape')
396
397unicodestringnl = ArgumentDescriptor(
398 name='unicodestringnl',
399 n=UP_TO_NEWLINE,
400 reader=read_unicodestringnl,
401 doc="""A newline-terminated Unicode string.
402
403 This is raw-unicode-escape encoded, so consists of
404 printable ASCII characters, and may contain embedded
405 escape sequences.
406 """)
407
408def read_unicodestring4(f):
409 """
410 >>> import StringIO
411 >>> s = u'abcd\\uabcd'
412 >>> enc = s.encode('utf-8')
413 >>> enc
414 'abcd\\xea\\xaf\\x8d'
415 >>> n = chr(len(enc)) + chr(0) * 3 # little-endian 4-byte length
416 >>> t = read_unicodestring4(StringIO.StringIO(n + enc + 'junk'))
417 >>> s == t
418 True
419
420 >>> read_unicodestring4(StringIO.StringIO(n + enc[:-1]))
421 Traceback (most recent call last):
422 ...
423 ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
424 """
425
426 n = read_int4(f)
427 if n < 0:
428 raise ValueError("unicodestring4 byte count < 0: %d" % n)
429 data = f.read(n)
430 if len(data) == n:
431 return unicode(data, 'utf-8')
432 raise ValueError("expected %d bytes in a unicodestring4, but only %d "
433 "remain" % (n, len(data)))
434
435unicodestring4 = ArgumentDescriptor(
436 name="unicodestring4",
437 n=TAKEN_FROM_ARGUMENT,
438 reader=read_unicodestring4,
439 doc="""A counted Unicode string.
440
441 The first argument is a 4-byte little-endian signed int
442 giving the number of bytes in the string, and the second
443 argument-- the UTF-8 encoding of the Unicode string --
444 contains that many bytes.
445 """)
446
447
448def read_decimalnl_short(f):
449 """
450 >>> import StringIO
451 >>> read_decimalnl_short(StringIO.StringIO("1234\\n56"))
452 1234
453
454 >>> read_decimalnl_short(StringIO.StringIO("1234L\\n56"))
455 Traceback (most recent call last):
456 ...
457 ValueError: trailing 'L' not allowed in '1234L'
458 """
459
460 s = read_stringnl(f, decode=False, stripquotes=False)
461 if s.endswith("L"):
462 raise ValueError("trailing 'L' not allowed in %r" % s)
463
464 # It's not necessarily true that the result fits in a Python short int:
465 # the pickle may have been written on a 64-bit box. There's also a hack
466 # for True and False here.
467 if s == "00":
468 return False
469 elif s == "01":
470 return True
471
472 try:
473 return int(s)
474 except OverflowError:
475 return long(s)
476
477def read_decimalnl_long(f):
478 """
479 >>> import StringIO
480
481 >>> read_decimalnl_long(StringIO.StringIO("1234\\n56"))
482 Traceback (most recent call last):
483 ...
484 ValueError: trailing 'L' required in '1234'
485
486 Someday the trailing 'L' will probably go away from this output.
487
488 >>> read_decimalnl_long(StringIO.StringIO("1234L\\n56"))
489 1234L
490
491 >>> read_decimalnl_long(StringIO.StringIO("123456789012345678901234L\\n6"))
492 123456789012345678901234L
493 """
494
495 s = read_stringnl(f, decode=False, stripquotes=False)
496 if not s.endswith("L"):
497 raise ValueError("trailing 'L' required in %r" % s)
498 return long(s)
499
500
501decimalnl_short = ArgumentDescriptor(
502 name='decimalnl_short',
503 n=UP_TO_NEWLINE,
504 reader=read_decimalnl_short,
505 doc="""A newline-terminated decimal integer literal.
506
507 This never has a trailing 'L', and the integer fit
508 in a short Python int on the box where the pickle
509 was written -- but there's no guarantee it will fit
510 in a short Python int on the box where the pickle
511 is read.
512 """)
513
514decimalnl_long = ArgumentDescriptor(
515 name='decimalnl_long',
516 n=UP_TO_NEWLINE,
517 reader=read_decimalnl_long,
518 doc="""A newline-terminated decimal integer literal.
519
520 This has a trailing 'L', and can represent integers
521 of any size.
522 """)
523
524
525def read_floatnl(f):
526 """
527 >>> import StringIO
528 >>> read_floatnl(StringIO.StringIO("-1.25\\n6"))
529 -1.25
530 """
531 s = read_stringnl(f, decode=False, stripquotes=False)
532 return float(s)
533
534floatnl = ArgumentDescriptor(
535 name='floatnl',
536 n=UP_TO_NEWLINE,
537 reader=read_floatnl,
538 doc="""A newline-terminated decimal floating literal.
539
540 In general this requires 17 significant digits for roundtrip
541 identity, and pickling then unpickling infinities, NaNs, and
542 minus zero doesn't work across boxes, or on some boxes even
543 on itself (e.g., Windows can't read the strings it produces
544 for infinities or NaNs).
545 """)
546
547def read_float8(f):
548 """
549 >>> import StringIO, struct
550 >>> raw = struct.pack(">d", -1.25)
551 >>> raw
552 '\\xbf\\xf4\\x00\\x00\\x00\\x00\\x00\\x00'
553 >>> read_float8(StringIO.StringIO(raw + "\\n"))
554 -1.25
555 """
556
557 data = f.read(8)
558 if len(data) == 8:
559 return _unpack(">d", data)[0]
560 raise ValueError("not enough data in stream to read float8")
561
562
563float8 = ArgumentDescriptor(
564 name='float8',
565 n=8,
566 reader=read_float8,
567 doc="""An 8-byte binary representation of a float, big-endian.
568
569 The format is unique to Python, and shared with the struct
570 module (format string '>d') "in theory" (the struct and cPickle
571 implementations don't share the code -- they should). It's
572 strongly related to the IEEE-754 double format, and, in normal
573 cases, is in fact identical to the big-endian 754 double format.
574 On other boxes the dynamic range is limited to that of a 754
575 double, and "add a half and chop" rounding is used to reduce
576 the precision to 53 bits. However, even on a 754 box,
577 infinities, NaNs, and minus zero may not be handled correctly
578 (may not survive roundtrip pickling intact).
579 """)
580
581##############################################################################
582# Object descriptors. The stack used by the pickle machine holds objects,
583# and in the stack_before and stack_after attributes of OpcodeInfo
584# descriptors we need names to describe the various types of objects that can
585# appear on the stack.
586
587class StackObject(object):
588 __slots__ = (
589 # name of descriptor record, for info only
590 'name',
591
592 # type of object, or tuple of type objects (meaning the object can
593 # be of any type in the tuple)
594 'obtype',
595
596 # human-readable docs for this kind of stack object; a string
597 'doc',
598 )
599
600 def __init__(self, name, obtype, doc):
601 assert isinstance(name, str)
602 self.name = name
603
604 assert isinstance(obtype, type) or isinstance(obtype, tuple)
605 if isinstance(obtype, tuple):
606 for contained in obtype:
607 assert isinstance(contained, type)
608 self.obtype = obtype
609
610 assert isinstance(doc, str)
611 self.doc = doc
612
613
614pyint = StackObject(
615 name='int',
616 obtype=int,
617 doc="A short (as opposed to long) Python integer object.")
618
619pylong = StackObject(
620 name='long',
621 obtype=long,
622 doc="A long (as opposed to short) Python integer object.")
623
624pyinteger_or_bool = StackObject(
625 name='int_or_bool',
626 obtype=(int, long, bool),
627 doc="A Python integer object (short or long), or "
628 "a Python bool.")
629
630pyfloat = StackObject(
631 name='float',
632 obtype=float,
633 doc="A Python float object.")
634
635pystring = StackObject(
636 name='str',
637 obtype=str,
638 doc="A Python string object.")
639
640pyunicode = StackObject(
641 name='unicode',
642 obtype=unicode,
643 doc="A Python Unicode string object.")
644
645pynone = StackObject(
646 name="None",
647 obtype=type(None),
648 doc="The Python None object.")
649
650pytuple = StackObject(
651 name="tuple",
652 obtype=tuple,
653 doc="A Python tuple object.")
654
655pylist = StackObject(
656 name="list",
657 obtype=list,
658 doc="A Python list object.")
659
660pydict = StackObject(
661 name="dict",
662 obtype=dict,
663 doc="A Python dict object.")
664
665anyobject = StackObject(
666 name='any',
667 obtype=object,
668 doc="Any kind of object whatsoever.")
669
670markobject = StackObject(
671 name="mark",
672 obtype=StackObject,
673 doc="""'The mark' is a unique object.
674
675 Opcodes that operate on a variable number of objects
676 generally don't embed the count of objects in the opcode,
677 or pull it off the stack. Instead the MARK opcode is used
678 to push a special marker object on the stack, and then
679 some other opcodes grab all the objects from the top of
680 the stack down to (but not including) the topmost marker
681 object.
682 """)
683
684stackslice = StackObject(
685 name="stackslice",
686 obtype=StackObject,
687 doc="""An object representing a contiguous slice of the stack.
688
689 This is used in conjuction with markobject, to represent all
690 of the stack following the topmost markobject. For example,
691 the POP_MARK opcode changes the stack from
692
693 [..., markobject, stackslice]
694 to
695 [...]
696
697 No matter how many object are on the stack after the topmost
698 markobject, POP_MARK gets rid of all of them (including the
699 topmost markobject too).
700 """)
701
702##############################################################################
703# Descriptors for pickle opcodes.
704
705class OpcodeInfo(object):
706
707 __slots__ = (
708 # symbolic name of opcode; a string
709 'name',
710
711 # the code used in a bytestream to represent the opcode; a
712 # one-character string
713 'code',
714
715 # If the opcode has an argument embedded in the byte string, an
716 # instance of ArgumentDescriptor specifying its type. Note that
717 # arg.reader(s) can be used to read and decode the argument from
718 # the bytestream s, and arg.doc documents the format of the raw
719 # argument bytes. If the opcode doesn't have an argument embedded
720 # in the bytestream, arg should be None.
721 'arg',
722
723 # what the stack looks like before this opcode runs; a list
724 'stack_before',
725
726 # what the stack looks like after this opcode runs; a list
727 'stack_after',
728
729 # the protocol number in which this opcode was introduced; an int
730 'proto',
731
732 # human-readable docs for this opcode; a string
733 'doc',
734 )
735
736 def __init__(self, name, code, arg,
737 stack_before, stack_after, proto, doc):
738 assert isinstance(name, str)
739 self.name = name
740
741 assert isinstance(code, str)
742 assert len(code) == 1
743 self.code = code
744
745 assert arg is None or isinstance(arg, ArgumentDescriptor)
746 self.arg = arg
747
748 assert isinstance(stack_before, list)
749 for x in stack_before:
750 assert isinstance(x, StackObject)
751 self.stack_before = stack_before
752
753 assert isinstance(stack_after, list)
754 for x in stack_after:
755 assert isinstance(x, StackObject)
756 self.stack_after = stack_after
757
758 assert isinstance(proto, int) and 0 <= proto <= 2
759 self.proto = proto
760
761 assert isinstance(doc, str)
762 self.doc = doc
763
764I = OpcodeInfo
765opcodes = [
766
767 # Ways to spell integers.
768
769 I(name='INT',
770 code='I',
771 arg=decimalnl_short,
772 stack_before=[],
773 stack_after=[pyinteger_or_bool],
774 proto=0,
775 doc="""Push an integer or bool.
776
777 The argument is a newline-terminated decimal literal string.
778
779 The intent may have been that this always fit in a short Python int,
780 but INT can be generated in pickles written on a 64-bit box that
781 require a Python long on a 32-bit box. The difference between this
782 and LONG then is that INT skips a trailing 'L', and produces a short
783 int whenever possible.
784
785 Another difference is due to that, when bool was introduced as a
786 distinct type in 2.3, builtin names True and False were also added to
787 2.2.2, mapping to ints 1 and 0. For compatibility in both directions,
788 True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
789 Leading zeroes are never produced for a genuine integer. The 2.3
790 (and later) unpicklers special-case these and return bool instead;
791 earlier unpicklers ignore the leading "0" and return the int.
792 """),
793
794 I(name='LONG',
795 code='L',
796 arg=decimalnl_long,
797 stack_before=[],
798 stack_after=[pylong],
799 proto=0,
800 doc="""Push a long integer.
801
802 The same as INT, except that the literal ends with 'L', and always
803 unpickles to a Python long. There doesn't seem a real purpose to the
804 trailing 'L'.
805 """),
806
807 I(name='BININT',
808 code='J',
809 arg=int4,
810 stack_before=[],
811 stack_after=[pyint],
812 proto=1,
813 doc="""Push a four-byte signed integer.
814
815 This handles the full range of Python (short) integers on a 32-bit
816 box, directly as binary bytes (1 for the opcode and 4 for the integer).
817 If the integer is non-negative and fits in 1 or 2 bytes, pickling via
818 BININT1 or BININT2 saves space.
819 """),
820
821 I(name='BININT1',
822 code='K',
823 arg=uint1,
824 stack_before=[],
825 stack_after=[pyint],
826 proto=1,
827 doc="""Push a one-byte unsigned integer.
828
829 This is a space optimization for pickling very small non-negative ints,
830 in range(256).
831 """),
832
833 I(name='BININT2',
834 code='M',
835 arg=uint2,
836 stack_before=[],
837 stack_after=[pyint],
838 proto=1,
839 doc="""Push a two-byte unsigned integer.
840
841 This is a space optimization for pickling small positive ints, in
842 range(256, 2**16). Integers in range(256) can also be pickled via
843 BININT2, but BININT1 instead saves a byte.
844 """),
845
846 # Ways to spell strings (8-bit, not Unicode).
847
848 I(name='STRING',
849 code='S',
850 arg=stringnl,
851 stack_before=[],
852 stack_after=[pystring],
853 proto=0,
854 doc="""Push a Python string object.
855
856 The argument is a repr-style string, with bracketing quote characters,
857 and perhaps embedded escapes. The argument extends until the next
858 newline character.
859 """),
860
861 I(name='BINSTRING',
862 code='T',
863 arg=string4,
864 stack_before=[],
865 stack_after=[pystring],
866 proto=1,
867 doc="""Push a Python string object.
868
869 There are two arguments: the first is a 4-byte little-endian signed int
870 giving the number of bytes in the string, and the second is that many
871 bytes, which are taken literally as the string content.
872 """),
873
874 I(name='SHORT_BINSTRING',
875 code='U',
876 arg=string1,
877 stack_before=[],
878 stack_after=[pystring],
879 proto=1,
880 doc="""Push a Python string object.
881
882 There are two arguments: the first is a 1-byte unsigned int giving
883 the number of bytes in the string, and the second is that many bytes,
884 which are taken literally as the string content.
885 """),
886
887 # Ways to spell None.
888
889 I(name='NONE',
890 code='N',
891 arg=None,
892 stack_before=[],
893 stack_after=[pynone],
894 proto=0,
895 doc="Push None on the stack."),
896
897 # Ways to spell Unicode strings.
898
899 I(name='UNICODE',
900 code='V',
901 arg=unicodestringnl,
902 stack_before=[],
903 stack_after=[pyunicode],
904 proto=0, # this may be pure-text, but it's a later addition
905 doc="""Push a Python Unicode string object.
906
907 The argument is a raw-unicode-escape encoding of a Unicode string,
908 and so may contain embedded escape sequences. The argument extends
909 until the next newline character.
910 """),
911
912 I(name='BINUNICODE',
913 code='X',
914 arg=unicodestring4,
915 stack_before=[],
916 stack_after=[pyunicode],
917 proto=1,
918 doc="""Push a Python Unicode string object.
919
920 There are two arguments: the first is a 4-byte little-endian signed int
921 giving the number of bytes in the string. The second is that many
922 bytes, and is the UTF-8 encoding of the Unicode string.
923 """),
924
925 # Ways to spell floats.
926
927 I(name='FLOAT',
928 code='F',
929 arg=floatnl,
930 stack_before=[],
931 stack_after=[pyfloat],
932 proto=0,
933 doc="""Newline-terminated decimal float literal.
934
935 The argument is repr(a_float), and in general requires 17 significant
936 digits for roundtrip conversion to be an identity (this is so for
937 IEEE-754 double precision values, which is what Python float maps to
938 on most boxes).
939
940 In general, FLOAT cannot be used to transport infinities, NaNs, or
941 minus zero across boxes (or even on a single box, if the platform C
942 library can't read the strings it produces for such things -- Windows
943 is like that), but may do less damage than BINFLOAT on boxes with
944 greater precision or dynamic range than IEEE-754 double.
945 """),
946
947 I(name='BINFLOAT',
948 code='G',
949 arg=float8,
950 stack_before=[],
951 stack_after=[pyfloat],
952 proto=1,
953 doc="""Float stored in binary form, with 8 bytes of data.
954
955 This generally requires less than half the space of FLOAT encoding.
956 In general, BINFLOAT cannot be used to transport infinities, NaNs, or
957 minus zero, raises an exception if the exponent exceeds the range of
958 an IEEE-754 double, and retains no more than 53 bits of precision (if
959 there are more than that, "add a half and chop" rounding is used to
960 cut it back to 53 significant bits).
961 """),
962
963 # Ways to build lists.
964
965 I(name='EMPTY_LIST',
966 code=']',
967 arg=None,
968 stack_before=[],
969 stack_after=[pylist],
970 proto=1,
971 doc="Push an empty list."),
972
973 I(name='APPEND',
974 code='a',
975 arg=None,
976 stack_before=[pylist, anyobject],
977 stack_after=[pylist],
978 proto=0,
979 doc="""Append an object to a list.
980
981 Stack before: ... pylist anyobject
982 Stack after: ... pylist+[anyobject]
983 """),
984
985 I(name='APPENDS',
986 code='e',
987 arg=None,
988 stack_before=[pylist, markobject, stackslice],
989 stack_after=[pylist],
990 proto=1,
991 doc="""Extend a list by a slice of stack objects.
992
993 Stack before: ... pylist markobject stackslice
994 Stack after: ... pylist+stackslice
995 """),
996
997 I(name='LIST',
998 code='l',
999 arg=None,
1000 stack_before=[markobject, stackslice],
1001 stack_after=[pylist],
1002 proto=0,
1003 doc="""Build a list out of the topmost stack slice, after markobject.
1004
1005 All the stack entries following the topmost markobject are placed into
1006 a single Python list, which single list object replaces all of the
1007 stack from the topmost markobject onward. For example,
1008
1009 Stack before: ... markobject 1 2 3 'abc'
1010 Stack after: ... [1, 2, 3, 'abc']
1011 """),
1012
1013 # Ways to build tuples.
1014
1015 I(name='EMPTY_TUPLE',
1016 code=')',
1017 arg=None,
1018 stack_before=[],
1019 stack_after=[pytuple],
1020 proto=1,
1021 doc="Push an empty tuple."),
1022
1023 I(name='TUPLE',
1024 code='t',
1025 arg=None,
1026 stack_before=[markobject, stackslice],
1027 stack_after=[pytuple],
1028 proto=0,
1029 doc="""Build a tuple out of the topmost stack slice, after markobject.
1030
1031 All the stack entries following the topmost markobject are placed into
1032 a single Python tuple, which single tuple object replaces all of the
1033 stack from the topmost markobject onward. For example,
1034
1035 Stack before: ... markobject 1 2 3 'abc'
1036 Stack after: ... (1, 2, 3, 'abc')
1037 """),
1038
1039 # Ways to build dicts.
1040
1041 I(name='EMPTY_DICT',
1042 code='}',
1043 arg=None,
1044 stack_before=[],
1045 stack_after=[pydict],
1046 proto=1,
1047 doc="Push an empty dict."),
1048
1049 I(name='DICT',
1050 code='d',
1051 arg=None,
1052 stack_before=[markobject, stackslice],
1053 stack_after=[pydict],
1054 proto=0,
1055 doc="""Build a dict out of the topmost stack slice, after markobject.
1056
1057 All the stack entries following the topmost markobject are placed into
1058 a single Python dict, which single dict object replaces all of the
1059 stack from the topmost markobject onward. The stack slice alternates
1060 key, value, key, value, .... For example,
1061
1062 Stack before: ... markobject 1 2 3 'abc'
1063 Stack after: ... {1: 2, 3: 'abc'}
1064 """),
1065
1066 I(name='SETITEM',
1067 code='s',
1068 arg=None,
1069 stack_before=[pydict, anyobject, anyobject],
1070 stack_after=[pydict],
1071 proto=0,
1072 doc="""Add a key+value pair to an existing dict.
1073
1074 Stack before: ... pydict key value
1075 Stack after: ... pydict
1076
1077 where pydict has been modified via pydict[key] = value.
1078 """),
1079
1080 I(name='SETITEMS',
1081 code='u',
1082 arg=None,
1083 stack_before=[pydict, markobject, stackslice],
1084 stack_after=[pydict],
1085 proto=1,
1086 doc="""Add an arbitrary number of key+value pairs to an existing dict.
1087
1088 The slice of the stack following the topmost markobject is taken as
1089 an alternating sequence of keys and values, added to the dict
1090 immediately under the topmost markobject. Everything at and after the
1091 topmost markobject is popped, leaving the mutated dict at the top
1092 of the stack.
1093
1094 Stack before: ... pydict markobject key_1 value_1 ... key_n value_n
1095 Stack after: ... pydict
1096
1097 where pydict has been modified via pydict[key_i] = value_i for i in
1098 1, 2, ..., n, and in that order.
1099 """),
1100
1101 # Stack manipulation.
1102
1103 I(name='POP',
1104 code='0',
1105 arg=None,
1106 stack_before=[anyobject],
1107 stack_after=[],
1108 proto=0,
1109 doc="Discard the top stack item, shrinking the stack by one item."),
1110
1111 I(name='DUP',
1112 code='2',
1113 arg=None,
1114 stack_before=[anyobject],
1115 stack_after=[anyobject, anyobject],
1116 proto=0,
1117 doc="Push the top stack item onto the stack again, duplicating it."),
1118
1119 I(name='MARK',
1120 code='(',
1121 arg=None,
1122 stack_before=[],
1123 stack_after=[markobject],
1124 proto=0,
1125 doc="""Push markobject onto the stack.
1126
1127 markobject is a unique object, used by other opcodes to identify a
1128 region of the stack containing a variable number of objects for them
1129 to work on. See markobject.doc for more detail.
1130 """),
1131
1132 I(name='POP_MARK',
1133 code='1',
1134 arg=None,
1135 stack_before=[markobject, stackslice],
1136 stack_after=[],
1137 proto=0,
1138 doc="""Pop all the stack objects at and above the topmost markobject.
1139
1140 When an opcode using a variable number of stack objects is done,
1141 POP_MARK is used to remove those objects, and to remove the markobject
1142 that delimited their starting position on the stack.
1143 """),
1144
1145 # Memo manipulation. There are really only two operations (get and put),
1146 # each in all-text, "short binary", and "long binary" flavors.
1147
1148 I(name='GET',
1149 code='g',
1150 arg=decimalnl_short,
1151 stack_before=[],
1152 stack_after=[anyobject],
1153 proto=0,
1154 doc="""Read an object from the memo and push it on the stack.
1155
1156 The index of the memo object to push is given by the newline-teriminated
1157 decimal string following. BINGET and LONG_BINGET are space-optimized
1158 versions.
1159 """),
1160
1161 I(name='BINGET',
1162 code='h',
1163 arg=uint1,
1164 stack_before=[],
1165 stack_after=[anyobject],
1166 proto=1,
1167 doc="""Read an object from the memo and push it on the stack.
1168
1169 The index of the memo object to push is given by the 1-byte unsigned
1170 integer following.
1171 """),
1172
1173 I(name='LONG_BINGET',
1174 code='j',
1175 arg=int4,
1176 stack_before=[],
1177 stack_after=[anyobject],
1178 proto=1,
1179 doc="""Read an object from the memo and push it on the stack.
1180
1181 The index of the memo object to push is given by the 4-byte signed
1182 little-endian integer following.
1183 """),
1184
1185 I(name='PUT',
1186 code='p',
1187 arg=decimalnl_short,
1188 stack_before=[],
1189 stack_after=[],
1190 proto=0,
1191 doc="""Store the stack top into the memo. The stack is not popped.
1192
1193 The index of the memo location to write into is given by the newline-
1194 terminated decimal string following. BINPUT and LONG_BINPUT are
1195 space-optimized versions.
1196 """),
1197
1198 I(name='BINPUT',
1199 code='q',
1200 arg=uint1,
1201 stack_before=[],
1202 stack_after=[],
1203 proto=1,
1204 doc="""Store the stack top into the memo. The stack is not popped.
1205
1206 The index of the memo location to write into is given by the 1-byte
1207 unsigned integer following.
1208 """),
1209
1210 I(name='LONG_BINPUT',
1211 code='r',
1212 arg=int4,
1213 stack_before=[],
1214 stack_after=[],
1215 proto=1,
1216 doc="""Store the stack top into the memo. The stack is not popped.
1217
1218 The index of the memo location to write into is given by the 4-byte
1219 signed little-endian integer following.
1220 """),
1221
1222 # Push a class object, or module function, on the stack, via its module
1223 # and name.
1224
1225 I(name='GLOBAL',
1226 code='c',
1227 arg=stringnl_noescape_pair,
1228 stack_before=[],
1229 stack_after=[anyobject],
1230 proto=0,
1231 doc="""Push a global object (module.attr) on the stack.
1232
1233 Two newline-terminated strings follow the GLOBAL opcode. The first is
1234 taken as a module name, and the second as a class name. The class
1235 object module.class is pushed on the stack. More accurately, the
1236 object returned by self.find_class(module, class) is pushed on the
1237 stack, so unpickling subclasses can override this form of lookup.
1238 """),
1239
1240 # Ways to build objects of classes pickle doesn't know about directly
1241 # (user-defined classes). I despair of documenting this accurately
1242 # and comprehensibly -- you really have to read the pickle code to
1243 # find all the special cases.
1244
1245 I(name='REDUCE',
1246 code='R',
1247 arg=None,
1248 stack_before=[anyobject, anyobject],
1249 stack_after=[anyobject],
1250 proto=0,
1251 doc="""Push an object built from a callable and an argument tuple.
1252
1253 The opcode is named to remind of the __reduce__() method.
1254
1255 Stack before: ... callable pytuple
1256 Stack after: ... callable(*pytuple)
1257
1258 The callable and the argument tuple are the first two items returned
1259 by a __reduce__ method. Applying the callable to the argtuple is
1260 supposed to reproduce the original object, or at least get it started.
1261 If the __reduce__ method returns a 3-tuple, the last component is an
1262 argument to be passed to the object's __setstate__, and then the REDUCE
1263 opcode is followed by code to create setstate's argument, and then a
1264 BUILD opcode to apply __setstate__ to that argument.
1265
1266 There are lots of special cases here. The argtuple can be None, in
1267 which case callable.__basicnew__() is called instead to produce the
1268 object to be pushed on the stack. This appears to be a trick unique
1269 to ExtensionClasses, and is deprecated regardless.
1270
1271 If type(callable) is not ClassType, REDUCE complains unless the
1272 callable has been registered with the copy_reg module's
1273 safe_constructors dict, or the callable has a magic
1274 '__safe_for_unpickling__' attribute with a true value. I'm not sure
1275 why it does this, but I've sure seen this complaint often enough when
1276 I didn't want to <wink>.
1277 """),
1278
1279 I(name='BUILD',
1280 code='b',
1281 arg=None,
1282 stack_before=[anyobject, anyobject],
1283 stack_after=[anyobject],
1284 proto=0,
1285 doc="""Finish building an object, via __setstate__ or dict update.
1286
1287 Stack before: ... anyobject argument
1288 Stack after: ... anyobject
1289
1290 where anyobject may have been mutated, as follows:
1291
1292 If the object has a __setstate__ method,
1293
1294 anyobject.__setstate__(argument)
1295
1296 is called.
1297
1298 Else the argument must be a dict, the object must have a __dict__, and
1299 the object is updated via
1300
1301 anyobject.__dict__.update(argument)
1302
1303 This may raise RuntimeError in restricted execution mode (which
1304 disallows access to __dict__ directly); in that case, the object
1305 is updated instead via
1306
1307 for k, v in argument.items():
1308 anyobject[k] = v
1309 """),
1310
1311 I(name='INST',
1312 code='i',
1313 arg=stringnl_noescape_pair,
1314 stack_before=[markobject, stackslice],
1315 stack_after=[anyobject],
1316 proto=0,
1317 doc="""Build a class instance.
1318
1319 This is the protocol 0 version of protocol 1's OBJ opcode.
1320 INST is followed by two newline-terminated strings, giving a
1321 module and class name, just as for the GLOBAL opcode (and see
1322 GLOBAL for more details about that). self.find_class(module, name)
1323 is used to get a class object.
1324
1325 In addition, all the objects on the stack following the topmost
1326 markobject are gathered into a tuple and popped (along with the
1327 topmost markobject), just as for the TUPLE opcode.
1328
1329 Now it gets complicated. If all of these are true:
1330
1331 + The argtuple is empty (markobject was at the top of the stack
1332 at the start).
1333
1334 + It's an old-style class object (the type of the class object is
1335 ClassType).
1336
1337 + The class object does not have a __getinitargs__ attribute.
1338
1339 then we want to create an old-style class instance without invoking
1340 its __init__() method (pickle has waffled on this over the years; not
1341 calling __init__() is current wisdom). In this case, an instance of
1342 an old-style dummy class is created, and then we try to rebind its
1343 __class__ attribute to the desired class object. If this succeeds,
1344 the new instance object is pushed on the stack, and we're done. In
1345 restricted execution mode it can fail (assignment to __class__ is
1346 disallowed), and I'm not really sure what happens then -- it looks
1347 like the code ends up calling the class object's __init__ anyway,
1348 via falling into the next case.
1349
1350 Else (the argtuple is not empty, it's not an old-style class object,
1351 or the class object does have a __getinitargs__ attribute), the code
1352 first insists that the class object have a __safe_for_unpickling__
1353 attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE,
1354 it doesn't matter whether this attribute has a true or false value, it
1355 only matters whether it exists (XXX this smells like a bug). If
1356 __safe_for_unpickling__ dosn't exist, UnpicklingError is raised.
1357
1358 Else (the class object does have a __safe_for_unpickling__ attr),
1359 the class object obtained from INST's arguments is applied to the
1360 argtuple obtained from the stack, and the resulting instance object
1361 is pushed on the stack.
1362 """),
1363
1364 I(name='OBJ',
1365 code='o',
1366 arg=None,
1367 stack_before=[markobject, anyobject, stackslice],
1368 stack_after=[anyobject],
1369 proto=1,
1370 doc="""Build a class instance.
1371
1372 This is the protocol 1 version of protocol 0's INST opcode, and is
1373 very much like it. The major difference is that the class object
1374 is taken off the stack, allowing it to be retrieved from the memo
1375 repeatedly if several instances of the same class are created. This
1376 can be much more efficient (in both time and space) than repeatedly
1377 embedding the module and class names in INST opcodes.
1378
1379 Unlike INST, OBJ takes no arguments from the opcode stream. Instead
1380 the class object is taken off the stack, immediately above the
1381 topmost markobject:
1382
1383 Stack before: ... markobject classobject stackslice
1384 Stack after: ... new_instance_object
1385
1386 As for INST, the remainder of the stack above the markobject is
1387 gathered into an argument tuple, and then the logic seems identical,
1388 except that no __safe_for_unpickling__ check is done (XXX this smells
1389 like a bug). See INST for the gory details.
1390 """),
1391
1392 # Machine control.
1393
1394 I(name='STOP',
1395 code='.',
1396 arg=None,
1397 stack_before=[anyobject],
1398 stack_after=[],
1399 proto=0,
1400 doc="""Stop the unpickling machine.
1401
1402 Every pickle ends with this opcode. The object at the top of the stack
1403 is popped, and that's the result of unpickling. The stack should be
1404 empty then.
1405 """),
1406
1407 # Ways to deal with persistent IDs.
1408
1409 I(name='PERSID',
1410 code='P',
1411 arg=stringnl_noescape,
1412 stack_before=[],
1413 stack_after=[anyobject],
1414 proto=0,
1415 doc="""Push an object identified by a persistent ID.
1416
1417 The pickle module doesn't define what a persistent ID means. PERSID's
1418 argument is a newline-terminated str-style (no embedded escapes, no
1419 bracketing quote characters) string, which *is* "the persistent ID".
1420 The unpickler passes this string to self.persistent_load(). Whatever
1421 object that returns is pushed on the stack. There is no implementation
1422 of persistent_load() in Python's unpickler: it must be supplied by an
1423 unpickler subclass.
1424 """),
1425
1426 I(name='BINPERSID',
1427 code='Q',
1428 arg=None,
1429 stack_before=[anyobject],
1430 stack_after=[anyobject],
1431 proto=1,
1432 doc="""Push an object identified by a persistent ID.
1433
1434 Like PERSID, except the persistent ID is popped off the stack (instead
1435 of being a string embedded in the opcode bytestream). The persistent
1436 ID is passed to self.persistent_load(), and whatever object that
1437 returns is pushed on the stack. See PERSID for more detail.
1438 """),
1439]
1440del I
1441
1442# Verify uniqueness of .name and .code members.
1443name2i = {}
1444code2i = {}
1445
1446for i, d in enumerate(opcodes):
1447 if d.name in name2i:
1448 raise ValueError("repeated name %r at indices %d and %d" %
1449 (d.name, name2i[d.name], i))
1450 if d.code in code2i:
1451 raise ValueError("repeated code %r at indices %d and %d" %
1452 (d.code, code2i[d.code], i))
1453
1454 name2i[d.name] = i
1455 code2i[d.code] = i
1456
1457del name2i, code2i, i, d
1458
1459##############################################################################
1460# Build a code2op dict, mapping opcode characters to OpcodeInfo records.
1461# Also ensure we've got the same stuff as pickle.py, although the
1462# introspection here is dicey.
1463
1464code2op = {}
1465for d in opcodes:
1466 code2op[d.code] = d
1467del d
1468
1469def assure_pickle_consistency(verbose=False):
1470 import pickle, re
1471
1472 copy = code2op.copy()
1473 for name in pickle.__all__:
1474 if not re.match("[A-Z][A-Z0-9_]+$", name):
1475 if verbose:
1476 print "skipping %r: it doesn't look like an opcode name" % name
1477 continue
1478 picklecode = getattr(pickle, name)
1479 if not isinstance(picklecode, str) or len(picklecode) != 1:
1480 if verbose:
1481 print ("skipping %r: value %r doesn't look like a pickle "
1482 "code" % (name, picklecode))
1483 continue
1484 if picklecode in copy:
1485 if verbose:
1486 print "checking name %r w/ code %r for consistency" % (
1487 name, picklecode)
1488 d = copy[picklecode]
1489 if d.name != name:
1490 raise ValueError("for pickle code %r, pickle.py uses name %r "
1491 "but we're using name %r" % (picklecode,
1492 name,
1493 d.name))
1494 # Forget this one. Any left over in copy at the end are a problem
1495 # of a different kind.
1496 del copy[picklecode]
1497 else:
1498 raise ValueError("pickle.py appears to have a pickle opcode with "
1499 "name %r and code %r, but we don't" %
1500 (name, picklecode))
1501 if copy:
1502 msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
1503 for code, d in copy.items():
1504 msg.append(" name %r with code %r" % (d.name, code))
1505 raise ValueError("\n".join(msg))
1506
1507assure_pickle_consistency()
1508
1509##############################################################################
1510# A pickle opcode generator.
1511
1512def genops(pickle):
1513 """"Generate all the opcodes in a pickle.
1514
1515 'pickle' is a file-like object, or string, containing the pickle.
1516
1517 Each opcode in the pickle is generated, from the current pickle position,
1518 stopping after a STOP opcode is delivered. A triple is generated for
1519 each opcode:
1520
1521 opcode, arg, pos
1522
1523 opcode is an OpcodeInfo record, describing the current opcode.
1524
1525 If the opcode has an argument embedded in the pickle, arg is its decoded
1526 value, as a Python object. If the opcode doesn't have an argument, arg
1527 is None.
1528
1529 If the pickle has a tell() method, pos was the value of pickle.tell()
1530 before reading the current opcode. If the pickle is a string object,
1531 it's wrapped in a StringIO object, and the latter's tell() result is
1532 used. Else (the pickle doesn't have a tell(), and it's not obvious how
1533 to query its current position) pos is None.
1534 """
1535
1536 import cStringIO as StringIO
1537
1538 if isinstance(pickle, str):
1539 pickle = StringIO.StringIO(pickle)
1540
1541 if hasattr(pickle, "tell"):
1542 getpos = pickle.tell
1543 else:
1544 getpos = lambda: None
1545
1546 while True:
1547 pos = getpos()
1548 code = pickle.read(1)
1549 opcode = code2op.get(code)
1550 if opcode is None:
1551 if code == "":
1552 raise ValueError("pickle exhausted before seeing STOP")
1553 else:
1554 raise ValueError("at position %s, opcode %r unknown" % (
1555 pos is None and "<unknown>" or pos,
1556 code))
1557 if opcode.arg is None:
1558 arg = None
1559 else:
1560 arg = opcode.arg.reader(pickle)
1561 yield opcode, arg, pos
1562 if code == '.':
1563 assert opcode.name == 'STOP'
1564 break
1565
1566##############################################################################
1567# A symbolic pickle disassembler.
1568
1569def dis(pickle, out=None, indentlevel=4):
1570 """Produce a symbolic disassembly of a pickle.
1571
1572 'pickle' is a file-like object, or string, containing a (at least one)
1573 pickle. The pickle is disassembled from the current position, through
1574 the first STOP opcode encountered.
1575
1576 Optional arg 'out' is a file-like object to which the disassembly is
1577 printed. It defaults to sys.stdout.
1578
1579 Optional arg indentlevel is the number of blanks by which to indent
1580 a new MARK level. It defaults to 4.
1581 """
1582
1583 markstack = []
1584 indentchunk = ' ' * indentlevel
1585 for opcode, arg, pos in genops(pickle):
1586 if pos is not None:
1587 print >> out, "%5d:" % pos,
1588
1589 line = "%s %s%s" % (opcode.code,
1590 indentchunk * len(markstack),
1591 opcode.name)
1592
1593 markmsg = None
1594 if markstack and markobject in opcode.stack_before:
1595 assert markobject not in opcode.stack_after
1596 markpos = markstack.pop()
1597 if markpos is not None:
1598 markmsg = "(MARK at %d)" % markpos
1599
1600 if arg is not None or markmsg:
1601 # make a mild effort to align arguments
1602 line += ' ' * (10 - len(opcode.name))
1603 if arg is not None:
1604 line += ' ' + repr(arg)
1605 if markmsg:
1606 line += ' ' + markmsg
1607 print >> out, line
1608
1609 if markobject in opcode.stack_after:
1610 assert markobject not in opcode.stack_before
1611 markstack.append(pos)
1612
1613
1614_dis_test = """
1615>>> import pickle
1616>>> x = [1, 2, (3, 4), {'abc': u"def"}]
1617>>> pik = pickle.dumps(x)
1618>>> dis(pik)
1619 0: ( MARK
1620 1: l LIST (MARK at 0)
1621 2: p PUT 0
1622 5: I INT 1
1623 8: a APPEND
1624 9: I INT 2
1625 12: a APPEND
1626 13: ( MARK
1627 14: I INT 3
1628 17: I INT 4
1629 20: t TUPLE (MARK at 13)
1630 21: p PUT 1
1631 24: a APPEND
1632 25: ( MARK
1633 26: d DICT (MARK at 25)
1634 27: p PUT 2
1635 30: S STRING 'abc'
1636 37: p PUT 3
1637 40: V UNICODE u'def'
1638 45: p PUT 4
1639 48: s SETITEM
1640 49: a APPEND
1641 50: . STOP
1642
1643Try again with a "binary" pickle.
1644
1645>>> pik = pickle.dumps(x, 1)
1646>>> dis(pik)
1647 0: ] EMPTY_LIST
1648 1: q BINPUT 0
1649 3: ( MARK
1650 4: K BININT1 1
1651 6: K BININT1 2
1652 8: ( MARK
1653 9: K BININT1 3
1654 11: K BININT1 4
1655 13: t TUPLE (MARK at 8)
1656 14: q BINPUT 1
1657 16: } EMPTY_DICT
1658 17: q BINPUT 2
1659 19: U SHORT_BINSTRING 'abc'
1660 24: q BINPUT 3
1661 26: X BINUNICODE u'def'
1662 34: q BINPUT 4
1663 36: s SETITEM
1664 37: e APPENDS (MARK at 3)
1665 38: . STOP
1666
1667Exercise the INST/OBJ/BUILD family.
1668
1669>>> import random
1670>>> dis(pickle.dumps(random.random))
Tim Petersd916cf42003-01-27 19:01:47 +00001671 0: c GLOBAL 'random random'
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001672 15: p PUT 0
1673 18: . STOP
1674
1675>>> x = [pickle.PicklingError()] * 2
1676>>> dis(pickle.dumps(x))
1677 0: ( MARK
1678 1: l LIST (MARK at 0)
1679 2: p PUT 0
1680 5: ( MARK
Tim Petersd916cf42003-01-27 19:01:47 +00001681 6: i INST 'pickle PicklingError' (MARK at 5)
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001682 28: p PUT 1
1683 31: ( MARK
1684 32: d DICT (MARK at 31)
1685 33: p PUT 2
1686 36: S STRING 'args'
1687 44: p PUT 3
1688 47: ( MARK
1689 48: t TUPLE (MARK at 47)
1690 49: p PUT 4
1691 52: s SETITEM
1692 53: b BUILD
1693 54: a APPEND
1694 55: g GET 1
1695 58: a APPEND
1696 59: . STOP
1697
1698>>> dis(pickle.dumps(x, 1))
1699 0: ] EMPTY_LIST
1700 1: q BINPUT 0
1701 3: ( MARK
1702 4: ( MARK
Tim Petersd916cf42003-01-27 19:01:47 +00001703 5: c GLOBAL 'pickle PicklingError'
Tim Peters8ecfc8e2003-01-27 18:51:48 +00001704 27: q BINPUT 1
1705 29: o OBJ (MARK at 4)
1706 30: q BINPUT 2
1707 32: } EMPTY_DICT
1708 33: q BINPUT 3
1709 35: U SHORT_BINSTRING 'args'
1710 41: q BINPUT 4
1711 43: ) EMPTY_TUPLE
1712 44: s SETITEM
1713 45: b BUILD
1714 46: h BINGET 2
1715 48: e APPENDS (MARK at 3)
1716 49: . STOP
1717
1718Try "the canonical" recursive-object test.
1719
1720>>> L = []
1721>>> T = L,
1722>>> L.append(T)
1723>>> L[0] is T
1724True
1725>>> T[0] is L
1726True
1727>>> L[0][0] is L
1728True
1729>>> T[0][0] is T
1730True
1731>>> dis(pickle.dumps(L))
1732 0: ( MARK
1733 1: l LIST (MARK at 0)
1734 2: p PUT 0
1735 5: ( MARK
1736 6: g GET 0
1737 9: t TUPLE (MARK at 5)
1738 10: p PUT 1
1739 13: a APPEND
1740 14: . STOP
1741>>> dis(pickle.dumps(L, 1))
1742 0: ] EMPTY_LIST
1743 1: q BINPUT 0
1744 3: ( MARK
1745 4: h BINGET 0
1746 6: t TUPLE (MARK at 3)
1747 7: q BINPUT 1
1748 9: a APPEND
1749 10: . STOP
1750
1751The protocol 0 pickle of the tuple causes the disassembly to get confused,
1752as it doesn't realize that the POP opcode at 16 gets rid of the MARK at 0
1753(so the output remains indented until the end). The protocol 1 pickle
1754doesn't trigger this glitch, because the disassembler realizes that
1755POP_MARK gets rid of the MARK. Doing a better job on the protocol 0
1756pickle would require the disassembler to emulate the stack.
1757
1758>>> dis(pickle.dumps(T))
1759 0: ( MARK
1760 1: ( MARK
1761 2: l LIST (MARK at 1)
1762 3: p PUT 0
1763 6: ( MARK
1764 7: g GET 0
1765 10: t TUPLE (MARK at 6)
1766 11: p PUT 1
1767 14: a APPEND
1768 15: 0 POP
1769 16: 0 POP
1770 17: g GET 1
1771 20: . STOP
1772>>> dis(pickle.dumps(T, 1))
1773 0: ( MARK
1774 1: ] EMPTY_LIST
1775 2: q BINPUT 0
1776 4: ( MARK
1777 5: h BINGET 0
1778 7: t TUPLE (MARK at 4)
1779 8: q BINPUT 1
1780 10: a APPEND
1781 11: 1 POP_MARK (MARK at 0)
1782 12: h BINGET 1
1783 14: . STOP
1784"""
1785
1786__test__ = {'dissassembler_test': _dis_test,
1787 }
1788
1789def _test():
1790 import doctest
1791 return doctest.testmod()
1792
1793if __name__ == "__main__":
1794 _test()