Tim Peters | ead8b7a | 2004-10-30 23:09:22 +0000 | [diff] [blame] | 1 | Intro |
| 2 | ===== |
| 3 | |
| 4 | The basic rule for dealing with weakref callbacks (and __del__ methods too, |
| 5 | for that matter) during cyclic gc: |
| 6 | |
| 7 | Once gc has computed the set of unreachable objects, no Python-level |
| 8 | code can be allowed to access an unreachable object. |
| 9 | |
| 10 | If that can happen, then the Python code can resurrect unreachable objects |
| 11 | too, and gc can't detect that without starting over. Since gc eventually |
| 12 | runs tp_clear on all unreachable objects, if an unreachable object is |
| 13 | resurrected then tp_clear will eventually be called on it (or may already |
| 14 | have been called before resurrection). At best (and this has been an |
| 15 | historically common bug), tp_clear empties an instance's __dict__, and |
| 16 | "impossible" AttributeErrors result. At worst, tp_clear leaves behind an |
| 17 | insane object at the C level, and segfaults result (historically, most |
Florent Xicluna | aa6c1d2 | 2011-12-12 18:54:29 +0100 | [diff] [blame] | 18 | often by setting a class's mro pointer to NULL, after which attribute |
| 19 | lookups performed by the class can segfault). |
Tim Peters | ead8b7a | 2004-10-30 23:09:22 +0000 | [diff] [blame] | 20 | |
| 21 | OTOH, it's OK to run Python-level code that can't access unreachable |
| 22 | objects, and sometimes that's necessary. The chief example is the callback |
| 23 | attached to a reachable weakref W to an unreachable object O. Since O is |
| 24 | going away, and W is still alive, the callback must be invoked. Because W |
| 25 | is still alive, everything reachable from its callback is also reachable, |
| 26 | so it's also safe to invoke the callback (although that's trickier than it |
| 27 | sounds, since other reachable weakrefs to other unreachable objects may |
| 28 | still exist, and be accessible to the callback -- there are lots of painful |
| 29 | details like this covered in the rest of this file). |
| 30 | |
| 31 | Python 2.4/2.3.5 |
| 32 | ================ |
| 33 | |
| 34 | The "Before 2.3.3" section below turned out to be wrong in some ways, but |
| 35 | I'm leaving it as-is because it's more right than wrong, and serves as a |
| 36 | wonderful example of how painful analysis can miss not only the forest for |
| 37 | the trees, but also miss the trees for the aphids sucking the trees |
| 38 | dry <wink>. |
| 39 | |
| 40 | The primary thing it missed is that when a weakref to a piece of cyclic |
| 41 | trash (CT) exists, then any call to any Python code whatsoever can end up |
| 42 | materializing a strong reference to that weakref's CT referent, and so |
| 43 | possibly resurrect an insane object (one for which cyclic gc has called-- or |
| 44 | will call before it's done --tp_clear()). It's not even necessarily that a |
| 45 | weakref callback or __del__ method does something nasty on purpose: as |
| 46 | soon as we execute Python code, threads other than the gc thread can run |
| 47 | too, and they can do ordinary things with weakrefs that end up resurrecting |
| 48 | CT while gc is running. |
| 49 | |
| 50 | http://www.python.org/sf/1055820 |
| 51 | |
| 52 | shows how innocent it can be, and also how nasty. Variants of the three |
| 53 | focussed test cases attached to that bug report are now part of Python's |
| 54 | standard Lib/test/test_gc.py. |
| 55 | |
| 56 | Jim Fulton gave the best nutshell summary of the new (in 2.4 and 2.3.5) |
| 57 | approach: |
| 58 | |
| 59 | Clearing cyclic trash can call Python code. If there are weakrefs to |
| 60 | any of the cyclic trash, then those weakrefs can be used to resurrect |
| 61 | the objects. Therefore, *before* clearing cyclic trash, we need to |
| 62 | remove any weakrefs. If any of the weakrefs being removed have |
| 63 | callbacks, then we need to save the callbacks and call them *after* all |
| 64 | of the weakrefs have been cleared. |
| 65 | |
| 66 | Alas, doing just that much doesn't work, because it overlooks what turned |
| 67 | out to be the much subtler problems that were fixed earlier, and described |
| 68 | below. We do clear all weakrefs to CT now before breaking cycles, but not |
| 69 | all callbacks encountered can be run later. That's explained in horrid |
| 70 | detail below. |
| 71 | |
| 72 | Older text follows, with a some later comments in [] brackets: |
| 73 | |
| 74 | Before 2.3.3 |
| 75 | ============ |
| 76 | |
Tim Peters | 403a203 | 2003-11-20 21:21:46 +0000 | [diff] [blame] | 77 | Before 2.3.3, Python's cyclic gc didn't pay any attention to weakrefs. |
| 78 | Segfaults in Zope3 resulted. |
| 79 | |
| 80 | weakrefs in Python are designed to, at worst, let *other* objects learn |
| 81 | that a given object has died, via a callback function. The weakly |
| 82 | referenced object itself is not passed to the callback, and the presumption |
| 83 | is that the weakly referenced object is unreachable trash at the time the |
| 84 | callback is invoked. |
| 85 | |
| 86 | That's usually true, but not always. Suppose a weakly referenced object |
| 87 | becomes part of a clump of cyclic trash. When enough cycles are broken by |
| 88 | cyclic gc that the object is reclaimed, the callback is invoked. If it's |
| 89 | possible for the callback to get at objects in the cycle(s), then it may be |
| 90 | possible for those objects to access (via strong references in the cycle) |
| 91 | the weakly referenced object being torn down, or other objects in the cycle |
| 92 | that have already suffered a tp_clear() call. There's no guarantee that an |
| 93 | object is in a sane state after tp_clear(). Bad things (including |
| 94 | segfaults) can happen right then, during the callback's execution, or can |
| 95 | happen at any later time if the callback manages to resurrect an insane |
| 96 | object. |
| 97 | |
Tim Peters | ead8b7a | 2004-10-30 23:09:22 +0000 | [diff] [blame] | 98 | [That missed that, in addition, a weakref to CT can exist outside CT, and |
| 99 | any callback into Python can use such a non-CT weakref to resurrect its CT |
| 100 | referent. The same bad kinds of things can happen then.] |
| 101 | |
Tim Peters | 403a203 | 2003-11-20 21:21:46 +0000 | [diff] [blame] | 102 | Note that if it's possible for the callback to get at objects in the trash |
| 103 | cycles, it must also be the case that the callback itself is part of the |
| 104 | trash cycles. Else the callback would have acted as an external root to |
| 105 | the current collection, and nothing reachable from it would be in cyclic |
| 106 | trash either. |
| 107 | |
Tim Peters | ead8b7a | 2004-10-30 23:09:22 +0000 | [diff] [blame] | 108 | [Except that a non-CT callback can also use a non-CT weakref to get at |
| 109 | CT objects.] |
| 110 | |
Tim Peters | 403a203 | 2003-11-20 21:21:46 +0000 | [diff] [blame] | 111 | More, if the callback itself is in cyclic trash, then the weakref to which |
| 112 | the callback is attached must also be trash, and for the same kind of |
| 113 | reason: if the weakref acted as an external root, then the callback could |
| 114 | not have been cyclic trash. |
| 115 | |
| 116 | So a problem here requires that a weakref, that weakref's callback, and the |
| 117 | weakly referenced object, all be in cyclic trash at the same time. This |
| 118 | isn't easy to stumble into by accident while Python is running, and, indeed, |
| 119 | it took quite a while to dream up failing test cases. Zope3 saw segfaults |
| 120 | during shutdown, during the second call of gc in Py_Finalize, after most |
| 121 | modules had been torn down. That creates many trash cycles (esp. those |
Florent Xicluna | aa6c1d2 | 2011-12-12 18:54:29 +0100 | [diff] [blame] | 122 | involving classes), making the problem much more likely. Once you |
Tim Peters | 403a203 | 2003-11-20 21:21:46 +0000 | [diff] [blame] | 123 | know what's required to provoke the problem, though, it's easy to create |
| 124 | tests that segfault before shutdown. |
| 125 | |
| 126 | In 2.3.3, before breaking cycles, we first clear all the weakrefs with |
| 127 | callbacks in cyclic trash. Since the weakrefs *are* trash, and there's no |
| 128 | defined-- or even predictable --order in which tp_clear() gets called on |
| 129 | cyclic trash, it's defensible to first clear weakrefs with callbacks. It's |
| 130 | a feature of Python's weakrefs too that when a weakref goes away, the |
| 131 | callback (if any) associated with it is thrown away too, unexecuted. |
| 132 | |
Tim Peters | ead8b7a | 2004-10-30 23:09:22 +0000 | [diff] [blame] | 133 | [In 2.4/2.3.5, we first clear all weakrefs to CT objects, whether or not |
| 134 | those weakrefs are themselves CT, and whether or not they have callbacks. |
| 135 | The callbacks (if any) on non-CT weakrefs (if any) are invoked later, |
| 136 | after all weakrefs-to-CT have been cleared. The callbacks (if any) on CT |
| 137 | weakrefs (if any) are never invoked, for the excruciating reasons |
| 138 | explained here.] |
| 139 | |
Tim Peters | 403a203 | 2003-11-20 21:21:46 +0000 | [diff] [blame] | 140 | Just that much is almost enough to prevent problems, by throwing away |
| 141 | *almost* all the weakref callbacks that could get triggered by gc. The |
| 142 | problem remaining is that clearing a weakref with a callback decrefs the |
| 143 | callback object, and the callback object may *itself* be weakly referenced, |
| 144 | via another weakref with another callback. So the process of clearing |
| 145 | weakrefs can trigger callbacks attached to other weakrefs, and those |
| 146 | latter weakrefs may or may not be part of cyclic trash. |
| 147 | |
| 148 | So, to prevent any Python code from running while gc is invoking tp_clear() |
Tim Peters | ead8b7a | 2004-10-30 23:09:22 +0000 | [diff] [blame] | 149 | on all the objects in cyclic trash, |
| 150 | |
| 151 | [That was always wrong: we can't stop Python code from running when gc |
| 152 | is breaking cycles. If an object with a __del__ method is not itself in |
| 153 | a cycle, but is reachable only from CT, then breaking cycles will, as a |
| 154 | matter of course, drop the refcount on that object to 0, and its __del__ |
| 155 | will run right then. What we can and must stop is running any Python |
| 156 | code that could access CT.] |
| 157 | it's not quite enough just to invoke |
Tim Peters | 403a203 | 2003-11-20 21:21:46 +0000 | [diff] [blame] | 158 | tp_clear() on weakrefs with callbacks first. Instead the weakref module |
| 159 | grew a new private function (_PyWeakref_ClearRef) that does only part of |
| 160 | tp_clear(): it removes the weakref from the weakly-referenced object's list |
| 161 | of weakrefs, but does not decref the callback object. So calling |
| 162 | _PyWeakref_ClearRef(wr) ensures that wr's callback object will never |
| 163 | trigger, and (unlike weakref's tp_clear()) also prevents any callback |
| 164 | associated *with* wr's callback object from triggering. |
| 165 | |
Tim Peters | ead8b7a | 2004-10-30 23:09:22 +0000 | [diff] [blame] | 166 | [Although we may trigger such callbacks later, as explained below.] |
| 167 | |
Tim Peters | 403a203 | 2003-11-20 21:21:46 +0000 | [diff] [blame] | 168 | Then we can call tp_clear on all the cyclic objects and never trigger |
| 169 | Python code. |
| 170 | |
Tim Peters | ead8b7a | 2004-10-30 23:09:22 +0000 | [diff] [blame] | 171 | [As above, not so: it means never trigger Python code that can access CT.] |
| 172 | |
Tim Peters | 403a203 | 2003-11-20 21:21:46 +0000 | [diff] [blame] | 173 | After we do that, the callback objects still need to be decref'ed. Callbacks |
| 174 | (if any) *on* the callback objects that were also part of cyclic trash won't |
| 175 | get invoked, because we cleared all trash weakrefs with callbacks at the |
| 176 | start. Callbacks on the callback objects that were not part of cyclic trash |
| 177 | acted as external roots to everything reachable from them, so nothing |
| 178 | reachable from them was part of cyclic trash, so gc didn't do any damage to |
| 179 | objects reachable from them, and it's safe to call them at the end of gc. |
| 180 | |
Tim Peters | ead8b7a | 2004-10-30 23:09:22 +0000 | [diff] [blame] | 181 | [That's so. In addition, now we also invoke (if any) the callbacks on |
| 182 | non-CT weakrefs to CT objects, during the same pass that decrefs the |
| 183 | callback objects.] |
| 184 | |
Tim Peters | 403a203 | 2003-11-20 21:21:46 +0000 | [diff] [blame] | 185 | An alternative would have been to treat objects with callbacks like objects |
| 186 | with __del__ methods, refusing to collect them, appending them to gc.garbage |
| 187 | instead. That would have been much easier. Jim Fulton gave a strong |
| 188 | argument against that (on Python-Dev): |
| 189 | |
| 190 | There's a big difference between __del__ and weakref callbacks. |
| 191 | The __del__ method is "internal" to a design. When you design a |
| 192 | class with a del method, you know you have to avoid including the |
| 193 | class in cycles. |
| 194 | |
| 195 | Now, suppose you have a design that makes has no __del__ methods but |
| 196 | that does use cyclic data structures. You reason about the design, |
| 197 | run tests, and convince yourself you don't have a leak. |
| 198 | |
| 199 | Now, suppose some external code creates a weakref to one of your |
| 200 | objects. All of a sudden, you start leaking. You can look at your |
| 201 | code all you want and you won't find a reason for the leak. |
| 202 | |
| 203 | IOW, a class designer can out-think __del__ problems, but has no control |
| 204 | over who creates weakrefs to his classes or class instances. The class |
| 205 | user has little chance either of predicting when the weakrefs he creates |
| 206 | may end up in cycles. |
| 207 | |
| 208 | Callbacks on weakref callbacks are executed in an arbitrary order, and |
| 209 | that's not good (a primary reason not to collect cycles with objects with |
| 210 | __del__ methods is to avoid running finalizers in an arbitrary order). |
| 211 | However, a weakref callback on a weakref callback has got to be rare. |
| 212 | It's possible to do such a thing, so gc has to be robust against it, but |
| 213 | I doubt anyone has done it outside the test case I wrote for it. |
Tim Peters | ead8b7a | 2004-10-30 23:09:22 +0000 | [diff] [blame] | 214 | |
| 215 | [The callbacks (if any) on non-CT weakrefs to CT objects are also executed |
| 216 | in an arbitrary order now. But they were before too, depending on the |
| 217 | vagaries of when tp_clear() happened to break enough cycles to trigger |
| 218 | them. People simply shouldn't try to use __del__ or weakref callbacks to |
| 219 | do fancy stuff.] |