Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#13896 closed defect (fixed)

Fix cython's gc_track and gc_untrack

Reported by: nbruin Owned by: rlm
Priority: blocker Milestone: sage-5.6
Component: memleak Keywords:
Cc: SimonKing, jpflori Merged in: sage-5.6.beta3
Authors: Robert Bradshaw Reviewers: Jeroen Demeyer
Report Upstream: Completely fixed; Fix reported upstream Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Description (last modified by jdemeyer)

In a long sage-devel thread we eventually found in this message that a GC during a weakref callback on a Cython class can lead to double deallocation of that class. In Python's Objects/typeobject.c, line 1024 and onwards, there are some comments that indicate that earlier version of Python were bitten by this problem too. The solution is to insert the appropriate PyObject_GC_Untrack and PyObject_GC_Track in cython's deallocation code. This is best fixed in cython itself.

Install only the new spkg at http://boxen.math.washington.edu/home/jdemeyer/spkg/cython-0.17.4.spkg

Attachments (3)

double-free-crash.patch (924 bytes) - added by nbruin 7 years ago.
Patch to more reliably produce crash
cython-0.17.3.p0.diff (2.3 KB) - added by jpflori 7 years ago.
double_dealloc_T796.pyx (1.2 KB) - added by nbruin 7 years ago.
Robert's cython test case (I spent quite some time twice to find it, so I'm storing it here for future reference)

Download all attachments as: .zip

Change History (34)

Changed 7 years ago by nbruin

Patch to more reliably produce crash

comment:1 Changed 7 years ago by nbruin

With attached patch applied to 5.6.beta2 (and probably also other versions close to it),

sage -t devel/sage/sage/modules/module.pyx

will crash relatively reliably on several machines (including sage.math)

comment:2 Changed 7 years ago by SimonKing

  • Cc SimonKing added

comment:3 follow-up: Changed 7 years ago by jpflori

  • Cc jpflori added

I'd like to see this ticket as a blocker, anyone against this idea?

comment:4 in reply to: ↑ 3 Changed 7 years ago by nbruin

Replying to jpflori:

I'd like to see this ticket as a blocker, anyone against this idea?

Since this is the ultimate "can generate segfaults anywhere", it's a prime candidate for blocker status. However, we're fully at the mercy of cython developers as to when this gets fixed. Also, if we release with this bug unfixed, we might as well leave #715 in too, since this one has a much wider possible impact :-).

comment:5 Changed 7 years ago by jpflori

  • Priority changed from major to blocker

Ok, Ive put it as blocker.

For those who want to play while waiting for upstream, I've posted a p0 Cython spkg which does "something" with PyObject_GC_[Un]Track. Not sure it makes any sense, but it seems to make our bug disappear. It's at http://boxen.math.washington.edu/home/jpflori/cython-0.17.3.p0.spkg

comment:6 Changed 7 years ago by nbruin

  • Description modified (diff)

Apologies. I saw I linked to the wrong file. Include/object.h also has some interesting information, but it looks like it is a bit out-of-date on some bits. In particular, if you look at the actual use of the TRASHCAN macros:

    PyObject_GC_UnTrack(self);
    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_BEGIN(self);
    --_PyTrash_delete_nesting;
...
  endlabel:
    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_END(self);
    --_PyTrash_delete_nesting;

with the explanation a little lower:

       Q. Why the bizarre (net-zero) manipulation of
          _PyTrash_delete_nesting around the trashcan macros?

       A. Some base classes (e.g. list) also use the trashcan mechanism.
          The following scenario used to be possible:

          - suppose the trashcan level is one below the trashcan limit

          - subtype_dealloc() is called

          - the trashcan limit is not yet reached, so the trashcan level
        is incremented and the code between trashcan begin and end is
        executed

          - this destroys much of the object's contents, including its
        slots and __dict__

          - basedealloc() is called; this is really list_dealloc(), or
        some other type which also uses the trashcan macros

          - the trashcan limit is now reached, so the object is put on the
        trashcan's to-be-deleted-later list

          - basedealloc() returns

          - subtype_dealloc() decrefs the object's type

          - subtype_dealloc() returns

          - later, the trashcan code starts deleting the objects from its
        to-be-deleted-later list

          - subtype_dealloc() is called *AGAIN* for the same object

          - at the very least (if the destroyed slots and __dict__ don't
        cause problems) the object's type gets decref'ed a second
        time, which is *BAD*!!!

          The remedy is to make sure that if the code between trashcan
          begin and end in subtype_dealloc() is called, the code between
          trashcan begin and end in basedealloc() will also be called.
          This is done by decrementing the level after passing into the
          trashcan block, and incrementing it just before leaving the
          block.

          But now it's possible that a chain of objects consisting solely
          of objects whose deallocator is subtype_dealloc() will defeat
          the trashcan mechanism completely: the decremented level means
          that the effective level never reaches the limit.      Therefore, we
          *increment* the level *before* entering the trashcan block, and
          matchingly decrement it after leaving.  This means the trashcan
          code will trigger a little early, but that's no big deal.

It's probably better to leave out the trashcan for now. It seems like rather tricky code and I'm not sure it's part of the official Python C-API (it might be something internal, just like they use some macros themselves they find unsafe for use in extension modules)

comment:7 follow-up: Changed 7 years ago by jpflori

I saw and read about this additional steps in addition to the macro, but I was not sure it was also needed here.

Anyway I agree it is a better take to leave that out for now, and anyway, upstream will decide what is the best.

So I've updated the spkg to not include the trashcan parts.

Changed 7 years ago by jpflori

comment:8 in reply to: ↑ 7 ; follow-up: Changed 7 years ago by nbruin

Replying to jpflori:

I saw and read about this additional steps in addition to the macro, but I was not sure it was also needed here.

In fact, I think the precautions taken are not enough for general cython classes. With the little

    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_BEGIN(self);
    --_PyTrash_delete_nesting;
    ...
    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_END(self);
    --_PyTrash_delete_nesting;

dance they are making sure there is room for one extra trashcan nesting provided that that call doesn't use the same trick. However, a cython class could have a whole inheritance hierarchy going here (that would all use this trick too!), so I'm pretty sure that the exact scenario they describe could still happen. You'd need to know the depth of the inheritance line (for deallocs, multiple inheritance can't happen, right?) and ensure there's enough room for all those.

comment:9 Changed 7 years ago by robertwb

comment:10 Changed 7 years ago by jpflori

Just one potentially naive question: shouldn't the object get retracked iff you're going to call another dealloc method? or conversely, if the type does not extend a previous type, shouldn't the object stay untracked when you call tp_free? I'm not sure it would really matter if the object is still tracked in this latter case, but I got this feeling when staring at CPython's code today.

Anyway, it just made me think of what will happen if your extension class is GC tracked, but the base class is not? In this case you're lost because if you track your object before calling the base dealloc, then you will not untrack it there. Is that even possible? And anyway if a class is not gc tracked, or is not a container I guess it cannot be weakrefed...

comment:11 follow-up: Changed 7 years ago by robertwb

The final call to the (generic) tp_free calls PyObject_GC_Untrack iff the GC flags are set in the type flags. If the base class is not GC tracked then its dealloc method won't touch these bits.

comment:12 Changed 7 years ago by jpflori

Thanks for pointing that out.

comment:13 follow-up: Changed 7 years ago by robertwb

Spkg up at http://sage.math.washington.edu/home/robertwb/patches/cython-0.17.4pre.spkg , if this looks good I'll cut a release and make an actual spkg based on that.

comment:14 in reply to: ↑ 8 ; follow-up: Changed 7 years ago by nbruin

trashcan issues now tracked on #13901 (yes, you can easily crash cython because it's not using the trashcan)

comment:15 in reply to: ↑ 13 Changed 7 years ago by nbruin

Replying to robertwb:

Spkg up at http://sage.math.washington.edu/home/robertwb/patches/cython-0.17.4pre.spkg , if this looks good I'll cut a release and make an actual spkg based on that.

This does look good to me. JP has already confirmed that this fixed the issue (as does your elegant test in the cython suite). Your pre.spkg has some different files in it, but I guess that's why you don't consider it an actual spkg.

comment:16 in reply to: ↑ 11 ; follow-up: Changed 7 years ago by jpflori

Replying to robertwb:

The final call to the (generic) tp_free calls PyObject_GC_Untrack iff the GC flags are set in the type flags. If the base class is not GC tracked then its dealloc method won't touch these bits.

Sorry to insist a little bit, but while looking at the trashcan stuff, I thought again about it and in fact what I was worried about was rather the converse.

If the base type does not have the GC_FLAG, and youve retracked it in the subclass, then final tp_free will indeed not touch anything related to gc, but won't that leave an invalid object in the gc tracked object list? In particular won't a call to gc_list_remove(o) be missing?

comment:17 in reply to: ↑ 16 ; follow-up: Changed 7 years ago by robertwb

Replying to jpflori:

Replying to robertwb:

The final call to the (generic) tp_free calls PyObject_GC_Untrack iff the GC flags are set in the type flags. If the base class is not GC tracked then its dealloc method won't touch these bits.

Sorry to insist a little bit, but while looking at the trashcan stuff, I thought again about it and in fact what I was worried about was rather the converse.

If the base type does not have the GC_FLAG, and youve retracked it in the subclass, then final tp_free will indeed not touch anything related to gc, but won't that leave an invalid object in the gc tracked object list? In particular won't a call to gc_list_remove(o) be missing?

The base tp_free looks at the actual type's flags (which will have GC_FLAG set) to determine what gc (un)tracking to do. Any intermediate superclasses will either leave this alone or do the untrack/track dance.

comment:18 in reply to: ↑ 14 Changed 7 years ago by robertwb

Replying to nbruin:

trashcan issues now tracked on #13901 (yes, you can easily crash cython because it's not using the trashcan)

Yeah, this is a separate (and more complicated to resolve) issue.

comment:19 in reply to: ↑ 17 Changed 7 years ago by nbruin

Replying to robertwb:

The base tp_free looks at the actual type's flags (which will have GC_FLAG set) to determine what gc (un)tracking to do. Any intermediate superclasses will either leave this alone or do the untrack/track dance.

... so suppose we have a superclass that doesn't do the untrack/track dance (so this must be a non-container superclass of a container class. We're entering rather hypothetical territory here). We'll be entering its dealloc with tracking SET. I guess the actual memory free happens by our class, so I guess the list of GC-tracked objects will be properly amended eventually. Can we prove that no GC or trashcan-shelving of this intermediate object will happen in between? I guess it's unlikely because non-container types should be easy to deallocate ... unless some callous person writes an extension class that does hold references to other objects but is convinced that those will never lead to cycles and hence makes it non-GC-tracked. Some weakref callbacks and a GC could then find a partially torn down object tracked by the GC. Multithreaded stuff could make this even worse, but I guess we're protected by the GIL here.

It should probably be mandated that any container type has to participate in GC. For a non-container type it's hard to see how a dealloc could ever be interrupted or interleaved by a GC. So this note is probably more a request for clarification (addition to documentation somewhere?) why this is not a problem than a diagnosis of a bug.

Last edited 7 years ago by nbruin (previous) (diff)

comment:20 follow-up: Changed 7 years ago by robertwb

I think it helps to look at the generated code. Suppose one has

cdef class A: ...
cdef class B(A): ...
cdef class C(B): ...
...

In this case one has, roughly,

tp_dealloc_A(self) {
   [optional untrack]
   bodyA
   [optional track]
   PY_TYPE(self)->tp_free(self)
}

tp_dealloc_B(self) {
   [optional untrack]
   bodyB
   [optional track]
   tp_dealloc_A(self)
}

tp_dealloc_C(self) {
   [optional untrack]
   bodyC
   [optional track]
   tp_dealloc_B(self)
}

...

bodyX consists of decrefing Python members, traversing weakrefs, and (if present)

PyRef(self)++;
X.__dealloc__(self);
PyRef(self)--;

The track/untrack markers are added exactly when Python/weakref members are present, which is where a garbage collection might happen. (When executing dealloc the refcount is incremented, also preventing garbage collection.)

What could be an issue is a non-gc-tracked container class that is subclassed by a gc-tracked class, but we don't have those in Cython.

comment:21 in reply to: ↑ 20 Changed 7 years ago by jpflori

What could be an issue is a non-gc-tracked container class that is subclassed by a gc-tracked class, but we don't have those in Cython.

That is exactly what I was thinking about, and IIRC what is looked for in the CPython subtype_dealloc when looking for the base type.

If you say it cannot happy in Cython, I'm very happy with that!

comment:22 Changed 7 years ago by jpflori

Are you sure this is the case, e.g., for category_object and sage_object? I see a TPFLAGS_HAVE_GC on the former but not on the latter.

Changed 7 years ago by nbruin

Robert's cython test case (I spent quite some time twice to find it, so I'm storing it here for future reference)

comment:23 Changed 7 years ago by vbraun

comment:24 Changed 7 years ago by robertwb

  • Description modified (diff)
  • Status changed from new to needs_review

comment:25 Changed 7 years ago by jdemeyer

  • Authors set to Robert Bradshaw
  • Status changed from needs_review to needs_work

Typo in the version number:

=== cython-0.17.3 (Robert Bradshaw, 3 January 2013) ===

should be

=== cython-0.17.4 (Robert Bradshaw, 3 January 2013) ===

comment:26 Changed 7 years ago by jdemeyer

  • Description modified (diff)
  • Reviewers set to Jeroen Demeyer
  • Status changed from needs_work to positive_review

Fixed SPKG.txt.

comment:27 Changed 7 years ago by jdemeyer

  • Report Upstream changed from Reported upstream. Developers acknowledge bug. to Completely fixed; Fix reported upstream

comment:28 Changed 7 years ago by robertwb

D'oh. Thanks.

comment:29 Changed 7 years ago by jdemeyer

  • Merged in set to sage-5.6.beta3
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:30 Changed 7 years ago by jdemeyer

I have not seen anymore segmentation faults regarding #715, so this might have fixed it.

comment:31 Changed 7 years ago by vbraun

Yay! Congratulations to everybody and a special thanks to Simon for pushing the weak caches!

Note: See TracTickets for help on using tickets.