Opened 9 years ago

Last modified 9 years ago

#13896 closed defect

Fix cython's gc_track and gc_untrack — at Version 6

Reported by: nbruin Owned by: rlm
Priority: blocker Milestone: sage-5.6
Component: memleak Keywords:
Cc: SimonKing, jpflori Merged in:
Authors: Reviewers:
Report Upstream: Completely fixed; Fix reported upstream Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by nbruin)

In a long sage-devel thread we eventually found in this message that a GC during a weakref callback on a Cython class can lead to double deallocation of that class. In Python's Objects/typeobject.c, line 1024 and onwards, there are some comments that indicate that earlier version of Python were bitten by this problem too. The solution is to insert the appropriate PyObject_GC_Untrack and PyObject_GC_Track in cython's deallocation code. This is best fixed in cython itself.

Change History (7)

Changed 9 years ago by nbruin

Patch to more reliably produce crash

comment:1 Changed 9 years ago by nbruin

With attached patch applied to 5.6.beta2 (and probably also other versions close to it),

sage -t devel/sage/sage/modules/module.pyx

will crash relatively reliably on several machines (including sage.math)

comment:2 Changed 9 years ago by SimonKing

  • Cc SimonKing added

comment:3 follow-up: Changed 9 years ago by jpflori

  • Cc jpflori added

I'd like to see this ticket as a blocker, anyone against this idea?

comment:4 in reply to: ↑ 3 Changed 9 years ago by nbruin

Replying to jpflori:

I'd like to see this ticket as a blocker, anyone against this idea?

Since this is the ultimate "can generate segfaults anywhere", it's a prime candidate for blocker status. However, we're fully at the mercy of cython developers as to when this gets fixed. Also, if we release with this bug unfixed, we might as well leave #715 in too, since this one has a much wider possible impact :-).

comment:5 Changed 9 years ago by jpflori

  • Priority changed from major to blocker

Ok, Ive put it as blocker.

For those who want to play while waiting for upstream, I've posted a p0 Cython spkg which does "something" with PyObject_GC_[Un]Track. Not sure it makes any sense, but it seems to make our bug disappear. It's at http://boxen.math.washington.edu/home/jpflori/cython-0.17.3.p0.spkg

comment:6 Changed 9 years ago by nbruin

  • Description modified (diff)

Apologies. I saw I linked to the wrong file. Include/object.h also has some interesting information, but it looks like it is a bit out-of-date on some bits. In particular, if you look at the actual use of the TRASHCAN macros:

    PyObject_GC_UnTrack(self);
    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_BEGIN(self);
    --_PyTrash_delete_nesting;
...
  endlabel:
    ++_PyTrash_delete_nesting;
    Py_TRASHCAN_SAFE_END(self);
    --_PyTrash_delete_nesting;

with the explanation a little lower:

       Q. Why the bizarre (net-zero) manipulation of
          _PyTrash_delete_nesting around the trashcan macros?

       A. Some base classes (e.g. list) also use the trashcan mechanism.
          The following scenario used to be possible:

          - suppose the trashcan level is one below the trashcan limit

          - subtype_dealloc() is called

          - the trashcan limit is not yet reached, so the trashcan level
        is incremented and the code between trashcan begin and end is
        executed

          - this destroys much of the object's contents, including its
        slots and __dict__

          - basedealloc() is called; this is really list_dealloc(), or
        some other type which also uses the trashcan macros

          - the trashcan limit is now reached, so the object is put on the
        trashcan's to-be-deleted-later list

          - basedealloc() returns

          - subtype_dealloc() decrefs the object's type

          - subtype_dealloc() returns

          - later, the trashcan code starts deleting the objects from its
        to-be-deleted-later list

          - subtype_dealloc() is called *AGAIN* for the same object

          - at the very least (if the destroyed slots and __dict__ don't
        cause problems) the object's type gets decref'ed a second
        time, which is *BAD*!!!

          The remedy is to make sure that if the code between trashcan
          begin and end in subtype_dealloc() is called, the code between
          trashcan begin and end in basedealloc() will also be called.
          This is done by decrementing the level after passing into the
          trashcan block, and incrementing it just before leaving the
          block.

          But now it's possible that a chain of objects consisting solely
          of objects whose deallocator is subtype_dealloc() will defeat
          the trashcan mechanism completely: the decremented level means
          that the effective level never reaches the limit.      Therefore, we
          *increment* the level *before* entering the trashcan block, and
          matchingly decrement it after leaving.  This means the trashcan
          code will trigger a little early, but that's no big deal.

It's probably better to leave out the trashcan for now. It seems like rather tricky code and I'm not sure it's part of the official Python C-API (it might be something internal, just like they use some macros themselves they find unsafe for use in extension modules)

Note: See TracTickets for help on using tickets.