Opened 7 years ago

Closed 7 years ago

#13904 closed defect (duplicate)

Better deletion of items of TripleDict

Reported by: SimonKing Owned by: rlm
Priority: major Milestone: sage-duplicate/invalid/wontfix
Component: memleak Keywords:
Cc: nbruin, vbraun, jpflori Merged in:
Authors: Reviewers: Simon King, Jean-Pierre Flori
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: #13896, #12313 Stopgaps:

Description (last modified by jdemeyer)

In #715, TripleDict has been modified, so that the key items are compared by identity rather than equality, and so that weak keys are used when possible.

However, as it turns out, the deletion of the items tracked in a TripleDict was done improperly. The aim of this ticket is to fix that.

Superseded by #13387.

Attachments (4)

trac_13904_fix_del.patch (632 bytes) - added by SimonKing 7 years ago.
Attempt to fix TripleDict.__delitem__
trac_13904_debug_only.patch (1002 bytes) - added by SimonKing 7 years ago.
For debugging only. Increase the likelyhood that a gc occurs during a weakref callback
test.sage (1.7 KB) - added by nbruin 7 years ago.
Script that tries to cause a double dealloc but fails because python is mature
trac13904_use_del.patch (833 bytes) - added by SimonKing 7 years ago.
Use del rather then calling __delitem__ directly, in TripleDict and MonoDict

Download all attachments as: .zip

Change History (38)

Changed 7 years ago by SimonKing

Attempt to fix TripleDict.__delitem__

comment:1 follow-ups: Changed 7 years ago by nbruin

There's the more severe problem that during GC cleanup, a weakref callback could lead to the deletion of a non-weakreffable key, which could retrigger the dealloc of the object that has just generated the weakref. That would lead to a double dealloc.

The following might be a solution:

  • equip TripleDict? with an attribute self.deathrow=[].
  • when TripleDictEraser? finds itself removing a non-weakreffed key (other than None, int, float, etc), it appends the key to self.deathrow instead.

This way, we don't trigger the deletion (and the whole cascade that may follow) in the callback.

Deathrow will be cleaned up eventually, as part of the regular cleanup code for deallocation of TripleDict?.

The problem is that weakref callback operates in a strange environment: There is a dealloc call higher up the stack somewhere. Normally python code doesn't run under those circumstances. It shouldn't make much difference, but you should absolutely not delete that object again. Hence, I think the default should be: Don't delete objects in callback. You can modify this rule, but only on a case-by-case basis, by carefully reasoning that the deletions that you do allow won't lead to nasty unpredictable deletion cascades.

comment:2 Changed 7 years ago by nbruin

I don't think we need this ticket for the __delitem__ optimization. That can be wrapped into #13387.

comment:3 Changed 7 years ago by SimonKing

Nils, indeed it might be an idea to use the ideas from #13387. However, the ticket here is not about optimization.

Anyway.

I am starting with Sage's debug version as in #13864. With the first patch applied, one gets a crash in modules/module.pyx

With both patches applied, one gets

sage -t  "devel/sage-main/sage/modules/module.pyx"          
         [3.5 s]

So, that looks like a progress.

However, at least with a slightly different version of the first patch, make ptest still results in

        sage -t  -force_lib devel/sage/sage/schemes/elliptic_curves/ell_point.py # Killed/crashed
        sage -t  -force_lib devel/sage/sage/algebras/free_algebra_quotient.py # Killed/crashed
        sage -t  -force_lib devel/sage/sage/algebras/free_algebra_quotient_element.py # Killed/crashed
        sage -t  -force_lib devel/sage/sage/categories/modules_with_basis.py # Killed/crashed
        sage -t  -force_lib devel/sage/sage/categories/homset.py # 1 doctests failed

when both patches are applied. Note, however, that I can not reproduce these crashes when I run them independently.

Hence, one needs to dig deeper. See comment:1.

comment:4 in reply to: ↑ 1 ; follow-up: Changed 7 years ago by SimonKing

Replying to nbruin:

There's the more severe problem that during GC cleanup, a weakref callback could lead to the deletion of a non-weakreffable key, which could retrigger the dealloc of the object that has just generated the weakref. That would lead to a double dealloc.

Ah, you say that the problem actually isn't so much the __delitem__ but the callback? OK, I'll have a look into that. And of course we should try to rebase #13387.

comment:5 Changed 7 years ago by SimonKing

Oops, I just notice that the patch attempts to fix TripleDictEraser.__call__, hence, the callback, but not TripleDict.__delitem__. So, I did have a look into the callback function...

comment:6 in reply to: ↑ 4 Changed 7 years ago by nbruin

Replying to SimonKing:

Ah, you say that the problem actually isn't so much the __delitem__ but the callback? OK, I'll have a look into that. And of course we should try to rebase #13387.

The __delitem__ isn't wrong. It just happens to be slightly more memory intensive that the straight del, so it happened to trigger a GC. The crash was due to Cython not handling GC during deallocation properly. That's getting fixed with #13896. The potential issue I described above hasn't been observed. I suspect that one could engineer an example. Perhaps one should (Robert's cython test case for #13896 can serve as a template). It's just that memory management is such a sensitive business that one should really (informally) prove bad scenarios can't happen rather than wait until they do. I think the scenario I describe can happen and I don't see what measures in the python code would prevent it.

One way to solve this would be to forbid non-weakreffable container types to be used as keys. The problem is that I can't think of a reliable test for that. Since the consequences of violating might lead to incredibly insidious memory corruptions, I think we should guard against it. A consequence is that any non-weakreffable (non-whitelisted such as None or int) key automatically has its life span bounded below by the lifetime of the TripleDict, even if the entry in which it served got removed due to a weakref dying (explicit key deletions would be fine). I don't see how we could get hold of a hook where clearing deathrow would be safe.

comment:7 in reply to: ↑ 1 ; follow-up: Changed 7 years ago by SimonKing

Replying to nbruin:

There's the more severe problem that during GC cleanup, a weakref callback could lead to the deletion of a non-weakreffable key, which could retrigger the dealloc of the object that has just generated the weakref. That would lead to a double dealloc.

Is what you describe as follows?

Let's say we have a key (A,B,C), where A is weakrefed and got deleted, so that the key (A,B,C) gets removed by means of A's callback function. Now let B be non-weakrefable. Assume further that the key triple is the last reference to B, and assume that B has an attribute that constitutes a (strong? weak?) reference to A. Then, deletion of B would trigger a second deletion of A.

Or did I misunderstand? Can you elaborate a bit more, please?

comment:8 in reply to: ↑ 7 Changed 7 years ago by nbruin

Replying to SimonKing:

Let's say we have a key (A,B,C), where A is weakrefed and got deleted, so that the key (A,B,C) gets removed by means of A's callback function. Now let B be non-weakrefable. Assume further that the key triple is the last reference to B, and assume that B has an attribute that constitutes a (strong? weak?) reference to A. Then, deletion of B would trigger a second deletion of A.

Yes, that's correct. It would have to be a strong reference from B to A, because deletion of weakrefs should not trigger deletion cascades. Of course, that means the reference count to A isn't 0, so normally A would not be eligible for deletion ... but if A and B are found to be cyclic garbage, deallocation could still occur. We'd be at the mercy of GC for how the chain gets broken up and deleted.

In fact, moving B to deathrow could be tricky, because that would resurrect B (generally frowned upon), albeit only within the cyclic garbage. OK, I'll try and see if I can make a testcase that triggers this behaviour. At this point I don't even know how to fix this. Perhaps we should just forbid non-weakreffable keys that refer to another key component. Mind you, that forbids

(None, ZZ, ZZ(1) )

as a key, which doesn't immediately look completely unlikely. We don't generate keys like that, but someone else might ...

comment:9 in reply to: ↑ 1 ; follow-up: Changed 7 years ago by SimonKing

Replying to nbruin:

There's the more severe problem that during GC cleanup, a weakref callback could lead to the deletion of a non-weakreffable key, which could retrigger the dealloc of the object that has just generated the weakref. That would lead to a double dealloc.

PS: Why can this only happen with a non-weakreffable key? In particular, what are the keys in our crashing example? Namely, in its current use, I don't think that there are any non-weakreffable keys other than None and ints.

The following might be a solution:

  • equip TripleDict? with an attribute self.deathrow=[].
  • when TripleDictEraser? finds itself removing a non-weakreffed key (other than None, int, float, etc), it appends the key to self.deathrow instead.

This way, we don't trigger the deletion (and the whole cascade that may follow) in the callback.

Deathrow will be cleaned up eventually, as part of the regular cleanup code for deallocation of TripleDict?.

The problem is that weakref callback operates in a strange environment: There is a dealloc call higher up the stack somewhere. Normally python code doesn't run under those circumstances. It shouldn't make much difference, but you should absolutely not delete that object again. Hence, I think the default should be: Don't delete objects in callback.

One could provide a new method TripleDict.kill_deathrow, that will be called in any other method of TripleDict. Hence, the deletions would occur when the TripleDict will be used another time, and in particular this will only happen after the "dealloc call higher up the stack" has been completed.

Perhaps kill_deathrow could also be called at the beginning of the callback function - so that it performs the deletions that were scheduled by the preceding callback function.

Problem could be a severe slow-down.

comment:10 in reply to: ↑ 9 ; follow-up: Changed 7 years ago by nbruin

Replying to SimonKing:

PS: Why can this only happen with a non-weakreffable key? In particular, what are the keys in our crashing example? Namely, in its current use, I don't think that there are any non-weakreffable keys other than None and ints.

The cython-induced crash has only weakreffable keys. The hypothetical behaviour has not been observed.

One could provide a new method TripleDict.kill_deathrow, that will be called in any other method of TripleDict. Hence, the deletions would occur when the TripleDict will be used another time, and in particular this will only happen after the "dealloc call higher up the stack" has been completed.

But arbitrary code can be executed during weakref callbacks, including arbitrary TripleDict operations. So you'd never be sure if it's safe to kill deathrow.

Really, this is starting to look more like a design deficiency in CPython: it should really mark objects on which dealloc is running to prevent GC, decref, whatever, from reentering dealloc. Preventing this by code protocol just seems to unduly restrict allowed data structures.

Perhaps kill_deathrow could also be called at the beginning of the callback function - so that it performs the deletions that were scheduled by the preceding callback function.

You wouldn't be sure that deathrow entries are now safe to delete ...

EXECUTIVE DECISION: we mandate that key components to TripleDict be either weakreffable or do not hold references to other objects (i.e., are non container-types). Perhaps we can relax this rule once we come up with a smart mechanism.

comment:11 in reply to: ↑ 10 Changed 7 years ago by SimonKing

Replying to nbruin:

Replying to SimonKing:

PS: Why can this only happen with a non-weakreffable key? In particular, what are the keys in our crashing example? Namely, in its current use, I don't think that there are any non-weakreffable keys other than None and ints.

The cython-induced crash has only weakreffable keys...

... and is an upstream bug in Cython?

One could provide a new method TripleDict.kill_deathrow, that will be called in any other method of TripleDict. Hence, the deletions would occur when the TripleDict will be used another time, and in particular this will only happen after the "dealloc call higher up the stack" has been completed.

But arbitrary code can be executed during weakref callbacks, including arbitrary TripleDict operations. So you'd never be sure if it's safe to kill deathrow.

Not quite, because we Know exactly what is happening in our weakref callbacks (namely: In the __call__ method of TripleDictEraser). The callback function will not call any method of TripleDict.

Perhaps kill_deathrow could also be called at the beginning of the callback function - so that it performs the deletions that were scheduled by the preceding callback function.

You wouldn't be sure that deathrow entries are now safe to delete ...

Agreed.

EXECUTIVE DECISION: we mandate that key components to TripleDict be either weakreffable or do not hold references to other objects (i.e., are non container-types). Perhaps we can relax this rule once we come up with a smart mechanism.

I guess this would be the easiest solution on the short run. Do you think of a whitelist, i.e., explicitly allowing NoneType, int, long and so on?

comment:12 Changed 7 years ago by SimonKing

By the way, my first attachment is wrong anyway, because I mispelled gc.isenabled().

Changed 7 years ago by SimonKing

For debugging only. Increase the likelyhood that a gc occurs during a weakref callback

comment:13 follow-ups: Changed 7 years ago by SimonKing

The "debug_only" patch is updated. With it, I tried the following:

sage: cython("""
....: cdef class A:
....:     cdef object x
....:     def __init__(self, x):
....:         self.x = x
....: """)
sage: import gc
sage: from sage.structure.coerce_dict import TripleDict
sage: T = TripleDict(31)
sage: while(1):
....:     b = B()
....:     a = A(b)
....:     T[b,a,None] = a
....:     del a,b
....:     foo = gc.collect()
....:     

That should be exactly the theoretical situation you were describing, isn't it?

But there is no crash. When I interrupt it after a while, with Ctrl-C, I get

sage: len(T)
1681

This I don't understand. Is TripleDict broken, in the sense of it doesn't allow its weak keys be cleared??

comment:14 in reply to: ↑ 13 Changed 7 years ago by SimonKing

Replying to SimonKing:

This I don't understand. Is TripleDict broken, in the sense of it doesn't allow its weak keys be cleared??

Well, certainly the weak keys can be cleared. But apparently they can not be cleared, if there is a non-weakreffable item in the key that holds a reference to a weakreffable item in the key.

Hence, on the one hand, it seems that the situation you constructed is a non-issue, because the callback function for the weak reference will not be called.

On the other hand, it is yet another reason to not allow non-weakreffable containers as key.

Changed 7 years ago by nbruin

Script that tries to cause a double dealloc but fails because python is mature

comment:15 follow-up: Changed 7 years ago by nbruin

HOORAY! As Modules/gc_weakref.txt shows, Zope has already put very good stresstests on python's clearing of weakrefs in cyclic garbage. I think we're safe with TripleDict as it is, because python takes special precautions when running weakref callbacks in cleaning up cyclic garbage. In our case the weakref itself is part of the cyclic garbage. That means Python always stacks the deck such that the weakref appears to be cleared prior to the deletion of what it references. Hence, the callback doesn't happen! I think we're safe with TripleDict? as it is.

comment:16 in reply to: ↑ 13 Changed 7 years ago by nbruin

Replying to SimonKing:

But there is no crash. When I interrupt it after a while, with Ctrl-C, I get

sage: len(T)
1681

This I don't understand. Is TripleDict broken, in the sense of it doesn't allow its weak keys be cleared??

No, nothing is broken. You haven't included the definition of B, but I assume it's a weakreffable type.

Although you delete a,b, your dictionary T is still holding a strong reference to a (because it can't weakref it) and a is holding a strong reference to b, so b is reachable. Nothing gets collected.

Things are collectible once you throw away T but thanks to the special handling of callbacks in cyclic garbage this is always handled properly.

comment:17 in reply to: ↑ 15 ; follow-up: Changed 7 years ago by SimonKing

Replying to nbruin:

I think we're safe with TripleDict as it is.

OK. But I wouldn't like to resolve this ticket as "wontfix" or "duplicate".

Still, we should use "del", not "__delitem__", because it gives shorter C-code. And in addition to that, we should see whether the new Cython from #13896 fixes the problem.

comment:18 Changed 7 years ago by SimonKing

  • Dependencies set to #13896

comment:19 in reply to: ↑ 17 Changed 7 years ago by nbruin

Replying to SimonKing:

Still, we should use "del", not "__delitem__", because it gives shorter C-code. And in addition to that, we should see whether the new Cython from #13896 fixes the problem.

Yes, go ahead. Since there is no urgent issue anymore, use this ticket for whatever you want (and yes, that cython does solve the problem, if you compile the whole library with it. I'm now pretty confident I know exactly what went wrong, after seeing it happen on bit-level ...). The gc_weakref.txt is a fascinating read and should be required for anybody intending to meddle with sage's memory management.

comment:20 follow-up: Changed 7 years ago by SimonKing

It seems that indeed #13896 fixes the problem. Hooray \^.^/

Hence, the priority regarding #715 is to upgrade Cython. The change from calling __delitem__ to del should be done (here) as well, but not so urgent.

comment:21 Changed 7 years ago by jpflori

  • Cc jpflori added; jpflory removed

comment:22 in reply to: ↑ 20 Changed 7 years ago by jpflori

Can we get the change from delitem to del quickly in? It is trivial, produce better C code, and potentially avoid horribly complicated gc problems?

From what I've really quickly read, Nils pointed out another problem with weakrefs and callbacks, but let's go one step at a time and deal with that elsewhere (or move the trivial fix on a separate ticket).

comment:23 follow-up: Changed 7 years ago by jpflori

  • Summary changed from Proper deletion of items of TripleDict to Better deletion of items of TripleDict

Oh in fact, I've read a little less quickly and the problem Nils thought about does not occur? So let's just get the trivial improvement here. I agree it is not a fix, but an imporvement, so I've changed the ticket title to reflect that as well.

And in case one really wants to let garbage collection occurs during weakref callback or whenever, there are always ways to do that.

comment:24 in reply to: ↑ 23 Changed 7 years ago by SimonKing

Replying to jpflori:

Oh in fact, I've read a little less quickly and the problem Nils thought about does not occur?

Yes, because "Python is more mature than we thought"...

So let's just get the trivial improvement here. I agree it is not a fix, but an imporvement

But, for the record, it is fixed with the new Cython spkg from #13896.

comment:25 Changed 7 years ago by SimonKing

Note that #12313 comprises a patch that makes the callback of a TripleDictEraser (hence, the deletion of dead items of a TripleDict) more secure. But I forgot to include the little improvement from here. Hence, I still think this ticket here is valid.

Changed 7 years ago by SimonKing

Use del rather then calling __delitem__ directly, in TripleDict and MonoDict

comment:26 Changed 7 years ago by SimonKing

  • Authors set to Simon King
  • Dependencies changed from #13896 to #13896, #12313
  • Status changed from new to needs_review

I think using del rather than calling __delitem__ makes sense even though the segfaults from #715 seem to be fixed by the Cython upgrade: Apparently del results in shorter C code.

The new patch uses del both on TripleDict and the new MonoDict, hence, it depends on #12313, where the latter was introduced.

Apply trac13904_use_del.patch

comment:27 Changed 7 years ago by jpflori

  • Description modified (diff)
  • Reviewers set to Jean-Pierre Flori
  • Status changed from needs_review to positive_review

comment:28 Changed 7 years ago by jdemeyer

  • Milestone changed from sage-5.6 to sage-5.7

comment:29 Changed 7 years ago by jdemeyer

  • Milestone changed from sage-5.7 to sage-5.8

comment:30 Changed 7 years ago by jdemeyer

  • Merged in set to sage-5.8.beta0
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:31 Changed 7 years ago by jdemeyer

  • Merged in sage-5.8.beta0 deleted
  • Resolution fixed deleted
  • Status changed from closed to new

Unmerging because of trouble with #12313.

Last edited 7 years ago by jdemeyer (previous) (diff)

comment:32 Changed 7 years ago by nbruin

Perhaps try #13387 instead? With the solution there, this ticket is superseded there. It improves the performance of these dictionaries in some other ways as well.

Last edited 7 years ago by nbruin (previous) (diff)

comment:33 Changed 7 years ago by jpflori

  • Status changed from new to needs_review

So this one should get closed as #12313 and #13387 will get merged together?

comment:34 Changed 7 years ago by jdemeyer

  • Authors Simon King deleted
  • Description modified (diff)
  • Milestone changed from sage-5.8 to sage-duplicate/invalid/wontfix
  • Resolution set to duplicate
  • Reviewers changed from Jean-Pierre Flori to Simon King, Jean-Pierre Flori
  • Status changed from needs_review to closed
Note: See TracTickets for help on using tickets.