#14471 closed defect (fixed)
Category refinement sometimes changes the hash of parents
Reported by: | vbraun | Owned by: | rlm |
---|---|---|---|
Priority: | critical | Milestone: | sage-5.12 |
Component: | memleak | Keywords: | |
Cc: | SimonKing, nbruin, nthiery, hivert | Merged in: | sage-5.12.beta0 |
Authors: | Volker Braun | Reviewers: | Simon King |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description (last modified by )
Some objects with dynamical classes use hash(self.__class__)
. Since this is not an invariant of dynamical classes bad things will happen.
One such instance is that, under some circumstances, a WeakValueDictionary remove callback is being called but cannot find the object in the dictionary since the hash changed.
Apply
Attachments (7)
Change History (47)
Changed 7 years ago by
comment:1 Changed 7 years ago by
- Description modified (diff)
comment:2 Changed 7 years ago by
The debug_ignored_exceptions.patch is a backport from Python-3.3 to give better diagnosis of ignored exceptions. With it, I get
Exception ignored in: <function remove at 0x2472938> Traceback (most recent call last): File "/home/vbraun/opt/sage-5.9.beta5/local/lib/python/weakref.py", line 51, in remove del self.data[wr.key] KeyError: (((<class 'sage.categories.algebras.Algebras'>, Callable function ring with arguments (x, y)), ()),)
so the KeyError really comes from the WeakValueDictionary remove callback.
comment:3 Changed 7 years ago by
- Cc SimonKing nbruin added
Adding some debugging in remove()
shows that the wr.key
is still in self.dict.keys()
but del self.dict[wr.key]
fails with a KeyError. Presumably because the whole thing is running during the garbage collection cycle. I tried #14159 but it didn't change anything.
I haven't figured out exactly which cache in the Sage library is the culprit, maybe Simon knows?
comment:4 Changed 7 years ago by
WeakValueDict
is a standard python library class, not one of our own brew. So once we can put the finger on the bug we should likely report upstream. See also #13394, although I doubt the issue pointed out there is what is at play here.
My first guess would be that the cache at play is a "cached_method" or a "cached_function" cache, probably the one from UniqueRepresentation
, but I'm not entirely sure. The entries in:
super(MatrixSpace,sage.matrix.matrix_space.MatrixSpace).__classcall__.get_cache().keys()
aren't quite of the shape you're finding.
A quick analysis; This is WeakValueDictionary.__init__
:
def __init__(self, *args, **kw): def remove(wr, selfref=ref(self)): self = selfref() if self is not None: del self.data[wr.key] self._remove = remove UserDict.UserDict.__init__(self, *args, **kw)
As you can see, remove
only stores a link to the dictionary via a weak reference, so that the dictionary can get deallocated before the callback (and the callback finds that the dict doesn't exist anymore and simply does nothing). So that's correct.
We also see that remove
only removes the key. It doesn't check whether the value is still what it's supposed to be and it also doesn't check if the key is still there in the first place. That would be safe if there's a guarantee that as soon as a value gets removed from the dictionary, the KeyedRef
gets removed as well (in which case the callback dies without getting executed).
The bug we're seeing indicates that we end up with:
weakvaluedict[key]=value
key:KeyedRef(value)
pair gets removed from dict - this should more or less atomically destroy theKeyedRef(value)
.- somehow, the callback from the
KeyedRef(value)
stored gets called before it gets destroyed.
Since the KeyedRef(value)
should really only be referenced from the weakvaluedict
, it's a bit hard to let it survive: As soon it gets decreffed it should get deallocated.
My guess is that this happens IN cyclic garbage removal, where callbacks do get separated from deletions. Python tries hard to avoid calling such callbacks unnecessarily or in situations where it's unsafe to do so, but perhaps cleaning up extremely complicated cyclic structures (which we do create) could fool the detection.
If this is right, then these keyerrors should be relatively harmless: They would be happening on a dictionary that is slated for demolition anyway. That would suggest we could just catch and ignore the keyerror in the WeakValueDictionary? removal function. It would be good to identify the scenario that justifies doing so, however.
comment:5 follow-ups: ↓ 6 ↓ 12 Changed 7 years ago by
I agree and verified with gc.get_referrers
that it is sage.structure.unique_representation.CachedRepresentation.__classcall__.cache
If I change the remove() callback to
import sys print >> sys.stderr, 'removing', wr.key in self.data.keys() del self.data[wr.key] print >> sys.stderr, 'finished removing'
then I get
... removing True finished removing removing True Exception KeyError: (((<class 'sage.categories.algebras.Algebras'>, Callable function ring with arguments (x, y)), ()),) in <function remove at 0x1acf758> ignored removing True finished removing removing True finished removing ...
So the weakvaluedict (self.data) is in a bad state...
comment:6 in reply to: ↑ 5 ; follow-up: ↓ 7 Changed 7 years ago by
Replying to vbraun:
So the weakvaluedict (self.data) is in a bad state...
self.data
is a normal dict. Your example doesn't quite show the dict is in a "bad state". It shows that between testing whether the key is in the dict and doing the delete, the key has been removed. That's not quite the same as self.data
beng corrupted.
Something is going wrong of course, though. Something seems to be mutating self.data in between. The only methods that seem to be mutating the underlying self.data
for a WeakValueDictionary
are __setitem__
, popitem
, pop
, setdefault
, update
(and of course __delitem__
and clear
it inherits from UserDict
). Perhaps I don't have enough imagination, but indeed the only scenario in which I could see one of those happening during a callback is when a GC gets triggered and basically the whole dict gets cleaned up.
Some of the "iterator" methods on WeakValueDict
don't quite seem safe to me, by the way. The iteritems
method, for instance, iterates over self.data.itervalues()
, but self.data
can be mutated outside of your control due to GC callbacks, and iterators over a mutating dict are generally not safe.
I doubt that's what's hurting us here, though.
comment:7 in reply to: ↑ 6 ; follow-up: ↓ 9 Changed 7 years ago by
Replying to nbruin:
Replying to vbraun:
So the weakvaluedict (self.data) is in a bad state...
self.data
is a normal dict. Your example doesn't quite show the dict is in a "bad state".
Fair enough, but even after I catch the KeyError the wr.key in self.data.keys()
is still True
. So the self.data
dict is in a weird place for sure.
I agree that this smells like something else is mutating self.data
in-between. It is possible that deallocation somewhere triggers a call to CachedRepresentation
, but I don't see where that would be the case.
comment:8 Changed 7 years ago by
I've added an unguarded_abort
to force a gdb backtrace but it looks as it should to me: A object is being deallocated and triggers the WeakValueDictionary remove callback. The only dict that is being cleaned at that time are the globals of the doctest runner.
comment:9 in reply to: ↑ 7 Changed 7 years ago by
Replying to vbraun:
Fair enough, but even after I catch the KeyError the
wr.key in self.data.keys()
is stillTrue
. So theself.data
dict is in a weird place for sure.
That is weird, and sounds to me like self.data
is indeed corrupted. Given that key/value pairs are stored together in a dictionary, it's very hard for this to be true too. Caching the result from data.keys wouldn't make sense at all, so I don't think python will be doing that. Perhaps check that something like wr.key in self.data
or self.data.__contains__(wr.key)
returns true and/or if value=self.data[wr.key]
succeeds (and what kind of value you get back: whether it's wr
)
That might give you some indication of what exactly is wrong with the entry and hence where the problem might be coming from.
One thing your traceback is confirming:
- There is no GC going on, so callbacks are happening when you think they should.
- Hence there's also not a circular data structure directly involved
- Things are happening during cleanup, so these callbacks are probably indeed happening on a dict that will be deleted too. It's still bad they don't properly execute, of course, but we've seen before that order in cleanup can sometimes lead to subtle problems (i.e., whether things are stored in a cdef attribute or in a conventional dict).
One possible scenario:
- self.data is in the process of being cleared. The dict is actually emptied then before any entries are decreffed (to prevent dict reshaping during deallocation).
- One of the weakreffed values also occurs as a key in the dict (and that's the only strong ref)
- decreffing that key causes the callback, but the callback finds a dict that is now empty, so of course we get a key error.
Your diagnosis that wr.key in self.data
is still true is not consistent with that, but the scenario indeed leads to ignored keyerrors:
import weakref class A(object): def __init__(self,n): self.n=n def __repr__(self): return "A(%d)"%self.n def mess(n): D=weakref.WeakValueDictionary() L=[A(i) for i in range(n)] for i in range(n-1): j=(i+10)%n D[L[i]]=L[j] return D D=mess(10000) D.clear()
(how many you ignored keyerrors you get depends a bit on memory layout etc.)
This is definitely a bug in the python library. The problem is that remove
does check if the dict still exists (via the weakref, which is mainly to ensure that presence of callbacks doesn't extend the life of the dict), but doesn't check if the dict is otherwise in a funny state, such as "in the process of a clear
".
This is now Python issue 17816. Python3 does not seem to have this problem and it may not be the problem we're running into here.
comment:10 follow-up: ↓ 11 Changed 7 years ago by
I've looked at the Python 3 code and there weakref keeps a separate list of keys to remove to work around this issue.
Any further non-Sage related discussion should probably move to the Python bugtracker.
As for Sage, should we patch Python for this? I hit this bug in #14469 and its likely that we'll trip over it again.
comment:11 in reply to: ↑ 10 Changed 7 years ago by
Replying to vbraun:
As for Sage, should we patch Python for this? I hit this bug in #14469 and its likely that we'll trip over it again.
I would suggest a paranoid remove
function for this:
def remove(wr, selfref=ref(self)): self = selfref() if self is not None and self.data.get(wr.key) is wr: del self.data[wr.key]
The get
method avoids exception handling and hopefully avoids allocation as much as possible (thus avoiding triggering GC) and will return None
if the reference is not present. Furthermore, we verify that the correct value is still stored in the dictionary, in case a callback gets delayed so much that another value makes it into that dict under the same key (that shouldn't really happen, so that's the paranoid bit).
I tried. Making the test
if self is not None and wr.key in self.data:
also makes the error go away (but is less paranoid than the option above), but indeed, making the test
if self is not None and wr.key in self.data.keys():
does not!! So there is something wrong with self.data.keys(). Indeed, putting
def remove(wr, selfref=ref(self)): self = selfref() if self is not None: if wr.key in self.data.keys() and wr.key not in self.data: raise RuntimeError("BAD KEY:%s"%(wr.key,)) del self.data[wr.key]
in there makes for "ignored runtime error". On the other hand, if I run the example above with the BAD KEY
test, I do not get these problems. Therefore, the discrepancy between in data.keys()
and in data
does not occur in the example above.
So:
- there is a fix that apparently solves both problems
- there is something different in the problem originally reported here, since
keys()
seems to report an erroneous list.
One difference between data.keys()
and __contains__
is that dictobject.c/dict_keys
just iterates over the table storing the dict and extracts the keys of all dictentries that have a non-NULL value field, whereas __contains__
will go through a dictlookup
.
Given that some of the key components may well be in the process of being torn down, could it be that their hash is broken? One would still be able to find the corresponding key by iterating through all of them (as wr.key in data.keys()
does), but finding it by hash would not work. Both __contains__
and __delitem__
would suffer from that.
This could happen if the MRO of some class changed to insert/remove WithEqualityById
in it. Changing the hash on a key component for which a callback happens would have exactly this effect. So my bet is that it's an artifact of the mro voodoo the dynamical category stuff does or it's an unfortunate clearing/changing of a cached hash value somewhere due to a deletion-in-progress.
comment:12 in reply to: ↑ 5 Changed 7 years ago by
Replying to vbraun:
Exception KeyError: (((<class 'sage.categories.algebras.Algebras'>, Callable function ring with arguments (x, y)), ()),) in <function remove at 0x1acf758> ignored
This entry is produced by the __call__
doctest in matrix0.pyx
:
sage: f(x,y) = x^2+y
sage: m = matrix([[f,f*f],[f^3,f^4]]); m
[ (x, y) |--> x^2 + y (x, y) |--> (x^2 + y)^2]
[(x, y) |--> (x^2 + y)^3 (x, y) |--> (x^2 + y)^4]
sage: m(1,2)
[ 3 9]
[27 81]
sage: m(y=2,x=1)
[ 3 9]
[27 81]
sage: m(2,1)
[ 5 25]
[125 625]
+ sage: D=UniqueRepresentation.__classcall__.get_cache()
+ sage: [k for k in D.data.keys() if k not in D.data]
+ []
With the added doctest, we get different behaviour depending on whether attachment:trac_14471_demonstrate.patch is applied or not: Without the patch, the test passes; with the patch the test fails. That means that the line
hasattr(obj, '_foobar_')
changes the hash of one of the key components involved; probably the "callable function ring", i.e., parent(f)
. Indeed:
sage: f(x,y) = x^2+y sage: H=hash(parent(f)) sage: m = matrix([[f,f*f],[f^3,f^4]]); sage: hasattr(m,'a') False sage: H == hash(parent(f)) False
that last one's really bad! The hasattr
on that matrix apparently clobbers the hash of its base ring!
comment:13 Changed 7 years ago by
- Cc nthierry added
I think this illustrates the problem rather well:
sage: f(x,y) = x^2+y sage: P=parent(f) sage: P.__class__ sage.symbolic.callable.CallableSymbolicExpressionRing_class sage: H=hash(P.__class__) sage: m = matrix([[f,f*f],[f^3,f^4]]) sage: H == hash(P.__class__) True sage: hasattr(m,'a') False sage: P.__class__ sage.symbolic.callable.CallableSymbolicExpressionRing_class_with_category sage: H == hash(P.__class__) False
We have that P.__hash__
looks like:
def __hash__(self): return hash((self.__class__, self._arguments))
Obviously, when class is not fixed, it's a bad ingredient for a hash. Given that
sage: T=type(P) sage: [t for t in T.mro() if '__hash__' in t.__dict__] [sage.symbolic.callable.CallableSymbolicExpressionRing_class, sage.symbolic.ring.SymbolicRing, sage.structure.category_object.CategoryObject, sage.structure.sage_object.SageObject, object]
we do need to do something about hashing, since sage.symbolic.ring.SymbolicRing.__hash__
is just a fixed value and these CallableExpression? rings do come in different flavours. The __cmp__
method there also depends on self.__class__
which should obviously not be relied upon.
Given that this code stems from #5930, it predates these dynamic classes voodoo by quite a bit, so the fact that it doesn't operate well with it is a bug in the dynamic classes stuff. A job for Simon, Nicholas and the gang!
Furthermore, note that type(parent(m))
inherits from UniqueRepresentation
but that type(parent(f))
does not. It doesn't seem easy to do, but it might be possible to make two equal-but-not-identical callable expression rings. Building matrix rings over them will wreak havoc, because the constructor will use equality testing on the ingredients (and hence return identical matrix rings), but the coercion framework will not honour "equality" between the non-identical base rings.
EDIT: The last bit is not the case. CallableSymbolicExpressionRing
is a UniqueFactory
. Furthermore, it seems the coercion framework does detect "equal but nonidentical" parent and can sometimes figure out coercions between them.
comment:14 Changed 7 years ago by
- Cc hivert added
comment:15 Changed 7 years ago by
I did a very quick search for lines in source containing both hash
and __class__
. Lines 73 and 75 in sage/combinat/species/recursive_species.py
are also guilty, so whoever is going to fix this might want to look at those lines too.
comment:16 follow-up: ↓ 17 Changed 7 years ago by
Isn't this super-dangerous, changing the hash can potentially give you wrong results.
Also, I thought it is already wrong to make the bare class (not _with_category
) accessible through parent()
? Of course that ought to be spelled out in the documentation.
comment:17 in reply to: ↑ 16 ; follow-up: ↓ 18 Changed 7 years ago by
Replying to vbraun:
Isn't this super-dangerous, changing the hash can potentially give you wrong results.
Experience shows you don't have to change hashes for that, but yes, feel free to up the severity of the ticket if you feel strongly about this.
Also, I thought it is already wrong to make the bare class (not _with_category) accessible through parent()? Of course that ought to be spelled out in the documentation.
I think the system just gets tricked into that here and I think it's unavoidable: If I remember correctly, the category initialization gets delayed for efficiency reasons. Fixing that would probably cause unacceptable loss of performance in other places.
comment:18 in reply to: ↑ 17 Changed 7 years ago by
Replying to nbruin:
I think the system just gets tricked into that here and I think it's unavoidable: If I remember correctly, the category initialization gets delayed for efficiency reasons. Fixing that would probably cause unacceptable loss of performance in other places.
Is that actually true? This is about constructing a parent, after all. Though maybe somebody who actually wrote the category code could chime in and tell us what he/she envisioned and where it is documented (j/k).
comment:19 Changed 7 years ago by
See sage.matrix.matrix_space.MatrixSpace.full_category_initialisation
for documentation. See #11900 for timings.
The difference of MatrixSpace
with respect to other parents is that some code (namely on elliptic curves) creates many different matrix spaces, but uses them only as a container, but not as a fully grown parent.
comment:20 Changed 7 years ago by
PS: A further difference is that finding the correct category is a bit more involved than in other examples. It may be an algebra or a module, depending on whether we talk about general or square matrices. And concerning module, this involves checking whether the base ring is a field, which also tends to be slow. And all this effort would be in vain, because the category framework isn't used for matrix spaces in the elliptic curve code.
comment:21 follow-up: ↓ 22 Changed 7 years ago by
But the problem is CallableSymbolicExpressionRing_class
, not MatrixSpace
? The matrices only trigger the problem by initializing the CallableSymbolicExpressionRing_class_with_category
. Presumably matrices are smart enough to not use __class__
for hashes and comparison if they delay category initialization.
comment:22 in reply to: ↑ 21 Changed 7 years ago by
Replying to vbraun:
But the problem is
CallableSymbolicExpressionRing_class
, notMatrixSpace
? The matrices only trigger the problem by initializing theCallableSymbolicExpressionRing_class_with_category
.
As much as I know, I never touched CallableSymbolicExpressionRing
.
Presumably matrices are smart enough to not use
__class__
for hashes and comparison if they delay category initialization.
I hope so.
comment:23 Changed 7 years ago by
- Status changed from new to needs_review
Ok, fixing the offending __hash__
functions is easy of course. I thought long about how we can ensure that it doesn't happen again, and the best way I can see is to check that the hash does not change in `Parent._refine_category_' during doctests. The patch implements that.
comment:24 Changed 7 years ago by
- Description modified (diff)
comment:25 Changed 7 years ago by
- Description modified (diff)
- Summary changed from Exception KeyError: in WeakValueDictionary remove ignored to Category refinement sometimes changes the hash of parents
comment:26 Changed 7 years ago by
Patchbot:
Apply trac_14471_dynamic_class_hash.patch
comment:27 Changed 7 years ago by
- Description modified (diff)
- Priority changed from major to critical
I'm setting the priority to critical since it is a structural issue that we must get right in order to trust computations.
Rediffed to sage-5.11.beta3
comment:28 Changed 7 years ago by
patchbot:
apply trac_14471_dynamic_class_hash.patch
comment:29 Changed 7 years ago by
- Reviewers set to Simon King
All tests pass (based on sage-5.11.b3). I think the patch solves the problem in a clean way.
I think one should add a new test to _refine_category
, demonstrating that the class is not changed, if its hash depends on it. I plan to add this in a review patch.
One might argue that any new class (such as DebugOptions_class
) requires new doctests. But it is just a means to store certain attributes, hence, I think doctests will not be needed in this case.
Unfortunately the patchbot has not finished testing yet. But unless there is a clear regression in startup time, I'll give it a positive review (as soon as I have posted my review patch).
comment:30 Changed 7 years ago by
- Status changed from needs_review to needs_work
There is no regression in startup time. However, I did not succeed to get the warning when changing the hash. I tried:
Here is a test against :trac:`14471`. Refining the category will issue a warning, if this change affects the hash value (note that this will only be seen in doctest mode):: sage: class MyParent(Parent): ....: def __hash__(self): ....: return hash(type(self)) ....: sage: a = MyParent(category=Rings()) sage: h = hash(a) sage: a._refine_category_(Algebras(QQ)) hash of <class '__main__.MyParent_with_category'> changed in Parent._refine_category_ sage: h == hash(a) False
If I understand correctly, the check of refine category will by default be done in doctests. But I get:
Failed example: a._refine_category_(Algebras(QQ)) Expected: hash of <class '__main__.MyParent_with_category'> changed in Parent._refine_category_ Got: <BLANKLINE>
So, what shall one do?
comment:31 Changed 7 years ago by
Fixed. The debug flag was actually not set during doctests (initialization order). Now is being set explicitly. I added your doctest.
comment:32 Changed 7 years ago by
- Status changed from needs_work to needs_review
Changed 7 years ago by
comment:33 Changed 7 years ago by
- Description modified (diff)
- Status changed from needs_review to positive_review
I think it would be a good idea to show that the hash actually did change. I extended the new test accordingly, in a review patch.
Positive review, then!
comment:34 Changed 7 years ago by
- Status changed from positive_review to needs_info
No, I spoke to soon. Now, as the debug flag is set, it makes sense to run the full doc tests. After all, it could be that the debug flag uncovers a bug.
comment:35 follow-up: ↓ 36 Changed 7 years ago by
- Status changed from needs_info to needs_review
Doctests pass on my machine, of course
comment:36 in reply to: ↑ 35 Changed 7 years ago by
- Status changed from needs_review to positive_review
comment:37 Changed 7 years ago by
- Milestone changed from sage-5.11 to sage-5.12
- Status changed from positive_review to needs_work
The PDF documentation doesn't build due to the use of single instead of double backticks:
! Missing { inserted. <to be read again> _ l.14636 when the constructor calls $Parent.__ init__()$ directly or ? ! Emergency stop. <to be read again> _ l.14636 when the constructor calls $Parent.__ init__()$ directly or ! ==> Fatal error occurred, no output PDF file produced! Transcript written on geometry.log. [180[combinat ] /mazur/release/merger/sage-5.11.rc0/devel/sage/doc/en/reference/combinat/sage/combinat/crystals/alcove_path.rst:: WARNING: unusable reference target fo und: ../../../../../../../html/en/reference/structure/sage/structure/unique_representation.html#sage.structure.unique_representation.UniqueRepresentation [combinat ] /mazur/release/merger/sage-5.11.rc0/devel/sage/doc/en/reference/combinat/sage/combinat/crystals/alcove_path.rst:: WARNING: unusable reference target found: ../../../../../../../html/en/reference/structure/sage/structure/parent.html#sage.structure.parent.Parent make[1]: *** [geometry.pdf] Error 1
comment:38 Changed 7 years ago by
- Description modified (diff)
- Status changed from needs_work to positive_review
Fixed.
comment:39 Changed 7 years ago by
- Merged in set to sage-5.12.beta0
- Resolution set to fixed
- Status changed from positive_review to closed
comment:40 Changed 7 years ago by
- Cc nthiery added; nthierry removed
Initial patch