#12313 closed defect (fixed)
Fix yet another memory leak caused by caching of coercion data
Reported by: | SimonKing | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | sage-5.8 |
Component: | memleak | Keywords: | coercion weak dictionary |
Cc: | jpflori, vbraun, robertwb, AlexanderDreyer, PolyBoRi, nthiery | Merged in: | sage-5.8.beta2 |
Authors: | Simon King, Jean-Pierre Flori | Reviewers: | Simon King, Jean-Pierre Flori, John Perry, Nils Bruin |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | #715, #11521, #12215, #13746, #13378; merge with #13387 | Stopgaps: |
Description (last modified by )
The following could not be fixed in #715:
sage: K = GF(1<<55,'t') sage: for i in range(50): ....: a = K.random_element() ....: E = EllipticCurve(j=a) ....: b = K.has_coerce_map_from(E) ....: sage: import gc sage: gc.collect() 0
The problem is that any parent has a dictionary that stores any coerce map (and a different dictionary for conversion maps) ending at this parent. The keys are given by the domains of the maps. So, in the example above, the field K
has an attribute that is a dictionary whose keys are the different elliptic curves.
In coercion, it is usually best to compare parents not by equality but by identity. Therefore, I suggest to implement a container called MonoDict
that works similar to the new weak TripleDict
(see #715), but takes a single item as a key.
First tests show that one can actually gain a lot of speed: MonoDict
can access its items much faster than a usual dictionary.
Apply
Attachments (27)
Change History (361)
comment:1 follow-up: ↓ 2 Changed 5 years ago by
comment:2 in reply to: ↑ 1 Changed 5 years ago by
Replying to SimonKing:
Perhaps one could do the same thing at #715, rather than having a weak reference to the underlying set of an action?
Nope! There, the situation was different, since there is strong reference to TripleDict
anyway (namely in the coercion model), so that a weak reference from the eraser won't suffice.
comment:3 Changed 5 years ago by
- Description modified (diff)
comment:4 Changed 5 years ago by
- Status changed from new to needs_review
With the attached patch (to be applied after #715), a new kind of weak dictionary is introduced, and used to cache coercions and conversions. All tests pass for me.
Concerning raw speed: Without the patch, we have
sage: RDF.coerce_map_from(ZZ) Native morphism: From: Integer Ring To: Real Double Field sage: timeit("RDF.coerce_map_from(ZZ)",number=10^5) 100000 loops, best of 3: 518 ns per loop
With the patch, it is 466 ns, hence, one gains 10% in an operation that is very frequent.
The total speed seems to be fine as well, make ptest
has not been slower than before.
Concerning memory leaks:
sage: import gc sage: _ = gc.collect() sage: K = GF(1<<55,'t') sage: for i in range(50): ... a = K.random_element() ... E = EllipticCurve(j=a) ... b = K.has_coerce_map_from(E) sage: _ = gc.collect() sage: len([x for x in gc.get_objects() if isinstance(x,type(E))]) 1
That is a new doctest and tests against a memory leak that has not been fixed by #715.
comment:5 Changed 5 years ago by
I have extended my patch by another coercion-related application of MonoDict
.
I found that sage.categories.groupoid.Groupoid
is often resulting in a memory leak. For example, by #715, polynomial rings are only weakly cached. But if the groupoid of a polynomial ring is created, then a strong reference on the polynomial ring is created, preventing it from being garbage collected. But a groupoid is created as soon as there is an action of that polynomial ring on another ring. Hence, a memory leak.
The new patch version is overriding the __classcall__
method of sage.categories.groupoid.Groupoid
(that would normally be inherited from UniqueRepresentation
) by a method that uses a MonoDict
for caching. In addition, there is now only a weak reference to the underlying set of a groupoid. By consequence, if there is no other strong reference then Groupoid(P)
can be garbage collected, but a strong reference to P prevents Groupoid(P)
from being garbage collected.
Still needs review.
comment:6 Changed 5 years ago by
- Status changed from needs_review to needs_work
comment:7 Changed 5 years ago by
And what if a user directly use a Groupoid object ?
Will the underlying set be garbage collected as in #715 for Actions?
Another reason to split this ticket in two (once again :) )
comment:8 Changed 5 years ago by
To finish my thoughts for now, we could as well leave the tickets as they are (except for some more doc and easy fixing) and potentially implement the other behavior (where a choice is possible between strong and weak refs) in a later ticket.
The current behavior with the error raising (if it gets a description of how not to get garbage collected in addition to telling that it got garbage collected) seems viable enough to me (unless that many people do actually use Groupoids and Actions directly).
comment:9 Changed 5 years ago by
Perhaps you are right, and we could re-organise stuff. Namely, for now (that's to say: for #715 and for #12313) we could keep strong references from an action to the underlying set, and from a groupoid to the underlying set. The unfortunate consequence would be that not all memory leaks would be fixed. But one could create a new ticket where the idea of having weak references to the underlying set is tested.
And here is yet another idea, at least for the groupoid case. Let S be any object and G its groupoid. Ideally, S should keep G alive and G should keep S alive. Hence, it would be fine to have a strong reference from S to G and from G to S -- Python's garbage collection can deal with circular references.
However, there should be no strong references to S or G, unless the user explicitly provides it. In particular, when creating Groupoid(S)
, there should no external cache be used.
Potential solution: The cache for the Groupoid.__classcall__
should be in S! Perhaps like this:
class Groupoid(Category): @staticmethod def __classcall__(cls, S): try: return getattr(S, '_cached_'+cls.__name__) except AttributeError: pass instance = type.__call__(cls, S) assert(isinstance(instance,cls)) if instance.__class__.__reduce__ == UniqueRepresentation.__reduce__: instance._reduction = (cls, args, options) try: setattr(S, '_cached_'+cls.__name__, instance) return except AttributeError: pass try: # Parents have a __cached_methods attribute, by #11115 S.__cached_methods['_cached_'+cls.__name__] = instance return except AttributeError: pass raise TypeError, "%s instance must either allow attribute assignment or be instances of %s"(cls, Parent)
Note that the problem can hardly be solved in the same way for actions. They need to be stored externally, namely in the coercion model.
comment:10 Changed 5 years ago by
- Work issues set to Split ticket
I think it is best to follow your suggestion and split stuff.
Soon (perhaps tomorrow) I'll return to the old patch, i.e. without touching the groupoid stuff. In a to-be-created ticket, I'll try the idea for caching groupoids that avoids memory leaks but uses strong references, as explained in my previous post.
However, I think the weak references to the underlying set of actions is an integral part of #715 - without it, the memory leak would not be fixed at all; so, I'd not split #715.
comment:11 Changed 5 years ago by
Python's name mangling sucks.
Name mangling occurs when an attribute is requested whose name starts with double underscores and does not end with an underscore. Thus, it happens when S.__cached_method
is requested in the code above.
However, for a reason that I don't understand, Python does not use the name of the class of S for mangling, but the name of cls -- for example, if S==ZZ
then Python is looking for S._Groupoid__cached_methods
, not S._<name of class of S>__cached_methods
. In addition, since __cached_methods
is a cpdef attribute, it should not be subject to name mangling at all.
I'll see if doing things in Cython helps.
Changed 5 years ago by
Weak coerce map dictionary with weakly referenced unique keys and optimised access speed
comment:12 Changed 5 years ago by
- Status changed from needs_work to needs_review
- Work issues Split ticket deleted
I have updated my patch, so that groupoids are not touched any more. Needs review, again.
Groupoids are now dealt with at #12357 (needs review), and I think the solution provided there is much better than the improper use of weak references.
comment:13 Changed 5 years ago by
- Status changed from needs_review to needs_work
- Work issues set to Find out why some attribute of a parent is deleted before the parent's elements are deleted
I am not totally sure whether the following problem is caused by this patch or by #715 or #11521. Anyway, with all three patches applied, I obtain:
sage: K.<z> = GF(4) sage: P.<x> = K[] sage: del P sage: del x sage: import gc sage: gc.collect() Exception AttributeError: 'PolynomialRing_field_with_category' object has no attribute '_modulus' in ignored 148
As you can observe, some attribute error is ignored, but worse: Cython is unable to tell where it is ignored.
The cython-users helped me to track the problem down, using gdb.
The attribute error occurs in the following cdef method defined in sage/rings/polynomial/polynomial_zz_pex.pyx:
cdef cparent get_cparent(parent) except? NULL: if parent is None: return NULL cdef ntl_ZZ_pEContext_class c c = parent._modulus return &(c.x)
By the way it is defined, the error is propagated. However, get_cparent is in fact called in a deallocation method, namely in sage/rings/polynomial/polynomial_template.pxi:
def __dealloc__(self): """ EXAMPLE:: sage: P.<x> = GF(2)[] sage: del x """ celement_destruct(&self.x, get_cparent((<Polynomial_template>self)._parent))
That is interesting for two reasons: (1) Cython can not tell the location of the error, which I think is a bug in Cython. (2) The attribute _modulus of self._parent
is destroyed before self is destroyed.
I find the latter rather strange. As long as self is alive, self._parent
can not be deallocated, but when self._parent
can not be deallocated, why is self._parent._modulus
deleted??
comment:14 Changed 5 years ago by
Sorry, I gave an incomplete link: The discussion at cython-users is here.
comment:15 Changed 5 years ago by
I added some print statements into the init and dealloc methods of ntl_ZZ_pEContext
(namely, it is the type of self._parent._modulus
). However, even though self._parent._modulus
is gone, the deallocation method of self._parent._modulus
is not called.
Still. Of course one could add a try-except clause into the deallocation of a polynomial template. But it should simply not be the case that an attribute of an existing object disappears!
comment:16 Changed 5 years ago by
- Description modified (diff)
- Status changed from needs_work to needs_review
I have attached a second patch, that catches the attribute error that may occur during deallocation.
Here is why I think the problem arose in the first place. The error occurs in a __dealloc__
method of a polynomial. The __dealloc__
method is, as much as I know, called after deletion of the python attributes. In particular, since self._parent
can be garbage collected with my patches (without #715, #11521 and #12313, it would stay in memory forever), self._parent._modulus
is deleted before calling the polynomial's __dealloc__
. Hence, the error.
I guess, the cleanest solution would be to keep a direct pointer to self._parent._modulus
as an attribute of self
. But that would create additional memory consumption. And since, by the example below, no memory leak is created with my second patch, I think the solution of simply catching the attribute error is good enough.
With the second patch, the above example works without a warning. Moreover, it seems to come without a speed penalty:
sage: K.<z> = GF(4) sage: def test(K): ....: for n in xrange(10000): ....: P.<x> = K[] ....: del P ....: del x sage: %timeit test(K) 5 loops, best of 3: 376 ms per loop
with the second patch, but 427 without the second patch.
I also tested whether the second patch introduces a hidden memory leak (after all, when the attribute error occurs for P, the deallocation of x is interrupted): In the above example, test(K)
does not seem to leak.
sage: test(K) sage: get_memory_usage() 825.640625 sage: test(K) sage: get_memory_usage() 825.640625
Needs review, then!
Apply trac12313_mono_dict.patch trac12313_polynomial_template_dealloc.patch
comment:17 follow-up: ↓ 18 Changed 5 years ago by
You are correct, in __dealloc__
it is not safe to look at any Python attribute. Usually they are still around, especially if they are being kept alive by some other reference. But not always, and there is no way to tell if they are.
You are, in fact, leaking memory: celement_destruct()
is never called if the AttributeError
is raised. I guess it only happens rarely in your test loop, so you don't notice it in the total memory? A while ago (when we were both in Kaiserslautern, actually) I fixed essentially the same error in multi_polynomial_libsingular.pyx
: To deallocate a singular ring element you need the singular ring, but the Python parent can be gone by then. There, I added a new cdef pointer to the singular ring to the ring elements.
Usually in OOP you are not allowed to throw exceptions from a destructor. Same with C++, for example.
comment:18 in reply to: ↑ 17 ; follow-up: ↓ 19 Changed 5 years ago by
- Work issues Find out why some attribute of a parent is deleted before the parent's elements are deleted deleted
Replying to vbraun:
You are, in fact, leaking memory:
celement_destruct()
is never called if theAttributeError
is raised.
Sure. But to my surprise, it isn't noticed in that loop. Surprise, because...
I guess it only happens rarely in your test loop, so you don't notice it in the total memory?
... I thought it would be called in each round of the loop.
A while ago (when we were both in Kaiserslautern, actually) I fixed essentially the same error in
multi_polynomial_libsingular.pyx
: To deallocate a singular ring element you need the singular ring, but the Python parent can be gone by then. There, I added a new cdef pointer to the singular ring to the ring elements.
So, you agree with the statement from my previous post that keeping a reference from the element to the modulus would be the cleanest solution.
Shall I modify sage/rings/polynomial/polynomial_template.pxi accordingly?
But there would be a memory penalty (one additional pointer per polynomial) and also a time penalty (the pointer needs to be assigned).
Usually in OOP you are not allowed to throw exceptions from a destructor.
Sure. The main problem for debugging was that Cython was not able to provide the location of the error: Note that it just says "Attribute error in ignored". In fact, it tries to obtain the string representation of the polynomial, but it fails for the same reason: The parent's modulus is already gone, and without the modulus, the string representation of the element can't be determined.
comment:19 in reply to: ↑ 18 Changed 5 years ago by
Replying to SimonKing:
So, you agree with the statement from my previous post that keeping a reference from the element to the modulus would be the cleanest solution.
Yes, agreed.
In terms of space, adding a C pointer on top of a Python object is nothing. And I don't think that the time to store the address is measurable. Might even be free depending on how the CPU out-of-order execution. You'll certainly make up the time by not resolving a Python attribute in the destructor.
comment:20 Changed 5 years ago by
I had a look at the code, and I think it would be quite some work to rewrite sage/rings/polynomial/polynomial_template.pxi so that cparent (I guess that is what we really want) is not (only) stored as an attribute of the parent but mainly as a cdef attribute of a polynomial.
I think I have currently no resources to spare, and better focus on the project I'm getting money for. So, for now, I leave it like this.
comment:21 Changed 5 years ago by
FYI, I'm working on a cleaner solution to the cparent stuff, implementing Simon's idea of an attribute in each element.
comment:22 follow-up: ↓ 23 Changed 5 years ago by
- Dependencies changed from #715 to #715, #11599
Here comes a first draft of the patch.
It adds a cdef'ed cparent attribute and makes sure it is set to the right value everywhere it should.
Or at least everywhere needed to pass "make ptest" on my computer (once the ticket is rebased on top of 11599).
Some point to clarify :
- PY_NEW could be replaced by .__new__ (possible from Cython >= 0.12)
- I put (<Polynomial_template>self)._cparent when access to _cparent is needed somehow inspired by how (<Polynomial_template>self)._parent was accessed, but I feel that just self._cparent without casting is enough, isn't it ? If it is the code would be more readable (and scripting the change is trivial.)
comment:23 in reply to: ↑ 22 Changed 5 years ago by
comment:24 Changed 5 years ago by
It's coming :) Just a second.
And by the way :
In fact there is (at least) a problem left.
After applying the "cparent" patch a 1 becomes a 2 in some output of L.stats() in coerce_dict.pyx.
We have to investigate if this is expected or if I screwed up something.
comment:25 Changed 5 years ago by
- Description modified (diff)
- Status changed from needs_review to needs_work
- Work issues set to minor formatting issues, L.stats() doctest problem
comment:26 Changed 5 years ago by
Strange I've popped and pushed and rebuilt Sage, and now the coerce_dict doctest is not failing anymore... I'm running a new "make ptest" to be sure something else did not go wrong.
comment:27 Changed 5 years ago by
Oh and I should add Simon test back in.
comment:28 Changed 5 years ago by
"Should add" means that the current patches are not ready for review yet?
comment:29 Changed 5 years ago by
Unfortunately yes. Anyway, I'm also adding some doc and removing some spaces from the coerce_dict file, but if you wanna have a look at the cparent stuff, please do, at least, just to let me know how you feel about it. This is not supposed to change that much (unless i remove all the Polynomial_template casts).
By the way, why did you remove the len and _get_buckets method from the TripleDict? class but left them in the MonoDict? one ? I feel len is not needed (and I am currently rebuilding Sage without it) and as _get_buckets is for debugging...
Anyway, I'll have to go now, so I'm not sure I'll be able to finish it tonight.
comment:30 Changed 5 years ago by
Sorry about the len stuff above, I did not see that TripleDict? actually extended MonoDict? in the pxd file what I suspected but could not find and I was confused.
comment:31 follow-up: ↓ 33 Changed 5 years ago by
- Status changed from needs_work to needs_review
- Work issues minor formatting issues, L.stats() doctest problem deleted
Here comes a "reviewer" patch, quickly produced, so potentially full of mistakes. And it adds Simon's original test for polynomial stuff.
There might still be a problem with L.stats() in coerce_dict.pyx. Not sure it has the same behavior depending on whether I run it alone "sage -t ..." or within "make ptest".
comment:32 Changed 5 years ago by
- Description modified (diff)
comment:33 in reply to: ↑ 31 Changed 5 years ago by
Replying to jpflori:
There might still be a problem with L.stats() in coerce_dict.pyx.
Concerning stats(): Since weak references are used, the stats can certainly change when a garbage collection happens. In order to make such test fail safe, one could swith garbage collection off at the beginning of the "flaky" test (gc.disable()), and switch it on in the end (gc.enable()).
comment:34 follow-up: ↓ 35 Changed 5 years ago by
PS: I think the gc.disable/enable trick should only be used if the test really remains problematic.
comment:35 in reply to: ↑ 34 Changed 5 years ago by
Replying to SimonKing:
PS: I think the gc.disable/enable trick should only be used if the test really remains problematic.
Of course.
I'm not at work anymore so I can't confirm that before tomorrow morning (GMT+2).
comment:36 follow-up: ↓ 37 Changed 5 years ago by
I was wrong about the error itself. On my machine (amd64, last (quite old) ubuntu LTS, sage-5.0.beta7, several tickets applied...), I indeed get one error in coerce_dict.pyx when running "make ptest", but not when running only "./sage -t -force_lib -long devel/sage/sage/structure/coerce_dict.pyx" and could reproduce this behavior several times.
The failing doctest is
sage: list(sorted(L.iteritems())) Expected: [(-15, 3), ('a', 1), ('ab', 2)] Got: [('a', 1), ('ab', 2), (-15, 3)]
I'm currently building sage-5.0.beta10 on the same computer and a different one to check that it's not only something I messed up with.
comment:37 in reply to: ↑ 36 ; follow-up: ↓ 38 Changed 5 years ago by
Replying to jpflori:
sage: list(sorted(L.iteritems()))
How can the sorting of integers and strings depend on whether or not you are testing in parallel? That's very strange.
comment:38 in reply to: ↑ 37 Changed 5 years ago by
Replying to SimonKing:
How can the sorting of integers and strings depend on whether or not you are testing in parallel? That's very strange.
In the worst case, we could transfom the two lists of pairs into sets, and compare the sets.
comment:39 Changed 5 years ago by
comment:40 Changed 5 years ago by
Changed 5 years ago by
Add a cdef'ed reference to the parent in each polynomial; rebased on top of sage-5.0.beta10
comment:41 Changed 5 years ago by
- Description modified (diff)
comment:42 Changed 5 years ago by
I get the same kind of discrepancy on another computer (intel quadcore 2, amd64, debian experimental) on top of sage-5.0.beta10 + #715 + #11521 + this ticket.
In fact, on this other installation, I got the same failing doctest about list items ordering when running "make ptest" (checked three times); and then the "stats()" failure while running "./sage -t .../coerce_dict.pyx" only the first of ten times (and not the other failure so same behavior as on the other computer) and no other error.
So I guess the stats() error can really be ignored, the other one is stranger. Can someone else confirm this ?
comment:43 Changed 5 years ago by
I think the problem is independent from these tickets.
I added these two lines in a clean sage 4.7.2 tree:
--- a/sage/structure/coerce_dict.pyx +++ b/sage/structure/coerce_dict.pyx @@ -35,7 +35,9 @@ - + + sage: list(sorted([('a', 1), (-15,3), ('ab', 2)])) + [(-15, 3), ('a', 1), ('ab', 2)]
and run a make ptest
. It failed with error:
File "/home/defeo/sage-4.7.2/devel/sage-main/sage/structure/coerce_dict.pyx", li ne 39: sage: list(sorted([('a', 1), (-15,3), ('ab', 2)])) Expected: [(-15, 3), ('a', 1), ('ab', 2)] Got: [('a', 1), ('ab', 2), (-15, 3)]
I think some doctest hidden in some file tested before coerce_dict.pyx
may be redefining sorted
. If this bug is confirmed in newer releases, we should make a separate ticket.
comment:44 Changed 5 years ago by
This is indeed really strange and I agree that it should be dealt with somewhere else. I'll post a slightly modified "reviewer" patch taking care of this issue by replacing the (useless) call to list in front by set.
To summarize our investigation of yesterday, we confirmed the bug when running "make ptest" on :
- one computer running either beta10
- two other computers running beta11
- one other computer running beta10 and beta11
All of them are quite or very recent Intel amd64 cpus running more or less recent Debian or Ubuntu OSes. On all these computers the test passed when running "sage -t devel/sage/sage/structure/coerce_dict.pyx". I also checked on some of them that running "sage -tp 0 devel/sage/sage" does NOT raise the error, so the culprit is to be found in some magic of the "ptest" target.
Finally I could also NOT reproduce the bug on a last corei7 (Lenovo X220) computer running beta11 (but there I had some doctest failure in fast_digraph.pyx or something like that, not sure of the name).
comment:45 Changed 5 years ago by
By the way, "make test" seems to be broken... (and that's the only target really different from the other test ones).
Update: This is #12568.
comment:46 Changed 5 years ago by
I've just updated the "reviewer" patch. On my side, I'm ok with Simon's patches. So if someone else is ok with mine, feel free to set the ticket to positive review.
comment:47 Changed 5 years ago by
- Status changed from needs_review to needs_work
Too bad. With this ticket and its dependencies applied, I get at least
sage -t -force_lib "devel/sage/sage/structure/coerce_dict.pyx" ********************************************************************** File "/mnt/local/king/SAGE/stable/sage-5.0.beta10-gcc/devel/sage/sage/structure/coerce_dict.pyx", line 735: sage: len(LE) # indirect doctest Expected: 1 Got: 50 ********************************************************************** File "/mnt/local/king/SAGE/stable/sage-5.0.beta10-gcc/devel/sage/sage/structure/coerce_dict.pyx", line 305: sage: len(LE) # indirect doctest Expected: 1 Got: 51 ********************************************************************** 2 items had failures: 1 of 37 in __main__.example_19 1 of 39 in __main__.example_6
and the test suite isn't finished yet.
So, apparently the fixed memoy leak creeped back at a different ticket.
comment:48 Changed 5 years ago by
This is really strange.
I can ensure you that we did ran the tests on at least 4 different computers yesterday on sage-5.0.beta11 + #715 + #11521 + here and got no such failures... I also observed the same correct behavior on top of beta10 before. I've also been playing this morning with "make test" and everything was fine on top of sage-5.0.beta11.
Did you tried this on top of a fresh install of sage-5.0.beta10 ?
comment:49 Changed 5 years ago by
comment:50 Changed 5 years ago by
comment:51 Changed 5 years ago by
comment:52 Changed 5 years ago by
- Status changed from needs_work to needs_review
comment:53 Changed 5 years ago by
- Dependencies changed from #715, #11599 to #715, #11521, #11599
Changed 5 years ago by
Weak coerce map dictionary with weakly referenced unique keys and optimised access speed
comment:54 Changed 5 years ago by
- Description modified (diff)
comment:55 Changed 5 years ago by
- Status changed from needs_review to needs_work
I've messed up something while rebasing the patch over the new version of #715, whence needs work.
comment:56 Changed 5 years ago by
The problem seems to arise from the last patch (I've not checked yet that with all but that last one, there's no problem on 32 bits).
comment:57 Changed 5 years ago by
Ok, I think I get it. The hash value is stored, but I moved out the modulo operation to make clear what could go wrong, thus modifying the hash value which is stored for MonoDict? objects.
comment:58 Changed 5 years ago by
Same question as with #715: Isn't there a Py_size_t that one should use to replace Py_ssize_t?
comment:59 Changed 5 years ago by
Some precisions : even without the last patch here, the first 3 ones make Sage segfault on 32 bits. (although it seems fine on 64 bits as I reported yesterday.) I'll try Simon's solution now.
Changed 5 years ago by
Changed 5 years ago by
Changed 5 years ago by
Changed 5 years ago by
Changed 5 years ago by
Changed 5 years ago by
comment:60 Changed 5 years ago by
- Description modified (diff)
- Status changed from needs_work to needs_review
comment:61 Changed 5 years ago by
Changed 5 years ago by
Weak coerce map dictionary with weakly referenced unique keys and optimised access speed, plus the necessary changes in cparent to make things work, plus reviewer patch.
comment:62 Changed 5 years ago by
- Description modified (diff)
- Reviewers set to Simon King
I have attached a combined patch, that is obtained by folding all patches that have previously to be applied. In particular, it contains the fix "use size_t instead of Py_ssize_t".
For me (64 bit), all tests pass (thus, positive review from me for the parts of code written by Jean-Pierre), but we should wait for Jean-Pierre's results on 32 bit.
For the patchbot:
Apply trac_12313-mono_dict-combined.patch
comment:63 Changed 5 years ago by
comment:64 Changed 5 years ago by
From sage-devel there is the need for a 32-bit test. I have 64-bit hardware but running 32-bits:
Linux blitzen 2.6.32-5-686 #1 SMP Mon Jan 16 16:04:25 UTC 2012 i686 Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz GenuineIntel GNU/Linux
From hg qapplied on top of sage-5.0.beta13 I have:
trac_715_combined.patch trac_11521_homset_weakcache_combined.patch trac_12313-mono_dict-combined.patch
After sage -br all long tests pass here.
comment:65 follow-up: ↓ 66 Changed 5 years ago by
- Status changed from needs_review to needs_work
I get a failure on boxen (x86_64 Linux):
sage -t -force_lib devel/sage/sage/structure/coerce_dict.pyx ********************************************************************** File "/padic/scratch/jdemeyer/merger/sage-5.1.beta0/devel/sage-main/sage/structure/coerce_dict.pyx", line 260: sage: L.stats() # min, avg, max (bucket length) Expected: (0, 0.06451..., 1) Got: (0, 0.06451612903225806, 2) **********************************************************************
comment:66 in reply to: ↑ 65 ; follow-up: ↓ 69 Changed 5 years ago by
comment:67 follow-up: ↓ 68 Changed 5 years ago by
I already reported such an error in http://trac.sagemath.org/sage_trac/ticket/12313#comment:24
I did not really pay attention because I could only reproduce it once on two fresh install of Sage.
It seemed to have been caused by the cparent stuff...
comment:68 in reply to: ↑ 67 Changed 5 years ago by
Replying to jpflori:
I already reported such an error in http://trac.sagemath.org/sage_trac/ticket/12313#comment:24
I did not really pay attention because I could only reproduce it once on two fresh install of Sage.
It seemed to have been caused by the cparent stuff...
OK. But I wonder: Is it reliably reproducible on the same machine? Namely, as I said, I have never seen the "2" on my own machine.
What shall we do about that error? After all, it is about how the data are distributed in the hash table, and the hash is clearly machine dependent (since the hash depends on the address of the object).
Hence, the output of "stat()" is machine dependent. Perhaps we could mark the output "random"?
comment:69 in reply to: ↑ 66 Changed 5 years ago by
comment:70 follow-up: ↓ 71 Changed 5 years ago by
It does not fail consistently, it succeeds most of the time. By chance, it failed on me the first two times I tried (so I assumed it always failed).
comment:71 in reply to: ↑ 70 ; follow-up: ↓ 72 Changed 5 years ago by
Replying to jdemeyer:
It does not fail consistently, it succeeds most of the time. By chance, it failed on me the first two times I tried (so I assumed it always failed).
OK, so it is random.
Since the purpose of that method is merely diagnostics, I suggest to remove the problem by marking the test "random".
comment:72 in reply to: ↑ 71 ; follow-up: ↓ 73 Changed 5 years ago by
Replying to SimonKing:
Since the purpose of that method is merely diagnostics, I suggest to remove the problem by marking the test "random".
I think it's dubious that the test has educational value, so you might as well remove the test completely.
comment:73 in reply to: ↑ 72 Changed 5 years ago by
Replying to jdemeyer:
I think it's dubious that the test has educational value, so you might as well remove the test completely.
Removing the test or removing the method? I think that the method has certainly been useful while developing the code, but perhaps it is now of no use?
comment:74 Changed 5 years ago by
I meant removing the test, but I don't really care.
comment:75 follow-ups: ↓ 77 ↓ 79 Changed 5 years ago by
My machine has tested about 1/3 of the directories when running sage -t sage/
. So, far there is one doctest error:
sage -t "devel/sage-main/sage/misc/sagedoc.py" ********************************************************************** File "/home/sage-5.0.beta13/devel/sage-main/sage/misc/sagedoc.py", line 566: sage: 'abvar/homology' in _search_src_or_doc('doc', 'homology', 'variety', interact=False) Expected: True Got: False ********************************************************************** 1 items had failures: 1 of 7 in __main__.example_8 ***Test Failed*** 1 failures.
I will update as time permits. I am not running long tests; let me know if I should.
comment:76 Changed 5 years ago by
Another:
sage -t "devel/sage-main/sage/modular/local_comp/type_space.py" *** *** Error: TIMED OUT! PROCESS KILLED! *** *** [360.5 s]
comment:77 in reply to: ↑ 75 ; follow-up: ↓ 78 Changed 5 years ago by
Replying to john_perry:
My machine has tested about 1/3 of the directories when running
sage -t sage/
. So, far there is one doctest error:sage -t "devel/sage-main/sage/misc/sagedoc.py" ********************************************************************** File "/home/sage-5.0.beta13/devel/sage-main/sage/misc/sagedoc.py", line 566: sage: 'abvar/homology' in _search_src_or_doc('doc', 'homology', 'variety', interact=False) Expected: True Got: False ********************************************************************** 1 items had failures: 1 of 7 in __main__.example_8 ***Test Failed*** 1 failures.I will update as time permits. I am not running long tests; let me know if I should.
I also got that one within my virtual machine, but did not worry too much because I cannot even think how this could be related to coercion and caching. In fact, I think it also happened without #12313 applied, but only #715 and #11521. I'll check without any of them right now (I should have done this before, but VirtualBox? is horribly slow for some reason and it's a pain to rebuild Sage and retest it).
comment:78 in reply to: ↑ 77 Changed 5 years ago by
Replying to jpflori:
I also got that one within my virtual machine, but did not worry too much because I cannot even think how this could be related to coercion and caching.
I cannot think of a relation, either.
comment:79 in reply to: ↑ 75 ; follow-up: ↓ 81 Changed 5 years ago by
Replying to john_perry:
My machine has tested about 1/3 of the directories when running
sage -t sage/
. So, far there is one doctest error:sage -t "devel/sage-main/sage/misc/sagedoc.py" ********************************************************************** File "/home/sage-5.0.beta13/devel/sage-main/sage/misc/sagedoc.py", line 566: sage: 'abvar/homology' in _search_src_or_doc('doc', 'homology', 'variety', interact=False) Expected: True Got: False ********************************************************************** 1 items had failures: 1 of 7 in __main__.example_8 ***Test Failed*** 1 failures.
This is a failure that you get if don't build the Sage documentation. Did you run make
and let it finish?
comment:80 Changed 5 years ago by
- Status changed from needs_work to needs_review
comment:81 in reply to: ↑ 79 Changed 5 years ago by
Replying to jdemeyer:
This is a failure that you get if don't build the Sage documentation. Did you run
make
and let it finish?
Yes. I guess jpflori did, too, but he reports the same result.
comment:82 Changed 5 years ago by
I've just rebuild (sage -br) beta13 after popping all the patches and testing that file alone (sage -t ..../sagedoc.py) raises the same error. I'm now running "make" again to see if anything changes afterward.
Whatsoever, this definitely looks unrelated.
comment:83 follow-up: ↓ 84 Changed 5 years ago by
FYI, make failed while building the doc because of lack of memory. I allocated 1Go to the virtual machine, isn't it enough? Looking at the memory consumption in top, it goes up to 65% (for the python process alone, not counting other processes) at 98% of the "reading source" stage and fails (on the last test it was at sage/notebook/twist). Quite strangely, there was still some memory available, about 70 Mo at least, not counting some memory cached, but the system might have refused to allocate some more to the python process.
I think that's the only reason for the above failure on my side, and the other ones I reported here (or in #715) on 32 bits.
comment:84 in reply to: ↑ 83 Changed 5 years ago by
- Status changed from needs_review to needs_info
Replying to jpflori:
FYI, make failed while building the doc because of lack of memory. I allocated 1Go to the virtual machine, isn't it enough?
I have 512M on my machine, so if the memory is an issue, that would explain it. What seems odd to me is that, as I recall, sage didn't report errors when it finished building. How do I determine whether this is an issue? A naive search of install.log
doesn't hit anything.
Whatsoever, this definitely looks unrelated.
Is that the case with the timeout error, as well? If so, I can give it a positive review, since doctests have finished with only the two errors I mentioned above: sagedoc.py
and timeout, the latter in several places. Unless, that is, you want me to run long tests as well. Let me know.
comment:85 Changed 5 years ago by
- Description modified (diff)
I just noticed that Jean-Pierre has provided a new combined patch, with randomisation of the "random" stats test. So, for the patchbot:
Apply trac_12313-mono_dict-combined-random.patch
comment:86 follow-up: ↓ 89 Changed 5 years ago by
- Status changed from needs_info to needs_review
At the end of "make", I can spot a myriad of warnings about objects.inv and more specifically at the end of the [reading sources] phase (which does not reach 100 %), I have an
Exception occured: File ... OSError: [Errno 12] Cannot allocate memory
This is not logged into install.log but into dochtml.log.
And this also happens within the virtual machine without any patch patch applied...
comment:87 Changed 5 years ago by
By the way, it's quite awful that Sage builds correctly with 512 Mo of memory, but that the doc does not.
comment:88 Changed 5 years ago by
Indepently of this ticket, I was already investigating the "docbuilding memory runaway" problem. I agree that there is indeed a problem.
comment:89 in reply to: ↑ 86 Changed 5 years ago by
Replying to jpflori:
- Status changed from needs_info to needs_review
What needs review? the ..._random.patch
? I think that was the one I checked; the problems were only w/timeout and sagedoc.py
, which people seem to agree isn't a problem w/this patch? I'll verify in a little while, just in case.
I'm happy to do it; I'm just not sure what it is needs doing now.
comment:90 Changed 5 years ago by
Nothing more. I was just waiting for John Cremona's report on sage-devel.
But we could as well put the ticket back to positive review, I'm quite convinced the problem we experienced on 32 bits are unrelated to the ticket itself.
comment:91 Changed 5 years ago by
- Reviewers changed from Simon King to Simon King, John Perry
- Status changed from needs_review to positive_review
comment:92 follow-up: ↓ 94 Changed 5 years ago by
Sorry for the delay, I left my laptop at home. I have just applied the patch (to 5.0-beta13) and am rebuilding, then will do a full test. It may not be finished until an overnight run...
comment:93 Changed 5 years ago by
- Milestone changed from sage-5.0 to sage-5.1
comment:94 in reply to: ↑ 92 Changed 5 years ago by
Replying to cremona:
Sorry for the delay, I left my laptop at home. I have just applied the patch (to 5.0-beta13) and am rebuilding, then will do a full test. It may not be finished until an overnight run...
All (long) tests passed. Let's hope 5.1 will not be too far down the line...
comment:95 Changed 5 years ago by
- Reviewers changed from Simon King, John Perry to Simon King, Jean-Pierre Flori, John Perry
Just putting myself as reviewer because I originally reviewed Simon code a while ago.
comment:96 Changed 5 years ago by
I get some rare doctest failures while testing a very preliminary version of sage-5.1.beta1. The annoying thing is that the failures aren't reproducible. The problems seem to be related to coercion. I'm only guessing this ticket is to blame, it could also be #12877 or some other ticket listed in http://boxen.math.washington.edu/home/release/sage-5.1.beta1/tickets.html
All the observed failures have been within the sage/schemes/elliptic_curves
directory.
Often one just gets a segmentation fault:
sage -t --long -force_lib devel/sage/sage/schemes/elliptic_curves/heegner.py Segmentation fault [101.7 s]
But there are also failures of the form:
sage -t --long -force_lib devel/sage/sage/schemes/elliptic_curves/ell_tate_curve.py ********************************************************************** File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/devel/sage-main/sage/schemes/elliptic_curves/ell_tate_curve.py", line 332: sage: eq.parametrisation_onto_tate_curve(1+5+5^2+O(5^10)) Exception raised: Traceback (most recent call last): File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/bin/ncadoctest.py", line 1231, in run_one_test self.run_one_example(test, example, filename, compileflags) File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/bin/sagedoctest.py", line 38, in run_one_example OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags) File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/bin/ncadoctest.py", line 1172, in run_one_example compileflags, 1) in test.globs File "<doctest __main__.example_12[3]>", line 1, in <module> eq.parametrisation_onto_tate_curve(Integer(1)+Integer(5)+Integer(5)**Integer(2)+O(Integer(5)**Integer(10)))###line 332: sage: eq.parametrisation_onto_tate_curve(1+5+5^2+O(5^10)) File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/lib/python/site-packages/sage/schemes/elliptic_curves/ell_tate_curve.py", line 340, in parametrisation_onto_tate_curve un = u * q**(-(u.valuation()/q.valuation()).floor()) File "element.pyx", line 1687, in sage.structure.element.RingElement.__mul__ (sage/structure/element.c:12755) File "coerce.pyx", line 797, in sage.structure.coerce.CoercionModel_cache_maps.bin_op (sage/structure/coerce.c:7555) TypeError: unsupported operand parent(s) for '*': '5-adic Ring with capped relative precision 20' and '5-adic Field with capped relative precision 20' ********************************************************************** File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/devel/sage-main/sage/schemes/elliptic_curves/ell_tate_curve.py", line 529: sage: eq.parametrisation_onto_original_curve(1+5+5^2+O(5^10)) Exception raised: Traceback (most recent call last): File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/bin/ncadoctest.py", line 1231, in run_one_test self.run_one_example(test, example, filename, compileflags) File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/bin/sagedoctest.py", line 38, in run_one_example OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags) File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/bin/ncadoctest.py", line 1172, in run_one_example compileflags, 1) in test.globs File "<doctest __main__.example_17[3]>", line 1, in <module> eq.parametrisation_onto_original_curve(Integer(1)+Integer(5)+Integer(5)**Integer(2)+O(Integer(5)**Integer(10)))###line 529: sage: eq.parametrisation_onto_original_curve(1+5+5^2+O(5^10)) File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/lib/python/site-packages/sage/schemes/elliptic_curves/ell_tate_curve.py", line 545, in parametrisation_onto_original_curve P = self.parametrisation_onto_tate_curve(u,prec=20) File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/lib/python/site-packages/sage/schemes/elliptic_curves/ell_tate_curve.py", line 340, in parametrisation_onto_tate_curve un = u * q**(-(u.valuation()/q.valuation()).floor()) File "element.pyx", line 1687, in sage.structure.element.RingElement.__mul__ (sage/structure/element.c:12755) File "coerce.pyx", line 797, in sage.structure.coerce.CoercionModel_cache_maps.bin_op (sage/structure/coerce.c:7555) TypeError: unsupported operand parent(s) for '*': '5-adic Ring with capped relative precision 20' and '5-adic Field with capped relative precision 20' **********************************************************************
I have also seen an error similar to the above but with multiplying RealNumber
s, but I didn't copy the exact error message.
comment:97 Changed 5 years ago by
I have seen the above error on various systems: Linux, OS X, 32-bit, 64-bit so it seems unrelated to the operating system.
comment:98 follow-up: ↓ 99 Changed 5 years ago by
Just to be sure: this didn't occur with beta0?
Because what is done here is somehow similar to what is done in #715 which was merged in beta0.
comment:99 in reply to: ↑ 98 Changed 5 years ago by
comment:100 Changed 5 years ago by
FYI, I could reproduce the problem in ell_tate_curve on top of vanilla sage-5.1.beta0.
I'm also testing this on top of sage-5.0.beta13 plus all coercion patches and #12877 but got no failure so far (ran the test about 50 times, got the failure with 5.1.beta0 twice in 20 runs).
comment:101 Changed 5 years ago by
You mean sage-5.1.beta0 + #12313?
comment:102 follow-up: ↓ 103 Changed 5 years ago by
Sorry about the above comment: I meant that this is repoducible on top of 5.0.beta13 + coercion patches and #12877 and not on top of 5.1.beta0
comment:103 in reply to: ↑ 102 Changed 5 years ago by
comment:104 Changed 5 years ago by
Hi Jean-Pierre and Jeroen!
I am about to go to some conference and will probably not have much time to trace the problem down until Sunday.
Anyway, it would help me if you could state precisely in what setting you find the problem and in what setting you don't. By "precisely", I mean a list of patches, since I can't imagine what you mean by "coercion patches".
comment:105 follow-up: ↓ 106 Changed 5 years ago by
Ok, I'll try to be more clear and avoid to replace random zeros by ones in the version number I write.
So far I could reproduce the problem in ell_tate_curve above (after about 10 runs, so quite often) on:
- 5.0.beta13 + #715 + #11521 + #12313 (+ #11943 + #12357 + #12877 but without these three last ones as well).
And could not after 100 runs of ell_tate_curve on:
- 5.1.beta0 vanilla
I'm currently testing 5.1.beta0 + #12313 (after about 40 tries, no errors, which is quite disturbing).
comment:106 in reply to: ↑ 105 ; follow-up: ↓ 111 Changed 5 years ago by
Hi Jean-Pierre,
Replying to jpflori:
So far I could reproduce the problem in ell_tate_curve above (after about 10 runs, so quite often) on:
... I'm currently testing 5.1.beta0 + #12313 (after about 40 tries, no errors, which is quite disturbing).
Aha. So, seemingly it is the combination of the weak coercion caches from #715 and #11521, plus the weak unique parent caches from here. Is there any hint what objects are involved in the crashes?
My gut feeling is that actions are involved: By #715, they only store a weak reference to the underlying set (which I felt was dangerous, but was needed to fix some memory leak, and seemed justifiable after all). I could imagine that the underlying set of some action becomes garbage collectable with #12313.
comment:107 Changed 5 years ago by
Looking at where the error produced by ell_tate_curve seems to indicate that the problems occurs while multiplying elements of the 5-adic ring and field.
I'm trying to build a simple loop reproducing the error based on that but failed so far.
comment:108 Changed 5 years ago by
Another problem is that the bug seems to occur very rarely...
I've not seen it yet on top of 5.1.beta0 + #12313 although I have the feeling it should occur (or I'm wrong :) ).
Even on top of 5.0.beta13 + #715 + #11521 + #12313 it does not happen so frequently when looping over the tests of ell_tate_curve.
I'll run some (I mean a LOT) test without #12313 now, just to be sure this is indeed the culprit.
comment:109 Changed 5 years ago by
Mmmm, this seems really tricky.
I've built 5.0.rc0 and applied #715, #11521, #12313, #11943 and #12357 and tried to reproduce the ell_tate_curve error but failed after a few hundreds of tries.
So I decided to pop #12357 and #11943 and tada ! I got the error back in quite quickly.
Of course, this could only be some random lottery... Or the changes introduced by #11943 could make the error even rarer.
Indeed, I finally produced the error with #11943 applied back after additional tries.
After a few hundreds of tries, I'm still unable to reproduce the error on 5.1.beta0 + #12313 and 5.1.beta0 + #12313 + #12877... but that might happen after more tests.
comment:110 Changed 5 years ago by
Some more data... I've run successfuly ell_tate_curve (i.e. without error) 10000 on 5.0.rc0 plus #715, #11521, #12313, #11943 and #12357 over the night.
I've just retested it without #11943 and #12357 and got the failure after 19 and the 13 tries.
On 5.1.beta0 plus #12313 I got not failure after about 14200 runs of ell_tate_curve. I'm relaunching it with #12877 applied to see if I get more (un)lucky.
Anyway, I'll use 5.0.rc0 plus #715, #11521, #12313 as a base to try to investigate that.
comment:111 in reply to: ↑ 106 Changed 5 years ago by
Replying to SimonKing:
My gut feeling is that actions are involved: By #715, they only store a weak reference to the underlying set (which I felt was dangerous, but was needed to fix some memory leak, and seemed justifiable after all). I could imagine that the underlying set of some action becomes garbage collectable with #12313.
The problem occurs while multiplying an element of a ring and its field of fractions. I did not think it through at all, but my natural guess would be that this is done by (canonical?) coercion of the ring element into the field of fractions and then the action takes place there.
That would correspond to the "canonical_coercion" and PyObject_CallObject(...) calls within the bin_op call made by the coercion model, hence no actions involved, but I'm still confused by all of this ;)
comment:112 Changed 5 years ago by
- Milestone changed from sage-5.1 to sage-pending
I'm going to move this to sage-pending for now, with the idea of doing some more tests later. In any case, this ticket will not be merged before sage-5.1.beta2.
comment:113 Changed 5 years ago by
- Merged in set to sage-5.1.beta2
- Milestone changed from sage-pending to sage-5.1
- Resolution set to fixed
- Status changed from positive_review to closed
I haven't seen any more errors, so maybe #11943 did fix this...
comment:114 Changed 5 years ago by
Well, #12357 still gives segmentation faults...
comment:115 follow-up: ↓ 116 Changed 5 years ago by
I might have some bad news. It seems that this might have caused the following rare (about 1% of the time for the first one, the second one is even more rare) failures on some machines while merging sage-5.2.beta0 (not sure which patch in sage-5.2.beta0 exposes it):
sage -t -force_lib devel/sage-main/sage/rings/padics/padic_base_leaves.py ********************************************************************** File "/home/buildbot/build/sage/rosemary-1/rosemary_full/build/sage-5.2.beta0/devel/sage-main/sage/rings/padics/padic_base_leaves.py", line 326: sage: K.has_coerce_map_from(ZpCA(17,10)) Expected: True Got: False **********************************************************************
and
sage -t --long -force_lib devel/sage/sage/rings/number_field/number_field.py ********************************************************************** File "/home/buildbot/build/sage/mark-1/mark_full/build/sage-5.1.rc0/devel/sage-main/sage/rings/number_field/number_field.py", line 302: sage: RR.coerce_map_from(K) Expected: Composite map: From: Number Field in a with defining polynomial x^3 - 2 To: Real Field with 53 bits of precision Defn: Generic morphism: From: Number Field in a with defining polynomial x^3 - 2 To: Real Lazy Field Defn: a -> 1.259921049894873? then Conversion via _mpfr_ method map: From: Real Lazy Field To: Real Field with 53 bits of precision Got nothing **********************************************************************
This means I will probably unmerge this from sage-5.1.
comment:116 in reply to: ↑ 115 ; follow-up: ↓ 117 Changed 5 years ago by
Replying to jdemeyer:
I might have some bad news. It seems that this might have caused the following rare (about 1% of the time) failure on some machines
I hate these sporadic errors! They are so difficult to debug! But if I recall correctly, Nicolas Thiéry posted an example in sage-combinat-devel exposing problems with the weak caching stuff. I hope I will at some point find the time to analyse his example. Right now, I can't.
comment:117 in reply to: ↑ 116 Changed 5 years ago by
Replying to SimonKing:
I hate these sporadic errors! They are so difficult to debug!
I know, it has happened to me too. The worst for me is when I discover such a sporadic error but I can't tell which ticket caused it.
Any chance that this error is also caused by #12313?
comment:118 Changed 5 years ago by
- Merged in sage-5.1.beta2 deleted
- Milestone changed from sage-5.1 to sage-5.2
- Resolution fixed deleted
- Status changed from closed to new
comment:119 Changed 5 years ago by
comment:120 Changed 5 years ago by
- Dependencies changed from #715, #11521, #11599 to #715, #11521, #11599, #12969
- Status changed from new to needs_review
Any chance we could test whether #12969 fixes the problem? I added it as a dependency.
comment:121 Changed 5 years ago by
- Description modified (diff)
I had to rebase the patch (and rename, because Jean-Pierre has submitted the previous version).
Apply trac_12313-mono_dict-combined-random-sk.patch
comment:122 Changed 5 years ago by
FWIW: On bsd.math, all tests pass with
trac12969_fix_coercion_cache.patch trac12215_weak_cached_function-sk.patch trac12215_segfault_fixes.patch trac_715_combined.patch trac_11521_homset_weakcache_combined.patch trac_12313-mono_dict-combined-random-sk.patch
on top of sage-5.2.rc0.
comment:123 Changed 5 years ago by
- Dependencies changed from #715, #11521, #11599, #12969 to #715, #11521, #11599, #12969, #11310
- Status changed from needs_review to needs_work
It turns out that the current patch won't apply, due to #11310.
comment:124 Changed 5 years ago by
- Dependencies changed from #715, #11521, #11599, #12969, #11310 to #715, #11521, #11599, #12969
- Status changed from needs_work to needs_review
Arrgh, sorry, I have posted on the wrong ticket. The patch seems fine.
Apply trac_12313-mono_dict-combined-random-sk.patch
comment:125 follow-ups: ↓ 128 ↓ 149 Changed 5 years ago by
sage/structure/coerce_dict.pyx:
line 96:
Perhaps you can explain the mechanism here. If I understand correctly, _refcache
is a global dictionary, keyed with the "key" part of KeyedRef
s and as values
lists of KeyedRef
s with that key. The deletion consists of looking up the key
in the registered MonoDict
and removing the entry if it's there (it could
already have disappeared if the entry was explicitly deleted from the
MonoDict
, because MonoDict.__delitem__
doesn't bother updating _refcache
)
It then proceeds to remove the now defunct KeyedRef
instance from _refcache
.
But doesn't that mean that even if we have len(L)==0
as a result from that,
the entry k : L
remains in _refcache
? That means we get a whole pile-up of
integers and empty lists in memory as a result (something I have indeed
observed). These objects aren't so big, so this leaks much slower of course,
but it's still a leak. I suspect you copied this from TripleDict
which will
have the same problem.
line 103:
for i from 0 <= i < PyList_GET_SIZE(bucket) by 4: if <size_t><object>PyList_GET_ITEM(bucket,i)==k: del bucket[i:i+2]
It reads to me that that should be ... by 2
corresponding tro the loops in
MonoDict.__setitem__
and MonoDict.__getitem__
line 497: (in Monodict._getitem__
)
How bad is the race condition here? I guess we're trusting that
bucket
doesn't change and that bucket
has been purged of any dead references
(i.e., that all MonoDictEraser
callbacks have happened). Does keeping the GIL
guarantee this?
Does the line
return <object>PyList_GET_ITEM(bucket, i+1)
increase a refcount on the object we retrieve? Does the <object>
typecast do
that? Or does the return
do that? (now that the object has been retrieved it
should be protected from deletion)
line 536:
Am I reading correctly that if k cannot be keyreffed then k is going to get a
strong ref in _refcache and will be eternalized? Is that what we want? If so,
write this in a comment! Especially because currently even explicit deletion
would leave the reference in _refcache
intact, so you'd be leaking in cases
where a normal dictionary would not.
comment:126 Changed 5 years ago by
I will first try to fix #715 according to your remarks, and then return to here.
comment:127 Changed 5 years ago by
- Status changed from needs_review to needs_work
- Work issues set to Rebase wrt the new version of #715. Take into account Nils' comments
comment:128 in reply to: ↑ 125 Changed 5 years ago by
- Dependencies changed from #715, #11521, #11599, #12969 to #715, #11521, #11599, #12969, #12215
- Status changed from needs_work to needs_review
Replying to nbruin:
sage/structure/coerce_dict.pyx:
line 96:
Perhaps you can explain the mechanism here.
I did explain it on #715, and I changed the patch similar to #715, according to your comments (_refcache is now local, etc).
line 103:
for i from 0 <= i < PyList_GET_SIZE(bucket) by 4: if <size_t><object>PyList_GET_ITEM(bucket,i)==k: del bucket[i:i+2]It reads to me that that should be
... by 2
Right! Thanks for spotting that.
line 497: (in
Monodict._getitem__
)How bad is the race condition here? I guess we're trusting that
bucket
doesn't change and thatbucket
has been purged of any dead references (i.e., that allMonoDictEraser
callbacks have happened). Does keeping the GIL guarantee this?
That is a good question.
I tried to stress test it:
L = MonoDict(31) class A: pass while 1: keys = [A() for _ in range(100)] for i,k in enumerate(keys): L[k] = i a = keys[99] keys = None print L[a]
I suppose that one could expect that garbage collection would sometimes occur while getting L[a]. Hence, one would expect a segfault to occur at some point. But it works fine (it constantly prints "99").
Does the line
return <object>PyList_GET_ITEM(bucket, i+1)increase a refcount on the object we retrieve? Does the
<object>
typecast do that? Or does thereturn
do that?
As much as I know, the <object>
cast does that. Note that without the cast, the code wouldn't even compile, because "return" can not return the result of PyList_GET_ITEM
(which is just a reference to an object).
Here is some test that (I think) demonstrates that the reference count is right:
sage: cython(""" ....: include "../ext/python_list.pxi" ....: def test(list L): ....: return <object>PyList_GET_ITEM(L,4) ....: """) sage: import gc sage: class A(object): pass ....: sage: L = [A() for _ in range(100)] sage: id(L[4]) 80680784 sage: a = test(L) sage: del L sage: _ = gc.collect() sage: [id(x) for x in gc.get_objects() if isinstance(x,A)] [80680784] sage: id(a) 80680784 sage: del a sage: _ = gc.collect() sage: [id(x) for x in gc.get_objects() if isinstance(x,A)] []
line 536:
Am I reading correctly that if k cannot be keyreffed then k is going to get a strong ref in _refcache and will be eternalized?
See the comments at #715: That used to be a problem, but is now solved.
Some comments:
- This patch fixes a typo introduced at #715. I suppose that #715, #11521, #12215 and #12313 will be merged in one go anyway, so, that should be fine.
- There was a doctest in sage/categories/modules_with_basis.py that explicitly relied on an "eternal cache". Namely, some module was defined in one doctest and named "X". In another doctest, the same module was taken, and it was still called "X", because it has not been garbage collected. With my patch, garbage collection works, and thus the custom name chosen in the first doctest is not present in the second doctest.
Apply trac_12313-mono_dict-combined-random-sk.patch
comment:129 Changed 5 years ago by
- Dependencies changed from #715, #11521, #11599, #12969, #12215 to #11521, #11599, #12969, #12215
comment:130 Changed 5 years ago by
Apply trac_12313-mono_dict-combined-random-sk.patch
comment:131 Changed 5 years ago by
Now I'm puzzled. The patchbot applies the same patches on top of the same version of Sage in the same order, but it gets an error when applying the patch from here, while it works fine for me.
For the record: I have
trac_715_combined.patch trac_715_local_refcache.patch trac_715_specification.patch trac_11521_homset_weakcache_combined.patch trac_11521_callback.patch trac12215_weak_cached_function-sk.patch trac12215_segfault_fixes.patch trac_12313-mono_dict-combined-random-sk.patch
on top of sage-5.3.beta2
Apply trac_12313-mono_dict-combined-random-sk.patch
comment:132 Changed 5 years ago by
It is frustrating. The patchbot reports a problem in sage.rings.padics.padic_base_leaves, that is most likely a cache problem in coercion: At some point, the absence of coercion is cached even though there is a coercion.
I thought that that kind of problem was fixed by #12969! And worse: I do not get the error reported by the patchbot. And even worse: I do get the error reported by the patchbot if I additionally apply #13370. And still worse: For #13370, the patchbot also reports a segfault in another test - that I can not reproduce.
My plan: Using #13370, I try to find out what side-effect is responsible for the wrong coercion cache. Hopefully it can then be fixed, in a new dependency for this ticket.
comment:133 follow-up: ↓ 134 Changed 5 years ago by
Meanwhile it seems to me that the __contains__
method of both MonoDict
and TripleDict
is too complicated: Currently, the content of the buckets is looked up. However, self._refcache
should also indicate what keys exist. So, it should be faster to look up self._refcache
, and not the buckets.
Perhaps it is a bit improper to fix the __contains__
method of TripleDict
here, even though it was introduced in #715 which isn't merged. Hence, my plan is now to fix TripleDict.__contains__
there, rebase the patches from the other tickets if needed, and fix MonoDict.__contains__
here.
comment:134 in reply to: ↑ 133 Changed 5 years ago by
comment:135 Changed 5 years ago by
- Owner changed from rlm to (none)
Or: Perhaps I need to change both patches after all. Namely, I found that the __delitem__
methods do not update the _refcache - and one of them belongs to #715.
comment:136 Changed 5 years ago by
OK: Both the patches at #715 and here are updated. The __contains__
methods should now be more efficient, and the __delitem__
method will remove the item not only from the buckets but also from the _refcache.
Needs review! And I hope the patchbot doesn't time out when finding the dependencies...
Apply trac_12313-mono_dict-combined-random-sk.patch
comment:137 Changed 5 years ago by
The patchbot times out again, but all tests pass for me.
comment:138 Changed 5 years ago by
Some confusion on my part made me post possibly relevant note on the wrong ticket: #11521 comment 152
Perhaps we should not be caching conversions at all, because they can too easily cause memory leaks, for instance:
for p in prime_range(1,1000): k=GF(p) a=k(1) b=ZZ(a)
(the conversion maps get cached on ZZ
, where the strong references to the domains will keep the finite fields alive)
I'd think conversions are allowed to be expensive.
comment:139 follow-up: ↓ 140 Changed 5 years ago by
I think the point of this ticket is exactly to solve the small example you posted (which is nothing but what the ticket description suggests if I'm not getting too tired):
- maps are cached in ZZ in a MonoDict? which key is a weakref to k
- k dies, and the MonoDict? is smart enough to let garbage collection occurs.
The other way around (maps from ZZ to k, then k dies) should be trivial as well.
Is that right?
comment:140 in reply to: ↑ 139 ; follow-up: ↓ 141 Changed 5 years ago by
Replying to jpflori:
I think the point of this ticket is exactly to solve the small example you posted (which is nothing but what the ticket description suggests if I'm not getting too tired):
- maps are cached in ZZ in a MonoDict? which key is a weakref to k
Yes, but the map stores a refcounted ref to k so the presence of the map on ZZ prevents k from being collected.
We still get the benefit on this patch that the absence of a map (which is the case with GF(p)
to ZZ
) gets cached by None
so in that case only the weakref to k exists on ZZ
and k
can be collected.
- k dies, and the MonoDict? is smart enough to let garbage collection occurs.
Only if the stored value doesn't strongly reference k
.
The other way around (maps from ZZ to k, then k dies) should be trivial as well.
Not "should be trivial as well" but "is already trivial thanks to cyclic garbage removal". The map is cached on k
and not on ZZ
. The map references k
, but if all other references to k
disappear we just have a cyclic reference, which python's GC should be able to figure out.
So: YES, this ticket is an improvement over what we had before (and once all issues are worked out should be merged) but it doesn't offer a complete solution to leaking due to caching yet:
for p in prime_range(1,1000): _=SR.coerce_map_from(GF(p))
will still leak like a disgruntled intelligence chief.
Conversion is a bit worse because conversion maps exist more often, but is also slightly better because the system won't ask for them by itself and won't investigate the transitive closure of the conversion graph. Therefore, I'm not sure conversion should be given the same caching treatment as coercion. The proposed patch is propagating such a caching treatment (but didn't introduce it).
comment:141 in reply to: ↑ 140 Changed 5 years ago by
Replying to nbruin:
Yes, but the map stores a refcounted ref to k so the presence of the map on ZZ prevents k from being collected.
Ok, I get it now, the example in the ticket description returns None, not the other one you proposed.
comment:142 Changed 5 years ago by
Another remark, the changes in this ticket do not affect the caching of maps themselves, but only of Booleans indicating whether such maps exists, which are now stored in MonoDict? rather than dict.
comment:143 Changed 5 years ago by
Oh no, I'm wrong again, the _coerce_from_hash stuff is definitely targetting the caching of the maps themselves.
comment:144 Changed 5 years ago by
So would storing only weakrefs to the domain in Maps/Morphism? suit you? The same approach was taken for Action and codomain in #715.
comment:145 Changed 5 years ago by
Both these work fine for me on a pristine 5.3b2 with the patches for this ticket applied:
sage -t -force_lib devel/sage-12313/sage/rings/padics/padic_base_leaves.py # 1 doctests failed sage -t -force_lib devel/sage-12313/sage/rings/polynomial/infinite_polynomial_element.py # Killed/crashed
See also #13370, where Simon can reproduce the padic_base_leaves failure.
I tried the following code to try and recreate the circumstances Simon finds there, but no problem occurs (it does usually succeed fairly quickly in finding R with the same id as the K before)
import gc while True: k=Qp(7) L=[c^2 for c in range(2^randint(1,15))] #to stir up memory use print id(k) K=Qp(7,40) iK = id(K) assert k.has_coerce_map_from(K) is False del K gc.collect() R=Zp(7,40) iR = id(R) if iR == iK: break del k,R gc.collect() assert k.has_coerce_map_from(R)
comment:146 Changed 5 years ago by
- Status changed from needs_review to needs_work
- Work issues changed from Rebase wrt the new version of #715. Take into account Nils' comments to Test activity of weak references if addresses coincide
I'd like to try to add a test that would prevent the race condition that seems to be behind the doctest failures.
comment:147 follow-up: ↓ 148 Changed 5 years ago by
- Status changed from needs_work to needs_review
- Work issues Test activity of weak references if addresses coincide deleted
I have updated the patch. Change: Before returning a value of the dictionary or assert that a key is in the dictionary, I test whether the keyed references involved are still active. If they are not, then the key just happens to use the old memory location for new values.
I have already tested that the error (that I only find with #13370) disappears with the new patches. I am now running make ptest with #13370.
I only wonder whether the additional tests imply an unacceptable slow-down. Could the reviewer do some tests? I am about to be on holiday, so, very likely I can't test much in the next few days.
Apply trac_12313-mono_dict-combined-random-sk.patch
comment:148 in reply to: ↑ 147 ; follow-up: ↓ 152 Changed 5 years ago by
Replying to SimonKing:
I have updated the patch. Change: Before returning a value of the dictionary or assert that a key is in the dictionary, I test whether the keyed references involved are still active. If they are not, then the key just happens to use the old memory location for new values.
So basically a MonoDict
is now just a WeakKeyDict
with identity testing rather than equality testing. Unless sage parents have really bad hash values, there should really not be a difference, right? That's one thing to try. TripleDict
still has its value, because of how it behaves if any one of the key components disappears.
EDIT: NO! silly me. You can make this much faster. Since you'll be accessing everything via the weakref anyway, you don't need a refcache
. You can just store a refcounted reference to the weakref directly in the bucket. Lookup is lightning fast, because you hash on id-based keys and you can walk the bucket as you did before.
You don't have a race condition, because you are holding a strong ref to the object that gets weakreffed until your MonoDict
or TripleDict
is in a consistent state to receive a callback (should it happen).
It looks like we might be slightly leaking if callbacks indeed sometimes never come ...
comment:149 in reply to: ↑ 125 Changed 5 years ago by
Replying to nbruin:
Does the line
return <object>PyList_GET_ITEM(bucket, i+1)increase a refcount on the object we retrieve?
Here is another evidence that <object> does what we need:
sage: cython(""" ....: def test(size_t k): ....: K = <object><void *>k ....: print K ....: del K ....: """) sage: class A(object): pass ....: sage: a = A() sage: test(id(a)) <__main__.A object at 0x44a17d0> sage: a <__main__.A object at 0x44a17d0> sage: del a sage: import gc sage: _ = gc.collect() sage: quit Exiting Sage (CPU time 0m0.16s, Wall time 0m39.96s).
If the reference count of the A-instance would not be increase by the <object> cast in the test function, I guess a segfault would result, wouldn't it?
comment:150 follow-up: ↓ 153 Changed 5 years ago by
Wouldn't it be easier to directly look at the Cython code for the <object> refcount?
As far as testing is concerned, I'll setup up Sage installs during the week and report here and on the other tickets. Hopefully, this should be done by the beginning of next week.
comment:151 Changed 5 years ago by
I obviously meant the C code generated by Cython...
comment:152 in reply to: ↑ 148 Changed 5 years ago by
Replying to nbruin:
Replying to SimonKing: So basically a
MonoDict
is now just aWeakKeyDict
with identity testing rather than equality testing.
Seems so.
Unless sage parents have really bad hash values, there should really not be a difference, right?
Unfortunately, parents P do have very bad hash values. Namely, the default for hash(P) is hash(repr(P)). This value is not cached, and repr(P) may be slow. Even worse, repr(P) may change, because there is a P.rename() method. There is some ticket for fixing that apparent problem.
Anyway, at this point it would make sense to compare usual WeakKeyDictionary
and MonoDict
timing-wise. That should be done using unique parents (there, the hash and also the comparison is fast).
comment:153 in reply to: ↑ 150 ; follow-up: ↓ 155 Changed 5 years ago by
Replying to jpflori:
Wouldn't it be easier to directly look at the Cython code for the <object> refcount?
No idea how to do so.
comment:154 follow-up: ↓ 157 Changed 5 years ago by
Sorry, the edited comment http://trac.sagemath.org/sage_trac/ticket/12313#comment:148 probably fell out of scope for you. In short: do away with the _refcache
and put the weakref directly in the bucket. That saves storage and a superfluous lookup. That way MonoDict
and TripleDict
should still be a bit faster than normal dicts
(because key comparison is trivial).
comment:155 in reply to: ↑ 153 Changed 5 years ago by
Replying to SimonKing:
Replying to jpflori:
Wouldn't it be easier to directly look at the Cython code for the <object> refcount?
No idea how to do so.
For the Sage library, you can find the C[++] generated files in devel/sage... For files generated on the file, you have to look in .sage/temp/... or something like this, involving the PID of the process.
Anyway, I had a look at the parent.c file produced from parent.p?? on a vanilla from coercion 5.3.beta2 and on the lines corresponding to a return <object> PY_TYPE(x)
there indeed seems to be a _ _Pyx_INCREF called on the address of the returned value.
comment:156 follow-up: ↓ 158 Changed 5 years ago by
From http://www.cython.org/release/Cython-0.11.3/Cython/Includes/python.pxd
... # it doesn't increment the reference count. When it sees PyObject* # in order to use the result you must explicitly cast to <object>, # and when you do that Cython increments the reference count wether # you want it to or not, forcing you to an explicit DECREF (or leak memory). # To avoid this we make the above convention. Note, you can ...
which also supports the above. See also https://groups.google.com/forum/?fromgroups=#!topic/cython-users/pZFurj3rUyg
comment:157 in reply to: ↑ 154 ; follow-up: ↓ 160 Changed 5 years ago by
Replying to nbruin:
Sorry, the edited comment http://trac.sagemath.org/sage_trac/ticket/12313#comment:148 probably fell out of scope for you. In short: do away with the
_refcache
and put the weakref directly in the bucket. That saves storage and a superfluous lookup. That wayMonoDict
andTripleDict
should still be a bit faster than normaldicts
(because key comparison is trivial).
I think that would not work, for two reasons:
- You seem to assume that all keys of
MonoDict
allow weak references. Is that granted? I am not sure. Perhaps this is a non-issue. - In a very early stage of my work, I did put the weak references into the buckets. However, my impression was that that would create a massive slow-down, as I explain now.
With the current patch, the buckets are of the form [h1,v1,h2,v2,h3,v3,...]
, where h1,h2,h3 are the memory addresses of the keys, and v1,v2,v3 are the values. Now assume that we would put the (weak) references r1,r2,r3 to the keys into the bucket: [r1,v1,r2,v2,r3,v3,...]
.
The weak references should have a callback function removing entries from the dictionary. Hence, they are not just weakref.ref
, but weakref.KeyedRef
.
Let K be some weak-reffable key and f be some callback function. While weakref.ref(K) is weakref.ref(K)
, we unfortunately have weakref.KeyedRef(K,f,id(K)) is not weakref.KeyedRef(K,f,id(K))
.
We want to compare K by identity. Hence, either do if r1() is K
or if r1.key == <size_t><void *>K and r1() is not None
. Both tests are slow (r1.key is a Python attribute lookup). Even worse: They are slow, regardless whether K is in the dictionary or not.
With the new code, we would first test h1==<size_t><void *>K
and then self._refcache[h1]() is not None
. The second test is slow, but apparently we need it, or some doctests will fail. But the point is: If K is not in the dictionary then except in very rare cases the first test will recognise it. And the first test is indeed fast. Hence, non-existing keys are fast to test, existing keys are probably not worse to test than with a usual WeakKeyDictionary
. Will test it soon...
comment:158 in reply to: ↑ 156 Changed 5 years ago by
Replying to jpflori:
# it doesn't increment the reference count. When it sees PyObject* # in order to use the result you must explicitly cast to <object>, # and when you do that Cython increments the reference count wether # you want it to or not, forcing you to an explicit DECREF (or leak memory). # To avoid this we make the above convention. Note, you can
Yep, I think I saw that text, but didn't recall where to find it. So, apparently it works.
comment:159 Changed 5 years ago by
Here are some timings. They are quite conclusive: MonoDict
is very fast, even in the "validate weak references before giving an answer" version!
First, something that is quite realistic. We have many polynomial rings over different finite base fields - something that often occurs in elliptic curve computations. Then, we create a WeakKeyDictionary
and a MonoDict
keyed with these polynomial rings, and then test containment with a ring that is in the dictionary and one that isn't:
sage: L = [] sage: for p in prime_range(10000): ....: L.append(GF(p)['x','y']) ....: sage: import weakref sage: from sage.structure.coerce_dict import MonoDict sage: M = MonoDict(53) sage: for i,K in enumerate(L): ....: M[K] = i ....: sage: W = weakref.WeakKeyDictionary() sage: for i,K in enumerate(L): ....: W[K] = i ....: sage: K = GF(97)['x','y'] sage: K2 = GF(next_prime(p))['x','y'] sage: K in W True sage: K in M True sage: K2 in W False sage: K2 in M False sage: %timeit K in W 625 loops, best of 3: 55.9 µs per loop sage: %timeit K in M 625 loops, best of 3: 533 ns per loop sage: %timeit K2 in W 625 loops, best of 3: 2.41 µs per loop sage: %timeit K2 in M 625 loops, best of 3: 1.56 µs per loop sage: %timeit a = W[K] 625 loops, best of 3: 54.9 µs per loop sage: %timeit a = M[K] 625 loops, best of 3: 1.08 µs per loop
Hence, in this realistic example, MonoDict
is a lot faster than WeakKeyDictionary
-- even though I am talking about the recent version of MonoDict
where the weak references are validated.
The polynomial rings have a decently fast hash, but comparison is slow:
sage: %timeit hash(K) 625 loops, best of 3: 649 ns per loop sage: %timeit hash(K2) 625 loops, best of 3: 678 ns per loop sage: %timeit K == K 625 loops, best of 3: 53.2 µs per loop sage: %timeit K == K2 625 loops, best of 3: 65.7 µs per loop
So, let us see what happens if we do "the same" with UniqueRepresentation
:
sage: class A(UniqueRepresentation): ....: def __init__(self,p): ....: self.p = p ....: sage: L = [] sage: for p in prime_range(10000): ....: L.append(A(p)) ....: sage: K = A(97) sage: K2 = A(next_prime(p)) sage: import weakref sage: W = weakref.WeakKeyDictionary() sage: from sage.structure.coerce_dict import MonoDict sage: M = MonoDict(53) sage: for i,K in enumerate(L): ....: W[K] = i ....: M[K] = i ....: sage: K in M True sage: K in W True sage: K2 in M False sage: K2 in W False sage: %timeit K in M 625 loops, best of 3: 552 ns per loop sage: %timeit K in W 625 loops, best of 3: 3.67 µs per loop sage: %timeit K2 in M 625 loops, best of 3: 1.6 µs per loop sage: %timeit K2 in W 625 loops, best of 3: 2.72 µs per loop sage: %timeit M[K] 625 loops, best of 3: 1.52 µs per loop sage: %timeit W[K] 625 loops, best of 3: 3.22 µs per loop sage: %timeit hash(K) 625 loops, best of 3: 944 ns per loop sage: %timeit K == K 625 loops, best of 3: 910 ns per loop sage: %timeit K == K2 625 loops, best of 3: 918 ns per loop
Hence, comparison is now a lot faster, and this is noticeable in WeakKeyDictionary
.
However: MonoDict
still outperforms WeakKeyDictionary
. I am slightly surprised - I didn't expect this would happen, after introducing the validation of weak references. Note that testing K in M
is even faster than a single comparison K==K
!
comment:160 in reply to: ↑ 157 ; follow-up: ↓ 161 Changed 5 years ago by
With the current patch, the buckets are of the form
[h1,v1,h2,v2,h3,v3,...]
, where h1,h2,h3 are the memory addresses of the keys, and v1,v2,v3 are the values. Now assume that we would put the (weak) references r1,r2,r3 to the keys into the bucket:[r1,v1,r2,v2,r3,v3,...]
.The weak references should have a callback function removing entries from the dictionary. Hence, they are not just
weakref.ref
, butweakref.KeyedRef
.
That's not what I meant. You make the buckets larger, so that their layout becomes:
[h1,r1,v1,h2,r2,v2,h3,r3,v3,...]
where for, say, the key-value pair k1: v1
we have
h1=id(k1) r1=weakref.KeyedRef(k1,f,h1)
This way you can do your whole lookup without doing an extra search in the now non-existent _refcache
.
for i from 0 <= i < PyList_GET_SIZE(bucket) by 3: tmp = <object>PyList_GET_ITEM(bucket, i) if <size_t>tmp == h: r = <object>PyList_GET_ITEM(bucket, i+1) if isinstance(r, KeyedRef) and r() is None: # the object got collected but we didn't receive a callback (yet?) # either remove it now: del r #we dont WANT the callback anymore! del bucket[i:i+3] self._size -= 1 # or skip that step and hope we'll get our callback later. In any case: raise KeyError, k else: # We've just checked that r's ref (the actual key) was still alive # so the fact that id(k) equals tmp means that # k is r() [or k is r if not weakreffable, but that would have been true anyway] return <object>PyList_GET_ITEM(bucket, i+2) raise KeyError, k
Deletion and setting are easily similarly modified.
This should be a bit faster because you avoid an extra self._refcache[h]
lookup and all the management that comes with maintaining an
independent hashtable that has exactly the same keys!
It is also more memory efficient because you're now storing h only once instead of twice.
For TripleDict
you can do it in exactly the same way, just a little more work.
comment:161 in reply to: ↑ 160 Changed 5 years ago by
Replying to nbruin:
That's not what I meant. You make the buckets larger, so that their layout becomes:
[h1,r1,v1,h2,r2,v2,h3,r3,v3,...]where for, say, the key-value pair
k1: v1
we haveh1=id(k1) r1=weakref.KeyedRef(k1,f,h1)
I see.
This way you can do your whole lookup without doing an extra search in the now non-existent
_refcache
.for i from 0 <= i < PyList_GET_SIZE(bucket) by 3: tmp = <object>PyList_GET_ITEM(bucket, i) if <size_t>tmp == h: r = <object>PyList_GET_ITEM(bucket, i+1) if isinstance(r, KeyedRef) and r() is None: # the object got collected but we didn't receive a callback (yet?) # either remove it now: del r #we dont WANT the callback anymore! del bucket[i:i+3] self._size -= 1 # or skip that step and hope we'll get our callback later. In any case: raise KeyError, k else: # We've just checked that r's ref (the actual key) was still alive # so the fact that id(k) equals tmp means that # k is r() [or k is r if not weakreffable, but that would have been true anyway] return <object>PyList_GET_ITEM(bucket, i+2) raise KeyError, k
Since you already have code, it should be relatively easy to test the performance.
This should be a bit faster because you avoid an extra
self._refcache[h]
lookup and all the management that comes with maintaining an independent hashtable that has exactly the same keys!
I wouldn't say that it is obviously faster. Let's simply test it, I'd say!
It is also more memory efficient because you're now storing h only once instead of twice.
For
TripleDict
you can do it in exactly the same way, just a little more work.
Not that much. Actually, I wonder whether one could have a TupleDict
, that is created with an additional parameter that prescribes the length L of the key items. Then, the data in each bucket would be stored by L references (weak, if possible), L memory locations, and one value. The look-up would be uniform.
I think one should create a proof of concept, to test the performance. If the performance is as good as with the current MonoDict
and TripleDict
, then we have to think how to organise work. Perhaps #715 and #12313 will be superseded by a new ticket, introducing TupleDict
?
comment:162 follow-up: ↓ 164 Changed 5 years ago by
OK! example implementation done. A couple of things I found:
- Some of your doctests for
MonoDict
testTripleDict
!
- your iterators can recurse quite deeply if there are lots of empty buckets in a row. That can easily happen.
- Casting to
<object>
in Cython indeed gives you a kind of<PyObject *>
where Cython takes care of intelligent casting and refcounting for you. That comes at some cost. Especially fortmp
I think we can largely do away with that for a little gain in speed. This might apply more widely.
- It's a shame we can't store our
h
values immediately in the list but that we have to store them in an integer, which means we're storing a pointer to that integer in the list. If our buckets would be arrays of longs we could save one level of indirection for the storage of theh
values, at the expense of having to do manual ref counting on the other parts, as opposed to letting Cython take care of it. A point to keep in mind if we need that last nanosecond ... (it's tempting to simply cast the value of h to aPyObject*
and store that directly in the bucket but that would wreak havoc with the cyclic garbage collector. We could mangle the value (add one? all pointer values are divisible by 4 nowadays) but I think that would lead us to stepping outside Python's defined C-API. [if they ever get more lispy, they'll start using those bottom bits for type info or so]
- What happens if we find a dead entry in setitem? How can we be sure the pending callback for that doesn't happen on our new value? I think explicitly deleting the weakref should do the trick, but then we should ensure that nobody else gets a chance to keep those weakrefs alive! I've gone with that for now. An alternative is to let the callback check that the weakref stored in the relevant bucket is indeed a weakref and dead. Then at least we won't be removing a new entry under a key that happens to have reused the previous memory address. Given that our use is caching, it would only lead to reduced performance, not wrong results, by the way.
Timings: I've run the same tests as above:
sage: import weakref sage: from sage.structure.coerce_dict import MonoDict, MonoDictNoRefCache sage: sage: L = [] sage: for p in prime_range(10000): ....: L.append(GF(p)['x','y']) ....: sage: K = GF(97)['x','y'] sage: K2 = GF(next_prime(p))['x','y'] sage: sage: M = MonoDict(53) sage: for i,K in enumerate(L): ....: M[K] = i ....: sage: N = MonoDictNoRefCache(53) sage: for i,K in enumerate(L): ....: N[K] = i ....: sage: W = weakref.WeakKeyDictionary() sage: for i,K in enumerate(L): ....: W[K] = i ....: sage: K in W True sage: K in M True sage: K in N True sage: sage: K2 in W False sage: K2 in M False sage: K2 in N False sage: sage: %timeit K in W 625 loops, best of 3: 36.7 µs per loop sage: %timeit K in M 625 loops, best of 3: 162 ns per loop sage: %timeit K in N 625 loops, best of 3: 198 ns per loop sage: sage: %timeit K2 in W 625 loops, best of 3: 637 ns per loop sage: %timeit K2 in M 625 loops, best of 3: 395 ns per loop sage: %timeit K2 in N 625 loops, best of 3: 145 ns per loop sage: sage: %timeit a = W[K] 625 loops, best of 3: 36.3 µs per loop sage: %timeit a = M[K] 625 loops, best of 3: 229 ns per loop sage: %timeit a = N[K] 625 loops, best of 3: 203 ns per loop
With the UniqueRepresentation
:
sage: class A(UniqueRepresentation): ....: def __init__(self,p): ....: self.p = p ....: sage: L = [] sage: for p in prime_range(10000): ....: L.append(A(p)) ....: sage: K = A(97) sage: K2 = A(next_prime(p)) sage: sage: import weakref sage: from sage.structure.coerce_dict import MonoDict, MonoDictNoRefCache sage: M = MonoDict(53) sage: for i,K in enumerate(L): ....: M[K] = i ....: sage: N = MonoDictNoRefCache(53) sage: for i,K in enumerate(L): ....: N[K] = i ....: sage: W = weakref.WeakKeyDictionary() sage: for i,K in enumerate(L): ....: W[K] = i ....: sage: K in W True sage: K in M True sage: K in N True sage: sage: K2 in W False sage: K2 in M False sage: K2 in N False sage: sage: %timeit K in W 625 loops, best of 3: 1.04 µs per loop sage: %timeit K in M 625 loops, best of 3: 173 ns per loop sage: %timeit K in N 625 loops, best of 3: 195 ns per loop sage: sage: %timeit K2 in W 625 loops, best of 3: 766 ns per loop sage: %timeit K2 in M 625 loops, best of 3: 440 ns per loop sage: %timeit K2 in N 625 loops, best of 3: 135 ns per loop sage: sage: %timeit a = W[K] 625 loops, best of 3: 926 ns per loop sage: %timeit a = M[K] 625 loops, best of 3: 208 ns per loop sage: %timeit a = N[K] 625 loops, best of 3: 179 ns per loop
If you do these tests repeatedly there is quite some noise, but these represent
what generally seems to happen. Oddly enough, retrieving an item is often a
little faster than getting True
back on K in N
! Must be something subtle or
overhead. Deciding that K in M
is the only thing where the old MonoDict
is
slightly faster. That is not the case we're interested in optimizing, though;
Lookup is! And MonoDictNoRefCache
is the clear winner there (as it is for
deciding that a key is NOT in there. Also a good operation to have).
Furthermore, note that MonoDictNoRefCache
is a little more thorough in its
membership test by also checking that the weakref is still alive.
Changed 5 years ago by
Example implementation of a MonoDict?? storing the weakrefs in the buckets as well
comment:163 follow-up: ↓ 165 Changed 5 years ago by
Observation: The test above is entirely unrepresentative because there is no resizing whatsoever (default threshold=0
leads to no resizing). So the test above was just using 53
buckets, meaning it was mainly doing linear search! Impressive how fast that is, of course, but more representative is (2/3
is the threshold ratio that Python's dict
uses by default):
M = MonoDict(53,threshold=2/3) N = MonoDictNoRefCache(53,threshold=2/3) ... sage: sage: %timeit K in W 625 loops, best of 3: 1.06 µs per loop sage: sage: %timeit K in M 625 loops, best of 3: 169 ns per loop sage: sage: %timeit K in N 625 loops, best of 3: 128 ns per loop sage: sage: sage: sage: %timeit K2 in W 625 loops, best of 3: 800 ns per loop sage: sage: %timeit K2 in M 625 loops, best of 3: 426 ns per loop sage: sage: %timeit K2 in N 625 loops, best of 3: 67.1 ns per loop sage: sage: sage: sage: %timeit a = W[K] 625 loops, best of 3: 941 ns per loop sage: sage: %timeit a = M[K] 625 loops, best of 3: 169 ns per loop sage: sage: %timeit a = N[K] 625 loops, best of 3: 122 ns per loop
That's why MonoDict
won for membership. It was using a properly calibrated dict
versus a linear search in MonoDictNoRefCache
. I think our design is clear now.
Concerning TupleDict
: Sure we can implement it but if our only use cases are MonoDict
and TripleDict
anyway, we're better off with the unrolled loops.
Somewhere in the next few days I'll prepare a patch for #715 to move TripleDict
to this design and then put a patch on #12313 for MonoDict
, if that's OK with people. In the mean time, comments on design welcome.
comment:164 in reply to: ↑ 162 Changed 5 years ago by
Replying to nbruin:
OK! example implementation done. A couple of things I found:
- Some of your doctests for
MonoDict
testTripleDict
!
Oops.
- your iterators can recurse quite deeply if there are lots of empty buckets in a row. That can easily happen.
Yes? I thought that the number of buckets is quite small (so, no deep recursion) and the whole thing is designed to distribute the items uniformly over the buckets
- Casting to
<object>
in Cython indeed gives you a kind of<PyObject *>
where Cython takes care of intelligent casting and refcounting for you. That comes at some cost. Especially fortmp
I think we can largely do away with that for a little gain in speed. This might apply more widely.
I wrote some general code for what I now call IdKeyDict
: A weak key dictionary (that also accepts items that aren't weak-reffable) with keys of prescribed length, keys being compared by identity. There, the tmp is not used.
- It's a shame we can't store our
h
values immediately in the list but that we have to store them in an integer, which means we're storing a pointer to that integer in the list.
I don't understand that comment.
- What happens if we find a dead entry in setitem? How can we be sure the pending callback for that doesn't happen on our new value?
My new code takes care of that. The methods are a lot more modularised. There is one method that simply finds the correct bucket (according to the hash) and another method that finds the position of an item on that bucket. The second method will automatically delete a dead item from the bucket. And of course, that method is both used in __setitem__
, __getitem__
and __contains__
.
I think explicitly
deleting the weakref should do the trick, but then we should ensure that nobody else gets a chance to keep those weakrefs alive!
What would be the problem if the weakrefs survive? OK, at some point the callback will start working, but would that be a problem? I mean, if the callback tries to remove an item that has already been deleted, then the callback would simply do nothing
comment:165 in reply to: ↑ 163 ; follow-up: ↓ 167 Changed 5 years ago by
Replying to nbruin:
Observation: The test above is entirely unrepresentative because there is no resizing whatsoever (default
threshold=0
leads to no resizing). So the test above was just using53
buckets, meaning it was mainly doing linear search! Impressive how fast that is, of course, but more representative is (2/3
is the threshold ratio that Python'sdict
uses by default)
Good point! By your comment, I have now threshold=0.7 as default in my experimental code.
That's why
MonoDict
won for membership. It was using a properly calibrateddict
versus a linear search inMonoDictNoRefCache
.
Yes, that would explain it.
Concerning
TupleDict
: Sure we can implement it but if our only use cases areMonoDict
andTripleDict
anyway, we're better off with the unrolled loops.
Last night I already wrote experimental code (calling the thing "IdKeyDict
", because the keys are compared by identity). I am now doing some tests - perhaps we are lucky and it is fast...
MonoDict
and TripleDict
are the only use cases yet. Until not so long time ago, TripleDict
has been the only use case. One could imagine that an IdKeyDict
being useful for a weak_cached_function, as in #12215: If you have a function with a fixed list of arguments (hence, no *args and **keys), and if you are happy to compare the arguments by identity, IdKeyDict
may be exactly the thing you want.
Plus, one avoids code duplication. And aren't modern compilers intelligent enough to do some loop unrolling automatically?
Anyway, since I already wrote code, I'll first test the performance, rather than immediately dropping the idea of unifying MonoDict
and TripleDict
.
comment:166 Changed 5 years ago by
What would be the problem if the weakrefs survive? OK, at some point the callback will start working, but would that be a problem? I mean, if the callback tries to remove an item that has already been deleted, then the callback would simply do nothing
Not if there's now a new entry stored under that id. The trick is, id(object)
is guaranteed to uniquely identify object
for its lifetime but, as you've seen, not beyond it.
By the way, as long as we ensure that absolutely no-one else keeps a reference to our weakrefs, then deleting them should make them disappear immediately, making it impossible to pass them as a parameter to a callback ... I hope that python uses that to purge its callback queue after a cyclic garbage collection.
See #13387 for a modification of MonoDict
and TripleDict
to not use a refcache. It's also a little more careful in handling dangling weakrefs (I think).
Hashtables should not have a particularly small number of buckets! About 1.5 times the number of elements! (the birthday paradox still means quite some collisions). And we since we don't ever resize down, the load could drop quite low! (although not particularly in our main applications).
comment:167 in reply to: ↑ 165 Changed 5 years ago by
Replying to SimonKing:
Plus, one avoids code duplication. And aren't modern compilers intelligent enough to do some loop unrolling automatically?
If you have a preprocessor, macro, or templating system then the N
in N-Dict
can indeed be resolved at compile-time and then the compiler can unroll. But I don't think Cython has parametric code generation (yet?)
Anyway, since I already wrote code, I'll first test the performance, rather than immediately dropping the idea of unifying
MonoDict
andTripleDict
.
OK! Compare with #13387. I suspect that your elusive ptest
SEGVs might be due to somewhat careless handling in the Dict
s. When I was rewriting them I found some places where Weakrefs were not properly checked before concluding presence.
comment:168 Changed 5 years ago by
Do everyone agree that further changes to *Dict should happen elsewhere? Even the potential fix for the elusive segfault Simon encountered and Nils just mentioned?
Or should #13387 (and further tickets) really be merged with these ones?
comment:169 Changed 5 years ago by
I just added an experimental patch idkey_dict that implements what I proposed in my previous posts. The timings seem very appealing:
sage: from sage.structure.idkey_dict import IdKeyDict sage: from sage.structure.coerce_dict import TripleDict sage: import weakref sage: I = IdKeyDict(3,53, threshold=0.7) sage: T = TripleDict(53, threshold=0.7) sage: W = weakref.WeakKeyDictionary() sage: L = [] sage: for p in prime_range(10000): ....: L.append(GF(p)['x','y']) ....: sage: K = GF(97)['x','y'] sage: K2 = GF(next_prime(p))['x','y'] sage: for i,K in enumerate(L): ....: W[K] = i ....: I[K,K,True] = i ....: T[K,K,True] = i ....: sage: (K,K,True) in I True sage: (K,K,True) in T True sage: K in W True sage: %timeit (K,K,True) in I 625 loops, best of 3: 1.31 µs per loop sage: %timeit (K,K,True) in T 625 loops, best of 3: 2.14 µs per loop sage: %timeit K in W 625 loops, best of 3: 56.4 µs per loop sage: (K2,K,True) in I False sage: (K2,K,True) in T False sage: K2 in W False sage: %timeit (K2,K,True) in I 625 loops, best of 3: 523 ns per loop sage: %timeit (K2,K,True) in T 625 loops, best of 3: 2.07 µs per loop sage: %timeit K2 in W 625 loops, best of 3: 2.4 µs per loop sage: I[K,K,True] 1228 sage: T[K,K,True] 1228 sage: W[K] 1228 sage: %timeit I[K,K,True] 625 loops, best of 3: 1.24 µs per loop sage: %timeit T[K,K,True] 625 loops, best of 3: 2.19 µs per loop sage: %timeit W[K] 625 loops, best of 3: 55.6 µs per loop
Hence, the general IdKeyDict
, that implements any fixed key length, already beats TripleDict
. But not all is good: When I tried to compare with MonoDict
, I got a segfault in __setitem__
Back to the drawing board, then...
comment:170 Changed 5 years ago by
OK, patch is updated.
Here are timings for key length 1:
sage: from sage.structure.idkey_dict import IdKeyDict sage: from sage.structure.coerce_dict import MonoDict sage: import weakref sage: I = IdKeyDict(1,53, threshold=0.7) sage: M = MonoDict(53, threshold=0.7) sage: W = weakref.WeakKeyDictionary() sage: L = [] sage: for p in prime_range(10000): ....: L.append(GF(p)['x','y']) ....: sage: K = GF(97)['x','y'] sage: K2 = GF(next_prime(p))['x','y'] sage: for i,K in enumerate(L): ....: W[K] = i ....: M[K] = i ....: I[K] = i ....: sage: K in W True sage: K in M True sage: K in I True sage: K2 in W False sage: K2 in M False sage: K2 in I False sage: %timeit K in W 625 loops, best of 3: 56 µs per loop sage: %timeit K in M 625 loops, best of 3: 550 ns per loop sage: %timeit K in I 625 loops, best of 3: 571 ns per loop sage: %timeit K2 in W 625 loops, best of 3: 2.46 µs per loop sage: %timeit K2 in M 625 loops, best of 3: 1.55 µs per loop sage: %timeit K2 in I 625 loops, best of 3: 387 ns per loop sage: W[K] 1228 sage: M[K] 1228 sage: I[K] 1228 sage: %timeit W[K] 625 loops, best of 3: 55.1 µs per loop sage: %timeit M[K] 625 loops, best of 3: 616 ns per loop sage: %timeit I[K] 625 loops, best of 3: 523 ns per loop
and concerning garbage collection
sage: del L sage: import gc sage: gc.collect() 321790 sage: len(W) 1 sage: len(M) 1 sage: len(I) 1 sage: list(I.iteritems()) [((194308608, <weakref at 0xbbd9110; to 'sage.rings.polynomial.multi_polynomial_libsingular' at 0xb94ea00 (JoinCategory.parent_class)>), 1228)]
Hence, the timings of the general implementation (general key length) are competitive. What do you think about it? Of course, one would need to add doctests.
comment:171 Changed 5 years ago by
For completeness: IdKeyDict
is also faster than a Python dict, but of course that should mainly be true because we are comparing by identity, not equality.
sage: from sage.structure.idkey_dict import IdKeyDict sage: I1 = IdKeyDict(1,53) sage: I3 = IdKeyDict(3,53) sage: D1 = {} sage: D3 = {} sage: L = [] sage: for p in prime_range(10000): ....: L.append(GF(p)['x','y']) ....: sage: for i,K in enumerate(L): ....: I1[K] = i ....: I3[K,K,True] = i ....: D1[K] = i ....: D3[K,K,True] = i ....: sage: K = GF(97)['x','y'] sage: K2 = GF(next_prime(p))['x','y'] sage: K in I1 True sage: K in D1 True sage: K2 in I1 False sage: K2 in D1 False sage: (K,K,True) in D3 True sage: (K,K,True) in I3 True sage: %timeit K in D1 625 loops, best of 3: 720 ns per loop sage: %timeit K in I1 625 loops, best of 3: 560 ns per loop sage: %timeit K2 in D1 625 loops, best of 3: 763 ns per loop sage: %timeit K2 in I1 625 loops, best of 3: 400 ns per loop sage: %timeit (K,K,True) in D3 625 loops, best of 3: 1.53 µs per loop sage: %timeit (K,K,True) in I3 625 loops, best of 3: 1.3 µs per loop sage: %timeit (K2,K,True) in D3 625 loops, best of 3: 1.45 µs per loop sage: %timeit (K2,K,True) in I3 625 loops, best of 3: 555 ns per loop sage: %timeit D1[K] 625 loops, best of 3: 647 ns per loop sage: %timeit I1[K] 625 loops, best of 3: 502 ns per loop sage: %timeit D3[K,K,True] 625 loops, best of 3: 1.48 µs per loop sage: %timeit I3[K,K,True] 625 loops, best of 3: 1.23 µs per loop
comment:172 follow-ups: ↓ 174 ↓ 177 Changed 5 years ago by
Running make ptest on a quite recent computer (12 threads), makes the runtime go from about 399 seconds on vanilla 5.3.beta2 to about 407 seconds on 5.3.beta2 + #715 + #11521 + #12313 averaging on a few runs.
On sage/schemes/elliptic_curves/sha_tate.py alone with long tests it goes from about 79 secs to about 82.5 secs.
comment:173 follow-up: ↓ 180 Changed 5 years ago by
I wrote some general code for what I now call
IdKeyDict
: A weak key dictionary (that also accepts items that aren't weak-reffable) with keys of prescribed length, keys being compared by identity. There, the tmp is not used.
<size_t><object>...
has two effects: It does de refcount dance (mostly superfluous) and it inserts PyInt_AsSsize_t call (necessary). By introducing the right header template you can get the second and avoid the first. See the code at #13387.
- It's a shame we can't store our
h
values immediately in the list but that we have to store them in an integer, which means we're storing a pointer to that integer in the list.
OK, perhaps a little cryptic indeed. What I meant was that for a monodict the bucket entries are really of the form
( key_id, ref, value )
which one could store as
(<size_t>, <PyObject*>, <PyObject*>)
but we're storing it as (Lists accept nothing else)
(<PyObject*>, <PyObject*>, <PyObject*>)
which means that the first entry gets stored by a pointer to a system-word-sized allocated entry (a PyInt? to be precise) instead of stored directly in the field that would be big enough anyway. That's wasting memory locality and hence CPU cache space on the bit we use the most. (weakref and the values obviously have to be pointers).
Perhaps JP has a point -- the main concern on this ticket is to get a correct MonoDict
. Discussions about optimizing it should go on #13387. The memory leaks desperately need solving. Why don't you move IdKeyDict
there? If you can get it to perform sufficiently well, we could drop it in. Having the code a little more structured would be a benefit from a maintenance point of view.
comment:174 in reply to: ↑ 172 Changed 5 years ago by
Replying to jpflori:
Running make ptest on a quite recent computer (12 threads), makes the runtime go from about 399 seconds on vanilla 5.3.beta2 to about 407 seconds on 5.3.beta2 + #715 + #11521 + #12313 averaging on a few runs.
On sage/schemes/elliptic_curves/sha_tate.py alone with long tests it goes from about 79 secs to about 82.5 secs.
Well, cacheing does have a benefit. If you have enough memory, it's sometimes faster to simply not do any deallocation or garbage collection. It's nice if we can do this without performance loss but I think not leaking memory is worth some speed. Speed in exchange for a leak often means that you'll be able to run out of memory faster.
I guess the key is getting confident that the time is lost in recomputing things that have been rightfully collected rather than computations performing slower. People can solve the first by keeping references to things they know they'll need again.
comment:175 follow-up: ↓ 176 Changed 5 years ago by
I agree. I'm only pointing out that on at least one computer, the speed regression, which had to be expected, seems more than negligible to me and should definitely not be taken into account for the inclusin of these tickets.
comment:176 in reply to: ↑ 175 Changed 5 years ago by
Replying to jpflori:
I agree. I'm only pointing out that on at least one computer, the speed regression, which had to be expected, seems more than negligible to me and should definitely not be taken into account for the inclusin of these tickets.
If someone is really interested in tracking down where the speed regression is coming from, they could sprinkle parents with an "elixir of life". I think it would be enough to patch UniqueRepresentation
with a global list to which any newly created structure gets appended. That should bless all parents with eternal life and hence make all our weak caching permanent once again.
Code that significantly benefits from this elixir should be marked for restructuring. It should keep better track of what parents it reuses. It's unrealistic to expect that the system is clairvoyant in this matter.
comment:177 in reply to: ↑ 172 ; follow-up: ↓ 181 Changed 5 years ago by
Replying to jpflori:
Running make ptest on a quite recent computer (12 threads), makes the runtime go from about 399 seconds on vanilla 5.3.beta2 to about 407 seconds on 5.3.beta2 + #715 + #11521 + #12313 averaging on a few runs.
On sage/schemes/elliptic_curves/sha_tate.py alone with long tests it goes from about 79 secs to about 82.5 secs.
sage: 8./399 0.0200501253132832 sage: 3.5/79 0.0443037974683544
How significant is that?
comment:178 follow-up: ↓ 185 Changed 5 years ago by
As far as I'm concerned it not significant at all in comparaison with the memleaks it prevents.
Without the patches, I'd have to restart Sage every couple of hours because of lack of memory.
With the patches, I'll just have to wait a couple of extra minutes more.
comment:179 Changed 5 years ago by
And provided #13387 gives good results, this will become even less significant.
comment:180 in reply to: ↑ 173 ; follow-up: ↓ 182 Changed 5 years ago by
Replying to nbruin:
I wrote some general code for what I now call
IdKeyDict
: A weak key dictionary (that also accepts items that aren't weak-reffable) with keys of prescribed length, keys being compared by identity. There, the tmp is not used.
<size_t><object>...
has two effects: It does de refcount dance (mostly superfluous) and it inserts PyInt_AsSsize_t call (necessary). By introducing the right header template you can get the second and avoid the first. See the code at #13387.
Thank you for the pointer!
Perhaps JP has a point -- the main concern on this ticket is to get a correct
MonoDict
. Discussions about optimizing it should go on #13387. The memory leaks desperately need solving. Why don't you moveIdKeyDict
there? If you can get it to perform sufficiently well, we could drop it in. Having the code a little more structured would be a benefit from a maintenance point of view.
I see these alternative approaches:
- We essentially keep the code from #715 and the main patch from here, for now. However, we need to do some fixes, because of the segfaults and stuff. But apart from that fix, #13387 should focus on performance and perhaps generality
IdKeyDict
. #12313 will be a dependency for #13387. - The performance of
IdKeyDict
is decent, actually seems to be better thanMonoDict
andTripleDict
. So, why not use the more general structure for the coercion dicts? Hence, we could make #13387 be independent of #715 and #12313 and mark both #715 and #12313 as duplicates.
To decide which way to go, we should probably ask: How much additional work is needed to fix the current segfault problems with #715 and #12313? If it isn't much, then the first way may be better. Otherwise, we may choose the second way.
As much as I see, what needs to be fixed (or just needs to be reviewed, because it already is in the patches): After testing equality of memory addresses, test whether the weak reference is still valid. If it isn't then delete the weak reference, in order to avoid its callback being active.
So, 1. or 2., or a third way?
comment:181 in reply to: ↑ 177 Changed 5 years ago by
Replying to SimonKing:
sage: 8./399 0.0200501253132832 sage: 3.5/79 0.0443037974683544
Not very significant in my opinion, but because this is fundamental infrastructure it would be worthwhile if someone investigated at some point. I did some initial test to equip UniqueRepresentation
and UniqueFactory
with a permanent store for anything they produced and the effect wasn't very big for sha_tate
. I might be overlooking some other construction that's contributing. Finding a smoking gun for this particular regression would be reassuring, but it might not exist.
Perhaps it's best to just profile some of these examples properly and see what's costing (and perhaps compare profiles?). If it's just the garbage collector doing its thing, we cannot do much about it. If it's recreating collected parents all the time, people should redesign that piece of code to keep the parents.
However, if we're finding that actions take much longer to perform now because they have to look up weakrefs rather than follow normal refs (or something similar), perhaps actions need a different way of operating (or at least an alternative, unsafe path that can be used in time critical parts)
If we're finding that the system is now spending significantly more time deleting things because so many more deletions involve swaths of weakref callbacks, we might want to think of more efficient mechanisms there, or just accept that this is part of the reasonable cost for memory management.
I doubt python-level profiling will cut it here. Anybody with good lower level profiling skills? Can we extract the commands run through sage for doctesting sha_tate.py
and compare profiles between 5.3b2 and 5.3b2 + #715 + #11521 + #12313 ? Both python level and c-level profiles would be useful.
Again, none of this will hold back a positive review from me and they're more issues for follow-up tickets.
comment:182 in reply to: ↑ 180 ; follow-ups: ↓ 183 ↓ 184 Changed 5 years ago by
As much as I see, what needs to be fixed (or just needs to be reviewed, because it already is in the patches): After testing equality of memory addresses, test whether the weak reference is still valid. If it isn't then delete the weak reference, in order to avoid its callback being active.
I've seen complete doctest successes for #715 + #11521 + #12313, but the last thing I saw, some bots were persistent about error that they didn't report for other patches, which makes me think they may be real issues that are hard to trigger. The problems I currently see:
TripleDict
set method is not dangling callback safe
TripleDict
delitem method doesn't raise an error when deleting a dead entry (don't think that's much of an issue and arguable whether it should do that anyway. In fact, withTripleDictEraser
using it, it can't)
TripleDictIter
isn't safe wrt to dangling weakrefs. That one's a real segfault waiting to happen, because it instantiates object references by pointer cast!
TripleDictEraser
hold a strong reference to the TripleDict?. I suspect that should be weak, although that adds to deletion costs...
- You're importing
ConstantFunction
but you don't use it.
If you have a reliable way of generating the segfaults, it might be worth trying with #13387. The iterator there should be safe. There might be other issues, though.
I have a preference to getting #715 + #11521 in mergeable state, because I think that enhances our chances of actually making progress (in terms of getting stuff merged). Making #13387 one huge patch bomb is probably hard to review. This code really needs to be read for review.
#12215: Bot is reporting a SIGABRT somewhere, which is probably an "assert" failing in C code. Smells like memory corruption, so it would be really useful to get a report on. Code in the patches is really simple, so I'm suspecting a nasty interaction with #715,#11521 that now gets triggered because now things are actually getting collected.
#12313: Bot reports a SIGSEGV somewhere, Same story as above.
#13370: Bot reports a SIGSEGV
Most likely explanation is that our MonoDict? and TripleDict? implementations corrupt memory in rare cases and that the other tickets trigger that by changing other things so that they get exercised more.
Can we get access to the bot or equivalent configuration to recreate the circumstances? Or can we get someone involved who can (one is called volker-desktop ...).
comment:183 in reply to: ↑ 182 ; follow-up: ↓ 188 Changed 5 years ago by
Replying to nbruin:
TripleDictIter
isn't safe wrt to dangling weakrefs. That one's a real segfault waiting to happen, because it instantiates object references by pointer cast!
OK.
TripleDictEraser
hold a strong reference to the TripleDict?. I suspect that should be weak, although that adds to deletion costs...
Why? References to TripleDictEraser
come from the TripleDict
it points to, and from the weak references that are stored in said TripleDict
. Would this be more than having a reference cycle that only involves KeyedRef
, TripleDict
and TripleDictEraser
? None of them has a __del__
method, isn't it?
- You're importing
ConstantFunction
but you don't use it.
Right.
If you have a reliable way of generating the segfaults, it might be worth trying with #13387. The iterator there should be safe. There might be other issues, though.
I have a preference to getting #715 + #11521 in mergeable state,
Hence, if you have a working iterator at #13387 then we should use that. Note that idkey_dict has a different iterator, thanks to Cython now supporting "yield". Do you think that iterator is safe?
Most likely explanation is that our MonoDict? and TripleDict? implementations corrupt memory in rare cases and that the other tickets trigger that by changing other things so that they get exercised more.
Can we get access to the bot or equivalent configuration to recreate the circumstances? Or can we get someone involved who can (one is called volker-desktop ...).
I'll meet Volker in St Andrews next week, if I'm not mistaken.
In the next few days, my resources for working will be a bit limited, but I'll try to get my patches from #715 and here in a better state, according to your remarks.
comment:184 in reply to: ↑ 182 Changed 5 years ago by
Replying to nbruin:
TripleDict
delitem method doesn't raise an error when deleting a dead entry (don't think that's much of an issue and arguable whether it should do that anyway. In fact, withTripleDictEraser
using it, it can't)
TripleDictEraser
isn't using it.
comment:185 in reply to: ↑ 178 Changed 5 years ago by
Replying to jpflori:
As far as I'm concerned it not significant at all in comparaison with the memleaks it prevents.
Without the patches, I'd have to restart Sage every couple of hours because of lack of memory.
With the patches, I'll just have to wait a couple of extra minutes more.
I concur. Even more, I think fixing a memory leak is much more important than efficiency, thus it is not fair to compare both timings.
Thank you very much to all of you for the good work!
Paul
comment:186 Changed 5 years ago by
comment:187 Changed 5 years ago by
- Status changed from needs_review to needs_work
- Work issues set to Rebase, and take into account Nils' comments
comment:188 in reply to: ↑ 183 ; follow-up: ↓ 189 Changed 5 years ago by
Replying to SimonKing:
Why? References to
TripleDictEraser
come from theTripleDict
it points to, and from the weak references that are stored in saidTripleDict
. Would this be more than having a reference cycle that only involvesKeyedRef
,TripleDict
andTripleDictEraser
? None of them has a__del__
method, isn't it?
Hm ... For MonoDictEraser? you used a weakref. That's mainly what led me to believe it's better. I don't know what's more costly. Weakrefs make individual deletions more expensive because of the callbacks. Cyclic garbage buildup adds cost in that the GC is triggered more frequently. It depends a bit on the connectivity of the graph if it's worthwhile introducing enough weakrefs to break cycles. The following check suggests that weakrefs can sometimes be cheaper:
sage: class refstore(object): ....: def __init__(self,ref): ....: self.ref=ref ....: sage: class cyclic(object): ....: def __init__(self): ....: self.me=refstore(self) ....: sage: N=0 sage: def callback(c): ....: global N ....: N+=1 ....: sage: import weakref sage: sage: class weakcyc(object): ....: def __init__(self): ....: self.me=refstore(weakref.ref(self,callback)) ....: sage: #import gc sage: #gc.set_debug(gc.DEBUG_STATS) sage: %time for i in range(80000): c = cyclic() CPU times: user 0.33 s, sys: 0.01 s, total: 0.35 s Wall time: 0.36 s sage: %time for i in range(80000): c = weakcyc() CPU times: user 0.20 s, sys: 0.00 s, total: 0.20 s Wall time: 0.21 s sage: %time for i in range(80000): c = cyclic() CPU times: user 0.39 s, sys: 0.00 s, total: 0.39 s Wall time: 0.40 s sage: %time for i in range(80000): c = weakcyc() CPU times: user 0.20 s, sys: 0.00 s, total: 0.20 s Wall time: 0.21 s sage: %time for i in range(80000): c = cyclic() CPU times: user 0.33 s, sys: 0.00 s, total: 0.33 s Wall time: 0.35 s sage: %time for i in range(80000): c = weakcyc() CPU times: user 0.20 s, sys: 0.00 s, total: 0.20 s Wall time: 0.21 s sage: N 239999
(it's fun to turn on gc stats and see that the weakrefs trigger no GC!)
Hence, if you have a working iterator at #13387 then we should use that. Note that idkey_dict has a different iterator, thanks to Cython now supporting "yield". Do you think that iterator is safe?
Apart from the fact that it's returning garbage keys at the moment: absolutely! See #13387. Or otherwise just do a nested while loop, storing the position using bucket number and index in bucket. That's easier logic than the recursive approach. But of course validate the weakrefs.
In the next few days, my resources for working will be a bit limited, but I'll try to get my patches from #715 and here in a better state, according to your remarks.
Have fun!
comment:189 in reply to: ↑ 188 Changed 5 years ago by
Replying to nbruin:
Replying to SimonKing:
Why? References to
TripleDictEraser
come from theTripleDict
it points to, and from the weak references that are stored in saidTripleDict
. Would this be more than having a reference cycle that only involvesKeyedRef
,TripleDict
andTripleDictEraser
? None of them has a__del__
method, isn't it?Hm ... For MonoDictEraser? you used a weakref. That's mainly what led me to believe it's better.
Yes, while working at rebasing my patch, I just noticed.
Not sure. Since I already have updated my patch at #715 without weak reference (for TripleDict
), I keep it like this there. But now I am rebasing my patch from here, and I think I will change it into a weak reference here (in both cases).
Hence, if you have a working iterator at #13387 then we should use that. Note that idkey_dict has a different iterator, thanks to Cython now supporting "yield". Do you think that iterator is safe?
Apart from the fact that it's returning garbage keys at the moment: absolutely!
Yesyes, there are no garbage keys with my new patch at #715.
comment:190 Changed 5 years ago by
- Status changed from needs_work to needs_review
OK, the patch is updated. I am now running tests.
Apply trac_12313-mono_dict-combined-random-sk.patch
comment:191 Changed 5 years ago by
Arrgh. There are already several segfaults in the first few tests. So, there is something seriously wrong with either the new patch at #715 or here.
comment:192 Changed 5 years ago by
- Work issues Rebase, and take into account Nils' comments deleted
Fixed the segfaults: Since TripleDictEraser
now has a weak reference to a TripleDict
, the callback function of the KeyedReference
s has to test whether that weak reference is still valid. This is what I forgot. Now I am running tests for the updated patch.
Apply trac_12313-mono_dict-combined-random-sk.patch
comment:193 Changed 5 years ago by
I had to update the patch again. It is now almost fine.
Only remaining problem (on my machine, that is):
sage -t -force_lib devel/sage/sage/combinat/integer_vectors_mod_permgroup.py
results in about 20 open processes that never finish. Timeout.
Running the test --verbose, one finds that all tests pass. But then, it is
271 tests in 40 items. 271 passed and 0 failed. Test passed. A workspace appears to have been corrupted... automatically rebuilding (this is harmless).
Harmless - but timeout.
Since the problem appears after running the tests, it seems to come from the doctest framework. Any idea what happens?
comment:194 Changed 5 years ago by
The message seems to come from "sage/interfaces/gap.py".
comment:195 Changed 5 years ago by
... and the problem can not be fixed by doing "gap_reset_workspace()".
comment:196 follow-up: ↓ 199 Changed 5 years ago by
I let the Gap._start
method print a stack when there is a problem that leads to the "this is harmless" message.
After finishing the tests, there is a RuntimeError
"Unable to start gap", and the stack is
(<frame object at 0x117d940>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/interfaces/gap.py', 964, '_start', [' for bla in inspect.stack():\n'], 0) (<frame object at 0x117d6f0>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/interfaces/gap.py', 536, '_eval_line', [' self._start()\n'], 0) (<frame object at 0x4a027e0>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/interfaces/gap.py', 1092, 'set', [' out = self._eval_line(cmd, allow_use_file=True)\n'], 0) (<frame object at 0x42d1d50>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/interfaces/interface.py', 385, '_create', [' self.set(name, value)\n'], 0) (<frame object at 0x4a07970>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/interfaces/expect.py', 1327, '__init__', [' self._name = parent._create(value, name=name)\n'], 0) (<frame object at 0x4d3b9e0>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/interfaces/interface.py', 195, '__call__', [' return cls(self, x, name=name)\n'], 0) (<frame object at 0x49ec260>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/groups/perm_gps/permgroup.py', 562, '__cmp__', [' gSelf = self._gap_()\n'], 0) (<frame object at 0x1408240>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python2.7/weakref.py', 51, 'remove', [' del self.data[wr.key]\n'], 0)
So, apparently one has to look at the __cmp__
method in permgroup.py and make sure that it works when gap can not start.
comment:197 Changed 5 years ago by
The RuntimeError
occurs in the following line in sage/interfaces/gap.py
Expect._start(self, "Failed to start GAP.")
The odd thing is: sage.interfaces.expect.Expect._start
has the arguments "alt_message" and "block_during_init". While the latter is used, the first argument does not occur in the method!
comment:198 Changed 5 years ago by
It hangs during gap_reset_workspace(), more precisely: While loading the Gap package ctbllib.
comment:199 in reply to: ↑ 196 ; follow-up: ↓ 200 Changed 5 years ago by
After finishing the tests, there is a
RuntimeError
"Unable to start gap", and the stack is(<frame object at 0x1408240>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python2.7/weakref.py', 51, 'remove', [' del self.data[wr.key]\n'], 0)
This is in the removal callback of a WeakValueDictionary
. So, what I *think* is
happening is that an object that is dependent on Gap for its equality test is
used in a cache somewhere.
Probably during shutting down, the value gets deleted and the dictionary gets its callback to remove the value. However, this is during shutdown! Things are in funny states. Probably, the deletion from the dictionary doesn't really matter anyway, because that's probably next in line to be destroyed.
However, the callback doesn't know that!
This is a nasty complication of these weakreffed dictionaries: the callback code can run under funny conditions and an equality test can involve a lot in some sage objects.
If we could tell our WeakValueDictionaries
something about their codomain
their callback could check if that still exists. If not then none of the entries
matter, so you don't have to bother to locate the appropriate key. Just delete
them all.
EDIT: I guess a solution here could be to store a strong ref to the gap interface on the object that needs gap comparison. That way, gap will surely still exist (and useable?) whenever such an object is alive. The possibilities of callbacks on objects mean you have to be more explicit about what they need to function! Nasty ...
comment:200 in reply to: ↑ 199 Changed 5 years ago by
Replying to nbruin:
If we could tell our
WeakValueDictionaries
something about their codomain
Unfortunately, these aren't our WeakValueDictionaries
, but Python's. So, we can not.
EDIT: I guess a solution here could be to store a strong ref to the gap interface on the object that needs gap comparison.
Sounds like a good idea.
Another one or two ideas:
- We find out where that weak dictionary lives, and delete the whole dictionary explicitly when Sage shuts down.
- We find out where that weak dictionary lives, and if we are lucky then we can replace it by a
MonoDict
orTripleDict
.
comment:201 Changed 5 years ago by
I just tried to add an attribute pointing to gap in each instance of PermutationGroup_generic
. It didn't help. I tested: The stored gap instance is not running, and it won't start.
comment:202 Changed 5 years ago by
It turns out that the problematic dictionary is UniqueRepresentation.__classcall__.cache
, which is a weak dictionary by #12215, and which can not be turned into a MonoDict
or TripleDict
.
Is it an option to delete it explicitly when Sage starts to shut down?
comment:203 follow-up: ↓ 207 Changed 5 years ago by
There is a function quit_sage
in sage.all. The first thing it does is to quit all sage interfaces.
If one prepends it with a garbage collection, then the problem vanishes. It would not help to move quitting the interfaces to the very end of quit_sage().
So, question: Is an explicit garbage collection an option?
comment:204 Changed 5 years ago by
- Description modified (diff)
I made a patch that makes Sage force a garbage collection before closing the interfaces. I am confident that now all tests will pass.
Apply trac_12313-mono_dict-combined-random-sk.patch trac_12313_quit_sage.patch
comment:205 Changed 5 years ago by
Alas. While running the tests, I accidentally saw
sage -t -force_lib devel/sage/sage/misc/preparser.py [7.5 s] sage -t -force_lib devel/sage/doc/en/thematic_tutorials/lie/crystals.rst [5.8 s] sage -t -force_lib devel/sage/sage/groups/group.pyx [5.9 s] sage -t -force_lib devel/sage/sage/rings/residue_field.pyx [7.7 s] *** glibc detected *** python: double free or corruption (!prev): 0x000000000757a7b0 *** sage -t -force_lib devel/sage/sage/plot/animate.py [4.5 s] sage -t -force_lib devel/sage/sage/schemes/plane_curves/affine_curve.py [6.5 s] sage -t -force_lib devel/sage/sage/coding/source_coding/huffman.py [4.2 s] sage -t -force_lib devel/sage/sage/groups/abelian_gps/abelian_group.py [4.6 s] sage -t -force_lib devel/sage/doc/fr/tutorial/tour_algebra.rst [7.6 s]
I give so much context, since I am running 4 tests in parallel. No idea which of the tests results in the "glibc detected" being printed.
comment:206 Changed 5 years ago by
If I run any of the tests mentioned in my previous post, no "glibc detected" occurs.
comment:207 in reply to: ↑ 203 Changed 5 years ago by
Really, it's a bit silly that WeakValueDictionary
does a lookup by key on its removal callback.
An alternative, better, but more laborious solution is to improve WeakValueDictionary
. What we need is a dictionary with an extra method:
def delete_by_hash_and_value_id(self,hash,id): find bucket based on hash find entry by walking the bucket to find the given *value* id. delete that entry
Then your WeakValueDictionary
can inherit from it and do:
def __setitem__(self,key,value): r=KeyedRef(value,(h,id(value)), lambda ref: self.delete_by_hash_and_value_id(ref.key[0],ref.key[1]) super(self).__setitem__(key,r)
The most attractive approach would be to subclass dict, but getting access to its internal storage to do this modified search would be painful.
comment:208 follow-ups: ↓ 209 ↓ 210 Changed 5 years ago by
My bet is that this:
sage: search_src("WeakValueDict") ... rings/polynomial/pbori.pyx:255:rings = weakref.WeakValueDictionary() ...
is responsible for the SIGABRT in much the same way.
comment:209 in reply to: ↑ 208 Changed 5 years ago by
Replying to nbruin:
My bet is that this:
sage: search_src("WeakValueDict") ... rings/polynomial/pbori.pyx:255:rings = weakref.WeakValueDictionary() ...is responsible for the SIGABRT in much the same way.
Funny that you name pbori.pyx! Meanwhile the doctests are finished, and I got the error exactly there! And it is reproducible. Namely:
sage -t --verbose -force_lib "devel/sage/sage/rings/polynomial/pbori.pyx" 2342 tests in 280 items. 2342 passed and 0 failed. Test passed. *** glibc detected *** python: double free or corruption (!prev): 0x0000000007367de0 ***
and then htop shows me processes like
python /mnt/local/king/.sage/tmp/pbori_22813.py python /mnt/local/king/.sage/tmp/mpc622-24779/pbori_9077.py
that are running at 0% CPU. Timeout, and again the error occurs when shutting down Sage.
But that's good, because it is reproducible! And if pbori.pyx uses a weak dictionary (I hope it wasn't I who introduced it...), it absolutely makes sense to see an error there.
comment:210 in reply to: ↑ 208 ; follow-up: ↓ 212 Changed 5 years ago by
Replying to nbruin:
rings/polynomial/pbori.pyx:255:rings = weakref.WeakValueDictionary?()
But is that dictionary used? I can't find a spot in pbori.pyx where it is modified. Not in pbori itself, and also I couldn't find another module where it is imported.
comment:211 Changed 5 years ago by
I inserted some lines into sage_quit, such that the length and the contents of sage.rings.polynomial.pbori are printed: It was empty. Even if I create a boolean polynomial ring, the dictionary is empty when starting to shut Sage down. Hence, it seems that it is not to blame.
comment:212 in reply to: ↑ 210 Changed 5 years ago by
Replying to SimonKing:
Replying to nbruin:
rings/polynomial/pbori.pyx:255:rings = weakref.WeakValueDictionary?()
But is that dictionary used? I can't find a spot in pbori.pyx where it is modified. Not in pbori itself, and also I couldn't find another module where it is imported.
I ran a hypersearch, and don't see anywhere else that it appears, either. Have you tried deleting that line?
(Believe it or not, I've been following this...)
comment:213 follow-ups: ↓ 215 ↓ 216 Changed 5 years ago by
It looks like that garbage collection is a great idea! The error is now completely reproducible:
sage -t --verbose -force_lib "devel/sage/sage/rings/polynomial/pbori.pyx" # Killed/crashed
I've instrumented the remove function on the WeakValueDictionary
a little bit:
def remove(wr, selfref=ref(self)):
self = selfref()
if self is not None:
+ if type(wr.key) is tuple:
+ T=[]
+ for t in wr.key:
+ try:
+ T.append(str(t))
+ except:
+ T.append("TYPE: %s"%type(t))
+ print('removal on key %s'%T)
del self.data[wr.key]
self._remove = remove
UserDict.UserDict.__init__(self, *args, **kw)
to find (now doctests are failing due to the extra output):
... ***Test Failed*** 18 failures. removal on key ['Rational Field', "('z', 'y')", '2', 'False', 'Degree reverse lexicographic term order'] *** glibc detected *** python: double free or corruption (!prev): 0x0000000006baaaa0 *** ... Unhandled SIGABRT: An abort() occurred in Sage.
That looks an awful lot like the kind of key that a polynomial ring constructor would get. So I guess it's just one of those WeakCachedFunction
dictionaries again. I suspect that part of polybori is already torn down when this entry is looked up. It really seems that the alternatives are:
- fix all those libraries to keep sufficient references to do equality testing till the very end (may be a bit unfair. We're really just calling them to throw them away)
- make a WeakValueDictionary? that doesn't need key comparisons on its callback and use that instead
- convince the release manager that these tickets are important and that these doctests are not the fault of these tickets. Remove any components that fail due to these changes.
Option 2 looks quite doable by comparison ... I'd say use a design rather similar to MonoDict
. We probably won't beat WeakValueDict
for standard performance, but we're making another trade-off: we want safe removal, even at the price of a little performance. So perhaps two variants, WeakCachedFunction
and SafeWeakCachedFunction
.
comment:214 Changed 5 years ago by
Thank you for looking into the problem. I found that the problem seems to be related with the doctest framework, too.
Namely, put the following into a file "test.py":
""" sage: B.<a,b,c,d,e,f> = BooleanPolynomialRing() sage: I = ideal(a*b + a + b*e + c*e + 1, a + b + c*d + c + 1, a*c + c + d*f + d + 1, a*c + c*f + c + d*f + 1, c*f + c + d + e + 1, a + b*c + b*d + e*f + 1) sage: I.groebner_basis() [1] """
Then, run sage -t on it. You will find the "glibc detected"
You may also try to run the same example interactively in Sage, and quit. Everything is fine (or at least, it was for me).
comment:215 in reply to: ↑ 213 Changed 5 years ago by
Replying to nbruin:
... ***Test Failed*** 18 failures. removal on key ['Rational Field', "('z', 'y')", '2', 'False', 'Degree reverse lexicographic term order'] *** glibc detected *** python: double free or corruption (!prev): 0x0000000006baaaa0 *** ... Unhandled SIGABRT: An abort() occurred in Sage.That looks an awful lot like the kind of key that a polynomial ring constructor would get.
Right! Could be a BooleanPolynomialRing
, though.
So I guess it's just one of those
WeakCachedFunction
dictionaries again.
Indeed, sage.rings.polynomial.polynomial_ring_constructor contains
import weakref _cache = weakref.WeakValueDictionary()
And guess who has introduced that line in #715 :-/
comment:216 in reply to: ↑ 213 ; follow-up: ↓ 223 Changed 5 years ago by
Replying to nbruin:
I've instrumented the remove function on the
WeakValueDictionary
a little bit:
I did a similar thing, printing a message naming the ref.key before deletion and after deletion. Then I ran sage -t --verbose on my small test file.
Problem: All deletions were successful!
To be precise, it all starts with many deletions, happening at Sage startup. Apparently Sage computes a lot of things that are then immediately garbage collected. And then, the tests themselves look like this:
Trying: set_random_seed(0L) Expecting nothing ok Trying: change_warning_output(sys.stdout) Expecting nothing ok Trying: B = BooleanPolynomialRing(names=('a', 'b', 'c', 'd', 'e', 'f',)); (a, b, c, d, e, f,) = B._first_ngens(6)###line 2:_sage_ >>> B.<a,b,c,d,e,f> = BooleanPolynomialRing() Expecting nothing ok Trying: I = ideal(a*b + a + b*e + c*e + Integer(1), a + b + c*d + c + Integer(1), a*c + c + d*f + d + Integer(1), a*c + c*f + c + d*f + Integer(1), c*f + c + d + e + Integer(1), a + b*c + b*d + e*f + Integer(1))###line 3:_sage_ >>> I = ideal(a*b + a + b*e + c*e + 1, a + b + c*d + c + 1, a*c + c + d*f + d + 1, a*c + c*f + c + d*f + 1, c*f + c + d + e + 1, a + b*c + b*d + e*f + 1) Expecting nothing ok Trying: I.groebner_basis()###line 4:_sage_ >>> I.groebner_basis() Expecting: [1] ********************************************************************** File "/home/king/SAGE/tests/pbori/pboriNeu/test_12.py", line ?, in __main__.example_0 Failed example: I.groebner_basis()###line 4:_sage_ >>> I.groebner_basis() Expected: [1] Got: try to remove (('RingHomset_generic_with_category', (<class 'sage.rings.homset.RingHomset_generic'>, <class 'sage.categories.category.JoinCategory.parent_class'>), None, None, <class 'sage.rings.homset.RingHomset_generic'>), ()) successful on (('RingHomset_generic_with_category', (<class 'sage.rings.homset.RingHomset_generic'>, <class 'sage.categories.category.JoinCategory.parent_class'>), None, None, <class 'sage.rings.homset.RingHomset_generic'>), ()) [1] Trying: sig_on_count() Expecting: 0 ok 4 items had no tests: __main__ __main__.change_warning_output __main__.check_with_tolerance __main__.warning_function ********************************************************************** 1 items had failures: 1 of 6 in __main__.example_0 6 tests in 5 items. 5 passed and 1 failed. ***Test Failed*** 1 failures. *** glibc detected *** python: double free or corruption (out): 0x0000000004be7e60 *** ^CAborting further tests. KeyboardInterrupt -- interrupted after 16.0 seconds! ---------------------------------------------------------------------- All tests passed!
So, there is only a Homset (for which there is a weak cache, by #11521) being collected during the tests, and perhaps it hasn't been the weak dictionary of polynomial rings, after all.
How can one find out what kind of thing got freed twice? I am afraid gdb does not seem to work on doc tests, and I can not reproduce the problem interactively.
comment:217 Changed 5 years ago by
I think two dictionaries are to blame: _cache
in sage/rings/polynomial/polynomial_ring_constructor.py
and self._cache
in UniqueFactory
. If I change both to dict()
I don't get an error. If either is Weak
, I do get the error.
In the process I saw how precarious remove
methods are. I wanted to put a sys.stderr.write
in the remove
function in WeakValueDictionary
. I ended up with
def __init__(self, *args, **kw): MYSTDERR=sys.stderr def remove(wr, selfref=ref(self)): self = selfref() if self is not None: MYSTDERR.write("REMOVING A KEY\n"); del self.data[wr.key] self._remove = remove UserDict.UserDict.__init__(self, *args, **kw)
(so remove
is a closure that keeps a reference to MYSTDERR
) because anything less would lead to errors. Apparently you cannot assume that the globals dictionary of your module is still intact when your remove method runs. That is such a restriction that I think we cannot expect people to write their equality testers to survive in such a harsh environment. We either have to track down all caches and tear them down in a more friendly setting or we have to go with a "safe" WeakValueDictionary?, which doesn't call methods on keys upon removal.
I'm learning much more about Python than I ever wanted to know ...
comment:218 Changed 5 years ago by
Would it help to explicitly delete (or explicitly clear) the "bad" weak dictionaries?
If it does, one could proceed as follows:
- Introduce a function that returns (usual) weak key dictionaries, and stores a (weak?) reference to all the weak key dictionaries it creates.
- In quit_sage, explicitly delete (or clear) all the weak key dictionaries in store.
Your suggestion to create a new a safe weak key dictionary sounds like much work to me...
comment:219 Changed 5 years ago by
Yes that might work as a workaround for now. The store indeed needs to be weak. It could itself be a WeakValueDictionary
, keyed on the id of the WeakValueDictionaries
it stores. You'd like to put them in a WeakSet
, but they're not hashable.
If we're unlucky you'll get into an infinite loop, though. It's not impossible that sage, upon equality testing, decides to construct some new parents to coerce things into in order to compare them. These parents would then get created and entered into caches that we might just have cleared! If the process involves creating a new category, a dynamic class might even end up producing a new UniqueRepresentation
and enter a new cache into our "to clear" dictionary. Dictionaries behave very badly when they are modified while being iterated over.
So: Worth a try for now. Let's do that to get this ticket out the door.
Future Work
I foresee crazy bugs in the future. A WeakValueDict
with safe removal is probably ultimately unavoidable. And is the work really so bad?
We could just go with roughly the design for IdKeyDict
, but then using Python's hash function and simpler entries. Normal lookup and delete would be via hash and then equality on the keys. Special deletion would be by hash and id of the value. Once we have a working prototype we can put it out there for the Python community. With any luck someone adopts it and improves, tunes and optimizes it. I don't think our applications are so super time critical. It's all for creation of parents, which is expensive anyway.
Our applications only need a rudimentary Dictionary
interface, but adding the other methods would not be a huge job.
See #13394
comment:220 follow-ups: ↓ 221 ↓ 222 Changed 5 years ago by
Too bad ... We never did positively confirm that the problem here was some method on a key getting called. The patch doesn't help for the polybori problem. I suspect it really is a double memory release somewhere that simply didn't ever occur before because the memory didn't get released in the first place? (perhaps python doesn't deallocate all memory upon exit? why would they?
I don't think a more fancy implementation of a safeweakvaluedict would work either. By keeping references to all keys while *clearing* the dictionary (that's just the clear that's inherited on dict), I think we are avoiding all calls to keys. Purging a dict doesn't involve key lookups and deleting the keys afterwards is just that. That should be possible. By then we're not holding weakrefs on them anymore, so no callbacks.
So, while I still think the current WeakValueDict
design has a silly callback, the only problem with it that we have actually observed can be solved with a well-timed gc.collect()
.
comment:221 in reply to: ↑ 220 Changed 5 years ago by
What I just tried was: Take all instances of WeakValueDictionary
in gc.get_objects()
and call clear()
on each. Then do gc.collect()
. But it didn't help, there still is the double free, which occurs during gc.collect()
(I have a print statement directly before and directly after gc.collect()
, but only the first is executed).
comment:222 in reply to: ↑ 220 Changed 5 years ago by
Replying to nbruin:
I don't think a more fancy implementation of a safeweakvaluedict would work either. By keeping references to all keys while *clearing* the dictionary (that's just the clear that's inherited on dict), I think we are avoiding all calls to keys.
I am not sure what you mean here by "keys". Are you talking about the keys of the weak references? Are you talking about the keys of WeakValueDictionary
?
comment:223 in reply to: ↑ 216 ; follow-up: ↓ 225 Changed 5 years ago by
Replying to SimonKing:> How can one find out what kind of thing got freed twice? I am afraid gdb does not seem to work on doc tests, and I can not reproduce the problem interactively.
You should be able to run a doctest inside gdb using
./sage -t -gdb devel/sage/...
comment:224 Changed 5 years ago by
Now that's funny. I tried to keep the store of any KeyedReference
. But it turns out that there is always exactly one dead KeyedReference
that raises an attribute error when trying to get the key. Namely, I inserted the following at the beginning of sage.all.quit_sage:
import gc, weakref KeepRefKeys = [] for r in gc.get_objects(): if isinstance(r, weakref.KeyedRef): try: KeepRefKeys.append(r.key) except AttributeError: print "Warning, no key for",r del r print "all keys are in store" gc.collect() print "collection done"
Then, quitting a normal Sage session, I get
sage: quit Exiting Sage (CPU time 0m0.05s, Wall time 0m0.85s). Warning, no key for <weakref at 0x3d281d0; dead> all keys are in store collection done
and executing the "bad" three-line doctest, I get
6 tests in 5 items. 6 passed and 0 failed. Test passed. Warning, no key for <weakref at 0x3a8f290; dead> all keys are in store *** glibc detected *** python: double free or corruption (out): 0x0000000004be6cf0 *** ^CKeyboardInterrupt -- interrupted after 5.4 seconds! Aborting further tests.
comment:225 in reply to: ↑ 223 Changed 5 years ago by
Replying to jpflori:
Replying to SimonKing:> How can one find out what kind of thing got freed twice? I am afraid gdb does not seem to work on doc tests, and I can not reproduce the problem interactively.
You should be able to run a doctest inside gdb using
./sage -t -gdb devel/sage/...
Not really. What I get from that is
sage -t -gdb "test_12.py" ******************************************************************************** Type r at the (gdb) prompt to run the doctests. Type bt if there is a crash to see a traceback. ******************************************************************************** 'import site' failed; use -v for traceback GNU gdb (GDB) 7.0.1-debian Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/bin/python...done. (gdb) r Starting program: /mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/bin/python /mnt/local/king/.sage/tmp/test_12_1233.py [Thread debugging using libthread_db enabled] Traceback (most recent call last): File "/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib64/libstdc++.so.6.0.16-gdb.py", line 19, in <module> import os ImportError: No module named os
Then wait a long time, finally one has
*** glibc detected *** /mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/bin/python: double free or corruption (out): 0x0000000004be4ec0 ***
And then? Keyboard Interrupt, or what?
^C Program received signal SIGINT, Interrupt. __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
Nothing happens, in particular I do not get a gdb prompt any more.
comment:226 Changed 5 years ago by
You could try valgrind but good luck with decrypting the output.
comment:227 Changed 5 years ago by
gdb just totally hangs. 0% CPU according to htop.
comment:228 Changed 5 years ago by
IIRC, Volker is quite good at C-level debugging. I hope he can help me, next week.
comment:229 follow-up: ↓ 236 Changed 5 years ago by
Replying to SimonKing:
I am not sure what you mean here by "keys". Are you talking about the keys of the weak references? Are you talking about the keys of
WeakValueDictionary
?
Yes. they are the same :-). I was mainly talking about what I tried in safeweakdict
and that I think we achieve the same safety that a proper SafeWeakValueDict
would provide.
By the way, for me (on Fedora 16) sage -t -gdb
works like a charm and, indeed, the backtrace shows this is indeed gcmodule
leading to Delete<polybori::groebner::ReductionStrategy>
leading to an error raise in libc
. So either the way polybori is wrapped is wrong or polybori itself has a "memory drain" (or whatever the reverse of a leak is). Time for a polybori expert I'd say.
I think it happens via libs/polybori/decl.pxd
void PBRedStrategy_delete "Delete<ReductionStrategy>"(PBRedStrategy *mem)
comment:230 follow-up: ↓ 235 Changed 5 years ago by
Replying to SimonKing:
gdb just totally hangs. 0% CPU according to htop.
I got a similar problem in #12173. FLINT wants to free the memory it uses upon quitting, but Sage/Python? tries to do the same upon deletion of Sage objects. The solution we adopted was to let the FLINT memory (not all of it fortunately) in place, although that would to memory leaks if one would load and quit Sage several times from one Python interpreter. We never figured out how in which order things were deallocated when Python was shutdown.
And GDB just hanged the same on the double free errors.
On your example, quite strangely, I get a nice segfault and backtrace as Nils.
Anyway, it could be possible that the situation is as in FLINT: some Sage method tries to directly deallocate something after we called the Polybori cleanup method which already freed taht piece of memory and it crashes.
comment:231 Changed 5 years ago by
Running the minimal doctest provided by Simon and adding a print statement in the _dealloc_ method of ReductionStrategy? indeed shows that PBRedStrategy_delete is called twice on the same memory address.
comment:232 Changed 5 years ago by
When I run it in a Sage interactive session, the address is different.
comment:233 Changed 5 years ago by
Mmm, disregard the above, even in the doctesting mode, the addresses seem different generically.
comment:234 Changed 5 years ago by
Here is the seemingly relevant part of the Valgrind output:
==14712== Invalid free() / delete / delete[] / realloc() ==14712== at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477) ==14712== by 0x2C653D62: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_ReductionStrategy(_object*) (ccobject.h:77) ==14712== by 0x2C5E9B52: __pyx_tp_clear_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (pbori.cpp:52486) ==14712== by 0x4F5AF96: collect (gcmodule.c:769) ==14712== by 0x4F5B728: gc_collect (gcmodule.c:1067) ==14712== by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F23BD1: PyEval_EvalCode (ceval.c:667) ==14712== by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353) ==14712== by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943) ==14712== by 0x4F5A294: Py_Main (main.c:639) ==14712== by 0x544A76C: (below main) (libc-start.c:226) ==14712== Address 0x21352000 is 96 bytes inside a block of size 448 alloc'd ==14712== at 0x4C2C054: operator new(unsigned long) (vg_replace_malloc.c:295) ==14712== by 0x2C6541D9: __pyx_pw_4sage_5rings_10polynomial_5pbori_16GroebnerStrategy_1__init__(_object*, _object*, _object*) (pbori.cpp:39499) ==14712== by 0x4EDB457: type_call (typeobject.c:737) ==14712== by 0x4E7E7A2: PyObject_Call (abstract.c:2529) ==14712== by 0x2C61104A: __pyx_pf_4sage_5rings_10polynomial_5pbori_22BooleanPolynomialIdeal_4groebner_basis(_object*, _object*, _object*) [clone .isra.97] (pbori.cpp:31442) ==14712== by 0x2C6116D6: __pyx_pw_4sage_5rings_10polynomial_5pbori_22BooleanPolynomialIdeal_5groebner_basis(_object*, _object*, _object*) (pbori.cpp:31182) ==14712== by 0x4E7E7A2: PyObject_Call (abstract.c:2529) ==14712== by 0x4F20A6C: PyEval_EvalFrameEx (ceval.c:4239) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F23BD1: PyEval_EvalCode (ceval.c:667) ==14712== by 0x4F21BFE: PyEval_EvalFrameEx (ceval.c:4718) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4EA674B: function_call (funcobject.c:526) ==14712== by 0x4E7E7A2: PyObject_Call (abstract.c:2529) ==14712== by 0x4E8BF1E: instancemethod_call (classobject.c:2578) ==14712== by 0x4E7E7A2: PyObject_Call (abstract.c:2529) ==14712== by 0x4F20A6C: PyEval_EvalFrameEx (ceval.c:4239) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4EA674B: function_call (funcobject.c:526) ==14712== by 0x4E7E7A2: PyObject_Call (abstract.c:2529) ==14712== by 0x4E8BF1E: instancemethod_call (classobject.c:2578) ==14712== by 0x4E7E7A2: PyObject_Call (abstract.c:2529) ==14712== by 0x4F20A6C: PyEval_EvalFrameEx (ceval.c:4239) ==14712== ... ==14712== Invalid free() / delete / delete[] / realloc() ==14712== at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477) ==14712== by 0x2C673D19: polybori::groebner::ReductionStrategy::~ReductionStrategy() (new_allocator.h:98) ==14712== by 0x2C653F2E: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (GroebnerStrategy.h:41) ==14712== by 0x4EB9E9A: PyDict_Clear (dictobject.c:891) ==14712== by 0x4EB9FF8: dict_tp_clear (dictobject.c:2096) ==14712== by 0x4F5AF96: collect (gcmodule.c:769) ==14712== by 0x4F5B728: gc_collect (gcmodule.c:1067) ==14712== by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F23BD1: PyEval_EvalCode (ceval.c:667) ==14712== by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353) ==14712== by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943) ==14712== by 0x4F5A294: Py_Main (main.c:639) ==14712== by 0x544A76C: (below main) (libc-start.c:226) ==14712== Address 0xcd903f0 is 0 bytes inside a block of size 96 free'd ==14712== at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477) ==14712== by 0x2C673D19: polybori::groebner::ReductionStrategy::~ReductionStrategy() (new_allocator.h:98) ==14712== by 0x2C653D5A: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_ReductionStrategy(_object*) (ccobject.h:77) ==14712== by 0x2C5E9B52: __pyx_tp_clear_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (pbori.cpp:52486) ==14712== by 0x4F5AF96: collect (gcmodule.c:769) ==14712== by 0x4F5B728: gc_collect (gcmodule.c:1067) ==14712== by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F23BD1: PyEval_EvalCode (ceval.c:667) ==14712== by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353) ==14712== by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943) ==14712== by 0x4F5A294: Py_Main (main.c:639) ==14712== by 0x544A76C: (below main) (libc-start.c:226) ==14712== ... ==14712== Invalid free() / delete / delete[] / realloc() ==14712== at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477) ==14712== by 0x2C6739D0: std::_Rb_tree<polybori::BooleMonomial, std::pair<polybori::BooleMonomial const, int>, std::_Select1st<std::pair<polybori::BooleMonomial const, int> >, polybori::symmetric_composition<std::less<polybori::CCuddNavigator>, polybori::navigates<polybori::BoolePolynomial> >, std::allocator<std::pair<polybori::BooleMonomial const, int> > >::_M_erase(std::_Rb_tree_node<std::pair<polybori::BooleMonomial const, int> >*) (new_allocator.h:98) ==14712== by 0x2C673D26: polybori::groebner::ReductionStrategy::~ReductionStrategy() (stl_tree.h:639) ==14712== by 0x2C653F2E: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (GroebnerStrategy.h:41) ==14712== by 0x4EB9E9A: PyDict_Clear (dictobject.c:891) ==14712== by 0x4EB9FF8: dict_tp_clear (dictobject.c:2096) ==14712== by 0x4F5AF96: collect (gcmodule.c:769) ==14712== by 0x4F5B728: gc_collect (gcmodule.c:1067) ==14712== by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F23BD1: PyEval_EvalCode (ceval.c:667) ==14712== by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353) ==14712== by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943) ==14712== by 0x4F5A294: Py_Main (main.c:639) ==14712== by 0x544A76C: (below main) (libc-start.c:226) ==14712== Address 0x21353bc0 is 0 bytes inside a block of size 80 free'd ==14712== at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477) ==14712== by 0x2C6739D0: std::_Rb_tree<polybori::BooleMonomial, std::pair<polybori::BooleMonomial const, int>, std::_Select1st<std::pair<polybori::BooleMonomial const, int> >, polybori::symmetric_composition<std::less<polybori::CCuddNavigator>, polybori::navigates<polybori::BoolePolynomial> >, std::allocator<std::pair<polybori::BooleMonomial const, int> > >::_M_erase(std::_Rb_tree_node<std::pair<polybori::BooleMonomial const, int> >*) (new_allocator.h:98) ==14712== by 0x2C673D26: polybori::groebner::ReductionStrategy::~ReductionStrategy() (stl_tree.h:639) ==14712== by 0x2C653D5A: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_ReductionStrategy(_object*) (ccobject.h:77) ==14712== by 0x2C5E9B52: __pyx_tp_clear_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (pbori.cpp:52486) ==14712== by 0x4F5AF96: collect (gcmodule.c:769) ==14712== by 0x4F5B728: gc_collect (gcmodule.c:1067) ==14712== by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F23BD1: PyEval_EvalCode (ceval.c:667) ==14712== by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353) ==14712== by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943) ==14712== by 0x4F5A294: Py_Main (main.c:639) ==14712== by 0x544A76C: (below main) (libc-start.c:226) ==14712== ... ==14712== Invalid free() / delete / delete[] / realloc() ==14712== at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477) ==14712== by 0x2C653F2E: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (GroebnerStrategy.h:41) ==14712== by 0x4EB9E9A: PyDict_Clear (dictobject.c:891) ==14712== by 0x4EB9FF8: dict_tp_clear (dictobject.c:2096) ==14712== by 0x4F5AF96: collect (gcmodule.c:769) ==14712== by 0x4F5B728: gc_collect (gcmodule.c:1067) ==14712== by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F23BD1: PyEval_EvalCode (ceval.c:667) ==14712== by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353) ==14712== by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943) ==14712== by 0x4F5A294: Py_Main (main.c:639) ==14712== by 0x544A76C: (below main) (libc-start.c:226) ==14712== Address 0xcdfa840 is 0 bytes inside a block of size 432 free'd ==14712== at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477) ==14712== by 0x2C653D5A: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_ReductionStrategy(_object*) (ccobject.h:77) ==14712== by 0x2C5E9B52: __pyx_tp_clear_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (pbori.cpp:52486) ==14712== by 0x4F5AF96: collect (gcmodule.c:769) ==14712== by 0x4F5B728: gc_collect (gcmodule.c:1067) ==14712== by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117) ==14712== by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253) ==14712== by 0x4F23BD1: PyEval_EvalCode (ceval.c:667) ==14712== by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353) ==14712== by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943) ==14712== by 0x4F5A294: Py_Main (main.c:639) ==14712== by 0x544A76C: (below main) (libc-start.c:226) ==14712==
comment:235 in reply to: ↑ 230 Changed 5 years ago by
Replying to jpflori:
Anyway, it could be possible that the situation is as in FLINT: some Sage method tries to directly deallocate something after we called the Polybori cleanup method which already freed taht piece of memory and it crashes.
That sounds like the solution should be as in the segfault fix for pari at #12215.
Generally, underlying C-data should be freed by using a method __dealloc__
(not __del__
, please!). You need to be careful, because during __dealloc__
you can not assume that the Python attributes of an instance are still working.
Hence, it is (in my opinion) very wrong that we had
import sage.matrix.matrix_mod2_dense sage.matrix.matrix_mod2_dense.free_m4ri() import sage.libs.flint.flint sage.libs.flint.flint.free_flint_stack() pari._unsafe_deallocate_pari_stack()
in sage.all.quit_sage (I only removed the pari._unsafe_deallocate... in #12215).
comment:236 in reply to: ↑ 229 Changed 5 years ago by
Replying to nbruin:
Replying to SimonKing: By the way, for me (on Fedora 16)
sage -t -gdb
works like a charm
Why does it not work for me? Does one need to do anything special to make it work with Sage?
and, indeed, the backtrace shows this is indeed
gcmodule
leading toDelete<polybori::groebner::ReductionStrategy>
leading to an error raise inlibc
.
OK, then we have to see at what point the reduction strategies are supposed to be deallocated.
I think it happens via
libs/polybori/decl.pxd
void PBRedStrategy_delete "Delete<ReductionStrategy>"(PBRedStrategy *mem)
I can not work on it right now. But thank you for the pointer!
comment:237 Changed 5 years ago by
Here is some hyptothesis about what goes wrong:
- the deallocator for ReductionStrategy? gets called before the GroebnerStrategy? one.
- then GroebnerStrategy? deallocator calls the C++ destructor of self._strat which in turn calls the one for ReductionStrategy?.
- bang.
comment:238 Changed 5 years ago by
I'm not yet convinced the explanation is as what I posted above, but here is a patch which seems logical to me, and prevents the polybori segfaults.
comment:239 Changed 5 years ago by
- Status changed from needs_review to needs_work
- Work issues set to Merge Jean-Pierre's fix for pbori
The patch sounds logical to me. I am now merging it (without the print statements, of course) into my patch, unless you beet me to it.
Changed 5 years ago by
Do a garbage collection before closing the interfaces. Do not delete a borrowed reference to reduction strategies in pbori
comment:240 Changed 5 years ago by
- Status changed from needs_work to needs_review
- Work issues Merge Jean-Pierre's fix for pbori deleted
My second patch is now updated, it includes Jean-Pierre's fix for pbori.
Let us hope that the tests will pass now. The tests in pbori.pyx do pass!
Apply trac_12313-mono_dict-combined-random-sk.patch trac_12313_quit_sage.patch
comment:241 follow-up: ↓ 242 Changed 5 years ago by
Replying to jpflori:
I'm not yet convinced the explanation is as what I posted above, but here is a patch which seems logical to me, and prevents the polybori segfaults.
I agree with your assessment as well. As far as I can see:
GroebnerStrategy.__init__
:
- Creates a new
ReductionStrategy
wrapper and stores it inself.reduction_strategy
- Asks polybori to delete the
self.reduction_strategy._strat
attribute on it - Then fills that slot with
&self._strat.generators
So GroebnerStrategy.reduction_strategy._strat
is now definitely a borrowed reference from ReductionStrategy
. However, GroebnerStrategy
has a reference to the latter, so it's borrowing from something it keeps alive.
GroebnerStrategy.__dealloc__
:
- zeros out
self.reduction_strategy._strat = NULL
- calls
del self._strat
, which I assume amounts to asking polybori to delete it as well.
This strategy works if GroebnerStrategy.__dealloc__
would get called before ReductionStrategy.__dealloc
, but as we know python attributes might already no longer exist at __dealloc__
time, so we're not guaranteed a call order.
So I think the code here should be
if self.reduction_strategy != NULL and self.reduction_strategy._strat != NULL: self.reduction_strategy._strat = NULL if self._count.released(): del self._strat
but this is assuming that attributes get zeroed out on deallocation (Can we even assume that self.reductio_strategy
is still alive at this point?)
Then you could do in ReductionStrategy.__dealloc__
:
if self._strat: PBRedStrategy_delete(self._strat) self._strat = NULL
However, that's assuming that we can establish whether GroebnerStrategy.reduction_strategy
is still valid upon deallocation. Since GroebnerStrategy
has borrowed the reference without ReductionStrategy
really knowing, I think GroebnerStrategy
should just not delete _strat
. In due time, ReductionStrategy.__dealloc__
gets called and will clean up the ref. It's hers!
comment:242 in reply to: ↑ 241 Changed 5 years ago by
Replying to nbruin:
Replying to jpflori:
GroebnerStrategy.__init__
:
- Creates a new
ReductionStrategy
wrapper and stores it inself.reduction_strategy
- Asks polybori to delete the
self.reduction_strategy._strat
attribute on it- Then fills that slot with
&self._strat.generators
So
GroebnerStrategy.reduction_strategy._strat
is now definitely a borrowed reference fromReductionStrategy
.
No, the reduction strategy borrows from the Gröbner strategy.
However, that's assuming that we can establish whether
GroebnerStrategy.reduction_strategy
is still valid upon deallocation. SinceGroebnerStrategy
has borrowed the reference withoutReductionStrategy
really knowing, I thinkGroebnerStrategy
should just not delete_strat
.
But who else should?
In due time,
ReductionStrategy.__dealloc__
gets called and will clean up the ref. It's hers!
No. reduction_strategy._strat
does not borrow a reference to _strat
(hence, ReductionStrategy.__dealloc__
does not dealloc GroebnerStrategy._strat
), but it borrows a reference to _strat.generators
.
That's why I think Jean-Pierre's solution is logical: Explicitly tell ReductionStrategy
whether its _strat
is borrowed from a GroebnerStrategy
's _strat.generators
.
comment:243 Changed 5 years ago by
comment:244 follow-up: ↓ 246 Changed 5 years ago by
Replying to SimonKing:
No, the reduction strategy borrows from the GrÃ¶bner strategy.
Yes, you're right. Sorry. But the problem remains of what to trust on dealloc. The whole borrowing thing complicates that. I think we can simplify it here.
Since GroebnerStrategy
always lends a reference to the ReductionStrategy
it creates, why not give it outright? Then GroebnerStrategy
never deletes _strat
and ReductionStrategy
always does that.
The zeroing out of GroebnerStrategy.reduction_strategy._struct
via that path makes me a little queasy. Plus, what happens if someone goes digging in GroebnerStrategy
and keeps ReductionStrategy
alive for longer? I think handing the reference to a suborbinate on "loan" just makes the code a little harder to reason about. It means you have to enforce that the subordinate doesn't go and do some work for someone else with the lent goods. It's so much easier if you don't have to enforce that at all.
Logically, I think GroebnerStrategy
shouldn't have a _strat
attribute. It should always access GroebnerStrategy.reduction_strategy._strat
. The fact that GroebnerStrategy._strat
does exist should be viewed as an optimization, but not affect program logic.
[Note: this is just squabbling over keeping code more easily maintainable. I agree JP's patch solves the problem. It's just nice if we can resolve the problem by reducing code complexity rather than increasing it]
EDIT: Plus the complication about GroebnerStrategy._strat.generators
and ReductionStrategy._strat
. I suspect this whole bit can be cleaned up without resorting to borrowed
flags but this is not the place.
comment:245 follow-up: ↓ 247 Changed 5 years ago by
YES! All tests pass for me too. I think the saying is "Mission Accomplished".
Concerning speed effects:
5.3b2 + (#715, #11521, #12215, #12313):
Total time for all tests: 1001.4 seconds Total time for all tests: 1002.6 seconds
plain 5.3b2:
Total time for all tests: 928.5 seconds Total time for all tests: 932.6 seconds
so people might want to profile their code every now and again to see if they are losing much time on construction/destruction of parents.
I'll try to finalize the positive reviews on all these tickets and then hopefully these can be the first ones in on 5.4.
comment:246 in reply to: ↑ 244 ; follow-up: ↓ 249 Changed 5 years ago by
- Cc AlexanderDreyer PolyBoRi added
Replying to nbruin:
Replying to SimonKing: Then
GroebnerStrategy
never deletes_strat
andReductionStrategy
always does that.
Again, the problem is that GroebnerStrategy._strat
is not the thing that is lend to ReductionStrategy._strat
. The reference that is lend is GroebnerStrategy._strat.generators
. But since GroebnerStrategy._strat
is more than that, GroebnerStrategy.__dealloc__
must take care of it. "Take care of that" means: Call a C++ destructor on GroebnerStrategy._strat
. Hence, changing it would mean to change the polybori spkg. I am ccing the authors, perhaps they consider such change for the next version.
The zeroing out of
GroebnerStrategy.reduction_strategy._struct
via that path makes me a little queasy. Plus, what happens if someone goes digging inGroebnerStrategy
and keepsReductionStrategy
alive for longer? I think handing the reference to a suborbinate on "loan" just makes the code a little harder to reason about. It means you have to enforce that the subordinate doesn't go and do some work for someone else with the lent goods. It's so much easier if you don't have to enforce that at all.
Probably you are right. But the practical question is whether this should better be on a "proper" PolyBoRi
ticket. If you see an easy way to fix the problem properly, then it may be done here. But my impression is that a proper solution needs a non-trivial amount of changes to sage/rings/polynomial/pbori.pyx; I'd prefer to have a non-trivial change on a new ticket.
Logically, I think
GroebnerStrategy
shouldn't have a_strat
attribute. It should always accessGroebnerStrategy.reduction_strategy._strat
. The fact thatGroebnerStrategy._strat
does exist should be viewed as an optimization, but not affect program logic.
Well, they are different things.
EDIT: Plus the complication about
GroebnerStrategy._strat.generators
andReductionStrategy._strat
. I suspect this whole bit can be cleaned up without resorting toborrowed
flags but this is not the place.
Well, but how, without opening the polybori spkg?
comment:247 in reply to: ↑ 245 Changed 5 years ago by
Replying to nbruin:
YES! All tests pass for me too. I think the saying is "Mission Accomplished".
Concerning speed effects:
5.3b2 + (#715, #11521, #12215, #12313):
Total time for all tests: 1001.4 seconds Total time for all tests: 1002.6 secondsplain 5.3b2:
Total time for all tests: 928.5 seconds Total time for all tests: 932.6 seconds
That is almost 8% regression. That isn't good. Can you find out what tests contribute most to the regression?
When I hacked into the weakref module for debugging, I found that many weak references die during startup of Sage. Hence, while Sage starts, many things are created without keeping a reference to them, so that they will be garbage collection before the user has even the chance to see them. I think it would be good (also for Sage startup time) to avoid that to happen. But I reckon this should be on a new ticket.
comment:248 Changed 5 years ago by
- Reviewers changed from Simon King, Jean-Pierre Flori, John Perry to Simon King, Jean-Pierre Flori, John Perry, Nils Bruin
- Status changed from needs_review to positive_review
Speed regression mitigation is definitely going onto another ticket; #13400 I'd say. The consensus here is clearly that reasonable memory use trumps a (relatively small) speed penalty. I'm not so sure that the doctests are such a good measure fore "performance" anyway. They mainly measure construction of tiny examples, so tend to concentrate on overhead that is fairly irrelevant (because not in inner loops) in practice. I'm not surprised we're not doing so well in that park, but we already have some small improvements on future tickets (although I bet the main penalty is coming from NOT having things cached anymore sometimes).
Anyway, the code here (and on the dependencies of this ticket!) has been discussed and vetted to death already. Bot's complaining about a bit of whitespace, which should be trivial to fix. In anticipation of that getting removed:
POSITIVE REVIEW!!'''
Congratulations, to all who participated and especially Simon for an incredible diligence and calm in pushing through this at times seemingly hopeless process.
There are a lot of people in the review field of this ticket. If any of them dissent with the status, they can take action.
comment:249 in reply to: ↑ 246 ; follow-up: ↓ 250 Changed 5 years ago by
Replying to SimonKing:
EDIT: Plus the complication about
GroebnerStrategy._strat.generators
andReductionStrategy._strat
. I suspect this whole bit can be cleaned up without resorting toborrowed
flags but this is not the place.Well, but how, without opening the polybori spkg?
Thanks for pointing out this issue. But it is not related to the spkg. In the spkg you have a composed C++ class with public attributes, that is managed with C++ constructors and destructors.
But cython seems not to be able to wrap the c++ attributes (unlike PolyBoRi
's original interface based on boost/python). The only solution I can think of is the a flag, which was exactly what this ticket came up with. Alternatively a reference count. (I think boost/python uses one or the other internally, too.)
comment:250 in reply to: ↑ 249 ; follow-up: ↓ 251 Changed 5 years ago by
Replying to AlexanderDreyer:
... In the spkg you have a composed C++ class with public attributes, that is managed with C++ constructors and destructors. But cython seems not to be able to wrap the c++ attributes (unlike
PolyBoRi
's original interface based on boost/python). The only solution I can think of is the a flag, which was exactly what this ticket came up with. Alternatively a reference count. (I think boost/python uses one or the other internally, too.)
Ah, so the more general solution would be a little class StratGuard
whose instances shepherd the _strat
objects that GroebnerStrategy._strat.generators
and ReductionStrategy._strat
refer to. Its __dealloc__
would be the only one that calls the C++ delete. Python's own refcounting would do the rest. Whether the actual access to the pointer happens through the StratGuard
object or through a separately held pointer (for speed) is a design choice. If this is the only use case, though, keeping a flag might be a more light-weight solution.
comment:251 in reply to: ↑ 250 Changed 5 years ago by
Replying to nbruin:
Ah, so the more general solution would be a little class
StratGuard
whose instances shepherd the_strat
objects thatGroebnerStrategy._strat.generators
andReductionStrategy._strat
refer to. Its__dealloc__
would be the only one that calls the C++ delete. Python's own refcounting would do the rest. Whether the actual access to the pointer happens through theStratGuard
object or through a separately held pointer (for speed) is a design choice. If this is the only use case, though, keeping a flag might be a more light-weight solution.
Indeed, I was even thinking in the very same direction, when I read about the issue. But the current solution causes less overhead and avoids a one-time-use class.
comment:252 Changed 5 years ago by
- Milestone changed from sage-5.3 to sage-5.4
comment:253 Changed 5 years ago by
- Milestone changed from sage-5.4 to sage-pending
comment:254 Changed 5 years ago by
- Dependencies changed from #11521, #11599, #12969, #12215 to #715, #11521, #12215
comment:255 Changed 5 years ago by
- Dependencies changed from #715, #11521, #12215 to #715, #11521, #13447, #12215
comment:256 follow-up: ↓ 258 Changed 5 years ago by
Is #13447 really needed as a dependency? From what I remember, #715 had to permanently keep libsingular rings alive because of the bsd.math bug. And #13447 intends to replace this workaround by a proper fix. I've not followed the discussion at #13447, but it seems to me that #12313 is needed by #13447 to get proper garbage collection, but it does not seem obvious that #12313 depends on #13447. Is there some rebase needed? Did I miss some other step?
comment:257 Changed 5 years ago by
I'll give Jeroen a chance to motivate the change in dependencies, but I think it shouldn't be there. The positive review (and all the patches to be applied) predate the existence of #13447, so they can't possibly depend on it.
Any issues that might require #13447 should be resolved with the strong caching workaround introduced in #715.
If the patches here were to require #13447 to function properly then it should be marked as "only merge together with"; not as a merge dependency (which would suggest the wrong merge order). Do we want to add "post merge dependencies" to trac?
comment:258 in reply to: ↑ 256 ; follow-up: ↓ 259 Changed 5 years ago by
- Dependencies changed from #715, #11521, #13447, #12215 to #715, #11521, #12215
I only made the dependencies consistent. The problem was that the dependencies of #12215 were wrong. I hope all dependencies are correct now.
comment:259 in reply to: ↑ 258 Changed 5 years ago by
Replying to jdemeyer:
I only made the dependencies consistent. The problem was that the dependencies of #12215 were wrong. I hope all dependencies are correct now.
Doesn't the patch bot do transitive dependency resolution anyway? I'd say one should only provide a minimal set of dependencies, unless there is some conceptual reason to make an indirect dependency explicit. Is there a reason to do otherwise?
comment:260 Changed 5 years ago by
- Milestone changed from sage-pending to sage-5.5
comment:261 Changed 5 years ago by
- Milestone changed from sage-5.5 to sage-5.6
comment:262 Changed 4 years ago by
- Dependencies changed from #715, #11521, #12215 to #715, #11521, #12215, #13746
- Description modified (diff)
I've put the pbori stuff at #13746 and made it a dependency here.
comment:263 Changed 4 years ago by
- Milestone changed from sage-5.6 to sage-pending
comment:264 Changed 4 years ago by
comment:265 Changed 4 years ago by
patchbot
Apply trac_12313-mono_dict-combined-random-sk.patch
comment:266 Changed 4 years ago by
Unfortunately, doctests don't all succeed for me. I've built 5.6rc0 from scratch and applied #12215 and #12313. I get a SIGFPE in
./sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/heegner.py
gdb:
Program received signal SIGFPE, Arithmetic exception. __pyx_pf_4sage_9structure_11coerce_dict_16TripleDictEraser_2__call__ (__pyx_v_r=<optimized out>, __pyx_v_self=<optimized out>) at sage/structure/coerce_dict.c:1107 1107 __pyx_t_10 = PyList_GET_ITEM(__pyx_t_1, (__pyx_v_h % PyList_GET_SIZE(__pyx_t_4))); (gdb) l 1102 */ 1103 __pyx_t_1 = __pyx_v_self->D->buckets; 1104 __Pyx_INCREF(__pyx_t_1); 1105 __pyx_t_4 = __pyx_v_self->D->buckets; 1106 __Pyx_INCREF(__pyx_t_4); 1107 __pyx_t_10 = PyList_GET_ITEM(__pyx_t_1, (__pyx_v_h % PyList_GET_SIZE(__pyx_t_4))); (gdb) p PyString_AsString(PyObject_Repr(__pyx_t_4)) $4 = 0xd5090e4 "None"
So at this point self.D.buckets==None
. Doing list-type transactions using non-error checking macros is probably not a good idea. This code corresponds to structure/coerce_dict.pyx:121
. Apparently, a callback is happening while the dictionary itself is already being torn down!
A traceback indeed shows that this is all happening as part of a garbage collection of a TripleDict? (I think that's what frame 15 indicates),ironically, triggered by the allocation of a new TripleDict?. I'm including rather a lot of the backtrace, because frame 100 is already in the sage library: Apparently, some operations easily trigger 100 level deep stacks!
(gdb) bt #0 __pyx_pf_4sage_9structure_11coerce_dict_16TripleDictEraser_2__call__ (__pyx_v_r=<optimized out>, __pyx_v_self=<optimized out>) at sage/structure/coerce_dict.c:1107 #1 __pyx_pw_4sage_9structure_11coerce_dict_16TripleDictEraser_3__call__ (__pyx_v_self=0xb4f17c0, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at sage/structure/coerce_dict.c:966 #2 0x00007ffff7c6f403 in PyObject_Call (func=0xb4f17c0, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #3 0x00007ffff7c6fcf0 in PyObject_CallFunctionObjArgs (callable=0xb4f17c0) at Objects/abstract.c:2760 #4 0x00007ffff7cd7b91 in handle_callback (callback=0xb4f17c0, ref=<optimized out>) at Objects/weakrefobject.c:881 #5 PyObject_ClearWeakRefs (object=<optimized out>) at Objects/weakrefobject.c:965 #6 0x00007fffea42e0e8 in __pyx_tp_dealloc_4sage_9structure_15category_object_CategoryObject (o=0x7843e30) at sage/structure/category_object.c:8990 #7 0x00007ffff7cc7b76 in subtype_dealloc (self=0x7843e30) at Objects/typeobject.c:1014 #8 0x00007ffff7cc4fa3 in tupledealloc (op=0xc1518d0) at Objects/tupleobject.c:220 #9 0x00007ffff7cc4fa3 in tupledealloc (op=0xc455c30) at Objects/tupleobject.c:220 #10 0x00007ffff7ca905f in dict_dealloc (mp=0x7856310) at Objects/dictobject.c:985 #11 0x00007ffff7cc7bc4 in subtype_dealloc (self=0xbdf2f10) at Objects/typeobject.c:999 #12 0x00007fffe90f8cc7 in __pyx_tp_dealloc_4sage_10categories_7functor_Functor (o=0xa7e09d0) at sage/categories/functor.c:3209 #13 0x00007ffff7c9bb5a in list_dealloc (op=0xbbede60) at Objects/listobject.c:309 #14 0x00007ffff7c9bb5a in list_dealloc (op=0xb771170) at Objects/listobject.c:309 #15 0x00007fffe9dee8df in __pyx_tp_clear_4sage_9structure_11coerce_dict_TripleDict (o=0xb8d73b0) at sage/structure/coerce_dict.c:5921 #16 0x00007ffff7d4b637 in delete_garbage (old=0x7ffff7fe19e0, collectable=0x7fffffff80f0) at Modules/gcmodule.c:769 #17 collect (generation=2) at Modules/gcmodule.c:930 #18 0x00007ffff7d4c1a8 in collect_generations () at Modules/gcmodule.c:996 #19 _PyObject_GC_Malloc (basicsize=<optimized out>) at Modules/gcmodule.c:1457 #20 _PyObject_GC_Malloc (basicsize=<optimized out>) at Modules/gcmodule.c:1439 #21 0x00007ffff7cc7c19 in PyType_GenericAlloc (type=0x7fffea004d00, nitems=0) at Objects/typeobject.c:753 #22 0x00007fffe9dee7ef in __pyx_tp_new_4sage_9structure_11coerce_dict_TripleDict (t=<optimized out>, a=<optimized out>, k=<optimized out>) at sage/structure/coerce_dict.c:5881 #23 0x00007ffff7ccbd83 in type_call (type=0x7fffea004d00, args=0xf507d0, kwds=0x0) at Objects/typeobject.c:721 #24 0x00007ffff7c6f403 in PyObject_Call (func=0x7fffea004d00, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #25 0x00007fffea65da43 in __pyx_f_4sage_9structure_6parent_6Parent_init_coerce (__pyx_v_self=0x5083a80, __pyx_optional_args=<optimized out>) at sage/structure/parent.c:5765 #26 0x00007fffea8a00ff in __pyx_f_4sage_9structure_10parent_old_6Parent_init_coerce (__pyx_v_self=<optimized out>, __pyx_optional_args=<optimized out>) at sage/structure/parent_old.c:1634 #27 0x00007fffea8a6cc4 in __pyx_pf_4sage_9structure_10parent_old_6Parent___init__ (__pyx_v_category=0x7ffff7fc6840, __pyx_v_embeddings=0x1038908, __pyx_v_actions=0x10388c0, __pyx_v_coerce_from=0x10381b8, __pyx_v_self=0x5083a80) at sage/structure/parent_old.c:1329 #28 __pyx_pw_4sage_9structure_10parent_old_6Parent_1__init__ (__pyx_v_self=0x5083a80, __pyx_args=<optimized out>, __pyx_kwds=0x508bf00) at sage/structure/parent_old.c:1297 #29 0x00007ffff7cc664c in wrap_init (self=<optimized out>, args=<optimized out>, wrapped=<optimized out>, kwds=<optimized out>) at Objects/typeobject.c:4719 #30 0x00007ffff7c6f403 in PyObject_Call (func=0xd30f710, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #31 0x00007ffff7d0cc97 in PyEval_CallObjectWithKeywords (func=0xd30f710, arg=0x7ffff7bbb050, kw=<optimized out>) at Python/ceval.c:3890 #32 0x00007ffff7c86e69 in wrapperdescr_call (descr=<optimized out>, args=0x7ffff7bbb050, kwds=0x508bf00) at Objects/descrobject.c:306 #33 0x00007ffff7c6f403 in PyObject_Call (func=0xfc0960, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #34 0x00007fffeaab5269 in __pyx_pf_4sage_9structure_11parent_base_14ParentWithBase___init__ ( __pyx_v_category=0x7ffff7fc6840, __pyx_v_embeddings=0x1038908, __pyx_v_actions=0x10388c0, __pyx_v_coerce_from=0x10381b8, __pyx_v_base=0x8593c60, __pyx_v_self=0x5083a80) at sage/structure/parent_base.c:1258 ---Type <return> to continue, or q <return> to quit--- #35 __pyx_pw_4sage_9structure_11parent_base_14ParentWithBase_1__init__ (__pyx_v_self=0x5083a80, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at sage/structure/parent_base.c:1221 #36 0x00007ffff7cc664c in wrap_init (self=<optimized out>, args=<optimized out>, wrapped=<optimized out>, kwds=<optimized out>) at Objects/typeobject.c:4719 #37 0x00007ffff7c6f403 in PyObject_Call (func=0xd30f6d0, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #38 0x00007ffff7d0cc97 in PyEval_CallObjectWithKeywords (func=0xd30f6d0, arg=0xd2f7d90, kw=<optimized out>) at Python/ceval.c:3890 #39 0x00007ffff7c86e69 in wrapperdescr_call (descr=<optimized out>, args=0xd2f7d90, kwds=0x508bde0) at Objects/descrobject.c:306 #40 0x00007ffff7c6f403 in PyObject_Call (func=0x1033320, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #41 0x00007fffe746a727 in __pyx_pf_4sage_9structure_11parent_gens_14ParentWithGens___init__ ( __pyx_v_category=0x7ffff7fc6840, __pyx_v_normalize=<optimized out>, __pyx_v_names=0x7ffff7fc6840, __pyx_v_base=0x8593c60, __pyx_v_self=0x5083a80) at sage/structure/parent_gens.c:2781 #42 __pyx_pw_4sage_9structure_11parent_gens_14ParentWithGens_1__init__ (__pyx_v_self=0x5083a80, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at sage/structure/parent_gens.c:2697 #43 0x00007ffff7cc664c in wrap_init (self=<optimized out>, args=<optimized out>, wrapped=<optimized out>, kwds=<optimized out>) at Objects/typeobject.c:4719 #44 0x00007ffff7c6f403 in PyObject_Call (func=0xd29f0d0, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #45 0x00007ffff7d0cc97 in PyEval_CallObjectWithKeywords (func=0xd29f0d0, arg=0xd2a2290, kw=<optimized out>) at Python/ceval.c:3890 #46 0x00007ffff7c86e69 in wrapperdescr_call (descr=<optimized out>, args=0xd2a2290, kwds=0x0) at Objects/descrobject.c:306 #47 0x00007ffff7c6f403 in PyObject_Call (func=0x1033cd0, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #48 0x00007ffff7d1124d in do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7fffffff8830, func=0x1033cd0) at Python/ceval.c:4239 #49 call_function (oparg=<optimized out>, pp_stack=0x7fffffff8830) at Python/ceval.c:4044 #50 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666 #51 0x00007ffff7d14275 in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=5, kws=0x7ffff7bbb068, kwcount=0, defs=0x1c61bf0, defcount=2, closure=0x0) at Python/ceval.c:3253 #52 0x00007ffff7c97273 in function_call (func=0x1db1488, arg=0xd1b27d0, kw=0x50af1a0) at Objects/funcobject.c:526 #53 0x00007ffff7c6f403 in PyObject_Call (func=0x1db1488, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #54 0x00007ffff7c7cb4f in instancemethod_call (func=0x1db1488, arg=0xd1b27d0, kw=0x50af1a0) at Objects/classobject.c:2578 #55 0x00007ffff7c6f403 in PyObject_Call (func=0xd2d7230, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #56 0x00007ffff7ccffd0 in slot_tp_init (self=0x5083a80, args=0xd1de100, kwds=0x50af1a0) at Objects/typeobject.c:5663 #57 0x00007ffff7ccbdc8 in type_call (type=<optimized out>, args=0xd1de100, kwds=0x50af1a0) at Objects/typeobject.c:737 #58 0x00007fffed3f5766 in __pyx_pf_4sage_4misc_19classcall_metaclass_typecall (__pyx_v_opts=0x50af1a0, __pyx_v_args=0xd1de100, __pyx_v_cls=0x1ee43a0, __pyx_self=<optimized out>) at sage/misc/classcall_metaclass.c:1294 #59 __pyx_pw_4sage_4misc_19classcall_metaclass_1typecall (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at sage/misc/classcall_metaclass.c:1257 #60 0x00007ffff7d12198 in ext_do_call (nk=0, na=<optimized out>, flags=<optimized out>, pp_stack=0x7fffffff8e60, func=0x995c20) at Python/ceval.c:4331 #61 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2705 #62 0x00007ffff7d14275 in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=5, kws=0x7ffff7bbb068, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253 #63 0x00007ffff7c97273 in function_call (func=0x99d050, arg=0xd1d4710, kw=0x6004d00) at Objects/funcobject.c:526 #64 0x00007ffff7c6f403 in PyObject_Call (func=0x99d050, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 ---Type <return> to continue, or q <return> to quit--- #65 0x00007fffed816e8d in __pyx_pf_4sage_4misc_9cachefunc_18WeakCachedFunction_2__call__ (__pyx_v_kwds=0x6004d00, __pyx_v_args=0xd1d4710, __pyx_v_self=0x919f30) at sage/misc/cachefunc.c:5080 #66 __pyx_pw_4sage_4misc_9cachefunc_18WeakCachedFunction_3__call__ (__pyx_v_self=0x919f30, __pyx_args=0xd1d4710, __pyx_kwds=<optimized out>) at sage/misc/cachefunc.c:4888 #67 0x00007ffff7c6f403 in PyObject_Call (func=0x919f30, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #68 0x00007ffff7d1124d in do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7fffffff9220, func=0x919f30) at Python/ceval.c:4239 #69 call_function (oparg=<optimized out>, pp_stack=0x7fffffff9220) at Python/ceval.c:4044 #70 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666 #71 0x00007ffff7d14275 in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=4, kws=0xd2ff608, kwcount=1, defs=0x1bf4968, defcount=2, closure=0x0) at Python/ceval.c:3253 #72 0x00007ffff7c97273 in function_call (func=0x1db1140, arg=0xd2caaa0, kw=0x6456560) at Objects/funcobject.c:526 #73 0x00007ffff7c6f403 in PyObject_Call (func=0x1db1140, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #74 0x00007fffed3f4a13 in __pyx_pf_4sage_4misc_19classcall_metaclass_18ClasscallMetaclass_2__call__ ( __pyx_v_opts=0x6456560, __pyx_v_args=0xd2a4690, __pyx_v_cls=0x1ee43a0) at sage/misc/classcall_metaclass.c:932 #75 __pyx_pw_4sage_4misc_19classcall_metaclass_18ClasscallMetaclass_3__call__ (__pyx_v_cls=0x1ee43a0, __pyx_args=0xd2a4690, __pyx_kwds=<optimized out>) at sage/misc/classcall_metaclass.c:872 #76 0x00007ffff7c6f403 in PyObject_Call (func=0x1ee43a0, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #77 0x00007fffd42b707d in __pyx_f_4sage_6matrix_20matrix_integer_dense_20Matrix_integer_dense__mod_int_c ( __pyx_v_self=0xd2725f0, __pyx_v_p=19) at sage/matrix/matrix_integer_dense.c:13674 #78 0x00007fffd429678a in __pyx_pf_4sage_6matrix_20matrix_integer_dense_20Matrix_integer_dense_46_mod_int ( __pyx_v_modulus=0xbbe8390, __pyx_v_self=0xd2725f0) at sage/matrix/matrix_integer_dense.c:13277 #79 __pyx_pw_4sage_6matrix_20matrix_integer_dense_20Matrix_integer_dense_47_mod_int (__pyx_v_self=0xd2725f0, __pyx_v_modulus=0xbbe8390) at sage/matrix/matrix_integer_dense.c:13196 #80 0x00007ffff7c6f403 in PyObject_Call (func=0xd2967a0, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #81 0x00007fffd3c13807 in __pyx_pf_4sage_6matrix_21matrix_rational_dense_21Matrix_rational_dense_56change_ring ( __pyx_v_R=<optimized out>, __pyx_v_self=0xd302398) at sage/matrix/matrix_rational_dense.c:14605 #82 __pyx_pw_4sage_6matrix_21matrix_rational_dense_21Matrix_rational_dense_57change_ring (__pyx_v_self=0xd302398, __pyx_v_R=<optimized out>) at sage/matrix/matrix_rational_dense.c:14120 #83 0x00007ffff7d13158 in call_function (oparg=<optimized out>, pp_stack=0x7fffffff9710) at Python/ceval.c:4009 #84 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666 #85 0x00007ffff7d133cb in fast_function (nk=<optimized out>, na=2, n=<optimized out>, pp_stack=0x7fffffff9870, func=0x195a398) at Python/ceval.c:4107 #86 call_function (oparg=<optimized out>, pp_stack=0x7fffffff9870) at Python/ceval.c:4042 #87 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666 #88 0x00007ffff7d14275 in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253 #89 0x00007ffff7c9717c in function_call (func=0x195a230, arg=0xd3122d8, kw=0x0) at Objects/funcobject.c:526 #90 0x00007ffff7c6f403 in PyObject_Call (func=0x195a230, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #91 0x00007ffff7c7cb4f in instancemethod_call (func=0x195a230, arg=0xd3122d8, kw=0x0) at Objects/classobject.c:2578 #92 0x00007ffff7c6f403 in PyObject_Call (func=0xd2a49b0, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #93 0x00007fffd62ed50b in __pyx_pf_4sage_5rings_13residue_field_19ResidueFieldFactory_2create_object ( __pyx_v_key=<optimized out>, __pyx_self=<optimized out>, __pyx_v_self=<optimized out>, __pyx_v_version=<optimized out>, __pyx_v_kwds=<optimized out>) at sage/rings/residue_field.c:5002 #94 __pyx_pw_4sage_5rings_13residue_field_19ResidueFieldFactory_3create_object (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at sage/rings/residue_field.c:4295 #95 0x00007ffff7c6f403 in PyObject_Call (func=0x1c8c1a0, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 ---Type <return> to continue, or q <return> to quit--- #96 0x00007ffff7c7cb4f in instancemethod_call (func=0x1c8c1a0, arg=0xd2a4370, kw=0x64430f0) at Objects/classobject.c:2578 #97 0x00007ffff7c6f403 in PyObject_Call (func=0xd2faa50, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #98 0x00007fffdec2d5d6 in __pyx_f_4sage_9structure_7factory_13UniqueFactory_get_object (__pyx_v_self=0x1c8e0a8, __pyx_v_version=0x162df50, __pyx_v_key=0xd0378c0, __pyx_v_extra_args=0x64430f0, __pyx_skip_dispatch=<optimized out>) at sage/structure/factory.c:1311 #99 0x00007fffdec2cb70 in __pyx_pf_4sage_9structure_7factory_13UniqueFactory_4__call__ (__pyx_v_kwds=0x64430f0, __pyx_v_args=0xd30f1d0, __pyx_v_self=0x1c8e0a8) at sage/structure/factory.c:1119 #100 __pyx_pw_4sage_9structure_7factory_13UniqueFactory_5__call__ (__pyx_v_self=0x1c8e0a8, __pyx_args=0xd30f1d0, __pyx_kwds=<optimized out>) at sage/structure/factory.c:987 #101 0x00007ffff7c6f403 in PyObject_Call (func=0x1c8e0a8, arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2529 #102 0x00007ffff7d1124d in do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7fffffffa360, func=0x1c8e0a8) at Python/ceval.c:4239
Anyway, I had originally thought that callbacks in these situations would be collected and discarded, but apparently we have a scenario here that's not detected by the python GC. If my diagnosis is correct, we should not assume that attributes on self.D
are still valid in TripleDictEraser
, which surprises me a bit. It's easily checked of course: If self.D.buckets
is None
, we don't have to do anything. Relevant code:
static int __pyx_tp_clear_4sage_9structure_11coerce_dict_TripleDict(PyObject *o) { struct __pyx_obj_4sage_9structure_11coerce_dict_TripleDict *p = (struct __pyx_obj_4sage_9structure_11coerce_dict_TripleDict *)o; PyObject* tmp; tmp = ((PyObject*)p->buckets); p->buckets = Py_None; Py_INCREF(Py_None); Py_XDECREF(tmp); tmp = ((PyObject*)p->_refcache); p->_refcache = ((PyObject*)Py_None); Py_INCREF(Py_None); Py_XDECREF(tmp); tmp = ((PyObject*)p->eraser); p->eraser = ((struct __pyx_obj_4sage_9structure_11coerce_dict_TripleDictEraser *)Py_None); Py_INCREF(Py_None); Py_XDECREF(tmp); return 0; }
so indeed, D.buckets
is set to None
before D.buckets
is deleted. I'm surprised that this deletion (of something that is garbage, so the eraser, which has an indirect strong link to it, must be garbage as well) apparently can still trigger a callback. It wouldn't surprise me if we could engineer this into something that would be considered a bug in Python. Anyway, to be safe, check that self.D.buckets is still alive before we do something with it! (either that or my sage/python build is seriously screwed and I'm seeing something that shouldn't happen).
comment:267 Changed 4 years ago by
- Status changed from positive_review to needs_work
Attached patch seems to fix the problem for me. My take is self.D
can't be None
. You can't go calling self.__call__()
if self
has been mutilated! On the other hand, it seems that in rare circumstances it can be that self.D.buckets
is already erased, because D
is in the process of being torn down. In that case, the callback is not necessary anyway, so we just return.
I think we just have to live with this: cycle breaking can be arbitrarily complicated, so we cannot expect Python to fully protect against it. Any routine involved in deallocation has to be prepared to find attributes already cleared.
Exercise: prove or disprove: determining whether a callback triggered by a GC needs to be executed is NP-hard.
comment:268 follow-up: ↓ 269 Changed 4 years ago by
- Can we be sure that
self.D
is not None when theTripleDictEraser
is called? Scenario: TheTripleDict
used to contain an object O. For some reason, theTripleDict
is part of a collectable reference cycle (hence, the strong reference held by theTripleDictEraser
would not prevent deletion of theTripleDict
), and gc decided to delete it first. Hence, when calling the eraser, theTripleDict
itself and not just its buckets is already gone. - Shouldn't the same be done with the
MonoDictEraser
?
comment:269 in reply to: ↑ 268 ; follow-up: ↓ 270 Changed 4 years ago by
Replying to SimonKing:
- Can we be sure that
self.D
is not None when theTripleDictEraser
is called?
I think so. Setting self.D=None
must be part of a dealloc on the TripleDictEraser
itself. I don't see how Python could consider it not a bug to call a __call__
method that is in the process of being dealloc-ed (see my remark)
... and gc decided to delete it first. Hence, when calling the eraser, the
TripleDict
itself and not just its buckets is already gone.
That's why the Python GC does take some efforts to avoid calling avoidable weakref callbacks.
- Shouldn't the same be done with the
MonoDictEraser
?
I figured this was probably triggered by an interaction of a non-weakreffable key component holding a reference to a weakreffable key component, which doesn't happen in MonoDict
. On the other hand, objects in MonoDict
could have a TripleDict
on an attribute somewhere and I don't immediately see a proof that such a construct could still lead to this kind of nasty cycle (especially because I don't quite understand the nature of the loop here that makes python execute this unnecessary callback. Apparently Python doesn't purge all garbage callbacks before doing any deletions on cyclic trash)
comment:270 in reply to: ↑ 269 Changed 4 years ago by
Replying to nbruin:
Replying to SimonKing:
- Can we be sure that
self.D
is not None when theTripleDictEraser
is called?I think so. Setting
self.D=None
must be part of a dealloc on theTripleDictEraser
itself.
Yes, right. So, the only problem is when we access an attribute of the TripleDict
that is attribute of the TripleDictEraser
, because the TripleDict
could already be destroyed at that point.
- Shouldn't the same be done with the
MonoDictEraser
?I figured this was probably triggered by an interaction of a non-weakreffable key component holding a reference to a weakreffable key component, which doesn't happen in
MonoDict
. On the other hand, objects inMonoDict
could have aTripleDict
on an attribute somewhere and I don't immediately see a proof that such a construct could still lead to this kind of nasty cycle (especially because I don't quite understand the nature of the loop here that makes python execute this unnecessary callback. Apparently Python doesn't purge all garbage callbacks before doing any deletions on cyclic trash)
I'd feel safer to test for the bucket being not None also in the case of MonoDict
. I'll submit a patch after lunch.
comment:271 Changed 4 years ago by
- Description modified (diff)
- Status changed from needs_work to needs_review
Here is the announced patch.
Apply trac_12313-mono_dict-combined-random-sk.patch trac_12313-safe_callback.patch
comment:272 Changed 4 years ago by
- Status changed from needs_review to positive_review
Works for me! By the way, as you can see from the patch I attached earlier, something must have gone wrong with applying the patches when I tested it. In the current version, *Eraser
has a weak reference to its dict, and weakrefs normally get cleared first, before any attributes get destroyed. So before D.buckets
becomes None
, we would already have self.D() == None
. Therefore I'd be surprised if this latest patch is actually necessary. Anyway, given how tricky these deallocators are, I have no problem with a little extra (perhaps superfluous) tests.
We have a future ticket to reorganize/optimize these dictionaries anyway. Let's get this in!
comment:273 Changed 4 years ago by
- Dependencies changed from #715, #11521, #12215, #13746 to #715, #11521, #12215, #13746, to be merged with #13378
Thank you! I'd like to see this merged together with #13378 - namely, #12313 introduces the problem that the repeated use of ZZ.ideal(5)
accumulates useless data (i.e., the information that there is no coercion map that has 5 as domain) in some coercion dict and makes the command incrementally slower. And this problem is fixed in #13378.
comment:274 Changed 4 years ago by
- Dependencies changed from #715, #11521, #12215, #13746, to be merged with #13378 to #715, #11521, #12215, #13746, #13378
comment:275 Changed 4 years ago by
- Milestone changed from sage-pending to sage-5.7
comment:276 follow-up: ↓ 277 Changed 4 years ago by
- Status changed from positive_review to needs_work
Unfortunately, the combination #13378 + #12215 + #12313 gives a serious slowdown for
from sage.combinat.sf.k_dual import DualkSchurFunctions Sym = SymmetricFunctions(QQ['t'].fraction_field()) dks4 = DualkSchurFunctions(Sym.kBoundedQuotient(4)) time TestSuite(dks4).run()
from 8 seconds before the patches to 90 seconds with the patches.
Just #13378 alone is fine.
comment:277 in reply to: ↑ 276 Changed 4 years ago by
comment:278 Changed 4 years ago by
The patch causing the slowdown is trac_12313-mono_dict-combined-random-sk.patch
In other words: #12215 is fine.
comment:279 follow-up: ↓ 281 Changed 4 years ago by
Here is the outcome of
sage: from sage.combinat.sf.k_dual import DualkSchurFunctions sage: Sym = SymmetricFunctions(QQ['t'].fraction_field()) sage: dks4 = DualkSchurFunctions(Sym.kBoundedQuotient(4)) sage: %prun TestSuite(dks4).run()
First, sage-5.6.rc0 with #12215:
9365 7.254 0.001 11.881 0.001 {method 'subs' of 'sage.structure.element.Element' objects} 30786 2.596 0.000 3.067 0.000 fraction_field.py:271(<lambda>) 5358/84 1.781 0.000 26.087 0.311 {sage.combinat.dict_addition.dict_linear_combination} 1778 1.295 0.001 13.615 0.008 sfa.py:988(_from_cache) 54541 1.061 0.000 2.015 0.000 infinity.py:883(_element_constructor_) 456782 1.051 0.000 1.051 0.000 {isinstance} 27795 0.815 0.000 1.931 0.000 partition.py:3510(Partitions) 124520 0.754 0.000 0.754 0.000 fraction_field.py:430(ring) 1715 0.518 0.000 2.106 0.001 integer_list.py:192(rightmost_pivot) 3951 0.433 0.000 0.547 0.000 free_module.py:411(_coefficient_fast) 4015 0.416 0.000 1.096 0.000 polynomial_ring.py:290(_element_constructor_) 54541 0.352 0.000 0.352 0.000 infinity.py:942(__init__) 2236 0.348 0.000 0.448 0.000 free_module.py:562(support) 80417 0.347 0.000 0.347 0.000 combinat.py:840(__hash__) 37514 0.340 0.000 0.340 0.000 {method 'sub' of '_sre.SRE_Pattern' objects} 37514 0.275 0.000 0.938 0.000 re.py:144(sub) 51922 0.275 0.000 0.275 0.000 partition.py:4003(__repr__) 19970 0.262 0.000 2.203 0.000 partition.py:2716(parent) 16195 0.259 0.000 0.515 0.000 free_module.py:2142(_from_dict) 37514 0.244 0.000 0.322 0.000 re.py:226(_compile) 4015 0.238 0.000 1.933 0.000 fraction_field.py:456(_element_constructor_) 39940 0.232 0.000 0.580 0.000 {repr} 9365 0.230 0.000 0.292 0.000 fraction_field.py:532(gen) 35636 0.186 0.000 0.305 0.000 infinity.py:285(__cmp__) 1715 0.186 0.000 0.634 0.000 integer_list.py:38(first)
Second, adding #13378
9365 7.525 0.001 12.378 0.001 {method 'subs' of 'sage.structure.element.Element' objects} 30786 2.666 0.000 3.166 0.000 fraction_field.py:271(<lambda>) 5363/85 1.908 0.000 29.846 0.351 {sage.combinat.dict_addition.dict_linear_combination} 1778 1.340 0.001 15.319 0.009 sfa.py:988(_from_cache) 494409 1.269 0.000 1.269 0.000 {isinstance} 55877 1.196 0.000 2.254 0.000 infinity.py:883(_element_constructor_) 27906 0.876 0.000 2.117 0.000 partition.py:3510(Partitions) 127929 0.809 0.000 0.809 0.000 fraction_field.py:430(ring) 1855 0.561 0.000 2.359 0.001 integer_list.py:192(rightmost_pivot) 4248/4191 0.493 0.000 1.256 0.000 polynomial_ring.py:290(_element_constructor_) 4070 0.461 0.000 0.591 0.000 free_module.py:411(_coefficient_fast) 82386 0.397 0.000 0.397 0.000 combinat.py:840(__hash__) 2236 0.374 0.000 0.498 0.000 free_module.py:562(support) 55877 0.372 0.000 0.372 0.000 infinity.py:942(__init__) 37516 0.367 0.000 0.369 0.000 {method 'sub' of '_sre.SRE_Pattern' objects} 46893 0.326 0.000 0.764 0.000 {repr} 56811 0.318 0.000 0.318 0.000 partition.py:4003(__repr__) 37514 0.304 0.000 1.020 0.000 re.py:144(sub) 16322 0.301 0.000 0.601 0.000 free_module.py:2142(_from_dict) 19970 0.290 0.000 2.418 0.000 partition.py:2716(parent) 37517 0.265 0.000 0.362 0.000 re.py:226(_compile) 4015 0.261 0.000 2.149 0.001 fraction_field.py:456(_element_constructor_) 9426 0.239 0.000 0.312 0.000 fraction_field.py:532(gen) 17385 0.223 0.000 0.365 0.000 combinat.py:667(__init__) 36359 0.216 0.000 0.349 0.000 infinity.py:285(__cmp__)
And finally adding #12313:
952735/392465 59.690 0.000 204.150 0.001 {method 'has_coerce_map_from' of 'sage.structure.parent.Parent' objects} 3818389 44.077 0.000 91.652 0.000 {repr} 1626 27.430 0.017 243.763 0.150 k_dual.py:1277(<genexpr>) 475630 24.414 0.000 71.594 0.000 {method 'coerce_map_from' of 'sage.structure.parent.Parent' objects} 471632 23.574 0.000 204.044 0.000 fraction_field.py:194(_coerce_map_from_) 1909173 15.683 0.000 111.239 0.000 combinat.py:1051(__cmp__) 1425904 15.023 0.000 23.084 0.000 polynomial_ring.py:1369(_repr_) 475629/475628 13.746 0.000 90.469 0.000 polynomial_ring.py:468(_coerce_map_from_) 3286022 12.918 0.000 17.094 0.000 unique_representation.py:531(__hash__) 558 11.842 0.021 73.743 0.132 k_dual.py:1234(<genexpr>) 475332 11.785 0.000 19.538 0.000 fraction_field.py:380(_repr_) 1925108 11.632 0.000 11.632 0.000 partition.py:4003(__repr__) 1439041 8.110 0.000 8.110 0.000 polynomial_ring.py:716(__hash__) 1425904 8.061 0.000 8.061 0.000 polynomial_ring.py:725(_repr_) 9365 7.681 0.001 12.918 0.001 {method 'subs' of 'sage.structure.element.Element' objects} 16080 7.001 0.000 9.528 0.001 homset.py:353(__init__) 1934768 5.774 0.000 5.774 0.000 {isinstance} 564 5.764 0.010 33.683 0.060 k_dual.py:1110(<genexpr>) 3334205 4.241 0.000 4.241 0.000 {id} 1927477 3.937 0.000 3.937 0.000 {cmp} 1450944 3.416 0.000 3.416 0.000 rational_field.py:217(__hash__) 479620 3.123 0.000 24.605 0.000 rational_field.py:261(_coerce_map_from_) 30786 3.106 0.000 3.593 0.000 fraction_field.py:271(<lambda>) 16060 2.354 0.000 13.278 0.001 homset.py:80(Hom) 315 2.335 0.007 50.641 0.161 k_dual.py:1076(retract)
I'm afraid I have to leave now. But the fact that repr is called so often makes me guess that there is indeed caching involved - namely with a hash function that relies on the string representation. But if that was correct, neither MonoDict
nor TripleDict
are to blame: They compare by identity and thus don't rely on the hash of the keys.
comment:280 Changed 4 years ago by
PS: Also I don't like that so many homsets become involved. I wonder how many of them are distinct.
comment:281 in reply to: ↑ 279 ; follow-up: ↓ 282 Changed 4 years ago by
We've seen things like this before. This is exactly what one would see from code that greatly benefited from rings (and hence their coercions!) being nailed in memory. It's worth investigating whether there are some obvious optimizations to make.
However, aren't testsuites
made from automatic tests? In that case wouldn't it be a very atypical mess of tiny little self-contained constructions to test, e.g., commutativity, property of identity etc? We shouldn't be optimizing code for those.
Otherwise just stick a strong cache in DualSchurFunctions
somewhere. That's such a specialized component that the memory leaking will likely only affect those who know enough about the code to fix it once it's necessary.
comment:282 in reply to: ↑ 281 Changed 4 years ago by
Replying to nbruin:
We've seen things like this before. This is exactly what one would see from code that greatly benefited from rings (and hence their coercions!) being nailed in memory.
Well, they are still nailed in memory, as long as you keep a pointer to domain and codomain of the coercion.
It's worth investigating whether there are some obvious optimizations to make.
Sure.
However, aren't
testsuites
made from automatic tests? In that case wouldn't it be a very atypical mess of tiny little self-contained constructions to test, e.g., commutativity, property of identity etc? We shouldn't be optimizing code for those.
Yes - and the fact that these tiny little constructions take so much time means that they are totally useless and have to be improved.
Otherwise just stick a strong cache in
DualSchurFunctions
somewhere.
No, I don't think this would be a solution. The dramatic increase in running time probably points to a general problem.
comment:283 follow-up: ↓ 284 Changed 4 years ago by
Had a quick look. To simplify and speed up the tests a little bit, you can concentrate on one of the tests that's slow:
sage: cmd='dks4._test_associativity()' sage: import cProfile,pstats sage: s=pstats.Stats(cProfile.Profile().runctx(cmd,globals(),{})) sage: S=s.sort_stats('cumulative') sage: S.print_stats() ... sage: S.print_callers() ...
the latter one essentially gives you the call graph, weighted by time contributed.
In fact it shows you pretty quickly that ShurFunctions? were apparently VERY heavily relying on parents being cached:
X=dks4[0]+ 2*dks4[1] + 3*dks4[2] (X*X)*X #wait quite a long time
This mainly uses dks4.product
and you can follow the code from there to see this has not been written with runtime optimization in mind. Probably if you make some things cached the way there, you'll regain the original performance.
My first impression is that this code indeed will make a concrete realization of some structure, perform the operation, and throw away the concrete realization. Elements on the way basically ARE homs, hence all the homsets. I'm sure someone will be grateful if you clean up this design, but I'm not so sure it should be part of this ticket or a requirement for its merger. Good luck!
comment:284 in reply to: ↑ 283 Changed 4 years ago by
- Cc nthiery added
Replying to nbruin:
...
In fact it shows you pretty quickly that ShurFunctions? were apparently VERY heavily relying on parents being cached: ...
My first impression is that this code indeed will make a concrete realization of some structure, perform the operation, and throw away the concrete realization. Elements on the way basically ARE homs, hence all the homsets. I'm sure someone will be grateful if you clean up this design, but I'm not so sure it should be part of this ticket or a requirement for its merger. Good luck!
Cc to Nicolas, perhaps he has remarks on it.
comment:285 Changed 4 years ago by
The patch needs to be rebased, because of a change in #12215.
comment:286 Changed 4 years ago by
- Description modified (diff)
- Work issues set to Speed regression
The patch is updated. I used the occasion to combine the two patches. Hence:
Apply trac_12313-mono_dict-combined-random-sk.patch
It still needs work, because of the speed regression.
comment:287 follow-up: ↓ 288 Changed 4 years ago by
Perhaps follow up speed regressions on #13991? Once it's clearer what the problem is, we can decide if that ticket should be a dependency for this one. Obviously, if it turns out to be a design flaw there, we shouldn't hold the rest of sage hostage to its resolution. The problem addressed by this ticket is a real one.
comment:288 in reply to: ↑ 287 Changed 4 years ago by
Replying to nbruin:
Perhaps follow up speed regressions on #13991? Once it's clearer what the problem is, we can decide if that ticket should be a dependency for this one. Obviously, if it turns out to be a design flaw there, we shouldn't hold the rest of sage hostage to its resolution. The problem addressed by this ticket is a real one.
I'd prefer this approach as well, being more affected by the cache problems than by the schur things I don't use. And I assume this should be the case of the majority of users.
comment:289 follow-up: ↓ 291 Changed 4 years ago by
But how are we sure that the speed regression is limited to that one file about Schur functions? It seems to me that it might just be an obvious place where the regression is observed. A priori, many other parts of the Sage library might also be affected.
comment:290 Changed 4 years ago by
Good point (although a speed regression is to be expected anyway, we cache less things so will have to rebuild them, unless being very careful; I guess the goal is that this speed regression is not terrible on average).
Do you have scripts that automatically look for these? Let's say by comparing the runtime of doctests? Or do you have to find them by inspection?
comment:291 in reply to: ↑ 289 ; follow-up: ↓ 292 Changed 4 years ago by
Replying to jdemeyer:
A priori, many other parts of the Sage library might also be affected.
And on the other hand, many other parts of the Sage library are affected by keeping garbage in memory indefinitely.
I think the normal behaviour of a CAS should be that a structure S is considered garbage and can be collected, unless the user has constructed some object O and keeps a reference to it, and O keeps a chain of references to S. Sadly, that was not the case for Sage, but this ticket is part of an attempt to make Sage behave "normally" in that department.
Hence, if the same structure is needed repeatedly during a computation, then the programmer should keep a reference to it locally, rather than relying on an external cache. And if different methods of an object O will use the same structure S, then it makes sense to store S as an attribute of O, in order to prevent S from being garbage collected as long as O is alive (but not longer than that).
I think what Nils wanted to say is: If other parts of the Sage library do not obey the general rules formulated in the preceding paragraph, but rely on an infinite external cache, then they need to be fixed. But the need to fix them should not prevent us from changing the infinite external cache to a finite external cache, because that is of benefit for all parts in the Sage library that follow the general rules.
Jeroen, I guess you have access to doc test timings for various versions of Sage. Would it be possible for you to find out which tests become slower with #12313? That might give us an idea how bad of a regression we would have.
comment:292 in reply to: ↑ 291 Changed 4 years ago by
Replying to SimonKing:
Jeroen, I guess you have access to doc test timings for various versions of Sage. Would it be possible for you to find out which tests become slower with #12313? That might give us an idea how bad of a regression we would have.
OK, can we put a freeze on this ticket then and keep the patch such that I'm not chasing a moving target.
comment:293 Changed 4 years ago by
- Status changed from needs_work to positive_review
- Work issues Speed regression deleted
comment:294 Changed 4 years ago by
I made a testing release: http://sage.math.washington.edu/home/release/sage-5.7.alpha12313/
comment:295 follow-up: ↓ 300 Changed 4 years ago by
Another different regression:
def test(RR): for d in range(-20,0): if abs(RR(quadratic_L_function__numerical(1, d, 10000) - quadratic_L_function__exact(1, d))) > 0.001: print "Oops! We have a problem at d = ", d, " exact = ", RR(quadratic_L_function__exact(1, d)), " numerical = ", RR(quadratic_L_function__numerical(1, d)) time test(RealField(50))
From 3.0 seconds in sage-5.7.beta0 to 4.2 seconds with #12215 + #12313.
comment:296 Changed 4 years ago by
Another one:
M = ModularSymbols(DirichletGroup(5).0, 7) time MM = M.change_ring(CyclotomicField(8))
from 1.3 seconds to 2.1 seconds.
comment:297 follow-up: ↓ 298 Changed 4 years ago by
Maybe we hsould open a "metaticket" to track this regression and point there to tickets that will actually fix each of these regressions?
It's jsut that the ticket here is getting quite long and we risk to lose track of the regressions (if more are to come).
comment:298 in reply to: ↑ 297 Changed 4 years ago by
Replying to jpflori:
Maybe we hsould open a "metaticket" to track this regression and point there to tickets that will actually fix each of these regressions?
To my understanding, the memory leak fixes are fine up to and including #12215. One could argue that the ticket from here is also fixing a general memory leak, and fixing a general leak counts more than a few speed regressions in specialised code. But one could argue that we should be happy to have the fixes up to #12215 in the next Sage release, and should not spoil it with several speed regressions - hence, #12313 should not be merged until we found a way to lift the regression.
I don't know what Jeroen prefers.
If he intends to merge #12313 soon, then it would make sense to change #13991 into a task (=metaticket), the original purpose of #13991 probably being solved by #13605.
If he intends to not merge #12313 before all regressions are solved, then I see no benefit in crating a separate metaticket. Instead, one should put #12313 as "needs work" and create new tickets for the separate aspects of regression, making them dependencies for #12313.
comment:299 Changed 4 years ago by
I don't know really, it's not an easy for me either. Probably I'll wait a bit to merge this to assess the situation better.
comment:300 in reply to: ↑ 295 Changed 4 years ago by
Replying to jdemeyer:
The profile calls look exactly the same. More time is spent in sage.rings.arith.kronecker_symbol
, though and oddly enough the time seems to go into conversion from ZZ
to QQ
:
x=-20 def test(): for n in xrange(10**7): _=QQ(x) sage: time test()
goes from Time: CPU 2.97 s, Wall: 2.98 s
to Time: CPU 8.53 s, Wall: 8.57 s
. Python level profiling doesn't tell us anything for this.
I suspect that the whole QQ(x).numerator() * QQ(x).denominator()
thing in kroneckersymbol
is actually unnecessary for the purposes where it's used (it doesn't do anything of the sort for y
), but that's rather irrelevant: This seems to be a case where discovery of a rather obvious conversion seems to have become quite a bit more expensive!
EDIT:
It could just be this:
sage: M=sage.structure.coerce_dict.MonoDict(23) sage: M[ZZ]=1 sage: %timeit _=M[ZZ] 625 loops, best of 3: 331 ns per loop sage: D={ZZ:1} sage: %timeit _=D[ZZ] 625 loops, best of 3: 110 ns per loop
which is fair enough: MonoDict
is a weak key dictionary, so looking up an entry has an extra level of indirection. We're bound to lose some time there. I don't think we can count that as a regression, though: We weren't doing enough work before.
Compared to the (python) weakref.WeakKeyDictionary
we do pretty well:
sage: W=weakref.WeakKeyDictionary() sage: W[ZZ]=1 sage: %timeit _=W[ZZ] 625 loops, best of 3: 1.36 µs per loop
It would be good if someone else can confirm that this explains the regression. I don't think it's a deadly one: If someone wants to do a conversion in an inner loop, they probably know the type of object they're working with, so they could just extract the morphism and apply it directly. That avoids any lookup. I'd deem this one a necessary regression that doesn't need to be addressed, or at least can be addressed partially later, as with #13387 (which presently mainly indexes comments on this megaticket)
comment:301 Changed 4 years ago by
- Milestone changed from sage-5.7 to sage-5.8
comment:302 follow-up: ↓ 304 Changed 4 years ago by
Note the following:
sage: M=sage.structure.coerce_dict.MonoDict(23) sage: R = ZZ['x','y','z'] sage: M[R]=1 sage: D={R:1} sage: timeit("_=M[R]", number=10^6) 1000000 loops, best of 3: 296 ns per loop sage: timeit("_=D[R]", number=10^6) 1000000 loops, best of 3: 348 ns per loop sage: D[ZZ]=2 sage: M[ZZ]=2 sage: timeit("_=M[R]", number=10^6) 1000000 loops, best of 3: 307 ns per loop sage: timeit("_=D[R]", number=10^6) 1000000 loops, best of 3: 351 ns per loop
So, MonoDict
isn't that bad, speed-wise. I am afraid I have no clue what the regression in _=QQ(x)
really comes from. Python calls seem not to be involved, not with and not without #12313.
comment:303 Changed 4 years ago by
Together with #13387, one gets
sage: M=sage.structure.coerce_dict.MonoDict(23) sage: R = ZZ['x','y','z'] sage: M[R]=1 sage: D={R:1} sage: timeit("_=M[R]", number=10^6) 1000000 loops, best of 3: 204 ns per loop sage: timeit("_=D[R]", number=10^6) 1000000 loops, best of 3: 361 ns per loop sage: D[ZZ]=2 sage: M[ZZ]=2 sage: timeit("_=M[R]", number=10^6) 1000000 loops, best of 3: 204 ns per loop sage: timeit("_=D[R]", number=10^6) 1000000 loops, best of 3: 359 ns per loop sage: timeit("_=M[ZZ]", number=10^6) 1000000 loops, best of 3: 206 ns per loop sage: timeit("_=D[ZZ]", number=10^6) 1000000 loops, best of 3: 98.4 ns per loop
Hence, it becomes faster, but a Python dictionary is still faster if "easy" targets such as ZZ are involved.
Indeed, ZZ is very quick at basic operations:
sage: timeit("hash(ZZ)", number=10^6) 1000000 loops, best of 3: 103 ns per loop sage: timeit("id(ZZ)", number=10^6) 1000000 loops, best of 3: 102 ns per loop sage: timeit("hash(R)", number=10^6) 1000000 loops, best of 3: 329 ns per loop
Hence, the hash of ZZ (used in dict) is as fast as id (used in MonoDict
). No surprise that it is that fast.
comment:304 in reply to: ↑ 302 ; follow-up: ↓ 305 Changed 4 years ago by
Replying to SimonKing:
... 1000000 loops, best of 3: 348 ns per loop sage: D[ZZ]=2 sage: M[ZZ]=2 sage: timeit("_=M[R]", number=10^6) 1000000 loops, best of 3: 307 ns per loop sage: timeit("_=D[R]", number=10^6) 1000000 loops, best of 3: 351 ns per loop
Those timing lines don't relate to ZZ
. Since your other timings are in line with what I found doing exactly the same example above, you'd probably also see that lookup of ZZ
in D is about 3 times as fast as it is in M
.
It's interesting to see that many rings in sage are so much slower (unique parents should always be able to use id
for equality and for their hash, right?)
Anyway, going from a strong dict to a weak key dict is *supposed* to be a little slower, so this is not an issue. We'll just have to live with it.
A more serious issue is the limited use of MonoDict
in this case. Essentially, this MonoDict
only has an effect for the cases where we cache that there is NO coercion: In the other cases, we have an entry
M[Domain] = coercion map from Domain to This Parent
i.e., the value in the weak key dict will hold a strong reference to the key. A self-quotation from sage-flame seems appropriate.
I think this is an fundamental problem and just one people will have to hold in mind: Coercing TO something will make the object you coerce FROM stay in memory for as long as the target is there. It's just a little disappointing that all this effort for a weak key dict is only going to make a difference for the entries that store None
as a value. Oh well. That does happen a lot. AND we are getting speed increases in many cases!
comment:305 in reply to: ↑ 304 ; follow-up: ↓ 312 Changed 4 years ago by
Replying to nbruin:
Replying to SimonKing:
... 1000000 loops, best of 3: 348 ns per loop sage: D[ZZ]=2 sage: M[ZZ]=2 sage: timeit("_=M[R]", number=10^6) 1000000 loops, best of 3: 307 ns per loop sage: timeit("_=D[R]", number=10^6) 1000000 loops, best of 3: 351 ns per loopThose timing lines don't relate to
ZZ
.
I didn't claim they do. This was in order to show that under some not so unlikely circumstances MonoDict
is faster than a usual dict-
Since your other timings are in line with what I found doing exactly the same example above, you'd probably also see that lookup of
ZZ
in D is about 3 times as fast as it is inM
.
Yep, I posted these timings as well, didn't I?
A more serious issue is the limited use of
MonoDict
in this case. Essentially, thisMonoDict
only has an effect for the cases where we cache that there is NO coercion: In the other cases, we have an entryM[Domain] = coercion map from Domain to This Parenti.e., the value in the weak key dict will hold a strong reference to the key.
I am not so sure, because I think this was one of the memleaks I explicitly addressed somewhere. But since my Sage install currently is without #12313, I need to recompile now -- which takes an eternity. I hope I'll be able to show you an example.
Anyway.
Here is a timing that shows that this patch has a bad effect both on looking up the coercion map and applying the coercion:
Without #12313
sage: timeit("phi=QQ.coerce_map_from(ZZ)", number=10^6) 1000000 loops, best of 3: 503 ns per loop sage: phi=QQ.coerce_map_from(ZZ) sage: x = -20 sage: timeit("phi(x)", number=10^6) 1000000 loops, best of 3: 563 ns per loop
With #12313:
sage: timeit("phi=QQ.coerce_map_from(ZZ)", number=10^6) 1000000 loops, best of 3: 738 ns per loop sage: phi=QQ.coerce_map_from(ZZ) sage: x = -20 sage: timeit("phi(x)", number=10^6) 1000000 loops, best of 3: 616 ns per loop
comment:306 Changed 4 years ago by
Hmm. The example I tried did not work. I could be mistaken, but I think there is one of my tickets addressing this problem.
comment:307 follow-up: ↓ 308 Changed 4 years ago by
Could someone doctest 5.7.beta2 with MALLOC_CHECK_=3?
As I reported here https://groups.google.com/d/msg/sage-release/nvm8f82Vs-U/Y994aicmNQYJ, it could be the case that this plus some tickets merged in beta2 revealed some new bugs.
comment:308 in reply to: ↑ 307 Changed 4 years ago by
Replying to jpflori:
Could someone doctest 5.7.beta2 with MALLOC_CHECK_=3?
As I reported here https://groups.google.com/d/msg/sage-release/nvm8f82Vs-U/Y994aicmNQYJ, it could be the case that this plus some tickets merged in beta2 revealed some new bugs.
Arrgh. Not again :(
In your sage-release post, you talk about #12215, but you also say that everything seems fine with your 5.7.beta1 install (which does contain #12215). So, you say sage-5.7.beta2 (not beta1) plus #12313 crash?
comment:309 Changed 4 years ago by
Oops, I'm deeply sorry, I thought I was posting on #12215.
And I'm indeed aware that #12215 was merged in beta1 and that the tests pass on it, but the backtrace suggests at first sight at least that #12215 has something to do with that.
Hopefully, we only need a simple fix for GroebnerStrategy? objects as we did in #11521 (IIRC). Indeed the new crashes seem to involve similar objects but in different files from before.
comment:310 Changed 4 years ago by
Aha, so, is the crash with vanilla sage-5.7.beta1 and sage-5.7.beta2, without additional patches?
comment:311 Changed 4 years ago by
The crash is:
- not happening on vanilla 5.7.beta1
- happening on vanilla 5.7.beta2
So it might be the case that #12215 is not the culprit but it looks like it is involved from the backtraces produced. Maybe its the combination of tickets in beta2 which triggers the errors. Remember just how problematic it was to reproduce the problem caused by #715 et al.
comment:312 in reply to: ↑ 305 ; follow-up: ↓ 313 Changed 4 years ago by
Replying to SimonKing:
Yep, I posted these timings as well, didn't I?
You didn't report the timings for ZZ
lookup with #12313 without #13387. Probably just a pasting mistake. Doesn't matter, we have the general picture.
I am not so sure, because I think this was one of the memleaks I explicitly addressed somewhere.
You did something for actions somewhere, where you ensured the maps were on underlying sets, for this reason. But the plain case of one ring coercing into another (e.g., virtually any ring into SR
) does nail
the ring into memory. Some examples (with #12313):
Having no coerce map indeed works fine now:
sage: import gc sage: T=type(GF(2)) sage: sage: for p in prime_range(2,300): ....: k=GF(p) ....: _=ZZ.has_coerce_map_from(k) ....: sage: del k sage: gc.collect() 1412 sage: len(list(o for o in gc.get_objects() if type(o) is T)) 1
and having coerce maps in the opposite direction doesn't hurt either:
sage: for p in prime_range(2,300): ....: k=GF(p) ....: _=k.has_coerce_map_from(ZZ) ....: sage: del k sage: sage: gc.collect() 7852 sage: len(list(o for o in gc.get_objects() if type(o) is T)) 1
But having for instance a convert map (these are cached in the same way as coercions) things do get nailed into memory:
sage: for p in prime_range(2,300): ....: k=GF(p) ....: _=ZZ.convert_map_from(k) #this nails k in memory ....: sage: del k sage: gc.collect() 0 sage: len(list(o for o in gc.get_objects() if type(o) is T)) 62
And this is why coercing into SR
(anything goes there) is so detrimental:
sage: import gc sage: sage: T=type(GF(2)) sage: len(list(o for o in gc.get_objects() if type(o) is T)) 1 sage: for p in prime_range(2,300): ....: k=GF(p) ....: _=SR.has_coerce_map_from(k) ....: sage: del k, _ sage: gc.collect() 271 sage: len(list(o for o in gc.get_objects() if type(o) is T)) 62
Back to worrying things. Just to double check that element creation is not a problem:
sage: import gc sage: sage: T=type(GF(2)) sage: len(list(o for o in gc.get_objects() if type(o) is T)) 1 sage: for p in prime_range(2,300): ....: k=GF(p) ....: _=k(1) ....: sage: del k, _ sage: gc.collect() 2656 sage: len(list(o for o in gc.get_objects() if type(o) is T)) 1
so one would hope that the following would be OK, but it's not. I guess binops get cached somewhere as well?
sage: for p in prime_range(2,300): ....: k=GF(p) ....: _=k(1)+ZZ(1) ....: sage: del k, _ sage: gc.collect() 0 sage: len(list(o for o in gc.get_objects() if type(o) is T)) 62
(I've checked that ZZ(1)+k(1)
gives the same result). That is a worrying leak, I think. Adding ZZ(1)
to a ring element is something people are quite likely to think is reasonable. But that seems to nail the ring into memory again! I think this means there is room for a follow-up ticket. I think a nice lead is the result from:
sage: cm = sage.structure.element.get_coercion_model() sage: maps, actions = cm.get_cache()
this seems to hold plenty of references to finite fields after the last example, both in keys and in values (I suppose the coercion model keeps a lot of these things in TripleDict
, so the key references shouldn't matter but the references in values do, because those would contribute to refcounts.
I'm afraid the binary operation discovery of the coercion framework is a memory leak, so the proper solution is probably to give people good tools to avoid depending on it.
comment:313 in reply to: ↑ 312 Changed 4 years ago by
Replying to nbruin:
I'm afraid the binary operation discovery of the coercion framework is a memory leak, so the proper solution is probably to give people good tools to avoid depending on it.
#14058 is making good progress on resolving that issue and depends on this ticket to have effect.
The speed regression reported in 276 has been diagnosed and fixed on #13991. This conflicts with work on the much broader scoped #13605, but that ticket doesn't seem to be going in any time soon, so we should probably just push #13991 in before it. I think one can even consider merging #12313 without it.
The speed regression reported on 295 is minor and probably essential to some degree, because a lookup in a weakly keyed dict is fundamentally more expensive than in a normal dict. #13387 mitigates the effects to some extent, but again merging #12313 without it is acceptable in my opinion.
comment:314 follow-up: ↓ 315 Changed 4 years ago by
Now that you're diagnosing anyway, how about 296?
comment:315 in reply to: ↑ 314 Changed 4 years ago by
Replying to jdemeyer:
Now that you're diagnosing anyway, how about 296?
I guess same kind of overhead as 295. Anyway, by including and using cdef access to MonoDict
in the coercion lookups I have all but eliminated these speed regressions. Previously, this code was dictionary access, which Cython probably special cases to very fast code. The migration to MonoDict
must have reverted that to a more generic and slower __getitem__
and __setitem__
slot lookup. See #13387 for the new patches.
It's hard to get times that are consistent across separate build of identical code! It seems things like branch name and choices made by the compiler have quite a bit of influence on these tests. The time differences I am observing now are within the variation I see when just rebuilding the same code.
Ticket #13991 will still be necessary to solve the regression it is intended to address.
comment:316 Changed 4 years ago by
- Merged in set to sage-5.8.beta0
- Resolution set to fixed
- Status changed from positive_review to closed
comment:317 Changed 4 years ago by
- Merged in sage-5.8.beta0 deleted
- Resolution fixed deleted
- Status changed from closed to new
This causes problems during a test build of sage-5.8.beta1:
sage -t -force_lib devel/sage/sage/rings/padics/padic_generic_element.pyx ********************************************************************** File "/release/merger/sage-5.8.beta1/devel/sage-main/sage/rings/padics/padic_generic_element.pyx", line 1005: sage: R2 = Qp(2,20,'capped-rel') Exception raised: Traceback (most recent call last): File "/release/merger/sage-5.8.beta1/local/bin/ncadoctest.py", line 1231, in run_one_test self.run_one_example(test, example, filename, compileflags) File "/release/merger/sage-5.8.beta1/local/bin/sagedoctest.py", line 38, in run_one_example OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags) File "/release/merger/sage-5.8.beta1/local/bin/ncadoctest.py", line 1172, in run_one_example compileflags, 1) in test.globs File "<doctest __main__.example_24[24]>", line 1, in <module> R2 = Qp(Integer(2),Integer(20),'capped-rel')###line 1005: sage: R2 = Qp(2,20,'capped-rel') File "factory.pyx", line 143, in sage.structure.factory.UniqueFactory.__call__ (sage/structure/factory.c:1119) File "factory.pyx", line 170, in sage.structure.factory.UniqueFactory.get_object (sage/structure/factory.c:1311) File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/factory.py", line 493, in create_object return pAdicFieldCappedRelative(p, prec, {'mode': print_mode, 'pos': print_pos, 'sep': print_sep, 'alphabet': print_alphabet, 'ram_name': name, 'max_ram_terms': print_max_terms}, name) File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/padic_base_leaves.py", line 457, in __init__ pAdicFieldBaseGeneric.__init__(self, p, prec, print_mode, names, pAdicCappedRelativeElement) File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/padic_base_generic.py", line 27, in __init__ coerce_list = [pAdicCoercion_ZZ_CR(self), pAdicCoercion_QQ_CR(self)] File "padic_base_coercion.pyx", line 79, in sage.rings.padics.padic_base_coercion.pAdicCoercion_ZZ_CR.__init__ (sage/rings/padics/padic_base_coercion.c:5839) RingHomomorphism_coercion.__init__(self, ZZ.Hom(R), check=False) File "parent_base.pyx", line 128, in sage.structure.parent_base.ParentWithBase.Hom (sage/structure/parent_base.c:1870) File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/categories/homset.py", line 266, in Hom _cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category))) File "coerce_dict.pyx", line 1059, in sage.structure.coerce_dict.TripleDict.__setitem__ (sage/structure/coerce_dict.c:6614) File "coerce_dict.pyx", line 1079, in sage.structure.coerce_dict.TripleDict.set (sage/structure/coerce_dict.c:6880) KeyError: (17671424, 87123360, 20657216) ********************************************************************** File "/release/merger/sage-5.8.beta1/devel/sage-main/sage/rings/padics/padic_generic_element.pyx", line 1068: sage: Qp(7)(0).abs() Exception raised: Traceback (most recent call last): File "/release/merger/sage-5.8.beta1/local/bin/ncadoctest.py", line 1231, in run_one_test self.run_one_example(test, example, filename, compileflags) File "/release/merger/sage-5.8.beta1/local/bin/sagedoctest.py", line 38, in run_one_example OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags) File "/release/merger/sage-5.8.beta1/local/bin/ncadoctest.py", line 1172, in run_one_example compileflags, 1) in test.globs File "<doctest __main__.example_25[4]>", line 1, in <module> Qp(Integer(7))(Integer(0)).abs()###line 1068: sage: Qp(7)(0).abs() File "factory.pyx", line 143, in sage.structure.factory.UniqueFactory.__call__ (sage/structure/factory.c:1119) File "factory.pyx", line 170, in sage.structure.factory.UniqueFactory.get_object (sage/structure/factory.c:1311) File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/factory.py", line 493, in create_object return pAdicFieldCappedRelative(p, prec, {'mode': print_mode, 'pos': print_pos, 'sep': print_sep, 'alphabet': print_alphabet, 'ram_name': name, 'max_ram_terms': print_max_terms}, name) File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/padic_base_leaves.py", line 457, in __init__ pAdicFieldBaseGeneric.__init__(self, p, prec, print_mode, names, pAdicCappedRelativeElement) File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/padic_base_generic.py", line 27, in __init__ coerce_list = [pAdicCoercion_ZZ_CR(self), pAdicCoercion_QQ_CR(self)] File "padic_base_coercion.pyx", line 79, in sage.rings.padics.padic_base_coercion.pAdicCoercion_ZZ_CR.__init__ (sage/rings/padics/padic_base_coercion.c:5839) RingHomomorphism_coercion.__init__(self, ZZ.Hom(R), check=False) File "parent_base.pyx", line 128, in sage.structure.parent_base.ParentWithBase.Hom (sage/structure/parent_base.c:1870) File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/categories/homset.py", line 266, in Hom _cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category))) File "coerce_dict.pyx", line 1059, in sage.structure.coerce_dict.TripleDict.__setitem__ (sage/structure/coerce_dict.c:6614) File "coerce_dict.pyx", line 1079, in sage.structure.coerce_dict.TripleDict.set (sage/structure/coerce_dict.c:6880) KeyError: (17671424, 87232416, 20657216) ********************************************************************** File "/release/merger/sage-5.8.beta1/devel/sage-main/sage/rings/padics/padic_generic_element.pyx", line 1070: sage: Qp(7)(0).abs(prec=20) Exception raised: Traceback (most recent call last): File "/release/merger/sage-5.8.beta1/local/bin/ncadoctest.py", line 1231, in run_one_test self.run_one_example(test, example, filename, compileflags) File "/release/merger/sage-5.8.beta1/local/bin/sagedoctest.py", line 38, in run_one_example OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags) File "/release/merger/sage-5.8.beta1/local/bin/ncadoctest.py", line 1172, in run_one_example compileflags, 1) in test.globs File "<doctest __main__.example_25[5]>", line 1, in <module> Qp(Integer(7))(Integer(0)).abs(prec=Integer(20))###line 1070: sage: Qp(7)(0).abs(prec=20) File "factory.pyx", line 143, in sage.structure.factory.UniqueFactory.__call__ (sage/structure/factory.c:1119) File "factory.pyx", line 170, in sage.structure.factory.UniqueFactory.get_object (sage/structure/factory.c:1311) File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/factory.py", line 493, in create_object return pAdicFieldCappedRelative(p, prec, {'mode': print_mode, 'pos': print_pos, 'sep': print_sep, 'alphabet': print_alphabet, 'ram_name': name, 'max_ram_terms': print_max_terms}, name) File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/padic_base_leaves.py", line 457, in __init__ pAdicFieldBaseGeneric.__init__(self, p, prec, print_mode, names, pAdicCappedRelativeElement) File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/padic_base_generic.py", line 27, in __init__ coerce_list = [pAdicCoercion_ZZ_CR(self), pAdicCoercion_QQ_CR(self)] File "padic_base_coercion.pyx", line 79, in sage.rings.padics.padic_base_coercion.pAdicCoercion_ZZ_CR.__init__ (sage/rings/padics/padic_base_coercion.c:5839) RingHomomorphism_coercion.__init__(self, ZZ.Hom(R), check=False) File "parent_base.pyx", line 128, in sage.structure.parent_base.ParentWithBase.Hom (sage/structure/parent_base.c:1870) File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/categories/homset.py", line 266, in Hom _cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category))) File "coerce_dict.pyx", line 1059, in sage.structure.coerce_dict.TripleDict.__setitem__ (sage/structure/coerce_dict.c:6614) File "coerce_dict.pyx", line 1079, in sage.structure.coerce_dict.TripleDict.set (sage/structure/coerce_dict.c:6880) KeyError: (17671424, 82669344, 20657216) **********************************************************************
comment:318 Changed 4 years ago by
I am not able to reproduce this error, however "coerce_dict.pyx", line 1079 is
# Test whether the old references are still active r1,r2,r3 = <tuple>(self._refcache[h1,h2,h3])
the line "homset.py", line 266:
_cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))
was introduced in #11521, the trac_11521_callback.patch, to be precise.
I'm a little worried about this implementation, because it's relying on implementation details of a tripledict that aren't really part of the API (in particular, here it's relying on the TripleDict.eraser
attribute and its call convention). I think this should be refactored to a method
_cache.set_weakvalued(key, H)
so that the tripledict can figure out how to put a properly formatted callback on the value.
A quick "solution" might be to merge #13387 as well, since that gets rid of self._refcache
. However, I'm a little puzzled how this error can occur:
- By the time we arrive at line 1079 we have succeeded in finding the key triple
(h1,h2,h3)
. - The KeyError? indicates that
_refcache
doesn't contain the key anymore. - But key removal from
_refcache
and frombucket
are inextricably linked! This means that a key removal was triggered somewhere between finding the element in the bucket and looking up in_refcache
. But that should mean thatbucket
got modified too! That's really worrisome, because at this point we're ready to scribble inbucket
at a location we've counted before.
Indeed, if we are allocating memory anywhere in the loop that walks through the bucket we could be triggering a garbage collection, which could indeed include callbacks that modify this very bucket. I don't think any of the lines here could do that, though, except for
(self._refcache[h1,h2,h3])
itself, if that needs to construct a tuple or something like that.
So my guess is: The entry that's keyed (17671424, 82669344, 20657216)
is indeed liable to get a callback from a garbage collection and thus be removed, but that hasn't happened yet. However, in preparing the call
self._refcache[h1,h2,h3]
the system needs to make a python tuple (h1,h2,h3) to pass to __getitem__
and that triggers the very GC that deletes the entry from bucket
and _refcache
; hence the key error.
In other words, we need to make absolutely sure that in this lookup loop, no memory allocations whatsoever are done, because any of those could trigger a GC that can corrupt our view of what the bucket should look like.
Looking at the lookup loop at #13387 for this, I think no memory allocations are made there, so if my diagnosis is correct, then merging that ticket would in fact be a proper fix.
I still think we shouldn't be making KeyedRef
instances using the eraser
of a TripleDict
outside the TripleDict
itself, though.
Looking at the code cython generates I think my diagnosis is a possibility:
* # Test whether the old references are still active * r1,r2,r3 = <tuple>(self._refcache[h1,h2,h3]) # <<<<<<<<<<<<<< * if (isinstance(r1,KeyedRef) and r1() is None) or \ * (isinstance(r2,KeyedRef) and r2() is None) or \ */ ... __pyx_t_9 = PyTuple_New(3); if (unlikely(!__pyx_t_9)) __Pyx_GOTREF(__pyx_t_9); PyTuple_SET_ITEM(__pyx_t_9, 0, __pyx_t_1); __Pyx_GIVEREF(__pyx_t_1); PyTuple_SET_ITEM(__pyx_t_9, 1, __pyx_t_5); __Pyx_GIVEREF(__pyx_t_5); PyTuple_SET_ITEM(__pyx_t_9, 2, __pyx_t_8); __Pyx_GIVEREF(__pyx_t_8); ... __pyx_t_8 = __Pyx_PyDict_GetItem(((PyObject *)__pyx_v_self->__pyx_base._refcache), ((PyObject *)__pyx_t_9)); if (!__pyx_t_8) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 1079; __pyx_clineno = __LINE__; goto __pyx_L1_error;} ...
As you see the lookup code does start with a PyTuple_New
that could trigger a GC leading to removal of the key that gets looked up in the last line (where the KeyError
would originate).
TL;DR: merge #13387
comment:319 Changed 4 years ago by
- Dependencies changed from #715, #11521, #12215, #13746, #13378 to #715, #11521, #12215, #13746, #13378; merge with #13387
- Status changed from new to needs_review
comment:320 follow-up: ↓ 322 Changed 4 years ago by
Conjectures on how the bug in 317 can occur
I had a bit of trouble coming up with a scenario where a key triple could get deleted during its setting. The reason is that when someone is setting a key, presumably they are holding (strong) references to the key components, so those can't possibly be slated for deletion.
However, the very statement introduces by #11521 changes that:
- A value in a TripleDict? may be stored via a weak reference (this was already the case)
- It installs a callback on the weakref that removes the keytriple:weakref pair from the dictionary (this IS new).
However, since actual deallocation of this weakreffed value may be delayed, it is very well possible that by the time the callback gets executed, the value stored under the key triple is no longer the original value. Thus, deletion is unwarranted! Therefore, the callback on this weakreffed value should check that the key component is as expected AND that the stored value is still as expected (a dead weakref would do, but checking that it's the exact weakref that we expect there would be even better). Otherwise, we may be erasing entries that don't have to do with the just deallocated object!
My guess is that #14084 might have something to do with that. Are there ever situations where the homset cached for two objects gets changed (category refinement comes to mind as one case where this might happen ...).? That could well lead to the scenario sketched above.
In short: trac_11521_callback.patch doesn't seem entirely safe to me, because the callback installed on the value only checks the key for deletion. It doesn't check that the value hasn't changed in the mean time.
We still need #13387 anyway, because ANY GC event in the lookup loop can have disastrous consequences which, like any memory corruption, could go unnoticed for a long time and then lead to undefined behaviour later on. The PyTuple_New
in the currently generated code means we cannot rule this out.
comment:321 Changed 4 years ago by
- Status changed from needs_review to positive_review
I'm setting this back to positive review, keeping in mind that it needs to be merged with #13387.
comment:322 in reply to: ↑ 320 Changed 4 years ago by
Replying to nbruin:
However, the very statement introduces by #11521 changes that:
- A value in a TripleDict? may be stored via a weak reference (this was already the case)
- It installs a callback on the weakref that removes the keytriple:weakref pair from the dictionary (this IS new).
And note that originally I did *not* use that callback. So, I wouldn't mind to simply revert this patch.
In other words: Homsets would be stored in a TripleDict
, via a weak reference without a callback. Hence, if the Homset becomes garbage collected, then we will have a dead weak references remaining in the TripleDict
(the original purpose of the callback was to remove this dead entry). Of course, it is not nice if the dead entry would stay in the TripleDict
forever. But it is conceivable that domain and/or codomain of the Homset will become garbage collectable if the Homset becomes garbage collectable. Hence, the dead entry would disappear anyway, by means of the callbacks from the keys (not from the value).
Hence, I suggest to not use a callback for the weak reference to a Homset.
comment:323 Changed 4 years ago by
I am trying to apply this patch to 5.8.beta0, but it doesn't work at all. Is it merged in 5.8.beta0?
comment:324 Changed 4 years ago by
comment:325 Changed 4 years ago by
- Description modified (diff)
- Status changed from positive_review to needs_work
I have added a patch, so that no callback is used for homsets. The rationale is given in comment:322.
For the patchbot (in an attempt to trick it, because the main patch seems to be in 5.8.beta0...):
Apply trac_12313-revert_callback_from_11521.patch to sage-5.8.beta0
comment:326 Changed 4 years ago by
- Status changed from needs_work to needs_review
comment:327 Changed 4 years ago by
comment:328 Changed 4 years ago by
Could it be that #12313 is not in sage-5.8.beta0? The version that I have downloaded before the official announcement does contain it.
Anyway, let's try:
Apply trac_12313-mono_dict-combined-random-sk.patch trac_12313-revert_callback_from_11521.patch to sage-5.8.beta0
comment:329 follow-up: ↓ 330 Changed 4 years ago by
- Status changed from needs_review to positive_review
Why add that patch here when it belongs on #14159? This ticket already has positive review. It's bad form to start changing the code unless absolutely necessary and that new patch has nothing to do with this ticket. I've uploaded your patch there (and changed the commit message to reflect the appropriate ticket number). I hope that's OK with you. If so, you can just remove the ticket here from the list and move the ticket back to "Positive". We really want to get this merged, especially #13387 (which depends on this ticket) because that fixes some issues with TripleDict
that could already bite its uses in 5.6 and 5.7.
comment:330 in reply to: ↑ 329 Changed 4 years ago by
- Description modified (diff)
Replying to nbruin:
Why add that patch here when it belongs on #14159? This ticket already has positive review. It's bad form to start changing the code unless absolutely necessary.
I thought it was absolutely necessary, and since it wasn't merged yet, I thought it was OK to do it here.
I've uploaded your patch there (and changed the commit message to reflect the appropriate ticket number). I hope that's OK with you.
If so, you can just remove the ticket here from the list
I can't remove the attachment, but I can remove it from the list of to-be-applied-patches.
and move the ticket back to "Positive".
OK.
Apply trac_12313-mono_dict-combined-random-sk.patch
comment:331 Changed 4 years ago by
comment:332 Changed 4 years ago by
- Description modified (diff)
comment:333 Changed 4 years ago by
- Merged in set to sage-5.8.beta2
- Resolution set to fixed
- Status changed from positive_review to closed
comment:334 Changed 4 years ago by
- Cc zimmerma removed
Memory leaks are quite tricky!!
The post here is mainly a memo for myself, documenting what went wrong in the last couple of days.
I tried to directly modify my code from #715. The approach was:
TripleDict
orMonoDict
should store the data in lists, with some list items being the address of the keys; and in a separate dictionary, aweakref.KeyedRef
with a callback function is preserved.And what happened when I replaced
Parent._coerce_from_hash
by aMonoDict
? A memory leak was created by using weak references! Namely,weakref.KeyedRef
.weakref.KeyedRef
has a strong reference to the callback functionMonoDictEraser
(similar to theTripleDictEraser
from #715)MonoDict
MonoDict
has a strong reference to the coerce map.I think the best place to break the reference chain is from the Eraser to the
MonoDict
. Perhaps one could do the same thing at #715, rather than having a weak reference to the underlying set of an action?