Opened 13 years ago
Last modified 7 years ago
#715 closed defect
Parents probably not reclaimed due to too much caching — at Version 88
Reported by: | robertwb | Owned by: | somebody |
---|---|---|---|
Priority: | major | Milestone: | sage-5.5 |
Component: | coercion | Keywords: | weak cache coercion Cernay2012 |
Cc: | jpflori, zimmerma, vbraun, robertwb, nbruin, malb, mjo | Merged in: | |
Authors: | Simon King | Reviewers: | |
Report Upstream: | N/A | Work issues: | avoid regression |
Branch: | Commit: | ||
Dependencies: | #9138, #11900 | Stopgaps: |
Description (last modified by )
Here is a small example illustrating the issue.
The memory footprint of the following piece of code grows indefinitely.
sage: K = GF(1<<55,'t') sage: a = K.random_element() sage: while 1: ....: E = EllipticCurve(j=a); P = E.random_point(); 2*P; del E, P;
E and P get deleted, but when 2*P is computed, the action of integers on A, the abelian group of rational points of the ellitpic curve, gets cached in the corecion model.
A key-value pair is left in coercion_model._action_maps dict:
(ZZ,A,*) : IntegerMulAction?
Moreover there is at least also references to A in the IntegerMulAction? and one in ZZ._action_hash.
So weak refs should be used in all these places if it does not slow things too much.
Apply:
Change History (91)
comment:1 Changed 13 years ago by
- Milestone set to sage-feature
comment:2 Changed 12 years ago by
- Milestone changed from sage-feature to sage-2.10.2
comment:3 Changed 11 years ago by
comment:4 Changed 11 years ago by
The coercion model needs to use weakrefs so that parents aren't needlessly referenced when they're discarded. It is nontrivial to see where the weakrefs need to go, and how to do so without slowing the code down.
The ticket is still valid.
comment:5 Changed 9 years ago by
- Component changed from basic arithmetic to coercion
- Description modified (diff)
- Report Upstream set to N/A
comment:6 Changed 9 years ago by
- Cc jpflori added
comment:7 Changed 9 years ago by
- Description modified (diff)
With the piece of code in the desrciption, there is only one reference to these objects in that ZZ._hash_actions dictionary because to build it we test if A1 == A2 and not A1 is A2 as in coercion_model._action_maps, and because of the current implementation of ellitpic curves, see http://groups.google.com/group/sage-nt/browse_thread/thread/ec8d0ad14a819082 and #11474, and decause the above code use only one j-invariant, only ones gets finally stored.
However with random curves, I guess there would be all of them.
About the weakref, the idea should more be to build something like WeakKeyDictionnary? if it does not slow down coercion too much...
comment:8 Changed 9 years ago by
The following example also exhibits a suspicious, steady growth in memory use. The only reason I can think of why that would happen is that references to the created finite field remain lying around somewhere, preventing deallocation:
sage: L=prime_range(10^8) sage: for p in L: k=GF(p)
If you change it to the creation of a polynomial ring the memory use rises much faster:
sage: L=prime_range(10^8) sage: for p in L: k=GF(p)['t']
Are "unique" parents simply *never* deallocated?
comment:9 Changed 9 years ago by
Be aware that polynomial rings are also cached because of uniqueness of parents, explaining somehow your second memory consumption; see #5970 for example.
For finite fields I did not check.
comment:10 Changed 8 years ago by
- Cc zimmerma added
comment:11 Changed 8 years ago by
See #11521 for some concrete instances of this problem and some advice to investigate it.
comment:12 Changed 8 years ago by
In my code for the computation Ext algebras of basic algebras, I use letterplace algebras (see #7797), and they involve the creation of many polynomial rings. Only one of them is used at a time, so, the others could be garbage collected. But they aren't, and I suspect this is because of using strong references in the coercion cache.
See the following example (using #7797)
sage: F.<a,b,c> = FreeAlgebra(GF(4,'z'), implementation='letterplace') sage: import gc sage: len(gc.get_objects()) 170947 sage: a*b*c*b*c*a*b*c a*b*c*b*c*a*b*c sage: len(gc.get_objects()) 171556 sage: del F,a,b,c sage: gc.collect() 81 sage: len(gc.get_objects()) 171448 sage: cm = sage.structure.element.get_coercion_model() sage: cm.reset_cache() sage: gc.collect() 273 sage: len(gc.get_objects()) 171108
That is certainly not a proof of my claim, but it indicates that it might be worth while to investigate.
In order to facilitate work, I am providing some other tickets that may be related to this:
I guess that one should use a similar cache model to what I did in #11521: The key for the cache should not just be (domain,codomain)
, because we want that garbage collection of the cache item is already allowed if just one of domain or codomain is collectable.
comment:13 Changed 8 years ago by
I try to wrap my mind around weak references. I found that when creating a weak reference, one can also provide a method that is called when the weak reference becomes invalid. I propose to use such method to erase the deleted object from the cache, regardless whether it appears as domain or codomain.
Here is a proof of concept:
sage: ref = weakref.ref sage: D = {} sage: def remove(x): ....: for a,b,c in D.keys(): ....: if a is x or b is x or c is x: ....: D.__delitem__((a,b,c)) ....: sage: class A: ....: def __init__(self,x): ....: self.x = x ....: def __repr__(self): ....: return str(self.x) ....: def __del__(self): ....: print "deleting",self.x ....: sage: a = A(5) sage: b = A(6) sage: r = ref(a,remove) sage: s = ref(b,remove) sage: D[r,r,s] = 1 sage: D[s,r,s] = 2 sage: D[s,s,s] = 3 sage: D[s,s,1] = 4 sage: D[r,s,1] = 5 sage: D.values() [5, 3, 1, 4, 2] sage: del a deleting 5 sage: D.values() [4, 3] sage: del b deleting 6 sage: D.values() []
comment:14 Changed 8 years ago by
It turns out that using weak references in the coercion cache will not be enough. Apparently there are other direct references that have to be dealt with.
comment:15 Changed 8 years ago by
I wonder whether the problem has already been solved. I just tested the example from the ticket description, and get (at least with #11900, #11521 and #11115):
sage: K = GF(1<<55,'t') sage: a = K.random_element() sage: m0 = get_memory_usage() sage: for i in range(1000): ....: E = EllipticCurve(j=a); P = E.random_point(); PP = 2*P ....: sage: get_memory_usage() - m0 15.22265625
I think that this is not particularly scary. I'll repeat the test with vanilla sage-4.8.alpha3, but this will take a while to rebuild.
comment:16 Changed 8 years ago by
No, even in vanilla sage-4.8.alpha3 I don't find a scary memory leak in this example.
Do we have a better example? One could, of course, argue that one should use weak references for caching even if we do not find an apparent memory leak. I am preparing a patch for it now.
comment:17 Changed 8 years ago by
- Cc vbraun added
Here is an experimental patch.
A new test shows that the weak caching actually works.
Note that the patch also introduces a weak cache for polynomial rings, which might be better to put into #5970. Well, we can sort things out later...
comment:18 Changed 8 years ago by
It needs work, though. Some tests in sage/structure fail, partially because of pickling, partially because some tests do not follow the new specification of TripleDict
(namely that the first two parts of each key triple and the associated value must be weak referenceable.
comment:19 Changed 8 years ago by
Now I wonder: Should I try to use weak references and make it accept stuff that does not allow for weak references?
In the intended applications, weak references are possible. But in some tests and in the pickle jar, the "wrong" type of keys (namely strings and ints) are used.
comment:20 Changed 8 years ago by
The only place where the weak references are created is in the set(...)
method of TripleDict
. I suggest to simply catch the error that may occur when creating a weak reference, and then use a different way of storing the key. I am now running tests, and I hope that this ticket will be "needs review" in a few hours.
comment:21 Changed 8 years ago by
- Keywords weak cache coercion added
- Status changed from new to needs_review
With the attached patch, all tests pass for me, and the new features are doctested. Needs review!
comment:22 Changed 8 years ago by
- Dependencies set to #11900
comment:23 Changed 8 years ago by
I was able to apply this patch to vanilla 4.7.2. Should I continue reviewing it like this?
Paul
comment:24 Changed 8 years ago by
on top of vanilla 4.7.2 several doctests fail:
sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/calculus/interpolators.pyx # 0 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/databases/database.py # 15 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/finance/time_series.pyx # 0 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/graphs/graph_list.py # 4 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/graphs/graph_database.py # 28 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/graphs/graph.py # 6 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/graphs/generic_graph.py # 4 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/matrix/matrix2.pyx # 3 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/hecke/hecke_operator.py # 1 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/hecke/ambient_module.py # 2 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/subspace.py # 6 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/boundary.py # 3 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/space.py # 3 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/modsym.py # 1 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/ambient.py # 11 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/abvar/abvar.py # 0 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/schemes/elliptic_curves/heegner.py # 9 doctests failed sage -t 4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/sandpiles/sandpile.py # Time out
Paul
comment:25 Changed 8 years ago by
I'll try again on top of vanilla sage-4.8.alpha3. You are right, the patch does apply (almost) cleanly even without #11900. That surprises me, because at some point there was an inconsistency.
Hopefully I can see later today whether I get the same errors as you.
comment:26 Changed 8 years ago by
- Dependencies #11900 deleted
It turns out that #11900 is indeed not needed.
I can not reproduce any of the errors you mention.
Moreover, the file "sage/devel/sage/databases/database.py", for which you reported an error, does not exist in vanilla sage (not in 4.7.2 and not in 4.8.alpha3).
Did you test other patches before returning to vanilla 4.7.2? Namely, when a patch changes a module from python to cython, and one wants to remove the patch, then it is often needed to also remove any reference to the cython module in build/sage/...
and in build/*/sage/...
. For example, when I had #11115 applied and want to remove it again, then I would do rm build/sage/misc/cachefunc.*
and rm build/*/sage/misc/cachefunc.*
.
comment:27 follow-up: ↓ 28 Changed 8 years ago by
comment:28 in reply to: ↑ 27 Changed 8 years ago by
Replying to zimmerma:
yes I tried other patches (#10983, #8720, #10596) before #715, but each one with a different clone.
But where does the databases/database.py file come from?
And could you post one or two examples for the errors you are getting (i.e. not just which files are problematic, but what commands exactly fail)?
comment:29 Changed 8 years ago by
comment:30 Changed 8 years ago by
I have simplified the routine that removes cache items when a weak reference became invalid. The tests all pass for me.
Apply trac715_weak_coercion_cache.patch
comment:31 Changed 8 years ago by
- Dependencies set to #9138, #11900
comment:32 Changed 8 years ago by
One question: Currently, my patch uses weak references only for the first two parts of the key. Should it also use weak references to the value, when possible?
By "when possible", I mean that not all values allow weak references - if it is possible then a weak reference is used, otherwise a strong reference is used. This might contribute to fixing the memory leak in #11521, but it might have a speed penalty.
Concerning #11521: The point is that an action (which currently does not allow weak references, but that might change) has a strong reference to the objects that are used for storing it in the cache. Hence, an action is not collectable with the current patch.
Thoughts?
comment:33 Changed 8 years ago by
I have slightly updated some of the new examples: In the old patch version, I had created TripleDict(10)
, but meanwhile I learnt that the given parameter should better be odd (actually a prime). So, in the new patch version, it is TripleDict(11)
.
comment:34 Changed 8 years ago by
- Status changed from needs_review to needs_work
- Work issues set to Comparison of the third key items
I think I need to modify one detail:
For efficiency and since domain/codomain of a map must be identic with (and not just equal to) the given keys, my patch compares them by "is" rather than "==". But I think one should still compare the third item of a key via "==" and not "is". I need to do some tests first...
comment:35 Changed 8 years ago by
It really is not an easy question whether or not we should have "is" or "==".
On the one hand, we have the lines
!python if y_mor is not None: all.append("Coercion on right operand via") all.append(y_mor) if res is not None and res is not y_mor.codomain(): raise RuntimeError, ("BUG in coercion model: codomains not equal!", x_mor, y_mor)
in sage/structure/coerce.pyx seem to imply that comparison via "is" is the right thing to do.
But in the same file, the coercion model copes with the fact that some parents are not unique:
!python # Make sure the domains are correct if R_map.domain() is not R: if fix: connecting = R_map.domain().coerce_map_from(R) if connecting is not None: R_map = R_map * connecting if R_map.domain() is not R: raise RuntimeError, ("BUG in coercion model, left domain must be original parent", R, R_map) if S_map is not None and S_map.domain() is not S: if fix: connecting = S_map.domain().coerce_map_from(S) if connecting is not None: S_map = S_map * connecting if S_map.domain() is not S: raise RuntimeError, ("BUG in coercion model, right domain must be original parent", S, S_map)
That would suggest that comparison by "==" (the old behaviour or TripleDict
) is fine.
Perhaps we should actually have to variants of TripleDict
, one using "is" and one using "==".
Note another detail of sage/structure/coerce.pyx: We have
cpdef verify_action(self, action, R, S, op, bint fix=True):
but
cpdef verify_coercion_maps(self, R, S, homs, bint fix=False):
Note the different default value for "fix". If "fix" is True then the coercion model tries to cope with non-unique parents by prepending a conversion between the two equal copies of a parent.
Since the default is to fix non-unique parents for actions, but not for coercion maps, I suggest that a "=="-TripleDict
should be used for actions and an "is"-TripleDict
for coercions.
comment:36 Changed 8 years ago by
I guess a choice has to be made and that it should at lest be as consistent as possible. What you propose makes sense to me, is not too far from the current model and gives a little more conssitency. Moreover, when both TripleDicts? will have been implemented, changing our mind later will be trivial.
comment:37 Changed 8 years ago by
There is another detail. Even in the old version of TripleDict
, we have
It is implemented as a list of lists (hereafter called buckets). The bucket is chosen according to a very simple hash based on the object pointer. and each bucket is of the form [k1, k2, k3, value, k1, k2, k3, value, ...] on which a linear search is performed.
So, the choice of a bucket is based on the object pointer - but then it is not consequent to compare by "==".
comment:38 Changed 8 years ago by
To be precise: The old behaviour was not consequent. The bucket depended on id(k1),id(k2),id(k3)
, but the comparison was by "==" rather than by "is".
Experimentally, I will provide two versions of TripleDict
, one using "hash"for determining the bucket and doing comparison by "==", the other using "id" for determining the bucket and doing comparison by "is".
comment:39 Changed 8 years ago by
- Work issues changed from Comparison of the third key items to fix doctests
As announced, I have attached an experimental patch. It provides two variants of TripleDict
, namely using "==" or "is" for comparison, respectively. Both are used, namely for caching coerce maps or actions, respectively.
It could be that a last-minute change was interfering, but I am confident that all but the following three tests pass:
sage -t devel/sage-main/doc/en/bordeaux_2008/nf_introduction.rst # 1 doctests failed sage -t devel/sage-main/sage/modular/modsym/space.py # Killed/crashed sage -t devel/sage-main/sage/structure/coerce_dict.pyx # 3 doctests failed
The memory leak exposed in the ticket description is fixed (more or less):
sage: K = GF(1<<55,'t') sage: a = K.random_element() sage: for i in range(500): ....: E = EllipticCurve(j=a) ....: P = E.random_point() ....: Q = 2*P ....: sage: import gc sage: gc.collect() 862 sage: from sage.schemes.generic.homset import SchemeHomsetModule_abelian_variety_coordinates_field sage: LE = [x for x in gc.get_objects() if isinstance(x,SchemeHomsetModule_abelian_variety_coordinates_field)] sage: len(LE) 2
I am not sure whether this makes #11521 redundant.
For now, it is "needs work, because of the doctests. But you can already play with the patch.
comment:40 Changed 8 years ago by
Sorry, only TWO doctests should fail: The tests of sage/structure/coerce_dict.pyx are, of course, fixed.
comment:41 Changed 8 years ago by
The segfault in sage -t devel/sage-main/sage/modular/modsym/space.py
seems difficult to debug.
Inspecting a core dump with gdb did not help at all:
(gdb) bt #0 0x00007f61d12ca097 in kill () from /lib64/libc.so.6 #1 0x00007f61d0044a40 in sigdie () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libcsage.so #2 0x00007f61d0044646 in sage_signal_handler () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libcsage.so #3 <signal handler called> #4 0x00007f61cf080520 in mpn_submul_1 () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libgmp.so.8 #5 0x00007f61cf0b4f0f in __gmpn_sb_bdiv_q () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libgmp.so.8 #6 0x00007f61cf0b6428 in __gmpn_divexact () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libgmp.so.8 #7 0x00007f61ccbf4d64 in ?? () ... #191 0x55c0ade81d9aeecf in ?? () #192 0xffffe4b8b6920b7b in ?? () #193 0x000000000ac854cf in ?? () #194 0x0000000000000000 in ?? ()
How could one proceed? What other debugging techniques can you recommend?
comment:42 follow-up: ↓ 43 Changed 8 years ago by
Looks like you did not tell gdb about the executable you were running. You should run
gdb --core=<corefile> $SAGE_LOCAL/bin/python
comment:43 in reply to: ↑ 42 Changed 8 years ago by
Replying to vbraun:
Looks like you did not tell gdb about the executable you were running.
No, I did tell it. I did
gdb --core=715doublecore ~/SAGE/sage-4.8.alpha3/local/bin/python
Should I do it inside a Sage shell?
comment:44 Changed 8 years ago by
No, doing the same inside a sage shell did not help either.
comment:45 Changed 8 years ago by
I am now printing some debugging information into a file, which hopefully means that I am coming closer to the source of the problem. The segfault arises in line 2165 of sage/modular/modsym/space.py
comment:46 Changed 8 years ago by
Sorry, it was the wrong line number.
comment:47 Changed 8 years ago by
Meanwhile I am rather desperate: I have not the faintest idea how the segfault occurs.
Therefore I used some debugging function that I registered using sys.settrace(...)
, so that all Python commands in the critical example are written into a file.
I posted logs for the unpatched and the patched version.
There is one obvious difference of the two logs: The hash is called more often in the patched version. Calling the hash is rather inefficient for matrix spaces: Each time when the hash of a matrix space is called, the matrix space's string representation is created, which is slow. I suggest to cache the hash value (like what I did for polynomial rings in #9944), but this should be on a different ticket.
Apart from that, I can't spot an obvious difference. Do you have any clue?
comment:48 Changed 8 years ago by
It turns out that using TripleDictById
for the _action_maps cache makes the segfault disappear.
If one uses TripleDict
for _coercion_maps then
sage -t devel/sage-main/sage/modular/modsym/space.py
takes 30 seconds, but if one also uses TripleDictById
then it only takes 23 seconds.
My conclusion:
- The old version of
TripleDict
was buggy: It usesid(...)
for the hash table, but==
for comparison. I think that had to be fixed. - The new version of
TripleDict
useshash(...)
for the hash table and==
for comparison. That should be fine, but (1) it leads to a segfault and (2) it leads to a slowdown. After all, callinghash(...)
is a lot slower than determining the address. - The new
TripleDictById
usesid(...)
for the hash table and... is ...
for comparison. Problem: It would probably not fix the memory leak.
However, the fact that using TripleDictById
fixes the segfault makes me wonder: Perhaps the segfault occurs when calling hash(...)
on a parent? Namely, in some cases, and action will already be constructed during initialisation of a parent. But if the hash is determined based on cdef data that aren't initialised, a segfault can easily occur.
I'll investigate that further. In any case, we need to keep an eye on the potential slow-down.
comment:49 Changed 8 years ago by
The segfault does not occur while computing a hash. It occurs in line 468 of sage/matrix/matrix_rational_dense.pyx, namely
mpq_mul(y, w._entries[j], self._matrix[j][i])
I also tested, just before that line, that w[j]
and self.get_unsafe(j,i)
(which accesses w._entries[j]
and self._matrix[j],[i]
) works.
At this point, I am at my wits' end. To me, it looks like a change in the way of comparing dictionary keys modifies internals of mpir (IIRC, this is where mpq_mul is defined). gdb can not decipher the core file, and I don't know how valgrind can be used.
What else?
comment:50 follow-up: ↓ 51 Changed 8 years ago by
Which patches did you apply? With only trac715_two_tripledicts.patch
applied sage doesn't start.
comment:51 in reply to: ↑ 50 Changed 8 years ago by
Replying to vbraun:
Which patches did you apply? With only
trac715_two_tripledicts.patch
applied sage doesn't start.
What???
According to hg qapplied
, I have
trac_12057_fix_doctests.patch 9138_flat.patch trac_11319_prime_field_coercion.patch trac_11319_number_field_example.patch trac11900_category_speedup_combined.patch 11115_flat.patch trac_11115_docfix.patch trac715_two_tripledicts.patch
Remark: I work on openSUSE
, hence, I had to apply #12131 and thus also its dependency #12057. I doubt that the absence of #11115 is responsible for Sage not starting. And all other patches are dependencies.
What error occurs when you start Sage with my patch? If we are lucky, it gives some clue why the segfault in the one doctest occurs.
Best regards,
Simon
comment:52 Changed 8 years ago by
PS: I started on top of sage-4.8.alpha3.
comment:53 Changed 8 years ago by
Meanwhile I built sage-5.0.prealpha0 and applied #11780 and trac715_two_tripledicts.patch. Sage starts fine.
So, Volker, what had you have applied when Sage didn't start?
comment:54 Changed 8 years ago by
I think I made a progress: I found that the vector space that is part of the crash is not unique! So, the VectorMatrixAction
is defined for a vector space that is equal to but not identical with the vector space it is acting on!
The natural solution is to try and find out why the vector space is not unique. Vector spaces should be created using the VectorSpace
constructor, that relies on a UniqueFactory
. But apparently some very old code is constructing a vector space directly - it wouldn't be the first time that this is causing trouble.
comment:55 Changed 8 years ago by
PS: Note that vector spaces with different inner product are considered equal.
sage: V = QQ^5 sage: M = random_matrix(QQ,5,5) sage: M.set_immutable() sage: W = VectorSpace(QQ,5,inner_product_matrix=M) sage: V Vector space of dimension 5 over Rational Field sage: W Ambient quadratic space of dimension 5 over Rational Field Inner product matrix: [ 0 1/2 1 -1 -1] [ 0 0 0 1 -1/2] [ -2 0 0 0 0] [ 1 0 2 0 0] [ 0 -2 0 1 0] sage: V==W True sage: type(V)==type(W) False
But this is not the problem here: The two equal vector spaces involved in the crash have default inner product.
The non-uniqueness makes me think of another potential solution: The coercion model has a method "verify_action". This is only called when a new action is found, but not when an action is taken from the cache.
So, in addition to fixing the non-unique vector space in the modular symbols code, one could always verify the action. Probably this would be too slow, though.
comment:56 Changed 8 years ago by
Aha! We have a sparse versus a dense vector space! Here is our problem!
comment:57 Changed 8 years ago by
I did manage to install it and reproduce the crash. The core dump shows that the stack is completely corrupted before we called into gmp code.
comment:58 Changed 8 years ago by
Hi Volker,
good that you managed to install it. Meanwhile I think I can debug it without the core dump - I think mistaking a sparse with a dense vector space is a pretty convincing reason for a segfault.
However, I hate that old code!!
I tried verify_action
, but then hundreds of tests fail in sage/modular/modsym/space.py. So, apparently it is very common to have non-unique parents in such a way that the action can not be fixed!
For example, I see errors like
TypeError: Coercion of [Infinity] - [0] (of type <class 'sage.modular.modsym.boundary.BoundarySpaceElement'>) into Space of Boundary Modular Symbols for Congruence Subgroup Gamma0(43) of weight 2 and over Rational Field not (yet) defined.
Anyway, verify_action
is no solution.
comment:59 Changed 8 years ago by
comment:60 Changed 8 years ago by
Hi Jean-Pierre,
don't start ptestlong - I am about to update the new patch such that the segfault does not occur and the time for executing the test is fine and the memleak is gone!
Changed 8 years ago by
Use weak references to the keys of TripleDict
. Compare by "==" or by "is", depending on the application. Use weak references for storing actions.
comment:61 Changed 8 years ago by
- Description modified (diff)
- Status changed from needs_work to needs_review
- Work issues fix doctests deleted
See the updated patch:
Apply trac715_two_tripledicts.patch
comment:62 Changed 8 years ago by
Here some remarks on the new patch:
I use TripleDictById
for storing actions, since otherwise we have trouble with non-unique parents and get segfaults.
In addition, I do not directly store the action but only a weak reference to it, since otherwise I couldn't fix the memory leak.
Sometimes, the stored action is in fact None
, for which we can't use a weak references. Instead, I use a constant function. For technical reasons it returns False and not None (namely, this is to avoid confusion with a weak reference that has become invalid).
Features
- The segfault in
sage -t sage/modular/modsym/space.py
is gone. - The time for executing that test remains fine, namely 20.7 seconds (unpatched sage-5.0.prealpha0) versus 21.4 seconds (with patch).
- The example from the ticket description does not leak anymore!
Thus, needs, review.
comment:63 follow-up: ↓ 64 Changed 8 years ago by
Ok, I'll give the new patch a go and report after make ptestlong and checking for the memleaks.
comment:64 in reply to: ↑ 63 Changed 8 years ago by
Replying to jpflori:
Ok, I'll give the new patch a go and report after make ptestlong
So do I.
I guess at least one thing is needed: Provide a doc test that demonstrates the fix of the memory leak. This should be similar to the example for the patch that I have posted at #11521. Note that #11521 is in fact a duplicate: The examples from the two ticket descriptions are almost identical.
comment:65 Changed 8 years ago by
Actions have strong references to domain and codomain, so its no surprise that they keep their coercion cache entry alive. But I don't understand how storing a weak reference to the action can work; Nothing else keeps the action alive unless it happens to be used while the garbage collector is running. So actions are essentially not cached any more. It seem that either actions should only store weak references to domain/codomain or we implement some ring buffer that keeps the last N coerce maps unconditionally alive.
In fact, the action's reference to domain and codomain seem to be for convenience only. After all you know domain and codomain when you constuct the action and when you pick it from the cache, so there shouldn't be much incentive to look it up. Perhaps it would be easy to make them weak refs, did you look into that?
comment:66 Changed 8 years ago by
I agree with Volker and would like to test putting weak refs to domain and codomain in Functor as I suggested in #11521 and letting an option to use strong ref by default so that a user building an action but not storing its domain elsewhere won't see it disappear magically.
Unfortunately I do not have much time to do anything more than testing till the end of the week.
comment:67 Changed 8 years ago by
I wouldn't use weak references for anything but caching. In particular, having a weak reference from a functor to its domain or codomain seems a no-go to me.
In one point I agree: There should be a mechanism to keep an action alive as long as domain and codomain exist. But perhaps this is already the case? Isn't there an action cache as an attribute of any parent? And isn't the action stored there (and not only in the cache of the coercion model) when an action is discovered?
So, before thinking of a weak reference from the functor to domain and codomain, I would first test whether the problem you describe actually occurs.
comment:68 Changed 8 years ago by
Just two mental notes:
One test in sage/structure/coerce.pyx fails, because it explicitly uses the action cache (ignoring the fact that it now contains weak references and not actions).
And: The long tests of these two files
devel/sage/sage/graphs/graph_generators.py devel/sage/sage/graphs/generic_graph.py
take 10 minutes each. Is my patch to blame, or has it been like that before?
comment:69 Changed 8 years ago by
- Status changed from needs_review to needs_work
With sage4.8.alpha5 plus #9138 #11900 #1115 #715 and #11521 applied some tests fail, namely in the files:
- sage.rings.padic.padic_base_generic_element.pyx (1 test failed) but did NOT when rerun and did again the next time,
- sage.rings.number_field.number_field.py (1) and did NOT again,
- sage.structure.coerce.pyx (5) and did again,
- sage.algebras.quatalg.quaternion_algebra.py (1) and did again once and then did NOT again,
- lots of them in sage.homology.* (20+25+50+93+1) and did again.
The random behavior of some of the above tests fails with:
- IndexError?: list index out of range (padic)
- Attribute Error: QuaternionAlgebra_abstract_with_category object has no attribute _a (quatalg)
and at some point in the stack TripleDicts? of the coercion model are present.
comment:70 Changed 8 years ago by
Oops, this should be more readable:
With sage4.8.alpha5 plus #9138 #11900 #1115 #715 and #11521 applied some tests fail, namely in the files:
- sage.rings.padic.padic_base_generic_element.pyx (1 test failed) but did NOT when rerun and did again the next time,
- sage.rings.number_field.number_field.py (1) and did NOT again,
- sage.structure.coerce.pyx (5) and did again,
- sage.algebras.quatalg.quaternion_algebra.py (1) and did again once and then did NOT again,
- lots of them in sage.homology.* (20+25+50+93+1) and did again.
The random behavior of some of the above tests fails with:
- IndexError?: list index out of range (padic)
- Attribute Error: QuaternionAlgebra?_abstract_with_category object has no attribute _a (quatalg) and at some point in the stack TripleDicts? of the coercion model are present.
comment:71 Changed 8 years ago by
For info, the number_field test also fails with an "IndexError: list out of range".
comment:72 Changed 8 years ago by
- Work issues set to avoid regression
The flaky behaviour probably means that sometimes something gets garbage collected when it shouldn't.
But why do you have the patch from #11521 applied?
Note that with sage-5.0.prealpha0 + #11780 + the new patch from here, I get two tests with errors, namely
sage -t --long -force_lib devel/sage/sage/structure/coerce_dict.pyx # 1 doctests failed sage -t --long -force_lib devel/sage/sage/structure/coerce.pyx # 5 doctests failed
However, the tests took rather long in total: 12100 seconds with the new patch versus 4569 seconds unpatched.
I think the regression is not acceptable.
Well, perhaps you are right and we should experiment with weak references on domain and codomain.
comment:73 follow-up: ↓ 74 Changed 8 years ago by
Good point about #11521, I'd say because that what I was firstly interested in.
Without it applied, the flaky behavior seem to disappear.
I'll post timings with all patches, with all patches except for #11521, and with no patches in a few hours.
Anyway I guess Volker is right and even with just #715 applied we should check that actions do not get garbage collected continuously as your timings suggest.
comment:74 in reply to: ↑ 73 ; follow-ups: ↓ 75 ↓ 76 Changed 8 years ago by
Replying to jpflori:
Good point about #11521, I'd say because that what I was firstly interested in.
Without it applied, the flaky behavior seem to disappear.
Good!
I'll post timings with all patches, with all patches except for #11521, and with no patches in a few hours.
OK, but I guess the timings I provided should be enough to show that the patch can not remain as it is now.
Anyway I guess Volker is right and even with just #715 applied we should check that actions do not get garbage collected continuously as your timings suggest.
Yep. Two potential solutions:
- Find out why apparently not all actions are registered in the parent (because then we would have a strong reference as long as at least the domain is alive).
- Play with the idea to have a strong reference on the action but a weak reference from a functor to its domain and codomain.
I'm trying the latter now.
comment:75 in reply to: ↑ 74 Changed 8 years ago by
Replying to SimonKing:
Replying to jpflori:
Good point about #11521, I'd say because that what I was firstly interested in. Without it applied, the flaky behavior seem to disappear.
Good!
I'll post timings with all patches, with all patches except for #11521, and with no patches in a few hours.
OK, but I guess the timings I provided should be enough to show that the patch can not remain as it is now.
Anyway I guess Volker is right and even with just #715 applied we should check that actions do not get garbage collected continuously as your timings suggest.
Yep. Two potential solutions: 1. Find out why apparently not all actions are registered in the parent (because then we would have a strong reference as long as at least the domain is alive). 2. Play with the idea to have a strong reference on the action but a weak reference from a functor to its domain and codomain. I'm trying the latter now.
Just to summarize, here is the current problem, please correct me if some of the following is wrong: we want to let a codomain (resp. domain) get garbage collected when its only weak reffed outside of the coercion model.
Before the current patch the situation is as follows for actions:
- when an action is resolved, it is cached in a triple dict in the coercion model with the domain and codomains as keys
- the action is also cached in the dictionnaries in the domain and the codomain much in the same way
- there is also a similar cache for homsets
The current patch let weakrefs be used for the keys to the above dictionaries and a weak ref to the corresponding value (which is the action).
The problem is that as the action is only weak reffed everywhere now, it gets garbage collected all the time (to be confirmed).
If it is not, then the codomain (resp. domain) will in turn not get garbage collected, because it will be strongly reffed in the action strongly reffed in the domain (resp. codomain) (to be confirmed).
The problem for the homset patch is slightly different and is being discussed in #11521.
comment:76 in reply to: ↑ 74 Changed 8 years ago by
Replying to SimonKing:
- Find out why apparently not all actions are registered in the parent (because then we would have a strong reference as long as at least the domain is alive).
That's why:
sage: search_src("register_action") structure/parent.pyx:1698: self.register_action(action) structure/parent.pyx:1791: cpdef register_action(self, action): structure/parent.pyx:1841: sage: R.register_action(act) structure/parent.pxd:29: cpdef register_action(self, action)
So, simply register action isn't used at all - which makes me think why some actions are stored in the parent.
comment:77 Changed 8 years ago by
I see. register_action is not to be used after any coercion was established.
comment:78 follow-up: ↓ 79 Changed 8 years ago by
Just as a remark from the side lines, it seems that consistently storing a reference in the parent would be the cleanest solution. Perhaps the testsuite stuff can be used to verify that all parents do that?
comment:79 in reply to: ↑ 78 Changed 8 years ago by
Replying to vbraun:
Just as a remark from the side lines, it seems that consistently storing a reference in the parent would be the cleanest solution.
But perhaps a difficult one. The condition that register_action
must not be used after defining any coercion is probably there for a reason.
Perhaps the testsuite stuff can be used to verify that all parents do that?
How could it? By hooking into the coercion model, look up any action there and verify that all are registered?
comment:80 Changed 8 years ago by
I have posted an experimental patch, that has to be applied on top of trac715_two_tripledicts.patch.
With the experimental patch, the coercion model stores strong references to the actions (hence, it restores the original behaviour), but functors will only store weak references to their domains and codomains.
Unfortunately, this does not fix the memory leak. But perhaps you want to play with it...
Ah! And I just see that "sage.categories.functor" was the wrong location to do the change.
comment:81 Changed 8 years ago by
Or I should say: Action.__domain
is not what the action acts on, but it is a groupoid, and is not used. So, forget the experimental patch.
comment:82 follow-up: ↓ 83 Changed 8 years ago by
An action of G on S stores direct references to G and to S.
The action is a functor, and as a functor, it additionally stores a reference to Groupoid(G)
, which stores another reference to G, and to the category of S.
In some cases, the category of S will store references to the base ring of S (for example, if S is an algebra), which might have a pointer back to S (for example if the action of S.base_ring()
on S was registered during initialisation). In this case, we are lost, since categories are unique parents and thus strongly cached (unless we apply #12215, which poses some problems).
For the same reason, creating the groupoid of G will result in an eternal reference on G (Groupoid(G)
is strongly cached and it points to G). So, the best that we can hope for is that we can free S at some point, but we will never be able to free G.
It starts to be complicated. Time to call it a day...
Perhaps the idea to register actions in the parents (in addition to a weak cache in the coercion model) is better?
comment:83 in reply to: ↑ 82 ; follow-up: ↓ 86 Changed 8 years ago by
Replying to SimonKing:
An action of G on S stores direct references to G and to S. The action is a functor, and as a functor, it additionally stores a reference to
Groupoid(G)
, which stores another reference to G, and to the category of S. In some cases, the category of S will store references to the base ring of S (for example, if S is an algebra), which might have a pointer back to S (for example if the action ofS.base_ring()
on S was registered during initialisation). In this case, we are lost, since categories are unique parents and thus strongly cached (unless we apply #12215, which poses some problems). For the same reason, creating the groupoid of G will result in an eternal reference on G (Groupoid(G)
is strongly cached and it points to G). So, the best that we can hope for is that we can free S at some point, but we will never be able to free G. It starts to be complicated. Time to call it a day... Perhaps the idea to register actions in the parents (in addition to a weak cache in the coercion model) is better?
But if you store the actions in both parents (with strong references), you will never be able to free any of the two domain and codomain.
In the ticket example for example you would get a strong reference to the action in the ZZ cache (which will hopefully never get deleted) (in fact that is what is happening with the current Sage version anyway, isn't that strange according to what you posted, because I guess is already initialized once the for loop is executed?) so the elliptic curves (in the ticket example you only get one stored in that cache because comarison was made with "==", if you let the j invariant change within the for loop you would get a growing number of curves in that cache) will stay strongly refed forever as well...
comment:84 Changed 8 years ago by
comment:85 Changed 8 years ago by
I got about 3350 sec on top of vanilla sage-4.8.alpha5.
comment:86 in reply to: ↑ 83 Changed 8 years ago by
Hi Jean-Pierre,
Replying to jpflori:
But if you store the actions in both parents (with strong references), you will never be able to free any of the two domain and codomain.
This is not necessarily the case. You would merely get circular references, and they would not obstruct garbage collection, unless one item in the cycle has a __del__
method.
One problem, however, is that many actions start with ZZ
. And if ZZ
is contained in the cycle, then it can not be collected, since ZZ
will live forever -- but you know that.
In the ticket example for example you would get a strong reference to the action in the ZZ cache (which will hopefully never get deleted) (in fact that is what is happening with the current Sage version anyway, isn't that strange according to what you posted, because I guess is already initialized once the for loop is executed?)
Yes. And is it really sure that the actions are stored in ZZ
?
Anyway. They are stored by ==
, and thus only one copy remains alive.
so the elliptic curves (in the ticket example you only get one stored in that cache because comarison was made with "==", if you let the j invariant change within the for loop you would get a growing number of curves in that cache) will stay strongly refed forever as well...
Yes. And that is a problem that, again, might be solved using weak references.
Namely:
Consider an action A of G on S. Typically, G is immortal (like ZZ
), but we are willing to let A and S die if we do not have any "external" strong reference to S. In particular, the existence of A should not be enough to keep S alive.
I think this can be accomplished as follows:
- For quick access and for backwards compatibility, we want that actions remain stored in the coercion model. We use weak references to the keys (G,S), but a strong reference to the action (this is what the previous version of trac715_two_tripledicts.patch did).
- In addition to that, A should only have a weak reference to S; I think it doesn't matter whether the reference from A to G is strong or weak.
Let us analyse what happens with G, S and A:
- G will remain alive forever, even without an external reference. Namely, the coercion cache has a strong reference to A; as a functor, A points to
Groupoid(G)
;Groupoid(G)
is strongly cached (unless we use weak caching forUniqueRepresentation
) and must have a reference to G. If we decide to use weak caching forUniqueRepresentation
, then we would only have a strong reference from G to A and a weak or strong reference from A to G. That would be fine for garbage collection. Anyway, I think keeping G alive will not hurt. - Assume that there is no external reference to S. There is a weak reference to S from the cache in the coercion model, namely as key of the cache. Moreover, there is another weak reference from A to S. Hence, S could be garbage collected.
- Assume that there is no external reference to A. If S is garbage collected (see the previous point), then it will remove itself from the coercion cache, and thus the strong reference to A would vanish - it could be collected. But if S is alive, then A will remain alive as well.
However, this is how the experimental patch should work - and it does not fix the leak. Perhaps this is, again, due to caching the homsets? So, we would need the patch from #12215 as well. Difficult topic.
comment:87 Changed 8 years ago by
Sorry, I meant "we would need the patch from #11521 as well".
comment:88 Changed 8 years ago by
- Description modified (diff)
I have attached another patch, which implements the ideas sketched above. I think it corresponds to what you suggested ("use a weak reference from the action to the domain").
One detail: We have to distinguish between the underlying set, the domain and the codomain of an action. In fact, the new patch only uses a weak reference to the underlying set, and introduces a cdef function (hence, hopefully with little overhead) returning it.
I consider sage-5.0.prealpha0 plus trac11780_unique_auxiliar_polyring.patch (probably not needed) plus trac715_two_tripledicts.patch plus trac715_weak_action.patch.
At least sage -t sage/modular/modsym/space.py
passes, but I need to run the whole test suite.
The example from the ticket description does not leak. However, if the j-invariant varies, it seems that for each elliptic curve one copy is preserved:
sage: K = GF(1<<55,'t') sage: for i in range(500): ....: a = K.random_element() ....: E = EllipticCurve(j=a) ....: P = E.random_point() ....: Q = 2*P ....: sage: import gc sage: gc.collect() 2124 sage: from sage.schemes.generic.homset import SchemeHomsetModule_abelian_variety_coordinates_field sage: LE = [x for x in gc.get_objects() if isinstance(x,SchemeHomsetModule_abelian_variety_coordinates_field)] sage: len(LE) 500
In any case, the original leak is fixed with the two patches. The question is whether the second patch suffices to keep actions alive, whether it avoids a regression, and whether all tests pass.
If everything is alright, we may still try to find out where the remaining strong reference to an elliptic curve comes from.
I think this is a bit too vague for a ticket. Robert, could you be more specific or close this?