Opened 13 years ago

Last modified 7 years ago

#715 closed defect

Parents probably not reclaimed due to too much caching — at Version 166

Reported by: robertwb Owned by: somebody
Priority: major Milestone: sage-5.5
Component: coercion Keywords: weak cache coercion Cernay2012
Cc: jpflori, zimmerma, vbraun, robertwb, nbruin, malb, mjo Merged in:
Authors: Simon King Reviewers: Jean-Pierre Flori
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: #9138, #11900, #11599, #11521 Stopgaps:

Description (last modified by davidloeffler)

Here is a small example illustrating the issue.

The memory footprint of the following piece of code grows indefinitely.

sage: K = GF(1<<55,'t') 
sage: a = K.random_element() 
sage: while 1: 
....:     E = EllipticCurve(j=a); P = E.random_point(); 2*P; del E, P;

E and P get deleted, but when 2*P is computed, the action of integers on A, the abelian group of rational points of the ellitpic curve, gets cached in the corecion model.

A key-value pair is left in coercion_model._action_maps dict:

(ZZ,A,*) : IntegerMulAction?

Moreover there is at least also references to A in the IntegerMulAction? and one in ZZ._action_hash.

So weak refs should be used in all these places if it does not slow things too much.

To be merged with #11521. Apply:

and then the tickets from #11521.

Change History (175)

comment:1 Changed 13 years ago by mabshoff

  • Milestone set to sage-feature

comment:2 Changed 12 years ago by mabshoff

  • Milestone changed from sage-feature to sage-2.10.2

comment:3 Changed 11 years ago by mhansen

I think this is a bit too vague for a ticket. Robert, could you be more specific or close this?

comment:4 Changed 11 years ago by robertwb

The coercion model needs to use weakrefs so that parents aren't needlessly referenced when they're discarded. It is nontrivial to see where the weakrefs need to go, and how to do so without slowing the code down.

The ticket is still valid.

comment:5 Changed 9 years ago by davidloeffler

  • Component changed from basic arithmetic to coercion
  • Description modified (diff)
  • Report Upstream set to N/A

comment:6 Changed 9 years ago by jpflori

  • Cc jpflori added

comment:7 Changed 9 years ago by jpflori

  • Description modified (diff)

With the piece of code in the desrciption, there is only one reference to these objects in that ZZ._hash_actions dictionary because to build it we test if A1 == A2 and not A1 is A2 as in coercion_model._action_maps, and because of the current implementation of ellitpic curves, see http://groups.google.com/group/sage-nt/browse_thread/thread/ec8d0ad14a819082 and #11474, and decause the above code use only one j-invariant, only ones gets finally stored.

However with random curves, I guess there would be all of them.

About the weakref, the idea should more be to build something like WeakKeyDictionnary? if it does not slow down coercion too much...

comment:8 Changed 9 years ago by nbruin

The following example also exhibits a suspicious, steady growth in memory use. The only reason I can think of why that would happen is that references to the created finite field remain lying around somewhere, preventing deallocation:

sage: L=prime_range(10^8)
sage: for p in L: k=GF(p)

If you change it to the creation of a polynomial ring the memory use rises much faster:

sage: L=prime_range(10^8)
sage: for p in L: k=GF(p)['t']

Are "unique" parents simply *never* deallocated?

comment:9 Changed 9 years ago by jpflori

Be aware that polynomial rings are also cached because of uniqueness of parents, explaining somehow your second memory consumption; see #5970 for example.

For finite fields I did not check.

comment:10 Changed 8 years ago by zimmerma

  • Cc zimmerma added

comment:11 Changed 8 years ago by jpflori

See #11521 for some concrete instances of this problem and some advice to investigate it.

comment:12 Changed 8 years ago by SimonKing

In my code for the computation Ext algebras of basic algebras, I use letterplace algebras (see #7797), and they involve the creation of many polynomial rings. Only one of them is used at a time, so, the others could be garbage collected. But they aren't, and I suspect this is because of using strong references in the coercion cache.

See the following example (using #7797)

sage: F.<a,b,c> = FreeAlgebra(GF(4,'z'), implementation='letterplace')
sage: import gc
sage: len(gc.get_objects())
170947
sage: a*b*c*b*c*a*b*c
a*b*c*b*c*a*b*c
sage: len(gc.get_objects())
171556
sage: del F,a,b,c
sage: gc.collect()
81
sage: len(gc.get_objects())
171448
sage: cm = sage.structure.element.get_coercion_model()
sage: cm.reset_cache()
sage: gc.collect()
273
sage: len(gc.get_objects())
171108

That is certainly not a proof of my claim, but it indicates that it might be worth while to investigate.

In order to facilitate work, I am providing some other tickets that may be related to this:

I guess that one should use a similar cache model to what I did in #11521: The key for the cache should not just be (domain,codomain), because we want that garbage collection of the cache item is already allowed if just one of domain or codomain is collectable.

comment:13 Changed 8 years ago by SimonKing

I try to wrap my mind around weak references. I found that when creating a weak reference, one can also provide a method that is called when the weak reference becomes invalid. I propose to use such method to erase the deleted object from the cache, regardless whether it appears as domain or codomain.

Here is a proof of concept:

sage: ref = weakref.ref
sage: D = {}
sage: def remove(x):
....:     for a,b,c in D.keys():
....:         if a is x or b is x or c is x:
....:             D.__delitem__((a,b,c))
....:             
sage: class A:
....:     def __init__(self,x):
....:         self.x = x
....:     def __repr__(self):
....:         return str(self.x)
....:     def __del__(self):
....:         print "deleting",self.x
....:         
sage: a = A(5)
sage: b = A(6)
sage: r = ref(a,remove)
sage: s = ref(b,remove)
sage: D[r,r,s] = 1
sage: D[s,r,s] = 2
sage: D[s,s,s] = 3
sage: D[s,s,1] = 4
sage: D[r,s,1] = 5
sage: D.values()
[5, 3, 1, 4, 2]
sage: del a
deleting 5
sage: D.values()
[4, 3]
sage: del b
deleting 6
sage: D.values()
[]

comment:14 Changed 8 years ago by SimonKing

It turns out that using weak references in the coercion cache will not be enough. Apparently there are other direct references that have to be dealt with.

comment:15 Changed 8 years ago by SimonKing

I wonder whether the problem has already been solved. I just tested the example from the ticket description, and get (at least with #11900, #11521 and #11115):

sage: K = GF(1<<55,'t')
sage: a = K.random_element()
sage: m0 = get_memory_usage()
sage: for i in range(1000):
....:     E = EllipticCurve(j=a); P = E.random_point(); PP = 2*P
....:     
sage: get_memory_usage() - m0
15.22265625

I think that this is not particularly scary. I'll repeat the test with vanilla sage-4.8.alpha3, but this will take a while to rebuild.

comment:16 Changed 8 years ago by SimonKing

No, even in vanilla sage-4.8.alpha3 I don't find a scary memory leak in this example.

Do we have a better example? One could, of course, argue that one should use weak references for caching even if we do not find an apparent memory leak. I am preparing a patch for it now.

comment:17 Changed 8 years ago by SimonKing

  • Cc vbraun added

Here is an experimental patch.

A new test shows that the weak caching actually works.

Note that the patch also introduces a weak cache for polynomial rings, which might be better to put into #5970. Well, we can sort things out later...

comment:18 Changed 8 years ago by SimonKing

It needs work, though. Some tests in sage/structure fail, partially because of pickling, partially because some tests do not follow the new specification of TripleDict (namely that the first two parts of each key triple and the associated value must be weak referenceable.

comment:19 Changed 8 years ago by SimonKing

Now I wonder: Should I try to use weak references and make it accept stuff that does not allow for weak references?

In the intended applications, weak references are possible. But in some tests and in the pickle jar, the "wrong" type of keys (namely strings and ints) are used.

comment:20 Changed 8 years ago by SimonKing

The only place where the weak references are created is in the set(...) method of TripleDict. I suggest to simply catch the error that may occur when creating a weak reference, and then use a different way of storing the key. I am now running tests, and I hope that this ticket will be "needs review" in a few hours.

comment:21 Changed 8 years ago by SimonKing

  • Authors set to Simon King
  • Keywords weak cache coercion added
  • Status changed from new to needs_review

With the attached patch, all tests pass for me, and the new features are doctested. Needs review!

comment:22 Changed 8 years ago by SimonKing

  • Dependencies set to #11900

It turns out that this patch only cleanly applies after #11900. So, I introduce #11900 as a dependency. My statement on "doctests passing" was with #11900 anyway.

comment:23 Changed 8 years ago by zimmerma

I was able to apply this patch to vanilla 4.7.2. Should I continue reviewing it like this?

Paul

comment:24 Changed 8 years ago by zimmerma

on top of vanilla 4.7.2 several doctests fail:

        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/calculus/interpolators.pyx # 0 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/databases/database.py # 15 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/finance/time_series.pyx # 0 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/graphs/graph_list.py # 4 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/graphs/graph_database.py # 28 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/graphs/graph.py # 6 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/graphs/generic_graph.py # 4 doctests failed

        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/matrix/matrix2.pyx # 3 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/hecke/hecke_operator.py # 1 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/hecke/ambient_module.py # 2 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/subspace.py # 6 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/boundary.py # 3 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/space.py # 3 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/modsym.py # 1 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/ambient.py # 11 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/abvar/abvar.py # 0 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/schemes/elliptic_curves/heegner.py # 9 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/sandpiles/sandpile.py # Time out

Paul

comment:25 Changed 8 years ago by SimonKing

I'll try again on top of vanilla sage-4.8.alpha3. You are right, the patch does apply (almost) cleanly even without #11900. That surprises me, because at some point there was an inconsistency.

Hopefully I can see later today whether I get the same errors as you.

comment:26 Changed 8 years ago by SimonKing

  • Dependencies #11900 deleted

It turns out that #11900 is indeed not needed.

I can not reproduce any of the errors you mention.

Moreover, the file "sage/devel/sage/databases/database.py", for which you reported an error, does not exist in vanilla sage (not in 4.7.2 and not in 4.8.alpha3).

Did you test other patches before returning to vanilla 4.7.2? Namely, when a patch changes a module from python to cython, and one wants to remove the patch, then it is often needed to also remove any reference to the cython module in build/sage/... and in build/*/sage/.... For example, when I had #11115 applied and want to remove it again, then I would do rm build/sage/misc/cachefunc.* and rm build/*/sage/misc/cachefunc.*.

comment:27 follow-up: Changed 8 years ago by zimmerma

yes I tried other patches (#10983, #8720, #10596) before #715, but each one with a different clone.

Paul

comment:28 in reply to: ↑ 27 Changed 8 years ago by SimonKing

Replying to zimmerma:

yes I tried other patches (#10983, #8720, #10596) before #715, but each one with a different clone.

But where does the databases/database.py file come from?

And could you post one or two examples for the errors you are getting (i.e. not just which files are problematic, but what commands exactly fail)?

comment:29 Changed 8 years ago by SimonKing

FWIW: I started with sage-4.8.alpha3, have #9138, #11900 and #715 applied, and all doctests pass. I don't know why the patchbot isn't even trying (although it says "retry: True"), but from my point of view, everything is alright.

comment:30 Changed 8 years ago by SimonKing

I have simplified the routine that removes cache items when a weak reference became invalid. The tests all pass for me.

Apply trac715_weak_coercion_cache.patch

comment:31 Changed 8 years ago by vbraun

  • Dependencies set to #9138, #11900

comment:32 Changed 8 years ago by SimonKing

One question: Currently, my patch uses weak references only for the first two parts of the key. Should it also use weak references to the value, when possible?

By "when possible", I mean that not all values allow weak references - if it is possible then a weak reference is used, otherwise a strong reference is used. This might contribute to fixing the memory leak in #11521, but it might have a speed penalty.

Concerning #11521: The point is that an action (which currently does not allow weak references, but that might change) has a strong reference to the objects that are used for storing it in the cache. Hence, an action is not collectable with the current patch.

Thoughts?

Changed 8 years ago by SimonKing

Use weak references in the coercion cache

comment:33 Changed 8 years ago by SimonKing

I have slightly updated some of the new examples: In the old patch version, I had created TripleDict(10), but meanwhile I learnt that the given parameter should better be odd (actually a prime). So, in the new patch version, it is TripleDict(11).

comment:34 Changed 8 years ago by SimonKing

  • Status changed from needs_review to needs_work
  • Work issues set to Comparison of the third key items

I think I need to modify one detail:

For efficiency and since domain/codomain of a map must be identic with (and not just equal to) the given keys, my patch compares them by "is" rather than "==". But I think one should still compare the third item of a key via "==" and not "is". I need to do some tests first...

comment:35 Changed 8 years ago by SimonKing

It really is not an easy question whether or not we should have "is" or "==".

On the one hand, we have the lines

!python
            if y_mor is not None:
                all.append("Coercion on right operand via")
                all.append(y_mor)
                if res is not None and res is not y_mor.codomain():
                    raise RuntimeError, ("BUG in coercion model: codomains not equal!", x_mor, y_mor)

in sage/structure/coerce.pyx seem to imply that comparison via "is" is the right thing to do.

But in the same file, the coercion model copes with the fact that some parents are not unique:

!python
        # Make sure the domains are correct
        if R_map.domain() is not R:
            if fix:
                connecting = R_map.domain().coerce_map_from(R)
                if connecting is not None:
                    R_map = R_map * connecting
            if R_map.domain() is not R:
                raise RuntimeError, ("BUG in coercion model, left domain must be original parent", R, R_map)
        if S_map is not None and S_map.domain() is not S:
            if fix:
                connecting = S_map.domain().coerce_map_from(S)
                if connecting is not None:
                    S_map = S_map * connecting
            if S_map.domain() is not S:
                raise RuntimeError, ("BUG in coercion model, right domain must be original parent", S, S_map)

That would suggest that comparison by "==" (the old behaviour or TripleDict) is fine.

Perhaps we should actually have to variants of TripleDict, one using "is" and one using "==".

Note another detail of sage/structure/coerce.pyx: We have

    cpdef verify_action(self, action, R, S, op, bint fix=True):

but

    cpdef verify_coercion_maps(self, R, S, homs, bint fix=False):

Note the different default value for "fix". If "fix" is True then the coercion model tries to cope with non-unique parents by prepending a conversion between the two equal copies of a parent.

Since the default is to fix non-unique parents for actions, but not for coercion maps, I suggest that a "=="-TripleDict should be used for actions and an "is"-TripleDict for coercions.

comment:36 Changed 8 years ago by jpflori

I guess a choice has to be made and that it should at lest be as consistent as possible. What you propose makes sense to me, is not too far from the current model and gives a little more conssitency. Moreover, when both TripleDicts? will have been implemented, changing our mind later will be trivial.

comment:37 Changed 8 years ago by SimonKing

There is another detail. Even in the old version of TripleDict, we have

    It is implemented as a list of lists (hereafter called buckets). The bucket 
    is chosen according to a very simple hash based on the object pointer.
    and each bucket is of the form [k1, k2, k3, value, k1, k2, k3, value, ...]
    on which a linear search is performed. 

So, the choice of a bucket is based on the object pointer - but then it is not consequent to compare by "==".

comment:38 Changed 8 years ago by SimonKing

To be precise: The old behaviour was not consequent. The bucket depended on id(k1),id(k2),id(k3), but the comparison was by "==" rather than by "is".

Experimentally, I will provide two versions of TripleDict, one using "hash"for determining the bucket and doing comparison by "==", the other using "id" for determining the bucket and doing comparison by "is".

comment:39 Changed 8 years ago by SimonKing

  • Work issues changed from Comparison of the third key items to fix doctests

As announced, I have attached an experimental patch. It provides two variants of TripleDict, namely using "==" or "is" for comparison, respectively. Both are used, namely for caching coerce maps or actions, respectively.

It could be that a last-minute change was interfering, but I am confident that all but the following three tests pass:

        sage -t  devel/sage-main/doc/en/bordeaux_2008/nf_introduction.rst # 1 doctests failed
        sage -t  devel/sage-main/sage/modular/modsym/space.py # Killed/crashed
        sage -t  devel/sage-main/sage/structure/coerce_dict.pyx # 3 doctests failed

The memory leak exposed in the ticket description is fixed (more or less):

sage: K = GF(1<<55,'t')
sage: a = K.random_element()
sage: for i in range(500):
....:     E = EllipticCurve(j=a)
....:     P = E.random_point()
....:     Q = 2*P
....:     
sage: import gc
sage: gc.collect()
862
sage: from sage.schemes.generic.homset import SchemeHomsetModule_abelian_variety_coordinates_field
sage: LE = [x for x in gc.get_objects() if  isinstance(x,SchemeHomsetModule_abelian_variety_coordinates_field)]
sage: len(LE)
2

I am not sure whether this makes #11521 redundant.

For now, it is "needs work, because of the doctests. But you can already play with the patch.

comment:40 Changed 8 years ago by SimonKing

Sorry, only TWO doctests should fail: The tests of sage/structure/coerce_dict.pyx are, of course, fixed.

comment:41 Changed 8 years ago by SimonKing

The segfault in sage -t devel/sage-main/sage/modular/modsym/space.py seems difficult to debug.

Inspecting a core dump with gdb did not help at all:

(gdb) bt
#0  0x00007f61d12ca097 in kill () from /lib64/libc.so.6
#1  0x00007f61d0044a40 in sigdie () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libcsage.so
#2  0x00007f61d0044646 in sage_signal_handler () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libcsage.so
#3  <signal handler called>
#4  0x00007f61cf080520 in mpn_submul_1 () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libgmp.so.8
#5  0x00007f61cf0b4f0f in __gmpn_sb_bdiv_q () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libgmp.so.8
#6  0x00007f61cf0b6428 in __gmpn_divexact () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libgmp.so.8
#7  0x00007f61ccbf4d64 in ?? ()
...
#191 0x55c0ade81d9aeecf in ?? ()
#192 0xffffe4b8b6920b7b in ?? ()
#193 0x000000000ac854cf in ?? ()
#194 0x0000000000000000 in ?? ()

How could one proceed? What other debugging techniques can you recommend?

comment:42 follow-up: Changed 8 years ago by vbraun

Looks like you did not tell gdb about the executable you were running. You should run

gdb --core=<corefile> $SAGE_LOCAL/bin/python

comment:43 in reply to: ↑ 42 Changed 8 years ago by SimonKing

Replying to vbraun:

Looks like you did not tell gdb about the executable you were running.

No, I did tell it. I did

gdb --core=715doublecore ~/SAGE/sage-4.8.alpha3/local/bin/python

Should I do it inside a Sage shell?

comment:44 Changed 8 years ago by SimonKing

No, doing the same inside a sage shell did not help either.

comment:45 Changed 8 years ago by SimonKing

I am now printing some debugging information into a file, which hopefully means that I am coming closer to the source of the problem. The segfault arises in line 2165 of sage/modular/modsym/space.py

comment:46 Changed 8 years ago by SimonKing

Sorry, it was the wrong line number.

comment:47 Changed 8 years ago by SimonKing

Meanwhile I am rather desperate: I have not the faintest idea how the segfault occurs.

Therefore I used some debugging function that I registered using sys.settrace(...), so that all Python commands in the critical example are written into a file.

I posted logs for the unpatched and the patched version.

There is one obvious difference of the two logs: The hash is called more often in the patched version. Calling the hash is rather inefficient for matrix spaces: Each time when the hash of a matrix space is called, the matrix space's string representation is created, which is slow. I suggest to cache the hash value (like what I did for polynomial rings in #9944), but this should be on a different ticket.

Apart from that, I can't spot an obvious difference. Do you have any clue?

comment:48 Changed 8 years ago by SimonKing

It turns out that using TripleDictById for the _action_maps cache makes the segfault disappear.

If one uses TripleDict for _coercion_maps then

sage -t  devel/sage-main/sage/modular/modsym/space.py

takes 30 seconds, but if one also uses TripleDictById then it only takes 23 seconds.

My conclusion:

  • The old version of TripleDict was buggy: It uses id(...) for the hash table, but == for comparison. I think that had to be fixed.
  • The new version of TripleDict uses hash(...) for the hash table and == for comparison. That should be fine, but (1) it leads to a segfault and (2) it leads to a slowdown. After all, calling hash(...) is a lot slower than determining the address.
  • The new TripleDictById uses id(...) for the hash table and ... is ... for comparison. Problem: It would probably not fix the memory leak.

However, the fact that using TripleDictById fixes the segfault makes me wonder: Perhaps the segfault occurs when calling hash(...) on a parent? Namely, in some cases, and action will already be constructed during initialisation of a parent. But if the hash is determined based on cdef data that aren't initialised, a segfault can easily occur.

I'll investigate that further. In any case, we need to keep an eye on the potential slow-down.

comment:49 Changed 8 years ago by SimonKing

The segfault does not occur while computing a hash. It occurs in line 468 of sage/matrix/matrix_rational_dense.pyx, namely

                mpq_mul(y, w._entries[j], self._matrix[j][i])

I also tested, just before that line, that w[j] and self.get_unsafe(j,i) (which accesses w._entries[j] and self._matrix[j],[i]) works.

At this point, I am at my wits' end. To me, it looks like a change in the way of comparing dictionary keys modifies internals of mpir (IIRC, this is where mpq_mul is defined). gdb can not decipher the core file, and I don't know how valgrind can be used.

What else?

comment:50 follow-up: Changed 8 years ago by vbraun

Which patches did you apply? With only trac715_two_tripledicts.patch applied sage doesn't start.

comment:51 in reply to: ↑ 50 Changed 8 years ago by SimonKing

Replying to vbraun:

Which patches did you apply? With only trac715_two_tripledicts.patch applied sage doesn't start.

What???

According to hg qapplied, I have

trac_12057_fix_doctests.patch
9138_flat.patch
trac_11319_prime_field_coercion.patch
trac_11319_number_field_example.patch
trac11900_category_speedup_combined.patch
11115_flat.patch
trac_11115_docfix.patch
trac715_two_tripledicts.patch

Remark: I work on openSUSE, hence, I had to apply #12131 and thus also its dependency #12057. I doubt that the absence of #11115 is responsible for Sage not starting. And all other patches are dependencies.

What error occurs when you start Sage with my patch? If we are lucky, it gives some clue why the segfault in the one doctest occurs.

Best regards,

Simon

comment:52 Changed 8 years ago by SimonKing

PS: I started on top of sage-4.8.alpha3.

comment:53 Changed 8 years ago by SimonKing

Meanwhile I built sage-5.0.prealpha0 and applied #11780 and trac715_two_tripledicts.patch. Sage starts fine.

So, Volker, what had you have applied when Sage didn't start?

comment:54 Changed 8 years ago by SimonKing

I think I made a progress: I found that the vector space that is part of the crash is not unique! So, the VectorMatrixAction is defined for a vector space that is equal to but not identical with the vector space it is acting on!

The natural solution is to try and find out why the vector space is not unique. Vector spaces should be created using the VectorSpace constructor, that relies on a UniqueFactory. But apparently some very old code is constructing a vector space directly - it wouldn't be the first time that this is causing trouble.

comment:55 Changed 8 years ago by SimonKing

PS: Note that vector spaces with different inner product are considered equal.

sage: V = QQ^5
sage: M = random_matrix(QQ,5,5)
sage: M.set_immutable()
sage: W = VectorSpace(QQ,5,inner_product_matrix=M)
sage: V
Vector space of dimension 5 over Rational Field
sage: W
Ambient quadratic space of dimension 5 over Rational Field
Inner product matrix:
[   0  1/2    1   -1   -1]
[   0    0    0    1 -1/2]
[  -2    0    0    0    0]
[   1    0    2    0    0]
[   0   -2    0    1    0]
sage: V==W
True
sage: type(V)==type(W)
False

But this is not the problem here: The two equal vector spaces involved in the crash have default inner product.

The non-uniqueness makes me think of another potential solution: The coercion model has a method "verify_action". This is only called when a new action is found, but not when an action is taken from the cache.

So, in addition to fixing the non-unique vector space in the modular symbols code, one could always verify the action. Probably this would be too slow, though.

comment:56 Changed 8 years ago by SimonKing

Aha! We have a sparse versus a dense vector space! Here is our problem!

comment:57 Changed 8 years ago by vbraun

I did manage to install it and reproduce the crash. The core dump shows that the stack is completely corrupted before we called into gmp code.

comment:58 Changed 8 years ago by SimonKing

Hi Volker,

good that you managed to install it. Meanwhile I think I can debug it without the core dump - I think mistaking a sparse with a dense vector space is a pretty convincing reason for a segfault.

However, I hate that old code!!

I tried verify_action, but then hundreds of tests fail in sage/modular/modsym/space.py. So, apparently it is very common to have non-unique parents in such a way that the action can not be fixed!

For example, I see errors like

    TypeError: Coercion of [Infinity] - [0] (of type <class 'sage.modular.modsym.boundary.BoundarySpaceElement'>) into Space of Boundary Modular Symbols for Congruence Subgroup Gamma0(43) of weight 2 and over Rational Field not (yet) defined.

Anyway, verify_action is no solution.

comment:59 Changed 8 years ago by jpflori

Hi all,

Just wanted to say I had no problem installing the new patch on top of sage.4.8.alpha5 with tickets #9138 #11900 #1115 #715 and #11521 in that order and Sage launches. I'll start a make ptestlong now.

comment:60 Changed 8 years ago by SimonKing

Hi Jean-Pierre,

don't start ptestlong - I am about to update the new patch such that the segfault does not occur and the time for executing the test is fine and the memleak is gone!

Changed 8 years ago by SimonKing

Use weak references to the keys of TripleDict. Compare by "==" or by "is", depending on the application. Use weak references for storing actions.

comment:61 Changed 8 years ago by SimonKing

  • Description modified (diff)
  • Status changed from needs_work to needs_review
  • Work issues fix doctests deleted

See the updated patch:

Apply trac715_two_tripledicts.patch

comment:62 Changed 8 years ago by SimonKing

Here some remarks on the new patch:

I use TripleDictById for storing actions, since otherwise we have trouble with non-unique parents and get segfaults.

In addition, I do not directly store the action but only a weak reference to it, since otherwise I couldn't fix the memory leak.

Sometimes, the stored action is in fact None, for which we can't use a weak references. Instead, I use a constant function. For technical reasons it returns False and not None (namely, this is to avoid confusion with a weak reference that has become invalid).

Features

  • The segfault in sage -t sage/modular/modsym/space.py is gone.
  • The time for executing that test remains fine, namely 20.7 seconds (unpatched sage-5.0.prealpha0) versus 21.4 seconds (with patch).
  • The example from the ticket description does not leak anymore!

Thus, needs, review.

comment:63 follow-up: Changed 8 years ago by jpflori

Ok, I'll give the new patch a go and report after make ptestlong and checking for the memleaks.

comment:64 in reply to: ↑ 63 Changed 8 years ago by SimonKing

Replying to jpflori:

Ok, I'll give the new patch a go and report after make ptestlong

So do I.

I guess at least one thing is needed: Provide a doc test that demonstrates the fix of the memory leak. This should be similar to the example for the patch that I have posted at #11521. Note that #11521 is in fact a duplicate: The examples from the two ticket descriptions are almost identical.

comment:65 Changed 8 years ago by vbraun

Actions have strong references to domain and codomain, so its no surprise that they keep their coercion cache entry alive. But I don't understand how storing a weak reference to the action can work; Nothing else keeps the action alive unless it happens to be used while the garbage collector is running. So actions are essentially not cached any more. It seem that either actions should only store weak references to domain/codomain or we implement some ring buffer that keeps the last N coerce maps unconditionally alive.

In fact, the action's reference to domain and codomain seem to be for convenience only. After all you know domain and codomain when you constuct the action and when you pick it from the cache, so there shouldn't be much incentive to look it up. Perhaps it would be easy to make them weak refs, did you look into that?

comment:66 Changed 8 years ago by jpflori

I agree with Volker and would like to test putting weak refs to domain and codomain in Functor as I suggested in #11521 and letting an option to use strong ref by default so that a user building an action but not storing its domain elsewhere won't see it disappear magically.

Unfortunately I do not have much time to do anything more than testing till the end of the week.

comment:67 Changed 8 years ago by SimonKing

I wouldn't use weak references for anything but caching. In particular, having a weak reference from a functor to its domain or codomain seems a no-go to me.

In one point I agree: There should be a mechanism to keep an action alive as long as domain and codomain exist. But perhaps this is already the case? Isn't there an action cache as an attribute of any parent? And isn't the action stored there (and not only in the cache of the coercion model) when an action is discovered?

So, before thinking of a weak reference from the functor to domain and codomain, I would first test whether the problem you describe actually occurs.

comment:68 Changed 8 years ago by SimonKing

Just two mental notes:

One test in sage/structure/coerce.pyx fails, because it explicitly uses the action cache (ignoring the fact that it now contains weak references and not actions).

And: The long tests of these two files

devel/sage/sage/graphs/graph_generators.py
devel/sage/sage/graphs/generic_graph.py

take 10 minutes each. Is my patch to blame, or has it been like that before?

comment:69 Changed 8 years ago by jpflori

  • Status changed from needs_review to needs_work

With sage4.8.alpha5 plus #9138 #11900 #1115 #715 and #11521 applied some tests fail, namely in the files:

  • sage.rings.padic.padic_base_generic_element.pyx (1 test failed) but did NOT when rerun and did again the next time,
  • sage.rings.number_field.number_field.py (1) and did NOT again,
  • sage.structure.coerce.pyx (5) and did again,
  • sage.algebras.quatalg.quaternion_algebra.py (1) and did again once and then did NOT again,
  • lots of them in sage.homology.* (20+25+50+93+1) and did again.

The random behavior of some of the above tests fails with:

  • IndexError?: list index out of range (padic)
  • Attribute Error: QuaternionAlgebra_abstract_with_category object has no attribute _a (quatalg)

and at some point in the stack TripleDicts? of the coercion model are present.

comment:70 Changed 8 years ago by jpflori

Oops, this should be more readable:

With sage4.8.alpha5 plus #9138 #11900 #1115 #715 and #11521 applied some tests fail, namely in the files:

  • sage.rings.padic.padic_base_generic_element.pyx (1 test failed) but did NOT when rerun and did again the next time,
  • sage.rings.number_field.number_field.py (1) and did NOT again,
  • sage.structure.coerce.pyx (5) and did again,
  • sage.algebras.quatalg.quaternion_algebra.py (1) and did again once and then did NOT again,
  • lots of them in sage.homology.* (20+25+50+93+1) and did again.

The random behavior of some of the above tests fails with:

  • Attribute Error: QuaternionAlgebra?_abstract_with_category object has no attribute _a (quatalg) and at some point in the stack TripleDicts? of the coercion model are present.

comment:71 Changed 8 years ago by jpflori

For info, the number_field test also fails with an "IndexError: list out of range".

comment:72 Changed 8 years ago by SimonKing

  • Work issues set to avoid regression

The flaky behaviour probably means that sometimes something gets garbage collected when it shouldn't.

But why do you have the patch from #11521 applied?

Note that with sage-5.0.prealpha0 + #11780 + the new patch from here, I get two tests with errors, namely

        sage -t  --long -force_lib devel/sage/sage/structure/coerce_dict.pyx # 1 doctests failed
        sage -t  --long -force_lib devel/sage/sage/structure/coerce.pyx # 5 doctests failed

However, the tests took rather long in total: 12100 seconds with the new patch versus 4569 seconds unpatched.

I think the regression is not acceptable.

Well, perhaps you are right and we should experiment with weak references on domain and codomain.

comment:73 follow-up: Changed 8 years ago by jpflori

Good point about #11521, I'd say because that what I was firstly interested in.

Without it applied, the flaky behavior seem to disappear.

I'll post timings with all patches, with all patches except for #11521, and with no patches in a few hours.

Anyway I guess Volker is right and even with just #715 applied we should check that actions do not get garbage collected continuously as your timings suggest.

comment:74 in reply to: ↑ 73 ; follow-ups: Changed 8 years ago by SimonKing

Replying to jpflori:

Good point about #11521, I'd say because that what I was firstly interested in.

Without it applied, the flaky behavior seem to disappear.

Good!

I'll post timings with all patches, with all patches except for #11521, and with no patches in a few hours.

OK, but I guess the timings I provided should be enough to show that the patch can not remain as it is now.

Anyway I guess Volker is right and even with just #715 applied we should check that actions do not get garbage collected continuously as your timings suggest.

Yep. Two potential solutions:

  1. Find out why apparently not all actions are registered in the parent (because then we would have a strong reference as long as at least the domain is alive).
  2. Play with the idea to have a strong reference on the action but a weak reference from a functor to its domain and codomain.

I'm trying the latter now.

comment:75 in reply to: ↑ 74 Changed 8 years ago by jpflori

Replying to SimonKing:

Replying to jpflori:

Good point about #11521, I'd say because that what I was firstly interested in. Without it applied, the flaky behavior seem to disappear.

Good!

I'll post timings with all patches, with all patches except for #11521, and with no patches in a few hours.

OK, but I guess the timings I provided should be enough to show that the patch can not remain as it is now.

Anyway I guess Volker is right and even with just #715 applied we should check that actions do not get garbage collected continuously as your timings suggest.

Yep. Two potential solutions: 1. Find out why apparently not all actions are registered in the parent (because then we would have a strong reference as long as at least the domain is alive). 2. Play with the idea to have a strong reference on the action but a weak reference from a functor to its domain and codomain. I'm trying the latter now.

Just to summarize, here is the current problem, please correct me if some of the following is wrong: we want to let a codomain (resp. domain) get garbage collected when its only weak reffed outside of the coercion model.

Before the current patch the situation is as follows for actions:

  • when an action is resolved, it is cached in a triple dict in the coercion model with the domain and codomains as keys
  • the action is also cached in the dictionnaries in the domain and the codomain much in the same way
  • there is also a similar cache for homsets

The current patch let weakrefs be used for the keys to the above dictionaries and a weak ref to the corresponding value (which is the action).

The problem is that as the action is only weak reffed everywhere now, it gets garbage collected all the time (to be confirmed).

If it is not, then the codomain (resp. domain) will in turn not get garbage collected, because it will be strongly reffed in the action strongly reffed in the domain (resp. codomain) (to be confirmed).

The problem for the homset patch is slightly different and is being discussed in #11521.

comment:76 in reply to: ↑ 74 Changed 8 years ago by SimonKing

Replying to SimonKing:

  1. Find out why apparently not all actions are registered in the parent (because then we would have a strong reference as long as at least the domain is alive).

That's why:

sage: search_src("register_action")
structure/parent.pyx:1698:            self.register_action(action)
structure/parent.pyx:1791:    cpdef register_action(self, action):
structure/parent.pyx:1841:            sage: R.register_action(act)
structure/parent.pxd:29:    cpdef register_action(self, action)

So, simply register action isn't used at all - which makes me think why some actions are stored in the parent.

comment:77 Changed 8 years ago by SimonKing

I see. register_action is not to be used after any coercion was established.

comment:78 follow-up: Changed 8 years ago by vbraun

Just as a remark from the side lines, it seems that consistently storing a reference in the parent would be the cleanest solution. Perhaps the testsuite stuff can be used to verify that all parents do that?

comment:79 in reply to: ↑ 78 Changed 8 years ago by SimonKing

Replying to vbraun:

Just as a remark from the side lines, it seems that consistently storing a reference in the parent would be the cleanest solution.

But perhaps a difficult one. The condition that register_action must not be used after defining any coercion is probably there for a reason.

Perhaps the testsuite stuff can be used to verify that all parents do that?

How could it? By hooking into the coercion model, look up any action there and verify that all are registered?

Changed 8 years ago by SimonKing

Experimental patch using weak references on domain and codomain of functors

comment:80 Changed 8 years ago by SimonKing

I have posted an experimental patch, that has to be applied on top of trac715_two_tripledicts.patch.

With the experimental patch, the coercion model stores strong references to the actions (hence, it restores the original behaviour), but functors will only store weak references to their domains and codomains.

Unfortunately, this does not fix the memory leak. But perhaps you want to play with it...

Ah! And I just see that "sage.categories.functor" was the wrong location to do the change.

comment:81 Changed 8 years ago by SimonKing

Or I should say: Action.__domain is not what the action acts on, but it is a groupoid, and is not used. So, forget the experimental patch.

comment:82 follow-up: Changed 8 years ago by SimonKing

An action of G on S stores direct references to G and to S.

The action is a functor, and as a functor, it additionally stores a reference to Groupoid(G), which stores another reference to G, and to the category of S.

In some cases, the category of S will store references to the base ring of S (for example, if S is an algebra), which might have a pointer back to S (for example if the action of S.base_ring() on S was registered during initialisation). In this case, we are lost, since categories are unique parents and thus strongly cached (unless we apply #12215, which poses some problems).

For the same reason, creating the groupoid of G will result in an eternal reference on G (Groupoid(G) is strongly cached and it points to G). So, the best that we can hope for is that we can free S at some point, but we will never be able to free G.

It starts to be complicated. Time to call it a day...

Perhaps the idea to register actions in the parents (in addition to a weak cache in the coercion model) is better?

comment:83 in reply to: ↑ 82 ; follow-up: Changed 8 years ago by jpflori

Replying to SimonKing:

An action of G on S stores direct references to G and to S. The action is a functor, and as a functor, it additionally stores a reference to Groupoid(G), which stores another reference to G, and to the category of S. In some cases, the category of S will store references to the base ring of S (for example, if S is an algebra), which might have a pointer back to S (for example if the action of S.base_ring() on S was registered during initialisation). In this case, we are lost, since categories are unique parents and thus strongly cached (unless we apply #12215, which poses some problems). For the same reason, creating the groupoid of G will result in an eternal reference on G (Groupoid(G) is strongly cached and it points to G). So, the best that we can hope for is that we can free S at some point, but we will never be able to free G. It starts to be complicated. Time to call it a day... Perhaps the idea to register actions in the parents (in addition to a weak cache in the coercion model) is better?

But if you store the actions in both parents (with strong references), you will never be able to free any of the two domain and codomain.

In the ticket example for example you would get a strong reference to the action in the ZZ cache (which will hopefully never get deleted) (in fact that is what is happening with the current Sage version anyway, isn't that strange according to what you posted, because I guess is already initialized once the for loop is executed?) so the elliptic curves (in the ticket example you only get one stored in that cache because comarison was made with "==", if you let the j invariant change within the for loop you would get a growing number of curves in that cache) will stay strongly refed forever as well...

comment:84 Changed 8 years ago by jpflori

My timings

  • with everything up to #11521: about 3650 sec (errors as above)
  • with everything up to #715: about 3350 sec (only errors in coerce.pyx)
  • with everything up to #11115: about 3350 sec as well (no errors)

So, unless I did something wrong, I do not get any significant slow down...

comment:85 Changed 8 years ago by jpflori

I got about 3350 sec on top of vanilla sage-4.8.alpha5.

comment:86 in reply to: ↑ 83 Changed 8 years ago by SimonKing

Hi Jean-Pierre,

Replying to jpflori:

But if you store the actions in both parents (with strong references), you will never be able to free any of the two domain and codomain.

This is not necessarily the case. You would merely get circular references, and they would not obstruct garbage collection, unless one item in the cycle has a __del__ method.

One problem, however, is that many actions start with ZZ. And if ZZ is contained in the cycle, then it can not be collected, since ZZ will live forever -- but you know that.

In the ticket example for example you would get a strong reference to the action in the ZZ cache (which will hopefully never get deleted) (in fact that is what is happening with the current Sage version anyway, isn't that strange according to what you posted, because I guess is already initialized once the for loop is executed?)

Yes. And is it really sure that the actions are stored in ZZ?

Anyway. They are stored by ==, and thus only one copy remains alive.

so the elliptic curves (in the ticket example you only get one stored in that cache because comarison was made with "==", if you let the j invariant change within the for loop you would get a growing number of curves in that cache) will stay strongly refed forever as well...

Yes. And that is a problem that, again, might be solved using weak references.

Namely:

Consider an action A of G on S. Typically, G is immortal (like ZZ), but we are willing to let A and S die if we do not have any "external" strong reference to S. In particular, the existence of A should not be enough to keep S alive.

I think this can be accomplished as follows:

  • For quick access and for backwards compatibility, we want that actions remain stored in the coercion model. We use weak references to the keys (G,S), but a strong reference to the action (this is what the previous version of trac715_two_tripledicts.patch did).
  • In addition to that, A should only have a weak reference to S; I think it doesn't matter whether the reference from A to G is strong or weak.

Let us analyse what happens with G, S and A:

  1. G will remain alive forever, even without an external reference. Namely, the coercion cache has a strong reference to A; as a functor, A points to Groupoid(G); Groupoid(G) is strongly cached (unless we use weak caching for UniqueRepresentation) and must have a reference to G. If we decide to use weak caching for UniqueRepresentation, then we would only have a strong reference from G to A and a weak or strong reference from A to G. That would be fine for garbage collection. Anyway, I think keeping G alive will not hurt.
  2. Assume that there is no external reference to S. There is a weak reference to S from the cache in the coercion model, namely as key of the cache. Moreover, there is another weak reference from A to S. Hence, S could be garbage collected.
  3. Assume that there is no external reference to A. If S is garbage collected (see the previous point), then it will remove itself from the coercion cache, and thus the strong reference to A would vanish - it could be collected. But if S is alive, then A will remain alive as well.

However, this is how the experimental patch should work - and it does not fix the leak. Perhaps this is, again, due to caching the homsets? So, we would need the patch from #12215 as well. Difficult topic.

comment:87 Changed 8 years ago by SimonKing

Sorry, I meant "we would need the patch from #11521 as well".

comment:88 Changed 8 years ago by SimonKing

  • Description modified (diff)

I have attached another patch, which implements the ideas sketched above. I think it corresponds to what you suggested ("use a weak reference from the action to the domain").

One detail: We have to distinguish between the underlying set, the domain and the codomain of an action. In fact, the new patch only uses a weak reference to the underlying set, and introduces a cdef function (hence, hopefully with little overhead) returning it.

I consider sage-5.0.prealpha0 plus trac11780_unique_auxiliar_polyring.patch (probably not needed) plus trac715_two_tripledicts.patch plus trac715_weak_action.patch.

At least sage -t sage/modular/modsym/space.py passes, but I need to run the whole test suite.

The example from the ticket description does not leak. However, if the j-invariant varies, it seems that for each elliptic curve one copy is preserved:

sage: K = GF(1<<55,'t')
sage: for i in range(500):
....:     a = K.random_element()
....:     E = EllipticCurve(j=a)
....:     P = E.random_point()
....:     Q = 2*P
....:     
sage: import gc
sage: gc.collect()
2124
sage: from sage.schemes.generic.homset import SchemeHomsetModule_abelian_variety_coordinates_field
sage: LE = [x for x in gc.get_objects() if  isinstance(x,SchemeHomsetModule_abelian_variety_coordinates_field)]
sage: len(LE)
500

In any case, the original leak is fixed with the two patches. The question is whether the second patch suffices to keep actions alive, whether it avoids a regression, and whether all tests pass.

If everything is alright, we may still try to find out where the remaining strong reference to an elliptic curve comes from.

comment:89 Changed 8 years ago by SimonKing

PS: The additional application of #11521 does not suffice to avoid the remaining strong reference to an elliptic curve.

comment:90 Changed 8 years ago by SimonKing

  • Status changed from needs_work to needs_info

With the two patches applied, I get some doctest errors that seem trivial to fix, and it takes 10905 seconds in total. Now, I am not sure: Originally, I had much less time with unpatched Sage.

But perhaps my computer was in a different state at that time? Jean-Pierre, if I understood correctly, you did not find any significant slowdown, right?

The first (i.e., the "official") patch is enough to fix the leak for the original example. According to Jean-Pierre, the timings are fine, it does not matter whether we have no patch, the official patch only, or the first experimental patch. And according to my own test, it does not matter whether we have the first or the second experimental patch.

So, the further proceeding depends on the following questions:

  • The experimental patches provide two different approaches to fix a potential problem, namely actions being deallocated when they are still needed. However, is this potential problem a real problem? Only then would it make sense to consider the experimental patches!
  • Do we also want to fix the leak in the more difficult example, namely when the j-invariant varies? In this case, we need to find out why the actions are registered in ZZ. It is not clear yet whether one really needs one of the experimental patches to get rid of it.

What is your answer to the questions?

comment:91 Changed 8 years ago by SimonKing

There are two occasions for writing stuff into Parent._action_hash: During initialisation, via register_action, and in addition the action is stored in the parent when a new action is found while calling get_action.

Perhaps we should distinguish the two cases: The actions that are stored during initialisation should probably be "immortal". But the actions that is stored on the fly should only be weakly cached.

I think this can be solved by changing Parent._action_hash into a dictionary that uses weak references to both the keys and the values. There is one difference between register_action and get_action: register_action additionally stores the actions in a list, but get_action doesn't. Hence, indeed the actions registered during initialisation will survive, but the stuff stored by get_action could become collectable.

comment:92 Changed 8 years ago by SimonKing

  • Status changed from needs_info to needs_review
  • Work issues avoid regression deleted

Yesss!! It suffices (in addition to what I did before) to use a TripleDictById (which uses weak references to the keys, but strong references to the value) for Parent._action_hash!!!

The leak is no completely gone:

sage: K = GF(1<<55,'t')
sage: for i in range(50):
....:     a = K.random_element()
....:     E = EllipticCurve(j=a)
....:     P = E.random_point()
....:     Q = 2*P
....:     
sage: from sage.schemes.generic.homset import SchemeHomsetModule_abelian_variety_coordinates_field
sage: import gc, objgraph
sage: gc.collect()
882
sage: LE = [x for x in gc.get_objects() if  isinstance(x,SchemeHomsetModule_abelian_variety_coordinates_field)]
sage: len(LE)
1

I need to add a test (or better: modify the test introduced by the first patch), demonstrating that the "big" leak is fixed, and I need to add tests for the new code I wrote, and of course I need to run ptestlong.

Nevertheless, I think you can start reviewing. And please store the doc test times, so that we can detect any regression.

Apply trac715_two_tripledicts.patch trac715_weak_action.patch

comment:93 follow-up: Changed 8 years ago by jpflori

Good!

I've just built last sage prealpha and am quite busy today, but I'll at least run ptestlong with and without patches to get timings in the afternoo.

comment:94 in reply to: ↑ 93 Changed 8 years ago by SimonKing

Replying to jpflori:

I've just built last sage prealpha and am quite busy today, but I'll at least run ptestlong with and without patches to get timings in the afternoo.

Good! I am sure that there will be errors (for example, the current test against the leak expects len(LE)==2), but the timings will certainly be interesting. And of course, it would be interesting whether an action can die prematurely.

comment:95 Changed 8 years ago by SimonKing

I am sure that there is a regression compared with vanilla sage-5.0.prealpha0. For example, when I originally ran sage -ptestlong, sage/schemes/hyperellyptic_cuve/hyperellyptic_padic_field.py took 57 seconds, but with the patches it takes 160 seconds.

comment:96 Changed 8 years ago by SimonKing

Meanwhile I took some timings on a different machine. Based on the experience that the schemes code tends to slow down a lot when one does fancy stuff (see #11900 and #11935), I use "sage/schemes" as a test bed.

I find (based on sage-4.8.alpha3):

With patch

king@mpc622:/mnt/local/king/SAGE/rebase/sage-4.8.alpha3/devel/sage$ hg qapplied 
trac_12149.3.patch
9138_flat.patch
trac11900_category_speedup_combined.patch
11115_flat.patch
trac_11115_docfix.patch
trac715_two_tripledicts.patch
trac715_weak_action.patch
king@mpc622:/mnt/local/king/SAGE/rebase/sage-4.8.alpha3/devel/sage$ ../../sage -t sage/schemes/
...
----------------------------------------------------------------------
All tests passed!
Total time for all tests: 625.1 seconds

Here are the five worst:

sage -t  "devel/sage-main/sage/schemes/elliptic_curves/ell_rational_field.py"
	 [58.1 s]
sage -t  "devel/sage-main/sage/schemes/elliptic_curves/heegner.py"
	 [51.1 s]
sage -t  "devel/sage-main/sage/schemes/elliptic_curves/ell_number_field.py"
	 [35.2 s]
sage -t  "devel/sage-main/sage/schemes/elliptic_curves/padic_lseries.py"
	 [26.9 s]
sage -t  "devel/sage-main/sage/schemes/elliptic_curves/sha_tate.py"
	 [25.7 s]

Now, the same without the two patches from here:

king@mpc622:/mnt/local/king/SAGE/rebase/sage-4.8.alpha3/devel/sage$ hg qapplied 
trac_12149.3.patch
9138_flat.patch
trac11900_category_speedup_combined.patch
11115_flat.patch
trac_11115_docfix.patch
king@mpc622:/mnt/local/king/SAGE/rebase/sage-4.8.alpha3/devel/sage$ ../../sage -t sage/schemes/
...
----------------------------------------------------------------------
All tests passed!
Total time for all tests: 597.0 seconds

And the five worst, comparing with the times from above, are:

sage -t  "devel/sage-main/sage/schemes/elliptic_curves/ell_rational_field.py"
	 [55.4 s] (was: [58.1 s])
sage -t  "devel/sage-main/sage/schemes/elliptic_curves/heegner.py"
	 [47.2 s] (was: [51.1 s])
sage -t  "devel/sage-main/sage/schemes/elliptic_curves/ell_number_field.py"
	 [34.1 s] (was: [35.2 s])
sage -t  "devel/sage-main/sage/schemes/elliptic_curves/padic_lseries.py"
	 [26.1 s] (was: [26.9 s])
sage -t  "devel/sage-main/sage/schemes/elliptic_curves/sha_tate.py"
	 [24.9 s] (was: [25.7 s])

Hence, we have a slow-down of, overall, (625.1 - 597)/625.1 = 4.5%, the slow-down seems to be systematic (you hardly find an example that became faster), and in some cases we have a slow-down of 10%.

I expected it to be worse (after all, coercion affects everything). But still, the question is: Can the slow-down be avoided?

comment:97 follow-up: Changed 8 years ago by jpflori

So here are my global timings for ptestlong on sage.5.0.prealpha0:

  • 3092.9 seconds with no errors on vanilla
  • 3097.8 seconds with 3 errors in sage.matrix.action.pyx and 1 in sage.structure.corece_dict.pyx on vanille + the two patches in the current ticket (#715) description.

That's kind of strange. Maybe the slowdown is absorbed by the fact that the test are running in parallel ?

I'll just test sage.schemes with a single core and report.

comment:98 in reply to: ↑ 97 Changed 8 years ago by SimonKing

Replying to jpflori:

  • 3092.9 seconds with no errors on vanilla
  • 3097.8 seconds with 3 errors in sage.matrix.action.pyx and 1 in sage.structure.corece_dict.pyx on vanille + the two patches in the current ticket (#715) description.

Cool!

Note that the tests in sage/schemes only take 597.5 seconds when I apply #11943 and #11935 on top. Hence, if there really is a slow-down then it can be countered.

One detail about trac: If you want to link to a ticket, you can just provide the number after the "hash symbol", hence, #715 and not [http://.../715 #715].

That's kind of strange. Maybe the slowdown is absorbed by the fact that the test are running in parallel ?

I don't know the typical standard deviation of the timings.

comment:99 Changed 8 years ago by SimonKing

Interestingly, I get slightly different errors:

        sage -t  --long -force_lib devel/sage/sage/structure/coerce_dict.pyx # 1 doctests failed
        sage -t  --long -force_lib devel/sage/sage/structure/parent.pyx # 1 doctests failed
        sage -t  --long -force_lib devel/sage/sage/matrix/action.pyx # 5 doctests failed

Anyway, this should be easy to fix...

comment:100 Changed 8 years ago by SimonKing

... and it took me 12187.2 seconds.

comment:101 Changed 8 years ago by SimonKing

The second patch is updated, more examples are added (in particular, it is demonstrated that the memory leak is fixed even when the j-invariant varies), and the errors that I had with the previous version are gone.

Hence, it can now be reviewed. Please try to find regressions!

Apply trac715_two_tripledicts.patch trac715_weak_action.patch

comment:102 follow-up: Changed 8 years ago by jpflori

Testing sage.schemes with only one core gave me:

  • 1526.0 sec on vanilla

This is more than acceptable according to me (if it does reflect anything... it might only be random stuff).

Running five tests of sage.schemes.elliptic_curves.padic_lseries gave me:

  • 51.0, 48.7, 47.0, 47.0, 47.1 on vanilla
  • 49.0, 47.2, 48.4, 47.4, 47.7 on vanilla + #715

Still surprising that I don't find any slow-down as you did, but I might also be good news :)

My next step is to check for the memory leaks (same j invariant, different j invariants, finite field example of Paul in #11521 ? or do that last one need a patch for the HomSet cache ? if this is the case it won't prevent this ticket to be closed, but should be treated in #11521, otherwise #11521 can be closed as duplicate) and that action do not get continuously deleted.

Afterward, I'll properly review your code and examples (that I've already seen many times obviously :)).

comment:103 in reply to: ↑ 102 Changed 8 years ago by SimonKing

Replying to jpflori:

Testing sage.schemes with only one core gave me:

  • 1526.0 sec on vanilla

That's very good news indeed!

Running five tests of sage.schemes.elliptic_curves.padic_lseries gave me:

  • 51.0, 48.7, 47.0, 47.0, 47.1 on vanilla
  • 49.0, 47.2, 48.4, 47.4, 47.7 on vanilla + #715

That looks like quite some randomness.

My next step is to check for the memory leaks (same j invariant, different j invariants, finite field example of Paul in #11521 ? or do that last one need a patch for the HomSet cache ?

Yes, the finite field example is not solved:

sage: for p in prime_range(10^5):
....:     K = GF(p)
....:     a = K(0)
....:     
sage: import gc
sage: gc.collect()
0

So, I am going to modify the ticket description of #11521, indicating that the original elliptic curve example has been tackled here, but that there remains an orthogonal problem.

Afterward, I'll properly review your code and examples (that I've already seen many times obviously :)).

Not so many times: Some examples are only in the very latest version of the second patch.

comment:104 Changed 8 years ago by jpflori

Please be careful with the non slow-down I reported above.

Something must have gone wrong with my installation, sorry for that, as I realized that the leak was not fixed.

I'll investigate all of this more carefully ASAP.

comment:105 Changed 8 years ago by SimonKing

There is one thing, related with regressions, that I didn't do: The TripleDict is cimported in sage/structure/coerce.pyx, and thus I could use the cdefed methods "set" and "get". But instead, I'm using the usual Python __getitem__ and __setitem__. So, I could avoid some overhead. Will test it a bit later.

Changed 8 years ago by SimonKing

Use weak references to the underlying set of an action. Use TripleDictById to store actions in parents. Disregard the orphan_functor patch!

comment:106 Changed 8 years ago by SimonKing

I have updated the second patch, so, please replace it with the new version. I am sorry that this came to late for "ptestlong", but perhaps the timings with the old patch version indicate what it might make sense to look at with the new version.

Apply trac715_two_tripledicts.patch trac715_weak_action.patch

comment:107 Changed 8 years ago by SimonKing

By the way, here is an example that shows that a TripleDictById finds its items faster then a usual dict, even though it has the additional advantage of weak keys. If one uses Cython, one can still save some more, which is what I did in the preceding change of the second patch.

I create a list of pairs of rings:

sage: for p in prime_range(10^3):
....:     K = GF(p)
....:     P = K['x','y']
....:     L.append((K,P))
....: 
sage: len(L)
168

I create a TripleDictById and a usual dictionary, and fill it by the same data:

sage: from sage.structure.coerce_dict import TripleDictById
sage: D = TripleDictById(113)
sage: for i,(K,P) in enumerate(L):
....:     D[K,P,True] = i
....:     
sage: E = {}
sage: for i,(K,P) in enumerate(L):
....:     E[K,P,True] = i
....:     
sage: len(D)
168
sage: len(E)
168

I create cython functions that know about the types. In the first, I use the Python way of accessing data from TripleDictById, in the second, I use the special cdefed get() method, and the third is for usual dictionaries.

sage: cython("""
....: from sage.structure.coerce_dict cimport TripleDictById
....: def testD(TripleDictById D, list L):
....:     for K,P in L:
....:         n = D[K,P,True]
....: def testDget(TripleDictById D, list L):
....:     for K,P in L:
....:         n = D.get(K,P,True)
....: def testE(dict D, list L):
....:     for K,P in L:
....:         n = D[K,P,True]
....: """)

Even though Cython is supposed to be quite good at optimising dictionary access (mind that testE(...) knows that it will receive a dictionary!), I was surprised by how much faster the TripleDictById is:

sage: %timeit testD(D,L)
625 loops, best of 3: 67.8 µs per loop
sage: %timeit testDget(D,L)
625 loops, best of 3: 52.1 µs per loop
sage: %timeit testE(E,L)
125 loops, best of 3: 3.26 ms per loop

Fourty to sixty times faster! So, I think it was a good idea to use TripleDictById not only in the coercion model, but also as an attribute of Parent.

comment:108 Changed 8 years ago by SimonKing

  • Status changed from needs_review to needs_work
  • Work issues set to Avoid a regression

We have a big regression.

I considered the doctests of sage/modules/free_module.py and took each timing twice, in order to be on the safe side.

Vanilla 5.0.prealpha0

sage -t  "devel/sage-main/sage/modules/free_module.py"
         [11.9 s]
sage -t  "devel/sage-main/sage/modules/free_module.py"
         [10.3 s]

With the first patch from here:

sage -t  "devel/sage-main/sage/modules/free_module.py"
         [24.1 s]
sage -t  "devel/sage-main/sage/modules/free_module.py"
         [25.7 s]

With both patches from here:

sage -t  "devel/sage-main/sage/modules/free_module.py"
         [26.0 s]
sage -t  "devel/sage-main/sage/modules/free_module.py"
         [25.8 s]

I think such a huge regression can't be accepted. Thus, it is "needs work".

comment:109 Changed 8 years ago by jpflori

  • Status changed from needs_work to needs_info
  • Work issues changed from Avoid a regression to Get some timings

Your timings are kind of strange in comparison with what I get.

I was going to post what follows which I double checked:

I can finally confirm I do not get any serious speed regression with the last couple of patches.

ptestlong gives something between 3100 and 3250 seconds with 5.0.prealpha0 vanilla or 715 applied.

Testing only sage.schemes with one core gives me between 1550 and 1600 with both in the same way.

And this time I confirm the test with random j-invariants is fixed by 715 and is not without (I'm getting paranoid now) as well as with a fixed j-invariant.

I'll review the code and example next.

So when I saw your post, I ran "sage -t devel/sage/sage-main/modules/free_modules.py" with both my installations (vanilla and vanilla+715) and got several times about 13 sec for both ! maybe a mean little lower for vanilla (at max .5 sec less).

I should mention I also get a quite big variance, not sure why, because my system is not heavily loaded, maybe cos the disk is on NFS.

E.g. I got between 12.3 and 20.0 (just once) for vanilla and between 12.7and 19. (at the same time, so maybe the network was loaded at that time ??)

For info I got a quite recent multicore Xeon running an outdated version of Ubuntu 64 bits.

comment:110 Changed 8 years ago by jpflori

Groumpf, trac didnot like my blank lines.

So the original part I was about to post is inbetween "I can finally confirm..." and "example next."

comment:111 Changed 8 years ago by SimonKing

  • Status changed from needs_info to needs_review

It is always possible that there is a regression on some hardware, but not on all.

I made an excessive log, i.e., I logged all Python commands. It turns out that there are only little differences with or without patch. Hence, I am sure that the regression does not come from an excessive garbage collection (otherwise, I would have seen that some objects are created repeatedly). So, I guess the regression comes from the C-level.

There is one thing that could make my code too slow: I use weak references in my version of TripleDict and TripleDictById; however, when getting dictionary items, I am calling the weak reference, in order to get the object that it is pointing to, and compare then. That is slow.

I was thinking: Perhaps I could manage to use id(X) as key of TripleDictById, rather than a weak reference to X. The weak reference to X could be stored elsewhere.

Anyway, here is a data point: Unpatched (there is only TripleDict, no TripleDictById):

sage: from sage.structure.coerce_dict import TripleDict
sage: D = TripleDict(113)
sage: L = []
sage: for p in prime_range(10^3):
....:     K = GF(p)
....:     P = K['x','y']
....:     L.append((K,P))
....:     
sage: for i,(K,P) in enumerate(L):
....:     D[K,P,True] = i
....:     
sage: cython("""
....: def testD(D, list L):
....:     for K,P in L:
....:         n = D[K,P,True]
....: """)
sage: %timeit testD(D,L)
625 loops, best of 3: 30.6 µs per loop

Patched (comparing TripleDict and TripleDictById):

sage: from sage.structure.coerce_dict import TripleDict, TripleDictById
sage: D = TripleDict(113)
sage: E = TripleDictById(113)
sage: L = []
sage: for p in prime_range(10^3):
....:     K = GF(p)
....:     P = K['x','y']
....:     L.append((K,P))
....:     
sage: for i,(K,P) in enumerate(L):
....:     D[K,P,True] = i
....:     E[K,P,True] = i
....:                                                                                                                        
sage: cython("""                                                                                                             
....: def testD(D, list L):
....:     for K,P in L:
....:         n = D[K,P,True]
....: """)
sage: %timeit testD(D,L)
25 loops, best of 3: 21 ms per loop
sage: %timeit testD(E,L)
625 loops, best of 3: 61.9 µs per loop

In the applications, I am mainly using TripleDictById. Nevertheless, it is only half as fast as the old TripleDict. So, this is what I have to work at!

comment:112 Changed 8 years ago by SimonKing

  • Status changed from needs_review to needs_work

comment:113 follow-up: Changed 8 years ago by vbraun

This is a bit hackish, but we could also store a strong reference as before but manually Py_DECREF it by one. The eraser then has to make sure that cache entries are removed when they fall out of use, or we'll segfault....

comment:114 in reply to: ↑ 113 ; follow-ups: Changed 8 years ago by SimonKing

Replying to vbraun:

This is a bit hackish, but we could also store a strong reference as before but manually Py_DECREF it by one. The eraser then has to make sure that cache entries are removed when they fall out of use, or we'll segfault....

How should the eraser know which entry is to be removed? I wouldn't like to re-implement the weakref module...

At least on my machine, I have a regression. In order to avoid it, I am now experimenting with some ideas to speed-up the access to dictionary items: With my current patch, I do something like

    if k1 is bucket[i]()

where buchet[i] is a weak reference. But calling the reference takes a lot of time.

For example, since k1 is compared by identity (not equality), it might make sense to store id(bucket[i]()) as an attribute of the weak reference. This is possible by weakref.KeyedRef. And bucket[i].key is a bit faster than bucket[i]().

comment:115 in reply to: ↑ 114 Changed 8 years ago by SimonKing

Replying to SimonKing:

For example, since k1 is compared by identity (not equality), it might make sense to store id(bucket[i]()) as an attribute of the weak reference. This is possible by weakref.KeyedRef. And bucket[i].key is a bit faster than bucket[i]().

... but k1 is bucket[i]() is a lot faster than id(k1)==bucket[i].key. Too bad.

comment:116 in reply to: ↑ 114 Changed 8 years ago by vbraun

Replying to SimonKing:

How should the eraser know which entry is to be removed? I wouldn't like to re-implement the weakref module...

As you said in the preceeding comment, you'd have to store a weak reference elsewhere. The only advantage is that comparing could be done on the actual reference.

comment:117 Changed 8 years ago by SimonKing

Perhaps as follows: We currently have one ensemble of buckets. Instead, we could have two ensembles, say, self.keys and self.refs. Each bucket in both ensembles is a list of length 3*n. Let X,Y,Z be key, let h be the hash of that triple and V the value associated with X,Y,Z.

Then, one could store X,Y,Z as self.keys[h][i:i+3], with artificially decrementing the reference count for X and Y (but not for Z, which usually is True, False, None, operator.mul and so on), as suggested by Volker. And self.refs[h][i:i+3] would be formed by a weak reference to X, a weak reference to Y, and V. The two weak references have a callback function, that tries to find a reference self.refs[h][j] when it became invalid, and would delete the corresponding triple both in self.refs[h] and in self.keys[h].

Two weak references with callback function pointing to the same object are distinct (they are only the same if they don't have a callback function). Hence, each reference occurs in the TripleDict exactly once. Hence, it makes sense to store the hash value of the triple X,Y,Z as additional data both in the reference to X and to Y - which is possible with weakref.KeyedRef. In that way, deleting an entry when a reference became invalid would be much faster as with my current patch, since it is not needed to search in all buckets.

I will try it tomorrow.

comment:118 Changed 8 years ago by SimonKing

  • Description modified (diff)

Here is a preliminary combined patch, implementing the ideas sketched in the previous post -- except that I forgot to explicitly decref stuff... Trying that now. I hope it doesn't segfault.

comment:119 Changed 8 years ago by SimonKing

Hm. When adding a "Py_DECREF", some doctest segfaults, and also the memory leak is not completely fixed: In the test where one creates 50 elliptic curves with random j-invariant, 12 of them survive garbage collection. That's better than 50, but worse than 1.

comment:120 Changed 8 years ago by SimonKing

First of all, in my current patch, I forgot the case k1 is k2: In that case, I would decrement the reference count twice for the same object. However, even when I avoid it, I get a double free.

I wonder: Could the double free result from the fact that I do del self.key_buckets[h][i:i+3] when the reference count for self.key_buckets[h][i] is already zero? Or would that be no problem?

comment:121 Changed 8 years ago by SimonKing

I made some progress by using Py_CLEAR instead of Py_DECREF. Now, it is "only" signal 11, not a double-free.

comment:122 Changed 8 years ago by SimonKing

Sorry, it seems that I have no idea whatsoever of reference counting. I made experiments with Py_DECREF resp. Py_CLEAR applied to list elements, but in all cases I get a segfault when the next garbage collection occurs.

comment:123 Changed 8 years ago by SimonKing

  • Work issues changed from Get some timings to Improve timing and provid documentation

I have updated the patch. Instead of storing the original key and using Py_DECREF, I store its address instead (for TripleDictId) resp. use another weak reference.

With the new patch, sage -t "devel/sage-main/sage/modules/free_module.py" works and is about as fast as in vanilla sage.

Moreover, the memleak is fixed.

However, the patch isn't fully tested or documented yet. And still TripleDictById is only half as fast as the old TripleDict (but recall: The old is buggy and uses strong references). So, it isn't ready for review, but of course I'd appreciate preliminary comments.

Apply trac715_tripledict_combined.patch

comment:124 Changed 8 years ago by SimonKing

  • Work issues changed from Improve timing and provid documentation to Rename `TripleDictById` into `TripleDict`. Improve timing and update documentation

OMG!!

I totally misinterpreted how the keys were compared in the original version of TripleDict. When I saw the line if PyList_GET_ITEM(bucket, i) == <PyObject*>k1 in the old code, I thought that this means to compare the objects by equality.

But now I learnt that this is comparison by identity. Arrgh! The behaviour that I provide with TripleDictById was there all along!

Conclusion: I should erase my version of TripleDict (which really compares by equality, not identity), rename my TripleDictById into TripleDict, and then finally try to get things up to speed.

Changed 8 years ago by SimonKing

Introduce weak references to coercion dicts, and refactor the hashtables.

comment:125 Changed 8 years ago by SimonKing

  • Status changed from needs_work to needs_info
  • Work issues changed from Rename `TripleDictById` into `TripleDict`. Improve timing and update documentation to Should we rename `TripleDictById` into `TripleDict`, or do we want a "weak triple dictionary with comparison by equality"?

I have posted a new patch version.

Recall that we want a dictionary whose keys are triples; we want to compare all three key items by identity, and we want that there is only a weak reference to the first two key items (the third my have a strong reference).

The TripleDictById is now based on the following idea:

  • There is one list that stores the memory addresses of the first two key items and the third key item. In particular, I don't need to decref the key items, since we only store their addresses.
  • There is another list that stores the value corresponding to the key triple, and stores weak references with a callback function to the first two key items.
  • When accessing the dictionary, the address of the first two key items is compared with the stored address, and the third is compared by identity with the stored data.
  • Only when iterating over the TripleDictById, the weak references are called (of course: iteritems is supposed to return the keys, not just the address of the keys).
  • There are two reasons for storing the weak references (and not only the addresses): The callback function of the weak references removes unused entries of the dictionary, and we also need it for iteration over the dictionary.

Status of the patch

  • The "raw" speed seems to be almost as good as in the unpatched version, the speed of doctests seems to be OK, and I don't observe segfaults.
  • The memleak is fixed.
  • The documentation of sage/structure/coerce_dict.pyx needs more polishing, and last but not least I did not run the doctests yet.

The patch still contains both TripleDict (which compares weak keys by equality) and TripleDictById (which compares keys by identity, similar to what TripleDict does in unpatched Sage, but using weak references).

What do you think: Should comparison by equality be provided in the patch?

Contra:

We don't use it in the rest of Sage, so, why should we add it?

Pro:

A "triple dict by comparison" is slower than a usual (strong) dictionary, but on the other hand weakref.WeakKeyDictionary does not work if the keys are tuples - hence, "triple dict by comparison" adds a new feature.

comment:126 Changed 8 years ago by SimonKing

  • Description modified (diff)
  • Status changed from needs_info to needs_review
  • Work issues Should we rename `TripleDictById` into `TripleDict`, or do we want a "weak triple dictionary with comparison by equality"? deleted

To answer my own question: I believe that comparison by equality does not make sense (yet), since it isn't used in the coercion model.

Therefore, I have produced a new patch. Idea: The TripleDict stores the addresses of the keys. In addition, there is a dictionary of weak references with callback function. The only purpose of these references is that their callback functions are erasing invalid dictionary items.

"Raw" speed

Patched:

sage: from sage.structure.coerce_dict import TripleDict
sage: D = TripleDict(113)
sage: L = []
sage: for p in prime_range(10^3):
....:     K = GF(p)
....:     P = K['x','y']
....:     L.append((K,P))
....:
sage: for i,(K,P) in enumerate(L):
....:     D[K,P,True] = i
....:
sage: cython("""
....: from sage.structure.coerce_dict cimport TripleDict
....: def testTriple(TripleDict D, list L):
....:     for K,P in L:
....:         n = D[K,P,True]
....: def testTripleGet(TripleDict D, list L):
....:     for K,P in L:
....:         n = D.get(K,P,True)
....: def testTripleSet(list L):
....:     cdef TripleDict D = TripleDict(113)
....:     for i,(K,P) in enumerate(L):
....:         D.set(K,P,True, i)
....: """)
sage: %timeit testTriple(D,L)
625 loops, best of 3: 42.4 µs per loop
sage: %timeit testTripleGet(D,L)
625 loops, best of 3: 28.3 µs per loop
sage: %timeit testTripleSet(L)
125 loops, best of 3: 2.66 ms per loop

Unpatched:

sage: %timeit testTriple(D,L)
625 loops, best of 3: 31.2 µs per loop
sage: %timeit testTripleGet(D,L)
625 loops, best of 3: 17.5 µs per loop
sage: %timeit testTripleSet(L)
625 loops, best of 3: 79.2 µs per loop

Doctest speed

Patched

sage -t  "devel/sage-main/sage/modules/free_module.py"
         [11.4 s]
sage -t  "devel/sage-main/sage/modules/free_module.py"
         [11.7 s]

Unpatched

sage -t  "devel/sage-main/sage/modules/free_module.py"
         [11.7 s]
sage -t  "devel/sage-main/sage/modules/free_module.py"
         [11.5 s]

Conclusion

Using weak references, things become a bit slower, but it is a lot better than with the previous patches. According to the timing of the doc tests, the regression doesn't matter in applications.

I guess there is no free lunch, and thus the regression is small enough, given that the memory leak is fixed (which is checked in a new test).

I have not run the full test suite yet, but I think it can be reviewed.

Apply trac715_one_triple_dict.patch

Changed 8 years ago by SimonKing

Drop the distinction of TripleDict versus TripleDictById. Use the memory addresses as dictionary keys

comment:127 Changed 8 years ago by SimonKing

make ptest reported only one error, and the error was in fact a misprint. Hence, I have updated my patch, and with it, all tests should pass.

comment:128 Changed 8 years ago by jpflori

Here are finally some first timings for 'make ptest' (this time I first checked the memory leak is actually fixed...)

  • sage-5.0.prealpha1 vanilla: 937.4 sec
  • sage-5.0.prealpha1 + 715: 948.8 sec

No errors for both. I'll report on make ptestlong tomorrow, try to check that actions do not get continuously deleted and finally review the code.

comment:129 follow-up: Changed 8 years ago by jpflori

Running "make ptestlong" gave me:

  • sage-5.0.prealpha1 vanilla: 1397.9 sec
  • sage-5.0.prealpha1 + 715: 1415.0 sec

with no errors for both (I remarked that testing sandpile.py was horribly long with my previous install of sage-5.0.prealpha0, something like 1350 sec for it alone; in between I've updated my ubuntu and recompiled everything for prealpha1, so that might explain why my new timings are so faster).

Hence no regression!

comment:130 in reply to: ↑ 129 Changed 8 years ago by SimonKing

Replying to jpflori:

Running "make ptestlong" gave me:

  • sage-5.0.prealpha1 vanilla: 1397.9 sec
  • sage-5.0.prealpha1 + 715: 1415.0 sec

I'm glad that this time (in contrast to what I did in #9138 and was fixing in #11900) it seems that I am not creating a terrible slow-down!

comment:131 Changed 8 years ago by SimonKing

I found another memory leak:

sage: K = GF(1<<55,'t')
sage: for i in range(50):
....:     a = K.random_element()
....:     E = EllipticCurve(j=a)
....:     b = K.has_coerce_map_from(E)
....:     
sage: import gc
sage: gc.collect()
0

Namely, K.coerce_map_from(E) stores the resulting map (or None) in a strong dictionary.

Several questions: Would it suffice to change the dictionary into a WeakKeyDictionary? If it would: Would it cause a regression? I guess the answer to the second question is "yes", since getting an item out of a weak key dictionary is quite slow and requesting a coerce map is a very frequent operation.

So, I suppose one could introduce another type of dictionary, analogous to TripleDict, which would not test for equality but for identity.

But should this be here or on a new ticket? I think the patch from here is big enough, hence, do it on a different ticket, but you can try to convince me to do it here.

Since the patchbot tried to use the wrong patches:

Apply trac715_one_tripledict.patch

comment:132 Changed 8 years ago by SimonKing

I don't know why the patchbot keeps trying to apply all patches.

Anyway. First experiments show that a MonoDict (which would be my name for a dictionary that uses weak keys, compares the keys by identity and expect a singly item as a key) is a lot faster than a usual dictionary, if the keys are frequently used parents such as finite fields. "A lot" means: More than 20 times faster.

I will simply try whether things still work when I replace dictionaries by MonoDict in the coercion model. If they do, I'll post here. If there are difficult problems, I'll move it to a different ticket.

comment:133 Changed 8 years ago by jpflori

I'd say we'd better put your MonoDict? fix in another ticket, even if it no difficult problems arise, to keep the patch readable enough and the problems clearly separated.

And close this one asap... sorry I should be the one finally reviewing your code (I already checked for speed regression and actual fix of the leak as mentioned above), but I do not have much time these days.

I'd say I'll do that on thursday (at worst i hope), as there is some Sage meeting in Paris that day.

comment:134 Changed 8 years ago by SimonKing

See #12313 for the other memleak.

comment:135 Changed 8 years ago by SimonKing

You raised the question whether actions (and perhaps maps as well) are garbage collected too often. I inserted some lines of code into the init method of sage.categories.map.Map and sage.categories.action.Action that counts how often the init method is called (namely by appending one character to some file on my disk). Then, I ran the doctests in sage.schemes. Result:

With #11780 only

  • 76102 maps
  • 41381 actions
  • 647.3 seconds

With #11780 and #715

  • 76192 maps
  • 46157 actions
  • 658 seconds

So, actions are created about 10% more often than without the patch, while the speed regression is not so dramatic.

Two explanations:

  1. These 10% of actions would have been needed, and it is bad that they were garbage collected.
  1. One file contains many tests, and often these tests are quite similar. In particular, many actions will occur in many different tests. Without the patch, the actions created by the first test are strongly cached and are thus still available for the second, third, ... test. But with the patch, the actions created by the first test will be garbage collected when the first test is done. Hence, it is good that they were garbage collected.

In order to find out whether 1. or 2. is the correct explanation, I'll determine the number of maps and actions created in "single" computations, namely in the benchmarks discussed at #11900.

comment:136 Changed 8 years ago by SimonKing

Here is some more data. In all cases, I give the number of maps and actions created, first with #11780 only and then with #11780+#715.

First test: Start Sage!

-> 191 maps, 44 actions versus 191 maps, 44 actions. Fine!

Second test:

E = J0(46).endomorphism_ring()
g = E.gens()

-> 597 maps, 320 actions versus 611 maps, 481 actions. That's about 50% more actions and is thus not good.

Third test:

L = EllipticCurve('960d1').prove_BSD()

-> 3550 maps, 97 actions versus 3550 maps, 97 actions. Fine!

Fourth test:

E = EllipticCurve('389a')
for p in prime_range(10000):
        if p != 389:
                G = E.change_ring(GF(p)).abelian_group()

-> 14969 maps, 9884 actions versus 14969 maps, 9885 actions. Fine!

Question to the reviewer: How bad do you think is the "missing action" in the second example? Would it be worth while to fix it in the method E.gens?

Would you even think I should try to modify TripleDict so that a list of strong references is preserved, but the list can only have a maximal length (thus popping the first references on the list when new references are appended)? In that way, one could extend the life time of the cache, but at the same time one would avoid an infinite memory growth.

It is a shame that Python only has strong and weak references, but no soft references!

comment:137 Changed 8 years ago by SimonKing

  • Cc robertwb added

At sage-devel, Robert Bradshaw suggested the following benchmark, measuring the impact of the new TripleDict on multiplication of integers with RDF (which does involve actions and thus does involve lookup in TripleDict):

sage: def test(n):
....:     a = Integer(10)
....:     b = QQ(20)
....:     s = RDF(30)
....:     for x in xrange(10**n):
....:         s += a*b*x
....:

With Sage-5.0.prealpha0+#11780:

sage: %time test(6)
CPU times: user 7.25 s, sys: 0.04 s, total: 7.29 s
Wall time: 7.31 s

and with the patch from here added

sage: %time test(6)
CPU times: user 7.29 s, sys: 0.01 s, total: 7.31 s
Wall time: 7.31 s

So, yet another supporting data point!

comment:138 follow-up: Changed 8 years ago by SimonKing

Question: How urgent do you see implementing a ring buffer for TripleDict? Namely, right now, I'd prefer to work on #12313. Since #12313 changes sage/structure/coerce_dict.pxd, it would probably be easier for me to coordinate work by postponing the ring buffer to a different ticket (or perhaps introduce it at #12313?).

What do you think?

comment:139 in reply to: ↑ 138 ; follow-up: Changed 8 years ago by jpflori

Replying to SimonKing:

Question: How urgent do you see implementing a ring buffer for TripleDict? Namely, right now, I'd prefer to work on #12313. Since #12313 changes sage/structure/coerce_dict.pxd, it would probably be easier for me to coordinate work by postponing the ring buffer to a different ticket (or perhaps introduce it at #12313?). What do you think?

I think we'd better close this one asap, especially now that it seems that no speed regression occur, and provide a speed-up in a subsequent ticket (as you did for #9138 and #11900 or two other ones..).

Of course one could argue that we get no speed regression because we go faster when accessing the dicts, but delete actions more often, so the situation for object creations is not exactly as before, but I do not think anybody or any functions relied the lifetime of these objects (or should...).

If you do agree, I'll review the ticket tomorrow as I already planned to do and mentioned a few comments above.

comment:140 in reply to: ↑ 139 Changed 8 years ago by SimonKing

Replying to jpflori:

If you do agree, I'll review the ticket tomorrow

Thank you! Yes, I'd prefer it that way. Having the ring buffer means modifying coerce_dict.pxd, which essentially means recompiling almost the whole Sage library, and that takes almost an hour on my laptop. So, it is better for me to not switch back and forth between #715 and #12313.

comment:141 Changed 8 years ago by jpflori

  • Description modified (diff)

comment:142 follow-up: Changed 8 years ago by jpflori

  • Status changed from needs_review to needs_info

Ive finally read your code and have to say bravo!

However I've got one request, or rather one question.

With the current implementation, Actions always use a weak ref for the underlying set so that it can and will be garbage collected if it is not strong refed elsewhere.

You illustrate and mention that in some examples in action.pyx.

You also modify an example involving Action and MatrixSpace? to make sure that no gc occurs.

I do not think this is the right solution, I mean that the user should be able to use Action has before (and anyway it does not feel right to me that you can create something that can magically disappear).

You could also argue that nobody actually uses Actions directly (I do not for example :) ), those who do will have to be careful.

I see two solutions:

  • Add a big fat warning in Action documentation (red, in a bloc, at the start, etc.)
  • Implement somehow an option to choose whether to use weak ref (which will be set for the coercion model) or strong ones (set by default, so the "normal" and previous behaviour will be the default one). It basically mean passing an additional boolean somehow which will lead the construction of underlying_set, be saved and modify the behavior of underlying_set() (i.e. add () or not)

What does everybody thinks ?

comment:143 Changed 8 years ago by jpflori

Note to myself: could use type(E) rather than importing the AbelianGroupSoLong?... type as Simon did in #12313

(type(E) is not Abelian... but the memory leak can be testing with it as well)

This is should make the example more understandable.

Same remark apply for ticket about homset.

comment:144 in reply to: ↑ 142 Changed 8 years ago by SimonKing

Replying to jpflori:

With the current implementation, Actions always use a weak ref for the underlying set so that it can and will be garbage collected if it is not strong refed elsewhere.

You illustrate and mention that in some examples in action.pyx.

You also modify an example involving Action and MatrixSpace? to make sure that no gc occurs.

Or rather: That it does not occur too late.

I do not think this is the right solution, I mean that the user should be able to use Action has before (and anyway it does not feel right to me that you can create something that can magically disappear).

I believe that it is fine. Namely, what use would an action have if you do not have any other strong reference to the underlying set S?

That's to say: You forgot S and all of its elements. But what use would an action on S if you not even know to provide a single element of S?

You could also argue that nobody actually uses Actions directly (I do not for example :) ), those who do will have to be careful.

I think so.

  • Add a big fat warning in Action documentation (red, in a bloc, at the start, etc.)

OK, that would need more than the short remarks in my added examples.

  • Implement somehow an option to choose whether to use weak ref (which will be set for the coercion model) or strong ones (set by default, so the "normal" and previous behaviour will be the default one). It basically mean passing an additional boolean somehow which will lead the construction of underlying_set, be saved and modify the behavior of underlying_set() (i.e. add () or not)

One could store the underlying set S either by

    self.S = weakref.ref(S)

resulting in a weak reference, or by

    self.S = ConstantFunction(S)

resulting in a strong reference.

The advantage is that underlying_set() could remain as it is. In particular, we don't need to make the syntax (return self.S versus return self.S()) depend on any any parameter used during initialisation. Note that calling a ConstantFunction takes almost no time.

However, it might even be faster to do

    if self.use_weak_references:
        return self.S()
    else:
        return self.S

where self.use_weak_references is a cdef bint parameter assigned during initialisation.

I can't test it right now.

comment:145 follow-up: Changed 8 years ago by jpflori

I'll have some time to work on this today or friday.

Any progress on your side ?

For example, implementing my preferred solution with the "use_wek_references"? :)

comment:146 in reply to: ↑ 145 Changed 8 years ago by SimonKing

Replying to jpflori:

I'll have some time to work on this today or friday.

Any progress on your side ?

For example, implementing my preferred solution with the "use_wek_references"? :)

No. Currently, I focus on computing Ext algebras of finite dimensional path algebra quotients (that's what I get my money for), and to fix my old group cohomology spkg (which wouldn't work with the most recent version of Sage for at least three independent reasons).

comment:147 Changed 8 years ago by jpflori

  • Keywords Cernay2012 added

I've posted a first draft of a patch to make use of weakrefs optional (did not add doc, nor changed the test added or modified by Simon yet).

I've surely forgotten some places where action are defined etc.

After doing that, I've begun thinking that Simon is right and that Actions are too much related to the coecion system for this approch to be valid.

Maybe using weakrefs all the time, even though objects can become unusable is good enough.

Changed 8 years ago by jpflori

Make use of weakrefs optional: off by default, on for coercion

comment:148 Changed 8 years ago by jpflori

Some further thoughts:

  • Currently my piece of code do not take into account classes overriding get_action
  • for this approach to be consistent I guess that get and discover action should return by default strong refed actions, so we should also add optional arguments to all the get and discover actions...

comment:149 Changed 8 years ago by jpflori

This last idea won't be really consistent anyway because the get_action function caches its result anyway in _action_hash...

So i'm now quite convinced that one should use weak refs all the time and that providing documentation about that is sufficient.

comment:150 Changed 8 years ago by jpflori

I'm finally trying to add some doc to this ticket and realized that in the matrix.action file you state the usual laius about underlying sets eventually getting garbage collected.

However, this is not the case in your examples, for the good reason that matrix spaces are cached.

I'll try to provide an example where matrices act on something not cached, and we won a new ticket where your constructions should be used to cache objects :)

Changed 8 years ago by jpflori

Reviewer patch; added doc

comment:151 Changed 8 years ago by jpflori

  • Status changed from needs_info to needs_review

comment:152 Changed 8 years ago by jpflori

  • Description modified (diff)

I've added warning blocks at the top of files modified by Simon (and fixed minor typos without introducing new ones I hope). The generated doc looks ok.

All tests pass on my computer and the numerical evidence we've gathered so far points that there is no speed regression.

If Simon or someone else could have a look at my "reviewer patch", this can be put to positive review.

Personally, I'm happy with Simon patches.

comment:153 Changed 8 years ago by SimonKing

  • Reviewers set to Jean-Pierre Flori
  • Status changed from needs_review to positive_review

Hi Jean-Pierre,

your reviewer patch looks fine to me! Thank you for fixing the typos and explaining things a bit clearer!

So, I change it into "positive review", naming you as a reviewer.

comment:154 Changed 8 years ago by jpflori

Great!

And sorry for the delay. I'll try to tackle the related tickets this afternoon.

comment:155 Changed 8 years ago by jdemeyer

  • Dependencies changed from #9138, #11900 to #9138, #11900, #11599
  • Status changed from positive_review to needs_work

This seems to conflict with #11599. With #11599 applied, I get doctest errors:

sage -t  -force_lib devel/sage/sage/structure/coerce_dict.pyx
**********************************************************************
File "/mnt/usb1/scratch/jdemeyer/merger/sage-5.0.beta8/devel/sage-main/sage/structure/coerce_dict.pyx", line 210:
    sage: from sage.schemes.generic.homset import SchemeHomsetModule_abelian_variety_coordinates_field
Exception raised:
    Traceback (most recent call last):
      File "/mnt/usb1/scratch/jdemeyer/merger/sage-5.0.beta8/local/bin/ncadoctest.py", line 1231, in run_one_test
        self.run_one_example(test, example, filename, compileflags)
      File "/mnt/usb1/scratch/jdemeyer/merger/sage-5.0.beta8/local/bin/sagedoctest.py", line 38, in run_one_example
        OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags)
      File "/mnt/usb1/scratch/jdemeyer/merger/sage-5.0.beta8/local/bin/ncadoctest.py", line 1172, in run_one_example
        compileflags, 1) in test.globs
      File "<doctest __main__.example_3[33]>", line 1, in <module>
        from sage.schemes.generic.homset import SchemeHomsetModule_abelian_variety_coordinates_field###line 210:
    sage: from sage.schemes.generic.homset import SchemeHomsetModule_abelian_variety_coordinates_field
    ImportError: cannot import name SchemeHomsetModule_abelian_variety_coordinates_field
**********************************************************************
File "/mnt/usb1/scratch/jdemeyer/merger/sage-5.0.beta8/devel/sage-main/sage/structure/coerce_dict.pyx", line 211:
    sage: LE = [x for x in gc.get_objects() if  isinstance(x,SchemeHomsetModule_abelian_variety_coordinates_field)]
Exception raised:
    Traceback (most recent call last):
      File "/mnt/usb1/scratch/jdemeyer/merger/sage-5.0.beta8/local/bin/ncadoctest.py", line 1231, in run_one_test
        self.run_one_example(test, example, filename, compileflags)
      File "/mnt/usb1/scratch/jdemeyer/merger/sage-5.0.beta8/local/bin/sagedoctest.py", line 38, in run_one_example
        OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags)
      File "/mnt/usb1/scratch/jdemeyer/merger/sage-5.0.beta8/local/bin/ncadoctest.py", line 1172, in run_one_example
        compileflags, 1) in test.globs
      File "<doctest __main__.example_3[34]>", line 1, in <module>
        LE = [x for x in gc.get_objects() if  isinstance(x,SchemeHomsetModule_abelian_variety_coordinates_field)]###line 211:
    sage: LE = [x for x in gc.get_objects() if  isinstance(x,SchemeHomsetModule_abelian_variety_coordinates_field)]
    NameError: name 'SchemeHomsetModule_abelian_variety_coordinates_field' is not defined
**********************************************************************
File "/mnt/usb1/scratch/jdemeyer/merger/sage-5.0.beta8/devel/sage-main/sage/structure/coerce_dict.pyx", line 212:
    sage: len(LE)    # indirect doctest
Exception raised:
    Traceback (most recent call last):
      File "/mnt/usb1/scratch/jdemeyer/merger/sage-5.0.beta8/local/bin/ncadoctest.py", line 1231, in run_one_test
        self.run_one_example(test, example, filename, compileflags)
      File "/mnt/usb1/scratch/jdemeyer/merger/sage-5.0.beta8/local/bin/sagedoctest.py", line 38, in run_one_example
        OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags)
      File "/mnt/usb1/scratch/jdemeyer/merger/sage-5.0.beta8/local/bin/ncadoctest.py", line 1172, in run_one_example
        compileflags, 1) in test.globs
      File "<doctest __main__.example_3[35]>", line 1, in <module>
        len(LE)    # indirect doctest###line 212:
    sage: len(LE)    # indirect doctest
    NameError: name 'LE' is not defined
**********************************************************************

comment:156 Changed 8 years ago by jpflori

!SchemeHomsetModule_abelian_variety_coordinates_field was indeed renamed to SchemeHomset_points_abelian_variety_field in #11599.

We have two solutions:

  • do the same renaming in the doctests here
  • use the EllipticCurve? class which provides basically the same test (that's the one I originally pointed out) and which I find more explicit.

I'll provide a patch for this second solution.

comment:157 Changed 8 years ago by jpflori

Except that the changes introduced in #11599 seem to break the work done here by reintroducing some caching...

comment:158 Changed 8 years ago by jpflori

More precisely both my proposed solution fix the import error (with EllipticCurve_finite_field for the second one) but then LE is still of length 50, whence no garbage collection occured.

comment:159 Changed 8 years ago by jpflori

Applying #11521 solves the problem back, so I guess that this ticket and #11521 should be merged at once as they both depend on each other.

comment:160 Changed 8 years ago by jpflori

Here comes a patch.

Changed 8 years ago by jpflori

Rebase on top of #11599, now circularly depends on #11521

comment:161 Changed 8 years ago by jpflori

  • Dependencies changed from #9138, #11900, #11599 to #9138, #11900, #11599, #11521
  • Description modified (diff)
  • Status changed from needs_work to needs_review

comment:162 follow-up: Changed 8 years ago by davidloeffler

A data point that might be helpful: all doctests pass on 5.0.beta10 on 64-bit Linux with qseries

trac715_one_triple_dict.patch
trac_715-reviewer.patch
trac_715-rebase_11599.patch
trac11521_triple_homset.patch
trac_11521-reviewer.patch

What is there here that still needs review? I can confirm that the change in jpflori's reviewer patch does not affect the doctest, in the sense that the new patched doctest fails without this ticket applied but succeeds with it. Is this ready to go in?

comment:163 in reply to: ↑ 162 Changed 8 years ago by SimonKing

Replying to davidloeffler:

What is there here that still needs review? I can confirm that the change in jpflori's reviewer patch does not affect the doctest, in the sense that the new patched doctest fails without this ticket applied but succeeds with it. Is this ready to go in?

From my perspective, it is. But I think I am not entitled to set it to positive review, since Jean-Pierre did not explicitly state that he gives his OK.

comment:164 Changed 8 years ago by jpflori

Oh, that's my bad, I just wanted to be sure that Simon was ok with my rebase... (and did not want to set it back to positive review because I did the rebase myself)

Sorry about that !

comment:165 Changed 8 years ago by jpflori

  • Status changed from needs_review to positive_review

And I'm putting the ticket back to positive review because the three of us seem happy with it.

comment:166 Changed 8 years ago by davidloeffler

  • Description modified (diff)
Note: See TracTickets for help on using tickets.