Opened 5 years ago

Closed 4 years ago

Last modified 4 years ago

#12313 closed defect (fixed)

Fix yet another memory leak caused by caching of coercion data

Reported by: SimonKing Owned by:
Priority: major Milestone: sage-5.8
Component: memleak Keywords: coercion weak dictionary
Cc: jpflori, vbraun, robertwb, AlexanderDreyer, PolyBoRi, nthiery Merged in: sage-5.8.beta2
Authors: Simon King, Jean-Pierre Flori Reviewers: Simon King, Jean-Pierre Flori, John Perry, Nils Bruin
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: #715, #11521, #12215, #13746, #13378; merge with #13387 Stopgaps:

Description (last modified by jdemeyer)

The following could not be fixed in #715:

sage: K = GF(1<<55,'t')
sage: for i in range(50):
....:     a = K.random_element()
....:     E = EllipticCurve(j=a)
....:     b = K.has_coerce_map_from(E)
....:     
sage: import gc
sage: gc.collect()
0

The problem is that any parent has a dictionary that stores any coerce map (and a different dictionary for conversion maps) ending at this parent. The keys are given by the domains of the maps. So, in the example above, the field K has an attribute that is a dictionary whose keys are the different elliptic curves.

In coercion, it is usually best to compare parents not by equality but by identity. Therefore, I suggest to implement a container called MonoDict that works similar to the new weak TripleDict (see #715), but takes a single item as a key.

First tests show that one can actually gain a lot of speed: MonoDict can access its items much faster than a usual dictionary.

Apply

Attachments (27)

trac12313_mono_dict.patch (30.8 KB) - added by SimonKing 5 years ago.
Weak coerce map dictionary with weakly referenced unique keys and optimised access speed
trac12313_polynomial_template_dealloc.patch (1.1 KB) - added by SimonKing 5 years ago.
Catch an attribute error that may occur during deallocation of polynomials
trac_12313-rebase_11599.patch (965 bytes) - added by jpflori 5 years ago.
Rebase on top of #11599
trac_12313-cparent.patch (33.7 KB) - added by jpflori 5 years ago.
Add a cdef'ed reference to the parent in each polynomial
trac_12313-reviewer.patch (20.6 KB) - added by jpflori 5 years ago.
Reviewer patch; added doc, some formatting changes
trac_12313-cparent-rebase_5.0.beta10.patch (35.5 KB) - added by jpflori 5 years ago.
Add a cdef'ed reference to the parent in each polynomial; rebased on top of sage-5.0.beta10
trac_12568-reviewer.patch (962 bytes) - added by jpflori 5 years ago.
Reviewer patch; added doc, some formatting changes, problematic test
trac_12313-mono_dict+modulo.patch (30.8 KB) - added by jpflori 5 years ago.
Weak coerce map dictionary with weakly referenced unique keys and optimised access speed
trac_12313-reviewer+modulo.patch (21.1 KB) - added by jpflori 5 years ago.
Reviewer patch; added doc, some formatting changes, problematic test
trac_12313-modulo.patch (4.7 KB) - added by jpflori 5 years ago.
C modulo operator problem
trac_12313-mono_dict-take2.patch (30.6 KB) - added by jpflori 5 years ago.
trac_12313-rebase_11599.2.patch (964 bytes) - added by jpflori 5 years ago.
trac_12313-rebase_11599-take2.patch (964 bytes) - added by jpflori 5 years ago.
trac_12313-cparent-take2.patch (35.5 KB) - added by jpflori 5 years ago.
trac_12313-reviewer-take2.patch (14.5 KB) - added by jpflori 5 years ago.
trac_12313-size_t-take2.patch (7.1 KB) - added by jpflori 5 years ago.
trac_12313-mono_dict-combined.patch (68.8 KB) - added by SimonKing 5 years ago.
Weak coerce map dictionary with weakly referenced unique keys and optimised access speed, plus the necessary changes in cparent to make things work, plus reviewer patch.
trac_12313-mono_dict-combined-random.patch (68.8 KB) - added by jpflori 5 years ago.
Flaky doctest marked as random.
MonoDictNoRefCache.patch (22.7 KB) - added by nbruin 5 years ago.
Example implementation of a MonoDict?? storing the weakrefs in the buckets as well
idkey_dict (24.3 KB) - added by SimonKing 5 years ago.
Experimental implementation of IdKeyDict
safeweakdict.patch (6.4 KB) - added by nbruin 5 years ago.
Trying to explicitly tear down WeakValueDicts? in sage
pbori.diff (1.9 KB) - added by jpflori 5 years ago.
Modify ReductionStrategy? deallocation.
trac_12313_quit_sage.patch (2.4 KB) - added by SimonKing 5 years ago.
Do a garbage collection before closing the interfaces. Do not delete a borrowed reference to reduction strategies in pbori
trac_12313-safer_tripledicteraser.patch (1.4 KB) - added by nbruin 4 years ago.
make TripleDictEraser? a little safer
trac_12313-safe_callback.patch (1.5 KB) - added by SimonKing 4 years ago.
Making sure Python attributes exist in a weakref callback
trac_12313-mono_dict-combined-random-sk.patch (68.4 KB) - added by SimonKing 4 years ago.
Combined patch - only apply this patch
trac_12313-revert_callback_from_11521.patch (1015 bytes) - added by SimonKing 4 years ago.
Use weak reference without callback for homsets (reverting a part of #11521)

Download all attachments as: .zip

Change History (361)

comment:1 follow-up: Changed 5 years ago by SimonKing

Memory leaks are quite tricky!!

The post here is mainly a memo for myself, documenting what went wrong in the last couple of days.

I tried to directly modify my code from #715. The approach was: TripleDict or MonoDict should store the data in lists, with some list items being the address of the keys; and in a separate dictionary, a weakref.KeyedRef with a callback function is preserved.

And what happened when I replaced Parent._coerce_from_hash by a MonoDict? A memory leak was created by using weak references! Namely,

  • There is a strong reference to weakref.KeyedRef.
  • The weakref.KeyedRef has a strong reference to the callback function
  • The callback function is an instance of something that I'll call MonoDictEraser (similar to the TripleDictEraser from #715)
  • The Eraser has a strong reference to the MonoDict
  • The MonoDict has a strong reference to the coerce map.
  • The coerce map has a strong reference to domain and codomain
  • Hence, domain and codomain can't be collected.

I think the best place to break the reference chain is from the Eraser to the MonoDict. Perhaps one could do the same thing at #715, rather than having a weak reference to the underlying set of an action?

comment:2 in reply to: ↑ 1 Changed 5 years ago by SimonKing

Replying to SimonKing:

Perhaps one could do the same thing at #715, rather than having a weak reference to the underlying set of an action?

Nope! There, the situation was different, since there is strong reference to TripleDict anyway (namely in the coercion model), so that a weak reference from the eraser won't suffice.

comment:3 Changed 5 years ago by SimonKing

  • Description modified (diff)

comment:4 Changed 5 years ago by SimonKing

  • Authors set to Simon King
  • Status changed from new to needs_review

With the attached patch (to be applied after #715), a new kind of weak dictionary is introduced, and used to cache coercions and conversions. All tests pass for me.

Concerning raw speed: Without the patch, we have

sage: RDF.coerce_map_from(ZZ)
Native morphism:
  From: Integer Ring
  To:   Real Double Field
sage: timeit("RDF.coerce_map_from(ZZ)",number=10^5)
100000 loops, best of 3: 518 ns per loop

With the patch, it is 466 ns, hence, one gains 10% in an operation that is very frequent.

The total speed seems to be fine as well, make ptest has not been slower than before.

Concerning memory leaks:

sage: import gc
sage: _ = gc.collect()
sage: K = GF(1<<55,'t')
sage: for i in range(50):
...     a = K.random_element()
...     E = EllipticCurve(j=a)
...     b = K.has_coerce_map_from(E)
sage: _ = gc.collect()
sage: len([x for x in gc.get_objects() if isinstance(x,type(E))])
1

That is a new doctest and tests against a memory leak that has not been fixed by #715.

comment:5 Changed 5 years ago by SimonKing

I have extended my patch by another coercion-related application of MonoDict.

I found that sage.categories.groupoid.Groupoid is often resulting in a memory leak. For example, by #715, polynomial rings are only weakly cached. But if the groupoid of a polynomial ring is created, then a strong reference on the polynomial ring is created, preventing it from being garbage collected. But a groupoid is created as soon as there is an action of that polynomial ring on another ring. Hence, a memory leak.

The new patch version is overriding the __classcall__ method of sage.categories.groupoid.Groupoid (that would normally be inherited from UniqueRepresentation) by a method that uses a MonoDict for caching. In addition, there is now only a weak reference to the underlying set of a groupoid. By consequence, if there is no other strong reference then Groupoid(P) can be garbage collected, but a strong reference to P prevents Groupoid(P) from being garbage collected.

Still needs review.

comment:6 Changed 5 years ago by jpflori

  • Status changed from needs_review to needs_work

Could you add the second problem fixed into the ticket description ?

Maybe this should be splitted into two tickets to be more readable, or at least two patches.

The title is a little vague as well :)

Sorry for the delay for this and #715 and #11521 but this is now on my top priority (for Sage).

comment:7 Changed 5 years ago by jpflori

And what if a user directly use a Groupoid object ?

Will the underlying set be garbage collected as in #715 for Actions?

Another reason to split this ticket in two (once again :) )

comment:8 Changed 5 years ago by jpflori

To finish my thoughts for now, we could as well leave the tickets as they are (except for some more doc and easy fixing) and potentially implement the other behavior (where a choice is possible between strong and weak refs) in a later ticket.

The current behavior with the error raising (if it gets a description of how not to get garbage collected in addition to telling that it got garbage collected) seems viable enough to me (unless that many people do actually use Groupoids and Actions directly).

comment:9 Changed 5 years ago by SimonKing

Perhaps you are right, and we could re-organise stuff. Namely, for now (that's to say: for #715 and for #12313) we could keep strong references from an action to the underlying set, and from a groupoid to the underlying set. The unfortunate consequence would be that not all memory leaks would be fixed. But one could create a new ticket where the idea of having weak references to the underlying set is tested.

And here is yet another idea, at least for the groupoid case. Let S be any object and G its groupoid. Ideally, S should keep G alive and G should keep S alive. Hence, it would be fine to have a strong reference from S to G and from G to S -- Python's garbage collection can deal with circular references.

However, there should be no strong references to S or G, unless the user explicitly provides it. In particular, when creating Groupoid(S), there should no external cache be used.

Potential solution: The cache for the Groupoid.__classcall__ should be in S! Perhaps like this:

class Groupoid(Category):
    @staticmethod
    def __classcall__(cls, S):
        try:
            return getattr(S, '_cached_'+cls.__name__)
        except AttributeError:
            pass
        instance = type.__call__(cls, S)
        assert(isinstance(instance,cls))
        if instance.__class__.__reduce__ == UniqueRepresentation.__reduce__:
            instance._reduction = (cls, args, options)
        try:
            setattr(S, '_cached_'+cls.__name__, instance)
            return
        except AttributeError:
            pass
        try:
            # Parents have a __cached_methods attribute, by #11115
            S.__cached_methods['_cached_'+cls.__name__] = instance
            return
        except AttributeError:
            pass
        raise TypeError, "%s instance must either allow attribute assignment or be instances of %s"(cls, Parent)

Note that the problem can hardly be solved in the same way for actions. They need to be stored externally, namely in the coercion model.

comment:10 Changed 5 years ago by SimonKing

  • Work issues set to Split ticket

I think it is best to follow your suggestion and split stuff.

Soon (perhaps tomorrow) I'll return to the old patch, i.e. without touching the groupoid stuff. In a to-be-created ticket, I'll try the idea for caching groupoids that avoids memory leaks but uses strong references, as explained in my previous post.

However, I think the weak references to the underlying set of actions is an integral part of #715 - without it, the memory leak would not be fixed at all; so, I'd not split #715.

comment:11 Changed 5 years ago by SimonKing

Python's name mangling sucks.

Name mangling occurs when an attribute is requested whose name starts with double underscores and does not end with an underscore. Thus, it happens when S.__cached_method is requested in the code above.

However, for a reason that I don't understand, Python does not use the name of the class of S for mangling, but the name of cls -- for example, if S==ZZ then Python is looking for S._Groupoid__cached_methods, not S._<name of class of S>__cached_methods. In addition, since __cached_methods is a cpdef attribute, it should not be subject to name mangling at all.

I'll see if doing things in Cython helps.

Changed 5 years ago by SimonKing

Weak coerce map dictionary with weakly referenced unique keys and optimised access speed

comment:12 Changed 5 years ago by SimonKing

  • Status changed from needs_work to needs_review
  • Work issues Split ticket deleted

I have updated my patch, so that groupoids are not touched any more. Needs review, again.

Groupoids are now dealt with at #12357 (needs review), and I think the solution provided there is much better than the improper use of weak references.

comment:13 Changed 5 years ago by SimonKing

  • Status changed from needs_review to needs_work
  • Work issues set to Find out why some attribute of a parent is deleted before the parent's elements are deleted

I am not totally sure whether the following problem is caused by this patch or by #715 or #11521. Anyway, with all three patches applied, I obtain:

sage: K.<z> = GF(4)
sage: P.<x> = K[]
sage: del P
sage: del x
sage: import gc
sage: gc.collect()
Exception AttributeError: 'PolynomialRing_field_with_category' object has no attribute '_modulus' in  ignored
148

As you can observe, some attribute error is ignored, but worse: Cython is unable to tell where it is ignored.

The cython-users helped me to track the problem down, using gdb.

The attribute error occurs in the following cdef method defined in sage/rings/polynomial/polynomial_zz_pex.pyx:

cdef cparent get_cparent(parent) except? NULL:
    if parent is None:
        return NULL
    cdef ntl_ZZ_pEContext_class c
    c = parent._modulus
    return &(c.x)

By the way it is defined, the error is propagated. However, get_cparent is in fact called in a deallocation method, namely in sage/rings/polynomial/polynomial_template.pxi:

    def __dealloc__(self):
        """
        EXAMPLE::

            sage: P.<x> = GF(2)[]
            sage: del x
        """
        celement_destruct(&self.x, get_cparent((<Polynomial_template>self)._parent))

That is interesting for two reasons: (1) Cython can not tell the location of the error, which I think is a bug in Cython. (2) The attribute _modulus of self._parent is destroyed before self is destroyed.

I find the latter rather strange. As long as self is alive, self._parent can not be deallocated, but when self._parent can not be deallocated, why is self._parent._modulus deleted??

comment:14 Changed 5 years ago by SimonKing

Sorry, I gave an incomplete link: The discussion at cython-users is here.

comment:15 Changed 5 years ago by SimonKing

I added some print statements into the init and dealloc methods of ntl_ZZ_pEContext (namely, it is the type of self._parent._modulus). However, even though self._parent._modulus is gone, the deallocation method of self._parent._modulus is not called.

Still. Of course one could add a try-except clause into the deallocation of a polynomial template. But it should simply not be the case that an attribute of an existing object disappears!

Changed 5 years ago by SimonKing

Catch an attribute error that may occur during deallocation of polynomials

comment:16 Changed 5 years ago by SimonKing

  • Description modified (diff)
  • Status changed from needs_work to needs_review

I have attached a second patch, that catches the attribute error that may occur during deallocation.

Here is why I think the problem arose in the first place. The error occurs in a __dealloc__ method of a polynomial. The __dealloc__ method is, as much as I know, called after deletion of the python attributes. In particular, since self._parent can be garbage collected with my patches (without #715, #11521 and #12313, it would stay in memory forever), self._parent._modulus is deleted before calling the polynomial's __dealloc__. Hence, the error.

I guess, the cleanest solution would be to keep a direct pointer to self._parent._modulus as an attribute of self. But that would create additional memory consumption. And since, by the example below, no memory leak is created with my second patch, I think the solution of simply catching the attribute error is good enough.

With the second patch, the above example works without a warning. Moreover, it seems to come without a speed penalty:

sage: K.<z> = GF(4)
sage: def test(K):
....:    for n in xrange(10000):
....:        P.<x> = K[]
....:        del P
....:        del x
sage: %timeit test(K)
5 loops, best of 3: 376 ms per loop

with the second patch, but 427 without the second patch.

I also tested whether the second patch introduces a hidden memory leak (after all, when the attribute error occurs for P, the deallocation of x is interrupted): In the above example, test(K) does not seem to leak.

sage: test(K)
sage: get_memory_usage()
825.640625
sage: test(K)
sage: get_memory_usage()
825.640625

Needs review, then!

Apply trac12313_mono_dict.patch trac12313_polynomial_template_dealloc.patch

comment:17 follow-up: Changed 5 years ago by vbraun

You are correct, in __dealloc__ it is not safe to look at any Python attribute. Usually they are still around, especially if they are being kept alive by some other reference. But not always, and there is no way to tell if they are.

You are, in fact, leaking memory: celement_destruct() is never called if the AttributeError is raised. I guess it only happens rarely in your test loop, so you don't notice it in the total memory? A while ago (when we were both in Kaiserslautern, actually) I fixed essentially the same error in multi_polynomial_libsingular.pyx: To deallocate a singular ring element you need the singular ring, but the Python parent can be gone by then. There, I added a new cdef pointer to the singular ring to the ring elements.

Usually in OOP you are not allowed to throw exceptions from a destructor. Same with C++, for example.

comment:18 in reply to: ↑ 17 ; follow-up: Changed 5 years ago by SimonKing

  • Work issues Find out why some attribute of a parent is deleted before the parent's elements are deleted deleted

Replying to vbraun:

You are, in fact, leaking memory: celement_destruct() is never called if the AttributeError is raised.

Sure. But to my surprise, it isn't noticed in that loop. Surprise, because...

I guess it only happens rarely in your test loop, so you don't notice it in the total memory?

... I thought it would be called in each round of the loop.

A while ago (when we were both in Kaiserslautern, actually) I fixed essentially the same error in multi_polynomial_libsingular.pyx: To deallocate a singular ring element you need the singular ring, but the Python parent can be gone by then. There, I added a new cdef pointer to the singular ring to the ring elements.

So, you agree with the statement from my previous post that keeping a reference from the element to the modulus would be the cleanest solution.

Shall I modify sage/rings/polynomial/polynomial_template.pxi accordingly?

But there would be a memory penalty (one additional pointer per polynomial) and also a time penalty (the pointer needs to be assigned).

Usually in OOP you are not allowed to throw exceptions from a destructor.

Sure. The main problem for debugging was that Cython was not able to provide the location of the error: Note that it just says "Attribute error in ignored". In fact, it tries to obtain the string representation of the polynomial, but it fails for the same reason: The parent's modulus is already gone, and without the modulus, the string representation of the element can't be determined.

comment:19 in reply to: ↑ 18 Changed 5 years ago by vbraun

Replying to SimonKing:

So, you agree with the statement from my previous post that keeping a reference from the element to the modulus would be the cleanest solution.

Yes, agreed.

In terms of space, adding a C pointer on top of a Python object is nothing. And I don't think that the time to store the address is measurable. Might even be free depending on how the CPU out-of-order execution. You'll certainly make up the time by not resolving a Python attribute in the destructor.

comment:20 Changed 5 years ago by SimonKing

I had a look at the code, and I think it would be quite some work to rewrite sage/rings/polynomial/polynomial_template.pxi so that cparent (I guess that is what we really want) is not (only) stored as an attribute of the parent but mainly as a cdef attribute of a polynomial.

I think I have currently no resources to spare, and better focus on the project I'm getting money for. So, for now, I leave it like this.

comment:21 Changed 5 years ago by jpflori

FYI, I'm working on a cleaner solution to the cparent stuff, implementing Simon's idea of an attribute in each element.

comment:22 follow-up: Changed 5 years ago by jpflori

  • Dependencies changed from #715 to #715, #11599

Here comes a first draft of the patch.

It adds a cdef'ed cparent attribute and makes sure it is set to the right value everywhere it should.

Or at least everywhere needed to pass "make ptest" on my computer (once the ticket is rebased on top of 11599).

Some point to clarify :

  1. PY_NEW could be replaced by .__new__  (possible from Cython >= 0.12)
  2. I put (<Polynomial_template>self)._cparent when access to _cparent is needed somehow inspired by how (<Polynomial_template>self)._parent was accessed, but I feel that just self._cparent without casting is enough, isn't it ? If it is the code would be more readable (and scripting the change is trivial.)

comment:23 in reply to: ↑ 22 Changed 5 years ago by SimonKing

Replying to jpflori:

Here comes a first draft of the patch.

Where?

comment:24 Changed 5 years ago by jpflori

It's coming :) Just a second.

And by the way :

In fact there is (at least) a problem left.

After applying the "cparent" patch a 1 becomes a 2 in some output of L.stats() in coerce_dict.pyx.

We have to investigate if this is expected or if I screwed up something.

Changed 5 years ago by jpflori

Rebase on top of #11599

Changed 5 years ago by jpflori

Add a cdef'ed reference to the parent in each polynomial

comment:25 Changed 5 years ago by jpflori

  • Authors changed from Simon King to Simon King, Jean-Pierre Flori
  • Description modified (diff)
  • Status changed from needs_review to needs_work
  • Work issues set to minor formatting issues, L.stats() doctest problem

comment:26 Changed 5 years ago by jpflori

Strange I've popped and pushed and rebuilt Sage, and now the coerce_dict doctest is not failing anymore... I'm running a new "make ptest" to be sure something else did not go wrong.

comment:27 Changed 5 years ago by jpflori

Oh and I should add Simon test back in.

comment:28 Changed 5 years ago by SimonKing

"Should add" means that the current patches are not ready for review yet?

comment:29 Changed 5 years ago by jpflori

Unfortunately yes. Anyway, I'm also adding some doc and removing some spaces from the coerce_dict file, but if you wanna have a look at the cparent stuff, please do, at least, just to let me know how you feel about it. This is not supposed to change that much (unless i remove all the Polynomial_template casts).

By the way, why did you remove the len and _get_buckets method from the TripleDict? class but left them in the MonoDict? one ? I feel len is not needed (and I am currently rebuilding Sage without it) and as _get_buckets is for debugging...

Anyway, I'll have to go now, so I'm not sure I'll be able to finish it tonight.

comment:30 Changed 5 years ago by jpflori

Sorry about the len stuff above, I did not see that TripleDict? actually extended MonoDict? in the pxd file what I suspected but could not find and I was confused.

comment:31 follow-up: Changed 5 years ago by jpflori

  • Status changed from needs_work to needs_review
  • Work issues minor formatting issues, L.stats() doctest problem deleted

Here comes a "reviewer" patch, quickly produced, so potentially full of mistakes. And it adds Simon's original test for polynomial stuff.

There might still be a problem with L.stats() in coerce_dict.pyx. Not sure it has the same behavior depending on whether I run it alone "sage -t ..." or within "make ptest".

comment:32 Changed 5 years ago by jpflori

  • Description modified (diff)

Changed 5 years ago by jpflori

Reviewer patch; added doc, some formatting changes

comment:33 in reply to: ↑ 31 Changed 5 years ago by SimonKing

Replying to jpflori:

There might still be a problem with L.stats() in coerce_dict.pyx.

Concerning stats(): Since weak references are used, the stats can certainly change when a garbage collection happens. In order to make such test fail safe, one could swith garbage collection off at the beginning of the "flaky" test (gc.disable()), and switch it on in the end (gc.enable()).

comment:34 follow-up: Changed 5 years ago by SimonKing

PS: I think the gc.disable/enable trick should only be used if the test really remains problematic.

comment:35 in reply to: ↑ 34 Changed 5 years ago by jpflori

Replying to SimonKing:

PS: I think the gc.disable/enable trick should only be used if the test really remains problematic.

Of course.

I'm not at work anymore so I can't confirm that before tomorrow morning (GMT+2).

comment:36 follow-up: Changed 5 years ago by jpflori

I was wrong about the error itself. On my machine (amd64, last (quite old) ubuntu LTS, sage-5.0.beta7, several tickets applied...), I indeed get one error in coerce_dict.pyx when running "make ptest", but not when running only "./sage -t -force_lib -long devel/sage/sage/structure/coerce_dict.pyx" and could reproduce this behavior several times.

The failing doctest is

    sage: list(sorted(L.iteritems()))
Expected:
    [(-15, 3), ('a', 1), ('ab', 2)]
Got:
    [('a', 1), ('ab', 2), (-15, 3)]

I'm currently building sage-5.0.beta10 on the same computer and a different one to check that it's not only something I messed up with.

comment:37 in reply to: ↑ 36 ; follow-up: Changed 5 years ago by SimonKing

Replying to jpflori:

sage: list(sorted(L.iteritems()))

How can the sorting of integers and strings depend on whether or not you are testing in parallel? That's very strange.

comment:38 in reply to: ↑ 37 Changed 5 years ago by SimonKing

Replying to SimonKing:

How can the sorting of integers and strings depend on whether or not you are testing in parallel? That's very strange.

In the worst case, we could transfom the two lists of pairs into sets, and compare the sets.

comment:39 Changed 5 years ago by jpflori

My cparent patch has to be rebased on top of sage-5.0.beta10 because of #12093.

This is done and I'm currently rebuilding beta10 with #715, #11521 and this ticket applied.

I'll report asap.

comment:40 Changed 5 years ago by jpflori

Running "make ptest" and "./sage -t .../coerce_dict.pyx" twice each on the same computer as before on top of sage-5.0.beta10 + #715 + #11521 + this ticket resulted in the same behavior.

Strange indeed, I'll try now on another computer.

Changed 5 years ago by jpflori

Add a cdef'ed reference to the parent in each polynomial; rebased on top of sage-5.0.beta10

comment:41 Changed 5 years ago by jpflori

  • Description modified (diff)

comment:42 Changed 5 years ago by jpflori

I get the same kind of discrepancy on another computer (intel quadcore 2, amd64, debian experimental) on top of sage-5.0.beta10 + #715 + #11521 + this ticket.

In fact, on this other installation, I got the same failing doctest about list items ordering when running "make ptest" (checked three times); and then the "stats()" failure while running "./sage -t .../coerce_dict.pyx" only the first of ten times (and not the other failure so same behavior as on the other computer) and no other error.

So I guess the stats() error can really be ignored, the other one is stranger. Can someone else confirm this ?

comment:43 Changed 5 years ago by defeo

I think the problem is independent from these tickets.

I added these two lines in a clean sage 4.7.2 tree:

--- a/sage/structure/coerce_dict.pyx
+++ b/sage/structure/coerce_dict.pyx
@@ -35,7 +35,9 @@
-    
+
+        sage: list(sorted([('a', 1), (-15,3), ('ab', 2)]))
+        [(-15, 3), ('a', 1), ('ab', 2)]

and run a make ptest. It failed with error:

File "/home/defeo/sage-4.7.2/devel/sage-main/sage/structure/coerce_dict.pyx", li
ne 39:
    sage: list(sorted([('a', 1), (-15,3), ('ab', 2)]))
Expected:
    [(-15, 3), ('a', 1), ('ab', 2)]
Got:
    [('a', 1), ('ab', 2), (-15, 3)]

I think some doctest hidden in some file tested before coerce_dict.pyx may be redefining sorted. If this bug is confirmed in newer releases, we should make a separate ticket.

comment:44 Changed 5 years ago by jpflori

This is indeed really strange and I agree that it should be dealt with somewhere else. I'll post a slightly modified "reviewer" patch taking care of this issue by replacing the (useless) call to list in front by set.

To summarize our investigation of yesterday, we confirmed the bug when running "make ptest" on :

  • one computer running either beta10
  • two other computers running beta11
  • one other computer running beta10 and beta11

All of them are quite or very recent Intel amd64 cpus running more or less recent Debian or Ubuntu OSes. On all these computers the test passed when running "sage -t devel/sage/sage/structure/coerce_dict.pyx". I also checked on some of them that running "sage -tp 0 devel/sage/sage" does NOT raise the error, so the culprit is to be found in some magic of the "ptest" target.

Finally I could also NOT reproduce the bug on a last corei7 (Lenovo X220) computer running beta11 (but there I had some doctest failure in fast_digraph.pyx or something like that, not sure of the name).

comment:45 Changed 5 years ago by jpflori

By the way, "make test" seems to be broken... (and that's the only target really different from the other test ones).

Update: This is #12568.

Last edited 5 years ago by jpflori (previous) (diff)

Changed 5 years ago by jpflori

Reviewer patch; added doc, some formatting changes, problematic test

comment:46 Changed 5 years ago by jpflori

I've just updated the "reviewer" patch. On my side, I'm ok with Simon's patches. So if someone else is ok with mine, feel free to set the ticket to positive review.

comment:47 Changed 5 years ago by SimonKing

  • Status changed from needs_review to needs_work

Too bad. With this ticket and its dependencies applied, I get at least

sage -t  -force_lib "devel/sage/sage/structure/coerce_dict.pyx"
**********************************************************************
File "/mnt/local/king/SAGE/stable/sage-5.0.beta10-gcc/devel/sage/sage/structure/coerce_dict.pyx", line 735:
    sage: len(LE)    # indirect doctest
Expected:
    1
Got:
    50
**********************************************************************
File "/mnt/local/king/SAGE/stable/sage-5.0.beta10-gcc/devel/sage/sage/structure/coerce_dict.pyx", line 305:
    sage: len(LE)    # indirect doctest
Expected:
    1
Got:
    51
**********************************************************************
2 items had failures:
   1 of  37 in __main__.example_19
   1 of  39 in __main__.example_6

and the test suite isn't finished yet.

So, apparently the fixed memoy leak creeped back at a different ticket.

comment:48 Changed 5 years ago by jpflori

This is really strange.

I can ensure you that we did ran the tests on at least 4 different computers yesterday on sage-5.0.beta11 + #715 + #11521 + here and got no such failures... I also observed the same correct behavior on top of beta10 before. I've also been playing this morning with "make test" and everything was fine on top of sage-5.0.beta11.

Did you tried this on top of a fresh install of sage-5.0.beta10 ?

Last edited 5 years ago by jpflori (previous) (diff)

comment:49 Changed 5 years ago by SimonKing

It has not been on top of a clean installation, I am afraid: I have #11768, #12641, #7797, #7797, and #12313 applied on top of sage-5.0.beta10.

So, I need to find out where the change in caching came from.

comment:50 Changed 5 years ago by jpflori

I guess you also have #715 and #11521 (just to be sure...) ?

comment:51 Changed 5 years ago by SimonKing

I do have #715 (the second copy of #7797 was a typo), but indeed I do not have #11521. My bad.

comment:52 Changed 5 years ago by SimonKing

  • Status changed from needs_work to needs_review

comment:53 Changed 5 years ago by jpflori

  • Dependencies changed from #715, #11599 to #715, #11521, #11599

Changed 5 years ago by jpflori

Weak coerce map dictionary with weakly referenced unique keys and optimised access speed

Changed 5 years ago by jpflori

Reviewer patch; added doc, some formatting changes, problematic test

Changed 5 years ago by jpflori

C modulo operator problem

comment:54 Changed 5 years ago by jpflori

  • Description modified (diff)

comment:55 Changed 5 years ago by jpflori

  • Status changed from needs_review to needs_work

I've messed up something while rebasing the patch over the new version of #715, whence needs work.

comment:56 Changed 5 years ago by jpflori

The problem seems to arise from the last patch (I've not checked yet that with all but that last one, there's no problem on 32 bits).

comment:57 Changed 5 years ago by jpflori

Ok, I think I get it. The hash value is stored, but I moved out the modulo operation to make clear what could go wrong, thus modifying the hash value which is stored for MonoDict? objects.

comment:58 Changed 5 years ago by SimonKing

Same question as with #715: Isn't there a Py_size_t that one should use to replace Py_ssize_t?

comment:59 Changed 5 years ago by jpflori

Some precisions : even without the last patch here, the first 3 ones make Sage segfault on 32 bits. (although it seems fine on 64 bits as I reported yesterday.) I'll try Simon's solution now.

Changed 5 years ago by jpflori

Changed 5 years ago by jpflori

Changed 5 years ago by jpflori

Changed 5 years ago by jpflori

Changed 5 years ago by jpflori

Changed 5 years ago by jpflori

comment:60 Changed 5 years ago by jpflori

  • Description modified (diff)
  • Status changed from needs_work to needs_review

comment:61 Changed 5 years ago by jpflori

With the new patches of #715 + #11521 + #12313, Sage starts both on 32 and 64 bits and make test pass on 64 bits (32 bits not finished yet).

Changed 5 years ago by SimonKing

Weak coerce map dictionary with weakly referenced unique keys and optimised access speed, plus the necessary changes in cparent to make things work, plus reviewer patch.

comment:62 Changed 5 years ago by SimonKing

  • Description modified (diff)
  • Reviewers set to Simon King

I have attached a combined patch, that is obtained by folding all patches that have previously to be applied. In particular, it contains the fix "use size_t instead of Py_ssize_t".

For me (64 bit), all tests pass (thus, positive review from me for the parts of code written by Jean-Pierre), but we should wait for Jean-Pierre's results on 32 bit.

For the patchbot:

Apply trac_12313-mono_dict-combined.patch

comment:63 Changed 5 years ago by jpflori

Mmm, I think the tests I described in #715, in fact included also this ticket.

That is sage-5.0.beta13 + #715 + #11521 + #12313

The 4 failing doctests did not look obviously related to these ticket.

They might be a consequence of Sage running within a virtual machine (which is quite slow for me).

comment:64 Changed 5 years ago by strogdon

From sage-devel there is the need for a 32-bit test. I have 64-bit hardware but running 32-bits:

Linux blitzen 2.6.32-5-686 #1 SMP Mon Jan 16 16:04:25 UTC 2012 i686 Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz GenuineIntel GNU/Linux

From hg qapplied on top of sage-5.0.beta13 I have:

trac_715_combined.patch
trac_11521_homset_weakcache_combined.patch
trac_12313-mono_dict-combined.patch

After sage -br all long tests pass here.

comment:65 follow-up: Changed 5 years ago by jdemeyer

  • Status changed from needs_review to needs_work

I get a failure on boxen (x86_64 Linux):

sage -t  -force_lib devel/sage/sage/structure/coerce_dict.pyx
**********************************************************************
File "/padic/scratch/jdemeyer/merger/sage-5.1.beta0/devel/sage-main/sage/structure/coerce_dict.pyx", line 260:
    sage: L.stats()             # min, avg, max (bucket length)
Expected:
    (0, 0.06451..., 1)
Got:
    (0, 0.06451612903225806, 2)
**********************************************************************

comment:66 in reply to: ↑ 65 ; follow-up: Changed 5 years ago by SimonKing

Replying to jdemeyer:

I get a failure on boxen (x86_64 Linux): sage -t -force_lib devel/sage/sage/structure/coerce_dict.pyx

Amazing (because I have x86_64 Linux as well). Does it also arise if only #715 and #11521 are applied?

comment:67 follow-up: Changed 5 years ago by jpflori

I already reported such an error in http://trac.sagemath.org/sage_trac/ticket/12313#comment:24

I did not really pay attention because I could only reproduce it once on two fresh install of Sage.

It seemed to have been caused by the cparent stuff...

comment:68 in reply to: ↑ 67 Changed 5 years ago by SimonKing

Replying to jpflori:

I already reported such an error in http://trac.sagemath.org/sage_trac/ticket/12313#comment:24

I did not really pay attention because I could only reproduce it once on two fresh install of Sage.

It seemed to have been caused by the cparent stuff...

OK. But I wonder: Is it reliably reproducible on the same machine? Namely, as I said, I have never seen the "2" on my own machine.

What shall we do about that error? After all, it is about how the data are distributed in the hash table, and the hash is clearly machine dependent (since the hash depends on the address of the object).

Hence, the output of "stat()" is machine dependent. Perhaps we could mark the output "random"?

comment:69 in reply to: ↑ 66 Changed 5 years ago by jdemeyer

Replying to SimonKing:

Amazing (because I have x86_64 Linux as well). Does it also arise if only #715 and #11521 are applied?

The test which fails is about stuff added in this ticket, so this question is not relevant.

comment:70 follow-up: Changed 5 years ago by jdemeyer

It does not fail consistently, it succeeds most of the time. By chance, it failed on me the first two times I tried (so I assumed it always failed).

comment:71 in reply to: ↑ 70 ; follow-up: Changed 5 years ago by SimonKing

Replying to jdemeyer:

It does not fail consistently, it succeeds most of the time. By chance, it failed on me the first two times I tried (so I assumed it always failed).

OK, so it is random.

Since the purpose of that method is merely diagnostics, I suggest to remove the problem by marking the test "random".

comment:72 in reply to: ↑ 71 ; follow-up: Changed 5 years ago by jdemeyer

Replying to SimonKing:

Since the purpose of that method is merely diagnostics, I suggest to remove the problem by marking the test "random".

I think it's dubious that the test has educational value, so you might as well remove the test completely.

comment:73 in reply to: ↑ 72 Changed 5 years ago by SimonKing

Replying to jdemeyer:

I think it's dubious that the test has educational value, so you might as well remove the test completely.

Removing the test or removing the method? I think that the method has certainly been useful while developing the code, but perhaps it is now of no use?

comment:74 Changed 5 years ago by jdemeyer

I meant removing the test, but I don't really care.

Changed 5 years ago by jpflori

Flaky doctest marked as random.

comment:75 follow-ups: Changed 5 years ago by john_perry

My machine has tested about 1/3 of the directories when running sage -t sage/. So, far there is one doctest error:

sage -t  "devel/sage-main/sage/misc/sagedoc.py"             
**********************************************************************
File "/home/sage-5.0.beta13/devel/sage-main/sage/misc/sagedoc.py", line 566:
    sage: 'abvar/homology' in _search_src_or_doc('doc', 'homology', 'variety', interact=False)
Expected:
    True
Got:
    False
**********************************************************************
1 items had failures:
   1 of   7 in __main__.example_8
***Test Failed*** 1 failures.

I will update as time permits. I am not running long tests; let me know if I should.

comment:76 Changed 5 years ago by john_perry

Another:

sage -t  "devel/sage-main/sage/modular/local_comp/type_space.py"
*** *** Error: TIMED OUT! PROCESS KILLED! *** ***

	 [360.5 s]

comment:77 in reply to: ↑ 75 ; follow-up: Changed 5 years ago by jpflori

Replying to john_perry:

My machine has tested about 1/3 of the directories when running sage -t sage/. So, far there is one doctest error:

sage -t  "devel/sage-main/sage/misc/sagedoc.py"             
**********************************************************************
File "/home/sage-5.0.beta13/devel/sage-main/sage/misc/sagedoc.py", line 566:
    sage: 'abvar/homology' in _search_src_or_doc('doc', 'homology', 'variety', interact=False)
Expected:
    True
Got:
    False
**********************************************************************
1 items had failures:
   1 of   7 in __main__.example_8
***Test Failed*** 1 failures.

I will update as time permits. I am not running long tests; let me know if I should.

I also got that one within my virtual machine, but did not worry too much because I cannot even think how this could be related to coercion and caching. In fact, I think it also happened without #12313 applied, but only #715 and #11521. I'll check without any of them right now (I should have done this before, but VirtualBox? is horribly slow for some reason and it's a pain to rebuild Sage and retest it).

comment:78 in reply to: ↑ 77 Changed 5 years ago by SimonKing

Replying to jpflori:

I also got that one within my virtual machine, but did not worry too much because I cannot even think how this could be related to coercion and caching.

I cannot think of a relation, either.

comment:79 in reply to: ↑ 75 ; follow-up: Changed 5 years ago by jdemeyer

Replying to john_perry:

My machine has tested about 1/3 of the directories when running sage -t sage/. So, far there is one doctest error:

sage -t  "devel/sage-main/sage/misc/sagedoc.py"             
**********************************************************************
File "/home/sage-5.0.beta13/devel/sage-main/sage/misc/sagedoc.py", line 566:
    sage: 'abvar/homology' in _search_src_or_doc('doc', 'homology', 'variety', interact=False)
Expected:
    True
Got:
    False
**********************************************************************
1 items had failures:
   1 of   7 in __main__.example_8
***Test Failed*** 1 failures.

This is a failure that you get if don't build the Sage documentation. Did you run make and let it finish?

comment:80 Changed 5 years ago by jdemeyer

  • Status changed from needs_work to needs_review

comment:81 in reply to: ↑ 79 Changed 5 years ago by john_perry

Replying to jdemeyer:

This is a failure that you get if don't build the Sage documentation. Did you run make and let it finish?

Yes. I guess jpflori did, too, but he reports the same result.

comment:82 Changed 5 years ago by jpflori

I've just rebuild (sage -br) beta13 after popping all the patches and testing that file alone (sage -t ..../sagedoc.py) raises the same error. I'm now running "make" again to see if anything changes afterward.

Whatsoever, this definitely looks unrelated.

comment:83 follow-up: Changed 5 years ago by jpflori

FYI, make failed while building the doc because of lack of memory. I allocated 1Go to the virtual machine, isn't it enough? Looking at the memory consumption in top, it goes up to 65% (for the python process alone, not counting other processes) at 98% of the "reading source" stage and fails (on the last test it was at sage/notebook/twist). Quite strangely, there was still some memory available, about 70 Mo at least, not counting some memory cached, but the system might have refused to allocate some more to the python process.

I think that's the only reason for the above failure on my side, and the other ones I reported here (or in #715) on 32 bits.

comment:84 in reply to: ↑ 83 Changed 5 years ago by john_perry

  • Status changed from needs_review to needs_info

Replying to jpflori:

FYI, make failed while building the doc because of lack of memory. I allocated 1Go to the virtual machine, isn't it enough?

I have 512M on my machine, so if the memory is an issue, that would explain it. What seems odd to me is that, as I recall, sage didn't report errors when it finished building. How do I determine whether this is an issue? A naive search of install.log doesn't hit anything.

Whatsoever, this definitely looks unrelated.

Is that the case with the timeout error, as well? If so, I can give it a positive review, since doctests have finished with only the two errors I mentioned above: sagedoc.py and timeout, the latter in several places. Unless, that is, you want me to run long tests as well. Let me know.

comment:85 Changed 5 years ago by SimonKing

  • Description modified (diff)

I just noticed that Jean-Pierre has provided a new combined patch, with randomisation of the "random" stats test. So, for the patchbot:

Apply trac_12313-mono_dict-combined-random.patch

comment:86 follow-up: Changed 5 years ago by jpflori

  • Status changed from needs_info to needs_review

At the end of "make", I can spot a myriad of warnings about objects.inv and more specifically at the end of the [reading sources] phase (which does not reach 100 %), I have an

Exception occured:
  File ...
OSError: [Errno 12] Cannot allocate memory

This is not logged into install.log but into dochtml.log.

And this also happens within the virtual machine without any patch patch applied...

comment:87 Changed 5 years ago by jpflori

By the way, it's quite awful that Sage builds correctly with 512 Mo of memory, but that the doc does not.

comment:88 Changed 5 years ago by jdemeyer

Indepently of this ticket, I was already investigating the "docbuilding memory runaway" problem. I agree that there is indeed a problem.

comment:89 in reply to: ↑ 86 Changed 5 years ago by john_perry

Replying to jpflori:

  • Status changed from needs_info to needs_review

What needs review? the ..._random.patch ? I think that was the one I checked; the problems were only w/timeout and sagedoc.py, which people seem to agree isn't a problem w/this patch? I'll verify in a little while, just in case.

I'm happy to do it; I'm just not sure what it is needs doing now.

comment:90 Changed 5 years ago by jpflori

Nothing more. I was just waiting for John Cremona's report on sage-devel.

But we could as well put the ticket back to positive review, I'm quite convinced the problem we experienced on 32 bits are unrelated to the ticket itself.

comment:91 Changed 5 years ago by john_perry

  • Reviewers changed from Simon King to Simon King, John Perry
  • Status changed from needs_review to positive_review

comment:92 follow-up: Changed 5 years ago by cremona

Sorry for the delay, I left my laptop at home. I have just applied the patch (to 5.0-beta13) and am rebuilding, then will do a full test. It may not be finished until an overnight run...

comment:93 Changed 5 years ago by jdemeyer

  • Milestone changed from sage-5.0 to sage-5.1

comment:94 in reply to: ↑ 92 Changed 5 years ago by cremona

Replying to cremona:

Sorry for the delay, I left my laptop at home. I have just applied the patch (to 5.0-beta13) and am rebuilding, then will do a full test. It may not be finished until an overnight run...

All (long) tests passed. Let's hope 5.1 will not be too far down the line...

comment:95 Changed 5 years ago by jpflori

  • Reviewers changed from Simon King, John Perry to Simon King, Jean-Pierre Flori, John Perry

Just putting myself as reviewer because I originally reviewed Simon code a while ago.

comment:96 Changed 5 years ago by jdemeyer

I get some rare doctest failures while testing a very preliminary version of sage-5.1.beta1. The annoying thing is that the failures aren't reproducible. The problems seem to be related to coercion. I'm only guessing this ticket is to blame, it could also be #12877 or some other ticket listed in http://boxen.math.washington.edu/home/release/sage-5.1.beta1/tickets.html

All the observed failures have been within the sage/schemes/elliptic_curves directory.

Often one just gets a segmentation fault:

sage -t  --long -force_lib devel/sage/sage/schemes/elliptic_curves/heegner.py
Segmentation fault

	 [101.7 s]

But there are also failures of the form:

sage -t  --long -force_lib devel/sage/sage/schemes/elliptic_curves/ell_tate_curve.py
**********************************************************************
File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/devel/sage-main/sage/schemes/elliptic_curves/ell_tate_curve.py", line 332:
    sage: eq.parametrisation_onto_tate_curve(1+5+5^2+O(5^10))
Exception raised:
    Traceback (most recent call last):
      File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/bin/ncadoctest.py", line 1231, in run_one_test
        self.run_one_example(test, example, filename, compileflags)
      File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/bin/sagedoctest.py", line 38, in run_one_example
        OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags)
      File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/bin/ncadoctest.py", line 1172, in run_one_example
        compileflags, 1) in test.globs
      File "<doctest __main__.example_12[3]>", line 1, in <module>
        eq.parametrisation_onto_tate_curve(Integer(1)+Integer(5)+Integer(5)**Integer(2)+O(Integer(5)**Integer(10)))###line 332:
    sage: eq.parametrisation_onto_tate_curve(1+5+5^2+O(5^10))
      File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/lib/python/site-packages/sage/schemes/elliptic_curves/ell_tate_curve.py", line 340, in parametrisation_onto_tate_curve
        un = u * q**(-(u.valuation()/q.valuation()).floor())
      File "element.pyx", line 1687, in sage.structure.element.RingElement.__mul__ (sage/structure/element.c:12755)
      File "coerce.pyx", line 797, in sage.structure.coerce.CoercionModel_cache_maps.bin_op (sage/structure/coerce.c:7555)
    TypeError: unsupported operand parent(s) for '*': '5-adic Ring with capped relative precision 20' and '5-adic Field with capped relative precision 20'
**********************************************************************
File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/devel/sage-main/sage/schemes/elliptic_curves/ell_tate_curve.py", line 529:
    sage: eq.parametrisation_onto_original_curve(1+5+5^2+O(5^10))
Exception raised:
    Traceback (most recent call last):
      File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/bin/ncadoctest.py", line 1231, in run_one_test
        self.run_one_example(test, example, filename, compileflags)
      File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/bin/sagedoctest.py", line 38, in run_one_example
        OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags)
      File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/bin/ncadoctest.py", line 1172, in run_one_example
        compileflags, 1) in test.globs
      File "<doctest __main__.example_17[3]>", line 1, in <module>
        eq.parametrisation_onto_original_curve(Integer(1)+Integer(5)+Integer(5)**Integer(2)+O(Integer(5)**Integer(10)))###line 529:
    sage: eq.parametrisation_onto_original_curve(1+5+5^2+O(5^10))
      File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/lib/python/site-packages/sage/schemes/elliptic_curves/ell_tate_curve.py", line 545, in parametrisation_onto_original_curve
        P = self.parametrisation_onto_tate_curve(u,prec=20)
      File "/Users/buildbot/build/sage/bsd-1/bsd_64_full/build/sage-5.1.beta1/local/lib/python/site-packages/sage/schemes/elliptic_curves/ell_tate_curve.py", line 340, in parametrisation_onto_tate_curve
        un = u * q**(-(u.valuation()/q.valuation()).floor())
      File "element.pyx", line 1687, in sage.structure.element.RingElement.__mul__ (sage/structure/element.c:12755)
      File "coerce.pyx", line 797, in sage.structure.coerce.CoercionModel_cache_maps.bin_op (sage/structure/coerce.c:7555)
    TypeError: unsupported operand parent(s) for '*': '5-adic Ring with capped relative precision 20' and '5-adic Field with capped relative precision 20'
**********************************************************************

I have also seen an error similar to the above but with multiplying RealNumbers, but I didn't copy the exact error message.

Last edited 5 years ago by jdemeyer (previous) (diff)

comment:97 Changed 5 years ago by jdemeyer

I have seen the above error on various systems: Linux, OS X, 32-bit, 64-bit so it seems unrelated to the operating system.

comment:98 follow-up: Changed 5 years ago by jpflori

Just to be sure: this didn't occur with beta0?

Because what is done here is somehow similar to what is done in #715 which was merged in beta0.

comment:99 in reply to: ↑ 98 Changed 5 years ago by jdemeyer

Replying to jpflori:

Just to be sure: this didn't occur with beta0?

I have not seen any failures with #715, but I can't exclude it. I haven't found a way to reliably reproduce the failures.

I have seen the segmentation fault with heegner.py several times while testing sage-5.1.beta1.

comment:100 Changed 5 years ago by jpflori

FYI, I could reproduce the problem in ell_tate_curve on top of vanilla sage-5.1.beta0.

I'm also testing this on top of sage-5.0.beta13 plus all coercion patches and #12877 but got no failure so far (ran the test about 50 times, got the failure with 5.1.beta0 twice in 20 runs).

comment:101 Changed 5 years ago by jdemeyer

You mean sage-5.1.beta0 + #12313?

comment:102 follow-up: Changed 5 years ago by jpflori

Sorry about the above comment: I meant that this is repoducible on top of 5.0.beta13 + coercion patches and #12877 and not on top of 5.1.beta0

Last edited 5 years ago by jpflori (previous) (diff)

comment:103 in reply to: ↑ 102 Changed 5 years ago by jdemeyer

Replying to jpflori:

Sorry about the above comment: I meant that this is repoducible on top of 5.0.beta13 + coercion patches and #12877 and not on top of 5.1.beta0

I am still confused. You mean that it's not reproducible with sage-5.1.beta0 without any patches?

comment:104 Changed 5 years ago by SimonKing

Hi Jean-Pierre and Jeroen!

I am about to go to some conference and will probably not have much time to trace the problem down until Sunday.

Anyway, it would help me if you could state precisely in what setting you find the problem and in what setting you don't. By "precisely", I mean a list of patches, since I can't imagine what you mean by "coercion patches".

comment:105 follow-up: Changed 5 years ago by jpflori

Ok, I'll try to be more clear and avoid to replace random zeros by ones in the version number I write.

So far I could reproduce the problem in ell_tate_curve above (after about 10 runs, so quite often) on:

And could not after 100 runs of ell_tate_curve on:

  • 5.1.beta0 vanilla

I'm currently testing 5.1.beta0 + #12313 (after about 40 tries, no errors, which is quite disturbing).

comment:106 in reply to: ↑ 105 ; follow-up: Changed 5 years ago by SimonKing

Hi Jean-Pierre,

Replying to jpflori:

So far I could reproduce the problem in ell_tate_curve above (after about 10 runs, so quite often) on:

... I'm currently testing 5.1.beta0 + #12313 (after about 40 tries, no errors, which is quite disturbing).

Aha. So, seemingly it is the combination of the weak coercion caches from #715 and #11521, plus the weak unique parent caches from here. Is there any hint what objects are involved in the crashes?

My gut feeling is that actions are involved: By #715, they only store a weak reference to the underlying set (which I felt was dangerous, but was needed to fix some memory leak, and seemed justifiable after all). I could imagine that the underlying set of some action becomes garbage collectable with #12313.

comment:107 Changed 5 years ago by jpflori

Looking at where the error produced by ell_tate_curve seems to indicate that the problems occurs while multiplying elements of the 5-adic ring and field.

I'm trying to build a simple loop reproducing the error based on that but failed so far.

comment:108 Changed 5 years ago by jpflori

Another problem is that the bug seems to occur very rarely...

I've not seen it yet on top of 5.1.beta0 + #12313 although I have the feeling it should occur (or I'm wrong :) ).

Even on top of 5.0.beta13 + #715 + #11521 + #12313 it does not happen so frequently when looping over the tests of ell_tate_curve.

I'll run some (I mean a LOT) test without #12313 now, just to be sure this is indeed the culprit.

comment:109 Changed 5 years ago by jpflori

Mmmm, this seems really tricky.

I've built 5.0.rc0 and applied #715, #11521, #12313, #11943 and #12357 and tried to reproduce the ell_tate_curve error but failed after a few hundreds of tries.

So I decided to pop #12357 and #11943 and tada ! I got the error back in quite quickly.

Of course, this could only be some random lottery... Or the changes introduced by #11943 could make the error even rarer.

Indeed, I finally produced the error with #11943 applied back after additional tries.

After a few hundreds of tries, I'm still unable to reproduce the error on 5.1.beta0 + #12313 and 5.1.beta0 + #12313 + #12877... but that might happen after more tests.

comment:110 Changed 5 years ago by jpflori

Some more data... I've run successfuly ell_tate_curve (i.e. without error) 10000 on 5.0.rc0 plus #715, #11521, #12313, #11943 and #12357 over the night.

I've just retested it without #11943 and #12357 and got the failure after 19 and the 13 tries.

On 5.1.beta0 plus #12313 I got not failure after about 14200 runs of ell_tate_curve. I'm relaunching it with #12877 applied to see if I get more (un)lucky.

Anyway, I'll use 5.0.rc0 plus #715, #11521, #12313 as a base to try to investigate that.

comment:111 in reply to: ↑ 106 Changed 5 years ago by jpflori

Replying to SimonKing:

My gut feeling is that actions are involved: By #715, they only store a weak reference to the underlying set (which I felt was dangerous, but was needed to fix some memory leak, and seemed justifiable after all). I could imagine that the underlying set of some action becomes garbage collectable with #12313.

The problem occurs while multiplying an element of a ring and its field of fractions. I did not think it through at all, but my natural guess would be that this is done by (canonical?) coercion of the ring element into the field of fractions and then the action takes place there.

That would correspond to the "canonical_coercion" and PyObject_CallObject(...) calls within the bin_op call made by the coercion model, hence no actions involved, but I'm still confused by all of this ;)

comment:112 Changed 5 years ago by jdemeyer

  • Milestone changed from sage-5.1 to sage-pending

I'm going to move this to sage-pending for now, with the idea of doing some more tests later. In any case, this ticket will not be merged before sage-5.1.beta2.

comment:113 Changed 5 years ago by jdemeyer

  • Merged in set to sage-5.1.beta2
  • Milestone changed from sage-pending to sage-5.1
  • Resolution set to fixed
  • Status changed from positive_review to closed

I haven't seen any more errors, so maybe #11943 did fix this...

comment:114 Changed 5 years ago by jdemeyer

Well, #12357 still gives segmentation faults...

comment:115 follow-up: Changed 5 years ago by jdemeyer

I might have some bad news. It seems that this might have caused the following rare (about 1% of the time for the first one, the second one is even more rare) failures on some machines while merging sage-5.2.beta0 (not sure which patch in sage-5.2.beta0 exposes it):

sage -t -force_lib devel/sage-main/sage/rings/padics/padic_base_leaves.py
**********************************************************************
File "/home/buildbot/build/sage/rosemary-1/rosemary_full/build/sage-5.2.beta0/devel/sage-main/sage/rings/padics/padic_base_leaves.py", line 326:
    sage: K.has_coerce_map_from(ZpCA(17,10))
Expected:
    True
Got:
    False
**********************************************************************

and

sage -t  --long -force_lib devel/sage/sage/rings/number_field/number_field.py
**********************************************************************
File "/home/buildbot/build/sage/mark-1/mark_full/build/sage-5.1.rc0/devel/sage-main/sage/rings/number_field/number_field.py", line 302:
    sage: RR.coerce_map_from(K)
Expected:
    Composite map:
      From: Number Field in a with defining polynomial x^3 - 2
      To:   Real Field with 53 bits of precision
      Defn:   Generic morphism:
              From: Number Field in a with defining polynomial x^3 - 2
              To:   Real Lazy Field
              Defn: a -> 1.259921049894873?
            then
              Conversion via _mpfr_ method map:
              From: Real Lazy Field
              To:   Real Field with 53 bits of precision
Got nothing
**********************************************************************

This means I will probably unmerge this from sage-5.1.

Last edited 5 years ago by jdemeyer (previous) (diff)

comment:116 in reply to: ↑ 115 ; follow-up: Changed 5 years ago by SimonKing

Replying to jdemeyer:

I might have some bad news. It seems that this might have caused the following rare (about 1% of the time) failure on some machines

I hate these sporadic errors! They are so difficult to debug! But if I recall correctly, Nicolas Thiéry posted an example in sage-combinat-devel exposing problems with the weak caching stuff. I hope I will at some point find the time to analyse his example. Right now, I can't.

comment:117 in reply to: ↑ 116 Changed 5 years ago by jdemeyer

Replying to SimonKing:

I hate these sporadic errors! They are so difficult to debug!

I know, it has happened to me too. The worst for me is when I discover such a sporadic error but I can't tell which ticket caused it.

Any chance that this error is also caused by #12313?

comment:118 Changed 5 years ago by jdemeyer

  • Merged in sage-5.1.beta2 deleted
  • Milestone changed from sage-5.1 to sage-5.2
  • Resolution fixed deleted
  • Status changed from closed to new

comment:119 Changed 5 years ago by SimonKing

Mental note for me: Test whether #12486 and/or #12969 are related with the error here.

comment:120 Changed 5 years ago by SimonKing

  • Dependencies changed from #715, #11521, #11599 to #715, #11521, #11599, #12969
  • Status changed from new to needs_review

Any chance we could test whether #12969 fixes the problem? I added it as a dependency.

comment:121 Changed 5 years ago by SimonKing

  • Description modified (diff)

I had to rebase the patch (and rename, because Jean-Pierre has submitted the previous version).

Apply trac_12313-mono_dict-combined-random-sk.patch

comment:122 Changed 5 years ago by SimonKing

FWIW: On bsd.math, all tests pass with

trac12969_fix_coercion_cache.patch
trac12215_weak_cached_function-sk.patch
trac12215_segfault_fixes.patch
trac_715_combined.patch
trac_11521_homset_weakcache_combined.patch
trac_12313-mono_dict-combined-random-sk.patch

on top of sage-5.2.rc0.

comment:123 Changed 5 years ago by SimonKing

  • Dependencies changed from #715, #11521, #11599, #12969 to #715, #11521, #11599, #12969, #11310
  • Status changed from needs_review to needs_work

It turns out that the current patch won't apply, due to #11310.

comment:124 Changed 5 years ago by SimonKing

  • Dependencies changed from #715, #11521, #11599, #12969, #11310 to #715, #11521, #11599, #12969
  • Status changed from needs_work to needs_review

Arrgh, sorry, I have posted on the wrong ticket. The patch seems fine.

Apply trac_12313-mono_dict-combined-random-sk.patch

comment:125 follow-ups: Changed 5 years ago by nbruin

sage/structure/coerce_dict.pyx:

line 96:

Perhaps you can explain the mechanism here. If I understand correctly, _refcache is a global dictionary, keyed with the "key" part of KeyedRefs and as values lists of KeyedRefs with that key. The deletion consists of looking up the key in the registered MonoDict and removing the entry if it's there (it could already have disappeared if the entry was explicitly deleted from the MonoDict, because MonoDict.__delitem__ doesn't bother updating _refcache)

It then proceeds to remove the now defunct KeyedRef instance from _refcache. But doesn't that mean that even if we have len(L)==0 as a result from that, the entry k : L remains in _refcache? That means we get a whole pile-up of integers and empty lists in memory as a result (something I have indeed observed). These objects aren't so big, so this leaks much slower of course, but it's still a leak. I suspect you copied this from TripleDict which will have the same problem.

line 103:

        for i from 0 <= i < PyList_GET_SIZE(bucket) by 4:
            if <size_t><object>PyList_GET_ITEM(bucket,i)==k:
                del bucket[i:i+2] 

It reads to me that that should be ... by 2 corresponding tro the loops in MonoDict.__setitem__ and MonoDict.__getitem__

line 497: (in Monodict._getitem__)

How bad is the race condition here? I guess we're trusting that bucket doesn't change and that bucket has been purged of any dead references (i.e., that all MonoDictEraser callbacks have happened). Does keeping the GIL guarantee this?

Does the line

                return <object>PyList_GET_ITEM(bucket, i+1) 

increase a refcount on the object we retrieve? Does the <object> typecast do that? Or does the return do that? (now that the object has been retrieved it should be protected from deletion)

line 536:

Am I reading correctly that if k cannot be keyreffed then k is going to get a strong ref in _refcache and will be eternalized? Is that what we want? If so, write this in a comment! Especially because currently even explicit deletion would leave the reference in _refcache intact, so you'd be leaking in cases where a normal dictionary would not.

comment:126 Changed 5 years ago by SimonKing

I will first try to fix #715 according to your remarks, and then return to here.

comment:127 Changed 5 years ago by SimonKing

  • Status changed from needs_review to needs_work
  • Work issues set to Rebase wrt the new version of #715. Take into account Nils' comments

comment:128 in reply to: ↑ 125 Changed 5 years ago by SimonKing

  • Dependencies changed from #715, #11521, #11599, #12969 to #715, #11521, #11599, #12969, #12215
  • Status changed from needs_work to needs_review

Replying to nbruin:

sage/structure/coerce_dict.pyx:

line 96:

Perhaps you can explain the mechanism here.

I did explain it on #715, and I changed the patch similar to #715, according to your comments (_refcache is now local, etc).

line 103:

        for i from 0 <= i < PyList_GET_SIZE(bucket) by 4:
            if <size_t><object>PyList_GET_ITEM(bucket,i)==k:
                del bucket[i:i+2] 

It reads to me that that should be ... by 2

Right! Thanks for spotting that.

line 497: (in Monodict._getitem__)

How bad is the race condition here? I guess we're trusting that bucket doesn't change and that bucket has been purged of any dead references (i.e., that all MonoDictEraser callbacks have happened). Does keeping the GIL guarantee this?

That is a good question.

I tried to stress test it:

L = MonoDict(31)
class A: pass
while 1:
    keys = [A() for _ in range(100)]
    for i,k in enumerate(keys):
        L[k] = i
    a = keys[99]
    keys = None
    print L[a]

I suppose that one could expect that garbage collection would sometimes occur while getting L[a]. Hence, one would expect a segfault to occur at some point. But it works fine (it constantly prints "99").

Does the line

                return <object>PyList_GET_ITEM(bucket, i+1) 

increase a refcount on the object we retrieve? Does the <object> typecast do that? Or does the return do that?

As much as I know, the <object> cast does that. Note that without the cast, the code wouldn't even compile, because "return" can not return the result of PyList_GET_ITEM (which is just a reference to an object).

Here is some test that (I think) demonstrates that the reference count is right:

sage: cython("""
....: include "../ext/python_list.pxi"
....: def test(list L):
....:     return <object>PyList_GET_ITEM(L,4)
....: """)
sage: import gc
sage: class A(object): pass
....: 
sage: L = [A() for _ in range(100)]
sage: id(L[4])
80680784
sage: a = test(L)
sage: del L
sage: _ = gc.collect()
sage: [id(x) for x in gc.get_objects() if isinstance(x,A)]
[80680784]
sage: id(a)
80680784
sage: del a
sage: _ = gc.collect()
sage: [id(x) for x in gc.get_objects() if isinstance(x,A)]
[]

line 536:

Am I reading correctly that if k cannot be keyreffed then k is going to get a strong ref in _refcache and will be eternalized?

See the comments at #715: That used to be a problem, but is now solved.

Some comments:

  • This patch fixes a typo introduced at #715. I suppose that #715, #11521, #12215 and #12313 will be merged in one go anyway, so, that should be fine.
  • There was a doctest in sage/categories/modules_with_basis.py that explicitly relied on an "eternal cache". Namely, some module was defined in one doctest and named "X". In another doctest, the same module was taken, and it was still called "X", because it has not been garbage collected. With my patch, garbage collection works, and thus the custom name chosen in the first doctest is not present in the second doctest.

Apply trac_12313-mono_dict-combined-random-sk.patch

comment:129 Changed 5 years ago by SimonKing

  • Dependencies changed from #715, #11521, #11599, #12969, #12215 to #11521, #11599, #12969, #12215

The patchbot got confused by the order of #715 and #11521, which are mutually dependent, but #715 needs to be applied first. Hence, I hope it helps to only provide #11521 as a dependency here.

comment:130 Changed 5 years ago by SimonKing

Apply trac_12313-mono_dict-combined-random-sk.patch

comment:131 Changed 5 years ago by SimonKing

Now I'm puzzled. The patchbot applies the same patches on top of the same version of Sage in the same order, but it gets an error when applying the patch from here, while it works fine for me.

For the record: I have

trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
trac12215_weak_cached_function-sk.patch
trac12215_segfault_fixes.patch
trac_12313-mono_dict-combined-random-sk.patch

on top of sage-5.3.beta2

Apply trac_12313-mono_dict-combined-random-sk.patch

comment:132 Changed 5 years ago by SimonKing

It is frustrating. The patchbot reports a problem in sage.rings.padics.padic_base_leaves, that is most likely a cache problem in coercion: At some point, the absence of coercion is cached even though there is a coercion.

I thought that that kind of problem was fixed by #12969! And worse: I do not get the error reported by the patchbot. And even worse: I do get the error reported by the patchbot if I additionally apply #13370. And still worse: For #13370, the patchbot also reports a segfault in another test - that I can not reproduce.

My plan: Using #13370, I try to find out what side-effect is responsible for the wrong coercion cache. Hopefully it can then be fixed, in a new dependency for this ticket.

comment:133 follow-up: Changed 5 years ago by SimonKing

Meanwhile it seems to me that the __contains__ method of both MonoDict and TripleDict is too complicated: Currently, the content of the buckets is looked up. However, self._refcache should also indicate what keys exist. So, it should be faster to look up self._refcache, and not the buckets.

Perhaps it is a bit improper to fix the __contains__ method of TripleDict here, even though it was introduced in #715 which isn't merged. Hence, my plan is now to fix TripleDict.__contains__ there, rebase the patches from the other tickets if needed, and fix MonoDict.__contains__ here.

Last edited 5 years ago by SimonKing (previous) (diff)

comment:134 in reply to: ↑ 133 Changed 5 years ago by SimonKing

Replying to SimonKing:

Perhaps it is a bit improper to fix the __contains__ method of TripleDict here, even though it was introduced in #715

Here I was wrong: Both __contains__ methods were introduced here. So, there fortunately is just one patch to change.

comment:135 Changed 5 years ago by SimonKing

  • Owner changed from rlm to (none)

Or: Perhaps I need to change both patches after all. Namely, I found that the __delitem__ methods do not update the _refcache - and one of them belongs to #715.

comment:136 Changed 5 years ago by SimonKing

OK: Both the patches at #715 and here are updated. The __contains__ methods should now be more efficient, and the __delitem__ method will remove the item not only from the buckets but also from the _refcache.

Needs review! And I hope the patchbot doesn't time out when finding the dependencies...

Apply trac_12313-mono_dict-combined-random-sk.patch

comment:137 Changed 5 years ago by SimonKing

The patchbot times out again, but all tests pass for me.

comment:138 Changed 5 years ago by nbruin

Some confusion on my part made me post possibly relevant note on the wrong ticket: #11521 comment 152

Perhaps we should not be caching conversions at all, because they can too easily cause memory leaks, for instance:

for p in prime_range(1,1000):
    k=GF(p)
    a=k(1)
    b=ZZ(a)

(the conversion maps get cached on ZZ, where the strong references to the domains will keep the finite fields alive)

I'd think conversions are allowed to be expensive.

comment:139 follow-up: Changed 5 years ago by jpflori

I think the point of this ticket is exactly to solve the small example you posted (which is nothing but what the ticket description suggests if I'm not getting too tired):

  • maps are cached in ZZ in a MonoDict? which key is a weakref to k
  • k dies, and the MonoDict? is smart enough to let garbage collection occurs.

The other way around (maps from ZZ to k, then k dies) should be trivial as well.

Is that right?

comment:140 in reply to: ↑ 139 ; follow-up: Changed 5 years ago by nbruin

Replying to jpflori:

I think the point of this ticket is exactly to solve the small example you posted (which is nothing but what the ticket description suggests if I'm not getting too tired):

  • maps are cached in ZZ in a MonoDict? which key is a weakref to k

Yes, but the map stores a refcounted ref to k so the presence of the map on ZZ prevents k from being collected.

We still get the benefit on this patch that the absence of a map (which is the case with GF(p) to ZZ) gets cached by None so in that case only the weakref to k exists on ZZ and k can be collected.

  • k dies, and the MonoDict? is smart enough to let garbage collection occurs.

Only if the stored value doesn't strongly reference k.

The other way around (maps from ZZ to k, then k dies) should be trivial as well.

Not "should be trivial as well" but "is already trivial thanks to cyclic garbage removal". The map is cached on k and not on ZZ. The map references k, but if all other references to k disappear we just have a cyclic reference, which python's GC should be able to figure out.

So: YES, this ticket is an improvement over what we had before (and once all issues are worked out should be merged) but it doesn't offer a complete solution to leaking due to caching yet:

for p in prime_range(1,1000):
    _=SR.coerce_map_from(GF(p))

will still leak like a disgruntled intelligence chief.

Conversion is a bit worse because conversion maps exist more often, but is also slightly better because the system won't ask for them by itself and won't investigate the transitive closure of the conversion graph. Therefore, I'm not sure conversion should be given the same caching treatment as coercion. The proposed patch is propagating such a caching treatment (but didn't introduce it).

comment:141 in reply to: ↑ 140 Changed 5 years ago by jpflori

Replying to nbruin:

Yes, but the map stores a refcounted ref to k so the presence of the map on ZZ prevents k from being collected.

Ok, I get it now, the example in the ticket description returns None, not the other one you proposed.

comment:142 Changed 5 years ago by jpflori

Another remark, the changes in this ticket do not affect the caching of maps themselves, but only of Booleans indicating whether such maps exists, which are now stored in MonoDict? rather than dict.

comment:143 Changed 5 years ago by jpflori

Oh no, I'm wrong again, the _coerce_from_hash stuff is definitely targetting the caching of the maps themselves.

comment:144 Changed 5 years ago by jpflori

So would storing only weakrefs to the domain in Maps/Morphism? suit you? The same approach was taken for Action and codomain in #715.

comment:145 Changed 5 years ago by nbruin

Both these work fine for me on a pristine 5.3b2 with the patches for this ticket applied:

	sage -t  -force_lib devel/sage-12313/sage/rings/padics/padic_base_leaves.py # 1 doctests failed
	sage -t  -force_lib devel/sage-12313/sage/rings/polynomial/infinite_polynomial_element.py # Killed/crashed

See also #13370, where Simon can reproduce the padic_base_leaves failure.

I tried the following code to try and recreate the circumstances Simon finds there, but no problem occurs (it does usually succeed fairly quickly in finding R with the same id as the K before)

import gc
while True:
  k=Qp(7)
  L=[c^2 for c in range(2^randint(1,15))] #to stir up memory use
  print id(k)
  K=Qp(7,40)
  iK = id(K)
  assert k.has_coerce_map_from(K) is False
  del K
  gc.collect()
  R=Zp(7,40)
  iR = id(R)
  if iR == iK:
      break
  del k,R
  gc.collect()
  
assert k.has_coerce_map_from(R)

comment:146 Changed 5 years ago by SimonKing

  • Status changed from needs_review to needs_work
  • Work issues changed from Rebase wrt the new version of #715. Take into account Nils' comments to Test activity of weak references if addresses coincide

I'd like to try to add a test that would prevent the race condition that seems to be behind the doctest failures.

comment:147 follow-up: Changed 5 years ago by SimonKing

  • Status changed from needs_work to needs_review
  • Work issues Test activity of weak references if addresses coincide deleted

I have updated the patch. Change: Before returning a value of the dictionary or assert that a key is in the dictionary, I test whether the keyed references involved are still active. If they are not, then the key just happens to use the old memory location for new values.

I have already tested that the error (that I only find with #13370) disappears with the new patches. I am now running make ptest with #13370.

I only wonder whether the additional tests imply an unacceptable slow-down. Could the reviewer do some tests? I am about to be on holiday, so, very likely I can't test much in the next few days.

Apply trac_12313-mono_dict-combined-random-sk.patch

comment:148 in reply to: ↑ 147 ; follow-up: Changed 5 years ago by nbruin

Replying to SimonKing:

I have updated the patch. Change: Before returning a value of the dictionary or assert that a key is in the dictionary, I test whether the keyed references involved are still active. If they are not, then the key just happens to use the old memory location for new values.

So basically a MonoDict is now just a WeakKeyDict with identity testing rather than equality testing. Unless sage parents have really bad hash values, there should really not be a difference, right? That's one thing to try. TripleDict still has its value, because of how it behaves if any one of the key components disappears.

EDIT: NO! silly me. You can make this much faster. Since you'll be accessing everything via the weakref anyway, you don't need a refcache. You can just store a refcounted reference to the weakref directly in the bucket. Lookup is lightning fast, because you hash on id-based keys and you can walk the bucket as you did before.

You don't have a race condition, because you are holding a strong ref to the object that gets weakreffed until your MonoDict or TripleDict is in a consistent state to receive a callback (should it happen).

It looks like we might be slightly leaking if callbacks indeed sometimes never come ...

Last edited 5 years ago by nbruin (previous) (diff)

comment:149 in reply to: ↑ 125 Changed 5 years ago by SimonKing

Replying to nbruin:

Does the line

                return <object>PyList_GET_ITEM(bucket, i+1) 

increase a refcount on the object we retrieve?

Here is another evidence that <object> does what we need:

sage: cython("""
....: def test(size_t k):
....:     K = <object><void *>k
....:     print K
....:     del K
....: """)
sage: class A(object): pass
....: 
sage: a = A()
sage: test(id(a))
<__main__.A object at 0x44a17d0>
sage: a
<__main__.A object at 0x44a17d0>
sage: del a
sage: import gc
sage: _ = gc.collect()
sage: quit
Exiting Sage (CPU time 0m0.16s, Wall time 0m39.96s).

If the reference count of the A-instance would not be increase by the <object> cast in the test function, I guess a segfault would result, wouldn't it?

comment:150 follow-up: Changed 5 years ago by jpflori

Wouldn't it be easier to directly look at the Cython code for the <object> refcount?

As far as testing is concerned, I'll setup up Sage installs during the week and report here and on the other tickets. Hopefully, this should be done by the beginning of next week.

comment:151 Changed 5 years ago by jpflori

I obviously meant the C code generated by Cython...

comment:152 in reply to: ↑ 148 Changed 5 years ago by SimonKing

Replying to nbruin:

Replying to SimonKing: So basically a MonoDict is now just a WeakKeyDict with identity testing rather than equality testing.

Seems so.

Unless sage parents have really bad hash values, there should really not be a difference, right?

Unfortunately, parents P do have very bad hash values. Namely, the default for hash(P) is hash(repr(P)). This value is not cached, and repr(P) may be slow. Even worse, repr(P) may change, because there is a P.rename() method. There is some ticket for fixing that apparent problem.

Anyway, at this point it would make sense to compare usual WeakKeyDictionary and MonoDict timing-wise. That should be done using unique parents (there, the hash and also the comparison is fast).

comment:153 in reply to: ↑ 150 ; follow-up: Changed 5 years ago by SimonKing

Replying to jpflori:

Wouldn't it be easier to directly look at the Cython code for the <object> refcount?

No idea how to do so.

comment:154 follow-up: Changed 5 years ago by nbruin

Sorry, the edited comment http://trac.sagemath.org/sage_trac/ticket/12313#comment:148 probably fell out of scope for you. In short: do away with the _refcache and put the weakref directly in the bucket. That saves storage and a superfluous lookup. That way MonoDict and TripleDict should still be a bit faster than normal dicts (because key comparison is trivial).

comment:155 in reply to: ↑ 153 Changed 5 years ago by jpflori

Replying to SimonKing:

Replying to jpflori:

Wouldn't it be easier to directly look at the Cython code for the <object> refcount?

No idea how to do so.

For the Sage library, you can find the C[++] generated files in devel/sage... For files generated on the file, you have to look in .sage/temp/... or something like this, involving the PID of the process.

Anyway, I had a look at the parent.c file produced from parent.p?? on a vanilla from coercion 5.3.beta2 and on the lines corresponding to a return <object> PY_TYPE(x) there indeed seems to be a _ _Pyx_INCREF called on the address of the returned value.

comment:156 follow-up: Changed 5 years ago by jpflori

From http://www.cython.org/release/Cython-0.11.3/Cython/Includes/python.pxd

...
# it doesn't increment the reference count.  When it sees PyObject*
# in order to use the result you must explicitly cast to <object>,
# and when you do that Cython increments the reference count wether
# you want it to or not, forcing you to an explicit DECREF (or leak memory).
# To avoid this we make the above convention.  Note, you can
...

which also supports the above. See also https://groups.google.com/forum/?fromgroups=#!topic/cython-users/pZFurj3rUyg

comment:157 in reply to: ↑ 154 ; follow-up: Changed 5 years ago by SimonKing

Replying to nbruin:

Sorry, the edited comment http://trac.sagemath.org/sage_trac/ticket/12313#comment:148 probably fell out of scope for you. In short: do away with the _refcache and put the weakref directly in the bucket. That saves storage and a superfluous lookup. That way MonoDict and TripleDict should still be a bit faster than normal dicts (because key comparison is trivial).

I think that would not work, for two reasons:

  1. You seem to assume that all keys of MonoDict allow weak references. Is that granted? I am not sure. Perhaps this is a non-issue.
  2. In a very early stage of my work, I did put the weak references into the buckets. However, my impression was that that would create a massive slow-down, as I explain now.

With the current patch, the buckets are of the form [h1,v1,h2,v2,h3,v3,...], where h1,h2,h3 are the memory addresses of the keys, and v1,v2,v3 are the values. Now assume that we would put the (weak) references r1,r2,r3 to the keys into the bucket: [r1,v1,r2,v2,r3,v3,...].

The weak references should have a callback function removing entries from the dictionary. Hence, they are not just weakref.ref, but weakref.KeyedRef.

Let K be some weak-reffable key and f be some callback function. While weakref.ref(K) is weakref.ref(K), we unfortunately have weakref.KeyedRef(K,f,id(K)) is not weakref.KeyedRef(K,f,id(K)).

We want to compare K by identity. Hence, either do if r1() is K or if r1.key == <size_t><void *>K and r1() is not None. Both tests are slow (r1.key is a Python attribute lookup). Even worse: They are slow, regardless whether K is in the dictionary or not.

With the new code, we would first test h1==<size_t><void *>K and then self._refcache[h1]() is not None. The second test is slow, but apparently we need it, or some doctests will fail. But the point is: If K is not in the dictionary then except in very rare cases the first test will recognise it. And the first test is indeed fast. Hence, non-existing keys are fast to test, existing keys are probably not worse to test than with a usual WeakKeyDictionary. Will test it soon...

comment:158 in reply to: ↑ 156 Changed 5 years ago by SimonKing

Replying to jpflori:

# it doesn't increment the reference count.  When it sees PyObject*
# in order to use the result you must explicitly cast to <object>,
# and when you do that Cython increments the reference count wether
# you want it to or not, forcing you to an explicit DECREF (or leak memory).
# To avoid this we make the above convention.  Note, you can

Yep, I think I saw that text, but didn't recall where to find it. So, apparently it works.

comment:159 Changed 5 years ago by SimonKing

Here are some timings. They are quite conclusive: MonoDict is very fast, even in the "validate weak references before giving an answer" version!

First, something that is quite realistic. We have many polynomial rings over different finite base fields - something that often occurs in elliptic curve computations. Then, we create a WeakKeyDictionary and a MonoDict keyed with these polynomial rings, and then test containment with a ring that is in the dictionary and one that isn't:

sage: L = []
sage: for p in prime_range(10000):
....:     L.append(GF(p)['x','y'])
....: 
sage: import weakref
sage: from sage.structure.coerce_dict import MonoDict
sage: M = MonoDict(53)
sage: for i,K in enumerate(L):
....:     M[K] = i
....: 
sage: W = weakref.WeakKeyDictionary()
sage: for i,K in enumerate(L):
....:     W[K] = i
....: 
sage: K = GF(97)['x','y']
sage: K2 = GF(next_prime(p))['x','y']
sage: K in W
True
sage: K in M
True
sage: K2 in W
False
sage: K2 in M
False
sage: %timeit K in W
625 loops, best of 3: 55.9 µs per loop
sage: %timeit K in M
625 loops, best of 3: 533 ns per loop
sage: %timeit K2 in W
625 loops, best of 3: 2.41 µs per loop
sage: %timeit K2 in M
625 loops, best of 3: 1.56 µs per loop
sage: %timeit a = W[K]
625 loops, best of 3: 54.9 µs per loop
sage: %timeit a = M[K]
625 loops, best of 3: 1.08 µs per loop

Hence, in this realistic example, MonoDict is a lot faster than WeakKeyDictionary -- even though I am talking about the recent version of MonoDict where the weak references are validated.

The polynomial rings have a decently fast hash, but comparison is slow:

sage: %timeit hash(K)
625 loops, best of 3: 649 ns per loop
sage: %timeit hash(K2)
625 loops, best of 3: 678 ns per loop
sage: %timeit K == K
625 loops, best of 3: 53.2 µs per loop
sage: %timeit K == K2
625 loops, best of 3: 65.7 µs per loop

So, let us see what happens if we do "the same" with UniqueRepresentation:

sage: class A(UniqueRepresentation):
....:     def __init__(self,p):
....:         self.p = p
....:         
sage: L = []
sage: for p in prime_range(10000):
....:     L.append(A(p))
....:     
sage: K = A(97)
sage: K2 = A(next_prime(p))
sage: import weakref
sage: W = weakref.WeakKeyDictionary()
sage: from sage.structure.coerce_dict import MonoDict
sage: M = MonoDict(53)
sage: for i,K in enumerate(L):
....:     W[K] = i
....:     M[K] = i
....:     
sage: K in M
True
sage: K in W
True
sage: K2 in M
False
sage: K2 in W
False
sage: %timeit K in M
625 loops, best of 3: 552 ns per loop
sage: %timeit K in W
625 loops, best of 3: 3.67 µs per loop
sage: %timeit K2 in M
625 loops, best of 3: 1.6 µs per loop
sage: %timeit K2 in W
625 loops, best of 3: 2.72 µs per loop
sage: %timeit M[K]
625 loops, best of 3: 1.52 µs per loop
sage: %timeit W[K]
625 loops, best of 3: 3.22 µs per loop
sage: %timeit hash(K)
625 loops, best of 3: 944 ns per loop
sage: %timeit K == K
625 loops, best of 3: 910 ns per loop
sage: %timeit K == K2
625 loops, best of 3: 918 ns per loop

Hence, comparison is now a lot faster, and this is noticeable in WeakKeyDictionary.

However: MonoDict still outperforms WeakKeyDictionary. I am slightly surprised - I didn't expect this would happen, after introducing the validation of weak references. Note that testing K in M is even faster than a single comparison K==K!

comment:160 in reply to: ↑ 157 ; follow-up: Changed 5 years ago by nbruin

With the current patch, the buckets are of the form [h1,v1,h2,v2,h3,v3,...], where h1,h2,h3 are the memory addresses of the keys, and v1,v2,v3 are the values. Now assume that we would put the (weak) references r1,r2,r3 to the keys into the bucket: [r1,v1,r2,v2,r3,v3,...].

The weak references should have a callback function removing entries from the dictionary. Hence, they are not just weakref.ref, but weakref.KeyedRef.

That's not what I meant. You make the buckets larger, so that their layout becomes:

[h1,r1,v1,h2,r2,v2,h3,r3,v3,...]

where for, say, the key-value pair k1: v1 we have

h1=id(k1)
r1=weakref.KeyedRef(k1,f,h1)

This way you can do your whole lookup without doing an extra search in the now non-existent _refcache.

for i from 0 <= i < PyList_GET_SIZE(bucket) by 3: 
    tmp = <object>PyList_GET_ITEM(bucket, i) 
    if <size_t>tmp == h:
        r = <object>PyList_GET_ITEM(bucket, i+1)
        if isinstance(r, KeyedRef) and r() is None:
            # the object got collected but we didn't receive a callback (yet?)
            # either remove it now:
            del r #we dont WANT the callback anymore!
            del bucket[i:i+3]
            self._size -= 1
            # or skip that step and hope we'll get our callback later. In any case:
 	        raise KeyError, k
        else:
            # We've just checked that r's ref (the actual key) was still alive
            # so the fact that id(k) equals tmp means that
            # k is r() [or k is r if not weakreffable, but that would have been true anyway]
 	        return <object>PyList_GET_ITEM(bucket, i+2)
raise KeyError, k 

Deletion and setting are easily similarly modified.

This should be a bit faster because you avoid an extra self._refcache[h] lookup and all the management that comes with maintaining an independent hashtable that has exactly the same keys!

It is also more memory efficient because you're now storing h only once instead of twice.

For TripleDict you can do it in exactly the same way, just a little more work.

comment:161 in reply to: ↑ 160 Changed 5 years ago by SimonKing

Replying to nbruin:

That's not what I meant. You make the buckets larger, so that their layout becomes:

[h1,r1,v1,h2,r2,v2,h3,r3,v3,...]

where for, say, the key-value pair k1: v1 we have

h1=id(k1)
r1=weakref.KeyedRef(k1,f,h1)

I see.

This way you can do your whole lookup without doing an extra search in the now non-existent _refcache.

for i from 0 <= i < PyList_GET_SIZE(bucket) by 3: 
    tmp = <object>PyList_GET_ITEM(bucket, i) 
    if <size_t>tmp == h:
        r = <object>PyList_GET_ITEM(bucket, i+1)
        if isinstance(r, KeyedRef) and r() is None:
            # the object got collected but we didn't receive a callback (yet?)
            # either remove it now:
            del r #we dont WANT the callback anymore!
            del bucket[i:i+3]
            self._size -= 1
            # or skip that step and hope we'll get our callback later. In any case:
 	        raise KeyError, k
        else:
            # We've just checked that r's ref (the actual key) was still alive
            # so the fact that id(k) equals tmp means that
            # k is r() [or k is r if not weakreffable, but that would have been true anyway]
 	        return <object>PyList_GET_ITEM(bucket, i+2)
raise KeyError, k 

Since you already have code, it should be relatively easy to test the performance.

This should be a bit faster because you avoid an extra self._refcache[h] lookup and all the management that comes with maintaining an independent hashtable that has exactly the same keys!

I wouldn't say that it is obviously faster. Let's simply test it, I'd say!

It is also more memory efficient because you're now storing h only once instead of twice.

For TripleDict you can do it in exactly the same way, just a little more work.

Not that much. Actually, I wonder whether one could have a TupleDict, that is created with an additional parameter that prescribes the length L of the key items. Then, the data in each bucket would be stored by L references (weak, if possible), L memory locations, and one value. The look-up would be uniform.

I think one should create a proof of concept, to test the performance. If the performance is as good as with the current MonoDict and TripleDict, then we have to think how to organise work. Perhaps #715 and #12313 will be superseded by a new ticket, introducing TupleDict?

comment:162 follow-up: Changed 5 years ago by nbruin

OK! example implementation done. A couple of things I found:

  • Some of your doctests for MonoDict test TripleDict!
  • your iterators can recurse quite deeply if there are lots of empty buckets in a row. That can easily happen.
  • Casting to <object> in Cython indeed gives you a kind of <PyObject *> where Cython takes care of intelligent casting and refcounting for you. That comes at some cost. Especially for tmp I think we can largely do away with that for a little gain in speed. This might apply more widely.

  • It's a shame we can't store our h values immediately in the list but that we have to store them in an integer, which means we're storing a pointer to that integer in the list. If our buckets would be arrays of longs we could save one level of indirection for the storage of the h values, at the expense of having to do manual ref counting on the other parts, as opposed to letting Cython take care of it. A point to keep in mind if we need that last nanosecond ... (it's tempting to simply cast the value of h to a PyObject* and store that directly in the bucket but that would wreak havoc with the cyclic garbage collector. We could mangle the value (add one? all pointer values are divisible by 4 nowadays) but I think that would lead us to stepping outside Python's defined C-API. [if they ever get more lispy, they'll start using those bottom bits for type info or so]
  • What happens if we find a dead entry in setitem? How can we be sure the pending callback for that doesn't happen on our new value? I think explicitly deleting the weakref should do the trick, but then we should ensure that nobody else gets a chance to keep those weakrefs alive! I've gone with that for now. An alternative is to let the callback check that the weakref stored in the relevant bucket is indeed a weakref and dead. Then at least we won't be removing a new entry under a key that happens to have reused the previous memory address. Given that our use is caching, it would only lead to reduced performance, not wrong results, by the way.

Timings: I've run the same tests as above:

sage: import weakref
sage: from sage.structure.coerce_dict import MonoDict, MonoDictNoRefCache
sage: 
sage: L = []
sage: for p in prime_range(10000):
....:             L.append(GF(p)['x','y'])
....: 
sage: K = GF(97)['x','y']
sage: K2 = GF(next_prime(p))['x','y']   
sage: 
sage: M = MonoDict(53)
sage: for i,K in enumerate(L):
....:         M[K] = i
....: 
sage: N = MonoDictNoRefCache(53)
sage: for i,K in enumerate(L):
....:         N[K] = i
....: 
sage: W = weakref.WeakKeyDictionary()
sage: for i,K in enumerate(L):
....:         W[K] = i
....: 
sage: K in W
True
sage: K in M
True
sage: K in N
True
sage: 
sage: K2 in W
False
sage: K2 in M
False
sage: K2 in N
False
sage: 
sage: %timeit K in W
625 loops, best of 3: 36.7 µs per loop
sage: %timeit K in M
625 loops, best of 3: 162 ns per loop
sage: %timeit K in N
625 loops, best of 3: 198 ns per loop
sage: 
sage: %timeit K2 in W
625 loops, best of 3: 637 ns per loop
sage: %timeit K2 in M
625 loops, best of 3: 395 ns per loop
sage: %timeit K2 in N
625 loops, best of 3: 145 ns per loop
sage: 
sage: %timeit a = W[K]
625 loops, best of 3: 36.3 µs per loop
sage: %timeit a = M[K]
625 loops, best of 3: 229 ns per loop
sage: %timeit a = N[K]
625 loops, best of 3: 203 ns per loop

With the UniqueRepresentation:

sage: class A(UniqueRepresentation):
....:         def __init__(self,p):
....:             self.p = p
....:         
sage: L = []
sage: for p in prime_range(10000):
....:         L.append(A(p))
....:     
sage: K = A(97)
sage: K2 = A(next_prime(p))
sage: 
sage: import weakref
sage: from sage.structure.coerce_dict import MonoDict, MonoDictNoRefCache
sage: M = MonoDict(53)
sage: for i,K in enumerate(L):
....:         M[K] = i
....: 
sage: N = MonoDictNoRefCache(53)
sage: for i,K in enumerate(L):
....:         N[K] = i
....: 
sage: W = weakref.WeakKeyDictionary()
sage: for i,K in enumerate(L):
....:         W[K] = i
....: 
sage: K in W
True
sage: K in M
True
sage: K in N
True
sage: 
sage: K2 in W
False
sage: K2 in M
False
sage: K2 in N
False
sage: 
sage: %timeit K in W
625 loops, best of 3: 1.04 µs per loop
sage: %timeit K in M
625 loops, best of 3: 173 ns per loop
sage: %timeit K in N
625 loops, best of 3: 195 ns per loop
sage: 
sage: %timeit K2 in W
625 loops, best of 3: 766 ns per loop
sage: %timeit K2 in M
625 loops, best of 3: 440 ns per loop
sage: %timeit K2 in N
625 loops, best of 3: 135 ns per loop
sage: 
sage: %timeit a = W[K]
625 loops, best of 3: 926 ns per loop
sage: %timeit a = M[K]
625 loops, best of 3: 208 ns per loop
sage: %timeit a = N[K]
625 loops, best of 3: 179 ns per loop

If you do these tests repeatedly there is quite some noise, but these represent what generally seems to happen. Oddly enough, retrieving an item is often a little faster than getting True back on K in N! Must be something subtle or overhead. Deciding that K in M is the only thing where the old MonoDict is slightly faster. That is not the case we're interested in optimizing, though; Lookup is! And MonoDictNoRefCache is the clear winner there (as it is for deciding that a key is NOT in there. Also a good operation to have). Furthermore, note that MonoDictNoRefCache is a little more thorough in its membership test by also checking that the weakref is still alive.

Last edited 5 years ago by nbruin (previous) (diff)

Changed 5 years ago by nbruin

Example implementation of a MonoDict?? storing the weakrefs in the buckets as well

comment:163 follow-up: Changed 5 years ago by nbruin

Observation: The test above is entirely unrepresentative because there is no resizing whatsoever (default threshold=0 leads to no resizing). So the test above was just using 53 buckets, meaning it was mainly doing linear search! Impressive how fast that is, of course, but more representative is (2/3 is the threshold ratio that Python's dict uses by default):

M = MonoDict(53,threshold=2/3)
N = MonoDictNoRefCache(53,threshold=2/3)
...
sage: sage: %timeit K in W
625 loops, best of 3: 1.06 µs per loop
sage: sage: %timeit K in M
625 loops, best of 3: 169 ns per loop
sage: sage: %timeit K in N
625 loops, best of 3: 128 ns per loop
sage: sage: 
sage: sage: %timeit K2 in W
625 loops, best of 3: 800 ns per loop
sage: sage: %timeit K2 in M
625 loops, best of 3: 426 ns per loop
sage: sage: %timeit K2 in N
625 loops, best of 3: 67.1 ns per loop
sage: sage: 
sage: sage: %timeit a = W[K]
625 loops, best of 3: 941 ns per loop
sage: sage: %timeit a = M[K]
625 loops, best of 3: 169 ns per loop
sage: sage: %timeit a = N[K]
625 loops, best of 3: 122 ns per loop

That's why MonoDict won for membership. It was using a properly calibrated dict versus a linear search in MonoDictNoRefCache. I think our design is clear now.

Concerning TupleDict: Sure we can implement it but if our only use cases are MonoDict and TripleDict anyway, we're better off with the unrolled loops.

Somewhere in the next few days I'll prepare a patch for #715 to move TripleDict to this design and then put a patch on #12313 for MonoDict, if that's OK with people. In the mean time, comments on design welcome.

comment:164 in reply to: ↑ 162 Changed 5 years ago by SimonKing

Replying to nbruin:

OK! example implementation done. A couple of things I found:

  • Some of your doctests for MonoDict test TripleDict!

Oops.

  • your iterators can recurse quite deeply if there are lots of empty buckets in a row. That can easily happen.

Yes? I thought that the number of buckets is quite small (so, no deep recursion) and the whole thing is designed to distribute the items uniformly over the buckets

  • Casting to <object> in Cython indeed gives you a kind of <PyObject *> where Cython takes care of intelligent casting and refcounting for you. That comes at some cost. Especially for tmp I think we can largely do away with that for a little gain in speed. This might apply more widely.

I wrote some general code for what I now call IdKeyDict: A weak key dictionary (that also accepts items that aren't weak-reffable) with keys of prescribed length, keys being compared by identity. There, the tmp is not used.

  • It's a shame we can't store our h values immediately in the list but that we have to store them in an integer, which means we're storing a pointer to that integer in the list.

I don't understand that comment.

  • What happens if we find a dead entry in setitem? How can we be sure the pending callback for that doesn't happen on our new value?

My new code takes care of that. The methods are a lot more modularised. There is one method that simply finds the correct bucket (according to the hash) and another method that finds the position of an item on that bucket. The second method will automatically delete a dead item from the bucket. And of course, that method is both used in __setitem__, __getitem__ and __contains__.

I think explicitly

deleting the weakref should do the trick, but then we should ensure that nobody else gets a chance to keep those weakrefs alive!

What would be the problem if the weakrefs survive? OK, at some point the callback will start working, but would that be a problem? I mean, if the callback tries to remove an item that has already been deleted, then the callback would simply do nothing

comment:165 in reply to: ↑ 163 ; follow-up: Changed 5 years ago by SimonKing

Replying to nbruin:

Observation: The test above is entirely unrepresentative because there is no resizing whatsoever (default threshold=0 leads to no resizing). So the test above was just using 53 buckets, meaning it was mainly doing linear search! Impressive how fast that is, of course, but more representative is (2/3 is the threshold ratio that Python's dict uses by default)

Good point! By your comment, I have now threshold=0.7 as default in my experimental code.

That's why MonoDict won for membership. It was using a properly calibrated dict versus a linear search in MonoDictNoRefCache.

Yes, that would explain it.

Concerning TupleDict: Sure we can implement it but if our only use cases are MonoDict and TripleDict anyway, we're better off with the unrolled loops.

Last night I already wrote experimental code (calling the thing "IdKeyDict", because the keys are compared by identity). I am now doing some tests - perhaps we are lucky and it is fast...

MonoDict and TripleDict are the only use cases yet. Until not so long time ago, TripleDict has been the only use case. One could imagine that an IdKeyDict being useful for a weak_cached_function, as in #12215: If you have a function with a fixed list of arguments (hence, no *args and **keys), and if you are happy to compare the arguments by identity, IdKeyDict may be exactly the thing you want.

Plus, one avoids code duplication. And aren't modern compilers intelligent enough to do some loop unrolling automatically?

Anyway, since I already wrote code, I'll first test the performance, rather than immediately dropping the idea of unifying MonoDict and TripleDict.

Last edited 5 years ago by SimonKing (previous) (diff)

comment:166 Changed 5 years ago by nbruin

What would be the problem if the weakrefs survive? OK, at some point the callback will start working, but would that be a problem? I mean, if the callback tries to remove an item that has already been deleted, then the callback would simply do nothing

Not if there's now a new entry stored under that id. The trick is, id(object) is guaranteed to uniquely identify object for its lifetime but, as you've seen, not beyond it.

By the way, as long as we ensure that absolutely no-one else keeps a reference to our weakrefs, then deleting them should make them disappear immediately, making it impossible to pass them as a parameter to a callback ... I hope that python uses that to purge its callback queue after a cyclic garbage collection.

See #13387 for a modification of MonoDict and TripleDict to not use a refcache. It's also a little more careful in handling dangling weakrefs (I think).

Hashtables should not have a particularly small number of buckets! About 1.5 times the number of elements! (the birthday paradox still means quite some collisions). And we since we don't ever resize down, the load could drop quite low! (although not particularly in our main applications).

comment:167 in reply to: ↑ 165 Changed 5 years ago by nbruin

Replying to SimonKing:

Plus, one avoids code duplication. And aren't modern compilers intelligent enough to do some loop unrolling automatically?

If you have a preprocessor, macro, or templating system then the N in N-Dict can indeed be resolved at compile-time and then the compiler can unroll. But I don't think Cython has parametric code generation (yet?)

Anyway, since I already wrote code, I'll first test the performance, rather than immediately dropping the idea of unifying MonoDict and TripleDict.

OK! Compare with #13387. I suspect that your elusive ptest SEGVs might be due to somewhat careless handling in the Dicts. When I was rewriting them I found some places where Weakrefs were not properly checked before concluding presence.

comment:168 Changed 5 years ago by jpflori

Do everyone agree that further changes to *Dict should happen elsewhere? Even the potential fix for the elusive segfault Simon encountered and Nils just mentioned?

Or should #13387 (and further tickets) really be merged with these ones?

Last edited 5 years ago by jpflori (previous) (diff)

comment:169 Changed 5 years ago by SimonKing

I just added an experimental patch idkey_dict that implements what I proposed in my previous posts. The timings seem very appealing:

sage: from sage.structure.idkey_dict import IdKeyDict
sage: from sage.structure.coerce_dict import TripleDict
sage: import weakref
sage: I = IdKeyDict(3,53, threshold=0.7)
sage: T = TripleDict(53, threshold=0.7)
sage: W = weakref.WeakKeyDictionary()
sage: L = []
sage: for p in prime_range(10000):
....:     L.append(GF(p)['x','y'])
....: 
sage: K = GF(97)['x','y']
sage: K2 = GF(next_prime(p))['x','y']
sage: for i,K in enumerate(L):
....:     W[K] = i
....:     I[K,K,True] = i
....:     T[K,K,True] = i
....:
sage: (K,K,True) in I
True
sage: (K,K,True) in T
True
sage: K in W
True
sage: %timeit (K,K,True) in I
625 loops, best of 3: 1.31 µs per loop
sage: %timeit (K,K,True) in T
625 loops, best of 3: 2.14 µs per loop
sage: %timeit K in W
625 loops, best of 3: 56.4 µs per loop
sage: (K2,K,True) in I
False
sage: (K2,K,True) in T
False
sage: K2 in W
False
sage: %timeit (K2,K,True) in I
625 loops, best of 3: 523 ns per loop
sage: %timeit (K2,K,True) in T
625 loops, best of 3: 2.07 µs per loop
sage: %timeit K2 in W
625 loops, best of 3: 2.4 µs per loop
sage: I[K,K,True]
1228  
sage: T[K,K,True]
1228  
sage: W[K]
1228  
sage: %timeit I[K,K,True]
625 loops, best of 3: 1.24 µs per loop
sage: %timeit T[K,K,True]
625 loops, best of 3: 2.19 µs per loop
sage: %timeit W[K]
625 loops, best of 3: 55.6 µs per loop

Hence, the general IdKeyDict, that implements any fixed key length, already beats TripleDict. But not all is good: When I tried to compare with MonoDict, I got a segfault in __setitem__

Back to the drawing board, then...

Changed 5 years ago by SimonKing

Experimental implementation of IdKeyDict

comment:170 Changed 5 years ago by SimonKing

OK, patch is updated.

Here are timings for key length 1:

sage: from sage.structure.idkey_dict import IdKeyDict
sage: from sage.structure.coerce_dict import MonoDict
sage: import weakref
sage: I = IdKeyDict(1,53, threshold=0.7)
sage: M = MonoDict(53, threshold=0.7)
sage: W = weakref.WeakKeyDictionary()
sage: L = []
sage: for p in prime_range(10000):
....:     L.append(GF(p)['x','y'])
....:
sage: K = GF(97)['x','y']
sage: K2 = GF(next_prime(p))['x','y']
sage: for i,K in enumerate(L):
....:     W[K] = i
....:     M[K] = i
....:     I[K] = i
....:
sage: K in W
True
sage: K in M
True  
sage: K in I
True  
sage: K2 in W
False 
sage: K2 in M
False 
sage: K2 in I
False 
sage: %timeit K in W
625 loops, best of 3: 56 µs per loop
sage: %timeit K in M
625 loops, best of 3: 550 ns per loop
sage: %timeit K in I
625 loops, best of 3: 571 ns per loop
sage: %timeit K2 in W
625 loops, best of 3: 2.46 µs per loop
sage: %timeit K2 in M
625 loops, best of 3: 1.55 µs per loop
sage: %timeit K2 in I
625 loops, best of 3: 387 ns per loop
sage: W[K]
1228  
sage: M[K]
1228  
sage: I[K]
1228  
sage: %timeit W[K]
625 loops, best of 3: 55.1 µs per loop
sage: %timeit M[K]
625 loops, best of 3: 616 ns per loop
sage: %timeit I[K]
625 loops, best of 3: 523 ns per loop

and concerning garbage collection

sage: del L        
sage: import gc
sage: gc.collect()
321790
sage: len(W)
1
sage: len(M)
1
sage: len(I)
1
sage: list(I.iteritems())
[((194308608, <weakref at 0xbbd9110; to 'sage.rings.polynomial.multi_polynomial_libsingular' at 0xb94ea00 (JoinCategory.parent_class)>), 1228)]

Hence, the timings of the general implementation (general key length) are competitive. What do you think about it? Of course, one would need to add doctests.

comment:171 Changed 5 years ago by SimonKing

For completeness: IdKeyDict is also faster than a Python dict, but of course that should mainly be true because we are comparing by identity, not equality.

sage: from sage.structure.idkey_dict import IdKeyDict
sage: I1 = IdKeyDict(1,53)
sage: I3 = IdKeyDict(3,53)
sage: D1 = {}
sage: D3 = {}
sage: L = []
sage: for p in prime_range(10000):
....:     L.append(GF(p)['x','y'])
....:
sage: for i,K in enumerate(L):
....:     I1[K] = i
....:     I3[K,K,True] = i
....:     D1[K] = i
....:     D3[K,K,True] = i
....:
sage: K = GF(97)['x','y']
sage: K2 = GF(next_prime(p))['x','y']
sage: K in I1
True
sage: K in D1
True
sage: K2 in I1
False
sage: K2 in D1
False
sage: (K,K,True) in D3
True
sage: (K,K,True) in I3
True
sage: %timeit K in D1
625 loops, best of 3: 720 ns per loop
sage: %timeit K in I1
625 loops, best of 3: 560 ns per loop
sage: %timeit K2 in D1
625 loops, best of 3: 763 ns per loop
sage: %timeit K2 in I1
625 loops, best of 3: 400 ns per loop
sage: %timeit (K,K,True) in D3
625 loops, best of 3: 1.53 µs per loop
sage: %timeit (K,K,True) in I3
625 loops, best of 3: 1.3 µs per loop
sage: %timeit (K2,K,True) in D3
625 loops, best of 3: 1.45 µs per loop
sage: %timeit (K2,K,True) in I3
625 loops, best of 3: 555 ns per loop
sage: %timeit D1[K]
625 loops, best of 3: 647 ns per loop
sage: %timeit I1[K]
625 loops, best of 3: 502 ns per loop
sage: %timeit D3[K,K,True]
625 loops, best of 3: 1.48 µs per loop
sage: %timeit I3[K,K,True]
625 loops, best of 3: 1.23 µs per loop

comment:172 follow-ups: Changed 5 years ago by jpflori

Running make ptest on a quite recent computer (12 threads), makes the runtime go from about 399 seconds on vanilla 5.3.beta2 to about 407 seconds on 5.3.beta2 + #715 + #11521 + #12313 averaging on a few runs.

On sage/schemes/elliptic_curves/sha_tate.py alone with long tests it goes from about 79 secs to about 82.5 secs.

comment:173 follow-up: Changed 5 years ago by nbruin

I wrote some general code for what I now call IdKeyDict: A weak key dictionary (that also accepts items that aren't weak-reffable) with keys of prescribed length, keys being compared by identity. There, the tmp is not used.

<size_t><object>... has two effects: It does de refcount dance (mostly superfluous) and it inserts PyInt_AsSsize_t call (necessary). By introducing the right header template you can get the second and avoid the first. See the code at #13387.

  • It's a shame we can't store our h values immediately in the list but that we have to store them in an integer, which means we're storing a pointer to that integer in the list.

OK, perhaps a little cryptic indeed. What I meant was that for a monodict the bucket entries are really of the form

( key_id, ref, value )

which one could store as

(<size_t>, <PyObject*>, <PyObject*>)

but we're storing it as (Lists accept nothing else)

(<PyObject*>, <PyObject*>, <PyObject*>)

which means that the first entry gets stored by a pointer to a system-word-sized allocated entry (a PyInt? to be precise) instead of stored directly in the field that would be big enough anyway. That's wasting memory locality and hence CPU cache space on the bit we use the most. (weakref and the values obviously have to be pointers).

Perhaps JP has a point -- the main concern on this ticket is to get a correct MonoDict. Discussions about optimizing it should go on #13387. The memory leaks desperately need solving. Why don't you move IdKeyDict there? If you can get it to perform sufficiently well, we could drop it in. Having the code a little more structured would be a benefit from a maintenance point of view.

comment:174 in reply to: ↑ 172 Changed 5 years ago by nbruin

Replying to jpflori:

Running make ptest on a quite recent computer (12 threads), makes the runtime go from about 399 seconds on vanilla 5.3.beta2 to about 407 seconds on 5.3.beta2 + #715 + #11521 + #12313 averaging on a few runs.

On sage/schemes/elliptic_curves/sha_tate.py alone with long tests it goes from about 79 secs to about 82.5 secs.

Well, cacheing does have a benefit. If you have enough memory, it's sometimes faster to simply not do any deallocation or garbage collection. It's nice if we can do this without performance loss but I think not leaking memory is worth some speed. Speed in exchange for a leak often means that you'll be able to run out of memory faster.

I guess the key is getting confident that the time is lost in recomputing things that have been rightfully collected rather than computations performing slower. People can solve the first by keeping references to things they know they'll need again.

comment:175 follow-up: Changed 5 years ago by jpflori

I agree. I'm only pointing out that on at least one computer, the speed regression, which had to be expected, seems more than negligible to me and should definitely not be taken into account for the inclusin of these tickets.

comment:176 in reply to: ↑ 175 Changed 5 years ago by nbruin

Replying to jpflori:

I agree. I'm only pointing out that on at least one computer, the speed regression, which had to be expected, seems more than negligible to me and should definitely not be taken into account for the inclusin of these tickets.

If someone is really interested in tracking down where the speed regression is coming from, they could sprinkle parents with an "elixir of life". I think it would be enough to patch UniqueRepresentation with a global list to which any newly created structure gets appended. That should bless all parents with eternal life and hence make all our weak caching permanent once again.

Code that significantly benefits from this elixir should be marked for restructuring. It should keep better track of what parents it reuses. It's unrealistic to expect that the system is clairvoyant in this matter.

comment:177 in reply to: ↑ 172 ; follow-up: Changed 5 years ago by SimonKing

Replying to jpflori:

Running make ptest on a quite recent computer (12 threads), makes the runtime go from about 399 seconds on vanilla 5.3.beta2 to about 407 seconds on 5.3.beta2 + #715 + #11521 + #12313 averaging on a few runs.

On sage/schemes/elliptic_curves/sha_tate.py alone with long tests it goes from about 79 secs to about 82.5 secs.

sage: 8./399
0.0200501253132832
sage: 3.5/79
0.0443037974683544

How significant is that?

comment:178 follow-up: Changed 5 years ago by jpflori

As far as I'm concerned it not significant at all in comparaison with the memleaks it prevents.

Without the patches, I'd have to restart Sage every couple of hours because of lack of memory.

With the patches, I'll just have to wait a couple of extra minutes more.

comment:179 Changed 5 years ago by jpflori

And provided #13387 gives good results, this will become even less significant.

comment:180 in reply to: ↑ 173 ; follow-up: Changed 5 years ago by SimonKing

Replying to nbruin:

I wrote some general code for what I now call IdKeyDict: A weak key dictionary (that also accepts items that aren't weak-reffable) with keys of prescribed length, keys being compared by identity. There, the tmp is not used.

<size_t><object>... has two effects: It does de refcount dance (mostly superfluous) and it inserts PyInt_AsSsize_t call (necessary). By introducing the right header template you can get the second and avoid the first. See the code at #13387.

Thank you for the pointer!

Perhaps JP has a point -- the main concern on this ticket is to get a correct MonoDict. Discussions about optimizing it should go on #13387. The memory leaks desperately need solving. Why don't you move IdKeyDict there? If you can get it to perform sufficiently well, we could drop it in. Having the code a little more structured would be a benefit from a maintenance point of view.

I see these alternative approaches:

  1. We essentially keep the code from #715 and the main patch from here, for now. However, we need to do some fixes, because of the segfaults and stuff. But apart from that fix, #13387 should focus on performance and perhaps generality IdKeyDict. #12313 will be a dependency for #13387.
  2. The performance of IdKeyDict is decent, actually seems to be better than MonoDict and TripleDict. So, why not use the more general structure for the coercion dicts? Hence, we could make #13387 be independent of #715 and #12313 and mark both #715 and #12313 as duplicates.

To decide which way to go, we should probably ask: How much additional work is needed to fix the current segfault problems with #715 and #12313? If it isn't much, then the first way may be better. Otherwise, we may choose the second way.

As much as I see, what needs to be fixed (or just needs to be reviewed, because it already is in the patches): After testing equality of memory addresses, test whether the weak reference is still valid. If it isn't then delete the weak reference, in order to avoid its callback being active.

So, 1. or 2., or a third way?

comment:181 in reply to: ↑ 177 Changed 5 years ago by nbruin

Replying to SimonKing:

sage: 8./399
0.0200501253132832
sage: 3.5/79
0.0443037974683544

Not very significant in my opinion, but because this is fundamental infrastructure it would be worthwhile if someone investigated at some point. I did some initial test to equip UniqueRepresentation and UniqueFactory with a permanent store for anything they produced and the effect wasn't very big for sha_tate. I might be overlooking some other construction that's contributing. Finding a smoking gun for this particular regression would be reassuring, but it might not exist.

Perhaps it's best to just profile some of these examples properly and see what's costing (and perhaps compare profiles?). If it's just the garbage collector doing its thing, we cannot do much about it. If it's recreating collected parents all the time, people should redesign that piece of code to keep the parents.

However, if we're finding that actions take much longer to perform now because they have to look up weakrefs rather than follow normal refs (or something similar), perhaps actions need a different way of operating (or at least an alternative, unsafe path that can be used in time critical parts)

If we're finding that the system is now spending significantly more time deleting things because so many more deletions involve swaths of weakref callbacks, we might want to think of more efficient mechanisms there, or just accept that this is part of the reasonable cost for memory management.

I doubt python-level profiling will cut it here. Anybody with good lower level profiling skills? Can we extract the commands run through sage for doctesting sha_tate.py and compare profiles between 5.3b2 and 5.3b2 + #715 + #11521 + #12313 ? Both python level and c-level profiles would be useful.

Again, none of this will hold back a positive review from me and they're more issues for follow-up tickets.

comment:182 in reply to: ↑ 180 ; follow-ups: Changed 5 years ago by nbruin

As much as I see, what needs to be fixed (or just needs to be reviewed, because it already is in the patches): After testing equality of memory addresses, test whether the weak reference is still valid. If it isn't then delete the weak reference, in order to avoid its callback being active.

I've seen complete doctest successes for #715 + #11521 + #12313, but the last thing I saw, some bots were persistent about error that they didn't report for other patches, which makes me think they may be real issues that are hard to trigger. The problems I currently see:

#715,#11521:

  • TripleDict set method is not dangling callback safe
  • TripleDict delitem method doesn't raise an error when deleting a dead entry (don't think that's much of an issue and arguable whether it should do that anyway. In fact, with TripleDictEraser using it, it can't)
  • TripleDictIter isn't safe wrt to dangling weakrefs. That one's a real segfault waiting to happen, because it instantiates object references by pointer cast!
  • TripleDictEraser hold a strong reference to the TripleDict?. I suspect that should be weak, although that adds to deletion costs...
  • You're importing ConstantFunction but you don't use it.

If you have a reliable way of generating the segfaults, it might be worth trying with #13387. The iterator there should be safe. There might be other issues, though.

I have a preference to getting #715 + #11521 in mergeable state, because I think that enhances our chances of actually making progress (in terms of getting stuff merged). Making #13387 one huge patch bomb is probably hard to review. This code really needs to be read for review.

#12215: Bot is reporting a SIGABRT somewhere, which is probably an "assert" failing in C code. Smells like memory corruption, so it would be really useful to get a report on. Code in the patches is really simple, so I'm suspecting a nasty interaction with #715,#11521 that now gets triggered because now things are actually getting collected.

#12313: Bot reports a SIGSEGV somewhere, Same story as above.

#13370: Bot reports a SIGSEGV

Most likely explanation is that our MonoDict? and TripleDict? implementations corrupt memory in rare cases and that the other tickets trigger that by changing other things so that they get exercised more.

Can we get access to the bot or equivalent configuration to recreate the circumstances? Or can we get someone involved who can (one is called volker-desktop ...).

comment:183 in reply to: ↑ 182 ; follow-up: Changed 5 years ago by SimonKing

Replying to nbruin:

  • TripleDictIter isn't safe wrt to dangling weakrefs. That one's a real segfault waiting to happen, because it instantiates object references by pointer cast!

OK.

  • TripleDictEraser hold a strong reference to the TripleDict?. I suspect that should be weak, although that adds to deletion costs...

Why? References to TripleDictEraser come from the TripleDict it points to, and from the weak references that are stored in said TripleDict. Would this be more than having a reference cycle that only involves KeyedRef, TripleDict and TripleDictEraser? None of them has a __del__ method, isn't it?

  • You're importing ConstantFunction but you don't use it.

Right.

If you have a reliable way of generating the segfaults, it might be worth trying with #13387. The iterator there should be safe. There might be other issues, though.

I have a preference to getting #715 + #11521 in mergeable state,

Hence, if you have a working iterator at #13387 then we should use that. Note that idkey_dict has a different iterator, thanks to Cython now supporting "yield". Do you think that iterator is safe?

Most likely explanation is that our MonoDict? and TripleDict? implementations corrupt memory in rare cases and that the other tickets trigger that by changing other things so that they get exercised more.

Can we get access to the bot or equivalent configuration to recreate the circumstances? Or can we get someone involved who can (one is called volker-desktop ...).

I'll meet Volker in St Andrews next week, if I'm not mistaken.

In the next few days, my resources for working will be a bit limited, but I'll try to get my patches from #715 and here in a better state, according to your remarks.

comment:184 in reply to: ↑ 182 Changed 5 years ago by SimonKing

Replying to nbruin:

  • TripleDict delitem method doesn't raise an error when deleting a dead entry (don't think that's much of an issue and arguable whether it should do that anyway. In fact, with TripleDictEraser using it, it can't)

TripleDictEraser isn't using it.

comment:185 in reply to: ↑ 178 Changed 5 years ago by zimmerma

Replying to jpflori:

As far as I'm concerned it not significant at all in comparaison with the memleaks it prevents.

Without the patches, I'd have to restart Sage every couple of hours because of lack of memory.

With the patches, I'll just have to wait a couple of extra minutes more.

I concur. Even more, I think fixing a memory leak is much more important than efficiency, thus it is not fair to compare both timings.

Thank you very much to all of you for the good work!

Paul

comment:186 Changed 5 years ago by SimonKing

I added a patch at #715, fixing for TripleDict what we have discussed here. Now, the patch from here of course does not apply. So, for now, it needs work (but #715 needs review, although admittedly I didn't run tests yet...)

comment:187 Changed 5 years ago by SimonKing

  • Status changed from needs_review to needs_work
  • Work issues set to Rebase, and take into account Nils' comments

comment:188 in reply to: ↑ 183 ; follow-up: Changed 5 years ago by nbruin

Replying to SimonKing:

Why? References to TripleDictEraser come from the TripleDict it points to, and from the weak references that are stored in said TripleDict. Would this be more than having a reference cycle that only involves KeyedRef, TripleDict and TripleDictEraser? None of them has a __del__ method, isn't it?

Hm ... For MonoDictEraser? you used a weakref. That's mainly what led me to believe it's better. I don't know what's more costly. Weakrefs make individual deletions more expensive because of the callbacks. Cyclic garbage buildup adds cost in that the GC is triggered more frequently. It depends a bit on the connectivity of the graph if it's worthwhile introducing enough weakrefs to break cycles. The following check suggests that weakrefs can sometimes be cheaper:

sage: class refstore(object):
....:         def __init__(self,ref):
....:             self.ref=ref
....:     
sage: class cyclic(object):
....:         def __init__(self):
....:             self.me=refstore(self)
....: 
sage: N=0
sage: def callback(c):
....:         global N
....:     N+=1
....: 
sage: import weakref
sage: 
sage: class weakcyc(object):
....:       def __init__(self):
....:         self.me=refstore(weakref.ref(self,callback))
....: 
sage: #import gc
sage: #gc.set_debug(gc.DEBUG_STATS)
sage: %time for i in range(80000): c = cyclic()
CPU times: user 0.33 s, sys: 0.01 s, total: 0.35 s
Wall time: 0.36 s
sage: %time for i in range(80000): c = weakcyc()
CPU times: user 0.20 s, sys: 0.00 s, total: 0.20 s
Wall time: 0.21 s
sage: %time for i in range(80000): c = cyclic()
CPU times: user 0.39 s, sys: 0.00 s, total: 0.39 s
Wall time: 0.40 s
sage: %time for i in range(80000): c = weakcyc()
CPU times: user 0.20 s, sys: 0.00 s, total: 0.20 s
Wall time: 0.21 s
sage: %time for i in range(80000): c = cyclic()
CPU times: user 0.33 s, sys: 0.00 s, total: 0.33 s
Wall time: 0.35 s
sage: %time for i in range(80000): c = weakcyc()
CPU times: user 0.20 s, sys: 0.00 s, total: 0.20 s
Wall time: 0.21 s
sage: N
239999

(it's fun to turn on gc stats and see that the weakrefs trigger no GC!)

Hence, if you have a working iterator at #13387 then we should use that. Note that idkey_dict has a different iterator, thanks to Cython now supporting "yield". Do you think that iterator is safe?

Apart from the fact that it's returning garbage keys at the moment: absolutely! See #13387. Or otherwise just do a nested while loop, storing the position using bucket number and index in bucket. That's easier logic than the recursive approach. But of course validate the weakrefs.

In the next few days, my resources for working will be a bit limited, but I'll try to get my patches from #715 and here in a better state, according to your remarks.

Have fun!

comment:189 in reply to: ↑ 188 Changed 5 years ago by SimonKing

Replying to nbruin:

Replying to SimonKing:

Why? References to TripleDictEraser come from the TripleDict it points to, and from the weak references that are stored in said TripleDict. Would this be more than having a reference cycle that only involves KeyedRef, TripleDict and TripleDictEraser? None of them has a __del__ method, isn't it?

Hm ... For MonoDictEraser? you used a weakref. That's mainly what led me to believe it's better.

Yes, while working at rebasing my patch, I just noticed.

Not sure. Since I already have updated my patch at #715 without weak reference (for TripleDict), I keep it like this there. But now I am rebasing my patch from here, and I think I will change it into a weak reference here (in both cases).

Hence, if you have a working iterator at #13387 then we should use that. Note that idkey_dict has a different iterator, thanks to Cython now supporting "yield". Do you think that iterator is safe?

Apart from the fact that it's returning garbage keys at the moment: absolutely!

Yesyes, there are no garbage keys with my new patch at #715.

comment:190 Changed 5 years ago by SimonKing

  • Status changed from needs_work to needs_review

OK, the patch is updated. I am now running tests.

Apply trac_12313-mono_dict-combined-random-sk.patch

comment:191 Changed 5 years ago by SimonKing

Arrgh. There are already several segfaults in the first few tests. So, there is something seriously wrong with either the new patch at #715 or here.

comment:192 Changed 5 years ago by SimonKing

  • Work issues Rebase, and take into account Nils' comments deleted

Fixed the segfaults: Since TripleDictEraser now has a weak reference to a TripleDict, the callback function of the KeyedReferences has to test whether that weak reference is still valid. This is what I forgot. Now I am running tests for the updated patch.

Apply trac_12313-mono_dict-combined-random-sk.patch

comment:193 Changed 5 years ago by SimonKing

I had to update the patch again. It is now almost fine.

Only remaining problem (on my machine, that is):

sage -t  -force_lib devel/sage/sage/combinat/integer_vectors_mod_permgroup.py

results in about 20 open processes that never finish. Timeout.

Running the test --verbose, one finds that all tests pass. But then, it is

271 tests in 40 items.
271 passed and 0 failed.
Test passed.
A workspace appears to have been corrupted... automatically rebuilding (this is harmless).

Harmless - but timeout.

Since the problem appears after running the tests, it seems to come from the doctest framework. Any idea what happens?

comment:194 Changed 5 years ago by SimonKing

The message seems to come from "sage/interfaces/gap.py".

comment:195 Changed 5 years ago by SimonKing

... and the problem can not be fixed by doing "gap_reset_workspace()".

comment:196 follow-up: Changed 5 years ago by SimonKing

I let the Gap._start method print a stack when there is a problem that leads to the "this is harmless" message.

After finishing the tests, there is a RuntimeError "Unable to start gap", and the stack is

(<frame object at 0x117d940>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/interfaces/gap.py', 964, '_start', ['            for bla in inspect.stack():\n'], 0)
(<frame object at 0x117d6f0>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/interfaces/gap.py', 536, '_eval_line', ['                self._start()\n'], 0)
(<frame object at 0x4a027e0>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/interfaces/gap.py', 1092, 'set', ['        out = self._eval_line(cmd, allow_use_file=True)\n'], 0)
(<frame object at 0x42d1d50>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/interfaces/interface.py', 385, '_create', ['        self.set(name, value)\n'], 0)
(<frame object at 0x4a07970>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/interfaces/expect.py', 1327, '__init__', ['                self._name = parent._create(value, name=name)\n'], 0)
(<frame object at 0x4d3b9e0>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/interfaces/interface.py', 195, '__call__', ['            return cls(self, x, name=name)\n'], 0)
(<frame object at 0x49ec260>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python/site-packages/sage/groups/perm_gps/permgroup.py', 562, '__cmp__', ['        gSelf = self._gap_()\n'], 0)
(<frame object at 0x1408240>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python2.7/weakref.py', 51, 'remove', ['                del self.data[wr.key]\n'], 0)

So, apparently one has to look at the __cmp__ method in permgroup.py and make sure that it works when gap can not start.

comment:197 Changed 5 years ago by SimonKing

The RuntimeError occurs in the following line in sage/interfaces/gap.py

Expect._start(self, "Failed to start GAP.")

The odd thing is: sage.interfaces.expect.Expect._start has the arguments "alt_message" and "block_during_init". While the latter is used, the first argument does not occur in the method!

comment:198 Changed 5 years ago by SimonKing

It hangs during gap_reset_workspace(), more precisely: While loading the Gap package ctbllib.

comment:199 in reply to: ↑ 196 ; follow-up: Changed 5 years ago by nbruin

After finishing the tests, there is a RuntimeError "Unable to start gap", and the stack is

(<frame object at 0x1408240>, '/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/python2.7/weakref.py', 51, 'remove', ['                del self.data[wr.key]\n'], 0)

This is in the removal callback of a WeakValueDictionary. So, what I *think* is happening is that an object that is dependent on Gap for its equality test is used in a cache somewhere.

Probably during shutting down, the value gets deleted and the dictionary gets its callback to remove the value. However, this is during shutdown! Things are in funny states. Probably, the deletion from the dictionary doesn't really matter anyway, because that's probably next in line to be destroyed.

However, the callback doesn't know that!

This is a nasty complication of these weakreffed dictionaries: the callback code can run under funny conditions and an equality test can involve a lot in some sage objects.

If we could tell our WeakValueDictionaries something about their codomain their callback could check if that still exists. If not then none of the entries matter, so you don't have to bother to locate the appropriate key. Just delete them all.

EDIT: I guess a solution here could be to store a strong ref to the gap interface on the object that needs gap comparison. That way, gap will surely still exist (and useable?) whenever such an object is alive. The possibilities of callbacks on objects mean you have to be more explicit about what they need to function! Nasty ...

Last edited 5 years ago by nbruin (previous) (diff)

comment:200 in reply to: ↑ 199 Changed 5 years ago by SimonKing

Replying to nbruin:

If we could tell our WeakValueDictionaries something about their codomain

Unfortunately, these aren't our WeakValueDictionaries, but Python's. So, we can not.

EDIT: I guess a solution here could be to store a strong ref to the gap interface on the object that needs gap comparison.

Sounds like a good idea.

Another one or two ideas:

  1. We find out where that weak dictionary lives, and delete the whole dictionary explicitly when Sage shuts down.
  2. We find out where that weak dictionary lives, and if we are lucky then we can replace it by a MonoDict or TripleDict.

comment:201 Changed 5 years ago by SimonKing

I just tried to add an attribute pointing to gap in each instance of PermutationGroup_generic. It didn't help. I tested: The stored gap instance is not running, and it won't start.

comment:202 Changed 5 years ago by SimonKing

It turns out that the problematic dictionary is UniqueRepresentation.__classcall__.cache, which is a weak dictionary by #12215, and which can not be turned into a MonoDict or TripleDict.

Is it an option to delete it explicitly when Sage starts to shut down?

comment:203 follow-up: Changed 5 years ago by SimonKing

There is a function quit_sage in sage.all. The first thing it does is to quit all sage interfaces.

If one prepends it with a garbage collection, then the problem vanishes. It would not help to move quitting the interfaces to the very end of quit_sage().

So, question: Is an explicit garbage collection an option?

comment:204 Changed 5 years ago by SimonKing

  • Description modified (diff)

I made a patch that makes Sage force a garbage collection before closing the interfaces. I am confident that now all tests will pass.

Apply trac_12313-mono_dict-combined-random-sk.patch trac_12313_quit_sage.patch

comment:205 Changed 5 years ago by SimonKing

Alas. While running the tests, I accidentally saw

sage -t  -force_lib devel/sage/sage/misc/preparser.py
         [7.5 s]
sage -t  -force_lib devel/sage/doc/en/thematic_tutorials/lie/crystals.rst
         [5.8 s]
sage -t  -force_lib devel/sage/sage/groups/group.pyx
         [5.9 s]
sage -t  -force_lib devel/sage/sage/rings/residue_field.pyx
         [7.7 s]
*** glibc detected *** python: double free or corruption (!prev): 0x000000000757a7b0 ***
sage -t  -force_lib devel/sage/sage/plot/animate.py
         [4.5 s]
sage -t  -force_lib devel/sage/sage/schemes/plane_curves/affine_curve.py
         [6.5 s]
sage -t  -force_lib devel/sage/sage/coding/source_coding/huffman.py
         [4.2 s]
sage -t  -force_lib devel/sage/sage/groups/abelian_gps/abelian_group.py
         [4.6 s]
sage -t  -force_lib devel/sage/doc/fr/tutorial/tour_algebra.rst
         [7.6 s]

I give so much context, since I am running 4 tests in parallel. No idea which of the tests results in the "glibc detected" being printed.

comment:206 Changed 5 years ago by SimonKing

If I run any of the tests mentioned in my previous post, no "glibc detected" occurs.

comment:207 in reply to: ↑ 203 Changed 5 years ago by nbruin

Really, it's a bit silly that WeakValueDictionary does a lookup by key on its removal callback.

An alternative, better, but more laborious solution is to improve WeakValueDictionary. What we need is a dictionary with an extra method:

    def delete_by_hash_and_value_id(self,hash,id):
        find bucket based on hash
        find entry by walking the bucket to find the given *value* id.
        delete that entry

Then your WeakValueDictionary can inherit from it and do:

    def __setitem__(self,key,value):
        r=KeyedRef(value,(h,id(value)),
            lambda ref: self.delete_by_hash_and_value_id(ref.key[0],ref.key[1])
        super(self).__setitem__(key,r)

The most attractive approach would be to subclass dict, but getting access to its internal storage to do this modified search would be painful.

comment:208 follow-ups: Changed 5 years ago by nbruin

My bet is that this:

sage: search_src("WeakValueDict")
...
rings/polynomial/pbori.pyx:255:rings = weakref.WeakValueDictionary()
...

is responsible for the SIGABRT in much the same way.

comment:209 in reply to: ↑ 208 Changed 5 years ago by SimonKing

Replying to nbruin:

My bet is that this:

sage: search_src("WeakValueDict")
...
rings/polynomial/pbori.pyx:255:rings = weakref.WeakValueDictionary()
...

is responsible for the SIGABRT in much the same way.

Funny that you name pbori.pyx! Meanwhile the doctests are finished, and I got the error exactly there! And it is reproducible. Namely:

sage -t --verbose -force_lib "devel/sage/sage/rings/polynomial/pbori.pyx"
2342 tests in 280 items.
2342 passed and 0 failed.
Test passed.
*** glibc detected *** python: double free or corruption (!prev): 0x0000000007367de0 ***

and then htop shows me processes like

python /mnt/local/king/.sage/tmp/pbori_22813.py
python /mnt/local/king/.sage/tmp/mpc622-24779/pbori_9077.py

that are running at 0% CPU. Timeout, and again the error occurs when shutting down Sage.

But that's good, because it is reproducible! And if pbori.pyx uses a weak dictionary (I hope it wasn't I who introduced it...), it absolutely makes sense to see an error there.

comment:210 in reply to: ↑ 208 ; follow-up: Changed 5 years ago by SimonKing

Replying to nbruin:

rings/polynomial/pbori.pyx:255:rings = weakref.WeakValueDictionary?()

But is that dictionary used? I can't find a spot in pbori.pyx where it is modified. Not in pbori itself, and also I couldn't find another module where it is imported.

comment:211 Changed 5 years ago by SimonKing

I inserted some lines into sage_quit, such that the length and the contents of sage.rings.polynomial.pbori are printed: It was empty. Even if I create a boolean polynomial ring, the dictionary is empty when starting to shut Sage down. Hence, it seems that it is not to blame.

comment:212 in reply to: ↑ 210 Changed 5 years ago by john_perry

Replying to SimonKing:

Replying to nbruin:

rings/polynomial/pbori.pyx:255:rings = weakref.WeakValueDictionary?()

But is that dictionary used? I can't find a spot in pbori.pyx where it is modified. Not in pbori itself, and also I couldn't find another module where it is imported.

I ran a hypersearch, and don't see anywhere else that it appears, either. Have you tried deleting that line?

(Believe it or not, I've been following this...)

comment:213 follow-ups: Changed 5 years ago by nbruin

It looks like that garbage collection is a great idea! The error is now completely reproducible:

	sage -t --verbose -force_lib "devel/sage/sage/rings/polynomial/pbori.pyx" # Killed/crashed

I've instrumented the remove function on the WeakValueDictionary a little bit:

        def remove(wr, selfref=ref(self)):
            self = selfref()
            if self is not None:
+                if type(wr.key) is tuple:
+                    T=[]
+                    for t in wr.key:
+                        try:
+                            T.append(str(t))
+                        except:
+                            T.append("TYPE: %s"%type(t))
+                    print('removal on key %s'%T)
                del self.data[wr.key]
        self._remove = remove
        UserDict.UserDict.__init__(self, *args, **kw)

to find (now doctests are failing due to the extra output):

...
***Test Failed*** 18 failures.
removal on key ['Rational Field', "('z', 'y')", '2', 'False', 'Degree reverse lexicographic term order']
*** glibc detected *** python: double free or corruption (!prev): 0x0000000006baaaa0 ***
...
Unhandled SIGABRT: An abort() occurred in Sage.

That looks an awful lot like the kind of key that a polynomial ring constructor would get. So I guess it's just one of those WeakCachedFunction dictionaries again. I suspect that part of polybori is already torn down when this entry is looked up. It really seems that the alternatives are:

  • fix all those libraries to keep sufficient references to do equality testing till the very end (may be a bit unfair. We're really just calling them to throw them away)
  • make a WeakValueDictionary? that doesn't need key comparisons on its callback and use that instead
  • convince the release manager that these tickets are important and that these doctests are not the fault of these tickets. Remove any components that fail due to these changes.

Option 2 looks quite doable by comparison ... I'd say use a design rather similar to MonoDict. We probably won't beat WeakValueDict for standard performance, but we're making another trade-off: we want safe removal, even at the price of a little performance. So perhaps two variants, WeakCachedFunction and SafeWeakCachedFunction.

Last edited 5 years ago by nbruin (previous) (diff)

comment:214 Changed 5 years ago by SimonKing

Thank you for looking into the problem. I found that the problem seems to be related with the doctest framework, too.

Namely, put the following into a file "test.py":

"""
            sage: B.<a,b,c,d,e,f> = BooleanPolynomialRing()
            sage: I = ideal(a*b + a + b*e + c*e + 1, a + b + c*d + c + 1, a*c + c + d*f + d + 1, a*c + c*f + c + d*f + 1, c*f + c + d + e + 1, a + b*c + b*d + e*f + 1)
            sage: I.groebner_basis()
            [1]
"""

Then, run sage -t on it. You will find the "glibc detected"

You may also try to run the same example interactively in Sage, and quit. Everything is fine (or at least, it was for me).

Last edited 5 years ago by SimonKing (previous) (diff)

comment:215 in reply to: ↑ 213 Changed 5 years ago by SimonKing

Replying to nbruin:

...
***Test Failed*** 18 failures.
removal on key ['Rational Field', "('z', 'y')", '2', 'False', 'Degree reverse lexicographic term order']
*** glibc detected *** python: double free or corruption (!prev): 0x0000000006baaaa0 ***
...
Unhandled SIGABRT: An abort() occurred in Sage.

That looks an awful lot like the kind of key that a polynomial ring constructor would get.

Right! Could be a BooleanPolynomialRing, though.

So I guess it's just one of those WeakCachedFunction dictionaries again.

Indeed, sage.rings.polynomial.polynomial_ring_constructor contains

import weakref
_cache = weakref.WeakValueDictionary()

And guess who has introduced that line in #715 :-/

comment:216 in reply to: ↑ 213 ; follow-up: Changed 5 years ago by SimonKing

Replying to nbruin:

I've instrumented the remove function on the WeakValueDictionary a little bit:

I did a similar thing, printing a message naming the ref.key before deletion and after deletion. Then I ran sage -t --verbose on my small test file.

Problem: All deletions were successful!

To be precise, it all starts with many deletions, happening at Sage startup. Apparently Sage computes a lot of things that are then immediately garbage collected. And then, the tests themselves look like this:

Trying:
    set_random_seed(0L)
Expecting nothing
ok
Trying:
    change_warning_output(sys.stdout)
Expecting nothing
ok
Trying:
    B = BooleanPolynomialRing(names=('a', 'b', 'c', 'd', 'e', 'f',)); (a, b, c, d, e, f,) = B._first_ngens(6)###line 2:_sage_    >>> B.<a,b,c,d,e,f> = BooleanPolynomialRing()
Expecting nothing
ok
Trying:
    I = ideal(a*b + a + b*e + c*e + Integer(1), a + b + c*d + c + Integer(1), a*c + c + d*f + d + Integer(1), a*c + c*f + c + d*f + Integer(1), c*f + c + d + e + Integer(1), a + b*c + b*d + e*f + Integer(1))###line 3:_sage_    >>> I = ideal(a*b + a + b*e + c*e + 1, a + b + c*d + c + 1, a*c + c + d*f + d + 1, a*c + c*f + c + d*f + 1, c*f + c + d + e + 1, a + b*c + b*d + e*f + 1)
Expecting nothing
ok
Trying:
    I.groebner_basis()###line 4:_sage_    >>> I.groebner_basis()
Expecting:
    [1]
**********************************************************************
File "/home/king/SAGE/tests/pbori/pboriNeu/test_12.py", line ?, in __main__.example_0
Failed example:
    I.groebner_basis()###line 4:_sage_    >>> I.groebner_basis()
Expected:
    [1]
Got:
    try to remove (('RingHomset_generic_with_category', (<class 'sage.rings.homset.RingHomset_generic'>, <class 'sage.categories.category.JoinCategory.parent_class'>), None, None, <class 'sage.rings.homset.RingHomset_generic'>), ())
    successful on (('RingHomset_generic_with_category', (<class 'sage.rings.homset.RingHomset_generic'>, <class 'sage.categories.category.JoinCategory.parent_class'>), None, None, <class 'sage.rings.homset.RingHomset_generic'>), ())
    [1]
Trying:
    sig_on_count()
Expecting:
    0
ok
4 items had no tests:
    __main__
    __main__.change_warning_output
    __main__.check_with_tolerance
    __main__.warning_function
**********************************************************************
1 items had failures:
   1 of   6 in __main__.example_0
6 tests in 5 items.
5 passed and 1 failed.
***Test Failed*** 1 failures.
*** glibc detected *** python: double free or corruption (out): 0x0000000004be7e60 ***
^CAborting further tests.

KeyboardInterrupt -- interrupted after 16.0 seconds!
----------------------------------------------------------------------
All tests passed!

So, there is only a Homset (for which there is a weak cache, by #11521) being collected during the tests, and perhaps it hasn't been the weak dictionary of polynomial rings, after all.

How can one find out what kind of thing got freed twice? I am afraid gdb does not seem to work on doc tests, and I can not reproduce the problem interactively.

comment:217 Changed 5 years ago by nbruin

I think two dictionaries are to blame: _cache in sage/rings/polynomial/polynomial_ring_constructor.py and self._cache in UniqueFactory. If I change both to dict() I don't get an error. If either is Weak, I do get the error.

In the process I saw how precarious remove methods are. I wanted to put a sys.stderr.write in the remove function in WeakValueDictionary. I ended up with

    def __init__(self, *args, **kw):
        MYSTDERR=sys.stderr
        def remove(wr, selfref=ref(self)):
            self = selfref()
            if self is not None:
                MYSTDERR.write("REMOVING A KEY\n");
                del self.data[wr.key]
        self._remove = remove
        UserDict.UserDict.__init__(self, *args, **kw)

(so remove is a closure that keeps a reference to MYSTDERR) because anything less would lead to errors. Apparently you cannot assume that the globals dictionary of your module is still intact when your remove method runs. That is such a restriction that I think we cannot expect people to write their equality testers to survive in such a harsh environment. We either have to track down all caches and tear them down in a more friendly setting or we have to go with a "safe" WeakValueDictionary?, which doesn't call methods on keys upon removal.

I'm learning much more about Python than I ever wanted to know ...

comment:218 Changed 5 years ago by SimonKing

Would it help to explicitly delete (or explicitly clear) the "bad" weak dictionaries?

If it does, one could proceed as follows:

  • Introduce a function that returns (usual) weak key dictionaries, and stores a (weak?) reference to all the weak key dictionaries it creates.
  • In quit_sage, explicitly delete (or clear) all the weak key dictionaries in store.

Your suggestion to create a new a safe weak key dictionary sounds like much work to me...

Last edited 5 years ago by SimonKing (previous) (diff)

comment:219 Changed 5 years ago by nbruin

Yes that might work as a workaround for now. The store indeed needs to be weak. It could itself be a WeakValueDictionary, keyed on the id of the WeakValueDictionaries it stores. You'd like to put them in a WeakSet, but they're not hashable.

If we're unlucky you'll get into an infinite loop, though. It's not impossible that sage, upon equality testing, decides to construct some new parents to coerce things into in order to compare them. These parents would then get created and entered into caches that we might just have cleared! If the process involves creating a new category, a dynamic class might even end up producing a new UniqueRepresentation and enter a new cache into our "to clear" dictionary. Dictionaries behave very badly when they are modified while being iterated over.

So: Worth a try for now. Let's do that to get this ticket out the door.


Future Work

I foresee crazy bugs in the future. A WeakValueDict with safe removal is probably ultimately unavoidable. And is the work really so bad?

We could just go with roughly the design for IdKeyDict, but then using Python's hash function and simpler entries. Normal lookup and delete would be via hash and then equality on the keys. Special deletion would be by hash and id of the value. Once we have a working prototype we can put it out there for the Python community. With any luck someone adopts it and improves, tunes and optimizes it. I don't think our applications are so super time critical. It's all for creation of parents, which is expensive anyway.

Our applications only need a rudimentary Dictionary interface, but adding the other methods would not be a huge job.

See #13394

Last edited 5 years ago by nbruin (previous) (diff)

Changed 5 years ago by nbruin

Trying to explicitly tear down WeakValueDicts? in sage

comment:220 follow-ups: Changed 5 years ago by nbruin

Too bad ... We never did positively confirm that the problem here was some method on a key getting called. The patch doesn't help for the polybori problem. I suspect it really is a double memory release somewhere that simply didn't ever occur before because the memory didn't get released in the first place? (perhaps python doesn't deallocate all memory upon exit? why would they?

I don't think a more fancy implementation of a safeweakvaluedict would work either. By keeping references to all keys while *clearing* the dictionary (that's just the clear that's inherited on dict), I think we are avoiding all calls to keys. Purging a dict doesn't involve key lookups and deleting the keys afterwards is just that. That should be possible. By then we're not holding weakrefs on them anymore, so no callbacks.

So, while I still think the current WeakValueDict design has a silly callback, the only problem with it that we have actually observed can be solved with a well-timed gc.collect().

comment:221 in reply to: ↑ 220 Changed 5 years ago by SimonKing

What I just tried was: Take all instances of WeakValueDictionary in gc.get_objects() and call clear() on each. Then do gc.collect(). But it didn't help, there still is the double free, which occurs during gc.collect() (I have a print statement directly before and directly after gc.collect(), but only the first is executed).

comment:222 in reply to: ↑ 220 Changed 5 years ago by SimonKing

Replying to nbruin:

I don't think a more fancy implementation of a safeweakvaluedict would work either. By keeping references to all keys while *clearing* the dictionary (that's just the clear that's inherited on dict), I think we are avoiding all calls to keys.

I am not sure what you mean here by "keys". Are you talking about the keys of the weak references? Are you talking about the keys of WeakValueDictionary?

comment:223 in reply to: ↑ 216 ; follow-up: Changed 5 years ago by jpflori

Replying to SimonKing:> How can one find out what kind of thing got freed twice? I am afraid gdb does not seem to work on doc tests, and I can not reproduce the problem interactively.

You should be able to run a doctest inside gdb using

./sage -t -gdb devel/sage/...

comment:224 Changed 5 years ago by SimonKing

Now that's funny. I tried to keep the store of any KeyedReference. But it turns out that there is always exactly one dead KeyedReference that raises an attribute error when trying to get the key. Namely, I inserted the following at the beginning of sage.all.quit_sage:

    import gc, weakref
    KeepRefKeys = []
    for r in gc.get_objects():
        if isinstance(r, weakref.KeyedRef):
            try:
                KeepRefKeys.append(r.key)
            except AttributeError:
                print "Warning, no key for",r
    del r
    print "all keys are in store"
    gc.collect()
    print "collection done"

Then, quitting a normal Sage session, I get

sage: quit
Exiting Sage (CPU time 0m0.05s, Wall time 0m0.85s).
Warning, no key for <weakref at 0x3d281d0; dead>
all keys are in store
collection done

and executing the "bad" three-line doctest, I get

6 tests in 5 items.
6 passed and 0 failed.
Test passed.
Warning, no key for <weakref at 0x3a8f290; dead>
all keys are in store
*** glibc detected *** python: double free or corruption (out): 0x0000000004be6cf0 ***
^CKeyboardInterrupt -- interrupted after 5.4 seconds!
Aborting further tests.

comment:225 in reply to: ↑ 223 Changed 5 years ago by SimonKing

Replying to jpflori:

Replying to SimonKing:> How can one find out what kind of thing got freed twice? I am afraid gdb does not seem to work on doc tests, and I can not reproduce the problem interactively.

You should be able to run a doctest inside gdb using

./sage -t -gdb devel/sage/...

Not really. What I get from that is

sage -t -gdb "test_12.py"
********************************************************************************
Type r at the (gdb) prompt to run the doctests.
Type bt if there is a crash to see a traceback.
********************************************************************************
'import site' failed; use -v for traceback
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/bin/python...done.
(gdb) r
Starting program: /mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/bin/python /mnt/local/king/.sage/tmp/test_12_1233.py
[Thread debugging using libthread_db enabled]
Traceback (most recent call last):
  File "/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib64/libstdc++.so.6.0.16-gdb.py", line 19, in <module>
    import os
ImportError: No module named os

Then wait a long time, finally one has

*** glibc detected *** /mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/bin/python: double free or corruption (out): 0x0000000004be4ec0 ***

And then? Keyboard Interrupt, or what?

^C
Program received signal SIGINT, Interrupt.
__lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97

Nothing happens, in particular I do not get a gdb prompt any more.

comment:226 Changed 5 years ago by jpflori

You could try valgrind but good luck with decrypting the output.

comment:227 Changed 5 years ago by SimonKing

gdb just totally hangs. 0% CPU according to htop.

comment:228 Changed 5 years ago by SimonKing

IIRC, Volker is quite good at C-level debugging. I hope he can help me, next week.

comment:229 follow-up: Changed 5 years ago by nbruin

Replying to SimonKing:

I am not sure what you mean here by "keys". Are you talking about the keys of the weak references? Are you talking about the keys of WeakValueDictionary?

Yes. they are the same :-). I was mainly talking about what I tried in safeweakdict and that I think we achieve the same safety that a proper SafeWeakValueDict would provide.

By the way, for me (on Fedora 16) sage -t -gdb works like a charm and, indeed, the backtrace shows this is indeed gcmodule leading to Delete<polybori::groebner::ReductionStrategy> leading to an error raise in libc. So either the way polybori is wrapped is wrong or polybori itself has a "memory drain" (or whatever the reverse of a leak is). Time for a polybori expert I'd say.

I think it happens via libs/polybori/decl.pxd

   void PBRedStrategy_delete "Delete<ReductionStrategy>"(PBRedStrategy *mem)

comment:230 follow-up: Changed 5 years ago by jpflori

Replying to SimonKing:

gdb just totally hangs. 0% CPU according to htop.

I got a similar problem in #12173. FLINT wants to free the memory it uses upon quitting, but Sage/Python? tries to do the same upon deletion of Sage objects. The solution we adopted was to let the FLINT memory (not all of it fortunately) in place, although that would to memory leaks if one would load and quit Sage several times from one Python interpreter. We never figured out how in which order things were deallocated when Python was shutdown.

And GDB just hanged the same on the double free errors.

On your example, quite strangely, I get a nice segfault and backtrace as Nils.

Anyway, it could be possible that the situation is as in FLINT: some Sage method tries to directly deallocate something after we called the Polybori cleanup method which already freed taht piece of memory and it crashes.

comment:231 Changed 5 years ago by jpflori

Running the minimal doctest provided by Simon and adding a print statement in the _dealloc_ method of ReductionStrategy? indeed shows that PBRedStrategy_delete is called twice on the same memory address.

comment:232 Changed 5 years ago by jpflori

When I run it in a Sage interactive session, the address is different.

comment:233 Changed 5 years ago by jpflori

Mmm, disregard the above, even in the doctesting mode, the addresses seem different generically.

comment:234 Changed 5 years ago by jpflori

Here is the seemingly relevant part of the Valgrind output:

==14712== Invalid free() / delete / delete[] / realloc()
==14712==    at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477)
==14712==    by 0x2C653D62: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_ReductionStrategy(_object*) (ccobject.h:77)
==14712==    by 0x2C5E9B52: __pyx_tp_clear_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (pbori.cpp:52486)
==14712==    by 0x4F5AF96: collect (gcmodule.c:769)
==14712==    by 0x4F5B728: gc_collect (gcmodule.c:1067)
==14712==    by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F23BD1: PyEval_EvalCode (ceval.c:667)
==14712==    by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353)
==14712==    by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943)
==14712==    by 0x4F5A294: Py_Main (main.c:639)
==14712==    by 0x544A76C: (below main) (libc-start.c:226)
==14712==  Address 0x21352000 is 96 bytes inside a block of size 448 alloc'd
==14712==    at 0x4C2C054: operator new(unsigned long) (vg_replace_malloc.c:295)
==14712==    by 0x2C6541D9: __pyx_pw_4sage_5rings_10polynomial_5pbori_16GroebnerStrategy_1__init__(_object*, _object*, _object*) (pbori.cpp:39499)
==14712==    by 0x4EDB457: type_call (typeobject.c:737)
==14712==    by 0x4E7E7A2: PyObject_Call (abstract.c:2529)
==14712==    by 0x2C61104A: __pyx_pf_4sage_5rings_10polynomial_5pbori_22BooleanPolynomialIdeal_4groebner_basis(_object*, _object*, _object*) [clone .isra.97] (pbori.cpp:31442)
==14712==    by 0x2C6116D6: __pyx_pw_4sage_5rings_10polynomial_5pbori_22BooleanPolynomialIdeal_5groebner_basis(_object*, _object*, _object*) (pbori.cpp:31182)
==14712==    by 0x4E7E7A2: PyObject_Call (abstract.c:2529)
==14712==    by 0x4F20A6C: PyEval_EvalFrameEx (ceval.c:4239)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F23BD1: PyEval_EvalCode (ceval.c:667)
==14712==    by 0x4F21BFE: PyEval_EvalFrameEx (ceval.c:4718)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4EA674B: function_call (funcobject.c:526)
==14712==    by 0x4E7E7A2: PyObject_Call (abstract.c:2529)
==14712==    by 0x4E8BF1E: instancemethod_call (classobject.c:2578)
==14712==    by 0x4E7E7A2: PyObject_Call (abstract.c:2529)
==14712==    by 0x4F20A6C: PyEval_EvalFrameEx (ceval.c:4239)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4EA674B: function_call (funcobject.c:526)
==14712==    by 0x4E7E7A2: PyObject_Call (abstract.c:2529)
==14712==    by 0x4E8BF1E: instancemethod_call (classobject.c:2578)
==14712==    by 0x4E7E7A2: PyObject_Call (abstract.c:2529)
==14712==    by 0x4F20A6C: PyEval_EvalFrameEx (ceval.c:4239)
==14712== 
...

==14712== Invalid free() / delete / delete[] / realloc()
==14712==    at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477)
==14712==    by 0x2C673D19: polybori::groebner::ReductionStrategy::~ReductionStrategy() (new_allocator.h:98)
==14712==    by 0x2C653F2E: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (GroebnerStrategy.h:41)
==14712==    by 0x4EB9E9A: PyDict_Clear (dictobject.c:891)
==14712==    by 0x4EB9FF8: dict_tp_clear (dictobject.c:2096)
==14712==    by 0x4F5AF96: collect (gcmodule.c:769)
==14712==    by 0x4F5B728: gc_collect (gcmodule.c:1067)
==14712==    by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F23BD1: PyEval_EvalCode (ceval.c:667)
==14712==    by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353)
==14712==    by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943)
==14712==    by 0x4F5A294: Py_Main (main.c:639)
==14712==    by 0x544A76C: (below main) (libc-start.c:226)
==14712==  Address 0xcd903f0 is 0 bytes inside a block of size 96 free'd
==14712==    at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477)
==14712==    by 0x2C673D19: polybori::groebner::ReductionStrategy::~ReductionStrategy() (new_allocator.h:98)
==14712==    by 0x2C653D5A: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_ReductionStrategy(_object*) (ccobject.h:77)
==14712==    by 0x2C5E9B52: __pyx_tp_clear_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (pbori.cpp:52486)
==14712==    by 0x4F5AF96: collect (gcmodule.c:769)
==14712==    by 0x4F5B728: gc_collect (gcmodule.c:1067)
==14712==    by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F23BD1: PyEval_EvalCode (ceval.c:667)
==14712==    by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353)
==14712==    by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943)
==14712==    by 0x4F5A294: Py_Main (main.c:639)
==14712==    by 0x544A76C: (below main) (libc-start.c:226)
==14712== 
...

==14712== Invalid free() / delete / delete[] / realloc()
==14712==    at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477)
==14712==    by 0x2C6739D0: std::_Rb_tree<polybori::BooleMonomial, std::pair<polybori::BooleMonomial const, int>, std::_Select1st<std::pair<polybori::BooleMonomial const, int> >, polybori::symmetric_composition<std::less<polybori::CCuddNavigator>, polybori::navigates<polybori::BoolePolynomial> >, std::allocator<std::pair<polybori::BooleMonomial const, int> > >::_M_erase(std::_Rb_tree_node<std::pair<polybori::BooleMonomial const, int> >*) (new_allocator.h:98)
==14712==    by 0x2C673D26: polybori::groebner::ReductionStrategy::~ReductionStrategy() (stl_tree.h:639)
==14712==    by 0x2C653F2E: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (GroebnerStrategy.h:41)
==14712==    by 0x4EB9E9A: PyDict_Clear (dictobject.c:891)
==14712==    by 0x4EB9FF8: dict_tp_clear (dictobject.c:2096)
==14712==    by 0x4F5AF96: collect (gcmodule.c:769)
==14712==    by 0x4F5B728: gc_collect (gcmodule.c:1067)
==14712==    by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F23BD1: PyEval_EvalCode (ceval.c:667)
==14712==    by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353)
==14712==    by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943)
==14712==    by 0x4F5A294: Py_Main (main.c:639)
==14712==    by 0x544A76C: (below main) (libc-start.c:226)
==14712==  Address 0x21353bc0 is 0 bytes inside a block of size 80 free'd
==14712==    at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477)
==14712==    by 0x2C6739D0: std::_Rb_tree<polybori::BooleMonomial, std::pair<polybori::BooleMonomial const, int>, std::_Select1st<std::pair<polybori::BooleMonomial const, int> >, polybori::symmetric_composition<std::less<polybori::CCuddNavigator>, polybori::navigates<polybori::BoolePolynomial> >, std::allocator<std::pair<polybori::BooleMonomial const, int> > >::_M_erase(std::_Rb_tree_node<std::pair<polybori::BooleMonomial const, int> >*) (new_allocator.h:98)
==14712==    by 0x2C673D26: polybori::groebner::ReductionStrategy::~ReductionStrategy() (stl_tree.h:639)
==14712==    by 0x2C653D5A: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_ReductionStrategy(_object*) (ccobject.h:77)
==14712==    by 0x2C5E9B52: __pyx_tp_clear_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (pbori.cpp:52486)
==14712==    by 0x4F5AF96: collect (gcmodule.c:769)
==14712==    by 0x4F5B728: gc_collect (gcmodule.c:1067)
==14712==    by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F23BD1: PyEval_EvalCode (ceval.c:667)
==14712==    by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353)
==14712==    by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943)
==14712==    by 0x4F5A294: Py_Main (main.c:639)
==14712==    by 0x544A76C: (below main) (libc-start.c:226)
==14712== 
...

==14712== Invalid free() / delete / delete[] / realloc()
==14712==    at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477)
==14712==    by 0x2C653F2E: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (GroebnerStrategy.h:41)
==14712==    by 0x4EB9E9A: PyDict_Clear (dictobject.c:891)
==14712==    by 0x4EB9FF8: dict_tp_clear (dictobject.c:2096)
==14712==    by 0x4F5AF96: collect (gcmodule.c:769)
==14712==    by 0x4F5B728: gc_collect (gcmodule.c:1067)
==14712==    by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F23BD1: PyEval_EvalCode (ceval.c:667)
==14712==    by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353)
==14712==    by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943)
==14712==    by 0x4F5A294: Py_Main (main.c:639)
==14712==    by 0x544A76C: (below main) (libc-start.c:226)
==14712==  Address 0xcdfa840 is 0 bytes inside a block of size 432 free'd
==14712==    at 0x4C2B0BF: operator delete(void*) (vg_replace_malloc.c:477)
==14712==    by 0x2C653D5A: __pyx_tp_dealloc_4sage_5rings_10polynomial_5pbori_ReductionStrategy(_object*) (ccobject.h:77)
==14712==    by 0x2C5E9B52: __pyx_tp_clear_4sage_5rings_10polynomial_5pbori_GroebnerStrategy(_object*) (pbori.cpp:52486)
==14712==    by 0x4F5AF96: collect (gcmodule.c:769)
==14712==    by 0x4F5B728: gc_collect (gcmodule.c:1067)
==14712==    by 0x4F21E44: PyEval_EvalFrameEx (ceval.c:4021)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F21EBF: PyEval_EvalFrameEx (ceval.c:4117)
==14712==    by 0x4F23A94: PyEval_EvalCodeEx (ceval.c:3253)
==14712==    by 0x4F23BD1: PyEval_EvalCode (ceval.c:667)
==14712==    by 0x4F4633F: PyRun_FileExFlags (pythonrun.c:1353)
==14712==    by 0x4F46DDE: PyRun_SimpleFileExFlags (pythonrun.c:943)
==14712==    by 0x4F5A294: Py_Main (main.c:639)
==14712==    by 0x544A76C: (below main) (libc-start.c:226)
==14712== 
Last edited 5 years ago by jpflori (previous) (diff)

comment:235 in reply to: ↑ 230 Changed 5 years ago by SimonKing

Replying to jpflori:

Anyway, it could be possible that the situation is as in FLINT: some Sage method tries to directly deallocate something after we called the Polybori cleanup method which already freed taht piece of memory and it crashes.

That sounds like the solution should be as in the segfault fix for pari at #12215.

Generally, underlying C-data should be freed by using a method __dealloc__ (not __del__, please!). You need to be careful, because during __dealloc__ you can not assume that the Python attributes of an instance are still working.

Hence, it is (in my opinion) very wrong that we had

    import sage.matrix.matrix_mod2_dense
    sage.matrix.matrix_mod2_dense.free_m4ri()

    import sage.libs.flint.flint
    sage.libs.flint.flint.free_flint_stack()

    pari._unsafe_deallocate_pari_stack()

in sage.all.quit_sage (I only removed the pari._unsafe_deallocate... in #12215).

comment:236 in reply to: ↑ 229 Changed 5 years ago by SimonKing

Replying to nbruin:

Replying to SimonKing: By the way, for me (on Fedora 16) sage -t -gdb works like a charm

Why does it not work for me? Does one need to do anything special to make it work with Sage?

and, indeed, the backtrace shows this is indeed gcmodule leading to Delete<polybori::groebner::ReductionStrategy> leading to an error raise in libc.

OK, then we have to see at what point the reduction strategies are supposed to be deallocated.

I think it happens via libs/polybori/decl.pxd

   void PBRedStrategy_delete "Delete<ReductionStrategy>"(PBRedStrategy *mem)

I can not work on it right now. But thank you for the pointer!

comment:237 Changed 5 years ago by jpflori

Here is some hyptothesis about what goes wrong:

comment:238 Changed 5 years ago by jpflori

I'm not yet convinced the explanation is as what I posted above, but here is a patch which seems logical to me, and prevents the polybori segfaults.

Changed 5 years ago by jpflori

Modify ReductionStrategy? deallocation.

comment:239 Changed 5 years ago by SimonKing

  • Status changed from needs_review to needs_work
  • Work issues set to Merge Jean-Pierre's fix for pbori

The patch sounds logical to me. I am now merging it (without the print statements, of course) into my patch, unless you beet me to it.

Changed 5 years ago by SimonKing

Do a garbage collection before closing the interfaces. Do not delete a borrowed reference to reduction strategies in pbori

comment:240 Changed 5 years ago by SimonKing

  • Status changed from needs_work to needs_review
  • Work issues Merge Jean-Pierre's fix for pbori deleted

My second patch is now updated, it includes Jean-Pierre's fix for pbori.

Let us hope that the tests will pass now. The tests in pbori.pyx do pass!

Apply trac_12313-mono_dict-combined-random-sk.patch trac_12313_quit_sage.patch

comment:241 follow-up: Changed 5 years ago by nbruin

Replying to jpflori:

I'm not yet convinced the explanation is as what I posted above, but here is a patch which seems logical to me, and prevents the polybori segfaults.

I agree with your assessment as well. As far as I can see:

GroebnerStrategy.__init__:

  • Creates a new ReductionStrategy wrapper and stores it in self.reduction_strategy
  • Asks polybori to delete the self.reduction_strategy._strat attribute on it
  • Then fills that slot with &self._strat.generators

So GroebnerStrategy.reduction_strategy._strat is now definitely a borrowed reference from ReductionStrategy. However, GroebnerStrategy has a reference to the latter, so it's borrowing from something it keeps alive.

GroebnerStrategy.__dealloc__:

  • zeros out self.reduction_strategy._strat = NULL
  • calls del self._strat, which I assume amounts to asking polybori to delete it as well.

This strategy works if GroebnerStrategy.__dealloc__ would get called before ReductionStrategy.__dealloc, but as we know python attributes might already no longer exist at __dealloc__ time, so we're not guaranteed a call order.

So I think the code here should be

    if self.reduction_strategy != NULL and self.reduction_strategy._strat != NULL:
        self.reduction_strategy._strat = NULL
        if self._count.released():
            del self._strat

but this is assuming that attributes get zeroed out on deallocation (Can we even assume that self.reductio_strategy is still alive at this point?)

Then you could do in ReductionStrategy.__dealloc__:

        if self._strat:
            PBRedStrategy_delete(self._strat)
            self._strat = NULL

However, that's assuming that we can establish whether GroebnerStrategy.reduction_strategy is still valid upon deallocation. Since GroebnerStrategy has borrowed the reference without ReductionStrategy really knowing, I think GroebnerStrategy should just not delete _strat. In due time, ReductionStrategy.__dealloc__ gets called and will clean up the ref. It's hers!

comment:242 in reply to: ↑ 241 Changed 5 years ago by SimonKing

Replying to nbruin:

Replying to jpflori: GroebnerStrategy.__init__:

  • Creates a new ReductionStrategy wrapper and stores it in self.reduction_strategy
  • Asks polybori to delete the self.reduction_strategy._strat attribute on it
  • Then fills that slot with &self._strat.generators

So GroebnerStrategy.reduction_strategy._strat is now definitely a borrowed reference from ReductionStrategy.

No, the reduction strategy borrows from the Gröbner strategy.

However, that's assuming that we can establish whether GroebnerStrategy.reduction_strategy is still valid upon deallocation. Since GroebnerStrategy has borrowed the reference without ReductionStrategy really knowing, I think GroebnerStrategy should just not delete _strat.

But who else should?

In due time, ReductionStrategy.__dealloc__ gets called and will clean up the ref. It's hers!

No. reduction_strategy._strat does not borrow a reference to _strat (hence, ReductionStrategy.__dealloc__ does not dealloc GroebnerStrategy._strat), but it borrows a reference to _strat.generators.

That's why I think Jean-Pierre's solution is logical: Explicitly tell ReductionStrategy whether its _strat is borrowed from a GroebnerStrategy's _strat.generators.

comment:243 Changed 5 years ago by SimonKing

Hooray! With the recent patch versions (#715, #11521, #12215, #12313), make ptest yields

All tests passed!
Total time for all tests: 2159.8 seconds

comment:244 follow-up: Changed 5 years ago by nbruin

Replying to SimonKing:

No, the reduction strategy borrows from the Gröbner strategy.

Yes, you're right. Sorry. But the problem remains of what to trust on dealloc. The whole borrowing thing complicates that. I think we can simplify it here.

Since GroebnerStrategy always lends a reference to the ReductionStrategy it creates, why not give it outright? Then GroebnerStrategy never deletes _strat and ReductionStrategy always does that.

The zeroing out of GroebnerStrategy.reduction_strategy._struct via that path makes me a little queasy. Plus, what happens if someone goes digging in GroebnerStrategy and keeps ReductionStrategy alive for longer? I think handing the reference to a suborbinate on "loan" just makes the code a little harder to reason about. It means you have to enforce that the subordinate doesn't go and do some work for someone else with the lent goods. It's so much easier if you don't have to enforce that at all.

Logically, I think GroebnerStrategy shouldn't have a _strat attribute. It should always access GroebnerStrategy.reduction_strategy._strat. The fact that GroebnerStrategy._strat does exist should be viewed as an optimization, but not affect program logic.

[Note: this is just squabbling over keeping code more easily maintainable. I agree JP's patch solves the problem. It's just nice if we can resolve the problem by reducing code complexity rather than increasing it]

EDIT: Plus the complication about GroebnerStrategy._strat.generators and ReductionStrategy._strat. I suspect this whole bit can be cleaned up without resorting to borrowed flags but this is not the place.

Last edited 5 years ago by nbruin (previous) (diff)

comment:245 follow-up: Changed 5 years ago by nbruin

YES! All tests pass for me too. I think the saying is "Mission Accomplished".

Concerning speed effects:

5.3b2 + (#715, #11521, #12215, #12313):

Total time for all tests: 1001.4 seconds
Total time for all tests: 1002.6 seconds

plain 5.3b2:

Total time for all tests: 928.5 seconds
Total time for all tests: 932.6 seconds

so people might want to profile their code every now and again to see if they are losing much time on construction/destruction of parents.

I'll try to finalize the positive reviews on all these tickets and then hopefully these can be the first ones in on 5.4.

comment:246 in reply to: ↑ 244 ; follow-up: Changed 5 years ago by SimonKing

  • Cc AlexanderDreyer PolyBoRi added

Replying to nbruin:

Replying to SimonKing: Then GroebnerStrategy never deletes _strat and ReductionStrategy always does that.

Again, the problem is that GroebnerStrategy._strat is not the thing that is lend to ReductionStrategy._strat. The reference that is lend is GroebnerStrategy._strat.generators. But since GroebnerStrategy._strat is more than that, GroebnerStrategy.__dealloc__ must take care of it. "Take care of that" means: Call a C++ destructor on GroebnerStrategy._strat. Hence, changing it would mean to change the polybori spkg. I am ccing the authors, perhaps they consider such change for the next version.

The zeroing out of GroebnerStrategy.reduction_strategy._struct via that path makes me a little queasy. Plus, what happens if someone goes digging in GroebnerStrategy and keeps ReductionStrategy alive for longer? I think handing the reference to a suborbinate on "loan" just makes the code a little harder to reason about. It means you have to enforce that the subordinate doesn't go and do some work for someone else with the lent goods. It's so much easier if you don't have to enforce that at all.

Probably you are right. But the practical question is whether this should better be on a "proper" PolyBoRi ticket. If you see an easy way to fix the problem properly, then it may be done here. But my impression is that a proper solution needs a non-trivial amount of changes to sage/rings/polynomial/pbori.pyx; I'd prefer to have a non-trivial change on a new ticket.

Logically, I think GroebnerStrategy shouldn't have a _strat attribute. It should always access GroebnerStrategy.reduction_strategy._strat. The fact that GroebnerStrategy._strat does exist should be viewed as an optimization, but not affect program logic.

Well, they are different things.

EDIT: Plus the complication about GroebnerStrategy._strat.generators and ReductionStrategy._strat. I suspect this whole bit can be cleaned up without resorting to borrowed flags but this is not the place.

Well, but how, without opening the polybori spkg?

comment:247 in reply to: ↑ 245 Changed 5 years ago by SimonKing

Replying to nbruin:

YES! All tests pass for me too. I think the saying is "Mission Accomplished".

Concerning speed effects:

5.3b2 + (#715, #11521, #12215, #12313):

Total time for all tests: 1001.4 seconds
Total time for all tests: 1002.6 seconds

plain 5.3b2:

Total time for all tests: 928.5 seconds
Total time for all tests: 932.6 seconds

That is almost 8% regression. That isn't good. Can you find out what tests contribute most to the regression?

When I hacked into the weakref module for debugging, I found that many weak references die during startup of Sage. Hence, while Sage starts, many things are created without keeping a reference to them, so that they will be garbage collection before the user has even the chance to see them. I think it would be good (also for Sage startup time) to avoid that to happen. But I reckon this should be on a new ticket.

comment:248 Changed 5 years ago by nbruin

  • Reviewers changed from Simon King, Jean-Pierre Flori, John Perry to Simon King, Jean-Pierre Flori, John Perry, Nils Bruin
  • Status changed from needs_review to positive_review

Speed regression mitigation is definitely going onto another ticket; #13400 I'd say. The consensus here is clearly that reasonable memory use trumps a (relatively small) speed penalty. I'm not so sure that the doctests are such a good measure fore "performance" anyway. They mainly measure construction of tiny examples, so tend to concentrate on overhead that is fairly irrelevant (because not in inner loops) in practice. I'm not surprised we're not doing so well in that park, but we already have some small improvements on future tickets (although I bet the main penalty is coming from NOT having things cached anymore sometimes).

Anyway, the code here (and on the dependencies of this ticket!) has been discussed and vetted to death already. Bot's complaining about a bit of whitespace, which should be trivial to fix. In anticipation of that getting removed:

POSITIVE REVIEW!!'''

Congratulations, to all who participated and especially Simon for an incredible diligence and calm in pushing through this at times seemingly hopeless process.

There are a lot of people in the review field of this ticket. If any of them dissent with the status, they can take action.

Last edited 5 years ago by nbruin (previous) (diff)

comment:249 in reply to: ↑ 246 ; follow-up: Changed 5 years ago by AlexanderDreyer

Replying to SimonKing:

EDIT: Plus the complication about GroebnerStrategy._strat.generators and ReductionStrategy._strat. I suspect this whole bit can be cleaned up without resorting to borrowed flags but this is not the place.

Well, but how, without opening the polybori spkg?

Thanks for pointing out this issue. But it is not related to the spkg. In the spkg you have a composed C++ class with public attributes, that is managed with C++ constructors and destructors. But cython seems not to be able to wrap the c++ attributes (unlike PolyBoRi's original interface based on boost/python). The only solution I can think of is the a flag, which was exactly what this ticket came up with. Alternatively a reference count. (I think boost/python uses one or the other internally, too.)

comment:250 in reply to: ↑ 249 ; follow-up: Changed 5 years ago by nbruin

Replying to AlexanderDreyer:

... In the spkg you have a composed C++ class with public attributes, that is managed with C++ constructors and destructors. But cython seems not to be able to wrap the c++ attributes (unlike PolyBoRi's original interface based on boost/python). The only solution I can think of is the a flag, which was exactly what this ticket came up with. Alternatively a reference count. (I think boost/python uses one or the other internally, too.)

Ah, so the more general solution would be a little class StratGuard whose instances shepherd the _strat objects that GroebnerStrategy._strat.generators and ReductionStrategy._strat refer to. Its __dealloc__ would be the only one that calls the C++ delete. Python's own refcounting would do the rest. Whether the actual access to the pointer happens through the StratGuard object or through a separately held pointer (for speed) is a design choice. If this is the only use case, though, keeping a flag might be a more light-weight solution.

comment:251 in reply to: ↑ 250 Changed 5 years ago by AlexanderDreyer

Replying to nbruin:

Ah, so the more general solution would be a little class StratGuard whose instances shepherd the _strat objects that GroebnerStrategy._strat.generators and ReductionStrategy._strat refer to. Its __dealloc__ would be the only one that calls the C++ delete. Python's own refcounting would do the rest. Whether the actual access to the pointer happens through the StratGuard object or through a separately held pointer (for speed) is a design choice. If this is the only use case, though, keeping a flag might be a more light-weight solution.

Indeed, I was even thinking in the very same direction, when I read about the issue. But the current solution causes less overhead and avoids a one-time-use class.

comment:252 Changed 5 years ago by jdemeyer

  • Milestone changed from sage-5.3 to sage-5.4

comment:253 Changed 5 years ago by jdemeyer

  • Milestone changed from sage-5.4 to sage-pending

comment:254 Changed 5 years ago by jdemeyer

  • Dependencies changed from #11521, #11599, #12969, #12215 to #715, #11521, #12215

comment:255 Changed 5 years ago by jdemeyer

  • Dependencies changed from #715, #11521, #12215 to #715, #11521, #13447, #12215

comment:256 follow-up: Changed 5 years ago by jpflori

Is #13447 really needed as a dependency? From what I remember, #715 had to permanently keep libsingular rings alive because of the bsd.math bug. And #13447 intends to replace this workaround by a proper fix. I've not followed the discussion at #13447, but it seems to me that #12313 is needed by #13447 to get proper garbage collection, but it does not seem obvious that #12313 depends on #13447. Is there some rebase needed? Did I miss some other step?

comment:257 Changed 5 years ago by nbruin

I'll give Jeroen a chance to motivate the change in dependencies, but I think it shouldn't be there. The positive review (and all the patches to be applied) predate the existence of #13447, so they can't possibly depend on it.

Any issues that might require #13447 should be resolved with the strong caching workaround introduced in #715.

If the patches here were to require #13447 to function properly then it should be marked as "only merge together with"; not as a merge dependency (which would suggest the wrong merge order). Do we want to add "post merge dependencies" to trac?

comment:258 in reply to: ↑ 256 ; follow-up: Changed 5 years ago by jdemeyer

  • Dependencies changed from #715, #11521, #13447, #12215 to #715, #11521, #12215

I only made the dependencies consistent. The problem was that the dependencies of #12215 were wrong. I hope all dependencies are correct now.

comment:259 in reply to: ↑ 258 Changed 5 years ago by nbruin

Replying to jdemeyer:

I only made the dependencies consistent. The problem was that the dependencies of #12215 were wrong. I hope all dependencies are correct now.

Doesn't the patch bot do transitive dependency resolution anyway? I'd say one should only provide a minimal set of dependencies, unless there is some conceptual reason to make an indirect dependency explicit. Is there a reason to do otherwise?

comment:260 Changed 5 years ago by jdemeyer

  • Milestone changed from sage-pending to sage-5.5

comment:261 Changed 5 years ago by jdemeyer

  • Milestone changed from sage-5.5 to sage-5.6

comment:262 Changed 4 years ago by jpflori

  • Dependencies changed from #715, #11521, #12215 to #715, #11521, #12215, #13746
  • Description modified (diff)

I've put the pbori stuff at #13746 and made it a dependency here.

comment:263 Changed 4 years ago by jdemeyer

  • Milestone changed from sage-5.6 to sage-pending

comment:264 Changed 4 years ago by SimonKing

FWIW: In sage-5.6.rc0 built with SAGE_DEBUG=yes (see #13864) plus #12215 plus #12313, all doctests pass on my x86_64 openSuse 12.1 laptop with MALLOC_CHECK_=3.

comment:265 Changed 4 years ago by nbruin

patchbot

Apply trac_12313-mono_dict-combined-random-sk.patch

comment:266 Changed 4 years ago by nbruin

Unfortunately, doctests don't all succeed for me. I've built 5.6rc0 from scratch and applied #12215 and #12313. I get a SIGFPE in

 ./sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/heegner.py

gdb:

Program received signal SIGFPE, Arithmetic exception.
__pyx_pf_4sage_9structure_11coerce_dict_16TripleDictEraser_2__call__ (__pyx_v_r=<optimized out>,
    __pyx_v_self=<optimized out>) at sage/structure/coerce_dict.c:1107
1107      __pyx_t_10 = PyList_GET_ITEM(__pyx_t_1, (__pyx_v_h % PyList_GET_SIZE(__pyx_t_4)));
(gdb) l
1102     */
1103      __pyx_t_1 = __pyx_v_self->D->buckets;
1104      __Pyx_INCREF(__pyx_t_1);
1105      __pyx_t_4 = __pyx_v_self->D->buckets;
1106      __Pyx_INCREF(__pyx_t_4);
1107      __pyx_t_10 = PyList_GET_ITEM(__pyx_t_1, (__pyx_v_h % PyList_GET_SIZE(__pyx_t_4)));
(gdb) p PyString_AsString(PyObject_Repr(__pyx_t_4))
$4 = 0xd5090e4 "None"

So at this point self.D.buckets==None. Doing list-type transactions using non-error checking macros is probably not a good idea. This code corresponds to structure/coerce_dict.pyx:121. Apparently, a callback is happening while the dictionary itself is already being torn down!

A traceback indeed shows that this is all happening as part of a garbage collection of a TripleDict? (I think that's what frame 15 indicates),ironically, triggered by the allocation of a new TripleDict?. I'm including rather a lot of the backtrace, because frame 100 is already in the sage library: Apparently, some operations easily trigger 100 level deep stacks!

(gdb) bt
#0  __pyx_pf_4sage_9structure_11coerce_dict_16TripleDictEraser_2__call__ (__pyx_v_r=<optimized out>, 
    __pyx_v_self=<optimized out>) at sage/structure/coerce_dict.c:1107
#1  __pyx_pw_4sage_9structure_11coerce_dict_16TripleDictEraser_3__call__ (__pyx_v_self=0xb4f17c0, 
    __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at sage/structure/coerce_dict.c:966
#2  0x00007ffff7c6f403 in PyObject_Call (func=0xb4f17c0, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#3  0x00007ffff7c6fcf0 in PyObject_CallFunctionObjArgs (callable=0xb4f17c0) at Objects/abstract.c:2760
#4  0x00007ffff7cd7b91 in handle_callback (callback=0xb4f17c0, ref=<optimized out>) at Objects/weakrefobject.c:881
#5  PyObject_ClearWeakRefs (object=<optimized out>) at Objects/weakrefobject.c:965
#6  0x00007fffea42e0e8 in __pyx_tp_dealloc_4sage_9structure_15category_object_CategoryObject (o=0x7843e30)
    at sage/structure/category_object.c:8990
#7  0x00007ffff7cc7b76 in subtype_dealloc (self=0x7843e30) at Objects/typeobject.c:1014
#8  0x00007ffff7cc4fa3 in tupledealloc (op=0xc1518d0) at Objects/tupleobject.c:220
#9  0x00007ffff7cc4fa3 in tupledealloc (op=0xc455c30) at Objects/tupleobject.c:220
#10 0x00007ffff7ca905f in dict_dealloc (mp=0x7856310) at Objects/dictobject.c:985
#11 0x00007ffff7cc7bc4 in subtype_dealloc (self=0xbdf2f10) at Objects/typeobject.c:999
#12 0x00007fffe90f8cc7 in __pyx_tp_dealloc_4sage_10categories_7functor_Functor (o=0xa7e09d0)
    at sage/categories/functor.c:3209
#13 0x00007ffff7c9bb5a in list_dealloc (op=0xbbede60) at Objects/listobject.c:309
#14 0x00007ffff7c9bb5a in list_dealloc (op=0xb771170) at Objects/listobject.c:309
#15 0x00007fffe9dee8df in __pyx_tp_clear_4sage_9structure_11coerce_dict_TripleDict (o=0xb8d73b0)
    at sage/structure/coerce_dict.c:5921
#16 0x00007ffff7d4b637 in delete_garbage (old=0x7ffff7fe19e0, collectable=0x7fffffff80f0) at Modules/gcmodule.c:769
#17 collect (generation=2) at Modules/gcmodule.c:930
#18 0x00007ffff7d4c1a8 in collect_generations () at Modules/gcmodule.c:996
#19 _PyObject_GC_Malloc (basicsize=<optimized out>) at Modules/gcmodule.c:1457
#20 _PyObject_GC_Malloc (basicsize=<optimized out>) at Modules/gcmodule.c:1439
#21 0x00007ffff7cc7c19 in PyType_GenericAlloc (type=0x7fffea004d00, nitems=0) at Objects/typeobject.c:753
#22 0x00007fffe9dee7ef in __pyx_tp_new_4sage_9structure_11coerce_dict_TripleDict (t=<optimized out>, a=<optimized out>, 
    k=<optimized out>) at sage/structure/coerce_dict.c:5881
#23 0x00007ffff7ccbd83 in type_call (type=0x7fffea004d00, args=0xf507d0, kwds=0x0) at Objects/typeobject.c:721
#24 0x00007ffff7c6f403 in PyObject_Call (func=0x7fffea004d00, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#25 0x00007fffea65da43 in __pyx_f_4sage_9structure_6parent_6Parent_init_coerce (__pyx_v_self=0x5083a80, 
    __pyx_optional_args=<optimized out>) at sage/structure/parent.c:5765
#26 0x00007fffea8a00ff in __pyx_f_4sage_9structure_10parent_old_6Parent_init_coerce (__pyx_v_self=<optimized out>, 
    __pyx_optional_args=<optimized out>) at sage/structure/parent_old.c:1634
#27 0x00007fffea8a6cc4 in __pyx_pf_4sage_9structure_10parent_old_6Parent___init__ (__pyx_v_category=0x7ffff7fc6840, 
    __pyx_v_embeddings=0x1038908, __pyx_v_actions=0x10388c0, __pyx_v_coerce_from=0x10381b8, __pyx_v_self=0x5083a80)
    at sage/structure/parent_old.c:1329
#28 __pyx_pw_4sage_9structure_10parent_old_6Parent_1__init__ (__pyx_v_self=0x5083a80, __pyx_args=<optimized out>, 
    __pyx_kwds=0x508bf00) at sage/structure/parent_old.c:1297
#29 0x00007ffff7cc664c in wrap_init (self=<optimized out>, args=<optimized out>, wrapped=<optimized out>, 
    kwds=<optimized out>) at Objects/typeobject.c:4719
#30 0x00007ffff7c6f403 in PyObject_Call (func=0xd30f710, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#31 0x00007ffff7d0cc97 in PyEval_CallObjectWithKeywords (func=0xd30f710, arg=0x7ffff7bbb050, kw=<optimized out>)
    at Python/ceval.c:3890
#32 0x00007ffff7c86e69 in wrapperdescr_call (descr=<optimized out>, args=0x7ffff7bbb050, kwds=0x508bf00)
    at Objects/descrobject.c:306
#33 0x00007ffff7c6f403 in PyObject_Call (func=0xfc0960, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#34 0x00007fffeaab5269 in __pyx_pf_4sage_9structure_11parent_base_14ParentWithBase___init__ (
    __pyx_v_category=0x7ffff7fc6840, __pyx_v_embeddings=0x1038908, __pyx_v_actions=0x10388c0, 
    __pyx_v_coerce_from=0x10381b8, __pyx_v_base=0x8593c60, __pyx_v_self=0x5083a80) at sage/structure/parent_base.c:1258
---Type <return> to continue, or q <return> to quit---
#35 __pyx_pw_4sage_9structure_11parent_base_14ParentWithBase_1__init__ (__pyx_v_self=0x5083a80, 
    __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at sage/structure/parent_base.c:1221
#36 0x00007ffff7cc664c in wrap_init (self=<optimized out>, args=<optimized out>, wrapped=<optimized out>, 
    kwds=<optimized out>) at Objects/typeobject.c:4719
#37 0x00007ffff7c6f403 in PyObject_Call (func=0xd30f6d0, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#38 0x00007ffff7d0cc97 in PyEval_CallObjectWithKeywords (func=0xd30f6d0, arg=0xd2f7d90, kw=<optimized out>)
    at Python/ceval.c:3890
#39 0x00007ffff7c86e69 in wrapperdescr_call (descr=<optimized out>, args=0xd2f7d90, kwds=0x508bde0)
    at Objects/descrobject.c:306
#40 0x00007ffff7c6f403 in PyObject_Call (func=0x1033320, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#41 0x00007fffe746a727 in __pyx_pf_4sage_9structure_11parent_gens_14ParentWithGens___init__ (
    __pyx_v_category=0x7ffff7fc6840, __pyx_v_normalize=<optimized out>, __pyx_v_names=0x7ffff7fc6840, 
    __pyx_v_base=0x8593c60, __pyx_v_self=0x5083a80) at sage/structure/parent_gens.c:2781
#42 __pyx_pw_4sage_9structure_11parent_gens_14ParentWithGens_1__init__ (__pyx_v_self=0x5083a80, 
    __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at sage/structure/parent_gens.c:2697
#43 0x00007ffff7cc664c in wrap_init (self=<optimized out>, args=<optimized out>, wrapped=<optimized out>, 
    kwds=<optimized out>) at Objects/typeobject.c:4719
#44 0x00007ffff7c6f403 in PyObject_Call (func=0xd29f0d0, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#45 0x00007ffff7d0cc97 in PyEval_CallObjectWithKeywords (func=0xd29f0d0, arg=0xd2a2290, kw=<optimized out>)
    at Python/ceval.c:3890
#46 0x00007ffff7c86e69 in wrapperdescr_call (descr=<optimized out>, args=0xd2a2290, kwds=0x0)
    at Objects/descrobject.c:306
#47 0x00007ffff7c6f403 in PyObject_Call (func=0x1033cd0, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#48 0x00007ffff7d1124d in do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7fffffff8830, func=0x1033cd0)
    at Python/ceval.c:4239
#49 call_function (oparg=<optimized out>, pp_stack=0x7fffffff8830) at Python/ceval.c:4044
#50 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666
#51 0x00007ffff7d14275 in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, 
    args=<optimized out>, argcount=5, kws=0x7ffff7bbb068, kwcount=0, defs=0x1c61bf0, defcount=2, closure=0x0)
    at Python/ceval.c:3253
#52 0x00007ffff7c97273 in function_call (func=0x1db1488, arg=0xd1b27d0, kw=0x50af1a0) at Objects/funcobject.c:526
#53 0x00007ffff7c6f403 in PyObject_Call (func=0x1db1488, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#54 0x00007ffff7c7cb4f in instancemethod_call (func=0x1db1488, arg=0xd1b27d0, kw=0x50af1a0) at Objects/classobject.c:2578
#55 0x00007ffff7c6f403 in PyObject_Call (func=0xd2d7230, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#56 0x00007ffff7ccffd0 in slot_tp_init (self=0x5083a80, args=0xd1de100, kwds=0x50af1a0) at Objects/typeobject.c:5663
#57 0x00007ffff7ccbdc8 in type_call (type=<optimized out>, args=0xd1de100, kwds=0x50af1a0) at Objects/typeobject.c:737
#58 0x00007fffed3f5766 in __pyx_pf_4sage_4misc_19classcall_metaclass_typecall (__pyx_v_opts=0x50af1a0, 
    __pyx_v_args=0xd1de100, __pyx_v_cls=0x1ee43a0, __pyx_self=<optimized out>) at sage/misc/classcall_metaclass.c:1294
#59 __pyx_pw_4sage_4misc_19classcall_metaclass_1typecall (__pyx_self=<optimized out>, __pyx_args=<optimized out>, 
    __pyx_kwds=<optimized out>) at sage/misc/classcall_metaclass.c:1257
#60 0x00007ffff7d12198 in ext_do_call (nk=0, na=<optimized out>, flags=<optimized out>, pp_stack=0x7fffffff8e60, 
    func=0x995c20) at Python/ceval.c:4331
#61 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2705
#62 0x00007ffff7d14275 in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, 
    args=<optimized out>, argcount=5, kws=0x7ffff7bbb068, kwcount=0, defs=0x0, defcount=0, closure=0x0)
    at Python/ceval.c:3253
#63 0x00007ffff7c97273 in function_call (func=0x99d050, arg=0xd1d4710, kw=0x6004d00) at Objects/funcobject.c:526
#64 0x00007ffff7c6f403 in PyObject_Call (func=0x99d050, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
---Type <return> to continue, or q <return> to quit---
#65 0x00007fffed816e8d in __pyx_pf_4sage_4misc_9cachefunc_18WeakCachedFunction_2__call__ (__pyx_v_kwds=0x6004d00, 
    __pyx_v_args=0xd1d4710, __pyx_v_self=0x919f30) at sage/misc/cachefunc.c:5080
#66 __pyx_pw_4sage_4misc_9cachefunc_18WeakCachedFunction_3__call__ (__pyx_v_self=0x919f30, __pyx_args=0xd1d4710, 
    __pyx_kwds=<optimized out>) at sage/misc/cachefunc.c:4888
#67 0x00007ffff7c6f403 in PyObject_Call (func=0x919f30, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#68 0x00007ffff7d1124d in do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7fffffff9220, func=0x919f30)
    at Python/ceval.c:4239
#69 call_function (oparg=<optimized out>, pp_stack=0x7fffffff9220) at Python/ceval.c:4044
#70 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666
#71 0x00007ffff7d14275 in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, 
    args=<optimized out>, argcount=4, kws=0xd2ff608, kwcount=1, defs=0x1bf4968, defcount=2, closure=0x0)
    at Python/ceval.c:3253
#72 0x00007ffff7c97273 in function_call (func=0x1db1140, arg=0xd2caaa0, kw=0x6456560) at Objects/funcobject.c:526
#73 0x00007ffff7c6f403 in PyObject_Call (func=0x1db1140, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#74 0x00007fffed3f4a13 in __pyx_pf_4sage_4misc_19classcall_metaclass_18ClasscallMetaclass_2__call__ (
    __pyx_v_opts=0x6456560, __pyx_v_args=0xd2a4690, __pyx_v_cls=0x1ee43a0) at sage/misc/classcall_metaclass.c:932
#75 __pyx_pw_4sage_4misc_19classcall_metaclass_18ClasscallMetaclass_3__call__ (__pyx_v_cls=0x1ee43a0, 
    __pyx_args=0xd2a4690, __pyx_kwds=<optimized out>) at sage/misc/classcall_metaclass.c:872
#76 0x00007ffff7c6f403 in PyObject_Call (func=0x1ee43a0, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#77 0x00007fffd42b707d in __pyx_f_4sage_6matrix_20matrix_integer_dense_20Matrix_integer_dense__mod_int_c (
    __pyx_v_self=0xd2725f0, __pyx_v_p=19) at sage/matrix/matrix_integer_dense.c:13674
#78 0x00007fffd429678a in __pyx_pf_4sage_6matrix_20matrix_integer_dense_20Matrix_integer_dense_46_mod_int (
    __pyx_v_modulus=0xbbe8390, __pyx_v_self=0xd2725f0) at sage/matrix/matrix_integer_dense.c:13277
#79 __pyx_pw_4sage_6matrix_20matrix_integer_dense_20Matrix_integer_dense_47_mod_int (__pyx_v_self=0xd2725f0, 
    __pyx_v_modulus=0xbbe8390) at sage/matrix/matrix_integer_dense.c:13196
#80 0x00007ffff7c6f403 in PyObject_Call (func=0xd2967a0, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#81 0x00007fffd3c13807 in __pyx_pf_4sage_6matrix_21matrix_rational_dense_21Matrix_rational_dense_56change_ring (
    __pyx_v_R=<optimized out>, __pyx_v_self=0xd302398) at sage/matrix/matrix_rational_dense.c:14605
#82 __pyx_pw_4sage_6matrix_21matrix_rational_dense_21Matrix_rational_dense_57change_ring (__pyx_v_self=0xd302398, 
    __pyx_v_R=<optimized out>) at sage/matrix/matrix_rational_dense.c:14120
#83 0x00007ffff7d13158 in call_function (oparg=<optimized out>, pp_stack=0x7fffffff9710) at Python/ceval.c:4009
#84 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666
#85 0x00007ffff7d133cb in fast_function (nk=<optimized out>, na=2, n=<optimized out>, pp_stack=0x7fffffff9870, 
    func=0x195a398) at Python/ceval.c:4107
#86 call_function (oparg=<optimized out>, pp_stack=0x7fffffff9870) at Python/ceval.c:4042
#87 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:2666
#88 0x00007ffff7d14275 in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=<optimized out>, 
    args=<optimized out>, argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253
#89 0x00007ffff7c9717c in function_call (func=0x195a230, arg=0xd3122d8, kw=0x0) at Objects/funcobject.c:526
#90 0x00007ffff7c6f403 in PyObject_Call (func=0x195a230, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#91 0x00007ffff7c7cb4f in instancemethod_call (func=0x195a230, arg=0xd3122d8, kw=0x0) at Objects/classobject.c:2578
#92 0x00007ffff7c6f403 in PyObject_Call (func=0xd2a49b0, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#93 0x00007fffd62ed50b in __pyx_pf_4sage_5rings_13residue_field_19ResidueFieldFactory_2create_object (
    __pyx_v_key=<optimized out>, __pyx_self=<optimized out>, __pyx_v_self=<optimized out>, 
    __pyx_v_version=<optimized out>, __pyx_v_kwds=<optimized out>) at sage/rings/residue_field.c:5002
#94 __pyx_pw_4sage_5rings_13residue_field_19ResidueFieldFactory_3create_object (__pyx_self=<optimized out>, 
    __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at sage/rings/residue_field.c:4295
#95 0x00007ffff7c6f403 in PyObject_Call (func=0x1c8c1a0, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
---Type <return> to continue, or q <return> to quit---
#96 0x00007ffff7c7cb4f in instancemethod_call (func=0x1c8c1a0, arg=0xd2a4370, kw=0x64430f0) at Objects/classobject.c:2578
#97 0x00007ffff7c6f403 in PyObject_Call (func=0xd2faa50, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#98 0x00007fffdec2d5d6 in __pyx_f_4sage_9structure_7factory_13UniqueFactory_get_object (__pyx_v_self=0x1c8e0a8, 
    __pyx_v_version=0x162df50, __pyx_v_key=0xd0378c0, __pyx_v_extra_args=0x64430f0, __pyx_skip_dispatch=<optimized out>)
    at sage/structure/factory.c:1311
#99 0x00007fffdec2cb70 in __pyx_pf_4sage_9structure_7factory_13UniqueFactory_4__call__ (__pyx_v_kwds=0x64430f0, 
    __pyx_v_args=0xd30f1d0, __pyx_v_self=0x1c8e0a8) at sage/structure/factory.c:1119
#100 __pyx_pw_4sage_9structure_7factory_13UniqueFactory_5__call__ (__pyx_v_self=0x1c8e0a8, __pyx_args=0xd30f1d0, 
    __pyx_kwds=<optimized out>) at sage/structure/factory.c:987
#101 0x00007ffff7c6f403 in PyObject_Call (func=0x1c8e0a8, arg=<optimized out>, kw=<optimized out>)
    at Objects/abstract.c:2529
#102 0x00007ffff7d1124d in do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7fffffffa360, func=0x1c8e0a8)
    at Python/ceval.c:4239

Anyway, I had originally thought that callbacks in these situations would be collected and discarded, but apparently we have a scenario here that's not detected by the python GC. If my diagnosis is correct, we should not assume that attributes on self.D are still valid in TripleDictEraser, which surprises me a bit. It's easily checked of course: If self.D.buckets is None, we don't have to do anything. Relevant code:

static int __pyx_tp_clear_4sage_9structure_11coerce_dict_TripleDict(PyObject *o) {
  struct __pyx_obj_4sage_9structure_11coerce_dict_TripleDict *p = (struct __pyx_obj_4sage_9structure_11coerce_dict_TripleDict *)o;
  PyObject* tmp;
  tmp = ((PyObject*)p->buckets);
  p->buckets = Py_None; Py_INCREF(Py_None);
  Py_XDECREF(tmp);
  tmp = ((PyObject*)p->_refcache);
  p->_refcache = ((PyObject*)Py_None); Py_INCREF(Py_None);
  Py_XDECREF(tmp);
  tmp = ((PyObject*)p->eraser);
  p->eraser = ((struct __pyx_obj_4sage_9structure_11coerce_dict_TripleDictEraser *)Py_None); Py_INCREF(Py_None);
  Py_XDECREF(tmp);
  return 0;
}

so indeed, D.buckets is set to None before D.buckets is deleted. I'm surprised that this deletion (of something that is garbage, so the eraser, which has an indirect strong link to it, must be garbage as well) apparently can still trigger a callback. It wouldn't surprise me if we could engineer this into something that would be considered a bug in Python. Anyway, to be safe, check that self.D.buckets is still alive before we do something with it! (either that or my sage/python build is seriously screwed and I'm seeing something that shouldn't happen).

Changed 4 years ago by nbruin

make TripleDictEraser? a little safer

comment:267 Changed 4 years ago by nbruin

  • Status changed from positive_review to needs_work

Attached patch seems to fix the problem for me. My take is self.D can't be None. You can't go calling self.__call__() if self has been mutilated! On the other hand, it seems that in rare circumstances it can be that self.D.buckets is already erased, because D is in the process of being torn down. In that case, the callback is not necessary anyway, so we just return.

I think we just have to live with this: cycle breaking can be arbitrarily complicated, so we cannot expect Python to fully protect against it. Any routine involved in deallocation has to be prepared to find attributes already cleared.

Exercise: prove or disprove: determining whether a callback triggered by a GC needs to be executed is NP-hard.

comment:268 follow-up: Changed 4 years ago by SimonKing

  • Can we be sure that self.D is not None when the TripleDictEraser is called? Scenario: The TripleDict used to contain an object O. For some reason, the TripleDict is part of a collectable reference cycle (hence, the strong reference held by the TripleDictEraser would not prevent deletion of the TripleDict), and gc decided to delete it first. Hence, when calling the eraser, the TripleDict itself and not just its buckets is already gone.
  • Shouldn't the same be done with the MonoDictEraser?

comment:269 in reply to: ↑ 268 ; follow-up: Changed 4 years ago by nbruin

Replying to SimonKing:

  • Can we be sure that self.D is not None when the TripleDictEraser is called?

I think so. Setting self.D=None must be part of a dealloc on the TripleDictEraser itself. I don't see how Python could consider it not a bug to call a __call__ method that is in the process of being dealloc-ed (see my remark)

... and gc decided to delete it first. Hence, when calling the eraser, the TripleDict itself and not just its buckets is already gone.

That's why the Python GC does take some efforts to avoid calling avoidable weakref callbacks.

  • Shouldn't the same be done with the MonoDictEraser?

I figured this was probably triggered by an interaction of a non-weakreffable key component holding a reference to a weakreffable key component, which doesn't happen in MonoDict. On the other hand, objects in MonoDict could have a TripleDict on an attribute somewhere and I don't immediately see a proof that such a construct could still lead to this kind of nasty cycle (especially because I don't quite understand the nature of the loop here that makes python execute this unnecessary callback. Apparently Python doesn't purge all garbage callbacks before doing any deletions on cyclic trash)

comment:270 in reply to: ↑ 269 Changed 4 years ago by SimonKing

Replying to nbruin:

Replying to SimonKing:

  • Can we be sure that self.D is not None when the TripleDictEraser is called?

I think so. Setting self.D=None must be part of a dealloc on the TripleDictEraser itself.

Yes, right. So, the only problem is when we access an attribute of the TripleDict that is attribute of the TripleDictEraser, because the TripleDict could already be destroyed at that point.

  • Shouldn't the same be done with the MonoDictEraser?

I figured this was probably triggered by an interaction of a non-weakreffable key component holding a reference to a weakreffable key component, which doesn't happen in MonoDict. On the other hand, objects in MonoDict could have a TripleDict on an attribute somewhere and I don't immediately see a proof that such a construct could still lead to this kind of nasty cycle (especially because I don't quite understand the nature of the loop here that makes python execute this unnecessary callback. Apparently Python doesn't purge all garbage callbacks before doing any deletions on cyclic trash)

I'd feel safer to test for the bucket being not None also in the case of MonoDict. I'll submit a patch after lunch.

Changed 4 years ago by SimonKing

Making sure Python attributes exist in a weakref callback

comment:271 Changed 4 years ago by SimonKing

  • Description modified (diff)
  • Status changed from needs_work to needs_review

Here is the announced patch.

Apply trac_12313-mono_dict-combined-random-sk.patch trac_12313-safe_callback.patch

comment:272 Changed 4 years ago by nbruin

  • Status changed from needs_review to positive_review

Works for me! By the way, as you can see from the patch I attached earlier, something must have gone wrong with applying the patches when I tested it. In the current version, *Eraser has a weak reference to its dict, and weakrefs normally get cleared first, before any attributes get destroyed. So before D.buckets becomes None, we would already have self.D() == None. Therefore I'd be surprised if this latest patch is actually necessary. Anyway, given how tricky these deallocators are, I have no problem with a little extra (perhaps superfluous) tests.

We have a future ticket to reorganize/optimize these dictionaries anyway. Let's get this in!

Last edited 4 years ago by nbruin (previous) (diff)

comment:273 Changed 4 years ago by SimonKing

  • Dependencies changed from #715, #11521, #12215, #13746 to #715, #11521, #12215, #13746, to be merged with #13378

Thank you! I'd like to see this merged together with #13378 - namely, #12313 introduces the problem that the repeated use of ZZ.ideal(5) accumulates useless data (i.e., the information that there is no coercion map that has 5 as domain) in some coercion dict and makes the command incrementally slower. And this problem is fixed in #13378.

comment:274 Changed 4 years ago by jdemeyer

  • Dependencies changed from #715, #11521, #12215, #13746, to be merged with #13378 to #715, #11521, #12215, #13746, #13378

comment:275 Changed 4 years ago by jdemeyer

  • Milestone changed from sage-pending to sage-5.7

comment:276 follow-up: Changed 4 years ago by jdemeyer

  • Status changed from positive_review to needs_work

Unfortunately, the combination #13378 + #12215 + #12313 gives a serious slowdown for

from sage.combinat.sf.k_dual import DualkSchurFunctions
Sym = SymmetricFunctions(QQ['t'].fraction_field())
dks4 = DualkSchurFunctions(Sym.kBoundedQuotient(4))
time TestSuite(dks4).run()

from 8 seconds before the patches to 90 seconds with the patches.

Just #13378 alone is fine.

Last edited 4 years ago by jdemeyer (previous) (diff)

comment:277 in reply to: ↑ 276 Changed 4 years ago by SimonKing

Replying to jdemeyer:

Unfortunately, the combination #13378 + #12215 + #12313 gives a serious slowdown...

Just #13378 alone is fine.

Did you also test #13378 plus #12215? That's to say, is #12313 to blame, or could it be #12215?

comment:278 Changed 4 years ago by jdemeyer

The patch causing the slowdown is trac_12313-mono_dict-combined-random-sk.patch

In other words: #12215 is fine.

comment:279 follow-up: Changed 4 years ago by SimonKing

Here is the outcome of

sage: from sage.combinat.sf.k_dual import DualkSchurFunctions
sage: Sym = SymmetricFunctions(QQ['t'].fraction_field())
sage: dks4 = DualkSchurFunctions(Sym.kBoundedQuotient(4))
sage: %prun TestSuite(dks4).run()

First, sage-5.6.rc0 with #12215:

     9365    7.254    0.001   11.881    0.001 {method 'subs' of 'sage.structure.element.Element' objects}
    30786    2.596    0.000    3.067    0.000 fraction_field.py:271(<lambda>)
  5358/84    1.781    0.000   26.087    0.311 {sage.combinat.dict_addition.dict_linear_combination}
     1778    1.295    0.001   13.615    0.008 sfa.py:988(_from_cache)
    54541    1.061    0.000    2.015    0.000 infinity.py:883(_element_constructor_)
   456782    1.051    0.000    1.051    0.000 {isinstance}
    27795    0.815    0.000    1.931    0.000 partition.py:3510(Partitions)
   124520    0.754    0.000    0.754    0.000 fraction_field.py:430(ring)
     1715    0.518    0.000    2.106    0.001 integer_list.py:192(rightmost_pivot)
     3951    0.433    0.000    0.547    0.000 free_module.py:411(_coefficient_fast)
     4015    0.416    0.000    1.096    0.000 polynomial_ring.py:290(_element_constructor_)
    54541    0.352    0.000    0.352    0.000 infinity.py:942(__init__)
     2236    0.348    0.000    0.448    0.000 free_module.py:562(support)
    80417    0.347    0.000    0.347    0.000 combinat.py:840(__hash__)
    37514    0.340    0.000    0.340    0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
    37514    0.275    0.000    0.938    0.000 re.py:144(sub)
    51922    0.275    0.000    0.275    0.000 partition.py:4003(__repr__)
    19970    0.262    0.000    2.203    0.000 partition.py:2716(parent)
    16195    0.259    0.000    0.515    0.000 free_module.py:2142(_from_dict)
    37514    0.244    0.000    0.322    0.000 re.py:226(_compile)
     4015    0.238    0.000    1.933    0.000 fraction_field.py:456(_element_constructor_)
    39940    0.232    0.000    0.580    0.000 {repr}
     9365    0.230    0.000    0.292    0.000 fraction_field.py:532(gen)
    35636    0.186    0.000    0.305    0.000 infinity.py:285(__cmp__)
     1715    0.186    0.000    0.634    0.000 integer_list.py:38(first)

Second, adding #13378

     9365    7.525    0.001   12.378    0.001 {method 'subs' of 'sage.structure.element.Element' objects}
    30786    2.666    0.000    3.166    0.000 fraction_field.py:271(<lambda>)
  5363/85    1.908    0.000   29.846    0.351 {sage.combinat.dict_addition.dict_linear_combination}
     1778    1.340    0.001   15.319    0.009 sfa.py:988(_from_cache)
   494409    1.269    0.000    1.269    0.000 {isinstance}
    55877    1.196    0.000    2.254    0.000 infinity.py:883(_element_constructor_)
    27906    0.876    0.000    2.117    0.000 partition.py:3510(Partitions)
   127929    0.809    0.000    0.809    0.000 fraction_field.py:430(ring)
     1855    0.561    0.000    2.359    0.001 integer_list.py:192(rightmost_pivot)
4248/4191    0.493    0.000    1.256    0.000 polynomial_ring.py:290(_element_constructor_)
     4070    0.461    0.000    0.591    0.000 free_module.py:411(_coefficient_fast)
    82386    0.397    0.000    0.397    0.000 combinat.py:840(__hash__)
     2236    0.374    0.000    0.498    0.000 free_module.py:562(support)
    55877    0.372    0.000    0.372    0.000 infinity.py:942(__init__)
    37516    0.367    0.000    0.369    0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
    46893    0.326    0.000    0.764    0.000 {repr}
    56811    0.318    0.000    0.318    0.000 partition.py:4003(__repr__)
    37514    0.304    0.000    1.020    0.000 re.py:144(sub)
    16322    0.301    0.000    0.601    0.000 free_module.py:2142(_from_dict)
    19970    0.290    0.000    2.418    0.000 partition.py:2716(parent)
    37517    0.265    0.000    0.362    0.000 re.py:226(_compile)
     4015    0.261    0.000    2.149    0.001 fraction_field.py:456(_element_constructor_)
     9426    0.239    0.000    0.312    0.000 fraction_field.py:532(gen)
    17385    0.223    0.000    0.365    0.000 combinat.py:667(__init__)
    36359    0.216    0.000    0.349    0.000 infinity.py:285(__cmp__)

And finally adding #12313:

952735/392465   59.690    0.000  204.150    0.001 {method 'has_coerce_map_from' of 'sage.structure.parent.Parent' objects}
  3818389   44.077    0.000   91.652    0.000 {repr}
     1626   27.430    0.017  243.763    0.150 k_dual.py:1277(<genexpr>)
   475630   24.414    0.000   71.594    0.000 {method 'coerce_map_from' of 'sage.structure.parent.Parent' objects}
   471632   23.574    0.000  204.044    0.000 fraction_field.py:194(_coerce_map_from_)
  1909173   15.683    0.000  111.239    0.000 combinat.py:1051(__cmp__)
  1425904   15.023    0.000   23.084    0.000 polynomial_ring.py:1369(_repr_)
475629/475628   13.746    0.000   90.469    0.000 polynomial_ring.py:468(_coerce_map_from_)
  3286022   12.918    0.000   17.094    0.000 unique_representation.py:531(__hash__)
      558   11.842    0.021   73.743    0.132 k_dual.py:1234(<genexpr>)
   475332   11.785    0.000   19.538    0.000 fraction_field.py:380(_repr_)
  1925108   11.632    0.000   11.632    0.000 partition.py:4003(__repr__)
  1439041    8.110    0.000    8.110    0.000 polynomial_ring.py:716(__hash__)
  1425904    8.061    0.000    8.061    0.000 polynomial_ring.py:725(_repr_)
     9365    7.681    0.001   12.918    0.001 {method 'subs' of 'sage.structure.element.Element' objects}
    16080    7.001    0.000    9.528    0.001 homset.py:353(__init__)
  1934768    5.774    0.000    5.774    0.000 {isinstance}
      564    5.764    0.010   33.683    0.060 k_dual.py:1110(<genexpr>)
  3334205    4.241    0.000    4.241    0.000 {id}
  1927477    3.937    0.000    3.937    0.000 {cmp}
  1450944    3.416    0.000    3.416    0.000 rational_field.py:217(__hash__)
   479620    3.123    0.000   24.605    0.000 rational_field.py:261(_coerce_map_from_)
    30786    3.106    0.000    3.593    0.000 fraction_field.py:271(<lambda>)
    16060    2.354    0.000   13.278    0.001 homset.py:80(Hom)
      315    2.335    0.007   50.641    0.161 k_dual.py:1076(retract)

I'm afraid I have to leave now. But the fact that repr is called so often makes me guess that there is indeed caching involved - namely with a hash function that relies on the string representation. But if that was correct, neither MonoDict nor TripleDict are to blame: They compare by identity and thus don't rely on the hash of the keys.

comment:280 Changed 4 years ago by SimonKing

PS: Also I don't like that so many homsets become involved. I wonder how many of them are distinct.

comment:281 in reply to: ↑ 279 ; follow-up: Changed 4 years ago by nbruin

We've seen things like this before. This is exactly what one would see from code that greatly benefited from rings (and hence their coercions!) being nailed in memory. It's worth investigating whether there are some obvious optimizations to make.

However, aren't testsuites made from automatic tests? In that case wouldn't it be a very atypical mess of tiny little self-contained constructions to test, e.g., commutativity, property of identity etc? We shouldn't be optimizing code for those.

Otherwise just stick a strong cache in DualSchurFunctions somewhere. That's such a specialized component that the memory leaking will likely only affect those who know enough about the code to fix it once it's necessary.

comment:282 in reply to: ↑ 281 Changed 4 years ago by SimonKing

Replying to nbruin:

We've seen things like this before. This is exactly what one would see from code that greatly benefited from rings (and hence their coercions!) being nailed in memory.

Well, they are still nailed in memory, as long as you keep a pointer to domain and codomain of the coercion.

It's worth investigating whether there are some obvious optimizations to make.

Sure.

However, aren't testsuites made from automatic tests? In that case wouldn't it be a very atypical mess of tiny little self-contained constructions to test, e.g., commutativity, property of identity etc? We shouldn't be optimizing code for those.

Yes - and the fact that these tiny little constructions take so much time means that they are totally useless and have to be improved.

Otherwise just stick a strong cache in DualSchurFunctions somewhere.

No, I don't think this would be a solution. The dramatic increase in running time probably points to a general problem.

comment:283 follow-up: Changed 4 years ago by nbruin

Had a quick look. To simplify and speed up the tests a little bit, you can concentrate on one of the tests that's slow:

sage: cmd='dks4._test_associativity()'
sage: import cProfile,pstats
sage: s=pstats.Stats(cProfile.Profile().runctx(cmd,globals(),{}))
sage: S=s.sort_stats('cumulative')
sage: S.print_stats()
...
sage: S.print_callers()
...

the latter one essentially gives you the call graph, weighted by time contributed.

In fact it shows you pretty quickly that ShurFunctions? were apparently VERY heavily relying on parents being cached:

X=dks4[0]+ 2*dks4[1] + 3*dks4[2]
(X*X)*X #wait quite a long time

This mainly uses dks4.product and you can follow the code from there to see this has not been written with runtime optimization in mind. Probably if you make some things cached the way there, you'll regain the original performance.

My first impression is that this code indeed will make a concrete realization of some structure, perform the operation, and throw away the concrete realization. Elements on the way basically ARE homs, hence all the homsets. I'm sure someone will be grateful if you clean up this design, but I'm not so sure it should be part of this ticket or a requirement for its merger. Good luck!

comment:284 in reply to: ↑ 283 Changed 4 years ago by SimonKing

  • Cc nthiery added

Replying to nbruin:

...

In fact it shows you pretty quickly that ShurFunctions? were apparently VERY heavily relying on parents being cached: ...

My first impression is that this code indeed will make a concrete realization of some structure, perform the operation, and throw away the concrete realization. Elements on the way basically ARE homs, hence all the homsets. I'm sure someone will be grateful if you clean up this design, but I'm not so sure it should be part of this ticket or a requirement for its merger. Good luck!

Cc to Nicolas, perhaps he has remarks on it.

comment:285 Changed 4 years ago by SimonKing

The patch needs to be rebased, because of a change in #12215.

Changed 4 years ago by SimonKing

Combined patch - only apply this patch

comment:286 Changed 4 years ago by SimonKing

  • Description modified (diff)
  • Work issues set to Speed regression

The patch is updated. I used the occasion to combine the two patches. Hence:

Apply trac_12313-mono_dict-combined-random-sk.patch

It still needs work, because of the speed regression.

comment:287 follow-up: Changed 4 years ago by nbruin

Perhaps follow up speed regressions on #13991? Once it's clearer what the problem is, we can decide if that ticket should be a dependency for this one. Obviously, if it turns out to be a design flaw there, we shouldn't hold the rest of sage hostage to its resolution. The problem addressed by this ticket is a real one.

comment:288 in reply to: ↑ 287 Changed 4 years ago by jpflori

Replying to nbruin:

Perhaps follow up speed regressions on #13991? Once it's clearer what the problem is, we can decide if that ticket should be a dependency for this one. Obviously, if it turns out to be a design flaw there, we shouldn't hold the rest of sage hostage to its resolution. The problem addressed by this ticket is a real one.

I'd prefer this approach as well, being more affected by the cache problems than by the schur things I don't use. And I assume this should be the case of the majority of users.

comment:289 follow-up: Changed 4 years ago by jdemeyer

But how are we sure that the speed regression is limited to that one file about Schur functions? It seems to me that it might just be an obvious place where the regression is observed. A priori, many other parts of the Sage library might also be affected.

comment:290 Changed 4 years ago by jpflori

Good point (although a speed regression is to be expected anyway, we cache less things so will have to rebuild them, unless being very careful; I guess the goal is that this speed regression is not terrible on average).

Do you have scripts that automatically look for these? Let's say by comparing the runtime of doctests? Or do you have to find them by inspection?

comment:291 in reply to: ↑ 289 ; follow-up: Changed 4 years ago by SimonKing

Replying to jdemeyer:

A priori, many other parts of the Sage library might also be affected.

And on the other hand, many other parts of the Sage library are affected by keeping garbage in memory indefinitely.

I think the normal behaviour of a CAS should be that a structure S is considered garbage and can be collected, unless the user has constructed some object O and keeps a reference to it, and O keeps a chain of references to S. Sadly, that was not the case for Sage, but this ticket is part of an attempt to make Sage behave "normally" in that department.

Hence, if the same structure is needed repeatedly during a computation, then the programmer should keep a reference to it locally, rather than relying on an external cache. And if different methods of an object O will use the same structure S, then it makes sense to store S as an attribute of O, in order to prevent S from being garbage collected as long as O is alive (but not longer than that).

I think what Nils wanted to say is: If other parts of the Sage library do not obey the general rules formulated in the preceding paragraph, but rely on an infinite external cache, then they need to be fixed. But the need to fix them should not prevent us from changing the infinite external cache to a finite external cache, because that is of benefit for all parts in the Sage library that follow the general rules.

Jeroen, I guess you have access to doc test timings for various versions of Sage. Would it be possible for you to find out which tests become slower with #12313? That might give us an idea how bad of a regression we would have.

comment:292 in reply to: ↑ 291 Changed 4 years ago by jdemeyer

Replying to SimonKing:

Jeroen, I guess you have access to doc test timings for various versions of Sage. Would it be possible for you to find out which tests become slower with #12313? That might give us an idea how bad of a regression we would have.

OK, can we put a freeze on this ticket then and keep the patch such that I'm not chasing a moving target.

comment:293 Changed 4 years ago by jdemeyer

  • Status changed from needs_work to positive_review
  • Work issues Speed regression deleted

comment:295 follow-up: Changed 4 years ago by jdemeyer

Another different regression:

def test(RR):
    for d in range(-20,0):
        if abs(RR(quadratic_L_function__numerical(1, d, 10000) - quadratic_L_function__exact(1, d))) > 0.001:
            print "Oops!  We have a problem at d = ", d, "    exact = ", RR(quadratic_L_function__exact(1, d)), "    numerical = ", RR(quadratic_L_function__numerical(1, d))

time test(RealField(50))

From 3.0 seconds in sage-5.7.beta0 to 4.2 seconds with #12215 + #12313.

comment:296 Changed 4 years ago by jdemeyer

Another one:

M = ModularSymbols(DirichletGroup(5).0, 7)
time MM = M.change_ring(CyclotomicField(8))

from 1.3 seconds to 2.1 seconds.

comment:297 follow-up: Changed 4 years ago by jpflori

Maybe we hsould open a "metaticket" to track this regression and point there to tickets that will actually fix each of these regressions?

It's jsut that the ticket here is getting quite long and we risk to lose track of the regressions (if more are to come).

comment:298 in reply to: ↑ 297 Changed 4 years ago by SimonKing

Replying to jpflori:

Maybe we hsould open a "metaticket" to track this regression and point there to tickets that will actually fix each of these regressions?

To my understanding, the memory leak fixes are fine up to and including #12215. One could argue that the ticket from here is also fixing a general memory leak, and fixing a general leak counts more than a few speed regressions in specialised code. But one could argue that we should be happy to have the fixes up to #12215 in the next Sage release, and should not spoil it with several speed regressions - hence, #12313 should not be merged until we found a way to lift the regression.

I don't know what Jeroen prefers.

If he intends to merge #12313 soon, then it would make sense to change #13991 into a task (=metaticket), the original purpose of #13991 probably being solved by #13605.

If he intends to not merge #12313 before all regressions are solved, then I see no benefit in crating a separate metaticket. Instead, one should put #12313 as "needs work" and create new tickets for the separate aspects of regression, making them dependencies for #12313.

Last edited 4 years ago by SimonKing (previous) (diff)

comment:299 Changed 4 years ago by jdemeyer

I don't know really, it's not an easy for me either. Probably I'll wait a bit to merge this to assess the situation better.

comment:300 in reply to: ↑ 295 Changed 4 years ago by nbruin

Replying to jdemeyer:

The profile calls look exactly the same. More time is spent in sage.rings.arith.kronecker_symbol, though and oddly enough the time seems to go into conversion from ZZ to QQ:

x=-20
def test():
       for n in xrange(10**7):
         _=QQ(x)
 
sage: time test()

goes from Time: CPU 2.97 s, Wall: 2.98 s to Time: CPU 8.53 s, Wall: 8.57 s. Python level profiling doesn't tell us anything for this.

I suspect that the whole QQ(x).numerator() * QQ(x).denominator() thing in kroneckersymbol is actually unnecessary for the purposes where it's used (it doesn't do anything of the sort for y), but that's rather irrelevant: This seems to be a case where discovery of a rather obvious conversion seems to have become quite a bit more expensive!

EDIT:

It could just be this:

sage: M=sage.structure.coerce_dict.MonoDict(23)
sage: M[ZZ]=1
sage: %timeit _=M[ZZ]
625 loops, best of 3: 331 ns per loop
sage: D={ZZ:1}
sage: %timeit _=D[ZZ]
625 loops, best of 3: 110 ns per loop

which is fair enough: MonoDict is a weak key dictionary, so looking up an entry has an extra level of indirection. We're bound to lose some time there. I don't think we can count that as a regression, though: We weren't doing enough work before.

Compared to the (python) weakref.WeakKeyDictionary we do pretty well:

sage: W=weakref.WeakKeyDictionary()
sage: W[ZZ]=1
sage: %timeit _=W[ZZ]
625 loops, best of 3: 1.36 µs per loop

It would be good if someone else can confirm that this explains the regression. I don't think it's a deadly one: If someone wants to do a conversion in an inner loop, they probably know the type of object they're working with, so they could just extract the morphism and apply it directly. That avoids any lookup. I'd deem this one a necessary regression that doesn't need to be addressed, or at least can be addressed partially later, as with #13387 (which presently mainly indexes comments on this megaticket)

Last edited 4 years ago by nbruin (previous) (diff)

comment:301 Changed 4 years ago by jdemeyer

  • Milestone changed from sage-5.7 to sage-5.8

comment:302 follow-up: Changed 4 years ago by SimonKing

Note the following:

sage: M=sage.structure.coerce_dict.MonoDict(23)
sage: R = ZZ['x','y','z']
sage: M[R]=1
sage: D={R:1}
sage: timeit("_=M[R]", number=10^6)
1000000 loops, best of 3: 296 ns per loop
sage: timeit("_=D[R]", number=10^6)
1000000 loops, best of 3: 348 ns per loop
sage: D[ZZ]=2
sage: M[ZZ]=2
sage: timeit("_=M[R]", number=10^6)
1000000 loops, best of 3: 307 ns per loop
sage: timeit("_=D[R]", number=10^6)
1000000 loops, best of 3: 351 ns per loop

So, MonoDict isn't that bad, speed-wise. I am afraid I have no clue what the regression in _=QQ(x) really comes from. Python calls seem not to be involved, not with and not without #12313.

comment:303 Changed 4 years ago by SimonKing

Together with #13387, one gets

sage: M=sage.structure.coerce_dict.MonoDict(23)
sage: R = ZZ['x','y','z']
sage: M[R]=1
sage: D={R:1}
sage: timeit("_=M[R]", number=10^6)
1000000 loops, best of 3: 204 ns per loop
sage: timeit("_=D[R]", number=10^6)
1000000 loops, best of 3: 361 ns per loop
sage: D[ZZ]=2
sage: M[ZZ]=2
sage: timeit("_=M[R]", number=10^6)
1000000 loops, best of 3: 204 ns per loop
sage: timeit("_=D[R]", number=10^6)
1000000 loops, best of 3: 359 ns per loop
sage: timeit("_=M[ZZ]", number=10^6)
1000000 loops, best of 3: 206 ns per loop
sage: timeit("_=D[ZZ]", number=10^6)
1000000 loops, best of 3: 98.4 ns per loop

Hence, it becomes faster, but a Python dictionary is still faster if "easy" targets such as ZZ are involved.

Indeed, ZZ is very quick at basic operations:

sage: timeit("hash(ZZ)", number=10^6)
1000000 loops, best of 3: 103 ns per loop
sage: timeit("id(ZZ)", number=10^6)
1000000 loops, best of 3: 102 ns per loop
sage: timeit("hash(R)", number=10^6)
1000000 loops, best of 3: 329 ns per loop

Hence, the hash of ZZ (used in dict) is as fast as id (used in MonoDict). No surprise that it is that fast.

comment:304 in reply to: ↑ 302 ; follow-up: Changed 4 years ago by nbruin

Replying to SimonKing:

...
1000000 loops, best of 3: 348 ns per loop
sage: D[ZZ]=2
sage: M[ZZ]=2
sage: timeit("_=M[R]", number=10^6)
1000000 loops, best of 3: 307 ns per loop
sage: timeit("_=D[R]", number=10^6)
1000000 loops, best of 3: 351 ns per loop

Those timing lines don't relate to ZZ. Since your other timings are in line with what I found doing exactly the same example above, you'd probably also see that lookup of ZZ in D is about 3 times as fast as it is in M.

It's interesting to see that many rings in sage are so much slower (unique parents should always be able to use id for equality and for their hash, right?)

Anyway, going from a strong dict to a weak key dict is *supposed* to be a little slower, so this is not an issue. We'll just have to live with it.

A more serious issue is the limited use of MonoDict in this case. Essentially, this MonoDict only has an effect for the cases where we cache that there is NO coercion: In the other cases, we have an entry

M[Domain] = coercion map from Domain to This Parent

i.e., the value in the weak key dict will hold a strong reference to the key. A self-quotation from sage-flame seems appropriate.

I think this is an fundamental problem and just one people will have to hold in mind: Coercing TO something will make the object you coerce FROM stay in memory for as long as the target is there. It's just a little disappointing that all this effort for a weak key dict is only going to make a difference for the entries that store None as a value. Oh well. That does happen a lot. AND we are getting speed increases in many cases!

comment:305 in reply to: ↑ 304 ; follow-up: Changed 4 years ago by SimonKing

Replying to nbruin:

Replying to SimonKing:

...
1000000 loops, best of 3: 348 ns per loop
sage: D[ZZ]=2
sage: M[ZZ]=2
sage: timeit("_=M[R]", number=10^6)
1000000 loops, best of 3: 307 ns per loop
sage: timeit("_=D[R]", number=10^6)
1000000 loops, best of 3: 351 ns per loop

Those timing lines don't relate to ZZ.

I didn't claim they do. This was in order to show that under some not so unlikely circumstances MonoDict is faster than a usual dict-

Since your other timings are in line with what I found doing exactly the same example above, you'd probably also see that lookup of ZZ in D is about 3 times as fast as it is in M.

Yep, I posted these timings as well, didn't I?

A more serious issue is the limited use of MonoDict in this case. Essentially, this MonoDict only has an effect for the cases where we cache that there is NO coercion: In the other cases, we have an entry

M[Domain] = coercion map from Domain to This Parent

i.e., the value in the weak key dict will hold a strong reference to the key.

I am not so sure, because I think this was one of the memleaks I explicitly addressed somewhere. But since my Sage install currently is without #12313, I need to recompile now -- which takes an eternity. I hope I'll be able to show you an example.

Anyway.

Here is a timing that shows that this patch has a bad effect both on looking up the coercion map and applying the coercion:

Without #12313

sage: timeit("phi=QQ.coerce_map_from(ZZ)", number=10^6)
1000000 loops, best of 3: 503 ns per loop
sage: phi=QQ.coerce_map_from(ZZ)
sage: x = -20
sage: timeit("phi(x)", number=10^6)
1000000 loops, best of 3: 563 ns per loop

With #12313:

sage: timeit("phi=QQ.coerce_map_from(ZZ)", number=10^6)
1000000 loops, best of 3: 738 ns per loop
sage: phi=QQ.coerce_map_from(ZZ)
sage: x = -20
sage: timeit("phi(x)", number=10^6)
1000000 loops, best of 3: 616 ns per loop
Last edited 4 years ago by SimonKing (previous) (diff)

comment:306 Changed 4 years ago by SimonKing

Hmm. The example I tried did not work. I could be mistaken, but I think there is one of my tickets addressing this problem.

comment:307 follow-up: Changed 4 years ago by jpflori

Could someone doctest 5.7.beta2 with MALLOC_CHECK_=3?

As I reported here https://groups.google.com/d/msg/sage-release/nvm8f82Vs-U/Y994aicmNQYJ, it could be the case that this plus some tickets merged in beta2 revealed some new bugs.

comment:308 in reply to: ↑ 307 Changed 4 years ago by SimonKing

Replying to jpflori:

Could someone doctest 5.7.beta2 with MALLOC_CHECK_=3?

As I reported here https://groups.google.com/d/msg/sage-release/nvm8f82Vs-U/Y994aicmNQYJ, it could be the case that this plus some tickets merged in beta2 revealed some new bugs.

Arrgh. Not again :(

In your sage-release post, you talk about #12215, but you also say that everything seems fine with your 5.7.beta1 install (which does contain #12215). So, you say sage-5.7.beta2 (not beta1) plus #12313 crash?

comment:309 Changed 4 years ago by jpflori

Oops, I'm deeply sorry, I thought I was posting on #12215.

And I'm indeed aware that #12215 was merged in beta1 and that the tests pass on it, but the backtrace suggests at first sight at least that #12215 has something to do with that.

Hopefully, we only need a simple fix for GroebnerStrategy? objects as we did in #11521 (IIRC). Indeed the new crashes seem to involve similar objects but in different files from before.

comment:310 Changed 4 years ago by SimonKing

Aha, so, is the crash with vanilla sage-5.7.beta1 and sage-5.7.beta2, without additional patches?

comment:311 Changed 4 years ago by jpflori

The crash is:

  • not happening on vanilla 5.7.beta1
  • happening on vanilla 5.7.beta2

So it might be the case that #12215 is not the culprit but it looks like it is involved from the backtraces produced. Maybe its the combination of tickets in beta2 which triggers the errors. Remember just how problematic it was to reproduce the problem caused by #715 et al.

comment:312 in reply to: ↑ 305 ; follow-up: Changed 4 years ago by nbruin

Replying to SimonKing:

Yep, I posted these timings as well, didn't I?

You didn't report the timings for ZZ lookup with #12313 without #13387. Probably just a pasting mistake. Doesn't matter, we have the general picture.

I am not so sure, because I think this was one of the memleaks I explicitly addressed somewhere.

You did something for actions somewhere, where you ensured the maps were on underlying sets, for this reason. But the plain case of one ring coercing into another (e.g., virtually any ring into SR) does nail the ring into memory. Some examples (with #12313):

Having no coerce map indeed works fine now:

sage: import gc
sage: T=type(GF(2))
sage: 
sage: for p in prime_range(2,300):
....:     k=GF(p)
....:     _=ZZ.has_coerce_map_from(k)
....: 
sage: del k
sage: gc.collect()
1412
sage: len(list(o for o in gc.get_objects() if type(o) is T))
1

and having coerce maps in the opposite direction doesn't hurt either:

sage: for p in prime_range(2,300):
....:     k=GF(p)
....:     _=k.has_coerce_map_from(ZZ)
....: 
sage: del k
sage: 
sage: gc.collect()
7852
sage: len(list(o for o in gc.get_objects() if type(o) is T))
1

But having for instance a convert map (these are cached in the same way as coercions) things do get nailed into memory:

sage: for p in prime_range(2,300):
....:     k=GF(p)
....:     _=ZZ.convert_map_from(k) #this nails k in memory
....: 
sage: del k
sage: gc.collect()
0
sage: len(list(o for o in gc.get_objects() if type(o) is T))
62

And this is why coercing into SR (anything goes there) is so detrimental:

sage: import gc
sage: sage: T=type(GF(2))
sage: len(list(o for o in gc.get_objects() if type(o) is T))
1
sage: for p in prime_range(2,300):
....:     k=GF(p)
....:     _=SR.has_coerce_map_from(k)
....: 
sage: del k, _
sage: gc.collect()
271
sage: len(list(o for o in gc.get_objects() if type(o) is T))
62

Back to worrying things. Just to double check that element creation is not a problem:

sage: import gc
sage: sage: T=type(GF(2))
sage: len(list(o for o in gc.get_objects() if type(o) is T))
1
sage: for p in prime_range(2,300):
....:     k=GF(p)
....:     _=k(1)
....: 
sage: del k, _
sage: gc.collect()
2656
sage: len(list(o for o in gc.get_objects() if type(o) is T))
1

so one would hope that the following would be OK, but it's not. I guess binops get cached somewhere as well?

sage: for p in prime_range(2,300):
....:     k=GF(p)
....:     _=k(1)+ZZ(1)
....: 
sage: del k, _
sage: gc.collect()
0
sage: len(list(o for o in gc.get_objects() if type(o) is T))
62

(I've checked that ZZ(1)+k(1) gives the same result). That is a worrying leak, I think. Adding ZZ(1) to a ring element is something people are quite likely to think is reasonable. But that seems to nail the ring into memory again! I think this means there is room for a follow-up ticket. I think a nice lead is the result from:

sage: cm = sage.structure.element.get_coercion_model()
sage: maps, actions = cm.get_cache()

this seems to hold plenty of references to finite fields after the last example, both in keys and in values (I suppose the coercion model keeps a lot of these things in TripleDict, so the key references shouldn't matter but the references in values do, because those would contribute to refcounts.

I'm afraid the binary operation discovery of the coercion framework is a memory leak, so the proper solution is probably to give people good tools to avoid depending on it.

comment:313 in reply to: ↑ 312 Changed 4 years ago by nbruin

Replying to nbruin:

I'm afraid the binary operation discovery of the coercion framework is a memory leak, so the proper solution is probably to give people good tools to avoid depending on it.

#14058 is making good progress on resolving that issue and depends on this ticket to have effect.

The speed regression reported in 276 has been diagnosed and fixed on #13991. This conflicts with work on the much broader scoped #13605, but that ticket doesn't seem to be going in any time soon, so we should probably just push #13991 in before it. I think one can even consider merging #12313 without it.

The speed regression reported on 295 is minor and probably essential to some degree, because a lookup in a weakly keyed dict is fundamentally more expensive than in a normal dict. #13387 mitigates the effects to some extent, but again merging #12313 without it is acceptable in my opinion.

comment:314 follow-up: Changed 4 years ago by jdemeyer

Now that you're diagnosing anyway, how about 296?

comment:315 in reply to: ↑ 314 Changed 4 years ago by nbruin

Replying to jdemeyer:

Now that you're diagnosing anyway, how about 296?

I guess same kind of overhead as 295. Anyway, by including and using cdef access to MonoDict in the coercion lookups I have all but eliminated these speed regressions. Previously, this code was dictionary access, which Cython probably special cases to very fast code. The migration to MonoDict must have reverted that to a more generic and slower __getitem__ and __setitem__ slot lookup. See #13387 for the new patches.

It's hard to get times that are consistent across separate build of identical code! It seems things like branch name and choices made by the compiler have quite a bit of influence on these tests. The time differences I am observing now are within the variation I see when just rebuilding the same code.

Ticket #13991 will still be necessary to solve the regression it is intended to address.

Last edited 4 years ago by nbruin (previous) (diff)

comment:316 Changed 4 years ago by jdemeyer

  • Merged in set to sage-5.8.beta0
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:317 Changed 4 years ago by jdemeyer

  • Merged in sage-5.8.beta0 deleted
  • Resolution fixed deleted
  • Status changed from closed to new

This causes problems during a test build of sage-5.8.beta1:

sage -t  -force_lib devel/sage/sage/rings/padics/padic_generic_element.pyx
**********************************************************************
File "/release/merger/sage-5.8.beta1/devel/sage-main/sage/rings/padics/padic_generic_element.pyx", line 1005:
    sage: R2 = Qp(2,20,'capped-rel')
Exception raised:
    Traceback (most recent call last):
      File "/release/merger/sage-5.8.beta1/local/bin/ncadoctest.py", line 1231, in run_one_test
        self.run_one_example(test, example, filename, compileflags)
      File "/release/merger/sage-5.8.beta1/local/bin/sagedoctest.py", line 38, in run_one_example
        OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags)
      File "/release/merger/sage-5.8.beta1/local/bin/ncadoctest.py", line 1172, in run_one_example
        compileflags, 1) in test.globs
      File "<doctest __main__.example_24[24]>", line 1, in <module>
        R2 = Qp(Integer(2),Integer(20),'capped-rel')###line 1005:
    sage: R2 = Qp(2,20,'capped-rel')
      File "factory.pyx", line 143, in sage.structure.factory.UniqueFactory.__call__ (sage/structure/factory.c:1119)
      File "factory.pyx", line 170, in sage.structure.factory.UniqueFactory.get_object (sage/structure/factory.c:1311)
      File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/factory.py", line 493, in create_object
        return pAdicFieldCappedRelative(p, prec, {'mode': print_mode, 'pos': print_pos, 'sep': print_sep, 'alphabet': print_alphabet, 'ram_name': name, 'max_ram_terms': print_max_terms}, name)
      File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/padic_base_leaves.py", line 457, in __init__
        pAdicFieldBaseGeneric.__init__(self, p, prec, print_mode, names, pAdicCappedRelativeElement)
      File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/padic_base_generic.py", line 27, in __init__
        coerce_list = [pAdicCoercion_ZZ_CR(self), pAdicCoercion_QQ_CR(self)]
      File "padic_base_coercion.pyx", line 79, in sage.rings.padics.padic_base_coercion.pAdicCoercion_ZZ_CR.__init__ (sage/rings/padics/padic_base_coercion.c:5839)
        RingHomomorphism_coercion.__init__(self, ZZ.Hom(R), check=False)
      File "parent_base.pyx", line 128, in sage.structure.parent_base.ParentWithBase.Hom (sage/structure/parent_base.c:1870)
      File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/categories/homset.py", line 266, in Hom
        _cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))
      File "coerce_dict.pyx", line 1059, in sage.structure.coerce_dict.TripleDict.__setitem__ (sage/structure/coerce_dict.c:6614)
      File "coerce_dict.pyx", line 1079, in sage.structure.coerce_dict.TripleDict.set (sage/structure/coerce_dict.c:6880)
    KeyError: (17671424, 87123360, 20657216)
**********************************************************************
File "/release/merger/sage-5.8.beta1/devel/sage-main/sage/rings/padics/padic_generic_element.pyx", line 1068:
    sage: Qp(7)(0).abs()
Exception raised:
    Traceback (most recent call last):
      File "/release/merger/sage-5.8.beta1/local/bin/ncadoctest.py", line 1231, in run_one_test
        self.run_one_example(test, example, filename, compileflags)
      File "/release/merger/sage-5.8.beta1/local/bin/sagedoctest.py", line 38, in run_one_example
        OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags)
      File "/release/merger/sage-5.8.beta1/local/bin/ncadoctest.py", line 1172, in run_one_example
        compileflags, 1) in test.globs
      File "<doctest __main__.example_25[4]>", line 1, in <module>
        Qp(Integer(7))(Integer(0)).abs()###line 1068:
    sage: Qp(7)(0).abs()
      File "factory.pyx", line 143, in sage.structure.factory.UniqueFactory.__call__ (sage/structure/factory.c:1119)
      File "factory.pyx", line 170, in sage.structure.factory.UniqueFactory.get_object (sage/structure/factory.c:1311)
      File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/factory.py", line 493, in create_object
        return pAdicFieldCappedRelative(p, prec, {'mode': print_mode, 'pos': print_pos, 'sep': print_sep, 'alphabet': print_alphabet, 'ram_name': name, 'max_ram_terms': print_max_terms}, name)
      File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/padic_base_leaves.py", line 457, in __init__
        pAdicFieldBaseGeneric.__init__(self, p, prec, print_mode, names, pAdicCappedRelativeElement)
      File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/padic_base_generic.py", line 27, in __init__
        coerce_list = [pAdicCoercion_ZZ_CR(self), pAdicCoercion_QQ_CR(self)]
      File "padic_base_coercion.pyx", line 79, in sage.rings.padics.padic_base_coercion.pAdicCoercion_ZZ_CR.__init__ (sage/rings/padics/padic_base_coercion.c:5839)
        RingHomomorphism_coercion.__init__(self, ZZ.Hom(R), check=False)
      File "parent_base.pyx", line 128, in sage.structure.parent_base.ParentWithBase.Hom (sage/structure/parent_base.c:1870)
      File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/categories/homset.py", line 266, in Hom
        _cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))
      File "coerce_dict.pyx", line 1059, in sage.structure.coerce_dict.TripleDict.__setitem__ (sage/structure/coerce_dict.c:6614)
      File "coerce_dict.pyx", line 1079, in sage.structure.coerce_dict.TripleDict.set (sage/structure/coerce_dict.c:6880)
    KeyError: (17671424, 87232416, 20657216)
**********************************************************************
File "/release/merger/sage-5.8.beta1/devel/sage-main/sage/rings/padics/padic_generic_element.pyx", line 1070:
    sage: Qp(7)(0).abs(prec=20)
Exception raised:
    Traceback (most recent call last):
      File "/release/merger/sage-5.8.beta1/local/bin/ncadoctest.py", line 1231, in run_one_test
        self.run_one_example(test, example, filename, compileflags)
      File "/release/merger/sage-5.8.beta1/local/bin/sagedoctest.py", line 38, in run_one_example
        OrigDocTestRunner.run_one_example(self, test, example, filename, compileflags)
      File "/release/merger/sage-5.8.beta1/local/bin/ncadoctest.py", line 1172, in run_one_example
        compileflags, 1) in test.globs
      File "<doctest __main__.example_25[5]>", line 1, in <module>
        Qp(Integer(7))(Integer(0)).abs(prec=Integer(20))###line 1070:
    sage: Qp(7)(0).abs(prec=20)
      File "factory.pyx", line 143, in sage.structure.factory.UniqueFactory.__call__ (sage/structure/factory.c:1119)
      File "factory.pyx", line 170, in sage.structure.factory.UniqueFactory.get_object (sage/structure/factory.c:1311)
      File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/factory.py", line 493, in create_object
        return pAdicFieldCappedRelative(p, prec, {'mode': print_mode, 'pos': print_pos, 'sep': print_sep, 'alphabet': print_alphabet, 'ram_name': name, 'max_ram_terms': print_max_terms}, name)
      File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/padic_base_leaves.py", line 457, in __init__
        pAdicFieldBaseGeneric.__init__(self, p, prec, print_mode, names, pAdicCappedRelativeElement)
      File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/rings/padics/padic_base_generic.py", line 27, in __init__
        coerce_list = [pAdicCoercion_ZZ_CR(self), pAdicCoercion_QQ_CR(self)]
      File "padic_base_coercion.pyx", line 79, in sage.rings.padics.padic_base_coercion.pAdicCoercion_ZZ_CR.__init__ (sage/rings/padics/padic_base_coercion.c:5839)
        RingHomomorphism_coercion.__init__(self, ZZ.Hom(R), check=False)
      File "parent_base.pyx", line 128, in sage.structure.parent_base.ParentWithBase.Hom (sage/structure/parent_base.c:1870)
      File "/release/merger/sage-5.8.beta1/local/lib/python/site-packages/sage/categories/homset.py", line 266, in Hom
        _cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))
      File "coerce_dict.pyx", line 1059, in sage.structure.coerce_dict.TripleDict.__setitem__ (sage/structure/coerce_dict.c:6614)
      File "coerce_dict.pyx", line 1079, in sage.structure.coerce_dict.TripleDict.set (sage/structure/coerce_dict.c:6880)
    KeyError: (17671424, 82669344, 20657216)
**********************************************************************

comment:318 Changed 4 years ago by nbruin

I am not able to reproduce this error, however "coerce_dict.pyx", line 1079 is

# Test whether the old references are still active
r1,r2,r3 = <tuple>(self._refcache[h1,h2,h3])

the line "homset.py", line 266:

    _cache[key] = KeyedRef(H, _cache.eraser, (id(X),id(Y),id(category)))

was introduced in #11521, the trac_11521_callback.patch, to be precise.

I'm a little worried about this implementation, because it's relying on implementation details of a tripledict that aren't really part of the API (in particular, here it's relying on the TripleDict.eraser attribute and its call convention). I think this should be refactored to a method

_cache.set_weakvalued(key, H)

so that the tripledict can figure out how to put a properly formatted callback on the value.

A quick "solution" might be to merge #13387 as well, since that gets rid of self._refcache. However, I'm a little puzzled how this error can occur:

  • By the time we arrive at line 1079 we have succeeded in finding the key triple (h1,h2,h3).
  • The KeyError? indicates that _refcache doesn't contain the key anymore.
  • But key removal from _refcache and from bucket are inextricably linked! This means that a key removal was triggered somewhere between finding the element in the bucket and looking up in _refcache. But that should mean that bucket got modified too! That's really worrisome, because at this point we're ready to scribble in bucket at a location we've counted before.

Indeed, if we are allocating memory anywhere in the loop that walks through the bucket we could be triggering a garbage collection, which could indeed include callbacks that modify this very bucket. I don't think any of the lines here could do that, though, except for (self._refcache[h1,h2,h3]) itself, if that needs to construct a tuple or something like that.

So my guess is: The entry that's keyed (17671424, 82669344, 20657216) is indeed liable to get a callback from a garbage collection and thus be removed, but that hasn't happened yet. However, in preparing the call self._refcache[h1,h2,h3] the system needs to make a python tuple (h1,h2,h3) to pass to __getitem__ and that triggers the very GC that deletes the entry from bucket and _refcache; hence the key error.

In other words, we need to make absolutely sure that in this lookup loop, no memory allocations whatsoever are done, because any of those could trigger a GC that can corrupt our view of what the bucket should look like.

Looking at the lookup loop at #13387 for this, I think no memory allocations are made there, so if my diagnosis is correct, then merging that ticket would in fact be a proper fix.

I still think we shouldn't be making KeyedRef instances using the eraser of a TripleDict outside the TripleDict itself, though.

Looking at the code cython generates I think my diagnosis is a possibility:

 *                         # Test whether the old references are still active
 *                         r1,r2,r3 = <tuple>(self._refcache[h1,h2,h3])             # <<<<<<<<<<<<<<
 *                         if (isinstance(r1,KeyedRef) and r1() is None) or \
 *                            (isinstance(r2,KeyedRef) and r2() is None) or \
 */
...
          __pyx_t_9 = PyTuple_New(3); if (unlikely(!__pyx_t_9))
          __Pyx_GOTREF(__pyx_t_9);
          PyTuple_SET_ITEM(__pyx_t_9, 0, __pyx_t_1);
          __Pyx_GIVEREF(__pyx_t_1);
          PyTuple_SET_ITEM(__pyx_t_9, 1, __pyx_t_5);
          __Pyx_GIVEREF(__pyx_t_5);
          PyTuple_SET_ITEM(__pyx_t_9, 2, __pyx_t_8);
          __Pyx_GIVEREF(__pyx_t_8);
...
          __pyx_t_8 = __Pyx_PyDict_GetItem(((PyObject *)__pyx_v_self->__pyx_base._refcache), ((PyObject *)__pyx_t_9)); if (!__pyx_t_8) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 1079; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
...

As you see the lookup code does start with a PyTuple_New that could trigger a GC leading to removal of the key that gets looked up in the last line (where the KeyError would originate).

TL;DR: merge #13387

comment:319 Changed 4 years ago by nbruin

  • Dependencies changed from #715, #11521, #12215, #13746, #13378 to #715, #11521, #12215, #13746, #13378; merge with #13387
  • Status changed from new to needs_review

comment:320 follow-up: Changed 4 years ago by nbruin

Conjectures on how the bug in 317 can occur

I had a bit of trouble coming up with a scenario where a key triple could get deleted during its setting. The reason is that when someone is setting a key, presumably they are holding (strong) references to the key components, so those can't possibly be slated for deletion.

However, the very statement introduces by #11521 changes that:

  • A value in a TripleDict? may be stored via a weak reference (this was already the case)
  • It installs a callback on the weakref that removes the keytriple:weakref pair from the dictionary (this IS new).

However, since actual deallocation of this weakreffed value may be delayed, it is very well possible that by the time the callback gets executed, the value stored under the key triple is no longer the original value. Thus, deletion is unwarranted! Therefore, the callback on this weakreffed value should check that the key component is as expected AND that the stored value is still as expected (a dead weakref would do, but checking that it's the exact weakref that we expect there would be even better). Otherwise, we may be erasing entries that don't have to do with the just deallocated object!

My guess is that #14084 might have something to do with that. Are there ever situations where the homset cached for two objects gets changed (category refinement comes to mind as one case where this might happen ...).? That could well lead to the scenario sketched above.

In short: trac_11521_callback.patch doesn't seem entirely safe to me, because the callback installed on the value only checks the key for deletion. It doesn't check that the value hasn't changed in the mean time.

We still need #13387 anyway, because ANY GC event in the lookup loop can have disastrous consequences which, like any memory corruption, could go unnoticed for a long time and then lead to undefined behaviour later on. The PyTuple_New in the currently generated code means we cannot rule this out.

comment:321 Changed 4 years ago by jdemeyer

  • Status changed from needs_review to positive_review

I'm setting this back to positive review, keeping in mind that it needs to be merged with #13387.

comment:322 in reply to: ↑ 320 Changed 4 years ago by SimonKing

Replying to nbruin:

However, the very statement introduces by #11521 changes that:

  • A value in a TripleDict? may be stored via a weak reference (this was already the case)
  • It installs a callback on the weakref that removes the keytriple:weakref pair from the dictionary (this IS new).

And note that originally I did *not* use that callback. So, I wouldn't mind to simply revert this patch.

In other words: Homsets would be stored in a TripleDict, via a weak reference without a callback. Hence, if the Homset becomes garbage collected, then we will have a dead weak references remaining in the TripleDict (the original purpose of the callback was to remove this dead entry). Of course, it is not nice if the dead entry would stay in the TripleDict forever. But it is conceivable that domain and/or codomain of the Homset will become garbage collectable if the Homset becomes garbage collectable. Hence, the dead entry would disappear anyway, by means of the callbacks from the keys (not from the value).

Hence, I suggest to not use a callback for the weak reference to a Homset.

comment:323 Changed 4 years ago by SimonKing

I am trying to apply this patch to 5.8.beta0, but it doesn't work at all. Is it merged in 5.8.beta0?

comment:324 Changed 4 years ago by SimonKing

Aha. Apparently it is. And so is #13370 and #13904.

Changed 4 years ago by SimonKing

Use weak reference without callback for homsets (reverting a part of #11521)

comment:325 Changed 4 years ago by SimonKing

  • Description modified (diff)
  • Status changed from positive_review to needs_work

I have added a patch, so that no callback is used for homsets. The rationale is given in comment:322.

For the patchbot (in an attempt to trick it, because the main patch seems to be in 5.8.beta0...):

Apply trac_12313-revert_callback_from_11521.patch to sage-5.8.beta0

comment:326 Changed 4 years ago by SimonKing

  • Status changed from needs_work to needs_review

comment:327 Changed 4 years ago by SimonKing

OK, patchbot has the problem that it thinkgs that #13387 was to be applied first. But in fact, one should first apply the single patch from here, and *then* #13387.

comment:328 Changed 4 years ago by SimonKing

Could it be that #12313 is not in sage-5.8.beta0? The version that I have downloaded before the official announcement does contain it.

Anyway, let's try:

Apply trac_12313-mono_dict-combined-random-sk.patch trac_12313-revert_callback_from_11521.patch to sage-5.8.beta0

comment:329 follow-up: Changed 4 years ago by nbruin

  • Status changed from needs_review to positive_review

Why add that patch here when it belongs on #14159? This ticket already has positive review. It's bad form to start changing the code unless absolutely necessary and that new patch has nothing to do with this ticket. I've uploaded your patch there (and changed the commit message to reflect the appropriate ticket number). I hope that's OK with you. If so, you can just remove the ticket here from the list and move the ticket back to "Positive". We really want to get this merged, especially #13387 (which depends on this ticket) because that fixes some issues with TripleDict that could already bite its uses in 5.6 and 5.7.

comment:330 in reply to: ↑ 329 Changed 4 years ago by SimonKing

  • Description modified (diff)

Replying to nbruin:

Why add that patch here when it belongs on #14159? This ticket already has positive review. It's bad form to start changing the code unless absolutely necessary.

I thought it was absolutely necessary, and since it wasn't merged yet, I thought it was OK to do it here.

I've uploaded your patch there (and changed the commit message to reflect the appropriate ticket number). I hope that's OK with you.

If so, you can just remove the ticket here from the list

I can't remove the attachment, but I can remove it from the list of to-be-applied-patches.

and move the ticket back to "Positive".

OK.

Apply trac_12313-mono_dict-combined-random-sk.patch

comment:331 Changed 4 years ago by jdemeyer

This ticket is not merged (at some point it was in some preliminary version of sage-5.8.beta0).

You check this by

  1. Noting that this ticket has positive_review (not closed) with an empty "merged in" field.
  2. Checking the release notes (either txt or html).
  3. Checking the announcement on sage-release.

comment:332 Changed 4 years ago by jdemeyer

  • Description modified (diff)

comment:333 Changed 4 years ago by jdemeyer

  • Merged in set to sage-5.8.beta2
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:334 Changed 4 years ago by zimmerma

  • Cc zimmerma removed
Note: See TracTickets for help on using tickets.