Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#13370 closed defect (fixed)

Do not cache the result of is_Field externally

Reported by: SimonKing Owned by: rlm
Priority: major Milestone: sage-5.8
Component: memleak Keywords:
Cc: nbruin, jpflori Merged in: sage-5.8.beta0
Authors: Simon King Reviewers: Nils Bruin
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: #11310, #12215, #11521, #12313, #13089, #12995 Stopgaps:

Description

In #11900, a cache has been introduced to sage.rings.ring.is_Field, in order to speed things up. Unfortunately, that caused the following leak:

sage: p = 1
sage: import gc
sage: gc.collect()
662
sage: for i in range(100):
....:     p = next_prime(p)
....:     R = ZZ.quotient(p)
....:     t = R in Fields()
....:     _ = gc.collect()
....:     print len([X for X in gc.get_objects() if isinstance(X, sage.rings.finite_rings.integer_mod_ring.IntegerModRing_generic)]),
....:     
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101

I think the cleanest solution is:

  • deprecate is_Field
  • instead, use an uncached function _is_Field
  • in the sage tree, replace is_Field(R) by R in Fields()

Rationale:

is_Field(R) (or _is_Field(R)) will change R.category() into a sub-category of Fields(). But then, the next call to "R in Fields()" will be very fast. Hence, there is no reason (speed-wise) to cache is_Field. Actually, R in Fields() should be faster than is_Field(R) anyway, because of the cythonisation.

Attachments (1)

trac13370_deprecate_is_field.patch (22.2 KB) - added by SimonKing 7 years ago.
Deprecate is_Field(R) in favour of R in Fields().

Download all attachments as: .zip

Change History (58)

comment:1 Changed 7 years ago by SimonKing

Timings

I made some benchmarks, based on most tests from #11900 and one new test. This is on my laptop, hence the timings differ from the ones that I stated on #11900.

sage-5.2.rc0 plus the memleak patches #12969, #715, #12215, #11521, #12313

sage: E = J0(46).endomorphism_ring()
sage: %time g = E.gens()
CPU times: user 7.60 s, sys: 0.24 s, total: 7.85 s
Wall time: 8.00 s
sage: %time L = EllipticCurve('960d1').prove_BSD()
CPU times: user 6.03 s, sys: 0.21 s, total: 6.24 s
Wall time: 7.83 s
sage: def test(E):
....:     for p in prime_range(10000):
....:         if p != 389:
....:             G = E.change_ring(GF(p)).abelian_group()
....:
sage: E = EllipticCurve('389a')
sage: %time test(E)
CPU times: user 31.59 s, sys: 0.21 s, total: 31.79 s
Wall time: 31.85 s
sage: %time for E in cremona_curves([11..100]): S = E.integral_points(both_signs=False)
CPU times: user 19.15 s, sys: 0.43 s, total: 19.58 s
Wall time: 20.13 s
sage: W.<z> = CyclotomicField(13)
sage: %time M = Matrix(W, 2, 3, [10^30*(1-z)^13, 1, 2, 3, 4, z]).echelon_form()
CPU times: user 3.81 s, sys: 0.01 s, total: 3.82 s
Wall time: 3.82 s

Here is a new test. First, with "cold" cache:

sage: L = [ZZ.quotient(p) for p in prime_range(10000)]
sage: %time X = [R in Fields() for R in L]
CPU times: user 0.42 s, sys: 0.01 s, total: 0.43 s
Wall time: 0.44 s

And now with a "warm" cache:

sage: %timeit X = [R in Fields() for R in L]
625 loops, best of 3: 1.37 ms per loop

If one adds the tests from here, one gets:

age: E = J0(46).endomorphism_ring()
sage: %time g = E.gens()
CPU times: user 7.46 s, sys: 0.33 s, total: 7.79 s
Wall time: 7.80 s
sage: %time L = EllipticCurve('960d1').prove_BSD()
CPU times: user 5.82 s, sys: 0.15 s, total: 5.97 s
Wall time: 6.12 s
sage: def test(E):
....:     for p in prime_range(10000):
....:         if p != 389:
....:             G = E.change_ring(GF(p)).abelian_group()
....:
sage: E = EllipticCurve('389a')
sage: %time test(E)
CPU times: user 31.43 s, sys: 0.10 s, total: 31.52 s
Wall time: 31.58 s
sage: %time for E in cremona_curves([11..100]): S = E.integral_points(both_signs=False)
CPU times: user 19.04 s, sys: 0.17 s, total: 19.22 s
Wall time: 19.26 s
sage: W.<z> = CyclotomicField(13)
sage: %time M = Matrix(W, 2, 3, [10^30*(1-z)^13, 1, 2, 3, 4, z]).echelon_form()
CPU times: user 2.88 s, sys: 0.00 s, total: 2.89 s
Wall time: 2.90 s
sage: L = [ZZ.quotient(p) for p in prime_range(10000)]
sage: %time X = [R in Fields() for R in L]
CPU times: user 0.40 s, sys: 0.00 s, total: 0.41 s
Wall time: 0.41 s
sage: %timeit X = [R in Fields() for R in L]
625 loops, best of 3: 1.29 ms per loop

Hence, no test became slower, but some became clearly faster - in particular the one with echelon form computation. I think an echelon form computation was were Nils noticed the memory leak.

Fixing a memory leak

I just noticed that I forgot to add the following test to my patch. But anyway, a memory leak has been fixed that hadn't, before:

sage: import gc
sage: _ = gc.collect()
sage: len([X for X in gc.get_objects() if isinstance(X, sage.rings.finite_rings.integer_mod_ring.IntegerModRing_generic)])
1
sage: for i in prime_range(100):
....:     R = ZZ.quotient(i)
....:     t = R in Fields()
....:     
sage: _ = gc.collect()
sage: len([X for X in gc.get_objects() if isinstance(X, sage.rings.finite_rings.integer_mod_ring.IntegerModRing_generic)])
2

However, not all is good. There is still another leak left, quite likely due to coercion.

sage: for i in prime_range(100):
....:     R = ZZ.quotient(i)
....:     t = 1 + R.one()
....:     
sage: _ = gc.collect()
sage: len([X for X in gc.get_objects() if isinstance(X, sage.rings.finite_rings.integer_mod_ring.IntegerModRing_generic)])
26

My suggestion is to deal with the remaining leak on a different ticket. But first I need to add one test to my patch...

comment:2 Changed 7 years ago by SimonKing

  • Authors set to Simon King
  • Status changed from new to needs_review

The doc test is added.

I did not run the full test suite yet. But I think I'll already change it into "needs review".

comment:3 Changed 7 years ago by SimonKing

Oops, I think that the new doc test will only work with the other memleak fixes. If this turns out to be correct, I will need to update the dependencies.

comment:4 Changed 7 years ago by SimonKing

I don't know what the patchbot is trying to do. According to the logs, it looks like sage won't even start. But it does for me! Admittedly, I work with a prerelease of 5.2 (not 5.3).

comment:5 Changed 7 years ago by SimonKing

  • Dependencies set to #11310, #715, #12215, #11521, #12313

The patch would not apply to sage-5.3.beta2, because of #11310. Since I worked on top of the other memleak fixes, I name them as dependencies as well.

comment:6 Changed 7 years ago by SimonKing

  • Status changed from needs_review to needs_work
  • Work issues set to Cope with import changes in sage/schemes

Too bad. The patch bot is right: The current patch won't work with sage-5.3.beta2, there is an import error at startup. It is in sage/schemes --- again.

comment:7 Changed 7 years ago by SimonKing

  • Dependencies changed from #11310, #715, #12215, #11521, #12313 to #11310, #715, #12215, #11521, #12313, #13089

The reason is #13089.

comment:8 Changed 7 years ago by SimonKing

  • Status changed from needs_work to needs_review
  • Work issues Cope with import changes in sage/schemes deleted

The patch is rebased with respect to #13089. With the stated dependencies on top of sage-5.3.beta2, sage does start. So, I hope that the patchbot can work fine...

comment:9 Changed 7 years ago by SimonKing

  • Dependencies changed from #11310, #715, #12215, #11521, #12313, #13089 to #11310, #12215, #11521, #12313, #13089

#715 and #11521 are mutually dependent, but #715 needs to be applied first. By consequence, the patchbot got confused when I stated both as dependency here. Hope it works now...

comment:10 Changed 7 years ago by SimonKing

The patchbot reports

sage -t  -force_lib devel/sage-13370/sage/rings/padics/padic_base_leaves.py # 1 doctests failed
sage -t  -force_lib devel/sage-13370/sage/rings/polynomial/infinite_polynomial_element.py # Killed/crashed

I do not get that crash, but I do get the error in padic_base_leaves.py. The strange thing is:

sage: K = Qp(17)
sage: S = Zp(17,40)
sage: K.has_coerce_map_from(S)
True

but in the doctest the last line returns "False".

In other words, there is yet again a case where the absence of a coerce map is cached when this is wrong. I really thought I had fixed that problem in #12969!!

comment:11 Changed 7 years ago by SimonKing

Arrgh, it is even worse!

When I take the complete failing test, namely

            sage: K = Qp(17)
            sage: K(1) + 1 #indirect doctest
            2 + O(17^20)                                                                                                                                                                                  
            sage: K.has_coerce_map_from(ZZ)                                                                                                                                                               
            True                                                                                                                                                                                          
            sage: K.has_coerce_map_from(int)
            True
            sage: K.has_coerce_map_from(QQ)
            True
            sage: K.has_coerce_map_from(RR)
            False
            sage: K.has_coerce_map_from(Qp(7))
            False
            sage: K.has_coerce_map_from(Qp(17,40))
            False
            sage: K.has_coerce_map_from(Qp(17,10))
            True
            sage: K.has_coerce_map_from(Zp(17,40))
            True

then every answer is as expected. But when I run the same thing as part of the test suit eof sage.rings.padic_base_leaves, then the last line returns "False".

Hence, there is a side effect from another test. And keep in mind that this side-effect results in wrongly caching the absence of a coercion, even though the patch only changes the syntax for detecting fields.

comment:12 Changed 7 years ago by SimonKing

It is worse than worse: According to the patchbot, the problem in padic_base_leaves also occurs with #12313, i.e., before applying #13370. But I do not get that error!

Hence: The patchbot finds a wrong coercion cache with #12313, which I only find with #12313. And the patchbot finds a segfault with #12313, that I do not find.

comment:13 Changed 7 years ago by SimonKing

I inserted some lines into sage.structure.parent that print type(S) and id(S) whenever the absence of a coercion gets cached. And I inserted a line into the failing test that prints type and id of the object that fails.

It turns out that by the test

sage: K.has_coerce_map_from(Qp(17,40))

in line 478 of sage/rings/padics/padic_base_leaves.py, the absence of a coercion gets cached for an object of type sage.rings.padics.padic_base_leaves.pAdicFieldCappedRelative_with_category under the address 79378272. However, a few lines below, when the test fails, the object at address 79378272 is of type sage.rings.padics.padic_base_leaves.pAdicRingCappedRelative_with_category (Ring, not Field!!)

Hence, it seems that the following happens: The absence of a coercion from Qp(17,40) to K is correctly cached. But because of all my weakref patches, Qp(17,40) gets garbage collected. For an unknown reason, the corresponding item in the coercion cache does not get removed, and by coincidence the old address of Qp(17,40) is now used for Zp(17,40). But since the coercion cache is now operating with the addresses of keys, it gets the wrong item. That would explain why the error is so difficult to reproduce.

If this is true, then we have to understand why the callback function of the weak reference to Qp(17,40) is not called as soon as Qp(17,40) gets garbage collected.

comment:14 Changed 7 years ago by SimonKing

My hope was that the new fixes in #715 and #12313 ("let the __delitem__ methods not only remove stuff from the buckets but also from the reference cache") would also fix the problem here. But unfortunately that isn't the case. And since the patch bot is currently not usable, I can not tell whether the meta-problem ("The patchbot finds the error already with #12313 but I don't, and the patchbot finds an additional error with #13370 but I don't.") persists.

comment:15 Changed 7 years ago by nbruin

I'm not getting any problems on 5.3b2 with the patches suggested here applied. See also http://trac.sagemath.org/sage_trac/ticket/12313#comment:145 for a test where I try to recreate the conditions suggested above. I'm not observing the erroneous behaviour. So the problem must be a little more subtle. The proper removal from the coercion cache does occur sometimes (always as far as I can see).

Hopefully this data point helps.

comment:16 follow-up: Changed 7 years ago by SimonKing

I was modifying the failing test by inserting some lines that show at what memory location stuff is stored, and what is cached when. So, the test is:

            sage: K = Qp(17)
            sage: K(1) + 1 #indirect doctest
            2 + O(17^20)
            sage: K.has_coerce_map_from(ZZ)
            True
            sage: K.has_coerce_map_from(int)
            True
            sage: K.has_coerce_map_from(QQ)
            True
            sage: K.has_coerce_map_from(RR)
            False
            sage: K.has_coerce_map_from(Qp(7))
            False
            sage: id(Qp(17,40))
            sage: K.has_coerce_map_from(Qp(17,40))
            False
            sage: K.has_coerce_map_from(Qp(17,10))
            True
            sage: S = Zp(17,40); id(S)
            sage: K.is_coercion_cached(S)
            sage: K.has_coerce_map_from(S)
            True
            sage: import gc
            sage: _ = gc.collect()
            sage: K.is_coercion_cached(S)
            sage: K.has_coerce_map_from(S)

and this results in

sage -t  "devel/sage/sage/rings/padics/padic_base_leaves.py"
**********************************************************************
File "/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/devel/sage/sage/rings/padics/padic_base_leaves.py", line 478:
    sage: id(Qp(17,40))
Expected nothing
Got:
    79371136
**********************************************************************
File "/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/devel/sage/sage/rings/padics/padic_base_leaves.py", line 483:
    sage: S = Zp(17,40); id(S)
Expected nothing
Got:
    79371136
**********************************************************************
File "/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/devel/sage/sage/rings/padics/padic_base_leaves.py", line 484:
    sage: K.is_coercion_cached(S)
Expected nothing
Got:
    True
**********************************************************************
File "/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/devel/sage/sage/rings/padics/padic_base_leaves.py", line 485:
    sage: K.has_coerce_map_from(S)
Expected:
    True
Got:
    False
**********************************************************************
File "/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/devel/sage/sage/rings/padics/padic_base_leaves.py", line 489:
    sage: K.is_coercion_cached(S)
Expected nothing
Got:
    True
**********************************************************************
File "/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/devel/sage/sage/rings/padics/padic_base_leaves.py", line 490:
    sage: K.has_coerce_map_from(S)
Expected nothing
Got:
    False

Hence:

  • In line 483, apparently the garbage collection of Qp(17,40) has already occurred, for otherwise Zp(17,40) could not have been created under the same address.
  • In line 484, apparently the entry to Qp(17,40) is still not removed. So, for whatever reason, the callback function for the weak reference to Qp(17,40) is not called. Why?

My hope was that by the garbage collection, the callback function of the weak reference to Qp(17,40) would finally be called. But apparently that is not the case. It turns out that doing gc.collect() before defining S does not change the outcome.

comment:17 Changed 7 years ago by SimonKing

A potential solution comes to mind: In __contains__ and also in __getitem__, one could first test whether the memory location of the keys is right, and then test whether the weak references in store are still valid. If they aren't, then the key is a new item under an old address. But that would belong to #715 and #11521.

comment:18 in reply to: ↑ 16 ; follow-up: Changed 7 years ago by nbruin

Of potential relevance is:

http://bugs.python.org/issue1055820

Python takes great care in dealing with weakrefs and cyclic garbage collection. See in particular the explanation and the code. From what I understand, python GC does not always call callbacks on weakrefs that are in the cyclic trash themselves but does promise to do callbacks on weakrefs to objects that are collected from cyclic trash. It does so by *first* clearing all the trash, saving the callbacks, and then making the callbacks.

I guess that means that callbacks might see a slightly warped world, where other weakreffed locations may already have been vacated but the callbacks on those haven't run yet. Our callbacks are so simple that I don't see how that could affect them, though.

We're storing our callbacks in a dict right on the MonoDict, so the weakrefs are never garbage if we still care about the MonoDict. If you're not getting callbacks then it seems that python is breaking its promise. It might be worth investigating, because if python is really at fault, we don't need a workaround but a fix to python!

comment:19 in reply to: ↑ 18 Changed 7 years ago by SimonKing

Hi Nils,

Thank you for the pointer!

Replying to nbruin:

It does so by *first* clearing all the trash, saving the callbacks, and then making the callbacks.

I don't know if waiting for a fix to Python is a good idea. But what you say confirms that the fix that I suggested makes sense: Python first clears all trash; in that moment, the weak references are invalid (i.e., if you call them, None is returned), but the callback function has not necessarily been called yet.

Hence, the existence of the callback does not guarantee that we have no dead weak references in store. That's to say: We must always validate the weak references. I just hope that the slow-down is not too much.

Anyway. I have already verified that the error in padic_base_leaves.py vanishes with the new patches from #715 and #12313, and am now running make ptest for #13370 (and will also kick the patchbot).

comment:20 follow-up: Changed 7 years ago by jpflori

Nice find, this may indeed and hopefully be the root of all the random failures we got previously (which quite surprisingly also involved Zp and Qp, maybe there is something else fishy with these sets).

I'll try testing and timings the patches asap. Are any other major modifications planned? e.g. some more intrusive changes to fix the memleak pointed out by Nils in #12313? Or do you plan on "saving" that for later?

comment:21 in reply to: ↑ 20 ; follow-up: Changed 7 years ago by SimonKing

Replying to jpflori:

Are any other major modifications planned? e.g. some more intrusive changes to fix the memleak pointed out by Nils in #12313? Or do you plan on "saving" that for later?

I think that progress is only possible if at some point one calls it a day. That's to say, if the patches fix some memory leaks and create no evident problem, then the existence of other not-yet-fixed memory leaks should not prevent them from being merged.

comment:22 in reply to: ↑ 21 Changed 7 years ago by jpflori

Replying to SimonKing:

Replying to jpflori:

Are any other major modifications planned? e.g. some more intrusive changes to fix the memleak pointed out by Nils in #12313? Or do you plan on "saving" that for later?

I think that progress is only possible if at some point one calls it a day. That's to say, if the patches fix some memory leaks and create no evident problem, then the existence of other not-yet-fixed memory leaks should not prevent them from being merged.

Agreed, I'll begin putting everything in place for testing with the current patches.

comment:23 Changed 7 years ago by SimonKing

Hooray, finally the patchbot and I agree! I get the same segfault in infinite_polynomial_element (embarrassingly, that module is my own code...). And if I only add patches out to #12313, the test works fine.

Hence, I have to understand how the new syntax and/or avoiding an additional strong reference to a field after testing that it is a field can result in a segfault.

comment:24 Changed 7 years ago by SimonKing

But alas, the segfault is not reproducible. I got it with make ptest. But when I now run the test individually, for about 30 times, everything went fine (and is 0.3 seconds faster than without the patch). That's frustrating.

comment:25 follow-up: Changed 7 years ago by SimonKing

According to the log, the mysterious segfault has something to do with actions:

sage -t  -force_lib devel/sage-13370/sage/rings/polynomial/infinite_polynomial_element.py
/opt/patchbot-5.3.beta2/local/lib/libcsage.so(print_backtrace+0x3b)[0xb6e80c49]
/opt/patchbot-5.3.beta2/local/lib/libcsage.so(sigdie+0x17)[0xb6e80c89]
/opt/patchbot-5.3.beta2/local/lib/libcsage.so(sage_signal_handler+0x212)[0xb6e8078c]
[0xb7704400]
/opt/patchbot-5.3.beta2/local/lib/libpython2.7.so.1.0(+0x29836)[0xb757e836]
/opt/patchbot-5.3.beta2/local/lib/libpython2.7.so.1.0(+0x4685a)[0xb759b85a]
/opt/patchbot-5.3.beta2/local/lib/libpython2.7.so.1.0(PyObject_Call+0x64)[0xb7583094]
/opt/patchbot-5.3.beta2/local/lib/python/site-packages/sage/categories/action.so(+0x9ef8)[0xb6125ef8]

Aha! Right now I got a bit of insight! Namely, I attempted parallel tests of sage/rings/polynomial/, and find a warning printed:

Exception NotImplementedError: NotImplementedError() in 'sage.rings.ring._is_Field' ignored

So, that's a problem. Since _is_Field is a cdef function, it won't propagate a NotImplementedError. I will now test whether catching the error and returning "False" when it occurs helps to solve the problem.

comment:26 in reply to: ↑ 25 Changed 7 years ago by SimonKing

Replying to SimonKing:

So, that's a problem. Since _is_Field is a cdef function, it won't propagate a NotImplementedError. I will now test whether catching the error and returning "False" when it occurs helps to solve the problem.

It doesn't solve the problem. Instead, when running the tests of sage/rings/polynomial parallely in four threads, I again get the segfault in infinite_polynomial_element.

So, the segfault depends on whether or not we have one or four doctests at a time. Nasty.

comment:27 Changed 7 years ago by SimonKing

With the new patch versions at #12313 and friends, the patchbot still reports

sage -t  -force_lib devel/sage-13370/sage/rings/polynomial/polynomial_real_mpfr_dense.pyx
The doctested process was killed by signal 11

Signal 11 is a segmentation fault.

However, I do not get that error. Both when I run it individually, or as part of "sage -tp 4", it works fine! Or at least it does, together with the patches from #13400.

comment:28 follow-up: Changed 7 years ago by SimonKing

With sage-5.3.beta2 plus #715, #11521, #12215, #12313 and #13370, make ptest works fine for me. So, once again, we have the problem that the patchbot reports a segfault that I can't reproduce.

comment:29 in reply to: ↑ 28 ; follow-up: Changed 7 years ago by nbruin

This is a pretty desperate suggestion, but given our earlier experience, the segfault reported by the bot is probably due to an actual bug that's hard to trigger. Getting the verbose traceback on the computer that runs the test would probably give the best lead, but if that's not possible:

Run the doctest under valgrind. I've tried that myself, but valgrind is giving me a SIGILL:

/usr/local/sage/5.3b2/local/lib/libcsage.so(print_backtrace+0x31)[0x1220a0a7]
/usr/local/sage/5.3b2/local/lib/libcsage.so(sigdie+0x14)[0x1220a0d9]
/usr/local/sage/5.3b2/local/lib/libcsage.so(sage_signal_handler+0x1ca)[0x12209c6a]
/lib64/libpthread.so.0(+0xf500)[0x5218500]
/usr/local/sage/5.3b2/local/lib/libmpfr.so.4(mpfr_set_d+0x14)[0x1b9d37c4]
/usr/local/sage/5.3b2/local/lib/python2.7/site-packages/sage/rings/real_mpfr.so(+0x24aa2)[0x1b76daa2]
/usr/local/sage/5.3b2/local/lib/libpython2.7.so.1.0(+0xa7d88)[0x4ed8d88]
/usr/local/sage/5.3b2/local/lib/libpython2.7.so.1.0(PyObject_Call+0x53)[0x4e7c3c3]
/usr/local/sage/5.3b2/local/lib/python2.7/site-packages/sage/rings/real_mpfr.so(+0x13a3a)[0x1b75ca3a]
/usr/local/sage/5.3b2/local/lib/python2.7/site-packages/sage/rings/real_mpfr.so(initreal_mpfr+0x4057)[0x1b788667]
...
------------------------------------------------------------------------
Unhandled SIGILL: An illegal instruction occurred in Sage.
This probably occurred because a *compiled* component of Sage has a bug
in it and is not properly wrapped with sig_on(), sig_off(). You might
want to run Sage under gdb with 'sage -gdb' to debug this.
Sage will now terminate.
------------------------------------------------------------------------
/usr/local/sage/5.3b2/local/bin/sage-valgrind: line 24: 10490 Illegal instruction     (core dumped) valgrind --tool=memcheck $MEMCHECK_FLAGS python -i

Valgrind is reporting in its log file:

vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x2E 0xC0 0xC5 0xFB
==10490== valgrind: Unrecognised instruction at address 0x1b9d37c4.
==10490== Your program just tried to execute an instruction that Valgrind
==10490== did not recognise.  There are two possible reasons for this.
==10490== 1. Your program has a bug and erroneously jumped to a non-code
==10490==    location.  If you are running Memcheck and you just saw a
==10490==    warning about a bad jump, it's probably your program's fault.
==10490== 2. The instruction is legitimate but Valgrind doesn't handle it,
==10490==    i.e. it's Valgrind's fault.  If you think this is the case or
==10490==    you are not sure, please let us know and we'll try to fix it.
==10490== Either way, Valgrind will now raise a SIGILL signal which will
==10490== probably kill your program.

That might just be that there's something funny with my valgrind. I'm not going to fix that, because valgrind output doesn't mean anything to me anyway. However, I do think it's peculiar that the traceback leads so directly to mpfr! That's just running sage -valgrind, by the way.

The main idea is that our first guess would be that the spurious segfaults are due to memory corruptions and that the cause of this corruption is probably a systematic misuse of memory somewhere. So even on a system where the segfault doesn't happen, the memory misuse should still happen and hopefully valgrind would give an indication where.

comment:30 in reply to: ↑ 29 Changed 7 years ago by SimonKing

Replying to nbruin:

This is a pretty desperate suggestion, but given our earlier experience, the segfault reported by the bot is probably due to an actual bug that's hard to trigger. Getting the verbose traceback on the computer that runs the test would probably give the best lead, but if that's not possible:

Run the doctest under valgrind.

How to do so? When I was young, one needed to install a valgrind spkg. Is that still needed?

I've tried that myself, but valgrind is giving me a SIGILL:

Are you saying: Even if I do not get a segfault (these sporadic segfaults are hunting me here and on different other tickets), valgrind might still point me to the problem?

I'd certainly try! But I need a pointer. sage -valgrind doesn't work for me:

king@mpc622:/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/devel/sage$ ../../sage -valgrind
----------------------------------------------------------------------
| Sage Version 5.3.beta2, Release Date: 2012-08-14                   |
| Type "notebook()" for the browser-based notebook interface.        |
| Type "help()" for help.                                            |
----------------------------------------------------------------------
**********************************************************************
*                                                                    *
* Warning: this is a prerelease version, and it may be unstable.     *
*                                                                    *
**********************************************************************
/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/bin/sage-ipython
Log file is /mnt/local/king/.sage/valgrind/sage-memcheck.%p
Using default flags:
--leak-resolution=high --log-file=/mnt/local/king/.sage/valgrind/sage-memcheck.%p --leak-check=full --num-callers=25 --suppressions=/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/valgrind/sage.supp
king@mpc622:/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/devel/sage$ 

Hence, I don't come to the Sage prompt, it quits before starting.

Aha! I just found that valgrind wrote into some file:

==23108== Memcheck, a memory error detector
==23108== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==23108== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==23108== Command: python -i
==23108== Parent PID: 22997
==23108== 
==23108== FATAL: can't open suppressions file "/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/local/lib/valgrind/sage.supp"

So, some file is missing.

comment:31 Changed 7 years ago by SimonKing

See #11918 for the valgrind issue with sage.supp

comment:32 Changed 7 years ago by jpflori

There still is some valgrind spkg. But it won't compile on any reasonably recent distro. There is some update in #7766, which will compile, but will not support AVX iinstruction yet. I guess the AVX instruction problem might be the reason of the SIGILL Nils get (do you have a Core i7?). For that, you'll need valgrind 3.8.0 which was released a few weeks ago (and is not ditributed in most distro yet I think). You can just replace the src dir in the spkg with the new tarball.

If you used your system valgrind, it could be a reason why the suppression file is missing as well. #11918 is surely about that.

comment:33 follow-up: Changed 7 years ago by jpflori

Hum, I did not mean #7766, which points to 3.7.0, but #13060 which points to something in between 3.7.0 and 3.8.0.

comment:34 in reply to: ↑ 33 Changed 7 years ago by SimonKing

Thanks for the pointer, Jean-Pierre!

Replying to jpflori:

Hum, I did not mean #7766, which points to 3.7.0, but #13060 which points to something in between 3.7.0 and 3.8.0.

OK, I took the 3.7.0 spkg from #13060, replaced the src/ folder by the 3.8.0 tarball, and did just install it. If I recall correctly, one needs sage -ba in order to make valgrind work right. Is that correct? Well, I just do, even if it is not correct...

comment:35 Changed 7 years ago by SimonKing

I valgrind-tested sage/rings/polynomial/infinite_polynomial_ring, which uses to segfault pm Volker's patchbot with #715, #11521, #12215, #12313, #13370, #13378, 13412 and #12876. However, everything seems to work fine. Next, I'll test the real_mpfr problem

comment:36 Changed 7 years ago by SimonKing

With the patches mentioned in my previous post and with a valgrind-3.8.0 spkg (replacing the src of Jean-Pierre's 3.7.0 spkg with the new sources), valgrinding polynomial_real_mpfr_dense.pyx seems to work fine:

...
==9230== LEAK SUMMARY:
==9230==    definitely lost: 441 bytes in 8 blocks
==9230==    indirectly lost: 320 bytes in 15 blocks
==9230==      possibly lost: 4,950,895 bytes in 5,597 blocks
==9230==    still reachable: 80,189,206 bytes in 41,378 blocks
==9230==         suppressed: 0 bytes in 0 blocks
==9230== Reachable blocks (those to which a pointer was found) are not shown.
==9230== To see them, rerun with: --leak-check=full --show-reachable=yes
==9230== 
==9230== For counts of detected and suppressed errors, rerun with: -v
==9230== Use --track-origins=yes to see where uninitialised values come from
==9230== ERROR SUMMARY: 12833 errors from 1495 contexts (suppressed: 1008 from 9)

In particular, there is no SIGILL.

comment:37 Changed 7 years ago by jpflori

I don't think the sage -ba or even sage -b is needed, but it does not hurt for sure.

comment:38 follow-up: Changed 7 years ago by jpflori

Did you look for "Invalid free()" in the file ?

comment:39 in reply to: ↑ 38 Changed 7 years ago by SimonKing

Replying to jpflori:

Did you look for "Invalid free()" in the file ?

I just did. Even "invalid " is not in the file.

comment:40 follow-up: Changed 7 years ago by jpflori

Mmmmm, not having any "Invalid" is strange. IIRC I got a bunch of invalid reads or writes even on a vanilla Sage. And anyway GMP/MPIR usually generates some which are perfectly legal but trouble Valgrind.

comment:41 in reply to: ↑ 40 Changed 7 years ago by SimonKing

Replying to jpflori:

Mmmmm, not having any "Invalid" is strange. IIRC I got a bunch of invalid reads or writes even on a vanilla Sage. And anyway GMP/MPIR usually generates some which are perfectly legal but trouble Valgrind.

OK, you are right. I repeated the whole thing and tried to take care that everything is as in the patchbot.

I have sage-5.3.beta2 and

king@mpc622:/mnt/local/king/SAGE/prereleases/sage-5.3.beta2/devel/sage$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
trac12215_weak_cached_function-sk.patch
trac12215_segfault_fixes.patch
trac_12313-mono_dict-combined-random-sk.patch
trac_12313_quit_sage.patch
trac13370_deprecate_is_field.patch

(the same as the patchbot).

Then I did

king@mpc622:/mnt/local/king/SAGE/prereleases/sage-5.3.beta2$ ./sage -t -valgrind devel/sage/sage/rings/polynomial/polynomial_real_mpfr_dense.pyx
sage -t -valgrind "devel/sage/sage/rings/polynomial/polynomial_real_mpfr_dense.pyx"
         [53.0 s]

----------------------------------------------------------------------
All tests passed!
Total time for all tests: 53.1 seconds
king@mpc622:/mnt/local/king/SAGE/prereleases/sage-5.3.beta2$ ls -ltr /mnt/local/king/.sage/valgrind/
insgesamt 50760
-rw------- 1 king malg      386  3. Mai 15:58 sage-memcheck.5510
-rw------- 1 king malg  3788215  4. Mai 23:44 sage-memcheck.1426
-rw------- 1 king malg      398 30. Aug 22:45 sage-memcheck.23108
-rw------- 1 king malg  8770111 30. Aug 22:57 sage-memcheck.30735
-rw------- 1 king malg 10453868 30. Aug 23:00 sage-memcheck.18736
-rw------- 1 king malg  8924220 30. Aug 23:13 sage-memcheck.trac12876
-rw------- 1 king malg  3102893 31. Aug 09:56 sage-memcheck.11856
-rw------- 1 king malg  7761738 31. Aug 10:42 sage-memcheck.9230
-rw------- 1 king malg  9069029 31. Aug 11:49 sage-memcheck.12255

And I posted sage-memcheck.12255. It does contain invalid read, but no invalid free.

comment:42 Changed 7 years ago by jpflori

  • Cc jpflori added

comment:43 Changed 7 years ago by SimonKing

  • Dependencies changed from #11310, #12215, #11521, #12313, #13089 to #11310, #12215, #11521, #12313, #13089, #13145

At #13145, refcounting for certain rings is fixed. I wonder whether this is related with the segfault that can be observed on some computers.

I can not reproduce the segfault on Volker's patchbot. So, I'd like to see whether adding #13145 suffices to make the patchbot happy.

comment:44 Changed 7 years ago by SimonKing

Hooray!! with #13145, the segfault is gone on the two patchbots that had it reported previously! The remaining doctest failure on one of the two machines seems unrelated (it can not find some html doc file for symbolic expressions). Hence, I am confident that #13145 was at least a step to the right direction.

comment:45 Changed 7 years ago by SimonKing

I am not sure if #13145 really fixed the problem, or it was just accidentally (these segfaults seem awfully flaky).

Anyway, I have updated the main patch at #715, avoiding a potential name conflict of attributes of Action and RingHomomorphism - just in case...

Kicking the patchbot now...

comment:46 Changed 7 years ago by SimonKing

  • Dependencies changed from #11310, #12215, #11521, #12313, #13089, #13145 to #11310, #12215, #11521, #12313, #13089

No segfault this time. I am retrying without #13145.

comment:47 Changed 7 years ago by SimonKing

Is there someone who'd review this?

comment:48 Changed 7 years ago by SimonKing

  • Dependencies changed from #11310, #12215, #11521, #12313, #13089 to #11310, #12215, #11521, #12313, #13089, #12995
  • Status changed from needs_review to needs_work

Needs to be rebased on top of sage-5.7.beta2

comment:49 Changed 7 years ago by SimonKing

  • Status changed from needs_work to needs_review

Patch is updated!

comment:50 Changed 7 years ago by nbruin

I'm all in favour of the idea of the patch. However, the bot seems to indicate there's still some work to do.

comment:51 Changed 7 years ago by SimonKing

Right. Polyhedron seems to be new, so that this rather old patch did not take care of it.

Changed 7 years ago by SimonKing

Deprecate is_Field(R) in favour of R in Fields().

comment:52 Changed 7 years ago by SimonKing

I have updated the patch, and I hope it'll make the patchbot happy. At least, all tests in sage/geometry pass. Needs review!

comment:53 Changed 7 years ago by nbruin

  • Status changed from needs_review to positive_review

Cool stuff! Works for me, so positive review. Note that in 1 you also noted the issue that is now the topic of #14058.

comment:54 Changed 7 years ago by jdemeyer

  • Milestone changed from sage-5.7 to sage-5.8
  • Reviewers set to Nils Bruin

comment:55 Changed 7 years ago by nbruin

Something that struck me. With the current trick,

sage: R=ZZ.quo(7)
sage: %timeit R in Fields()

is going to be pretty efficient, because the category of R reflects it's a field after the first test. However,

sage: R=ZZ.quo(15)
sage: %timeit R in Fields()

probably isn't efficient (but it probably wasn't before either), because there's no way to register it's NOT a field in its category.

I recognize that frequently asking whether a field is indeed a field will happen a lot, whereas frequently confirming something is NOT a field will probably be considerably less common, so perhaps it's not a problem. It does reflect that this solution is accomplishing something quite different from just making is_Field efficient without external cache.

comment:56 Changed 7 years ago by jdemeyer

  • Merged in set to sage-5.8.beta0
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:57 Changed 7 years ago by jdemeyer

Please see #14158 for a blocker follow-up.

Note: See TracTickets for help on using tickets.