Opened 4 years ago
Closed 4 years ago
#24986 closed defect (fixed)
Inconsistent mpz_t state after interrupted sig_realloc()
Reported by: | embray | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | sage-8.7 |
Component: | cython | Keywords: | cygwin |
Cc: | jdemeyer, caruso | Merged in: | |
Authors: | Jeroen Demeyer | Reviewers: | Erik Bray |
Report Upstream: | N/A | Work issues: | |
Branch: | 521bac9 (Commits, GitHub, GitLab) | Commit: | 521bac96289ae33b5ecf337f1f36a7fd29d7bc8a |
Dependencies: | #27073 | Stopgaps: |
Description (last modified by )
TODO
☐ | Use sig_occurred() to check whether an exception from Cysignals is currently being handled while in Integer.tp_dealloc . If so, assume that the state of the object's mpz struct may not be consistent, so do not call mpz_clear on it and do not place it back in the free pool.
|
☐ | Don't forget to re-enable the test that was disabled in #25137 in order to test that this is fixed. |
As discussed on sage-devel, I'm fairly consistently (roughly 9 times out of 10) getting the following failure on Cygwin:
sage -t --warn-long 164.8 src/sage/structure/coerce_actions.pyx ********************************************************************** File "src/sage/structure/coerce_actions.pyx", line 786, in sage.structure.coerce_actions.IntegerMulAction._repr_name_ Failed example: IntegerMulAction(ZZ, GF5) Expected: Left Integer Multiplication by Integer Ring on Finite Field of size 5 Got: Left Integer Multiplication by Integer Ring on Finite Field of size 1 ********************************************************************** 1 item had failures: 1 of 4 in sage.structure.coerce_actions.IntegerMulAction._repr_name_ [143 tests, 1 failure, 3.30 s] ---------------------------------------------------------------------- sage -t --warn-long 164.8 src/sage/structure/coerce_actions.pyx # 1 doctest failed
Obviously Sage doesn't even allow creation of an order 1 field. In fact, I traced the cause of this to a specific line in FiniteFieldFactory.create_key_and_extra_args
where, by chance, an Integer
with a value of 1
is constructed (using fast_tp_new
) whose (mp_limb*)(Integer.value._mp_d)
member is assigned the same address as the _mp_d
of the Integer
that happens to hold the field's order.
The result is that the order
is then set to 1
as well. This happens after the check that order>1
so creation of the field still succeeds. Clearly there is a subtle bug either in fast_tp_new
, or in the memory allocator itself.
Change History (86)
comment:1 Changed 4 years ago by
- Description modified (diff)
comment:2 follow-up: ↓ 3 Changed 4 years ago by
comment:3 in reply to: ↑ 2 Changed 4 years ago by
Replying to jdemeyer:
We should also consider the possibility of a bug in Cygwin, as this has not been observed on other platforms.
Yep, I agree. Though I think there's also a possibility of a subtle bug. The code around which this is happening looks like this:
order = Integer(order) if order <= 1: raise ValueError("the order of a finite field must be at least 2") if order.is_prime(): p = order n = Integer(1)
It's the last line, n = Integer(1)
where that Integer
has the same _mp_d
as order
for some reason. Also in this case order
is already passed in as an Integer
, so I'm wondering if there's some subtle bug around order = Integer(order)
.
comment:4 Changed 4 years ago by
Maybe not--by the time I get to order = Integer(order)
the next two Integer
s in the pool already shared the same _mp_d
:
(gdb) p __pyx_v_4sage_5rings_7integer_integer_pool_count $37 = 7 (gdb) p ((struct __pyx_obj_4sage_5rings_7integer_Integer *)__pyx_v_4sage_5rings_7integer_integer_pool[6]).value $38 = {{_mp_alloc = 1, _mp_size = 0, _mp_d = 0x602b85c40}} (gdb) p ((struct __pyx_obj_4sage_5rings_7integer_Integer *)__pyx_v_4sage_5rings_7integer_integer_pool[5]).value $39 = {{_mp_alloc = 1, _mp_size = 0, _mp_d = 0x602b85c40}}
I'm going to add in some tracing and see if that reveals anything (or makes the problem go away entirely which would point more toward a Cygwin bug if it's that sensitive.
comment:5 Changed 4 years ago by
Can you compile the Sage library with -O0 -g
just to check for a potential compiler bug? You will need to do rm -rf src/build
first.
comment:6 Changed 4 years ago by
I'll give that a try in a bit. A compiler bug is also certainly a possibility. This is gcc 6.4.0.
comment:7 Changed 4 years ago by
It's not possible that there'd be anywhere in Sage where it creates Integer objects while the GIL is released, is there?
comment:8 Changed 4 years ago by
More generally, I think that Sage never releases the GIL. Libraries used by Sage (e.g. Numpy) might.
comment:9 Changed 4 years ago by
Aha! I put some code in fast_tp_dealloc
to check, before returning an Integer
to the pool, if there is already an entry in the pool with the same _mp_d
, and raise a RuntimeError
if so. And this condition first occurs in earlier test in the same module:
sage -t --warn-long 164.8 src/sage/structure/coerce_actions.pyx ********************************************************************** File "src/sage/structure/coerce_actions.pyx", line 758, in sage.structure.coerce_actions.IntegerMulAction._call_ Failed example: alarm(0.5); (2^(10^7)) * P Expected: Traceback (most recent call last): ... AlarmInterrupt Got: RuntimeError: [fast_tp_dealloc] DUPLICATE _mp_d: 0x0x60404ebe0L! (integer_pool_count: 3) Exception RuntimeError: RuntimeError('[fast_tp_dealloc] DUPLICATE _mp_d: 0x0x60404ebe0L!\n (integer_pool_count: 3)',) in 'sage.rings.integer.fast_tp_dealloc' ignored Traceback (most recent call last): File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 541, in _run self.compile_and_execute(example, compiler, test.globs) File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 951, in compile_and_execute exec(compiled, globs) File "<doctest sage.structure.coerce_actions.IntegerMulAction._call_[11]>", line 1, in <module> alarm(RealNumber('0.5')); (Integer(2)**(Integer(10)**Integer(7))) * P SystemError: error return without exception set
The presence of alarm()
also explains the slight randomness of this error. So it seems that somewhere the allocation/deallocation of integers not safe to interruption by a signal...
comment:10 follow-up: ↓ 12 Changed 4 years ago by
Specifically this is inside fast_mul
which is doing some pretty heavy creation and destruction of Integers. If I just run the same code manually at the prompt and hit Ctrl-C, depending on exactly where it is fast_mul
, sometimes I just get a KeyboardInterrupt
raised from cysignals, or sometimes my RuntimeError
gets raised.
comment:11 Changed 4 years ago by
Two new notes:
- I can reproduce the problem reliably on Cygwin by the following:
sage: E = EllipticCurve(GF(5), [4,0]) sage: set_random_seed(0); P = E.random_element() sage: alarm(0.5); (2^(10^7)) * P --------------------------------------------------------------------------- AlarmInterrupt Traceback (most recent call last) <ipython-input-3-1f9f5650ad9d> in <module>() ----> 1 alarm(RealNumber('0.5')); (Integer(2)**(Integer(10)**Integer(7))) * P ... /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.Integer._shift_helper (build/cythonized/sage/rings/integer.c:40438)() 6371 # Now finally call into MPIR to do the shifting. 6372 cdef Integer z = PY_NEW(Integer) -> 6373 sig_on() 6374 if n < 0: 6375 mpz_fdiv_q_2exp(z.value, self.value, -n) src/cysignals/signals.pyx in cysignals.signals.sig_raise_exception() AlarmInterrupt: sage: a = Integer(5) sage: a 5 sage: b = Integer(1) sage: a 1
This same process works correctly on Linux though, which is still suspicious.
- If I remove the following lines from
fast_tp_dealloc
the problem appears to go away:so it's possible also something fishy is happening with# Here we free any extra memory used by the mpz_t by # setting it to a single limb. if o_mpz._mp_alloc > 10: _mpz_realloc(o_mpz, 1)
_mpz_realloc
.
comment:12 in reply to: ↑ 10 Changed 4 years ago by
Replying to embray:
Specifically this is inside
fast_mul
Just to be clear that we are talking about the same thing: you mean fast_mul
inside src/sage/structure/coerce_actions.pyx
?
comment:13 Changed 4 years ago by
Yes. Then at some point the _mpz_realloc
I mentioned above sets the pointer to the same address as one of the existing (presumably unused) integers in the pool.
comment:14 Changed 4 years ago by
It would be good to know exactly which code is executing when the interrupt/alarm happens.
There are two ways to find out:
- Run Python under
gdb
- Using
cysignals
assume that Cygwin has abacktrace()
function. You'll need to recompilecysignals
with debugging. Within Sage, it suffices to doSAGE_DEBUG=yes ./sage -f cysignals
comment:15 follow-up: ↓ 16 Changed 4 years ago by
Something else to try (just shooting in the dark here): explicitly call sage.ext.memory.init_memory_functions()
before doing your tests.
comment:16 in reply to: ↑ 15 ; follow-up: ↓ 17 Changed 4 years ago by
Replying to jdemeyer:
Something else to try (just shooting in the dark here): explicitly call
sage.ext.memory.init_memory_functions()
before doing your tests.
What would explicitly calling it do? Shouldn't it be called automatically anyways? I confirmed in gdb that mpir is using the sage/cysignals allocation functions. I actually tried disabling it and it didn't seem to make a difference on the problem.
comment:17 in reply to: ↑ 16 Changed 4 years ago by
Replying to embray:
Shouldn't it be called automatically anyways?
Of course it should. But the whole point of debugging is to verify that everything that should happen actually happens.
comment:18 Changed 4 years ago by
Does the cysignals testsuite pass (./sage -f -c cysignals
)? Does anything change if cysignals (and the Sage library) is compiled with -O0
?
comment:19 follow-ups: ↓ 21 ↓ 22 Changed 4 years ago by
After rebuilding with SAGE_DEBUG=yes
(mostly--singular is failing its debug builds for some reason, which is something I'll have to investigate another time), I can no longer reproduce the problem, which is disconcerting. I confirmed that the Sage modules compiled with -O0
.
For reference's sake, with the alarm(0.5)
(and actually with other values as well, such as alarm(1)
), the AlarmInterrupt
is almost always raised from the same place:
sage: alarm(1); (2^(10^7)) * P *** SIG 14 *** inside sig_on do_raise_exception(sig=14) PyErr_Occurred() = 0x0 Raising Python exception 0 ms after signals... --------------------------------------------------------------------------- AlarmInterrupt Traceback (most recent call last) <ipython-input-17-57b8340ec195> in <module>() ----> 1 alarm(Integer(1)); (Integer(2)**(Integer(10)**Integer(7))) * P /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.Integer.__mul__ (build/cythonized/sage/rings/integer.c:12921)() 1856 return y 1857 -> 1858 return coercion_model.bin_op(left, right, operator.mul) 1859 1860 cpdef _mul_(self, right): /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/structure/coerce.pyx in sage.structure.coerce.CoercionModel_cache_maps.bin_op (build/cythonized/sage/structure/coerce.c:9695)() 1114 action = self.get_action(xp, yp, op, x, y) 1115 if action is not None: -> 1116 return (<Action>action)._call_(x, y) 1117 1118 try: /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/structure/coerce_actions.pyx in sage.structure.coerce_actions.IntegerMulAction._call_ (build/cythonized/sage/structure/coerce_actions.c:10587)() 772 return fast_mul_long(a, n_long) 773 --> 774 return fast_mul(a, nn) 775 776 def _repr_name_(self): /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/structure/coerce_actions.pyx in sage.structure.coerce_actions.fast_mul (build/cythonized/sage/structure/coerce_actions.c:11610)() 899 sig_check() 900 pow2a += pow2a --> 901 n = n >> 1 902 sum = pow2a 903 n = n >> 1 /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.Integer.__rshift__ (build/cythonized/sage/rings/integer.c:40813)() 6444 if not isinstance(x, Integer): 6445 return x >> int(y) -> 6446 return (<Integer>x)._shift_helper(y, -1) 6447 6448 cdef _and(Integer self, Integer other): /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.Integer._shift_helper (build/cythonized/sage/rings/integer.c:40452)() 6375 # Now finally call into MPIR to do the shifting. 6376 cdef Integer z = PY_NEW(Integer) -> 6377 sig_on() 6378 if n < 0: 6379 mpz_fdiv_q_2exp(z.value, self.value, -n) src/cysignals/signals.pyx in cysignals.signals.sig_raise_exception() AlarmInterrupt:
On occasion it's elsewhere, but most of the time it's in that shift call. To learn much else I'll have to drop into gdb, but first I'll have to rebuild some modules with -O(>0)
and see if the problem returns...
comment:20 Changed 4 years ago by
In other to narrow the problem down, it would be good to know if it matters whether cysignals is compiled with -O0
or whether the Sage library is compiled with -O0
.
comment:21 in reply to: ↑ 19 ; follow-up: ↓ 23 Changed 4 years ago by
Replying to embray:
On occasion it's elsewhere, but most of the time it's in that shift call.
The problem is that the sig_on()
hides the actual place where the signal is raised. I hope that we can learn something by knowing precisely what is interrupted.
comment:22 in reply to: ↑ 19 Changed 4 years ago by
Replying to embray:
After rebuilding with
SAGE_DEBUG=yes
(mostly--singular is failing its debug builds for some reason, which is something I'll have to investigate another time), I can no longer reproduce the problem, which is disconcerting. I confirmed that the Sage modules compiled with-O0
.
Oops--disregard that. I had forgotten exactly where I left off on this yesterday, and it appears I still had the troublesome _mpz_realloc
disabled in the source I was building from. After re-enabling that the problem returns, even in the debug build.
comment:23 in reply to: ↑ 21 ; follow-up: ↓ 26 Changed 4 years ago by
Replying to jdemeyer:
Replying to embray:
On occasion it's elsewhere, but most of the time it's in that shift call.
The problem is that the
sig_on()
hides the actual place where the signal is raised. I hope that we can learn something by knowing precisely what is interrupted.
Right, but doesn't that at least mean it's at least somewhere between that sig_on()
and the corresponding sig_off()
?
comment:24 Changed 4 years ago by
OK, so the problem persists with -O0
. Good, so it looks like a real bug (either in Sage, cysignals, MPIR or Cygwin).
comment:25 Changed 4 years ago by
Yes, likely one of those four things--probably not the compiler though I wouldn't rule it out completely.
comment:26 in reply to: ↑ 23 Changed 4 years ago by
Replying to embray:
Right, but doesn't that at least mean it's at least somewhere between that
sig_on()
and the correspondingsig_off()
?
Yes, but I would like to know the exact place in the MPIR code where it is interrupted.
comment:27 Changed 4 years ago by
It looks like it's being interrupted in the _mpz_realloc
call from mpz_mul_2exp(z.value, self.value, n)
, on the just-created Integer
"z".
I think I see the general problem here. The function _mpz_realloc
is not interrupt-safe. Even with GMP's realloc set to sig_realloc
, that just blocks signals before/after the actual realloc
. However, it's then only after realloc
succeeds that the _mp_d
pointer gets updated. If the signal handler is invoked before that happens then the old value in _mp_d
might now point to a freed address (in fact, and address to a size=1 array of limbs).
Then, as fast_mul
is cleaned up it again frees an integer (the variable "n"). This is a large integer so it trips the semi-arbitrary condition:
# Here we free any extra memory used by the mpz_t by # setting it to a single limb. if o_mpz._mp_alloc > 10: _mpz_realloc(o_mpz, 1)
and this realloc
happens to give back the same size=1
array that was previously just freed. So both _mp_d
end up with the same pointer.
comment:28 Changed 4 years ago by
I'm still doing more testing, but I seem to have fixed it by wrapping any call to _mpz_realloc
and anything that calls _mpz_realloc
with sig_block/unblock()
. But a better solution would be nice since the category of "anything that calls _mpz_realloc
" includes most mpz_
functions.
comment:29 Changed 4 years ago by
I'm afraid that this is essentially an unsolvable problem. It's well known that longjmp()
(used by cysignals) can mess up the internal state of objects. In cysignals, I implemented safe memory allocation functions like sig_malloc()
because those were the most common source of breakage. But of course, things can break in other ways.
comment:30 follow-up: ↓ 31 Changed 4 years ago by
Right--in general it's not completely solvable. For this case we do have some ways forward, however.
- In GMP/MPIR it's already possible to pass custom memory allocation functions. We can set our own
realloc
for it to use, but what we really need here is to be able to set our own function for_mpz_realloc
. A pluggable interface for this could be provided as well (for now requiring a patch of course).
- In the context of fast allocation of integers, we might just have to rethink the extent to which we muck around with the internals, at the cost of some efficiency in some cases.
I don't think this bug is specific to Cygwin. It just showed up there perhaps due to the larger overhead in memory allocation functions or similar reasons.
comment:31 in reply to: ↑ 30 Changed 4 years ago by
Replying to embray:
- In the context of fast allocation of integers, we might just have to rethink the extent to which we muck around with the internals, at the cost of some efficiency in some cases.
That's not really the problem here. The internal state is already messed up before that.
comment:32 follow-up: ↓ 33 Changed 4 years ago by
The potential for problems is reduce if, say, we mpz_clear
before placing objects back in the pool, and mpz_init
when taking them out. That's what I mean. That could mean a big loss in performance if integers of the same size are being created and destroyed rapidly, but there's also an extent to which we should trust the memory allocator to handle that case efficiently as well.
comment:33 in reply to: ↑ 32 Changed 4 years ago by
Replying to embray:
The potential for problems is reduce if, say, we
mpz_clear
before placing objects back in the pool, andmpz_init
when taking them out.
I don't see why. How is this bug even related to the integer pool?
comment:34 follow-up: ↓ 36 Changed 4 years ago by
Maybe I haven't explained the issue well enough. The bug has everything to do with the integer pool. Of course, as you say, there are many contexts where an interrupt could leave a datastructure in an incomplete state, as is happening here with the mpz_t
s.
This isn't usually a problem, because in most cases we have Cysignals turned on, which means an exception is raised and, in most cases (but not all--therein is still a problem) the computation is aborted and the relevant datastructures are cleaned up anyways and it doesn't entirely matter if they're consistent or not (here there is still a potential for double-frees and the like, and that would be a bug too, albeit a slightly different one).
When it comes to the integer pool that's not the case, because instead of completely cleaning up and destroying the relevant data structures, we're always assuming that when an Integer
is returned to the pool that it's in a sane, consistent state (or at the very least, that its _mp_d
pointer is still valid). And that's clearly not the case.
The deeper underlying problem here is not the bug. The bug is that the integer pool makes it too easy to keep around invalid datastructures.
comment:35 Changed 4 years ago by
I suppose one other possibility, is we replace the GMP memory allocation functions with ones that trace memory allocations relevant to the integer pool, and use that to perform a quick consistency check before returning objects to the pool. Would have to think about how to do this in a way that does not have too awful overhead...
No, this would suck, because there would be no obvious way to enable that behavior only for the relevant contexts.
comment:36 in reply to: ↑ 34 Changed 4 years ago by
Replying to embray:
When it comes to the integer pool that's not the case, because instead of completely cleaning up and destroying the relevant data structures, we're always assuming that when an
Integer
is returned to the pool that it's in a sane, consistent state
I see what you are trying to say, but I don't agree. I believe that "completely cleaning up and destroying the relevant data structures" is equally unsafe as reusing that data structure. In other words, instead of getting in trouble when reusing the mpz_t
, you will instead get in trouble when calling mpz_clear()
on it.
comment:37 follow-up: ↓ 38 Changed 4 years ago by
Just writing down a thought before I forget: temporarily disable the integer pool following an exception from cysignals, possibly even clear it. Only re-enable next time an Integer is explicitly created. Might be nice to have an alternative to sig_on()
that can register an error callback (without having to write an explicit try/except)
comment:38 in reply to: ↑ 37 Changed 4 years ago by
Replying to embray:
Just writing down a thought before I forget: temporarily disable the integer pool following an exception from cysignals, possibly even clear it.
In my opinion, this bug has nothing to do with the integer pool, so it can't be fixed by changing the integer pool implementation.
comment:39 follow-up: ↓ 40 Changed 4 years ago by
cysignals could run a gc cycle after interrupt and have a boolean flag to tell c/cython code that we are currently in this "dangerous" gc cycle. This seems to be a reasonable cysignals feature, allowing users to react to possibly inconsistent C data structueres.
The integer pool could use that to not recycle mpz_t
.
comment:40 in reply to: ↑ 39 Changed 4 years ago by
Replying to vbraun:
cysignals could run a gc cycle after interrupt and have a boolean flag to tell c/cython code that we are currently in this "dangerous" gc cycle. This seems to be a reasonable cysignals feature, allowing users to react to possibly inconsistent C data structueres.
I see what you mean. Something like that could be done without patching GMP/MPIR.
I don't think that cysignals should run the "gc cycle" because the broken objects will still be referenced at that time. Instead, we could somehow check if there is a live (not yet handled) KeyboardInterrupt
exception.
comment:41 Changed 4 years ago by
- Component changed from memleak to cython
- Summary changed from Possible bug in Integer's fast_tp_new causing subtle test failure to Inconsistent mpz_t state after interrupted sig_realloc()
More concretely, here is an idea:
- Whenever
cysignals
raises an exception, it stores a pointercysigs.exc_value
to the exception.
Integer.tp_dealloc
checks whether the exception saved by cysignals equals the currently pending exception. If not, it resetscysigs.exc_value = NULL
. If the exception matches, it assumes that theInteger
is in a broken state and it skipstp_dealloc
completely: it does not store in the pool and it does not callmpz_clear
on it.
comment:42 Changed 4 years ago by
At what point would cysigs.exc_value
get cleared otherwise? In sig_on()
?
I agree with you that the problem here is not just with the integer pool, but based on further analysis it's an area that is more likely than anywhere else to be affected by this, at least where Integers are concerned. And it's worse really--if you tried to mpz_clear
you get a double free an exception. But in this case, because invalid Integer
s are being recycled (but that still happen to contain valid pointers) you end up with far more insidious bugs (i.e. the values of some ints being flipped).
comment:43 Changed 4 years ago by
Otherwise, I think that's a pretty acceptable solution. It could lead to memory leaks, but the present situation is no better in that regard. And we can guess this will probably be a rare situation ...
comment:44 follow-up: ↓ 45 Changed 4 years ago by
Although, I also don't much see the downside to wrapping most mpz_ functions with sig_block/unblock()--any that modify the internal datastructures (in such a way that involves memory allocations) of one or more of its arguments--and that aren't expected to take long to run. Of course, this would be tedious, so we'd have to generate the wrappers automatically.
comment:45 in reply to: ↑ 44 Changed 4 years ago by
Replying to embray:
that aren't expected to take long to run.
Many interesting mpz
functions take time at least linear in the size of the input. Since we want to support large integers, they can all take a long time.
Perhaps the only obvious example to apply sig_block()
/sig_unblock()
to would be mpz_init()
.
comment:46 Changed 4 years ago by
- Dependencies set to #21509, #24111
comment:47 follow-up: ↓ 48 Changed 4 years ago by
That brings me back to my earlier point that it would be very helpful to be able to wrap/override _mpz_realloc
since that's worst offender with respect to memory allocations inside most mpz functions (mostly called to extend the limbs array to make room for the result).
comment:48 in reply to: ↑ 47 Changed 4 years ago by
Replying to embray:
That brings me back to my earlier point that it would be very helpful to be able to wrap/override
_mpz_realloc
That requires convincing upstream for the need of a hook there.
comment:49 follow-up: ↓ 51 Changed 4 years ago by
By the way, I tried to reproduce this problem on Linux by adding an artificial delay in sage_sig_realloc
(to increase the chances of the alarm happening there). Unfortunately, it could not reproduce the Cygwin problem. Maybe realloc()
is more hardened against this?
comment:50 Changed 4 years ago by
I have been thinking about concretely implementing 41. One problem is that Cython tends to run code without an active exception set, which means that it's non-trivial to check "the currently pending exception".
Of course, since Integer.tp_dealloc
uses completely custom code that Cython is not aware of, we could do it there. But I would rather prefer a general solution, which would also work without a custom tp_dealloc
.
Another idea is to check the refcount of the exception value to check whether it has been deleted. That works as long as the exception handler really deletes the exception and does not store it somehow.
Or, as you proposed in 42, we could explicitly clear the stored exception value under certain circumstances where we assume that we are not currently propagating an exception.
My feeling is that the refcount solution is the least likely to break: it seems unlikely that user code would catch the exception and store it in a variable. And I am assuming that REPLs don't do this either, but I'll certainly need to check that.
comment:51 in reply to: ↑ 49 Changed 4 years ago by
Replying to jdemeyer:
By the way, I tried to reproduce this problem on Linux by adding an artificial delay in
sage_sig_realloc
(to increase the chances of the alarm happening there). Unfortunately, it could not reproduce the Cygwin problem. Mayberealloc()
is more hardened against this?
Possibly? Or it could just be a coincidence in how the mallocs in question work. The only reason this really becomes a problem is that the next Integer that gets allocated some memory (for its _mp_d
) at the same address that was freed by a previous realloc()
. This seems to be a predicatable behavior of my malloc--I don't know the details but it's probably got some pool of small allocations with a FIFO or something. Or as you suggested it could be the malloc in glibc is more robust in cleaning things up if it's interrupted by a signal. Whatever the issue is it's very subtle, but could in principle happen anywhere.
comment:52 follow-up: ↓ 53 Changed 4 years ago by
My gut feeling is that anything that relies on refcounts is going to be very fragile in its own right, but it's worth a try.
Since this problem might be very difficult to solve in general, I wonder if it would be acceptable to make a temporary fix (like I already did) that just addresses this one, exact use case in the tests, just so that I don't have to have this test failing on Cygwin. With the goal of course of removing it later.
Alternatively, we should add something to the doctest framework to skip platform-specific known failures.
comment:53 in reply to: ↑ 52 Changed 4 years ago by
I'd certainly like to fix this in cysignals. However, I have lately been bitten by various breakages of cysignals on various platforms, so I'd like to improve the continuous integration of cysignals first. If you know of a free CI service that lets you run Python code on platforms other than Linux x86_64, let me know :-)
comment:54 follow-up: ↓ 55 Changed 4 years ago by
OSX is still tricky, though I think Travis-CI has it now?
I can set up CI for cysignals on Cygwin if you want.
comment:55 in reply to: ↑ 54 ; follow-up: ↓ 57 Changed 4 years ago by
Replying to embray:
OSX is still tricky, though I think Travis-CI has it now?
In theory, yes. In practice, not so much.
I can set up CI for cysignals on Cygwin if you want.
That would be super-awesome. How do you plan to do that?
comment:56 Changed 4 years ago by
Julian Rüth suggested to try conda for Travis CI on OS X. I'll try that.
comment:57 in reply to: ↑ 55 ; follow-up: ↓ 58 Changed 4 years ago by
Replying to jdemeyer:
Replying to embray:
OSX is still tricky, though I think Travis-CI has it now?
In theory, yes. In practice, not so much.
:(
I can set up CI for cysignals on Cygwin if you want.
That would be super-awesome. How do you plan to do that?
AppVeyor? supports Cygwin, and I have set up other projects to do Cygwin builds with AppVeyor?. It will be pretty straightforward, if you just give me admin on the cysignals project on GH (if I don't already have it)
comment:58 in reply to: ↑ 57 ; follow-up: ↓ 60 Changed 4 years ago by
Replying to embray:
It will be pretty straightforward, if you just give me admin on the cysignals project on GH (if I don't already have it)
I just did that. Could it support 32-bit and 64-bit builds? That would be great because the other CI services are 64-bit only.
comment:59 Changed 4 years ago by
You may also be interested in https://github.com/sagemath/cysignals/pull/76 which is about a work-in-progress native Windows port of cysignals, using mingw64.
comment:60 in reply to: ↑ 58 Changed 4 years ago by
Replying to jdemeyer:
Could it support 32-bit and 64-bit builds? That would be great because the other CI services are 64-bit only.
Well, it has both 64-bit and 32-bit Cygwin installed on their VMs, but the underlying machine architecture is still 64-bit.
comment:61 Changed 4 years ago by
- Description modified (diff)
comment:62 Changed 4 years ago by
- Milestone changed from sage-8.2 to sage-8.3
comment:63 Changed 4 years ago by
- Dependencies #21509, #24111 deleted
comment:64 Changed 4 years ago by
- Keywords cygwin added
comment:65 Changed 4 years ago by
- Milestone changed from sage-8.3 to sage-8.4
Conceivably doable. This will likely never have a perfect fix, but if we can get the sig_occurred()
feature in cysignals it will help.
comment:66 Changed 4 years ago by
Ping to self
comment:67 Changed 4 years ago by
- Milestone changed from sage-8.4 to sage-8.5
comment:68 Changed 4 years ago by
- Dependencies set to #26900
- Description modified (diff)
comment:69 Changed 4 years ago by
- Branch set to u/jdemeyer/inconsistent_mpz_t_state_after_interrupted_sig_realloc__
comment:70 Changed 4 years ago by
- Commit set to 2a7ee2b8e7a9379f6f492b1f5c3524deed93695b
This doesn't quite work yet because sig_occurred()
is being called with a live exception. Since sig_occurred()
may call gc.collect()
, this means trouble.
Really, sig_occurred
should check for PyErr_Occurred()
which is both an optimization and a protection against doing GC with an exception set.
New commits:
1d26b4b | Upgrade to cysignals-1.8.0
|
2a7ee2b | Throw away mpz_t after an interrupt happened
|
comment:71 Changed 4 years ago by
- Commit changed from 2a7ee2b8e7a9379f6f492b1f5c3524deed93695b to 539d0e5e7a5d2954616dc60ceb8f1899b6465112
comment:72 Changed 4 years ago by
- Status changed from new to needs_review
comment:73 Changed 4 years ago by
- Milestone changed from sage-8.5 to sage-8.7
Retargeting some of my tickets (somewhat optimistically for now).
comment:74 Changed 4 years ago by
- Commit changed from 539d0e5e7a5d2954616dc60ceb8f1899b6465112 to 521bac96289ae33b5ecf337f1f36a7fd29d7bc8a
comment:75 Changed 4 years ago by
- Dependencies #26900 deleted
comment:76 follow-up: ↓ 78 Changed 4 years ago by
Did you do anything about comment:70? It's not clear what you did to address this, if anything. Does cysignals 1.8.1 fix this?
comment:77 Changed 4 years ago by
- Status changed from needs_review to needs_info
comment:78 in reply to: ↑ 76 Changed 4 years ago by
- Status changed from needs_info to needs_review
Replying to embray:
Did you do anything about comment:70? It's not clear what you did to address this, if anything. Does cysignals 1.8.1 fix this?
Yes, it is fixed by cysignals 1.8.1: https://github.com/sagemath/cysignals/commit/7030f02c97f01adb50ae63b38b68c1260fe7482d
comment:79 Changed 4 years ago by
Okay, great. I'm testing now on Cygwin, and will give positive review assuming it passes.
Also need to make sure cysignals 1.8.1 lands in Debian but that's no reason to hold up the ticket.
comment:80 Changed 4 years ago by
- Reviewers set to Erik Bray
- Status changed from needs_review to positive_review
Been running this test in a loop for more than an hour now with no crash. Of course, this still isn't 100% fool-proof but it's definitely much better.
comment:82 Changed 4 years ago by
So this is genuinely breaking docbuild tests somehow. That's bizarre...
comment:83 Changed 4 years ago by
It's the doctest change from this ticket which is breaking things. Apparently the sphinxbuild.py
test is raising an exception and then it tries to display the traceback, long after the exception has been handled by the doctest framework. I consider this a fishy test, so the solution is fixing that test.
comment:84 Changed 4 years ago by
In fact, the test already fails on Python 3 with vanilla Sage 8.6.
comment:85 Changed 4 years ago by
- Dependencies set to #27073
- Status changed from needs_work to positive_review
comment:86 Changed 4 years ago by
- Branch changed from u/jdemeyer/inconsistent_mpz_t_state_after_interrupted_sig_realloc__ to 521bac96289ae33b5ecf337f1f36a7fd29d7bc8a
- Resolution set to fixed
- Status changed from positive_review to closed
We should also consider the possibility of a bug in Cygwin, as this has not been observed on other platforms.