Opened 2 years ago

Closed 16 months ago

#24986 closed defect (fixed)

Inconsistent mpz_t state after interrupted sig_realloc()

Reported by: embray Owned by:
Priority: major Milestone: sage-8.7
Component: cython Keywords: cygwin
Cc: jdemeyer, caruso Merged in:
Authors: Jeroen Demeyer Reviewers: Erik Bray
Report Upstream: N/A Work issues:
Branch: 521bac9 (Commits) Commit: 521bac96289ae33b5ecf337f1f36a7fd29d7bc8a
Dependencies: #27073 Stopgaps:

Description (last modified by jdemeyer)

TODO

Use sig_occurred() to check whether an exception from Cysignals is currently being handled while in Integer.tp_dealloc. If so, assume that the state of the object's mpz struct may not be consistent, so do not call mpz_clear on it and do not place it back in the free pool.
Don't forget to re-enable the test that was disabled in #25137 in order to test that this is fixed.

As discussed on sage-devel, I'm fairly consistently (roughly 9 times out of 10) getting the following failure on Cygwin:

sage -t --warn-long 164.8 src/sage/structure/coerce_actions.pyx
**********************************************************************
File "src/sage/structure/coerce_actions.pyx", line 786, in
sage.structure.coerce_actions.IntegerMulAction._repr_name_
Failed example:
    IntegerMulAction(ZZ, GF5)
Expected:
    Left Integer Multiplication by Integer Ring on Finite Field of size 5
Got:
    Left Integer Multiplication by Integer Ring on Finite Field of size 1
**********************************************************************
1 item had failures:
   1 of   4 in sage.structure.coerce_actions.IntegerMulAction._repr_name_
    [143 tests, 1 failure, 3.30 s]
----------------------------------------------------------------------
sage -t --warn-long 164.8 src/sage/structure/coerce_actions.pyx  # 1
doctest failed

Obviously Sage doesn't even allow creation of an order 1 field. In fact, I traced the cause of this to a specific line in FiniteFieldFactory.create_key_and_extra_args where, by chance, an Integer with a value of 1 is constructed (using fast_tp_new) whose (mp_limb*)(Integer.value._mp_d) member is assigned the same address as the _mp_d of the Integer that happens to hold the field's order.

The result is that the order is then set to 1 as well. This happens after the check that order>1 so creation of the field still succeeds. Clearly there is a subtle bug either in fast_tp_new, or in the memory allocator itself.

Change History (86)

comment:1 Changed 2 years ago by embray

  • Description modified (diff)

comment:2 follow-up: Changed 2 years ago by jdemeyer

We should also consider the possibility of a bug in Cygwin, as this has not been observed on other platforms.

comment:3 in reply to: ↑ 2 Changed 2 years ago by embray

Replying to jdemeyer:

We should also consider the possibility of a bug in Cygwin, as this has not been observed on other platforms.

Yep, I agree. Though I think there's also a possibility of a subtle bug. The code around which this is happening looks like this:

            order = Integer(order)
            if order <= 1:
                raise ValueError("the order of a finite field must be at least 2")

            if order.is_prime():
                p = order
                n = Integer(1)

It's the last line, n = Integer(1) where that Integer has the same _mp_d as order for some reason. Also in this case order is already passed in as an Integer, so I'm wondering if there's some subtle bug around order = Integer(order).

Last edited 2 years ago by embray (previous) (diff)

comment:4 Changed 2 years ago by embray

Maybe not--by the time I get to order = Integer(order) the next two Integers in the pool already shared the same _mp_d:

(gdb) p __pyx_v_4sage_5rings_7integer_integer_pool_count
$37 = 7
(gdb) p ((struct __pyx_obj_4sage_5rings_7integer_Integer *)__pyx_v_4sage_5rings_7integer_integer_pool[6]).value
$38 = {{_mp_alloc = 1, _mp_size = 0, _mp_d = 0x602b85c40}}
(gdb) p ((struct __pyx_obj_4sage_5rings_7integer_Integer *)__pyx_v_4sage_5rings_7integer_integer_pool[5]).value
$39 = {{_mp_alloc = 1, _mp_size = 0, _mp_d = 0x602b85c40}}

I'm going to add in some tracing and see if that reveals anything (or makes the problem go away entirely which would point more toward a Cygwin bug if it's that sensitive.

comment:5 Changed 2 years ago by jdemeyer

Can you compile the Sage library with -O0 -g just to check for a potential compiler bug? You will need to do rm -rf src/build first.

comment:6 Changed 2 years ago by embray

I'll give that a try in a bit. A compiler bug is also certainly a possibility. This is gcc 6.4.0.

comment:7 Changed 2 years ago by embray

It's not possible that there'd be anywhere in Sage where it creates Integer objects while the GIL is released, is there?

comment:8 Changed 2 years ago by jdemeyer

More generally, I think that Sage never releases the GIL. Libraries used by Sage (e.g. Numpy) might.

comment:9 Changed 2 years ago by embray

Aha! I put some code in fast_tp_dealloc to check, before returning an Integer to the pool, if there is already an entry in the pool with the same _mp_d, and raise a RuntimeError if so. And this condition first occurs in earlier test in the same module:

sage -t --warn-long 164.8 src/sage/structure/coerce_actions.pyx
**********************************************************************
File "src/sage/structure/coerce_actions.pyx", line 758, in sage.structure.coerce_actions.IntegerMulAction._call_
Failed example:
    alarm(0.5); (2^(10^7)) * P
Expected:
    Traceback (most recent call last):
    ...
    AlarmInterrupt
Got:
    RuntimeError: [fast_tp_dealloc] DUPLICATE _mp_d: 0x0x60404ebe0L!
                            (integer_pool_count: 3)
    Exception RuntimeError: RuntimeError('[fast_tp_dealloc] DUPLICATE _mp_d: 0x0x60404ebe0L!\n                        (integer_pool_count: 3)',) in 'sage.rings.integer.fast_tp_dealloc' ignored
    Traceback (most recent call last):
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 541, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 951, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.structure.coerce_actions.IntegerMulAction._call_[11]>", line 1, in <module>
        alarm(RealNumber('0.5')); (Integer(2)**(Integer(10)**Integer(7))) * P
    SystemError: error return without exception set

The presence of alarm() also explains the slight randomness of this error. So it seems that somewhere the allocation/deallocation of integers not safe to interruption by a signal...

comment:10 follow-up: Changed 2 years ago by embray

Specifically this is inside fast_mul which is doing some pretty heavy creation and destruction of Integers. If I just run the same code manually at the prompt and hit Ctrl-C, depending on exactly where it is fast_mul, sometimes I just get a KeyboardInterrupt raised from cysignals, or sometimes my RuntimeError gets raised.

comment:11 Changed 2 years ago by embray

Two new notes:

  1. I can reproduce the problem reliably on Cygwin by the following:
    sage: E = EllipticCurve(GF(5), [4,0])
    sage: set_random_seed(0); P = E.random_element()
    sage: alarm(0.5); (2^(10^7)) * P
    ---------------------------------------------------------------------------
    AlarmInterrupt                            Traceback (most recent call last)
    <ipython-input-3-1f9f5650ad9d> in <module>()
    ----> 1 alarm(RealNumber('0.5')); (Integer(2)**(Integer(10)**Integer(7))) * P
    ...
    /home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.Integer._shift_helper (build/cythonized/sage/rings/integer.c:40438)()
       6371         # Now finally call into MPIR to do the shifting.
       6372         cdef Integer z = PY_NEW(Integer)
    -> 6373         sig_on()
       6374         if n < 0:
       6375             mpz_fdiv_q_2exp(z.value, self.value, -n)
    
    src/cysignals/signals.pyx in cysignals.signals.sig_raise_exception()
    
    AlarmInterrupt:
    sage: a = Integer(5)
    sage: a
    5
    sage: b = Integer(1)
    sage: a
    1
    
    This same process works correctly on Linux though, which is still suspicious.
  1. If I remove the following lines from fast_tp_dealloc the problem appears to go away:
            # Here we free any extra memory used by the mpz_t by
            # setting it to a single limb.
            if o_mpz._mp_alloc > 10:
                _mpz_realloc(o_mpz, 1)
    
    so it's possible also something fishy is happening with _mpz_realloc.

comment:12 in reply to: ↑ 10 Changed 2 years ago by jdemeyer

Replying to embray:

Specifically this is inside fast_mul

Just to be clear that we are talking about the same thing: you mean fast_mul inside src/sage/structure/coerce_actions.pyx?

comment:13 Changed 2 years ago by embray

Yes. Then at some point the _mpz_realloc I mentioned above sets the pointer to the same address as one of the existing (presumably unused) integers in the pool.

comment:14 Changed 2 years ago by jdemeyer

It would be good to know exactly which code is executing when the interrupt/alarm happens.

There are two ways to find out:

  1. Run Python under gdb
  1. Using cysignals assume that Cygwin has a backtrace() function. You'll need to recompile cysignals with debugging. Within Sage, it suffices to do
    SAGE_DEBUG=yes ./sage -f cysignals
    

comment:15 follow-up: Changed 2 years ago by jdemeyer

Something else to try (just shooting in the dark here): explicitly call sage.ext.memory.init_memory_functions() before doing your tests.

comment:16 in reply to: ↑ 15 ; follow-up: Changed 2 years ago by embray

Replying to jdemeyer:

Something else to try (just shooting in the dark here): explicitly call sage.ext.memory.init_memory_functions() before doing your tests.

What would explicitly calling it do? Shouldn't it be called automatically anyways? I confirmed in gdb that mpir is using the sage/cysignals allocation functions. I actually tried disabling it and it didn't seem to make a difference on the problem.

comment:17 in reply to: ↑ 16 Changed 2 years ago by jdemeyer

Replying to embray:

Shouldn't it be called automatically anyways?

Of course it should. But the whole point of debugging is to verify that everything that should happen actually happens.

comment:18 Changed 2 years ago by jdemeyer

Does the cysignals testsuite pass (./sage -f -c cysignals)? Does anything change if cysignals (and the Sage library) is compiled with -O0?

comment:19 follow-ups: Changed 2 years ago by embray

After rebuilding with SAGE_DEBUG=yes (mostly--singular is failing its debug builds for some reason, which is something I'll have to investigate another time), I can no longer reproduce the problem, which is disconcerting. I confirmed that the Sage modules compiled with -O0.

For reference's sake, with the alarm(0.5) (and actually with other values as well, such as alarm(1)), the AlarmInterrupt is almost always raised from the same place:

sage: alarm(1); (2^(10^7)) * P

*** SIG 14 *** inside sig_on
do_raise_exception(sig=14)
PyErr_Occurred() = 0x0
Raising Python exception 0 ms after signals...
---------------------------------------------------------------------------
AlarmInterrupt                            Traceback (most recent call last)
<ipython-input-17-57b8340ec195> in <module>()
----> 1 alarm(Integer(1)); (Integer(2)**(Integer(10)**Integer(7))) * P

/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.Integer.__mul__ (build/cythonized/sage/rings/integer.c:12921)()
   1856             return y
   1857
-> 1858         return coercion_model.bin_op(left, right, operator.mul)
   1859
   1860     cpdef _mul_(self, right):

/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/structure/coerce.pyx in sage.structure.coerce.CoercionModel_cache_maps.bin_op (build/cythonized/sage/structure/coerce.c:9695)()
   1114             action = self.get_action(xp, yp, op, x, y)
   1115             if action is not None:
-> 1116                 return (<Action>action)._call_(x, y)
   1117
   1118         try:

/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/structure/coerce_actions.pyx in sage.structure.coerce_actions.IntegerMulAction._call_ (build/cythonized/sage/structure/coerce_actions.c:10587)()
    772             return fast_mul_long(a, n_long)
    773
--> 774         return fast_mul(a, nn)
    775
    776     def _repr_name_(self):

/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/structure/coerce_actions.pyx in sage.structure.coerce_actions.fast_mul (build/cythonized/sage/structure/coerce_actions.c:11610)()
    899         sig_check()
    900         pow2a += pow2a
--> 901         n = n >> 1
    902     sum = pow2a
    903     n = n >> 1

/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.Integer.__rshift__ (build/cythonized/sage/rings/integer.c:40813)()
   6444         if not isinstance(x, Integer):
   6445             return x >> int(y)
-> 6446         return (<Integer>x)._shift_helper(y, -1)
   6447
   6448     cdef _and(Integer self, Integer other):

/home/embray/src/sagemath/sage/local/lib/python2.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.Integer._shift_helper (build/cythonized/sage/rings/integer.c:40452)()
   6375         # Now finally call into MPIR to do the shifting.
   6376         cdef Integer z = PY_NEW(Integer)
-> 6377         sig_on()
   6378         if n < 0:
   6379             mpz_fdiv_q_2exp(z.value, self.value, -n)

src/cysignals/signals.pyx in cysignals.signals.sig_raise_exception()

AlarmInterrupt:

On occasion it's elsewhere, but most of the time it's in that shift call. To learn much else I'll have to drop into gdb, but first I'll have to rebuild some modules with -O(>0) and see if the problem returns...

comment:20 Changed 2 years ago by jdemeyer

In other to narrow the problem down, it would be good to know if it matters whether cysignals is compiled with -O0 or whether the Sage library is compiled with -O0.

comment:21 in reply to: ↑ 19 ; follow-up: Changed 2 years ago by jdemeyer

Replying to embray:

On occasion it's elsewhere, but most of the time it's in that shift call.

The problem is that the sig_on() hides the actual place where the signal is raised. I hope that we can learn something by knowing precisely what is interrupted.

comment:22 in reply to: ↑ 19 Changed 2 years ago by embray

Replying to embray:

After rebuilding with SAGE_DEBUG=yes (mostly--singular is failing its debug builds for some reason, which is something I'll have to investigate another time), I can no longer reproduce the problem, which is disconcerting. I confirmed that the Sage modules compiled with -O0.

Oops--disregard that. I had forgotten exactly where I left off on this yesterday, and it appears I still had the troublesome _mpz_realloc disabled in the source I was building from. After re-enabling that the problem returns, even in the debug build.

comment:23 in reply to: ↑ 21 ; follow-up: Changed 2 years ago by embray

Replying to jdemeyer:

Replying to embray:

On occasion it's elsewhere, but most of the time it's in that shift call.

The problem is that the sig_on() hides the actual place where the signal is raised. I hope that we can learn something by knowing precisely what is interrupted.

Right, but doesn't that at least mean it's at least somewhere between that sig_on() and the corresponding sig_off()?

comment:24 Changed 2 years ago by jdemeyer

OK, so the problem persists with -O0. Good, so it looks like a real bug (either in Sage, cysignals, MPIR or Cygwin).

comment:25 Changed 2 years ago by embray

Yes, likely one of those four things--probably not the compiler though I wouldn't rule it out completely.

comment:26 in reply to: ↑ 23 Changed 2 years ago by jdemeyer

Replying to embray:

Right, but doesn't that at least mean it's at least somewhere between that sig_on() and the corresponding sig_off()?

Yes, but I would like to know the exact place in the MPIR code where it is interrupted.

comment:27 Changed 2 years ago by embray

It looks like it's being interrupted in the _mpz_realloc call from mpz_mul_2exp(z.value, self.value, n), on the just-created Integer "z".

I think I see the general problem here. The function _mpz_realloc is not interrupt-safe. Even with GMP's realloc set to sig_realloc, that just blocks signals before/after the actual realloc. However, it's then only after realloc succeeds that the _mp_d pointer gets updated. If the signal handler is invoked before that happens then the old value in _mp_d might now point to a freed address (in fact, and address to a size=1 array of limbs).

Then, as fast_mul is cleaned up it again frees an integer (the variable "n"). This is a large integer so it trips the semi-arbitrary condition:

        # Here we free any extra memory used by the mpz_t by
        # setting it to a single limb.
        if o_mpz._mp_alloc > 10:
            _mpz_realloc(o_mpz, 1)

and this realloc happens to give back the same size=1 array that was previously just freed. So both _mp_d end up with the same pointer.

comment:28 Changed 2 years ago by embray

I'm still doing more testing, but I seem to have fixed it by wrapping any call to _mpz_realloc and anything that calls _mpz_realloc with sig_block/unblock(). But a better solution would be nice since the category of "anything that calls _mpz_realloc" includes most mpz_ functions.

Last edited 2 years ago by embray (previous) (diff)

comment:29 Changed 2 years ago by jdemeyer

I'm afraid that this is essentially an unsolvable problem. It's well known that longjmp() (used by cysignals) can mess up the internal state of objects. In cysignals, I implemented safe memory allocation functions like sig_malloc() because those were the most common source of breakage. But of course, things can break in other ways.

comment:30 follow-up: Changed 2 years ago by embray

Right--in general it's not completely solvable. For this case we do have some ways forward, however.

  1. In GMP/MPIR it's already possible to pass custom memory allocation functions. We can set our own realloc for it to use, but what we really need here is to be able to set our own function for _mpz_realloc. A pluggable interface for this could be provided as well (for now requiring a patch of course).
  1. In the context of fast allocation of integers, we might just have to rethink the extent to which we muck around with the internals, at the cost of some efficiency in some cases.

I don't think this bug is specific to Cygwin. It just showed up there perhaps due to the larger overhead in memory allocation functions or similar reasons.

comment:31 in reply to: ↑ 30 Changed 2 years ago by jdemeyer

Replying to embray:

  1. In the context of fast allocation of integers, we might just have to rethink the extent to which we muck around with the internals, at the cost of some efficiency in some cases.

That's not really the problem here. The internal state is already messed up before that.

Last edited 2 years ago by jdemeyer (previous) (diff)

comment:32 follow-up: Changed 2 years ago by embray

The potential for problems is reduce if, say, we mpz_clear before placing objects back in the pool, and mpz_init when taking them out. That's what I mean. That could mean a big loss in performance if integers of the same size are being created and destroyed rapidly, but there's also an extent to which we should trust the memory allocator to handle that case efficiently as well.

comment:33 in reply to: ↑ 32 Changed 2 years ago by jdemeyer

Replying to embray:

The potential for problems is reduce if, say, we mpz_clear before placing objects back in the pool, and mpz_init when taking them out.

I don't see why. How is this bug even related to the integer pool?

comment:34 follow-up: Changed 2 years ago by embray

Maybe I haven't explained the issue well enough. The bug has everything to do with the integer pool. Of course, as you say, there are many contexts where an interrupt could leave a datastructure in an incomplete state, as is happening here with the mpz_ts.

This isn't usually a problem, because in most cases we have Cysignals turned on, which means an exception is raised and, in most cases (but not all--therein is still a problem) the computation is aborted and the relevant datastructures are cleaned up anyways and it doesn't entirely matter if they're consistent or not (here there is still a potential for double-frees and the like, and that would be a bug too, albeit a slightly different one).

When it comes to the integer pool that's not the case, because instead of completely cleaning up and destroying the relevant data structures, we're always assuming that when an Integer is returned to the pool that it's in a sane, consistent state (or at the very least, that its _mp_d pointer is still valid). And that's clearly not the case.

The deeper underlying problem here is not the bug. The bug is that the integer pool makes it too easy to keep around invalid datastructures.

Last edited 2 years ago by embray (previous) (diff)

comment:35 Changed 2 years ago by embray

I suppose one other possibility, is we replace the GMP memory allocation functions with ones that trace memory allocations relevant to the integer pool, and use that to perform a quick consistency check before returning objects to the pool. Would have to think about how to do this in a way that does not have too awful overhead...

No, this would suck, because there would be no obvious way to enable that behavior only for the relevant contexts.

Last edited 2 years ago by embray (previous) (diff)

comment:36 in reply to: ↑ 34 Changed 2 years ago by jdemeyer

Replying to embray:

When it comes to the integer pool that's not the case, because instead of completely cleaning up and destroying the relevant data structures, we're always assuming that when an Integer is returned to the pool that it's in a sane, consistent state

I see what you are trying to say, but I don't agree. I believe that "completely cleaning up and destroying the relevant data structures" is equally unsafe as reusing that data structure. In other words, instead of getting in trouble when reusing the mpz_t, you will instead get in trouble when calling mpz_clear() on it.

comment:37 follow-up: Changed 2 years ago by embray

Just writing down a thought before I forget: temporarily disable the integer pool following an exception from cysignals, possibly even clear it. Only re-enable next time an Integer is explicitly created. Might be nice to have an alternative to sig_on() that can register an error callback (without having to write an explicit try/except)

comment:38 in reply to: ↑ 37 Changed 2 years ago by jdemeyer

Replying to embray:

Just writing down a thought before I forget: temporarily disable the integer pool following an exception from cysignals, possibly even clear it.

In my opinion, this bug has nothing to do with the integer pool, so it can't be fixed by changing the integer pool implementation.

comment:39 follow-up: Changed 2 years ago by vbraun

cysignals could run a gc cycle after interrupt and have a boolean flag to tell c/cython code that we are currently in this "dangerous" gc cycle. This seems to be a reasonable cysignals feature, allowing users to react to possibly inconsistent C data structueres.

The integer pool could use that to not recycle mpz_t.

comment:40 in reply to: ↑ 39 Changed 2 years ago by jdemeyer

Replying to vbraun:

cysignals could run a gc cycle after interrupt and have a boolean flag to tell c/cython code that we are currently in this "dangerous" gc cycle. This seems to be a reasonable cysignals feature, allowing users to react to possibly inconsistent C data structueres.

I see what you mean. Something like that could be done without patching GMP/MPIR.

I don't think that cysignals should run the "gc cycle" because the broken objects will still be referenced at that time. Instead, we could somehow check if there is a live (not yet handled) KeyboardInterrupt exception.

comment:41 Changed 2 years ago by jdemeyer

  • Component changed from memleak to cython
  • Summary changed from Possible bug in Integer's fast_tp_new causing subtle test failure to Inconsistent mpz_t state after interrupted sig_realloc()

More concretely, here is an idea:

  1. Whenever cysignals raises an exception, it stores a pointer cysigs.exc_value to the exception.
  1. Integer.tp_dealloc checks whether the exception saved by cysignals equals the currently pending exception. If not, it resets cysigs.exc_value = NULL. If the exception matches, it assumes that the Integer is in a broken state and it skips tp_dealloc completely: it does not store in the pool and it does not call mpz_clear on it.

comment:42 Changed 2 years ago by embray

At what point would cysigs.exc_value get cleared otherwise? In sig_on()?

I agree with you that the problem here is not just with the integer pool, but based on further analysis it's an area that is more likely than anywhere else to be affected by this, at least where Integers are concerned. And it's worse really--if you tried to mpz_clear you get a double free an exception. But in this case, because invalid Integers are being recycled (but that still happen to contain valid pointers) you end up with far more insidious bugs (i.e. the values of some ints being flipped).

comment:43 Changed 2 years ago by embray

Otherwise, I think that's a pretty acceptable solution. It could lead to memory leaks, but the present situation is no better in that regard. And we can guess this will probably be a rare situation ...

comment:44 follow-up: Changed 2 years ago by embray

Although, I also don't much see the downside to wrapping most mpz_ functions with sig_block/unblock()--any that modify the internal datastructures (in such a way that involves memory allocations) of one or more of its arguments--and that aren't expected to take long to run. Of course, this would be tedious, so we'd have to generate the wrappers automatically.

Last edited 2 years ago by embray (previous) (diff)

comment:45 in reply to: ↑ 44 Changed 2 years ago by jdemeyer

Replying to embray:

that aren't expected to take long to run.

Many interesting mpz functions take time at least linear in the size of the input. Since we want to support large integers, they can all take a long time.

Perhaps the only obvious example to apply sig_block()/sig_unblock() to would be mpz_init().

comment:46 Changed 2 years ago by jdemeyer

  • Dependencies set to #21509, #24111

comment:47 follow-up: Changed 2 years ago by embray

That brings me back to my earlier point that it would be very helpful to be able to wrap/override _mpz_realloc since that's worst offender with respect to memory allocations inside most mpz functions (mostly called to extend the limbs array to make room for the result).

comment:48 in reply to: ↑ 47 Changed 2 years ago by jdemeyer

Replying to embray:

That brings me back to my earlier point that it would be very helpful to be able to wrap/override _mpz_realloc

That requires convincing upstream for the need of a hook there.

comment:49 follow-up: Changed 2 years ago by jdemeyer

By the way, I tried to reproduce this problem on Linux by adding an artificial delay in sage_sig_realloc (to increase the chances of the alarm happening there). Unfortunately, it could not reproduce the Cygwin problem. Maybe realloc() is more hardened against this?

comment:50 Changed 2 years ago by jdemeyer

I have been thinking about concretely implementing 41. One problem is that Cython tends to run code without an active exception set, which means that it's non-trivial to check "the currently pending exception".

Of course, since Integer.tp_dealloc uses completely custom code that Cython is not aware of, we could do it there. But I would rather prefer a general solution, which would also work without a custom tp_dealloc.

Another idea is to check the refcount of the exception value to check whether it has been deleted. That works as long as the exception handler really deletes the exception and does not store it somehow.

Or, as you proposed in 42, we could explicitly clear the stored exception value under certain circumstances where we assume that we are not currently propagating an exception.

My feeling is that the refcount solution is the least likely to break: it seems unlikely that user code would catch the exception and store it in a variable. And I am assuming that REPLs don't do this either, but I'll certainly need to check that.

comment:51 in reply to: ↑ 49 Changed 2 years ago by embray

Replying to jdemeyer:

By the way, I tried to reproduce this problem on Linux by adding an artificial delay in sage_sig_realloc (to increase the chances of the alarm happening there). Unfortunately, it could not reproduce the Cygwin problem. Maybe realloc() is more hardened against this?

Possibly? Or it could just be a coincidence in how the mallocs in question work. The only reason this really becomes a problem is that the next Integer that gets allocated some memory (for its _mp_d) at the same address that was freed by a previous realloc(). This seems to be a predicatable behavior of my malloc--I don't know the details but it's probably got some pool of small allocations with a FIFO or something. Or as you suggested it could be the malloc in glibc is more robust in cleaning things up if it's interrupted by a signal. Whatever the issue is it's very subtle, but could in principle happen anywhere.

comment:52 follow-up: Changed 2 years ago by embray

My gut feeling is that anything that relies on refcounts is going to be very fragile in its own right, but it's worth a try.

Since this problem might be very difficult to solve in general, I wonder if it would be acceptable to make a temporary fix (like I already did) that just addresses this one, exact use case in the tests, just so that I don't have to have this test failing on Cygwin. With the goal of course of removing it later.

Alternatively, we should add something to the doctest framework to skip platform-specific known failures.

comment:53 in reply to: ↑ 52 Changed 2 years ago by jdemeyer

I'd certainly like to fix this in cysignals. However, I have lately been bitten by various breakages of cysignals on various platforms, so I'd like to improve the continuous integration of cysignals first. If you know of a free CI service that lets you run Python code on platforms other than Linux x86_64, let me know :-)

comment:54 follow-up: Changed 2 years ago by embray

OSX is still tricky, though I think Travis-CI has it now?

I can set up CI for cysignals on Cygwin if you want.

comment:55 in reply to: ↑ 54 ; follow-up: Changed 2 years ago by jdemeyer

Replying to embray:

OSX is still tricky, though I think Travis-CI has it now?

In theory, yes. In practice, not so much.

I can set up CI for cysignals on Cygwin if you want.

That would be super-awesome. How do you plan to do that?

comment:56 Changed 2 years ago by jdemeyer

Julian Rüth suggested to try conda for Travis CI on OS X. I'll try that.

comment:57 in reply to: ↑ 55 ; follow-up: Changed 2 years ago by embray

Replying to jdemeyer:

Replying to embray:

OSX is still tricky, though I think Travis-CI has it now?

In theory, yes. In practice, not so much.

:(

I can set up CI for cysignals on Cygwin if you want.

That would be super-awesome. How do you plan to do that?

AppVeyor? supports Cygwin, and I have set up other projects to do Cygwin builds with AppVeyor?. It will be pretty straightforward, if you just give me admin on the cysignals project on GH (if I don't already have it)

comment:58 in reply to: ↑ 57 ; follow-up: Changed 2 years ago by jdemeyer

Replying to embray:

It will be pretty straightforward, if you just give me admin on the cysignals project on GH (if I don't already have it)

I just did that. Could it support 32-bit and 64-bit builds? That would be great because the other CI services are 64-bit only.

comment:59 Changed 2 years ago by jdemeyer

You may also be interested in https://github.com/sagemath/cysignals/pull/76 which is about a work-in-progress native Windows port of cysignals, using mingw64.

comment:60 in reply to: ↑ 58 Changed 2 years ago by embray

Replying to jdemeyer:

Could it support 32-bit and 64-bit builds? That would be great because the other CI services are 64-bit only.

Well, it has both 64-bit and 32-bit Cygwin installed on their VMs, but the underlying machine architecture is still 64-bit.

comment:61 Changed 2 years ago by embray

  • Description modified (diff)

comment:62 Changed 2 years ago by embray

  • Milestone changed from sage-8.2 to sage-8.3

comment:63 Changed 2 years ago by jdemeyer

  • Dependencies #21509, #24111 deleted

comment:64 Changed 23 months ago by embray

  • Keywords cygwin added

comment:65 Changed 23 months ago by embray

  • Milestone changed from sage-8.3 to sage-8.4

Conceivably doable. This will likely never have a perfect fix, but if we can get the sig_occurred() feature in cysignals it will help.

comment:66 Changed 22 months ago by jdemeyer

Ping to self

comment:67 Changed 19 months ago by embray

  • Milestone changed from sage-8.4 to sage-8.5

comment:68 Changed 18 months ago by jdemeyer

  • Dependencies set to #26900
  • Description modified (diff)

comment:69 Changed 18 months ago by jdemeyer

  • Branch set to u/jdemeyer/inconsistent_mpz_t_state_after_interrupted_sig_realloc__

comment:70 Changed 18 months ago by jdemeyer

  • Commit set to 2a7ee2b8e7a9379f6f492b1f5c3524deed93695b

This doesn't quite work yet because sig_occurred() is being called with a live exception. Since sig_occurred() may call gc.collect(), this means trouble.

Really, sig_occurred should check for PyErr_Occurred() which is both an optimization and a protection against doing GC with an exception set.


New commits:

1d26b4bUpgrade to cysignals-1.8.0
2a7ee2bThrow away mpz_t after an interrupt happened

comment:71 Changed 17 months ago by git

  • Commit changed from 2a7ee2b8e7a9379f6f492b1f5c3524deed93695b to 539d0e5e7a5d2954616dc60ceb8f1899b6465112

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

a4a0621Upgrade to cysignals-1.8.1
fac8fc9doctest framework: do not store exception
539d0e5Throw away mpz_t after an interrupt happened

comment:72 Changed 17 months ago by jdemeyer

  • Authors set to Jeroen Demeyer
  • Status changed from new to needs_review

comment:73 Changed 17 months ago by embray

  • Milestone changed from sage-8.5 to sage-8.7

Retargeting some of my tickets (somewhat optimistically for now).

comment:74 Changed 17 months ago by git

  • Commit changed from 539d0e5e7a5d2954616dc60ceb8f1899b6465112 to 521bac96289ae33b5ecf337f1f36a7fd29d7bc8a

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

6d4bb9fdoctest framework: clear exception
521bac9Throw away mpz_t after an interrupt happened

comment:75 Changed 17 months ago by jdemeyer

  • Dependencies #26900 deleted

comment:76 follow-up: Changed 17 months ago by embray

Did you do anything about comment:70? It's not clear what you did to address this, if anything. Does cysignals 1.8.1 fix this?

comment:77 Changed 17 months ago by embray

  • Status changed from needs_review to needs_info

comment:78 in reply to: ↑ 76 Changed 17 months ago by jdemeyer

  • Status changed from needs_info to needs_review

Replying to embray:

Did you do anything about comment:70? It's not clear what you did to address this, if anything. Does cysignals 1.8.1 fix this?

Yes, it is fixed by cysignals 1.8.1: https://github.com/sagemath/cysignals/commit/7030f02c97f01adb50ae63b38b68c1260fe7482d

comment:79 Changed 17 months ago by embray

Okay, great. I'm testing now on Cygwin, and will give positive review assuming it passes.

Also need to make sure cysignals 1.8.1 lands in Debian but that's no reason to hold up the ticket.

comment:80 Changed 17 months ago by embray

  • Reviewers set to Erik Bray
  • Status changed from needs_review to positive_review

Been running this test in a loop for more than an hour now with no crash. Of course, this still isn't 100% fool-proof but it's definitely much better.

comment:81 Changed 16 months ago by vbraun

  • Status changed from positive_review to needs_work

See patchbot

comment:82 Changed 16 months ago by jdemeyer

So this is genuinely breaking docbuild tests somehow. That's bizarre...

comment:83 Changed 16 months ago by jdemeyer

It's the doctest change from this ticket which is breaking things. Apparently the sphinxbuild.py test is raising an exception and then it tries to display the traceback, long after the exception has been handled by the doctest framework. I consider this a fishy test, so the solution is fixing that test.

comment:84 Changed 16 months ago by jdemeyer

In fact, the test already fails on Python 3 with vanilla Sage 8.6.

comment:85 Changed 16 months ago by jdemeyer

  • Dependencies set to #27073
  • Status changed from needs_work to positive_review

comment:86 Changed 16 months ago by vbraun

  • Branch changed from u/jdemeyer/inconsistent_mpz_t_state_after_interrupted_sig_realloc__ to 521bac96289ae33b5ecf337f1f36a7fd29d7bc8a
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.