Opened 3 years ago

Closed 3 years ago

#11752 closed defect (fixed)

ecl.pyx should not touch SIGPWR neither SIGXCPU when initializing ecl

Reported by: pcpa Owned by: was
Priority: minor Milestone: sage-4.7.2
Component: interfaces Keywords:
Cc: Merged in: sage-4.7.2.alpha3
Authors: Paulo César Pereira de Andrade Reviewers: Nils Bruin
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Description (last modified by leif)

I experienced a problem in the Mandriva sagemath 4.7.1 that was causing sage to exit with the this somewhat funny message:

sage: from sage.interfaces.maxima_lib import maxima_lib

/usr/share/sage/local/bin/sage-sage: line 178:  7177 Power failure           sage-ipython "$@" -i

The problem is the logic in ecl.pyx that saves all signal handlers, initializes ecl and restores them, but this has the nasty side effect of messing with Boehm GC thread handling, that uses SIGXCPU and SIGPWR to stop/restart threads during gc.The attached patch corrects the problem for me, but is just a suggestion, and may need proper testing on non Linux systems, but on other systems Boehm GC should use a signal number larger than 32. Or, maybe add some comments about what those numbers are (SIGPWR = 30 and SIGXCPU = 24).


Solution turned out to be: tell ECL not to create a separate signal-handling thread (which it won't do if threads are not enabled, as in the standard Sage build of ECL).


Apply only 11752_ecl-nothread.patch to the Sage library.

Attachments (3)

ecl-log.txt (1.6 KB) - added by pcpa 3 years ago.
ecl-log.txt
sage-4.7.1-ecl_module.patch (919 bytes) - added by pcpa 3 years ago.
sage-4.7.1-ecl_module.patch
11752_ecl-nothread.patch (1.1 KB) - added by nbruin 3 years ago.
Set an option for ecl's cl_boot

Download all attachments as: .zip

Change History (22)

comment:1 Changed 3 years ago by nbruin

In principle the proposed change should be OK. Restoring the 32 signal handlers was just a general approach to avoid ECL's interrupt handlers.

Note however that sage's ECL and Boehm GC are normally built with threading disabled, and then they don't touch or use SIGPWR and SIGXCPU. ECL has some quite advanced features for dealing with threads, but I recall that the approach taken with signal handlers when threads are enabled is quite incompatible with sage.

Currently, to make SIGINT work in long-running ecl code, the signal handler simply gets swapped to the ecl one and swapped back to the sage one upon exit of the ecl code. This approach is obviously incompatible with multiple threads.

The approach taken in ecl, have one thread receive all asynchronous signals and let that distribute them over the other threads is a sound one, but in order to make that work in sage we would have to have such a thread in sage and make it aware of any ecl threads running.

So in practice, if the fix here is necessary you run the risk that you are running a configuration that handles signals quite poorly. You may want to test this.

comment:2 follow-up: Changed 3 years ago by pcpa

I need to make more complete tests, but the idea is to use as much as possible from system packages in my Mandriva sagemath rpm. The patch allows sage to load the ecl/maxima module and at least the (few) doctests I run did work correctly. Patch probably can be made a one liner by just not restoring handlers after cl_boot call, and for more clarity, use proper signal names, instead of numeric value.

If truly required, I may adapt the sagemath rpm package to use an alternate ecl, Bohem GC, and possibly maxima build, but at first I am trying to avoid it.

This should at least help in knowing beforehand issues that will happen in the future if sage is linked to other binaries, multi thread or not, that uses Bohem GC, or do their own signal management.

comment:3 in reply to: ↑ 2 Changed 3 years ago by nbruin

Replying to pcpa:

The patch allows sage to load the ecl/maxima module and at least the (few) doctests I run did work correctly.

This does not test asynchronous signal handling at all. The easiest way to do so is to start up a time-consuming job and send some Ctrl-C signals to the process (I don't think there are many other async signals Sage is required to handle properly). See #10818 for some relevant tests and ideas.

Patch probably can be made a one liner by just not restoring handlers after cl_boot call, and for more clarity, use proper signal names, instead of numeric value.

You'll at least have to restore the SIGINT handler.

This should at least help in knowing beforehand issues that will happen in the future if sage is linked to other binaries, multi thread or not, that use Boehm GC, or do their own signal management.

Absolutely! This is very useful work.

comment:4 follow-up: Changed 3 years ago by pcpa

It appears I have it completely broken in regarding to signal handling now. The attached log session is what I found. Running simple doctests does not test this behavior indeed. In the attached session log, the <<<<<< comment >>>>>>> lines are about what I pressed or typed when the console was not echoing.

comment:5 Changed 3 years ago by fbissey

Just looking at what's happening in sage-on-gentoo. We give a bit more freedom about what user can enable or not. I don't see the problem on my machine where I have boehm-gc compiled with threads but ecl without. However there was a report on bugzilla this morning of someone getting

$ LC_ALL=C sage
----------------------------------------------------------------------
| Sage Version 4.7.1, Release Date: 2011-08-11                       |
| Type notebook() for the GUI, and license() for information.        |
----------------------------------------------------------------------
sage: var('i n')
(i, n)
sage: sum(4*i+1, i, 1, n)
/usr/bin/sage-sage: line 230:  8666 Power failure           sage-ipython "$@" -i

Which appear to me to be very similar to what you describe, I'll write again when I get more info from this user.

comment:6 in reply to: ↑ 4 ; follow-up: Changed 3 years ago by nbruin

Some observations:

  • the fact that you land in ECL's break-loop is indeed a sign something is broken, because we run the equivalent of a "try ... expect" in ecl_eval, so the interrupt should be caught and propagated up. I suspect this breaks because you are running it in a thread-enabled ECL. Perhaps the ctrl-C arrives at a different thread that is not running under such protection? There has to be a thread-aware option in ECL to change the handling of SIGINT, though. We could adjust that
  • You should exit ECL's break-loop with ":q", not "(quit)". The latter terminates the ECL session, but because ECL is not running its top-level (it's embedded), it cannot really do this and hence leaves the system in an inconsistent state. So the memory corruption you are observing is probably not due to erroneous signal handling.

Thank you for testing this! It looks like for now we should really advise people against running ECL in thread-enabled mode inside Sage. Unfortunately, it seems enabling/disabling threads is something done at configure time, not at ECL run time.

comment:7 in reply to: ↑ 6 ; follow-up: Changed 3 years ago by nbruin

There may be a boot-time option for ECL that gets around this problem. See ECL doc. If you can set SIGNAL_HANDLING_THREAD to false when calling cl_boot, you may be able to force ECL to not set up a separate signal handling thread, even if threading is enabled. I cannot test this here because I do not have a thread-enabled ECL, but in your setup it may be easy to try out.

comment:8 in reply to: ↑ 7 ; follow-up: Changed 3 years ago by pcpa

Replying to nbruin:

There may be a boot-time option for ECL that gets around this problem. See ECL doc. If you can set SIGNAL_HANDLING_THREAD to false when calling cl_boot, you may be able to force ECL to not set up a separate signal handling thread, even if threading is enabled. I cannot test this here because I do not have a thread-enabled ECL, but in your setup it may be easy to try out.

Thanks! This indeed appears to correct the problem.

[Updating patch and simple session log attachments next]

Changed 3 years ago by pcpa

ecl-log.txt

comment:9 in reply to: ↑ 8 Changed 3 years ago by nbruin

Replying to pcpa:

Thanks! This indeed appears to correct the problem.

Wonderful. The ecl_set_option(ECL_OPT_SIGNAL_HANDLING_THREAD, 0) is completely uncontroversial. There is a remote possibility that specifying this option might be enough to dissuade Boehm GC from using SIGPWR etc. Given that this fixes something that shouldn't occur in standard Sage configurations, such a minimal fix would be preferable. If that's enough, I can give your amended patch a positive review right away.

If it doesn't, I think SIGPWR and SIGXCPU are such exotic signals (and Sage doesn't do anything special with them anyway) that I think it should be acceptable to relinquish them to ECLs handlers. You could add them to cdef extern int SIGINT, SIGBUS, SIGSEGV (line 48 in ecl.pyx) to refer to them by name.

If they do need to be given to ecl, they probably need treatment in ecl_sig_on and ecl_sig_off as well. That doesn't make it thread-safe yet (imagine 2 sage threads, one of which executing ECL code that triggers a GC. What's the other thread going to do with the SIGPWR?). Perhaps jdemeyer can give some advice what to do in that case.

comment:10 Changed 3 years ago by pcpa

I just did a build of ecl.so without patching the code to save/restore signal handlers and it does still work, so, it is only required the call to ecl_set_option(). As long as there is only one thread, it should be ok, as Bohem GC for sure will not pthread_kill(pthread_self(), SIG{PWR,XCPU}), and one should not inject whatever code is required to create a thread in ecl_eval().

Only testing to see what would happen with multiple threads, what would be desirable would be to have "a single Bohem GC instance". In one quick test I did, it failed miserably, that was to link libcsage.so to -lpthread, do a s/sigprocmask/pthread_sigmask/ including gc.h and calling GC_init() before setting signal handles in interrupt.c. It did print some weird messages rooted in the cl_boot() call and crashed.

Updating again the patch, that now is only the ecl_set_option call, and sync to missing ECL_OPT values.

comment:11 Changed 3 years ago by nbruin

  • Authors set to Paulo César Pereira de Andrade
  • Priority changed from major to minor
  • Reviewers set to Nils Bruin
  • Status changed from new to needs_review

comment:12 Changed 3 years ago by nbruin

  • Status changed from needs_review to positive_review

The attached patch fixes something for non-standard Sage configurations. The change cannot possibly affect standard sage configurations negatively because we really do not want a separate signal handling thread in ECL (that messes thing up in exactly the way the author was observing). So: positive review from me.

Thanks to Paulo for tracking this down.

comment:13 Changed 3 years ago by nbruin

  • Status changed from positive_review to needs_work

Ouch, sorry. I tried

> sage -sh
> cd $SAGE_ROOT/devel/sage/sage
> hg qimport <patch file>
> hg qpush

and I get failures:

applying sage-4.7.1-ecl_module.patch
unable to find 'spkg/build/sage-4.7.1/sage/libs/ecl.pxd' for patching
1 out of 1 hunks FAILED -- saving rejects to file spkg/build/sage-4.7.1/sage/libs/ecl.pxd.rej
unable to find 'spkg/build/sage-4.7.1/sage/libs/ecl.pyx' for patching
1 out of 1 hunks FAILED -- saving rejects to file spkg/build/sage-4.7.1/sage/libs/ecl.pyx.rej
patch failed, unable to continue (try -v)
patch failed, rejects left in working dir
errors during apply, please fix and refresh sage-4.7.1-ecl_module.patch

It looks like the patch file isn't formatted properly! Can you follow the sage instructions for creating a patch file that applies cleanly to the distribution? The proposed change is fine, but we have to make sure the patch file is formatted so that the distribution manager can easily apply it.

While you're at it, also make sure to include a meaningful commit message mentioning #11752.

comment:14 Changed 3 years ago by pcpa

Sorry, the patch is  formatted for the procedure I use to rpmbuild sagemath. I am not much used to mercurial... Based on log, it should be something like "hg import -p4 sage-4.7.1-ecl_module.patch".

I will edit the patch to look like patches from raw mode view of commits to hg.sagemath.org, and add a description to the "patch header".

Changed 3 years ago by pcpa

sage-4.7.1-ecl_module.patch

Changed 3 years ago by nbruin

Set an option for ecl's cl_boot

comment:15 Changed 3 years ago by nbruin

  • Status changed from needs_work to positive_review

Thanks! Patch applied but it still didn't quite look like a "mercurial" patch (things like date and author missing) so I reexported the patch with "hg export qtip". I hope this patch is OK.

comment:16 Changed 3 years ago by nbruin

  • Description modified (diff)

comment:17 Changed 3 years ago by fbissey

Looks good to me too. Hopefully it will go in the next alpha.

comment:18 Changed 3 years ago by leif

  • Description modified (diff)

Please use proper URLs (in this case to the raw attachment) or trac wiki mark-up ("[attachment:here_comes_the_filename]") when referring to patches (or spkgs) in the description.

Also, an attachment comment "Apply only this patch." can make life easier.

comment:19 Changed 3 years ago by leif

  • Merged in set to sage-4.7.2.alpha3
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.