Opened 7 years ago

Closed 6 years ago

#13437 closed defect (invalid)

Clean up SIGALRM handling in p_iter_fork

Reported by: nbruin Owned by: tbd
Priority: minor Milestone: sage-duplicate/invalid/wontfix
Component: performance Keywords:
Cc: SimonKing Merged in:
Authors: Reviewers: Jeroen Demeyer
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: #13311 Stopgaps:

Description

sage.parallel.use_fork.p_iter_fork uses SIGALRM for timeouts to kill workers that are taking too long. It could be that the use of signals there isn't as robust as it could be. Perhaps clean it up?

Attachments (2)

trac_13437-sigalrm.patch (3.3 KB) - added by nbruin 7 years ago.
First try at cleaner use of signal.alarm
13437_ALRM_no_interrupt.patch (575 bytes) - added by jdemeyer 7 years ago.

Download all attachments as: .zip

Change History (17)

Changed 7 years ago by nbruin

First try at cleaner use of signal.alarm

comment:1 Changed 7 years ago by SimonKing

  • Cc SimonKing added

comment:2 Changed 7 years ago by SimonKing

This patch seems to solve the issue found with #715+#11521 on bsd.math (which is OS X on Intel chips, IIRC).

comment:3 follow-up: Changed 7 years ago by SimonKing

No, it only makes the problem vanish in an interactive gdb'ed session. In the doctest, I still get:

bash-3.2$ ../../sage -t sage/misc/cachefunc.pyx
sage -t  "devel/sage-main/sage/misc/cachefunc.pyx"          
The doctested process was killed by signal 11
         [19.9 s]
 
----------------------------------------------------------------------
The following tests failed:


        sage -t  "devel/sage-main/sage/misc/cachefunc.pyx" # Killed/crashed
Total time for all tests: 19.9 seconds
bash-3.2$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
trac_13437-sigalrm.patch

comment:4 in reply to: ↑ 3 Changed 7 years ago by nbruin

That's not too surprising. A SEGV is quite different from an EINTR and with what we know now, it is not so far fetched that under GDB's control, the occurrence of EINTRs is a little different. I think there is merit to the proposed patch here for general reasons.

For your SEGV problem: just dig into sage-doctest and extract the ....py file that gets run to test the doctests. That likely still triggers the SEGV and then you can run that file under other controls and/or with instrumentation. At least it'll get the doctest pipes out of your way. It's also a good way to see that doctests are run in a different environment than interactive sage use.

Changed 7 years ago by jdemeyer

comment:5 follow-up: Changed 7 years ago by jdemeyer

Apply 13437_ALRM_no_interrupt.patch and call

sage.ext.c_lib.SIGALRM_set_interruptible(0)

after every call to signal.signal(SIGALRM,...) to disable the interrupting of system calls due to SIGALRM (not sure whether you want this, though).

comment:6 Changed 7 years ago by jdemeyer

Actually, there is a Python interface for this: http://docs.python.org/library/signal.html#signal.siginterrupt

comment:7 follow-up: Changed 7 years ago by jdemeyer

Did the #715 bug ever appear on a non-OSX system?

comment:8 in reply to: ↑ 7 ; follow-up: Changed 7 years ago by SimonKing

Replying to jdemeyer:

Did the #715 bug ever appear on a non-OSX system?

Yes and no. First of all, I have never seen the bug with #715 alone.

  • On bsd.math, there is the signal 11 issue with #715 plus #11521 (which are supposed to be merged together anyway), in sage/misc/cachefunc.pyx
  • On Volker's patchbot, a signal 11 arose with an earlier version of #13370 (which depends on #11521). The error was with sage/rings/polynomial/polynomial_real_mpfr_dense.pyx.
  • With #12876 (which depends on #11521 as well), there is a signal 11 in sage/rings/polynomial/infinite_polynomial_ring.py, again on Volker's patchbot, and also on sagepad.org

comment:9 in reply to: ↑ 8 ; follow-up: Changed 7 years ago by jdemeyer

Replying to SimonKing:

  • On bsd.math, there is the signal 11 issue with #715 plus #11521 (which are supposed to be merged together anyway), in sage/misc/cachefunc.pyx
  • On Volker's patchbot, a signal 11 arose with an earlier version of #13370 (which depends on #11521). The error was with sage/rings/polynomial/polynomial_real_mpfr_dense.pyx.
  • With #12876 (which depends on #11521 as well), there is a signal 11 in sage/rings/polynomial/infinite_polynomial_ring.py, again on Volker's patchbot, and also on sagepad.org

Is there any reason to assume that these 3 failures are related? Since the doctest failures are in different files, I would guess that we're seeing 3 different bugs.

comment:10 in reply to: ↑ 9 Changed 7 years ago by SimonKing

Replying to jdemeyer:

Is there any reason to assume that these 3 failures are related? Since the doctest failures are in different files, I would guess that we're seeing 3 different bugs.

I can't tell. I have no access to a machine where these errors occur, nobody has ever produced a backtrace of it, and I even don't know in what test in each file the error occurs, what happens with gdb, what happens verbose, etc. The three errors just "look" the same, in the sense of the doctester reports the same problem. That's why I think that the underlying problem could be the same.

comment:11 in reply to: ↑ 5 Changed 7 years ago by SimonKing

Replying to jdemeyer:

Apply 13437_ALRM_no_interrupt.patch and call

sage.ext.c_lib.SIGALRM_set_interruptible(0)

after every call to signal.signal(SIGALRM,...) to disable the interrupting of system calls due to SIGALRM (not sure whether you want this, though).

I applied the patch and changed sage/parallel/use_fork.py as follows (this should cover all instances of signal.signal(SIGALRM,...) in that file):

diff --git a/sage/parallel/use_fork.py b/sage/parallel/use_fork.py
--- a/sage/parallel/use_fork.py
+++ b/sage/parallel/use_fork.py
@@ -11,6 +11,8 @@
 #                  http://www.gnu.org/licenses/
 ################################################################################
 
+from sage.ext.c_lib import SIGALRM_set_interruptible
+
 class p_iter_fork:
     """
     A parallel iterator implemented using ``fork()``.
@@ -105,6 +107,7 @@
 
         #and install our handler (saving the old one!)
         sigalrm_orig=signal.signal(signal.SIGALRM,my_alrm_handler)
+        SIGALRM_set_interruptible(0)
         try:
             while len(v) > 0 or len(workers) > 0:
                 # Spawn up to n subprocesses
@@ -179,6 +182,7 @@
 
         finally:
             signal.signal(signal.SIGALRM,sigalrm_orig) #restore original SIGALRM handler
+            SIGALRM_set_interruptible(0)
             # Clean up all temporary files.
             try:
                 for X in os.listdir(dir):

However, the signal 11 problem persists.

comment:12 follow-up: Changed 7 years ago by SimonKing

PS: I meant to say that I had applied both patches plus the diff given in my previous post.

comment:13 in reply to: ↑ 12 Changed 7 years ago by nbruin

Replying to SimonKing:

PS: I meant to say that I had applied both patches plus the diff given in my previous post.

The whole point of setting an alarm here is to interrupt a system call. Sage's default SIGALRM handler exits. So once we're not interested in alarms anymore, we should make sure we cancel any that we may have scheduled -- not that they get ignored.

It's mildly interesting to see that on at least one system, apparently a SIGALRM did trigger an unexpected system call termination. The sigalrm.patch changes two things: The type of exception that is caught (was a RuntimeError raised by the signal handler, now an OSError raised by python upon receiving an EINTR in os.wait) and the scheduling of alarms. Either one could have an effect, but note that it's quite likely that the observed differences in behaviour are almost certainly due to gdb's interference.

comment:14 Changed 6 years ago by jdemeyer

  • Dependencies set to #13311
  • Milestone changed from sage-5.10 to sage-duplicate/invalid/wontfix
  • Status changed from new to needs_review

What exactly is the problem that this ticket is trying to fix? Because there is no longer a problem with #715 + #11521. There is also #13311 which changes the working of alarm().

Proposal: close this as invalid. But if the problem can be more clearly stated and still persists with #13311, go ahead.

comment:15 Changed 6 years ago by jdemeyer

  • Resolution set to invalid
  • Reviewers set to Jeroen Demeyer
  • Status changed from needs_review to closed
Note: See TracTickets for help on using tickets.