Opened 7 years ago
Closed 6 years ago
#13437 closed defect (invalid)
Clean up SIGALRM handling in p_iter_fork
Reported by: | nbruin | Owned by: | tbd |
---|---|---|---|
Priority: | minor | Milestone: | sage-duplicate/invalid/wontfix |
Component: | performance | Keywords: | |
Cc: | SimonKing | Merged in: | |
Authors: | Reviewers: | Jeroen Demeyer | |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | #13311 | Stopgaps: |
Description
sage.parallel.use_fork.p_iter_fork
uses SIGALRM for timeouts to kill workers that are taking too long. It could be that the use of signals there isn't as robust as it could be. Perhaps clean it up?
Attachments (2)
Change History (17)
Changed 7 years ago by
comment:1 Changed 7 years ago by
- Cc SimonKing added
comment:2 Changed 7 years ago by
comment:3 follow-up: ↓ 4 Changed 7 years ago by
No, it only makes the problem vanish in an interactive gdb'ed session. In the doctest, I still get:
bash-3.2$ ../../sage -t sage/misc/cachefunc.pyx sage -t "devel/sage-main/sage/misc/cachefunc.pyx" The doctested process was killed by signal 11 [19.9 s] ---------------------------------------------------------------------- The following tests failed: sage -t "devel/sage-main/sage/misc/cachefunc.pyx" # Killed/crashed Total time for all tests: 19.9 seconds bash-3.2$ hg qa trac_715_combined.patch trac_715_local_refcache.patch trac_715_safer.patch trac_715_specification.patch trac_11521_homset_weakcache_combined.patch trac_11521_callback.patch trac_13437-sigalrm.patch
comment:4 in reply to: ↑ 3 Changed 7 years ago by
That's not too surprising. A SEGV is quite different from an EINTR and with what we know now, it is not so far fetched that under GDB's control, the occurrence of EINTRs is a little different. I think there is merit to the proposed patch here for general reasons.
For your SEGV problem: just dig into sage-doctest and extract the ....py file that gets run to test the doctests. That likely still triggers the SEGV and then you can run that file under other controls and/or with instrumentation. At least it'll get the doctest pipes out of your way. It's also a good way to see that doctests are run in a different environment than interactive sage use.
Changed 7 years ago by
comment:5 follow-up: ↓ 11 Changed 7 years ago by
Apply 13437_ALRM_no_interrupt.patch and call
sage.ext.c_lib.SIGALRM_set_interruptible(0)
after every call to signal.signal(SIGALRM,...)
to disable the interrupting of system calls due to SIGALRM
(not sure whether you want this, though).
comment:6 Changed 7 years ago by
Actually, there is a Python interface for this: http://docs.python.org/library/signal.html#signal.siginterrupt
comment:7 follow-up: ↓ 8 Changed 7 years ago by
Did the #715 bug ever appear on a non-OSX system?
comment:8 in reply to: ↑ 7 ; follow-up: ↓ 9 Changed 7 years ago by
Replying to jdemeyer:
Did the #715 bug ever appear on a non-OSX system?
Yes and no. First of all, I have never seen the bug with #715 alone.
- On bsd.math, there is the signal 11 issue with #715 plus #11521 (which are supposed to be merged together anyway), in sage/misc/cachefunc.pyx
- On Volker's patchbot, a signal 11 arose with an earlier version of #13370 (which depends on #11521). The error was with sage/rings/polynomial/polynomial_real_mpfr_dense.pyx.
- With #12876 (which depends on #11521 as well), there is a signal 11 in sage/rings/polynomial/infinite_polynomial_ring.py, again on Volker's patchbot, and also on sagepad.org
comment:9 in reply to: ↑ 8 ; follow-up: ↓ 10 Changed 7 years ago by
Replying to SimonKing:
- On bsd.math, there is the signal 11 issue with #715 plus #11521 (which are supposed to be merged together anyway), in sage/misc/cachefunc.pyx
- On Volker's patchbot, a signal 11 arose with an earlier version of #13370 (which depends on #11521). The error was with sage/rings/polynomial/polynomial_real_mpfr_dense.pyx.
- With #12876 (which depends on #11521 as well), there is a signal 11 in sage/rings/polynomial/infinite_polynomial_ring.py, again on Volker's patchbot, and also on sagepad.org
Is there any reason to assume that these 3 failures are related? Since the doctest failures are in different files, I would guess that we're seeing 3 different bugs.
comment:10 in reply to: ↑ 9 Changed 7 years ago by
Replying to jdemeyer:
Is there any reason to assume that these 3 failures are related? Since the doctest failures are in different files, I would guess that we're seeing 3 different bugs.
I can't tell. I have no access to a machine where these errors occur, nobody has ever produced a backtrace of it, and I even don't know in what test in each file the error occurs, what happens with gdb, what happens verbose, etc. The three errors just "look" the same, in the sense of the doctester reports the same problem. That's why I think that the underlying problem could be the same.
comment:11 in reply to: ↑ 5 Changed 7 years ago by
Replying to jdemeyer:
Apply 13437_ALRM_no_interrupt.patch and call
sage.ext.c_lib.SIGALRM_set_interruptible(0)after every call to
signal.signal(SIGALRM,...)
to disable the interrupting of system calls due toSIGALRM
(not sure whether you want this, though).
I applied the patch and changed sage/parallel/use_fork.py as follows (this should cover all instances of signal.signal(SIGALRM,...)
in that file):
diff --git a/sage/parallel/use_fork.py b/sage/parallel/use_fork.py --- a/sage/parallel/use_fork.py +++ b/sage/parallel/use_fork.py @@ -11,6 +11,8 @@ # http://www.gnu.org/licenses/ ################################################################################ +from sage.ext.c_lib import SIGALRM_set_interruptible + class p_iter_fork: """ A parallel iterator implemented using ``fork()``. @@ -105,6 +107,7 @@ #and install our handler (saving the old one!) sigalrm_orig=signal.signal(signal.SIGALRM,my_alrm_handler) + SIGALRM_set_interruptible(0) try: while len(v) > 0 or len(workers) > 0: # Spawn up to n subprocesses @@ -179,6 +182,7 @@ finally: signal.signal(signal.SIGALRM,sigalrm_orig) #restore original SIGALRM handler + SIGALRM_set_interruptible(0) # Clean up all temporary files. try: for X in os.listdir(dir):
However, the signal 11 problem persists.
comment:12 follow-up: ↓ 13 Changed 7 years ago by
PS: I meant to say that I had applied both patches plus the diff given in my previous post.
comment:13 in reply to: ↑ 12 Changed 7 years ago by
Replying to SimonKing:
PS: I meant to say that I had applied both patches plus the diff given in my previous post.
The whole point of setting an alarm here is to interrupt a system call. Sage's default SIGALRM handler exits. So once we're not interested in alarms anymore, we should make sure we cancel any that we may have scheduled -- not that they get ignored.
It's mildly interesting to see that on at least one system, apparently a SIGALRM did trigger an unexpected system call termination. The sigalrm.patch
changes two things: The type of exception that is caught (was a RuntimeError
raised by the signal handler, now an OSError
raised by python upon receiving an EINTR
in os.wait) and the scheduling of alarms. Either one could have an effect, but note that it's quite likely that the observed differences in behaviour are almost certainly due to gdb's interference.
comment:14 Changed 6 years ago by
- Dependencies set to #13311
- Milestone changed from sage-5.10 to sage-duplicate/invalid/wontfix
- Status changed from new to needs_review
comment:15 Changed 6 years ago by
- Resolution set to invalid
- Reviewers set to Jeroen Demeyer
- Status changed from needs_review to closed
First try at cleaner use of signal.alarm