Sage: Ticket #13437: Clean up SIGALRM handling in p_iter_fork
https://trac.sagemath.org/ticket/13437
<p>
<code>sage.parallel.use_fork.p_iter_fork</code> uses SIGALRM for timeouts to kill workers that are taking too long. It could be that the use of signals there isn't as robust as it could be. Perhaps clean it up?
</p>
en-usSagehttps://trac.sagemath.org/chrome/site/logo_sagemath_trac.png
https://trac.sagemath.org/ticket/13437
Trac 1.1.6nbruinFri, 07 Sep 2012 04:35:34 GMTattachment set
https://trac.sagemath.org/ticket/13437
https://trac.sagemath.org/ticket/13437
<ul>
<li><strong>attachment</strong>
set to <em>trac_13437-sigalrm.patch</em>
</li>
</ul>
<p>
First try at cleaner use of signal.alarm
</p>
TicketSimonKingFri, 07 Sep 2012 06:16:52 GMTcc set
https://trac.sagemath.org/ticket/13437#comment:1
https://trac.sagemath.org/ticket/13437#comment:1
<ul>
<li><strong>cc</strong>
<em>SimonKing</em> added
</li>
</ul>
TicketSimonKingFri, 07 Sep 2012 07:49:32 GMT
https://trac.sagemath.org/ticket/13437#comment:2
https://trac.sagemath.org/ticket/13437#comment:2
<p>
This patch seems to solve the issue found with <a class="closed ticket" href="https://trac.sagemath.org/ticket/715" title="defect: Parents probably not reclaimed due to too much caching (closed: fixed)">#715</a>+<a class="closed ticket" href="https://trac.sagemath.org/ticket/11521" title="defect: Use weak references to cache homsets (closed: fixed)">#11521</a> on bsd.math (which is OS X on Intel chips, IIRC).
</p>
TicketSimonKingFri, 07 Sep 2012 07:51:30 GMT
https://trac.sagemath.org/ticket/13437#comment:3
https://trac.sagemath.org/ticket/13437#comment:3
<p>
No, it only makes the problem vanish in an interactive gdb'ed session. In the doctest, I still get:
</p>
<pre class="wiki">bash-3.2$ ../../sage -t sage/misc/cachefunc.pyx
sage -t "devel/sage-main/sage/misc/cachefunc.pyx"
The doctested process was killed by signal 11
[19.9 s]
----------------------------------------------------------------------
The following tests failed:
sage -t "devel/sage-main/sage/misc/cachefunc.pyx" # Killed/crashed
Total time for all tests: 19.9 seconds
bash-3.2$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
trac_13437-sigalrm.patch
</pre>
TicketnbruinFri, 07 Sep 2012 08:40:56 GMT
https://trac.sagemath.org/ticket/13437#comment:4
https://trac.sagemath.org/ticket/13437#comment:4
<p>
That's not too surprising. A SEGV is quite different from an EINTR and with what we know now, it is not so far fetched that under GDB's control, the occurrence of EINTRs is a little different. I think there is merit to the proposed patch here for general reasons.
</p>
<p>
For your SEGV problem: just dig into sage-doctest and extract the ....py file that gets run to test the doctests. That likely still triggers the SEGV and then you can run that file under other controls and/or with instrumentation. At least it'll get the doctest pipes out of your way. It's also a good way to see that doctests are run in a different environment than interactive sage use.
</p>
TicketjdemeyerFri, 07 Sep 2012 09:24:06 GMTattachment set
https://trac.sagemath.org/ticket/13437
https://trac.sagemath.org/ticket/13437
<ul>
<li><strong>attachment</strong>
set to <em>13437_ALRM_no_interrupt.patch</em>
</li>
</ul>
TicketjdemeyerFri, 07 Sep 2012 09:26:20 GMT
https://trac.sagemath.org/ticket/13437#comment:5
https://trac.sagemath.org/ticket/13437#comment:5
<p>
Apply <a class="attachment" href="https://trac.sagemath.org/attachment/ticket/13437/13437_ALRM_no_interrupt.patch" title="Attachment '13437_ALRM_no_interrupt.patch' in Ticket #13437">13437_ALRM_no_interrupt.patch</a><a class="trac-rawlink" href="https://trac.sagemath.org/raw-attachment/ticket/13437/13437_ALRM_no_interrupt.patch" title="Download"></a> and call
</p>
<pre class="wiki">sage.ext.c_lib.SIGALRM_set_interruptible(0)
</pre><p>
<strong>after every call</strong> to <code>signal.signal(SIGALRM,...)</code> to disable the interrupting of system calls due to <code>SIGALRM</code> (not sure whether you want this, though).
</p>
TicketjdemeyerFri, 07 Sep 2012 09:28:07 GMT
https://trac.sagemath.org/ticket/13437#comment:6
https://trac.sagemath.org/ticket/13437#comment:6
<p>
Actually, there is a Python interface for this: <a class="ext-link" href="http://docs.python.org/library/signal.html#signal.siginterrupt"><span class="icon"></span>http://docs.python.org/library/signal.html#signal.siginterrupt</a>
</p>
TicketjdemeyerFri, 07 Sep 2012 09:37:25 GMT
https://trac.sagemath.org/ticket/13437#comment:7
https://trac.sagemath.org/ticket/13437#comment:7
<p>
Did the <a class="closed ticket" href="https://trac.sagemath.org/ticket/715" title="defect: Parents probably not reclaimed due to too much caching (closed: fixed)">#715</a> bug ever appear on a non-OSX system?
</p>
TicketSimonKingFri, 07 Sep 2012 11:23:33 GMT
https://trac.sagemath.org/ticket/13437#comment:8
https://trac.sagemath.org/ticket/13437#comment:8
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/13437#comment:7" title="Comment 7">jdemeyer</a>:
</p>
<blockquote class="citation">
<p>
Did the <a class="closed ticket" href="https://trac.sagemath.org/ticket/715" title="defect: Parents probably not reclaimed due to too much caching (closed: fixed)">#715</a> bug ever appear on a non-OSX system?
</p>
</blockquote>
<p>
Yes and no. First of all, I have never seen the bug with <a class="closed ticket" href="https://trac.sagemath.org/ticket/715" title="defect: Parents probably not reclaimed due to too much caching (closed: fixed)">#715</a> alone.
</p>
<ul><li>On bsd.math, there is the signal 11 issue with <a class="closed ticket" href="https://trac.sagemath.org/ticket/715" title="defect: Parents probably not reclaimed due to too much caching (closed: fixed)">#715</a> plus <a class="closed ticket" href="https://trac.sagemath.org/ticket/11521" title="defect: Use weak references to cache homsets (closed: fixed)">#11521</a> (which are supposed to be merged together anyway), in sage/misc/cachefunc.pyx
</li><li>On <a class="ext-link" href="http://patchbot.sagemath.org/log/13370/Fedora/17/x86_64/3.5.2-3.fc17.x86_64/volker-desktop.stp.dias.ie/2012-08-27%2023:35:38%20+0100"><span class="icon"></span>Volker's patchbot</a>, a signal 11 arose with an earlier version of <a class="closed ticket" href="https://trac.sagemath.org/ticket/13370" title="defect: Do not cache the result of is_Field externally (closed: fixed)">#13370</a> (which depends on <a class="closed ticket" href="https://trac.sagemath.org/ticket/11521" title="defect: Use weak references to cache homsets (closed: fixed)">#11521</a>). The error was with sage/rings/polynomial/polynomial_real_mpfr_dense.pyx.
</li><li>With <a class="closed ticket" href="https://trac.sagemath.org/ticket/12876" title="enhancement: Fix element and parent classes of Hom categories to be abstract, and ... (closed: fixed)">#12876</a> (which depends on <a class="closed ticket" href="https://trac.sagemath.org/ticket/11521" title="defect: Use weak references to cache homsets (closed: fixed)">#11521</a> as well), there is a signal 11 in sage/rings/polynomial/infinite_polynomial_ring.py, again on <a class="ext-link" href="http://patchbot.sagemath.org/log/12876/Fedora/17/x86_64/3.5.2-3.fc17.x86_64/volker-desktop.stp.dias.ie/2012-09-03%2013:32:42%20+0100"><span class="icon"></span>Volker's patchbot</a>, and also on <a class="ext-link" href="http://patchbot.sagemath.org/log/12876/Fedora/17/x86_64/3.4.6-2.fc17.x86_64/sagepad.org/2012-09-02%2002:07:43%20+0200"><span class="icon"></span>sagepad.org</a>
</li></ul>
TicketjdemeyerFri, 07 Sep 2012 11:49:52 GMT
https://trac.sagemath.org/ticket/13437#comment:9
https://trac.sagemath.org/ticket/13437#comment:9
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/13437#comment:8" title="Comment 8">SimonKing</a>:
</p>
<blockquote class="citation">
<ul><li>On bsd.math, there is the signal 11 issue with <a class="closed ticket" href="https://trac.sagemath.org/ticket/715" title="defect: Parents probably not reclaimed due to too much caching (closed: fixed)">#715</a> plus <a class="closed ticket" href="https://trac.sagemath.org/ticket/11521" title="defect: Use weak references to cache homsets (closed: fixed)">#11521</a> (which are supposed to be merged together anyway), in sage/misc/cachefunc.pyx
</li><li>On <a class="ext-link" href="http://patchbot.sagemath.org/log/13370/Fedora/17/x86_64/3.5.2-3.fc17.x86_64/volker-desktop.stp.dias.ie/2012-08-27%2023:35:38%20+0100"><span class="icon"></span>Volker's patchbot</a>, a signal 11 arose with an earlier version of <a class="closed ticket" href="https://trac.sagemath.org/ticket/13370" title="defect: Do not cache the result of is_Field externally (closed: fixed)">#13370</a> (which depends on <a class="closed ticket" href="https://trac.sagemath.org/ticket/11521" title="defect: Use weak references to cache homsets (closed: fixed)">#11521</a>). The error was with sage/rings/polynomial/polynomial_real_mpfr_dense.pyx.
</li><li>With <a class="closed ticket" href="https://trac.sagemath.org/ticket/12876" title="enhancement: Fix element and parent classes of Hom categories to be abstract, and ... (closed: fixed)">#12876</a> (which depends on <a class="closed ticket" href="https://trac.sagemath.org/ticket/11521" title="defect: Use weak references to cache homsets (closed: fixed)">#11521</a> as well), there is a signal 11 in sage/rings/polynomial/infinite_polynomial_ring.py, again on <a class="ext-link" href="http://patchbot.sagemath.org/log/12876/Fedora/17/x86_64/3.5.2-3.fc17.x86_64/volker-desktop.stp.dias.ie/2012-09-03%2013:32:42%20+0100"><span class="icon"></span>Volker's patchbot</a>, and also on <a class="ext-link" href="http://patchbot.sagemath.org/log/12876/Fedora/17/x86_64/3.4.6-2.fc17.x86_64/sagepad.org/2012-09-02%2002:07:43%20+0200"><span class="icon"></span>sagepad.org</a>
</li></ul></blockquote>
<p>
Is there any reason to assume that these 3 failures are related? Since the doctest failures are in different files, I would guess that we're seeing 3 different bugs.
</p>
TicketSimonKingFri, 07 Sep 2012 12:19:39 GMT
https://trac.sagemath.org/ticket/13437#comment:10
https://trac.sagemath.org/ticket/13437#comment:10
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/13437#comment:9" title="Comment 9">jdemeyer</a>:
</p>
<blockquote class="citation">
<p>
Is there any reason to assume that these 3 failures are related? Since the doctest failures are in different files, I would guess that we're seeing 3 different bugs.
</p>
</blockquote>
<p>
I can't tell. I have no access to a machine where these errors occur, nobody has ever produced a backtrace of it, and I even don't know in what test in each file the error occurs, what happens with gdb, what happens verbose, etc.
The three errors just "look" the same, in the sense of the doctester reports the same problem. That's why I think that the underlying problem could be the same.
</p>
TicketSimonKingFri, 07 Sep 2012 12:27:59 GMT
https://trac.sagemath.org/ticket/13437#comment:11
https://trac.sagemath.org/ticket/13437#comment:11
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/13437#comment:5" title="Comment 5">jdemeyer</a>:
</p>
<blockquote class="citation">
<p>
Apply <a class="attachment" href="https://trac.sagemath.org/attachment/ticket/13437/13437_ALRM_no_interrupt.patch" title="Attachment '13437_ALRM_no_interrupt.patch' in Ticket #13437">13437_ALRM_no_interrupt.patch</a><a class="trac-rawlink" href="https://trac.sagemath.org/raw-attachment/ticket/13437/13437_ALRM_no_interrupt.patch" title="Download"></a> and call
</p>
<pre class="wiki">sage.ext.c_lib.SIGALRM_set_interruptible(0)
</pre><p>
<strong>after every call</strong> to <code>signal.signal(SIGALRM,...)</code> to disable the interrupting of system calls due to <code>SIGALRM</code> (not sure whether you want this, though).
</p>
</blockquote>
<p>
I applied the patch and changed sage/parallel/use_fork.py as follows (this should cover all instances of <code>signal.signal(SIGALRM,...)</code> in that file):
</p>
<pre class="wiki">diff --git a/sage/parallel/use_fork.py b/sage/parallel/use_fork.py
--- a/sage/parallel/use_fork.py
+++ b/sage/parallel/use_fork.py
@@ -11,6 +11,8 @@
# http://www.gnu.org/licenses/
################################################################################
+from sage.ext.c_lib import SIGALRM_set_interruptible
+
class p_iter_fork:
"""
A parallel iterator implemented using ``fork()``.
@@ -105,6 +107,7 @@
#and install our handler (saving the old one!)
sigalrm_orig=signal.signal(signal.SIGALRM,my_alrm_handler)
+ SIGALRM_set_interruptible(0)
try:
while len(v) > 0 or len(workers) > 0:
# Spawn up to n subprocesses
@@ -179,6 +182,7 @@
finally:
signal.signal(signal.SIGALRM,sigalrm_orig) #restore original SIGALRM handler
+ SIGALRM_set_interruptible(0)
# Clean up all temporary files.
try:
for X in os.listdir(dir):
</pre><p>
However, the signal 11 problem persists.
</p>
TicketSimonKingFri, 07 Sep 2012 12:29:46 GMT
https://trac.sagemath.org/ticket/13437#comment:12
https://trac.sagemath.org/ticket/13437#comment:12
<p>
PS: I meant to say that I had applied <em>both</em> patches plus the diff given in my previous post.
</p>
TicketnbruinFri, 07 Sep 2012 16:02:23 GMT
https://trac.sagemath.org/ticket/13437#comment:13
https://trac.sagemath.org/ticket/13437#comment:13
<p>
Replying to <a class="ticket" href="https://trac.sagemath.org/ticket/13437#comment:12" title="Comment 12">SimonKing</a>:
</p>
<blockquote class="citation">
<p>
PS: I meant to say that I had applied <em>both</em> patches plus the diff given in my previous post.
</p>
</blockquote>
<p>
The whole point of setting an alarm here is to interrupt a system call. Sage's default SIGALRM handler exits. So once we're not interested in alarms anymore, we should make sure we cancel any that we may have scheduled -- not that they get ignored.
</p>
<p>
It's mildly interesting to see that on at least one system, apparently a SIGALRM did trigger an unexpected system call termination. The <code>sigalrm.patch</code> changes two things: The type of exception that is caught (was a <code>RuntimeError</code> raised by the signal handler, now an <code>OSError</code> raised by python upon receiving an <code>EINTR</code> in os.wait) and the scheduling of alarms. Either one could have an effect, but note that it's quite likely that the observed differences in behaviour are almost certainly due to gdb's interference.
</p>
TicketjdemeyerWed, 15 May 2013 13:42:57 GMTstatus, milestone changed; dependencies set
https://trac.sagemath.org/ticket/13437#comment:14
https://trac.sagemath.org/ticket/13437#comment:14
<ul>
<li><strong>status</strong>
changed from <em>new</em> to <em>needs_review</em>
</li>
<li><strong>dependencies</strong>
set to <em>#13311</em>
</li>
<li><strong>milestone</strong>
changed from <em>sage-5.10</em> to <em>sage-duplicate/invalid/wontfix</em>
</li>
</ul>
<p>
What exactly is the problem that this ticket is trying to fix? Because there is no longer a problem with <a class="closed ticket" href="https://trac.sagemath.org/ticket/715" title="defect: Parents probably not reclaimed due to too much caching (closed: fixed)">#715</a> + <a class="closed ticket" href="https://trac.sagemath.org/ticket/11521" title="defect: Use weak references to cache homsets (closed: fixed)">#11521</a>. There is also <a class="closed ticket" href="https://trac.sagemath.org/ticket/13311" title="defect: alarm() doesn't work for Cython code (closed: fixed)">#13311</a> which changes the working of <code>alarm()</code>.
</p>
<p>
Proposal: close this as invalid. But if the problem can be more clearly stated and still persists with <a class="closed ticket" href="https://trac.sagemath.org/ticket/13311" title="defect: alarm() doesn't work for Cython code (closed: fixed)">#13311</a>, go ahead.
</p>
TicketjdemeyerWed, 22 May 2013 09:10:08 GMTstatus changed; reviewer, resolution set
https://trac.sagemath.org/ticket/13437#comment:15
https://trac.sagemath.org/ticket/13437#comment:15
<ul>
<li><strong>status</strong>
changed from <em>needs_review</em> to <em>closed</em>
</li>
<li><strong>reviewer</strong>
set to <em>Jeroen Demeyer</em>
</li>
<li><strong>resolution</strong>
set to <em>invalid</em>
</li>
</ul>
Ticket