Opened 6 years ago

Closed 6 years ago

#18741 closed defect (fixed)

Random failure in sagespawn.pyx

Reported by: vbraun Owned by:
Priority: major Milestone: sage-6.8
Component: interfaces Keywords: random_fail
Cc: jdemeyer Merged in:
Authors: Jeroen Demeyer Reviewers: Volker Braun
Report Upstream: N/A Work issues:
Branch: 659a1f5 (Commits, GitHub, GitLab) Commit: 659a1f549535ddf1f08b36a33d71cd37b6971bee
Dependencies: Stopgaps:

Status badges

Description (last modified by jdemeyer)

There are at least two races left, see comments for details.

Change History (13)

comment:1 Changed 6 years ago by vbraun

  • Keywords random_fail added

On Arando (already reported at #17686):

sage -t --long src/sage/interfaces/sagespawn.pyx
**********************************************************************
File "src/sage/interfaces/sagespawn.pyx", line 125, in sage.interfaces.sagespawn.SageSpawn.close
Failed example:
    while s.isalive():  # long time (5 seconds)
        sleep(0.1)
Exception raised:
    Traceback (most recent call last):
      File "/home/buildslave-sage/slave/sage_git/build/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 496, in _run
        self.compile_and_execute(example, compiler, test.globs)
      File "/home/buildslave-sage/slave/sage_git/build/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 858, in compile_and_execute
        exec(compiled, globs)
      File "<doctest sage.interfaces.sagespawn.SageSpawn.close[3]>", line 1, in <module>
        while s.isalive():  # long time (5 seconds)
      File "/home/buildslave-sage/slave/sage_git/build/local/lib/python2.7/site-packages/pexpect.py", line 762, in isalive
        pid, status = os.waitpid(self.pid, waitpid_options)
      File "sage/ext/interrupt/interrupt.pyx", line 197, in sage.ext.interrupt.interrupt.sage_python_check_interrupt (/home/buildslave-sage/slave/sage_git/build/src/build/cythonized/sage/ext/interrupt/interrupt.c:1743)
        sig_check()
      File "sage/ext/interrupt/interrupt.pyx", line 86, in sage.ext.interrupt.interrupt.sig_raise_exception (/home/buildslave-sage/slave/sage_git/build/src/build/cythonized/sage/ext/interrupt/interrupt.c:884)
        raise KeyboardInterrupt
    KeyboardInterrupt
**********************************************************************
1 item had failures:
   1 of   5 in sage.interfaces.sagespawn.SageSpawn.close
    [24 tests, 1 failure, 5.08 s]

comment:2 Changed 6 years ago by vbraun

  • Cc jdemeyer added

On OSX:

sage -t --long src/sage/interfaces/sagespawn.pyx
    Bad exit: 1
**********************************************************************
Tests run before process (pid=73714) failed:
sage: from sage.interfaces.sagespawn import SageSpawn ## line 45 ##
sage: SageSpawn("sleep 1", name="Sleeping Beauty") ## line 46 ##
Sleeping Beauty with PID 73715 running /bin/sleep 1
sage: sig_on_count() ## line 48 ##
0
sage: from sage.interfaces.sagespawn import SageSpawn ## line 67 ##
sage: s = SageSpawn("true", name="stupid process") ## line 68 ##
sage: s  # indirect doctest ## line 69 ##
stupid process with PID 73717 running /usr/bin/true
sage: while s.isalive():  # Wait until the process finishes
    sleep(0.1) ## line 71 ##
sage: s  # indirect doctest ## line 73 ##
stupid process finished running /usr/bin/true
sage: sig_on_count() ## line 75 ##
0
sage: from sage.interfaces.sagespawn import SageSpawn ## line 96 ##
sage: s = SageSpawn("sh", ["-c", "while true; do sleep 1; done"]) ## line 97 ##
sage: s._keep_alive() ## line 98 ##
sage: pid = s.pid ## line 99 ##
sage: del s ## line 100 ##
sage: import gc ## line 101 ##
sage: _ = gc.collect() ## line 102 ##
sage: from signal import SIGTERM ## line 107 ##
sage: os.kill(pid, SIGTERM) ## line 108 ##
sage: sig_on_count() ## line 109 ##
0
sage: from sage.interfaces.sagespawn import SageSpawn ## line 122 ##
sage: s = SageSpawn("sleep 1000") ## line 123 ##
sage: s.close() ## line 124 ##
sage: while s.isalive():  # long time (5 seconds)
    sleep(0.1) ## line 125 ##
sage: sig_on_count() ## line 127 ##
0
sage: from sage.interfaces.sagespawn import SageSpawn ## line 157 ##
sage: s = SageSpawn("sh", ["-c", "while true; do sleep 1; done"]) ## line 158 ##
sage: s.terminate_async(interval=0.2) ## line 163 ##
sage: while True:
    try:
        os.kill(s.pid, 0)
    except OSError:
        sleep(0.1)
    else:
        break  # process got killed ## line 164 ##
sage: sig_on_count() ## line 171 ##
0

comment:3 Changed 6 years ago by vbraun

Same on Linux (Snapperkob):

sage -t --long src/sage/interfaces/sagespawn.pyx
    Killed due to terminate
**********************************************************************
Tests run before process (pid=10311) failed:
sage: from sage.interfaces.sagespawn import SageSpawn ## line 45 ##
sage: SageSpawn("sleep 1", name="Sleeping Beauty") ## line 46 ##
Sleeping Beauty with PID 10312 running /bin/sleep 1
sage: sig_on_count() ## line 48 ##
0
sage: from sage.interfaces.sagespawn import SageSpawn ## line 67 ##
sage: s = SageSpawn("true", name="stupid process") ## line 68 ##
sage: s  # indirect doctest ## line 69 ##
stupid process with PID 10316 running /bin/true
sage: while s.isalive():  # Wait until the process finishes
    sleep(0.1) ## line 71 ##
sage: s  # indirect doctest ## line 73 ##
stupid process finished running /bin/true
sage: sig_on_count() ## line 75 ##
0
sage: from sage.interfaces.sagespawn import SageSpawn ## line 96 ##
sage: s = SageSpawn("sh", ["-c", "while true; do sleep 1; done"]) ## line 97 ##
sage: s._keep_alive() ## line 98 ##
sage: pid = s.pid ## line 99 ##
sage: del s ## line 100 ##
sage: import gc ## line 101 ##
sage: _ = gc.collect() ## line 102 ##
sage: from signal import SIGTERM ## line 107 ##
sage: os.kill(pid, SIGTERM) ## line 108 ##
sage: sig_on_count() ## line 109 ##
0
sage: from sage.interfaces.sagespawn import SageSpawn ## line 122 ##
sage: s = SageSpawn("sleep 1000") ## line 123 ##
sage: s.close() ## line 124 ##
sage: while s.isalive():  # long time (5 seconds)
    sleep(0.1) ## line 125 ##
sage: sig_on_count() ## line 127 ##
0
sage: from sage.interfaces.sagespawn import SageSpawn ## line 157 ##
sage: s = SageSpawn("sh", ["-c", "while true; do sleep 1; done"]) ## line 158 ##
sage: s.terminate_async(interval=0.2) ## line 163 ##
sage: while True:
    try:
        os.kill(s.pid, 0)
    except OSError:
        sleep(0.1)
    else:
        break  # process got killed ## line 164 ##
sage: sig_on_count() ## line 171 ##
0

comment:4 Changed 6 years ago by jdemeyer

  • Authors set to Jeroen Demeyer

comment:5 Changed 6 years ago by jdemeyer

On my laptop, I can reproduce the first two, not the third. Those first two seem to be different issues. The first is the process-group-was-not-yet-changed race I mentioned in #17686. But I still need to figure out the "bad exit".

comment:6 Changed 6 years ago by jdemeyer

  • Branch set to u/jdemeyer/random_failure_in_sagespawn_pyx

comment:7 Changed 6 years ago by jdemeyer

  • Commit set to 659a1f549535ddf1f08b36a33d71cd37b6971bee
  • Status changed from new to needs_review

New commits:

659a1f5Fix some race conditions in SageSpawn

comment:8 Changed 6 years ago by jdemeyer

  • Description modified (diff)

comment:9 follow-up: Changed 6 years ago by vbraun

Looks good but the 0.125s is IMHO too slow in the poll; If used with a quick-running executable in @parallel then we might spend most of the time waiting. IMHO we should just sleep(0) (yield to the thread scheduler). Also, the total timeout should probably be larger than 20*0.125 which you should be able to hit in case of low memory / swapping.

comment:10 in reply to: ↑ 9 Changed 6 years ago by jdemeyer

Replying to vbraun:

Looks good but the 0.125s is IMHO too slow in the poll; If used with a quick-running executable in @parallel then we might spend most of the time waiting.

That code is only for the case where you spawn a pexpect subprocess (not for @parallel) and where the process is started and immediately killed. If you're doing that too often, you're doing something wrong anyway.

comment:11 Changed 6 years ago by vbraun

  • Status changed from needs_review to positive_review

comment:12 Changed 6 years ago by vbraun

  • Reviewers set to Volker Braun

comment:13 Changed 6 years ago by vbraun

  • Branch changed from u/jdemeyer/random_failure_in_sagespawn_pyx to 659a1f549535ddf1f08b36a33d71cd37b6971bee
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.