Opened 11 years ago
Closed 11 years ago
#11658 closed defect (fixed)
the timeout option is not working correctly in parallel computing
Reported by: | mderickx | Owned by: | tbd |
---|---|---|---|
Priority: | major | Milestone: | sage-4.7.2 |
Component: | performance | Keywords: | sleep time-out |
Cc: | Merged in: | sage-4.7.2.alpha2 | |
Authors: | Leif Leonhardy | Reviewers: | Volker Braun |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description (last modified by )
With a timeout of 5 seconds I should not be able to perform a sleep of 40 seconds!
sage: from time import sleep sage: f=parallel(ncpus=4,timeout=5,verbose=True) sage: g=f(sleep) sage: time list(g([5,10,20,40])) [(((5,), {}), None), (((10,), {}), None), (((20,), {}), None), (((40,), {}), None)] Time: CPU 0.02 s, Wall: 40.07 s
Attachments (1)
Change History (12)
comment:1 Changed 11 years ago by
- Component changed from PLEASE CHANGE to performance
- Description modified (diff)
- Type changed from PLEASE CHANGE to defect
comment:2 Changed 11 years ago by
- Keywords sleep time-out added
- Summary changed from the timout option is not working correctly in parralel computing to the timeout option is not working correctly in parallel computing
comment:3 Changed 11 years ago by
P.S.:
If you use p_iter="fork"
(the only variant that's said to support timeout
), apparently indeed only ncpus
-1 child processes are created, so it seems to be as I guessed.
Depending on the selection / order of arguments, you may well get timeouts for [perhaps only some of] the child processes (i.e. they'll get killed), though not after the time you'd expect.
The behaviour is non-deterministic though, for whatever reason. (Try running the time
command repeatedly, i.e. within a loop.)
comment:4 Changed 11 years ago by
Oh, this is a more funny one (and not due to what I first guessed).
Looking at the code, it does spawn ncpus
processes, but there's a fat bug in the code (re)computing how long to wait the next time (the call to signal.alarm()
):
if timeout: def mysig(a,b): raise RuntimeError, "SIGALRM" oldest = min([X[1] for X in workers.values()]) signal.signal(signal.SIGALRM, mysig) signal.alarm(int(walltime() - oldest)+1)
(X[1]
is the wall time each child process was forked / started.)
This code is executed repeatedly; if wait()
was interrupted by a RuntimeError
triggered by SIGALRM
, it is checked whether any of the child processes timed out, and if so, they get killed.
Changed 11 years ago by
Sage library patch. Corrects time to wait for child processes (before they get killed) in the parallel fork decorator. Based on Sage 4.7.1.rc0.
comment:6 follow-up: ↓ 7 Changed 11 years ago by
- Reviewers set to Volker Braun
- Status changed from needs_review to positive_review
Yes its definitely better to have the timeout depend on the timeout variable :-)
comment:7 in reply to: ↑ 6 Changed 11 years ago by
Replying to vbraun:
Yes its definitely better to have the timeout depend on the timeout variable :-)
:D Just wondering if we should add a doctest for that, but now it has already positive review...
I was going to add one similar to Maarten's example (with some parallel sleeping processes, the default of ncpus
, and a timeout of about 5 seconds), marking it # long time
.
comment:8 Changed 11 years ago by
I already had a patch sorry guy's for making you do double effort but thanks for the quick fix :). I found this bug by reading the source code since I was trying to understand what they where doing. BTW I didn't respond earlier because the mailing system of the trac server was malfunctioning and didn't know others where working on it. The mailing server should work now. At least this message will test that for me :)
comment:9 follow-up: ↓ 10 Changed 11 years ago by
Ok trac wasn't fixed yet :(. Another test
comment:10 in reply to: ↑ 9 Changed 11 years ago by
Replying to mderickx:
Ok trac wasn't fixed yet :(. Another test
For the record, I did get this one. [Another test :) ]
comment:11 Changed 11 years ago by
- Merged in set to sage-4.7.2.alpha2
- Resolution set to fixed
- Status changed from positive_review to closed
Is this really an issue?
I guess the timeout simply sets a
SIGALRM
, which thesleep()
function(s) override, but I may be wrong.