Ticket #11658 (closed defect: fixed)
the timeout option is not working correctly in parallel computing
| Reported by: | mderickx | Owned by: | tbd |
|---|---|---|---|
| Priority: | major | Milestone: | sage-4.7.2 |
| Component: | performance | Keywords: | sleep time-out |
| Cc: | Work issues: | ||
| Report Upstream: | N/A | Reviewers: | Volker Braun |
| Authors: | Leif Leonhardy | Merged in: | sage-4.7.2.alpha2 |
| Dependencies: | Stopgaps: |
Description (last modified by mderickx) (diff)
With a timeout of 5 seconds I should not be able to perform a sleep of 40 seconds!
sage: from time import sleep
sage: f=parallel(ncpus=4,timeout=5,verbose=True)
sage: g=f(sleep)
sage: time list(g([5,10,20,40]))
[(((5,), {}), None), (((10,), {}), None), (((20,), {}), None), (((40,), {}), None)]
Time: CPU 0.02 s, Wall: 40.07 s
Attachments
Change History
comment:1 Changed 22 months ago by mderickx
- Type changed from PLEASE CHANGE to defect
- Component changed from PLEASE CHANGE to performance
- Description modified (diff)
comment:2 Changed 22 months ago by leif
- Keywords sleep time-out added
- Summary changed from the timout option is not working correctly in parralel computing to the timeout option is not working correctly in parallel computing
Is this really an issue?
I guess the timeout simply sets a SIGALRM, which the sleep() function(s) override, but I may be wrong.
comment:3 Changed 22 months ago by leif
P.S.:
If you use p_iter="fork" (the only variant that's said to support timeout), apparently indeed only ncpus-1 child processes are created, so it seems to be as I guessed.
Depending on the selection / order of arguments, you may well get timeouts for [perhaps only some of] the child processes (i.e. they'll get killed), though not after the time you'd expect.
The behaviour is non-deterministic though, for whatever reason. (Try running the time command repeatedly, i.e. within a loop.)
comment:4 Changed 22 months ago by leif
Oh, this is a more funny one (and not due to what I first guessed).
Looking at the code, it does spawn ncpus processes, but there's a fat bug in the code (re)computing how long to wait the next time (the call to signal.alarm()):
if timeout:
def mysig(a,b):
raise RuntimeError, "SIGALRM"
oldest = min([X[1] for X in workers.values()])
signal.signal(signal.SIGALRM, mysig)
signal.alarm(int(walltime() - oldest)+1)
(X[1] is the wall time each child process was forked / started.)
This code is executed repeatedly; if wait() was interrupted by a RuntimeError triggered by SIGALRM, it is checked whether any of the child processes timed out, and if so, they get killed.
Changed 22 months ago by leif
-
attachment
trac_11658-fix_timeout_in_parallel_decorator.sagelib.patch
added
Sage library patch. Corrects time to wait for child processes (before they get killed) in the parallel fork decorator. Based on Sage 4.7.1.rc0.
comment:5 Changed 22 months ago by leif
- Status changed from new to needs_review
A trivial patch is up.
comment:6 follow-up: ↓ 7 Changed 22 months ago by vbraun
- Status changed from needs_review to positive_review
- Reviewers set to Volker Braun
- Authors set to Leif Leonhardy
Yes its definitely better to have the timeout depend on the timeout variable :-)
comment:7 in reply to: ↑ 6 Changed 22 months ago by leif
Replying to vbraun:
Yes its definitely better to have the timeout depend on the timeout variable :-)
:D Just wondering if we should add a doctest for that, but now it has already positive review...
I was going to add one similar to Maarten's example (with some parallel sleeping processes, the default of ncpus, and a timeout of about 5 seconds), marking it # long time.
comment:8 Changed 22 months ago by mderickx
I already had a patch sorry guy's for making you do double effort but thanks for the quick fix :). I found this bug by reading the source code since I was trying to understand what they where doing. BTW I didn't respond earlier because the mailing system of the trac server was malfunctioning and didn't know others where working on it. The mailing server should work now. At least this message will test that for me :)
comment:9 follow-up: ↓ 10 Changed 22 months ago by mderickx
Ok trac wasn't fixed yet :(. Another test
comment:10 in reply to: ↑ 9 Changed 22 months ago by leif
Replying to mderickx:
Ok trac wasn't fixed yet :(. Another test
For the record, I did get this one. [Another test :) ]
comment:11 Changed 21 months ago by jdemeyer
- Status changed from positive_review to closed
- Resolution set to fixed
- Merged in set to sage-4.7.2.alpha2
