Opened 10 years ago

Closed 10 years ago

#11658 closed defect (fixed)

the timeout option is not working correctly in parallel computing

Reported by: mderickx Owned by: tbd
Priority: major Milestone: sage-4.7.2
Component: performance Keywords: sleep time-out
Cc: Merged in: sage-4.7.2.alpha2
Authors: Leif Leonhardy Reviewers: Volker Braun
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by mderickx)

With a timeout of 5 seconds I should not be able to perform a sleep of 40 seconds!

sage: from time import sleep
sage: f=parallel(ncpus=4,timeout=5,verbose=True)
sage: g=f(sleep)
sage: time list(g([5,10,20,40]))
[(((5,), {}), None), (((10,), {}), None), (((20,), {}), None), (((40,), {}), None)]
Time: CPU 0.02 s, Wall: 40.07 s

Attachments (1)

trac_11658-fix_timeout_in_parallel_decorator.sagelib.patch (986 bytes) - added by leif 10 years ago.
Sage library patch. Corrects time to wait for child processes (before they get killed) in the parallel fork decorator. Based on Sage 4.7.1.rc0.

Download all attachments as: .zip

Change History (12)

comment:1 Changed 10 years ago by mderickx

  • Component changed from PLEASE CHANGE to performance
  • Description modified (diff)
  • Type changed from PLEASE CHANGE to defect

comment:2 Changed 10 years ago by leif

  • Keywords sleep time-out added
  • Summary changed from the timout option is not working correctly in parralel computing to the timeout option is not working correctly in parallel computing

Is this really an issue?

I guess the timeout simply sets a SIGALRM, which the sleep() function(s) override, but I may be wrong.

comment:3 Changed 10 years ago by leif

P.S.:

If you use p_iter="fork" (the only variant that's said to support timeout), apparently indeed only ncpus-1 child processes are created, so it seems to be as I guessed.

Depending on the selection / order of arguments, you may well get timeouts for [perhaps only some of] the child processes (i.e. they'll get killed), though not after the time you'd expect.

The behaviour is non-deterministic though, for whatever reason. (Try running the time command repeatedly, i.e. within a loop.)

comment:4 Changed 10 years ago by leif

Oh, this is a more funny one (and not due to what I first guessed).

Looking at the code, it does spawn ncpus processes, but there's a fat bug in the code (re)computing how long to wait the next time (the call to signal.alarm()):

                    if timeout:
                        def mysig(a,b):
                            raise RuntimeError, "SIGALRM"
                        oldest = min([X[1] for X in workers.values()])
                        signal.signal(signal.SIGALRM, mysig)
                        signal.alarm(int(walltime() - oldest)+1)

(X[1] is the wall time each child process was forked / started.)

This code is executed repeatedly; if wait() was interrupted by a RuntimeError triggered by SIGALRM, it is checked whether any of the child processes timed out, and if so, they get killed.

Changed 10 years ago by leif

Sage library patch. Corrects time to wait for child processes (before they get killed) in the parallel fork decorator. Based on Sage 4.7.1.rc0.

comment:5 Changed 10 years ago by leif

  • Status changed from new to needs_review

A trivial patch is up.

comment:6 follow-up: Changed 10 years ago by vbraun

  • Authors set to Leif Leonhardy
  • Reviewers set to Volker Braun
  • Status changed from needs_review to positive_review

Yes its definitely better to have the timeout depend on the timeout variable :-)

comment:7 in reply to: ↑ 6 Changed 10 years ago by leif

Replying to vbraun:

Yes its definitely better to have the timeout depend on the timeout variable :-)

:D Just wondering if we should add a doctest for that, but now it has already positive review...

I was going to add one similar to Maarten's example (with some parallel sleeping processes, the default of ncpus, and a timeout of about 5 seconds), marking it # long time.

comment:8 Changed 10 years ago by mderickx

I already had a patch sorry guy's for making you do double effort but thanks for the quick fix :). I found this bug by reading the source code since I was trying to understand what they where doing. BTW I didn't respond earlier because the mailing system of the trac server was malfunctioning and didn't know others where working on it. The mailing server should work now. At least this message will test that for me :)

comment:9 follow-up: Changed 10 years ago by mderickx

Ok trac wasn't fixed yet :(. Another test

comment:10 in reply to: ↑ 9 Changed 10 years ago by leif

Replying to mderickx:

Ok trac wasn't fixed yet :(. Another test

For the record, I did get this one. [Another test :) ]

comment:11 Changed 10 years ago by jdemeyer

  • Merged in set to sage-4.7.2.alpha2
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.