Opened 6 years ago

Closed 6 years ago

#1077 closed defect (fixed)

[with patch, with positive review] DSage restarts two workers after timeout

Reported by: jvoight Owned by: yi
Priority: major Milestone: sage-2.9
Component: packages: standard Keywords:
Cc: Merged in:
Authors: Reviewers:
Report Upstream: Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Description

When a job times out, the worker restarts running two jobs. This slows things down and is not natural.

And when one of those new jobs finishes, it performs a hard reset, killing the second job, which then never gets completed.

Change History (11)

comment:1 Changed 6 years ago by yi

  • Owner changed from was to yi

comment:2 Changed 6 years ago by mabshoff

  • Milestone set to sage-2.9

comment:3 Changed 6 years ago by yi

Could you please elaborate? What do you mean it restarts running two jobs? Currently the job timing out counts as a failure and by default each job has a failure threshold of 3 (i.e. it will fail three times before being removed from the job queue). Unfortunately there is no easy way to change that until now. If you launch the server like this:

dsage.server(job_failure_threshold=0), this means that each job will only fail once before it is removed from the queue. Find the bundle here:

http://sage.math.washington.edu/home/yqiang/dsage.hg

Please report back if this does not fix the problem for you.

comment:4 Changed 6 years ago by jvoight

Here's an example output. How can worker 0 be working on two jobs at once?

2007/11/07 22:28 -0700 [-] [Worker 0] Job COZkyk31Am failed! 2007/11/07 22:28 -0700 [-] Traceback:

Traceback (most recent call last):

File "<stdin>", line 1, in <module> File "/home/jvoight/.sage/dsage/tmp_worker_files/COZkyk31Am/default_job.py", line 8, in <module>

DSAGE_RESULT=enumerate_totallyreal_fields(Integer(9),Integer(28334269485),[Integer(70), Integer(1), -Integer(15), Integer(0), Integer(1)],return_seqs=True)

File "/home/jvoight/sage/local/lib/python2.5/site-packages/sage/rings/number_field/totallyreal.py", line 225, in enumerate_totallyreal_fields

[zk,d] = nf.nfbasis_d()

File "/home/jvoight/sage/local/lib/python2.5/site-packages/sage/misc/misc.py", line 1300, in mysig

raise KeyboardInterrupt?, "computation timed out because alarm was set for %s seconds"%alarm_time

KeyboardInterrupt?: computation timed out because alarm was set for 1800 seconds

2007/11/07 22:28 -0700 [-] [Worker 0] Performing hard reset. 2007/11/07 22:28 -0700 [-] [Worker: 0] Restarting... 2007/11/07 22:28 -0700 [Broker,client] [Worker 0] Starting job kLm2hihd1N 2007/11/07 22:28 -0700 [Broker,client] [Worker 0] Starting job jUtQDMnlOG

comment:5 Changed 6 years ago by yi

  • Milestone changed from sage-2.9 to sage-2.8.13
  • Summary changed from DSage restarts two workers after timeout to [WITH PATCH] DSage restarts two workers after timeout

comment:6 Changed 6 years ago by yi

  • Status changed from new to assigned

comment:7 Changed 6 years ago by mhansen

  • Summary changed from [WITH PATCH] DSage restarts two workers after timeout to [with patch] DSage restarts two workers after timeout

comment:8 Changed 6 years ago by mabshoff

Yi, could you please provide a patch or bundle once 2.8.13 is out. If I try the bundle above it complains about unknown parent and it is unclear to me whether to apply the other bundle first.

Cheers,

Michael

comment:9 Changed 6 years ago by yi

I've uploaded

http://sage.math.washington.edu/home/yqiang/dsage_latest.hg

Which is a bundle against 2.8.14.

comment:10 Changed 6 years ago by rlm

  • Summary changed from [with patch] DSage restarts two workers after timeout to [with patch, with positive review] DSage restarts two workers after timeout

comment:11 Changed 6 years ago by mabshoff

  • Resolution set to fixed
  • Status changed from assigned to closed

Merged in 2.9.rc0.

Note: See TracTickets for help on using tickets.