Opened 4 years ago

Last modified 4 years ago

#26118 closed defect

sage -tp times out on a 160 core machine — at Version 4

Reported by: saraedum Owned by:
Priority: minor Milestone: sage-8.4
Component: doctest framework Keywords:
Cc: slelievre, roed Merged in:
Authors: Reviewers:
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by saraedum)

Strangely SAGE_NUM_THREADS=160 sage -tp --long --all produces lots of timeouts on a 160 core machine.

It turns out that only a few cores are actually used (three or four most of the time) which seems to be related to the set CPU affinity.

This ugly workaround fixes it:

--- a/src/sage/doctest/forker.py
+++ b/src/sage/doctest/forker.py
@@ -1696,6 +1696,9 @@ class DocTestDispatcher(SageObject):
         # Logger
         log = self.controller.log
 
+        import os
+        os.system("taskset -p 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffff
ffffffffffffffffffffffff %d 2>/dev/null >/dev/null" % os.getpid())
+
         from cysignals.pselect import PSelecter
         try:
             # Block SIGCHLD and SIGINT except during the pselect() call

A better workaround, turned out to be setting OPENBLAS_NUM_THREADS=1.

Change History (4)

comment:1 Changed 4 years ago by saraedum

  • Description modified (diff)

comment:2 Changed 4 years ago by saraedum

  • Cc roed added

roed: As you are often working on k8s. Does this also happen there?

comment:3 Changed 4 years ago by saraedum

With an initial workaround, that set the CPU affinity before forking:

Btw., with that workaround often the following part of the "A tour of Sage" in all languages hangs:

Trying (line 99):    m = random_matrix(RDF,500)
Expecting nothing
ok [0.09 s]
Trying (line 108):    e = m.eigenvalues()  # ungefähr 2 Sekunden
Expecting nothing

So maybe this has something to do with some BLAS?

Last edited 4 years ago by saraedum (previous) (diff)

comment:4 Changed 4 years ago by saraedum

  • Description modified (diff)

The hangs do not happen anymore with OPENBLAS_NUM_THREADS=1 and sage -tp --all finishes in 2:29 minutes (which is the time it takes to run the tests in sage.manifolds.differentiable.tensorfield).

Note: See TracTickets for help on using tickets.