Opened 4 years ago
Last modified 4 years ago
#26118 closed defect
sage -tp times out on a 160 core machine — at Version 4
Reported by: | saraedum | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | sage-8.4 |
Component: | doctest framework | Keywords: | |
Cc: | slelievre, roed | Merged in: | |
Authors: | Reviewers: | ||
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description (last modified by )
Strangely SAGE_NUM_THREADS=160 sage -tp --long --all
produces lots of timeouts on a 160 core machine.
It turns out that only a few cores are actually used (three or four most of the time) which seems to be related to the set CPU affinity.
This ugly workaround fixes it:
--- a/src/sage/doctest/forker.py +++ b/src/sage/doctest/forker.py @@ -1696,6 +1696,9 @@ class DocTestDispatcher(SageObject): # Logger log = self.controller.log + import os + os.system("taskset -p 0xffffffffffffffffffffffffffffffffffffffffffffffffffffffff ffffffffffffffffffffffff %d 2>/dev/null >/dev/null" % os.getpid()) + from cysignals.pselect import PSelecter try: # Block SIGCHLD and SIGINT except during the pselect() call
A better workaround, turned out to be setting OPENBLAS_NUM_THREADS=1
.
Change History (4)
comment:1 Changed 4 years ago by
- Description modified (diff)
comment:2 Changed 4 years ago by
- Cc roed added
comment:3 Changed 4 years ago by
With an initial workaround, that set the CPU affinity before forking:
Btw., with that workaround often the following part of the "A tour of Sage" in all languages hangs:
Trying (line 99): m = random_matrix(RDF,500) Expecting nothing ok [0.09 s] Trying (line 108): e = m.eigenvalues() # ungefähr 2 Sekunden Expecting nothing
So maybe this has something to do with some BLAS?
comment:4 Changed 4 years ago by
- Description modified (diff)
The hangs do not happen anymore with OPENBLAS_NUM_THREADS=1
and sage -tp --all
finishes in 2:29 minutes (which is the time it takes to run the tests in sage.manifolds.differentiable.tensorfield
).
roed: As you are often working on k8s. Does this also happen there?