#26118 closed defect (fixed)
sage tp times out on a 160 core machine
Reported by:  saraedum  Owned by:  

Priority:  minor  Milestone:  sage8.4 
Component:  doctest framework  Keywords:  
Cc:  slelievre, roed  Merged in:  
Authors:  Julian Rüth  Reviewers:  Volker Braun 
Report Upstream:  N/A  Work issues:  
Branch:  dd20582 (Commits, GitHub, GitLab)  Commit:  
Dependencies:  Stopgaps: 
Description (last modified by )
Strangely SAGE_NUM_THREADS=160 sage tp long all
produces lots of timeouts on a 160 core machine.
It turns out that only a few cores are actually used (three or four most of the time) which seems to be related to the set CPU affinity.
One workaround export OPENBLAS_NUM_THREADS=1
.
Or we could export OPENBLAS_MAIN_FREE=1
to disable OPENBLAS' threading affinity.
The FAQ explains what these options do: https://github.com/xianyi/OpenBLAS/wiki/faq
Here are some benchmarks:
sage: m=load('/tmp/matrix') # a 500x500 RDF matrix sage: %timeit m.eigenvalues() 1 loop, best of 3: 1.78 s per loop # currently 1 loop, best of 3: 911 ms per loop # with OPENBLAS_NUM_THREADS=1
Note that the first one causes 100% load on 130 of the 160 cores and the second one only causes 100% load on a single core. (Yes, it's faster anyway.)
I also tried with all sorts of combinations of SAGE_NUM_THREADS*, OPENBLAS_MAIN_FREE and always got around 1.8s and 100% load on 130 cores. I'v sometimes seen segfaults that were not reproducible but never with OPENBLAS_NUM_THREADS=1.
Change History (32)
comment:1 Changed 4 years ago by
 Description modified (diff)
comment:2 Changed 4 years ago by
 Cc roed added
comment:3 Changed 4 years ago by
Btw., with that workaround often the following part of the "A tour of Sage" in all languages hangs:
Trying (line 99): m = random_matrix(RDF,500) Expecting nothing ok [0.09 s] Trying (line 108): e = m.eigenvalues() # ungefähr 2 Sekunden Expecting nothing
So maybe this has something to do with some BLAS?
comment:4 Changed 4 years ago by
 Description modified (diff)
The hangs do not happen anymore with OPENBLAS_NUM_THREADS=1
and sage tp all
finishes in 2:29 minutes (which is the time it takes to run the tests in sage.manifolds.differentiable.tensorfield
).
comment:5 followup: ↓ 7 Changed 4 years ago by
What I want to know is, where is this 160 core machine and can I use it to build sage?
comment:6 followup: ↓ 8 Changed 4 years ago by
I haven't seen this on k8s, though I haven't been using it much recently.
comment:7 in reply to: ↑ 5 Changed 4 years ago by
Replying to embray:
What I want to know is, where is this 160 core machine and can I use it to build sage?
I spawned it with the google cloud credits from slelievre. I think they still work for today if you want to play with it.
comment:8 in reply to: ↑ 6 Changed 4 years ago by
comment:9 Changed 4 years ago by
See #21323 for some background on OPENBLAS_NUM_THREADS
.
comment:10 Changed 4 years ago by
 Description modified (diff)
comment:11 Changed 4 years ago by
Setting export OPENBLAS_MAIN_FREE=1
causes hangs in the "a tour of Sage", same as in comment 3.
comment:12 Changed 4 years ago by
 Description modified (diff)
comment:13 Changed 4 years ago by
 Branch set to u/saraedum/26118
comment:14 Changed 4 years ago by
 Commit set to dd20582d28372ffc38d16444b3f81d5c2885289a
 Status changed from new to needs_review
New commits:
dd20582  Set OPENBLAS_NUM_THREADS=1

comment:15 Changed 4 years ago by
SAGE_NUM_THREADS=160 sage tp long all
btw crashes with segfaults in gentourng
; that's probably unrelated.
SAGE_NUM_THREADS=1 SAGE_NUM_THREADS_PARALLEL=160 sage tp long all
works however, and finishes after 4 minutes (which is the time that sage.schemes.elliptic_curves.ell_number_field takes…)
Note that the latter setting would not work without the changes introduced in this ticket (all 160 workers are scheduled on the same core.)
comment:16 Changed 4 years ago by
If we decide to set OPENBLAS_NUM_THREADS=1
, would it be possible to set it within python instead of in sageenv
? That way it would also work when using from sage import ...
.
comment:17 Changed 4 years ago by
 Status changed from needs_review to needs_work
Let us not set this is for all cases; in normal usage it is not a problem, only in the doctests. And only, it seems, in extreme cases. I would prefer to find out exactly what the limits are here and limit functionality only near those limits (and to maybe investigate what exactly is going on at those limits and maybe fix the underlying issue).
comment:18 Changed 4 years ago by
What makes you think this is only a doctest problem?
comment:19 Changed 4 years ago by
Yes, please look at my benchmarks above. This is a performance and a stability problem in normal usage.
comment:20 Changed 4 years ago by
 Status changed from needs_work to positive_review
I've also noted the wrong cpu affinity on the buildbot, it is a major nuisance.
comment:21 Changed 4 years ago by
 Reviewers set to Volker Braun
comment:22 Changed 4 years ago by
 Branch changed from u/saraedum/26118 to dd20582d28372ffc38d16444b3f81d5c2885289a
 Resolution set to fixed
 Status changed from positive_review to closed
comment:23 Changed 4 years ago by
 Commit dd20582d28372ffc38d16444b3f81d5c2885289a deleted
I don't see why you would want to set this for all use cases. It seems, rather, libraries that use their own multithreading and would conflict with openblas's should explicitly set openblas_set_num_threads
as in https://trac.sagemath.org/ticket/21323#comment:7
This just looks like a bruteforce workaround for a specific case.
comment:24 Changed 4 years ago by
I there some reliable way to reproduce this problem? I'm having trouble even doing that, though I'm sure it's possible.
comment:25 Changed 4 years ago by
It was quite reproducible. But afair the cpu affinity is only set in the course of running the tests, not immediately. Wait until make ptestlong is mostly done and then check the taskset output.
comment:26 followup: ↓ 27 Changed 4 years ago by
I'm not sure what I'm looking for here. Julian suggested that it's a problem that occurs in "normal usage". Is there a way to reproduce this problem without having to run make ptestlong
?
In any case it sounds like a bug that should be fixed in openblas, not hackishly worked around. The reason I'm confused is that in the doctests we fork for every test, so I don't see why there would be a problem with thread CPU affinity since from test to test it's not even going to be reusing the same threads.
Perhaps I need to study more carefully what exactly this option is doing in openblas.
comment:27 in reply to: ↑ 26 ; followup: ↓ 29 Changed 4 years ago by
Replying to embray:
I'm not sure what I'm looking for here. Julian suggested that it's a problem that occurs in "normal usage". Is there a way to reproduce this problem without having to run
make ptestlong
?
Yes, see ticket description.
comment:28 Changed 4 years ago by
I see; it's nothing to do with thread CPU affinity at all, but process CPU affinity. The mention of OPENBLAS_NUM_THREADS=1
had me thinking this was something about threads. That makes more sense.
What's weird is that gotoblas_affinity_init()
still sets up a new CPU affinity mask and installs it with sched_setaffinity()
does some stuff I don't understandand then if OPENBLAS_MAIN_FREE=1
and only if, calls sched_setaffinity()
again to set it back to the original mask.
comment:29 in reply to: ↑ 27 Changed 4 years ago by
Replying to saraedum:
Replying to embray:
I'm not sure what I'm looking for here. Julian suggested that it's a problem that occurs in "normal usage". Is there a way to reproduce this problem without having to run
make ptestlong
?Yes, see ticket description.
I tried that, but there was only one time I was able to get a kind of slow result, and I think I deleted that matrix. Now, no matter how many times I try to create a random matrix, I seem to be getting ones whose eigenvalues are solved very quickly and I barely see any problem. Perhaps I need some specific matrix that exhibits the problem.
Anyways, now that I've looked at the openblas code I have no problem with this workaround. I want to understand the code in openblas a bit better, but I agree that for the purposes of Sage we don't have any reason for openblas to mess with CPU affinity. Perhaps we should even build Sage's openblas with NO_AFFINITY=1
(but keep the environment variable set for nonSage builds of openblas).
comment:30 Changed 4 years ago by
I'm surprised OPENBLAS_NUM_THREADS=1
didn't work for you:
#ifdef USE_OPENMP numprocs = 0; #else numprocs = readenv_atoi("OPENBLAS_NUM_THREADS"); if (numprocs == 0) numprocs = readenv_atoi("GOTO_NUM_THREADS"); #endif if (numprocs == 0) numprocs = readenv_atoi("OMP_NUM_THREADS"); numnodes = 1; if (numprocs == 1) { disable_mapping = 1; return; }
comment:31 Changed 4 years ago by
Sorry, I don't understand what you mean with "didn't work for you".
comment:32 Changed 4 years ago by
Nevermind, apparently it did: https://trac.sagemath.org/ticket/26118#comment:4 I misread something else.
roed: As you are often working on k8s. Does this also happen there?