Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#26118 closed defect (fixed)

sage -tp times out on a 160 core machine

Reported by: Julian Rüth Owned by:
Priority: minor Milestone: sage-8.4
Component: doctest framework Keywords:
Cc: Samuel Lelièvre, David Roe Merged in:
Authors: Julian Rüth Reviewers: Volker Braun
Report Upstream: N/A Work issues:
Branch: dd20582 (Commits, GitHub, GitLab) Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by Julian Rüth)

Strangely SAGE_NUM_THREADS=160 sage -tp --long --all produces lots of timeouts on a 160 core machine.

It turns out that only a few cores are actually used (three or four most of the time) which seems to be related to the set CPU affinity.

One workaround export OPENBLAS_NUM_THREADS=1.

Or we could export OPENBLAS_MAIN_FREE=1 to disable OPENBLAS' threading affinity.

The FAQ explains what these options do: https://github.com/xianyi/OpenBLAS/wiki/faq


Here are some benchmarks:

sage: m=load('/tmp/matrix') # a 500x500 RDF matrix
sage: %timeit m.eigenvalues()
1 loop, best of 3: 1.78 s per loop # currently
1 loop, best of 3: 911 ms per loop # with OPENBLAS_NUM_THREADS=1

Note that the first one causes 100% load on 130 of the 160 cores and the second one only causes 100% load on a single core. (Yes, it's faster anyway.)

I also tried with all sorts of combinations of SAGE_NUM_THREADS*, OPENBLAS_MAIN_FREE and always got around 1.8s and 100% load on 130 cores. I'v sometimes seen segfaults that were not reproducible but never with OPENBLAS_NUM_THREADS=1.

Change History (32)

comment:1 Changed 4 years ago by Julian Rüth

Description: modified (diff)

comment:2 Changed 4 years ago by Julian Rüth

Cc: David Roe added

roed: As you are often working on k8s. Does this also happen there?

comment:3 Changed 4 years ago by Julian Rüth

With an initial workaround, that set the CPU affinity before forking:

Btw., with that workaround often the following part of the "A tour of Sage" in all languages hangs:

Trying (line 99):    m = random_matrix(RDF,500)
Expecting nothing
ok [0.09 s]
Trying (line 108):    e = m.eigenvalues()  # ungefähr 2 Sekunden
Expecting nothing

So maybe this has something to do with some BLAS?

Last edited 4 years ago by Julian Rüth (previous) (diff)

comment:4 Changed 4 years ago by Julian Rüth

Description: modified (diff)

The hangs do not happen anymore with OPENBLAS_NUM_THREADS=1 and sage -tp --all finishes in 2:29 minutes (which is the time it takes to run the tests in sage.manifolds.differentiable.tensorfield).

comment:5 Changed 4 years ago by Erik Bray

What I want to know is, where is this 160 core machine and can I use it to build sage?

comment:6 Changed 4 years ago by David Roe

I haven't seen this on k8s, though I haven't been using it much recently.

comment:7 in reply to:  5 Changed 4 years ago by Julian Rüth

Replying to embray:

What I want to know is, where is this 160 core machine and can I use it to build sage?

I spawned it with the google cloud credits from slelievre. I think they still work for today if you want to play with it.

comment:8 in reply to:  6 Changed 4 years ago by Julian Rüth

Replying to roed:

I haven't seen this on k8s, though I haven't been using it much recently.

This seems to be a new issue caused by #24669.

comment:9 Changed 4 years ago by Julian Rüth

See #21323 for some background on OPENBLAS_NUM_THREADS.

comment:10 Changed 4 years ago by Julian Rüth

Description: modified (diff)

comment:11 Changed 4 years ago by Julian Rüth

Setting export OPENBLAS_MAIN_FREE=1 causes hangs in the "a tour of Sage", same as in comment 3.

comment:12 Changed 4 years ago by Julian Rüth

Authors: Julian Rüth
Description: modified (diff)

comment:13 Changed 4 years ago by Julian Rüth

Branch: u/saraedum/26118

comment:14 Changed 4 years ago by Julian Rüth

Commit: dd20582d28372ffc38d16444b3f81d5c2885289a
Status: newneeds_review

New commits:

dd20582Set OPENBLAS_NUM_THREADS=1

comment:15 Changed 4 years ago by Julian Rüth

SAGE_NUM_THREADS=160 sage -tp --long --all btw crashes with segfaults in gentourng; that's probably unrelated.

SAGE_NUM_THREADS=1 SAGE_NUM_THREADS_PARALLEL=160 sage -tp --long --all works however, and finishes after 4 minutes (which is the time that sage.schemes.elliptic_curves.ell_number_field takes…)

Note that the latter setting would not work without the changes introduced in this ticket (all 160 workers are scheduled on the same core.)

comment:16 Changed 4 years ago by Timo Kaufmann

If we decide to set OPENBLAS_NUM_THREADS=1, would it be possible to set it within python instead of in sage-env? That way it would also work when using from sage import ....

comment:17 Changed 4 years ago by Erik Bray

Status: needs_reviewneeds_work

Let us not set this is for all cases; in normal usage it is not a problem, only in the doctests. And only, it seems, in extreme cases. I would prefer to find out exactly what the limits are here and limit functionality only near those limits (and to maybe investigate what exactly is going on at those limits and maybe fix the underlying issue).

comment:18 Changed 4 years ago by Timo Kaufmann

What makes you think this is only a doctest problem?

comment:19 Changed 4 years ago by Julian Rüth

Yes, please look at my benchmarks above. This is a performance and a stability problem in normal usage.

comment:20 Changed 4 years ago by Volker Braun

Status: needs_workpositive_review

I've also noted the wrong cpu affinity on the buildbot, it is a major nuisance.

comment:21 Changed 4 years ago by Volker Braun

Reviewers: Volker Braun

comment:22 Changed 4 years ago by Volker Braun

Branch: u/saraedum/26118dd20582d28372ffc38d16444b3f81d5c2885289a
Resolution: fixed
Status: positive_reviewclosed

comment:23 Changed 4 years ago by Erik Bray

Commit: dd20582d28372ffc38d16444b3f81d5c2885289a

I don't see why you would want to set this for all use cases. It seems, rather, libraries that use their own multithreading and would conflict with openblas's should explicitly set openblas_set_num_threads as in https://trac.sagemath.org/ticket/21323#comment:7

This just looks like a brute-force workaround for a specific case.

comment:24 Changed 4 years ago by Erik Bray

I there some reliable way to reproduce this problem? I'm having trouble even doing that, though I'm sure it's possible.

comment:25 Changed 4 years ago by Volker Braun

It was quite reproducible. But afair the cpu affinity is only set in the course of running the tests, not immediately. Wait until make ptestlong is mostly done and then check the taskset output.

comment:26 Changed 4 years ago by Erik Bray

I'm not sure what I'm looking for here. Julian suggested that it's a problem that occurs in "normal usage". Is there a way to reproduce this problem without having to run make ptestlong?

In any case it sounds like a bug that should be fixed in openblas, not hackishly worked around. The reason I'm confused is that in the doctests we fork for every test, so I don't see why there would be a problem with thread CPU affinity since from test to test it's not even going to be reusing the same threads.

Perhaps I need to study more carefully what exactly this option is doing in openblas.

comment:27 in reply to:  26 ; Changed 4 years ago by Julian Rüth

Replying to embray:

I'm not sure what I'm looking for here. Julian suggested that it's a problem that occurs in "normal usage". Is there a way to reproduce this problem without having to run make ptestlong?

Yes, see ticket description.

comment:28 Changed 4 years ago by Erik Bray

I see; it's nothing to do with thread CPU affinity at all, but process CPU affinity. The mention of OPENBLAS_NUM_THREADS=1 had me thinking this was something about threads. That makes more sense.

What's weird is that gotoblas_affinity_init() still sets up a new CPU affinity mask and installs it with sched_setaffinity()--does some stuff I don't understand--and then if OPENBLAS_MAIN_FREE=1 and only if, calls sched_setaffinity() again to set it back to the original mask.

comment:29 in reply to:  27 Changed 4 years ago by Erik Bray

Replying to saraedum:

Replying to embray:

I'm not sure what I'm looking for here. Julian suggested that it's a problem that occurs in "normal usage". Is there a way to reproduce this problem without having to run make ptestlong?

Yes, see ticket description.

I tried that, but there was only one time I was able to get a kind of slow result, and I think I deleted that matrix. Now, no matter how many times I try to create a random matrix, I seem to be getting ones whose eigenvalues are solved very quickly and I barely see any problem. Perhaps I need some specific matrix that exhibits the problem.

Anyways, now that I've looked at the openblas code I have no problem with this workaround. I want to understand the code in openblas a bit better, but I agree that for the purposes of Sage we don't have any reason for openblas to mess with CPU affinity. Perhaps we should even build Sage's openblas with NO_AFFINITY=1 (but keep the environment variable set for non-Sage builds of openblas).

comment:30 Changed 4 years ago by Erik Bray

I'm surprised OPENBLAS_NUM_THREADS=1 didn't work for you:

#ifdef USE_OPENMP
  numprocs = 0;
#else
  numprocs = readenv_atoi("OPENBLAS_NUM_THREADS");
  if (numprocs == 0) numprocs = readenv_atoi("GOTO_NUM_THREADS");
#endif

  if (numprocs == 0) numprocs = readenv_atoi("OMP_NUM_THREADS");

  numnodes = 1;

  if (numprocs == 1) {
    disable_mapping = 1;
    return;
  }

comment:31 Changed 4 years ago by Julian Rüth

Sorry, I don't understand what you mean with "didn't work for you".

comment:32 Changed 4 years ago by Erik Bray

Nevermind, apparently it did: https://trac.sagemath.org/ticket/26118#comment:4 I misread something else.

Note: See TracTickets for help on using tickets.