Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#26118 closed defect (fixed)

sage -tp times out on a 160 core machine

Reported by: saraedum Owned by:
Priority: minor Milestone: sage-8.4
Component: doctest framework Keywords:
Cc: slelievre, roed Merged in:
Authors: Julian Rüth Reviewers: Volker Braun
Report Upstream: N/A Work issues:
Branch: dd20582 (Commits, GitHub, GitLab) Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by saraedum)

Strangely SAGE_NUM_THREADS=160 sage -tp --long --all produces lots of timeouts on a 160 core machine.

It turns out that only a few cores are actually used (three or four most of the time) which seems to be related to the set CPU affinity.

One workaround export OPENBLAS_NUM_THREADS=1.

Or we could export OPENBLAS_MAIN_FREE=1 to disable OPENBLAS' threading affinity.

The FAQ explains what these options do: https://github.com/xianyi/OpenBLAS/wiki/faq


Here are some benchmarks:

sage: m=load('/tmp/matrix') # a 500x500 RDF matrix
sage: %timeit m.eigenvalues()
1 loop, best of 3: 1.78 s per loop # currently
1 loop, best of 3: 911 ms per loop # with OPENBLAS_NUM_THREADS=1

Note that the first one causes 100% load on 130 of the 160 cores and the second one only causes 100% load on a single core. (Yes, it's faster anyway.)

I also tried with all sorts of combinations of SAGE_NUM_THREADS*, OPENBLAS_MAIN_FREE and always got around 1.8s and 100% load on 130 cores. I'v sometimes seen segfaults that were not reproducible but never with OPENBLAS_NUM_THREADS=1.

Change History (32)

comment:1 Changed 3 years ago by saraedum

  • Description modified (diff)

comment:2 Changed 3 years ago by saraedum

  • Cc roed added

roed: As you are often working on k8s. Does this also happen there?

comment:3 Changed 3 years ago by saraedum

With an initial workaround, that set the CPU affinity before forking:

Btw., with that workaround often the following part of the "A tour of Sage" in all languages hangs:

Trying (line 99):    m = random_matrix(RDF,500)
Expecting nothing
ok [0.09 s]
Trying (line 108):    e = m.eigenvalues()  # ungefähr 2 Sekunden
Expecting nothing

So maybe this has something to do with some BLAS?

Last edited 3 years ago by saraedum (previous) (diff)

comment:4 Changed 3 years ago by saraedum

  • Description modified (diff)

The hangs do not happen anymore with OPENBLAS_NUM_THREADS=1 and sage -tp --all finishes in 2:29 minutes (which is the time it takes to run the tests in sage.manifolds.differentiable.tensorfield).

comment:5 follow-up: Changed 3 years ago by embray

What I want to know is, where is this 160 core machine and can I use it to build sage?

comment:6 follow-up: Changed 3 years ago by roed

I haven't seen this on k8s, though I haven't been using it much recently.

comment:7 in reply to: ↑ 5 Changed 3 years ago by saraedum

Replying to embray:

What I want to know is, where is this 160 core machine and can I use it to build sage?

I spawned it with the google cloud credits from slelievre. I think they still work for today if you want to play with it.

comment:8 in reply to: ↑ 6 Changed 3 years ago by saraedum

Replying to roed:

I haven't seen this on k8s, though I haven't been using it much recently.

This seems to be a new issue caused by #24669.

comment:9 Changed 3 years ago by saraedum

See #21323 for some background on OPENBLAS_NUM_THREADS.

comment:10 Changed 3 years ago by saraedum

  • Description modified (diff)

comment:11 Changed 3 years ago by saraedum

Setting export OPENBLAS_MAIN_FREE=1 causes hangs in the "a tour of Sage", same as in comment 3.

comment:12 Changed 3 years ago by saraedum

  • Authors set to Julian Rüth
  • Description modified (diff)

comment:13 Changed 3 years ago by saraedum

  • Branch set to u/saraedum/26118

comment:14 Changed 3 years ago by saraedum

  • Commit set to dd20582d28372ffc38d16444b3f81d5c2885289a
  • Status changed from new to needs_review

New commits:

dd20582Set OPENBLAS_NUM_THREADS=1

comment:15 Changed 3 years ago by saraedum

SAGE_NUM_THREADS=160 sage -tp --long --all btw crashes with segfaults in gentourng; that's probably unrelated.

SAGE_NUM_THREADS=1 SAGE_NUM_THREADS_PARALLEL=160 sage -tp --long --all works however, and finishes after 4 minutes (which is the time that sage.schemes.elliptic_curves.ell_number_field takes…)

Note that the latter setting would not work without the changes introduced in this ticket (all 160 workers are scheduled on the same core.)

comment:16 Changed 3 years ago by gh-timokau

If we decide to set OPENBLAS_NUM_THREADS=1, would it be possible to set it within python instead of in sage-env? That way it would also work when using from sage import ....

comment:17 Changed 3 years ago by embray

  • Status changed from needs_review to needs_work

Let us not set this is for all cases; in normal usage it is not a problem, only in the doctests. And only, it seems, in extreme cases. I would prefer to find out exactly what the limits are here and limit functionality only near those limits (and to maybe investigate what exactly is going on at those limits and maybe fix the underlying issue).

comment:18 Changed 3 years ago by gh-timokau

What makes you think this is only a doctest problem?

comment:19 Changed 3 years ago by saraedum

Yes, please look at my benchmarks above. This is a performance and a stability problem in normal usage.

comment:20 Changed 3 years ago by vbraun

  • Status changed from needs_work to positive_review

I've also noted the wrong cpu affinity on the buildbot, it is a major nuisance.

comment:21 Changed 3 years ago by vbraun

  • Reviewers set to Volker Braun

comment:22 Changed 3 years ago by vbraun

  • Branch changed from u/saraedum/26118 to dd20582d28372ffc38d16444b3f81d5c2885289a
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:23 Changed 3 years ago by embray

  • Commit dd20582d28372ffc38d16444b3f81d5c2885289a deleted

I don't see why you would want to set this for all use cases. It seems, rather, libraries that use their own multithreading and would conflict with openblas's should explicitly set openblas_set_num_threads as in https://trac.sagemath.org/ticket/21323#comment:7

This just looks like a brute-force workaround for a specific case.

comment:24 Changed 3 years ago by embray

I there some reliable way to reproduce this problem? I'm having trouble even doing that, though I'm sure it's possible.

comment:25 Changed 3 years ago by vbraun

It was quite reproducible. But afair the cpu affinity is only set in the course of running the tests, not immediately. Wait until make ptestlong is mostly done and then check the taskset output.

comment:26 follow-up: Changed 3 years ago by embray

I'm not sure what I'm looking for here. Julian suggested that it's a problem that occurs in "normal usage". Is there a way to reproduce this problem without having to run make ptestlong?

In any case it sounds like a bug that should be fixed in openblas, not hackishly worked around. The reason I'm confused is that in the doctests we fork for every test, so I don't see why there would be a problem with thread CPU affinity since from test to test it's not even going to be reusing the same threads.

Perhaps I need to study more carefully what exactly this option is doing in openblas.

comment:27 in reply to: ↑ 26 ; follow-up: Changed 3 years ago by saraedum

Replying to embray:

I'm not sure what I'm looking for here. Julian suggested that it's a problem that occurs in "normal usage". Is there a way to reproduce this problem without having to run make ptestlong?

Yes, see ticket description.

comment:28 Changed 3 years ago by embray

I see; it's nothing to do with thread CPU affinity at all, but process CPU affinity. The mention of OPENBLAS_NUM_THREADS=1 had me thinking this was something about threads. That makes more sense.

What's weird is that gotoblas_affinity_init() still sets up a new CPU affinity mask and installs it with sched_setaffinity()--does some stuff I don't understand--and then if OPENBLAS_MAIN_FREE=1 and only if, calls sched_setaffinity() again to set it back to the original mask.

comment:29 in reply to: ↑ 27 Changed 3 years ago by embray

Replying to saraedum:

Replying to embray:

I'm not sure what I'm looking for here. Julian suggested that it's a problem that occurs in "normal usage". Is there a way to reproduce this problem without having to run make ptestlong?

Yes, see ticket description.

I tried that, but there was only one time I was able to get a kind of slow result, and I think I deleted that matrix. Now, no matter how many times I try to create a random matrix, I seem to be getting ones whose eigenvalues are solved very quickly and I barely see any problem. Perhaps I need some specific matrix that exhibits the problem.

Anyways, now that I've looked at the openblas code I have no problem with this workaround. I want to understand the code in openblas a bit better, but I agree that for the purposes of Sage we don't have any reason for openblas to mess with CPU affinity. Perhaps we should even build Sage's openblas with NO_AFFINITY=1 (but keep the environment variable set for non-Sage builds of openblas).

comment:30 Changed 3 years ago by embray

I'm surprised OPENBLAS_NUM_THREADS=1 didn't work for you:

#ifdef USE_OPENMP
  numprocs = 0;
#else
  numprocs = readenv_atoi("OPENBLAS_NUM_THREADS");
  if (numprocs == 0) numprocs = readenv_atoi("GOTO_NUM_THREADS");
#endif

  if (numprocs == 0) numprocs = readenv_atoi("OMP_NUM_THREADS");

  numnodes = 1;

  if (numprocs == 1) {
    disable_mapping = 1;
    return;
  }

comment:31 Changed 3 years ago by saraedum

Sorry, I don't understand what you mean with "didn't work for you".

comment:32 Changed 3 years ago by embray

Nevermind, apparently it did: https://trac.sagemath.org/ticket/26118#comment:4 I misread something else.

Note: See TracTickets for help on using tickets.