Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#26585 closed defect (fixed)

Calling openblas_set_num_threads in the main process of Sage can lead to segfaults

Reported by: cpernet Owned by:
Priority: major Milestone: sage-8.5
Component: linear algebra Keywords:
Cc: embray Merged in:
Authors: Erik Bray Reviewers: Erik Bray
Report Upstream: Fixed upstream, but not in a stable release. Work issues:
Branch: 9dc0772 (Commits, GitHub, GitLab) Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by embray)

Note: This does not happen currently in Sage, but does result from attempts to add multi-threading support to fflas-ffpack.

Multiplying a 1000x1000 matrix over GF(11) causes a segfault on a 32 core server.

sage: a=random_matrix(GF(11),1000)
sage: b = a*a
Erreur de segmentation

gdb says it comes from OpenBLAS thread manager.

Thread 48 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fe738c31700 (LWP 18829)]
0x00007fffef552d0e in blas_thread_server ()
   from /home/pernet/sage/local/lib/libopenblas.so.0
(gdb) where
#0  0x00007fffef552d0e in blas_thread_server ()
   from /home/pernet/sage/local/lib/libopenblas.so.0
#1  0x00007ffff7d73f2a in start_thread (arg=0x7fe738c31700)
    at pthread_create.c:463
#2  0x00007ffff7b08f0f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

but in valgrind mode, everything works just fine.

Upstream PR: https://github.com/xianyi/OpenBLAS/pull/1837/files

Change History (14)

comment:1 Changed 3 years ago by cpernet

  • Type changed from PLEASE CHANGE to defect

comment:2 Changed 3 years ago by embray

  • Cc embray added; ebray removed

I have a suspicion that this is an openblas bug but it's not clear yet.

comment:3 Changed 3 years ago by embray

Somehow, it seems like, the blas_thread_init routine is being called multiple times, when it shouldn't be.

comment:4 Changed 3 years ago by embray

Hmm; I think I see the problem. When the openblas library is first initialized it runs through a number of initialization routines, including a function called blas_thread_init. By default the blas_num_threads is 1, so nothing very interesting happens in blas_thread_init. Nevertheless, it concludes by setting a global variable blas_server_avail = 1. This prevents any subsequent calls to blas_thread_init from doing anything.

Normally the only way blas_server_avail gets reset to zero is if the function blas_thread_shutdown is called. This can be called manually, I think, though there is usually any reason to. But there is one case in which it does get called transparently: When openblas is initialized it registers a pthread_atfork handler to call blas_thread_shutdown prior to forking (so that the child process gets a clean threading state). In the child process the thread manager is then re-enabled (blas_thread_init). But in the parent it is not immediately re-initialized.

Rather, most functions in openblas that use threads (e.g. blas_exec_async) check blas_server_avail and call blas_thread_init if it is zero. However, openblas_set_num_threads does not perform such a check. But if it is called to increase the number of threads, it does go ahead and create new threads without checking blas_server_avail at all. The end result is as if blas_thread_init were called (more or less, with some exceptions), but openblas doesn't think threads have been initialized. Then when blas_thread_init does get called it starts mutating global data (such as mutexes) that the now running threads are already using, leading to segfaults in those threads.

comment:5 Changed 3 years ago by embray

Confirmed. Simply adding if (unlikely(blas_server_avail == 0)) blas_thread_init(); at the beginning of openblas_set_num_threads was good enough to fix it.

comment:6 Changed 3 years ago by embray

  • Description modified (diff)
  • Report Upstream changed from N/A to Reported upstream. No feedback yet.

Added upstream pull request; if someone wants to review it I can also add the patch to Sage's openblas package.

comment:7 Changed 3 years ago by vdelecroix

Erik openblas patch is available for testing at branch public/26585-openblas_patch.

comment:8 Changed 3 years ago by embray

  • Branch set to public/26585-openblas_patch
  • Commit set to 9dc0772cd68b9d66285cea502a13b0257ce123de

Cool, might as well add it to the ticket. If we need to update it later nbd.


New commits:

9dc0772patch for openblas

comment:9 Changed 3 years ago by embray

  • Authors set to Erik Bray
  • Report Upstream changed from Reported upstream. No feedback yet. to Fixed upstream, but not in a stable release.
  • Reviewers set to Erik Bray
  • Status changed from new to needs_review

Fix accepted upstream as-is.

comment:10 Changed 3 years ago by embray

  • Status changed from needs_review to positive_review

comment:11 Changed 3 years ago by vdelecroix

Good news. Thanks Erik for taking care of it!

comment:12 Changed 3 years ago by embray

  • Summary changed from Matrix product over GF(11) segfaults on multicore server to Calling openblas_set_num_threads in the main process of Sage can lead to segfaults

Updated the title of the ticket to reflect the broader problem (if I need to look for this later I won't remember the exact example of multiplying matrices in GF(11) :)

comment:13 Changed 3 years ago by vbraun

  • Branch changed from public/26585-openblas_patch to 9dc0772cd68b9d66285cea502a13b0257ce123de
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:14 Changed 3 years ago by embray

  • Commit 9dc0772cd68b9d66285cea502a13b0257ce123de deleted

I just realized this probably should have increased the patch version of the SPKG. Otherwise it won't get updated in incremental builds.

It's not clear to me that this patch is immediately needed though...

Note: See TracTickets for help on using tickets.