Opened 3 years ago

Closed 3 years ago

Last modified 17 months ago

#22021 closed defect (fixed)

OpenBLAS randomly crashes / deadlocks

Reported by: vbraun Owned by:
Priority: major Milestone: sage-7.5
Component: linear algebra Keywords: random_fail
Cc: fbissey, jpflori, jdemeyer Merged in:
Authors: Volker Braun Reviewers: François Bissey
Report Upstream: N/A Work issues:
Branch: 0266727 (Commits) Commit:
Dependencies: Stopgaps:

Description

Openblas occasionally crashes or deadlocks, most often in src/sage/matrix/matrix_integer_dense.pyx but also other places. But its always some longish linear algebra computation. Examples in the comments.

Change History (42)

comment:1 Changed 3 years ago by vbraun

Example 1:

4302sage -t --long src/sage/matrix/matrix_integer_dense.pyx
4303    Killed due to segmentation fault
4304**********************************************************************
4305Tests run before process (pid=8896) failed:
4306sage: a = matrix(ZZ, 3,3, range(9)); a ## line 20 ##
4307[0 1 2]
4308[3 4 5]
4309[6 7 8]
[...]
5329sage: A._solve_iml(B, right=False) ## line 4108 ##
5330sage: A._solve_iml(B, right=True) ## line 4112 ##
5331sage: A = random_matrix(ZZ, 2000, 2000) ## line 4119 ##
5332sage: B = random_matrix(ZZ, 2000, 2000) ## line 4120 ##
5333sage: t0 = walltime() ## line 4121 ##
5334sage: alarm(2); A._solve_iml(B)  # long time ## line 4122 ##
5335------------------------------------------------------------------------
5336/home/buildbot/slave/sage_git/build/local/lib/python2.7/site-packages/cysignals/signals.so(+0x3061)[0xf69ed061]
5337/home/buildbot/slave/sage_git/build/local/lib/python2.7/site-packages/cysignals/signals.so(+0x30cc)[0xf69ed0cc]
5338/home/buildbot/slave/sage_git/build/local/lib/python2.7/site-packages/cysignals/signals.so(+0x6ba3)[0xf69f0ba3]
5339[0xf76e8cd0]
5340/home/buildbot/slave/sage_git/build/local/lib/libopenblas.so.0(dgemm_kernel+0x7b)[0xf4dd55fb]
5341------------------------------------------------------------------------
5342sage: t = walltime(t0) ## line 4126 ##
5343
5344**********************************************************************

Example 2:

4319sage -t --long src/sage/matrix/matrix_integer_dense.pyx
4320    Killed due to segmentation fault
4321**********************************************************************
4322Tests run before process (pid=13893) failed:
4323sage: a = matrix(ZZ, 3,3, range(9)); a ## line 20 ##
4324[0 1 2]
4325[3 4 5]
4326[6 7 8]
[...]
5384sage: A = random_matrix(ZZ, 2000, 2000) ## line 4285 ##
5385sage: B = random_matrix(ZZ, 2000, 2000) ## line 4286 ##
5386sage: t0 = walltime() ## line 4287 ##
5387sage: alarm(2); A._solve_flint(B)  # long time ## line 4288 ##
5388sage: t = walltime(t0) ## line 4292 ##
5389sage: t < 10 or t ## line 4293 ##
5390True
5391sage: sig_on_count() # check sig_on/off pairings (virtual doctest) ## line 4299 ##
53920
5393sage: t = ModularSymbols(11,sign=1).hecke_matrix(2) ## line 4548 ##
5394------------------------------------------------------------------------
5395
5396**********************************************************************

Example 3:

7007sage -t --long src/sage/schemes/elliptic_curves/ell_rational_field.py
7008    Timed out
7009**********************************************************************
7010Tests run before process (pid=5657) timed out:
7011sage: E = EllipticCurve([1,2,3,4,5]); E ## line 140 ##
7012Elliptic Curve defined by y^2 + x*y + 3*y = x^3 + 2*x^2 + 4*x + 5 over Rational Field
7013sage: EllipticCurve([4,5]).ainvs() ## line 146 ##
[...]
8048sage: factor(E.modular_degree()) ## line 3750 ##
80492^7 * 2617
8050sage: E = EllipticCurve('11a') ## line 3755 ##
8051sage: for M in range(1,11): print(E.modular_degree(M=M)) # long time (20s on 2009 MBP) ## line 3756 ##
80521
80531
80543
80552
80567
805745
8058
8059**********************************************************************
8060----------------------------------------------------------------------
8062sage -t --long src/sage/schemes/elliptic_curves/ell_rational_field.py  # Timed out

comment:2 Changed 3 years ago by vbraun

  • Cc fbissey jpflori jdemeyer added

comment:3 Changed 3 years ago by vbraun

Here is a stack trace on a deadlocked test on OSX:

(lldb) process attach --pid 30237
Process 30237 stopped
* thread #1: tid = 0x1eea45e, 0x00007fff8c08d10a libsystem_kernel.dylib`__semwait_signal + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007fff8c08d10a libsystem_kernel.dylib`__semwait_signal + 10
libsystem_kernel.dylib`__semwait_signal:
->  0x7fff8c08d10a <+10>: jae    0x7fff8c08d114            ; <+20>
    0x7fff8c08d10c <+12>: movq   %rax, %rdi
    0x7fff8c08d10f <+15>: jmp    0x7fff8c0877f2            ; cerror
    0x7fff8c08d114 <+20>: retq   

Executable module set to "/Users/buildslave-sage/slave/sage_git/build/local/bin/python".
Architecture set to: x86_64-apple-macosx.
(lldb) bt
* thread #1: tid = 0x1eea45e, 0x00007fff8c08d10a libsystem_kernel.dylib`__semwait_signal + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff8c08d10a libsystem_kernel.dylib`__semwait_signal + 10
    frame #1: 0x00007fff90e8f787 libsystem_pthread.dylib`pthread_join + 444
    frame #2: 0x000000010be46e9e libopenblas_sandybridgep-r0.2.19.dylib`blas_thread_shutdown_ + 238
    frame #3: 0x00007fff90e8fda7 libsystem_pthread.dylib`_pthread_fork_prepare + 85
    frame #4: 0x00007fff81da9a74 libSystem.B.dylib`libSystem_atfork_prepare + 24
    frame #5: 0x00007fff8f2a7f7c libsystem_c.dylib`fork + 12
    frame #6: 0x00007fff8f2a7310 libsystem_c.dylib`forkpty + 58
    frame #7: 0x000000010999f493 libpython2.7.dylib`posix_forkpty + 35
    frame #8: 0x00000001099514ef libpython2.7.dylib`PyEval_EvalFrameEx + 25631
    frame #9: 0x000000010995126e libpython2.7.dylib`PyEval_EvalFrameEx + 24990
    frame #10: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #11: 0x00000001098bc0dd libpython2.7.dylib`function_call + 349
    frame #12: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #13: 0x00000001098989ec libpython2.7.dylib`instancemethod_call + 140
    frame #14: 0x000000010b27f072 sagespawn.so`__pyx_pw_4sage_10interfaces_9sagespawn_9SageSpawn_3_spawnpty + 65 at sagespawn.c:4885
    frame #15: 0x000000010b27f031 sagespawn.so`__pyx_pw_4sage_10interfaces_9sagespawn_9SageSpawn_3_spawnpty + 123
    frame #16: 0x000000010b27efb6 sagespawn.so`__pyx_pw_4sage_10interfaces_9sagespawn_9SageSpawn_3_spawnpty(__pyx_self=<unavailable>, __pyx_args=<unavailable>, __pyx_kwds=<unavailable>) + 86
    frame #17: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #18: 0x000000010994de87 libpython2.7.dylib`PyEval_EvalFrameEx + 11703
    frame #19: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #20: 0x00000001099511cd libpython2.7.dylib`PyEval_EvalFrameEx + 24829
    frame #21: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #22: 0x00000001098bc0dd libpython2.7.dylib`function_call + 349
    frame #23: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #24: 0x00000001098989ec libpython2.7.dylib`instancemethod_call + 140
    frame #25: 0x000000010b27bb80 sagespawn.so`__Pyx_PyObject_Call(func=0x00000002403d9b90, arg=<unavailable>, kw=<unavailable>) + 64 at sagespawn.c:4885
    frame #26: 0x000000010b281fd0 sagespawn.so`__pyx_pw_4sage_10interfaces_9sagespawn_9SageSpawn_1__init__ + 1745 at sagespawn.c:1564
    frame #27: 0x000000010b2818ff sagespawn.so`__pyx_pw_4sage_10interfaces_9sagespawn_9SageSpawn_1__init__(__pyx_self=<unavailable>, __pyx_args=<unavailable>, __pyx_kwds=<unavailable>) + 111
    frame #28: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #29: 0x00000001098989ec libpython2.7.dylib`instancemethod_call + 140
    frame #30: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #31: 0x00000001099033ed libpython2.7.dylib`slot_tp_init + 109
    frame #32: 0x0000000109900d2a libpython2.7.dylib`type_call + 202
    frame #33: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #34: 0x000000010994eecd libpython2.7.dylib`PyEval_EvalFrameEx + 15869
    frame #35: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #36: 0x00000001098bbffc libpython2.7.dylib`function_call + 124
    frame #37: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #38: 0x00000001098989ec libpython2.7.dylib`instancemethod_call + 140
    frame #39: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #40: 0x000000010994eecd libpython2.7.dylib`PyEval_EvalFrameEx + 15869
    frame #41: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #42: 0x00000001099511cd libpython2.7.dylib`PyEval_EvalFrameEx + 24829
    frame #43: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #44: 0x00000001098bc0dd libpython2.7.dylib`function_call + 349
    frame #45: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #46: 0x000000010994de87 libpython2.7.dylib`PyEval_EvalFrameEx + 11703
    frame #47: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #48: 0x00000001098bc0dd libpython2.7.dylib`function_call + 349
    frame #49: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #50: 0x00000001098989ec libpython2.7.dylib`instancemethod_call + 140
    frame #51: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #52: 0x000000010994de87 libpython2.7.dylib`PyEval_EvalFrameEx + 11703
    frame #53: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #54: 0x0000000237667191 matrix_integer_dense.so`__Pyx_PyFunction_FastCallDict(func=0x0000000114422488, args=<unavailable>, nargs=<unavailable>, kwargs=0x9000000000000000) + 113 at matrix_integer_dense.c:57331
    frame #55: 0x00000002376affa6 matrix_integer_dense.so`__pyx_pw_4sage_6matrix_20matrix_integer_dense_20Matrix_integer_dense_157_singular_(__pyx_v_self=<unavailable>, __pyx_args=<unavailable>, __pyx_kwds=<unavailable>) + 6134 at matrix_integer_dense.c:48368
    frame #56: 0x000000010995169b libpython2.7.dylib`PyEval_EvalFrameEx + 26059
    frame #57: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #58: 0x00000001098bbffc libpython2.7.dylib`function_call + 124
    frame #59: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #60: 0x00000001098989ec libpython2.7.dylib`instancemethod_call + 140
    frame #61: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #62: 0x0000000109902f95 libpython2.7.dylib`slot_tp_call + 101
    frame #63: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #64: 0x000000010994eecd libpython2.7.dylib`PyEval_EvalFrameEx + 15869
    frame #65: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #66: 0x0000000109951373 libpython2.7.dylib`PyEval_EvalFrameEx + 25251
    frame #67: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #68: 0x00000001099511cd libpython2.7.dylib`PyEval_EvalFrameEx + 24829
    frame #69: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #70: 0x00000001099511cd libpython2.7.dylib`PyEval_EvalFrameEx + 24829
    frame #71: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #72: 0x00000001099511cd libpython2.7.dylib`PyEval_EvalFrameEx + 24829
    frame #73: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #74: 0x00000001098bbffc libpython2.7.dylib`function_call + 124
    frame #75: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #76: 0x00000001098989ec libpython2.7.dylib`instancemethod_call + 140
    frame #77: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #78: 0x0000000109902f95 libpython2.7.dylib`slot_tp_call + 101
    frame #79: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #80: 0x000000010994eecd libpython2.7.dylib`PyEval_EvalFrameEx + 15869
    frame #81: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #82: 0x00000001099511cd libpython2.7.dylib`PyEval_EvalFrameEx + 24829
    frame #83: 0x000000010995126e libpython2.7.dylib`PyEval_EvalFrameEx + 24990
    frame #84: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #85: 0x00000001098bbffc libpython2.7.dylib`function_call + 124
    frame #86: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #87: 0x00000001098989ec libpython2.7.dylib`instancemethod_call + 140
    frame #88: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #89: 0x00000001099033ed libpython2.7.dylib`slot_tp_init + 109
    frame #90: 0x0000000109900d2a libpython2.7.dylib`type_call + 202
    frame #91: 0x0000000109884403 libpython2.7.dylib`PyObject_Call + 67
    frame #92: 0x000000010994eecd libpython2.7.dylib`PyEval_EvalFrameEx + 15869
    frame #93: 0x000000010995126e libpython2.7.dylib`PyEval_EvalFrameEx + 24990
    frame #94: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #95: 0x00000001099511cd libpython2.7.dylib`PyEval_EvalFrameEx + 24829
    frame #96: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #97: 0x00000001099511cd libpython2.7.dylib`PyEval_EvalFrameEx + 24829
    frame #98: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #99: 0x00000001099511cd libpython2.7.dylib`PyEval_EvalFrameEx + 24829
    frame #100: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #101: 0x00000001099511cd libpython2.7.dylib`PyEval_EvalFrameEx + 24829
    frame #102: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #103: 0x00000001099511cd libpython2.7.dylib`PyEval_EvalFrameEx + 24829
    frame #104: 0x000000010995219c libpython2.7.dylib`PyEval_EvalCodeEx + 2124
    frame #105: 0x00000001099522b9 libpython2.7.dylib`PyEval_EvalCode + 25
    frame #106: 0x000000010997c2c3 libpython2.7.dylib`PyRun_FileExFlags + 291
    frame #107: 0x000000010997dd67 libpython2.7.dylib`PyRun_SimpleFileExFlags + 215
    frame #108: 0x0000000109997873 libpython2.7.dylib`Py_Main + 3443
    frame #109: 0x00007fff8f9385ad libdyld.dylib`start + 1
    frame #110: 0x00007fff8f9385ad libdyld.dylib`start + 1

comment:4 Changed 3 years ago by vbraun

Running while true ; do ./sage -t --long src/sage/matrix/matrix_integer_dense.pyx ; done usually hangs/crashes in about 10 - 100 tries.

comment:5 Changed 3 years ago by vbraun

Another possibility: The alarm() could mess up OpenBLAS's internal thread pool...

comment:6 Changed 3 years ago by vbraun

Here is a similar backtrace from Linux 64-bit Haswell-E (libopenblas_haswellp-r0.2.19.so)

#0  0x00007fa6cf0756bd in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fa6b9dcb51e in blas_thread_shutdown_ () from /home/vbraun/Code/sage.git/local/lib/libopenblas.so.0
#2  0x00007fa6ce663aa5 in fork () from /lib64/libc.so.6
#3  0x00007fa6cec67840 in forkpty () from /lib64/libutil.so.1
#4  0x00007fa6cf3d6583 in posix_forkpty (self=<optimized out>, noargs=<optimized out>) at ./Modules/posixmodule.c:4012
#5  0x00007fa6cf391cca in call_function (oparg=<optimized out>, pp_stack=0x7ffcbbf7d5d0) at Python/ceval.c:4019
#6  PyEval_EvalFrameEx (f=f@entry=0x7f9e6fa8b830, throwflag=throwflag@entry=0) at Python/ceval.c:2681
#7  0x00007fa6cf391ca0 in fast_function (nk=<optimized out>, na=0, n=0, pp_stack=0x7ffcbbf7d6f0, func=<optimized out>) at Python/ceval.c:4121
#8  call_function (oparg=<optimized out>, pp_stack=0x7ffcbbf7d6f0) at Python/ceval.c:4056
#9  PyEval_EvalFrameEx (f=f@entry=0x7f9e72d799c0, throwflag=throwflag@entry=0) at Python/ceval.c:2681
#10 0x00007fa6cf392b2c in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f9e740b9c80, 
    argcount=2, kws=kws@entry=0x7f9e6f4b5e78, kwcount=4, defs=0x7fa6bedf0cc8, defcount=5, closure=0x0) at Python/ceval.c:3267
#11 0x00007fa6cf30c8ed in function_call (func=0x7fa6be8d8320, arg=0x7f9e740b9c68, kw=0x7f9e6fa5a910) at Objects/funcobject.c:526
#12 0x00007fa6cf2dbe33 in PyObject_Call (func=func@entry=0x7fa6be8d8320, arg=arg@entry=0x7f9e740b9c68, kw=kw@entry=0x7f9e6fa5a910)
    at Objects/abstract.c:2529
#13 0x00007fa6cf2ed01c in instancemethod_call (func=0x7fa6be8d8320, arg=0x7f9e740b9c68, kw=0x7f9e6fa5a910) at Objects/classobject.c:2602
#14 0x00007fa6be661ab9 in __Pyx_PyObject_Call (kw=0x7f9e6fa5a910, arg=0x7f9e6fb68990, func=0x7f9e6f8e73c0)
    at /home/vbraun/Sage/git/src/build/cythonized/sage/interfaces/sagespawn.c:4885
#15 __pyx_pf_4sage_10interfaces_9sagespawn_9SageSpawn_2_spawnpty (__pyx_self=<optimized out>, __pyx_v_kwds=0x7f9e6fa5a910, 
    __pyx_v_args=<optimized out>, __pyx_v_self=<optimized out>) at /home/vbraun/Sage/git/src/build/cythonized/sage/interfaces/sagespawn.c:1797
#16 __pyx_pw_4sage_10interfaces_9sagespawn_9SageSpawn_3_spawnpty (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at /home/vbraun/Sage/git/src/build/cythonized/sage/interfaces/sagespawn.c:1763
#17 0x00007fa6cf2dbe33 in PyObject_Call (func=func@entry=0x7fa6beb1aad0, arg=arg@entry=0x7f9e74261560, kw=kw@entry=0x7f9e6f4e0398)
    at Objects/abstract.c:2529
#18 0x00007fa6cf38cede in ext_do_call (nk=<optimized out>, na=2, flags=<optimized out>, pp_stack=0x7ffcbbf7dc30, func=0x7fa6beb1aad0)
    at Python/ceval.c:4348
#19 PyEval_EvalFrameEx (f=f@entry=0x7f9e72d79580, throwflag=throwflag@entry=0) at Python/ceval.c:2720
#20 0x00007fa6cf392b2c in PyEval_EvalCodeEx (co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=<optimized out>, 
    argcount=argcount@entry=5, kws=0x7f9e72d2c6c8, kwcount=0, defs=0x7fa6bed85068, defcount=3, closure=0x0) at Python/ceval.c:3267

comment:7 Changed 3 years ago by vbraun

comment:8 Changed 3 years ago by fbissey

I am testing the OMP_NUM_THREADS=1 thing. It seems to apply even if you haven't compiled openblas with openmp. Have done quite a few loops without incident so far.

comment:9 Changed 3 years ago by fbissey

OMP_NUM_THREADS=1 seems effective in suppressing the problem so it is probably a related issue. The test also runs quicker and uses less resources.

comment:10 Changed 3 years ago by vbraun

  • Branch set to u/vbraun/openblas_randomly_crashes

comment:11 Changed 3 years ago by fbissey

  • Commit set to 27f412b65b7b13ec908eebec9f26d7036a374174

That's brutal.


New commits:

27f412bDisable multi-threading in OpenBLAS

comment:12 Changed 3 years ago by jpflori

But acceptable as long as we don't properly use OpenBLAS threading in Sage, if that ever happens. IIRC even Linbox supposes (or supposed and now enforces it at runtime) that OpenBLAS is single threaded to get optimal performance.

comment:13 follow-up: Changed 3 years ago by jpflori

Maybe setting OMP_NUM_THREADS by default to 1 at runtime would be less brutal?

comment:14 Changed 3 years ago by jpflori

See #21323 where a related discussion happened..

comment:15 in reply to: ↑ 13 Changed 3 years ago by fbissey

Replying to jpflori:

Maybe setting OMP_NUM_THREADS by default to 1 at runtime would be less brutal?

More complicated and error prone - by that I mean someone will mess up.

comment:16 Changed 3 years ago by fbissey

Doing it for each call from the code doesn't scale very well. You'll have to find all the code and fix the non-sage code, like cvxopt and scipy, as well. Good luck.

comment:17 follow-up: Changed 3 years ago by jpflori

I mean setting it globally in sage-env...

comment:18 in reply to: ↑ 17 Changed 3 years ago by fbissey

Replying to jpflori:

I mean setting it globally in sage-env...

That's what I understood before you mentioned #21323. Someone is bound to toy with stuff in sage-env. I am half surprised we don't have more incident report because of it. I'll admit that little monster (sage-env) is intimidating.

comment:19 follow-up: Changed 3 years ago by vbraun

So any conclusion? We are currently in a pretty bad shape as far as running doctests is concerned, about 50% of the time the testsuite fails for me.

comment:20 in reply to: ↑ 19 Changed 3 years ago by fbissey

Replying to vbraun:

So any conclusion? We are currently in a pretty bad shape as far as running doctests is concerned, about 50% of the time the testsuite fails for me.

No bullet proof way I guess. It can always be tempered with, one way or the other. Putting it in sage-env has the advantage that it can be picked up by distro while they may not be willing to pack their blas/lapack (whichever it is) single threaded. From my sage-on-distro point of view the only difficulty is when someone import sage from python rather than by starting the sage script (yes you can do that, env.py make that possible, _I_ made sure of that).

So after bashing it, I will say that sage-env offer the most flexibility - and flexibility always cuts both ways.

comment:21 follow-up: Changed 3 years ago by vbraun

Globally setting OMP_NUM_THREADS=1 will affect all OpenMP programs, not just OpenBLAS. If thats what you really want then I'm fine with it, just pointing out the obvious.

comment:22 in reply to: ↑ 21 Changed 3 years ago by fbissey

Replying to vbraun:

Globally setting OMP_NUM_THREADS=1 will affect all OpenMP programs, not just OpenBLAS. If thats what you really want then I'm fine with it, just pointing out the obvious.

Thanks for pointing the obvious. I guess your branch has the least side effects on sage as a whole.

comment:23 Changed 3 years ago by vbraun

Is that a positive review? ;-)

comment:24 Changed 3 years ago by fbissey

  • Authors set to Volker Braun
  • Reviewers set to François Bissey
  • Status changed from new to needs_review

I was waiting for you to fill everything in and put it in "needs_review" but what the heck. Done!

comment:25 Changed 3 years ago by vbraun

  • Status changed from needs_review to positive_review

Thanks!

comment:26 Changed 3 years ago by vbraun

  • Status changed from positive_review to needs_work

Hmm fails on OSX

16387[openblas-0.2.19.p0] gfortran -m128bit-long-double -Wall -m64  -L/Users/buildslave-sage/slave/sage_git/build/local/lib -Wl,-rpath,/Users/buildslave-sage/slave/sage_git/build/local/lib  -o sblat3 sblat3.o ../libopenblas_sandybridgep-r0.2.19.a -lpthread -lgfortran -lpthread -lgfortran 
16388[openblas-0.2.19.p0] ld: file too small (length=0) file '../libopenblas_sandybridgep-r0.2.19.a' for architecture x86_64
16389[openblas-0.2.19.p0] collect2: error: ld returned 1 exit status
16390[openblas-0.2.19.p0] make[4]: *** [cblat1] Error 1

comment:27 Changed 3 years ago by fbissey

Looks like a test program problem again. I will investigate ASAP.

comment:28 Changed 3 years ago by fbissey

Hum I cannot reproduce it locally, my line is slightly different

gfortran -m128bit-long-double -Wall -m64  -L/Users/fbissey/build/sage/local/lib -Wl,-rpath,/Users/fbissey/build/sage/local/lib  -o sblat3 sblat3.o ../libopenblas_haswell-r0.2.19.a -lgfortran -lgfortran

Notably I don't have -lpthread which is suspicious. Was this a "binary" build?

comment:29 Changed 3 years ago by vbraun

No, just a normal (incremental) build

comment:30 Changed 3 years ago by fbissey

There are other suspicious bits in the build log,

gcc -O2 -DMAX_STACK_ALLOC=2048 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DNO_WARMUP -DMAX_CPU_NUMBER=4 -DASMNAME=_ -DASMFNAME=__ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I.. -all_load -headerpad_max_install_names -install_name "/Users/fbissey/build/sage/local/var/tmp/sage/build/openblas-0.2.19/src/exports/../libopenblas_haswell-r0.2.19.dylib" -dynamiclib -o ../libopenblas_haswell-r0.2.19.dylib ../libopenblas_haswell-r0.2.19.a -Wl,-exported_symbols_list,osx.def  -L/Users/fbissey/build/sage/local/lib -L/Users/fbissey/build/sage/local/lib/gcc/x86_64-apple-darwin16.1.0/5.4.0 -L/Users/fbissey/build/sage/local/lib/gcc/x86_64-apple-darwin16.1.0/5.4.0/../../.. -L/Users/fbissey/build/sage/local/lib -L/Users/fbissey/build/sage/local/lib/gcc/x86_64-apple-darwin16.1.0/5.4.0 -L/Users/fbissey/build/sage/local/lib/gcc/x86_64-apple-darwin16.1.0/5.4.0/../../..  -lgfortran -lSystem -lquadmath -lm -lSystem -lgfortran -lSystem -lquadmath -lm -lSystem  
ld: warning: object file (/Users/fbissey/build/sage/local/lib/gcc/x86_64-apple-darwin16.1.0/5.4.0/libgcc.a(_muldi3.o)) was built for newer OSX version (10.9) than being linked (10.6)
ld: warning: object file (/Users/fbissey/build/sage/local/lib/gcc/x86_64-apple-darwin16.1.0/5.4.0/libgcc.a(_negdi2.o)) was built for newer OSX version (10.9) than being linked (10.6)
....

I get those after the tests are run and before the install (so it happens after your error), I wonder if the problem they are alluding too is fatal in your build. It certainly merit investigation. Also what is the exact version of OS X involved?

comment:31 Changed 3 years ago by fbissey

It looks like Openblas wants to build compatible to 10.6. From Makefile.system

ifeq ($(OSNAME), Darwin)
export MACOSX_DEPLOYMENT_TARGET=10.6
MD5SUM = md5 -r
endif

We currently set MACOSX_DEPLOYMENT_TARGET to the system value or 10.9, whichever is lowest. I don't think openblas should over-ride an external setting of that kind.

comment:32 follow-up: Changed 3 years ago by jpflori

Sorry, I don't have much sage time these days. The correct env variable is OPENBLAS_NUM_THREADS. See:

comment:33 in reply to: ↑ 32 Changed 3 years ago by fbissey

Replying to jpflori:

Sorry, I don't have much sage time these days. The correct env variable is OPENBLAS_NUM_THREADS. See:

Is the right variable for runtime, not build time. Volker got it right according to the same link. However for some reason I think threads were not completely disabled in his build.

comment:34 Changed 3 years ago by fbissey

I cannot figure out where the -lpthread originate in your build Volker.

comment:35 Changed 3 years ago by fbissey

I think I found where you get pthread from and that may be the root cause. We may have a "shell" accident depending on version. It is still rather strange that I cannot reproduce it.

comment:36 Changed 3 years ago by fbissey

  • Branch changed from u/vbraun/openblas_randomly_crashes to u/fbissey/openblas_randomly_crashes
  • Commit changed from 27f412b65b7b13ec908eebec9f26d7036a374174 to 026672777b73c63b58800135796caa3981faf924

Let's try with these bits of QA.


New commits:

0266727Fix a few QA in Openblas

comment:37 Changed 3 years ago by fbissey

  • Status changed from needs_work to needs_review

Two of the patches I have added to this branch have now been upstreamed. Only the last one about the SMP variable use in ifdef has not been submitted yet. This one cure the fact you have -lpthread when you shouldn't. It fells like something other than a recent bash is used on that machine.

comment:38 Changed 3 years ago by vbraun

  • Status changed from needs_review to positive_review

comment:39 Changed 3 years ago by vbraun

Followup at #22100

comment:40 Changed 3 years ago by vbraun

  • Branch changed from u/fbissey/openblas_randomly_crashes to 026672777b73c63b58800135796caa3981faf924
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:41 Changed 2 years ago by mderickx

  • Commit 026672777b73c63b58800135796caa3981faf924 deleted

Apparently this still happens sometimes in sage 8.1.beta6, I created #23933 for this.

comment:42 Changed 17 months ago by saraedum

You discarded the option to set OMP_NUM_THREADS=1. What about OPENBLAS_NUM_THREADS=1? This came up in #26118 (sage -tp does not scale to many cores.)

Note: See TracTickets for help on using tickets.