Opened 3 years ago

Closed 15 months ago

#27434 closed defect (worksforme)

Cygwin buildbot hangs in building the docs

Reported by: embray Owned by: embray
Priority: critical Milestone: sage-duplicate/invalid/wontfix
Component: porting: Cygwin Keywords:
Cc: Merged in:
Authors: Reviewers:
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges


A few month ago a set up an experimental buildbot worker for Cygwin so that we can hopefully have consistent and regular Cygwin builds. (I used to have a Cygwin patchbot too but it's been mysteriously down for a while; I need to check on that).

The buildbot has had a few problems getting off the ground though. The biggest blocker, I think, is that it hangs during building the reference docs, specifically the manifolds docs (which are the first it tries to build).

If I modify the docbuilder code to run the docbuild serially (i.e. replace multiprocessing.map_async with plain map()) the docbuild completes successfully.

It turns out it's not just the docbuild. When running the sage.manifolds tests like

./sage -t --long -p 4 src/sage/manifolds/

a few of the test modules succeed, but eventually all of the works grind to a complete halt and get stuck.

I would suspect a problem like #27213, but I have built OpenBLAS on that system with the fix for that issue. So it's probably something else.

I don't observe this problem on any other system--just the one I happened to set up as the buildbot. It could also happen to be a bug in the Cygwin on that machine.

I'm just opening this ticket as a reminder to myself to investigate further.

Change History (11)

comment:1 Changed 3 years ago by embray

  • Owner changed from (none) to embray

comment:2 Changed 3 years ago by embray

One reason to suspect Cygwin is that the processes that get stuck are really stuck and unkillable through Cygwin. They no longer respond to signals including SIGKILL. Though this can also happen sometimes in multithreading deadlocks.

comment:3 Changed 3 years ago by embray

I tried running just

./sage -t --long --verbose src/sage/manifolds/

and it got stuck at

Trying (line 767):    g + gX # display of the point atop the coordinate grid
    Graphics object consisting of 20 graphics primitives

Then I tried

./sage -t --long --verbose src/sage/manifolds/

and it got stuck at

Trying (line 2233):    g
    Graphics object consisting of 18 graphics primitives

If I run each test with --serial it passes. So there's definitely a problem with fork. The fact that it's happen during displaying graphics also leads me again to suspect OpenBLAS but we'll see.

comment:4 Changed 3 years ago by embray

Yep. This looks almost exactly like a reprise of #27213. It's bizarre though--this shouldn't be happening.

comment:5 Changed 3 years ago by embray

facepalm it just occurred to me that the fix from #27256 was incorrect/insufficient. I didn't realize that there is an earlier call to sdh_make inside an if statement, and that the latter call (the one I actually fixed) is only run if the first one fails--it retries then with a more generic target.

Testing now...

comment:6 Changed 3 years ago by embray

Just looking at the logs from my last rebuild of OpenBLAS on the buildbot, I can see that USE_TLS was not defined when compiling the module it's relevant to, so this definitely has to be it. What a pain!

comment:7 Changed 3 years ago by embray

Fixing OpenBLAS definitely allowed me to make progress; I am going to open a separate ticket for that.

Despite that fix the docbuild still hangs though. It gets much further along, but ultimately crashes. I think possibly due to some other error occurring but not getting reported in a timely manner due to our stupid x.get(99999) with Pool.map_async.

comment:8 Changed 3 years ago by embray

This is insane--the docbuild is still, ultimately, hanging up somewhere in OpenBLAS, actually much in the same way (matrix inversion via Numpy) as it was in the first place, even though I apparently fixed it in the other case that this was crashing:

#0  0x00007ffb78d4aa24 in ntdll!ZwWaitForSingleObject () from /cygdrive/c/WINDOWS/SYSTEM32/ntdll.dll
#1  0x00007ffb75159252 in WaitForSingleObjectEx () from /cygdrive/c/WINDOWS/System32/KERNELBASE.dll
#2  0x00000003f349ff6a in exec_blas_async_wait () from /home/Admin/sage-buildbot-worker/sage_git/build/local/bin/cygopenblas.dll
#3  0x00000003f34a0010 in exec_blas () from /home/Admin/sage-buildbot-worker/sage_git/build/local/bin/cygopenblas.dll
#4  0x00000003f349e4c2 in gemm_thread_n () from /home/Admin/sage-buildbot-worker/sage_git/build/local/bin/cygopenblas.dll
#5  0x00000003f363528a in dgetrs_N_parallel () from /home/Admin/sage-buildbot-worker/sage_git/build/local/bin/cygopenblas.dll
#6  0x00000003f3267826 in dgesv_ () from /home/Admin/sage-buildbot-worker/sage_git/build/local/bin/cygopenblas.dll
#7  0x00000003ec2c278a in DOUBLE_inv () from /home/Admin/sage-buildbot-worker/sage_git/build/local/lib/python2.7/site-packages/numpy/linalg/_umath_linalg.dll
#8  0x00000003a3c8fdbc in PyUFunc_GenericFunction () from /home/Admin/sage-buildbot-worker/sage_git/build/local/lib/python2.7/site-packages/numpy/core/_multiarray_umath.dll
#9  0x00000003a3c90076 in ufunc_generic_call () from /home/Admin/sage-buildbot-worker/sage_git/build/local/lib/python2.7/site-packages/numpy/core/_multiarray_umath.dll
#10 0x00000003eaa64e91 in PyObject_Call () from /home/Admin/sage-buildbot-worker/sage_git/build/local/bin/libpython2.7.dll

I wonder if there's a different, previously unknown-to-me fork()+threads bug in OpenBLAS.

comment:9 Changed 3 years ago by embray

The fix in #27438 definitely fixes the more pervasive bug that is simply caused when OpenBLAS doesn't clean up its thread pool prior to forking.

However, with that issue fixed, there is another more subtle and I think non-deterministic issue that can crop up. It only shows up in the docbuild, specifically the plot3d docs, just because it gives the most opportunities for the problem to occur within a single process: almost each plot it creates involves at some point calling numpy.inv which ends up invoking an multithreaded BLAS routine. And eventually one of those threads gets locked. Specifically, the exec_blas function puts a task on the queue then signals a pool.filled event which worker threads are waiting on. Even with one worker thread, after enough calls in some case either the event doesn't get signaled, or the worker thread never receives it. The worker thread blocks forever waiting to receive the event, and the main thread blocks waiting for the worker thread to finish processing the task queue it never looks at.

I don't think it's a simple race condition. The code is well-written and everything looks correctly ordered, though I wouldn't rule it out. More debug logging and error checking might help me learn more.

One other point of note is that when OpenBLAS is handling fork properly it had a downside that every time the process forks (which happens frequently during the plot3d docbuild to run tachyon for each plot) it has to shutdown the thread pool, then later reinitialize it before running a multithreaded routine again. This seems to work fine normally, but it makes me wonder if there's some kind of resource leakage (e.g. handles not being closed) that ultimately leads to this problem. It's a possibility I need to investigate more.

In the meantime a possible workaround would be just disable OpenBLAS multithreading on Cygwin, at least during the docbuild. In principle the issue could occur anywhere but that seems to be the place where it is most likely to be exercised. I haven't seen anything like it elsewhere, including running the tests.

comment:10 Changed 3 years ago by embray

Quite possibly related but not sure: #27509. There is actually a fairly severe memory leak in OpenBLAS on Cygwin. The conditions that cause this leak (e.g. numerous interleaved multi-threaded BLAS operations and forking off subprocesses) are especially prevalent in the plot3d docs where docbuild was most frequently hanging.

Last edited 3 years ago by embray (previous) (diff)

comment:11 Changed 15 months ago by embray

  • Milestone changed from sage-pending to sage-duplicate/invalid/wontfix
  • Resolution set to worksforme
  • Status changed from new to closed

This can be closed. I haven't had an particular docbuild issues on Cygwin in a while, related to OpenBLAS or otherwise. Also the new doctest forker code that has since been added is more robust against hangs.

The buildbot this references hasn't been running in some time either.

Note: See TracTickets for help on using tickets.