#23973 closed defect (fixed)
doc build hangs on Cygwin
Reported by: | embray | Owned by: | |
---|---|---|---|
Priority: | blocker | Milestone: | sage-8.1 |
Component: | porting: Cygwin | Keywords: | windows cygwin ecl gc |
Cc: | Merged in: | ||
Authors: | Erik Bray | Reviewers: | Jeroen Demeyer |
Report Upstream: | Fixed upstream, but not in a stable release. | Work issues: | |
Branch: | 5863356 (Commits, GitHub, GitLab) | Commit: | |
Dependencies: | Stopgaps: |
Description (last modified by )
For a few weeks now (I haven't been able to pursue the issue much due to travel) I've had to take the Cygwin patchbot down due to an issue that didn't occur before where building the docs, particularly with multiple processes, hangs indefinitely.
For some reason, the problem appears to begin with this commit. Before this commit the docs build no problem. After this commit it will build several documents and then all the worker processes will just deadlock it seems.
Upstream PR: https://github.com/ivmai/bdwgc/pull/187
Change History (26)
comment:1 Changed 5 years ago by
comment:2 Changed 5 years ago by
- Description modified (diff)
comment:3 Changed 5 years ago by
Hard to say, but now that I take a careful look at the doc build log I think there might be an unhandled failure of some sort. I just tried a new build from scratch and here's what I found: The document en/constructions
, that was edited in that commit, starts building (again, this is a parallel build so the log is interleaved):
[dochtml] Building en/constructions. [dochtml] [dochtml] [construct] loading pickled environment... not yet created [dochtml] [construct] Compiling the master document [dochtml] [construct] building [mo]: targets for 0 po files that are out of date [dochtml] [construct] building [html]: targets for 16 source files that are out of date [dochtml] [construct] updating environment: 16 added, 0 changed, 0 removed [dochtml] [construct] reading sources... [ 6%] algebraic_geometry [dochtml] [a_tour_of] pickling environment... done [dochtml] [a_tour_of] checking consistency... done [dochtml] [a_tour_of] preparing documents... done [dochtml] [a_tour_of] writing output... [100%] index [dochtml] [construct] reading sources... [ 12%] calculus [dochtml] [a_tour_of] generating indices... genindex [dochtml] [a_tour_of] Merging js index files... [dochtml] [a_tour_of] ... done (108 js index entries) [dochtml] [a_tour_of] Writing js search indexes...writing additional pages... search [dochtml] [a_tour_of] copying images... [ 50%] sin_plot.png [dochtml] [a_tour_of] copying images... [100%] eigen_plot.png [dochtml] Insufficient memory for black list
It's not clear where "Insufficient memory for black list" is coming from but probably en/constructions
. After that there's no more output from that worker, but the remaining documents get built successfully, and then the build hangs.
So I guess this is a matter of poor error handling. Reminds me I should finish my work to generalize DocTestDispatcher.parallel_dispatch
:)
comment:4 Changed 5 years ago by
Well this is definitely strange. I don't know why it would give an "Insufficient memory" error. This error is apparently coming from gc which is trying to initialize about 1 MB, and it definitely should be able to do that...
comment:5 Changed 5 years ago by
Sorry for the obvious question, but are you sure that you are not actually running out of memory? It's well known that the Sage docbuilder requires a lot of memory.
comment:6 Changed 5 years ago by
Yeah it's a fair question, but I'm pretty sure not. I have 32GB available and got this even when closing most other applications. I watched memory usage during a build and it still did not even come close to using all my system's memory. In any case, it fails almost deterministically to make this one 1 MB allocation, so that seems unlikely. I fear it might be a more subtle bug with gc and/or ecl, possibly having to do with fork (it wouldn't be the first one I've encountered).
comment:7 Changed 5 years ago by
Yes, 32GB should be enough, unless you are using too many processes. Does it work when building the docs serially? See also #21389.
comment:8 Changed 5 years ago by
Yes, it seems to occur specifically when using multiprocessing (I'm only using 4 processes for the present test, but I will try manually undoing #21389 and seeing if it happens even with 1 process). Still, that all makes me think it's nothing to do with the doc build itself, though I'm failing to reproduce the issue outside that context.
comment:9 Changed 5 years ago by
Given the commit that you pointed to, it seems that something goes wrong when producing that plot in the context of multiprocessing.
Maybe a simple workaround would be to disallow docbuilding in parallel on Cygwin?
comment:10 Changed 5 years ago by
Well yes, clearly. But that would be shuffling the problem under the rug (which might be fine if I can't find something else out soon...). It's in reading the sources for that file where it crashes. But I can reproduce that plot fine on my own--even running that code directly using multiprocessing.Pool
appears to work fine. What's also strange is I can't reproduce the issue if I run a parallel doc build with just that document, or with a small handful of documents.
The code for that plot calls PiecewiseFunction.fourier_series_partial_sum
which is using maxima do some symbolic integrations, and that's specifically where the failure must be occurring. I'm running the doc build again with GC_PRINT_VERBOSE_STATS=1
to see if anything turns up...
comment:11 Changed 5 years ago by
There are a few potentially relevant bug fixes in libgc since the current version in Sage. I'm going to try merging #23700 (for starters) and see if that helps.
comment:12 follow-up: ↓ 13 Changed 5 years ago by
Unfortunately simply upgrading to gc 7.6 did not resolve the problem. All it gave me was a little additional error output but nothing immediately helpful.
comment:13 in reply to: ↑ 12 Changed 5 years ago by
Replying to embray:
Unfortunately simply upgrading to gc 7.6 did not resolve the problem. All it gave me was a little additional error output but nothing immediately helpful.
I take it back. Looking at the error message it gave it actually couldn't have been using gc 7.6. In fact for some reason when I upgraded gc it did not replace the old DLL.
comment:14 Changed 5 years ago by
I got gc 7.6 installed correctly this time, but unfortunately it still did not resolve the problem. The error message output during the build of this document is now slightly more helpful but not by much:
GC Warning: Out of memory - trying to allocate requested amount (8224 bytes)... Insufficient memory for black list
comment:15 Changed 5 years ago by
I think I may have run across a possible answer to this conundrum, from Hans Boehm himself: http://www.hpl.hp.com/hosted/linux/mail-archives/gc/2006-March/001214.html
Indeed, it seems gc on Cygwin is using sbrk to allocate memory, and this runs into the hard-coded upper limit on the data segment of the Cygwin process (most memory allocation in Cygwin uses mmap which avoids these issues). I'll try rebuilding gc with mmap support enabled (which it is not, by default, on Cygwin, though not for any particular reason apparently). That should probably fix it...
comment:16 Changed 5 years ago by
- Branch set to u/embray/cygwin/ticket-23973
- Commit set to 188ed32783282e72f3e76393a4b049576c0dfc67
- Keywords windows cygwin ecl gc added
- Report Upstream changed from N/A to Not yet reported upstream; Will do shortly.
- Status changed from new to needs_review
Here's a full summary of the issue so far, and how the attached patch addresses it:
- Cygwin processes reserve an address space for Cygwin's "private heap" for the process--most POSIX system calls that manipulate the heap (e.g.
brk
,sbrk
) are manipulating Cygwin's heap.- This private heap is of a fixed size and reserved at process startup--this allows it to be used freely without interference from Windows, which may be doing whatever it wants with the rest of the process's virtual address space.
- Most of the time this is not a problem--while Cygwin's
malloc
does make small allocations to the heap, larger allocations are made withmmap
and can be anywhere in the Windows process's VM space (not within the private heap). - However, a custom memory allocator (such as that used by libgc) may still run into problems here. In particular, the default configuration for libgc on Cygwin is to use
sbrk
only for memory allocations. - It just happened that running the doc build in the manner I was running it deterministically (but otherwise arbitrarily) caused
sbrk
to bump into the fixed size limit of the private heap (at least I think--I have not directly confirmed this but it seems likely). This was happening to me in the context of the doc build, but in principle this error could have come up in some other context.
- The obvious solution is to build libgc to use
mmap
(which generally works well on Cygwin) and not rely onsbrk
. Another workaround would be to manually increase the max heap size on theecl
executable (this can be done with thepeflags
utility) but that's only sweeping the problem under the rug somewhat.
- libgc hard-codes a lot of the defaults for various settings depending on platform, such as
USE_MMAP
for whether or not it should usemmap
for allocations (in fact, it seems on Linux this is not the default for some reason). The only way to force this at configure time, it seems, is to (seemingly strangely) pass--enable-munmap
toconfigure
.- However, on Cygwin this does not enable
USE_MMAP
. It seems like maybe it used to, but now it also setsUSE_WINALLOC
, a setting whereby gc uses the Windows API directly for managing memory allocation.- This has a surprising side-effect that it forces
--enable-handle-fork
to be disabled, as relying on direct handling of virtual memory allocations is inherently incompatible with Cygwin's fork emulation. However, we need this setting in order for ecl to work at all in a multi-process environment where it might be forked (see #22694) - Therefore we need to force
USE_MMAP
as well asUSE_MUNMAP
to be defined by manually passing them in viaCFLAGS
. Other workarounds are possible of course but this is the simplest.
- This has a surprising side-effect that it forces
- However, on Cygwin this does not enable
- However there's a reason
USE_MMAP
was explicitly disabled on Cygwin.- libgc's memory allocator uses algorithms that require it to be able to keep ranges of memory allocated but unusuable--in particular it remaps an existing
mmap
with thePROT_NONE
flag to do this. This runs afoul of the same problem we had in #22810, where Cygwin does not like this. - This can be fixed on Cygwin by simply using
mprotect(..., PROT_NONE)
here, instead ofmmap
. However, this alone is not sufficient and can lead quickly to failures because:- The
mprotect
call must be aligned to the start of themmap
'd region, which libgc calculates using itsGC_page_size
global variable, which is initialized early on. However, it currently calculates this incorrectly on Cygwin; rather it uses it incorrectly. Although it does get the page size right, what it really needs is a value called the allocation granularity--the alignment for the start address of mmap'd regions. On many systems this is equal to the page size but not on 64-bit windows where the page size is 4k but the allocation granularity is 64k. Regions allocated by Cygwin'smmap
are aligned to the latter. The easiest workaround here is to just use the allocation granularity forGC_page_size
. There's not much sense in trying to use the actual page size anywhere in this case.
- The
- libgc's memory allocator uses algorithms that require it to be able to keep ranges of memory allocated but unusuable--in particular it remaps an existing
New commits:
188ed32 | Enable munmap/mmap on Cygwin, but fix issues with its implementation on Cygwin;
|
comment:17 Changed 5 years ago by
- Description modified (diff)
- Report Upstream changed from Not yet reported upstream; Will do shortly. to Reported upstream. No feedback yet.
comment:18 follow-up: ↓ 20 Changed 5 years ago by
Pardon my ignorance, but are you sure that CYGWIN32
is the correct macro to check?
comment:19 Changed 5 years ago by
- Commit changed from 188ed32783282e72f3e76393a4b049576c0dfc67 to 5863356b68e1796ea3c6d92391753a6bfbc93ca7
Branch pushed to git repo; I updated commit sha1. New commits:
5863356 | Updated patch level
|
comment:20 in reply to: ↑ 18 Changed 5 years ago by
Replying to jdemeyer:
Pardon my ignorance, but are you sure that
CYGWIN32
is the correct macro to check?
Yeah that's just the macro they use internally for "cygwin". The "32" part is a misnomer.
comment:21 Changed 5 years ago by
Isn't __CYGWIN__
the official macro to check for Cygwin?
comment:22 Changed 5 years ago by
It doesn't matter--have a look at the source code for gc. It defines its own macros for platform checks. This is only consistent with their style.
comment:23 Changed 5 years ago by
- Reviewers set to Jeroen Demeyer
- Status changed from needs_review to positive_review
OK. I trust you on this one.
comment:24 Changed 5 years ago by
It would be helpful to have feedback from upstream in case there's somehow something outright wrong about this approach, but it at least resolved the problem, so...
comment:25 Changed 5 years ago by
- Branch changed from u/embray/cygwin/ticket-23973 to 5863356b68e1796ea3c6d92391753a6bfbc93ca7
- Resolution set to fixed
- Status changed from positive_review to closed
comment:26 Changed 5 years ago by
- Commit 5863356b68e1796ea3c6d92391753a6bfbc93ca7 deleted
- Report Upstream changed from Reported upstream. No feedback yet. to Fixed upstream, but not in a stable release.
Does this happen after a failure or during a successful build?