Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#23973 closed defect (fixed)

doc build hangs on Cygwin

Reported by: embray Owned by:
Priority: blocker Milestone: sage-8.1
Component: porting: Cygwin Keywords: windows cygwin ecl gc
Cc: Merged in:
Authors: Erik Bray Reviewers: Jeroen Demeyer
Report Upstream: Fixed upstream, but not in a stable release. Work issues:
Branch: 5863356 (Commits, GitHub, GitLab) Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by embray)

For a few weeks now (I haven't been able to pursue the issue much due to travel) I've had to take the Cygwin patchbot down due to an issue that didn't occur before where building the docs, particularly with multiple processes, hangs indefinitely.

For some reason, the problem appears to begin with this commit. Before this commit the docs build no problem. After this commit it will build several documents and then all the worker processes will just deadlock it seems.

Upstream PR:

Change History (26)

comment:1 Changed 5 years ago by jdemeyer

Does this happen after a failure or during a successful build?

comment:2 Changed 5 years ago by embray

  • Description modified (diff)

comment:3 Changed 5 years ago by embray

Hard to say, but now that I take a careful look at the doc build log I think there might be an unhandled failure of some sort. I just tried a new build from scratch and here's what I found: The document en/constructions, that was edited in that commit, starts building (again, this is a parallel build so the log is interleaved):

[dochtml] Building en/constructions.
[dochtml] [construct] loading pickled environment... not yet created
[dochtml] [construct] Compiling the master document
[dochtml] [construct] building [mo]: targets for 0 po files that are out of date
[dochtml] [construct] building [html]: targets for 16 source files that are out of date
[dochtml] [construct] updating environment: 16 added, 0 changed, 0 removed
[dochtml] [construct] reading sources... [  6%] algebraic_geometry
[dochtml] [a_tour_of] pickling environment... done
[dochtml] [a_tour_of] checking consistency... done
[dochtml] [a_tour_of] preparing documents... done
[dochtml] [a_tour_of] writing output... [100%] index
[dochtml] [construct] reading sources... [ 12%] calculus
[dochtml] [a_tour_of] generating indices... genindex
[dochtml] [a_tour_of] Merging js index files...
[dochtml] [a_tour_of] ... done (108 js index entries)
[dochtml] [a_tour_of] Writing js search indexes...writing additional pages... search
[dochtml] [a_tour_of] copying images... [ 50%] sin_plot.png
[dochtml] [a_tour_of] copying images... [100%] eigen_plot.png
[dochtml] Insufficient memory for black list

It's not clear where "Insufficient memory for black list" is coming from but probably en/constructions. After that there's no more output from that worker, but the remaining documents get built successfully, and then the build hangs.

So I guess this is a matter of poor error handling. Reminds me I should finish my work to generalize DocTestDispatcher.parallel_dispatch :)

Last edited 5 years ago by embray (previous) (diff)

comment:4 Changed 5 years ago by embray

Well this is definitely strange. I don't know why it would give an "Insufficient memory" error. This error is apparently coming from gc which is trying to initialize about 1 MB, and it definitely should be able to do that...

comment:5 Changed 5 years ago by jdemeyer

Sorry for the obvious question, but are you sure that you are not actually running out of memory? It's well known that the Sage docbuilder requires a lot of memory.

comment:6 Changed 5 years ago by embray

Yeah it's a fair question, but I'm pretty sure not. I have 32GB available and got this even when closing most other applications. I watched memory usage during a build and it still did not even come close to using all my system's memory. In any case, it fails almost deterministically to make this one 1 MB allocation, so that seems unlikely. I fear it might be a more subtle bug with gc and/or ecl, possibly having to do with fork (it wouldn't be the first one I've encountered).

comment:7 Changed 5 years ago by jdemeyer

Yes, 32GB should be enough, unless you are using too many processes. Does it work when building the docs serially? See also #21389.

comment:8 Changed 5 years ago by embray

Yes, it seems to occur specifically when using multiprocessing (I'm only using 4 processes for the present test, but I will try manually undoing #21389 and seeing if it happens even with 1 process). Still, that all makes me think it's nothing to do with the doc build itself, though I'm failing to reproduce the issue outside that context.

comment:9 Changed 5 years ago by jdemeyer

Given the commit that you pointed to, it seems that something goes wrong when producing that plot in the context of multiprocessing.

Maybe a simple workaround would be to disallow docbuilding in parallel on Cygwin?

comment:10 Changed 5 years ago by embray

Well yes, clearly. But that would be shuffling the problem under the rug (which might be fine if I can't find something else out soon...). It's in reading the sources for that file where it crashes. But I can reproduce that plot fine on my own--even running that code directly using multiprocessing.Pool appears to work fine. What's also strange is I can't reproduce the issue if I run a parallel doc build with just that document, or with a small handful of documents.

The code for that plot calls PiecewiseFunction.fourier_series_partial_sum which is using maxima do some symbolic integrations, and that's specifically where the failure must be occurring. I'm running the doc build again with GC_PRINT_VERBOSE_STATS=1 to see if anything turns up...

comment:11 Changed 5 years ago by embray

There are a few potentially relevant bug fixes in libgc since the current version in Sage. I'm going to try merging #23700 (for starters) and see if that helps.

comment:12 follow-up: Changed 5 years ago by embray

Unfortunately simply upgrading to gc 7.6 did not resolve the problem. All it gave me was a little additional error output but nothing immediately helpful.

comment:13 in reply to: ↑ 12 Changed 5 years ago by embray

Replying to embray:

Unfortunately simply upgrading to gc 7.6 did not resolve the problem. All it gave me was a little additional error output but nothing immediately helpful.

I take it back. Looking at the error message it gave it actually couldn't have been using gc 7.6. In fact for some reason when I upgraded gc it did not replace the old DLL.

comment:14 Changed 5 years ago by embray

I got gc 7.6 installed correctly this time, but unfortunately it still did not resolve the problem. The error message output during the build of this document is now slightly more helpful but not by much:

GC Warning: Out of memory - trying to allocate requested amount (8224 bytes)...
Insufficient memory for black list

comment:15 Changed 5 years ago by embray

I think I may have run across a possible answer to this conundrum, from Hans Boehm himself:

Indeed, it seems gc on Cygwin is using sbrk to allocate memory, and this runs into the hard-coded upper limit on the data segment of the Windows process (most memory allocation in Cygwin uses mmap which avoids these issues). I'll try rebuilding gc with mmap support enabled (which it is not, by default, on Cygwin, though not for any particular reason apparently). That should probably fix it...

Version 0, edited 5 years ago by embray (next)

comment:16 Changed 5 years ago by embray

  • Authors set to Erik Bray
  • Branch set to u/embray/cygwin/ticket-23973
  • Commit set to 188ed32783282e72f3e76393a4b049576c0dfc67
  • Keywords windows cygwin ecl gc added
  • Report Upstream changed from N/A to Not yet reported upstream; Will do shortly.
  • Status changed from new to needs_review

Here's a full summary of the issue so far, and how the attached patch addresses it:

  • Cygwin processes reserve an address space for Cygwin's "private heap" for the process--most POSIX system calls that manipulate the heap (e.g. brk, sbrk) are manipulating Cygwin's heap.
    • This private heap is of a fixed size and reserved at process startup--this allows it to be used freely without interference from Windows, which may be doing whatever it wants with the rest of the process's virtual address space.
    • Most of the time this is not a problem--while Cygwin's malloc does make small allocations to the heap, larger allocations are made with mmap and can be anywhere in the Windows process's VM space (not within the private heap).
    • However, a custom memory allocator (such as that used by libgc) may still run into problems here. In particular, the default configuration for libgc on Cygwin is to use sbrk only for memory allocations.
    • It just happened that running the doc build in the manner I was running it deterministically (but otherwise arbitrarily) caused sbrk to bump into the fixed size limit of the private heap (at least I think--I have not directly confirmed this but it seems likely). This was happening to me in the context of the doc build, but in principle this error could have come up in some other context.
    • The obvious solution is to build libgc to use mmap (which generally works well on Cygwin) and not rely on sbrk. Another workaround would be to manually increase the max heap size on the ecl executable (this can be done with the peflags utility) but that's only sweeping the problem under the rug somewhat.
  • libgc hard-codes a lot of the defaults for various settings depending on platform, such as USE_MMAP for whether or not it should use mmap for allocations (in fact, it seems on Linux this is not the default for some reason). The only way to force this at configure time, it seems, is to (seemingly strangely) pass --enable-munmap to configure.
    • However, on Cygwin this does not enable USE_MMAP. It seems like maybe it used to, but now it also sets USE_WINALLOC, a setting whereby gc uses the Windows API directly for managing memory allocation.
      • This has a surprising side-effect that it forces --enable-handle-fork to be disabled, as relying on direct handling of virtual memory allocations is inherently incompatible with Cygwin's fork emulation. However, we need this setting in order for ecl to work at all in a multi-process environment where it might be forked (see #22694)
      • Therefore we need to force USE_MMAP as well as USE_MUNMAP to be defined by manually passing them in via CFLAGS. Other workarounds are possible of course but this is the simplest.
  • However there's a reason USE_MMAP was explicitly disabled on Cygwin.
    • libgc's memory allocator uses algorithms that require it to be able to keep ranges of memory allocated but unusuable--in particular it remaps an existing mmap with the PROT_NONE flag to do this. This runs afoul of the same problem we had in #22810, where Cygwin does not like this.
    • This can be fixed on Cygwin by simply using mprotect(..., PROT_NONE) here, instead of mmap. However, this alone is not sufficient and can lead quickly to failures because:
      • The mprotect call must be aligned to the start of the mmap'd region, which libgc calculates using its GC_page_size global variable, which is initialized early on. However, it currently calculates this incorrectly on Cygwin; rather it uses it incorrectly. Although it does get the page size right, what it really needs is a value called the allocation granularity--the alignment for the start address of mmap'd regions. On many systems this is equal to the page size but not on 64-bit windows where the page size is 4k but the allocation granularity is 64k. Regions allocated by Cygwin's mmap are aligned to the latter. The easiest workaround here is to just use the allocation granularity for GC_page_size. There's not much sense in trying to use the actual page size anywhere in this case.

New commits:

188ed32Enable munmap/mmap on Cygwin, but fix issues with its implementation on Cygwin;

comment:17 Changed 5 years ago by embray

  • Description modified (diff)
  • Report Upstream changed from Not yet reported upstream; Will do shortly. to Reported upstream. No feedback yet.

comment:18 follow-up: Changed 5 years ago by jdemeyer

Pardon my ignorance, but are you sure that CYGWIN32 is the correct macro to check?

comment:19 Changed 5 years ago by git

  • Commit changed from 188ed32783282e72f3e76393a4b049576c0dfc67 to 5863356b68e1796ea3c6d92391753a6bfbc93ca7

Branch pushed to git repo; I updated commit sha1. New commits:

5863356Updated patch level

comment:20 in reply to: ↑ 18 Changed 5 years ago by embray

Replying to jdemeyer:

Pardon my ignorance, but are you sure that CYGWIN32 is the correct macro to check?

Yeah that's just the macro they use internally for "cygwin". The "32" part is a misnomer.

comment:21 Changed 5 years ago by jdemeyer

Isn't __CYGWIN__ the official macro to check for Cygwin?

comment:22 Changed 5 years ago by embray

It doesn't matter--have a look at the source code for gc. It defines its own macros for platform checks. This is only consistent with their style.

comment:23 Changed 5 years ago by jdemeyer

  • Reviewers set to Jeroen Demeyer
  • Status changed from needs_review to positive_review

OK. I trust you on this one.

comment:24 Changed 5 years ago by embray

It would be helpful to have feedback from upstream in case there's somehow something outright wrong about this approach, but it at least resolved the problem, so...

comment:25 Changed 5 years ago by vbraun

  • Branch changed from u/embray/cygwin/ticket-23973 to 5863356b68e1796ea3c6d92391753a6bfbc93ca7
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:26 Changed 5 years ago by embray

  • Commit 5863356b68e1796ea3c6d92391753a6bfbc93ca7 deleted
  • Report Upstream changed from Reported upstream. No feedback yet. to Fixed upstream, but not in a stable release.
Note: See TracTickets for help on using tickets.