Opened 9 months ago

Closed 9 months ago

Last modified 9 months ago

#27490 closed defect (fixed)

Simplistic multiprocessing.Pool replacement for parallel docbuild on older Cygwin

Reported by: embray Owned by:
Priority: blocker Milestone: sage-8.7
Component: porting: Cygwin Keywords:
Cc: jdemeyer Merged in:
Authors: Erik Bray Reviewers: Jeroen Demeyer
Report Upstream: N/A Work issues:
Branch: fe0e3ea (Commits) Commit:
Dependencies: Stopgaps:

Description

For Cygwin versions less than 3.0.0 (and only Cygwin) this replaces the use of multiprocessing.Pool in the sage_setup.docbuild.build_many function, with a naïve but "good enough" (it works in general) parallel process pool that does not rely on starting processes from threads.

This is needed for #27214, because the specific combination of using MAP_NORESERVE mmaps and forking processes from a thread can result in a bug in Cygwin (fixed in 3.0.0) which causes unhandled segfaults to occur in any code that is run during the docbuild which uses libgap.

So this is really only needed so that the docs can continue to be built on systems (including my primary development environment, as well as the buildbot) that do not yet have Cygwin >= 3.0.0 once #27214 is applied.

Change History (33)

comment:1 Changed 9 months ago by embray

  • Authors set to Erik Bray
  • Branch set to u/embray/ticket-27490
  • Cc jdemeyer added
  • Commit set to f9aabb755ce4cb89a6dd926bd26a08611cfaf389
  • Status changed from new to needs_review

Jeroen, you will likely have thoughts about this. Keep in mind, it's not meant to be at all robust--just a quick workaround so I don't have to spend too much more time on it. But if you have any thoughts on straightforward improvements to this I'm all for it.

Obviously a better workaround, if it were possible, would be to use the much talked-about idea for generalizing the parallel processing loop from the Sage doctester. But since we don't have that yet this will do for now.


New commits:

f9aabb7Trac #27490: Simplistic multiprocessing.Pool replacement for parallel docbuild on older Cygwin

comment:2 follow-up: Changed 9 months ago by jdemeyer

I'm just wondering why you even bother with parallel docbuilding in the first place. The obvious solution is just using a single process.

comment:3 follow-up: Changed 9 months ago by jdemeyer

Also, in general I don't like code of the form

if A_is_not_broken:
    A()
else:
    B()

If B works well enough to replace A, then why don't we just use B unconditionally?

This is especially relevant when A involves multiprocessing.Pool which has other issues (that's why I didn't use it in the doctester).

So I would like to know if there is a good reason to not use your "simplistic multiprocessing.Pool replacement" on all systems.

comment:4 follow-up: Changed 9 months ago by jdemeyer

  • Status changed from needs_review to needs_work

Some comments on the code:

  1. This should use os.wait() instead of time.sleep(5). Unless I'm missing something, this should be robust.
  1. A few comments would be helpful.
  1. not any(filter(None, workers)) is not an easy condition to understand. I would write it as all(w is None for w in workers).
  1. I don't think that there is a need to gracefully shutdown the workers in the finally block. A hard kill os.kill(w.pid, signal.SIGKILL) may be better because it guarantees to kill the process (cysignals catches SIGTERM which does sys.exit() but that might not actually kill the process in a sufficiently messed up state). For extra safety, maybe call is_alive() before killing the process.

In general, I like it. It's simple but the use case is sufficiently simple that we don't need anything more complicated. And it says a lot about multiprocess.Pool that this might actually work better than multiprocessing.Pool.

comment:5 in reply to: ↑ 4 Changed 9 months ago by embray

Thanks for having a look.

Replying to jdemeyer:

Some comments on the code:

  1. This should use os.wait() instead of time.sleep(5). Unless I'm missing something, this should be robust.

Yes, that should be much better.

  1. A few comments would be helpful.

+1

  1. not any(filter(None, workers)) is not an easy condition to understand. I would write it as all(w is None for w in workers).

I thought it was pretty straightforward, but I guess the not any is a little confusing.

  1. I don't think that there is a need to gracefully shutdown the workers in the finally block. A hard kill os.kill(w.pid, signal.SIGKILL) may be better because it guarantees to kill the process (cysignals catches SIGTERM which does sys.exit() but that might not actually kill the process in a sufficiently messed up state). For extra safety, maybe call is_alive() before killing the process.

I think there is: Or at least to try to SIGTERM first. Reason being this block can be reached if one process exits in error, while other processes are still working perfectly fine. You want to gracefully shut them down and ensure that their atexit handlers run, clean up temp files, etc.

In general, I like it. It's simple but the use case is sufficiently simple that we don't need anything more complicated. And it says a lot about multiprocess.Pool that this might actually work better than multiprocessing.Pool.

I don't know that it says a lot. I don't think it actually works "better" on the whole, just in this one case. Keep in mind also that multiprocessing.Pool implements a lot of generic functionality (e.g. multiple simultaneous map_async jobs) that simply aren't needed for this simpler use case.

One downside to this approach is that there is no data returned from the child processes to the parent. So for example an exception raised in a worker cannot be re-raised from the parent. Instead I just raise a generic RuntimeError if the process exited with an error code. I simulated this case in testing and you can still see the worker's exception and traceback printed to stderr, so that was good-enough for my purposes.

comment:6 in reply to: ↑ 3 Changed 9 months ago by embray

Replying to jdemeyer:

Also, in general I don't like code of the form

if A_is_not_broken:
    A()
else:
    B()

If B works well enough to replace A, then why don't we just use B unconditionally?

This is especially relevant when A involves multiprocessing.Pool which has other issues (that's why I didn't use it in the doctester).

So I would like to know if there is a good reason to not use your "simplistic multiprocessing.Pool replacement" on all systems.

Partly for the reason I mentioned at the end of my previous comment, and partly just because I need this now and although I'm convinced it's robust-enough for my use it's still not well-tested.

How about for now we special-case this, and then for the next release make it a priority to finally get at least an initial version of the doctest forker code released and replace it with that?

comment:7 in reply to: ↑ 2 Changed 9 months ago by embray

Replying to jdemeyer:

I'm just wondering why you even bother with parallel docbuilding in the first place. The obvious solution is just using a single process.

At first I did just replace this with just plain map(target, args) but I wasn't satisfied: It was slow (obviously) and made memory usage even worse than it already is, over time. It still seems better, if not kind of brushing other problems under the rug, to build each sub-doc in its own process that can easily be cleaned up when it's done.

The parallel version only took about 20 minutes to get working.

comment:8 Changed 9 months ago by git

  • Commit changed from f9aabb755ce4cb89a6dd926bd26a08611cfaf389 to 88771dfbe0d6b90cf5574065d312d1e0a7f005af

Branch pushed to git repo; I updated commit sha1. New commits:

219e9c4A little bit of import cleanup
88771dfTrac #27490: Address some review comments and other cleanup:

comment:9 follow-up: Changed 9 months ago by embray

  • Status changed from needs_work to needs_review

I addressed most of your comments, but I still have the big if/else. Since that sage_setup.docbuild.__init__ module is already quite large, perhaps I could move that code to a utility module at least.

I'd still be open to just using it on all platforms, but I'm wary, given that this isn't battle-tested.

comment:10 in reply to: ↑ 9 Changed 9 months ago by jdemeyer

Replying to embray:

I'd still be open to just using it on all platforms, but I'm wary, given that this isn't battle-tested.

I see your point.

comment:11 Changed 9 months ago by jdemeyer

  • Reviewers set to Jeroen Demeyer
  • Status changed from needs_review to positive_review

I'm willing to give this the benefit of the doubt. You'll probably be the only user of this code anyway.

I'm sure that there is room for improvement (I still don't like the finally block for example), but we can defer that to the point that we decide to use this code for all systems.

comment:12 Changed 9 months ago by embray

  • Status changed from positive_review to needs_work

Thanks. However, I need to double back now since my last change broke something. It's not starting new processes up after previous ones finish.

comment:13 Changed 9 months ago by embray

This sort-of makes sense: It turns out of you os.wait() a process run by multiprocessing.Process() it breaks, because it wants to os.waitpid() on itself so it can get its return code. So wait()-ing that process manually means it never finds out that it itself finished (and it doesn't check for the correct error code that would indicate as much) so it always just returns None for its exitcode. This is a bit buggy but understandable.

comment:14 Changed 9 months ago by jdemeyer

I see. In the doctester, we solve this by checking for the SIGCHLD signal instead of os.wait(). Now I know why.

comment:15 Changed 9 months ago by embray

I thought of doing that as a solution, but then you start involving signal handling and basically reimplementing parts of the doctester. I have another solution that's a bit ugly but simpler.

It would be nice if there were a sort of multiprocessing.wait() that could wait for any one multiprocessing.Process to finish (or maybe one out of a specific list thereof). Since multiprocessing already tracks its child processes this could be done. Maybe I'll propose it...

comment:16 Changed 9 months ago by git

  • Commit changed from 88771dfbe0d6b90cf5574065d312d1e0a7f005af to 1e5b1f56b37e1fb9abdc75c5e3540fc137c5024c

Branch pushed to git repo; I updated commit sha1. New commits:

1e5b1f5Trac #27490: Further fixes in use of os.wait()

comment:17 Changed 9 months ago by embray

  • Status changed from needs_work to needs_review

One can see how quickly more complex this can become.

comment:18 Changed 9 months ago by embray

  • Status changed from needs_review to positive_review

Setting back to positive review since Jeroen was okay with (or at least resigned to) the current approach. I've tested the updated code several times with different numbers of processes and am satisfied with it for now.

comment:19 Changed 9 months ago by vbraun

Can we move that abomination at least into a separate file

comment:20 Changed 9 months ago by vbraun

  • Status changed from positive_review to needs_work

comment:21 Changed 9 months ago by embray

  • Status changed from needs_work to needs_info

Which "abomination"? I'm not necessarily going to do anything at your behest if you put it in such negative terms.

comment:22 Changed 9 months ago by embray

(Which is not not say I necessarily think this is pretty as-is but it's still not even clear exactly what you're asking to move).

comment:23 Changed 9 months ago by vbraun

Im asking you to move the parallel implementation of multiprocessing into a separate file, ideally with a small doctstring explaining what is going on here.

comment:24 Changed 9 months ago by embray

But both versions? Just the one I added?

comment:25 Changed 9 months ago by vbraun

At least for now just the one you added in the else branch. Thanks.

comment:26 Changed 9 months ago by embray

Okay, I'll do that.

comment:27 Changed 9 months ago by git

  • Commit changed from 1e5b1f56b37e1fb9abdc75c5e3540fc137c5024c to 78f89387636bbbe9b2b315e6ed801f3b4908b249

Branch pushed to git repo; I updated commit sha1. New commits:

78f8938Trac #27490: Moved the alternate build_many implementation into a

comment:28 Changed 9 months ago by git

  • Commit changed from 78f89387636bbbe9b2b315e6ed801f3b4908b249 to 3a35e4d05c1ee547d25a30c51c57cdfdf5a9aa7a

Branch pushed to git repo; I updated commit sha1. New commits:

3a35e4dfixup

comment:29 Changed 9 months ago by git

  • Commit changed from 3a35e4d05c1ee547d25a30c51c57cdfdf5a9aa7a to fe0e3ea1a8d85d41047f9308db48912f1bdcffcd

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

fe0e3eaTrac #27490: Moved the alternate build_many implementation into a

comment:30 Changed 9 months ago by embray

  • Status changed from needs_info to needs_review

(Just removed an unused import.)

comment:31 Changed 9 months ago by vbraun

  • Status changed from needs_review to positive_review

comment:32 Changed 9 months ago by vbraun

  • Branch changed from u/embray/ticket-27490 to fe0e3ea1a8d85d41047f9308db48912f1bdcffcd
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:33 Changed 9 months ago by vbraun

  • Commit fe0e3ea1a8d85d41047f9308db48912f1bdcffcd deleted

Followup at #27514

Note: See TracTickets for help on using tickets.