Opened 3 years ago

Closed 3 years ago

#25161 closed defect (fixed)

Sphinx build hangs when a BaseException occurs

Reported by: saraedum Owned by:
Priority: major Milestone: sage-8.3
Component: doctest framework Keywords:
Cc: infinity0 Merged in:
Authors: Julian Rüth Reviewers: Erik Bray
Report Upstream: N/A Work issues:
Branch: 319849a (Commits) Commit: 319849adf7ea1d1695f2ac67d3df00e72cc962dd
Dependencies: Stopgaps:

Description (last modified by saraedum)

Python's multiprocessing.Pool hangs when a BaseException that is not an Excepion happens in the function it runs; we need to wrap these in "normal" Exception's so they get thrown correctly by the pool's get.

Here's a minimal Python session to showcase this problem:

Python 2.7.14 (default, Jan  5 2018, 10:41:29) 
[GCC 7.2.1 20171224] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> def work(x):
...     raise BaseException(x)
... 
>>> from multiprocessing import Pool
>>> pool = Pool(2)
>>> pool.map_async(work, ["error", "error"]).get()
Process PoolWorker-1:
Process PoolWorker-2:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 113, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 113, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 65, in mapstar
    result = (True, func(*args, **kwds))
    return map(*args)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 65, in mapstar
  File "<stdin>", line 2, in work
    return map(*args)
  File "<stdin>", line 2, in work
BaseException: error
BaseException: error
[hangs]

This upstreams part of Debian's u2-better-sphinx-failure-modes.patch.

Change History (26)

comment:1 Changed 3 years ago by saraedum

  • Description modified (diff)

comment:2 Changed 3 years ago by saraedum

  • Branch set to u/saraedum/docbuild-base-exception

comment:3 Changed 3 years ago by saraedum

  • Commit set to 6a02f359428139894908746c18d6967deaef7fae

Now, how could I possibly doctest this?


New commits:

6a02f35Wrap BaseException so Pool.get() aborts

comment:4 Changed 3 years ago by saraedum

  • Work issues set to write doctests

comment:5 Changed 3 years ago by saraedum

  • Status changed from new to needs_review

comment:6 Changed 3 years ago by saraedum

  • Status changed from needs_review to needs_work

Setting this to needs review as I'd like to get some feedback on this.

comment:7 follow-up: Changed 3 years ago by embray

I think I've seen a similar issue before, albeit rarely. Last year I started working on a generalized version of the parallel doctest runner from sage.doctest.forker that could also be used in docbuilds (and provided better exception handling for free).

However, I had some minor bugs in my implementation that I never quite finished working out, and then got distracted and never finished it. I'd really like that get that project up and running again...

comment:8 follow-up: Changed 3 years ago by embray

Do you have any idea what's causing a non-Exception exception to be raised from the docbuild?

The only thing I can think of is if it's an exception coming from Cysignals, and even then it's not clear what the exception would be in that case (an alarm, maybe?)

Otherwise the fix certainly makes sense.

comment:9 Changed 3 years ago by embray

  • Status changed from needs_work to needs_info

comment:10 in reply to: ↑ 7 ; follow-up: Changed 3 years ago by jdemeyer

Replying to embray:

Last year I started working on a generalized version of the parallel doctest runner from sage.doctest.forker that could also be used in docbuilds (and provided better exception handling for free).

Huge +1 (disclaimer: I wrote that doctest forker).

multiprocessing is broken in many ways. For this ticket, wouldn't it be possible to patch multiprocessing instead? That way, all users of multiprocessing would benefit, not only the docbuilder.

comment:11 in reply to: ↑ 8 ; follow-up: Changed 3 years ago by jdemeyer

Replying to embray:

Do you have any idea what's causing a non-Exception exception to be raised from the docbuild?

KeyboardInterrupt is a likely candidate. This probably explains why interrupting the docbuilder is unreliable.

comment:12 in reply to: ↑ 11 ; follow-up: Changed 3 years ago by saraedum

Replying to jdemeyer:

Replying to embray:

Do you have any idea what's causing a non-Exception exception to be raised from the docbuild?

Maybe something coming out of cysignals. But I have no clue really.

KeyboardInterrupt is a likely candidate. This probably explains why interrupting the docbuilder is unreliable.

I observed this problem in a CI run, so I don't think it's a KeyboardInterrupt?. But you're right, the above example can not be killed with Ctrl-c. It just prints the KeyboardInterrupt? on Python 2.7. (Works on Python 3.6.)

comment:13 in reply to: ↑ 10 ; follow-up: Changed 3 years ago by saraedum

Replying to jdemeyer:

Replying to embray:

Last year I started working on a generalized version of the parallel doctest runner from sage.doctest.forker that could also be used in docbuilds (and provided better exception handling for free).

Huge +1 (disclaimer: I wrote that doctest forker).

multiprocessing is broken in many ways. For this ticket, wouldn't it be possible to patch multiprocessing instead? That way, all users of multiprocessing would benefit, not only the docbuilder.

What do you mean? Patch our Python SPKG or monkey-patch the multiprocessing module?

comment:14 Changed 3 years ago by saraedum

  • Status changed from needs_info to needs_review
  • Work issues write doctests deleted

comment:15 Changed 3 years ago by saraedum

(Without the catch BaseException:, the added doctest just hangs btw.)

comment:16 Changed 3 years ago by git

  • Commit changed from 6a02f359428139894908746c18d6967deaef7fae to fcbfe1c7b83d2784186445651647c0a9fe408781

Branch pushed to git repo; I updated commit sha1. New commits:

6a7640dAdd doctest for #25161
fcbfe1cMore standard logging usage

comment:17 Changed 3 years ago by saraedum

Ok, here's what's been happening in #24655:

[dochtml] Exception in thread Thread-7:
[dochtml] Traceback (most recent call last):
[dochtml]   File "/home/sage/sage/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
[dochtml]     self.run()
[dochtml]   File "/home/sage/sage/local/lib/python2.7/threading.py", line 754, in run
[dochtml]     self.__target(*self.__args, **self.__kwargs)
[dochtml]   File "/home/sage/sage/local/lib/python2.7/multiprocessing/pool.py", line 328, in _handle_workers
[dochtml]     pool._maintain_pool()
[dochtml]   File "/home/sage/sage/local/lib/python2.7/multiprocessing/pool.py", line 232, in _maintain_pool
[dochtml]     self._repopulate_pool()
[dochtml]   File "/home/sage/sage/local/lib/python2.7/multiprocessing/pool.py", line 225, in _repopulate_pool
[dochtml]     w.start()
[dochtml]   File "/home/sage/sage/local/lib/python2.7/multiprocessing/process.py", line 130, in start
[dochtml]     self._popen = Popen(self)
[dochtml]   File "/home/sage/sage/local/lib/python2.7/multiprocessing/forking.py", line 121, in __init__
[dochtml]     self.pid = os.fork()
[dochtml] OSError: [Errno 12] Cannot allocate memory

That seems to make the docbuild hang. But that's an Exception, so the problem there seems to be something else.

Last edited 3 years ago by saraedum (previous) (diff)

comment:18 Changed 3 years ago by git

  • Commit changed from fcbfe1c7b83d2784186445651647c0a9fe408781 to 319849adf7ea1d1695f2ac67d3df00e72cc962dd

Branch pushed to git repo; I updated commit sha1. New commits:

319849aAdded UTF-8 header to silence python warnings

comment:19 Changed 3 years ago by infinity0

This is only slightly related, but I wanted to bring this other patch to your attention:

https://salsa.debian.org/science-team/sagemath/blob/master/debian/patches/df-subprocess-sphinx.patch

If that is adopted, a side-effect (I think) would be to avoid the need for this patch, since all errors would instead become CalledProcessErrors?.

comment:20 Changed 3 years ago by saraedum

Ah, thanks for pointing that out. I'm working through all the Debian patches at the moment so I might create a ticket for that one as well. (But I think it's maybe more controversial.)

Apart from that multiprocessing seems to work much better in Python 3, so maybe the problem goes away there…

comment:21 in reply to: ↑ 13 Changed 3 years ago by embray

Replying to saraedum:

Replying to jdemeyer:

Replying to embray:

Last year I started working on a generalized version of the parallel doctest runner from sage.doctest.forker that could also be used in docbuilds (and provided better exception handling for free).

Huge +1 (disclaimer: I wrote that doctest forker).

multiprocessing is broken in many ways. For this ticket, wouldn't it be possible to patch multiprocessing instead? That way, all users of multiprocessing would benefit, not only the docbuilder.

What do you mean? Patch our Python SPKG or monkey-patch the multiprocessing module?

I think what Jeroen means is as a more robust alternative to using multiprocessing.Pool.map and the like, as is currently used by the docbuilder.

My prototype implementation was compatible with (and this is why it was a bit difficult to get right) concurrent.futures API, albeit extended to cover the additional features that the doctest runner has, such as capturing standard I/O from the worker processes.

comment:22 in reply to: ↑ 12 Changed 3 years ago by embray

Replying to saraedum:

Replying to jdemeyer:

Replying to embray:

Do you have any idea what's causing a non-Exception exception to be raised from the docbuild?

Maybe something coming out of cysignals. But I have no clue really.

KeyboardInterrupt is a likely candidate. This probably explains why interrupting the docbuilder is unreliable.

I observed this problem in a CI run, so I don't think it's a KeyboardInterrupt?. But you're right, the above example can not be killed with Ctrl-c. It just prints the KeyboardInterrupt? on Python 2.7. (Works on Python 3.6.)

It's possible that if a process is taking too long, or otherwise going beyond its allowed resource usage the CI runner tries to stop it, and it might start out "nice" and just send Ctrl-Cs, though it also might move on to SIGTERM and/or SIGKILL.

It sounds to me like the enormous memory usage of the docbuild might still be a problem for #24655. We may need to try to do more work on fixing that...

comment:23 Changed 3 years ago by saraedum

So, the underlying problem here was really the memory usage of the sequential docbuild, see #24655. Anyway, I think the change proposed here still makes sense.

Last edited 3 years ago by saraedum (previous) (diff)

comment:24 Changed 3 years ago by embray

  • Reviewers set to Erik Bray
  • Status changed from needs_review to positive_review

Agreed.

comment:25 Changed 3 years ago by embray

  • Milestone changed from sage-8.2 to sage-8.3

comment:26 Changed 3 years ago by vbraun

  • Branch changed from u/saraedum/docbuild-base-exception to 319849adf7ea1d1695f2ac67d3df00e72cc962dd
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.