Opened 7 months ago

Closed 6 months ago

#31395 closed defect (fixed)

RuntimeError: can't start new thread (due to memlimit)

Reported by: mjo Owned by:
Priority: blocker Milestone: sage-9.3
Component: user interface Keywords: memlimit, RuntimeError, threads
Cc: fbissey, vbraun, mkoeppe Merged in:
Authors: Michael Orlitzky Reviewers: Dima Pasechnik
Report Upstream: N/A Work issues:
Branch: cf0c05c (Commits, GitHub, GitLab) Commit: cf0c05c000e965d63d649dcc8db66e0fe7f23b8f
Dependencies: Stopgaps:

Status badges

Description

My home machine can't run sage -t in 9.3.beta7:

$ echo "" > foo.py
$ sage -t foo.py 
too many failed tests, not using stored timings
Running doctests with ID 2021-02-14-07-59-58-a18c55bc.
Git branch: u/mjo/ticket/31382
Using --optional=build,dochtml,gentoo,memlimit,pip,sage,sage_spkg
Doctesting 1 file.
Process DocTestWorker-1:
sage -t --random-seed=0 foo.py
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/mjo/src/sage.git/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 2179, in run
    task(self.options, self.outtmpfile, msgpipe, self.result_queue)
  File "/home/mjo/src/sage.git/local/lib/python3.9/site-packages/sage/doctest/forker.py", line 2529, in __call__
    result_queue.put(result, False)
  File "/usr/lib/python3.9/multiprocessing/queues.py", line 94, in put
    self._start_thread()
  File "/usr/lib/python3.9/multiprocessing/queues.py", line 179, in _start_thread
    self._thread.start()
  File "/usr/lib/python3.9/threading.py", line 874, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
    Bad exit: 1
**********************************************************************
Tests run before process (pid=13654) failed:

**********************************************************************
----------------------------------------------------------------------
sage -t --random-seed=0 foo.py  # Bad exit: 1
----------------------------------------------------------------------
Total time for all tests: 0.1 seconds
    cpu time: 0.0 seconds
    cumulative wall time: 0.0 seconds

Running strace on that command shows a bunch of ENOMEM:

[pid 13910] mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)

So I tried --memlimit=0, and that seems to work:

$ sage -t -m 0 foo.py 
too many failed tests, not using stored timings
Running doctests with ID 2021-02-14-08-08-27-1edc5213.
Git branch: u/mjo/ticket/31382
Using --optional=build,dochtml,gentoo,pip,sage,sage_spkg
Doctesting 1 file.
sage -t --random-seed=0 foo.py
    [0 tests, 0.00 s]
----------------------------------------------------------------------
All tests passed!
----------------------------------------------------------------------
Total time for all tests: 0.0 seconds
    cpu time: 0.0 seconds
    cumulative wall time: 0.0 seconds

I think the default memlimit of 3300 MiB is arbitrary anyway, so maybe it just needs to be raised a little? A slight increase to --memlimit=3400 solves the problem on that machine.

Change History (13)

comment:1 follow-up: Changed 7 months ago by mjo

Also: why do we reimplement ulimit -v in a mathematics suite in the first place?

Every time we hit the limit, we increase it, annoying a few people in the process. Seems like we could avoid the intermediate step by just deleting it with the same end result.

comment:2 in reply to: ↑ 1 Changed 6 months ago by mjo

Replying to mjo:

Also: why do we reimplement ulimit -v in a mathematics suite in the first place?

To answer my own question: there is exactly one test in the sage library designed to check that allocating a huge matrix space fails gracefully when you run out of memory. To prevent that test from crashing the machine in the process, it's necessary to artificially limit the amount of memory available.

The whole sage -t --memlimit thing looks like it exists to support that one test. (It could be used for others, but not too many people are interested in gracefully running out of memory I guess.) The matrix space test relies not only on a limit being set, but that the limit is low enough not to crash the machine. Thus the conflict in choosing a default value that is just low enough to fail in only the expected way.

Despite there being only one such test, it seems to me that the "correct" memory limit is actually test-dependent. The fact that we've been able to get away with a single default for so long is probably just due to there being only one test: it's easy to pick a default that fails that one test but does not crash the machine. If there were twenty such tests, the required memory limit for the twentieth one could be too high to trigger a failure in the first.

Point is, I think the global default limit is the wrong approach here anyway. With only one test to fix, we could just set a memory limit for that one test (on Linux...), and test it unconditionally without any optional - memlimit hijinks. This will eliminate the problem for now, without reducing test coverage. If anyone wants to add more memory-limit tests in the future, it could be made less awkward with a wrapper that does the getrlimit/setrlimit stuff for you and ignores the subsequent failures on non-linux platforms.

comment:3 Changed 6 months ago by dimpase

  • Cc fbissey added
  • Priority changed from major to critical

this needs a fix before the release. I think it blocks most if not all Gentoo systems from testing Sage well atm.

comment:4 Changed 6 months ago by mjo

  • Authors set to Michael Orlitzky
  • Branch set to u/mjo/ticket/31395
  • Commit set to cf0c05c000e965d63d649dcc8db66e0fe7f23b8f

Here's what it would look like if we removed the limit entirely.


New commits:

cf0c05cTrac #31395: eliminate the --memlimit option to the doctest runner.

comment:5 Changed 6 months ago by dimpase

  • Status changed from new to needs_review

comment:6 follow-up: Changed 6 months ago by dimpase

perhaps just set the default to 0, and run that #optional : memtest test only if it is >0 ?

then anyone who needs it can still have it.

comment:7 in reply to: ↑ 6 Changed 6 months ago by mjo

Replying to dimpase:

perhaps just set the default to 0, and run that #optional : memtest test only if it is >0 ?

then anyone who needs it can still have it.

I could settle for this. The reason I opted for removal instead is that (relative to complete removal) setting the default to zero has two cons:

  1. It leaves all of the scaffolding for --memlimit and "optional - memlimit" in place but unused by default.
  2. The one test needing a memory limit is disabled by default.

To be fair, I don't really think we should be running that one test anyway. It's still going to cause (much more isolated) problems down the road. But I thought it would be easier to get a positive review without removing/disabling any tests that are currently performed.

comment:8 Changed 6 months ago by mjo

And I guess if the user has to go out of his way to invoke the memlimit tests, we could just tell him to run ulimit -v <limit> before sage -t. That would let us get rid of the whole --memlimit infrastructure.

The big problem I foresee is that, without the default limit (that was customized for the one test...), the user has no idea what to choose as the memory limit.

As we've seen, 3300MB is too low on Gentoo... it crashes sage. But the one memlimit test does,

MatrixSpace(GF(2), 2^30)(1)

which constructs a matrix with 2^31 entries. Depending on the implementation, it shouldn't require too much more than 4GiB. So we're asking the user to guess a memlimit within a pretty small window; otherwise either sage itself or that doctest (the only one needing a memlimit in the first place) will fail.

comment:9 Changed 6 months ago by dimpase

  • Cc vbraun mkoeppe added

Should we drop --memlimit ? I fine with totally dropping it.

comment:10 Changed 6 months ago by dimpase

  • Reviewers set to Dima Pasechnik
  • Status changed from needs_review to positive_review

I think it has been argued well enough that memlimit in doctests is a misfeature

comment:11 Changed 6 months ago by gh-mwageringel

This problem was also discussed in #28106.

comment:12 Changed 6 months ago by mkoeppe

  • Priority changed from critical to blocker

Setting priority to blocker to bring this ticket to the attention of the release bot.

comment:13 Changed 6 months ago by vbraun

  • Branch changed from u/mjo/ticket/31395 to cf0c05c000e965d63d649dcc8db66e0fe7f23b8f
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.