Opened 9 years ago
Closed 5 years ago
#15585 closed defect (fixed)
Random failure in SimplicialComplex.is_cohen_macaulay
Reported by: | vbraun | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | sage-7.6 |
Component: | algebra | Keywords: | random_fail |
Cc: | jdemeyer, roed, slabbe | Merged in: | |
Authors: | Jeroen Demeyer | Reviewers: | Sébastien Labbé |
Report Upstream: | N/A | Work issues: | |
Branch: | 51b2030 (Commits, GitHub, GitLab) | Commit: | 51b2030a38280980cb8837d1f6b452ca1084ea8e |
Dependencies: | #22462 | Stopgaps: |
Description (last modified by )
This is fairly unlikely but occasionally comes up on the buildbot:
sage -t --long src/sage/homology/simplicial_complex.py ********************************************************************** File "src/sage/homology/simplicial_complex.py", line 2236, in sage.homology.simplicial_complex.SimplicialComplex.is_cohen_macaulay Failed example: S.is_cohen_macaulay(ncpus=3) Expected: False Got: [Errno 2] No such file or directory: '/home/buildbot/build/sage/snapperkob/sage_git/dot_sage/temp/snapperkob/10634/dir_n0BDmn/10759.out' False
This is because a race condition in @parallel
. This is a parallel generator which forks processes, each process handling one item of the generator. The output of each finished process is stored as pickle in a working directory and then yield
ed by the main process.
When the generator is closed (for example, the generator is used as argument to all()
and a False
condition is found), the following happens in a finally
block:
- The working directory is removed.
- The remaining processes are killed.
This is a race condition because it can happen that a subprocess finishes between these steps. Then that process wants to write its output in the deleted directory. The fix is obvious: first kill the processes, then delete the directory.
Change History (29)
comment:1 Changed 9 years ago by
comment:2 Changed 9 years ago by
My guess would be that forke'd process writes temp file, and parent tries to read it after the fork quits. That is inherently racy since the sage cleaner will attempt to delete the child's temp files as it has another pid.
comment:3 Changed 9 years ago by
- Milestone changed from sage-6.1 to sage-6.2
comment:4 Changed 8 years ago by
- Milestone changed from sage-6.2 to sage-6.3
comment:5 Changed 8 years ago by
- Keywords random_fail added
comment:6 Changed 8 years ago by
- Milestone changed from sage-6.3 to sage-6.4
comment:7 Changed 7 years ago by
Still happens occasionally
comment:8 follow-up: ↓ 9 Changed 6 years ago by
Seen this today for the first ever I think. (Sage 7.3.rc0)
comment:9 in reply to: ↑ 8 Changed 6 years ago by
Replying to leif:
Seen this today for the first ever I think. (Sage 7.3.rc0)
P.S.: Same error, different test:
sage -t --long --warn-long 68.2 src/sage/homology/simplicial_complex.py ********************************************************************** File "src/sage/homology/simplicial_complex.py", line 2813, in sage.homology.simplicial_complex.SimplicialComplex.is_cohen_macaulay Failed example: X.is_cohen_macaulay(ZZ) Expected: False Got: [Errno 2] No such file or directory: '/home/leif/.sage/temp/tunguska/16183/dir_5o8pnH/16223.out' False **********************************************************************
comment:10 Changed 6 years ago by
I've seen this a few times recently, too.
comment:11 Changed 6 years ago by
- Milestone changed from sage-6.4 to sage-7.4
comment:12 Changed 6 years ago by
- Milestone changed from sage-7.4 to sage-7.6
still there in 7.5 and 7.6.betas.
comment:13 follow-up: ↓ 14 Changed 5 years ago by
I wonder if it would help if the tests all used just 1 CPU.
comment:14 in reply to: ↑ 13 ; follow-up: ↓ 15 Changed 5 years ago by
Replying to jhpalmieri:
I wonder if it would help if the tests all used just 1 CPU.
Maybe, but that's not fixing the problem, just hiding it.
If you want to "fix" the problem but not hide it, just add # known bug
.
comment:15 in reply to: ↑ 14 Changed 5 years ago by
Replying to jdemeyer:
Replying to jhpalmieri:
I wonder if it would help if the tests all used just 1 CPU.
Maybe, but that's not fixing the problem, just hiding it.
If you want to "fix" the problem but not hide it, just add
# known bug
.
at least it would be good to know what part of the code in question writes temp files with extension .out (In my admittedly limited experience with parallel code I never saw slaves doing any file I/O; if they do they ought to clean up after themselves, otherwise there is not telling as to what will happen)
comment:16 follow-ups: ↓ 17 ↓ 19 Changed 5 years ago by
It would be also nice to know what kinds of systems produce the error: OS, number of CPUs, file system, etc.
comment:17 in reply to: ↑ 16 Changed 5 years ago by
Replying to jhpalmieri:
It would be also nice to know what kinds of systems produce the error: OS, number of CPUs, file system, etc.
As a rule, I get it on a gentoo linux laptop, running on a 4-core Intel i7, and the usual ext4 file systems on an SSD.
comment:18 Changed 5 years ago by
See #22462.
comment:19 in reply to: ↑ 16 ; follow-up: ↓ 20 Changed 5 years ago by
Replying to jhpalmieri:
It would be also nice to know what kinds of systems produce the error: OS, number of CPUs, file system, etc.
I almost always get this problem when running MAKE='make -j8' make ptestlong
on a Ubuntu 16.04, with 8 cpus, with file system ext4.
I am now running make testlong
in serially to see the difference.
comment:20 in reply to: ↑ 19 Changed 5 years ago by
I am now running
make testlong
in serially to see the difference.
I get All tests passed!
with make testlong
.
comment:21 Changed 5 years ago by
- Cc slabbe added
- Dependencies set to #22462
comment:22 Changed 5 years ago by
- Description modified (diff)
comment:23 Changed 5 years ago by
- Branch set to u/jdemeyer/ticket/15585
comment:24 Changed 5 years ago by
- Commit set to 51b2030a38280980cb8837d1f6b452ca1084ea8e
- Status changed from new to needs_review
comment:25 Changed 5 years ago by
On a machine that almost always gives the error on is_cohen_macaulay
, I get All tests passed!
on a single run of MAKE='make -j6' make ptestlong
.
comment:26 Changed 5 years ago by
- Status changed from needs_review to positive_review
On a second run of make ptestlong
, I still do not get the error. -> Great! Positive review.
comment:27 Changed 5 years ago by
- Reviewers set to Sébastien Labbé
comment:28 Changed 5 years ago by
Thanks for investigating this. I've been seeing this problem too, but thought it was a weird case of the sage-cleaner being overly aggressive for some reason.
comment:29 Changed 5 years ago by
- Branch changed from u/jdemeyer/ticket/15585 to 51b2030a38280980cb8837d1f6b452ca1084ea8e
- Resolution set to fixed
- Status changed from positive_review to closed
Full log (but really nothing interesting) at http://build.sagemath.org/sage/builders/%20%20fast%20AIMS%20snapperkob%20%28Ubuntu%2012.04%20x86_64%29%20incremental/builds/28/steps/shell_3/logs/stdio