Opened 6 years ago

Closed 3 years ago

#15585 closed defect (fixed)

Random failure in SimplicialComplex.is_cohen_macaulay

Reported by: vbraun Owned by:
Priority: major Milestone: sage-7.6
Component: algebra Keywords: random_fail
Cc: jdemeyer, roed, slabbe Merged in:
Authors: Jeroen Demeyer Reviewers: Sébastien Labbé
Report Upstream: N/A Work issues:
Branch: 51b2030 (Commits) Commit: 51b2030a38280980cb8837d1f6b452ca1084ea8e
Dependencies: #22462 Stopgaps:

Description (last modified by jdemeyer)

This is fairly unlikely but occasionally comes up on the buildbot:

sage -t --long src/sage/homology/simplicial_complex.py
**********************************************************************
File "src/sage/homology/simplicial_complex.py", line 2236, in sage.homology.simplicial_complex.SimplicialComplex.is_cohen_macaulay
Failed example:
    S.is_cohen_macaulay(ncpus=3)
Expected:
    False
Got:
    [Errno 2] No such file or directory: '/home/buildbot/build/sage/snapperkob/sage_git/dot_sage/temp/snapperkob/10634/dir_n0BDmn/10759.out'
    False

This is because a race condition in @parallel. This is a parallel generator which forks processes, each process handling one item of the generator. The output of each finished process is stored as pickle in a working directory and then yielded by the main process.

When the generator is closed (for example, the generator is used as argument to all() and a False condition is found), the following happens in a finally block:

  1. The working directory is removed.
  1. The remaining processes are killed.

This is a race condition because it can happen that a subprocess finishes between these steps. Then that process wants to write its output in the deleted directory. The fix is obvious: first kill the processes, then delete the directory.

Change History (29)

comment:2 Changed 6 years ago by vbraun

My guess would be that forke'd process writes temp file, and parent tries to read it after the fork quits. That is inherently racy since the sage cleaner will attempt to delete the child's temp files as it has another pid.

comment:3 Changed 6 years ago by vbraun_spam

  • Milestone changed from sage-6.1 to sage-6.2

comment:4 Changed 6 years ago by vbraun_spam

  • Milestone changed from sage-6.2 to sage-6.3

comment:5 Changed 6 years ago by vbraun

  • Keywords random_fail added

comment:6 Changed 6 years ago by vbraun_spam

  • Milestone changed from sage-6.3 to sage-6.4

comment:7 Changed 5 years ago by vbraun

Still happens occasionally

comment:8 follow-up: Changed 4 years ago by leif

Seen this today for the first ever I think. (Sage 7.3.rc0)

comment:9 in reply to: ↑ 8 Changed 4 years ago by leif

Replying to leif:

Seen this today for the first ever I think. (Sage 7.3.rc0)

P.S.: Same error, different test:

sage -t --long --warn-long 68.2 src/sage/homology/simplicial_complex.py
**********************************************************************
File "src/sage/homology/simplicial_complex.py", line 2813, in sage.homology.simplicial_complex.SimplicialComplex.is_cohen_macaulay
Failed example:
    X.is_cohen_macaulay(ZZ)
Expected:
    False
Got:
    [Errno 2] No such file or directory: '/home/leif/.sage/temp/tunguska/16183/dir_5o8pnH/16223.out'
    False
**********************************************************************

comment:10 Changed 4 years ago by jhpalmieri

I've seen this a few times recently, too.

comment:11 Changed 4 years ago by leif

  • Milestone changed from sage-6.4 to sage-7.4

comment:12 Changed 3 years ago by dimpase

  • Milestone changed from sage-7.4 to sage-7.6

still there in 7.5 and 7.6.betas.

comment:13 follow-up: Changed 3 years ago by jhpalmieri

I wonder if it would help if the tests all used just 1 CPU.

comment:14 in reply to: ↑ 13 ; follow-up: Changed 3 years ago by jdemeyer

Replying to jhpalmieri:

I wonder if it would help if the tests all used just 1 CPU.

Maybe, but that's not fixing the problem, just hiding it.

If you want to "fix" the problem but not hide it, just add # known bug.

comment:15 in reply to: ↑ 14 Changed 3 years ago by dimpase

Replying to jdemeyer:

Replying to jhpalmieri:

I wonder if it would help if the tests all used just 1 CPU.

Maybe, but that's not fixing the problem, just hiding it.

If you want to "fix" the problem but not hide it, just add # known bug.

at least it would be good to know what part of the code in question writes temp files with extension .out (In my admittedly limited experience with parallel code I never saw slaves doing any file I/O; if they do they ought to clean up after themselves, otherwise there is not telling as to what will happen)

comment:16 follow-ups: Changed 3 years ago by jhpalmieri

It would be also nice to know what kinds of systems produce the error: OS, number of CPUs, file system, etc.

comment:17 in reply to: ↑ 16 Changed 3 years ago by dimpase

Replying to jhpalmieri:

It would be also nice to know what kinds of systems produce the error: OS, number of CPUs, file system, etc.

As a rule, I get it on a gentoo linux laptop, running on a 4-core Intel i7, and the usual ext4 file systems on an SSD.

comment:18 Changed 3 years ago by jdemeyer

See #22462.

comment:19 in reply to: ↑ 16 ; follow-up: Changed 3 years ago by slabbe

Replying to jhpalmieri:

It would be also nice to know what kinds of systems produce the error: OS, number of CPUs, file system, etc.

I almost always get this problem when running MAKE='make -j8' make ptestlong on a Ubuntu 16.04, with 8 cpus, with file system ext4.

I am now running make testlong in serially to see the difference.

comment:20 in reply to: ↑ 19 Changed 3 years ago by slabbe

I am now running make testlong in serially to see the difference.

I get All tests passed! with make testlong.

comment:21 Changed 3 years ago by jdemeyer

  • Authors set to Jeroen Demeyer
  • Cc slabbe added
  • Dependencies set to #22462

comment:22 Changed 3 years ago by jdemeyer

  • Description modified (diff)

comment:23 Changed 3 years ago by jdemeyer

  • Branch set to u/jdemeyer/ticket/15585

comment:24 Changed 3 years ago by jdemeyer

  • Commit set to 51b2030a38280980cb8837d1f6b452ca1084ea8e
  • Status changed from new to needs_review

New commits:

8f7ff57Use ContainChildren to implement p_iter_fork
a4dddccFurther fixes to use_fork
51b2030Fix race condition is p_iter_fork

comment:25 Changed 3 years ago by slabbe

On a machine that almost always gives the error on is_cohen_macaulay, I get All tests passed! on a single run of MAKE='make -j6' make ptestlong.

comment:26 Changed 3 years ago by slabbe

  • Status changed from needs_review to positive_review

On a second run of make ptestlong, I still do not get the error. -> Great! Positive review.

comment:27 Changed 3 years ago by slabbe

  • Reviewers set to Sébastien Labbé

comment:28 Changed 3 years ago by embray

Thanks for investigating this. I've been seeing this problem too, but thought it was a weird case of the sage-cleaner being overly aggressive for some reason.

comment:29 Changed 3 years ago by vbraun

  • Branch changed from u/jdemeyer/ticket/15585 to 51b2030a38280980cb8837d1f6b452ca1084ea8e
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.