Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#8677 closed defect (fixed)

Race condition creating dirs during Sage build

Reported by: GeorgSWeber Owned by: GeorgSWeber
Priority: blocker Milestone: sage-4.6
Component: build Keywords:
Cc: Merged in: sage-4.6.rc0
Authors: Volker Braun, William Stein Reviewers: Volker Braun, Mitesh Patel
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Description (last modified by mpatel)

In a recent thread: http://groups.google.com/group/sage-devel/browse_thread/thread/97f1e17019caf2d0# it was reported, that " sage 4.3.5 won't build on Red Hat Enterprise Linux 5.4: File exists: '/usr/local_machine/sage-4.3.5/locallib/python/site-packagessage/structure' " The partial log output was:

python `which cython` --embed-positions --directive cdivision=True
-I/usr/local_machine/sage-4.3.5/devel/sage-main -o sage/stats/hmm/chmm.c
sage/stats/hmm/chmm.pyx
sage/stats/hmm/chmm.pyx -->
/usr/local_machine/sage-4.3.5/local//lib/python/site-packages//sage/stats/hmm/chmm.pyx
python `which cython` --embed-positions --directive cdivision=True
-I/usr/local_machine/sage-4.3.5/devel/sage-main -o
sage/structure/coerce_dict.c sage/structure/coerce_dict.pyx
sage/structure/coerce_dict.pyx -->
/usr/local_machine/sage-4.3.5/local//lib/python/site-packages//sage/structure/coerce_dict.pyx
python `which cython` --emTraceback (most recent call last):
   File "setup.py", line 754, in <module>
     execute_list_of_commands(queue)
   File "setup.py", line 250, in execute_list_of_commands
     execute_list_of_commands_in_parallel(command_list, nthreads)
   File "setup.py", line 191, in execute_list_of_commands_in_parallel
     for r in p.imap(apply_pair, command_list):
   File
"/usr/local_machine/sage-4.3.5/local/lib/python/multiprocessing/pool.py",
line 520, in next
     raise value
OSError: [Errno 17] File exists:
'/usr/local_machine/sage-4.3.5/local//lib/python/site-packages//sage/structure'
sage: There was an error installing modified sage library code.

ERROR installing SAGE

real    1m56.499s
user    8m27.884s
sys     0m19.321s
sage: An error occurred while installing sage-4.3.5
Please email sage-devel http://groups.google.com/group/sage-devel
explaining the problem and send the relevant part of
of /usr/local_machine/sage-4.3.5/install.log.  Describe your computer,
operating system, etc.
If you want to try to fix the problem yourself, *don't* just cd to
/usr/local_machine/sage-4.3.5/spkg/build/sage-4.3.5 and type 'make
check' or whatever is appropriate.
Instead, the following commands setup all environment variables
correctly and load a subshell for you to debug the error:
(cd '/usr/local_machine/sage-4.3.5/spkg/build/sage-4.3.5' &&
'/usr/local_machine/sage-4.3.5/sage' -sh)
When you are done debugging, you can type "exit" to leave the
subshell.
make[1]: *** [installed/sage-4.3.5] Error 1
make[1]: Leaving directory `/usr/local_machine/sage-4.3.5/spkg'

real    59m35.961s
user    64m50.926s
sys     11m16.115s
Error building Sage.
warning:
/usr/local_machine/sage-4.3.5/devel/sage-main/sage/structure/parent_base.pyx:63:4:
Overriding cdef method with def method.
warning:
/usr/local_machine/sage-4.3.5/devel/sage-main/sage/structure/parent.pyx:118:4:
dict already a builtin Cython type 

My suspicion is, that the reason is some race condition in checking, whether during the build some directory has to be newly added, or already exists.

Release manager

Apply only trac_8677.4.patch.

Attachments (4)

trac_8677.patch (1.3 KB) - added by was 9 years ago.
trac_8677.2.patch (1.4 KB) - added by mpatel 9 years ago.
Alternate version, fix commit string. Apply just one patch.
trac_8677.3.patch (1.7 KB) - added by vbraun 9 years ago.
Yet another alternate version
trac_8677.4.patch (1.7 KB) - added by jhpalmieri 9 years ago.
Still another alternate version

Download all attachments as: .zip

Change History (22)

comment:1 Changed 9 years ago by tomc

That is consistent with the fact that I have previously built Sage successfully: this was exactly the same version of Sage, on exactly the same hardware.  The only difference between the two situations was that before the unsuccessful build I did 'export MAKE="make -j6"' to make the build process use 6 jobs in parallel, whereas the successful build used only one build job.

comment:2 Changed 9 years ago by craigcitro

Yep, this is a race condition. I hit exactly this problem for the "second" half of the parallel build stuff -- namely dispatching calls to gcc in parallel. The fix there would work here: at some point before the parallel dispatch, I walked the tree and made sure to create all the necessary directories. The other option would be to catch and ignore the OSError, but I think that's a horrible idea. Well, that's not completely true -- if you checked the error message, and confirmed that the filename it complained about was a directory that does now exist, it would probably be fine to ignore it and move on.

There's nothing RH specific about this -- I think it was just bad luck.

comment:3 Changed 9 years ago by was

I setup NFS on disk.math and sage.math, and when I export an NSF share using the async option two things happen:

(1) the filesystem is noticeably *much* faster,

(2) I can repeatedly get the error that is being discussed here with, e.g., MAKE="make -j8" on sage.math.

Probably the RedHat? system Tom Coates was using has a similar NFS setup.

comment:4 Changed 9 years ago by was

  • Priority changed from major to critical

Changed 9 years ago by was

comment:5 Changed 9 years ago by was

  • Priority changed from critical to blocker
  • Status changed from new to needs_review

comment:6 Changed 9 years ago by mpatel

  • Authors set to William Stein

With async set, do we need to worry more than usual about data corruption? According to this Linux NFS FAQ, async is potentially unsafe.

Should we do as Craig suggests above, e.g., pass in the except block if dirname now exists, but raise otherwise?

comment:7 Changed 9 years ago by vbraun

A correctly working NFSv3 shouldn't need the async options to achieve fast writes, but maybe some older servers/clients are being used? With async, data will be corrupted if the server crashes, but thats probably acceptable on a build system.

I take issue with the patch, however. The current race seems to be

    if not os.path.exists(path): 
        # another process can create path here
        os.makedirs(path) 
        # error!

The correct fix would be to replace the whole if clause with

    try: 
        os.makedirs(path) 
    except OSError: 
        pass 

But the patch mixes up both. It still contains the original race and then hacks around it to trap the error.

comment:8 follow-up: Changed 9 years ago by was

A correctly working NFSv3 shouldn't need the async options to achieve fast writes

Are you just making this up, or?!

The correct fix would be to replace the whole if clause with ...

I strongly disagree. I see no point in not including the

if not os.path.exists(dirname):

code, since that makes perfect sense when doing a serial build, which is what most people do.

comment:9 in reply to: ↑ 8 ; follow-up: Changed 9 years ago by vbraun

Replying to was:

Are you just making this up, or?!

Thats what the aforementioned FAQ says:

For the Linux implementation of NFS Version 3, using the "async" export option to allow faster writes is no longer necessary. NFS Version 3 explicitly allows a server to reply before writing data to disk, under controlled circumstances.

I strongly disagree. I see no point in not including the if not os.path.exists(dirname):

The point is that the pattern "if (look at filesystem) then (modify filesystem)" is fundamentally flawed. Its the origin of countless security problems from temp races etc. The world is a better place without it :-) And the only way to make it die is to stomp it out wherever we encounter it...

comment:10 in reply to: ↑ 9 ; follow-up: Changed 9 years ago by mpatel

Replying to vbraun:

Replying to was:

I strongly disagree. I see no point in not including the if not os.path.exists(dirname):

The point is that the pattern "if (look at filesystem) then (modify filesystem)" is fundamentally flawed. Its the origin of countless security problems from temp races etc. The world is a better place without it :-) And the only way to make it die is to stomp it out wherever we encounter it...

What if we use William's solution for the first "hunk" and Volker's for the second? Then we can avoid many unnecessary exceptions in the first. In the second, we only call os.makedirs if copying the Cython-compiled file fails.

Changed 9 years ago by mpatel

Alternate version, fix commit string. Apply just one patch.

comment:11 in reply to: ↑ 10 Changed 9 years ago by mpatel

  • Summary changed from race condition creating dirs during Sage build on RH 5.4 to Race condition creating dirs during Sage build

Replying to mpatel:

What if we use William's solution for the first "hunk" and Volker's for the second? Then we can avoid many unnecessary exceptions in the first. In the second, we only call os.makedirs if copying the Cython-compiled file fails.

I've add a patch that does this.

comment:12 Changed 9 years ago by vbraun

Sorry, maybe I did not understand the issue. Is this about improving performance? Is that microsecond really going to matter in a loop where we copy files and/or call the compiler?

In any case, I think mixing the two patterns is taking the worst of both worlds ;-) I'll attach my version of the patch in case I wasn't clear enough.

Changed 9 years ago by vbraun

Yet another alternate version

comment:13 Changed 9 years ago by mpatel

With a clean, built-from-scratch 4.6.alpha3 on sage.math, I applied v3, did rm -rf devel/sage/build/ and then env MAKE="make -j20" ./sage -ba-force. The latter fails after a number of python `which cython` commands:

[...]
Traceback (most recent call last):
  File "setup.py", line 799, in <module>
    execute_list_of_commands(queue)
  File "setup.py", line 319, in execute_list_of_commands
    execute_list_of_commands_in_parallel(command_list, nthreads)
  File "setup.py", line 252, in execute_list_of_commands_in_parallel
    process_command_results(p.imap(apply_pair, command_list))
  File "setup.py", line 256, in process_command_results
    for r in result_values:
  File "/mnt/usb1/scratch/mpatel/tmp/sage-4.6.alpha3-working2/local/lib/python/multiprocessing/pool.py", line 520, in next
    raise value
NameError: global name 'path' is not defined
sage: There was an error installing modified sage library code.

This doesn't happen with v1 or v2, but I'm not sure why. Thoughts?

comment:14 follow-up: Changed 9 years ago by jhpalmieri

This doesn't happen with v1 or v2, but I'm not sure why. Thoughts?

Probably on lines 687-690, "path" should be "dirname"? I'm attaching a new version with this change. Since all of the other patches list William as the author, I'm doing the same for this one...

Changed 9 years ago by jhpalmieri

Still another alternate version

comment:15 in reply to: ↑ 14 Changed 9 years ago by mpatel

Replying to jhpalmieri:

This doesn't happen with v1 or v2, but I'm not sure why. Thoughts?

Probably on lines 687-690, "path" should be "dirname"? [...]

"D'oh!" Thanks, John.

comment:16 Changed 9 years ago by mpatel

  • Authors changed from William Stein to Volker Braun, William Stein
  • Description modified (diff)
  • Reviewers set to Volker Braun, Mitesh Patel
  • Status changed from needs_review to positive_review

comment:17 Changed 9 years ago by mpatel

  • Merged in set to sage-4.6.rc0
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:18 Changed 9 years ago by leif

Just for the record:

Though I've seen this not very often (I think the first time with 4.6.alpha2), I can confirm that running into the race condition wasn't limited to NFS-mounted file systems...

Note: See TracTickets for help on using tickets.