Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#24647 closed enhancement (fixed)

Stricter locking around sage-rebase on Cygwin

Reported by: embray Owned by:
Priority: major Milestone: sage-8.2
Component: porting: Cygwin Keywords: rebase
Cc: Merged in:
Authors: Erik Bray Reviewers: Emmanuel Charpentier
Report Upstream: N/A Work issues:
Branch: 188b546 (Commits) Commit:
Dependencies: Stopgaps:

Description (last modified by embray)

In #23397 I added a file lock around running sage-rebase.sh at the end of every package install on Cygwin, in an effort to reduce errors when multiple packages run sage-rebase.sh simultaneously during parallel builds.

Unfortunately, it seems this was not enough--it's possible to get errors during parallel builds if sage-rebase.sh is run during the build/installation of some packages.

So with this change (on Cygwin only of course) all build processes obtain a shared lock on rebase.lock, and then sage-rebase.sh is run under an exclusive lock.

This unfortunately slows down parallel builds--for any N packages being built simultaneously, all packages in that set have to wait whichever one takes the longest to build. But I think this is a necessary sacrifice in order for the builds to not run into this problem. In the future it may be possible to make this more fine-grained; after all it isn't strictly necessary to run rebase for every package (only those that install DLLs).

Attachments (1)

dochtml.log.gz (73.2 KB) - added by charpent 3 years ago.
Log of two attempts to make the documentation with MAKE=="make -j4"

Download all attachments as: .zip

Change History (33)

comment:1 Changed 3 years ago by embray

  • Status changed from new to needs_review

comment:2 Changed 3 years ago by jdemeyer

Does Cygwin support /dev/fd? That might be a simpler solution to deal with locks from a file descriptor.

comment:3 Changed 3 years ago by embray

It does, but I'm not exactly sure what you're proposing. This is reasonably simple as-is and less platform-dependent.

comment:4 Changed 3 years ago by embray

  • Description modified (diff)

comment:5 follow-ups: Changed 3 years ago by charpent

  • Status changed from needs_review to needs_work

I think that this is related : the system hangs with a message telling "Waiting for rebase lock". I gave up after reasonable time (about ten minutes with *no* activity indication after this message) : C-c C-C, then closing all Cygwin processes and goig to cmd.exe to execute %SAGE_ROOT%\local\bin\sage-rebaseall.bat.

[BTW: I had to edit call :READREGISTRY HKEY_LOCAL_MACHINE\SOFTWARE\Cygwin\setup rootdir to call :READREGISTRY HKEY_CURRENT_USER\SOFTWARE\Cygwin\setup rootdir : I have no admin rights on the machine I'm trying this on...).

This happened twice. A third attempt worked after waiting about 20 minutes. Go figure...

This should probably documented at least in the information message ("Now waiting for the rebase lock. This might be long...") and possibly in the Cygwin port page.

==> tentatively needs_work

[ I'm almost sure this is not related, but when in doubt... ] : I still get lots of errors when trying to build in parallel, dur to fork() failing (Resource temporarily unavailable). Most times, it happens while trying to install a pip package.

Last edited 3 years ago by charpent (previous) (diff)

comment:6 in reply to: ↑ 5 Changed 3 years ago by embray

Replying to charpent:

[ I'm almost sure this is not related, but when in doubt... ] : I still get lots of errors when trying to build in parallel, dur to fork() failing (Resource temporarily unavailable). Most times, it happens while trying to install a pip package.

That's what this is trying to resolve.

comment:7 in reply to: ↑ 5 ; follow-up: Changed 3 years ago by embray

Replying to charpent:

I think that this is related : the system hangs with a message telling "Waiting for rebase lock". I gave up after reasonable time (about ten minutes with *no* activity indication after this message) : C-c C-C, then closing all Cygwin processes and goig to cmd.exe to execute %SAGE_ROOT%\local\bin\sage-rebaseall.bat.

I think in this case it's not that there's nothing happening, but rather the output from the other build process gets blocked until it's complete (so e.g. if some builds are waiting on mpir to build which can take a long time, you'll get blocked output and then a big output dump when the mpir build is done). I'm not exactly sure what that is but it should maybe be fixed...

comment:8 follow-up: Changed 3 years ago by embray

I think I might see the problem. It's the same reason I added an explicit echo "Waiting for rebase lock" in sage-spkg even though sage-flock also (normally) displays a message. However, the message displayed by sage-flock doesn't output a newline until the lock is actually obtained. This causes the line-buffered output to hang. I'm going to test that though, and if so I think it would suffice to add a --quiet flag to sage-flock so that it doesn't output anything for cases like this.

comment:9 in reply to: ↑ 7 ; follow-up: Changed 3 years ago by embray

Replying to embray:

Replying to charpent:

I think that this is related : the system hangs with a message telling "Waiting for rebase lock". I gave up after reasonable time (about ten minutes with *no* activity indication after this message) : C-c C-C, then closing all Cygwin processes and goig to cmd.exe to execute %SAGE_ROOT%\local\bin\sage-rebaseall.bat.

I think in this case it's not that there's nothing happening, but rather the output from the other build process gets blocked until it's complete (so e.g. if some builds are waiting on mpir to build which can take a long time, you'll get blocked output and then a big output dump when the mpir build is done). I'm not exactly sure what that is but it should maybe be fixed...

Regarding the above comment, I don't think MPIR specifically was the issue, but just s/mpir/whatever-other-slow-building-package/ in the above.

Changed 3 years ago by charpent

Log of two attempts to make the documentation with MAKE=="make -j4"

comment:10 in reply to: ↑ 9 ; follow-up: Changed 3 years ago by charpent

Replying to embray:

Replying to embray:

Replying to charpent:

I think that this is related : the system hangs with a message telling "Waiting for rebase lock". I gave up after reasonable time (about ten minutes with *no* activity indication after this message) : C-c C-C, then closing all Cygwin processes and goig to cmd.exe to execute %SAGE_ROOT%\local\bin\sage-rebaseall.bat.

I think in this case it's not that there's nothing happening, but rather the output from the other build process gets blocked until it's complete (so e.g. if some builds are waiting on mpir to build which can take a long time, you'll get blocked output and then a big output dump when the mpir build is done). I'm not exactly sure what that is but it should maybe be fixed...

Regarding the above comment, I don't think MPIR specifically was the issue, but just s/mpir/whatever-other-slow-building-package/ in the above.

That would be a problem of delay.

So this might be related : I can't finish make (with MAKE="make -j4") on this machine ; it fails reporting a timeout in a forked process. See the uploaded log.

I'll unset MAKE and try to finish this one.

Last edited 3 years ago by charpent (previous) (diff)

comment:11 Changed 3 years ago by embray

I can't seem to reproduce the problem of the output hanging, but I feel like I have seen it before, but I don't remember in exactly what context.

comment:12 in reply to: ↑ 10 Changed 3 years ago by charpent

Replying to charpent:

[ Snip... ]

I'll unset MAKE and try to finish this one.

A serial make finished successfully un abouy 15 minutes (the last unsuccesful attempt to parallel make (which was only for docs) took 23 minutes).

FWIW, my various attempts to parallel make took about 22h30min of registered computing time, not that far of the 27h12min and some change needed by my previous serial building of 8.2.beta5(IIRC).

The elapsed time was much longer, sice I didn't attend to all the build and discovered errors after coming back to my office after ( night | lunch | another night ).

At least on my present machine, parallel building seems currently a losing proposition.

I also noticed a strong tendency to trashing (a coexistent Chrome window completely stopped responding several times...) : 8 GB RAM might be insufficient for safely building sage on 4 threads...

comment:13 follow-up: Changed 3 years ago by embray

I don't know what the deal is with your machine then. I don't have a recent result for a full build from scratch on mine but it should't take that long. You probably have some kind of BLODA: https://cygwin.com/faq/faq.html#faq.using.bloda

Last edited 3 years ago by embray (previous) (diff)

comment:14 in reply to: ↑ 8 ; follow-up: Changed 3 years ago by embray

Replying to embray:

I think I might see the problem. It's the same reason I added an explicit echo "Waiting for rebase lock" in sage-spkg even though sage-flock also (normally) displays a message. However, the message displayed by sage-flock doesn't output a newline until the lock is actually obtained. This causes the line-buffered output to hang. I'm going to test that though, and if so I think it would suffice to add a --quiet flag to sage-flock so that it doesn't output anything for cases like this.

In retrospect, this shouldn't be necessary either since the sage-flock call's stderr is piped to /dev/null. I'm going to try again to reproduce the issue but if I can't I'd ask you set postive_review anyways because it definitely fixes all rebase issues during build for my personal machine, as well as the patchbot...

comment:15 follow-up: Changed 3 years ago by embray

I did find a different bug--if your don't already have $SAGE_LOCAL/var/lock this failed to create it; that might have been a problem in your case.

comment:16 Changed 3 years ago by git

  • Commit changed from 2f009ef30842c65fa67fcf0afc2d7e3e27ad5305 to 188b546d4a7e13ab5b567072d34e0afcd476e898

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

26f2603Adds a couple new features to sage-flock borrowed from the standard flock command:
7bfda95Stricter locking around sage-rebase on Cygwin
188b546Ensure that SAGE_LOCAL/var/lock actually exists

comment:17 follow-up: Changed 3 years ago by embray

Gonna start a fresh build from scratch with make -j4.

comment:18 in reply to: ↑ 13 Changed 3 years ago by charpent

Replying to embray:

I don't know what the deal is with your machine then. I don't have a recent result for a full build from scratch on mine but it should't take that long. You probably have some kind of BLODA: https://cygwin.com/faq/faq.html#faq.using.bloda

From this list, I can only suspect Google Chrome, if that is akin to the "Google Desktop" listed.

But since is a machine configured by our (somewhat paranoïd) administrators, I might well have, for example, an antivirus running in the background and invisible to me.

Care for a full install log (I'd have to send it directly to you, since it woould be about 2.1 MB gzipped...) ?


New commits:

26f2603Adds a couple new features to sage-flock borrowed from the standard flock command:
7bfda95Stricter locking around sage-rebase on Cygwin
188b546Ensure that SAGE_LOCAL/var/lock actually exists

comment:19 in reply to: ↑ 15 ; follow-up: Changed 3 years ago by charpent

Replying to embray:

I did find a different bug--if your don't already have $SAGE_LOCAL/var/lock this failed to create it; that might have been a problem in your case.

Current state :

$ ls -lR sage/local/var/lock/
sage/local/var/lock/:
total 0
-rw-r--r-- 1 121235 Domain Users 0 12 mars  19:03 pip2.lock
-rw-r--r-- 1 121235 Domain Users 0 12 mars  19:03 rebase.lock

comment:20 in reply to: ↑ 14 Changed 3 years ago by charpent

  • Status changed from needs_work to positive_review

Sorry, I didn't see the last part at first...

Replying to embray:

[ Snip... ]

In retrospect, this shouldn't be necessary either since the sage-flock call's stderr is piped to /dev/null. I'm going to try again to reproduce the issue but if I can't I'd ask you set postive_review anyways because it definitely fixes all rebase issues during build for my personal machine, as well as the patchbot...

I agree with you : I'm starting to suppose that my "work" machine's configuration (out of my control) is even worse than Windows...

I'll try to setup a fresh Windows VM on a sane (i. e. Linux) machine and reproduce the issues. If so, I'll re-open.

comment:21 in reply to: ↑ 19 ; follow-up: Changed 3 years ago by embray

Replying to charpent:

Replying to embray:

I did find a different bug--if your don't already have $SAGE_LOCAL/var/lock this failed to create it; that might have been a problem in your case.

Current state :

$ ls -lR sage/local/var/lock/
sage/local/var/lock/:
total 0
-rw-r--r-- 1 121235 Domain Users 0 12 mars  19:03 pip2.lock
-rw-r--r-- 1 121235 Domain Users 0 12 mars  19:03 rebase.lock

It would get created, eventually, at least after the first pip install. But with the fix I just pushed you might consider trying another make distclean and make from scratch. If you want to save time, make sure to use the --with-blas=atlas and SAGE_ATLAS_LIB=/usr/lib (as long as you have cygwin's lapack package installed).

comment:22 in reply to: ↑ 21 Changed 3 years ago by charpent

Replying to embray:

[ Snip... ]

It would get created, eventually, at least after the first pip install. But with the fix I just pushed you might consider trying another make distclean and make from scratch.

I'll try to do this eventually, but I won't be able to do this in the next two weeks at a minimum, at least on my problematic machine.

If you want to save time, make sure to use the --with-blas=atlas and SAGE_ATLAS_LIB=/usr/lib (as long as you have cygwin's lapack package installed).

That's what I did anyway...

comment:23 in reply to: ↑ 17 ; follow-up: Changed 3 years ago by charpent

Replying to embray:

Gonna start a fresh build from scratch with make -j4.

How did this went ?

comment:24 in reply to: ↑ 23 ; follow-up: Changed 3 years ago by embray

Replying to charpent:

Replying to embray:

Gonna start a fresh build from scratch with make -j4.

How did this went ?

In retrospect, maybe the times you got aren't so unreasonable after all:

real    361m29.365s
user    313m46.310s
sys     162m22.775s
Sage build/upgrade complete!

That's just over 6 hours. If it were a serial build it would be roughly 4 times that (roughly of course--just because it was using -j4 doesn't mean it's going to be exactly 4 times faster, especially given how much this patch slows things down).

I'll also say however that the build completed without any errors or requiring any intervention on my part, so it should be pretty stable. I think my last version of this patch may have been slightly broken because of the directory creation issue, so maybe it didn't quite work as advertised before.

I also didn't notice any long hangs with no output, but it's not as if I had my eyes glued to the screen the entire time either.

comment:25 in reply to: ↑ 24 ; follow-up: Changed 3 years ago by charpent

Replying to embray:

Replying to charpent:

Replying to embray:

Gonna start a fresh build from scratch with make -j4.

How did this went ?

In retrospect, maybe the times you got aren't so unreasonable after all:

real    361m29.365s
user    313m46.310s
sys     162m22.775s
Sage build/upgrade complete!

That's just over 6 hours. If it were a serial build it would be roughly 4 times that (roughly of course--just because it was using -j4 doesn't mean it's going to be exactly 4 times faster, especially given how much this patch slows things down).

The times I quoted were for attempts to parallel build... So my times are still unacceptably high.

Your hypothesis of an interference with an (unknown) program running on my config still deserves exploration. A first attempt of running with "CYGWIN=detect_bloda" didn't show anything. But I'll retry this after a cold start...

I'll also try a fresh build on a VM. As already told, this will have to wait a bit...

BTW, something else bothers me : the discrepancy between wall time and CPU time. From the (serial) testlong.log I ran after completing the build :

Total time for all tests: 63115.2 seconds
    cpu time: 17746.2 seconds
    cumulative wall time: 45802.2 seconds

and this is displayed below the call :

real    1074m14,375s
user    687m47,318s
sys     247m32,749s

Notes :

  • A bit less than 18 hours for the tests is a bit long in the tooth...
  • I get the two errors already reported (killing a process takes more than 4 seconds, unicode error in the doctest framework) plus a new one :
sage -t --long --warn-long 209.2 src/sage/structure/coerce_actions.pyx
**********************************************************************
File "src/sage/structure/coerce_actions.pyx", line 786, in sage.structure.coerce
_actions.IntegerMulAction._repr_name_
Failed example:
    IntegerMulAction(ZZ, GF5)
Expected:
    Left Integer Multiplication by Integer Ring on Finite Field of size 5
Got:
    Left Integer Multiplication by Integer Ring on Finite Field of size 1
**********************************************************************
1 item had failures:
   1 of   4 in sage.structure.coerce_actions.IntegerMulAction._repr_name_
    [143 tests, 1 failure, 11.83 s]

Care for a formal bug report ?

I'll also say however that the build completed without any errors or requiring any intervention on my part, so it should be pretty stable. I think my last version of this patch may have been slightly broken because of the directory creation issue, so maybe it didn't quite work as advertised before.

I also didn't notice any long hangs with no output, but it's not as if I had my eyes glued to the screen the entire time either.

comment:26 in reply to: ↑ 25 Changed 3 years ago by embray

Replying to charpent:

Replying to embray:

Replying to charpent:

Replying to embray:

Gonna start a fresh build from scratch with make -j4.

How did this went ?

In retrospect, maybe the times you got aren't so unreasonable after all:

real    361m29.365s
user    313m46.310s
sys     162m22.775s
Sage build/upgrade complete!

That's just over 6 hours. If it were a serial build it would be roughly 4 times that (roughly of course--just because it was using -j4 doesn't mean it's going to be exactly 4 times faster, especially given how much this patch slows things down).

The times I quoted were for attempts to parallel build... So my times are still unacceptably high.

Your hypothesis of an interference with an (unknown) program running on my config still deserves exploration. A first attempt of running with "CYGWIN=detect_bloda" didn't show anything. But I'll retry this after a cold start...

I'm pretty sure CYGWIN=detect_bloda is deprecated / doesn't do anything anymore.

comment:27 Changed 3 years ago by embray

I just ran a make ptestlong and I also got the coerce_actions failure, though the same module passed when run on its own. Seems like some kind of caching issue maybe--worth investigating.

comment:28 Changed 3 years ago by embray

I ran it again and now it's failing consistently, even though it passed once before.

Anyways, that's a separate issue from this--I'll open a ticket for it.

comment:29 Changed 3 years ago by chapoton

  • Reviewers set to Eric Charpentier

comment:30 Changed 3 years ago by chapoton

  • Reviewers changed from Eric Charpentier to Emmanuel Charpentier

comment:31 Changed 3 years ago by vbraun

  • Branch changed from u/embray/cygwin/more-rebase-locks to 188b546d4a7e13ab5b567072d34e0afcd476e898
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:32 Changed 3 years ago by embray

  • Commit 188b546d4a7e13ab5b567072d34e0afcd476e898 deleted

Heh, oh no! This has me confused for a moment.

Note: See TracTickets for help on using tickets.