#26011 closed enhancement (fixed)
Copying package files is really slow
Reported by: | jdemeyer | Owned by: | embray |
---|---|---|---|
Priority: | blocker | Milestone: | sage-8.4 |
Component: | build | Keywords: | |
Cc: | embray | Merged in: | |
Authors: | Erik Bray | Reviewers: | Dima Pasechnik |
Report Upstream: | N/A | Work issues: | |
Branch: | b58e16e (Commits) | Commit: | |
Dependencies: | Stopgaps: |
Description (last modified by )
For some packages (especially databases containing just files), the "Copying package files" bit of the install process takes a very long time. For example, the pari_galpol
package (see #26010) needs
real 3m19.377s user 2m22.153s sys 0m36.311s
to install. This is just a database package containing 14681 files to be installed (total size 80MB). Ideally, this requires only I/O operations to install and the user CPU time should be close to 0.
The reason is probably that the code to keep track of the installed files is actually O(n^2)
in the number of files.
Change History (22)
comment:1 Changed 2 years ago by
- Description modified (diff)
comment:2 follow-up: ↓ 4 Changed 2 years ago by
comment:3 Changed 2 years ago by
- Owner changed from (none) to embray
comment:4 in reply to: ↑ 2 Changed 2 years ago by
comment:5 Changed 2 years ago by
For my own records, I tried this with just the existing database_pari
package and got
real 2m40.827s user 0m46.358s sys 0m31.482s
(the majority of which, as you note, was spent in the "Copying files" phase).
I'm going to see what I can do to improve this a little bit without having to rewrite too much (yet).
comment:6 Changed 2 years ago by
- Branch set to u/embray/ticket-26011
- Commit set to 54d73ac929f32c1c47443f08c7b835cb0daae46c
- Status changed from new to needs_review
This brought the time for database_pari down to:
real 0m32.636s user 0m24.156s sys 0m3.019s
(still with the majority spent in copying). It should also fix #26013 at the same time (conflicting with the branch currently there).
I am concerned about the possibility of race conditions with cp
, and the sed call is cryptic even for my liking but I still plan to replace more of this with Python.
New commits:
54d73ac | simplify and speed up logic for package file copying and file manifest generation
|
comment:7 Changed 2 years ago by
- Status changed from needs_review to needs_work
There are merge conflicts.
comment:8 Changed 2 years ago by
- Commit changed from 54d73ac929f32c1c47443f08c7b835cb0daae46c to 09a2926255236ef3210a8c644439fbe58b7ccf07
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
09a2926 | simplify and speed up logic for package file copying and file manifest generation
|
comment:9 Changed 2 years ago by
I rebased this, but I'm having some other build issues on this branch, and I need to look into what's going on and if it's related to the change.
comment:10 Changed 2 years ago by
- Commit changed from 09a2926255236ef3210a8c644439fbe58b7ccf07 to b58e16e85602dac560aec89499672ee5d3c39686
Branch pushed to git repo; I updated commit sha1. New commits:
b58e16e | Ensure that we don't add an empty 'file' to the file list for packages that install no files (e.g. when installing 'appnope' on non-OSX
|
comment:11 Changed 2 years ago by
Using cp
instead of mv
is also a regression because moving should be a lot faster than copying. Why not use the sed
script but keep moving as before?
comment:12 Changed 2 years ago by
You're right of course, but the problem is that mv
does not handle moving whole directory trees very well. For example, if you try to mv
a directory over a directory of the same name that already exists you'll just get an annoying "Directory not empty" error.
If you know a way to efficiently mv
a directory tree on top of an existing one I'm all for it; off the top of my head I just don't know how to do that. There is no mv --recursive
(perhaps there should be).
comment:13 Changed 2 years ago by
I mean, the code I originally had was essentially a hand-rolled recursive mv
, but it was slow for one--in part just due to the use of some shell idioms. Perhaps I could rewrite just that part in Python though then you have the overhead of starting up Python (perl would be better but I'm not sure we want to start introducing perl into the build process :)
But regardless how it's implemented it seems there are also some tricky pitfalls to it, like handling symlinks properly (though if that's the only issue then it's easy enough to handle it).
comment:14 Changed 2 years ago by
I put a time
in front of the cp
call in the current version of this branch and it shows:
[database_pari-20161017] Moving package files from temporary location /home/embray/src/sagemath/sage/local/var/tmp/sage/build/database_pari-20161017/inst to /home/embray/src/sagemath/sage/local [database_pari-20161017] [database_pari-20161017] real 0m2.204s [database_pari-20161017] user 0m0.010s [database_pari-20161017] sys 0m0.619s
which seems fine. But with the for
loop calling mv
for each file (and also creating directories and removing symlinks as necessary) I get:
[database_pari-20161017] Moving package files from temporary location /home/embray/src/sagemath/sage/local/var/tmp/sage/build/database_pari-20161017/inst to /home/embray/src/sagemath/sage/local [database_pari-20161017] [database_pari-20161017] real 0m17.014s [database_pari-20161017] user 0m1.847s [database_pari-20161017] sys 0m11.742s
Of course, if I wrote a dedicated program to do this it would probably be a lot faster. If nothing else just running mv
for every file probably adds enormous overhead when there are a lot of files; at least that's my guess.
comment:15 Changed 2 years ago by
I tried dropping this Python script into sage-spkg
to do the moves:
time cat <<_EOF_ | python - "$PREFIX" "$SAGE_LOCAL" import os import sys src, dst = sys.argv[1:] for dirpath, dirnames, filenames in os.walk(src): dst_path = os.path.join(dst, os.path.relpath(dirpath, src)) for dirname in dirnames: dst_dirname = os.path.join(dst_path, dirname) if not os.path.exists(dst_dirname): os.makedirs(dst_dirname) for filename in filenames: src_filename = os.path.join(dirpath, filename) dst_filename = os.path.join(dst_path, filename) if os.path.islink(dst_filename): os.remove(dst_filename) os.rename(src_filename, dst_filename) _EOF_
and got
[database_pari-20161017] real 0m3.960s [database_pari-20161017] user 0m0.463s [database_pari-20161017] sys 0m0.312s
so still slower than the single cp
call, but much faster than the shell loop.
I think in the case of database_pari a lot of the overhead is just in the large number of files (over 8000) than in overall size of the files. Although in this case the total size of the files is still quite large (> 250 MB) most of the individual files are small, and moving has the same or nearly the same cost as copying. Whereas, if we had a package with a single very large file moving would definitely be faster.
One almost certain advantage to using mv/rename is that it's supposed to be atomic, whereas cp is not. This may have some implications for parallel builds but I'm not sure.
comment:16 Changed 2 years ago by
Using python -S
also speeds things up a bit, but not dramatically.
comment:17 Changed 2 years ago by
- Status changed from needs_work to needs_review
Jeroen, do you know off the top of your head a better solution for mv
-ing an entire directory tree?
Otherwise, I'm not sure there's a huge advantage to it over just using cp -r
, except in the case of very large files.
There's also the possibility of using the above Python snippet, which seems to work well (but is more complicated obviously).
I'm not convinced one way or the other what the best thing to do here is so I'm curious what solution you prefer.
comment:18 Changed 2 years ago by
- Priority changed from minor to blocker
comment:19 Changed 2 years ago by
- Reviewers set to Dima Pasechnik
- Status changed from needs_review to positive_review
it looks good.
comment:20 Changed 2 years ago by
- Branch changed from u/embray/ticket-26011 to b58e16e85602dac560aec89499672ee5d3c39686
- Resolution set to fixed
- Status changed from positive_review to closed
comment:21 Changed 2 years ago by
- Commit b58e16e85602dac560aec89499672ee5d3c39686 deleted
comment:22 Changed 2 years ago by
See #26642 for a blocker follow-up.
I admit, I wasn't sure how well-optimized substitutions like
FOO="${FOO} ..."
are in bash. I assume that's the part you're saying is O(n2)?I think I was trying to avoid a second loop there and crossing my fingers that there would be some optimization for that case, but I didn't exactly do much performance testing. It would probably be much better to use an array there, and then loop over it when writing the actual file list to the stamp file. Better O(2n) then O(n2).
That said, I wonder how much effort it would be to just rewrite
sage-spkg
in Python... For starters though I'll see how much improvement can be made with what we already have.