Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#26011 closed enhancement (fixed)

Copying package files is really slow

Reported by: jdemeyer Owned by: embray
Priority: blocker Milestone: sage-8.4
Component: build Keywords:
Cc: embray Merged in:
Authors: Erik Bray Reviewers: Dima Pasechnik
Report Upstream: N/A Work issues:
Branch: b58e16e (Commits, GitHub, GitLab) Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by jdemeyer)

For some packages (especially databases containing just files), the "Copying package files" bit of the install process takes a very long time. For example, the pari_galpol package (see #26010) needs

real    3m19.377s
user    2m22.153s
sys     0m36.311s

to install. This is just a database package containing 14681 files to be installed (total size 80MB). Ideally, this requires only I/O operations to install and the user CPU time should be close to 0.

The reason is probably that the code to keep track of the installed files is actually O(n^2) in the number of files.

Change History (22)

comment:1 Changed 3 years ago by jdemeyer

  • Description modified (diff)

comment:2 follow-up: Changed 3 years ago by embray

I admit, I wasn't sure how well-optimized substitutions like FOO="${FOO} ..." are in bash. I assume that's the part you're saying is O(n2)?

I think I was trying to avoid a second loop there and crossing my fingers that there would be some optimization for that case, but I didn't exactly do much performance testing. It would probably be much better to use an array there, and then loop over it when writing the actual file list to the stamp file. Better O(2n) then O(n2).

That said, I wonder how much effort it would be to just rewrite sage-spkg in Python... For starters though I'll see how much improvement can be made with what we already have.

Last edited 3 years ago by embray (previous) (diff)

comment:3 Changed 3 years ago by embray

  • Owner changed from (none) to embray

comment:4 in reply to: ↑ 2 Changed 3 years ago by jdemeyer

Replying to embray:

That said, I wonder how much effort it would be to just rewrite sage-spkg in Python...

That wouldn't be that hard, but it would be a boring job.

Maybe start with factoring out just the copy-from-DESTDIR stuff? That way, one could fix #26011 and #26018 at the same time.

comment:5 Changed 3 years ago by embray

For my own records, I tried this with just the existing database_pari package and got

real    2m40.827s
user    0m46.358s
sys     0m31.482s

(the majority of which, as you note, was spent in the "Copying files" phase).

I'm going to see what I can do to improve this a little bit without having to rewrite too much (yet).

comment:6 Changed 3 years ago by embray

  • Authors set to Erik Bray
  • Branch set to u/embray/ticket-26011
  • Commit set to 54d73ac929f32c1c47443f08c7b835cb0daae46c
  • Status changed from new to needs_review

This brought the time for database_pari down to:

real    0m32.636s
user    0m24.156s
sys     0m3.019s

(still with the majority spent in copying). It should also fix #26013 at the same time (conflicting with the branch currently there).

I am concerned about the possibility of race conditions with cp, and the sed call is cryptic even for my liking but I still plan to replace more of this with Python.


New commits:

54d73acsimplify and speed up logic for package file copying and file manifest generation

comment:7 Changed 3 years ago by saraedum

  • Status changed from needs_review to needs_work

There are merge conflicts.

comment:8 Changed 3 years ago by git

  • Commit changed from 54d73ac929f32c1c47443f08c7b835cb0daae46c to 09a2926255236ef3210a8c644439fbe58b7ccf07

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

09a2926simplify and speed up logic for package file copying and file manifest generation

comment:9 Changed 3 years ago by embray

I rebased this, but I'm having some other build issues on this branch, and I need to look into what's going on and if it's related to the change.

comment:10 Changed 3 years ago by git

  • Commit changed from 09a2926255236ef3210a8c644439fbe58b7ccf07 to b58e16e85602dac560aec89499672ee5d3c39686

Branch pushed to git repo; I updated commit sha1. New commits:

b58e16eEnsure that we don't add an empty 'file' to the file list for packages that install no files (e.g. when installing 'appnope' on non-OSX

comment:11 Changed 3 years ago by jdemeyer

Using cp instead of mv is also a regression because moving should be a lot faster than copying. Why not use the sed script but keep moving as before?

comment:12 Changed 3 years ago by embray

You're right of course, but the problem is that mv does not handle moving whole directory trees very well. For example, if you try to mv a directory over a directory of the same name that already exists you'll just get an annoying "Directory not empty" error.

If you know a way to efficiently mv a directory tree on top of an existing one I'm all for it; off the top of my head I just don't know how to do that. There is no mv --recursive (perhaps there should be).

Last edited 3 years ago by embray (previous) (diff)

comment:13 Changed 3 years ago by embray

I mean, the code I originally had was essentially a hand-rolled recursive mv, but it was slow for one--in part just due to the use of some shell idioms. Perhaps I could rewrite just that part in Python though then you have the overhead of starting up Python (perl would be better but I'm not sure we want to start introducing perl into the build process :)

But regardless how it's implemented it seems there are also some tricky pitfalls to it, like handling symlinks properly (though if that's the only issue then it's easy enough to handle it).

comment:14 Changed 3 years ago by embray

I put a time in front of the cp call in the current version of this branch and it shows:

[database_pari-20161017] Moving package files from temporary location /home/embray/src/sagemath/sage/local/var/tmp/sage/build/database_pari-20161017/inst to /home/embray/src/sagemath/sage/local
[database_pari-20161017]
[database_pari-20161017] real   0m2.204s
[database_pari-20161017] user   0m0.010s
[database_pari-20161017] sys    0m0.619s

which seems fine. But with the for loop calling mv for each file (and also creating directories and removing symlinks as necessary) I get:

[database_pari-20161017] Moving package files from temporary location /home/embray/src/sagemath/sage/local/var/tmp/sage/build/database_pari-20161017/inst to /home/embray/src/sagemath/sage/local
[database_pari-20161017]
[database_pari-20161017] real   0m17.014s
[database_pari-20161017] user   0m1.847s
[database_pari-20161017] sys    0m11.742s

Of course, if I wrote a dedicated program to do this it would probably be a lot faster. If nothing else just running mv for every file probably adds enormous overhead when there are a lot of files; at least that's my guess.

comment:15 Changed 3 years ago by embray

I tried dropping this Python script into sage-spkg to do the moves:

    time cat <<_EOF_ | python - "$PREFIX" "$SAGE_LOCAL"
import os
import sys

src, dst = sys.argv[1:]

for dirpath, dirnames, filenames in os.walk(src):
    dst_path = os.path.join(dst, os.path.relpath(dirpath, src))
    for dirname in dirnames:
        dst_dirname = os.path.join(dst_path, dirname)
        if not os.path.exists(dst_dirname):
            os.makedirs(dst_dirname)

    for filename in filenames:
        src_filename = os.path.join(dirpath, filename)
        dst_filename = os.path.join(dst_path, filename)

        if os.path.islink(dst_filename):
            os.remove(dst_filename)

        os.rename(src_filename, dst_filename)
_EOF_

and got

[database_pari-20161017] real   0m3.960s
[database_pari-20161017] user   0m0.463s
[database_pari-20161017] sys    0m0.312s

so still slower than the single cp call, but much faster than the shell loop. I think in the case of database_pari a lot of the overhead is just in the large number of files (over 8000) than in overall size of the files. Although in this case the total size of the files is still quite large (> 250 MB) most of the individual files are small, and moving has the same or nearly the same cost as copying. Whereas, if we had a package with a single very large file moving would definitely be faster.

One almost certain advantage to using mv/rename is that it's supposed to be atomic, whereas cp is not. This may have some implications for parallel builds but I'm not sure.

comment:16 Changed 3 years ago by embray

Using python -S also speeds things up a bit, but not dramatically.

comment:17 Changed 3 years ago by embray

  • Status changed from needs_work to needs_review

Jeroen, do you know off the top of your head a better solution for mv-ing an entire directory tree?

Otherwise, I'm not sure there's a huge advantage to it over just using cp -r, except in the case of very large files.

There's also the possibility of using the above Python snippet, which seems to work well (but is more complicated obviously).

I'm not convinced one way or the other what the best thing to do here is so I'm curious what solution you prefer.

comment:18 Changed 3 years ago by embray

  • Priority changed from minor to blocker

Changing the priority of this ticket since I am proposing it also as a solution to #26153.

Minus some resolution on this ticket I can make a more targeted fix just to #26153, but it seems unnecessary.

comment:19 Changed 3 years ago by dimpase

  • Reviewers set to Dima Pasechnik
  • Status changed from needs_review to positive_review

it looks good.

comment:20 Changed 3 years ago by vbraun

  • Branch changed from u/embray/ticket-26011 to b58e16e85602dac560aec89499672ee5d3c39686
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:21 Changed 3 years ago by slelievre

  • Commit b58e16e85602dac560aec89499672ee5d3c39686 deleted

The solution for problems with cp at #14166 and #14236 was to use tar instead.

Other cp related tickets were #7464 and #16155.

comment:22 Changed 3 years ago by jdemeyer

See #26642 for a blocker follow-up.

Note: See TracTickets for help on using tickets.