Opened 4 years ago
Last modified 4 years ago
#25111 new enhancement
In the built documentation, replace duplicate files by symlinks
Reported by: | jhpalmieri | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | sage-8.4 |
Component: | documentation | Keywords: | |
Cc: | gh-timokau | Merged in: | |
Authors: | Reviewers: | ||
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description (last modified by )
Until Sage 8.2, the _static
directories in the generated HTML documentation of the reference manual were symlinks to a single master _static
directory. Now all the files are copied, leading to a huge explosion in size of the built documentation (from 1.8GB in Sage 8.2 to about 20GB in Sage 8.3).
Change History (28)
comment:1 Changed 4 years ago by
comment:2 Changed 4 years ago by
Here is some Python code which works for me. Is this the sort of thing you use?
from filecmp import dircmp import os, shutil def directories_equal(left, right, ignore=None): """ True if and only if the directories ``left`` and ``right`` have the same contents, file by file. Ignore any files listed in ``ignore``. """ dcmp = dircmp(left, right, ignore=ignore) return (not dcmp.left_only and not dcmp.right_only and not dcmp.common_funny and not dcmp.funny_files and not dcmp.diff_files and all(directories_equal(os.path.join(left, a), os.path.join(right, a), ignore=ignore) for a in dcmp.common_dirs)) def replace_duplicates_with_symlinks(source, target): """ INPUTS: - ``source``, ``target``: directories. If the two directories are identical, replace ``target`` with a symlink pointing to ``source``. Otherwise, for each file in ``target``, if a copy of it exists in ``source``, replace the copy in ``target`` with a symlink pointing to ``source``. """ if directories_equal(source, target, ignore=['pdf.png']): if not os.path.islink(target): shutil.rmtree(target) os.symlink(source, target) else: # compare file by file, doing the replacement dcmp = dircmp(source, target) for d in dcmp.common_dirs: replace_duplicates_with_symlinks(os.path.join(source, d), os.path.join(target, d)) for f in dcmp.common_files: os.remove(os.path.join(target, f)) os.symlink(os.path.join(source, f), os.path.join(target, f)) def replace_with_master_directory(top_dir): """ top_dir: top of html doc directory (so typically top_dir = local/share/doc/sage/html) """ master = os.path.join(top_dir, 'en', '_static') for lang in os.listdir(top_dir): for d in os.listdir(os.path.join(top_dir, lang)): target = os.path.join(top_dir, lang, d, '_static') if (os.path.isdir(target) and not os.path.islink(target) and not os.path.samefile(master, target)): replace_duplicates_with_symlinks(master, target)
comment:3 Changed 4 years ago by
This saves me almost 400 MB, by the way. ("This" = replace_with_master_directory(os.path.join(SAGE_LOCAL, 'share', 'doc', 'sage', 'html'))
.)
comment:4 Changed 4 years ago by
No I don't use python code because I do it within the packaging script in bash
# Prune _static folders cp -r build_doc/html/en/_static build_doc/html/ || die "failed to copy _static folder" for sdir in `find build_doc/html -name _static` ; do if [ $sdir != "build_doc/html/_static" ] ; then rm -rf $sdir || die "failed to remove $sdir" ln -rst ${sdir%_static} build_doc/html/_static fi done
because I have the mathjax fonts by default and they are copied in all _static
directories, the saving is in GB.
The last touch is replacing most of the mathjax stuff by symlink in the master _static folder
# Linking to local copy of mathjax folders rather than copying them local mathjax_folders="config extensions fonts jax localization unpacked" for sdir in ${mathjax_folders} ; do rm -rf build_doc/html/_static/${sdir} \ || die "failed to remove mathjax folder $sdir" ln -st build_doc/html/_static/ ../../../../mathjax/$sdir done
comment:5 Changed 4 years ago by
See possibly related discussion at #25089.
comment:6 Changed 4 years ago by
I'm confused by this ticket, because it already does that, per #25089...
comment:7 Changed 4 years ago by
I see the difference--it does already do this within the en/reference
docs, where each "reference" section is treated as a sub-document of the reference "master document", and in that case the _static
directories get symlinked up to the master document. My assumption was that all of the Sage docs (including "reference") were in turn treated as sub-documents of a higher-level master document but apparently that's not the case.
IMO treating the entire tree of Sage docs as such a hierarchy with shared static resources would be the best approach.
comment:8 Changed 4 years ago by
Some parts of the documentation tree have slightly different _static
directories, which is where the approach in comment 2 comes from: compare each _static
directory to the top-level one, replacing files (and directories) with symlinks when possible.
comment:9 Changed 4 years ago by
Is it worth pursuing this? It could be part of the docbuild process, or it could be done only when you use make
to build all of the Sage docs. I'm leaning toward the latter approach. In either case, all of the _static
directories will be produced and then cleaned up later, so disk usage will increase during the build process before dropping at the end, although this happens throughout the build process. (I don't know how to deal with the symlinks on the fly. I also don't know if there is a way to tell Sphinx to look mainly in one place for shared static resources. Since documentation in different languages have different _static/translations.js
files, we can't rely solely on a single _static
folder.)
I also don't know what to do about Windows/cygwin and symbolic links.
comment:10 Changed 4 years ago by
Sphinx has a configuration option html_static_path which might do what we want. I'll look into it.
Edit: or maybe not: the documentation says that the files "are copied to the output’s _static directory after the theme’s static files". We don't want files copied, we want a single _static
directory.
comment:11 Changed 4 years ago by
Couldn't a different _static/translations.js
be used per language? That is, somehow namespace that file by the language in the first place. That or at least have an alternate location for it. html_static_path
can be a list.
comment:12 Changed 4 years ago by
I don't think that html_static_path
will help: it provides a list from which the output _static
directories are produced – it does not provide a list of directories to use instead of the output _static
directories. In fact, we already set html_static_path
in src/doc/common/conf.py
.
comment:13 Changed 4 years ago by
- Description modified (diff)
- Milestone changed from sage-8.2 to sage-8.4
- Priority changed from major to critical
comment:14 follow-up: ↓ 18 Changed 4 years ago by
- Description modified (diff)
Too bad that I missed this ticket earlier. This should have been an 8.3 blocker.
comment:15 follow-up: ↓ 16 Changed 4 years ago by
As a workaround to #25089, for the Windows build I run a script that deletes all the duplicate _static
directories and instead modifies links in the HTML to reference a single _static
directory. My script just runs over the docs after they've been built, but it would probably be better to figure out how to do this directly in the Sphinx build.
comment:16 in reply to: ↑ 15 Changed 4 years ago by
Replying to embray:
As a workaround to #25089, for the Windows build I run a script that deletes all the duplicate
_static
directories and instead modifies links in the HTML to reference a single_static
directory. My script just runs over the docs after they've been built, but it would probably be better to figure out how to do this directly in the Sphinx build.
I do exactly that in sage-on-gentoo as well. Would be great to know how to tell sphinx to create symlinks instead of copying.
comment:17 Changed 4 years ago by
- Description modified (diff)
comment:18 in reply to: ↑ 14 ; follow-ups: ↓ 22 ↓ 24 Changed 4 years ago by
- Description modified (diff)
Replying to jdemeyer:
Too bad that I missed this ticket earlier. This should have been an 8.3 blocker.
By the way, that 20GB figure is not really how much disk space is being taken up. There is a bug (I have no idea where this comes from) that puts a "mathjax" directory in each _static
, which is a hard link to the mathjax sources (i.e. under $SAGE_LOCAL/share/mathjax
). This inexplicably contains a symlink to itself as $SAGE_LOCAL/share/mathjax/mathjax
. In the _static
directories, however, this symlink is being dereferenced and converted to a hard link as well, so you end up with an infinite loop of hardlinks, which tools like du
don't handle well when counting (the size it's reporting is probably just being limited by some max depth parameter).
If I delete all those nonsense mathjax
hardlinks I then get:
$ du -sh local/share/doc/sage/html/ 779M local/share/doc/sage/html/
and
$ du -sh local/share/doc/sage/ 2.0G local/share/doc/sage/
so I think it's not all as bad as it seems.
comment:19 Changed 4 years ago by
- Cc gh-timokau added
comment:20 Changed 4 years ago by
I see now your sage-devel post where you reported the same.
comment:21 Changed 4 years ago by
It seems to be more subtle: sometimes the symlinks are correctly generated and sometimes not.
comment:22 in reply to: ↑ 18 Changed 4 years ago by
Replying to embray:
If I delete all those nonsense
mathjax
hardlinks I then get:$ du -sh local/share/doc/sage/html/ 779M local/share/doc/sage/html/
By the way, after using the script in comment:2, I get
$ du -s -h local/share/doc/sage/html/ 315M local/share/doc/sage/html/
There are lots of symlinks, though.
comment:23 follow-up: ↓ 27 Changed 4 years ago by
I'm getting really confused here. Initially I thought that the problem was the _static
directories in the reference manual were no longer symlinked, but that's not the problem.
The problem seems to be the few copies of _static
for the various documents (one for each document). This was never a problem before, as long as _static
remained small. But because of the mathjax
issue, every _static
directory contains a million copies of mathjax
.
comment:24 in reply to: ↑ 18 ; follow-up: ↓ 26 Changed 4 years ago by
Replying to embray:
There is a bug (I have no idea where this comes from) that puts a "mathjax" directory in each
_static
, which is a hard link to the mathjax sources (i.e. under$SAGE_LOCAL/share/mathjax
).
I'm not sure why you think that it's a hard link. On my system (and probably most Unix-like systems), creating a hardlink to a directory is not even allowed. See https://askubuntu.com/questions/210741/why-are-hard-links-not-allowed-for-directories/525129
comment:25 Changed 4 years ago by
I created a new ticket #26152 specifically for the mathjax symlink issue.
comment:26 in reply to: ↑ 24 Changed 4 years ago by
Replying to jdemeyer:
Replying to embray:
There is a bug (I have no idea where this comes from) that puts a "mathjax" directory in each
_static
, which is a hard link to the mathjax sources (i.e. under$SAGE_LOCAL/share/mathjax
).I'm not sure why you think that it's a hard link. On my system (and probably most Unix-like systems), creating a hardlink to a directory is not even allowed. See https://askubuntu.com/questions/210741/why-are-hard-links-not-allowed-for-directories/525129
To clarify: The directory is not a hard link but the files under it are, and there were deeply nested directories (I didn't confirm how deep) each containing what I presume were hard links to the files (since deleting them did not actually release much usage of my disk).
comment:27 in reply to: ↑ 23 Changed 4 years ago by
Replying to jdemeyer:
I'm getting really confused here. Initially I thought that the problem was the
_static
directories in the reference manual were no longer symlinked, but that's not the problem.The problem seems to be the few copies of
_static
for the various documents (one for each document). This was never a problem before, as long as_static
remained small. But because of themathjax
issue, every_static
directory contains a million copies ofmathjax
.
And as noted above, the _static
directories for the different documents can actually differ, so we can't (I think) have a single one. But we can symlink the mathjax
parts in each one. They take up the bulk of the disk space.
comment:28 Changed 4 years ago by
In a recently built copy of the Sage documentation, there seem to be 27 copies of MathJax installed in various _static
directories, which translates into about 430 MB of disk space on my computer.
I actually do this in sage-on-gentoo, but I do it at the packaging level rather than the building level.