Opened 5 years ago

# In the built documentation, replace duplicate files by symlinks

Reported by: Owned by: John Palmieri critical sage-8.4 documentation Timo Kaufmann N/A

Until Sage 8.2, the _static directories in the generated HTML documentation of the reference manual were symlinks to a single master _static directory. Now all the files are copied, leading to a huge explosion in size of the built documentation (from 1.8GB in Sage 8.2 to about 20GB in Sage 8.3).

### comment:1 Changed 5 years ago by François Bissey

I actually do this in sage-on-gentoo, but I do it at the packaging level rather than the building level.

### comment:2 Changed 5 years ago by John Palmieri

Here is some Python code which works for me. Is this the sort of thing you use?

from filecmp import dircmp
import os, shutil

def directories_equal(left, right, ignore=None):
"""
True if and only if the directories left and right have
the same contents, file by file. Ignore any files listed in
ignore.
"""
dcmp = dircmp(left, right, ignore=ignore)
return (not dcmp.left_only and not dcmp.right_only
and not dcmp.common_funny and not dcmp.funny_files
and not dcmp.diff_files and
all(directories_equal(os.path.join(left, a), os.path.join(right, a), ignore=ignore)
for a in dcmp.common_dirs))

"""
INPUTS:

- source, target: directories.

If the two directories are identical, replace target with a
symlink pointing to source. Otherwise, for each file in
target, if a copy of it exists in source, replace the copy
in target with a symlink pointing to source.
"""
if directories_equal(source, target, ignore=['pdf.png']):
shutil.rmtree(target)
else:
# compare file by file, doing the replacement
dcmp = dircmp(source, target)
for d in dcmp.common_dirs:
os.path.join(target, d))
for f in dcmp.common_files:
os.remove(os.path.join(target, f))
os.path.join(target, f))

def replace_with_master_directory(top_dir):
"""
top_dir: top of html doc directory (so typically
top_dir = local/share/doc/sage/html)
"""
master = os.path.join(top_dir, 'en', '_static')
for lang in os.listdir(top_dir):
for d in os.listdir(os.path.join(top_dir, lang)):
target = os.path.join(top_dir, lang, d, '_static')
if (os.path.isdir(target)
and not os.path.samefile(master, target)):

### comment:3 Changed 5 years ago by John Palmieri

This saves me almost 400 MB, by the way. ("This" = replace_with_master_directory(os.path.join(SAGE_LOCAL, 'share', 'doc', 'sage', 'html')).)

Last edited 5 years ago by John Palmieri (previous) (diff)

### comment:4 Changed 5 years ago by François Bissey

No I don't use python code because I do it within the packaging script in bash

# Prune _static folders
cp -r build_doc/html/en/_static build_doc/html/ || die "failed to copy _static folder"
for sdir in find build_doc/html -name _static ; do
if [ $sdir != "build_doc/html/_static" ] ; then rm -rf$sdir || die "failed to remove $sdir" ln -rst${sdir%_static} build_doc/html/_static
fi
done

because I have the mathjax fonts by default and they are copied in all _static directories, the saving is in GB.

The last touch is replacing most of the mathjax stuff by symlink in the master _static folder

# Linking to local copy of mathjax folders rather than copying them
local mathjax_folders="config extensions fonts jax localization unpacked"
for sdir in ${mathjax_folders} ; do rm -rf build_doc/html/_static/${sdir} \
|| die "failed to remove mathjax folder $sdir" ln -st build_doc/html/_static/ ../../../../mathjax/$sdir
done

### comment:5 Changed 5 years ago by Samuel Lelièvre

See possibly related discussion at #25089.

### comment:6 Changed 5 years ago by Erik Bray

I'm confused by this ticket, because it already does that, per #25089...

### comment:7 Changed 5 years ago by Erik Bray

I see the difference--it does already do this within the en/reference docs, where each "reference" section is treated as a sub-document of the reference "master document", and in that case the _static directories get symlinked up to the master document. My assumption was that all of the Sage docs (including "reference") were in turn treated as sub-documents of a higher-level master document but apparently that's not the case.

IMO treating the entire tree of Sage docs as such a hierarchy with shared static resources would be the best approach.

### comment:8 Changed 5 years ago by John Palmieri

Some parts of the documentation tree have slightly different _static directories, which is where the approach in comment 2 comes from: compare each _static directory to the top-level one, replacing files (and directories) with symlinks when possible.

Last edited 5 years ago by John Palmieri (previous) (diff)

### comment:9 Changed 5 years ago by John Palmieri

Is it worth pursuing this? It could be part of the docbuild process, or it could be done only when you use make to build all of the Sage docs. I'm leaning toward the latter approach. In either case, all of the _static directories will be produced and then cleaned up later, so disk usage will increase during the build process before dropping at the end, although this happens throughout the build process. (I don't know how to deal with the symlinks on the fly. I also don't know if there is a way to tell Sphinx to look mainly in one place for shared static resources. Since documentation in different languages have different _static/translations.js files, we can't rely solely on a single _static folder.)

I also don't know what to do about Windows/cygwin and symbolic links.

### comment:10 Changed 5 years ago by John Palmieri

Sphinx has a configuration option html_static_path which might do what we want. I'll look into it.

Edit: or maybe not: the documentation says that the files "are copied to the output’s _static directory after the theme’s static files". We don't want files copied, we want a single _static directory.

Last edited 5 years ago by John Palmieri (previous) (diff)

### comment:11 Changed 5 years ago by Erik Bray

Couldn't a different _static/translations.js be used per language? That is, somehow namespace that file by the language in the first place. That or at least have an alternate location for it. html_static_path can be a list.

### comment:12 Changed 5 years ago by John Palmieri

I don't think that html_static_path will help: it provides a list from which the output _static directories are produced – it does not provide a list of directories to use instead of the output _static directories. In fact, we already set html_static_path in src/doc/common/conf.py.

### comment:13 Changed 4 years ago by Jeroen Demeyer

Description: modified (diff) sage-8.2 → sage-8.4 major → critical

### comment:14 follow-up:  18 Changed 4 years ago by Jeroen Demeyer

Description: modified (diff)

Too bad that I missed this ticket earlier. This should have been an 8.3 blocker.

### comment:15 follow-up:  16 Changed 4 years ago by Erik Bray

As a workaround to #25089, for the Windows build I run a script that deletes all the duplicate _static directories and instead modifies links in the HTML to reference a single _static directory. My script just runs over the docs after they've been built, but it would probably be better to figure out how to do this directly in the Sphinx build.

### comment:16 in reply to:  15 Changed 4 years ago by François Bissey

As a workaround to #25089, for the Windows build I run a script that deletes all the duplicate _static directories and instead modifies links in the HTML to reference a single _static directory. My script just runs over the docs after they've been built, but it would probably be better to figure out how to do this directly in the Sphinx build.

I do exactly that in sage-on-gentoo as well. Would be great to know how to tell sphinx to create symlinks instead of copying.

### comment:17 Changed 4 years ago by Jeroen Demeyer

Description: modified (diff)

### comment:18 in reply to:  14 ; follow-ups:  22  24 Changed 4 years ago by Erik Bray

Description: modified (diff)

Too bad that I missed this ticket earlier. This should have been an 8.3 blocker.

By the way, that 20GB figure is not really how much disk space is being taken up. There is a bug (I have no idea where this comes from) that puts a "mathjax" directory in each _static, which is a hard link to the mathjax sources (i.e. under $SAGE_LOCAL/share/mathjax). This inexplicably contains a symlink to itself as$SAGE_LOCAL/share/mathjax/mathjax. In the _static directories, however, this symlink is being dereferenced and converted to a hard link as well, so you end up with an infinite loop of hardlinks, which tools like du don't handle well when counting (the size it's reporting is probably just being limited by some max depth parameter).

If I delete all those nonsense mathjax hardlinks I then get:

$du -sh local/share/doc/sage/html/ 779M local/share/doc/sage/html/ and$ du -sh local/share/doc/sage/
2.0G    local/share/doc/sage/

so I think it's not all as bad as it seems.

Last edited 4 years ago by Erik Bray (previous) (diff)

### comment:20 Changed 4 years ago by Erik Bray

I see now your sage-devel post where you reported the same.

### comment:21 Changed 4 years ago by Jeroen Demeyer

It seems to be more subtle: sometimes the symlinks are correctly generated and sometimes not.

### comment:22 in reply to:  18 Changed 4 years ago by John Palmieri

If I delete all those nonsense mathjax hardlinks I then get:

$du -sh local/share/doc/sage/html/ 779M local/share/doc/sage/html/ By the way, after using the script in comment:2, I get$ du -s -h local/share/doc/sage/html/
315M	local/share/doc/sage/html/

There are lots of symlinks, though.

### comment:23 follow-up:  27 Changed 4 years ago by Jeroen Demeyer

I'm getting really confused here. Initially I thought that the problem was the _static directories in the reference manual were no longer symlinked, but that's not the problem.

The problem seems to be the few copies of _static for the various documents (one for each document). This was never a problem before, as long as _static remained small. But because of the mathjax issue, every _static directory contains a million copies of mathjax.

Last edited 4 years ago by Jeroen Demeyer (previous) (diff)

There is a bug (I have no idea where this comes from) that puts a "mathjax" directory in each _static, which is a hard link to the mathjax sources (i.e. under $SAGE_LOCAL/share/mathjax). I'm not sure why you think that it's a hard link. On my system (and probably most Unix-like systems), creating a hardlink to a directory is not even allowed. See https://askubuntu.com/questions/210741/why-are-hard-links-not-allowed-for-directories/525129 ### comment:25 Changed 4 years ago by Jeroen Demeyer I created a new ticket #26152 specifically for the mathjax symlink issue. ### comment:26 in reply to: 24 Changed 4 years ago by Erik Bray Replying to jdemeyer: Replying to embray: There is a bug (I have no idea where this comes from) that puts a "mathjax" directory in each _static, which is a hard link to the mathjax sources (i.e. under$SAGE_LOCAL/share/mathjax).

I'm not sure why you think that it's a hard link. On my system (and probably most Unix-like systems), creating a hardlink to a directory is not even allowed. See https://askubuntu.com/questions/210741/why-are-hard-links-not-allowed-for-directories/525129

To clarify: The directory is not a hard link but the files under it are, and there were deeply nested directories (I didn't confirm how deep) each containing what I presume were hard links to the files (since deleting them did not actually release much usage of my disk).