Opened 4 years ago

Last modified 3 years ago

#25111 new enhancement

In the built documentation, replace duplicate files by symlinks

Reported by: jhpalmieri Owned by:
Priority: critical Milestone: sage-8.4
Component: documentation Keywords:
Cc: gh-timokau Merged in:
Authors: Reviewers:
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by embray)

Until Sage 8.2, the _static directories in the generated HTML documentation of the reference manual were symlinks to a single master _static directory. Now all the files are copied, leading to a huge explosion in size of the built documentation (from 1.8GB in Sage 8.2 to about 20GB in Sage 8.3).

Change History (28)

comment:1 Changed 4 years ago by fbissey

I actually do this in sage-on-gentoo, but I do it at the packaging level rather than the building level.

comment:2 Changed 4 years ago by jhpalmieri

Here is some Python code which works for me. Is this the sort of thing you use?

from filecmp import dircmp
import os, shutil

def directories_equal(left, right, ignore=None):
    """
    True if and only if the directories ``left`` and ``right`` have
    the same contents, file by file. Ignore any files listed in
    ``ignore``.
    """
    dcmp = dircmp(left, right, ignore=ignore)
    return (not dcmp.left_only and not dcmp.right_only 
            and not dcmp.common_funny and not dcmp.funny_files
            and not dcmp.diff_files and 
            all(directories_equal(os.path.join(left, a), os.path.join(right, a), ignore=ignore) 
                for a in dcmp.common_dirs))


def replace_duplicates_with_symlinks(source, target):
    """
    INPUTS:

    - ``source``, ``target``: directories.

    If the two directories are identical, replace ``target`` with a
    symlink pointing to ``source``. Otherwise, for each file in
    ``target``, if a copy of it exists in ``source``, replace the copy
    in ``target`` with a symlink pointing to ``source``.  
    """
    if directories_equal(source, target, ignore=['pdf.png']):
        if not os.path.islink(target):
            shutil.rmtree(target)
            os.symlink(source, target)
    else:
        # compare file by file, doing the replacement
        dcmp = dircmp(source, target)
        for d in dcmp.common_dirs:
            replace_duplicates_with_symlinks(os.path.join(source, d),
                                             os.path.join(target, d))
        for f in dcmp.common_files:
            os.remove(os.path.join(target, f))
            os.symlink(os.path.join(source, f),
                       os.path.join(target, f))
    

def replace_with_master_directory(top_dir):
    """
    top_dir: top of html doc directory (so typically 
    top_dir = local/share/doc/sage/html)
    """
    master = os.path.join(top_dir, 'en', '_static')
    for lang in os.listdir(top_dir):
        for d in os.listdir(os.path.join(top_dir, lang)):
            target = os.path.join(top_dir, lang, d, '_static')
            if (os.path.isdir(target) 
                and not os.path.islink(target) 
                and not os.path.samefile(master, target)):
                replace_duplicates_with_symlinks(master, target)

comment:3 Changed 4 years ago by jhpalmieri

This saves me almost 400 MB, by the way. ("This" = replace_with_master_directory(os.path.join(SAGE_LOCAL, 'share', 'doc', 'sage', 'html')).)

Last edited 4 years ago by jhpalmieri (previous) (diff)

comment:4 Changed 4 years ago by fbissey

No I don't use python code because I do it within the packaging script in bash

			# Prune _static folders
			cp -r build_doc/html/en/_static build_doc/html/ || die "failed to copy _static folder"
			for sdir in `find build_doc/html -name _static` ; do
				if [ $sdir != "build_doc/html/_static" ] ; then
					rm -rf $sdir || die "failed to remove $sdir"
					ln -rst ${sdir%_static} build_doc/html/_static
				fi
			done

because I have the mathjax fonts by default and they are copied in all _static directories, the saving is in GB.

The last touch is replacing most of the mathjax stuff by symlink in the master _static folder

			# Linking to local copy of mathjax folders rather than copying them
			local mathjax_folders="config extensions fonts jax localization unpacked"
			for sdir in ${mathjax_folders} ; do
				rm -rf build_doc/html/_static/${sdir} \
					|| die "failed to remove mathjax folder $sdir"
				ln -st build_doc/html/_static/ ../../../../mathjax/$sdir
			done

comment:5 Changed 4 years ago by slelievre

See possibly related discussion at #25089.

comment:6 Changed 4 years ago by embray

I'm confused by this ticket, because it already does that, per #25089...

comment:7 Changed 4 years ago by embray

I see the difference--it does already do this within the en/reference docs, where each "reference" section is treated as a sub-document of the reference "master document", and in that case the _static directories get symlinked up to the master document. My assumption was that all of the Sage docs (including "reference") were in turn treated as sub-documents of a higher-level master document but apparently that's not the case.

IMO treating the entire tree of Sage docs as such a hierarchy with shared static resources would be the best approach.

comment:8 Changed 4 years ago by jhpalmieri

Some parts of the documentation tree have slightly different _static directories, which is where the approach in comment 2 comes from: compare each _static directory to the top-level one, replacing files (and directories) with symlinks when possible.

Last edited 4 years ago by jhpalmieri (previous) (diff)

comment:9 Changed 4 years ago by jhpalmieri

Is it worth pursuing this? It could be part of the docbuild process, or it could be done only when you use make to build all of the Sage docs. I'm leaning toward the latter approach. In either case, all of the _static directories will be produced and then cleaned up later, so disk usage will increase during the build process before dropping at the end, although this happens throughout the build process. (I don't know how to deal with the symlinks on the fly. I also don't know if there is a way to tell Sphinx to look mainly in one place for shared static resources. Since documentation in different languages have different _static/translations.js files, we can't rely solely on a single _static folder.)

I also don't know what to do about Windows/cygwin and symbolic links.

comment:10 Changed 4 years ago by jhpalmieri

Sphinx has a configuration option html_static_path which might do what we want. I'll look into it.

Edit: or maybe not: the documentation says that the files "are copied to the output’s _static directory after the theme’s static files". We don't want files copied, we want a single _static directory.

Last edited 4 years ago by jhpalmieri (previous) (diff)

comment:11 Changed 4 years ago by embray

Couldn't a different _static/translations.js be used per language? That is, somehow namespace that file by the language in the first place. That or at least have an alternate location for it. html_static_path can be a list.

comment:12 Changed 4 years ago by jhpalmieri

I don't think that html_static_path will help: it provides a list from which the output _static directories are produced – it does not provide a list of directories to use instead of the output _static directories. In fact, we already set html_static_path in src/doc/common/conf.py.

comment:13 Changed 3 years ago by jdemeyer

  • Description modified (diff)
  • Milestone changed from sage-8.2 to sage-8.4
  • Priority changed from major to critical

comment:14 follow-up: Changed 3 years ago by jdemeyer

  • Description modified (diff)

Too bad that I missed this ticket earlier. This should have been an 8.3 blocker.

comment:15 follow-up: Changed 3 years ago by embray

As a workaround to #25089, for the Windows build I run a script that deletes all the duplicate _static directories and instead modifies links in the HTML to reference a single _static directory. My script just runs over the docs after they've been built, but it would probably be better to figure out how to do this directly in the Sphinx build.

comment:16 in reply to: ↑ 15 Changed 3 years ago by fbissey

Replying to embray:

As a workaround to #25089, for the Windows build I run a script that deletes all the duplicate _static directories and instead modifies links in the HTML to reference a single _static directory. My script just runs over the docs after they've been built, but it would probably be better to figure out how to do this directly in the Sphinx build.

I do exactly that in sage-on-gentoo as well. Would be great to know how to tell sphinx to create symlinks instead of copying.

comment:17 Changed 3 years ago by jdemeyer

  • Description modified (diff)

comment:18 in reply to: ↑ 14 ; follow-ups: Changed 3 years ago by embray

  • Description modified (diff)

Replying to jdemeyer:

Too bad that I missed this ticket earlier. This should have been an 8.3 blocker.

By the way, that 20GB figure is not really how much disk space is being taken up. There is a bug (I have no idea where this comes from) that puts a "mathjax" directory in each _static, which is a hard link to the mathjax sources (i.e. under $SAGE_LOCAL/share/mathjax). This inexplicably contains a symlink to itself as $SAGE_LOCAL/share/mathjax/mathjax. In the _static directories, however, this symlink is being dereferenced and converted to a hard link as well, so you end up with an infinite loop of hardlinks, which tools like du don't handle well when counting (the size it's reporting is probably just being limited by some max depth parameter).

If I delete all those nonsense mathjax hardlinks I then get:

$ du -sh local/share/doc/sage/html/
779M    local/share/doc/sage/html/

and

$ du -sh local/share/doc/sage/
2.0G    local/share/doc/sage/

so I think it's not all as bad as it seems.

Last edited 3 years ago by embray (previous) (diff)

comment:19 Changed 3 years ago by gh-timokau

  • Cc gh-timokau added

comment:20 Changed 3 years ago by embray

I see now your sage-devel post where you reported the same.

comment:21 Changed 3 years ago by jdemeyer

It seems to be more subtle: sometimes the symlinks are correctly generated and sometimes not.

comment:22 in reply to: ↑ 18 Changed 3 years ago by jhpalmieri

Replying to embray:

If I delete all those nonsense mathjax hardlinks I then get:

$ du -sh local/share/doc/sage/html/
779M    local/share/doc/sage/html/

By the way, after using the script in comment:2, I get

$ du -s -h local/share/doc/sage/html/
315M	local/share/doc/sage/html/

There are lots of symlinks, though.

comment:23 follow-up: Changed 3 years ago by jdemeyer

I'm getting really confused here. Initially I thought that the problem was the _static directories in the reference manual were no longer symlinked, but that's not the problem.

The problem seems to be the few copies of _static for the various documents (one for each document). This was never a problem before, as long as _static remained small. But because of the mathjax issue, every _static directory contains a million copies of mathjax.

Last edited 3 years ago by jdemeyer (previous) (diff)

comment:24 in reply to: ↑ 18 ; follow-up: Changed 3 years ago by jdemeyer

Replying to embray:

There is a bug (I have no idea where this comes from) that puts a "mathjax" directory in each _static, which is a hard link to the mathjax sources (i.e. under $SAGE_LOCAL/share/mathjax).

I'm not sure why you think that it's a hard link. On my system (and probably most Unix-like systems), creating a hardlink to a directory is not even allowed. See https://askubuntu.com/questions/210741/why-are-hard-links-not-allowed-for-directories/525129

comment:25 Changed 3 years ago by jdemeyer

I created a new ticket #26152 specifically for the mathjax symlink issue.

comment:26 in reply to: ↑ 24 Changed 3 years ago by embray

Replying to jdemeyer:

Replying to embray:

There is a bug (I have no idea where this comes from) that puts a "mathjax" directory in each _static, which is a hard link to the mathjax sources (i.e. under $SAGE_LOCAL/share/mathjax).

I'm not sure why you think that it's a hard link. On my system (and probably most Unix-like systems), creating a hardlink to a directory is not even allowed. See https://askubuntu.com/questions/210741/why-are-hard-links-not-allowed-for-directories/525129

To clarify: The directory is not a hard link but the files under it are, and there were deeply nested directories (I didn't confirm how deep) each containing what I presume were hard links to the files (since deleting them did not actually release much usage of my disk).

comment:27 in reply to: ↑ 23 Changed 3 years ago by jhpalmieri

Replying to jdemeyer:

I'm getting really confused here. Initially I thought that the problem was the _static directories in the reference manual were no longer symlinked, but that's not the problem.

The problem seems to be the few copies of _static for the various documents (one for each document). This was never a problem before, as long as _static remained small. But because of the mathjax issue, every _static directory contains a million copies of mathjax.

And as noted above, the _static directories for the different documents can actually differ, so we can't (I think) have a single one. But we can symlink the mathjax parts in each one. They take up the bulk of the disk space.

comment:28 Changed 3 years ago by jhpalmieri

In a recently built copy of the Sage documentation, there seem to be 27 copies of MathJax installed in various _static directories, which translates into about 430 MB of disk space on my computer.

Last edited 3 years ago by jhpalmieri (previous) (diff)
Note: See TracTickets for help on using tickets.