Opened 9 years ago

Last modified 3 years ago

#10768 needs_info enhancement

Revisit the pickle jar procedure

Reported by: nthiery Owned by: was
Priority: major Milestone: sage-wishlist
Component: pickling Keywords:
Cc: sage-combinat, ohanar Merged in:
Authors: Reviewers:
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Description (last modified by jdemeyer)

The current pickle jar mechanism has some drawbacks:

  • We never add new pickles to the pickle jar
  • We don't know how old pickles in the pickle jar are
  • We may be testing an old pickle, but not a recent one
  • Updating specific pickles is a bit tedious

Here is a new proposal:

  1. Pickles will no longer be stored in a .tar.bz2 file but simply as files within the directory extcode/pickle_jar/$VERSION. This will likely increase the on-disk space needed for a Sage install, but will not have a big influence on Sage distributions, since we have an extcode spkg anyway (which is tarred and compressed).
  2. Pickles will be under git control (this will now become possible).
  3. The $VERSION in the directory name refers to the Sage version used to create the pickle. Once a pickle has been made, it will remain in place in that directory, even in subsequent Sage versions (so sage-4.7.2 will contain pickle_jar/4.7, pickle_jar/4.7.1 and pickle_jar/4.7.2).
  4. When making a new release, the release manager will unpickle all old pickles and repickle them with the new Sage version. Whenever a pickle has changed, the new (changed) pickle will be stored in pickle_jar/$NEWVERSION. The old pickle is kept where it was.
  5. sage.structure.sage_object.unpickle_all will check all pickles (old and new).
  6. If some day some pickle rots away and it is decided by consensus to not support unpickling it anymore, then the patch author would simply git remove the old pickle.

Change History (13)

comment:1 Changed 9 years ago by jdemeyer

While we're at it, why does the pickle jar need to be a tar.bz2 file as opposed to just a directory in data/extcode/pickle_jar? When distributing the pickle jar, it is contained in the extcode spkg anyway, so I don't see the gain of having an additional layer of tarring.

comment:2 follow-up: Changed 9 years ago by jdemeyer

One major advantage of not having the tar file would be that the pickle jar could be updated using standard hg commands. This would instantly solve 2 of the 3 complaints:

  1. Using hg log, we would know exactly how old everything is
  2. Updating specific pickles would become as easy as adding a patch to the Sage library.

comment:3 Changed 9 years ago by jdemeyer

Related ticket: #11069

comment:4 follow-up: Changed 9 years ago by jdemeyer

Nicolas, just to make sure I understand you correctly, is your proposal the following:

  1. Pickle jars are named after the Sage version (i.e. we would have a pickle_jar-4.6.2.tar.bz2 file or a pickle_jar-4.6.2 directory in my proposal).
  2. We always keep the old versions unchanged (so sage-4.7 would still contain pickle_jar-4.6.2).
  3. With every new Sage version, the release manager unpickles pickle_jar-$OLDVERSION, repickles them using the new Sage version and saves them as pickle_jar-$NEWVERSION.

I can see some merit to this proposal, however I would save only the pickles which actually changed. Otherwise you will end up with lots of copies of the same pickle.

comment:5 in reply to: ↑ 2 ; follow-up: Changed 9 years ago by nthiery

Replying to jdemeyer:

One major advantage of not having the tar file would be that the pickle jar could be updated using standard hg commands. This would instantly solve 2 of the 3 complaints:

  1. Using hg log, we would know exactly how old everything is
  2. Updating specific pickles would become as easy as adding a patch to the Sage library.

+1, definitely! Actually I did not suggest it earlier because I was worrying about the disk space usage, not for the Sage distribution but for the Sage install. But if there is a consensus that this is well used disk space, let's go for it.

I was also wondering whether this could possibly slow down unpickle_all since this would require loading lots of little files instead of slurping in one large archive. Any clue?

comment:6 in reply to: ↑ 4 ; follow-up: Changed 9 years ago by nthiery

Hi Jeroen!

Replying to jdemeyer:

Nicolas, just to make sure I understand you correctly, is your proposal the following:

I am going to use the occasion to amend a bit the proposal :-)

  1. Pickle jars are named after the Sage version (i.e. we would have a pickle_jar-4.6.2.tar.bz2 file or a pickle_jar-4.6.2 directory in my proposal).

Yes.

  1. We always keep the old versions unchanged (so sage-4.7 would still contain pickle_jar-4.6.2).

Yes. More precisely sage-4.7 would still contain the subset of the pickles in pickle_jar-4.6.2 that:

  • still unpickles properly in sage-4.7
  • differ from the corresponding pickle in 4.7 (and any intermediate version)
  1. With every new Sage version, the release manager unpickles pickle_jar-$OLDVERSION, repickles them using the new Sage version and saves them as pickle_jar-$NEWVERSION.

More precisely: the release manager recreates a fresh pickle jar by running all the sage tests with SAGE_PICKLE_JAR set (as described in unpickle_all). And then removes from pickle_jar-$OLDVERSION those that did not change. An easy thing to script.

I can see some merit to this proposal, however I would save only the pickles which actually changed. Otherwise you will end up with lots of copies of the same pickle.

+1; this is a good refinement of the last point in the ticket description. The comments above should take care of this.

Note that if the pickle_jar for 3.1 and 4.6.2 contain the same pickle X (version numbers just for the example), then I prefer to delete that of 3.1 and keep that of 4.6.2. Indeed, if X does not unpickle anymore with 4.7, then the relevant question is: "is it acceptable to not unpickle in 4.7 a pickle generated by 4.6.2?".

Do you mind rephrasing the ticket description accordingly, and then make a quick call for comments on sage-devel?

Thanks!

Cheers,

Nicolas

comment:7 in reply to: ↑ 6 Changed 9 years ago by jdemeyer

Replying to nthiery:

Note that if the pickle_jar for 3.1 and 4.6.2 contain the same pickle X (version numbers just for the example), then I prefer to delete that of 3.1 and keep that of 4.6.2.

If we use hg to track the pickles, I actually think it is better not to constantly move pickles from one version to another. So while I understand your point, from a practical point of view, I prefer to keep the pickle in the old directory of the old version.

comment:8 in reply to: ↑ 5 Changed 9 years ago by jdemeyer

Replying to nthiery:

+1, definitely! Actually I did not suggest it earlier because I was worrying about the disk space usage, not for the Sage distribution but for the Sage install.

Currently, the pickle jar contains 1174 files. Assuming each file takes 4kB of actual disk space, this would use a few megabytes. I don't think this is an issue.

I was also wondering whether this could possibly slow down unpickle_all since this would require loading lots of little files instead of slurping in one large archive. Any clue?

This would depend very much on the operating system and file system... But yes, on some systems this will be slower. On the other hand, it could even speed up things by not having to decompress and untar.

comment:9 Changed 9 years ago by jdemeyer

  • Description modified (diff)

comment:10 Changed 7 years ago by andrew.mathas

Hi Nicolas,

I want to add to your proposal that the pickle_jar be properly documented. As far as I am aware, there is currently no documentation on what the pickle jar is for, how it should be used, and what to do when a pickle breaks with

sage -t  devel/sage-sf/sage/structure/sage_object.pyx

for example. A non-trivial example for using register_unpickle_override should also be added.

Secondly, I think that the procedure for adding new pickles to the jar needs to streamlined. Again, I don't believe that it is described anywhere when or how this happens, but I do know that there are many "new" classes which are not represented in the pickle_jar with the consequence that the pickle_jar is unable to check backward compatibility for these classes.

Andrew

comment:11 Changed 6 years ago by vbraun

  • Cc ohanar added
  • Description modified (diff)
  • Status changed from new to needs_info

Do we really put all that into the git repo? The current (incredibly old) pickle jar is about 2MB uncompressed. A new one is likely considerably larger. There are of the order of 10 minor Sage releases every year. I don't know often the pickle changes, but it seems likely that this'll generate on the order of 10MB/year that will be with us forever. The whole git repo is currently <100MB.

comment:12 Changed 6 years ago by nthiery

Hi Volker!

I don't have a good view on the order of magnitudes. Yet, with the proposed protocol, pickles that don't change don't get duplicated between versions, and I'd expect that only a few pickles get changed from one version to the other (especially if we emphasize pickling by construction rather than by internal data structure). A good experiment would be to regenerate a new pickle jar, and see how much we have added to it since last time!

I don't have a strong opinion about whether the pickle jar should be maintained under git or not. If we can affor it, that makes things easier, as changes to the pickle jar can be done within the usual workflow. But if it's too big, it's too big.

Cheers,

Nicolas

comment:13 Changed 3 years ago by jdemeyer

  • Description modified (diff)
Note: See TracTickets for help on using tickets.