Opened 2 years ago

Last modified 2 years ago

#28302 new defect

stopgap for save/load that is not working accross different Sage versions

Reported by: vdelecroix Owned by:
Priority: major Milestone: sage-wishlist
Component: misc Keywords:
Cc: Merged in:
Authors: Reviewers:
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges

Description

According to the manual the commands loads/save are supposed to give an easy way to save and retrieve data on a hard drive. Sadly this is not safe accross different Sage versions.

This should either be widely advertized in the documentation, emitting a warning or a stopgap.

Attachments (2)

demo_yaml.ipynb (39.2 KB) - added by soehms 2 years ago.
How can YAML be used for long-term data storage with Sage?
demo_yaml.html (346.8 KB) - added by soehms 2 years ago.
This is demo_yaml in html-format (produced with older Sage version 8.1)

Download all attachments as: .zip

Change History (19)

comment:1 Changed 2 years ago by klee

The link to the manual does not work.

I doubt save/load is supposed to work across different Sage versions.

The gist of the problem is whether a pickled object from an old Sage version can be unpickled and work in current Sage version. Python's pickling protocol does not guarantee this.

I think this problem is not directly related with whether we maintain pickle jar or not.

A sage object is tied to a specific Sage version. Hence if we want an object pickled in old Sage can be unpickled and working in current Sage, the object should be pickled with Sage version information and in unpickling, should be upgraded incrementally to current Sage.

I think what Sage lacks is this upgrading mechanism for objects pickled in old Sage version. Pickle jar would be necessary to check whether this upgrading mechanism is functioning.

Let me show my idea. The upgrading mechanism is coded to sage_object class:

    def __getstate__(self):
        if hasattr(self, '_getstate_'):
            state = self._getstate_()
        else:
            state = self.__dict__.copy()
        state['sage_version'] = 5  # current Sage version

    def __setstate__(self, state):
        if 'sage_version' in state:
            version = state['sage_version']
            del state['sage_version']
        else:
            version = None
        self.__dict__.update(state)
        if version is not None and hasattr(self, '_upgrade'):
            self._upgrade(from_vesion=version)

and this is written to the class of objects that needs to be upgraded from old Sage:

    def _upgrade(self, from_version=None):
        version = from_version
        if version < 4:
            # upgrade to version 4
            self.weight = 10
            # now self is an object of sage version 4
            version = 4
        if version < 5:
            # upgrade to version 5
            del self.weight
            # now self is an object of sage version 5
            version = 5    

But I guess the real solution would be far more complex, but this would be start.

comment:2 Changed 2 years ago by embray

  • Milestone changed from sage-8.9 to sage-wishlist

comment:3 follow-ups: Changed 2 years ago by embray

See also #24337.

I believe the only solution here is better documentation emphasizing that save()/load() use pickle, and are not guaranteed to work between different versions of Sage, or even different versions of Python, and should only be used for temporary storage, not long-term portable serialization.

comment:4 in reply to: ↑ 3 Changed 2 years ago by vdelecroix

Replying to embray:

See also #24337.

I believe the only solution here is better documentation emphasizing that save()/load() use pickle, and are not guaranteed to work between different versions of Sage, or even different versions of Python, and should only be used for temporary storage, not long-term portable serialization.

Indeed, the fact that we should advertise to not use save/load was even discussed this sage-devel thread.

Do you have any suggestion for "long-term portable serialization"?

comment:5 follow-up: Changed 2 years ago by soehms

I agree that we should avoid this misunderstanding about save and load. I had it one and a half year ago, too. But surely, it is much more annoying if you have published sobj-files that run out date as in the case from the sage-devel-thread!

On the other hand, if these functions will be disabled I would appreciate to have an option to force their use, even so. Similarly, if there will be a warning, an option to suppress it would be nice. The reason is, that I found a way, which works pretty good in my case where the Sage objects are matrices with entries in

Multivariate Polynomial Ring in u, v over Univariate Laurent Polynomial Ring in w over Integer Ring

respectively

Multivariate Laurent Polynomial Ring in a, b, c over Univariate Quotient Polynomial Ring in e3 over Integer Ring with modulus e3^2 + e3 + 1

I converted them (recursively) into python dictionaries (over python dictionaries ...) ending up with values which are nothing but python int. There is very good support by Sage to convert to and from such dictionaries (but this may be different in other cases, as for example with number fields).

Thus, my sobj-files have been robust against version upgrades since than (and the conversion saved some disk space, as well)! Sure, there is no guarantee that compatibility will keep forever. But the chance is better since this is closer to kernel functionality.

Maybe there is a way, to have such a solution being supported more systematically.

comment:6 in reply to: ↑ 5 ; follow-up: Changed 2 years ago by embray

Replying to soehms:

On the other hand, if these functions will be disabled I would appreciate to have an option to force their use, even so. Similarly, if there will be a warning, an option to suppress it would be nice. The reason is, that I found a way, which works pretty good in my case where the Sage objects are matrices with entries in

Certainly no one is proposing disabling or removing them. Just better clarifying what it is and is not appropriate for (I would say, short-term caching, mainly).

We could also maybe do better about storing with .sobj files the Sage version with which they were created (if we don't already; I think maybe we do?). Still allow them to be loaded with newer Sage versions if possible, but provide a better error message if not, including the Sage version that should be used to load them.

comment:7 in reply to: ↑ 6 Changed 2 years ago by soehms

Replying to embray:

Replying to soehms:

On the other hand, if these functions will be disabled I would appreciate to have an option to force their use, even so. Similarly, if there will be a warning, an option to suppress it would be nice. The reason is, that I found a way, which works pretty good in my case where the Sage objects are matrices with entries in

Certainly no one is proposing disabling or removing them. Just better clarifying what it is and is not appropriate for (I would say, short-term caching, mainly).

Maybe I misunderstood this. But, is there really a (generic) way, to suppress the warning of a deprecated function or method?

I had trouble with a deprecation (ticket #16613), which is not implemented completely, that means: there is still library-code left using the deprecated method (I implemented a fix for this in #26421, but better would have done this in a separate ticket). So, how can you get rid of the warning in such a case (apart from waiting until the fix is reviewed)?

We could also maybe do better about storing with .sobj files the Sage version with which they were created (if we don't already; I think maybe we do?). Still allow them to be loaded with newer Sage versions if possible, but provide a better error message if not, including the Sage version that should be used to load them.

To enlarge transparency about the version an sobj-file is created for is the one thing. But in addition, wouldn't it be worth to improve the stability against version upgrades as I was claiming above?

I think most of the objects, people want to save to disk are instances of element classes. Many of these classes have methods to convert to a dictionary, whose values are instances of element classes again. I think the latter should be more stable against version upgrades as the former, since usually it is closer to kernel functionality. In the cases I explained in my former post, the recursion ends up with integers.

Furthermore, many parent classes have an _element_constructor_ doing the opposite conversion. So in all such cases, the stability against version upgrades could be improved similarly as I have done it for the matrices over some types of polynomial rings.

comment:8 follow-up: Changed 2 years ago by SimonKing

Of course, the problem has two sides:

  • A Sage user may have old data and wants to use them in a new Sage version.
  • A Sage user may have new data and wants to use them in future Sage versions.

If I understand correctly what you are saying:

Sage doesn't even attempt to offer a way to permanently and reliably store user data. If a pickle of user data created by some months of computation time in Sage-x.y.z can not be read in Sage-x.y.(z+1), then it is just tough luck for the user, but isn't considered a bug. And all is good if we simply write in the documentation that a user shouldn't use Sage to store mathematical data.

If that understanding of your statements is correct then I very strongly disagree with you. And that's also something that I find very disturbing in Python-3: Apparently, Python-2 str basically is the same as Python-3 bytes, and Python-2 unicode basically is the same as Python-3 str. But, surprisesurprise: Python-3 attempts to unpickle the pickle of a Python-2 str as Python-3 str (not as Python-3 bytes) -- which isn't exactly consistent (and fails in some cases).

Granted, it can happen that a Sage-x.y.z pickle cannot be unpickled in Sage-x.y.(z+1). However, if that happens then it clearly is a bug and it is our duty to attempt to fix it. Within reason, of course. But in the first place it is our duty. And since the default way of storing Sage data used to be pickling, we should acknowledge that it is a bug if an old pickle cannot be unpickled by a new Sage version.

I believe that any self-respecting software project should provide at least one way of storing data with the promise that the stored data will still be useable in future versions of the software. Of course, there may be different approaches to keep that promise. Such us: Provide a conversion function that turns the data format of version x.y into the data format of version x.(y+1), so that the user can access old stored data version x.y by repeated application of such functions, and then use them in version x.(y+15). However, it is not ok to tell the user: "Well, you may have data from version x.y, but you will not be able to use the old data together with the cool new features of version x.(y+15)".

Last edited 2 years ago by SimonKing (previous) (diff)

comment:9 in reply to: ↑ 8 ; follow-ups: Changed 2 years ago by klee

Replying to SimonKing:

Provide a conversion function that turns the data format of version x.y into the data format of version x.(y+1), so that the user can access old stored data version x.y by repeated application of such functions, and then use them in version x.(y+15).

So, in the beginning of this thread, I have proposed an upgrade mechanism of sage objects from a version to another version of sage. I modified it further. The idea is to add into sage_object.pyx something like

    def __getstate__(self):
        if hasattr(self, '_getstate_'):
            state = self._getstate_()
        elif hasattr(self, '__dict__'):
            state = self.__dict__.copy()
        else:
            state = dict()

        from sage.env import SAGE_VERSION
        v = SAGE_VERSION.split('.')
        major = int(v[0])
        minor = int(v[1])
        state['sage_version'] = major * 100 + minor
        return state

    def __setstate__(self, state):
        if 'sage_version' in state:
            version = state['sage_version']
            del state['sage_version']
        else:
            version = None

        if hasattr(self, '_setstate_'):
            self._setstate_(state)
        elif hasattr(self, '__dict__'):
            self.__dict__.update(state)
        else:
            self.__dict__ = state

        if version is not None and hasattr(self, '_upgrade'):
            self._upgrade(from_version=version)

    def _upgrade(self, from_version):
        version = from_version
        if version < 800:
            # upgrade to version 801
            self._upgradable = True
            # now self is an object of Sage version 8.1
            version = 800
        if version < 809:
            # upgrade to version 902
            del self._upgradable
            # now self is an object of Sage version 9.2
            version = 809    

I wonder why no one makes a comment on the idea... Is it a bad idea?

comment:10 in reply to: ↑ 3 ; follow-up: Changed 2 years ago by vdelecroix

Replying to embray:

See also #24337.

I believe the only solution here is better documentation emphasizing that save()/load() use pickle, and are not guaranteed to work between different versions of Sage, or even different versions of Python, and should only be used for temporary storage, not long-term portable serialization.

Erik, this is actually a wrong statement. From the pickle documentation

The pickle serialization format is guaranteed to be backwards
compatible across Python releases provided a compatible pickle
protocol is chosen and pickling and unpickling code deals with
Python 2 to Python 3 type differences if your data is crossing
that unique breaking change language boundary.
Last edited 2 years ago by vdelecroix (previous) (diff)

comment:11 in reply to: ↑ 9 ; follow-up: Changed 2 years ago by SimonKing

Replying to klee:

Replying to SimonKing:

Provide a conversion function that turns the data format of version x.y into the data format of version x.(y+1), so that the user can access old stored data version x.y by repeated application of such functions, and then use them in version x.(y+15).

So, in the beginning of this thread, I have proposed an upgrade mechanism of sage objects from a version to another version of sage. ... I wonder why no one makes a comment on the idea... Is it a bad idea?

Sorry, I first wanted to let off some steam, as there recently were some comments in sage-support that I find rather discouraging. I understood these comments as: "Don't use Sage if you want to create a mathematical database", and I believe that's a very unhealthy attitude towards users. Also I suffered, too, from a backwards incompatibility between pickling strings in Python-2 and Python-3.

Concerning your proposal: If I understand correctly, it fits into the often-used approach to provide a reasonable default for double-underscore magical methods, and implement the details by a corresponding single-underscore method. So, that's sound. And it would help in those cases in which __getstate__/__setstate__ is used for pickling.

Is your suggestion that the _upgrade function has no default implementation, i.e., is custom? Another approach would be that, whenever in future a new incompatibility arises in version x.y, add a new method _upgrade_to_xy, whose job is to upgrade from x.(y-1) to x.y, and have a default _upgrade method, that would automatically call an appropriate sequence of the _upgrade_to_ab methods to come from version foo.bar to version x.y

But it seems to me that the __reduce__ protocol is also very commonly used. It would perhaps make sense to automatically add version information in a default implementation of __reduce__. Of course, something is different: The first return value of __reduce__ is an unpickling function, which is custom. And to solve incompatibilities between different versions, there probably is no way around fixing each individual custom unpickling function. Or do you see a way?

comment:12 in reply to: ↑ 10 ; follow-up: Changed 2 years ago by SimonKing

Replying to vdelecroix:

Erik, this is actually a wrong statement. From the pickle documentation

The pickle serialization format is guaranteed to be backwards
compatible across Python releases provided a compatible pickle
protocol is chosen and pickling and unpickling code deals with
Python 2 to Python 3 type differences if your data is crossing
that unique breaking change language boundary.

If Python makes that promise then I believe it really is a bug (and not just an oddity) that Python-3 cannot unpickle some strings that were pickled with Python-2. See #28444.

EDIT: "cannot" means that one needs to add the new keyword encoding="bytes" to successfully unpickle. But without the keyword, it fails.

Last edited 2 years ago by SimonKing (previous) (diff)

comment:13 in reply to: ↑ 12 ; follow-up: Changed 2 years ago by vdelecroix

Replying to SimonKing:

Replying to vdelecroix:

Erik, this is actually a wrong statement. From the pickle documentation

The pickle serialization format is guaranteed to be backwards
compatible across Python releases provided a compatible pickle
protocol is chosen and pickling and unpickling code deals with
Python 2 to Python 3 type differences if your data is crossing
that unique breaking change language boundary.

If Python makes that promise then I believe it really is a bug (and not just an oddity) that Python-3 cannot unpickle some strings that were pickled with Python-2. See #28444.

EDIT: "cannot" means that one needs to add the new keyword encoding="bytes" to successfully unpickle. But without the keyword, it fails.

It is indeed not a bug. The promise is "up to pickling protocol" and "Python2/Python3 datastructure differences".

comment:14 in reply to: ↑ 13 Changed 2 years ago by SimonKing

Replying to vdelecroix:

It is indeed not a bug. The promise is "up to pickling protocol" and "Python2/Python3 datastructure differences".

Well, I was told that the Python 2 str datastructure corresponds to the Python 3 bytes datastructure. The problem I try to solve in #28444 is exactly that Python 3 CHANGES the datastructure found in a pickle and turns it into a different datastructure that happens to have the same name (namely str, which in Python 3 is something else than in Python 2).

comment:15 in reply to: ↑ 11 Changed 2 years ago by klee

Replying to SimonKing:

Sorry, I first wanted to let off some steam, as there recently were some comments in sage-support that I find rather discouraging. I understood these comments as: "Don't use Sage if you want to create a mathematical database", and I believe that's a very unhealthy attitude towards users. Also I suffered, too, from a backwards incompatibility between pickling strings in Python-2 and Python-3.

As it is, the current save/load based on pickling is not provisioned to work across versions. But I think we should and could make efforts to make it so.

Concerning your proposal: If I understand correctly, it fits into the often-used approach to provide a reasonable default for double-underscore magical methods, and implement the details by a corresponding single-underscore method. So, that's sound. And it would help in those cases in which __getstate__/__setstate__ is used for pickling.

Is your suggestion that the _upgrade function has no default implementation, i.e., is custom?

_upgrade would be provided for a class of objects only when there is a change in the internal data structure of the class and thus unpickling breaks. The developer who is responsible of the breakage should implement the _upgrade function.

What could be "default" for _upgrade? I don't understand.

Another approach would be that, whenever in future a new incompatibility arises in version x.y, add a new method _upgrade_to_xy, whose job is to upgrade from x.(y-1) to x.y, and have a default _upgrade method, that would automatically call an appropriate sequence of the _upgrade_to_ab methods to come from version foo.bar to version x.y

That is done in the _upgrade function. See if version < ...: version ... statements. The sequence of if statements would upgrade the object in steps ultimately to the present version.

But it seems to me that the __reduce__ protocol is also very commonly used. It would perhaps make sense to automatically add version information in a default implementation of __reduce__. Of course, something is different: The first return value of __reduce__ is an unpickling function, which is custom. And to solve incompatibilities between different versions, there probably is no way around fixing each individual custom unpickling function. Or do you see a way?

For those classes of objects with custom __reduce__, perhaps we should customize the __reduce__ method further to incorporate the upgrading mechanism. But I didn't think about these details. Also I don't think classes with custom __reduce__ is common. What is common is those UniqueRepresentation classes. But for them, there is just one implementation of __reduce__.

Changed 2 years ago by soehms

How can YAML be used for long-term data storage with Sage?

comment:16 in reply to: ↑ 9 Changed 2 years ago by soehms

Replying to klee:

Replying to SimonKing:

Provide a conversion function that turns the data format of version x.y into the data format of version x.(y+1), so that the user can access old stored data version x.y by repeated application of such functions, and then use them in version x.(y+15).

So, in the beginning of this thread, I have proposed an upgrade mechanism of sage objects from a version to another version of sage. I modified it further. The idea is to add into sage_object.pyx something like

    def __getstate__(self):
        if hasattr(self, '_getstate_'):
            state = self._getstate_()
        elif hasattr(self, '__dict__'):
            state = self.__dict__.copy()
        else:
            state = dict()

        from sage.env import SAGE_VERSION
        v = SAGE_VERSION.split('.')
        major = int(v[0])
        minor = int(v[1])
        state['sage_version'] = major * 100 + minor
        return state

    def __setstate__(self, state):
        if 'sage_version' in state:
            version = state['sage_version']
            del state['sage_version']
        else:
            version = None

        if hasattr(self, '_setstate_'):
            self._setstate_(state)
        elif hasattr(self, '__dict__'):
            self.__dict__.update(state)
        else:
            self.__dict__ = state

        if version is not None and hasattr(self, '_upgrade'):
            self._upgrade(from_version=version)

    def _upgrade(self, from_version):
        version = from_version
        if version < 800:
            # upgrade to version 801
            self._upgradable = True
            # now self is an object of Sage version 8.1
            version = 800
        if version < 809:
            # upgrade to version 902
            del self._upgradable
            # now self is an object of Sage version 9.2
            version = 809    

I wonder why no one makes a comment on the idea... Is it a bad idea?

I'm sorry for that, as well. It is in no way a bad idea. But I have some doubts, that pickling could become robust against version upgrades in principle. In my point of view it looks like an raw egg and the question is whether to just increase its protection or take into account the option to boil it before transportation.

One problem I see is that pickling serializes all related instances independent whether they are simple structured or not. Thus the chance that changes in data structure will harm pickling is really high. Your suggestion would heal such violations, if they where detected. But how can you guarantee that developers and reviewers of all such harmful changes will notice that they should implement or adapt an _upgrade method?

I made some tests on how an option to boil the egg could look like (using your idea, as well) and documented this in a jupyter notebook I've attached to that ticket.

I think we should do both: Make pickling more save and think about another format!

Last edited 2 years ago by soehms (previous) (diff)

Changed 2 years ago by soehms

This is demo_yaml in html-format (produced with older Sage version 8.1)

comment:17 Changed 2 years ago by soehms

For convenience I added the demo_yaml in a directly readable format

Note: See TracTickets for help on using tickets.