How can YAML be used for long-term data storage with Sage?

In the sage-devel-thread save/loads and the pickle jar and in #28302 there has been a discussion about data-storage for long-term purpose. Since the sobj format doesn't seem to be suitable for this task (see also : getting rid of the pickle_jar and #24337), suggestions where made to use a human readable data format for this concern. This worksheet tries to visualize how this could look like using YAML giving some examples with respect to the storage of matrices.

To reproduce this code you need to install raumel.yaml (sage - pip install ruamel.yaml). I choose YAML since it is the ML mostly focusing to be human readable. The methods add_multi_representer and add_multi_constructor are not the recommend way of ruamel.yaml to register custom conversion methods (they even aren't documented and inherited from PyYAML). But for this demonstration case I didn't want to change class structures (which would mean to add methods to_yaml and from_yaml (unfortunately not _to_yaml and _from_yaml) to Sage classes).

In general, the code is just intended to illustrate how YAML could be used with Sage. It is not claiming to be complete, structural reasonable or well tested. Its only purpose is to have the examples given below work. These examples did run on stable 8.1 (Python 2), 8.9.rc0 (Python 3) and 8.9.rc1 (Python 2 and 3). The produced output just differs in the dumped version-information.

Furthermore, I've freely chosen the names of the tags only for this test (but in a systematical way using names from Sage's global name-space). Surely, it would need an extensive process of agreement to find an appropriate scheme (taking into account OpenMath and other human readable CAS formats).

Why think about another serialization format?

The advantage of pickling lies in performance, file-size and that you get back objects that are very close to what you have saved before. This is achieved by serializing all dependent objects exactly as they are, independent whether they are simple structured or not.

But if you want your data-files to be valid after some version upgrades this accuracy becomes a disadvantage, since the chance that the data-structure of one of the dependent objects has changed is high. You can't guarantee that developers and reviewers of all such changes will notice all the time, that they have broken pickling.

Now, Sage has its own frameworks how elements and parents can be constructed. Python's pickling doesn't know anything about that. You can customize __reduce__ or __getstate__/__setstate__ methods to tell Python a little bit more. But anyway, pickling will always construct Python objects and not parents and elements.

Thus, in order to store data for long-term usage, it is better to rely their reconstruction on intrinsic Sage construction methods such as _element_constructor_ and diverse construction methods for parents.

Sebastian Oehms, October 2019

In [1]:
from ruamel.yaml import YAML
yaml = YAML(typ='safe')

from sys import version_info
from sage.misc.banner import version_dict
from sage.structure.element import Matrix
from sage.structure.factory import lookup_global
from sage.matrix.matrix_space import MatrixSpace
from sage.matrix.matrix_gfpn_dense import Matrix_gfpn_dense, mtx_unpickle
from sage.matrix.matrix_generic_dense import Matrix_generic_dense
from sage.rings.integer_ring import IntegerRing_class, ZZ
from sage.rings.rational_field import QQ
from sage.rings.finite_rings.finite_field_prime_modn import FiniteField_prime_modn
from sage.rings.finite_rings.finite_field_base import FiniteField
from sage.rings.polynomial.laurent_polynomial_ring import LaurentPolynomialRing_univariate
from sage.rings.polynomial.multi_polynomial_ring import MPolynomialRing_polydict
from sage.rings.polynomial.polydict import ETuple


# -------------------------------------------------------------------------------------------------------------------
# dictionaries to recover implementation variants (these would better be shared with the corresponding library files)
# -------------------------------------------------------------------------------------------------------------------
matrix_space_implementations={'meataxe':Matrix_gfpn_dense, 'generic':Matrix_generic_dense}
finite_field_implementations={'modn':FiniteField_prime_modn}


# -----------------------------------------------------------------------------------------------------------
# helper functions
# -----------------------------------------------------------------------------------------------------------
def implementation_to_str(impl_dict, cls):
    implementation = None
    for k, v in impl_dict.items():
        if v in cls.__mro__:
            implementation = k
            break
    if implementation is None:
        raise NotImplementedError('unhandled implementation %s' %implementation)
    return implementation

def str_to_implementation(parent, impl_str):
    if parent == 'MatrixSpace':
        impl_dict = matrix_space_implementations
    elif parent == 'FiniteField':
        impl_dict = finite_field_implementations
    else:
        return impl_str        
    if impl_str in impl_dict.keys():
        return impl_dict[impl_str]
    return impl_str

def poly_to_dict_rec(p):
    dict_res = {}
    if hasattr(p, 'dict'):
        p_dict = p.dict()
        for k in p_dict.keys():
            if isinstance(k, ETuple):
                dict_res[tuple(k)] = poly_to_dict_rec(p_dict[k])
            else:
                dict_res[k] = poly_to_dict_rec(p_dict[k])
    else:
        if p in ZZ:
            return int(p)
        return p
    return dict_res

# -----------------------------------------------------------------------------------------------------------
# For Future use (in case arguments of construction calls did change)
# -----------------------------------------------------------------------------------------------------------
def upgrade(construction_call, arguments, version):
    this = version_dict()
    if version['major'] == this['major'] and version['minor'] == this['minor']:
        return arguments

    if version['major'] < 8  or version['major'] == 8 and version['minor'] < 4:
        if construction_call == 'MatrixSpace':
            if arguments['implementation'] == 'flint':
                # this has been the default implementation for MatrixSpace befor version 8.4
                # (:trac:`23719`). We check if this is the case here, and remove it in case of True
                base_ring = arguments['base_ring']
                if not base_ring is ZZ and not base_ring is QQ:
                    del arguments['implementation']
    return arguments


# -----------------------------------------------------------------------------------------------------------
# declaration of representers needed for the examples
# -----------------------------------------------------------------------------------------------------------
def yaml_matrix_representer(representer, data):
    if isinstance(data, Matrix_gfpn_dense): # to use meataxe_unpickle function
        mtx_args = data.__reduce__()[1][1:]
        dict_dump = {'parent':data.parent(), 'mtx_args':mtx_args}
    else:
        entries = {key:poly_to_dict_rec(value) for key, value in data.dict().items()}
        dict_dump = {'parent':data.parent(), 'data':entries}
    return representer.represent_mapping('!ElementMatrix', dict_dump)

def yaml_matrix_space_representer(representer, data):
    base_ring = data.base_ring()
    implementation = None
    if hasattr(data, 'Element'): # to have this work with version 8.1, too
        if not data._has_default_implementation():
            implementation = implementation_to_str(matrix_space_implementations, data.Element)
    else: # needed for versions before 8.4 (:trac:`23719`)
        if data._implementation != 'flint':  # flint has been default before 8.4
            if isinstance(data._implementation, type):                
                implementation = implementation_to_str(matrix_space_implementations, data._implementation)
            else:
                implementation = implementation_to_str(matrix_space_implementations, data._get_matrix_class())
    nrows, ncols = data.dims()
    dict_dump = {'base_ring':base_ring, 'nrows':nrows, 'ncols':ncols, 'sparse':data.is_sparse(), 'version':version_dict()}
    if implementation is not None:
        dict_dump['implementation'] = implementation
    return representer.represent_mapping('!ParentMatrixSpace', dict_dump)

def yaml_integer_ring_representer(representer, data):
    return representer.represent_mapping('!ParentIntegerRing', {})

def yaml_finite_field_representer(representer, data):
    implementation = implementation_to_str(finite_field_implementations, data.__class__);
    dict_dump = {'order':int(data.cardinality()), 'names':data.variable_name(), 'impl':implementation, 'version':version_dict()}
    return representer.represent_mapping('!ParentFiniteField', dict_dump)

def yaml_laurent_polynomial_ring_representer(representer, data):
    dict_dump = {'base_ring':data.base_ring(), 'names':data.variable_names(), 'version':version_dict()}
    return representer.represent_mapping('!ParentLaurentPolynomialRing', dict_dump)

def yaml_polynomial_ring_representer(representer, data):
    dict_dump = {'base_ring':data.base_ring(), 'names':data.variable_names(), 'version':version_dict()}
    return representer.represent_mapping('!ParentPolynomialRing', dict_dump)

# -----------------------------------------------------------------------------------------------------------
# declaration of constructors needed for the examples
# -----------------------------------------------------------------------------------------------------------
def yaml_element_constructor(constructor, tag_suffix, node):
    inp = constructor.construct_mapping(node, deep=True)
    parent  = inp['parent']
    if 'mtx_args' in inp.keys():
        # special case: use mtx_unpickle
        mtx_args = inp['mtx_args']
        return mtx_unpickle(parent, *mtx_args)
    data    = inp['data']
    return parent(data)

def yaml_parent_constructor(constructor, tag_suffix, node):
    args = constructor.construct_mapping(node, deep=True)
    constructor_name = tag_suffix

    if 'version' in args.keys():
        # check for version upgrades
        version = args['version']
        del args['version']
        args = upgrade(tag_suffix, args, version)
    if 'implementation' in args.keys():
        # set non-default implementation
        args['implementation'] = str_to_implementation(constructor_name, args['implementation'])

    if version_info.major == 2:
        constructor_name = bytes(tag_suffix)        
    return lookup_global(constructor_name)(**args)


# -----------------------------------------------------------------------------------------------------------
# registration of representers needed for the examples
# -----------------------------------------------------------------------------------------------------------
yaml.representer.add_multi_representer(Matrix,             yaml_matrix_representer)

yaml.representer.add_multi_representer(MatrixSpace,        yaml_matrix_space_representer)
yaml.representer.add_multi_representer(FiniteField,        yaml_finite_field_representer)
yaml.representer.add_multi_representer(LaurentPolynomialRing_univariate, yaml_laurent_polynomial_ring_representer)
yaml.representer.add_multi_representer(MPolynomialRing_polydict,   yaml_polynomial_ring_representer)
yaml.representer.add_multi_representer(IntegerRing_class,  yaml_integer_ring_representer)

# -----------------------------------------------------------------------------------------------------------
# registration of constructors needed for the examples
# -----------------------------------------------------------------------------------------------------------
yaml.constructor.add_multi_constructor('!Element',         yaml_element_constructor)
yaml.constructor.add_multi_constructor('!Parent',          yaml_parent_constructor)

Simon King's example

The first example is taken from the Trac-ticket #28444 in which an incompatibility between Python 2 and Python 3 sobj-files has been fixed. To reproduce the example you have to download the corresponding file from that ticket into your current directory.

In [2]:
if version_info.major == 2:
    M1 = load('Py2.sobj')  # if you use Python 2 on a version newer than 8.9.rc0 than Py3.sobj will not work
else:
    M1 = load('Py3.sobj')
stream = open('M1.yaml', 'w')
yaml.dump(M1,stream)
stream.close()

Will yield:

!ElementMatrix
mtx_args:
- 2
- 8
- !!binary |
  gB8=
- true
parent: !ParentMatrixSpace
  base_ring: !ParentFiniteField
    impl: modn
    names: x
    order: 2
    version: {major: 8, minor: 9, prerelease: true, tiny: 0}
  implementation: meataxe
  ncols: 8
  nrows: 2
  sparse: false
  version: {major: 8, minor: 9, prerelease: true, tiny: 0}

This result is independent whether you use Python 2 or Python 3 (nothing shown by diff). Loading this file back works independent on the Python version, as well:

In [3]:
stream = open('M1.yaml', 'r')
M1back = yaml.load(stream)
stream.close()
M1 == M1back
Out[3]:
True

Here we have used the same mtx_unplickle-function as it is used in the sobj case to store the entries of the matrix binary. If you want to have it more readable you have to change the matrix representer:

In [4]:
def yaml_matrix_representer(representer, data):
    if isinstance(data, Matrix_gfpn_dense) and False:  # deactivating mtx_unpickle
        mtx_args = data.__reduce__()[1][1:]
        dict_dump = {'parent':data.parent(), 'mtx_args':mtx_args}
    else:
        entries = {key:poly_to_dict_rec(value) for key, value in data.dict().items()}
        dict_dump = {'parent':data.parent(), 'data':entries}
    return representer.represent_mapping('!ElementMatrix', dict_dump)

yaml.representer.add_multi_representer(Matrix,      yaml_matrix_representer)
In [5]:
stream = open('M1entries.yaml', 'w')
yaml.dump(M1, stream)
stream.close()

The result now is this:

!ElementMatrix
data:
  ? [0, 0]
  : 1
  ? [1, 3]
  : 1
  ? [1, 4]
  : 1
  ? [1, 5]
  : 1
  ? [1, 6]
  : 1
  ? [1, 7]
  : 1
parent: !ParentMatrixSpace
  base_ring: !ParentFiniteField
    impl: modn
    names: x
    order: 2
    version: {major: 8, minor: 9, prerelease: true, tiny: 0}
  implementation: meataxe
  ncols: 8
  nrows: 2
  sparse: false
  version: {major: 8, minor: 9, prerelease: true, tiny: 0}

Reading it back:

In [6]:
stream = open('M1entries.yaml', 'r')
M1entries = yaml.load(stream)
stream.close()
M1 == M1entries
Out[6]:
True

But one thing is annoying:

In [7]:
M1back.parent() == M1.parent()
Out[7]:
False
In [8]:
M1entries.parent() == M1.parent()
Out[8]:
False

Even though:

In [9]:
M1back.parent()
Out[9]:
Full MatrixSpace of 2 by 8 dense matrices over Finite Field of size 2
In [10]:
M1entries.parent()
Out[10]:
Full MatrixSpace of 2 by 8 dense matrices over Finite Field of size 2
In [11]:
M1.parent()
Out[11]:
Full MatrixSpace of 2 by 8 dense matrices over Finite Field of size 2

The reason for this is that the _pyx_order of the corresponding base rings cannot be identified:

In [12]:
for k, v in M1.base_ring().__dict__.items():
    if M1back.base_ring().__dict__[k] != v:
        print("different item", k, v)
('different item', '_pyx_order', <sage.rings.finite_rings.integer_mod.NativeIntStruct object at 0x7f971300a9b0>)

I guess that this isn't an essential problem!

Loading from externally generated yaml-code

The above implementation of the constructors can also be used for classes, that don't have a corresponding dump-method:

In [13]:
sage: braid="""
....: !ElementBraid
....: data:
....:   [3,-2,3,-1]
....: parent: !ParentBraidGroup
....:     n: 5
....: """
sage: yaml.load(braid)
Out[13]:
s2*s1^-1*s2*s0^-1
In [14]:
_.parent()
Out[14]:
Braid group on 5 strands

As you can see: There is no mystery about element construction. It is just like in a Sage-session. In case of an incompatibility you may do adaptions manually or by simple conversion tools.

A larger matrix with polynomial entries

For an example with a larger matrix I use data which describe a 648 x 648 matrix over a 2 variate polynomial ring over an univariate Laurent polynomial ring over ZZ (this is a regular representation matrix of one of the generators of the cubic Hecke algebra on 4 strands which can be downloaded from Ivan Marin's data file as human readable maple-file).

Notice: Loading the full data-file (containing six such matrices) may take more than a minute (on an i5 about 70 - 80 seconds with Python 2 and 110 -120 seconds with Python 3)!

In [15]:
from six.moves.urllib.request import urlopen
url_data = urlopen('http://www.lamfa.u-picardie.fr/marin/softs/H4/MatricesRegH4.maple').read().decode()
preparsed_data =url_data.replace(':=', '=').replace(';', '').replace('^', '**').replace('Matrix', 'matrix')
stream = open('MatricesRegH4.py',  'w')
stream.write(preparsed_data)
stream.close()

L.<w>   = LaurentPolynomialRing(ZZ)
R.<u,v> = PolynomialRing(L)

from sage.misc.misc import cputime

print('Start loading modified maple-file')
start = cputime()
load('MatricesRegH4.py')
end = cputime(start)
print('Finished loading modified maple-file after %s seconds' %end)
Start loading modified maple-file
Finished loading modified maple-file after 235.364 seconds

For the test we use the matrix mm1:

In [16]:
mm1.parent()
Out[16]:
Full MatrixSpace of 648 by 648 dense matrices over Multivariate Polynomial Ring in u, v over Univariate Laurent Polynomial Ring in w over Integer Ring
In [17]:
stream = open('mm1.yaml', 'w')
print('Start dumping YAML mm1')
start = cputime()
yaml.dump(mm1, stream=stream)
end = cputime(start)
print('Finished dumping  YAML mm1 after %s seconds' %end)
stream.close()
Start dumping YAML mm1
Finished dumping  YAML mm1 after 0.668 seconds
In [18]:
stream=open('mm1.yaml', 'r')
print('Start loading YAML mm1')
start = cputime()
mm1_yaml = yaml.load(stream)
end = cputime(start)
print('Finished loading YAML mm1 after %s seconds' %end)
stream.close()
Start loading YAML mm1
Finished loading YAML mm1 after 20.888 seconds
In [19]:
mm1 == mm1_yaml
Out[19]:
True
In [20]:
mm1.parent() == mm1_yaml.parent()
Out[20]:
True

The matrix looks similar as the second version of the first example. But the entries are a bit more complicated, since they represent recursive dictionaries. This comes from the conversion (helper function poly_to_dict_rec in the code above) before serializing. That is what I've tried to indicate in comment 5 and comment 7 of the ticket.

!ElementMatrix
data:
  ? [0, 27]
  : ? [0, 1]
    : {0: -1}
  ? [0, 54]
  : ? [0, 0]
    : {0: 1}
..............
  ? [646, 619]
  : ? [0, 0]
    : {1: 1}
  ? [647, 620]
  : ? [0, 0]
    : {1: 1}
parent: !ParentMatrixSpace
  base_ring: !ParentPolynomialRing
    base_ring: !ParentLaurentPolynomialRing
      base_ring: !ParentIntegerRing {}
      names: [w]
      version: {major: 8, minor: 9, prerelease: true, tiny: 0}
    names: [u, v]
    version: {major: 8, minor: 9, prerelease: true, tiny: 0}
  ncols: 648
  nrows: 648
  sparse: false
  version: {major: 8, minor: 9, prerelease: true, tiny: 0}

Now lets compare this with sobj:

In [21]:
print('Start dumping SOBJ mm1')
start = cputime()
save(mm1, 'mm1.sobj')
end = cputime(start)
print('Finished dumping  SOBJ mm1 after %s seconds' %end)
stream.close()
Start dumping SOBJ mm1
Finished dumping  SOBJ mm1 after 6.312 seconds
In [22]:
print('Start loading SOBJ mm1')
start = cputime()
mm1_sobj = load('mm1.sobj')
end = cputime(start)
print('Finished loading SOBJ mm1 after %s seconds' %end)
stream.close()
Start loading SOBJ mm1
Finished loading SOBJ mm1 after 3.012 seconds
In [23]:
mm1 == mm1_sobj
Out[23]:
True
In [24]:
mm1.parent() == mm1_sobj.parent()
Out[24]:
True

Difference in file-size:

-rw-r--r-- 1 sebastian sebastian   44565 Okt 14 07:59 mm1.yaml
-rw-r--r-- 1 sebastian sebastian 5475008 Okt 14 07:59 mm1.sobj

The reason for this difference comes from:

In [25]:
mm1.is_dense()
Out[25]:
True

Even though the matrix is not really dense it has been constructed as a dense matrix since the maple-file contains the full list of entries. Thus, sobj seems to serialize the full list as well. But anyway, the matrix read back from YAML is dense, as well:

In [26]:
mm1_yaml.is_dense()
Out[26]:
True

So we may serialize a dense matrix as dict and a sparse matrix as a list depending on which way is more efficient with respect to file-size!

Converting mm1 to a sparse matrix gives some reputation for sobj:

In [27]:
MSsparse = MatrixSpace(R, 648, 648, sparse=True)
mm1s = MSsparse(mm1)
print('Start dumping SOBJ mm1s')
start = cputime()
save(mm1s, 'mm1s.sobj')
end = cputime(start)
print('Finished dumping  SOBJ mm1s after %s seconds' %end)
stream.close()
Start dumping SOBJ mm1s
Finished dumping  SOBJ mm1s after 0.024 seconds
In [28]:
print('Start loading SOBJ mm1s')
start = cputime()
mm1s_sobj = load('mm1s.sobj')
end = cputime(start)
print('Finished loading SOBJ mm1s after %s seconds' %end)
stream.close()
Start loading SOBJ mm1s
Finished loading SOBJ mm1s after 0.016 seconds

Difference in file-size now:

Python 2:

-rw-rw-r-- 1 sebastian sebastian   44565 Okt 15 18:08 mm1.yaml
-rw-rw-r-- 1 sebastian sebastian 3673860 Okt 15 18:09 mm1.sobj
-rw-rw-r-- 1 sebastian sebastian   24612 Okt 15 18:09 mm1s.sobj

Python 3:

-rw-r--r-- 1 sebastian sebastian   44565 Okt 14 07:59 mm1.yaml
-rw-r--r-- 1 sebastian sebastian 5475008 Okt 14 07:59 mm1.sobj
-rw-r--r-- 1 sebastian sebastian   39150 Okt 14 07:59 mm1s.sobj

Thus, as one would expect, for stable long-term serialization you have to pay with larger file-size and more CPU-time. But I think, the user should be enabled to decide by himself, which feature is preferable.

Handling of version incompatibilty

(following comment 1, comment 9 and comment 15 of the ticket)

Finally, to explain the meaning of the upgrade-function of cell 2, suppose that the yaml_matrix_space_representer would have been implemented different before version 8.4 (#23719), for example like this:

In [29]:
def yaml_matrix_space_representer(representer, data):
    nrows, ncols = data.dims()
    dict_dump = {'base_ring':data.base_ring(), 'nrows':nrows, 'ncols':ncols, 'sparse':data.is_sparse(), 'implementation':data._implementation, 'version':version_dict()}
    return representer.represent_mapping('!ParentMatrixSpace', dict_dump)

yaml.representer.add_multi_representer(MatrixSpace,        yaml_matrix_space_representer)

and that mm1 has been dumped with in old enought version:

In [30]:
if not hasattr(mm1.parent(), 'Element'):
    stream = open('mm1-old.yaml', 'w')
    print('Start dumping YAML mm1-old')
    start = cputime()
    yaml.dump(mm1, stream=stream)
    end = cputime(start)
    print('Finished dumping  YAML mm1-old after %s seconds' %end)
    stream.close()
Start dumping YAML mm1-old
Finished dumping  YAML mm1-old after 0.284 seconds

Than the difference of mm1.yaml to the corresponding file mm1-old.yaml will be (here using 8.1):

diff mm1.yaml mm1-old.yaml 
3248c3248
<       version: {major: 8, minor: 9, prerelease: true, tiny: 0}
---
>       version: {major: 8, minor: 1, prerelease: false, tiny: 0}
3250c3250,3251
<     version: {major: 8, minor: 9, prerelease: true, tiny: 0}
---
>     version: {major: 8, minor: 1, prerelease: false, tiny: 0}
>   implementation: flint
3254c3255
<   version: {major: 8, minor: 9, prerelease: true, tiny: 0}
---
>   version: {major: 8, minor: 1, prerelease: false, tiny: 0}

since flint had been the default value for implementation at that time. Now, the advantage of the human readable file is, that an advanced Sage user can delete the line implementation: flint easily from the file. He will than be able to read it into the recent version.

Now, the task of the upgrade function is to do exactly the same implicitly, so that all sage users may use the file in recent versions (another task of this function could be its usage in a conversion-tool). With the above implementation of upgrade this works fine:

In [31]:
if version_info.major == 2:
    from exceptions import IOError as FileNotFoundError 
try:
    stream = open('mm1-old.yaml', 'r')
    mm1_old = yaml.load(stream)
    print('mm1-old successfully loaded')
except FileNotFoundError:
    mm1_old = mm1_yaml
    print('mm1-old not found')
mm1-old successfully loaded
In [32]:
mm1_old == mm1
Out[32]:
True
In [33]:
mm1_old.parent() == mm1.parent()
Out[33]:
True

On the other hand: Deactivating the upgrade-function:

In [34]:
def upgrade(construction_call, arguments, version):
    return arguments

will cause an error when reading the old file into a recent Sage-version:

In [35]:
stream=open('mm1-old.yaml', 'r')
mm1_old=yaml.load(stream)

if mm1-old.yaml was created successfully as above, the former cell would give on recent versions:

sage: stream = open('mm1-old.yaml', 'r')
sage: mm1_old = yaml.load(stream)
Traceback (most recent call last):
...
ValueError: unknown matrix implementation 'flint' over Multivariate Polynomial Ring in u, v over Univariate Laurent Polynomial Ring in w over Integer Ring