Opened 3 years ago

Closed 15 months ago

Last modified 9 months ago

#29662 closed task (fixed)

Deprecate sage.stats.basic_stats

Reported by: Karl-Dieter Crisman Owned by:
Priority: major Milestone: sage-9.5
Component: statistics Keywords:
Cc: David Coudert, Travis Scrimshaw Merged in:
Authors: Matthias Koeppe Reviewers: David Coudert
Report Upstream: N/A Work issues:
Branch: 7e7331a (Commits, GitHub, GitLab) Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by Matthias Köppe)

Right now in sage stats there are a few very basic things, and then lots of very technical stuff sampling from e.g. polynomials or for hidden markov modeling!

In this ticket, we deprecate sage.stats.basic_stats, which is underdeveloped. Users are usually better off to learn to use other facilities available in Python that provide better coverage and consistency. For example, mean happens to work with vectors (which is used in some doctests elsewhere in sage), but variance does not.

Note that Python 3 comes with a basic built-in stats module, but much of its functionality is incompatible with Sage's numbers in #28234.

In the deprecation messages, we instead point users to suitable numpy and scipy functions, as well as to pandas (for moving_average).

See also #29663

Change History (40)

comment:1 Changed 3 years ago by Dima Pasechnik

Description: modified (diff)

comment:2 Changed 3 years ago by Karl-Dieter Crisman

Thanks for that comment about Py3. So perhaps there are already namespace clashes ... though presumably that module can't handle (say) the mean of several Integers or even stranger objects, as-is.

comment:3 Changed 3 years ago by Karl-Dieter Crisman

Description: modified (diff)

comment:4 in reply to:  2 Changed 3 years ago by Dima Pasechnik

Replying to kcrisman:

Thanks for that comment about Py3. So perhaps there are already namespace clashes ... though presumably that module can't handle (say) the mean of several Integers or even stranger objects, as-is.

this might be an upstream bug. One should check whether this works with unusual numpy types, by the way. If it doesn't we have a case...

comment:5 Changed 16 months ago by Matthias Köppe

Unfortunately the Python3 statistics module is not compatible with Sage numbers. This is coming from #28234.

sage: import statistics                                                                                                                                                 
sage: statistics.mean([1, 2])                                                                                                                                           
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-c5dde83ca691> in <module>
----> 1 statistics.mean([Integer(1), Integer(2)])

/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/statistics.py in mean(data)
    314     if n < 1:
    315         raise StatisticsError('mean requires at least one data point')
--> 316     T, total, count = _sum(data)
    317     assert count == n
    318     return _convert(total / n, T)

/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/statistics.py in _sum(data, start)
    166         for n, d in map(_exact_ratio, values):
    167             count += 1
--> 168             partials[d] = partials_get(d, 0) + n
    169     if None in partials:
    170         # The sum will be a NAN or INF. We can ignore all the finite

TypeError: unsupported operand type(s) for +: 'int' and 'builtin_function_or_method'

comment:6 Changed 16 months ago by Matthias Köppe

Authors: Matthias Koeppe
Cc: David Coudert added

So to get rid of the single use of sage.stats within sage.graphs, a workaround like in this commit is needed. I don't know if this is acceptable

comment:7 Changed 16 months ago by Matthias Köppe

Python 3.8 introduces https://docs.python.org/3/library/statistics.html#statistics.fmean, which coerces to float first and therefore works fine; but we cannot use it yet as we still support Python 3.7. Also geometric_mean (Python >= 3.8), median, median_low, median_high, median_grouped work fine for the same reason.

mode, multimode, and quantiles also work fine. They do not convert to float.

OTOH, harmonic_mean, pstdev, pvariance, stdev, variance all lead to the same error as mean.

comment:8 Changed 16 months ago by Matthias Köppe

Branch: u/mkoeppe/clarify_stats_module_role

comment:9 Changed 16 months ago by Matthias Köppe

Commit: 2be6a71916d5f963b039a59a90f3f0e4d50daee7

numpy and scipy.stats functions do not have this problem; they just convert everything to numpy.float64 first.

So if we want to deprecate sage.stats.basic_stats, the deprecation messages should probably send users to a combination of numpy and scipy functions.


New commits:

2be6a71GenericGraph.clustering_average: Replace use of sage.stats.basic_stats by statistics

comment:10 Changed 16 months ago by Matthias Köppe

Cc: Travis Scrimshaw added
Milestone: sage-featuresage-9.5

comment:11 Changed 16 months ago by git

Commit: 2be6a71916d5f963b039a59a90f3f0e4d50daee7d10694a244dff78cb7c6624c222701d633150624

Branch pushed to git repo; I updated commit sha1. New commits:

d10694asage.stats.basic_stats: Add deprecations for all functions

comment:12 Changed 16 months ago by Matthias Köppe

Summary: Clarify stats module roleDeprecate sage.stats.basic_stats

comment:13 Changed 16 months ago by David Coudert

Currently, clustering_coeff returns a rational when implementation is "dense_copy" or "sparse_copy". Unfortunately, the rational field of sage is currently not able to take as input the type Fraction from Python library fractions.

A solution could be

from statistics import mean
from fractions import Fraction
return QQ(str(mean([Fraction(str(x)) for x in G.clustering_coeff(implementation=implementation).values()])))

comment:14 Changed 16 months ago by Matthias Köppe

OK. Sounds like having the result in QQ is important to you.

comment:15 Changed 16 months ago by git

Commit: d10694a244dff78cb7c6624c222701d633150624d3d374bf6ebad93051102102de5a7581c1edf48e

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

933f0aasage.stats.basic_stats: Add deprecations for all functions
d3d374bGenericGraph.clustering_average: Remove use of sage.stats.basic_stats

comment:16 Changed 16 months ago by Matthias Köppe

... so instead of going through Fraction I have just replaced it by sum followed by division

comment:17 Changed 16 months ago by git

Commit: d3d374bf6ebad93051102102de5a7581c1edf48e03e926f08886317fa83ec1991971bdb1a1df5cbc

Branch pushed to git repo; I updated commit sha1. New commits:

03e926fsrc/sage/stats/basic_stats.py: Update doctest outputs with deprecation warnings

comment:18 Changed 16 months ago by Matthias Köppe

Priority: minormajor
Status: newneeds_review

comment:19 Changed 16 months ago by David Coudert

Reviewers: David Coudert
Status: needs_reviewpositive_review

LGTM. Thank you.

comment:20 Changed 16 months ago by Matthias Köppe

Thanks!

comment:21 Changed 16 months ago by Volker Braun

Status: positive_reviewneeds_work
...
    :
    DeprecationWarning: sage.stats.basic_stats.mean is deprecated; use numpy.mean or numpy.nanmean instead
    See https://trac.sagemath.org/29662 for details.
**********************************************************************
1 item had failures:
   1 of  14 in sage.crypto.lwe.LWE.__init__
    [111 tests, 1 failure, 0.32 s]
----------------------------------------------------------------------
sage -t --long --warn-long 46.3 --random-seed=0 src/sage/crypto/lwe.py  # 1 doctest failed
----------------------------------------------------------------------

comment:22 Changed 16 months ago by git

Commit: 03e926f08886317fa83ec1991971bdb1a1df5cbc5966df91465a86652ae4c217d2ab0cfcd4f4ccd6

Branch pushed to git repo; I updated commit sha1. New commits:

5966df9src/sage/crypto/lwe.py: Replace use of sage.stats.basic_stats in doctest by numpy

comment:23 Changed 16 months ago by Matthias Köppe

Status: needs_workneeds_review

comment:24 Changed 16 months ago by David Coudert

Status: needs_reviewneeds_work

The patchbot reports several errors due to mean

sage -t --long --random-seed=0 src/sage/stats/distributions/discrete_gaussian_lattice.py  # 3 doctests failed
sage -t --long --random-seed=0 src/doc/en/prep/Quickstarts/Statistics-and-Distributions.rst  # 2 doctests failed
sage -t --long --random-seed=0 src/sage/stats/distributions/discrete_gaussian_polynomial.py  # 1 doctest failed
sage -t --long --random-seed=0 src/sage/coding/information_set_decoder.py  # 2 doctests failed
sage -t --long --random-seed=0 src/sage/graphs/graph_generators_pyx.pyx  # 1 doctest failed

comment:25 Changed 16 months ago by Matthias Köppe

Description: modified (diff)

comment:26 Changed 16 months ago by git

Commit: 5966df91465a86652ae4c217d2ab0cfcd4f4ccd688745e18d58b36df2df61aada8c270a7980f02a7

Branch pushed to git repo; I updated commit sha1. New commits:

88745e1src/sage/stats/distributions/discrete_gaussian_lattice.py: Remove use of sage.basic_stats.mean in doctests

comment:27 Changed 16 months ago by git

Commit: 88745e18d58b36df2df61aada8c270a7980f02a7c6e40a58ad8a607dcbcacf4503018d3b9ed8f97f

Branch pushed to git repo; I updated commit sha1. New commits:

fb10839src/sage/stats/distributions/discrete_gaussian_polynomial.py: Remove use of sage.stats.basic_stats.mean in doctest
768ebdfsrc/sage/graphs/graph_generators_pyx.pyx: Remove use of sage.stats.basic_stats.mean in doctest
c6e40a5src/sage/coding/information_set_decoder.py: Remove use of sage.stats.basic_stats; make imports more specific

comment:28 Changed 16 months ago by git

Commit: c6e40a58ad8a607dcbcacf4503018d3b9ed8f97f7e7331a3fe55bb85e6a583c20b532fcaaf52b6a3

Branch pushed to git repo; I updated commit sha1. New commits:

7e7331asrc/doc/en/prep/Quickstarts/Statistics-and-Distributions.rst: Update, stop recommending deprecated functions

comment:29 Changed 16 months ago by Matthias Köppe

Status: needs_workneeds_review

comment:30 Changed 15 months ago by David Coudert

Status: needs_reviewpositive_review

I think it's OK now. We have a few pyflake warnings that can be fixed afterward.

comment:31 Changed 15 months ago by Matthias Köppe

Thanks for reviewing!

comment:32 Changed 15 months ago by Volker Braun

Branch: u/mkoeppe/clarify_stats_module_role7e7331a3fe55bb85e6a583c20b532fcaaf52b6a3
Resolution: fixed
Status: positive_reviewclosed

comment:33 Changed 10 months ago by William Stein

Commit: 7e7331a3fe55bb85e6a583c20b532fcaaf52b6a3

The definition of "numpy.mean" is different than Sage's built in mean:

sage: mean([vector([1,2]),vector([3,4])])
<ipython-input-4-37be8bb3881f>:1: DeprecationWarning: sage.stats.basic_stats.mean is deprecated; use numpy.mean or numpy.nanmean instead
See https://trac.sagemath.org/29662 for details.
  mean([vector([Integer(1),Integer(2)]),vector([Integer(3),Integer(4)])])
(2, 3)
sage: numpy.mean([vector([1,2]),vector([3,4])])
2.5

Also, one of the examples in the Sage docs involving mean doesn't work with numpy.mean:

sage: v = stats.TimeSeries([1..10])
sage: mean(v)
<ipython-input-13-9d6545956b1d>:1: DeprecationWarning: sage.stats.basic_stats.mean is deprecated; use numpy.mean or numpy.nanmean instead
See https://trac.sagemath.org/29662 for details.
  mean(v)
5.5
sage: numpy.mean(v)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-2e91276b932b> in <module>
----> 1 numpy.mean(v)

<__array_function__ internals> in mean(*args, **kwargs)

/ext/sage/9.5/local/var/lib/sage/venv-python3.9.9/lib/python3.9/site-packages/numpy/core/fromnumeric.py in mean(a, axis, dtype, out, keepdims, where)
   3436             pass
   3437         else:
-> 3438             return mean(axis=axis, dtype=dtype, out=out, **kwargs)
   3439 
   3440     return _methods._mean(a, axis=axis, dtype=dtype,

TypeError: TimeSeries.mean() takes no keyword arguments

Maybe this is an argument to improve variance, rather than delete it?

For example, mean happens to work with vectors (which is used in some doctests elsewhere in sage), but variance does not.

comment:34 Changed 10 months ago by Dima Pasechnik

perhaps removing mean, variance, etc. from global namespace would be enough for the time being.

comment:35 in reply to:  34 Changed 10 months ago by Vincent Delecroix

Replying to dimpase:

perhaps removing mean, variance, etc. from global namespace would be enough for the time being.

I think that having sage versions of mean and variance in the global namespace are useful to many users. Much more than EllipticCurve let say.

numpy and python are floating point focused when it comes to mathematics. This is not what sage tries to teach.

comment:36 Changed 10 months ago by Dima Pasechnik

By default, EllipticCurve, or in fact anything that's not in vanilla Python, should not be in global namespace. This is in line with other Python packages.

By the way, one of the worst is currently sum, which overloads Python's sum at Sage prompt, but does not do this in the library.

Having said that, nothing then would prevent one from having "classical Sage" module, which populates the global namespace with stuff.

comment:37 in reply to:  36 Changed 10 months ago by Nils Bruin

Replying to dimpase:

By default, EllipticCurve, or in fact anything that's not in vanilla Python, should not be in global namespace. This is in line with other Python packages.

Indeed, and then with from sage.all import * people can get a standard set of stuff in their global namespace. And the sage command environment should probably preload that, just as the preparser. It's great if sage can work like a python package, but the sage-specific environment is very valuable for quick stuff. And for, for instance, teaching, it is very useful if there is a useful "standard" set, if only so that it's easily available in sagecell.

It's valuable that in maple, maxima, mathematica, you can start right away with typing in math one-liners and get results, without having to execute boiler plate. We have that in sagemath now as well. I think it would reduce the usability of sagemath in many informal settings if we were to do away with it. A python library is just not a very good match for what people expect from an interactive CAS, so I think it's good we have a shim layer that makes sagemath behave a little more like a traditional CAS.

comment:39 Changed 9 months ago by Samuel Lelièvre

Follow-up ticket: #33432

comment:40 Changed 9 months ago by Vincent Delecroix

Follow-up ticket: #33453

Note: See TracTickets for help on using tickets.