#29662 closed task (fixed)
Deprecate sage.stats.basic_stats
Reported by:  KarlDieter Crisman  Owned by:  

Priority:  major  Milestone:  sage9.5 
Component:  statistics  Keywords:  
Cc:  David Coudert, Travis Scrimshaw  Merged in:  
Authors:  Matthias Koeppe  Reviewers:  David Coudert 
Report Upstream:  N/A  Work issues:  
Branch:  7e7331a (Commits, GitHub, GitLab)  Commit:  
Dependencies:  Stopgaps: 
Description (last modified by )
Right now in sage stats there are a few very basic things, and then lots of very technical stuff sampling from e.g. polynomials or for hidden markov modeling!
In this ticket, we deprecate sage.stats.basic_stats
, which is underdeveloped. Users are usually better off to learn to use other facilities available in Python that provide better coverage and consistency. For example, mean
happens to work with vectors (which is used in some doctests elsewhere in sage), but variance
does not.
Note that Python 3 comes with a basic builtin stats module, but much of its functionality is incompatible with Sage's numbers in #28234.
In the deprecation messages, we instead point users to suitable numpy
and scipy
functions, as well as to pandas
(for moving_average
).
See also #29663
Change History (40)
comment:1 Changed 3 years ago by
Description:  modified (diff) 

comment:2 followup: 4 Changed 3 years ago by
comment:3 Changed 3 years ago by
Description:  modified (diff) 

comment:4 Changed 3 years ago by
Replying to kcrisman:
Thanks for that comment about Py3. So perhaps there are already namespace clashes ... though presumably that module can't handle (say) the mean of several
Integer
s or even stranger objects, asis.
this might be an upstream bug. One should check whether this works with unusual numpy types, by the way. If it doesn't we have a case...
comment:5 Changed 16 months ago by
Unfortunately the Python3 statistics
module is not compatible with Sage numbers. This is coming from #28234.
sage: import statistics sage: statistics.mean([1, 2])  TypeError Traceback (most recent call last) <ipythoninput3c5dde83ca691> in <module> > 1 statistics.mean([Integer(1), Integer(2)]) /usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/statistics.py in mean(data) 314 if n < 1: 315 raise StatisticsError('mean requires at least one data point') > 316 T, total, count = _sum(data) 317 assert count == n 318 return _convert(total / n, T) /usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/statistics.py in _sum(data, start) 166 for n, d in map(_exact_ratio, values): 167 count += 1 > 168 partials[d] = partials_get(d, 0) + n 169 if None in partials: 170 # The sum will be a NAN or INF. We can ignore all the finite TypeError: unsupported operand type(s) for +: 'int' and 'builtin_function_or_method'
comment:6 Changed 16 months ago by
Authors:  → Matthias Koeppe 

Cc:  David Coudert added 
So to get rid of the single use of sage.stats
within sage.graphs
, a workaround like in this commit is needed. I don't know if this is acceptable
comment:7 Changed 16 months ago by
Python 3.8 introduces https://docs.python.org/3/library/statistics.html#statistics.fmean, which coerces to float
first and therefore works fine; but we cannot use it yet as we still support Python 3.7. Also geometric_mean
(Python >= 3.8), median
, median_low
, median_high
, median_grouped
work fine for the same reason.
mode
, multimode
, and quantiles
also work fine. They do not convert to float
.
OTOH, harmonic_mean
, pstdev
, pvariance
, stdev
, variance
all lead to the same error as mean
.
comment:8 Changed 16 months ago by
Branch:  → u/mkoeppe/clarify_stats_module_role 

comment:9 Changed 16 months ago by
Commit:  → 2be6a71916d5f963b039a59a90f3f0e4d50daee7 

numpy and scipy.stats functions do not have this problem; they just convert everything to numpy.float64
first.
So if we want to deprecate sage.stats.basic_stats
, the deprecation messages should probably send users to a combination of numpy and scipy functions.
New commits:
2be6a71  GenericGraph.clustering_average: Replace use of sage.stats.basic_stats by statistics

comment:10 Changed 16 months ago by
Cc:  Travis Scrimshaw added 

Milestone:  sagefeature → sage9.5 
comment:11 Changed 16 months ago by
Commit:  2be6a71916d5f963b039a59a90f3f0e4d50daee7 → d10694a244dff78cb7c6624c222701d633150624 

Branch pushed to git repo; I updated commit sha1. New commits:
d10694a  sage.stats.basic_stats: Add deprecations for all functions

comment:12 Changed 16 months ago by
Summary:  Clarify stats module role → Deprecate sage.stats.basic_stats 

comment:13 Changed 16 months ago by
Currently, clustering_coeff
returns a rational when implementation
is "dense_copy"
or "sparse_copy"
. Unfortunately, the rational field of sage is currently not able to take as input the type Fraction
from Python library fractions
.
A solution could be
from statistics import mean from fractions import Fraction return QQ(str(mean([Fraction(str(x)) for x in G.clustering_coeff(implementation=implementation).values()])))
comment:15 Changed 16 months ago by
Commit:  d10694a244dff78cb7c6624c222701d633150624 → d3d374bf6ebad93051102102de5a7581c1edf48e 

comment:16 Changed 16 months ago by
... so instead of going through Fraction
I have just replaced it by sum
followed by division
comment:17 Changed 16 months ago by
Commit:  d3d374bf6ebad93051102102de5a7581c1edf48e → 03e926f08886317fa83ec1991971bdb1a1df5cbc 

Branch pushed to git repo; I updated commit sha1. New commits:
03e926f  src/sage/stats/basic_stats.py: Update doctest outputs with deprecation warnings

comment:18 Changed 16 months ago by
Priority:  minor → major 

Status:  new → needs_review 
comment:19 Changed 16 months ago by
Reviewers:  → David Coudert 

Status:  needs_review → positive_review 
LGTM. Thank you.
comment:21 Changed 16 months ago by
Status:  positive_review → needs_work 

... : DeprecationWarning: sage.stats.basic_stats.mean is deprecated; use numpy.mean or numpy.nanmean instead See https://trac.sagemath.org/29662 for details. ********************************************************************** 1 item had failures: 1 of 14 in sage.crypto.lwe.LWE.__init__ [111 tests, 1 failure, 0.32 s]  sage t long warnlong 46.3 randomseed=0 src/sage/crypto/lwe.py # 1 doctest failed 
comment:22 Changed 16 months ago by
Commit:  03e926f08886317fa83ec1991971bdb1a1df5cbc → 5966df91465a86652ae4c217d2ab0cfcd4f4ccd6 

Branch pushed to git repo; I updated commit sha1. New commits:
5966df9  src/sage/crypto/lwe.py: Replace use of sage.stats.basic_stats in doctest by numpy

comment:23 Changed 16 months ago by
Status:  needs_work → needs_review 

comment:24 Changed 16 months ago by
Status:  needs_review → needs_work 

The patchbot reports several errors due to mean
sage t long randomseed=0 src/sage/stats/distributions/discrete_gaussian_lattice.py # 3 doctests failed sage t long randomseed=0 src/doc/en/prep/Quickstarts/StatisticsandDistributions.rst # 2 doctests failed sage t long randomseed=0 src/sage/stats/distributions/discrete_gaussian_polynomial.py # 1 doctest failed sage t long randomseed=0 src/sage/coding/information_set_decoder.py # 2 doctests failed sage t long randomseed=0 src/sage/graphs/graph_generators_pyx.pyx # 1 doctest failed
comment:25 Changed 16 months ago by
Description:  modified (diff) 

comment:26 Changed 16 months ago by
Commit:  5966df91465a86652ae4c217d2ab0cfcd4f4ccd6 → 88745e18d58b36df2df61aada8c270a7980f02a7 

Branch pushed to git repo; I updated commit sha1. New commits:
88745e1  src/sage/stats/distributions/discrete_gaussian_lattice.py: Remove use of sage.basic_stats.mean in doctests

comment:27 Changed 16 months ago by
Commit:  88745e18d58b36df2df61aada8c270a7980f02a7 → c6e40a58ad8a607dcbcacf4503018d3b9ed8f97f 

Branch pushed to git repo; I updated commit sha1. New commits:
fb10839  src/sage/stats/distributions/discrete_gaussian_polynomial.py: Remove use of sage.stats.basic_stats.mean in doctest

768ebdf  src/sage/graphs/graph_generators_pyx.pyx: Remove use of sage.stats.basic_stats.mean in doctest

c6e40a5  src/sage/coding/information_set_decoder.py: Remove use of sage.stats.basic_stats; make imports more specific

comment:28 Changed 16 months ago by
Commit:  c6e40a58ad8a607dcbcacf4503018d3b9ed8f97f → 7e7331a3fe55bb85e6a583c20b532fcaaf52b6a3 

Branch pushed to git repo; I updated commit sha1. New commits:
7e7331a  src/doc/en/prep/Quickstarts/StatisticsandDistributions.rst: Update, stop recommending deprecated functions

comment:29 Changed 16 months ago by
Status:  needs_work → needs_review 

comment:30 Changed 15 months ago by
Status:  needs_review → positive_review 

I think it's OK now. We have a few pyflake warnings that can be fixed afterward.
comment:32 Changed 15 months ago by
Branch:  u/mkoeppe/clarify_stats_module_role → 7e7331a3fe55bb85e6a583c20b532fcaaf52b6a3 

Resolution:  → fixed 
Status:  positive_review → closed 
comment:33 Changed 10 months ago by
Commit:  7e7331a3fe55bb85e6a583c20b532fcaaf52b6a3 

The definition of "numpy.mean" is different than Sage's built in mean:
sage: mean([vector([1,2]),vector([3,4])]) <ipythoninput437be8bb3881f>:1: DeprecationWarning: sage.stats.basic_stats.mean is deprecated; use numpy.mean or numpy.nanmean instead See https://trac.sagemath.org/29662 for details. mean([vector([Integer(1),Integer(2)]),vector([Integer(3),Integer(4)])]) (2, 3) sage: numpy.mean([vector([1,2]),vector([3,4])]) 2.5
Also, one of the examples in the Sage docs involving mean doesn't work with numpy.mean:
sage: v = stats.TimeSeries([1..10]) sage: mean(v) <ipythoninput139d6545956b1d>:1: DeprecationWarning: sage.stats.basic_stats.mean is deprecated; use numpy.mean or numpy.nanmean instead See https://trac.sagemath.org/29662 for details. mean(v) 5.5 sage: numpy.mean(v)  TypeError Traceback (most recent call last) <ipythoninput142e91276b932b> in <module> > 1 numpy.mean(v) <__array_function__ internals> in mean(*args, **kwargs) /ext/sage/9.5/local/var/lib/sage/venvpython3.9.9/lib/python3.9/sitepackages/numpy/core/fromnumeric.py in mean(a, axis, dtype, out, keepdims, where) 3436 pass 3437 else: > 3438 return mean(axis=axis, dtype=dtype, out=out, **kwargs) 3439 3440 return _methods._mean(a, axis=axis, dtype=dtype, TypeError: TimeSeries.mean() takes no keyword arguments
Maybe this is an argument to improve variance, rather than delete it?
For example, mean happens to work with vectors (which is used in some doctests elsewhere in sage), but variance does not.
comment:34 followup: 35 Changed 10 months ago by
perhaps removing mean
, variance
, etc. from global namespace would be enough for the time being.
comment:35 Changed 10 months ago by
Replying to dimpase:
perhaps removing
mean
,variance
, etc. from global namespace would be enough for the time being.
I think that having sage versions of mean
and variance
in the global namespace are useful to many users. Much more than EllipticCurve
let say.
numpy and python are floating point focused when it comes to mathematics. This is not what sage tries to teach.
comment:36 followup: 37 Changed 10 months ago by
By default, EllipticCurve
, or in fact anything that's not in vanilla Python, should not be in global namespace.
This is in line with other Python packages.
By the way, one of the worst is currently sum
, which overloads Python's sum at Sage prompt, but does not do this in the library.
Having said that, nothing then would prevent one from having "classical Sage" module, which populates the global namespace with stuff.
comment:37 Changed 10 months ago by
Replying to dimpase:
By default,
EllipticCurve
, or in fact anything that's not in vanilla Python, should not be in global namespace. This is in line with other Python packages.
Indeed, and then with from sage.all import *
people can get a standard set of stuff in their global namespace. And the sage
command environment should probably preload that, just as the preparser. It's great if sage can work like a python package, but the sagespecific environment is very valuable for quick stuff. And for, for instance, teaching, it is very useful if there is a useful "standard" set, if only so that it's easily available in sagecell
.
It's valuable that in maple, maxima, mathematica, you can start right away with typing in math oneliners and get results, without having to execute boiler plate. We have that in sagemath now as well. I think it would reduce the usability of sagemath in many informal settings if we were to do away with it. A python library is just not a very good match for what people expect from an interactive CAS, so I think it's good we have a shim layer that makes sagemath behave a little more like a traditional CAS.
Thanks for that comment about Py3. So perhaps there are already namespace clashes ... though presumably that module can't handle (say) the mean of several
Integer
s or even stranger objects, asis.