statistics
Description (last modified by )
We implement a sage compatible version of the Python statistics module in stats/statistics.py
. In particular, it will provide the mean
and median
functions that were recently deprecated for removal in #29662. See also #33432.
See also https://docs.python.org/3.10/whatsnew/3.8.html#statistics
Thank you, this looks like a great addition which should take care of the necessary problems but continue providing what is needed. I didn't see any obvious problems in the code but it would be good to get a second more detailed eye.
Your new function mean
is not compatible with the builtin statistics.mean
 that specifies "If data is empty, StatisticsError
will be raised." https://docs.python.org/3/library/statistics.html#statistics.mean
(Also, the argument is called data
, not v
.)
Also, please include crossreferences to the numpy
functions in the documentation so that this information is not lost.
Replying to mkoeppe:
Your new function
mean
is not compatible with the builtinstatistics.mean
 that specifies "If data is empty,StatisticsError
will be raised." https://docs.python.org/3/library/statistics.html#statistics.mean
True. I did that on purpose to follow the numpy behaviour. The current code that I wrote only emits a RuntimeWarning
.
(Also, the argument is called
data
, notv
.)
Is the argument name really relevant? Though data
is definitely much better.
Replying to vdelecroix:
Replying to mkoeppe:
Your new function
mean
is not compatible with the builtinstatistics.mean
 that specifies "If data is empty,StatisticsError
will be raised." https://docs.python.org/3/library/statistics.html#statistics.meanTrue. I did that on purpose to follow the numpy behaviour. The current code that I wrote only emits a
RuntimeWarning
.
That makes no sense  the point of the module is to be compatible not with numpy but with the builtin statistics module.
Replying to vdelecroix:
(Also, the argument is called
data
, notv
.)Is the argument name really relevant? Though
data
is definitely much better.
Yes, because the signature is mean(data)
, not mean(data, /)
, users are allowed to call it as mean(data=...)
.
Note thate Python argument naming is awful
statistics.variance(data, xbar=None)
statistics.pvariance(data, mu=None)
Replying to mkoeppe:
Compatibility > beauty
This is not about beauty but coherence. However this is a very minor point. It is perfectly fine to keep as much as the Python world as we can.
More importantly
 if the user provides a numpy array then the appropriate numpy method is called. This is not the Python behaviour. Should I remove it?
 If
data
only contains builtin Python data (int
,float
,Fraction
,Decimal
, ...) should the code simply transfer to the Pythonstatistics
module? Usually, sage functions tend to usepy_scalar_to_element
which does convertint > Integer
,float > RealNumber
, etc
The distinction of xbar
and mu
is deliberate, see discussion in https://bugs.python.org/issue20389
Replying to vdelecroix:
More importantly
 if the user provides a numpy array then the appropriate numpy method is called. This is not the Python behaviour. Should I remove it?
Computing it via the numpy method I think is a good idea; but the result type / error handling must be compatible with the other types.
 If
data
only contains builtin Python data (int
,float
,Fraction
,Decimal
, ...) should the code simply transfer to the Pythonstatistics
module? Usually, sage functions tend to usepy_scalar_to_element
which does convertint > Integer
,float > RealNumber
, etc
I think it would make sense to always make the result a Sage type, via py_scalar_to_element
if necessary
Replying to mkoeppe:
Also, please include crossreferences to the
numpy
functions in the documentation so that this information is not lost.
Not sure about what you meant here.
Can't just replace sage.stats.basic_stats.mean
by lazy_import('sage.stats.statistics', 'mean', deprecation=33453)
. They have a different specification.
I have a very minor observation. I think there are a few typos in sage.stats.statistics
in the updated deprecation notices:

basic_stats.sage
76 76 77 77 sage: std([1..6], bias=True) 78 78 doctest:warning... 79 DeprecationWarning: sage.stats.basic_stats.std is deprecated; use sage.stats.stat stics.stdev or sage.stats.statistics.pstdev instead79 DeprecationWarning: sage.stats.basic_stats.std is deprecated; use sage.stats.statistics.stdev or sage.stats.statistics.pstdev instead 80 80 See https://trac.sagemath.org/33453 for details. 81 81 1/2*sqrt(35/3) 82 82 sage: std([1..6], bias=False) … … 106 106 sage: std(data) # random 107 107 0.29487771726609185 108 108 """ 109 deprecation(33453, 'sage.stats.basic_stats.std is deprecated; use sage.stats.stat stics.stdev or sage.stats.statistics.pstdev instead')109 deprecation(33453, 'sage.stats.basic_stats.std is deprecated; use sage.stats.statistics.stdev or sage.stats.statistics.pstdev instead') 110 110 111 111 if hasattr(v, 'standard_deviation'): 112 112 return v.standard_deviation(bias=bias)
Replying to vdelecroix:
Replying to mkoeppe:
Also, please include crossreferences to the
numpy
functions in the documentation so that this information is not lost.
Not sure about what you meant here.
We define the mean of the empty list to be the (symbolic) NaN, following the convention of MATLAB, Scipy, and R.  This function is deprecated. Use ``numpy.mean`` or ``numpy.nanmean``  instead. + This function is deprecated. Use ``sage.stats.statistics.mean`` instead. The + differences with this function are + +  the code does not try to call ``v.mean()`` +  raises an error on empty input
Stuff like this  please don't remove crossreferences to numpy.
In
+ if is_numpy_type(type(data)): + import numpy + if isinstance(data, numpy.ndarray): + return data.mean()
I think the result should be coerced to a Sage number type (comment:18)
Replying to mkoeppe:
In
+ if is_numpy_type(type(data)): + import numpy + if isinstance(data, numpy.ndarray): + return data.mean()I think the result should be coerced to a Sage number type (comment:18)
I don't like the conversion afterwards so much. The mean of a list of integers is a floating point in numpy.
sage: data = [1,2,7,11,15,23] sage: statistics.mean(data) 37/6 sage: import sage.stats.statistics as statistics sage: import numpy sage: statistics.mean(numpy.array(data)) 6.166666666666667
Maybe I should just remove these numpy shortcuts and simply document how to use the proper numpy methods in the documentation only?
Reviewers:  → David Roe 

Status:  needs_review → needs_work 
There are pyflakes errors. Once those are fixed, I'm happy to give this a positive review.
comment:33 Changed 11 months ago by
Replying to roed:
There are pyflakes errors. Once those are fixed, I'm happy to give this a positive review.
Please don't. comment:30 has to be sorted out first.
Also, I would like to mitigate the warnings in basic_stats
. Currently, the only deprecated behaviour of mean(v)
are when either
 it calls the underlying
v.mean()
method of the object  ot when the input data
v
is empty.
In all other cases, we can suppress the deprecation. And similarly for all other functions in basic_stats
.
