Opened 11 months ago
Last modified 5 months ago
#33453 needs_work enhancement
statistics
Reported by:  Vincent Delecroix  Owned by:  

Priority:  major  Milestone:  sage9.8 
Component:  statistics  Keywords:  
Cc:  Matthias Köppe  Merged in:  
Authors:  Vincent Delecroix  Reviewers:  Matthias Koeppe, David Roe 
Report Upstream:  N/A  Work issues:  
Branch:  u/vdelecroix/33453 (Commits, GitHub, GitLab)  Commit:  9e82df16977691d2863f9aef065a1a4b8656b922 
Dependencies:  Stopgaps: 
Description (last modified by )
We implement a sage compatible version of the Python statistics module in stats/statistics.py
. In particular, it will provide the mean
and median
functions that were recently deprecated for removal in #29662. See also #33432.
See also https://docs.python.org/3.10/whatsnew/3.8.html#statistics
Change History (37)
comment:1 Changed 11 months ago by
Description:  modified (diff) 

comment:2 Changed 11 months ago by
Branch:  → u/vdelecroix/33453 

Commit:  → 3e13755f1e65e97460d6b7c763c3009007aebc6f 
comment:3 Changed 11 months ago by
Commit:  3e13755f1e65e97460d6b7c763c3009007aebc6f → 0abcc587c284e980920d2fc78a663e2710ed5116 

Branch pushed to git repo; I updated commit sha1. New commits:
0abcc58  33453: add statistics module to doc

comment:4 Changed 11 months ago by
Commit:  0abcc587c284e980920d2fc78a663e2710ed5116 → 0a44a4fb8c05d379fc472fcccc86bb6e099b22d0 

comment:5 Changed 11 months ago by
Cc:  Matthias Köppe added 

Status:  new → needs_review 
comment:6 Changed 11 months ago by
Thank you, this looks like a great addition which should take care of the necessary problems but continue providing what is needed. I didn't see any obvious problems in the code but it would be good to get a second more detailed eye.
comment:7 followup: 10 Changed 11 months ago by
Your new function mean
is not compatible with the builtin statistics.mean
 that specifies "If data is empty, StatisticsError
will be raised." https://docs.python.org/3/library/statistics.html#statistics.mean
(Also, the argument is called data
, not v
.)
comment:8 followup: 21 Changed 11 months ago by
Also, please include crossreferences to the numpy
functions in the documentation so that this information is not lost.
comment:9 Changed 11 months ago by
Status:  needs_review → needs_work 

comment:10 followups: 11 12 Changed 11 months ago by
Replying to mkoeppe:
Your new function
mean
is not compatible with the builtinstatistics.mean
 that specifies "If data is empty,StatisticsError
will be raised." https://docs.python.org/3/library/statistics.html#statistics.mean
True. I did that on purpose to follow the numpy behaviour. The current code that I wrote only emits a RuntimeWarning
.
(Also, the argument is called
data
, notv
.)
Is the argument name really relevant? Though data
is definitely much better.
comment:11 Changed 11 months ago by
Replying to vdelecroix:
Replying to mkoeppe:
Your new function
mean
is not compatible with the builtinstatistics.mean
 that specifies "If data is empty,StatisticsError
will be raised." https://docs.python.org/3/library/statistics.html#statistics.meanTrue. I did that on purpose to follow the numpy behaviour. The current code that I wrote only emits a
RuntimeWarning
.
That makes no sense  the point of the module is to be compatible not with numpy but with the builtin statistics module.
comment:12 Changed 11 months ago by
Replying to vdelecroix:
(Also, the argument is called
data
, notv
.)Is the argument name really relevant? Though
data
is definitely much better.
Yes, because the signature is mean(data)
, not mean(data, /)
, users are allowed to call it as mean(data=...)
.
comment:13 Changed 11 months ago by
Note thate Python argument naming is awful
statistics.variance(data, xbar=None)
statistics.pvariance(data, mu=None)
comment:15 Changed 11 months ago by
Replying to mkoeppe:
Compatibility > beauty
This is not about beauty but coherence. However this is a very minor point. It is perfectly fine to keep as much as the Python world as we can.
comment:16 followup: 18 Changed 11 months ago by
More importantly
 if the user provides a numpy array then the appropriate numpy method is called. This is not the Python behaviour. Should I remove it?
 If
data
only contains builtin Python data (int
,float
,Fraction
,Decimal
, ...) should the code simply transfer to the Pythonstatistics
module? Usually, sage functions tend to usepy_scalar_to_element
which does convertint > Integer
,float > RealNumber
, etc
comment:17 Changed 11 months ago by
The distinction of xbar
and mu
is deliberate, see discussion in https://bugs.python.org/issue20389
comment:18 Changed 11 months ago by
Replying to vdelecroix:
More importantly
 if the user provides a numpy array then the appropriate numpy method is called. This is not the Python behaviour. Should I remove it?
Computing it via the numpy method I think is a good idea; but the result type / error handling must be compatible with the other types.
 If
data
only contains builtin Python data (int
,float
,Fraction
,Decimal
, ...) should the code simply transfer to the Pythonstatistics
module? Usually, sage functions tend to usepy_scalar_to_element
which does convertint > Integer
,float > RealNumber
, etc
I think it would make sense to always make the result a Sage type, via py_scalar_to_element
if necessary
comment:19 Changed 11 months ago by
Commit:  0a44a4fb8c05d379fc472fcccc86bb6e099b22d0 → a60c096e5dfdb0d89e0be2a5c0f675ed11f42073 

Branch pushed to git repo; I updated commit sha1. New commits:
a60c096  33453: follow python specifications + improved doc

comment:20 Changed 11 months ago by
Status:  needs_work → needs_review 

comment:21 followup: 28 Changed 11 months ago by
Replying to mkoeppe:
Also, please include crossreferences to the
numpy
functions in the documentation so that this information is not lost.
Not sure about what you meant here.
comment:22 Changed 11 months ago by
Can't just replace sage.stats.basic_stats.mean
by lazy_import('sage.stats.statistics', 'mean', deprecation=33453)
. They have a different specification.
comment:23 Changed 11 months ago by
Status:  needs_review → needs_work 

comment:24 Changed 11 months ago by
I have a very minor observation. I think there are a few typos in sage.stats.statistics
in the updated deprecation notices:

basic_stats.sage
76 76 77 77 sage: std([1..6], bias=True) 78 78 doctest:warning... 79 DeprecationWarning: sage.stats.basic_stats.std is deprecated; use sage.stats.stat stics.stdev or sage.stats.statistics.pstdev instead79 DeprecationWarning: sage.stats.basic_stats.std is deprecated; use sage.stats.statistics.stdev or sage.stats.statistics.pstdev instead 80 80 See https://trac.sagemath.org/33453 for details. 81 81 1/2*sqrt(35/3) 82 82 sage: std([1..6], bias=False) … … 106 106 sage: std(data) # random 107 107 0.29487771726609185 108 108 """ 109 deprecation(33453, 'sage.stats.basic_stats.std is deprecated; use sage.stats.stat stics.stdev or sage.stats.statistics.pstdev instead')109 deprecation(33453, 'sage.stats.basic_stats.std is deprecated; use sage.stats.statistics.stdev or sage.stats.statistics.pstdev instead') 110 110 111 111 if hasattr(v, 'standard_deviation'): 112 112 return v.standard_deviation(bias=bias)
comment:25 Changed 11 months ago by
Commit:  a60c096e5dfdb0d89e0be2a5c0f675ed11f42073 → 002b83a90d085559fec50d32c7aefa10004aaf77 

comment:26 Changed 11 months ago by
Commit:  002b83a90d085559fec50d32c7aefa10004aaf77 → 7cf846cf49eafef72b2cdcf2e6dcb19f7e4594c8 

Branch pushed to git repo; I updated commit sha1. New commits:
7cf846c  33453: statstics > statistics

comment:27 Changed 11 months ago by
Status:  needs_work → needs_review 

comment:28 Changed 11 months ago by
Replying to vdelecroix:
Replying to mkoeppe:
Also, please include crossreferences to the
numpy
functions in the documentation so that this information is not lost.
Not sure about what you meant here.
We define the mean of the empty list to be the (symbolic) NaN, following the convention of MATLAB, Scipy, and R.  This function is deprecated. Use ``numpy.mean`` or ``numpy.nanmean``  instead. + This function is deprecated. Use ``sage.stats.statistics.mean`` instead. The + differences with this function are + +  the code does not try to call ``v.mean()`` +  raises an error on empty input
Stuff like this  please don't remove crossreferences to numpy.
comment:29 followup: 30 Changed 11 months ago by
In
+ if is_numpy_type(type(data)): + import numpy + if isinstance(data, numpy.ndarray): + return data.mean()
I think the result should be coerced to a Sage number type (comment:18)
comment:30 Changed 11 months ago by
Replying to mkoeppe:
In
+ if is_numpy_type(type(data)): + import numpy + if isinstance(data, numpy.ndarray): + return data.mean()I think the result should be coerced to a Sage number type (comment:18)
I don't like the conversion afterwards so much. The mean of a list of integers is a floating point in numpy.
sage: data = [1,2,7,11,15,23] sage: statistics.mean(data) 37/6 sage: import sage.stats.statistics as statistics sage: import numpy sage: statistics.mean(numpy.array(data)) 6.166666666666667
Maybe I should just remove these numpy shortcuts and simply document how to use the proper numpy methods in the documentation only?
comment:31 Changed 11 months ago by
Commit:  7cf846cf49eafef72b2cdcf2e6dcb19f7e4594c8 → 9e82df16977691d2863f9aef065a1a4b8656b922 

Branch pushed to git repo; I updated commit sha1. New commits:
9e82df1  33453: clean doc

comment:32 followup: 33 Changed 11 months ago by
Reviewers:  → David Roe 

Status:  needs_review → needs_work 
There are pyflakes errors. Once those are fixed, I'm happy to give this a positive review.
comment:33 Changed 11 months ago by
Replying to roed:
There are pyflakes errors. Once those are fixed, I'm happy to give this a positive review.
Please don't. comment:30 has to be sorted out first.
Also, I would like to mitigate the warnings in basic_stats
. Currently, the only deprecated behaviour of mean(v)
are when either
 it calls the underlying
v.mean()
method of the object  ot when the input data
v
is empty.
In all other cases, we can suppress the deprecation. And similarly for all other functions in basic_stats
.
comment:34 Changed 11 months ago by
Reviewers:  David Roe → Matthias Koeppe, David Roe 

comment:35 Changed 10 months ago by
Milestone:  sage9.6 → sage9.7 

comment:36 Changed 9 months ago by
Description:  modified (diff) 

comment:37 Changed 5 months ago by
Milestone:  sage9.7 → sage9.8 

New commits:
33453: a basic statistics module