Opened 11 months ago
Last modified 5 months ago
#33453 needs_work enhancement
statistics
Reported by: | Vincent Delecroix | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | sage-9.8 |
Component: | statistics | Keywords: | |
Cc: | Matthias Köppe | Merged in: | |
Authors: | Vincent Delecroix | Reviewers: | Matthias Koeppe, David Roe |
Report Upstream: | N/A | Work issues: | |
Branch: | u/vdelecroix/33453 (Commits, GitHub, GitLab) | Commit: | 9e82df16977691d2863f9aef065a1a4b8656b922 |
Dependencies: | Stopgaps: |
Description (last modified by )
We implement a sage compatible version of the Python statistics module in stats/statistics.py
. In particular, it will provide the mean
and median
functions that were recently deprecated for removal in #29662. See also #33432.
See also https://docs.python.org/3.10/whatsnew/3.8.html#statistics
Change History (37)
comment:1 Changed 11 months ago by
Description: | modified (diff) |
---|
comment:2 Changed 11 months ago by
Branch: | → u/vdelecroix/33453 |
---|---|
Commit: | → 3e13755f1e65e97460d6b7c763c3009007aebc6f |
comment:3 Changed 11 months ago by
Commit: | 3e13755f1e65e97460d6b7c763c3009007aebc6f → 0abcc587c284e980920d2fc78a663e2710ed5116 |
---|
Branch pushed to git repo; I updated commit sha1. New commits:
0abcc58 | 33453: add statistics module to doc
|
comment:4 Changed 11 months ago by
Commit: | 0abcc587c284e980920d2fc78a663e2710ed5116 → 0a44a4fb8c05d379fc472fcccc86bb6e099b22d0 |
---|
comment:5 Changed 11 months ago by
Cc: | Matthias Köppe added |
---|---|
Status: | new → needs_review |
comment:6 Changed 11 months ago by
Thank you, this looks like a great addition which should take care of the necessary problems but continue providing what is needed. I didn't see any obvious problems in the code but it would be good to get a second more detailed eye.
comment:7 follow-up: 10 Changed 11 months ago by
Your new function mean
is not compatible with the built-in statistics.mean
- that specifies "If data is empty, StatisticsError
will be raised." https://docs.python.org/3/library/statistics.html#statistics.mean
(Also, the argument is called data
, not v
.)
comment:8 follow-up: 21 Changed 11 months ago by
Also, please include cross-references to the numpy
functions in the documentation so that this information is not lost.
comment:9 Changed 11 months ago by
Status: | needs_review → needs_work |
---|
comment:10 follow-ups: 11 12 Changed 11 months ago by
Replying to mkoeppe:
Your new function
mean
is not compatible with the built-instatistics.mean
- that specifies "If data is empty,StatisticsError
will be raised." https://docs.python.org/3/library/statistics.html#statistics.mean
True. I did that on purpose to follow the numpy behaviour. The current code that I wrote only emits a RuntimeWarning
.
(Also, the argument is called
data
, notv
.)
Is the argument name really relevant? Though data
is definitely much better.
comment:11 Changed 11 months ago by
Replying to vdelecroix:
Replying to mkoeppe:
Your new function
mean
is not compatible with the built-instatistics.mean
- that specifies "If data is empty,StatisticsError
will be raised." https://docs.python.org/3/library/statistics.html#statistics.meanTrue. I did that on purpose to follow the numpy behaviour. The current code that I wrote only emits a
RuntimeWarning
.
That makes no sense - the point of the module is to be compatible not with numpy but with the built-in statistics module.
comment:12 Changed 11 months ago by
Replying to vdelecroix:
(Also, the argument is called
data
, notv
.)Is the argument name really relevant? Though
data
is definitely much better.
Yes, because the signature is mean(data)
, not mean(data, /)
, users are allowed to call it as mean(data=...)
.
comment:13 Changed 11 months ago by
Note thate Python argument naming is awful
statistics.variance(data, xbar=None)
statistics.pvariance(data, mu=None)
comment:15 Changed 11 months ago by
Replying to mkoeppe:
Compatibility > beauty
This is not about beauty but coherence. However this is a very minor point. It is perfectly fine to keep as much as the Python world as we can.
comment:16 follow-up: 18 Changed 11 months ago by
More importantly
- if the user provides a numpy array then the appropriate numpy method is called. This is not the Python behaviour. Should I remove it?
- If
data
only contains builtin Python data (int
,float
,Fraction
,Decimal
, ...) should the code simply transfer to the Pythonstatistics
module? Usually, sage functions tend to usepy_scalar_to_element
which does convertint -> Integer
,float -> RealNumber
, etc
comment:17 Changed 11 months ago by
The distinction of xbar
and mu
is deliberate, see discussion in https://bugs.python.org/issue20389
comment:18 Changed 11 months ago by
Replying to vdelecroix:
More importantly
- if the user provides a numpy array then the appropriate numpy method is called. This is not the Python behaviour. Should I remove it?
Computing it via the numpy method I think is a good idea; but the result type / error handling must be compatible with the other types.
- If
data
only contains builtin Python data (int
,float
,Fraction
,Decimal
, ...) should the code simply transfer to the Pythonstatistics
module? Usually, sage functions tend to usepy_scalar_to_element
which does convertint -> Integer
,float -> RealNumber
, etc
I think it would make sense to always make the result a Sage type, via py_scalar_to_element
if necessary
comment:19 Changed 11 months ago by
Commit: | 0a44a4fb8c05d379fc472fcccc86bb6e099b22d0 → a60c096e5dfdb0d89e0be2a5c0f675ed11f42073 |
---|
Branch pushed to git repo; I updated commit sha1. New commits:
a60c096 | 33453: follow python specifications + improved doc
|
comment:20 Changed 11 months ago by
Status: | needs_work → needs_review |
---|
comment:21 follow-up: 28 Changed 11 months ago by
Replying to mkoeppe:
Also, please include cross-references to the
numpy
functions in the documentation so that this information is not lost.
Not sure about what you meant here.
comment:22 Changed 11 months ago by
Can't just replace sage.stats.basic_stats.mean
by lazy_import('sage.stats.statistics', 'mean', deprecation=33453)
. They have a different specification.
comment:23 Changed 11 months ago by
Status: | needs_review → needs_work |
---|
comment:24 Changed 11 months ago by
I have a very minor observation. I think there are a few typos in sage.stats.statistics
in the updated deprecation notices:
-
basic_stats.sage
76 76 77 77 sage: std([1..6], bias=True) 78 78 doctest:warning... 79 DeprecationWarning: sage.stats.basic_stats.std is deprecated; use sage.stats.stat stics.stdev or sage.stats.statistics.pstdev instead79 DeprecationWarning: sage.stats.basic_stats.std is deprecated; use sage.stats.statistics.stdev or sage.stats.statistics.pstdev instead 80 80 See https://trac.sagemath.org/33453 for details. 81 81 1/2*sqrt(35/3) 82 82 sage: std([1..6], bias=False) … … 106 106 sage: std(data) # random 107 107 0.29487771726609185 108 108 """ 109 deprecation(33453, 'sage.stats.basic_stats.std is deprecated; use sage.stats.stat stics.stdev or sage.stats.statistics.pstdev instead')109 deprecation(33453, 'sage.stats.basic_stats.std is deprecated; use sage.stats.statistics.stdev or sage.stats.statistics.pstdev instead') 110 110 111 111 if hasattr(v, 'standard_deviation'): 112 112 return v.standard_deviation(bias=bias)
comment:25 Changed 11 months ago by
Commit: | a60c096e5dfdb0d89e0be2a5c0f675ed11f42073 → 002b83a90d085559fec50d32c7aefa10004aaf77 |
---|
comment:26 Changed 11 months ago by
Commit: | 002b83a90d085559fec50d32c7aefa10004aaf77 → 7cf846cf49eafef72b2cdcf2e6dcb19f7e4594c8 |
---|
Branch pushed to git repo; I updated commit sha1. New commits:
7cf846c | 33453: statstics -> statistics
|
comment:27 Changed 11 months ago by
Status: | needs_work → needs_review |
---|
comment:28 Changed 11 months ago by
Replying to vdelecroix:
Replying to mkoeppe:
Also, please include cross-references to the
numpy
functions in the documentation so that this information is not lost.
Not sure about what you meant here.
We define the mean of the empty list to be the (symbolic) NaN, following the convention of MATLAB, Scipy, and R. - This function is deprecated. Use ``numpy.mean`` or ``numpy.nanmean`` - instead. + This function is deprecated. Use ``sage.stats.statistics.mean`` instead. The + differences with this function are + + - the code does not try to call ``v.mean()`` + - raises an error on empty input
Stuff like this -- please don't remove cross-references to numpy.
comment:29 follow-up: 30 Changed 11 months ago by
In
+ if is_numpy_type(type(data)): + import numpy + if isinstance(data, numpy.ndarray): + return data.mean()
I think the result should be coerced to a Sage number type (comment:18)
comment:30 Changed 11 months ago by
Replying to mkoeppe:
In
+ if is_numpy_type(type(data)): + import numpy + if isinstance(data, numpy.ndarray): + return data.mean()I think the result should be coerced to a Sage number type (comment:18)
I don't like the conversion afterwards so much. The mean of a list of integers is a floating point in numpy.
sage: data = [1,2,7,-11,15,23] sage: statistics.mean(data) 37/6 sage: import sage.stats.statistics as statistics sage: import numpy sage: statistics.mean(numpy.array(data)) 6.166666666666667
Maybe I should just remove these numpy shortcuts and simply document how to use the proper numpy methods in the documentation only?
comment:31 Changed 11 months ago by
Commit: | 7cf846cf49eafef72b2cdcf2e6dcb19f7e4594c8 → 9e82df16977691d2863f9aef065a1a4b8656b922 |
---|
Branch pushed to git repo; I updated commit sha1. New commits:
9e82df1 | 33453: clean doc
|
comment:32 follow-up: 33 Changed 11 months ago by
Reviewers: | → David Roe |
---|---|
Status: | needs_review → needs_work |
There are pyflakes errors. Once those are fixed, I'm happy to give this a positive review.
comment:33 Changed 11 months ago by
Replying to roed:
There are pyflakes errors. Once those are fixed, I'm happy to give this a positive review.
Please don't. comment:30 has to be sorted out first.
Also, I would like to mitigate the warnings in basic_stats
. Currently, the only deprecated behaviour of mean(v)
are when either
- it calls the underlying
v.mean()
method of the object - ot when the input data
v
is empty.
In all other cases, we can suppress the deprecation. And similarly for all other functions in basic_stats
.
comment:34 Changed 11 months ago by
Reviewers: | David Roe → Matthias Koeppe, David Roe |
---|
comment:35 Changed 10 months ago by
Milestone: | sage-9.6 → sage-9.7 |
---|
comment:36 Changed 9 months ago by
Description: | modified (diff) |
---|
comment:37 Changed 5 months ago by
Milestone: | sage-9.7 → sage-9.8 |
---|
New commits:
33453: a basic statistics module