Opened 11 months ago

Last modified 5 months ago

#33453 needs_work enhancement

statistics

Reported by: Vincent Delecroix Owned by:
Priority: major Milestone: sage-9.8
Component: statistics Keywords:
Cc: Matthias Köppe Merged in:
Authors: Vincent Delecroix Reviewers: Matthias Koeppe, David Roe
Report Upstream: N/A Work issues:
Branch: u/vdelecroix/33453 (Commits, GitHub, GitLab) Commit: 9e82df16977691d2863f9aef065a1a4b8656b922
Dependencies: Stopgaps:

Status badges

Description (last modified by Matthias Köppe)

We implement a sage compatible version of the Python statistics module in stats/statistics.py. In particular, it will provide the mean and median functions that were recently deprecated for removal in #29662. See also #33432.

See also https://docs.python.org/3.10/whatsnew/3.8.html#statistics

Change History (37)

comment:1 Changed 11 months ago by Vincent Delecroix

Description: modified (diff)

comment:2 Changed 11 months ago by Vincent Delecroix

Branch: u/vdelecroix/33453
Commit: 3e13755f1e65e97460d6b7c763c3009007aebc6f

New commits:

3e1375533453: a basic statistics module

comment:3 Changed 11 months ago by git

Commit: 3e13755f1e65e97460d6b7c763c3009007aebc6f0abcc587c284e980920d2fc78a663e2710ed5116

Branch pushed to git repo; I updated commit sha1. New commits:

0abcc5833453: add statistics module to doc

comment:4 Changed 11 months ago by git

Commit: 0abcc587c284e980920d2fc78a663e2710ed51160a44a4fb8c05d379fc472fcccc86bb6e099b22d0

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

304d7db33453: a basic statistics module
2dc0bac33453: add statistics module to doc
0a44a4f33453: mitigate the effects of 29662

comment:5 Changed 11 months ago by Vincent Delecroix

Cc: Matthias Köppe added
Status: newneeds_review

comment:6 Changed 11 months ago by Karl-Dieter Crisman

Thank you, this looks like a great addition which should take care of the necessary problems but continue providing what is needed. I didn't see any obvious problems in the code but it would be good to get a second more detailed eye.

comment:7 Changed 11 months ago by Matthias Köppe

Your new function mean is not compatible with the built-in statistics.mean - that specifies "If data is empty, StatisticsError will be raised." https://docs.python.org/3/library/statistics.html#statistics.mean

(Also, the argument is called data, not v.)

comment:8 Changed 11 months ago by Matthias Köppe

Also, please include cross-references to the numpy functions in the documentation so that this information is not lost.

comment:9 Changed 11 months ago by Matthias Köppe

Status: needs_reviewneeds_work

comment:10 in reply to:  7 ; Changed 11 months ago by Vincent Delecroix

Replying to mkoeppe:

Your new function mean is not compatible with the built-in statistics.mean - that specifies "If data is empty, StatisticsError will be raised." https://docs.python.org/3/library/statistics.html#statistics.mean

True. I did that on purpose to follow the numpy behaviour. The current code that I wrote only emits a RuntimeWarning.

(Also, the argument is called data, not v.)

Is the argument name really relevant? Though data is definitely much better.

comment:11 in reply to:  10 Changed 11 months ago by Matthias Köppe

Replying to vdelecroix:

Replying to mkoeppe:

Your new function mean is not compatible with the built-in statistics.mean - that specifies "If data is empty, StatisticsError will be raised." https://docs.python.org/3/library/statistics.html#statistics.mean

True. I did that on purpose to follow the numpy behaviour. The current code that I wrote only emits a RuntimeWarning.

That makes no sense - the point of the module is to be compatible not with numpy but with the built-in statistics module.

comment:12 in reply to:  10 Changed 11 months ago by Matthias Köppe

Replying to vdelecroix:

(Also, the argument is called data, not v.)

Is the argument name really relevant? Though data is definitely much better.

Yes, because the signature is mean(data), not mean(data, /), users are allowed to call it as mean(data=...).

comment:13 Changed 11 months ago by Vincent Delecroix

Note thate Python argument naming is awful

  • statistics.variance(data, xbar=None)
  • statistics.pvariance(data, mu=None)

comment:14 Changed 11 months ago by Matthias Köppe

Compatibility > beauty

comment:15 in reply to:  14 Changed 11 months ago by Vincent Delecroix

Replying to mkoeppe:

Compatibility > beauty

This is not about beauty but coherence. However this is a very minor point. It is perfectly fine to keep as much as the Python world as we can.

comment:16 Changed 11 months ago by Vincent Delecroix

More importantly

  • if the user provides a numpy array then the appropriate numpy method is called. This is not the Python behaviour. Should I remove it?
  • If data only contains builtin Python data (int, float, Fraction, Decimal, ...) should the code simply transfer to the Python statistics module? Usually, sage functions tend to use py_scalar_to_element which does convert int -> Integer, float -> RealNumber, etc

comment:17 Changed 11 months ago by Matthias Köppe

The distinction of xbar and mu is deliberate, see discussion in https://bugs.python.org/issue20389

comment:18 in reply to:  16 Changed 11 months ago by Matthias Köppe

Replying to vdelecroix:

More importantly

  • if the user provides a numpy array then the appropriate numpy method is called. This is not the Python behaviour. Should I remove it?

Computing it via the numpy method I think is a good idea; but the result type / error handling must be compatible with the other types.

  • If data only contains builtin Python data (int, float, Fraction, Decimal, ...) should the code simply transfer to the Python statistics module? Usually, sage functions tend to use py_scalar_to_element which does convert int -> Integer, float -> RealNumber, etc

I think it would make sense to always make the result a Sage type, via py_scalar_to_element if necessary

comment:19 Changed 11 months ago by git

Commit: 0a44a4fb8c05d379fc472fcccc86bb6e099b22d0a60c096e5dfdb0d89e0be2a5c0f675ed11f42073

Branch pushed to git repo; I updated commit sha1. New commits:

a60c09633453: follow python specifications + improved doc

comment:20 Changed 11 months ago by Vincent Delecroix

Status: needs_workneeds_review

comment:21 in reply to:  8 ; Changed 11 months ago by Vincent Delecroix

Replying to mkoeppe:

Also, please include cross-references to the numpy functions in the documentation so that this information is not lost.

Not sure about what you meant here.

comment:22 Changed 11 months ago by Matthias Köppe

Can't just replace sage.stats.basic_stats.mean by lazy_import('sage.stats.statistics', 'mean', deprecation=33453). They have a different specification.

comment:23 Changed 11 months ago by Matthias Köppe

Status: needs_reviewneeds_work

comment:24 Changed 11 months ago by David Lowry-Duda

I have a very minor observation. I think there are a few typos in sage.stats.statistics in the updated deprecation notices:

  • basic_stats.sage

     
    7676
    7777        sage: std([1..6], bias=True)
    7878        doctest:warning...
    79         DeprecationWarning: sage.stats.basic_stats.std is deprecated; use sage.stats.statstics.stdev or sage.stats.statistics.pstdev instead
     79        DeprecationWarning: sage.stats.basic_stats.std is deprecated; use sage.stats.statistics.stdev or sage.stats.statistics.pstdev instead
    8080        See https://trac.sagemath.org/33453 for details.
    8181        1/2*sqrt(35/3)
    8282        sage: std([1..6], bias=False)
     
    106106        sage: std(data)  # random
    107107        0.29487771726609185
    108108    """
    109     deprecation(33453, 'sage.stats.basic_stats.std is deprecated; use sage.stats.statstics.stdev or sage.stats.statistics.pstdev instead')
     109    deprecation(33453, 'sage.stats.basic_stats.std is deprecated; use sage.stats.statistics.stdev or sage.stats.statistics.pstdev instead')
    110110
    111111    if hasattr(v, 'standard_deviation'):
    112112        return v.standard_deviation(bias=bias)

comment:25 Changed 11 months ago by git

Commit: a60c096e5dfdb0d89e0be2a5c0f675ed11f42073002b83a90d085559fec50d32c7aefa10004aaf77

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

7b8ba3c33453: mitigate the effects of 29662
9c82ac933453: follow python specifications + improved doc
002b83a33453: cleaning

comment:26 Changed 11 months ago by git

Commit: 002b83a90d085559fec50d32c7aefa10004aaf777cf846cf49eafef72b2cdcf2e6dcb19f7e4594c8

Branch pushed to git repo; I updated commit sha1. New commits:

7cf846c33453: statstics -> statistics

comment:27 Changed 11 months ago by Vincent Delecroix

Status: needs_workneeds_review

comment:28 in reply to:  21 Changed 11 months ago by Matthias Köppe

Replying to vdelecroix:

Replying to mkoeppe:

Also, please include cross-references to the numpy functions in the documentation so that this information is not lost.

Not sure about what you meant here.

     We define the mean of the empty list to be the (symbolic) NaN,
     following the convention of MATLAB, Scipy, and R.
 
-    This function is deprecated.  Use ``numpy.mean`` or ``numpy.nanmean``
-    instead.
+    This function is deprecated.  Use ``sage.stats.statistics.mean`` instead. The
+    differences with this function are
+
+    - the code does not try to call ``v.mean()``
+    - raises an error on empty input

Stuff like this -- please don't remove cross-references to numpy.

comment:29 Changed 11 months ago by Matthias Köppe

In

+    if is_numpy_type(type(data)):
+        import numpy
+        if isinstance(data, numpy.ndarray):
+            return data.mean()

I think the result should be coerced to a Sage number type (comment:18)

comment:30 in reply to:  29 Changed 11 months ago by Vincent Delecroix

Replying to mkoeppe:

In

+    if is_numpy_type(type(data)):
+        import numpy
+        if isinstance(data, numpy.ndarray):
+            return data.mean()

I think the result should be coerced to a Sage number type (comment:18)

I don't like the conversion afterwards so much. The mean of a list of integers is a floating point in numpy.

sage: data = [1,2,7,-11,15,23]
sage: statistics.mean(data)
37/6
sage: import sage.stats.statistics as statistics
sage: import numpy
sage: statistics.mean(numpy.array(data))
6.166666666666667

Maybe I should just remove these numpy shortcuts and simply document how to use the proper numpy methods in the documentation only?

comment:31 Changed 11 months ago by git

Commit: 7cf846cf49eafef72b2cdcf2e6dcb19f7e4594c89e82df16977691d2863f9aef065a1a4b8656b922

Branch pushed to git repo; I updated commit sha1. New commits:

9e82df133453: clean doc

comment:32 Changed 11 months ago by David Roe

Reviewers: David Roe
Status: needs_reviewneeds_work

There are pyflakes errors. Once those are fixed, I'm happy to give this a positive review.

comment:33 in reply to:  32 Changed 11 months ago by Vincent Delecroix

Replying to roed:

There are pyflakes errors. Once those are fixed, I'm happy to give this a positive review.

Please don't. comment:30 has to be sorted out first.

Also, I would like to mitigate the warnings in basic_stats. Currently, the only deprecated behaviour of mean(v) are when either

  • it calls the underlying v.mean() method of the object
  • ot when the input data v is empty.

In all other cases, we can suppress the deprecation. And similarly for all other functions in basic_stats.

comment:34 Changed 11 months ago by Matthias Köppe

Reviewers: David RoeMatthias Koeppe, David Roe

comment:35 Changed 10 months ago by Matthias Köppe

Milestone: sage-9.6sage-9.7

comment:36 Changed 9 months ago by Matthias Köppe

Description: modified (diff)

comment:37 Changed 5 months ago by Matthias Köppe

Milestone: sage-9.7sage-9.8
Note: See TracTickets for help on using tickets.