Opened 14 months ago

Last modified 14 months ago

#27579 needs_review enhancement

Finding median in linear time

Reported by: gh-Hrishabh-yadav Owned by:
Priority: major Milestone: sage-duplicate/invalid/wontfix
Component: number theory Keywords:
Cc: dcoudert Merged in:
Authors: Reviewers:
Report Upstream: N/A Work issues:
Branch: u/gh-Hrishabh-yadav/median (Commits) Commit: 2530c54eea95690df0f102d527a5719bbb5a0d93
Dependencies: Stopgaps:

Description

This ticket introduces a new technique for finding median of graph : https://en.wikipedia.org/wiki/Median_of_medians This algorithm runs at complexity of O(n) in worst case and will be a good addition to sage stats module.

Change History (14)

comment:1 Changed 14 months ago by gh-Hrishabh-yadav

  • Branch set to u/gh-Hrishabh-yadav/median

comment:2 Changed 14 months ago by gh-Hrishabh-yadav

  • Commit set to d82ca32486058bd1245ac2dfba812976a57c51ab

Implementation of median_of_medians code is done. Algorithm does run fast as compared to the one in function median(), but still slow compared to it's expectation. Results:

sage: A = [randint(0,i*100) for i in range (0,5000000)]
sage: median_of_medians(A,len(A)//2)
sage: median(A)
Time for median_of_medians : 2.66196990013
Time for median : 3.39063596725

New commits:

d82ca32Added median_of_medians code

comment:3 Changed 14 months ago by gh-Hrishabh-yadav

  • Status changed from new to needs_review

comment:4 Changed 14 months ago by dcoudert

Please follow the coding style of Sagemath: http://doc.sagemath.org/html/en/developer/coding_basics.html and https://www.python.org/dev/peps/pep-0008 . For instance,

  • def median_of_medians( A, i ): -> def median_of_medians(A, i ):
  • bullets in INPUT block must be aligned below INPUT
  • in these bullets, use ``A`` and not `A`
  • if(len(A)<=i): -> if len(A) <= i:
  • etc., etc., etc.

comment:5 Changed 14 months ago by git

  • Commit changed from d82ca32486058bd1245ac2dfba812976a57c51ab to 29676011291f86fb12a8da9ac0785e75faafb1d6

Branch pushed to git repo; I updated commit sha1. New commits:

2967601Modified coding style

comment:6 Changed 14 months ago by git

  • Commit changed from 29676011291f86fb12a8da9ac0785e75faafb1d6 to 80ffb1c75397b40fb64fc54bb69af91230b9254e

Branch pushed to git repo; I updated commit sha1. New commits:

80ffb1cRemoved Brackets

comment:7 Changed 14 months ago by gh-Hrishabh-yadav

I have modified the code.

comment:8 follow-up: Changed 14 months ago by dcoudert

The method says that it returns the i-th smallest integer in the array, but in the examples, you return the (i+1)-th.

The examples are for small values, but you have hardcoded items_per_column = 1000. Thus, your tests are only for sorted(A)[i].

Please add a description of the algorithm. Otherwise, it's very hard to check your code.

Do you really need recursive calls ? can't you avoid that ?

and why choosing 1000 ?

comment:9 in reply to: ↑ 8 Changed 14 months ago by gh-Hrishabh-yadav

Replying to dcoudert:

The method says that it returns the i-th smallest integer in the array, but in the examples, you return the (i+1)-th.

I will change it.

The examples are for small values, but you have hardcoded items_per_column = 1000. Thus, your tests are only for sorted(A)[i].

By taking 'items_per_column =1000' I got the best result, Although I will make sure it is given by the user.

Please add a description of the algorithm. Otherwise, it's very hard to check your code.

Yeah sure.

Do you really need recursive calls ? can't you avoid that ?

Actually no, I have a way to remove recursive calls, let me code it!

and why choosing 1000 ?

As mentioned above, it gave the best results among '10','100','1000','10000'and '100000'.

Thanks!

comment:10 follow-up: Changed 14 months ago by gh-Hrishabh-yadav

Apparently numpy has implemented introselect algorithm which is a mix of median_of_median and quickselct algorithm, and works quite fast as compared to the algorithm implemented in current branch.

Introselect: https://en.wikipedia.org/wiki/Introselect

numpy library function :-https://docs.scipy.org/doc/numpy/reference/generated/numpy.partition.htm

I have made a new branch in which numpy library algorithm is called rather the one implemented above. Or should i have both codes and let user decide which one should be implemented.

Last edited 14 months ago by gh-Hrishabh-yadav (previous) (diff)

comment:11 Changed 14 months ago by git

  • Commit changed from 80ffb1c75397b40fb64fc54bb69af91230b9254e to 2530c54eea95690df0f102d527a5719bbb5a0d93

Branch pushed to git repo; I updated commit sha1. New commits:

2530c54Added numpy.partition

comment:12 in reply to: ↑ 10 Changed 14 months ago by dcoudert

I have made a new branch in which numpy library algorithm is called rather the one implemented above. Or should i have both codes and let user decide which one should be implemented.

If there is a faster and simpler method, there is no need for adding another one, and in view of the proposed method, I think there is no need for adding i_smallest. The added value to Sagemath seems rather limited here.

So I suggest to change the milestone to wontfix.

comment:13 Changed 14 months ago by gh-Hrishabh-yadav

  • Milestone changed from sage-8.8 to sage-duplicate/invalid/wontfix

comment:14 Changed 14 months ago by gh-Hrishabh-yadav

Ok done!

Note: See TracTickets for help on using tickets.