Opened 8 years ago
Closed 8 years ago
#16374 closed enhancement (fixed)
better two_squares, three_squares, four_squares for small input
Reported by:  Vincent Delecroix  Owned by:  

Priority:  major  Milestone:  sage6.3 
Component:  number theory  Keywords:  
Cc:  Nathann Cohen, Jeroen Demeyer, Leif Leonhardy  Merged in:  
Authors:  Vincent Delecroix  Reviewers:  Jeroen Demeyer, Leif Leonhardy, Nathann Cohen 
Report Upstream:  N/A  Work issues:  
Branch:  51e36d8 (Commits, GitHub, GitLab)  Commit:  51e36d88f4f171ee241b0eb5e467b513ddfa312e 
Dependencies:  Stopgaps: 
Description
We implement a much more efficient version of square decomposition for small values.
Change History (47)
comment:1 Changed 8 years ago by
Branch:  → u/vdelecroix/16374 

Commit:  → 7cf5fff24e77135cc00032a7347d4e5a6402324b 
comment:2 Changed 8 years ago by
Commit:  7cf5fff24e77135cc00032a7347d4e5a6402324b → f1354e68519feecd5de198c10a532b89ee938a7b 

Branch pushed to git repo; I updated commit sha1. New commits:
f1354e6  trac #16374: two small optimization

comment:3 followup: 5 Changed 8 years ago by
Hello !
I added a commit in u/ncohen/16374 which contains some tests and comments.
I have two questions to ask though :
 Why
ii <= n/2
and notii<=jj
?
 Why a "while" instead of a "if" in the following code ?
j = (<unsigned int> sqrt(<double> n)) + 1 # (rounding is toward zero) while j*j > n: j = 1
Besides, there is a way to avoid many multiplications, and perhaps it can improve the code a bit (no idea, profiling is the only way :P
)
sage: i = 5 sage: ii = i**2 sage: i**2 == ii True sage: ii+=2*i+1 sage: i+=1 sage: i**2 == ii True
Nathann
comment:4 Changed 8 years ago by
Status:  new → needs_info 

comment:5 Changed 8 years ago by
Hi Nathann,
Thanks for reading it.
I added a commit in u/ncohen/16374 which contains some tests and comments.
Great. I also have something important to modify in three_squares_pyx
and four_squares_pyx
. It is much faster to first remove a huge square and then try to decompose the rest into a sum of less squares. The result is not the smallest for lexicographical order but the timings are x4 better.
It seems that you like C function declaration of the form int f(int res[3])
. It makes the specifications clearer but at compilation the results are strictly equivalent.
I have two questions to ask though :
 Why
ii <= n/2
and notii<=jj
?
Right.
 Why a "while" instead of a "if" in the following code ?
j = (<unsigned int> sqrt(<double> n)) + 1 # (rounding is toward zero) while j*j > n: j = 1
I am not sure of the specification (if any) of sqrt
which deals with double... I will try to improve it.
Besides, there is a way to avoid many multiplications, and perhaps it can improve the code a bit (no idea, profiling is the only way
:P
)
I thought about it. I am not sure that ii += 2*i + 1
is cheaper than ii = i*i
? Will have a look.
Vincent
comment:6 Changed 8 years ago by
Commit:  f1354e68519feecd5de198c10a532b89ee938a7b → a2b31ca5ec86eaed275a854e41cb5fd7a3ce5c95 

Branch pushed to git repo; I updated commit sha1. New commits:
a2b31ca  trac #16374: reviewer comment + further optimization

comment:7 followups: 10 17 Changed 8 years ago by
Status:  needs_info → needs_review 

Hi,
I confirm that there is no difference between "ii = i*i" and "ii = 2*i + 1". Anyway, I am pretty sure that this is the kind of optimization that gcc takes care of with the option "O3". In the new commit there are the modifications relative to your comment and mine.
New timings
sage: from sage.rings.arith_pyx import four_squares_pyx sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)") 25 loops, best of 3: 11.3 ms per loop
About the dependencies, I am not sure whether this ticket should go before or after #16308...
Vincent
comment:8 Changed 8 years ago by
Dependencies:  #16308 

comment:9 Changed 8 years ago by
Instead of adding a new module arith_pyx
, either
 Rename this module such that it explicitly refers to sums of squares.
 Use
fast_arith.pyx
instead.
comment:10 Changed 8 years ago by
Replying to vdelecroix:
Anyway, I am pretty sure that this is the kind of optimization that gcc takes care of with the option "O3".
Sure, write the most readable code and let the compiler do its job.
About the dependencies, I am not sure whether this ticket should go before or after #16308...
Even better: make these two tickets independent and then make a third ticket to integrate them.
comment:11 followup: 12 Changed 8 years ago by
Status:  needs_review → needs_work 

I would recommend to use unsigned long
instead of instead int
. If could imagine that the sum of 4 squares function is useful for numbers larger than 2^32
.
Also a typo: plateform
> platform
comment:12 Changed 8 years ago by
Replying to jdemeyer:
I would recommend to use
unsigned long
instead ofinstead int
. If could imagine that the sum of 4 squares function is useful for numbers larger than2^32
.
Of course. But for large entries, we would better use the *_squares
from #16308. I will do some timings, but if we loose something on small integers I will not change it.
Also a typo:
plateform
>platform
Thank you. I will modify it.
comment:13 followup: 14 Changed 8 years ago by
What's the point of <unsigned int> sqrt(<double> n + .5)
? Are you worried that the libm sqrt
function does not round correctly? I believe that IEEE754 requires this.
comment:14 followup: 15 Changed 8 years ago by
Replying to jdemeyer:
What's the point of
<unsigned int> sqrt(<double> n + .5)
? Are you worried that the libmsqrt
function does not round correctly? I believe that IEEE754 requires this.
I believe it is in IEEE754, but I am not sure that all compilers follow IEEE754. You think it is safe to remove this .5
?
comment:15 followup: 16 Changed 8 years ago by
Replying to vdelecroix:
I believe it is in IEEE754, but I am not sure that all compilers follow IEEE754. You think it is safe to remove this
.5
?
It is certainly not a compiler issue, but a libm
/hardware issue. I would expect IEEE754 and would remove the + 0.5
.
comment:16 Changed 8 years ago by
Replying to jdemeyer:
Replying to vdelecroix:
I believe it is in IEEE754, but I am not sure that all compilers follow IEEE754. You think it is safe to remove this
.5
?It is certainly not a compiler issue, but a
libm
/hardware issue. I would expect IEEE754 and would remove the+ 0.5
.
Yes. Integers up to 2^{53}1 can be represented exactly in a double
, at least on the plat[e]forms we support.
I like the type instead int
btw. (Google would have asked: "Did you mean instant int
?", or probably "insteady int
"?)
Regarding that, I'd write versions for / using integers of known, platformindependent width, that is, use uint8_t
, uint16_t
, uint32_t
etc., with a wrapper function that delegates to the appropriate function, or use ifthenelifelse such that the compiler is aware of the exact range of numbers. (Separate functions could more easily be checked for overflow conditions as well.)
comment:17 Changed 8 years ago by
Replying to vdelecroix:
I confirm that there is no difference between "ii = i*i" and "ii = 2*i + 1".
There of course is; just look at what the (C) compiler produces from it. (I don't know what Cython itself does; it may optimize it, but may also further obfuscate the code.)
On older machines, presumably any 32bit processor [we support], multiplication (or squaring) is more expensive than addition, increment, and/or a shift. (And its cost depends on the number of 1bits in the multiplicands.)
Anyway, I am pretty sure that this is the kind of optimization that gcc takes care of with the option "O3".
Nope, you have to be more explicit. It of course does some strength reduction, e.g. replacing multiplication/division by powers of two by shifts, additions or nice addressing modes etc., but you cannot expect it to run a theorem prover... ;)
comment:18 Changed 8 years ago by
Commit:  a2b31ca5ec86eaed275a854e41cb5fd7a3ce5c95 → 352e255834a35c951f8bf22a6161df8cef56e51a 

comment:19 Changed 8 years ago by
Status:  needs_work → needs_review 

Hi,
The file is now sum_of_squares.pyx
.
I changed the unsigned int
to unsigned long
as Jeroen suggested. The loss in terms of timing seems to be around 1% (see below). Nevertheless, the function makes no sense for very large input as it is much faster to called the Python function from arith
(see also below).
For the rounding issue with sqrt
, it uses now <unsigned long> sqrt(<double> n)
and there is an exhaustive test in the doc of the function two_squares_pyx
.
unsigned int
sage: from sage.rings.sum_of_squares import four_squares_pyx sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)") 25 loops, best of 3: 11 ms per loop
unsigned long
sage: from sage.rings.sum_of_squares import four_squares_pyx sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)") 25 loops, best of 3: 11.3 ms per loop
Very large input
sage: from sage.rings.sum_of_squares import two_squares_pyx sage: %timeit two_squares_pyx(2**51+21) 10 loops, best of 3: 88.1 ms per loop sage: %timeit two_squares(2**51+21) 1000 loops, best of 3: 1.04 ms per loop
Vincent
comment:20 followup: 22 Changed 8 years ago by
I wouldn't call 2^{51}+21 "very large", at least not if the functions now take unsigned long
s. (Ok, on most of current machines that exceeds an unsigned int
.)
While we're at it, if n
is unsigned long
, you should take (unsigned long)sqrtl((long double)n)
, otherwise you'd probably have to add 1 to the result in case n
is a perfect square or slightly above. (On 32bit machines, where unsigned long
s are only 32bit as well, that doesn't matter.)
comment:21 Changed 8 years ago by
Commit:  352e255834a35c951f8bf22a6161df8cef56e51a → 17883e34f96f7f78f35f56bad527dcbaa728169d 

Branch pushed to git repo; I updated commit sha1. New commits:
17883e3  trac #16374: replace sqrt by sqrtl

comment:22 Changed 8 years ago by
Replying to leif:
I wouldn't call 2^{51}+21 "very large", at least not if the functions now take
unsigned long
s. (Ok, on most of current machines that exceeds anunsigned int
.)
Very large meant "the function is no more efficient with such input".
While we're at it, if
n
isunsigned long
, you should take(unsigned long)sqrtl((long double)n)
, otherwise you'd probably have to add 1 to the result in casen
is a perfect square or slightly above. (On 32bit machines, whereunsigned long
s are only 32bit as well, that doesn't matter.)
Done.
Vincent
comment:23 Changed 8 years ago by
Well... Theeeeeeeeeeeeeeeeeeeeeeen if there is no other comment, can this ticket be flagged as positive_review
? I agreed with what I read but I know can't follow you on the "hardware/programming standard" ground :)
Nathann
comment:24 followup: 25 Changed 8 years ago by
Just to mention: if there is an overflow, the error is very explicit
sage: from sage.rings.sum_of_squares import two_squares_pyx sage: two_squares_pyx(2**65+19) Traceback (most recent call last): ... OverflowError: long int too large to convert
comment:25 followup: 27 Changed 8 years ago by
Replying to vdelecroix:
Just to mention: if there is an overflow, the error is very explicit
sage: from sage.rings.sum_of_squares import two_squares_pyx sage: two_squares_pyx(2**65+19) Traceback (most recent call last): ... OverflowError: long int too large to convert
I rather meant overflows in the C/Cython code, where just modulo arithmetic happens, without throwing exceptions. (With using unsigned long
s you're probably on the safe[r] side, but that's something I(?)'ll have to check more carefully...)
comment:26 followup: 28 Changed 8 years ago by
P.S.: On modern (64bit) AMD CPUs, not using squaring is significantly faster, while on Intel CPUs there's apparently not much of a difference, with I think a slight advantage of doing squaring explicitly (i.e., integer multiplication), but it's hard to get reliable timings there.
(And GCC definitely doesn't do the transformation either way.)
comment:27 Changed 8 years ago by
Replying to leif:
Replying to vdelecroix:
Just to mention: if there is an overflow, the error is very explicit
sage: from sage.rings.sum_of_squares import two_squares_pyx sage: two_squares_pyx(2**65+19) Traceback (most recent call last): ... OverflowError: long int too large to convertI rather meant overflows in the C/Cython code, where just modulo arithmetic happens, without throwing exceptions. (With using
unsigned long
s you're probably on the safe[r] side, but that's something I(?)'ll have to check more carefully...)
There are only two places were overflow may occur:
 the input
 when we rely on long double to perform the square root (i.e. are we sure that an
unsigned long
fits into along double
?)
All integers that appear inside the function are smaller than the input. So, no trouble here.
Vincent
comment:28 Changed 8 years ago by
Replying to leif:
P.S.: On modern (64bit) AMD CPUs, not using squaring is significantly faster, while on Intel CPUs there's apparently not much of a difference, with I think a slight advantage of doing squaring explicitly (i.e., integer multiplication), but it's hard to get reliable timings there.
(And GCC definitely doesn't do the transformation either way.)
I would prefer to keep the multiplication as it is much more readable.
comment:29 followup: 32 Changed 8 years ago by
I think the long double
is overkill and would revert to using sqrt()
.
 It decreases portability (
sqrtl()
is C99 so at least you need to compile in C99 mode and not all systems have this function).
 You will never want to call
two_squares
in this range (2^53
) anyway.
 For
three_squares
andfour_squares
, there are many solutions, so it doesn't matter if the square root is off by 1.
comment:30 Changed 8 years ago by
Status:  needs_review → needs_work 

comment:31 Changed 8 years ago by
Also, these functions should be interruptible: http://www.sagemath.org/doc/developer/coding_in_cython.html#interruptandsignalhandling
comment:32 followup: 33 Changed 8 years ago by
Replying to jdemeyer:
I think the
long double
is overkill and would revert to usingsqrt()
.
 It decreases portability (
sqrtl()
is C99 so at least you need to compile in C99 mode and not all systems have this function).
:) Yes, C99 is only 15 years old. (Note that a lot of code in Sage requires C99, and any recent GCC defaults to std=gnu99
.)
While there are (very few) platforms where the math library at least used to lack some long double functions, I really wouldn't care here.
By the way, IMHO i
and j
should be unsigned
rather than unsigned long
. (And as mentioned, I'd rather use uint32_t
and uint64_t
, or even better, uint_fast32_t
instead of the former.)
Note that in
j = <unsigned> sqrt[l](<[long] double> n) j += 1  j%2
j
still cannot overflow, since sqrt[l]()
returns UINT_MAX
(which is odd) for n close to ULONG_MAX
.
However, I'd limit the input to values far below ULONG_MAX
(more precisely, UINT64_MAX
) anyway.
 You will never want to call
two_squares
in this range (2^53
) anyway.
Yes, and I'd actually bail out above some cutoff.
 I have to check the details, but I think that
<unsigned long> sqrt(<double> n)
is actually sufficiently precise that it computes the exact integer square root.
? You mean (unsigned long)sqrt((double)N) * (unsigned long)sqrt((double)N) <= N
(for all N < 2^{53}, say), or (unsigned long)sqrt((double)(N*N)) == N
(for all N < 2^{26}, say)?
I'm ok with using sqrt()
if we limit the input accordingly (to at most 2^{53}1).
comment:33 Changed 8 years ago by
Replying to leif:
While there are (very few) platforms where the math library at least used to lack some long double functions, I really wouldn't care here.
Yes indeed, I was referring to the math library part of C99.
 I have to check the details, but I think that
<unsigned long> sqrt(<double> n)
is actually sufficiently precise that it computes the exact integer square root.
(unsigned long)sqrt((double)(N*N)) == N
(for all N < 2^{26}, say)?
I meant this, which is true even for much larger values of N
(as long as N
is representable in a double
). But we actually require sqrt((double) n)**2 <= n
for all inputs n
, which is not true with my formula.
comment:34 Changed 8 years ago by
Commit:  17883e34f96f7f78f35f56bad527dcbaa728169d → 39c83cae8575fa8748dc97f67ee62b8788b79b48 

Branch pushed to git repo; I updated commit sha1. New commits:
39c83ca  trac #16374: unsigned long > uint_fast32_t and signal handling

comment:35 Changed 8 years ago by
Commit:  39c83cae8575fa8748dc97f67ee62b8788b79b48 → 8ba94c4a3cb84aec3bd71f8a8620a6639da4bfa6 

Branch pushed to git repo; I updated commit sha1. New commits:
8ba94c4  trac #16374: fix a signal handling error

comment:36 Changed 8 years ago by
Status:  needs_work → needs_review 

Hi,
The file now uses uint_fast32_t
and sqrt
. There is a check on the input size of the three cython functions (since on some platform a uint_fast32_t
is actually 64 bits). But I did not know how to test it properly... so it is # not tested
.
Please review! Vincent
comment:37 followup: 39 Changed 8 years ago by
Status:  needs_review → needs_work 

Small comment: you could make the input type uint32_t
and use uint_fast32_t
for the computation. Then your exceptions will be platformindependent.
comment:38 Changed 8 years ago by
Commit:  8ba94c4a3cb84aec3bd71f8a8620a6639da4bfa6 → fdf90e6da61efa184245e5e7fb61c27f79bf0396 

comment:39 Changed 8 years ago by
Status:  needs_work → needs_review 

Replying to jdemeyer:
Small comment: you could make the input type
uint32_t
and useuint_fast32_t
for the computation. Then your exceptions will be platformindependent.
Great. Thanks for the suggestion. Done in the new version and the tests are updated. I also add a function is_sum_of_square_pyx
which is much faster than using sum_of_squares_pyx
with try/except
, especially if the answer is False
.
comment:40 Changed 8 years ago by
Hllooooooooo !
I reread this thing again and I think that it is good to go. I added a small commit in public/16374 to replace 2^32
with 2^{32}
which produced a bad output in the html doc.
If you have no objection to that you can set the ticket to positive_review
:P
Nathann
comment:41 Changed 8 years ago by
Branch:  u/vdelecroix/16374 → public/16374 

Commit:  fdf90e6da61efa184245e5e7fb61c27f79bf0396 → 1b965f8361b779b8e888bb56a249ea48ac4b6852 
Reviewers:  → Jeroen Demeyer, Leif Leonhardy, Nathann Cohen 
Status:  needs_review → positive_review 
comment:42 Changed 8 years ago by
Status:  positive_review → needs_work 

Fails on 32 bit:
sage t long src/sage/rings/sum_of_squares.pyx ********************************************************************** File "src/sage/rings/sum_of_squares.pyx", line 156, in sage.rings.sum_of_squares.two_squares_pyx Failed example: two_squares_pyx(2**32) Expected: Traceback (most recent call last): ... OverflowError: value too large to convert to uint32_t Got: <BLANKLINE> Traceback (most recent call last): File "/home/buildslavesage/slave/sage_git/build/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 480, in _run self.execute(example, compiled, test.globs) File "/home/buildslavesage/slave/sage_git/build/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 839, in execute exec compiled in globs File "<doctest sage.rings.sum_of_squares.two_squares_pyx[6]>", line 1, in <module> two_squares_pyx(Integer(2)**Integer(32)) File "sum_of_squares.pyx", line 128, in sage.rings.sum_of_squares.two_squares_pyx (sage/rings/sum_of_squares.c:2759) OverflowError: long int too large to convert ********************************************************************** File "src/sage/rings/sum_of_squares.pyx", line 197, in sage.rings.sum_of_squares.is_sum_of_two_squares_pyx Failed example: is_sum_of_two_squares_pyx(2**32) Expected: Traceback (most recent call last): ... OverflowError: value too large to convert to uint32_t Got: <BLANKLINE> Traceback (most recent call last): File "/home/buildslavesage/slave/sage_git/build/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 480, in _run self.execute(example, compiled, test.globs) File "/home/buildslavesage/slave/sage_git/build/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 839, in execute exec compiled in globs File "<doctest sage.rings.sum_of_squares.is_sum_of_two_squares_pyx[2]>", line 1, in <module> is_sum_of_two_squares_pyx(Integer(2)**Integer(32)) File "sum_of_squares.pyx", line 184, in sage.rings.sum_of_squares.is_sum_of_two_squares_pyx (sage/rings/sum_of_squares.c:2916) OverflowError: long int too large to convert ********************************************************************** File "src/sage/rings/sum_of_squares.pyx", line 244, in sage.rings.sum_of_squares.three_squares_pyx Failed example: three_squares_pyx(2**32) Expected: Traceback (most recent call last): ... OverflowError: value too large to convert to uint32_t Got: <BLANKLINE> Traceback (most recent call last): File "/home/buildslavesage/slave/sage_git/build/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 480, in _run self.execute(example, compiled, test.globs) File "/home/buildslavesage/slave/sage_git/build/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 839, in execute exec compiled in globs File "<doctest sage.rings.sum_of_squares.three_squares_pyx[10]>", line 1, in <module> three_squares_pyx(Integer(2)**Integer(32)) File "sum_of_squares.pyx", line 212, in sage.rings.sum_of_squares.three_squares_pyx (sage/rings/sum_of_squares.c:3045) OverflowError: long int too large to convert ********************************************************************** File "src/sage/rings/sum_of_squares.pyx", line 291, in sage.rings.sum_of_squares.four_squares_pyx Failed example: four_squares_pyx(2**32) Expected: Traceback (most recent call last): ... OverflowError: value too large to convert to uint32_t Got: <BLANKLINE> Traceback (most recent call last): File "/home/buildslavesage/slave/sage_git/build/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 480, in _run self.execute(example, compiled, test.globs) File "/home/buildslavesage/slave/sage_git/build/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 839, in execute exec compiled in globs File "<doctest sage.rings.sum_of_squares.four_squares_pyx[5]>", line 1, in <module> four_squares_pyx(Integer(2)**Integer(32)) File "sum_of_squares.pyx", line 266, in sage.rings.sum_of_squares.four_squares_pyx (sage/rings/sum_of_squares.c:3209) OverflowError: long int too large to convert ********************************************************************** 4 items had failures: 1 of 10 in sage.rings.sum_of_squares.four_squares_pyx 1 of 4 in sage.rings.sum_of_squares.is_sum_of_two_squares_pyx 1 of 14 in sage.rings.sum_of_squares.three_squares_pyx 1 of 11 in sage.rings.sum_of_squares.two_squares_pyx [35 tests, 4 failures, 0.83 s]
comment:43 Changed 8 years ago by
Commit:  1b965f8361b779b8e888bb56a249ea48ac4b6852 → 51e36d88f4f171ee241b0eb5e467b513ddfa312e 

Branch pushed to git repo; I updated commit sha1. New commits:
51e36d8  trac #16374: fix OverflowError test to be platform independent

comment:44 Changed 8 years ago by
Status:  needs_work → needs_review 

Sorry! I did no think that the message of the OverflowError
was platform dependent... new version that needs review.
comment:47 Changed 8 years ago by
Branch:  public/16374 → 51e36d88f4f171ee241b0eb5e467b513ddfa312e 

Resolution:  → fixed 
Status:  positive_review → closed 
Hi,
I implemented a much better version than my proposal in #16308 (linear complexity in the input for
two_squares_pyx
). It is not yet branched inarith.py
, I will do this when #16308 is merged.timings:
Best Vincent
New commits:
trac #16374: new file arith_pyx.pyx