Opened 7 years ago
Closed 7 years ago
#16374 closed enhancement (fixed)
better two_squares, three_squares, four_squares for small input
Reported by: | vdelecroix | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | sage-6.3 |
Component: | number theory | Keywords: | |
Cc: | ncohen, jdemeyer, leif | Merged in: | |
Authors: | Vincent Delecroix | Reviewers: | Jeroen Demeyer, Leif Leonhardy, Nathann Cohen |
Report Upstream: | N/A | Work issues: | |
Branch: | 51e36d8 (Commits) | Commit: | 51e36d88f4f171ee241b0eb5e467b513ddfa312e |
Dependencies: | Stopgaps: |
Description
We implement a much more efficient version of square decomposition for small values.
Change History (47)
comment:1 Changed 7 years ago by
- Branch set to u/vdelecroix/16374
- Commit set to 7cf5fff24e77135cc00032a7347d4e5a6402324b
comment:2 Changed 7 years ago by
- Commit changed from 7cf5fff24e77135cc00032a7347d4e5a6402324b to f1354e68519feecd5de198c10a532b89ee938a7b
Branch pushed to git repo; I updated commit sha1. New commits:
f1354e6 | trac #16374: two small optimization
|
comment:3 follow-up: ↓ 5 Changed 7 years ago by
Hello !
I added a commit in u/ncohen/16374 which contains some tests and comments.
I have two questions to ask though :
- Why
ii <= n/2
and notii<=jj
?
- Why a "while" instead of a "if" in the following code ?
j = (<unsigned int> sqrt(<double> n)) + 1 # (rounding is toward zero) while j*j > n: j -= 1
Besides, there is a way to avoid many multiplications, and perhaps it can improve the code a bit (no idea, profiling is the only way :-P
)
sage: i = 5 sage: ii = i**2 sage: i**2 == ii True sage: ii+=2*i+1 sage: i+=1 sage: i**2 == ii True
Nathann
comment:4 Changed 7 years ago by
- Status changed from new to needs_info
comment:5 in reply to: ↑ 3 Changed 7 years ago by
Hi Nathann,
Thanks for reading it.
I added a commit in u/ncohen/16374 which contains some tests and comments.
Great. I also have something important to modify in three_squares_pyx
and four_squares_pyx
. It is much faster to first remove a huge square and then try to decompose the rest into a sum of less squares. The result is not the smallest for lexicographical order but the timings are x4 better.
It seems that you like C function declaration of the form int f(int res[3])
. It makes the specifications clearer but at compilation the results are strictly equivalent.
I have two questions to ask though :
- Why
ii <= n/2
and notii<=jj
?
Right.
- Why a "while" instead of a "if" in the following code ?
j = (<unsigned int> sqrt(<double> n)) + 1 # (rounding is toward zero) while j*j > n: j -= 1
I am not sure of the specification (if any) of sqrt
which deals with double... I will try to improve it.
Besides, there is a way to avoid many multiplications, and perhaps it can improve the code a bit (no idea, profiling is the only way
:-P
)
I thought about it. I am not sure that ii += 2*i + 1
is cheaper than ii = i*i
? Will have a look.
Vincent
comment:6 Changed 7 years ago by
- Commit changed from f1354e68519feecd5de198c10a532b89ee938a7b to a2b31ca5ec86eaed275a854e41cb5fd7a3ce5c95
Branch pushed to git repo; I updated commit sha1. New commits:
a2b31ca | trac #16374: reviewer comment + further optimization
|
comment:7 follow-ups: ↓ 10 ↓ 17 Changed 7 years ago by
- Status changed from needs_info to needs_review
Hi,
I confirm that there is no difference between "ii = i*i" and "ii -= 2*i + 1". Anyway, I am pretty sure that this is the kind of optimization that gcc takes care of with the option "-O3". In the new commit there are the modifications relative to your comment and mine.
New timings
sage: from sage.rings.arith_pyx import four_squares_pyx sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)") 25 loops, best of 3: 11.3 ms per loop
About the dependencies, I am not sure whether this ticket should go before or after #16308...
Vincent
comment:8 Changed 7 years ago by
- Dependencies #16308 deleted
comment:9 Changed 7 years ago by
Instead of adding a new module arith_pyx
, either
- Rename this module such that it explicitly refers to sums of squares.
- Use
fast_arith.pyx
instead.
comment:10 in reply to: ↑ 7 Changed 7 years ago by
Replying to vdelecroix:
Anyway, I am pretty sure that this is the kind of optimization that gcc takes care of with the option "-O3".
Sure, write the most readable code and let the compiler do its job.
About the dependencies, I am not sure whether this ticket should go before or after #16308...
Even better: make these two tickets independent and then make a third ticket to integrate them.
comment:11 follow-up: ↓ 12 Changed 7 years ago by
- Status changed from needs_review to needs_work
I would recommend to use unsigned long
instead of instead int
. If could imagine that the sum of 4 squares function is useful for numbers larger than 2^32
.
Also a typo: plateform
-> platform
comment:12 in reply to: ↑ 11 Changed 7 years ago by
Replying to jdemeyer:
I would recommend to use
unsigned long
instead ofinstead int
. If could imagine that the sum of 4 squares function is useful for numbers larger than2^32
.
Of course. But for large entries, we would better use the *_squares
from #16308. I will do some timings, but if we loose something on small integers I will not change it.
Also a typo:
plateform
->platform
Thank you. I will modify it.
comment:13 follow-up: ↓ 14 Changed 7 years ago by
What's the point of <unsigned int> sqrt(<double> n + .5)
? Are you worried that the libm sqrt
function does not round correctly? I believe that IEEE-754 requires this.
comment:14 in reply to: ↑ 13 ; follow-up: ↓ 15 Changed 7 years ago by
Replying to jdemeyer:
What's the point of
<unsigned int> sqrt(<double> n + .5)
? Are you worried that the libmsqrt
function does not round correctly? I believe that IEEE-754 requires this.
I believe it is in IEEE-754, but I am not sure that all compilers follow IEEE-754. You think it is safe to remove this .5
?
comment:15 in reply to: ↑ 14 ; follow-up: ↓ 16 Changed 7 years ago by
Replying to vdelecroix:
I believe it is in IEEE-754, but I am not sure that all compilers follow IEEE-754. You think it is safe to remove this
.5
?
It is certainly not a compiler issue, but a libm
/hardware issue. I would expect IEEE-754 and would remove the + 0.5
.
comment:16 in reply to: ↑ 15 Changed 7 years ago by
Replying to jdemeyer:
Replying to vdelecroix:
I believe it is in IEEE-754, but I am not sure that all compilers follow IEEE-754. You think it is safe to remove this
.5
?It is certainly not a compiler issue, but a
libm
/hardware issue. I would expect IEEE-754 and would remove the+ 0.5
.
Yes. Integers up to 253-1 can be represented exactly in a double
, at least on the plat[e]forms we support.
I like the type instead int
btw. (Google would have asked: "Did you mean instant int
?", or probably "insteady int
"?)
Regarding that, I'd write versions for / using integers of known, platform-independent width, that is, use uint8_t
, uint16_t
, uint32_t
etc., with a wrapper function that delegates to the appropriate function, or use if-then-elif-else such that the compiler is aware of the exact range of numbers. (Separate functions could more easily be checked for overflow conditions as well.)
comment:17 in reply to: ↑ 7 Changed 7 years ago by
Replying to vdelecroix:
I confirm that there is no difference between "ii = i*i" and "ii -= 2*i + 1".
There of course is; just look at what the (C) compiler produces from it. (I don't know what Cython itself does; it may optimize it, but may also further obfuscate the code.)
On older machines, presumably any 32-bit processor [we support], multiplication (or squaring) is more expensive than addition, increment, and/or a shift. (And its cost depends on the number of 1-bits in the multiplicands.)
Anyway, I am pretty sure that this is the kind of optimization that gcc takes care of with the option "-O3".
Nope, you have to be more explicit. It of course does some strength reduction, e.g. replacing multiplication/division by powers of two by shifts, additions or nice addressing modes etc., but you cannot expect it to run a theorem prover... ;-)
comment:18 Changed 7 years ago by
- Commit changed from a2b31ca5ec86eaed275a854e41cb5fd7a3ce5c95 to 352e255834a35c951f8bf22a6161df8cef56e51a
comment:19 Changed 7 years ago by
- Status changed from needs_work to needs_review
Hi,
The file is now sum_of_squares.pyx
.
I changed the unsigned int
to unsigned long
as Jeroen suggested. The loss in terms of timing seems to be around 1% (see below). Nevertheless, the function makes no sense for very large input as it is much faster to called the Python function from arith
(see also below).
For the rounding issue with sqrt
, it uses now <unsigned long> sqrt(<double> n)
and there is an exhaustive test in the doc of the function two_squares_pyx
.
unsigned int
sage: from sage.rings.sum_of_squares import four_squares_pyx sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)") 25 loops, best of 3: 11 ms per loop
unsigned long
sage: from sage.rings.sum_of_squares import four_squares_pyx sage: timeit("for n in xrange(10000,50000): x = four_squares_pyx(n)") 25 loops, best of 3: 11.3 ms per loop
Very large input
sage: from sage.rings.sum_of_squares import two_squares_pyx sage: %timeit two_squares_pyx(2**51+21) 10 loops, best of 3: 88.1 ms per loop sage: %timeit two_squares(2**51+21) 1000 loops, best of 3: 1.04 ms per loop
Vincent
comment:20 follow-up: ↓ 22 Changed 7 years ago by
I wouldn't call 251+21 "very large", at least not if the functions now take unsigned long
s. (Ok, on most of current machines that exceeds an unsigned int
.)
While we're at it, if n
is unsigned long
, you should take (unsigned long)sqrtl((long double)n)
, otherwise you'd probably have to add 1 to the result in case n
is a perfect square or slightly above. (On 32-bit machines, where unsigned long
s are only 32-bit as well, that doesn't matter.)
comment:21 Changed 7 years ago by
- Commit changed from 352e255834a35c951f8bf22a6161df8cef56e51a to 17883e34f96f7f78f35f56bad527dcbaa728169d
Branch pushed to git repo; I updated commit sha1. New commits:
17883e3 | trac #16374: replace sqrt by sqrtl
|
comment:22 in reply to: ↑ 20 Changed 7 years ago by
Replying to leif:
I wouldn't call 251+21 "very large", at least not if the functions now take
unsigned long
s. (Ok, on most of current machines that exceeds anunsigned int
.)
Very large meant "the function is no more efficient with such input".
While we're at it, if
n
isunsigned long
, you should take(unsigned long)sqrtl((long double)n)
, otherwise you'd probably have to add 1 to the result in casen
is a perfect square or slightly above. (On 32-bit machines, whereunsigned long
s are only 32-bit as well, that doesn't matter.)
Done.
Vincent
comment:23 Changed 7 years ago by
Well... Theeeeeeeeeeeeeeeeeeeeeeen if there is no other comment, can this ticket be flagged as positive_review
? I agreed with what I read but I know can't follow you on the "hardware/programming standard" ground :-)
Nathann
comment:24 follow-up: ↓ 25 Changed 7 years ago by
Just to mention: if there is an overflow, the error is very explicit
sage: from sage.rings.sum_of_squares import two_squares_pyx sage: two_squares_pyx(2**65+19) Traceback (most recent call last): ... OverflowError: long int too large to convert
comment:25 in reply to: ↑ 24 ; follow-up: ↓ 27 Changed 7 years ago by
Replying to vdelecroix:
Just to mention: if there is an overflow, the error is very explicit
sage: from sage.rings.sum_of_squares import two_squares_pyx sage: two_squares_pyx(2**65+19) Traceback (most recent call last): ... OverflowError: long int too large to convert
I rather meant overflows in the C/Cython code, where just modulo arithmetic happens, without throwing exceptions. (With using unsigned long
s you're probably on the safe[r] side, but that's something I(?)'ll have to check more carefully...)
comment:26 follow-up: ↓ 28 Changed 7 years ago by
P.S.: On modern (64-bit) AMD CPUs, not using squaring is significantly faster, while on Intel CPUs there's apparently not much of a difference, with I think a slight advantage of doing squaring explicitly (i.e., integer multiplication), but it's hard to get reliable timings there.
(And GCC definitely doesn't do the transformation either way.)
comment:27 in reply to: ↑ 25 Changed 7 years ago by
Replying to leif:
Replying to vdelecroix:
Just to mention: if there is an overflow, the error is very explicit
sage: from sage.rings.sum_of_squares import two_squares_pyx sage: two_squares_pyx(2**65+19) Traceback (most recent call last): ... OverflowError: long int too large to convertI rather meant overflows in the C/Cython code, where just modulo arithmetic happens, without throwing exceptions. (With using
unsigned long
s you're probably on the safe[r] side, but that's something I(?)'ll have to check more carefully...)
There are only two places were overflow may occur:
- the input
- when we rely on long double to perform the square root (i.e. are we sure that an
unsigned long
fits into along double
?)
All integers that appear inside the function are smaller than the input. So, no trouble here.
Vincent
comment:28 in reply to: ↑ 26 Changed 7 years ago by
Replying to leif:
P.S.: On modern (64-bit) AMD CPUs, not using squaring is significantly faster, while on Intel CPUs there's apparently not much of a difference, with I think a slight advantage of doing squaring explicitly (i.e., integer multiplication), but it's hard to get reliable timings there.
(And GCC definitely doesn't do the transformation either way.)
I would prefer to keep the multiplication as it is much more readable.
comment:29 follow-up: ↓ 32 Changed 7 years ago by
I think the long double
is overkill and would revert to using sqrt()
.
- It decreases portability (
sqrtl()
is C99 so at least you need to compile in C99 mode and not all systems have this function).
- You will never want to call
two_squares
in this range (2^53
) anyway.
- For
three_squares
andfour_squares
, there are many solutions, so it doesn't matter if the square root is off by 1.
comment:30 Changed 7 years ago by
- Status changed from needs_review to needs_work
comment:31 Changed 7 years ago by
Also, these functions should be interruptible: http://www.sagemath.org/doc/developer/coding_in_cython.html#interrupt-and-signal-handling
comment:32 in reply to: ↑ 29 ; follow-up: ↓ 33 Changed 7 years ago by
Replying to jdemeyer:
I think the
long double
is overkill and would revert to usingsqrt()
.
- It decreases portability (
sqrtl()
is C99 so at least you need to compile in C99 mode and not all systems have this function).
:-) Yes, C99 is only 15 years old. (Note that a lot of code in Sage requires C99, and any recent GCC defaults to -std=gnu99
.)
While there are (very few) platforms where the math library at least used to lack some long double functions, I really wouldn't care here.
By the way, IMHO i
and j
should be unsigned
rather than unsigned long
. (And as mentioned, I'd rather use uint32_t
and uint64_t
, or even better, uint_fast32_t
instead of the former.)
Note that in
j = <unsigned> sqrt[l](<[long] double> n) j += 1 - j%2
j
still cannot overflow, since sqrt[l]()
returns UINT_MAX
(which is odd) for n close to ULONG_MAX
.
However, I'd limit the input to values far below ULONG_MAX
(more precisely, UINT64_MAX
) anyway.
- You will never want to call
two_squares
in this range (2^53
) anyway.
Yes, and I'd actually bail out above some cutoff.
- I have to check the details, but I think that
<unsigned long> sqrt(<double> n)
is actually sufficiently precise that it computes the exact integer square root.
? You mean (unsigned long)sqrt((double)N) * (unsigned long)sqrt((double)N) <= N
(for all N < 253, say), or (unsigned long)sqrt((double)(N*N)) == N
(for all N < 226, say)?
I'm ok with using sqrt()
if we limit the input accordingly (to at most 253-1).
comment:33 in reply to: ↑ 32 Changed 7 years ago by
Replying to leif:
While there are (very few) platforms where the math library at least used to lack some long double functions, I really wouldn't care here.
Yes indeed, I was referring to the math library part of C99.
- I have to check the details, but I think that
<unsigned long> sqrt(<double> n)
is actually sufficiently precise that it computes the exact integer square root.
(unsigned long)sqrt((double)(N*N)) == N
(for all N < 226, say)?
I meant this, which is true even for much larger values of N
(as long as N
is representable in a double
). But we actually require sqrt((double) n)**2 <= n
for all inputs n
, which is not true with my formula.
comment:34 Changed 7 years ago by
- Commit changed from 17883e34f96f7f78f35f56bad527dcbaa728169d to 39c83cae8575fa8748dc97f67ee62b8788b79b48
Branch pushed to git repo; I updated commit sha1. New commits:
39c83ca | trac #16374: unsigned long -> uint_fast32_t and signal handling
|
comment:35 Changed 7 years ago by
- Commit changed from 39c83cae8575fa8748dc97f67ee62b8788b79b48 to 8ba94c4a3cb84aec3bd71f8a8620a6639da4bfa6
Branch pushed to git repo; I updated commit sha1. New commits:
8ba94c4 | trac #16374: fix a signal handling error
|
comment:36 Changed 7 years ago by
- Status changed from needs_work to needs_review
Hi,
The file now uses uint_fast32_t
and sqrt
. There is a check on the input size of the three cython functions (since on some platform a uint_fast32_t
is actually 64 bits). But I did not know how to test it properly... so it is # not tested
.
Please review! Vincent
comment:37 follow-up: ↓ 39 Changed 7 years ago by
- Status changed from needs_review to needs_work
Small comment: you could make the input type uint32_t
and use uint_fast32_t
for the computation. Then your exceptions will be platform-independent.
comment:38 Changed 7 years ago by
- Commit changed from 8ba94c4a3cb84aec3bd71f8a8620a6639da4bfa6 to fdf90e6da61efa184245e5e7fb61c27f79bf0396
comment:39 in reply to: ↑ 37 Changed 7 years ago by
- Status changed from needs_work to needs_review
Replying to jdemeyer:
Small comment: you could make the input type
uint32_t
and useuint_fast32_t
for the computation. Then your exceptions will be platform-independent.
Great. Thanks for the suggestion. Done in the new version and the tests are updated. I also add a function is_sum_of_square_pyx
which is much faster than using sum_of_squares_pyx
with try/except
, especially if the answer is False
.
comment:40 Changed 7 years ago by
Hllooooooooo !
I re-read this thing again and I think that it is good to go. I added a small commit in public/16374 to replace 2^32
with 2^{32}
which produced a bad output in the html doc.
If you have no objection to that you can set the ticket to positive_review
:-P
Nathann
comment:41 Changed 7 years ago by
- Branch changed from u/vdelecroix/16374 to public/16374
- Commit changed from fdf90e6da61efa184245e5e7fb61c27f79bf0396 to 1b965f8361b779b8e888bb56a249ea48ac4b6852
- Reviewers set to Jeroen Demeyer, Leif Leonhardy, Nathann Cohen
- Status changed from needs_review to positive_review
comment:42 Changed 7 years ago by
- Status changed from positive_review to needs_work
Fails on 32 bit:
sage -t --long src/sage/rings/sum_of_squares.pyx ********************************************************************** File "src/sage/rings/sum_of_squares.pyx", line 156, in sage.rings.sum_of_squares.two_squares_pyx Failed example: two_squares_pyx(2**32) Expected: Traceback (most recent call last): ... OverflowError: value too large to convert to uint32_t Got: <BLANKLINE> Traceback (most recent call last): File "/home/buildslave-sage/slave/sage_git/build/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 480, in _run self.execute(example, compiled, test.globs) File "/home/buildslave-sage/slave/sage_git/build/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 839, in execute exec compiled in globs File "<doctest sage.rings.sum_of_squares.two_squares_pyx[6]>", line 1, in <module> two_squares_pyx(Integer(2)**Integer(32)) File "sum_of_squares.pyx", line 128, in sage.rings.sum_of_squares.two_squares_pyx (sage/rings/sum_of_squares.c:2759) OverflowError: long int too large to convert ********************************************************************** File "src/sage/rings/sum_of_squares.pyx", line 197, in sage.rings.sum_of_squares.is_sum_of_two_squares_pyx Failed example: is_sum_of_two_squares_pyx(2**32) Expected: Traceback (most recent call last): ... OverflowError: value too large to convert to uint32_t Got: <BLANKLINE> Traceback (most recent call last): File "/home/buildslave-sage/slave/sage_git/build/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 480, in _run self.execute(example, compiled, test.globs) File "/home/buildslave-sage/slave/sage_git/build/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 839, in execute exec compiled in globs File "<doctest sage.rings.sum_of_squares.is_sum_of_two_squares_pyx[2]>", line 1, in <module> is_sum_of_two_squares_pyx(Integer(2)**Integer(32)) File "sum_of_squares.pyx", line 184, in sage.rings.sum_of_squares.is_sum_of_two_squares_pyx (sage/rings/sum_of_squares.c:2916) OverflowError: long int too large to convert ********************************************************************** File "src/sage/rings/sum_of_squares.pyx", line 244, in sage.rings.sum_of_squares.three_squares_pyx Failed example: three_squares_pyx(2**32) Expected: Traceback (most recent call last): ... OverflowError: value too large to convert to uint32_t Got: <BLANKLINE> Traceback (most recent call last): File "/home/buildslave-sage/slave/sage_git/build/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 480, in _run self.execute(example, compiled, test.globs) File "/home/buildslave-sage/slave/sage_git/build/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 839, in execute exec compiled in globs File "<doctest sage.rings.sum_of_squares.three_squares_pyx[10]>", line 1, in <module> three_squares_pyx(Integer(2)**Integer(32)) File "sum_of_squares.pyx", line 212, in sage.rings.sum_of_squares.three_squares_pyx (sage/rings/sum_of_squares.c:3045) OverflowError: long int too large to convert ********************************************************************** File "src/sage/rings/sum_of_squares.pyx", line 291, in sage.rings.sum_of_squares.four_squares_pyx Failed example: four_squares_pyx(2**32) Expected: Traceback (most recent call last): ... OverflowError: value too large to convert to uint32_t Got: <BLANKLINE> Traceback (most recent call last): File "/home/buildslave-sage/slave/sage_git/build/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 480, in _run self.execute(example, compiled, test.globs) File "/home/buildslave-sage/slave/sage_git/build/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 839, in execute exec compiled in globs File "<doctest sage.rings.sum_of_squares.four_squares_pyx[5]>", line 1, in <module> four_squares_pyx(Integer(2)**Integer(32)) File "sum_of_squares.pyx", line 266, in sage.rings.sum_of_squares.four_squares_pyx (sage/rings/sum_of_squares.c:3209) OverflowError: long int too large to convert ********************************************************************** 4 items had failures: 1 of 10 in sage.rings.sum_of_squares.four_squares_pyx 1 of 4 in sage.rings.sum_of_squares.is_sum_of_two_squares_pyx 1 of 14 in sage.rings.sum_of_squares.three_squares_pyx 1 of 11 in sage.rings.sum_of_squares.two_squares_pyx [35 tests, 4 failures, 0.83 s]
comment:43 Changed 7 years ago by
- Commit changed from 1b965f8361b779b8e888bb56a249ea48ac4b6852 to 51e36d88f4f171ee241b0eb5e467b513ddfa312e
Branch pushed to git repo; I updated commit sha1. New commits:
51e36d8 | trac #16374: fix OverflowError test to be platform independent
|
comment:44 Changed 7 years ago by
- Status changed from needs_work to needs_review
Sorry! I did no think that the message of the OverflowError
was platform dependent... new version that needs review.
comment:46 Changed 7 years ago by
Thanks Volker!
comment:47 Changed 7 years ago by
- Branch changed from public/16374 to 51e36d88f4f171ee241b0eb5e467b513ddfa312e
- Resolution set to fixed
- Status changed from positive_review to closed
Hi,
I implemented a much better version than my proposal in #16308 (linear complexity in the input for
two_squares_pyx
). It is not yet branched inarith.py
, I will do this when #16308 is merged.timings:
Best Vincent
New commits:
trac #16374: new file arith_pyx.pyx