Opened 10 years ago
Closed 9 years ago
#10195 closed defect (duplicate)
Occasional doctest failure in libs/fplll/fplll.pyx
Reported by: | mpatel | Owned by: | mvngu |
---|---|---|---|
Priority: | major | Milestone: | sage-duplicate/invalid/wontfix |
Component: | doctest coverage | Keywords: | |
Cc: | drkirkby, malb, jdemeyer | Merged in: | |
Authors: | Reviewers: | Jeroen Demeyer | |
Report Upstream: | Fixed upstream, in a later stable release. | Work issues: | |
Branch: | Commit: | ||
Dependencies: | #11130 | Stopgaps: |
Description (last modified by )
Reported on sage-devel:
I ran ./sage -t -long -force_lib "devel/sage/sage/libs/fplll/fplll.pyx" 1000 times in serial [1] with a 64-bit 4.6.rc0 built on OS X 10.6 (bsd.math). All but one of the runs pass. The failure: Run 766 of 1000 Detected SAGE64 flag Building Sage on OS X in 64-bit mode sage -t -long -force_lib "devel/sage/sage/libs/fplll/fplll.pyx" ********************************************************************** File "/Users/buildbot/build/sage/bsd-2/bsd_64_full/build/sage-4.6.0pre0/devel/sa ge/sage/libs/fplll/fplll.pyx", line 853: sage: L.echelon_form() == A.echelon_form() Expected: True Got: False
The error also occurs with a 32-bit build on bsd.math (OS X 10.6, 5 out of 1000 runs) and on sage.math (64-bit Ubuntu 8.04.4 LTS, 6 of 1000 runs).
David Kirkby does not get any incorrect results on OpenSolaris 06/2009 after more than 15000 runs with 4.6.rc0 and more than 16000 with 4.6.1.alpha0. He had a total of 31748 passes. However, he did experience 109 doctest failures which are likely to be result of doctesting two copies of Sage simultaneously, as these were using the same directory for temporary files ($HOME/.sage/tmp
). His errors were like this:
Run 442 of 100000 sage -t -long -force_lib "devel/sage/sage/libs/fplll/fplll.pyx" python: can't open file '/export/home/drkirkby/.sage//tmp/fplll.py': [Errno 2] No such file or directory [0.2 s]
and never due to False being return instead of True. His problems are probably the result of the issues discussed on #9739. Once David only tested one copy of Sage at a time, there were 0 failures in 17230 tests of sage-4.6.rc0. (Note, since Sage currently only works well on Solaris or OpenSolaris if built as a 32-bit application, this was a 32-bit build).
Attachments (1)
Change History (27)
comment:1 Changed 10 years ago by
- Type changed from PLEASE CHANGE to defect
comment:2 Changed 10 years ago by
- Description modified (diff)
comment:3 Changed 10 years ago by
- Description modified (diff)
comment:4 Changed 10 years ago by
- Description modified (diff)
comment:5 follow-up: ↓ 6 Changed 10 years ago by
I reckon you Linux and OS X users should upgrade to Solaris!
Dave
comment:6 in reply to: ↑ 5 ; follow-up: ↓ 7 Changed 10 years ago by
Replying to drkirkby:
I reckon you Linux and OS X users should upgrade to Solaris!
Dave
What if those failures are intended by the fplll authors, but just do not work on your machine / operating system? Or just those "file not found" instances would have been the ones failing... Nobody knows. ;-)
(At least regarding memory and CPUs, I trust most of my machines; they didn't show any bit errors I certainly would have noticed for at least month of continuous computations whose results I could verify...)
Btw, one could also try something like
$ export SAGE_TEST_GLOBAL_ITER=1000 $ ./sage -tp 1 -long devel/sage/sage/libs/fplll/fplll.pyx 2>&1 | tee fplll-test.log $ echo "`grep -c All fplll-test.log` out of 1000 runs did NOT fail." $ echo "`grep -c Got fplll-test.log` tests out of 1000 gave an unexpected result."
comment:7 in reply to: ↑ 6 Changed 10 years ago by
- Description modified (diff)
Replying to leif:
Replying to drkirkby:
I reckon you Linux and OS X users should upgrade to Solaris!
Dave
What if those failures are intended by the fplll authors, but just do not work on your machine / operating system? Or just those "file not found" instances would have been the ones failing... Nobody knows. ;-)
I was rather expecting you to have something to say on this;-)
FWIW, I'm now doctesting again, but this time only testing one instance of Sage. So errors due to using the one directory for temp files should not exist. I've only managed to run the tests 3723 times, but none of them have failed in any way whatsoever.
I personally think my initial failures were due to #9739.
I think the matrix used for this test might be random, which could explain failures which occur when (for example) all elements are zero. However, that would probably not explain why I don't get any failures.
comment:8 follow-up: ↓ 11 Changed 10 years ago by
- Cc malb added
I'm cc'ing Martin Albrecht on this, as he is the author of the library interface.
I'll take a guess at what is happening here. I believe gcc uses the 387 by default on 32-bit builds and the SSE instructions on 32-bit builds for 64-bit builds. So the precision of the results will be increased on 32-bit builds, as the FPU is using 80 bits internally, not 64 as it does with the SSE instructions.
I suspect recompiling libfpll with -mfpmath=387 on 64-bit builds would solve this. See
http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options
Dave
comment:9 Changed 10 years ago by
More data (Ubuntu 9.04 x86_64, Core2):
Failures: 29/5000 Failed runs: 468 560 588 656 691 849 909 993 1054 1133 1391 1437 1443 1761 1845 1848 2293 2294 2529 2801 2848 2938 2992 3433 3665 3862 3940 4432 4661
Ubuntu 9.04 x86 with env SAGE_TEST_GLOBAL_ITER=1000 ./sage -tp 1 -long ...
: 3/1000 failed.
comment:10 follow-up: ↓ 12 Changed 10 years ago by
Dave, what happens if you run the following?
from sage.libs.fplll.fplll import gen_ajtai runs=1000000 verbose=1 failed=[] for i in xrange(1,runs): A = gen_ajtai(10, 0.7) L = A.LLL() # L.is_LLL_reduced() # True if not (L.echelon_form() == A.echelon_form()): failed += [ i ] print "Test #%d failed:" % i if verbose: print "A:\n", A print "L:\n", L if verbose>1: print "A.echelon_form()\n", A.echelon_form() print "L.echelon_form()\n", L.echelon_form() print "%d out of %d tests failed:\n" % (len(failed), runs) print failed
comment:11 in reply to: ↑ 8 Changed 10 years ago by
Replying to drkirkby:
I suspect recompiling libfpll with -mfpmath=387 on 64-bit builds would solve this.
At least recompilation of libfplll and sage/libs/fplll/*
with -mfpmath=387
on the (32-bit) Pentium 4 Prescott doesn't change the outcome (on which I usually compile with -mfpmath=sse
).
comment:12 in reply to: ↑ 10 Changed 10 years ago by
Replying to leif:
Dave, what happens if you run the following?
<snip>
Well, I did a quick check - just 10,000 runs, and there were 19 failures
sage: print "%d out of %d tests failed:\n" % (len(failed), runs) 19 out of 10000 tests failed: sage: print failed [432, 654, 1219, 1264, 1326, 1490, 1696, 2366, 2622, 3049, 3626, 3835, 4239, 5221, 5368, 6148, 7007, 9661, 9792]
See attached file
Dave
Changed 10 years ago by
19 failures in 10,000 runs on OpenSolaris? using Leifs script. The actual doctest has never failed in many thosands of runs
comment:13 Changed 10 years ago by
- Description modified (diff)
comment:14 Changed 10 years ago by
Replying to drkirkby:
Replying to leif:
Dave, what happens if you run the following?
<snip>
Well, I did a quick check - just 10,000 runs, and there were 19 failures
Wow, that's more then I get. (I did a couple of tests on three machines, with 100,000 and 1,000,000 "runs"; the failure rate in comparison to the doctests dropped to about 0.15-0.17%.)
I've also rebuilt rc0 from scratch with SAGE87=yes
on x86_64, and besides 7 new doctest errors, three of them PARI "bugs", one numerical noise, the failures remain; approximately same ratio with the Python program.
comment:15 Changed 10 years ago by
This may actually be a bug in echelon_form()
. Manual inspection of one of the failing matrices shows that A.echelon_form()
doesn't actually return a matrix in HNF, but that the row spaces of A.echelon_form()
and L.echelon_form()
are in fact equal.
comment:16 Changed 10 years ago by
By the way, these failures are deterministic, so there's no need to do extensive loops.
One example that fails in pari/gp 2.4.3 (in sage), but works with pari/gp 2.3.1:
mathnf([0, 0, 0, 0, 0, 0, 0, 0, 0, 13; 0, 0, 0, 0, 0, 0, 0, 0, 23, 6; \ 0, 0, 0, 0, 0, 0, 0, 23, -4, -7; 0, 0, 0, 0, 0, 0, 17, -3, 5, -5; \ 0, 0, 0, 0, 0, 56, 16, -16, -15, -17; 0, 0, 0, 0, 57, 24, -16, -25, 2, -21; \ 0, 0, 0, 114, 9, 56, 51, -52, 25, -55; 0, 0, 113, -31, -11, 24, 0, 28, 34, -16; \ 0, 50, 3, 2, 16, -6, -2, 7, -19, -21; 118, 43, 51, 23, 37, -52, 18, 38, 51, 28], 0)
I'll investigate some more, and then report this upstream.
comment:17 follow-up: ↓ 20 Changed 10 years ago by
- Report Upstream changed from N/A to Reported upstream. Little or no feedback.
comment:18 Changed 10 years ago by
- Cc jdemeyer added
comment:19 Changed 10 years ago by
- Report Upstream changed from Reported upstream. Little or no feedback. to Reported upstream. Developers acknowledge bug.
comment:20 in reply to: ↑ 17 Changed 10 years ago by
I have no documentation on "Batut's algorithm" (probably in Christian Batut's thesis), I'm reverse-engineering the code to try and understand what is going wrong.
This doesn't sound good :-)
comment:21 Changed 10 years ago by
- Report Upstream changed from Reported upstream. Developers acknowledge bug. to Fixed upstream, but not in a stable release.
According to Karim Belabas this is now fixed in pari svn r12889.
(How many more times am I expected to change the upstream status field? :-) )
comment:22 follow-up: ↓ 23 Changed 10 years ago by
See #11130.
comment:23 in reply to: ↑ 22 Changed 10 years ago by
- Dependencies set to #11130
- Milestone changed from sage-4.7.2 to sage-duplicate/invalid/wontfix
- Status changed from new to needs_review
comment:24 Changed 10 years ago by
- Report Upstream changed from Fixed upstream, but not in a stable release. to Fixed upstream, in a later stable release.
- Status changed from needs_review to positive_review
comment:25 Changed 10 years ago by
- Reviewers set to Jeroen Demeyer
comment:26 Changed 9 years ago by
- Resolution set to duplicate
- Status changed from positive_review to closed
Leif Leonhardy's results: