Opened 8 years ago

Closed 8 years ago

#10195 closed defect (duplicate)

Occasional doctest failure in libs/fplll/fplll.pyx

Reported by: mpatel Owned by: mvngu
Priority: major Milestone: sage-duplicate/invalid/wontfix
Component: doctest coverage Keywords:
Cc: drkirkby, malb, jdemeyer Merged in:
Authors: Reviewers: Jeroen Demeyer
Report Upstream: Fixed upstream, in a later stable release. Work issues:
Branch: Commit:
Dependencies: #11130 Stopgaps:

Description (last modified by drkirkby)

Reported on sage-devel:

I ran

./sage -t -long -force_lib "devel/sage/sage/libs/fplll/fplll.pyx"

1000 times in serial [1] with a 64-bit 4.6.rc0 built on OS X 10.6
(bsd.math).  All but one of the runs pass.  The failure:

Run 766 of 1000
Detected SAGE64 flag
Building Sage on OS X in 64-bit mode
sage -t -long -force_lib "devel/sage/sage/libs/fplll/fplll.pyx"
**********************************************************************
File
"/Users/buildbot/build/sage/bsd-2/bsd_64_full/build/sage-4.6.0pre0/devel/sa
ge/sage/libs/fplll/fplll.pyx", line 853:
    sage: L.echelon_form() == A.echelon_form()
Expected:
    True
Got:
    False

The error also occurs with a 32-bit build on bsd.math (OS X 10.6, 5 out of 1000 runs) and on sage.math (64-bit Ubuntu 8.04.4 LTS, 6 of 1000 runs).

David Kirkby does not get any incorrect results on OpenSolaris 06/2009 after more than 15000 runs with 4.6.rc0 and more than 16000 with 4.6.1.alpha0. He had a total of 31748 passes. However, he did experience 109 doctest failures which are likely to be result of doctesting two copies of Sage simultaneously, as these were using the same directory for temporary files ($HOME/.sage/tmp). His errors were like this:

Run 442 of 100000
sage -t -long -force_lib "devel/sage/sage/libs/fplll/fplll.pyx"
python: can't open file '/export/home/drkirkby/.sage//tmp/fplll.py':
[Errno 2] No such file or directory

        [0.2 s]

and never due to False being return instead of True. His problems are probably the result of the issues discussed on #9739. Once David only tested one copy of Sage at a time, there were 0 failures in 17230 tests of sage-4.6.rc0. (Note, since Sage currently only works well on Solaris or OpenSolaris if built as a 32-bit application, this was a 32-bit build).

Attachments (1)

OpenSolaris-failures-with-Leifs-program.txt (18.2 KB) - added by drkirkby 8 years ago.
19 failures in 10,000 runs on OpenSolaris? using Leifs script. The actual doctest has never failed in many thosands of runs

Download all attachments as: .zip

Change History (27)

comment:1 Changed 8 years ago by mpatel

  • Type changed from PLEASE CHANGE to defect

comment:2 Changed 8 years ago by mpatel

  • Description modified (diff)

Leif Leonhardy's results:

3/1000 failures on Ubuntu 10.04 x86_64 (Core2), Sage 4.6.rc0, first
run.
4/1000 failures on Ubuntu 10.04 x86_64 (Core2), Sage 4.6.rc0, second
run.
3/1000 failures on Ubuntu 9.04 x86_64 (Core2), Sage 4.6.rc0, first
run.
2/ 500 failures on Ubuntu 9.04 x86_64 (Core2), Sage 4.6.rc0, second
run.
5/1000 failures on Ubuntu 9.04 x86 (Pentium 4 Prescott), Sage 4.6.rc0.

(Exactly the same as above, line 853, False instead of True.) 

comment:3 Changed 8 years ago by mpatel

  • Description modified (diff)

comment:4 Changed 8 years ago by drkirkby

  • Description modified (diff)

comment:5 follow-up: Changed 8 years ago by drkirkby

I reckon you Linux and OS X users should upgrade to Solaris!

Dave

comment:6 in reply to: ↑ 5 ; follow-up: Changed 8 years ago by leif

Replying to drkirkby:

I reckon you Linux and OS X users should upgrade to Solaris!

Dave

What if those failures are intended by the fplll authors, but just do not work on your machine / operating system? Or just those "file not found" instances would have been the ones failing... Nobody knows. ;-)

(At least regarding memory and CPUs, I trust most of my machines; they didn't show any bit errors I certainly would have noticed for at least month of continuous computations whose results I could verify...)

Btw, one could also try something like

$ export SAGE_TEST_GLOBAL_ITER=1000
$ ./sage -tp 1 -long devel/sage/sage/libs/fplll/fplll.pyx 2>&1 | tee fplll-test.log
$ echo "`grep -c All fplll-test.log` out of 1000 runs did NOT fail."
$ echo "`grep -c Got fplll-test.log` tests out of 1000 gave an unexpected result."

comment:7 in reply to: ↑ 6 Changed 8 years ago by drkirkby

  • Description modified (diff)

Replying to leif:

Replying to drkirkby:

I reckon you Linux and OS X users should upgrade to Solaris!

Dave

What if those failures are intended by the fplll authors, but just do not work on your machine / operating system? Or just those "file not found" instances would have been the ones failing... Nobody knows. ;-)

I was rather expecting you to have something to say on this;-)

FWIW, I'm now doctesting again, but this time only testing one instance of Sage. So errors due to using the one directory for temp files should not exist. I've only managed to run the tests 3723 times, but none of them have failed in any way whatsoever.

I personally think my initial failures were due to #9739.

I think the matrix used for this test might be random, which could explain failures which occur when (for example) all elements are zero. However, that would probably not explain why I don't get any failures.

comment:8 follow-up: Changed 8 years ago by drkirkby

  • Cc malb added

I'm cc'ing Martin Albrecht on this, as he is the author of the library interface.

I'll take a guess at what is happening here. I believe gcc uses the 387 by default on 32-bit builds and the SSE instructions on 32-bit builds for 64-bit builds. So the precision of the results will be increased on 32-bit builds, as the FPU is using 80 bits internally, not 64 as it does with the SSE instructions.

I suspect recompiling libfpll with -mfpmath=387 on 64-bit builds would solve this. See

http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options

Dave

comment:9 Changed 8 years ago by leif

More data (Ubuntu 9.04 x86_64, Core2):

Failures: 29/5000
Failed runs: 468 560 588 656 691 849 909 993 1054 1133 1391 1437 1443 1761 1845 1848 2293 2294 2529 2801 2848 2938 2992 3433 3665 3862 3940 4432 4661

Ubuntu 9.04 x86 with env SAGE_TEST_GLOBAL_ITER=1000 ./sage -tp 1 -long ...: 3/1000 failed.

comment:10 follow-up: Changed 8 years ago by leif

Dave, what happens if you run the following?

from sage.libs.fplll.fplll import gen_ajtai

runs=1000000
verbose=1
failed=[]

for i in xrange(1,runs):
    A = gen_ajtai(10, 0.7)
    L = A.LLL()
    # L.is_LLL_reduced()
    # True
    if not (L.echelon_form() == A.echelon_form()):
        failed += [ i ]
        print "Test #%d failed:" % i
        if verbose:
            print "A:\n", A
            print "L:\n", L
            if verbose>1:
                print "A.echelon_form()\n", A.echelon_form()
                print "L.echelon_form()\n", L.echelon_form()

print "%d out of %d tests failed:\n" % (len(failed), runs)
print failed

comment:11 in reply to: ↑ 8 Changed 8 years ago by leif

Replying to drkirkby:

I suspect recompiling libfpll with -mfpmath=387 on 64-bit builds would solve this.

At least recompilation of libfplll and sage/libs/fplll/* with -mfpmath=387 on the (32-bit) Pentium 4 Prescott doesn't change the outcome (on which I usually compile with -mfpmath=sse).

comment:12 in reply to: ↑ 10 Changed 8 years ago by drkirkby

Replying to leif:

Dave, what happens if you run the following?

<snip>

Well, I did a quick check - just 10,000 runs, and there were 19 failures

sage: print "%d out of %d tests failed:\n" % (len(failed), runs)
19 out of 10000 tests failed:

sage: print failed
[432, 654, 1219, 1264, 1326, 1490, 1696, 2366, 2622, 3049, 3626, 3835, 4239, 5221, 5368, 6148, 7007, 9661, 9792]

See attached file

Dave

Changed 8 years ago by drkirkby

19 failures in 10,000 runs on OpenSolaris? using Leifs script. The actual doctest has never failed in many thosands of runs

comment:13 Changed 8 years ago by drkirkby

  • Description modified (diff)

comment:14 Changed 8 years ago by leif

Replying to drkirkby:

Replying to leif:

Dave, what happens if you run the following?

<snip>

Well, I did a quick check - just 10,000 runs, and there were 19 failures

Wow, that's more then I get. (I did a couple of tests on three machines, with 100,000 and 1,000,000 "runs"; the failure rate in comparison to the doctests dropped to about 0.15-0.17%.)

I've also rebuilt rc0 from scratch with SAGE87=yes on x86_64, and besides 7 new doctest errors, three of them PARI "bugs", one numerical noise, the failures remain; approximately same ratio with the Python program.

comment:15 Changed 8 years ago by wjp

This may actually be a bug in echelon_form(). Manual inspection of one of the failing matrices shows that A.echelon_form() doesn't actually return a matrix in HNF, but that the row spaces of A.echelon_form() and L.echelon_form() are in fact equal.

comment:16 Changed 8 years ago by wjp

By the way, these failures are deterministic, so there's no need to do extensive loops.

One example that fails in pari/gp 2.4.3 (in sage), but works with pari/gp 2.3.1:

mathnf([0, 0, 0, 0, 0, 0, 0, 0, 0, 13; 0, 0, 0, 0, 0, 0, 0, 0, 23, 6; \
0, 0, 0, 0, 0, 0, 0, 23, -4, -7; 0, 0, 0, 0, 0, 0, 17, -3, 5, -5; \
0, 0, 0, 0, 0, 56, 16, -16, -15, -17; 0, 0, 0, 0, 57, 24, -16, -25, 2, -21; \
0, 0, 0, 114, 9, 56, 51, -52, 25, -55; 0, 0, 113, -31, -11, 24, 0, 28, 34, -16; \
0, 50, 3, 2, 16, -6, -2, 7, -19, -21; 118, 43, 51, 23, 37, -52, 18, 38, 51, 28], 0)

I'll investigate some more, and then report this upstream.

comment:17 follow-up: Changed 8 years ago by wjp

  • Report Upstream changed from N/A to Reported upstream. Little or no feedback.

comment:18 Changed 8 years ago by wjp

  • Cc jdemeyer added

comment:19 Changed 8 years ago by wjp

  • Report Upstream changed from Reported upstream. Little or no feedback. to Reported upstream. Developers acknowledge bug.

comment:20 in reply to: ↑ 17 Changed 8 years ago by jdemeyer

I have no documentation on "Batut's algorithm" (probably in Christian Batut's thesis), I'm reverse-engineering the code to try and understand what is going wrong.

This doesn't sound good :-)

comment:21 Changed 8 years ago by wjp

  • Report Upstream changed from Reported upstream. Developers acknowledge bug. to Fixed upstream, but not in a stable release.

According to Karim Belabas this is now fixed in pari svn r12889.

(How many more times am I expected to change the upstream status field? :-) )

comment:22 follow-up: Changed 8 years ago by jdemeyer

See #11130.

comment:23 in reply to: ↑ 22 Changed 8 years ago by leif

  • Dependencies set to #11130
  • Milestone changed from sage-4.7.2 to sage-duplicate/invalid/wontfix
  • Status changed from new to needs_review

Replying to jdemeyer:

See #11130.

I.e., #11130 will (also) fix this. [Hopefully.]

comment:24 Changed 8 years ago by jdemeyer

  • Report Upstream changed from Fixed upstream, but not in a stable release. to Fixed upstream, in a later stable release.
  • Status changed from needs_review to positive_review

comment:25 Changed 8 years ago by jdemeyer

  • Reviewers set to Jeroen Demeyer

comment:26 Changed 8 years ago by jdemeyer

  • Resolution set to duplicate
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.