Opened 10 years ago
Last modified 7 years ago
#14098 closed defect
zn_poly-0.9.p9 fails at least one its tests on power7 — at Version 15
Reported by: | François Bissey | Owned by: | David Kirkby |
---|---|---|---|
Priority: | major | Milestone: | sage-5.8 |
Component: | porting | Keywords: | |
Cc: | Jeroen Demeyer | Merged in: | |
Authors: | Francois Bissey, David Harvey | Reviewers: | |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description (last modified by )
On the login node of our power7 cluster (beatrice) zn_poly fails make check
(sage-sh) frb15@p2n14-c:src$ make check test/test -quick all mpn_smp_basecase()... ok mpn_smp_kara()... make: *** [check] Segmentation fault (core dumped)
Here is a detailed backtrace
(gdb) r mpn_smp_kara Starting program: /hpc/scratch/frb15/sandbox/sage-5.7.beta4/spkg/build/zn_poly-0.9.p9/src/test/test mpn_smp_kara mpn_smp_kara()... Program received signal SIGSEGV, Segmentation fault. 0x00000400000dd3f0 in gmp_rrandomb (rp=0x0, rstate=0x40000134db8, nbits=17779444199848231480) at random2.c:67 67 random2.c: No such file or directory. in random2.c (gdb) bt #0 0x00000400000dd3f0 in gmp_rrandomb (rp=0x0, rstate=0x40000134db8, nbits=17779444199848231480) at random2.c:67 #1 0x00000400000dd360 in __gmpn_random2 (rp=0x0, n=-5198573331259894519) at random2.c:54 #2 0x000000001002c634 in ZNP_mpn_random2 (res=0x0, n=13248170742449657097) at test/support.c:107 #3 0x0000000010027224 in testcase_mpn_smp_kara (n=6624085371224828549) at test/mpn_mulmid-test.c:89 #4 0x0000000010027434 in test_mpn_smp_kara (quick=0) at test/mpn_mulmid-test.c:125 #5 0x00000000100210dc in run_test (target=0x10041488, quick=0) at test/test.c:187 #6 0x0000000010021450 in main (argc=2, argv=0xfffffffe5a8) at test/test.c:235 (gdb) bt full #0 0x00000400000dd3f0 in gmp_rrandomb (rp=0x0, rstate=0x40000134db8, nbits=17779444199848231480) at random2.c:67 bi = 4398113622176 ranm = 268711088 cap_chunksize = 0 chunksize = 0 i = 277803815622628616 #1 0x00000400000dd360 in __gmpn_random2 (rp=0x0, n=-5198573331259894519) at random2.c:54 rstate = 0x40000134db8 bit_pos = 8 ran = 3915822088 ranm = 3915822088 #2 0x000000001002c634 in ZNP_mpn_random2 (res=0x0, n=13248170742449657097) at test/support.c:107 i = 0 #3 0x0000000010027224 in testcase_mpn_smp_kara (n=6624085371224828549) at test/mpn_mulmid-test.c:89 buf1 = 0x0 buf2 = 0x0 ref = 0x0 res = 0x0 success = 1 #4 0x0000000010027434 in test_mpn_smp_kara (quick=0) at test/mpn_mulmid-test.c:125 success = 1 n = 6624085371224828549 trial = 0 #5 0x00000000100210dc in run_test (target=0x10041488, quick=0) at test/test.c:187 success = 4095 #6 0x0000000010021450 in main (argc=2, argv=0xfffffffe5a8) at test/test.c:235 found = 1 all_success = 1 any_targets = 1 quick = 0 i = 33 j = 1 (gdb) q
It seems to point the finger at mpir.
New spkg:
Change History (16)
comment:1 Changed 10 years ago by
comment:2 Changed 10 years ago by
the __gmpn_random2 (rp=0x0, n=-5198573331259894519)
call is very suspicious, since the second argument should be a size in limbs.
Paul
comment:3 Changed 10 years ago by
Hi Paul,
I suspect that the problem is triggered when enabling the debugging code, furthermore zn_poly itself is built with -DNDEBUG regardless of SAGE_DEBUG=yes. I am wondering if it could cause the problem.
Francois
comment:4 Changed 10 years ago by
Very odd. The main code is always compiled with -DNDEBUG - no option to turn it of. But the code for the test which fails is all compiled with -DDEBUG - no turning it off either. So it must happening when SAGE_DEBUG is turned on for some other component of sage. Since no one else seem to have seen it before it has to be a power7 specific problem.
comment:5 Changed 10 years ago by
To continue on what you started Paul in
testcase_mpn_smp_kara (n=6624085371224828549)
n is supposed to be a size_t so I think we have a gross overflow somewhere earlier. The value originates from here:
/* Tests mpn_smp_kara for a range of n. */ int test_mpn_smp_kara (int quick) { int success = 1; size_t n; ulong trial; // first a dense range of small problems for (n = 2; n <= 30 && success; n++) for (trial = 0; trial < (quick ? 300 : 30000) && success; trial++) success = success && testcase_mpn_smp_kara (n); // now a few larger problems too for (trial = 0; trial < (quick ? 100 : 3000) && success; trial++) { n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2; <======= n generated here. success = success && testcase_mpn_smp_kara (n); } return success; }
comment:6 Changed 10 years ago by
On power7 it appears that ZNP_mpn_smp_kara_thresh is equal to SIZE_MAX which according to /usr/include/stdint.h is
/* Limit of `size_t' type. */ # if __WORDSIZE == 64 # define SIZE_MAX (18446744073709551615UL) # else # define SIZE_MAX (4294967295U) # endif
random_ulong is defined by
ulong random_ulong (ulong max) { return gmp_urandomm_ui (randstate, max); }
so n needs to be size_t which is at most SIZE_MAX but the test generate a random number between 0 and 3 * SIZE_MAX + 2. <sarcasm> Oh dear! I wonder why that doesn't work. </sarcasm>
I guess it is potentially fine if ZNP_mpn_smp_kara_thresh is not SIZE_MAX, I don't know how it is on other systems.
comment:7 Changed 10 years ago by
Francois, can you see how ZNP_mpn_smp_kara_thresh
is defined on other 64-bit systems,
and which kinds of values is generated by n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2
?
Paul
comment:8 Changed 10 years ago by
I am certainly poking at that. The value of ZNP_mpn_smp_kara_thresh is computed by the tuning code and it is clearly allowed to be equal to SIZE_MAX
// generate tuning.c file printf (header); x = ZNP_mpn_smp_kara_thresh; printf ("size_t ZNP_mpn_smp_kara_thresh = "); printf (x == SIZE_MAX ? "SIZE_MAX;\n" : "%lu;\n", x);
So someone potentially set themselves for trouble in the test. However after inserting a few printf in the code the mystery deepens
mpn_smp_basecase()... ok mpn_smp_kara()... test: src/mpn_mulmid.c:241: ZNP_mpn_smp_kara: Assertion `n >= 2' failed. maxtrial= 98 SIZE_MAX= 18446744073709551615 maxtrial= 98 n=31 n=24 n=38 n=40 n=74 n=24 n=28 n=32 n=77 n=67 n=76 n=64 n=13 n=17 n=90 n=42 n=47 n=79 n=21 n=82 n=32 n=10 n=67 n=25 n=26 n=39 n=77 n=90 n=97 n=7 n=74 n=59 n=70 n=87 n=23 n=6 n=70 n=97 n=78 n=74 n=57 n=53 n=28 n=21 n=51 n=33 n=41 n=2 n=88 n=57 n=56 n=96 n=46 n=38 n=69 n=93 n=11 n=61 n=24 n=25 n=45 n=46 n=6 n=44 n=32 n=93 n=59 n=45 n=46 n=31 n=91 n=32 n=45 n=45 n=90 n=61 n=78 n=47 n=33 n=75 n=71 n=37 n=92 n=94 n=50 n=84 n=8 n=43 n=15 n=31 n=31 make: *** [check] Aborted (core dumped) Error running zn_poly's quick test suite ('make check').
I didn't have the assertion before and after putting these we Abort rather than segfault.
comment:9 Changed 10 years ago by
I guess there is a bug in the tuning code, which should not give for ZNP_mpn_smp_kara_thresh
a huge value.
Paul
comment:10 Changed 10 years ago by
I am the author.... thanks Paul for drawing my attention to this.
I haven't looked at this code for years so it's almost as mysterious to me as to everyone else here!
My guess is that the bug is in the test code rather than in the tuning code. I suspect that the threshold is allowed to be SIZE_MAX, but that the line
n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2;
should be replaced by e.g.
if (ZNP_mpn_smp_kara_thresh == SIZE_MAX) n = random_ulong (100) + 2; else n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2;
It could also be a bug in the tuning code, but that would be much harder to fix. If I remember correctly what this threshold means, it is very surprising to me that its optimal value is SIZE_MAX on any real system.
comment:11 Changed 10 years ago by
Thanks for the code. My last error was due to me trying to do something similar and failing to read the original code properly (putting the +2 inside the bracket). power7 is a strange beast but it is unlikely that it is the optimal value. The tuning probably assume something that is wrong on this platform and that would indeed be difficult to find.
comment:12 Changed 10 years ago by
Not sure what happened I wanted to do another run to post tuning.c but the value of ZNP_mpn_smp_kara_thresh is now 133. I swear it was SIZE_MAX before. There is still plenty of SIZE_MAX value in the file:
#include "zn_poly_internal.h" size_t ZNP_mpn_smp_kara_thresh = 133; size_t ZNP_mpn_mulmid_fallback_thresh = 4868; tuning_info_t tuning_info[] = { { // bits = 0 }, { // bits = 1 }, { // bits = 2 94, // KS1 -> KS2 multiplication threshold SIZE_MAX, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 270, // KS1 -> KS2 squaring threshold SIZE_MAX, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 206, // KS1 -> KS2 middle product threshold SIZE_MAX, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold 1000, // nussbaumer multiplication threshold 1000 // nussbaumer squaring threshold }, { // bits = 3 105, // KS1 -> KS2 multiplication threshold SIZE_MAX, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 270, // KS1 -> KS2 squaring threshold 9634, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 120, // KS1 -> KS2 middle product threshold SIZE_MAX, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold 1000, // nussbaumer multiplication threshold 1000 // nussbaumer squaring threshold }, { // bits = 4 123, // KS1 -> KS2 multiplication threshold SIZE_MAX, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 154, // KS1 -> KS2 squaring threshold SIZE_MAX, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 132, // KS1 -> KS2 middle product threshold SIZE_MAX, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold
comment:13 Changed 10 years ago by
Francois, anyway it does not hurt to implement what David suggests in comment 10. This should fix this ticket once for all.
Paul
comment:14 Changed 10 years ago by
I can say it worked nicely, so I'll prepare a new spkg with it so this kind of thing cannot happen again. I think I found out what happened and made thing different. In the original build I used gcc-4.7.1, this build the compiler was gcc shipped with the distro gcc-4.3.4. There could be some subtle bugs lurking in gcc itself or the standard used to compile the tuning code.
#include "zn_poly_internal.h" size_t ZNP_mpn_smp_kara_thresh = SIZE_MAX; size_t ZNP_mpn_mulmid_fallback_thresh = SIZE_MAX; tuning_info_t tuning_info[] = { { // bits = 0 }, { // bits = 1 }, { // bits = 2 94, // KS1 -> KS2 multiplication threshold SIZE_MAX, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 218, // KS1 -> KS2 squaring threshold SIZE_MAX, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 216, // KS1 -> KS2 middle product threshold SIZE_MAX, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold 1000, // nussbaumer multiplication threshold 1000 // nussbaumer squaring threshold }, { // bits = 3 107, // KS1 -> KS2 multiplication threshold SIZE_MAX, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 167, // KS1 -> KS2 squaring threshold SIZE_MAX, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 146, // KS1 -> KS2 middle product threshold 6889, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold 1000, // nussbaumer multiplication threshold 1000 // nussbaumer squaring threshold }, { // bits = 4 68, // KS1 -> KS2 multiplication threshold SIZE_MAX, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 187, // KS1 -> KS2 squaring threshold SIZE_MAX, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 95, // KS1 -> KS2 middle product threshold 7367, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold 1000, // nussbaumer multiplication threshold 1000 // nussbaumer squaring threshold }, { // bits = 5 60, // KS1 -> KS2 multiplication threshold 18841, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 192, // KS1 -> KS2 squaring threshold SIZE_MAX, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 128, // KS1 -> KS2 middle product threshold 5037, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold
Changed 10 years ago by
Attachment: | mpn_mulmid-test.c.patch added |
---|
patch added to zn_poly for review purposes
comment:15 Changed 10 years ago by
Authors: | → Francois Bissey, David Harvey |
---|---|
Description: | modified (diff) |
Milestone: | sage-5.7 → sage-5.8 |
Status: | new → needs_review |
OK new spkg ready for review. I also attached the patch for review but it is just David's code.
Note this is the quick test always run with zn_poly. It passes in 5.7beta3 without debug and it fails in beta4 with SAGE_DEBUG=yes.