#14098 closed defect (fixed)
zn_poly-0.9.p9 fails at least one its tests on power7
Reported by: | fbissey | Owned by: | drkirkby |
---|---|---|---|
Priority: | major | Milestone: | sage-5.8 |
Component: | porting | Keywords: | |
Cc: | jdemeyer | Merged in: | sage-5.8.beta1 |
Authors: | François Bissey, David Harvey | Reviewers: | Paul Zimmermann, Jeroen Demeyer |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description (last modified by )
On the login node of our power7 cluster (beatrice) zn_poly fails make check
(sage-sh) frb15@p2n14-c:src$ make check test/test -quick all mpn_smp_basecase()... ok mpn_smp_kara()... make: *** [check] Segmentation fault (core dumped)
Here is a detailed backtrace
(gdb) r mpn_smp_kara Starting program: /hpc/scratch/frb15/sandbox/sage-5.7.beta4/spkg/build/zn_poly-0.9.p9/src/test/test mpn_smp_kara mpn_smp_kara()... Program received signal SIGSEGV, Segmentation fault. 0x00000400000dd3f0 in gmp_rrandomb (rp=0x0, rstate=0x40000134db8, nbits=17779444199848231480) at random2.c:67 67 random2.c: No such file or directory. in random2.c (gdb) bt #0 0x00000400000dd3f0 in gmp_rrandomb (rp=0x0, rstate=0x40000134db8, nbits=17779444199848231480) at random2.c:67 #1 0x00000400000dd360 in __gmpn_random2 (rp=0x0, n=-5198573331259894519) at random2.c:54 #2 0x000000001002c634 in ZNP_mpn_random2 (res=0x0, n=13248170742449657097) at test/support.c:107 #3 0x0000000010027224 in testcase_mpn_smp_kara (n=6624085371224828549) at test/mpn_mulmid-test.c:89 #4 0x0000000010027434 in test_mpn_smp_kara (quick=0) at test/mpn_mulmid-test.c:125 #5 0x00000000100210dc in run_test (target=0x10041488, quick=0) at test/test.c:187 #6 0x0000000010021450 in main (argc=2, argv=0xfffffffe5a8) at test/test.c:235 (gdb) bt full #0 0x00000400000dd3f0 in gmp_rrandomb (rp=0x0, rstate=0x40000134db8, nbits=17779444199848231480) at random2.c:67 bi = 4398113622176 ranm = 268711088 cap_chunksize = 0 chunksize = 0 i = 277803815622628616 #1 0x00000400000dd360 in __gmpn_random2 (rp=0x0, n=-5198573331259894519) at random2.c:54 rstate = 0x40000134db8 bit_pos = 8 ran = 3915822088 ranm = 3915822088 #2 0x000000001002c634 in ZNP_mpn_random2 (res=0x0, n=13248170742449657097) at test/support.c:107 i = 0 #3 0x0000000010027224 in testcase_mpn_smp_kara (n=6624085371224828549) at test/mpn_mulmid-test.c:89 buf1 = 0x0 buf2 = 0x0 ref = 0x0 res = 0x0 success = 1 #4 0x0000000010027434 in test_mpn_smp_kara (quick=0) at test/mpn_mulmid-test.c:125 success = 1 n = 6624085371224828549 trial = 0 #5 0x00000000100210dc in run_test (target=0x10041488, quick=0) at test/test.c:187 success = 4095 #6 0x0000000010021450 in main (argc=2, argv=0xfffffffe5a8) at test/test.c:235 found = 1 all_success = 1 any_targets = 1 quick = 0 i = 33 j = 1 (gdb) q
It seems to point the finger at mpir.
New spkg:
Attachments (1)
Change History (27)
comment:1 Changed 8 years ago by
comment:2 Changed 8 years ago by
the __gmpn_random2 (rp=0x0, n=-5198573331259894519)
call is very suspicious, since the second argument should be a size in limbs.
Paul
comment:3 Changed 8 years ago by
Hi Paul,
I suspect that the problem is triggered when enabling the debugging code, furthermore zn_poly itself is built with -DNDEBUG regardless of SAGE_DEBUG=yes. I am wondering if it could cause the problem.
Francois
comment:4 Changed 8 years ago by
Very odd. The main code is always compiled with -DNDEBUG - no option to turn it of. But the code for the test which fails is all compiled with -DDEBUG - no turning it off either. So it must happening when SAGE_DEBUG is turned on for some other component of sage. Since no one else seem to have seen it before it has to be a power7 specific problem.
comment:5 Changed 8 years ago by
To continue on what you started Paul in
testcase_mpn_smp_kara (n=6624085371224828549)
n is supposed to be a size_t so I think we have a gross overflow somewhere earlier. The value originates from here:
/* Tests mpn_smp_kara for a range of n. */ int test_mpn_smp_kara (int quick) { int success = 1; size_t n; ulong trial; // first a dense range of small problems for (n = 2; n <= 30 && success; n++) for (trial = 0; trial < (quick ? 300 : 30000) && success; trial++) success = success && testcase_mpn_smp_kara (n); // now a few larger problems too for (trial = 0; trial < (quick ? 100 : 3000) && success; trial++) { n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2; <======= n generated here. success = success && testcase_mpn_smp_kara (n); } return success; }
comment:6 Changed 8 years ago by
On power7 it appears that ZNP_mpn_smp_kara_thresh is equal to SIZE_MAX which according to /usr/include/stdint.h is
/* Limit of `size_t' type. */ # if __WORDSIZE == 64 # define SIZE_MAX (18446744073709551615UL) # else # define SIZE_MAX (4294967295U) # endif
random_ulong is defined by
ulong random_ulong (ulong max) { return gmp_urandomm_ui (randstate, max); }
so n needs to be size_t which is at most SIZE_MAX but the test generate a random number between 0 and 3 * SIZE_MAX + 2. <sarcasm> Oh dear! I wonder why that doesn't work. </sarcasm>
I guess it is potentially fine if ZNP_mpn_smp_kara_thresh is not SIZE_MAX, I don't know how it is on other systems.
comment:7 Changed 8 years ago by
Francois, can you see how ZNP_mpn_smp_kara_thresh
is defined on other 64-bit systems,
and which kinds of values is generated by n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2
?
Paul
comment:8 Changed 8 years ago by
I am certainly poking at that. The value of ZNP_mpn_smp_kara_thresh is computed by the tuning code and it is clearly allowed to be equal to SIZE_MAX
// generate tuning.c file printf (header); x = ZNP_mpn_smp_kara_thresh; printf ("size_t ZNP_mpn_smp_kara_thresh = "); printf (x == SIZE_MAX ? "SIZE_MAX;\n" : "%lu;\n", x);
So someone potentially set themselves for trouble in the test. However after inserting a few printf in the code the mystery deepens
mpn_smp_basecase()... ok mpn_smp_kara()... test: src/mpn_mulmid.c:241: ZNP_mpn_smp_kara: Assertion `n >= 2' failed. maxtrial= 98 SIZE_MAX= 18446744073709551615 maxtrial= 98 n=31 n=24 n=38 n=40 n=74 n=24 n=28 n=32 n=77 n=67 n=76 n=64 n=13 n=17 n=90 n=42 n=47 n=79 n=21 n=82 n=32 n=10 n=67 n=25 n=26 n=39 n=77 n=90 n=97 n=7 n=74 n=59 n=70 n=87 n=23 n=6 n=70 n=97 n=78 n=74 n=57 n=53 n=28 n=21 n=51 n=33 n=41 n=2 n=88 n=57 n=56 n=96 n=46 n=38 n=69 n=93 n=11 n=61 n=24 n=25 n=45 n=46 n=6 n=44 n=32 n=93 n=59 n=45 n=46 n=31 n=91 n=32 n=45 n=45 n=90 n=61 n=78 n=47 n=33 n=75 n=71 n=37 n=92 n=94 n=50 n=84 n=8 n=43 n=15 n=31 n=31 make: *** [check] Aborted (core dumped) Error running zn_poly's quick test suite ('make check').
I didn't have the assertion before and after putting these we Abort rather than segfault.
comment:9 Changed 8 years ago by
I guess there is a bug in the tuning code, which should not give for ZNP_mpn_smp_kara_thresh
a huge value.
Paul
comment:10 Changed 8 years ago by
I am the author.... thanks Paul for drawing my attention to this.
I haven't looked at this code for years so it's almost as mysterious to me as to everyone else here!
My guess is that the bug is in the test code rather than in the tuning code. I suspect that the threshold is allowed to be SIZE_MAX, but that the line
n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2;
should be replaced by e.g.
if (ZNP_mpn_smp_kara_thresh == SIZE_MAX) n = random_ulong (100) + 2; else n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2;
It could also be a bug in the tuning code, but that would be much harder to fix. If I remember correctly what this threshold means, it is very surprising to me that its optimal value is SIZE_MAX on any real system.
comment:11 Changed 8 years ago by
Thanks for the code. My last error was due to me trying to do something similar and failing to read the original code properly (putting the +2 inside the bracket). power7 is a strange beast but it is unlikely that it is the optimal value. The tuning probably assume something that is wrong on this platform and that would indeed be difficult to find.
comment:12 Changed 8 years ago by
Not sure what happened I wanted to do another run to post tuning.c but the value of ZNP_mpn_smp_kara_thresh is now 133. I swear it was SIZE_MAX before. There is still plenty of SIZE_MAX value in the file:
#include "zn_poly_internal.h" size_t ZNP_mpn_smp_kara_thresh = 133; size_t ZNP_mpn_mulmid_fallback_thresh = 4868; tuning_info_t tuning_info[] = { { // bits = 0 }, { // bits = 1 }, { // bits = 2 94, // KS1 -> KS2 multiplication threshold SIZE_MAX, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 270, // KS1 -> KS2 squaring threshold SIZE_MAX, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 206, // KS1 -> KS2 middle product threshold SIZE_MAX, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold 1000, // nussbaumer multiplication threshold 1000 // nussbaumer squaring threshold }, { // bits = 3 105, // KS1 -> KS2 multiplication threshold SIZE_MAX, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 270, // KS1 -> KS2 squaring threshold 9634, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 120, // KS1 -> KS2 middle product threshold SIZE_MAX, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold 1000, // nussbaumer multiplication threshold 1000 // nussbaumer squaring threshold }, { // bits = 4 123, // KS1 -> KS2 multiplication threshold SIZE_MAX, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 154, // KS1 -> KS2 squaring threshold SIZE_MAX, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 132, // KS1 -> KS2 middle product threshold SIZE_MAX, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold
comment:13 Changed 8 years ago by
Francois, anyway it does not hurt to implement what David suggests in comment 10. This should fix this ticket once for all.
Paul
comment:14 Changed 8 years ago by
I can say it worked nicely, so I'll prepare a new spkg with it so this kind of thing cannot happen again. I think I found out what happened and made thing different. In the original build I used gcc-4.7.1, this build the compiler was gcc shipped with the distro gcc-4.3.4. There could be some subtle bugs lurking in gcc itself or the standard used to compile the tuning code.
#include "zn_poly_internal.h" size_t ZNP_mpn_smp_kara_thresh = SIZE_MAX; size_t ZNP_mpn_mulmid_fallback_thresh = SIZE_MAX; tuning_info_t tuning_info[] = { { // bits = 0 }, { // bits = 1 }, { // bits = 2 94, // KS1 -> KS2 multiplication threshold SIZE_MAX, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 218, // KS1 -> KS2 squaring threshold SIZE_MAX, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 216, // KS1 -> KS2 middle product threshold SIZE_MAX, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold 1000, // nussbaumer multiplication threshold 1000 // nussbaumer squaring threshold }, { // bits = 3 107, // KS1 -> KS2 multiplication threshold SIZE_MAX, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 167, // KS1 -> KS2 squaring threshold SIZE_MAX, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 146, // KS1 -> KS2 middle product threshold 6889, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold 1000, // nussbaumer multiplication threshold 1000 // nussbaumer squaring threshold }, { // bits = 4 68, // KS1 -> KS2 multiplication threshold SIZE_MAX, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 187, // KS1 -> KS2 squaring threshold SIZE_MAX, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 95, // KS1 -> KS2 middle product threshold 7367, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold 1000, // nussbaumer multiplication threshold 1000 // nussbaumer squaring threshold }, { // bits = 5 60, // KS1 -> KS2 multiplication threshold 18841, // KS2 -> KS4 multiplication threshold SIZE_MAX, // KS4 -> FFT multiplication threshold 192, // KS1 -> KS2 squaring threshold SIZE_MAX, // KS2 -> KS4 squaring threshold SIZE_MAX, // KS4 -> FFT squaring threshold 128, // KS1 -> KS2 middle product threshold 5037, // KS2 -> KS4 middle product threshold SIZE_MAX, // KS4 -> FFT middle product threshold
comment:15 Changed 8 years ago by
- Description modified (diff)
- Milestone changed from sage-5.7 to sage-5.8
- Status changed from new to needs_review
OK new spkg ready for review. I also attached the patch for review but it is just David's code.
comment:16 Changed 8 years ago by
- Cc jdemeyer added
the patch looks fine to me, however since I have no access to a power7 I can only check the patch and new package on another computer. Jeroen, how should we proceed in that case, assuming Francois (the author of the patch and new package) is the only person to have access to a power7?
Paul
comment:17 Changed 8 years ago by
- Reviewers set to Paul Zimmermann
- Status changed from needs_review to positive_review
I don't mind giving positive_review in this case. We can reasonably expect that the author has tested the package on the failing machine.
comment:18 Changed 8 years ago by
I don't mind giving positive_review in this case.
however I'd like to check first the new spkg works on my machine.
Paul
comment:19 Changed 8 years ago by
This doesn't even build:
Applying patches to upstream sources... makemakefile.py.patch patching file makemakefile.py mpn_mulmid-test.c.patch patching file test/mpn_mulmid-test.c Hunk #1 FAILED at 121. 1 out of 1 hunk FAILED -- saving rejects to file test/mpn_mulmid-test.c.rej Error: '../patches/mpn_mulmid-test.c.patch' failed to apply. real 0m0.011s user 0m0.010s sys 0m0.000s ************************************************************************ Error installing package zn_poly-0.9.p10 ************************************************************************
comment:20 Changed 8 years ago by
- Status changed from positive_review to needs_work
comment:21 Changed 8 years ago by
- Status changed from needs_work to needs_review
Sorry made a big mistake when preparing the final spkg (source not pristine clean, that's rather unforgiving). It should be ok now (I double checked).
comment:22 Changed 8 years ago by
- Reviewers changed from Paul Zimmermann to Paul Zimmermann, Jeroen Demeyer
- Status changed from needs_review to positive_review
all tests now pass on my computer (on top of Sage 5.6).
Paul
comment:23 Changed 8 years ago by
- Merged in set to sage-5.8.beta1
- Resolution set to fixed
- Status changed from positive_review to closed
comment:24 Changed 8 years ago by
zn_poly's tuning (and apparently due to that its test suite, too) is flaky on other systems as well: #13947
It would be nice if some of you could also take a look at that... :P
comment:25 Changed 8 years ago by
Yes it looks similar. Turning debugging on was somewhat helpful here.
Note this is the quick test always run with zn_poly. It passes in 5.7beta3 without debug and it fails in beta4 with SAGE_DEBUG=yes.