Opened 8 years ago

Closed 8 years ago

Last modified 5 years ago

#14098 closed defect (fixed)

zn_poly-0.9.p9 fails at least one its tests on power7

Reported by: fbissey Owned by: drkirkby
Priority: major Milestone: sage-5.8
Component: porting Keywords:
Cc: jdemeyer Merged in: sage-5.8.beta1
Authors: François Bissey, David Harvey Reviewers: Paul Zimmermann, Jeroen Demeyer
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by fbissey)

On the login node of our power7 cluster (beatrice) zn_poly fails make check

(sage-sh) frb15@p2n14-c:src$ make check
test/test -quick all
mpn_smp_basecase()... ok
mpn_smp_kara()... make: *** [check] Segmentation fault (core dumped)

Here is a detailed backtrace

(gdb) r mpn_smp_kara
Starting program: /hpc/scratch/frb15/sandbox/sage-5.7.beta4/spkg/build/zn_poly-0.9.p9/src/test/test mpn_smp_kara
mpn_smp_kara()... 
Program received signal SIGSEGV, Segmentation fault.
0x00000400000dd3f0 in gmp_rrandomb (rp=0x0, rstate=0x40000134db8, nbits=17779444199848231480) at random2.c:67
67      random2.c: No such file or directory.
        in random2.c
(gdb) bt
#0  0x00000400000dd3f0 in gmp_rrandomb (rp=0x0, rstate=0x40000134db8, nbits=17779444199848231480) at random2.c:67
#1  0x00000400000dd360 in __gmpn_random2 (rp=0x0, n=-5198573331259894519) at random2.c:54
#2  0x000000001002c634 in ZNP_mpn_random2 (res=0x0, n=13248170742449657097) at test/support.c:107
#3  0x0000000010027224 in testcase_mpn_smp_kara (n=6624085371224828549) at test/mpn_mulmid-test.c:89
#4  0x0000000010027434 in test_mpn_smp_kara (quick=0) at test/mpn_mulmid-test.c:125
#5  0x00000000100210dc in run_test (target=0x10041488, quick=0) at test/test.c:187
#6  0x0000000010021450 in main (argc=2, argv=0xfffffffe5a8) at test/test.c:235
(gdb) bt full
#0  0x00000400000dd3f0 in gmp_rrandomb (rp=0x0, rstate=0x40000134db8, nbits=17779444199848231480) at random2.c:67
        bi = 4398113622176
        ranm = 268711088
        cap_chunksize = 0
        chunksize = 0
        i = 277803815622628616
#1  0x00000400000dd360 in __gmpn_random2 (rp=0x0, n=-5198573331259894519) at random2.c:54
        rstate = 0x40000134db8
        bit_pos = 8
        ran = 3915822088
        ranm = 3915822088
#2  0x000000001002c634 in ZNP_mpn_random2 (res=0x0, n=13248170742449657097) at test/support.c:107
        i = 0
#3  0x0000000010027224 in testcase_mpn_smp_kara (n=6624085371224828549) at test/mpn_mulmid-test.c:89
        buf1 = 0x0
        buf2 = 0x0
        ref = 0x0
        res = 0x0
        success = 1
#4  0x0000000010027434 in test_mpn_smp_kara (quick=0) at test/mpn_mulmid-test.c:125
        success = 1
        n = 6624085371224828549
        trial = 0
#5  0x00000000100210dc in run_test (target=0x10041488, quick=0) at test/test.c:187
        success = 4095
#6  0x0000000010021450 in main (argc=2, argv=0xfffffffe5a8) at test/test.c:235
        found = 1
        all_success = 1
        any_targets = 1
        quick = 0
        i = 33
        j = 1
(gdb) q

It seems to point the finger at mpir.

New spkg:

Attachments (1)

mpn_mulmid-test.c.patch (556 bytes) - added by fbissey 8 years ago.
patch added to zn_poly for review purposes

Download all attachments as: .zip

Change History (27)

comment:1 Changed 8 years ago by fbissey

Note this is the quick test always run with zn_poly. It passes in 5.7beta3 without debug and it fails in beta4 with SAGE_DEBUG=yes.

comment:2 Changed 8 years ago by zimmerma

the __gmpn_random2 (rp=0x0, n=-5198573331259894519) call is very suspicious, since the second argument should be a size in limbs.

Paul

comment:3 Changed 8 years ago by fbissey

Hi Paul,

I suspect that the problem is triggered when enabling the debugging code, furthermore zn_poly itself is built with -DNDEBUG regardless of SAGE_DEBUG=yes. I am wondering if it could cause the problem.

Francois

comment:4 Changed 8 years ago by fbissey

Very odd. The main code is always compiled with -DNDEBUG - no option to turn it of. But the code for the test which fails is all compiled with -DDEBUG - no turning it off either. So it must happening when SAGE_DEBUG is turned on for some other component of sage. Since no one else seem to have seen it before it has to be a power7 specific problem.

comment:5 Changed 8 years ago by fbissey

To continue on what you started Paul in

testcase_mpn_smp_kara (n=6624085371224828549)

n is supposed to be a size_t so I think we have a gross overflow somewhere earlier. The value originates from here:

/*
   Tests mpn_smp_kara for a range of n.
*/
int
test_mpn_smp_kara (int quick)
{
   int success = 1;
   size_t n;
   ulong trial;

   // first a dense range of small problems
   for (n = 2; n <= 30 && success; n++)
   for (trial = 0; trial < (quick ? 300 : 30000) && success; trial++)
      success = success && testcase_mpn_smp_kara (n);

   // now a few larger problems too
   for (trial = 0; trial < (quick ? 100 : 3000) && success; trial++)
   {
      n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2;      <======= n generated here.
      success = success && testcase_mpn_smp_kara (n);
   }

   return success;
}

comment:6 Changed 8 years ago by fbissey

On power7 it appears that ZNP_mpn_smp_kara_thresh is equal to SIZE_MAX which according to /usr/include/stdint.h is

/* Limit of `size_t' type.  */
# if __WORDSIZE == 64
#  define SIZE_MAX              (18446744073709551615UL)
# else
#  define SIZE_MAX              (4294967295U)
# endif

random_ulong is defined by

ulong
random_ulong (ulong max)
{
   return gmp_urandomm_ui (randstate, max);
}

so n needs to be size_t which is at most SIZE_MAX but the test generate a random number between 0 and 3 * SIZE_MAX + 2. <sarcasm> Oh dear! I wonder why that doesn't work. </sarcasm>

I guess it is potentially fine if ZNP_mpn_smp_kara_thresh is not SIZE_MAX, I don't know how it is on other systems.

comment:7 Changed 8 years ago by zimmerma

Francois, can you see how ZNP_mpn_smp_kara_thresh is defined on other 64-bit systems, and which kinds of values is generated by n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2?

Paul

comment:8 Changed 8 years ago by fbissey

I am certainly poking at that. The value of ZNP_mpn_smp_kara_thresh is computed by the tuning code and it is clearly allowed to be equal to SIZE_MAX

   // generate tuning.c file
   printf (header);

   x = ZNP_mpn_smp_kara_thresh;
   printf ("size_t ZNP_mpn_smp_kara_thresh = ");
   printf (x == SIZE_MAX ? "SIZE_MAX;\n" : "%lu;\n", x);

So someone potentially set themselves for trouble in the test. However after inserting a few printf in the code the mystery deepens

mpn_smp_basecase()... ok
mpn_smp_kara()... test: src/mpn_mulmid.c:241: ZNP_mpn_smp_kara: Assertion `n >= 2' failed.
maxtrial= 98 SIZE_MAX= 18446744073709551615
maxtrial= 98
n=31
n=24
n=38
n=40
n=74
n=24
n=28
n=32
n=77
n=67
n=76
n=64
n=13
n=17
n=90
n=42
n=47
n=79
n=21
n=82
n=32
n=10
n=67
n=25
n=26
n=39
n=77
n=90
n=97
n=7
n=74
n=59
n=70
n=87
n=23
n=6
n=70
n=97
n=78
n=74
n=57
n=53
n=28
n=21
n=51
n=33
n=41
n=2
n=88
n=57
n=56
n=96
n=46
n=38
n=69
n=93
n=11
n=61
n=24
n=25
n=45
n=46
n=6
n=44
n=32
n=93
n=59
n=45
n=46
n=31
n=91
n=32
n=45
n=45
n=90
n=61
n=78
n=47
n=33
n=75
n=71
n=37
n=92
n=94
n=50
n=84
n=8
n=43
n=15
n=31
n=31
make: *** [check] Aborted (core dumped)
Error running zn_poly's quick test suite ('make check').

I didn't have the assertion before and after putting these we Abort rather than segfault.

comment:9 Changed 8 years ago by zimmerma

I guess there is a bug in the tuning code, which should not give for ZNP_mpn_smp_kara_thresh a huge value.

Paul

comment:10 Changed 8 years ago by dmharvey

I am the author.... thanks Paul for drawing my attention to this.

I haven't looked at this code for years so it's almost as mysterious to me as to everyone else here!

My guess is that the bug is in the test code rather than in the tuning code. I suspect that the threshold is allowed to be SIZE_MAX, but that the line

n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2; 

should be replaced by e.g.

if (ZNP_mpn_smp_kara_thresh == SIZE_MAX)
   n = random_ulong (100) + 2;
else
   n = random_ulong (3 * ZNP_mpn_smp_kara_thresh) + 2; 

It could also be a bug in the tuning code, but that would be much harder to fix. If I remember correctly what this threshold means, it is very surprising to me that its optimal value is SIZE_MAX on any real system.

comment:11 Changed 8 years ago by fbissey

Thanks for the code. My last error was due to me trying to do something similar and failing to read the original code properly (putting the +2 inside the bracket). power7 is a strange beast but it is unlikely that it is the optimal value. The tuning probably assume something that is wrong on this platform and that would indeed be difficult to find.

comment:12 Changed 8 years ago by fbissey

Not sure what happened I wanted to do another run to post tuning.c but the value of ZNP_mpn_smp_kara_thresh is now 133. I swear it was SIZE_MAX before. There is still plenty of SIZE_MAX value in the file:

#include "zn_poly_internal.h"

size_t ZNP_mpn_smp_kara_thresh = 133;
size_t ZNP_mpn_mulmid_fallback_thresh = 4868;

tuning_info_t tuning_info[] = 
{
   {  // bits = 0
   },
   {  // bits = 1
   },
   {  // bits = 2
         94,   // KS1 -> KS2 multiplication threshold
   SIZE_MAX,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        270,   // KS1 -> KS2 squaring threshold
   SIZE_MAX,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
        206,   // KS1 -> KS2 middle product threshold
   SIZE_MAX,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold
       1000,   // nussbaumer multiplication threshold
       1000    // nussbaumer squaring threshold
   },
   {  // bits = 3
        105,   // KS1 -> KS2 multiplication threshold
   SIZE_MAX,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        270,   // KS1 -> KS2 squaring threshold
       9634,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
        120,   // KS1 -> KS2 middle product threshold
   SIZE_MAX,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold
       1000,   // nussbaumer multiplication threshold
       1000    // nussbaumer squaring threshold
   },
   {  // bits = 4
        123,   // KS1 -> KS2 multiplication threshold
   SIZE_MAX,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        154,   // KS1 -> KS2 squaring threshold
   SIZE_MAX,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
        132,   // KS1 -> KS2 middle product threshold
   SIZE_MAX,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold

comment:13 Changed 8 years ago by zimmerma

Francois, anyway it does not hurt to implement what David suggests in comment 10. This should fix this ticket once for all.

Paul

comment:14 Changed 8 years ago by fbissey

I can say it worked nicely, so I'll prepare a new spkg with it so this kind of thing cannot happen again. I think I found out what happened and made thing different. In the original build I used gcc-4.7.1, this build the compiler was gcc shipped with the distro gcc-4.3.4. There could be some subtle bugs lurking in gcc itself or the standard used to compile the tuning code.

#include "zn_poly_internal.h"

size_t ZNP_mpn_smp_kara_thresh = SIZE_MAX;
size_t ZNP_mpn_mulmid_fallback_thresh = SIZE_MAX;

tuning_info_t tuning_info[] = 
{
   {  // bits = 0
   },
   {  // bits = 1
   },
   {  // bits = 2
         94,   // KS1 -> KS2 multiplication threshold
   SIZE_MAX,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        218,   // KS1 -> KS2 squaring threshold
   SIZE_MAX,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
        216,   // KS1 -> KS2 middle product threshold
   SIZE_MAX,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold
       1000,   // nussbaumer multiplication threshold
       1000    // nussbaumer squaring threshold
   },
   {  // bits = 3
        107,   // KS1 -> KS2 multiplication threshold
   SIZE_MAX,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        167,   // KS1 -> KS2 squaring threshold
   SIZE_MAX,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
        146,   // KS1 -> KS2 middle product threshold
       6889,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold
       1000,   // nussbaumer multiplication threshold
       1000    // nussbaumer squaring threshold
   },
   {  // bits = 4
         68,   // KS1 -> KS2 multiplication threshold
   SIZE_MAX,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        187,   // KS1 -> KS2 squaring threshold
   SIZE_MAX,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
         95,   // KS1 -> KS2 middle product threshold
       7367,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold
       1000,   // nussbaumer multiplication threshold
       1000    // nussbaumer squaring threshold
   },
   {  // bits = 5
         60,   // KS1 -> KS2 multiplication threshold
      18841,   // KS2 -> KS4 multiplication threshold
   SIZE_MAX,   // KS4 -> FFT multiplication threshold
        192,   // KS1 -> KS2 squaring threshold
   SIZE_MAX,   // KS2 -> KS4 squaring threshold
   SIZE_MAX,   // KS4 -> FFT squaring threshold
        128,   // KS1 -> KS2 middle product threshold
       5037,   // KS2 -> KS4 middle product threshold
   SIZE_MAX,   // KS4 -> FFT middle product threshold

Changed 8 years ago by fbissey

patch added to zn_poly for review purposes

comment:15 Changed 8 years ago by fbissey

  • Authors set to Francois Bissey, David Harvey
  • Description modified (diff)
  • Milestone changed from sage-5.7 to sage-5.8
  • Status changed from new to needs_review

OK new spkg ready for review. I also attached the patch for review but it is just David's code.

comment:16 Changed 8 years ago by zimmerma

  • Cc jdemeyer added

the patch looks fine to me, however since I have no access to a power7 I can only check the patch and new package on another computer. Jeroen, how should we proceed in that case, assuming Francois (the author of the patch and new package) is the only person to have access to a power7?

Paul

comment:17 Changed 8 years ago by jdemeyer

  • Reviewers set to Paul Zimmermann
  • Status changed from needs_review to positive_review

I don't mind giving positive_review in this case. We can reasonably expect that the author has tested the package on the failing machine.

comment:18 Changed 8 years ago by zimmerma

I don't mind giving positive_review in this case.

however I'd like to check first the new spkg works on my machine.

Paul

comment:19 Changed 8 years ago by jdemeyer

This doesn't even build:

Applying patches to upstream sources...
makemakefile.py.patch
patching file makemakefile.py
mpn_mulmid-test.c.patch
patching file test/mpn_mulmid-test.c
Hunk #1 FAILED at 121.
1 out of 1 hunk FAILED -- saving rejects to file test/mpn_mulmid-test.c.rej
Error: '../patches/mpn_mulmid-test.c.patch' failed to apply.

real    0m0.011s
user    0m0.010s
sys     0m0.000s
************************************************************************
Error installing package zn_poly-0.9.p10
************************************************************************

comment:20 Changed 8 years ago by jdemeyer

  • Status changed from positive_review to needs_work

comment:21 Changed 8 years ago by fbissey

  • Status changed from needs_work to needs_review

Sorry made a big mistake when preparing the final spkg (source not pristine clean, that's rather unforgiving). It should be ok now (I double checked).

comment:22 Changed 8 years ago by zimmerma

  • Reviewers changed from Paul Zimmermann to Paul Zimmermann, Jeroen Demeyer
  • Status changed from needs_review to positive_review

all tests now pass on my computer (on top of Sage 5.6).

Paul

comment:23 Changed 8 years ago by jdemeyer

  • Merged in set to sage-5.8.beta1
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:24 Changed 8 years ago by leif

zn_poly's tuning (and apparently due to that its test suite, too) is flaky on other systems as well: #13947

It would be nice if some of you could also take a look at that... :P

comment:25 Changed 8 years ago by fbissey

Yes it looks similar. Turning debugging on was somewhat helpful here.

comment:26 Changed 5 years ago by chapoton

  • Authors changed from Francois Bissey, David Harvey to François Bissey, David Harvey
Note: See TracTickets for help on using tickets.