#27676 closed defect (wontfix)

sage miscompiles with gcc9

Reported by: Konrad127123 Owned by:
Priority: major Milestone: sage-duplicate/invalid/wontfix
Component: packages: standard Keywords: gcc, openblas, brial
Cc: fbissey Merged in:
Authors: Reviewers: Steven Trogdon
Report Upstream: N/A Work issues:
Branch: u/Konrad127123/gcc-9-test2 (Commits) Commit: 6a54ee185d83bdf29877d6fc6ec6a54236812bc6
Dependencies: Stopgaps:

Description (last modified by Konrad127123)

gcc-9 is due out in late April/early May 2019. With the current gcc-9 snapshot, sage has issues where it miscompiles.

There seem to be two sets of issues:

  1. openblas-0.3.5 doesn't compile correctly with gcc-9. This makes lots of errors occur if SAGE_CHECK="yes" (in openblas and other packages that depend on it) and makes make testlong output lots of errors and eventually just hang. As of writing (2019/04/16), this seems to be fixed in upstream git.
  1. Something seems to go wrong with brial or something related. make testlong has errors on the following files:
    sage -t --long --warn-long 31.0 src/sage/crypto/mq/sr.py  # Killed due to abort
    sage -t --long --warn-long 31.0 src/sage/rings/polynomial/multi_polynomial_sequence.py  # Killed due to abort
    sage -t --long --warn-long 31.0 src/sage/rings/polynomial/pbori.pyx  # Killed due to abort
    

As a minor issue, with SAGE_CHECK="yes", make doesn't seem to stop after compiling openblas, despite many tests failing.

Latest gcc-9 snapshot as of writing: ftp://gcc.gnu.org/pub/gcc/snapshots/9-20190414/gcc-9-20190414.tar.xz

Current openblas upstream as of writing: https://community.dur.ac.uk/konrad.dabrowski/sage/openblas-0.3.5b.tar.gz

Attachments (5)

openblas-0.3.5.p1.log.gz (187.4 KB) - added by Konrad127123 15 months ago.
Test suite output of openblas with gcc-9
testlong.log.gz (98.9 KB) - added by Konrad127123 15 months ago.
make testlong output with gcc-9 and openblas from upstream git.
sr.txt.gz (20.9 KB) - added by Konrad127123 15 months ago.
Output of sage -t --long --warn-long 31.0 src/sage/crypto/mq/sr.py run standalone
multi_polynomial_sequence.txt.gz (13.5 KB) - added by Konrad127123 15 months ago.
Output of sage -t --long --warn-long 31.0 src/sage/rings/polynomial/multi_polynomial_sequence.py run standalone
pbori.txt.gz (20.4 KB) - added by Konrad127123 15 months ago.
Output of sage -t --long --warn-long 31.0 src/sage/rings/polynomial/pbori.pyx run standalone

Download all attachments as: .zip

Change History (46)

Changed 15 months ago by Konrad127123

Test suite output of openblas with gcc-9

Changed 15 months ago by Konrad127123

make testlong output with gcc-9 and openblas from upstream git.

comment:1 Changed 15 months ago by Konrad127123

  • Branch set to u/Konrad127123/gcc-9-test2

comment:2 Changed 15 months ago by fbissey

  • Cc fbissey added
  • Commit set to 6a54ee185d83bdf29877d6fc6ec6a54236812bc6

New commits:

884e9daisl patch is in gcc-9
a11398eDisable isl when building gcc to avoid inconsistently building gcc with sage's optional isl package. See also #26735.
2d6dc98Update gcc to the 9-20190414 snapshot
6a54ee1Update OpenBLAS to 2019/04/14

comment:3 Changed 15 months ago by Konrad127123

  • Description modified (diff)

comment:4 Changed 15 months ago by Konrad127123

  • Description modified (diff)

comment:5 Changed 15 months ago by Konrad127123

FYI: The branch currently attached to this ticket is to help with testing/investigating. It's not an attempt to fix the bugs.

Changed 15 months ago by Konrad127123

Output of sage -t --long --warn-long 31.0 src/sage/crypto/mq/sr.py run standalone

Changed 15 months ago by Konrad127123

Output of sage -t --long --warn-long 31.0 src/sage/rings/polynomial/multi_polynomial_sequence.py run standalone

Changed 15 months ago by Konrad127123

Output of sage -t --long --warn-long 31.0 src/sage/rings/polynomial/pbori.pyx run standalone

comment:6 Changed 14 months ago by dimpase

With ModuleNotFoundError: No module named 'Cython', as seen in the logs, all bets are off.

I tried gcc 9.1 (from Gentoo) and got a lot of "Bad exit" in tests from memory overflows.

comment:7 Changed 14 months ago by fbissey

Have you run any tests on individual packages? I would expect a few things to come from individual packages.

comment:8 Changed 14 months ago by vbraun

Maybe we can upgrade to the released openblas 0.3.6 on a separate ticket to get that out of the way

comment:9 Changed 14 months ago by vbraun

This is now 27847

comment:10 Changed 14 months ago by vbraun

With #27847 only the following failures remain

sage -t --long src/sage/crypto/mq/sr.py  # Killed due to abort
sage -t --long src/sage/rings/polynomial/multi_polynomial_sequence.py  # Killed due to abort
sage -t --long src/sage/rings/polynomial/pbori.pyx  # Killed due to abort

comment:11 Changed 14 months ago by fbissey

That's less than I have in sage-on-gentoo but I haven't rebuilt everything yet. But there is really a track leading to pbori not sure if it is just the sage interface or brial itself. brial passes its own testsuite with gcc-9.1 that's already something but not necessarilly conclusive that it is all on sage-side.

From what Steve Trogdon posted to me

Cython backtrace
----------------
#0  0x00007fe418a5fd80 in __waitpid () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-libs/glibc-2.29-r2/work/glibc-2.29/nptl/../sysdeps/unix/sysv/linux/waitpid.c:30
#1  0x00007fe4104374eb in print_enhanced_backtrace () at /storage/strogdon/gentoo-rap/var/tmp/portage/dev-python/cysignals-1.10.2/work/cysignals-1.10.2/build/src/cysignals/implementation.c:563
#2  0x00007fe410437687 in sigdie () at /storage/strogdon/gentoo-rap/var/tmp/portage/dev-python/cysignals-1.10.2/work/cysignals-1.10.2/build/src/cysignals/implementation.c:589
#3  0x00007fe410436a0e in sigdie_for_sig () at /storage/strogdon/gentoo-rap/var/tmp/portage/dev-python/cysignals-1.10.2/work/cysignals-1.10.2/build/src/cysignals/implementation.c:160
#4  0x00007fe410436ba6 in cysigs_signal_handler () at /storage/strogdon/gentoo-rap/var/tmp/portage/dev-python/cysignals-1.10.2/work/cysignals-1.10.2/build/src/cysignals/implementation.c:262
#5  0x00007fe4190cea30 in __restore_rt ()
#6  0x00007fe4186b6330 in __GI_raise () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-libs/glibc-2.29-r2/work/glibc-2.29/signal/../sysdeps/unix/sysv/linux/raise.c:50
#7  0x00007fe41869f7a4 in __GI_abort () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-libs/glibc-2.29-r2/work/glibc-2.29/stdlib/abort.c:79
#8  0x00007fe401744d20 in __gnu_cxx::__verbose_terminate_handler () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-devel/gcc-9.1.0/work/gcc-9.1.0/libstdc++-v3/libsupc++/vterminate.cc:95
#9  0x00007fe401742dc0 in __cxxabiv1::__terminate () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-devel/gcc-9.1.0/work/gcc-9.1.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
#10 0x00007fe401742e00 in std::terminate () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-devel/gcc-9.1.0/work/gcc-9.1.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
#11 0x00007fe401743000 in __cxxabiv1::__cxa_throw () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-devel/gcc-9.1.0/work/gcc-9.1.0/libstdc++-v3/libsupc++/eh_throw.cc:95
#12 0x00007fe401715ac6 in std::__throw_length_error () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-devel/gcc-9.1.0/work/gcc-9.1.0/libstdc++-v3/src/c++11/functexcept.cc:78
#13 0x00007fe3aa2e6ea2 in std::deque<polybori::CCuddNavigator, std::allocator<polybori::CCuddNavigator> >::_M_new_elements_at_back () at /storage/strogdon/gentoo-rap/usr/lib64/gcc/x86_64-pc-linux-gnu/9.1.0/include/g++-v9/bits/stl_deque.h:2191
  2186           *  Makes sure the _M_map has space for new nodes.  Does not
  2187           *  actually add the nodes.  Can invalidate _M_map pointers.
  2188           *  (And consequently, %deque iterators.)
  2189           */
  2190          void
> 2191          _M_reserve_map_at_back(size_type __nodes_to_add = 1)
  2192          {
  2193          if (__nodes_to_add + 1 > this->_M_impl._M_map_size
  2194              - (this->_M_impl._M_finish._M_node - this->_M_impl._M_map))
  2195            _M_reallocate_map(__nodes_to_add, false);

I'd say there is something to fix in Cudd inside brial.

comment:12 Changed 14 months ago by strogdon

I should have posted this elsewhere.

Last edited 14 months ago by strogdon (previous) (diff)

comment:13 Changed 14 months ago by vbraun

Brial testsuite passes for me, too (gcc 9.1.1 / Fedora 30)

comment:14 Changed 14 months ago by vbraun

I think the culprit is in a higher stack frame, presumably std::deque runs out of memory because an invalid size is passed in. In particular, this looks fishy:

#10 pbori_cuddInitCache (unique=unique@entry=0x7f8185b293f0, 
    cacheSize=cacheSize@entry=262144, maxCacheSize=699050, 
    maxCacheSize@entry=<error reading variable: DW_OP_const_type has different sizes for type and data>) at cuddCache.c:152
        i = <optimized out>
        logSize = 18
        mem = <optimized out>
        offset = <optimized out>
        __PRETTY_FUNCTION__ = "pbori_cuddInitCache"

comment:15 Changed 14 months ago by vbraun

There is quite a fat memory leak, e.g.

sage: for i in range(300):
....:     B = BooleanPolynomialRing(5,'x{}'.format(i))

eats 5GB. Changing the memory limit, e.g.

./sage -t --memlimit=2000 src/sage/rings/polynomial/pbori.pyx

changes where the error occurs

comment:16 Changed 14 months ago by vbraun

Disabling the memlimit lets me complete

sage -t --memlimit=0 src/sage/rings/polynomial/pbori.pyx

but does eat much more memory than before. Doing it with the other tests eats all available memory

comment:17 Changed 14 months ago by dimpase

Did you try going from -std=gnu++11 to -std=c++11 ? This can be done by applying

  • build/pkgs/gcc/spkg-configure.m4

    a b SAGE_SPKG_CONFIGURE([gcc], [ 
    9494        IS_REALLY_GCC=yes
    9595    fi
    9696
    97     AX_CXX_COMPILE_STDCXX_11([], optional)
     97    AX_CXX_COMPILE_STDCXX_11([noext], optional)
    9898    if test $HAVE_CXX11 != 1; then
    9999        SAGE_MUST_INSTALL_GCC([your C++ compiler does not support C++11])
    100100    fi

comment:18 Changed 14 months ago by dimpase

as well, our boost (and pbori uses it) is 2.5 y.o. version 1.66, while the current one is 1.70.

Last edited 14 months ago by dimpase (previous) (diff)

comment:19 Changed 14 months ago by vbraun

Brial's reference counting implementation miscompiles with gcc9, and the branch where the refcount falls back to zero is never executed. Hence the ram usage ballooning. I've opened an upstream bug:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90535

comment:20 Changed 14 months ago by dimpase

it is insane that this kind of address arithmetic is invalid C, as they imply on the ticket. Where is it said in the standard?

comment:21 follow-up: Changed 14 months ago by fbissey

I don't think they imply this is invalid C. But this is fairly rare occurrence and, if I am reading correctly, accounting for it everywhere means some optimisation opportunities are lost. As you'll note, Volker said the example works at -O0. This is from GNU gcc devs from whom I learned that pessimized was a word (at least they use it).

I should be able to do a micro release of brial to include that option this week. PR welcome if you feel faster. I may also check if I can safely update the cudd code in brial to a more recent version. It may be that newer cudd doesn't exhibit the problem.

@Volker in which brial file does the incriminated code occur?

comment:22 Changed 14 months ago by dimpase

OK, if they willingly mis-compile valid code for the sake of optimisation, they should at least print fat warnings about code branches they discard...

comment:23 in reply to: ↑ 21 Changed 14 months ago by strogdon

Replying to fbissey:

...

I should be able to do a micro release of brial to include that option this week. PR welcome if you feel faster. I may also check if I can safely update the cudd code in brial to a more recent version. It may be that newer cudd doesn't exhibit the problem. ...

Perhaps I misunderstand but my s-o-g brial build uses -O0; in fact, the system is built that way. And the tests still fail.

comment:24 follow-up: Changed 14 months ago by fbissey

OK, Volker's example works at -O0 may be our real life code is a bit more complicated. Could you compile brial with -fno-delete-null-pointer-checks in the CFLAGS (and CXXFLAGS too for good measure).

comment:25 in reply to: ↑ 24 Changed 14 months ago by strogdon

Replying to fbissey:

OK, Volker's example works at -O0 may be our real life code is a bit more complicated. Could you compile brial with -fno-delete-null-pointer-checks in the CFLAGS (and CXXFLAGS too for good measure).

I still get the 3 failures, brial built with

CFLAGS="-march=native -O0 -fno-delete-null-pointer-checks -pipe -g -ggdb"
CXXFLAGS="${CFLAGS}"

comment:26 Changed 14 months ago by vbraun

Reading more into this, C++11 Standard 5.7.5 says that pointer arithmetic on a pointer that is not pointing into an element of an array is undefined behavior. So in particular the compiler is free to assume that --ptr cannot be NULL, since you can't have allocated an array that starts at 0x1.

The offending code is

/// Release current pointer by decrementing reference counting
inline void 
intrusive_ptr_release(PBORI_PREFIX(DdManager)* ptr) {
   if (!(--(ptr->hooks))) {
 
    PBORI_ASSERT(PBORI_PREFIX(Cudd_CheckZeroRef)(ptr) == 0);
    PBORI_PREFIX(Cudd_Quit)(ptr);
  }
}

Really we should just change DdManager.hooks to be an std::size_t instead of char*, why on earth are you using pointer arithmetic to count in the first place. I think thats going to be a simple change since its only used in one or two places.

These are inlined headers so you also need to rebuild the sage library....

comment:27 follow-up: Changed 14 months ago by vbraun

Indeed, this succeeds for me:

export CFLAGS='-fno-delete-null-pointer-checks'
export CXXFLAGS='-fno-delete-null-pointer-checks'
./sage -p brial
touch src/sage/rings/polynomial/pbori.pyx 
./sage -b
./sage -t --long src/sage/rings/polynomial/pbori.pyx
./sage -t --long src/sage/crypto/mq/sr.py
./sage -t --long src/sage/rings/polynomial/multi_polynomial_sequence.py

comment:28 Changed 14 months ago by fbissey

OK I thought it was in the C code from Cudd. But this is the C++ interface to Cudd from brial. That code is in libbrial/include/polybori/rings/CCuddInterface.h. There is some more suspicious code in libbrial/include/polybori/rings/CCuddCore.h

/// Release this by decrementing reference counting
  refcount_type release() {
    return (--ref);
  }

comment:29 Changed 14 months ago by vbraun

refcount_type is just a typedef for std::size_t so thats perfect.

Since we don't want to change the DdManager struct I guess we have to allocate a value for DdManager.hooks.

Last edited 14 months ago by vbraun (previous) (diff)

comment:30 Changed 14 months ago by fbissey

If you send a pull request I can make a quick release of brial.

comment:31 in reply to: ↑ 27 Changed 14 months ago by strogdon

Replying to vbraun:

Indeed, this succeeds for me:

export CFLAGS='-fno-delete-null-pointer-checks'
export CXXFLAGS='-fno-delete-null-pointer-checks'
./sage -p brial
touch src/sage/rings/polynomial/pbori.pyx 
./sage -b
./sage -t --long src/sage/rings/polynomial/pbori.pyx
./sage -t --long src/sage/crypto/mq/sr.py
./sage -t --long src/sage/rings/polynomial/multi_polynomial_sequence.py

OK, rebuilding Sage also with the new CFLAGS does work here. I never thought about the headers.

comment:32 Changed 14 months ago by dimpase

By the way, perhaps it's time to drop gcc 4. Indeed, 4.9.4 was the last 4.* release, and it happened in 2016.

comment:33 Changed 14 months ago by fbissey

I am now using the following if gcc-9.1 is present in sage-on-gentoo

  • sage/libs/polybori/decl.pxd

    diff --git a/sage/libs/polybori/decl.pxd b/sage/libs/polybori/decl.pxd
    index dd6a3aa..edb7bb7 100644
    a b  
    11# distutils: language = c++
    2 # distutils: extra_compile_args = -std=c++11
     2# distutils: extra_compile_args = -std=c++11 -fno-delete-null-pointer-checks
    33
    44from libcpp.string cimport string as std_string
    55from libcpp.vector cimport vector

It is possibly over-broad.

comment:35 Changed 13 months ago by fbissey

New brial release https://github.com/BRiAl/BRiAl/releases/download/1.2.5/brial-1.2.5.tar.bz2 with the PR included. Do we create a separate ticket for the upgrade?

comment:36 Changed 13 months ago by vbraun

Please do make a ticket for that!

comment:37 Changed 13 months ago by fbissey

#27942 is ready for review for brial.

comment:38 Changed 13 months ago by embray

  • Milestone sage-8.8 deleted

As the Sage-8.8 release milestone is pending, we should delete the sage-8.8 milestone for tickets that are not actively being worked on or that still require significant work to move forward. If you feel that this ticket should be included in the next Sage release at the soonest please set its milestone to the next release milestone (sage-8.9).

comment:39 Changed 11 months ago by fbissey

  • Milestone set to sage-duplicate/invalid/wontfix
  • Status changed from new to needs_review

I think we pretty much fixed this by opening separate tickets for each separate problems. So now, this meta ticket should be closed.

comment:40 Changed 11 months ago by strogdon

  • Reviewers set to Steven Trogdon
  • Status changed from needs_review to positive_review

I agree.

comment:41 Changed 11 months ago by embray

  • Resolution set to wontfix
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.