Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#14460 closed defect (fixed)

GCC-4.8.0 miscompiles some sig_on() statements

Reported by: Jeroen Demeyer Owned by: Georg S. Weber
Priority: critical Milestone: sage-5.9
Component: build Keywords:
Cc: Merged in: sage-5.9.rc0
Authors: Jeroen Demeyer Reviewers: Jean-Pierre Flori
Report Upstream: Fixed upstream, but not in a stable release. Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by Jeroen Demeyer)

With #14453, the are crashes in libGAP due to a GCC-4.8.0 bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56982. If the Sage library is compiled with -Og, the following doctest from sage/libs/gap/element.pyx crashes with an unhandled SIGABRT:

sage: F1 = libgap.FreeGroup(['a']) ## line 427 ## 
sage: F2 = libgap.FreeGroup(['a']) ## line 428 ## 
sage: F1 == F2 ## line 429 ## 

libGAP is one of the few places in Sage where non-trivial use of the sig_on() signal handling is actually tested, so there might be many more places where the problem could occur.

Compiling with -fno-tree-dominator-opts seems to fix the issue, we should add this flag to compile Cython code with GCC >= 4.8.0.

Attachments (1)

14460_gcc_bug.patch (1.1 KB) - added by Jeroen Demeyer 9 years ago.

Download all attachments as: .zip

Change History (33)

comment:1 Changed 9 years ago by Jeroen Demeyer

Description: modified (diff)

comment:2 Changed 9 years ago by Jeroen Demeyer

Report Upstream: Reported upstream. No feedback yet.Reported upstream. Developers acknowledge bug.

comment:3 Changed 9 years ago by Jeroen Demeyer

Authors: Jeroen Demeyer
Milestone: sage-5.10sage-5.9
Priority: majorcritical
Status: newneeds_review

comment:4 Changed 9 years ago by Jeroen Demeyer

Description: modified (diff)

comment:5 Changed 9 years ago by Jean-Pierre Flori

I guess the plan is to patch Sage again when GCC fixes the problem (hopefully in 4.8.1 and 4.9.?)?

comment:6 Changed 9 years ago by Volker Braun

Are we going to remember this, or will that just linger around forever until some setup-archaeologist finds it again? I would prefer just to shitlist a particular version, so if it still crashes the next time then we'll know that its not fixed ;-)

In fact, the imho best way is to stick it into the compiler wrapper so that everything (and not just the sage library) gets no-tree-dominator-opts enforced if and only if the gcc version is affected. Though thats a bit of a longer project...

So, if you want to go on with this bandaid then thats good enough for now.

comment:7 in reply to:  6 ; Changed 9 years ago by Jeroen Demeyer

Replying to vbraun:

Are we going to remember this, or will that just linger around forever until some setup-archaeologist finds it again? I would prefer just to shitlist a particular version, so if it still crashes the next time then we'll know that its not fixed ;-)

Well, at least it has the necessary pointers to the Sage and GCC bug report, so the setup-archaeologist should be able to determine whether or not the fix is still needed. I find undocumented work-arounds far worse than obsolete work-arounds.

About this particular fix: it doesn't currently give doctest failures. If we only apply this fix for GCC-4.8.0 and GCC-4.8.1 comes out without the bug fix, who will remember to check for this bug?

comment:8 in reply to:  7 ; Changed 9 years ago by Jean-Pierre Flori

Replying to jdemeyer:

Replying to vbraun:

Are we going to remember this, or will that just linger around forever until some setup-archaeologist finds it again? I would prefer just to shitlist a particular version, so if it still crashes the next time then we'll know that its not fixed ;-)

Well, at least it has the necessary pointers to the Sage and GCC bug report, so the setup-archaeologist should be able to determine whether or not the fix is still needed. I find undocumented work-arounds far worse than obsolete work-arounds.

About this particular fix: it doesn't currently give doctest failures. If we only apply this fix for GCC-4.8.0 and GCC-4.8.1 comes out without the bug fix, who will remember to check for this bug?

I agree its a saner approach to hope archeologists will limit the GCC versions for which the fix is applied. At worst, it won't be applied when GCC 4.10 gets stable.

Great you mentioned both the upstream bug and this track ticket in the patch. Maybe a direct indication of what the bug is could help, e.g. adding "which let sig_on() miscompile" after "bug". Even without it I'll give it positive_review, when I can reproduce the problem and check its indeed solved with the patch. (I assume the space between 'null' and '"""' is there for ease of reading.) From #14460, I thought the crash appeared when doctesting sage/structure/element.pyx? But after recompiling libgap, this file, and everything in sage/libs/gap, using gcc-4.8 from debian running "./sage -t" on this file is fine. You mention that there are no failing doctests, so I may have missed something.

Using the code you posted at gcc bugtracker I get

[jp@jp-x220]% cd ../bla                            ~/boulot/sage/sage-5.9.beta5
[jp@jp-x220]% gcc-4.8 -Og jmp.c                               ~/boulot/sage/bla
[jp@jp-x220]% ./a.out                                         ~/boulot/sage/bla
Returning 1
x = 0, n = 1
Returning 0
x = 42, n = 1
zsh: abort      ./a.out

So I guess debian's gcc does not include a magical fix.

comment:9 in reply to:  7 ; Changed 9 years ago by Nils Bruin

Replying to jdemeyer:

About this particular fix: it doesn't currently give doctest failures. If we only apply this fix for GCC-4.8.0 and GCC-4.8.1 comes out without the bug fix, who will remember to check for this bug?

Would it be easily doable to include a doctest that DOES fail in the presence of this bug?

comment:10 in reply to:  9 ; Changed 9 years ago by Jeroen Demeyer

Replying to nbruin:

Would it be easily doable to include a doctest that DOES fail in the presence of this bug?

Yes, as I mentioned #14453 does that.

comment:11 Changed 9 years ago by Jeroen Demeyer

Description: modified (diff)

comment:12 in reply to:  8 Changed 9 years ago by Jeroen Demeyer

Replying to jpflori:

Maybe a direct indication of what the bug is could help, e.g. adding "which let sig_on() miscompile" after "bug".

Done.

comment:13 in reply to:  10 Changed 9 years ago by Jean-Pierre Flori

Replying to jdemeyer:

Replying to nbruin:

Would it be easily doable to include a doctest that DOES fail in the presence of this bug?

Yes, as I mentioned #14453 does that.

I meant #14453 in my previous post. I'll rebuild the Sage library and retry doctesting sage/libs/gap/element.pyx

comment:14 Changed 9 years ago by Jean-Pierre Flori

I've rebuilt Sage (5.9.beta5 + #10508) completely using Debian gcc/g++/gfortran 4.8.0 without any env variables except for MAKE, CC, GXX, FC and SAGE_ATLAS_ARCH, then applied #14453, rerun "./sage -b", but "./sage -t devel/sage/sage/libs/gap/element.pyx" still passes its doctests...

I don't doubt GCC is broken as I can reproduce its buggy behavior on the piece of code you provided, but it would have been nice to confirm the fix indeed fixes potential problems in Sage.

Last minute info: ok I just needed to set CFLAGS="-Og" to reproduce the bug, I guess this should be more explicitely documented. It's stated on the GCC ticket but nowhere on Sage's Trac.

comment:15 Changed 9 years ago by Jean-Pierre Flori

Status: needs_reviewneeds_work

And when "-Og" is used, then the fix indeed fixes the problem.

So let's mention "-Og" that in the patch as well (let's say in setup.py where the fix is triggered, not sure it will be easy to find near the problematic doctest itself, for that the archeologist will have to come to the trac ticket).

comment:16 Changed 9 years ago by Jeroen Demeyer

Description: modified (diff)

Changed 9 years ago by Jeroen Demeyer

Attachment: 14460_gcc_bug.patch added

comment:17 Changed 9 years ago by Jeroen Demeyer

Status: needs_workneeds_review

comment:18 Changed 9 years ago by Jean-Pierre Flori

Reviewers: Jean-Pierre Flori
Status: needs_reviewpositive_review

comment:19 Changed 9 years ago by Jeroen Demeyer

By the way, there is lots of activity on the GCC ticket. It seems some variation of this bug is reproducible back to GCC 4.1.

comment:20 in reply to:  19 ; Changed 9 years ago by Leif Leonhardy

Replying to jdemeyer:

By the way, there is lots of activity on the GCC ticket. It seems some variation of this bug is reproducible back to GCC 4.1.

Well, if that's true, are you sure same or similar can't happen within the Sage library with GCC versions != 4.![89]*?

On the other hand, if the bug only shows up in conjunction with -Og (or maybe -Os as well), does it make sense to add -fno-tree-dominator-opts regardless of the setting of CFLAGS etc., and for every extension module?


IMHO the scenario is specific enough that it's not worth fixing it within Sage at all (and there's an easy "work-around": just add -fno-tree-dominator-opts to CFLAGS in case it also contains -- not to say "you added" -- -Og [and GCC is 4.8.0]).

We do not add a potentially missing -fno-strict-aliasing for compiling Cython-generated code (which is a much more common or likely case, and btw. completely independent of the compiler version), nor do we add -fno-ipa-cp-clone when building R (with GCC 4.7.x and -O3).

comment:21 in reply to:  20 Changed 9 years ago by Jean-Pierre Flori

Replying to leif:

On the other hand, if the bug only shows up in conjunction with -Og (or maybe -Os as well), does it make sense to add -fno-tree-dominator-opts regardless of the setting of CFLAGS etc., and for every extension module?

Good point.

nor do we add -fno-ipa-cp-clone when building R (with GCC 4.7.x and -O3).

Maybe we should :) especially that we ship 4.7.x in Sage.

comment:22 in reply to:  20 ; Changed 9 years ago by Jeroen Demeyer

Replying to leif:

Well, if that's true, are you sure same or similar can't happen within the Sage library with GCC versions != 4.![89]*?

Perhaps...

the bug only shows up in conjunction with -Og

This is unfounded speculation.

does it make sense to add -fno-tree-dominator-opts regardless of the setting of CFLAGS etc., and for every extension module?

Absolutely, since we don't know the precise conditions which trigger the bug. As I said, sig_on() error handling is rarely doctested, so we cannot rely on doctests to check for this bug.

comment:23 Changed 9 years ago by Jeroen Demeyer

So, what is your opinion on adding -fno-dominator-opts for all GCC versions >= 4.1?

comment:24 in reply to:  22 Changed 9 years ago by Leif Leonhardy

Replying to jdemeyer:

Replying to leif:

the bug only shows up in conjunction with -Og

This is unfounded speculation.

Besides that I wrote "if ...", it's as "founded" as the conclusion that (solely) adding -fno-tree-dominator-opts solves the issue.

comment:25 Changed 9 years ago by Jeroen Demeyer

Merged in: sage-5.9.rc0
Report Upstream: Reported upstream. Developers acknowledge bug.Fixed upstream, but not in a stable release.
Resolution: fixed
Status: positive_reviewclosed

comment:26 Changed 9 years ago by Jeroen Demeyer

Bug still exists in GCC-4.8.1.

comment:27 Changed 9 years ago by Jean-Pierre Flori

Too bad.

I was thinking about this and must admit Volker had a point. We unconditionnally pass "-fno-..." whereas without further tweaking gcc 4.8.0 (and so 4.8.1 as well) correctly compiles sig_on statements. Of course according to Jeroen we don't really know (or do we now?) what options trigger the miscompile, so it is the safest choice to always pass "-fno-...".

But then why not treat cases where -O3 is passed to CFLAGS and R miscompiles and so on...

(By the way I've remarked that Python has the bad habit to override CFLAGS by adding the content of an OPT variable on the compile command line...)

comment:28 in reply to:  27 ; Changed 9 years ago by Jeroen Demeyer

Replying to jpflori:

gcc 4.8.0 (and so 4.8.1 as well) correctly compiles sig_on statements.

Same answer as I made before: what makes you think that this is true?

But then why not treat cases where -O3 is passed to CFLAGS and R miscompiles and so on...

I never said we shouldn't fix that. But this is a less serious issue, since -fipa-cp-clone is not added by the default R flags of -O2. Assuming the user does not use custom CFLAGS, adding -fno-ipa-cp-clone will not make a difference.

(By the way I've remarked that Python has the bad habit to override CFLAGS by adding the content of an OPT variable on the compile command line...)

At least, the flags added by Sage come at the very end of the command line.

comment:29 in reply to:  28 ; Changed 9 years ago by Jean-Pierre Flori

Replying to jdemeyer:

Replying to jpflori:

gcc 4.8.0 (and so 4.8.1 as well) correctly compiles sig_on statements.

Same answer as I made before: what makes you think that this is true?

Or that it is not true? From your report it seemed you got segfaults only when passing

But then why not treat cases where -O3 is passed to CFLAGS and R miscompiles and so on...

I never said we shouldn't fix that. But this is a less serious issue, since -fipa-cp-clone is not added by the default R flags of -O2. Assuming the user does not use custom CFLAGS, adding -fno-ipa-cp-clone will not make a difference.

I agree, I only posted all of this here so that I don't lose it in my mind.

(By the way I've remarked that Python has the bad habit to override CFLAGS by adding the content of an OPT variable on the compile command line...)

At least, the flags added by Sage come at the very end of the command line.

Not really sure, I had a really painful experience lately building Python on a Raspberry Pi. Doing something like "CFLAGS="-O -g" ./sage -i python.spkg" does not work. Doing "OPT= CFLAG="-O -g" ..." would work. Look at configure.ac and Makefile.pre.in, the only thing that comes in $CFLAGS after $OPT is $EXTRA_CFLAGS (which is surely meant to be mostly user defined). Anyway it's not really the place to discuss this.

comment:30 in reply to:  29 Changed 9 years ago by Jeroen Demeyer

Replying to jpflori:

Or that it is not true?

The principle of being on the safe side means that, in the case of doubt, assume something can go wrong.

From your report it seemed you got segfaults only when passing

...the -Og flag. (I guess this is what you wanted to write).

And that's even true: the particular segfault in libGAP only occurs with -Og. But there could very well be other segfaults, not found by doctests, in other parts of Sage, with the default flags.

comment:31 Changed 9 years ago by Jeroen Demeyer

Let me copy this sentence from the ticket description:

libGAP is one of the few places in Sage where non-trivial use of the sig_on() signal handling is actually tested, so there might be many more places where the problem could occur.

comment:32 Changed 9 years ago by Jean-Pierre Flori

Yeah I agree, no harm meant at all.

The best solution would be to live in a world where GCC never exposes bugs. And as I've said before, I'm ok with the solution which was merged, I even gave positive_review, I just hope the situation would be better. At some point we'll have to rule out all versions of GCC :)

Note: See TracTickets for help on using tickets.