Opened 4 years ago

Last modified 3 weeks ago

#22191 closed enhancement

update ECL to 16.1.3 — at Version 45

Reported by: dimpase Owned by:
Priority: critical Milestone: sage-9.2
Component: packages: standard Keywords: upgrade, ecl
Cc: rws, fbissey, jdemeyer, charpent, jpflori, arojas, gh-timokau, saraedum, infinity0, slelievre, thansen, embray, mjo Merged in:
Authors: Dima Pasechnik Reviewers:
Report Upstream: Fixed upstream, but not in a stable release. Work issues:
Branch: u/nbruin/ecl16.1.3 (Commits) Commit: 35e9d3d1610bade7ae7d5fac735244be25e009eb
Dependencies: Stopgaps:

Description (last modified by thansen)

ECL 16.1.3 has been released.

We should update, as it will resolve some Maxima bugs, and is essential for resolving other Maxima bugs.

Tarball: https://common-lisp.net/project/ecl/static/files/release/ecl-16.1.3.tgz

Change History (46)

comment:1 Changed 4 years ago by dimpase

  • Branch set to u/dimpase/ecl16.1.3
  • Commit set to 33e59cb6e6d5b6b16f0b23eacc566d613ffccd07
  • Description modified (diff)

New commits:

3ed32fbUpgrade Jupyter packages, add enum34
72993f2Merge branch 'u/jdemeyer/upgrade_jupyter_packages' of trac.sagemath.org:sage into develop
2a291abMerge branch 'develop' of trac.sagemath.org:sage into develop
bd5a04fUpdate source to 2.11.0, fix checksum.
8356446Merge branch 'u/charpent/upgrade_git_to_2_10_2__or_2_11___' of trac.sagemath.org:sage into newgit
33e59cbupdate ECL to 16.1.3 - most of our patches are obsolete

comment:2 Changed 4 years ago by dimpase

oops, somewhat wrong branch... testing anyway

comment:3 Changed 4 years ago by git

  • Commit changed from 33e59cb6e6d5b6b16f0b23eacc566d613ffccd07 to eff60815383ca3b3120a7a26e98e1712072c70ac

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

eff6081update ECL to 16.1.3 - most of our patches are obsolete

comment:4 Changed 4 years ago by dimpase

  • Authors set to Dima Pasechnik
  • Status changed from new to needs_review

comment:5 Changed 4 years ago by fbissey

  • Cc fbissey added

comment:6 Changed 4 years ago by dimpase

The FPE problem on #18920 in comment 99 might be due to ECL 16.1.3 throwing an FPE exception at

> (/ 1.0 0.0)

Condition of type: DIVISION-BY-ZERO
#<a DIVISION-BY-ZERO>

while 16.1.2 happily returns

#.ext::single-float-positive-infinity

comment:7 Changed 4 years ago by dimpase

with 6.1.3 there is apparently new --no-trap-fpe which emulates the old behaviour. So we somehow have to figure out where to put this option.

comment:8 Changed 4 years ago by fbissey

If it is an option for ecl then somewhere in sage/libs/ecl.pyx is the place, probably in ecl_init but I have to dig a bit.

comment:9 follow-up: Changed 4 years ago by fbissey

There is already an option ECL_OPT_TRAP_SIGFPE is it equivalent?

Changed 4 years ago by dimpase

crash gdb log

comment:10 in reply to: ↑ 9 Changed 4 years ago by dimpase

Replying to fbissey:

There is already an option ECL_OPT_TRAP_SIGFPE is it equivalent?

I guess it should be. Perhaps it's something more subtle... I attach the crash dump.

comment:11 Changed 4 years ago by git

  • Commit changed from eff60815383ca3b3120a7a26e98e1712072c70ac to ac8348c462b11a67ee22f77e265afc0ce0572620

Branch pushed to git repo; I updated commit sha1. Last 10 new commits:

62b5825some more easy doctest fixes
3c28022more easy doctest fixes
5d86413Merge branch 'public/t18920' of trac.sagemath.org:sage into max5381
100d9d6workaround for #<a ARITHMETIC-ERROR> (18920#comment:60
79eb82ffix for maxima bug #3236
814e8edreflect patch's dependence on an earlier maxima tree commit
af7e5ffnew scaling of dynatomic polynomial
68da1e4Merge branch 'public/t18920' of trac.sagemath.org:sage into ecl1613
6bae4b1bumped up for 5.39.0. Sage builds (with ecl 16.1.3)
ac8348cdisable SIGFPE traps

comment:12 Changed 4 years ago by git

  • Commit changed from ac8348c462b11a67ee22f77e265afc0ce0572620 to a06f00245bfb652270eeea78248348777d006538

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

3ed32fbUpgrade Jupyter packages, add enum34
72993f2Merge branch 'u/jdemeyer/upgrade_jupyter_packages' of trac.sagemath.org:sage into develop
2a291abMerge branch 'develop' of trac.sagemath.org:sage into develop
6027df9Merge branch 'develop' of trac.sagemath.org:sage into develop
b9ee72dupdate ECL to 16.1.3 - most of our patches are obsolete
a06f002disable SIGFPE traps

comment:13 Changed 4 years ago by git

  • Commit changed from a06f00245bfb652270eeea78248348777d006538 to ad25254d3445265211081b50501af670daf4dade

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

377e0edupdate ECL to 16.1.3 - most of our patches are obsolete
ad25254disable SIGFPE traps

comment:14 Changed 4 years ago by dimpase

  • Status changed from needs_review to needs_work
  • Work issues set to sort out SIGFPE

this is a clean update branch for ECL 16.1.3 - needs work to sort out SIGFPE.

comment:15 follow-up: Changed 4 years ago by nbruin

I'm pretty sure the change in behaviour comes from this check-in on ECL: https://gitlab.com/embeddable-common-lisp/ecl/commit/c2b2941768f39544f45f24e19a30081d316eeb71

Clearly, this code will be setting FPE generation flags in more places than before. Probably ECL doesn't go out of its way to reset FPE flags when ECL is exiting. So probably, after executing ECL code, it is now the case that (often?) the FPE flags are set the way ECL wants; not the way we want. So in addition to restoring the SIG handlers, we should probably sprinkle an appropriate amount of fesetenv calls around the ECL entry and exit wrappers.

Last edited 4 years ago by nbruin (previous) (diff)

comment:16 Changed 4 years ago by dimpase

Upstream tells me that https://gitlab.com/embeddable-common-lisp/ecl/blob/develop/src/lsp/top.lsp#L1456 makes write-error patch unnecessary. So we should remove it too.

Last edited 4 years ago by dimpase (previous) (diff)

comment:17 in reply to: ↑ 15 Changed 4 years ago by dimpase

  • Cc jdemeyer added

Replying to nbruin:

I'm pretty sure the change in behaviour comes from this check-in on ECL: https://gitlab.com/embeddable-common-lisp/ecl/commit/c2b2941768f39544f45f24e19a30081d316eeb71

Clearly, this code will be setting FPE generation flags in more places than before.

but this ECL macro is not about the case at hand, as I believe we do have fenv.h available. Or you mean that Sage does not use fenv.h?

Probably ECL doesn't go out of its way to reset FPE flags when ECL is exiting. So probably, after executing ECL code, it is now the case that (often?) the FPE flags are set the way ECL wants; not the way we want. So in addition to restoring the SIG handlers, we should probably sprinkle an appropriate amount of fesetenv calls around the ECL entry and exit wrappers.

this does not explain why simply setting ECL_OPT_TRAP_SIGFPE to 0 solves the issue for division by 0, but not for FP overflow.

However, I don't understand why doing anything with ECL_OPT_TRAP_SIGFPE has any effect whatsoever (assuming we do not preserve it among other ECL's interrupts such as SIGINT). Shouldn't

    #and put the Sage signal handlers back
    for i in range(1,32):
        sigaction(i, &sage_action[i], NULL)

in ecl.pyx be overriding whatever ECL did with SIGFPE?

comment:18 Changed 4 years ago by dimpase

The experimental branch using ECLs SIGFPE handler is on girt trac: u/dimpase/ecl16.1.3exp So you can have a look --- it also doesn't work.

comment:19 follow-up: Changed 4 years ago by nbruin

I am not very familiar with this kind of coding issues and/or with SIG handling in general.

If we're doing our sig handler setting properly then it shouldn't matter what sig handlers ECL installs when we run non-ECL code, because we switch them out. As far as I can see this is the case. I see no evidence that an ECL sig handler is interfering with our work. The trace-back doesn't mention ECL code at all. Just an uncaught signal.

It seems that the sage action for SIGFPE is to coredump. That's what we're seeing.

There are flags that govern when SIGFPE is raised (rather than propagating a NaN or something similar). The ECL macro I pointed at will be inserted in ECL generated code quite a bit, so I expect that macro to be executed ALL THE TIME. Plus, the patch was made exactly to fix an issue of unraised SIGFPEs (CL apparently prefers the exceptions over NaNs? by default).

We are not taking measures to set fenv to our liking after executing ECL, and one change in the new ECL version is certainly to change fenv more often. So I think it's worth a try to save & restore's sage's own fenv across ECL calls.

It could be that ECL changes its FPE raise mask depending on the value of ECL_OPT_TRAP_SIGFPE. I haven't checked. That would be consistent with observed behaviour.

comment:20 in reply to: ↑ 19 ; follow-ups: Changed 4 years ago by dimpase

Replying to nbruin:

I am not very familiar with this kind of coding issues and/or with SIG handling in general.

If we're doing our sig handler setting properly then it shouldn't matter what sig handlers ECL installs when we run non-ECL code, because we switch them out. As far as I can see this is the case. I see no evidence that an ECL sig handler is interfering with our work. The trace-back doesn't mention ECL code at all. Just an uncaught signal.

But why do we get an uncaught FP overflow signal? That's because booting ECL up destroys Sages' sig handler, even if we switch ECLs SIGFPE handler off. How exactly this is possible is not clear to me.

It seems that the sage action for SIGFPE is to coredump. That's what we're seeing.

no, it is not. Sage's action on FP overflow is to return infinity. Why does ECL continue to mess around with it, even if told not to? In fact, ECL 16.1.3 has a new command line switch, -no-trap-fp, and it works. (i.e. you get ECLs infinity if you try to convert a too big number to float if this switch is on, and you get an interrupt if that switch is off)

There are flags that govern when SIGFPE is raised (rather than propagating a NaN or something similar). The ECL macro I pointed at will be inserted in ECL generated code quite a bit, so I expect that macro to be executed ALL THE TIME. Plus, the patch was made exactly to fix an issue of unraised SIGFPEs (CL apparently prefers the exceptions over NaNs? by default).

We are not taking measures to set fenv to our liking after executing ECL, and one change in the new ECL version is certainly to change fenv more often. So I think it's worth a try to save & restore's sage's own fenv across ECL calls.

It could be that ECL changes its FPE raise mask depending on the value of ECL_OPT_TRAP_SIGFPE. I haven't checked. That would be consistent with observed behaviour.

comment:21 Changed 4 years ago by nbruin

  • Branch changed from u/dimpase/ecl16.1.3 to u/nbruin/ecl16.1.3

comment:22 in reply to: ↑ 20 Changed 4 years ago by nbruin

  • Commit changed from ad25254d3445265211081b50501af670daf4dade to a2cadc04eb91409ad1f88b5f4d6ad80d5dfc3451

Replying to dimpase:

In fact, ECL 16.1.3 has a new command line switch, -no-trap-fp, and it works.

The corresponding ECL command is (si::trap-fpe t nil), so we could just pass that to ECL just after cl_boot. That might be easier/more efficient than passing it in the argv.

This should put ECL in agreement about what the fenv FPE raise flags are supposed to be (and then it also doesn't matter what SIGFPE handler ECL uses, because the signals will not occur, so we don't need to restore ECL's SIGFPE handler for ECL code either.

Last edited 4 years ago by nbruin (previous) (diff)

comment:23 follow-up: Changed 4 years ago by git

  • Commit changed from a2cadc04eb91409ad1f88b5f4d6ad80d5dfc3451 to ea11dd53248462bf2859fb3cbe34bfd47f006682

Branch pushed to git repo; I updated commit sha1. New commits:

ea11dd5Restore "fenv" upon entry and exit of ECL code

comment:24 in reply to: ↑ 23 Changed 4 years ago by nbruin

Replying to git:

Branch pushed to git repo; I updated commit sha1. New commits:

ea11dd5Restore "fenv" upon entry and exit of ECL code

OK, committing the change before pushing is probably helpful.

comment:25 Changed 4 years ago by git

  • Commit changed from ea11dd53248462bf2859fb3cbe34bfd47f006682 to 35e9d3d1610bade7ae7d5fac735244be25e009eb

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

377e0edupdate ECL to 16.1.3 - most of our patches are obsolete
35e9d3dset ECL to not trap floating point exceptions, so that SIGFPE will not occur

comment:26 Changed 4 years ago by nbruin

OK, that should do the trick! I guess ECL_OPT_TRAP_SIGFPE doesn't have much to do with what we need because it's supposed to control what happens with SIGFPE if ECL is catching signals, whereas we're interested in preventing FP exceptions from generating signals at all. the routine si:trap-fpe is exactly the one to set that behaviour, in a way that is maintained in ECL's environment tracking as well.

comment:27 follow-up: Changed 4 years ago by charpent

  • Cc charpent added

Just as a memento for future testing :

With the current ECL Maxima 5.39.0 gives :

(%i3) taylor(integrate(f(x),x),x,x_0,2); 

taylor: unable to expand at a point specified in:
'integrate(f(x),x)
 -- an error. To debug this try: debugmode(true);

whereas :

%i2) taylor(integrate(f(x),x),x,x_0,2);

(%o2) 'at('integrate(f(x),x),x = x_0)+f(x_0)*(x-x_0)
                                     +(('at('diff(f(x),x,1),x = x_0))
                                      *(x-x_0)^2)
                                      /2

is expected since Maxima 5.38.1 at the latest...

comment:28 Changed 4 years ago by nbruin

DISCLAIMER: I just found out that all my testing was on 16.1.2. While I expect that my fix should work, I haven't actually successfully tested this. Other people who do have easy access to 16.1.3 can verify this easily.

comment:29 in reply to: ↑ 20 Changed 4 years ago by dimpase

OK, this appears to work. make ptest with this ECL 16.1.3 branch and Maxima from #18920 passes without any errors.

comment:30 Changed 4 years ago by dimpase

  • Report Upstream changed from N/A to Reported upstream. No feedback yet.

Daniel Kochmański (ECL lead dev) invited me to open an issue #347 on this.

comment:31 in reply to: ↑ 27 Changed 4 years ago by dimpase

Replying to charpent:

Just as a memento for future testing :

With the current ECL Maxima 5.39.0 gives :

(%i3) taylor(integrate(f(x),x),x,x_0,2); 

taylor: unable to expand at a point specified in:
'integrate(f(x),x)
 -- an error. To debug this try: debugmode(true);

this unfortunately stays the same with ECL 16.1.3. This is for Maxima/ECL people to fix, not for us.

comment:32 follow-ups: Changed 4 years ago by jdemeyer

It would be really good to explain on this ticket what the SIGFPE issue is all about.

comment:33 Changed 4 years ago by jdemeyer

  • Description modified (diff)

comment:34 in reply to: ↑ 32 Changed 4 years ago by dimpase

Replying to jdemeyer:

It would be really good to explain on this ticket what the SIGFPE issue is all about.

examples can be found on #18920. More specifically, in Sage Maxima built with ECL 16.1.3 and unpatched libs/ecl.pyx gives crash dumps like

sage: g(x)=taylor(log(x),x,1,6) # this initialises ECL/Maxima
sage: q = (2^53) * 2^971/1
sage: float(q) # boom, instead of "inf" as the answer.
...
Unhandled SIGFPE: An unhandled floating point exception occurred.
...

and

sage: g(x)=taylor(log(x),x,1,6); g
x |--> -1/6*(x - 1)^6 + 1/5*(x - 1)^5 - 1/4*(x - 1)^4 + 1/3*(x - 1)^3 - 1/2*(x - 1)^2 + x - 1
sage: plot(log(x),(x,0,2))
...
Unhandled SIGFPE: An unhandled floating point exception occurred.
...

comment:35 in reply to: ↑ 32 ; follow-up: Changed 4 years ago by nbruin

The mechanics of why SIGFPE is a problem with ECL 6.1.3 by default:

  • Apparently it is desirable in CL to raise floating point exceptions rather than propagate Inf and NaN by default (CL is a formally specified language, so it probably states that FP exceptions should be handled that way)
  • This is most easily done by using fenv.h functionality to configure floating point arithmetic to raise SIGFPE when these things arise. So, ECL does that, and has a SIGFPE handler in place.
  • We switch the ECL sighandlers in/out whenever we're entering/exiting significant ECL code. Sage is not expecting floating point exceptions to raise SIGFPE, so sage does not have a SIGFPE handler.
  • In order to fix a bug elsewhere in ECL, https://gitlab.com/embeddable-common-lisp/ecl/commit/c2b2941768f39544f45f24e19a30081d316eeb71 changed ECL to much more often update the fenv to reflect its desired state (previously, they were not doing this at all). ECL is not actually written to be called as a library, so it doesn't take any precautions to set/restore global state such as fenv upon entry/exit. Hence, after running ECL in its default configuration, you're liable to end up with an fenv in which SIGFPE can be raised. This is what was happening.

For solutions:

  • it's possible to tell ECL to NOT let floating point exceptions raise SIGFPE. Then ECL's SIGFPE handler is irrelevant, and the fenv state ECL wants is basically consistent with what we want. This is the solution currently used.
  • alternatively, we could let ECL do what it wants and set/restore fenv upon entry/exit of ECL code. fesetenv and fegetenv are the relevant routines from fenv.h. In addition, though, we should restore ECL's SIGFPE handler during ECL code execution, if we go that route (currently we don't).

This was one of these issues that's really easy to resolve, once you understand the problem.

The main confusion is that there are two components: whether floating point exceptions should raise SIGFPE, and how SIGFPE gets handled once it's raised. You might think these two components are controlled via the same mechanism, but they are not. It looks like ECL might be investigating if its configuration can be streamlined on these two issues: https://gitlab.com/embeddable-common-lisp/ecl/issues/347

Currently there are two options that seem relevant: ECL_OPT_TRAP_SIGFPE and --no-trap-fpe that feed into different places and control different aspects.

comment:36 in reply to: ↑ 35 ; follow-up: Changed 4 years ago by jdemeyer

Replying to nbruin:

This was one of these issues that's really easy to resolve, once you understand the problem.

So the issue is solved now?

comment:37 in reply to: ↑ 36 ; follow-up: Changed 4 years ago by dimpase

Replying to jdemeyer:

Replying to nbruin:

This was one of these issues that's really easy to resolve, once you understand the problem.

So the issue is solved now?

it is desirable to tell ECL to NOT let floating point exceptions raise SIGFPE before booting it up, not during the bootup, for this leaves a time window where SIGFPE can be raised. How can this be done?

comment:38 in reply to: ↑ 37 Changed 4 years ago by nbruin

Replying to dimpase:

it is desirable to tell ECL to NOT let floating point exceptions raise SIGFPE before booting it up, not during the bootup, for this leaves a time window where SIGFPE can be raised. How can this be done?

I think this is an academic issue, because in our current setting, no fp exceptions will occur in the window, and hence the way SIGFPE is handled is irrelevant. Anyway:

I assume ECL installs its SIGFPE handler before it executes any FP instructions and/or sets the fenv. The ECL's SIGFPE handler greatly eases the effect of SIGFPE. So if (si::trap-fpe T NIL) is executed right after cl_boot; before switching the SIGhandlers, I think the risk is smaller (I don't think there is a risk at all because no FP work occurs in this process, so it's hard to make an accurate estimate of the relative risks here).

Alternatively, you could trust that ECL knows what it's doing and execute cl_boot with an argv that contains --no-trap-fpe. With any luck, that should just get processed in the same way as command line options upon startup. It is then ECL's responsibility to do this before a SIGFPE actually occurs. In reality, this is essentially the same as just feeding ECL (si::trap-fpe T NIL) just after boot.

fenv is supposed to be thread-specific, so concurrency issues should not come into play, even if sage were ever to grow multiple threads.

comment:39 Changed 4 years ago by nbruin

  • Status changed from needs_work to needs_review
  • Work issues sort out SIGFPE deleted

Needs review?

comment:40 follow-up: Changed 4 years ago by dimpase

there is now https://gitlab.com/embeddable-common-lisp/ecl/issues/349 and also still write_error patch to remove.

comment:41 follow-up: Changed 4 years ago by jdemeyer

  • Status changed from needs_review to needs_work

You should document the change to src/sage/libs/ecl.pyx. At a minimum, there should be a link to this ticket.

comment:42 in reply to: ↑ 41 Changed 4 years ago by dimpase

Replying to jdemeyer:

You should document the change to src/sage/libs/ecl.pyx. At a minimum, there should be a link to this ticket.

Sure. I didn't set it up for review, Nils did. IMHO upstream is still working on this.

comment:43 Changed 3 years ago by jpflori

  • Cc jpflori added

comment:44 in reply to: ↑ 40 Changed 3 years ago by nbruin

Replying to dimpase:

there is now https://gitlab.com/embeddable-common-lisp/ecl/issues/349 and also still write_error patch to remove.

That ticket has now been closed and its fix was to change a test.

No movement on https://gitlab.com/embeddable-common-lisp/ecl/issues/347, so if updating ECL is relevant then doing so with the (si::trap-fpe T NIL) given here is probably a good way to go.

comment:45 Changed 2 years ago by thansen

  • Description modified (diff)
  • Summary changed from update ECL to 6.1.3 to update ECL to 16.1.3
Note: See TracTickets for help on using tickets.