Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#23011 closed defect (fixed)

ECL sometimes doesn't handle signals in a timely manner

Reported by: embray Owned by:
Priority: minor Milestone: sage-8.0
Component: packages: standard Keywords: ecl cygwin
Cc: Merged in:
Authors: Erik Bray Reviewers: Volker Braun
Report Upstream: N/A Work issues:
Branch: baff40c (Commits, GitHub, GitLab) Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by embray)

The following test from sage/libs/ecl.pyx sometimes blocks / fails due to timeout:

from sage.libs.ecl import *
from cysignals.tests import interrupt_after_delay
ecl_eval("(setf i 0)")
inf_loop = ecl_eval("(defun infinite() (loop (incf i)))")
interrupt_after_delay(1000)
print("starting loop")
inf_loop()

This will sometimes block indefinitely, at least on Cygwin--I have not tested on other platforms yet.

The issue isn't with interrupt_after_delay itself, since replacing inf_loop with a long sleep() doesn't exhibit the problem.

The problem appears to be fixed upstream in the development version of ECL. I traced the fix to this commit which is pretty bizarre, as it doesn't have anything to do with signal handling. Yet I'm certain this is the one. What I suspect is that something in the (loop) macro had slow behavior due to the bug this commit fixed, and that that slow behavior was taking place while signal handling is disabled (ecl_disable_interrupts()).

After applying the fix it doesn't appear to block signals anymore, or if it does it's only for a very small amount of time, so the failure should be much more rare.

Change History (15)

comment:1 Changed 4 years ago by embray

  • Description modified (diff)

comment:2 Changed 4 years ago by embray

  • Authors set to Erik Bray
  • Branch set to u/embray/ticket-23011
  • Commit set to baff40c20ce89086b21633e27c05211cd28e868e
  • Status changed from new to needs_review

I've added the fix from upstream as a patch to Sage's ECL.

The patch should also apply cleanly to ECL 6.1.3 (#22191) and the issue is still present in that version.


New commits:

baff40cAdds fix to ECL from upstream that works around https://trac.sagemath.org/ticket/23011

comment:3 follow-up: Changed 4 years ago by tscrim

Does this depend on #22191? Do we want to try and get #22191 progressing again?

comment:4 Changed 4 years ago by jdemeyer

Does Cygwin have something like strace? It would be interesting to see the output of that.

On Linux:

write(1, "starting loop\n", 14)         = 14
rt_sigaction(SIGINT, {0x7fb763cb1f50, ~[KILL STOP RTMIN RT_1], SA_RESTORER|SA_SIGINFO, 0x7fb8242a6860}, {0x7fb8179bcac0, [HUP INT ALRM], SA_RESTORER, 0x7fb8242a6860}, 8
rt_sigaction(SIGBUS, {0x7fb763cb1fc0, ~[KILL STOP RTMIN RT_1], SA_RESTORER|SA_SIGINFO, 0x7fb8242a6860}, {0x7fb8179bc8c0, [HUP INT ALRM], SA_RESTORER|SA_NODEFER, 0x7fb82
rt_sigaction(SIGSEGV, {0x7fb763cb1fc0, ~[KILL STOP RTMIN RT_1], SA_RESTORER|SA_SIGINFO, 0x7fb8242a6860}, {0x7fb8179bc8c0, [HUP INT ALRM], SA_RESTORER|SA_NODEFER, 0x7fb8
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
brk(0x7e76000)                          = 0x7e76000
brk(0x7eb6000)                          = 0x7eb6000
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
--- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=12670, si_uid=1038} ---
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigaction(SIGINT, {0x7fb8179bcac0, [HUP INT ALRM], SA_RESTORER, 0x7fb8242a6860}, NULL, 8) = 0
rt_sigaction(SIGBUS, {0x7fb8179bc8c0, [HUP INT ALRM], SA_RESTORER|SA_NODEFER, 0x7fb8242a6860}, NULL, 8) = 0
rt_sigaction(SIGSEGV, {0x7fb8179bc8c0, [HUP INT ALRM], SA_RESTORER|SA_NODEFER, 0x7fb8242a6860}, NULL, 8) = 0
futex(0x1ec4400, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1ec4400, FUTEX_WAKE_PRIVATE, 1) = 1
ioctl(1, TIOCGWINSZ, {ws_row=44, ws_col=168, ws_xpixel=0, ws_ypixel=0}) = 0
futex(0x1ec4400, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff) = 0
futex(0x1ec4400, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1ec4400, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1ec4400, FUTEX_WAKE_PRIVATE, 1) = 1
stat("/usr/local/src/sage-git/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", {st_mode=S_IFREG|0644, st_size=132017, ...}) = 0

comment:5 in reply to: ↑ 3 Changed 4 years ago by embray

Replying to tscrim:

Does this depend on #22191? Do we want to try and get #22191 progressing again?

As I wrote in the comment just above this, it should be independent of that :)

comment:6 follow-up: Changed 4 years ago by embray

Yes, there's an strace. Sometimes it's helpful. Actually when I first started working on this issue I was focusing on the interrupt_after_delay implementation, which I thought might have a race condition, but that wasn't it. I haven't looked at what ECL is doing in that loop, but maybe I'll take a look now just for comparison's sake. Is this problem not reproducible in Linux?

comment:7 in reply to: ↑ 6 ; follow-up: Changed 4 years ago by jdemeyer

Replying to embray:

Is this problem not reproducible in Linux?

You mention that it's a doctest. So if the problem would occur on Linux, surely somebody would have mentioned it and I cannot remember ever seeing this problem.

It could be that there is a genuine race condition on all systems, but that it only manifests itself on Cygwin due to different timing of things.

Looking at the trace, the many lines with rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 do look strange. Just by itself, it's nothing problematic though.

comment:8 in reply to: ↑ 7 Changed 4 years ago by embray

Replying to jdemeyer:

Replying to embray:

Is this problem not reproducible in Linux?

You mention that it's a doctest. So if the problem would occur on Linux, surely somebody would have mentioned it and I cannot remember ever seeing this problem.

It happens semi-randomly though.

It could be that there is a genuine race condition on all systems, but that it only manifests itself on Cygwin due to different timing of things.

Right, it wouldn't be the first time. I think this is most likely too.

Looking at the trace, the many lines with rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 do look strange. Just by itself, it's nothing problematic though.

With the patch applied I don't see this at all--when it gets to the loop, strace output goes silent. I'm about to try rebuilding again without the patch.

comment:9 Changed 4 years ago by embray

Yeah, this is still a bit of a mystery. All I can see is that the loop calls longjmp once per loop, and this can interfere with signal handling, but not for very long so that's not likely a problem. And the longjmp doesn't go away with this patch either, nor do I see why it would make a difference.

comment:10 Changed 4 years ago by vbraun

  • Reviewers set to Volker Braun
  • Status changed from needs_review to positive_review

Since its an upstream patch anyways we should probably merge it even if we don't 100% understand where the timing issue comes from...

comment:11 Changed 4 years ago by vbraun

  • Branch changed from u/embray/ticket-23011 to baff40c20ce89086b21633e27c05211cd28e868e
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:12 Changed 4 years ago by embray

  • Commit baff40c20ce89086b21633e27c05211cd28e868e deleted

Thanks, and I agree. I just wish I understood better; it's quite mysterious.

comment:13 follow-up: Changed 4 years ago by dimpase

Erik, I reckon you have a branch for using a development version of ECL. If so, would you mind sharing it, I think I need to try it on #22679.

comment:14 in reply to: ↑ 13 ; follow-up: Changed 4 years ago by embray

Replying to dimpase:

Erik, I reckon you have a branch for using a development version of ECL. If so, would you mind sharing it, I think I need to try it on #22679.

I don't think I have anything exactly like that. Right now, when I want to test dev versions of packages, I just copy the build/pkgs/<pkgname> directory somewhere, and then check out the git repo into that under src/, then run ./spkg-install manually (in a sage -sh).

comment:15 in reply to: ↑ 14 Changed 4 years ago by dimpase

Replying to embray:

Replying to dimpase:

Erik, I reckon you have a branch for using a development version of ECL. If so, would you mind sharing it, I think I need to try it on #22679.

I don't think I have anything exactly like that. Right now, when I want to test dev versions of packages, I just copy the build/pkgs/<pkgname> directory somewhere, and then check out the git repo into that under src/, then run ./spkg-install manually (in a sage -sh).

Doing this with the latest snapshot I then get an error building ecl extension:

src/build/cythonized/sage/libs/ecl.c:3701:57: error: use of undeclared identifier 'ECL_OPT_TRAP_SIGCHLD'; did you mean 'ECL_OPT_TRAP_SIGILL'?

and indeed ECL_OPT_TRAP_SIGCHLD is gone, so one needs to rework the interface. Probably it was a recent change and you didn't hit it then...

Note: See TracTickets for help on using tickets.