Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#27385 closed defect (fixed)

Random hangs in giac interface since upgrade to 1.5 (at least on Cygwin)

Reported by: embray Owned by:
Priority: critical Milestone: sage-8.8
Component: interfaces Keywords: giac
Cc: frederichan, parisse, slelievre Merged in:
Authors: Erik Bray Reviewers: Dima Pasechnik
Report Upstream: N/A Work issues:
Branch: 8d0a9df (Commits, GitHub, GitLab) Commit:
Dependencies: Stopgaps:

Status badges

Description

Since the upgrade to giac 1.5 (#26315) the test suite for sage.interfaces.giac will sometimes randomly hang.

I don't have a reliable way to reproduce the problem: I just run

./sage -t --long src/sage/interfaces/giac.py

in an infinitely loop, and once out of roughly every 5 runs the test process will just hang without any CPU usage, usually on completely arbitrary lines in the test. Though the test process will at least respond to being killed with a SIGTERM so the issue probably isn't a threading deadlock in openblas, at last (those tend to be unresponsive to signals in Cygwin since it's using Windows native threads directly).

Change History (25)

comment:1 Changed 3 years ago by embray

Definitely seeing something in strace where the python process is waiting for giac to exit, and it's off in some loop calling

921 23821490 [main] giac 21056 clock_nanosleep: clock_nanosleep (0.001000000)

until I send kill -SIGTERM to the python process. In the process of shutting down, python tries a few times to kill giac with SIGHUP, SIGINT, and SIGQUIT and it doesn't respond. It keeps going in a loop. Finally a few moments later the sage-cleaner comes along and SIGKILLs giac.

So at the very least that points us in the right direction: There's some miscommunication happening, somewhere, with how sage tries to kill giac processes.

comment:2 Changed 3 years ago by slelievre

  • Cc frederichan parisse slelievre added

comment:3 Changed 3 years ago by gh-timokau

Since you seem to be able to reproduce it relatively reliably (5 runs of a single test should be cheap), couldn't you bisect between giac 1.4.9.45 and 1.5.0.37?

comment:4 Changed 3 years ago by embray

That's exactly what I'm working on doing now. It's just taking a long time to download the tarballs because there's something wrong with the webserver they're hosted on.

I'm still at a bit of a loss about the overall issue. It's getting stuck in this while loop in the icas_eval function while pexpect is waiting for a prompt:

3348       while (1){
3349         // look at other threads
3350         int cs=context_list().size(),ci=0;
3351         for (;ci<cs;++ci){
3352           context * cptr=context_list()[ci];
3353           if (cptr!=contextptr)
3354             status=check_thread(cptr);
3355         }
3356         // 0 finished, 2/3 debug/wait click
3357         status=giac::check_thread(contextptr);
3358 #ifdef HAVE_LIBFLTK
3359         if (xcas::Xcas_Debug_Window && status<2)
3360           xcas::Xcas_Debug_Window->hide();
3361 #endif
3362         if (status<=0){
3363 #ifdef HAVE_LIBFLTK
3364           Fl::flush();
3365 #endif
3366           break;
3367         }
3368 #ifdef HAVE_LIBFLTK
3369         if (status!=1)
3370           xcas::Xcas_debugguer(status,contextptr);
3371 #else
3372         // FIXME Debugguer without FLTK
3373         giac::thread_eval_status(1,contextptr);
3374 #endif
3375         if (ctrl_c){
3376           if (giac::is_context_busy(contextptr))
3377             giac::kill_thread(true,contextptr);
3378           ctrl_c=false; interrupted=false;
3379         }
3380         else
3381           usleep(1000);
3382       }

Still not sure if there's an easy way to reproduce this outside of Sage.

comment:5 Changed 3 years ago by embray

After painstakingly manually bisecting between the various giac source tarballs, I'm reasonably certain that the issue started between giac-1.5.0-21 and giac-1.5.0-25. I'm going to start diffing between those versions to see if any specific changes stand out.

The issue is hard to reproduce exactly reliably, but can be done simply by slamming the icas command line with a lot of input in rapid succession. For example by doing, via the pexpect interface in Sage:

idx = 0
while True:
    print(idx)
    giac.eval('1 + 1')
    idx += 1

eventually, usually after a few thousand iterations, or as few as a few dozen it will freeze in the while(1) loop I posted in the comment above.

I suspect a subtle race condition, possibly involving a slight error in clock arithmetic or something like that; I've seen that sort of thing before. E.g. https://bugs.python.org/issue26939

comment:6 Changed 3 years ago by embray

The relevant change that caused this problem to appear is, quite subtly,

  • src/icas.cc

    diff -ruN giac-1.5.0-21/src/icas.cc giac-1.5.0-25/src/icas.cc
    old new  
    160160using namespace giac;
    161161#define STDIN_FILENO 0
    162162namespace xcas {
     163  /*
    163164  void icas_eval(giac::gen & g,giac::gen & gg,int & reading_file,std::string &filename,giac::context * contextptr){
    164165    gg=protecteval(g,10,contextptr);
    165166  }
     
    167168  bool fltk_view(const giac::gen & g,giac::gen & ge,const std::string & filename,std::string & figure_filename,int file_type,const giac::context *contextptr){
    168169    return false;
    169170  }
     171  */
    170172}
    171173
    172174#else

Somewhere along the line there was some experimentation with integrating the xcas::icas_eval REPL loop into the command-line icas interface, probably so that it can have optional graphics support if compiled with FLTK (e.g. opening windows to view plots), and so it uses this threaded eval loop in order to allow non-blocking I/O while also interacting with a GUI window.

Prior to 1.5.0-25, there was some hackery in icas.cc to override xcas::icas_eval with a dummy, non-threaded version, at least when compiled without FLTK. But I guess at some point it was decided that the threaded version worked well-enough in icas as well(??) and the dummy override was not even removed but just...commented out.

Since the threaded eval loop is apparently working well-enough on other platforms I think what I'll do for now, so I can move on to other things, is provide a patch to re-disable the threaded eval just on Cygwin. It would be good to then investigate what the bug is in the threaded eval, but that is lower priority for the purposes of fixing Sage.

Last edited 3 years ago by embray (previous) (diff)

comment:7 Changed 3 years ago by gh-timokau

Good work! Did you try reproducing it on non-cygwin and/or plain giac without sage's interface?

comment:8 follow-up: Changed 3 years ago by embray

I haven't tried reproducing the problem outside of Sage, but using giac.eval is pretty close to just using pexpect directly, and in this case it's useful just to slam the interface with input much faster than a human could, making the deadloop more probable.

I've tried to reproduce the problem in Linux but haven't been successful. I bet it might be possible to reproduce by reducing the delay in the usleep() call quoted above but I'm not sure.

comment:9 in reply to: ↑ 8 ; follow-up: Changed 3 years ago by gh-timokau

Replying to embray:

I haven't tried reproducing the problem outside of Sage, but using giac.eval is pretty close to just using pexpect directly, and in this case it's useful just to slam the interface with input much faster than a human could, making the deadloop more probable.

You should be able to do the same thing in a shell: while :; do echo '1+1;'; done | giac

comment:10 Changed 3 years ago by embray

  • Authors set to Erik Bray
  • Branch set to u/embray/ticket-27385
  • Commit set to 8d0a9df87cfe93082085f2805e47ef75026588a3
  • Status changed from new to needs_review

Temporary fix for now. This patch only impacts Cygwin so I think it should be no problem, but I bumped the patch version for the buildbots anyways.

I'd like to continue to investigate the problem when I can, but for the purposes of getting 8.7 stable on Cygwin again I think we should go ahead and use this patch since it's an easy fix.


New commits:

8d0a9dfTrac #27385: Patch for Cygwin to re-disable threaded eval until/unless

comment:11 in reply to: ↑ 9 Changed 3 years ago by embray

Replying to gh-timokau:

Replying to embray:

I haven't tried reproducing the problem outside of Sage, but using giac.eval is pretty close to just using pexpect directly, and in this case it's useful just to slam the interface with input much faster than a human could, making the deadloop more probable.

You should be able to do the same thing in a shell: while :; do echo '1+1;'; done | giac

Yep, I was able to reproduce the same problem like that too.

comment:12 follow-up: Changed 3 years ago by frederichan

Sorry I can't test on cygwin, but are you able to build giacpy_sage on cygwin? because this cython interface should avoid these problems and provides the samething.

comment:13 in reply to: ↑ 12 Changed 3 years ago by embray

Replying to frederichan:

Sorry I can't test on cygwin, but are you able to build giacpy_sage on cygwin? because this cython interface should avoid these problems and provides the samething.

I'll give giacpy a try; I believe I have in the past, and it worked fine. Nevertheless as long as the pexpect interface exists in sage, and giacpy_sage remains an optional package, it needs to work reliably.

As for giacpy_sage, what is the plan for that? Are we going to make it a standard package and perhaps deprecate, and eventually remove, the pexpect interface?

comment:14 Changed 3 years ago by embray

Yeah, there's definitely a race condition in this loop. I haven't confirmed for sure that this is the problem but it probably is. A cleaned up version of it (with some ifdefs removed) looks like:

      giac::make_thread(g1,eval_level(contextptr),icas_eval_callback,&result,contextptr);
      int status;
      while (1){
        // look at other threads
        int cs=context_list().size(),ci=0;
        for (;ci<cs;++ci){
          context * cptr=context_list()[ci];
          if (cptr!=contextptr)
            status=check_thread(cptr);
        }

        // 0 finished, 2/3 debug/wait click
        status=giac::check_thread(contextptr);
        if (status<=0){
          break;
        }

        giac::thread_eval_status(1,contextptr);
        if (ctrl_c){
          if (giac::is_context_busy(contextptr))
            giac::kill_thread(true,contextptr);
          ctrl_c=false; interrupted=false;
        }
        else
          usleep(1000);
      }
      gg=result;
    }

First of all, the "check other threads" loop at the beginning appears to be irrelevant in this case, as there is only one thread running anyways. There is a variable called thread_eval_status (which is protected by a mutex) which contains a "status" flag of the currently executing "eval thread" if any. As the comment in the code states, a value of 0 means the eval thread is finished and has exited. The function giac::check_thread first checks the value of this flag. As we can see, the loop is only broken out of if this flag is <=0 indicating that the thread finished, or possibly an error occurred.

After checking this flag it then immediately, in fact forcibly sets the value of the flag back to 1 with status=giac::check_thread(contextptr);. I'm not sure why it does this but in any case there's an obvious race condition here where the check_thread() call could return non-zero so the loop is not broken out of, then the eval thread completes and sets thread_eval_status=0, then the loop immediately sets thread_eval_status=1 again. The thread_eval_status is never set back to 0 and the loop continues forever (even though the thread has actually exited). If I'm correct that this is the problem it should be possible to reproduce on Linux as well, but it's probably more rare.

I think this code has a few problems. There are many ways to skin this cat, but in the present implementation it would still be more correct to use a condition variable to signal a change in the thread_eval_status flag. It probably still needs to loop every now and then so that it can check the ctrl_c flag set by its SIGINT handler, but it could do that with pthread_cond_timedwait, just for example.

comment:15 Changed 3 years ago by parisse

Actually, I never run this code because I always compile with FLTK. The thread_eval_status is set to 2 (or 3) if a user program is stopped waiting an interaction from the user (2 program is debugged, 3 waits for a user input in the GUI). The FLTK code branch checks that the status is != from 1 then calls the user interaction code, this code will set the status back to 1. The non-FLTK branch should do something similar, but nothing is implemented for that currently, as explained by the comment ( FIXME Debugguer without FLTK) then it sets back the status to 1, but it does not check that status!=1, and that's indeed a potential race condition. I will add this condition and that should fix the issue.

comment:16 Changed 3 years ago by dimpase

  • Reviewers set to Dima Pasechnik
  • Status changed from needs_review to positive_review

Looks good to me, modulo its working on Cygwin - something I did't test, even though I have access to a Windows 10 VM on Azure, so I potentially could.

comment:17 Changed 3 years ago by embray

The best way to test this is to just run something like:

idx = 0
while True:
    print(idx)
    giac.eval('1 + 1')
    idx += 1

(where the 'print' is just to see that it's doing something). Before long it will stop printing--it can take anywhere from a few dozen to a few thousand iterations--and the icas process will sit in a busy loop.

comment:18 Changed 3 years ago by embray

  • Milestone changed from sage-8.7 to sage-8.8

Moving all blocker/critical issues from 8.7 to 8.8.

comment:19 Changed 3 years ago by vbraun

  • Branch changed from u/embray/ticket-27385 to 8d0a9df87cfe93082085f2805e47ef75026588a3
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:20 follow-up: Changed 3 years ago by gh-mwageringel

  • Commit 8d0a9df87cfe93082085f2805e47ef75026588a3 deleted

The tests for sage.interfaces.giac also hang for me occasionally on CentOS Linux release 7.6.1810 (Core), with Giac using about 3% CPU. I have observed this 4-5 times with Sage 8.8. However, I cannot reliably reproduce the issue – the infinite loop suggested in comment:17 does not seem to work for me.

comment:21 in reply to: ↑ 20 ; follow-up: Changed 3 years ago by embray

Replying to gh-mwageringel:

The tests for sage.interfaces.giac also hang for me occasionally on CentOS Linux release 7.6.1810 (Core), with Giac using about 3% CPU. I have observed this 4-5 times with Sage 8.8. However, I cannot reliably reproduce the issue – the infinite loop suggested in comment:17 does not seem to work for me.

The workaround/patch in this ticket explicitly only targeted Cygwin, but as I found through subsequent investigation, the issue could theoretically impact any platform.

You could try building giac using the patch from this ticket, but instead of #ifdef __CYGWIN__ try just #if 1.

comment:22 in reply to: ↑ 21 Changed 3 years ago by gh-mwageringel

Replying to embray:

You could try building giac using the patch from this ticket, but instead of #ifdef __CYGWIN__ try just #if 1.

Yes, this seems to solve the problem on my machine. I tested this quite extensively and have not been able to replicate the problem since I applied the patch.

comment:23 follow-up: Changed 3 years ago by parisse

Did you upgrade to the latest version of giac (1.5.0-61)? I thought I fixed that 4 months ago. I'd prefer not to waste time checking again...

comment:24 in reply to: ↑ 23 ; follow-up: Changed 3 years ago by gh-mwageringel

Replying to parisse:

Did you upgrade to the latest version of giac (1.5.0-61)?

No.

comment:25 in reply to: ↑ 24 Changed 3 years ago by dimpase

Replying to gh-mwageringel:

Replying to parisse:

Did you upgrade to the latest version of giac (1.5.0-61)?

No.

I've opened #28101 to do an update.

Note: See TracTickets for help on using tickets.