Opened 9 years ago

Closed 9 years ago

#14371 closed defect (fixed)

Race condition in singular doctest

Reported by: jdemeyer Owned by: mvngu
Priority: blocker Milestone: sage-5.9
Component: doctest coverage Keywords:
Cc: jpflori Merged in: sage-5.9.rc0
Authors: Jeroen Demeyer, Volker Braun Reviewers: Volker Braun, Jeroen Demeyer
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by jdemeyer)

On hawk (OpenSolaris), after applying 14371_traceback_dont_catch.patch to get a useful traceback:

sage -t devel/sage/sage/interfaces/expect.py
**********************************************************************
File "devel/sage/sage/interfaces/expect.py", line 1089, in sage.interfaces.expect.Expect._crash_msg
Failed example:
    singular('2+3')
Exception raised:
    Traceback (most recent call last):
      File "/export/home/buildbot/build/sage/hawk-1/hawk_full/build/sage-5.9.beta4/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 466, in _run
        self.execute(example, compiled, test.globs)
      File "/export/home/buildbot/build/sage/hawk-1/hawk_full/build/sage-5.9.beta4/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 825, in execute
        exec compiled in globs
      File "<doctest sage.interfaces.expect.Expect._crash_msg[4]>", line 1, in <module>
        singular('2+3')
      File "/export/home/buildbot/build/sage/hawk-1/hawk_full/build/sage-5.9.beta4/local/lib/python2.7/site-packages/sage/interfaces/singular.py", line 724, in __call__
        return SingularElement(self, type, x, False)
      File "/export/home/buildbot/build/sage/hawk-1/hawk_full/build/sage-5.9.beta4/local/lib/python2.7/site-packages/sage/interfaces/singular.py", line 1184, in __init__
        self._name = parent._create( value, type)
      File "/export/home/buildbot/build/sage/hawk-1/hawk_full/build/sage-5.9.beta4/local/lib/python2.7/site-packages/sage/interfaces/singular.py", line 685, in _create
        self.set(type, name, value)
      File "/export/home/buildbot/build/sage/hawk-1/hawk_full/build/sage-5.9.beta4/local/lib/python2.7/site-packages/sage/interfaces/singular.py", line 626, in set
        out = self.eval(cmd)
      File "/export/home/buildbot/build/sage/hawk-1/hawk_full/build/sage-5.9.beta4/local/lib/python2.7/site-packages/sage/interfaces/singular.py", line 580, in eval
        s = Expect.eval(self, x, **kwds)
      File "/export/home/buildbot/build/sage/hawk-1/hawk_full/build/sage-5.9.beta4/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 1197, in eval
        for L in code.split('\n') if L != ''])
      File "/export/home/buildbot/build/sage/hawk-1/hawk_full/build/sage-5.9.beta4/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 818, in _eval_line
        E.sendline(line)
      File "/export/home/buildbot/build/sage/hawk-1/hawk_full/build/sage-5.9.beta4/local/lib/python2.7/site-packages/pexpect.py", line 677, in sendline
        n = n + self.send (os.linesep)
      File "/export/home/buildbot/build/sage/hawk-1/hawk_full/build/sage-5.9.beta4/local/lib/python2.7/site-packages/pexpect.py", line 669, in send
        c = os.write(self.child_fd, str)
    TypeError: [Errno 22] Invalid argument
    Error evaluating def sage51=2+3; in Singular
**********************************************************************

Apply 14371_traceback_dont_catch.patch and 14371_sleep_isalive_alt.patch

Attachments (6)

trac_14371_dont_catch_runtimeerror.patch (3.6 KB) - added by vbraun 9 years ago.
Initial patch
14371_traceback.patch (2.9 KB) - added by jdemeyer 9 years ago.
14371_singular_race.patch (725 bytes) - added by jdemeyer 9 years ago.
14371_traceback_dont_catch.patch (6.1 KB) - added by jdemeyer 9 years ago.
14371_sleep_isalive.patch (2.4 KB) - added by jdemeyer 9 years ago.
14371_sleep_isalive_alt.patch (2.5 KB) - added by jdemeyer 9 years ago.
Alternative patch

Download all attachments as: .zip

Change History (47)

comment:1 Changed 9 years ago by jdemeyer

  • Status changed from new to needs_review

It's not the best solution possible, but after spending more than 1 hour trying stuff and reading the pexpect source code, there is nothing better I could come up with.

comment:2 follow-up: Changed 9 years ago by vbraun

Which syscall is failing with invalid argument? Can we get the full backtrace? Why do we catch the RuntimeError at all in set()?

comment:3 in reply to: ↑ 2 Changed 9 years ago by jdemeyer

Replying to vbraun:

Which syscall is failing with invalid argument?

os.write() to the terminal, from the send() function in site-packages/pexpect.py

comment:4 follow-up: Changed 9 years ago by vbraun

Does that only happen in parallel testing but not in serial tests? The only thing I can think of is that we are running out of file descriptors. Solaris defaults to a really low ulimit -n, I think.

comment:5 in reply to: ↑ 4 Changed 9 years ago by jdemeyer

Replying to vbraun:

The only thing I can think of is that we are running out of file descriptors.

I don't that has anything to do with it. I think it has to do with the fact that the Singular process dies and that therefore, writing to the pty fails. But unfortunately, it is more complicated than that.

comment:6 follow-up: Changed 9 years ago by vbraun

This doesn't make sense, if something is wrong with the file descriptor then write would cause EBADF not EINVAL. Are you sure that the error comes from writing to the old, dead subprocess and not to the newly-forked one? And why are we even catching RuntimeError? in sage.interface.singular.set?

I would guess that the issue is the absence of forkpty() on Solaris? I.e. no os.forkpty in Python, so pty.fork falls back to an alternate implementation which presumably triggers a race in Solaris. Contact upstream ;-)

It seems ridiculous to change the doctest here instead of adding a workaround to sage.interface. If this doesn't work then your platform just can't run Sage.

Changed 9 years ago by vbraun

Initial patch

comment:7 Changed 9 years ago by vbraun

I've added a patch that weans off the singular interface from RuntimeError abuse. Maybe you can try with that patch, then we ought to get a actually useful backtrace. All tests pass with the patch on my machine...

comment:8 in reply to: ↑ 6 Changed 9 years ago by jdemeyer

Replying to vbraun:

It seems ridiculous to change the doctest here instead of adding a workaround to sage.interface.

I agree, but my workaround (adding the sleep()) worked and everything else which I tried didn't work. But I'll certainly try your patch.

Changed 9 years ago by jdemeyer

comment:9 Changed 9 years ago by jdemeyer

  • Authors changed from Jeroen Demeyer to Jeroen Demeyer, Volker Braun
  • Description modified (diff)

comment:10 Changed 9 years ago by jdemeyer

  • Description modified (diff)

comment:11 Changed 9 years ago by jdemeyer

  • Description modified (diff)

comment:12 Changed 9 years ago by jdemeyer

14371_traceback_dont_catch.patch = trac_14371_dont_catch_runtimeerror.patch + 14371_traceback.patch with 1 change: inherit SingularError from RuntimeError for backwards compatibility.

comment:13 Changed 9 years ago by jdemeyer

OK, useful traceback now, any ideas?

Changed 9 years ago by jdemeyer

comment:14 Changed 9 years ago by jdemeyer

  • Description modified (diff)

comment:15 follow-up: Changed 9 years ago by vbraun

os.write can never raise a TypeError, only OSError. It is the point where the exception originates, but the exception type and message come from the except OSError: branch in _eval_line.

comment:16 follow-up: Changed 9 years ago by vbraun

Maybe run the doctest under dtrace to figure out where the EINVAL comes from? Somebody who is actually using OpenSolaris should make an effort.

comment:17 in reply to: ↑ 15 Changed 9 years ago by jdemeyer

Replying to vbraun:

os.write can never raise a TypeError, only OSError. It is the point where the exception originates, but the exception type and message come from the except OSError: branch in _eval_line.

These are all true statements, but what's the point you're trying to make?

Perhaps you are confused by the use of sys.exc_info()[2], which allows to raise a new exception with the old traceback. So it really is the os.write() which is failing.

comment:18 in reply to: ↑ 16 Changed 9 years ago by jdemeyer

Replying to vbraun:

Somebody who is actually using OpenSolaris should make an effort.

I did, but I really couldn't figure out what is going wrong.

comment:19 follow-up: Changed 9 years ago by vbraun

That os.write fails is ok, we catch the OSError and restart the process. But your hack prints the traceback starting from os.write, which is not the relevant one. The except: branch fails somewhere and that is the issue.

Changed 9 years ago by jdemeyer

comment:20 Changed 9 years ago by jdemeyer

  • Description modified (diff)

comment:21 in reply to: ↑ 19 Changed 9 years ago by jdemeyer

Replying to vbraun:

That os.write fails is ok, we catch the OSError and restart the process.

No, we don't because pexpect hasn't realized that the process is dead (E.isalive() is still true).

comment:22 Changed 9 years ago by vbraun

Known bug? This is from pexpect.py:

    def isalive(self):
        """This tests if the child process is running or not.
        This is non-blocking. If the child was terminated then this
        will read the exitstatus or signalstatus of the child.
        This returns 1 if the child process appears to be running or 0 if not.
        It can take literally SECONDS for Solaris to return the right status.
        """

How about we wait for 5 seconds after an os.write error on Slowaris?

comment:23 Changed 9 years ago by jdemeyer

  • Description modified (diff)

Something like this (needs review)?

Changed 9 years ago by jdemeyer

comment:24 follow-up: Changed 9 years ago by vbraun

I was more thingking about that:

    except OSError, msg: 
        if sys.platform.startswith('sunos'):
            # OpenSolaris bug workaround
            time.sleep(5)
        if not E.isalive():

There is no point in waiting if your OS works correctly. But more importantly, does it fix the issue?

comment:25 in reply to: ↑ 24 ; follow-up: Changed 9 years ago by jdemeyer

Replying to vbraun:

There is no point in waiting if your OS works correctly.

True, but I don't like special-casing operating systems. How can we be certain that it only affects (Open)Solaris? Since this failure (OSError when the process is truly alive) essentially never happens, I don't think it is a problem. On sane systems, this doesn't affect the doctest since those time.sleep() calls will not be done as E.isalive() is false.

But more importantly, does it fix the issue?

I haven't checked it yet, but it should.

Last edited 9 years ago by jdemeyer (previous) (diff)

comment:26 in reply to: ↑ 25 ; follow-up: Changed 9 years ago by vbraun

Replying to jdemeyer:

How can we be certain that it only affects (Open)Solaris?

We can't, thats why I would like to special case only Solaris so we get doctest failures if other (better) OS'es have a bug that causes os.write to fail while isalive().

Last edited 9 years ago by vbraun (previous) (diff)

comment:27 in reply to: ↑ 26 ; follow-up: Changed 9 years ago by jdemeyer

Replying to vbraun:

Replying to jdemeyer:

How can we be certain that it only affects (Open)Solaris?

We can't, thats why I would like to special case only Solaris so we get doctest failures

And then what? Add a fix for that other operating system? Better just fix all hypothetical errors in one go.

comment:28 Changed 9 years ago by jdemeyer

Anyway, the OpenSolaris issue seems to be fixed by this patch.

comment:29 in reply to: ↑ 27 Changed 9 years ago by vbraun

Replying to jdemeyer:

And then what?

And then hopefully not make an OS with such a glaring kernel bug a first-class supported platform.

comment:30 Changed 9 years ago by jpflori

  • Cc jpflori added

comment:31 Changed 9 years ago by jpflori

FYI, this is also happening on Solaris 10, so not only OpenSolaris? (which is dead isn't it? we should maybe rather try OpenIndiana??).

comment:32 Changed 9 years ago by jpflori

(Although OpenIndiana? does not support SPARC yet...)

comment:33 Changed 9 years ago by jdemeyer

  • Milestone changed from sage-5.10 to sage-5.9

comment:34 Changed 9 years ago by vbraun

Fwiw, here is an alternative patch.

comment:35 Changed 9 years ago by jdemeyer

Volker, for the record: do you give positive review to 14371_traceback_dont_catch.patch?

Last edited 9 years ago by jdemeyer (previous) (diff)

comment:36 Changed 9 years ago by vbraun

Yes, to the extend that I can give positive review to a patch that I wrote for the most part ;-)

comment:37 Changed 9 years ago by jdemeyer

Volker, as release manager I need this bug to be fixed, so I'm going to apply your version, even though I still prefer to fix the potential problem on all systems.

Unfortunately, David Kirkby's OpenSolaris machine is down for the moment, so I cannot test your patch (although it should work).

comment:38 Changed 9 years ago by vbraun

Sorry for being a PITA sometimes ;-)

Changed 9 years ago by jdemeyer

Alternative patch

comment:39 Changed 9 years ago by jdemeyer

  • Description modified (diff)
  • Reviewers set to Volker Braun, Jeroen Demeyer

Comment-only change to your patch. Volker, if you agree, then this one is good to go.

comment:40 Changed 9 years ago by vbraun

  • Status changed from needs_review to positive_review

Sounds good to me.

comment:41 Changed 9 years ago by jdemeyer

  • Merged in set to sage-5.9.rc0
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.