Opened 4 years ago

Closed 2 years ago

#17864 closed defect (wontfix)

Popen.terminate not working on cygwin

Reported by: gouezel Owned by:
Priority: major Milestone: sage-duplicate/invalid/wontfix
Component: porting: Cygwin Keywords:
Cc: jpflori Merged in:
Authors: Sebastien Gouezel Reviewers:
Report Upstream: N/A Work issues:
Branch: u/gouezel/terminate_robustly (Commits) Commit: f76d23b54581b8a4c1091d96980a174bd7ec2af9
Dependencies: Stopgaps:

Description

The implementation of signals on cygwin is so deficient that Popened processes with an stdout pipe can not be terminated using signals if they still have something to output. This is a known problem. For instance:

>>> from subprocess import Popen, PIPE
>>> p = Popen(['cat', 'setup.py'], stdout=PIPE)
>>> p.stdout.readline()
'#!/usr/bin/env python\n'
>>> p.terminate()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/subprocess.py", line 1551, in terminate
    self.send_signal(signal.SIGTERM)
  File "/usr/lib/python2.7/subprocess.py", line 1546, in send_signal
    os.kill(self.pid, sig)
OSError: [Errno 3] No such process

The palp library is called in sage precisely in this way. There are already workarounds in sage/geometry/lattice_polytope.py:

# We program around an issue with subprocess (this so far seems to
# only be an issue on Cygwin).
try:
    p.terminate()
except OSError:
    pass

but the result is that the process is left dangling forever. And there is also one place without the workaround, resulting in the exception (and a failing doctest).

Change History (21)

comment:1 Changed 4 years ago by gouezel

  • Authors set to Sebastien Gouezel
  • Branch set to u/gouezel/terminate_robustly
  • Commit set to 61d077d391ecdbae80e26616eb4c84b264cd2e52
  • Status changed from new to needs_review

Adding and using a generic function to terminate Popened childs that also works on cygwin. I put it in sage.misc.misc, which is maybe not very good, any better idea welcome.


New commits:

61d077d #17864: let Popened processes terminate even on cygwin

comment:2 Changed 4 years ago by jdemeyer

  • Status changed from needs_review to needs_work

I don't like

p.terminate()
if p.poll() is not None:
    return p.returncode

p.kill()
if p.poll() is not None:
    return p.returncode

raise RuntimeError('Process could not be terminated')

You need to give the process some time to die.

Also: why do you return the returncode? It seems none of the current applications need that.

Therefore, I think the block I quoted above can be replaced by

p.terminate()

comment:3 Changed 4 years ago by git

  • Commit changed from 61d077d391ecdbae80e26616eb4c84b264cd2e52 to a1bb6d43cb5648aeb54ac31f7d4cc0c64978660d

Branch pushed to git repo; I updated commit sha1. New commits:

a1bb6d4 #17864: simplify terminate_robustly

comment:4 Changed 4 years ago by gouezel

  • Status changed from needs_work to needs_review

comment:5 Changed 4 years ago by jdemeyer

  • Status changed from needs_review to needs_work

It seems the process is no longer terminated on Cygwin. Also, can this be implemented in a platform-independent way please? I don't like explicit platform tests if it can be avoided.

comment:6 Changed 4 years ago by gouezel

The python function communicate does the following: read everything that the process has to output, then terminate the process and wait until it is properly terminated (see https://docs.python.org/2/library/subprocess.html). In particular, communicate always terminates the process. This works on cygwin (contrary to the simple signal sent by p.terminate()) but it has two drawbacks:

  • if the process still has a lot to output, it means that a lot of useless work is done, and a lot of useless data is transferred back to sage (and dumped right away).
  • if the process takes time to properly terminate, we will be waiting for it, for no reason.

To me, it means that the simple method using p.terminate() is preferable whenever it is available.

I can make it platform-independent by first trying p.terminate() and using communicate if an exception is raised.

comment:7 Changed 4 years ago by jdemeyer

I understand your explanation, but the problem is that there is really no guarantee that the process will actually be terminated. Waiting for the process to finish by itself is very different from terminating the process. Therefore, I think that communicate is not the right solution to the problem (I don't know Cygwin, so I don't know what the right solution is).

comment:8 Changed 4 years ago by jdemeyer

I have no idea, but would using a different signal work?

comment:9 Changed 4 years ago by git

  • Commit changed from a1bb6d43cb5648aeb54ac31f7d4cc0c64978660d to f76d23b54581b8a4c1091d96980a174bd7ec2af9

Branch pushed to git repo; I updated commit sha1. New commits:

f76d23b #17864: try/except in terminate_robustly

comment:10 follow-up: Changed 4 years ago by gouezel

  • Status changed from needs_work to needs_review

No signal can work, it is impossible to send any signal to the child process while it is busy waiting to output something... I don't see any other approach than communicate. For the use case we have currently (calls to the palp database), it works perfectly, and in general it should work if the program we call is not buggy.

Modified patch with try/except block, to make it more platform-agnostic. If cygwin's signals improve in the future, this means the exception will not be raised and terminate will be used as it should be also on cygwin.

The patch should not change anything on sane platforms where signals work properly. It solves the issue on cygwin for our current use case, and if there are further issues they will get noticed with the patch, contrary to the current code with except: pass. So, I'd rather get it merged, and open a followup ticket if necessary.

comment:11 Changed 4 years ago by gouezel

  • Cc jpflori added

comment:12 Changed 4 years ago by jdemeyer

Here is a different idea: is it really important to terminate the child process? What about just not terminating the process if that fails? Or something like: close stdin, try to terminate, wait 1 second, try again, give up.

comment:13 in reply to: ↑ 10 Changed 4 years ago by jdemeyer

Replying to gouezel:

For the use case we have currently (calls to the palp database), it works perfectly

Then why not simply use communicate() always on every platform in that use case?

comment:14 Changed 4 years ago by jdemeyer

  • Status changed from needs_review to needs_info

comment:15 follow-up: Changed 4 years ago by jpflori

I guess it's still better to try to properly terminate the process first. IMHO it's just cleaner.

comment:16 in reply to: ↑ 15 Changed 4 years ago by jdemeyer

Replying to jpflori:

I guess it's still better to try to properly terminate the process first.

If there are two ways A and B to achieve X, where A works on some platforms but B always works, we should just do B.

In this case: X = end the PALP process, A = call terminate(), B = use communicate.

comment:17 follow-up: Changed 4 years ago by jpflori

Note that Python doc says:

communicate
...
Note

The data read is buffered in memory, so do not use this method if the data size is large or unlimited. 

comment:18 in reply to: ↑ 17 Changed 4 years ago by jdemeyer

Replying to jpflori:

Note that Python doc says:

communicate
...
Note

The data read is buffered in memory, so do not use this method if the data size is large or unlimited. 

If the output is too large (no idea if this is the case for PALP), you can read() in a loop instead.

comment:19 follow-up: Changed 3 years ago by embray

With apologies, I haven't read through all the comments on this ticket yet, but I feel like the explanation in the ticket's original description is off. It doesn't seem to have anything to do with if the process still has something to input. For example:

>>> p = Popen(['echo'], stdout=PIPE)
>>> p.stdout.read()
'\n'
>>> p.stdout.close()
>>> p.terminate()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/subprocess.py", line 1559, in terminate
    self.send_signal(signal.SIGTERM)
  File "/usr/lib/python2.7/subprocess.py", line 1554, in send_signal
    os.kill(self.pid, sig)
OSError: [Errno 3] No such process

I even made a test program called noop with an empty main function and:

>>> p = Popen(['./noop'])
>>> p.terminate()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/subprocess.py", line 1559, in terminate
    self.send_signal(signal.SIGTERM)
  File "/usr/lib/python2.7/subprocess.py", line 1554, in send_signal
    os.kill(self.pid, sig)
OSError: [Errno 3] No such process

On Linux the difference is subtle:

>>> p = Popen(['./noop'])
>>> p.terminate()
>>> p.wait()
0
>>> p.terminate()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/subprocess.py", line 1551, in terminate
    self.send_signal(signal.SIGTERM)
  File "/usr/lib/python2.7/subprocess.py", line 1546, in send_signal
    os.kill(self.pid, sig)
OSError: [Errno 3] No such process

This is because before the wait() call the process still exists in a zombie state and it's not an error to send it a signal (I think this is an implementation detail though--on Linux the struct representing the relevant task_struct for the process gets an awaiting signal set on it, but of course it's never acted on).

On Windows of course the semantics are totally different. The signal emulation is handled by a thread that's spun up when the process starts. When the process ends and [ExitProcess](https://msdn.microsoft.com/en-us/library/windows/desktop/ms682658(v=vs.85).aspx) is called, all threads are stopped, so there's nothing to "handle" signals sent to that process, so any attempt to do so results in an error. Cygwin could probably be more robust here--it also holds on to a process in its internal process list until wait() is called, and the process has exited. It could check before attempting to signal a process if it's already exited and waiting to be reaped. I don't know if there's an POSIX requirement of this though--it may just be considered an implementation detail.

Long story short, it's not portable to called terminate() or kill() at least before a poll() call if the process has already exited.

comment:20 in reply to: ↑ 19 Changed 3 years ago by embray

Replying to embray:

Cygwin could probably be more robust here--it also holds on to a process in its internal process list until wait() is called, and the process has exited. It could check before attempting to signal a process if it's already exited and waiting to be reaped. I don't know if there's an POSIX requirement of this though--it may just be considered an implementation detail.

To follow up on this, according to http://pubs.opengroup.org/onlinepubs/9699919799/functions/kill.html :

Existing implementations vary on the result of a kill() with pid indicating an inactive process (a terminated process that has not been waited for by its parent). Some indicate success on such a call (subject to permission checking), while others give an error of [ESRCH]. Since the definition of process lifetime in this volume of POSIX.1-2008 covers inactive processes, the [ESRCH] error as described is inappropriate in this case. In particular, this means that an application cannot have a parent process check for termination of a particular child with kill(). (Usually this is done with the null signal; this can be done reliably with waitpid().)

So Cygwin is definitely wrong here. I'll bring this up with them. Looking at the Cygwin implementation I see no obvious reason why it couldn't have better behavior here. Nonetheless it's acknowledged above that existing implementations have varying behavior, and Cygwin may not be the only culprit. So best is to not send signals to exited processes (and also be prepared to handle ESRCH if the process exits between the last poll and the kill() call).

comment:21 Changed 2 years ago by embray

  • Milestone changed from sage-6.6 to sage-duplicate/invalid/wontfix
  • Resolution set to wontfix
  • Status changed from needs_info to closed

Going ahead and closing this since it's been superseded by #21206 (which is a duplicate of this issue, but closer to being resolved).

Note: See TracTickets for help on using tickets.