Opened 7 years ago

Closed 7 years ago

#14055 closed defect (fixed)

Fix sage-cleaner

Reported by: jdemeyer Owned by: leif
Priority: critical Milestone: sage-5.10
Component: scripts Keywords:
Cc: ppurka, nbruin Merged in: sage-5.10.beta2
Authors: Jeroen Demeyer, Volker Braun Reviewers: John Palmieri, Leif Leonhardy, Volker Braun
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Description (last modified by vbraun)

sage-cleaner got completely broken due to incompatible filenames between sage-cleaner and the Sage library.

Apply:

  1. 14055_cleaner_sagelib.patch to the Sage library.
  2. 14055_simplify_env.patch to the Sage library.
  3. trac_14055_maxima_debug.patch to the Sage library.
  4. 14055_cleaner.patch to local/bin.
  5. 14055_cleaner_hostname.patch to local/bin.

Update ECL spkg with http://boxen.math.washington.edu/home/vbraun/spkg/ecl-12.12.1.p3.spkg

Attachments (7)

14055_cleaner_sagelib.patch (3.4 KB) - added by jdemeyer 7 years ago.
14055_cleaner.patch (6.9 KB) - added by jdemeyer 7 years ago.
14055_simplify_env.patch (7.0 KB) - added by vbraun 7 years ago.
Updated patch
14055_cleaner_hostname.patch (1.8 KB) - added by jdemeyer 7 years ago.
Updated patch
trac_14055_maxima_debug.patch (745 bytes) - added by vbraun 7 years ago.
Initial patch
log.10693 (432.6 KB) - added by vbraun 7 years ago.
strace of failed ecl startup (from taurus)
ecl-p3.diff (1.1 KB) - added by vbraun 7 years ago.
diff for review only: ecl-12.12.1.p2 -> ecl-12.12.1.p3

Download all attachments as: .zip

Change History (115)

comment:1 Changed 7 years ago by jdemeyer

I am working on a patch.

comment:2 Changed 7 years ago by jdemeyer

  • Authors set to Jeroen Demeyer
  • Status changed from new to needs_review

comment:3 Changed 7 years ago by ppurka

  • Cc ppurka added

comment:4 follow-up: Changed 7 years ago by jhpalmieri

In the line

pidfile = os.path.join(DOT_SAGE, 'tmp', 'cleaner-%s.pid'%HOSTNAME)

would it make sense to use 'temp' instead of 'tmp', so all of the sage-cleaner information is in the same directory?

(Is there any good reason for having both directories tmp and temp? At some point, we should clean this up.)

comment:5 follow-up: Changed 7 years ago by leif

  • Keywords orphans added
  • Reviewers set to Punarbasu Purkayastha, Leif Leonhardy

THERE ARE STILL SOME INSTANCES OF UPPERCASE SAGE.

I don't like the mixture of tmp, temp, TMP and TEMP or whatever.


temp and tmp shouldn't be hardcoded.

"Deleting .../spawned_processes" gets printed even if .../spawned_processes is not a directory (not e), and also rmtree() is attempted, ignoring any OSError.


I don't like "%s" % i where i is (or should be) an int.


Command line option handling is certainly suboptimal. (argv[1] could be a float as well, there's no usage, and additional parameters simply get ignored.)

Likewise, if DOT_SAGE isn't set, an ugly KeyError is raised, instead of printing a meaningful message.

comment:6 Changed 7 years ago by leif

  • Reviewers changed from Punarbasu Purkayastha, Leif Leonhardy to John Palmieri, Leif Leonhardy

Ooops.

comment:7 in reply to: ↑ 5 Changed 7 years ago by jdemeyer

Leif: the subject of this ticket is not "Fix all possible issues in sage-cleaner", the goal is to just make it work.

comment:8 in reply to: ↑ 4 ; follow-up: Changed 7 years ago by jdemeyer

Replying to jhpalmieri:

Is there any good reason for having both directories tmp and temp?

Absolutely not, but at least the directory used in sage-cleaner should match the directory used in the Sage library.

comment:9 Changed 7 years ago by jhpalmieri

I am having a problem with this patch, but I don't know why. I am repeatedly running sage -tp 2 devel/sage/sage/homology/s*. Without the patch applied, everything is fine. With the patch, after the first successful run, I get this (I also removed the &>/dev/null after the call to sage-cleaner in spkg/bin/sage):

$ sage -tp 2 devel/sage/sage/homology/s*
SAGE_TMP_ROOT = /Users/palmieri/.sage/temp/jpalmieri538
Starting sage-cleaner
Checking PIDs []
sage-cleaner is finished
Running doctests with ID 2013-03-20-11-16-30-613ad8e5.
Sorting sources by runtime so that slower doctests are run first....
Doctesting 3 files using 2 threads.
sage -t devel/sage/sage/homology/simplicial_complex_morphism.py
    [187 tests, 0.3 s]
sage -t devel/sage/sage/homology/simplicial_complex_homset.py
    [49 tests, 0.1 s]
sage -t devel/sage/sage/homology/simplicial_complex.py
    Time out after testing finished
**********************************************************************
Tests run before process timed out:

...

**********************************************************************
----------------------------------------------------------------------
sage -t devel/sage/sage/homology/simplicial_complex.py  # Time out after testing finished
----------------------------------------------------------------------
Total time for all tests: 26.9 seconds
    cpu time: 0.4 seconds
    cumulative wall time: 0.5 seconds

Without the patch, the file simplicial_complex.py passes tests in under 7 seconds; cpu time and cumulative wall time are each around 6 or 7 seconds. With the patch, the time out is pretty consistent for each run after the first. I'm seeing this on two different OS X 10.8 machines.

Also, if I delete the directory DOT_SAGE/tmp, then the pidfile won't get written at all during doctesting or when running sage. What script is responsible for creating that directory? If this directory is missing, I don't get the time-out any more, so the problem is really with sage-cleaner.

On the bright side, it seems to be cleaning out the appropriate files, directories, and processes.

comment:10 Changed 7 years ago by jhpalmieri

Maybe the problem is that time.sleep(wait + 3) used to be called before the first call to cleanup(), and now it's not. So now cleanup() can be called before any doctesting processes have started, so it returns zero, so sage-cleaner thinks there is nothing to be cleaned up and exits right away. I'm not sure why that results in a timeout in the doctesting, but anyway, this change fixes the problem for me:

  • sage-cleaner

    diff --git a/sage-cleaner b/sage-cleaner
    a b  
    115115        else:
    116116            wait = 10
    117117   
     118        time.sleep(wait)
    118119        # Initial cleanup, ignore time
    119120        running_sages = cleanup()
    120121        cleanup_time = 0.0

I don't know if the initial sleep should be for wait or a smaller number. Does it matter?

comment:11 in reply to: ↑ 8 Changed 7 years ago by jhpalmieri

Replying to jdemeyer:

Replying to jhpalmieri:

Is there any good reason for having both directories tmp and temp?

Absolutely not, but at least the directory used in sage-cleaner should match the directory used in the Sage library.

Where is DOTSAGE/tmp used? Certainly the temp version is used in the definition of SAGE_TMP, so that is ubiquitous. For the file cleaner-HOSTNAME.pid, is that referenced anywhere in the Sage library? It looks like we could put that file in temp instead of tmp safely. Indeed, I can't find any other references to DOTSAGE/tmp...

comment:12 follow-up: Changed 7 years ago by novoselt

I've applied this patch to SAGE_ROOT/local/bin, ran "make ptestlong" and still got two ecl processes eating RAM when tests were finished without errors.

comment:13 Changed 7 years ago by jdemeyer

jhpalmieri: I can totally reproduce problems with devel/sage/sage/homology/simplicial_complex.py, but they are different from yours and unrelated to sage-cleaner: #14323.

comment:14 follow-up: Changed 7 years ago by jhpalmieri

If you look at the output from the doctest command, I find it suspicious that the message "sage-cleaner is finished" appears before any doctests are run.

comment:15 in reply to: ↑ 14 Changed 7 years ago by jdemeyer

Replying to jhpalmieri:

I find it suspicious that the message "sage-cleaner is finished" appears before any doctests are run.

Sure, but I don't see how that could lead to doctest failures. It could be that your problem is also an instance of #14323, let's first fix that.

comment:16 Changed 7 years ago by leif

Replying to jhpalmieri:

Maybe the problem is that time.sleep(wait + 3) used to be called before the first call to cleanup(), and now it's not. So now cleanup() can be called before any doctesting processes have started, so it returns zero, so sage-cleaner thinks there is nothing to be cleaned up and exits right away. [...] I don't know if the initial sleep should be for wait or a smaller number. Does it matter?

I wouldn't rely on some wall time.

Unfortunately the cleaner cannot wait() for its parent, but it could poll whether it is still running (or -- much better I think -- wait for another signal), and do nothing until its parent has "vanished" (terminated, and no other process with the same PID is meanwhile running -- that's also a potential, although probably unlikely, problem with left-over Sage cleaner and spawned_processes PID files; in the latter case, even unrelated processes of the same user could get killed.)

If there's already another Sage cleaner instance running, it could of course exit immediately (modulo the problem mentioned before).

comment:17 Changed 7 years ago by jhpalmieri

I also get the same problem with devel/sage/sage/homology/examples.py, but that file uses libGAP, too. If I doctest both of these, I only get a failure in examples.py, while the other passes doctests. Adding my patch from my comment above makes the failures go away. I'm still not sure if these problems are related to #14323 or not. If they are, then at least I can reliably produce the failures, and maybe that will help to track them down.

comment:18 follow-up: Changed 7 years ago by jhpalmieri

In case it helps, I can duplicate this problem on bsd.math.washington.edu.

comment:19 Changed 7 years ago by jdemeyer

John: please check whether the new spkg at #14323 fixes your problem.

comment:20 in reply to: ↑ 18 Changed 7 years ago by jdemeyer

Replying to jhpalmieri:

In case it helps, I can duplicate this problem on bsd.math.washington.edu.

I can indeed reproduce the problem and #14323 doesn't help.

comment:21 Changed 7 years ago by jhpalmieri

On boxen, with the spkg from #14323, I see that the issue from that ticket seems to be fixed, but then I can reproduce the issue here.

comment:22 Changed 7 years ago by jdemeyer

I found the problem with expect.py: the doctest process starts the sage-cleaner, somehow leading to trouble.

Changed 7 years ago by jdemeyer

Changed 7 years ago by jdemeyer

comment:23 Changed 7 years ago by jdemeyer

  • Description modified (diff)
  • Keywords orphans removed

Needs review.

comment:24 Changed 7 years ago by jhpalmieri

Once or twice, after running Sage, quitting, and starting it again, or doing doctests and then running Sage, I've seen this (again, I have allowed sage-cleaner to print its output to the screen):

----------------------------------------------------------------------
| Sage Version 5.9.beta1, Release Date: 2013-03-22                   |
| Type "notebook()" for the browser-based notebook interface.        |
| Type "help()" for help.                                            |
----------------------------------------------------------------------
**********************************************************************
*                                                                    *
* Warning: this is a prerelease version, and it may be unstable.     *
*                                                                    *
**********************************************************************
SAGE_TMP_ROOT = /Users/palmieri/.sage/temp/Macintosh_001b639d44a1.local
sage-cleaner is already running with PID 78006, exiting
Checking PIDs []
cleanup() #13 took 0.00s
sage-cleaner is finished
sage: 

Sure enough, in this case sage-cleaner isn't running while Sage is. Is this a problem? (It does start up if asked to by interfaces/cleaner.py.)

I can, not very reliably, reproduce this by starting Sage, letting the count go above 10, quit Sage, wait a second or two, and then restart it.

Otherwise, I think I'm happy with this.

comment:25 Changed 7 years ago by jhpalmieri

See #14342 for a sort of follow-up (an attempt to stop using DOT_SAGE/tmp altogether).

comment:26 in reply to: ↑ 12 Changed 7 years ago by leif

Replying to novoselt:

I've applied this patch to SAGE_ROOT/local/bin, ran "make ptestlong" and still got two ecl processes eating RAM when tests were finished without errors.

Did you meanwhile also try the new versions? (There's now a patch to the Sage library as well.)

comment:27 follow-up: Changed 7 years ago by novoselt

Same story: compile 5.9.beta1, apply both patches here, run "make ptestlong" and there are two ecl processes. I also got failures

sage -t --long devel/sage/sage/homology/simplicial_complex.py  # 5 doctests failed
sage -t --long devel/sage/sage/libs/gap/libgap.pyx  # 5 doctests failed
sage -t --long devel/sage/sage/libs/gap/element.pyx  # 3 doctests failed

comment:28 in reply to: ↑ 27 Changed 7 years ago by jdemeyer

Replying to novoselt:

Same story: compile 5.9.beta1, apply both patches here, run "make ptestlong" and there are two ecl processes. I also got failures

sage -t --long devel/sage/sage/homology/simplicial_complex.py  # 5 doctests failed
sage -t --long devel/sage/sage/libs/gap/libgap.pyx  # 5 doctests failed
sage -t --long devel/sage/sage/libs/gap/element.pyx  # 3 doctests failed

Please show which tests are failing.

comment:29 Changed 7 years ago by novoselt

novoselt@sage:~/sage-5.9.beta1$ ./sage -t --long devel/sage/sage/homology/simplicial_complex.py  # 5 doctests failed
Running doctests with ID 2013-03-28-13-53-51-e8efe0bf.
Doctesting 1 file.
sage -t --long devel/sage/sage/homology/simplicial_complex.py
**********************************************************************
File "devel/sage/sage/homology/simplicial_complex.py", line 3134, in sage.homology.simplicial_complex.SimplicialComplex.automorphism_group
Failed example:
    S.automorphism_group().is_isomorphic(SymmetricGroup(4))
Exception raised:
    Traceback (most recent call last):
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 459, in _run
        self.execute(example, compiled, test.globs)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 815, in execute
        exec compiled in globs
      File "<doctest sage.homology.simplicial_complex.SimplicialComplex.automorphism_group[1]>", line 1, in <module>
        S.automorphism_group().is_isomorphic(SymmetricGroup(Integer(4)))
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/homology/simplicial_complex.py", line 3155, in automorphism_group
        [f.tuple() for f in self.facets()]])
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/graphs/generic_graph.py", line 16543, in automorphism_group
        output.append(PermutationGroup([perm_group_elt(aa) for aa in a]))
      File "refinement_graphs.pyx", line 941, in sage.groups.perm_gps.partn_ref.refinement_graphs.perm_group_elt (sage/groups/perm_gps/partn_ref/refinement_graphs.c:32796)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/groups/perm_gps/permgroup.py", line 653, in __call__
        return self._element_class()(x, self, check=check)
      File "permgroup_element.pyx", line 452, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__init__ (sage/groups/perm_gps/permgroup_element.c:4490)
      File "sage_object.pyx", line 474, in sage.structure.sage_object.SageObject._gap_ (sage/structure/sage_object.c:4544)
      File "sage_object.pyx", line 450, in sage.structure.sage_object.SageObject._interface_ (sage/structure/sage_object.c:4144)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 199, in __call__
        return cls(self, x, name=name)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 1285, in __init__
        raise TypeError, x
    TypeError: list.remove(x): x not in list
**********************************************************************
File "devel/sage/sage/homology/simplicial_complex.py", line 3138, in sage.homology.simplicial_complex.SimplicialComplex.automorphism_group
Failed example:
    P.automorphism_group().is_isomorphic(AlternatingGroup(5))
Exception raised:
    Traceback (most recent call last):
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 459, in _run
        self.execute(example, compiled, test.globs)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 815, in execute
        exec compiled in globs
      File "<doctest sage.homology.simplicial_complex.SimplicialComplex.automorphism_group[3]>", line 1, in <module>
        P.automorphism_group().is_isomorphic(AlternatingGroup(Integer(5)))
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/homology/simplicial_complex.py", line 3155, in automorphism_group
        [f.tuple() for f in self.facets()]])
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/graphs/generic_graph.py", line 16543, in automorphism_group
        output.append(PermutationGroup([perm_group_elt(aa) for aa in a]))
      File "refinement_graphs.pyx", line 941, in sage.groups.perm_gps.partn_ref.refinement_graphs.perm_group_elt (sage/groups/perm_gps/partn_ref/refinement_graphs.c:32796)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/groups/perm_gps/permgroup.py", line 653, in __call__
        return self._element_class()(x, self, check=check)
      File "permgroup_element.pyx", line 452, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__init__ (sage/groups/perm_gps/permgroup_element.c:4490)
      File "sage_object.pyx", line 474, in sage.structure.sage_object.SageObject._gap_ (sage/structure/sage_object.c:4544)
      File "sage_object.pyx", line 450, in sage.structure.sage_object.SageObject._interface_ (sage/structure/sage_object.c:4144)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 199, in __call__
        return cls(self, x, name=name)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 1280, in __init__
        self._name = parent._create(value, name=name)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 389, in _create
        self.set(name, value)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 1322, in set
        out = self._eval_line(cmd, allow_use_file=True)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 730, in _eval_line
        self._start()
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 1175, in _start
        Expect._start(self, "Failed to start GAP.")
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 443, in _start
        self._expect.expect(self._prompt)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/pexpect.py", line 916, in expect
        return self.expect_list(compiled_pattern_list, timeout, searchwindowsize)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/pexpect.py", line 967, in expect_list
        c = self.read_nonblocking (self.maxread, timeout)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/pexpect.py", line 548, in read_nonblocking
        r, w, e = select.select([self.child_fd], [], [], timeout)
    error: (4, 'Interrupted system call')
**********************************************************************
File "devel/sage/sage/homology/simplicial_complex.py", line 3142, in sage.homology.simplicial_complex.SimplicialComplex.automorphism_group
Failed example:
    Z.automorphism_group().is_isomorphic(CyclicPermutationGroup(2))
Exception raised:
    Traceback (most recent call last):
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 459, in _run
        self.execute(example, compiled, test.globs)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 815, in execute
        exec compiled in globs
      File "<doctest sage.homology.simplicial_complex.SimplicialComplex.automorphism_group[5]>", line 1, in <module>
        Z.automorphism_group().is_isomorphic(CyclicPermutationGroup(Integer(2)))
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/homology/simplicial_complex.py", line 3155, in automorphism_group
        [f.tuple() for f in self.facets()]])
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/graphs/generic_graph.py", line 16543, in automorphism_group
        output.append(PermutationGroup([perm_group_elt(aa) for aa in a]))
      File "refinement_graphs.pyx", line 941, in sage.groups.perm_gps.partn_ref.refinement_graphs.perm_group_elt (sage/groups/perm_gps/partn_ref/refinement_graphs.c:32796)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/groups/perm_gps/permgroup.py", line 653, in __call__
        return self._element_class()(x, self, check=check)
      File "permgroup_element.pyx", line 452, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__init__ (sage/groups/perm_gps/permgroup_element.c:4490)
      File "sage_object.pyx", line 474, in sage.structure.sage_object.SageObject._gap_ (sage/structure/sage_object.c:4544)
      File "sage_object.pyx", line 450, in sage.structure.sage_object.SageObject._interface_ (sage/structure/sage_object.c:4144)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 199, in __call__
        return cls(self, x, name=name)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 1280, in __init__
        self._name = parent._create(value, name=name)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 389, in _create
        self.set(name, value)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 1322, in set
        out = self._eval_line(cmd, allow_use_file=True)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 736, in _eval_line
        expect_eof= (self._quit_string() in line))
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 597, in _execute_line
        x = E.expect_list(self._compiled_full_pattern)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 516, in __getattr__
        return ParentWithBase.__getattribute__(self, attrname)
      File "parent.pyx", line 675, in sage.structure.parent.Parent.__getattr__ (sage/structure/parent.c:6215)
    AttributeError: 'Gap' object has no attribute '_compiled_full_pattern'
**********************************************************************
File "devel/sage/sage/homology/simplicial_complex.py", line 3144, in sage.homology.simplicial_complex.SimplicialComplex.automorphism_group
Failed example:
    group, dict = Z.automorphism_group(translation=True)
Exception raised:
    Traceback (most recent call last):
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 459, in _run
        self.execute(example, compiled, test.globs)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 815, in execute
        exec compiled in globs
      File "<doctest sage.homology.simplicial_complex.SimplicialComplex.automorphism_group[6]>", line 1, in <module>
        group, dict = Z.automorphism_group(translation=True)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/homology/simplicial_complex.py", line 3155, in automorphism_group
        [f.tuple() for f in self.facets()]])
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/graphs/generic_graph.py", line 16543, in automorphism_group
        output.append(PermutationGroup([perm_group_elt(aa) for aa in a]))
      File "refinement_graphs.pyx", line 941, in sage.groups.perm_gps.partn_ref.refinement_graphs.perm_group_elt (sage/groups/perm_gps/partn_ref/refinement_graphs.c:32796)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/groups/perm_gps/permgroup.py", line 653, in __call__
        return self._element_class()(x, self, check=check)
      File "permgroup_element.pyx", line 452, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__init__ (sage/groups/perm_gps/permgroup_element.c:4490)
      File "sage_object.pyx", line 474, in sage.structure.sage_object.SageObject._gap_ (sage/structure/sage_object.c:4544)
      File "sage_object.pyx", line 450, in sage.structure.sage_object.SageObject._interface_ (sage/structure/sage_object.c:4144)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 199, in __call__
        return cls(self, x, name=name)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 1280, in __init__
        self._name = parent._create(value, name=name)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 389, in _create
        self.set(name, value)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 1322, in set
        out = self._eval_line(cmd, allow_use_file=True)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 736, in _eval_line
        expect_eof= (self._quit_string() in line))
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 597, in _execute_line
        x = E.expect_list(self._compiled_full_pattern)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 516, in __getattr__
        return ParentWithBase.__getattribute__(self, attrname)
      File "parent.pyx", line 675, in sage.structure.parent.Parent.__getattr__ (sage/structure/parent.c:6215)
    AttributeError: 'Gap' object has no attribute '_compiled_full_pattern'
**********************************************************************
File "devel/sage/sage/homology/simplicial_complex.py", line 3145, in sage.homology.simplicial_complex.SimplicialComplex.automorphism_group
Failed example:
    Set([dict[s] for s in Z.vertices()])
Exception raised:
    Traceback (most recent call last):
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 459, in _run
        self.execute(example, compiled, test.globs)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 815, in execute
        exec compiled in globs
      File "<doctest sage.homology.simplicial_complex.SimplicialComplex.automorphism_group[7]>", line 1, in <module>
        Set([dict[s] for s in Z.vertices()])
    TypeError: 'type' object has no attribute '__getitem__'
**********************************************************************
1 item had failures:
   5 of   9 in sage.homology.simplicial_complex.SimplicialComplex.automorphism_group
    [409 tests, 5 failures, 10.0 s]
----------------------------------------------------------------------
sage -t --long devel/sage/sage/homology/simplicial_complex.py  # 5 doctests failed
----------------------------------------------------------------------
Total time for all tests: 10.3 seconds
    cpu time: 7.9 seconds
    cumulative wall time: 10.0 seconds
novoselt@sage:~/sage-5.9.beta1$ ./sage -t --long devel/sage/sage/libs/gap/libgap.pyx  # 5 doctests failed
Running doctests with ID 2013-03-28-13-55-04-e18989af.
Doctesting 1 file.
sage -t --long devel/sage/sage/libs/gap/libgap.pyx
**********************************************************************
File "devel/sage/sage/libs/gap/libgap.pyx", line 23, in sage.libs.gap.libgap
Failed example:
    b = gap('10')
Exception raised:
    Traceback (most recent call last):
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 459, in _run
        self.execute(example, compiled, test.globs)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 815, in execute
        exec compiled in globs
      File "<doctest sage.libs.gap.libgap[5]>", line 1, in <module>
        b = gap('10')
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 199, in __call__
        return cls(self, x, name=name)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 1285, in __init__
        raise TypeError, x
    TypeError: list.remove(x): x not in list
**********************************************************************
File "devel/sage/sage/libs/gap/libgap.pyx", line 24, in sage.libs.gap.libgap
Failed example:
    timeit('b*b')   # random output
Exception raised:
    Traceback (most recent call last):
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 459, in _run
        self.execute(example, compiled, test.globs)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 815, in execute
        exec compiled in globs
      File "<doctest sage.libs.gap.libgap[6]>", line 1, in <module>
        timeit('b*b')   # random output
      File "sage_timeit_class.pyx", line 114, in sage.misc.sage_timeit_class.SageTimeit.__call__ (sage/misc/sage_timeit_class.c:931)
      File "sage_timeit_class.pyx", line 78, in sage.misc.sage_timeit_class.SageTimeit.eval (sage/misc/sage_timeit_class.c:731)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/misc/sage_timeit.py", line 236, in sage_timeit
        if timer.timeit(number) >= 0.2:
      File "/home/novoselt/sage-5.9.beta1/local/lib/python/timeit.py", line 195, in timeit
        timing = self.inner(it, self.timer)
      File "<magic-timeit>", line 6, in inner
    NameError: global name 'b' is not defined
**********************************************************************
File "devel/sage/sage/libs/gap/libgap.pyx", line 52, in sage.libs.gap.libgap
Failed example:
    generators = libgap.AlternatingGroup(4).GeneratorsOfGroup().sage()
Exception raised:
    Traceback (most recent call last):
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 459, in _run
        self.execute(example, compiled, test.globs)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 815, in execute
        exec compiled in globs
      File "<doctest sage.libs.gap.libgap[13]>", line 1, in <module>
        generators = libgap.AlternatingGroup(Integer(4)).GeneratorsOfGroup().sage()
      File "element.pyx", line 1617, in sage.libs.gap.element.GapElement_List.sage (sage/libs/gap/element.c:9500)
      File "element.pyx", line 1673, in sage.libs.gap.element.GapElement_Permutation.sage (sage/libs/gap/element.c:9699)
      File "permgroup_element.pyx", line 452, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__init__ (sage/groups/perm_gps/permgroup_element.c:4490)
      File "sage_object.pyx", line 474, in sage.structure.sage_object.SageObject._gap_ (sage/structure/sage_object.c:4544)
      File "sage_object.pyx", line 450, in sage.structure.sage_object.SageObject._interface_ (sage/structure/sage_object.c:4144)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 199, in __call__
        return cls(self, x, name=name)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 1280, in __init__
        self._name = parent._create(value, name=name)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 389, in _create
        self.set(name, value)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 1322, in set
        out = self._eval_line(cmd, allow_use_file=True)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 730, in _eval_line
        self._start()
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 1175, in _start
        Expect._start(self, "Failed to start GAP.")
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 443, in _start
        self._expect.expect(self._prompt)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/pexpect.py", line 916, in expect
        return self.expect_list(compiled_pattern_list, timeout, searchwindowsize)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/pexpect.py", line 967, in expect_list
        c = self.read_nonblocking (self.maxread, timeout)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/pexpect.py", line 548, in read_nonblocking
        r, w, e = select.select([self.child_fd], [], [], timeout)
    error: (4, 'Interrupted system call')
**********************************************************************
File "devel/sage/sage/libs/gap/libgap.pyx", line 53, in sage.libs.gap.libgap
Failed example:
    generators   # a Sage list of Sage permutations!
Exception raised:
    Traceback (most recent call last):
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 459, in _run
        self.execute(example, compiled, test.globs)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 815, in execute
        exec compiled in globs
      File "<doctest sage.libs.gap.libgap[14]>", line 1, in <module>
        generators   # a Sage list of Sage permutations!
    NameError: name 'generators' is not defined
**********************************************************************
File "devel/sage/sage/libs/gap/libgap.pyx", line 55, in sage.libs.gap.libgap
Failed example:
    PermutationGroup(generators).cardinality()   # computed in Sage
Exception raised:
    Traceback (most recent call last):
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 459, in _run
        self.execute(example, compiled, test.globs)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 815, in execute
        exec compiled in globs
      File "<doctest sage.libs.gap.libgap[15]>", line 1, in <module>
        PermutationGroup(generators).cardinality()   # computed in Sage
    NameError: name 'generators' is not defined
**********************************************************************
1 item had failures:
   5 of  36 in sage.libs.gap.libgap
    [63 tests, 5 failures, 5.3 s]
----------------------------------------------------------------------
sage -t --long devel/sage/sage/libs/gap/libgap.pyx  # 5 doctests failed
----------------------------------------------------------------------
Total time for all tests: 5.4 seconds
    cpu time: 5.1 seconds
    cumulative wall time: 5.3 seconds
novoselt@sage:~/sage-5.9.beta1$ ./sage -t --long devel/sage/sage/libs/gap/element.pyx  # 3 doctests failed
Running doctests with ID 2013-03-28-13-55-47-94cfc7ed.
Doctesting 1 file.
sage -t --long devel/sage/sage/libs/gap/element.pyx
**********************************************************************
File "devel/sage/sage/libs/gap/element.pyx", line 1400, in sage.libs.gap.element.GapElement_Function._sage_doc_
Failed example:
    'constructs  the  cyclic  group' in f._sage_doc_()
Exception raised:
    Traceback (most recent call last):
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 459, in _run
        self.execute(example, compiled, test.globs)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 815, in execute
        exec compiled in globs
      File "<doctest sage.libs.gap.element.GapElement_Function._sage_doc_[1]>", line 1, in <module>
        'constructs  the  cyclic  group' in f._sage_doc_()
      File "element.pyx", line 1407, in sage.libs.gap.element.GapElement_Function._sage_doc_ (sage/libs/gap/element.c:8928)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 1287, in help
        tmp_to_use = self._local_tmpfile()
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 581, in _local_tmpfile
        self.__local_tmpfile = os.path.join(SAGE_TMP_INTERFACE, 'tmp' + str(self.pid()))
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 298, in pid
        self._start()
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 1180, in _start
        expect.failed_to_start.remove(self.name())
    ValueError: list.remove(x): x not in list
**********************************************************************
File "devel/sage/sage/libs/gap/element.pyx", line 1666, in sage.libs.gap.element.GapElement_Permutation.sage
Failed example:
    perm_gap.sage()
Exception raised:
    Traceback (most recent call last):
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 459, in _run
        self.execute(example, compiled, test.globs)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 815, in execute
        exec compiled in globs
      File "<doctest sage.libs.gap.element.GapElement_Permutation.sage[1]>", line 1, in <module>
        perm_gap.sage()
      File "element.pyx", line 1673, in sage.libs.gap.element.GapElement_Permutation.sage (sage/libs/gap/element.c:9699)
      File "permgroup_element.pyx", line 452, in sage.groups.perm_gps.permgroup_element.PermutationGroupElement.__init__ (sage/groups/perm_gps/permgroup_element.c:4490)
      File "sage_object.pyx", line 474, in sage.structure.sage_object.SageObject._gap_ (sage/structure/sage_object.c:4544)
      File "sage_object.pyx", line 450, in sage.structure.sage_object.SageObject._interface_ (sage/structure/sage_object.c:4144)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 199, in __call__
        return cls(self, x, name=name)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 1280, in __init__
        self._name = parent._create(value, name=name)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/interface.py", line 389, in _create
        self.set(name, value)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 1322, in set
        out = self._eval_line(cmd, allow_use_file=True)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 730, in _eval_line
        self._start()
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/gap.py", line 1175, in _start
        Expect._start(self, "Failed to start GAP.")
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 443, in _start
        self._expect.expect(self._prompt)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/pexpect.py", line 916, in expect
        return self.expect_list(compiled_pattern_list, timeout, searchwindowsize)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/pexpect.py", line 967, in expect_list
        c = self.read_nonblocking (self.maxread, timeout)
      File "/home/novoselt/sage-5.9.beta1/local/lib/python2.7/site-packages/pexpect.py", line 548, in read_nonblocking
        r, w, e = select.select([self.child_fd], [], [], timeout)
    error: (4, 'Interrupted system call')
**********************************************************************
File "devel/sage/sage/libs/gap/element.pyx", line 1668, in sage.libs.gap.element.GapElement_Permutation.sage
Failed example:
    type(_)
Expected:
    <type 'sage.groups.perm_gps.permgroup_element.PermutationGroupElement'>
Got:
    <type 'sage.libs.gap.element.GapElement_Permutation'>
**********************************************************************
2 items had failures:
   1 of   3 in sage.libs.gap.element.GapElement_Function._sage_doc_
   2 of   4 in sage.libs.gap.element.GapElement_Permutation.sage
    [256 tests, 3 failures, 5.5 s]
----------------------------------------------------------------------
sage -t --long devel/sage/sage/libs/gap/element.pyx  # 3 doctests failed
----------------------------------------------------------------------
Total time for all tests: 5.6 seconds
    cpu time: 5.3 seconds
    cumulative wall time: 5.5 seconds

comment:30 Changed 7 years ago by jdemeyer

Please apply #14323 (if you haven't) and try again.

comment:31 follow-up: Changed 7 years ago by novoselt

#14323 fixes the doctest failures, but not ecl processes...

comment:32 in reply to: ↑ 31 ; follow-up: Changed 7 years ago by leif

Replying to novoselt:

#14323 fixes the doctest failures, but not ecl processes...

I'm now among the lucky few who get these greedy, immortal ECL processes:

With Sage 5.9.beta2 (which includes #14323), GCC 4.8.0, on Ubuntu 10.04.4 x86_64, during make testlong (as opposed to make ptestlong -- others previously reported only the latter caused problems).

Two ECL processes show up while testing sage/interfaces/lisp.py (as I had guessed), quickly start eating up memory, and keep running even after the file has been tested without any errors.

comment:33 Changed 7 years ago by jhpalmieri

What needs to be done here. I asked a question earlier, but otherwise, are there issues with this?

comment:34 Changed 7 years ago by jdemeyer

That question is really a race condition between exiting one Sage cleaner and starting another. I'll think about it.

comment:35 in reply to: ↑ 32 Changed 7 years ago by leif

Replying to leif:

Replying to novoselt:

#14323 fixes the doctest failures, but not ecl processes...

I'm now among the lucky few who get these greedy, immortal ECL processes:

With Sage 5.9.beta2 (which includes #14323), GCC 4.8.0, on Ubuntu 10.04.4 x86_64, during make testlong (as opposed to make ptestlong -- others previously reported only the latter caused problems).

Two ECL processes show up while testing sage/interfaces/lisp.py (as I had guessed), quickly start eating up memory, and keep running even after the file has been tested without any errors.

Same on Ubuntu 10.04.4 x86, GCC 4.7.2, and ptestlong. I also got an orphaned Maxima process (running at 100%) in the first run. (Sage 5.9.beta2, no patches from here applied.)

comment:36 follow-up: Changed 7 years ago by leif

So I've continued experimenting with Sage 5.8 vs. 5.9.beta2 (same system, same compiler and flags), and it turns out that the rogue ECL processes are solely caused by the new doctesting framework, so we should probably open another ticket for that issue. (I did the tests with the Sage cleaner "disabled" in both releases, too, with the same results: 5.8, with the old doctesting framework: No problems at all. 5.9.beta2: ECL orphans every time I run doctests from the Makefile, no problems when run from the command line, i.e., without using make.)

Minimal receipt to reproduce the orphans here:

testlisp:
        ./sage -t devel/sage/sage/interfaces/lisp.py

Running make testlisp.

The (un)setting of MAKE, using make -j1, ./sage -t vs. ./sage -tp etc. doesn't matter, nor does using $(PIPE) (or spkg/bin/pipestatus), or teeing the output with a "normal" pipe; it just works when running ./sage -t ... from the command line, while I get the busy orphans as soon as I run the exact same command(s) from the Makefile.

(Note that also ECL didn't change between 5.8 and 5.9.beta2.)

So fixing the Sage cleaner alone may alleviate the problem, but won't really solve it.

comment:37 in reply to: ↑ 36 ; follow-up: Changed 7 years ago by leif

Replying to leif:

[...] Minimal receipt to reproduce the orphans here:

testlisp:
        ./sage -t devel/sage/sage/interfaces/lisp.py

Running make testlisp.

The (un)setting of MAKE, using make -j1, ./sage -t vs. ./sage -tp etc. doesn't matter, nor does using $(PIPE) (or spkg/bin/pipestatus), or teeing the output with a "normal" pipe; it just works when running ./sage -t ... from the command line, while I get the busy orphans as soon as I run the exact same command(s) from the Makefile.

Did a little more testing ...

With (Ubuntu 10.04.4's) GNU Make 3.81, I do get the orphans, while I don't get them with (vanilla) GNU Make 3.82. (I haven't found any particularly interesting change in 3.82's NEWS.)

Still, at least John Cremona reported getting the orphans on Ubuntu 12.04 as well, and I strongly doubt Precise still ships GNU Make 3.81 (released 2006, while 3.82 was released in 2010, although Ubuntu might [still] patch its version, haven't checked that). And we currently don't require the latest version of GNU Make either.

comment:38 in reply to: ↑ 37 Changed 7 years ago by leif

Replying to leif:

With (Ubuntu 10.04.4's) GNU Make 3.81, I do get the orphans, while I don't get them with (vanilla) GNU Make 3.82. (I haven't found any particularly interesting change in 3.82's NEWS.)

Still, at least John Cremona reported getting the orphans on Ubuntu 12.04 as well, and I strongly doubt Precise still ships GNU Make 3.81 (released 2006, while 3.82 was released in 2010, although Ubuntu might [still] patch its version, haven't checked that). And we currently don't require the latest version of GNU Make either.

Oh, indeed at least Ubuntu 12.04.1 LTS still has (some?) GNU Make 3.81 ... m)

comment:39 Changed 7 years ago by jdemeyer

The ECL problem is fixed and needs review at #14426.

comment:40 follow-up: Changed 7 years ago by jhpalmieri

Here is my view on this ticket:

  • sage-cleaner used to be completely broken.
  • this ticket mostly fixes it.
  • the only flaw I see is a race condition, but that looks pretty rare. Fixing it can wait.
  • there are undoubtedly other ways in which sage-cleaner could be fixed up, but they can wait.

So I'm willing to give this a positive review. Any dissenting views?

comment:41 in reply to: ↑ 40 ; follow-ups: Changed 7 years ago by leif

Replying to jhpalmieri:

Here is my view on this ticket:

  • sage-cleaner used to be completely broken.

I actually didn't have problems with it recently (more precisely, for meanwhile a few years), until 5.9.x.

  • this ticket mostly fixes it.
  • the only flaw I see is a race condition, but that looks pretty rare. Fixing it can wait.

Agreed, although I haven't closely looked at that yet. (It's pretty clear that there is a race condition, but I'm not sure how likely it is in practice.)

  • there are undoubtedly other ways in which sage-cleaner could be fixed up, but they can wait.

So I'm willing to give this a positive review. Any dissenting views?

Hmmm, in principleTM the Sage cleaner (with the patches from here) should have killed the ECL processes, but Andrey reported it did not, so (regardless of #12426) something seems to be still broken.

comment:42 in reply to: ↑ 41 Changed 7 years ago by leif

Replying to leif:

Hmmm, in principleTM the Sage cleaner (with the patches from here) should have killed the ECL processes, but Andrey reported it did not, so (regardless of #12426) something seems to be still broken.

Ooops, of course regardless of #12426, but I meant #14426.

comment:43 in reply to: ↑ 41 ; follow-up: Changed 7 years ago by jhpalmieri

Replying to leif:

Replying to jhpalmieri:

Here is my view on this ticket:

  • sage-cleaner used to be completely broken.

I actually didn't have problems with it recently (more precisely, for meanwhile a few years), until 5.9.x.

I haven't seen problems with processes, but the subdirectories of DOT_SAGE/temp/ have been polluted with many directories dating back to 5.8 at least, because sage-cleaner was looking in the wrong directory.

Hmmm, in principleTM the Sage cleaner (with the patches from here) should have killed the ECL processes, but Andrey reported it did not, so (regardless of #12426) something seems to be still broken.

In your testing, could you tell whether the correct PID (for the runaway process) was written to the correct file? How did the runaway process escape the cleaner?

comment:44 in reply to: ↑ 43 ; follow-up: Changed 7 years ago by leif

Replying to jhpalmieri:

Replying to leif: In your testing, could you tell whether the correct PID (for the runaway process) was written to the correct file? How did the runaway process escape the cleaner?

That's a question to Andrey (just in case that wasn't clear).

comment:45 in reply to: ↑ 44 ; follow-up: Changed 7 years ago by novoselt

Replying to leif:

Replying to jhpalmieri:

Replying to leif: In your testing, could you tell whether the correct PID (for the runaway process) was written to the correct file? How did the runaway process escape the cleaner?

That's a question to Andrey (just in case that wasn't clear).

It was not ;-) I have no idea how to check it, but if you provide me with instructions on what files to look at, I'll be happy to do it. ECL processes appear consistently for me until #14426, so reproduction is trivial.

comment:46 in reply to: ↑ 45 Changed 7 years ago by leif

Replying to novoselt:

Replying to leif:

Replying to jhpalmieri:

Replying to leif: In your testing, could you tell whether the correct PID (for the runaway process) was written to the correct file? How did the runaway process escape the cleaner?

That's a question to Andrey (just in case that wasn't clear).

It was not ;-) I have no idea how to check it, but if you provide me with instructions on what files to look at, I'll be happy to do it. ECL processes appear consistently for me until #14426, so reproduction is trivial.

The ECL processes' PIDs should be in the spawned_processes text file, with the patches from here now located in the directory sage.misc.SAGE_TMP (usually some $DOT_SAGE/temp/<hostname>-<pid>/... or $DOT_SAGE/temp/<hostname>/<pid>/?).

Last edited 7 years ago by leif (previous) (diff)

comment:47 Changed 7 years ago by leif

(Edited last comment.)

comment:48 follow-up: Changed 7 years ago by jhpalmieri

Leif, I actually intended the question for you, because of the testing you mentioned above, but I'm happy for answers from anyone.

comment:49 Changed 7 years ago by leif

Third attempt:

spawned_processes should be located in $DOT_SAGE/temp/<hostname>/<pid of [parent] Sage process>/.

The whole naming is (still) a mess, sage.misc.misc.SAGE_TMP is totally unrelated to the environment variable SAGE_TMP.

comment:50 in reply to: ↑ 48 Changed 7 years ago by leif

Replying to jhpalmieri:

Leif, I actually intended the question for you, because of the testing you mentioned above, but I'm happy for answers from anyone.

Well, I thought it was not because it didn't apply; as mentioned, I completely "disabled" the Sage cleaner to track down the ECL issue, as the new doctesting framework was causing it. (The Sage cleaner might just have hidden it in previous releases.)

comment:51 follow-up: Changed 7 years ago by jhpalmieri

Oh, I didn't realize that you had disabled the creation of the spawned-processes file.

By the way, where is the environment variable SAGE_TMP used? I think we should only use sage.misc.misc.SAGE_TMP these days.

comment:52 in reply to: ↑ 51 ; follow-up: Changed 7 years ago by leif

Replying to jhpalmieri:

Oh, I didn't realize that you had disabled the creation of the spawned-processes file.

Nope, I just replaced local/bin/sage-cleaner (to just exit 0).


By the way, where is the environment variable SAGE_TMP used? I think we should only use sage.misc.misc.SAGE_TMP these days.

Good question -- no idea. I thought it was more or less just overriding TMPDIR (and used as such), but I now see it is no longer(?) documented at all anyway.

comment:53 in reply to: ↑ 52 Changed 7 years ago by leif

Replying to leif:

By the way, where is the environment variable SAGE_TMP used? I think we should only use sage.misc.misc.SAGE_TMP these days.

Good question -- no idea. I thought it was more or less just overriding TMPDIR (and used as such), but I now see it is no longer(?) documented at all anyway.

(... and I thought SAGE_TESTDIR would default to it as well.)

comment:54 Changed 7 years ago by leif

Is sage.misc.misc.SAGE_TMP used for anything but the cleaner files? (IMHO the name doesn't suggest it isn't, and a "generic" temporary directory certainly shouldn't by default be located below $HOME, i.e., ~/.sage/.)

The whole cleaner stuff should IMHO be moved to some "var" directory, and the names used (including the Python / shell variable names) should reflect the use.

Last edited 7 years ago by leif (previous) (diff)

comment:55 Changed 7 years ago by novoselt

This is probably not very useful, but:

  • I ran make ptestlong in sage-5.9.beta4 without any patches, got ecl processes in the end of tests and .sage/temp/<host> contained only two folders for a running notebook server (Can it ever be a problem if I am running multiple sages/servers under my account? I didn't notice any issues so far, so I assumed it is OK.)
  • Then I ran make ptest for some reason. I got ecls, no test errors, but there is now 441 folder left under .sage/temp/<host>. I tried grepping through them with PIDs of ecl processes, but it was useless as there were too many hits - looks like these folders were accumulated during testing and PIDs were reused multiple times.

comment:56 Changed 7 years ago by vbraun

This ticket has nothing to do with runaway ecl, which is fixed in #14426.

comment:57 Changed 7 years ago by vbraun

  • Reviewers changed from John Palmieri, Leif Leonhardy to John Palmieri, Leif Leonhardy, Volker Braun
  • Status changed from needs_review to positive_review

I made #14431 for further work on the sage cleaner. But this is better than nothing.

comment:58 Changed 7 years ago by vbraun

  • Description modified (diff)
  • Status changed from positive_review to needs_work

I just noticed that the SAGE_TMP variables are still not in sync if the hostname contains a hyphen (as mine does), and this prevents the sage cleaner from starting on my machine. Please review the trivial 14055_cleaner_hostname.patch.

comment:59 Changed 7 years ago by vbraun

  • Status changed from needs_work to needs_review

comment:60 Changed 7 years ago by jhpalmieri

Volker, I don't understand your patch. In sage.misc.misc, HOSTNAME is imported from sage.env, where its definition is

        'HOSTNAME'         : socket.gethostname().replace('-','_').replace('/','_').replace('\\','_')

What does sage.env.HOSTNAME return on your machine? What about sage.misc.misc.SAGE_TMP?

comment:61 Changed 7 years ago by vbraun

Interesting, I have on a virgin sage-5.9.beta4 install:

sage: sage.env.HOSTNAME
'volker-desktop.stp.dias.ie'
sage: sage.misc.misc.HOSTNAME
'volker-desktop.stp.dias.ie'
sage: sage.misc.misc.SAGE_TMP
l'/home/vbraun/.sage/temp/volker-desktop.stp.dias.ie/17284'

Fun ;-)

comment:62 Changed 7 years ago by vbraun

Also, underscore, slash, and backslash are not valid in hostnames. And dashes (especially without spaces) are totally harmless in file names. Why?? ;-)

comment:63 Changed 7 years ago by vbraun

Oh there is this gem in sage.env:

# set any variables that are already in os.environ
for var in SAGE_ENV:
    try:
        exec(var + ' = os.environ[var]')
    except KeyError:
        pass

So HOSTNAME is again the actual hostname, and you can't grep for it.

comment:64 Changed 7 years ago by jhpalmieri

On a few systems that I tried, os.environ['HOSTNAME'] returns a KeyError, even though echo $HOSTNAME returns something valid from the command line.

Anyway, I'm guessing that we replace hyphens to get potentially valid Python module names. This may be an artifact of the old doctesting code. Should we patch env.py?

  • sage/env.py

    diff --git a/sage/env.py b/sage/env.py
    a b  
    3030SAGE_ENV = {
    3131        # system info
    3232        'UNAME'            : os.uname()[0],
    33         'HOSTNAME'         : socket.gethostname().replace('-','_').replace('/','_').replace('\\','_'),
     33        'HOSTNAME'         : socket.gethostname()
    3434        'LOCAL_IDENTIFIER' : '$HOSTNAME.%s'%os.getpid(),
    3535
    3636        # bunch of sage directories and files

Or is that going to break things?

comment:65 Changed 7 years ago by vbraun

  • Description modified (diff)

comment:66 Changed 7 years ago by vbraun

I agree with John's suggestion. I used the opportunity to get rid of all the eval()s and potential infinite loop in sage.env ;-)

comment:67 Changed 7 years ago by jhpalmieri

This change might be safer than my last proposal:

  • sage/env.py

    diff --git a/sage/env.py b/sage/env.py
    a b  
    5858# set any variables that are already in os.environ
    5959for var in SAGE_ENV:
    6060    try:
    61         exec(var + ' = os.environ[var]')
     61        if var == 'HOSTNAME':
     62            exec(var + ' = os.environ[var].replace("-","_").replace("/","_").replace("\\","_")')
     63        else:
     64            exec(var + ' = os.environ[var]')
    6265    except KeyError:
    6366        pass
    6467

I'll look at your patch, too.

comment:68 Changed 7 years ago by jhpalmieri

Line 113:

_add_variable_or_fallback('DOT_SAGE', "/home/.sage")

doesn't look right: if DOT_SAGE has a space in it, it will remain unchanged in this situation. Perhaps os.environ.pop('DOT_SAGE',None) first?

Changed 7 years ago by vbraun

Updated patch

comment:69 Changed 7 years ago by vbraun

Thanks, fixed.

I don't think we should muck with $HOSTNAME, if it exists. You can break Sage pretty badly by setting any of the supported environment variables to a weird value, so why should be special-case the hostname?

comment:70 Changed 7 years ago by jhpalmieri

Right now we already special case the hostname, so the path of least resistance is to continue to do that. Your changes might be better, but they require more thought. I'll try to test them out, although I would appreciate other sets of eyes looking at them, too.

comment:71 Changed 7 years ago by jhpalmieri

Volker's patch seems to work for me.

A slight problem with changing how Sage processes the hostname: for me, at least, the directory DOT_SAGE/temp/HOSTNAME is filled with old directories which were never properly cleaned up. On my machines, 'HOSTNAME' is not a valid key in os.environ, so the hyphens have been replaced by underscores. If we change sage-cleaner (etc.) to not do the replacement, we will start to use a new HOSTNAME directory, so the old one won't get cleaned up.

On one hand, this is minor, and perhaps people should clean up their own DOT_SAGE/temp directories. On the other hand, the directory is polluted because of a bug in Sage, so maybe we should clean it up for them. So is it worth applying a patch like my suggestion above for Sage 5.9, and then in Sage 5.10, applying Volker's changes?

I'm curious: why is 'HOSTNAME' sometimes in os.environ and sometimes not?

comment:72 follow-ups: Changed 7 years ago by vbraun

Bash set HOSTNAME, so most people have it.

How about we make the sage-cleaner fuzzy match the hostname, ignoring hyphen/underscores. I'll add a patch soon.

comment:73 in reply to: ↑ 72 Changed 7 years ago by ppurka

Replying to vbraun:

Bash set HOSTNAME, so most people have it.

It must be specific to bash because I don't see it in zsh here. Surprisingly it gets HOST, but not when I grep for it in env.

~» echo $HOST
ub2
~» echo $SHELL
/bin/zsh
~» env | grep -i host
~ [1] » echo $HOSTNAME

~» 
Last edited 7 years ago by ppurka (previous) (diff)

comment:74 Changed 7 years ago by vbraun

I've added a bit to sage-cleaner that will delete outdated temp directories.

comment:75 in reply to: ↑ 72 Changed 7 years ago by jhpalmieri

Replying to vbraun:

Bash set HOSTNAME, so most people have it.

I wonder if this is specific to Bash 4. I have version 3.2 (on OS X 10.8.3 and on boxen) and HOSTNAME is not set. echo $HOSTNAME returns the correct thing, but HOSTNAME is not listed by export or env.

Last edited 7 years ago by jhpalmieri (previous) (diff)

comment:76 follow-up: Changed 7 years ago by jhpalmieri

What about defining HOSTNAME in the sage-cleaner script by using from sage.env import HOSTNAME?

comment:77 Changed 7 years ago by jhpalmieri

On second thought, cleaning out the old temp/HOSTNAME directories might be a bad idea. If someone is running an old version of Sage (say, as a notebook server), and while it's running, they try out a new version, it would be bad to delete files and directories possibly in use by the old one.

comment:78 Changed 7 years ago by vbraun

There are only temporary files in there, so deleting them is not that much of an issue. Much worse is that pids get reused, the temp directory contains an old spawned_processes, and then on shutdown random old pids are SIGKILLed.

Bash sets HOSTNAME, but does not export it. The bash profile often exports HOSTNAME among other environment variables. Any further discussion on how and when bash sets HOSTNAME please on a separate ticket.

comment:79 in reply to: ↑ 76 Changed 7 years ago by vbraun

Replying to jhpalmieri:

What about defining HOSTNAME in the sage-cleaner script by using from sage.env import HOSTNAME?

That feature request is already on #14431.

comment:80 Changed 7 years ago by jhpalmieri

  • Authors changed from Jeroen Demeyer to Jeroen Demeyer, Volker Braun

These patches seem to be working for me on several different platforms. Anyone else have an opinion?

comment:81 Changed 7 years ago by jhpalmieri

I take that back. On mark, I get sporadic failures with

./sage -tp devel/sage/doc/en/constructions/

(with 2 threads). I get one or both of

sage -t devel/sage/doc/en/constructions/linear_algebra.rst  # Killed due to bus error
sage -t devel/sage/doc/en/constructions/polynomials.rst  # Killed due to bus error

pretty frequently, especially if I wait a while before running the tests, to make sure there are no running instances of sage-cleaner.

comment:82 Changed 7 years ago by jhpalmieri

I've also seen similar problems on taurus. With taurus, and sometimes with mark, one part of the doctest failure for linear_algebra.rst says

age: var('a,b,c') ## line 416 ##
(a, b, c)
sage: eqn = [a+b*c==1, b-a*c==0, a+b==5] ## line 418 ##
sage: s = solve(eqn, a,b,c); s ## line 419 ##

;;; Unhandled lisp initialization error
;;; Message:
UNBOUND-VARIABLE
;;; Arguments:

Internal or unrecoverable error in:

Lisp initialization error.

  [2: No such file or directory]

Is the sage-cleaner removing some directory too early?

comment:83 Changed 7 years ago by vbraun

I don't see how the cleaner would cause the bus error (usually thats memory alignment or segfault).

The "Lisp initialization error" is #14426

comment:84 Changed 7 years ago by jhpalmieri

On taurus, I have two copies of Sage: one vanilla 5.9.beta4, and one with the patches from here and (now) from #14426 applied: I built the ecl spkg from #14426, reinstalled maxima, ran ./sage -ba

With the vanilla 5.9.beta4, ./sage -tp devel/sage/doc/en/constructions/ passed doctests 10 times in a row.

On the patched 5.9.beta4, the same command had failures 5 of the 10 times. I was seeing similar behavior before applying the patches from #14426.

comment:85 Changed 7 years ago by vbraun

I ran strace -ff -o /tmp/log/trace -s 128 sage -tp doc/en/constructions/linear_algebra.rst on my desktop and there isn't much going on in the file system:

$ grep ^unlink *
trace.13793:unlink("/tmp/ffiHfdBLe")                = 0
trace.13793:unlink("/tmp/QMFB1f")                   = 0
trace.13793:unlink("/tmp/tmp7WBqmx")                = 0
trace.13793:unlink("/dev/shm/sem.opP4VE")           = 0
trace.13793:unlink("/dev/shm/sem.mp13793-0")        = 0
trace.13793:unlink("/dev/shm/sem.yHmC64")           = 0
trace.13793:unlink("/dev/shm/sem.mp13793-1")        = 0
trace.13793:unlink("/dev/shm/sem.8cqchv")           = 0
trace.13793:unlink("/dev/shm/sem.mp13793-2")        = 0
trace.13793:unlink("/tmp/tmpfadSPrV")               = 0
trace.13834:unlink("/home/vbraun/opt/sage-5.9.beta5/local/lib/sage-force-relocate.txt") = -1 ENOENT (No such file or directory)
trace.13844:unlink("/home/vbraun/.sage/temp/volker-desktop.stp.dias.ie/13859/spawned_processes") = 0

$ grep ^rmdir *
trace.13793:rmdir("/home/vbraun/.sage/temp/volker-desktop.stp.dias.ie/13793") = 0
trace.13844:rmdir("/home/vbraun/.sage/temp/volker-desktop.stp.dias.ie/13859") = 0

$ grep ^rename *
trace.13793:rename("/tmp/tmp7WBqmx", "/home/vbraun/.sage/cache/_home_vbraun_opt_sage-5.9.beta5_devel_sage-main-lazy_import_cache.pickle") = -1 EXDEV (Invalid cross-device link)
trace.13793:rename("/home/vbraun/.sage/tmpQCpKCj", "/home/vbraun/.sage/timings2.json") = 0

The only obvious race is #13826, though I don't think that could cause the lisp failure. All communication with Maxima goes through a pty.

comment:86 follow-up: Changed 7 years ago by vbraun

Can we get this reviewed? Especially if you run the patchbot then you rather quickly generate directories for all pids and soon random processes are being killed.

comment:87 in reply to: ↑ 86 Changed 7 years ago by leif

Replying to vbraun:

Can we get this reviewed? Especially if you run the patchbot then you rather quickly generate directories for all pids and soon random processes are being killed.

Go ahead... ;-)

It's not even a 5.9 blocker right now... (milestone 5.10!)

comment:88 Changed 7 years ago by vbraun

Well I'm obviously giving positive review to everything that I didn't write on this ticket. We should run it on the buildbot to see if it exposes any other bugs.

Last edited 7 years ago by vbraun (previous) (diff)

comment:89 Changed 7 years ago by jhpalmieri

  • Status changed from needs_review to positive_review

Well, the code looks okay, and it is an improvement over what was there before. I don't understand the failures I mentioned earlier, but let's give it a positive review so it gets merged and gets some wider testing.

comment:90 Changed 7 years ago by jdemeyer

Could you remove TAB characters from 14055_cleaner_hostname.patch

Changed 7 years ago by jdemeyer

Updated patch

comment:91 Changed 7 years ago by jdemeyer

  • Merged in set to sage-5.10.beta1
  • Resolution set to fixed
  • Status changed from positive_review to closed

Fixed trailing whitespace (including a TAB!)

comment:92 Changed 7 years ago by jdemeyer

  • Merged in sage-5.10.beta1 deleted
  • Resolution fixed deleted
  • Status changed from closed to new

I am still seeing the linear_algebra.rst doctest failures. I don't understand how this ticket could cause it, though.

comment:93 Changed 7 years ago by vbraun

  • Status changed from new to needs_info

I've been running repeated doctests and the only thing that I see on taurus is occasional (about 5% of doctests runs) failure with the maxima pty interface. The trouble always starts with the assertion failure where input != echo:

File "devel/sage/doc/en/constructions/plotting.rst", line 52, in doc.en.constructions.plotting
Failed example:
    L = [(i/100.0, maxima.eval('jacobi_sn (%s/100.0,2.0)'%i))for i in range(-300,300)]
Exception raised:
    Traceback (most recent call last):
      File "/home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 466, in _run
        self.execute(example, compiled, test.globs)
      File "/home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/python2.7/site-packages/sage/doctest/forker.py", line 825, in execute
        exec compiled in globs
      File "<doctest doc.en.constructions.plotting[0]>", line 1, in <module>
        L = [(i/RealNumber('100.0'), maxima.eval('jacobi_sn (%s/100.0,2.0)'%i))for i in range(-Integer(300),Integer(300))]
      File "/home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/python2.7/site-packages/sage/interfaces/expect.py", line 1208, in eval
        for L in code.split('\n') if L != ''])
      File "/home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/python2.7/site-packages/sage/interfaces/maxima.py", line 754, in _eval_line
        assert line_echo.strip() == line.strip()
    AssertionError

And that has nothing to do with this ticket...

comment:94 Changed 7 years ago by vbraun

For the record, this is the comparison that fails:

        assert line_echo.strip() == line.strip(), 'mismatch:\n' + line_echo + line
    AssertionError: mismatch:
    jacobi_sn (-228/100 .0,2.0);
    jacobi_sn (-228/100.0,2.0);

Changed 7 years ago by vbraun

Initial patch

comment:95 Changed 7 years ago by vbraun

  • Description modified (diff)
  • Status changed from needs_info to needs_review

I don't know why there are random spaces sprinkled in, but it certainly doesn't have anything to do with this ticket. I haven't found any other failures on taurus.

comment:96 Changed 7 years ago by jdemeyer

Volker, I don't see how your patch would affect the error reported in 82.

comment:97 Changed 7 years ago by vbraun

What I'm saying is, I could not reproduce (on taurus) the error reported in comment:82. It might be a miscompile or a transient hardware error as far as I can tell.

comment:99 Changed 7 years ago by vbraun

Running with nice -19 and background load seems to trigger it, I now get some maxima segfaults somewhat reliably. They look like a race in the multi-threaded ecl to me, for example the following failure:

sage: s = solve(eqn, a,b,c); s ## line 419 ##

;;; Unhandled lisp initialization error
;;; Message:
unbound-variable
;;; Arguments:

Internal or unrecoverable error in:

Lisp initialization error.

  [2: No such file or directory]

;;; ECL C Backtrace
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(si_dump_c_backtrace+0x28) [0x7f01f330e1d8]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(ecl_internal_error+0x3f) [0x7f01f32f91af]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x1242f4) [0x7f01f32f92f4]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(cl_funcall+0x70) [0x7f01f32dc3f0]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(cl_error+0xdb) [0x7f01f32fa13b]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x125482) [0x7f01f32fa482]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(FEwrong_type_argument+0x1e) [0x7f01f32fa4ae]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(stream_dispatch_table+0x17) [0x7f01f32ece17]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(ecl_write_char+0x1b) [0x7f01f32ed6ab]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x13766b) [0x7f01f330c66b]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(_ecl_write_symbol+0x156) [0x7f01f330cbc6]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(si_write_ugly_object+0x26) [0x7f01f330bcc6]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x1242db) [0x7f01f32f92db]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(cl_funcall+0x70) [0x7f01f32dc3f0]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(cl_error+0xdb) [0x7f01f32fa13b]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x1252d8) [0x7f01f32fa2d8]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(ecl_interpret+0x19cd) [0x7f01f32de67d]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x10e34f) [0x7f01f32e334f]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(si_eval_with_env+0x2eb) [0x7f01f32e4f0b]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(si_signal_simple_error+0x26d) [0x7f01f32a9e4d]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(FEwrong_type_nth_arg+0x109) [0x7f01f32f9cf9]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(_ecl_sethash+0) [0x7f01f3326170]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x14df28) [0x7f01f3322f28]
;;; /lib64/libpthread.so.0() [0x324440f500]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x1503e0) [0x7f01f33253e0]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x15172c) [0x7f01f332672c]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(ecl_extend_hashtable+0x185) [0x7f01f3326335]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x151716) [0x7f01f3326716]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(+0x11c4dc) [0x7f01f32f14dc]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(ecl_init_module+0x4e8) [0x7f01f32f7298]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(init_lib_LSP+0x4c9) [0x7f01f3218089]
;;; /home/vbraun/opt/taurus/sage-5.10.beta1/local/lib/libecl.so.12.12(ecl_init_module+0x399) [0x7f01f32f7149]

Changed 7 years ago by vbraun

strace of failed ecl startup (from taurus)

comment:100 Changed 7 years ago by vbraun

  • Cc nbruin added

I've attached an strace from a failed ECL startup on taurus. It seems that during dlopen of ECL an unrelated thread (id 16046) happens to finish which causes a SIGCHLD. If the ECL signal handler is not turned off during init then that would explain things going south. Nils, since you wrote it maybe you have an opinion on that?

comment:101 Changed 7 years ago by nbruin

The last time I looked at it, ECL as a library with multithread support is rather fundamentally incompatible with the way we use it. As far as I know, if thread support is enabled then ECL expects one (dedicated?) thread to handle asynchronous signals and I think that thread may even be split off when there's otherwise only one thread running. I don't see how to consolidate that with our own signal management. The solution up to know has been to only use ECL in single threaded mode, which may mean building it without thread support. I know ECL has been moving towards multithreadedness. However, as an "embeddable" lisp there's a good argument they should also keep a strictly single-threaded version as an option and I hope they're continuing with that (perhaps sending a message that we really appreciate single threadedness might help).

Our signal switching code has been a little behind the times for a while already. Some improvements were suggested here:

http://comments.gmane.org/gmane.lisp.ecl.general/8211

That code might be outdated already as well.

What is the hypothesis here, by the way? That an event triggered by sage-cleaner produces a signal meant for our own signal handler that happens to end up in with the ecl signal handler? If that's the problem I see no way around it. Then we just need to write our own signal handler to always be in control, keeping flags on whether some signals might have to be dispatched to ECL and do so for signals we deem appropriate for that. That's major surgery (I've actually been surprised we were able to avoid that for so long and I hope we can continue)

I guess normally, ECL expects SIGCHLD to be ignored. We apparently set a different mask? Changing to ECL's signal mask would include ignoring a SIGCHLD if it happens to be in control when the signal arrives. Can we afford to? Do we already try to do that? In that case is the problem that the signal arrives in a small window where we've switched handlers but not masks? Then we should be a little more careful about the switch, i.e., perhaps disable signals during the switch completely.

comment:102 follow-up: Changed 7 years ago by vbraun

For the record, we build ECL with --disable-threads so it shouldn't have any multithread support. It does link with pthreads, though perhaps boehmgc needs that.

I don't think the sage-cleaner has anything to do with the problem. It just slightly changes timings so that we happen to trigger a rather unlikely race on taurus/skynet. It certainly should not send any signals except for term/kill.

The new doctest framework seems to be the only thing that uses SIGCHLD in the Sage library.

comment:103 in reply to: ↑ 102 Changed 7 years ago by nbruin

Replying to vbraun:

For the record, we build ECL with --disable-threads so it shouldn't have any multithread support. It does link with pthreads, though perhaps boehmgc needs that.

OK, good.

I don't think the sage-cleaner has anything to do with the problem.

So perhaps this should go on a different ticket then?

The new doctest framework seems to be the only thing that uses SIGCHLD in the Sage library.

I took a quick look at unixint.d and it seems ECL does install a handler for SIGCHLD. Perhaps you're just unlucky that the SIGCHLD gets delivered at a very inopportune moment when ECL hasn't quite finished setting up its handlers? Or are we just sloppy in how we change handlers? The handler they use is defined in unixsys.d as si::wait-for-all-processes.

comment:104 Changed 7 years ago by vbraun

  • Description modified (diff)

Switching off ECL_OPT_TRAP_SIGCHLD seems to fix it (knock on wood). 50x testing linear_algebra.rst on taurus works with the new spkg (you need to recompile maxima after ecl).

Changed 7 years ago by vbraun

diff for review only: ecl-12.12.1.p2 -> ecl-12.12.1.p3

comment:105 Changed 7 years ago by jdemeyer

Volker, if this works, it's a very nice piece of debugging. Testing right away...

comment:106 Changed 7 years ago by jdemeyer

  • Status changed from needs_review to positive_review

comment:107 Changed 7 years ago by vbraun

I ran 500 iterations for linear_algebra.rst and didn't get any failures. The "random space in output" presumably has also been caused by the SIGCHLD handler.

comment:108 Changed 7 years ago by jdemeyer

  • Merged in set to sage-5.10.beta2
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.