Opened 3 years ago
Closed 3 years ago
#27681 closed defect (fixed)
GAP: Apparent problem with workspace initialization
Reported by:  embray  Owned by:  

Priority:  critical  Milestone:  sage8.8 
Component:  packages: standard  Keywords:  gap libgap 
Cc:  jdemeyer  Merged in:  
Authors:  Erik Bray  Reviewers:  Dima Pasechnik 
Report Upstream:  N/A  Work issues:  
Branch:  ecacac8 (Commits, GitHub, GitLab)  Commit:  
Dependencies:  #27680  Stopgaps: 
Description (last modified by )
I don't fully understand the problem yet but there seems to be an issue in 8.8.beta2 (if not earlier) related to GAP workspaces.
The only way I've been able to reproduce it reliably is by running
$ rm rf ~/.sage/gap/*
and then
./sage t long src/sage/combinat/tableau.py
at some point during this test (possibly others but this is the only where I've reliably reproduced the issue) things start to go horribly wrong:
sage t long src/sage/combinat/tableau.py ********************************************************************** File "src/sage/combinat/tableau.py", line 7751, in sage.combinat.tableau.StandardTableaux_size.cardinality Failed example: StandardTableaux(50).cardinality() # long time Exception raised: Traceback (most recent call last): File "/home/embray/src/sagemath/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 671, in _run self.compile_and_execute(example, compiler, test.globs) File "/home/embray/src/sagemath/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 1095, in compile_and_execute exec(compiled, globs) File "<doctest sage.combinat.tableau.StandardTableaux_size.cardinality[4]>", line 1, in <module> StandardTableaux(Integer(50)).cardinality() # long time File "sage/misc/lazy_import.pyx", line 354, in sage.misc.lazy_import.LazyImport.__call__ (build/cythonized/sage/misc/lazy_import.c:3691) return self.get_object()(*args, **kwds) File "sage/misc/classcall_metaclass.pyx", line 330, in sage.misc.classcall_metaclass.ClasscallMetaclass.__call__ (build/cythonized/sage/misc/classcall_metaclass.c:1735) return cls.classcall(cls, *args, **kwds) File "/home/embray/src/sagemath/sage/local/lib/python2.7/sitepackages/sage/combinat/tableau.py", line 7584, in __classcall_private__ return StandardTableaux_size(n) File "sage/misc/classcall_metaclass.pyx", line 330, in sage.misc.classcall_metaclass.ClasscallMetaclass.__call__ (build/cythonized/sage/misc/classcall_metaclass.c:1735) return cls.classcall(cls, *args, **kwds) File "sage/misc/cachefunc.pyx", line 1003, in sage.misc.cachefunc.CachedFunction.__call__ (build/cythonized/sage/misc/cachefunc.c:6011) w = self.f(*args, **kwds) File "/home/embray/src/sagemath/sage/local/lib/python2.7/sitepackages/sage/structure/unique_representation.py", line 1027, in __classcall__ instance = typecall(cls, *args, **options) File "sage/misc/classcall_metaclass.pyx", line 497, in sage.misc.classcall_metaclass.typecall (build/cythonized/sage/misc/classcall_metaclass.c:2185) return (<PyTypeObject*>type).tp_call(cls, args, kwds) File "/home/embray/src/sagemath/sage/local/lib/python2.7/sitepackages/sage/combinat/tableau.py", line 7693, in __init__ facade=True, keepkey=False) File "/home/embray/src/sagemath/sage/local/lib/python2.7/sitepackages/sage/sets/disjoint_union_enumerated_sets.py", line 288, in __init__ self._facade_for = tuple(family) File "/home/embray/src/sagemath/sage/local/lib/python2.7/sitepackages/sage/sets/family.py", line 1078, in __getitem__ return self.function(i) File "sage/misc/classcall_metaclass.pyx", line 330, in sage.misc.classcall_metaclass.ClasscallMetaclass.__call__ (build/cythonized/sage/misc/classcall_metaclass.c:1735) return cls.classcall(cls, *args, **kwds) File "sage/misc/cachefunc.pyx", line 1003, in sage.misc.cachefunc.CachedFunction.__call__ (build/cythonized/sage/misc/cachefunc.c:6011) w = self.f(*args, **kwds) File "sage/structure/parent.pyx", line 303, in sage.structure.parent.Parent.__init__ (build/cythonized/sage/structure/parent.c:4756) for cls in self.__class__.mro(): MemoryError ********************************************************************** File "src/sage/combinat/tableau.py", line 8572, in sage.combinat.tableau.IncreasingTableaux.__init__ Failed example: TestSuite(S).run() # long time Exception raised: Traceback (most recent call last): File "/home/embray/src/sagemath/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 671, in _run self.compile_and_execute(example, compiler, test.globs) File "/home/embray/src/sagemath/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 1108, in compile_and_execute self.update_digests(example) File "/home/embray/src/sagemath/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 999, in update_digests digest.update(str_to_bytes(reduce_hex(gen), 'ascii')) File "/home/embray/src/sagemath/sage/local/lib/python2.7/sitepackages/sage/doctest/parsing.py", line 442, in reduce_hex from operator import xor MemoryError Process DocTestWorker1: Traceback (most recent call last): File "/home/embray/src/sagemath/sage/local/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap self.run() File "/home/embray/src/sagemath/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 2139, in run task(self.options, self.outtmpfile, msgpipe, self.result_queue) File "/home/embray/src/sagemath/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 2498, in __call__ result_queue.put(result, False) File "/home/embray/src/sagemath/sage/local/lib/python2.7/multiprocessing/queues.py", line 107, in put self._start_thread() File "/home/embray/src/sagemath/sage/local/lib/python2.7/multiprocessing/queues.py", line 195, in _start_thread self._thread.start() File "/home/embray/src/sagemath/sage/local/lib/python2.7/threading.py", line 736, in start _start_new_thread(self.__bootstrap, ()) error: can't start new thread Bad exit: 1 **********************************************************************
ISTM like maybe some memory corruption occurs as a result of trying to load a corrupt or partial GAP workspace. It's really bad, whatever it is.
I should add, if I subsequently run the test it passes. But then I can reproduce the problem again by deleting workspaces and rererunning the test.
Change History (41)
comment:1 Changed 3 years ago by
 Description modified (diff)
comment:2 Changed 3 years ago by
comment:3 Changed 3 years ago by
 Branch set to u/embray/ticket27681
 Commit set to 0b6d94ee889e35b167b616fc4b9488a29cedc6d1
 Dependencies set to #27680
 Status changed from new to needs_review
This seems to fit it, with the aid of #27680.
Ironically, it was while testing #27680 that I discovered this issue, and at first I thought it was caused somehow by my changes in that ticket. But I can confirm that it's not related at all.
The problem stems (for reasons I still do not understand and are disturbing) from the fact that libgap was actually being used in the process of initializing a new workspace by pexpect GAP. Somehow this was causing problems; perhaps related to package loading.
It's still disturbing that this could cause such a problem, and I can't seem to reproduce the issue in other examples of using both GAP interfaces simultaneously. It must come down to some subtle resource contention issue within GAP itself, and we can make some progress toward avoiding it by taking more care not to use both GAPs simultaneously during workspace caching.
New commits:
de4332b  Change libgap initialization to work lazily. Don't initialize() in Gap.__init__ but only just before libgap actually needs to be initialized.

e687ee6  This test no longer makes sense in light of lazy initialization of libgap

0b6d94e  Trac #27681: Update all_installed_packages to allow passing a specific GAP

comment:4 Changed 3 years ago by
 Status changed from needs_review to needs_work
The attached fix worked around the issue superficially for one use case, but it still seems to be a problem after all as I am still able to reproduce it in other cases even with this fix.
comment:5 Changed 3 years ago by
I am not able to reproduce the problem in the ticket description. Is it platformspecific?
comment:6 Changed 3 years ago by
OK, I see it on Fedora 26 now (but not on Gentoo or Debian!).
I don't even need to remove the workspaces. During the run GAP (but not libgap) workspace is changed (well, it's creation time is changed). So this seems to be unrelated to libgap, it's a GAP interface problem.
comment:7 Changed 3 years ago by
sage/combinat/tableau.py # Bad exit: 1
is mentioned here
https://groups.google.com/d/msg/sagedevel/UJW7bDFdrhQ/EoC_WdOOFgAJ
already in 8.7beta  I think I saw this earlier. I'm sure it's a doctestrelated race condition  I have no idea why GAP workspace is updated midtest, this alone is fishy.
comment:8 Changed 3 years ago by
 Cc jdmeyer added
comment:9 Changed 3 years ago by
 Cc jdemeyer added; jdmeyer removed
comment:10 Changed 3 years ago by
I'm not sure it has nothing to do with libgap. In 0b6d94ee I fixed a problem where libgap was being used during pexpectgap initialization in the all_installed_packages
function.
This seemed to alleviate the problem somewhat, but it seems to be more of a bandaid.
What's really weird is that the way the problem manifests itself is in the form of some pretty severe memory corruption affecting just random things in Sage. And I would think that only the libgap interface would be capable of breaking things that badly.
I haven't looked at the issue in a few weeks though so I don't have any good ideas at the moment. I need to start looking at it again...
comment:11 Changed 3 years ago by
As usual with race conditions, it's extremely fluid, changes somewhere unrelated might "fix" it. First of all, there is no reason for GAP workspace to get updated during this test, yet is updated (not only on systems where these errors are seen, it seems it happens everywhere).
comment:12 Changed 3 years ago by
After rebasing this branch on 8.8.beta4 I can't reproduce the issue anymore, at least not in the same way. I'll keep trying things though.
comment:13 Changed 3 years ago by
 Commit changed from 0b6d94ee889e35b167b616fc4b9488a29cedc6d1 to ecacac8b9440cb8ac1950a3d0619a50cbb19ede6
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
19d5733  Change libgap initialization to work lazily. Don't initialize() in Gap.__init__ but only just before libgap actually needs to be initialized.

1312024  This test no longer makes sense in light of lazy initialization of libgap

ecacac8  Trac #27681: Update all_installed_packages to allow passing a specific GAP

comment:14 Changed 3 years ago by
 Status changed from needs_work to needs_review
Rebased the branch on 8.8.beta4. Did a full make ptestlong
and it passed, no problem. This is on my Ubuntu machine; which IIRC is where I originally noted the problem.
Can you try this updated version of the branch and see if it's fixed on Fedora as well? Anyways, try as I might and I can't get this to come back, at least not in the way I originally reproduced it.
I'll try again on Cygwin too...
comment:15 Changed 3 years ago by
The error is still there, slightly different:
$ ./sage t long src/sage/combinat/tableau.py Running doctests with ID 20190506164327d70e975c. Git branch: develop Using optional=build,dochtml,memlimit,python2,sage Doctesting 1 file. sage t long warnlong 54.8 src/sage/combinat/tableau.py Process DocTestWorker1: Traceback (most recent call last): File "/home/scratch2/dimpase/sage/sage/local/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap self.run() File "/home/scratch2/dimpase/sage/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 2139, in run task(self.options, self.outtmpfile, msgpipe, self.result_queue) File "/home/scratch2/dimpase/sage/sage/local/lib/python2.7/sitepackages/sage/doctest/forker.py", line 2498, in __call__ result_queue.put(result, False) File "/home/scratch2/dimpase/sage/sage/local/lib/python2.7/multiprocessing/queues.py", line 107, in put self._start_thread() File "/home/scratch2/dimpase/sage/sage/local/lib/python2.7/multiprocessing/queues.py", line 195, in _start_thread self._thread.start() File "/home/scratch2/dimpase/sage/sage/local/lib/python2.7/threading.py", line 736, in start _start_new_thread(self.__bootstrap, ()) error: can't start new thread Bad exit: 1 ********************************************************************** Tests run before process (pid=23005) failed: sage: t = Tableau([[1,2,3],[4,5]]); t ## line 144 ## ... sage: IT = IncreasingTableaux(4, (1,0,1,1)) ## line 9389 ## sage: IT[0].parent() is IT ## line 9390 ## True sage: sig_on_count() # check sig_on/off pairings (virtual doctest) ## line 9392 ## 0 sage: IT = IncreasingTableaux(4, (1,0,1,1)) ## line 9402 ## sage: all(it in IT for it in IT) ## line 9403 ## True sage: all(it in IT for it in IncreasingTableaux([2,2], (1,0,1,1))) ## line 9405 ## True sage: sig_on_count() # check sig_on/off pairings (virtual doctest) ## line 9407 ## 0 **********************************************************************  sage t long warnlong 54.8 src/sage/combinat/tableau.py # Bad exit: 1
GAP workspace still changes, as before.
comment:16 Changed 3 years ago by
Note that if I wipe out ~/.sage/ then after that the 1st run the test passes, but on the 2nd, etc. runs it does not.
comment:17 Changed 3 years ago by
Right, that's the behavior I was getting before too, but now I'm not getting it anymore, even on the exact same system as before :(
comment:18 Changed 3 years ago by
How does one check the caller stack while running tests? I would like to see where gap_reset_workspace()
is triggered during this test run, but the usual inspect
thing does not print anything:

src/sage/interfaces/gap.py
a b def gap_reset_workspace(max_workspace_size=None, verbose=False): 1575 1575 sage: sage.interfaces.gap.WORKSPACE = ORIGINAL_WORKSPACE 1576 1576 """ 1577 1577 # Create new workspace with filename WORKSPACE 1578 import inspect 1579 curframe = inspect.currentframe() 1580 calframe = inspect.getouterframes(curframe, 2) 1581 print('caller name:', calframe[1][3]) 1582 1578 1583 g = Gap(use_workspace_cache=False, max_workspace_size=None) 1579 1584 g.eval('SetUserPreference("HistoryMaxLines", 30)') 1580 1585 from sage.tests.gap_packages import all_installed_packages
yes the GAP workspace is changed during the test run!
comment:19 Changed 3 years ago by
more precisely, the workspace stays the same (seems to, according to md5), but here is what I see
$ stat .sage/gap/gapworkspace0x2bfc9c9be8a26a98 File: .sage/gap/gapworkspace0x2bfc9c9be8a26a98 Size: 143113403 Blocks: 279528 IO Block: 4096 regular file Device: 10304h/66308d Inode: 3328612 Links: 1 Access: (0644/rwrr) Uid: ( 1000/ dima) Gid: ( 1000/ dima) Access: 20190507 11:05:02.276613187 +0100 Modify: 20190507 11:05:02.276613187 +0100 Change: 20190507 11:05:02.276613187 +0100 Birth: 20190506 12:50:31.880862790 +0100
comment:20 followup: ↓ 21 Changed 3 years ago by
One way would be to run the test with serial
.
comment:21 in reply to: ↑ 20 Changed 3 years ago by
Replying to embray:
One way would be to run the test with
serial
.
I don't see how serial
matters  unless it can be used to run a script after each test...
By the way, on the Fedora system I see these errors, running with serial
allows the tests to pass. (The GAP workspace is still being changed).
comment:22 Changed 3 years ago by
You can put import pdb; pdb.set_trace()
in a function, and then run the tests with serial
and break where the function is first called. This is how I found that it was originally being called by all_installed_packages
.
comment:23 Changed 3 years ago by
OK, I was able to find where the workspace is accessed (more precisely, os.utime()
is called on it, and commented it out, and still get the same crashes.
So it is a race condition, as serial
allows the tests to pass, but it appears that it has nothing to do with GAP or libgap workspaces.
comment:24 Changed 3 years ago by
But what would be racing? If you run a single test module then there's still only one test running concurrently, so there's nothing for it to race against.
comment:25 Changed 3 years ago by
whatever serial
is doing, it removes some parallelism somewhere, I gather.
comment:26 Changed 3 years ago by
That's exactly what it's doing. Normally the doctests for each module are run in a separate process. serial
makes it run in the same process that started the test runner.
IIRC I was not able to reproduce the problem with serial
but now I can't remember for sure. My point is that it's just an easier way to run the test interactively and stop at a breakpoint.
comment:27 followup: ↓ 28 Changed 3 years ago by
 Reviewers set to Dima Pasechnik
 Status changed from needs_review to positive_review
If there is a problem here, it's with doctest framework. Otherwise, looks good.
comment:28 in reply to: ↑ 27 Changed 3 years ago by
Replying to dimpase:
If there is a problem here, it's with doctest framework. Otherwise, looks good.
I'm not sure how you concluded that, but it is a possibility I suppose. I'd be more concerned that it's a problem with forking, which does happen when running the doctests, but doesn't normally happen when running interactively. So it can make things difficult to reproduce. Nevertheless, I can't reliably reproduce the issue anymore either, and the one thing I did fix here seems to help a bit...
comment:29 Changed 3 years ago by
 Branch changed from u/embray/ticket27681 to ecacac8b9440cb8ac1950a3d0619a50cbb19ede6
 Resolution set to fixed
 Status changed from positive_review to closed
comment:30 Changed 3 years ago by
 Commit ecacac8b9440cb8ac1950a3d0619a50cbb19ede6 deleted
I have just noticed this ticket. And it gave me a little bit of fun chasing it (I don't know how Volker merged it, it doesn't show where I expect).
So I got that interesting failure in sageongentoo
File "/usr/lib64/python2.7/sitepackages/sage/tests/gap_packages.py", line 137, in sage.tests.gap_packages.all_installed_packages Failed example: all_installed_packages() == all_installed_packages(gap=gap) Expected: True Got: False
After some inspection
sage: from sage.tests.gap_packages import all_installed_packages sage: all_installed_packages() ('AtlasRep', 'AutoDoc', 'FactInt', 'GAPDoc', 'Hap', 'MapClass', 'SmallGrp', 'Toric', 'aclib', 'alnuth', 'atlasrep', 'autpgrp', 'cohomolo', 'corelg', 'crime', 'crisp', 'crypting', 'cryst', 'crystcat', 'ctbllib', 'design', 'digraphs', 'fga', 'gbnp', 'genss', 'grape', 'guava', 'hecke', 'io', 'irredsol', 'laguna', 'liealgdb', 'liepring', 'liering', 'loops', 'nq', 'orb', 'polenta', 'polycyclic', 'primgrp', 'qpa', 'quagroup', 'radiroot', 'resclasses', 'semigroups', 'sla', 'sonata', 'sophus', 'tomlib', 'transgrp', 'utils') sage: all_installed_packages(gap=gap) ('AutoDoc', 'FactInt', 'GAPDoc', 'Hap', 'MapClass', 'SmallGrp', 'Toric', 'aclib', 'alnuth', 'atlasrep', 'autpgrp', 'cohomolo', 'corelg', 'crime', 'crisp', 'crypting', 'cryst', 'crystcat', 'ctbllib', 'design', 'digraphs', 'fga', 'gbnp', 'genss', 'grape', 'guava', 'hecke', 'io', 'irredsol', 'laguna', 'liealgdb', 'liepring', 'liering', 'loops', 'nq', 'orb', 'polenta', 'polycyclic', 'primgrp', 'qpa', 'quagroup', 'radiroot', 'resclasses', 'semigroups', 'sla', 'sonata', 'sophus', 'tomlib', 'transgrp', 'utils')
The leading AtlasRep
is in fact something I had forgotten in my ~/.gap
and that will soon be cleaned. But it points to the fact that the way we start gap
's pexpect interface makes it ignore .gap
. You could say we start gap
in a mode that also happens to be a "safe mode" (all your personal settings ignored) and that really should be a different option upstream.
In the meantime, should we amend the test so that ignore_dot_gap=True
in both case in a follow up ticket.
comment:31 Changed 3 years ago by
Just to document so I can point to it next time. In gap4.10.1 you have the file src/system.c
which controls many OS dependent things including how gap starts. Currently line 1297 onwards
#ifdef HAVE_DOTGAPRC /* the users home directory */ if ( getenv("HOME") != 0 ) { strxcpy(DotGapPath, getenv("HOME"), sizeof(DotGapPath)); # if defined(SYS_IS_DARWIN) && SYS_IS_DARWIN /* On Darwin, add .gap to the sys roots, but leave */ /* DotGapPath at $HOME/Library/Preferences/GAP */ strxcat(DotGapPath, "/.gap;", sizeof(DotGapPath)); if (!IgnoreGapRC) { SySetGapRootPath(DotGapPath); } strxcpy(DotGapPath, getenv("HOME"), sizeof(DotGapPath)); strxcat(DotGapPath, "/Library/Preferences/GAP;", sizeof(DotGapPath)); # elif defined(__CYGWIN__) strxcat(DotGapPath, "/_gap;", sizeof(DotGapPath)); # else strxcat(DotGapPath, "/.gap;", sizeof(DotGapPath)); # endif if (!IgnoreGapRC) { SySetGapRootPath(DotGapPath); }
So if IgnoreGapRC
is "false" you get .gap
. Where is that variable coming from? It is documented in the same file at line 150
/**************************************************************************** ** *V IgnoreGapRC . . . . . . . . . . . . . . . . . . . r option for kernel ** */ Int IgnoreGapRC;
and finally more interestingly we have this at line 1077
{ 'r', "", toggle, &IgnoreGapRC, 0 }, /* kernel */
and in sage we have in sage/interfaces/gap.py
at line 205
gap_cmd = "gap r"
At this stage I don't know if that option is a remnant of gap4.8 and lower or if it is really needed.
comment:32 Changed 3 years ago by
Follow up at #27878
comment:33 followup: ↓ 35 Changed 3 years ago by
Using 8.8.rc2 with Python 2, I still consistently get the same error as in comment:15. This is on CentOS Linux release 7.6.1810 (Core). Using serial
makes the test pass.
comment:34 Changed 3 years ago by
I was able to reproduce this again in #18267 specifically.
comment:35 in reply to: ↑ 33 Changed 3 years ago by
Replying to ghmwageringel:
Using 8.8.rc2 with Python 2, I still consistently get the same error as in comment:15. This is on CentOS Linux release 7.6.1810 (Core). Using
serial
makes the test pass.
Thanks for the info; that's interesting (and strange). I'm still struggling to find a way to reliably reproduce this, though it's good to know that serial
made it pass for you. It makes me wonder if there is some subtle bug with libgap postfork.
comment:36 Changed 3 years ago by
I've opened #28106 to track this issue further.
comment:37 Changed 3 years ago by
 Branch changed from ecacac8b9440cb8ac1950a3d0619a50cbb19ede6 to u/embray/ticket27681
 Commit set to ecacac8b9440cb8ac1950a3d0619a50cbb19ede6
 Dependencies changed from #27680 to #27680, #28106
 Resolution fixed deleted
 Status changed from closed to new
Reopening as it breaks the buildbot
comment:38 Changed 3 years ago by
 Status changed from new to needs_review
comment:39 Changed 3 years ago by
 Status changed from needs_review to positive_review
Setting this back to positive review assuming that it'll be fixed in #28106
comment:40 Changed 3 years ago by
 Dependencies changed from #27680, #28106 to 27680, 28106
comment:41 Changed 3 years ago by
 Branch changed from u/embray/ticket27681 to ecacac8b9440cb8ac1950a3d0619a50cbb19ede6
 Commit ecacac8b9440cb8ac1950a3d0619a50cbb19ede6 deleted
 Dependencies changed from 27680, 28106 to #27680
 Resolution set to fixed
 Status changed from positive_review to closed
Wrong ticket, sorry for the noise
Some additional observations:
rm rf ~/.sage/gap/gapworkspace*
; i.e. deleting just the pexpect gap workspace and not the libgap workspace.rm rf ~/.sage/gap/libgapworkspace*
the test passes every time, but no new libgap workspace is created (as if the test does not use libgap at all; I haven't confirmed this but it seems to be the case).gapworkspace
and alibgapworkspace
are created.