Opened 4 years ago

Closed 3 years ago

#27213 closed defect (fixed)

OpenBLAS 0.3.5 causes hangs on Cygwin

Reported by: embray Owned by:
Priority: blocker Milestone: sage-8.7
Component: packages: standard Keywords:
Cc: gh-timokau Merged in:
Authors: Reviewers:
Report Upstream: Fixed upstream, but not in a stable release. Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Status badges

Description (last modified by embray)

Ever since the upgrade to v0.3.5 of OpenBLAS (#27020) it has been impossible to get through the test suite on Cygwin without all processes eventually coming to a halt. In particular this can be reproduced by running the test

$ ./sage -t src/sage/geometry/

where for me it always stops exactly after line 3900:

sage: o = lattice_polytope.cross_polytope(3)
sage: o.skeleton_show([1,2,4])

though I suspect the problem has nothing specifically to do with that code. When I run it by itself (even in a forked process) it does not hang.

It should be noted that the issue does not occur when I run that test with the --serial flag, so it is most likely yet another issue related to fork()-safety of OpenBLAS's multi-threading code (though I've yet to determine exactly what OpenBLAS functions are being run in this case).

I initially suspected that the problem might have been introduced by PR #1875 in part just because any change introducing new concurrency primitives seems like a likely source of problems. However, removing just this one change seemed to have no effect on the issue.

So now I am just bisecting between OpenBLAS 0.3.3 and 0.3.5 to see if I can narrow it down.

Upstream bug report:: Upstream fix:

Change History (16)

comment:1 Changed 4 years ago by embray

Possibly related:

Another bizarre issue I'm seeing in this module's tests is an apparent memory leak that gets introduced, deterministically, at line 3293 with

    sage: o = lattice_polytope.cross_polytope(3)
    sage: o.npoints()

(specifically, it occurs after the o.npoints()). Suddenly here the process's memory usage balloons from a couple hundred MB to ~3GB and never goes back down again.

It's really bizarre because there's no reason to suspect that code would cause such a ballooning in memory, and I can't reproduce the problem (even in a forked process) outside that very narrow occurrence in the test suite. I'll have to see if I can put a break there and dig a little deeper into where exactly it's occurring...

comment:2 Changed 4 years ago by embray

Here's a straightforward way to reproduce the memory blow-up. It happens immediately when the process is forked (the parent process itself blows up, and the child has the same large memory usage)., but only after running the following code:

sage: P2 = ReflexivePolytope(2, 0)
sage: PM_max, permutations = P2._palp_PM_max(check=True)
sage: PM_max.automorphisms_of_rows_and_columns()

then do anything that forks the process (e.g. subprocess.Popen() with any executable).

It seems that the last line, PM_max.automorphisms_of_rows_and_columns() causes ~3GB worth of memory to be committed. In Windows terms this means that memory is guaranteed reserved for that process, though the pages are not actually created in RAM until/unless accessed. The fork() necessarily reads in all these pages (even though they're clean, never used) this forcing them to be committed to be paged in.

In any case, this issue seems to occur regardless what OpenBLAS is in use, so I think it's not directly related to this issue, though it's possible it exacerbates it.

comment:3 Changed 4 years ago by embray

I see. It's GAP/libgap that's making that allocation. Something to look into later...

comment:4 Changed 4 years ago by gh-timokau

  • Cc gh-timokau added

comment:5 Changed 4 years ago by embray

  • Cc gh-timokau removed

To finish off on the memory issue I opened #27214. As one final point on that, I temporarily tried setting the initial memory allocation in GAP to 0. I'm not exactly clear on what it does in this case, but it did seem to cause it to limit GAP to only allocating as much memory as it needed at a minimum (possibly at some performance cost).

This also had no effect on the original issue that this ticket was opened for. The test still hangs at exactly the same point.

comment:6 Changed 4 years ago by embray

  • Cc gh-timokau added

Oops, gotta love that early 2000s web.

comment:7 Changed 4 years ago by embray

Bizarrely, if I run the test with ./sage -t --debug it does not hang.

comment:8 Changed 4 years ago by embray

Per strace it appears to be hanging at, or near, an mmap call. I wonder if this is something similar to #23973.

  111 32811159 [main] python2 9608 mmap64: addr 0x0, len 262144, prot 0x3, flags 0x22, fd -1, off 0x0
 4245 32815404 [main] python2 9608 mmap64: 0x6F6C7F90000 = mmap()

I'll need to decode exactly what prot and flags those are.

Version 0, edited 4 years ago by embray (next)

comment:9 Changed 4 years ago by embray

Okay, 262144 = 256KB is the default size of "arenas" for object pools in the python memory allocator (something I've studied before but not in many years). Regardless, that means there's nothing really special about this mmap either. The hang might actually be happening somewhere else just after this...

comment:10 Changed 4 years ago by jdemeyer

So what's the relation between this ticket and #27214? Is it the same issue?

comment:11 follow-up: Changed 4 years ago by embray

Git bisect narrowed it down to this commit:

I think this may be partly coincidental, but it does suggest that the problem has something to do with the "new TLS code". I haven't looked closely into what that is yet though I'm aware it exists: there is some code using TLS for something though I'm not sure exactly what, and it looks like it might be a bit of a mess.

comment:12 in reply to: ↑ 11 Changed 4 years ago by embray

Replying to embray:

Git bisect narrowed it down to this commit:

I think this may be partly coincidental

Indeed, there were definitely some shifting configurations and things between git bisect iterations such that I don't think everything necessarily got rebuilt that should have from change to change. Normally just re-running make should have been good enough, but I think a few make clean may have been in order as well. Regardless, I think something definitely went wrong between v0.3.3 and v0.3.5 due to the shuffling around that occurred while introducing the USE_TLS feature.

comment:13 Changed 4 years ago by embray

  • Report Upstream changed from N/A to Reported upstream. No feedback yet.

A more "thorough" git bisect with a make clean between each step led actually to this commit:

This is making more sense now. The feature that is actually suspect is the new level3 threaded implementation. This was the code I felt was suspect in the first place, albeit for reasons I'm still not clear on.

The new multithreading implementation for level 3 routines predates v0.3.3, but it was disabled by default. Indeed, when I build v0.3.3 I have -DUSE_SIMPLE_THREADED_LEVEL3. But this change flipped it to enabled by default (thanks) but alas this code is still buggy. See for example the bug fix that I originally thought was suspect.

I'll be curious to chase down what the actual bug is here, but in the meantime I'm happy to build with USE_SIMPLE_THREADED_LEVEL3=1. I will also report this upstream.

comment:14 Changed 4 years ago by embray

Well this is disconcerting. A clean build of v0.3.5 with USE_SIMPLE_THREADED_LEVEL3=1 does not appear to have fixed the problem.

comment:15 Changed 3 years ago by embray

  • Description modified (diff)
  • Report Upstream changed from Reported upstream. No feedback yet. to Fixed upstream, but not in a stable release.

comment:16 Changed 3 years ago by embray

  • Resolution set to fixed
  • Status changed from new to closed

I believe this is fixed with #27256. There's a clear and repeatable pattern of having terrible hangs without this fix which go away with this fix.

Note: See TracTickets for help on using tickets.