Opened 5 years ago

Closed 5 years ago

#17320 closed defect (fixed)

Memory leaks with LP Solvers are back

Reported by: pmueller Owned by:
Priority: major Milestone: sage-6.4
Component: linear programming Keywords:
Cc: SimonKing, ncohen Merged in:
Authors: Nils Bruin, Nathann Cohen Reviewers: Nils Bruin, Nathann Cohen
Report Upstream: N/A Work issues:
Branch: 5480f4e (Commits) Commit: 5480f4e85c1a0ff96bd1502c78992073ccb854e5
Dependencies: Stopgaps:

Description

3 years ago a memory leak in lp solvers was fixed in ticket 11917. It seems that the problems are back in Sage 6.3. The code

while True:
    P = MixedIntegerLinearProgram(solver="Coin")

consumes several GB of memory within a few seconds! The same happens with solver="gurobi", but not with solver="glpk".

Simon King confirmed the leak at sage-support, including some analysis and details.

Change History (29)

comment:1 Changed 5 years ago by SimonKing

  • Cc SimonKing added

comment:2 Changed 5 years ago by tmonteil

  • Cc ncohen added

comment:3 Changed 5 years ago by nbruin

When I try to replicate Simon's memory object analysis, I do not find a significant leak:

sage: import gc
sage: from collections import Counter
sage: gc.collect()
0
sage: gc.set_debug(gc.DEBUG_STATS)
sage: pre={id(a) for a in gc.get_objects()}
sage: for i in [1..100000]:
....:         P = MixedIntegerLinearProgram(solver="Coin")
....:         del P
....:     
sage: gc.collect()
gc: collecting generation 2...
gc: objects in each generation: 117 0 89955
gc: done, 0.0464s elapsed.
0
sage: gc.collect()
gc: collecting generation 2...
gc: objects in each generation: 48 0 90013
gc: done, 0.0397s elapsed.
0
sage: post=Counter(str(type(a)) for a in gc.get_objects() if id(a) not in pre)
sage: post
Counter({"<type 'method_descriptor'>": 32, "<type 'list'>": 23, "<type 'dict'>": 13, "<type 'tuple'>": 9, "<class '_ast.Name'>": 7, "<type 'weakref'>": 5, "<class '_ast.Call'>": 4, "<type 'frame'>": 2, "<type 'builtin_function_or_method'>": 2, "<type 'listiterator'>": 1, "<type 'set'>": 1, "<type 'instancemethod'>": 1, "<class '_ast.comprehension'>": 1, "<class '_ast.Compare'>": 1, "<class '_ast.Attribute'>": 1, "<class '_ast.Interactive'>": 1, "<type 'module'>": 1})

As you can see, the garbage collector does not get triggered during the loop (indicating that reference counting is sufficient to keep the number of python objects bounded in memory during the loop). Also, after the loop there are no alarming objects in "post".

I can confirm that memory usage explodes, but I think this indicates the memory leak is outside of python managed memory.

[The problem with Simon's code might be that he doesn't update the set S when he counts a new object, so there's a possibility for objects that get created on the first pass to get counted every time.]

comment:4 Changed 5 years ago by nbruin

The following is fishy in coin_backend.pyx:

    def __cinit__(self, maximization = True):
...
        self.si = new OsiClpSolverInterface()
        self.model =  new CbcModel(self.si[0])
...
    def __dealloc__(self):
        del self.si

If deleting self.si is required in dealloc, shouldn't self.model be deleted as well? (this indicates to me that cython has grown enough understanding of C++ to know how to translate a "del" on a C++ object?)

Indeed, I've tried and simply inserting

        del self.model

makes memory consumption constant with the given test. I haven't tested if it affects operations otherwise. Responsible ticket seems to be #12220, where new CbcModel was added. In that same file we have

    cpdef int solve(self) except -1:
...
        model = new CbcModel(si[0])
...(*)
        del self.model
        self.model = model

which seems to suggest that unceremoniously deleting a model is just fine.

Incidentally, that routine can leak too, because there are a whole bunch of "raise" statements in (*) that can lead to self.model not being replaced by model, but model itself not being deleted.

This suggests that the whole statement here should be guarded by some try/except/finally to ensure that del model is executed if self.model=model is not.

Last edited 5 years ago by nbruin (previous) (diff)

comment:5 Changed 5 years ago by nbruin

  • Branch set to u/nbruin/memory_leaks_with_lp_solvers_are_back

comment:6 Changed 5 years ago by nbruin

  • Commit set to bc20667184ca8209776d8c363df2aa3e65994287

something like this should do the trick. Feel free to amend as needed.


New commits:

bc20667trac #17320: fix some memory management for coin_backend

comment:7 follow-up: Changed 5 years ago by ncohen

Hello !

I do not understand why you put all this in a "try" block. Wouldn't it be easier to just set self.si and self.model to NULL at the beginning, and add "if self.model != NULL" in the dealloc ?

Nathann

comment:8 in reply to: ↑ 7 Changed 5 years ago by nbruin

Replying to ncohen:

I do not understand why you put all this in a "try" block. Wouldn't it be easier to just set self.si and self.model to NULL at the beginning, and add "if self.model != NULL" in the dealloc ?

Quite possibly, but that's not how that code was written and I wasn't planning on delving into the program logic. There's a "new ..." there and there are uncontrolled exits via explicit raises and quite possibly by other code that can generate errors. If the new model gets inserted into self.model there's no leak because the old value of self.model gets properly deleted, but all the error condition exits would lead to an allocated "model" that simply falls out of scope, i.e., a memory leak (unless cython generates an implicit try/finally to cleanly delete any local variables that point to C++ objects--it doesn't, otherwise the original code wouldn't have worked properly).

I just expressed it as a try/finally to ensure we always have a chance to see if "model" needs to be deleted. I'm sure there are other solutions.

Last edited 5 years ago by nbruin (previous) (diff)

comment:9 follow-up: Changed 5 years ago by ncohen

Hello !

It seems that only moving the "del" above was sufficient. I created a branch at public/17320 that does that, tell me what you think !

Nathann

comment:10 follow-up: Changed 5 years ago by pmueller

Dear Nathann,

ok, seems to fix the leak issue. Is there a similar easy fix for the gurobi backend?

Best wishes, Peter

comment:11 in reply to: ↑ 10 Changed 5 years ago by ncohen

Helloooooooo !

ok, seems to fix the leak issue. Is there a similar easy fix for the gurobi backend?

No idea, I don't have a gurobi license anymore. Their making them computer-dependent complicates stuff a lot.

Nathann

comment:12 follow-up: Changed 5 years ago by pmueller

Nathann,

what do you mean? On whichever machine I use gurobi, it is a matter of a few seconds to receive the free academic license.

-- Peter

comment:13 in reply to: ↑ 12 Changed 5 years ago by ncohen

Hello !

what do you mean? On whichever machine I use gurobi, it is a matter of a few seconds to receive the free academic license.

Oh ? Well, last time I tried I had to setup a weird proxy and start the authentication software in that proxy jail. Because my lab was not recognised as 'research institution' and stuff. And right now I am in India for two months.

Well, I will give it a try tomorrow but I am sure that I will lose at least one hour on that :-P

Nathann

comment:14 in reply to: ↑ 9 Changed 5 years ago by nbruin

Replying to ncohen:

It seems that only moving the "del" above was sufficient. I created a branch at public/17320 that does that, tell me what you think !

It seems to me that only committing to the new model by assigning it to self.model at the very end is a design feature: it means that if the computation fails for any reason, the object self is unaffected and hence still in a consistent state. It's also better for future development, because as long as model is private, there is no lock required to modify it. Once it's exported to self.model, you'd need to lock self before self.model is changed. Python currently doesn't support such multiprocessing programming anyway, but having changes in global state as atomic as possible is generally a good feature, because it reduces the possibility for unexpected interactions between multiple threads (the GIL does get released every now and again ...)

With your change, the model that just experienced an error is now stored in self. It could be convenient to have it available for further analysis, so a priori it's not clear which design is more useful. It depends on the details of what the code does and how it does it. Changing the design just to avoid a try/finally is probably not a good idea, however.

Furthermore, there is still this code between model allocation and committing to self.model=model.

        model.setLogLevel(self.model.logLevel())

        # multithreading
        import multiprocessing
        model.setNumberThreads(multiprocessing.cpu_count())

        model.branchAndBound()

which looks like it's quite nontrivial and hence could raise errors by itself. You'd still have a leak then. So either guard that with a try/finally, in which case you might as well stick with the guard all the way, or move the self.model=model even higher.

comment:15 follow-up: Changed 5 years ago by ncohen

Hello,

It seems to me that only committing to the new model by assigning it to self.model at the very end is a design feature:

I wrote this file myself, and I do not think that this had the slightest importance to me at that time.

it means that if the computation fails for any reason,

When an exception is raised in that code, it is almost always because the problem is infeasible. This is not a "failure", it is the actual answer the user is looking for.

the object self is unaffected and hence still in a consistent state. It's also better for future development, because as long as model is private, there is no lock required to modify it. Once it's exported to self.model, you'd need to lock self before self.model is changed. Python currently doesn't support such multiprocessing programming anyway, but having changes in global state as atomic as possible is generally a good feature, because it reduces the possibility for unexpected interactions between multiple threads (the GIL does get released every now and again ...)

If your worry is that in a multithread Python program with a MixedIntegerLinearProgram? instance p the function p.solve() may be called simultaneously from different processors, I believe that you can forget it. Even if that made the sligthest sense, you would have many other places in the code to re-read and adapt.

With your change, the model that just experienced an error is now stored in self. It could be convenient to have it available for further analysis, so a priori it's not clear which design is more useful. It depends on the details of what the code does and how it does it.

Well, again what you call an 'error' is just a certificate of infeasibiity, and this is definitely good to have around. But the fact that all doctests pass (even in the graph/ folder) indicates that this change brings no real problem.

Furthermore, there is still this code between model allocation and committing to self.model=model.

        model.setLogLevel(self.model.logLevel())

        # multithreading
        import multiprocessing
        model.setNumberThreads(multiprocessing.cpu_count())

        model.branchAndBound()

which looks like it's quite nontrivial and hence could raise errors by itself. You'd still have a leak then. So either guard that with a try/finally, in which case you might as well stick with the guard all the way, or move the self.model=model even higher.

Please move that line higher if you care. I believe that you are being worried unnecessarily.

Nathann

comment:16 Changed 5 years ago by ncohen

Well, I will give it a try tomorrow but I am sure that I will lose at least one hour on that :-P

I just tried it, but here we have no direct connection to internet: everything has to go through a http proxy that obviously will not handle their grbgetkey utility.

I will try again tonight from another connection, but then I will not be on a university network anymore.... >_<

Nathann

comment:17 Changed 5 years ago by ncohen

  • Branch changed from u/nbruin/memory_leaks_with_lp_solvers_are_back to public/17320
  • Commit changed from bc20667184ca8209776d8c363df2aa3e65994287 to 351f3e1448f8b18251296d93056736ded8517f19
  • Status changed from new to needs_review

Hello again !

Turns out that their license thing was greatly simplified since (or that I was lucky). Either way it went smoothly after I got a real internet connection.

I did the fix for Gubori, though for some reason I don't quite get why it works... I actually *removed* a call to GRBfreeenv and the memory leak seems to have disappeared. While debugging I noticed that GRBmodel returned an error code.

Well, here it is :-)

Nathann


New commits:

4da9aa3trac #17320: Broken doctests
4fcbc89trac #17320: Memory leak
520c6f1trac #17320: Merged with 6.5.beta0
351f3e1trac #17320: Same fix for Gurobi

comment:18 Changed 5 years ago by git

  • Commit changed from 351f3e1448f8b18251296d93056736ded8517f19 to 5480f4e85c1a0ff96bd1502c78992073ccb854e5

Branch pushed to git repo; I updated commit sha1. New commits:

5480f4etrac 17320: tighten window for memory leak in CoinBackend.solve and ensure that function declared "int" actually has a return statement

comment:19 in reply to: ↑ 15 Changed 5 years ago by nbruin

Replying to ncohen:

Please move that line higher if you care. I believe that you are being worried unnecessarily.

OK, done. If it's possible the obviously avoid memory leaks at virtually no cost then we should, particularly because sage library code often is used as template for new contributions.

Also, ensured that CoinBackend?.solve has a return statement: In C, falling out of a function with a defined return type without executing an explicit return causes undefined behaviour, which could include returning -1 by accident (the one value that indicates erroneous behaviour). It seems cython does set the return value to 0 in absence of a return statement, but I wasn't able to find that stated explicitly in cython's documentation. Being explicit about it is safer. (certainly, if you say in the function signature that returning -1 means an exception has occurred, then it is instructive to explicitly *not* do that when no exception has occurred)

comment:20 Changed 5 years ago by ncohen

Hellooooooo !!

OK, done. If it's possible the obviously avoid memory leaks at virtually no cost then we should, particularly because sage library code often is used as template for new contributions.

Okayokay. Well, there is obviously nothing wrong with what you did !

Right now I am on a very bad internet connection. It goes through a weird proxy and I can't use git nor ssh and I get disconnected from trac so I have to re-send every comment 10 times. I will be able to test and review your commit tonight. Do you have anything against the previous changes ? If not I will set this ticket to positive_review later when I will be able to test your commit.

Thanks,

Nathann

comment:21 follow-up: Changed 5 years ago by ncohen

  • Authors set to Nils Bruin, Nathann Cohen
  • Reviewers set to Nils Bruin, Nathann Cohen

Hello ! I was finally able to test your latest commit and Sage reported nothing wrong. You can set this ticket to positive_review if you agree with my commits.

Thanks,

Nathann

comment:22 in reply to: ↑ 21 ; follow-up: Changed 5 years ago by nbruin

Replying to ncohen:

You can set this ticket to positive_review if you agree with my commits.

No objections here. However, someone who uses gurobi should probably check&sign off on that change: especially because it's optional, the buildbot won't have anything useful to say about it either.

comment:23 in reply to: ↑ 22 ; follow-up: Changed 5 years ago by ncohen

No objections here. However, someone who uses gurobi should probably check&sign off on that change: especially because it's optional, the buildbot won't have anything useful to say about it either.

Okay... Perhaps Peter then, since he raised the problem ? :-P

Nathann

comment:24 in reply to: ↑ 23 ; follow-up: Changed 5 years ago by pmueller

Replying to ncohen:

Okay... Perhaps Peter then, since he raised the problem ? :-P

Well, I tried the patch in version 6.4.rc2. My original program now runs fine. There, however, all MIPs are feasible. So I tried a silly test code where every other MIP is infeasible. Again, memory usage is constant.

So the Coin and gurobi backends now seem to work fine.

(By the way, MIPSolverException cannot be used in a try/except environment. Not sure if this is a bug. Anyway, this was the same in the unpatched version.)

comment:25 Changed 5 years ago by pmueller

  • Status changed from needs_review to positive_review

comment:26 in reply to: ↑ 24 ; follow-up: Changed 5 years ago by SimonKing

Replying to pmueller:

(By the way, MIPSolverException cannot be used in a try/except environment. Not sure if this is a bug. Anyway, this was the same in the unpatched version.)

Why not? Did you perhaps not import the error first?

sage: from sage.numerical.mip import MIPSolverException
sage: try:
....:     raise MIPSolverException("Problem")
....: except MIPSolverException:
....:     print "problem caught"
....: 
problem caught

comment:27 in reply to: ↑ 26 ; follow-up: Changed 5 years ago by pmueller

Replying to SimonKing:

Replying to pmueller:

(By the way, MIPSolverException cannot be used in a try/except environment. Not sure if this is a bug. Anyway, this was the same in the unpatched version.)

Why not? Did you perhaps not import the error first?

Indeed, I didn't know that one has to import this error. Hmm, still doesn't look user-friendly to me that the class MixedIntegerLinearProgram is known to Sage without an import, while the exception belonging to it is not. -- Peter Mueller

Last edited 5 years ago by pmueller (previous) (diff)

comment:28 in reply to: ↑ 27 Changed 5 years ago by SimonKing

Replying to pmueller:

Why not? Did you perhaps not import the error first?

Indeed, I didn't know that one has to import this error. Hmm, still doesn't look user-friendly to me that the class MixedIntegerLinearProgram? is known to Sage without an import, while the exception belonging to it is not. -- Peter Mueller

I'd say that having MixedIntegerLinearProgram in the global name space is not good, since most Sage users wouldn't use it. So, it is friendly to *some* users that there is no need to import it before using.

And MIPSolverException is a technicality that is only relevant to developers or expert users, and thus clearly shouldn't pollute the global name space.

Last edited 5 years ago by SimonKing (previous) (diff)

comment:29 Changed 5 years ago by vbraun

  • Branch changed from public/17320 to 5480f4e85c1a0ff96bd1502c78992073ccb854e5
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.