Opened 7 years ago

Last modified 6 months ago

#13447 needs_work defect

Make libsingular multivariate polynomial rings collectable

Reported by: nbruin Owned by:
Priority: major Milestone: sage-8.9
Component: memleak Keywords:
Cc: SimonKing, malb, vbraun, gagern, robertwb, ylchapuy, jpflori, burcin Merged in:
Authors: Simon King Reviewers: Travis Scrimshaw
Report Upstream: N/A Work issues:
Branch: u/SimonKing/make_libsingular_multivariate_polynomial_rings_collectable (Commits) Commit: c2b0bf61272c2062b2a4e47232f699fe306d1b21
Dependencies: #11521 Stopgaps:

Description (last modified by SimonKing)

Presently, #715 + #11521 help not permanently keeping parent in memory. In the process we uncovered a hard-but-consistently triggerable problem with the collection of MPolynomialRing_libsingular. We have only observed the problem on bsd.math.washington.edu, MacOSX 10.6 on x86_64.

The present work-around is to permanently store references to these upon creation, thus preventing collection. It would be nice if we could properly solve the problem (or at least establish that the problem is specific to bsd.math)

Attachments (7)

trac_13447-double_refcount.patch (1.8 KB) - added by nbruin 7 years ago.
take into account both refcount_dict and ring*.ref fields on deletion.
trac_13447-consolidated_refcount.patch (1.5 KB) - added by nbruin 7 years ago.
Consolidate two refcount systems (cruft not yet removed from patch)
trac_13447-modulus_fix.patch (771 bytes) - added by nbruin 7 years ago.
return NULL instead of raising AttributeError?
trac_13447-rely_on_singular_refcount.patch (17.1 KB) - added by SimonKing 7 years ago.
Use Singular's refcounter for refcounting
trac_13447-refcount_cleanup.patch (4.6 KB) - added by nbruin 7 years ago.
clean up final bits in refcounting of singular rings
trac_13447-sanitise_ring_refcount.patch (28.1 KB) - added by SimonKing 7 years ago.
Combined patch
trac_13447-attempted_improvement.patch (4.1 KB) - added by SimonKing 7 years ago.
Experiments towards fixing some problems

Download all attachments as: .zip

Change History (168)

comment:1 Changed 7 years ago by nbruin

On 5.4-beta0 + #715 + #11521, there is a doctest failure on bsd.math.washington.edu, an x86_64 machine running MacOSX 10.6:

bash-3.2$ ../../sage -t sage/misc/cachefunc.pyx
sage -t  "devel/sage-main/sage/misc/cachefunc.pyx"          
The doctested process was killed by signal 11
         [12.7 s]
 
----------------------------------------------------------------------
The following tests failed:


        sage -t  "devel/sage-main/sage/misc/cachefunc.pyx" # Killed/crashed
Total time for all tests: 12.7 seconds

The segmentation fault happens reliably, but is hard to study because

  • running the same examples in an interactive session does not trigger the problem
  • running with sage -t -gdb (yes, that's possible!) or sage -t --verbose does not trigger the problem.
  • the order of tests in the file seems fairly important. You can change and

delete some tests but not others. The likely explanation is that a garbage collection has to be triggered under the right conditions, so that the memory corruption (which likely happens upon deallocate somewhere) happens in the right spot.

The segfault happens in the doctest for CachedMethodCaller._instance_call (line 1038 in the sage source; example_27 in the file ~/.sage/tmp/cachefunc_*.py left after doctesting), in the line

            sage: P.<a,b,c,d> = QQ[]

Further instrumentation showed that the segfault happens in sage/libs/singular/ring. pyx, in singular_ring_new, in the part that copies the strings over.

+    sys.stderr.write("before _names allocation\n")
     _names = <char**>omAlloc0(sizeof(char*)*(len(names)))
+    sys.stderr.write("after _names allocation\n")

     for i from 0 <= i < n:
         _name = names[i]
+        sys.stderr.write("calling omStrDup for i=%s with name=%s\n"%(i,names[i])
        _names[i] = omStrDup(_name)
+        sys.stderr.write("after omStrDup\n")

The call _omStrDup segfaults for i=1. Unwinding the _omStrDup call:

     for i from 0 <= i < n:
         _name = names[i]
+        sys.stderr.write("calling omStrDup for i=%s with name=%s\n"%(i,names[i]))
-        _names[i] = omStrDup(_name)
+        j = 0
+        while <bint> _name[j]:
+            j+=1
+        j+=1     #increment to include the 0
+        sys.stderr.write("string length (including 0) seems to be %s\n"%j)
+        copiedname =  <char*>omAlloc(sizeof(char)*(j+perturb))
+        sys.stderr.write("Done reserving memory buffer; got address %x\n"%(<long>copiedname))
+        for 0 <= offset < j:
+            sys.stderr.write("copying character nr %s\n"%offset)
+            copiedname[offset] = _name[offset]
+        _names[i] = copiedname
+        sys.stderr.write("after omStrDup\n")

shows that it's actually the omAlloc call segfaulting. For perturb=7 or higher, the segfault does not happen. For perturb a lower value it does. Given that the omAlloc addresses returned on earlier calls do not seem close to a page boundary, the only way omAlloc can fail is basically by a corrupted freelist an 8-byte bin. Likely culprits:

  • a double free (although I'd expect that would trigger problems on more architectures)
  • someone writing out-of-bounds in omAlloc-managed memory.

Perhaps someone claiming an <int>,<int> structure and storing a <void *> in the second one?

Note the <char*> to <long> cast in the print statement. With an <int>, the compiler complains about loss of precision, but not with <long>. I haven't checked whether <long> is really 64 bits on this machine, though. I have tried and the problem seems to persist with the old singular (5.4b0 has a recently upgraded singular).

It would help a lot if someone could build singular to use plain malloc throughout and then use valgrind or a similar tool, which should be able to immediately catch a double free or out-of-bounds error. If the root of the problem is not OSX-specific, this would even show up on other architectures.

See also [http://trac.sagemath.org/sage_trac/ticket/715#comment:295 #715,comment 295] and below for some more details on how the diagnosis above was obtained.

comment:2 Changed 7 years ago by nbruin

  • Summary changed from Make libsindular multivariate polynomial rings collectable to Make libsingular multivariate polynomial rings collectable

comment:3 Changed 7 years ago by nbruin

OK, I did a little experiment and tried to build singular with plain malloc rather than omalloc. In principle, omalloc has an --with-emulate... flag, but the API offered in that mode is woefully incomplete for singular use. I tried to extend it a little. Very rough result:

http://sage.math.washington.edu/home/nbruin/singular-3-1-5.malloc.spkg

One problem is supplying a memdup, which needs to know the size of an allocated block from its pointer. On BSD, you can use malloc_size for that. On linux one could use malloc_usable_size. The rest is a whole swath of routines that need to be supplied.

The package above is very dirty, but on bsd.math it did provide me with an apparently working libsingular. The singular executable produced didn't seem usable, so keep your old one!

The doctest passes! Not exactly what we were hoping for, but it does make a double-free unlikely. That would have been detected. Corruption after freeing could still be possible, since malloc allocates way bigger blocks, so freelist data is likely missed.

There is of course also the possibility of some routine writing out of bounds, which is less likely to trigger problems with malloc too.

Singular people might be interested in incorporating the changes to omalloc (and preferrably extend them a little bit) so that --with-emulate... becomes a viable option for Singular debugging. Then you can valgrind singular code.

comment:4 Changed 7 years ago by SimonKing

  • Cc SimonKing added

comment:5 Changed 7 years ago by SimonKing

  • Report Upstream changed from N/A to Reported upstream. No feedback yet.

I have contacted Hans Schönemann.

comment:6 follow-up: Changed 7 years ago by SimonKing

  • Cc malb added

I think Martin should be Cc for this as well.

I am not sure if changing to malloc is an acceptable option for Singular. If I understand correctly, omalloc is very important for having a good performance in Gröbner basis computations.

comment:7 in reply to: ↑ 6 Changed 7 years ago by nbruin

Replying to SimonKing:

I am not sure if changing to malloc is an acceptable option for Singular.

I am sure it is not acceptable for production, but being able to swap out omalloc for debugging can be very useful. That's why I tried. I understand that there are great tools to do memory management debugging and omalloc puts them all out of play because it hides all memory alloc/free activity.

It seems omalloc has its own tools but I wasn't able to get them working, I've seen indications that they don't work on 64 bits, and there's a good chance they're not as good as the general ones because they're for a smaller market.

I'm sure someone more familiar with the Singular and omalloc code bases can make a more informed decision on whether having the option of straight malloc memory management for debugging is worthwhile. My initial finding is that it quite likely can be done with relatively small modifications. I got it to more or less work in an evening, while being unfamiliar with the code.

Last edited 7 years ago by nbruin (previous) (diff)

comment:8 follow-up: Changed 7 years ago by nbruin

At least the problem is a real one. I've found a similar iMac:

    Hardware Overview:

      Model Name: iMac
      Model Identifier: iMac10,1
      Processor Name: Intel Core 2 Duo
      Processor Speed: 3.06 GHz
      Number Of Processors: 1
      Total Number Of Cores: 2
      L2 Cache: 3 MB
      Memory: 4 GB
      Bus Speed: 1.07 GHz
      Boot ROM Version: IM101.00CC.B00
      SMC Version (system): 1.52f9

    System Software Overview:

      System Version: Mac OS X 10.6.8 (10K549)
      Kernel Version: Darwin 10.8.0
      64-bit Kernel and Extensions: No
      Time since boot: 5 days 8:46

and bsd.math.washington.edu:

    Hardware Overview:

      Model Name: Mac Pro
      Model Identifier: MacPro5,1
      Processor Name: Quad-Core Intel Xeon
      Processor Speed: 2.4 GHz
      Number Of Processors: 2
      Total Number Of Cores: 8
      L2 Cache (per core): 256 KB
      L3 Cache (per processor): 12 MB
      Memory: 32 GB
      Processor Interconnect Speed: 5.86 GT/s
      Boot ROM Version: MP51.007F.B03
      SMC Version (system): 1.39f11
      SMC Version (processor tray): 1.39f11

    System Software Overview:

      System Version: Mac OS X 10.6.8 (10K549)
      Kernel Version: Darwin 10.8.0
      64-bit Kernel and Extensions: Yes
      Time since boot: 16 days 6:14

Both these machines exhibit the same problem that on 5.4b0 + #715 + #11521, the doctest for cachefunc.pyx segfaults in the same spot. Note that the iMac claims to not have a 64-bit kernel. Sage is compiled to be 64-bit on that machine, though (and seems to work).

Have we actually established that this bug does not occur on newer OSX versions?

comment:9 in reply to: ↑ 8 ; follow-up: Changed 7 years ago by SimonKing

Replying to nbruin:

Have we actually established that this bug does not occur on newer OSX versions?

And have we actually established that this problem does not occur with older Singular versions? I am not totally sure, but I think the problem with #715+#11521 first emerged in sage-5.4.beta0, when Singular-3-1-5 was merged.

comment:10 in reply to: ↑ 9 Changed 7 years ago by nbruin

Replying to SimonKing:

And have we actually established that this problem does not occur with older Singular versions?

Quoting from comment:1

I have tried and the problem seems to persist with the old singular (5.4b0 has a recently upgraded singular).

In the mean time, a bit of googling led me to OSX's "GuardMalloc?". While sage+singular-malloc does not crash on the doctest, it does crash when run with

export DYLD_INSERT_LIBRARIES=/usr/lib/libgmalloc.dylib

Since gmalloc is a memory manager that places each allocation on its own page with protected/unmapped memory as close as possible around the block and that unmaps the block as soon as freed (I'm just parroting the manpage), a segfault is likely due to an access-after-free or access-out-of-bounds -- the one that would normally cause the corruption and then the segfault much later. (that's the whole idea of replacing omalloc -- I don't think it's doable to get omalloc to segfault on an access-after-free). This all comes at a significant speed penalty of course, so experiments are painful and I wouldn't even be able to interpret the backtrace/coredump if I got it (I'd hope that the gmalloc-induced segfault would be reproducible in gdb). It would really be useful if the test file would be pared down to an absolute minimum. That's basically just a backtracking search on which elements can be removed while still triggering a segfault.

However, I think this is a strong indication that there is a real memory violation at the base of this and that it is tracable.

comment:11 follow-up: Changed 7 years ago by SimonKing

I tried to track the problem as follows:

  • sage/libs/singular/ring.pyx

    diff --git a/sage/libs/singular/ring.pyx b/sage/libs/singular/ring.pyx
    a b  
    1616
    1717include "../../ext/stdsage.pxi"
    1818
     19import sys
     20
    1921from sage.libs.gmp.types cimport __mpz_struct
    2022from sage.libs.gmp.mpz cimport mpz_init_set_ui, mpz_init_set
    2123
     
    495497    cdef object r = wrap_ring(existing_ring)
    496498    refcount = ring_refcount_dict.pop(r)
    497499    ring_refcount_dict[r] = refcount+1
     500    sys.stderr.write("reference %d to %d, wrapper %d\n"%(refcount+1,<size_t>existing_ring, id(r)))
     501    sys.stderr.flush()
    498502    return existing_ring
    499503
    500504
     
    536540
    537541    cdef ring_wrapper_Py r = wrap_ring(doomed)
    538542    refcount = ring_refcount_dict.pop(r)
     543    sys.stderr.write("dereference level %d of %d, wrapper %d\n"%(refcount-1,<size_t>doomed, id(r)))
     544    sys.stderr.flush()
    539545    if refcount > 1:
    540546        ring_refcount_dict[r] = refcount-1
    541547        return
  • sage/rings/polynomial/multi_polynomial_libsingular.pyx

    diff --git a/sage/rings/polynomial/multi_polynomial_libsingular.pyx b/sage/rings/polynomial/multi_polynomial_libsingular.pyx
    a b  
    151151    sage: b-j*c
    152152    b - 1728*c
    153153"""
     154import sys
    154155
    155156# The Singular API is as follows:
    156157#
     
    242243
    243244import sage.libs.pari.gen
    244245import polynomial_element
    245 
     246from sage.libs.singular.ring cimport wrap_ring
    246247cdef class MPolynomialRing_libsingular(MPolynomialRing_generic):
    247248
    248249    def __cinit__(self):
     
    364365        from sage.rings.polynomial.polynomial_element import PolynomialBaseringInjection
    365366        base_inject = PolynomialBaseringInjection(base_ring, self)
    366367        self.register_coercion(base_inject)
     368        sys.stderr.write("At %d, creating %s\n"%(<size_t>self._ring, self))
     369        sys.stderr.flush()
    367370
    368371    def __dealloc__(self):
    369372        r"""
     
    390393            sage: _ = gc.collect()
    391394        """
    392395        if self._ring != NULL:  # the constructor did not raise an exception
     396            from sage.libs.singular.ring import ring_refcount_dict
     397            try:
     398                level = ring_refcount_dict[wrap_ring(self._ring)]
     399            except KeyError:
     400                level = -1
     401            if level > 1:
     402                sys.stderr.write("WARNING: %d\n"%(<size_t>self._ring))
     403            else:
     404                sys.stderr.write("__dealloc__: %s\n"%(<size_t>self._ring))
     405            sys.stderr.flush()
    393406            singular_ring_delete(self._ring)
    394407
    395408    def __copy__(self):

Then, I ran python -t on the segfaulting test. Observation: It happens precisely twice that "WARNING" is printed, i.e., the __dealloc__ method is called even though there remain multiple references to the underlying libsingular ring.

In both cases it is QQ['a','b','c','d']. Here is a snipped from the output:

reference 2 to 4409548912, wrapper 4302568952
reference 3 to 4409548912, wrapper 4302569000
reference 4 to 4409548912, wrapper 4302568952
At 4409548912, creating Multivariate Polynomial Ring in a, b, c, d over Rational Field
reference 5 to 4409548912, wrapper 4302569000
reference 6 to 4409548912, wrapper 4302568952
reference 7 to 4409548912, wrapper 4302569000
reference 8 to 4409548912, wrapper 4302568952
reference 9 to 4409548912, wrapper 4302569000
reference 10 to 4409548912, wrapper 4302568952
reference 2 to 4409549416, wrapper 4302568928
reference 3 to 4409549416, wrapper 4302569000
dereference level 9 of 4409548912, wrapper 4302568928
dereference level 8 of 4409548912, wrapper 4302568952
dereference level 7 of 4409548912, wrapper 4302568928
dereference level 6 of 4409548912, wrapper 4302568952
dereference level 5 of 4409548912, wrapper 4302568928
dereference level 4 of 4409548912, wrapper 4302568952
dereference level 3 of 4409548912, wrapper 4302568928
WARNING: 4409548912
dereference level 2 of 4409548912, wrapper 4302568952
dereference level 1 of 4409548912, wrapper 4302568928
dereference level 0 of 4409548912, wrapper 4302568952

However, I am not totally sure whether this indicates a problem, because in both cases the remaining references are immediately removed. Also, it is always the case that 4 references are set to the libsingular ring before actually creating the polynomial ring in Sage.

One last observation: You may notice a libsingular ring at address 4409549416 that is referenced here as well, aparently in the middle of the construction of QQ['a','b','c','d']. It is later used for QQ['x','y','z']. The last report before the segfault is

reference 32 to 4409549416, wrapper 4302568952

Seems like a wild-goose chase to me, though.

comment:12 in reply to: ↑ 11 Changed 7 years ago by SimonKing

Replying to SimonKing:

One last observation: You may notice a libsingular ring at address 4409549416 that is referenced here as well, aparently in the middle of the construction of QQ['a','b','c','d']. It is later used for QQ['x','y','z']. The last report before the segfault is

reference 32 to 4409549416, wrapper 4302568952

And this ring is in fact currRing when it crashes.

comment:13 Changed 7 years ago by nbruin

OK! good progress. Instrumenting sagedoc.py a little bit we can indeed see the order in which the doctests are executed:

__main__
__main__.change_warning_output
__main__.check_with_tolerance
__main__.example_0
__main__.example_1
__main__.example_10
__main__.example_11
__main__.example_12
__main__.example_13
__main__.example_14
__main__.example_15
__main__.example_16
__main__.example_17
__main__.example_18
__main__.example_19
__main__.example_2
__main__.example_20
__main__.example_21
__main__.example_22
__main__.example_23
__main__.example_24
__main__.example_25
__main__.example_26
__main__.example_27
Unhandled SIGSEGV

so that indeed seems to be alphabetical order.

Now let's run the doctests with singular-using-malloc. Result: No segfault. OSX comes with gmalloc, which is a guarded malloc for debugging purposes. It places every allocation on a separate page and unmaps that page upon freeing. So, any access-after-free leads to a segfault. Now we do get a segfault and it happens a lot sooner than example_27. In fact, now the segfault survives in gdb. The error happens when executing

G = I.groebner_basis()###line 921:_sage_    >>> G = I.groebner_basis()

Here's a session with gdb once the segfault has happened. I think I have been able to extract enough data to point at the probably problem.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x00000001850dbf44
__pyx_f_4sage_4libs_8singular_8function_call_function (__pyx_v_self=0x190ab8960, __pyx_v_args=0x190a8e810, __pyx_v_R=0x19c39be70, __pyx_optional_args=<value temporarily unavailable, due to optimizations>) at sage/libs/singular/function.cpp:13253
13253       currRingHdl->data.uring->ref = (currRingHdl->data.uring->ref - 1);
####NB: This is line 1410 in sage/libs/singular/function.pyx
(gdb) print currRingHdl
$1 = (idhdl) 0x17c2b5fd0
(gdb) print currRingHdl->data
$2 = {
  i = -2062696816,
  uring = 0x1850dbe90,
  p = 0x1850dbe90,
  n = 0x1850dbe90,
  uideal = 0x1850dbe90,
  umap = 0x1850dbe90,
  umatrix = 0x1850dbe90,
  ustring = 0x1850dbe90 <Address 0x1850dbe90 out of bounds>,
  iv = 0x1850dbe90,
  bim = 0x1850dbe90,
  l = 0x1850dbe90,
  li = 0x1850dbe90,
  pack = 0x1850dbe90,
  pinf = 0x1850dbe90
}
(gdb) print currRingHdl->data.uring
$3 = (ring) 0x1850dbe90
(gdb) print currRingHdl->data.uring->ref
Cannot access memory at address 0x1850dbf44
(gdb) print  *__pyx_v_si_ring
$10 = {
  idroot = 0x0, 
  order = 0x19c3cbff0, 
  block0 = 0x19c3cdff0, 
  block1 = 0x19c3cfff0, 
  parameter = 0x0, 
  minpoly = 0x0, 
  minideal = 0x0, 
  wvhdl = 0x19c3c9fe0, 
  names = 0x19c3bdfe0, 
  ordsgn = 0x19c3ddfe0, 
  typ = 0x19c3dffd0, 
  NegWeightL_Offset = 0x0, 
  VarOffset = 0x19c3d9ff0, 
  qideal = 0x0, 
  firstwv = 0x0, 
  PolyBin = 0x104ee8440, 
  ringtype = 0, 
  ringflaga = 0x0, 
  ringflagb = 0, 
  nr2mModul = 0, 
  nrnModul = 0x0, 
  options = 100663424, 
  ch = 0, 
  ref = 0, 
  float_len = 0, 
  float_len2 = 0, 
  N = 3, 
  P = 0, 
  OrdSgn = 1, 
  firstBlockEnds = 3, 
  real_var_start = 0, 
  real_var_end = 0, 
  isLPring = 0, 
  VectorOut = 0, 
  ShortOut = 0, 
  CanShortOut = 1, 
  LexOrder = 0, 
  MixedOrder = 0, 
  ComponentOrder = -1, 
  ExpL_Size = 3, 
  CmpL_Size = 3, 
  VarL_Size = 1, 
  BitsPerExp = 20, 
  ExpPerLong = 3, 
  pCompIndex = 2, 
  pOrdIndex = 0, 
  OrdSize = 1, 
  VarL_LowIndex = 1, 
  MinExpPerLong = 3, 
  NegWeightL_Size = 0, 
  VarL_Offset = 0x19c3e3ff0, 
  bitmask = 1048575, 
  divmask = 1152922604119523329, 
  p_Procs = 0x19c3e7f80, 
  pFDeg = 0x104a80150 <pDeg(spolyrec*, sip_sring*)>, 
  pLDeg = 0x104a80920 <pLDegb(spolyrec*, int*, sip_sring*)>, 
  pFDegOrig = 0x104a80150 <pDeg(spolyrec*, sip_sring*)>, 
  pLDegOrig = 0x104a80920 <pLDegb(spolyrec*, int*, sip_sring*)>, 
  p_Setm = 0x104a7ff40 <p_Setm_TotalDegree(spolyrec*, sip_sring*)>, 
  cf = 0x11e487e70, 
  algring = 0x0, 
  _nc = 0x0
}
(gdb) print __pyx_v_si_ring
$11 = (ip_sring *) 0x19c3c5e90
(gdb) print ((struct __pyx_obj_4sage_5rings_10polynomial_28multi_polynomial_libsingular_MPolynomialRing_libsingular *)__pyx_v_R)->_ring
$12 = (ip_sring *) 0x19c3c5e90
(gdb) print ((struct __pyx_obj_4sage_5rings_10polynomial_6plural_NCPolynomialRing_plural *)__pyx_v_R)->_ring
$13 = (ip_sring *) 0x10019ff30
####NB: so PY_TYPE_CHECK(R, MPolynomialRing_libsingular) is true
(gdb) print (__pyx_v_si_ring != currRing)
$15 = false
####NB: does this mean that rChangeCurrRing(si_ring) got executed or that si_ring already equalled currRing?
(gdb) print (currRingHdl->data.uring != currRing)
$16 = true
####NB: of course, that's why we segfault on the statement that follows:
####NB:       currRingHdl.data.uring.ref -= 1
(gdb) print *(currRingHdl->data.uring)
Cannot access memory at address 0x1850dbe90
####NB: It looks like currRingHdl.data.uring has been unbound.
####NB: naturally, changing a field on that pointer will corrupt memory (or in this case
####NB: because gmalloc has unmapped the page, cause a segfault)
####NB: Could it be that the code here should really test for uring being still valid?
####NB: (if it can do that at all)?

So I think the issue is in sage.lib.singular.function.call_function:

...
    if currRingHdl.data.uring!= currRing:
        currRingHdl.data.uring.ref -= 1
        currRingHdl.data.uring = currRing # ref counting?
        currRingHdl.data.uring.ref += 1
...

The evidence points absolutely to currRingHdl.data.uring pointing to unallocated (probably freed) memory. The access then of course can have all kinds of effects. At this point it is probably doable for a LibSingular expert to reason about the code whether uring should always be valid at this point (I suspect not).

It looks suspicious to me that sage.libs.singular.ring.singular_ring_delete does do a careful dance to zero out the currRing variable but doesn't seem to care about currRngHdl. I also find it worrying that there apparently is a refcount system right on the ring structures (as you can see above) and yet in singular_ring_delete a separate refcounting dict is used. One would think the same refcounting system should be borrowed by singular_ring_new and singular_ring_delete. It looks to me the code above thinks that by increasing ...uring.ref the reference is protected, but singular_ring_delete doesn't seem to take into account this refcount. It could well be that I'm misinterpreting the code and that this is all perfectly safe, though.

Libsingular specialists: Keep in mind that in principle, singular code can get executed in rather awkward moments, possibly as part of clean-ups of circular garbage and call-backs on weakref cleanup, where equality might be tested of objects that are soon to be deallocated themselves.

The fickleness of the bug is consistent with this condition arising during a cyclic garbage collection with just the right amount of junk around. That would make the occurrence of the bug depend on just about everything in memory. Or at least if you depend on the corruption leading to a segfault, it depends on which location exactly gets corrupted.

I think we might be getting close to a badge for debugging excellence here!

Last edited 7 years ago by nbruin (previous) (diff)

Changed 7 years ago by nbruin

take into account both refcount_dict and ring*.ref fields on deletion.

Changed 7 years ago by nbruin

Consolidate two refcount systems (cruft not yet removed from patch)

comment:14 Changed 7 years ago by nbruin

OK, two independent patches. Either prevents the segfault. I may just have removed the symptom, but not the cause.

If I'm correctly understanding the problem, trac_13447-consolidated_refcount.patch should be the preferred solution. However, my unfamiliarity with (lib)singular's intricacies might have caused an oversight somewhere. I think my interpretation is consistent with the use in sage.lib.singular.function.call_function, which is my main source of inspiration.

If people agree, we can clean out the cruft remaining from the refcounting method implemented locally.

comment:15 Changed 7 years ago by nbruin

  • Status changed from new to needs_info
  • Work issues set to Input from libsingular experts

comment:16 Changed 7 years ago by SimonKing

  • Cc vbraun gagern added
  • Description modified (diff)

Replying to nbruin:

If I'm correctly understanding the problem, trac_13447-consolidated_refcount.patch should be the preferred solution.

I didn't test the patch yet. However, it seems very straight forward to me: There already is a refcounting, and thus one should use it. I am Cc'ing Volker Braun and Martin von Gagern, the authors of #11339. Does trac_13447-consolidated_refcount.patch make sense to you as well?

Keeping a double refcount (as with trac_13447-double_refcount.patch seems suspicious to me.

Perhaps one should let the patchbots test it? Thus, I'll add this as dependency for #715, and for the patchbot:

Apply trac_13447-consolidated_refcount.patch

PS: You really deserve a badge for debugging excellence! Do I understand correctly that the bug is not on the side of Singular? I'll inform Hans accordingly.

comment:17 Changed 7 years ago by nbruin

With the new refcounting, I think it could be that currRingHdl.data.uring holds the last reference to a ring. In fact, it seems that was the source of the segfaults. If that reference is removed in call_function, shouldn't we delete the ring? The naive solution

...
        currRingHdl.data.uring.ref -= 1
        if currRingHdl.data.uring.ref == 0:
            rDelete(currRingHdl.data.uring)
        currRingHdl.data.uring = currRing # ref counting?
        currRingHdl.data.uring.ref += 1
...

seems to have no ill effect (I put a print statement there that did produce some output, so it does happen), but perhaps I'm overlooking something. Are there other places where references are liable to be lost?

comment:18 Changed 7 years ago by SimonKing

For the record, the following tests fail:

        sage -t  -force_lib devel/sage/sage/libs/singular/ring.pyx # 6 doctests failed
        sage -t  -force_lib devel/sage/sage/modular/modsym/ambient.py # 1 doctests failed
        sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_libsingular.pyx # 1 doctests failed

with

$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
trac_13447-consolidated_refcount.patch

So, not all is good, but almost...

comment:19 Changed 7 years ago by vbraun

I don't know if ring.ref has any meaning to Singular. If we are indeed free to use that field for reference counting in Sage then I'm fine with trac_13447-consolidated_refcount.patch.

Upstream plans to get rid of the whole currRing global variable eventually, for the record.

comment:20 follow-up: Changed 7 years ago by SimonKing

Two errors mentioned in comment:18 look (again) difficult.

The first one:

sage -t  -force_lib devel/sage/sage/modular/modsym/ambient.py
**********************************************************************
File "/scratch/sking/sage-5.4.beta0/devel/sage-main/sage/modular/modsym/ambient.py", line 1351:
    sage: ModularSymbols(20,2).boundary_space().dimension()
Expected:
    6
Got:
    0

Hence, the way how one refcounts libsingular rings influences the dimension of Hecke modules. Strange at least...

Note, however, that the value returned by the "dimension()" method above is not constant, because it only returns a lower bound (if I recall correctly) that is increased when one learns more about the Hecke module. Hence, it could very well be that ModularSymbols(20,2).boundary_space() used to be cached but is now garbage collected, so that information on the dimension is lost.

The second error is apparently ignored and only printed to stderr:

sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_ring.py
         [2.1 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial.pyx
         [4.3 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/groebner_fan.py
         [7.8 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_libsingular.pyx
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
**********************************************************************
File "/scratch/sking/sage-5.4.beta0/devel/sage-main/sage/rings/polynomial/multi_polynomial_libsingular.pyx", line 409:
    sage: len(ring_refcount_dict) == n + 1
Expected:
    True
Got:
    False
**********************************************************************
1 items had failures:
   1 of  19 in __main__.example_4
***Test Failed*** 1 failures.
For whitespace errors, see the file /Users/SimonKing/.sage/tmp/bsd.math.washington.edu-96119/multi_polynomial_libsingular_8098.py
         [4.2 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_ring_generic.pyx
         [2.3 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/polydict.pyx
         [2.0 s]

Since these were parallel tests, I can't tell were the ignored attribute errors actually came from.

comment:21 Changed 7 years ago by SimonKing

PS: When consolidating refcounters, we must not forget #13145, which only got merged in sage-5.4.beta1.

comment:22 in reply to: ↑ 20 ; follow-up: Changed 7 years ago by nbruin

Replying to SimonKing:

sage -t  -force_lib devel/sage/sage/modular/modsym/ambient.py
**********************************************************************
File "/scratch/sking/sage-5.4.beta0/devel/sage-main/sage/modular/modsym/ambient.py", line 1351:
    sage: ModularSymbols(20,2).boundary_space().dimension()
Expected:
    6
Got:
    0

I have seen that error before, with other work-arounds (and I think also with singular-malloc), so if it's indeed only a lower bound, then sage has merely changed. It's not an error. If you're worried you can see where that dimension is computed and put a hard ref in the creation of the relevant object. If garbage collection is the cause of the observed amnesia, a hard ref should "solve" it. In that case you can just change the doctest answer.

The second error is apparently ignored and only printed to stderr:

Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored

This is a worrisome error because it's fickle. One a linux x86_64 box, get this reliably in sage/rings/polynomial/multi_polynomial_libsingular.pyx. When I let it print the lines it's doctesting I get:

set_random_seed(0L)
change_warning_output(sys.stdout)
F = GF(Integer(7)**Integer(2), names=('a',)); (a,) = F._first_ngens(1)###line 1913:_sage_    >>> F.<a> = GF(7^2)
R = F['x, y']; (x, y,) = R._first_ngens(2)###line 1914:_sage_    >>> R.<x,y> = F[]
p = a*x**Integer(2) + y + a**Integer(3); p###line 1915:_sage_    >>> p = a*x^2 + y + a^3; p
q = copy(p)###line 1917:_sage_    >>> q = copy(p)
p == q###line 1918:_sage_    >>> p == q
p is q###line 1920:_sage_    >>> p is q
lst = [p,q];###line 1922:_sage_    >>> lst = [p,q];
matrix(ZZ, Integer(2), Integer(2), lambda i,j: bool(lst[i]==lst[j]))###line 1923:_sage_    >>> matrix(ZZ, 2, 2, lambda i,j: bool(lst[i]==lst[j]))
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
matrix(ZZ, Integer(2), Integer(2), lambda i,j: bool(lst[i] is lst[j]))###line 1926:_sage_    >>> matrix(ZZ, 2, 2, lambda i,j: bool(lst[i] is lst[j]))
sig_on_count()

so it happens when doctesting line 1923. These are probably errors encountered during a dealloc, so it might be happening in a garbage collection. It could also be a WeakValueDict deletion callback that's trying to do a comparison that fails. Googling shows that you've asked about that exact error message on cython-users on 27 January, 2012, so if you solved the bug that led to that question then, perhaps you can also solve this one. It could also be a straight memory corruption. [edit:] OK that was on #11521. You didn't really find that error. You just made it go away by inserting a garbage collection. The good news is that this makes it not so likely that the patch here is causing a new memory corruption. It's more likely a lingering issue that once again gets triggered.

Last edited 7 years ago by nbruin (previous) (diff)

comment:23 in reply to: ↑ 22 Changed 7 years ago by SimonKing

Replying to nbruin:

Replying to SimonKing:

sage -t  -force_lib devel/sage/sage/modular/modsym/ambient.py
**********************************************************************
File "/scratch/sking/sage-5.4.beta0/devel/sage-main/sage/modular/modsym/ambient.py", line 1351:
    sage: ModularSymbols(20,2).boundary_space().dimension()
Expected:
    6
Got:
    0

I have seen that error before, with other work-arounds (and I think also with singular-malloc), so if it's indeed only a lower bound, then sage has merely changed. It's not an error.

I am not a number theorist, but I have learnt from the code that the dimension is computed from the number of "cusps". Hence, if one adds the compution of cusps to that test and assigns the involved Hecke modules to variables, then the tests pass:

            sage: M = ModularSymbols(20, 2)
            sage: B = M.boundary_space(); B
            Space of Boundary Modular Symbols for Congruence Subgroup Gamma0(20) of weight 2 and over Rational Field
            sage: M.cusps()
            [Infinity, 0, -1/4, 1/5, -1/2, 1/10]            
            sage: M.dimension()
            7
            sage: B.dimension()
            6

I think this would be a good solution.

The second error is apparently ignored and only printed to stderr:

Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored

This is a worrisome error because it's fickle. ... so it happens when doctesting line 1923. These are probably errors encountered during a dealloc, so it might be happening in a garbage collection. It could also be a WeakValueDict deletion callback that's trying to do a comparison that fails.

Agreed.

Googling shows that you've asked about that exact error message on cython-users on 27 January, 2012, so if you solved the bug that led to that question then, perhaps you can also solve this one.

Yes, but that question was a pure Cython question, namely like: "Wouldn't it be a good idea to print the function name in which an error was ignored, rather than printing an empty string? That would help debugging."

It could also be a straight memory corruption. [edit:] OK that was on #11521. You didn't really find that error.

Yes. But if it surfaces again, we should now solve it for good. I guess deletion from a weak dictionary is a likely candidate.

comment:24 Changed 7 years ago by SimonKing

I only find two files in sage/rings/ where the string "._modulus" occurs: polynomial_ring.py and polynomial_zz_pex.pyx:

devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:    c = parent._modulus
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:                d = parent._modulus.ZZ_pE(list(x.polynomial()))
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:                d = parent._modulus.ZZ_pE(list(e_polynomial))
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:        d = self._parent._modulus.ZZ_pE(list(left.polynomial()))
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:        _a = self._parent._modulus.ZZ_pE(list(a.polynomial()))
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:        self._parent._modulus.restore()
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:        self._parent._modulus.restore()
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:        left._parent._modulus.restore()
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:        self._parent._modulus.restore()
devel/sage/sage/rings/polynomial/polynomial_ring.py:            self._modulus = ntl_ZZ_pEContext(ntl_ZZ_pX(list(base_ring.polynomial()), p))

Hence, it should be easy to find out which of the few locations is actually involved.

comment:25 Changed 7 years ago by SimonKing

I found that the attribute error occurs in the cdef function get_cparent in sage/rings/polynomial/polynomial_zz_pex.pyx. Next question is then, of course: At what point is get_cparent called?

comment:26 follow-up: Changed 7 years ago by SimonKing

Printing messages to stderr, it seems to me that the error occurs during deallocation of a polynomial template, namely in sage/rings/polynomial/polynomial_template.pxi:

    def __dealloc__(self):
        """
        EXAMPLE::

            sage: P.<x> = GF(2)[]
            sage: del x
        """
        celement_destruct(&self.x, get_cparent((<Polynomial_template>self)._parent))

Is the cparent of self deallocated too early (perhaps because the refcounting is still not accurate)?

Or is it a nasty race condition? Namely:

  • A polynomial p in a polynomial ring R is about to be garbage collected.
  • All python stuff is deleted first. In particular, p's reference to its parent R is gone.
  • Incidentally, because the reference from p to R is gone, R can now be collected as well.
  • When R gets deleted, its underlying libsingular ring is deallocated.
  • Now, p.__dealloc__ is finally called, and tries to access the underlying libsingular ring - but it is too late.

Question: If a polynomial is created, will the reference counter to the underlying libsingular ring be incremented?

comment:27 in reply to: ↑ 26 ; follow-up: Changed 7 years ago by nbruin

Or is it a nasty race condition? Namely:

  • A polynomial p in a polynomial ring R is about to be garbage collected.
  • All python stuff is deleted first. In particular, p's reference to its parent R is gone.
  • Incidentally, because the reference from p to R is gone, R can now be collected as well.
  • When R gets deleted, its underlying libsingular ring is deallocated.
  • Now, p.__dealloc__ is finally called, and tries to access the underlying libsingular ring - but it is too late.

Question: If a polynomial is created, will the reference counter to the underlying libsingular ring be incremented?

From what I understand, __dealloc__ methods cannot assume that python attributes are still valid. They fundamentally cannot, because otherwise it wouldn't be possible to clean up cyclic garbage (__del__ methods are run when all attributes are valid and hence if they are present in cyclic garbage, it is not cleared).

So, I think that if the libsingular ring pointer is necessary during deallocation of a polynomial, then it should store it in a c-variable. Then it would indeed need to increase the reference counter.

You'd initially think that you could store a "c level" pointer to the python polynomial ring and manually increase the refcount. That would ensure that the python polynomial ring is still alive when __alloc__ gets called. However, it would also mean that there is an extra refcount that the cyclic garbage detector wouldn't be able to explain, so polynomial rings would always appear to have an "external" reference and hence never be eligible for garbage collection. Since rings tend to cache 0 and 1, such references would always be present and all you work would be for naught: Polynomial rings would exist forever.

So I think you have to bite the bullet and ensure that get_cparent doesn't access any python attributes or that you can avoid calling it in a __dealloc__.

EDIT: Or perhaps not. While looking at the code a bit I concluded I don't understand a bit of it, due to the templating. I think what I wrote above has some truth to it, but I honestly cannot say whether it has any relevance to the problem at hand. It seems to explain what you're experiencing.

We have:

    def __dealloc__(self):
        celement_destruct(&self.x, get_cparent((<Polynomial_template>self)._parent))

and for us:

get_cparent(parent) == <ntl_ZZ_pEContext_class>(parent._modulus)

The _parent attribute is a cython slot. However, it holds a reference to a python-managed object, so I think cython ensures it's properly taken into account in GC cycle counting. But that would suggest to me python could clear this slot to break cycles! So in that case, Polynomial_template is never safe. It could be I'm wrong, however.

I haven't been able to locate what parent._modulus is in this case. However,

sage: K.<a>=GF(next_prime(2**60)**3)
sage: R.<x> = PolynomialRing(K,implementation='NTL')
sage: '_modulus' in  R.__dict__.keys()
True

suggests this attribute is stored in a dictionary. It's set in sage.rings.polynomial.polynomial_ring.PolynomialRing_field.__init__:1367

        if implementation == "NTL" and is_FiniteField(base_ring) and not(sparse):
            from sage.libs.ntl.ntl_ZZ_pEContext import ntl_ZZ_pEContext
            from sage.libs.ntl.ntl_ZZ_pX import ntl_ZZ_pX
            from sage.rings.polynomial.polynomial_zz_pex import Polynomial_ZZ_pEX

            p=base_ring.characteristic()
            self._modulus = ntl_ZZ_pEContext(ntl_ZZ_pX(list(base_ring.polynomial()), p))
            element_class = Polynomial_ZZ_pEX

I guess we've just found that this is not a very good place to store _modulus. Where else, though? Would it be enough to have a cythonized version of PolynomialRing_field so that _modulus can be tied a little tighter to the parent? It seems to me the parent is the right place to store this information. We just need to convince the parent to hold on to its information for a bit longer.

Last edited 7 years ago by nbruin (previous) (diff)

comment:28 Changed 7 years ago by nbruin

  • Cc robertwb ylchapuy added

comment:29 in reply to: ↑ 27 ; follow-ups: Changed 7 years ago by SimonKing

Replying to nbruin:

We have:

    def __dealloc__(self):
        celement_destruct(&self.x, get_cparent((<Polynomial_template>self)._parent))

and for us:

get_cparent(parent) == <ntl_ZZ_pEContext_class>(parent._modulus)

The _parent attribute is a cython slot.

Interestingly, there is no complaint about a missing attribute _parent. It is _modulus that is missing.

However, it holds a reference to a python-managed object, so I think cython ensures it's properly taken into account in GC cycle counting. But that would suggest to me python could clear this slot to break cycles! So in that case, Polynomial_template is never safe. It could be I'm wrong, however.

I think you are right. The __dealloc__ of Polynomial_template is unsafe, unless polynomial rings will stay in memory forever. But I'd love to hear that we are wrong, because otherwise each polynomial would need a pointer to the c-data expected to be returned by get_cparent((<Polynomial_template>self)._parent), and we'd need to take into account reference counting for the c-parent during creation and deletion of polynomials.

Or perhaps there is a way out. We have a polynomial ring R and we have some elements a,b,c,... Each element points to R, and R points to some of its elements, namely to its generators. The problem is that deallocation of the elements is only possible as long as R is alive.

If we'd manually incref R upon creation of an element x, decrefing R when x gets deallocated, then we would ensure that R will survive until the last of its elements is deleted. Or would that mean that the elements will survive as well, because of the reference from R to its generators? Edit: Yes it would.

Last edited 7 years ago by SimonKing (previous) (diff)

comment:30 in reply to: ↑ 29 Changed 7 years ago by nbruin

Why don't you do something along the lines of:

cdef cparent get_cparent(parent) except? NULL:
    if parent is None:
        return NULL
    cdef ntl_ZZ_pEContext_class c 
    try:
        c = parent._modulus
    except KeyError:
        c = <some vaguely appropriate proxy value>
    return &(c.x)

This should really only be happening upon deletion anyway and I I'd be surprised if having the correct _modulus is very important at that point. It's a dirty hack but alternatives probably mean a full reimplementation of these polynomial rings.

Looking in ntl_ZZ_pEX_linkage seems to indicate it only leads to

if parent != NULL: parent[0].restore()

so perhaps you can just use NULL as a proxy value!

Last edited 7 years ago by nbruin (previous) (diff)

Changed 7 years ago by nbruin

return NULL instead of raising AttributeError?

comment:31 Changed 7 years ago by nbruin

YAY! indeed, returning NULL seems to solve the problem. I don't know whether there are any other ill effects, but since NULL was already returned upon missing parent, I think that with an incomplete parent it's an appropriate value too. It seems the NTL wrapper was already written with the possibility in mind of not having a valid parent around.

comment:32 Changed 7 years ago by SimonKing

  • Report Upstream changed from Reported upstream. No feedback yet. to None of the above - read trac for reasoning.

Great! I didn't expect that NULL would work here, because, after all, c-data of a polynomial is supposed to be deallocated with the help of c-data of a polynomial ring celement_destruct(&self.x, get_cparent((<Polynomial_template>self)._parent)) - but if someone has already thought of the possibility that the parent is invalid, then doing the same with an invalid parent._modulus seems the right thing to do.

While we are at it, I changed the "Reported Upstream" field, because it isn't an upstream bug, after all.

A bit later today, I will also provide a patch fixing the Hecke module dimensions, as in comment:23. Do you want me to ask a number theorists whether the fix ("Compute the cusps, which implies that the dimension is computed as well") is mathematically correct? Or is it enough for you that the same number as before (dimension 6) is obtained?

comment:33 Changed 7 years ago by SimonKing

  • Authors set to Nils Bruin, Simon King
  • Description modified (diff)
  • Status changed from needs_info to needs_review
  • Work issues changed from Input from libsingular experts to Input from a libsingular expert

I have provided a new patch, that removes the custom refcounter, using Singular's refcounter (ring.ref) instead.

As I have announced, I also fixed the failing modular symbols test, by computing the dimension before displaying it: The test previously worked only because a computation happened in a different test that happened to be executed early enough, that side effect being possible because Hecke modules would stay in memory permanently.

I did not run the full test suite yet. But sage/rings/polynomial/plural.pyx and sage/rings/polynomial/multi_polynomial_libsingular.pyx and sage/modular/modsym/ambient.py all work.

Problems for the release manager and the reviewer:

  • I removed the custom refcounting. But there were tests using the custom refcounters. The original tests demonstrated that the underlying c-data (the libsingular ring) is properly deleted. I replaced them by tests showing that the MPolynomialRing_libsingular get garbage collected. Is that OK from your point of view?
  • The mentioned tests will only work with #715 and #11521, because they are responsible for making polynomial rings garbage collectable. Hence, #13447 and #715 and #11521 need to be merged together; just having #715 and #11521 would result in the OS X problem we encountered, and #13447 alone would have two failing tests.
  • #13145 has already been merged in sage-5.4.beta1. I suggest to unmerge it, because it uses the old unreliable "double refcount" approach. My new patch also takes care of refcounting of plural rings.

Apply trac_13447-consolidated_refcount.patch trac_13447-modulus_fix.patch trac_13447-rely_on_singular_refcount.patch

Last edited 7 years ago by SimonKing (previous) (diff)

comment:34 Changed 7 years ago by SimonKing

  • Dependencies set to #11521
  • Description modified (diff)

comment:35 Changed 7 years ago by SimonKing

  • Status changed from needs_review to needs_work

Good news! With the new patches, i.e.

$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
trac_13447-consolidated_refcount.patch
trac_13447-modulus_fix.patch
trac_13447-rely_on_singular_refcount.patch

there is only one crash with make ptest, namely

sage -t  -force_lib devel/sage/sage/libs/singular/groebner_strategy.pyx # Killed/crashed

The crash seems harmless, this time: It occurs at strat = GroebnerStrategy(None), and I suspect that the attempt to incref "None" is a bad idea...

While we are at it: Perhaps it would be better to not unmerge #13145, but to use it as a dependency.

comment:36 Changed 7 years ago by SimonKing

In order to get nice tests, I think one should introduce a function that returns the refcount of a commutative or noncommutative libsingular ring.

comment:37 Changed 7 years ago by SimonKing

Having a function that shows the refcount really is a good idea! I already found that elements of a non-commutative ring do not increment the refcount to the underlying "libplural" ring.

Anyway, a new test in the commutative setting shall be:

            sage: import gc
            sage: gc.collect()  # random
            sage: R.<x,y,z> = GF(5)[]
            sage: R._get_refcount()
            7
            sage: p = x*y+z
            sage: R._get_refcount()
            8
            sage: del p
            sage: gc.collect()  # random
            sage: R._get_refcount()
            7

Of course, the question is whether we really need to incref the ring if we create an element. I think, in the commutative case, it is needed, because deallocation of an element refers to the cparent.

It could be that in the non-commutative case we have already a work-around:

    def __dealloc__(self):
        # TODO: Warn otherwise!
        # for some mysterious reason, various things may be NULL in some cases
        if self._parent is not <ParentWithBase>None and (<NCPolynomialRing_plural>self._parent)._ring != NULL and self._poly != NULL:
            p_Delete(&self._poly, (<NCPolynomialRing_plural>self._parent)._ring)

I think we could leave it like that, for now. If someone feels it is needed, then he/she may change NCPolynomial_plural to use templates.

comment:38 follow-up: Changed 7 years ago by nbruin

I think all the wrap_ring and ring_wrapper stuff can go from polynomial_libsingular. I think this was only there to provide a dictionary key for the ring_refcount_dictionary. Any code that uses it is liable to require change anyway, so deleting it is probably a good thing.

comment:39 in reply to: ↑ 38 Changed 7 years ago by SimonKing

Replying to nbruin:

I think all the wrap_ring and ring_wrapper stuff can go from polynomial_libsingular. I think this was only there to provide a dictionary key for the ring_refcount_dictionary. Any code that uses it is liable to require change anyway, so deleting it is probably a good thing.

Sure. I am about to prepare a new patch version, that uses singular_ring_reference and singular_ring_delete consequently (and not with a manual ..._ring.ref += 1, as in sage.libs.singular.function).

comment:40 Changed 7 years ago by SimonKing

There is one nasty detail with singular_function. If one sets ring = singular_function('ring') and then uses the singular_function to create a ring, then its reference counter is not incremented, even though the following function is called in this case:

cdef inline RingWrap new_RingWrap(ring* r):
    cdef RingWrap ring_wrap_result = PY_NEW(RingWrap)
    ring_wrap_result._ring = r
    ring_wrap_result._ring.ref += 1
    
    return ring_wrap_result

I do not understand it, yet. But anyway, that's the problem I am currently dealing with.

Changed 7 years ago by SimonKing

Use Singular's refcounter for refcounting

comment:41 Changed 7 years ago by SimonKing

  • Description modified (diff)
  • Status changed from needs_work to needs_review

The problems that I mentioned are now solved with the new patch version.

Note that the new patch is relative to #13145, which I made a new dependency for #715.

So, for the record:

$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
13145.patch
trac_13447-consolidated_refcount.patch
trac_13447-modulus_fix.patch
trac_13447-rely_on_singular_refcount.patch

Let's keep the fingers crossed that the tests pass this time.

Apply trac_13447-consolidated_refcount.patch trac_13447-modulus_fix.patch trac_13447-rely_on_singular_refcount.patch

comment:42 Changed 7 years ago by SimonKing

  • Reviewers set to Simon King

Hooray!

All tests passed!
Total time for all tests: 1128.9 seconds

And I did not notice any "ignored" errors.

I think I can give Nils' part of the patches a positive review, and suggest a cross-review.

Changed 7 years ago by nbruin

clean up final bits in refcounting of singular rings

comment:43 Changed 7 years ago by nbruin

  • Description modified (diff)

Patches look good to me, so a positive review from me for Simon's patches.

I've attached a further patch to clean up some more details. Doctests pass for me, but Simon should look at it and decide if he's happy with it. Please go ahead and amend as you see necessary. I also have no problem with the patches being unified (we have some very small patches here now)

Apply trac_13447-consolidated_refcount.patch trac_13447-modulus_fix.patch trac_13447-rely_on_singular_refcount.patch trac_13447-refcount_cleanup.patch

comment:44 Changed 7 years ago by SimonKing

Concerning the last patch: I think one should rather use (as it is "officially" recommended in the docs) singular_ring_reference and singular_ring_delete, and not

    if currRingHdl.data.uring!= currRing:  
        currRingHdl.data.uring.ref -= 1  
        if currRingHdl.data.uring.ref == 0:  
            rDelete(currRingHdl.data.uring)  
        currRingHdl.data.uring = currRing # ref counting?  
        currRingHdl.data.uring.ref += 1  

Apart from that, I agree that the ring wrappers can be binned.

A bit later today, I will produce a combined patch, and provided that all tests pass I will set it to positive review.

comment:45 Changed 7 years ago by SimonKing

For the record: Apparently it is impossible to use the "proper" singular_ring_delete function in the code snipped mentioned in comment:44. I guess the problem is that singular_ring_delete will set currRing=None in certain situations - hence, the next line currRingHdl.data.uring = singular_ring_reference(currRing) will fail with a segfault.

So, I'll leave this code snipped as it is.

Last edited 7 years ago by SimonKing (previous) (diff)

comment:46 Changed 7 years ago by SimonKing

On the other hand: I'd really like to understand why it doesn't work with singular_ring_delete in this case.

I found that the segfault occurs inside singular_ring_delete(currRingHdl.data.uring) when attempting to change to currRingHdl.data.uring. So, apparently, currRingHdl.data.uring is invalid at that point. And this is hardly surprising, because it is created by the following lines, which look like a hack to me:

        if currRingHdl == NULL:
            import sys
            currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 1)
            currRingHdl.data.uring.ref += 1

comment:47 Changed 7 years ago by SimonKing

Now I think I understand:

currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 1) has the purpose to create currRingHdl (it is only called if currRingHdl==NULL), but the value assigned to currRingHdl.data.uring is invalid. In particular, the attempt to change to currRingHdl.data.uring results in a segfault. Hence, singular_ring_delete(currRingHdl.data.uring) would not work.

Solution: We create currRingHdl, but we immediately delete the invalid currRingHdl.data.uring and replace it by a new reference to currRing. If there is no currRing then I print a warning, but I hope that we will never see that warning...

comment:48 Changed 7 years ago by SimonKing

  • Description modified (diff)

I have attached a combined patch. Nils, are you happy with what I did with currRingHdl in sage/libs/singular/function.pyx? If you are, and if the tests pass, then please set it to "positive review"

For the record: I have

$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
13145.patch
trac_13447-sanitise_ring_refcount.patch

on top of sage-5.4.beta0 Edit and the tests both in sage/rings/polynomial and sage/libs/singular pass. Let's hope for the full test suite!

Apply trac_13447-sanitise_ring_refcount.patch

Last edited 7 years ago by SimonKing (previous) (diff)

comment:49 Changed 7 years ago by SimonKing

  • Owner changed from rlm to (none)

make ptest succeeded on bsd.math! To be on the safe side, I am now testing make ptestlong, but if you (Nils) is happy with my changes to currRingHdl then please finish the review!

comment:50 Changed 7 years ago by SimonKing

And make ptestlong succeeded as well. Looks good, I think.

comment:51 follow-up: Changed 7 years ago by nbruin

I'm not so sure that

    currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 1) 
    currRingHdl.data.uring.ref += 1

is entirely safe. In 'libsing-test2.cc' (in the Singular package, the following similar code is used:

    // prepare the arguments
    // create a ring Q[x,y]
      // the variable names
      char **n=(char**)omAlloc(2*sizeof(char*));
      n[0]=(char*)"x";
      n[1]=(char*)"y";
      // create the ring
      ring R=rDefault(0,2,n);
      // n is not needed any more:
      omFreeSize(n,2*sizeof(char*));
    // make it the default ring, also for the interpeter
    idhdl newRingHdl=enterid("R" /* ring name*/,
                           0, /*nesting level, 0=global*/
                           RING_CMD,
                           &IDROOT,
                           FALSE);
    IDRING(newRingHdl)=R;
    rSetHdl(newRingHdl);

I'd assume IDRING(newRingHdl) is newRingHdl.data.uring, so the assignment above does not assume anything about that field being initialized. So who knows where newRingHdl.data.uring.ref points!

I get the impression that indeed, currRing is set to something valid nearly always, so your assumption that it is during SingularFunction.__init__ (that's earlier than calling!) might be OK. The singular code does contain an awful lot of currRing!=NULL checks though, so I'm not so sure that's really guaranteed.

OK, a little searching of the Singular source shows that enterid (which takes 6 arguments but we call it with 5 and so does the above example) apparently calls idrec::set, in our case with init=1 and the example above with init=0. The init=1 causes a call idrecDataInit and for a ring command that does omAlloc0Bin(sip_sring_bin). The resulting pointer gets indeed assigned to what I think is ...data.uring (although not in a very typerespecting way, but I guess that's par for C code).

So my guess is that by calling with init=1 we indeed get a block that we can delete. Since that's all we do with it I think the cleaner way for us would be

            currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 0)
            currRingHdl.data.uring = NULL

(or whatever is the appropriate value for uring), as per the libsing example above)

    if currRingHdl.data.uring!= currRing: 
        if currRingHdl.data.uring != NULL:
            singular_ring_delete(currRingHdl.data.uring)
        currRingHdl.data.uring = currRing
        singular_ring_reference(currRing) 

(or indeed just leave init=1 in place so that currRingHdl.data.uring is never NULL. In any case, this happens at most once in any sage session, so the leaked memory (if any) would be minuscule.

Concerning refcounts: I found the following in dyn_modules/python/ring_wrap.h

#include <boost/intrusive_ptr.hpp>
using namespace boost;
//typedef intrusive_ptr<ip_sring> Ring;
// inline void intrusive_ptr_add_ref(ring r){
//     r->ref++;
// }
// inline void intrusive_ptr_release(ring r){
//     r->ref--;
//     if (r->ref<=0) rDelete(r);
// }

which is in step with our use of the ref field. But the code is commented out! There is such a thing as an intrusive_ptr in Boost and I think it needs exactly those two functions declared to function. If this is what is intended then we're OK. If this is commented out because now a different scheme is used, we may be in trouble.

Version 0, edited 7 years ago by nbruin (next)

comment:52 Changed 7 years ago by nbruin

  • Status changed from needs_review to needs_info

libSingular experts: Does ring->ref == 0 mean there is exactly one reference to the ring still active?

comment:53 in reply to: ↑ 51 ; follow-ups: Changed 7 years ago by SimonKing

Replying to nbruin:

I'm not so sure that

    currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 1) 
    currRingHdl.data.uring.ref += 1

is entirely safe.

I am sure that it is not safe: One can't even call singular_ring_delete on it. That's why I removed it, unless currRing==NULL.

EDIT: kernel/Number.h contains uncommented definitions:

using namespace boost;
inline void intrusive_ptr_add_ref(ring r){
    r->ref++;
    //Print("ref count after add: %d", r->ref);
}
inline void intrusive_ptr_release(ring r){
    if (r->ref<=0) rDelete(r);
    else {
    r->ref--;

    }
    //Print("ref count after release: %d", r->ref);
}

I find the ptr_release a little worrisome: Do the singular people believe that r->ref ==0 means that there is still a reference?

According to the code snipped, the ring is deleted if r->ref==0. So, apparently r->ref==0 means that there is no reference.

this code all indicates that special action is required only if ref<=0 'before' decreasing.

If ref<=0 then no reference is left, hence, the ring is deleted. Otherwise, the reference counter is decremented. I really don't understand the problem.

In that case we should probably ensure that singular_ring_ref and singular_delete_ring are somehow aliased to boost::intrusive_ptr_add_ref and intrusive_ptr_release from kernel/Number.h.

Singular tests whether ref<=0; if it is, then the ring is deleted, but if it isn't then the counter is decremented.

We first decrement the counter, and then test whether ref<0; if it is, then the ring is deleted, but if it isn't then no further action is taken.

Isn't that logically the same?

comment:54 in reply to: ↑ 53 Changed 7 years ago by nbruin

Replying to SimonKing:

According to the code snipped, the ring is deleted if r->ref==0. So, apparently r->ref==0 means that there is no reference.

Indeed, but this is tested before decrement. So for this to work, you'd need

   R=create_ring_with_appropriate_refcount
   // here R.ref should be 0
   //for the following to cause deletion.
   intrusive_ptr_release(R)

We initialize R.ref=1 in singular_ring_new.

If ref<=0 then no reference is left, hence, the ring is deleted. Otherwise, the reference counter is decremented. I really don't understand the problem.

If ref<=0 BEFORE the reference count is decreased. The difference is that intrusive_ptr_release happily leaves a ring in memory that AFTER the call has R.ref==0. Our singular_ring_delete will only abstain from calling rDelete(doomed) if after the call we have doomed.ref > 0.

 	401	    doomed.ref = doomed.ref-1 
 	402	    if doomed.ref > 0: 
 	403	        return 

The two behaviours are off-by-one.

Singular tests whether ref<=0; if it is, then the ring is deleted, but if it isn't then the counter is decremented.

We first decrement the counter, and then test whether ref<0; if it is, then the ring is deleted, but if it isn't then no further action is taken.

We don't. We test whether ref<=0 (or rather whether ref>0 to see if we should do an early exit).

As our doctests show, apparently we're never calling singular_ring_delete on rings we haven't initialized the ref field on ourselves and that singular expects to to survive after we do, since that would probably lead to an error (Singular would try to access a deallocated ring).

A mismatch in the other direction would be milder: Singular takes a reference on a ring we initialized the ref field on, we lose our last reference via a singular_ring_delete (no rDelete gets called because Singular correctly increased the refcount). However, Singular will never delete the ring because we originally initialized the ref field to 1 whereas Singular apparently expects it to be initialized to 0.

comment:55 in reply to: ↑ 53 Changed 7 years ago by nbruin

Replying to SimonKing:

I am sure that it is not safe: One can't even call singular_ring_delete on it. That's why I removed it, unless currRing==NULL.

I think that's because singular_ring_delete does a changeCurrentRing on it. Calling rDelete seems to work just fine. I think that thanks to the init=1 the memory for the ring is allocated, but initialized to 0 thanks to the omAlloc0. My failure to locate a ref = 1 anywhere in Singular's code also makes me believe that they're happy giving back a ring where ref = 0.

I'm not so sure the rChangeCurrRing dance is necessary in singular_ring_delete. The code in intrusive_ptr_release doesn't need it. I suspect Volker put it in as an extra precaution while he was trying to debug other issues. A cursory reading of the code of rDelete doesn't seem to indicate its operation is affected in any way by whether the current ring is equal to the one being deleted.

comment:56 Changed 7 years ago by SimonKing

OK, I asked Hans Schönemann. I hope he'll answer soon.

If r->ref==0 really means in Singular that there is exactly one reference left, then we should act accordingly.

comment:57 follow-ups: Changed 7 years ago by SimonKing

  • Status changed from needs_info to needs_review
  • Work issues Input from a libsingular expert deleted

Hans Schönemann gave me a couple of answers - many thanks! I hope I am translating and summarising correctly:

r->ref counts the number of interpreter variables referencing a ring minus one. Hence:

ring r=.....; // ref is 0
def rr=r;     // ref is 1
kill r;       // ref is 0
kill rr;      // ref is 0 when calling rKill -> delete it.

There are locally generated rings (in std, for example), unbeknownst to the interpreter - they have ref zero.

Singular uses r->ref only in rKill - it makes me wonder why we use rDelete and not rKill, by the way. In fact, rKill deletes local data, but rDelete doesn't.

Generally, one should have currRingHdl.ring == currRing, or currRing==NULL and currRingHdl==NULL. I just notice that he wrote currRingHdl.ring, not currRingHdl.data.uring - is that a difference?

Concerning debugging: If one builds Singular so that OM_NDEBUG is not defined, then a debug version of omalloc is used. In that way, we would have more easily detected the original problem with omStrDup.

Conclusions

I am not sure if you agree with my conclusions, but here we go:

Since r->ref does not play a rôle in libsingular, we are free to use r->ref to count the number of pointers to r (not "number of pointers minus one"). However, when calling singular interpreter functions, we must make sure that r->ref>0. With our patch, we already do so - hence, that's fine.

Since we apply singular_ring_delete to non-commutative (quotient) rings, and we do not call rKill but only rDelete, we currently have a memory leak for non-commutative rings. rKill will first test that r->ref==0, then kills local data, then kills the ring, and sets currRing and friends to NULL if the to-be-deleted ring is currRing. Hence, we should use rKill in singular_ring_delete, but probably without the rChangeCurrRing dance.

If currRing==NULL, we should not create currRingHdl. If currRing!=NULL then we should let uring point to it. the latter is already done in my patch, the former should be done.

So far I forgot to ask Hans about enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 1).

comment:58 Changed 7 years ago by SimonKing

  • Status changed from needs_review to needs_work
  • Work issues set to Replace rDelete by rKill. Only create currRingHdl if currRing is not null

comment:59 in reply to: ↑ 57 Changed 7 years ago by SimonKing

Replying to SimonKing:

Since we apply singular_ring_delete to non-commutative (quotient) rings, and we do not call rKill but only rDelete, we currently have a memory leak for non-commutative rings.

Or perhaps we do not have a leak. Namely, the internal data are also referenced via python, and thus will be deleted when the ring is deleted. Well, let's see if rKill results in segfaults (due to the attempt to deallocate the internal data twice)...

comment:60 Changed 7 years ago by SimonKing

  • Status changed from needs_work to needs_review
  • Work issues Replace rDelete by rKill. Only create currRingHdl if currRing is not null deleted

I have updated trac_13447-sanitise_ring_refcount.patch.

Changes:

  1. currRingHdl is only created if currRing!=NULL, and currRingHdl.data.uring is subsequently set to currRing.
  2. singular_ring_delete is now using rKill in lieu of rDelete, so that local data are guaranteed to be deallocated.

In my previous comment, I was a bit sceptical about the second point. However, all doctests pass on bsd.math, hence, using rKill does not result in a segfault. In other words, the local data are not double-freed, und thus rKill is better than rDelete.

Apply trac_13447-sanitise_ring_refcount.patch

comment:61 Changed 7 years ago by SimonKing

Here is one comment of Hans on the following lines of the patch:

          global currRingHdl
          if currRingHdl == NULL and currRing!=NULL:
              # Create an invalid mock ring - it would not be possible
              # to make that ring currRing!
              # The only aim is to create currRingHdl
              currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 1)
              # Now we assign proper data to currRingHdl
              rDelete(currRingHdl.data.uring)
              currRingHdl.data.uring = singular_ring_reference(currRing)

He says that it is ok like that. However, it would be better to do

          global currRingHdl
          if currRingHdl == NULL and currRing!=NULL:
              # Create an invalid mock ring - it would not be possible
              # to make that ring currRing!
              # The only aim is to create currRingHdl
              currRingHdl = enterid("my_awesome_sage_ring", 0,  RING_CMD, &IDROOT, 0)
              # Now we assign proper data to currRingHdl
              currRingHdl.data.uring = singular_ring_reference(currRing)

Note the difference in the last argument of enterid.

I'll try it...

Last edited 7 years ago by SimonKing (previous) (diff)

comment:62 Changed 7 years ago by SimonKing

I have updated the patch again. Tests pass on bsd.math.

The difference between enterid(...,0) and enterid(...,1) is that the former does not allocate memory for currRingHdl.data.uring, while the latter does. Hence, using enterid(..., 0), one avoids the useless call to rDelete.

Apply trac_13447-sanitise_ring_refcount.patch

comment:63 Changed 7 years ago by nbruin

I'm afraid you (at least theoretically) have recreated the possibility for our original error to occur. We essentially have:

SingularFunction.__init__:
	if currRingHdl == NULL and currRing!=NULL: 
            currRingHdl = <create ring handle>
            currRingHdl.data.uring = currRing

SingularFunction.__call__:
	if currRingHdl.data.uring!= currRing: 
	    singular_ring_delete(currRingHdl.data.uring) 
	    currRingHdl.data.uring = singular_ring_reference(currRing)

If we init with currRing == NULL we may be leaving currRingHdl == NULL. If a subsequent invocation of call happens after currRing has a new value and currRingHdl is not updated (if that would never happen we wouldn't need this code) we'd be calling

	if NULL.data.uring!= currRing: 
	    singular_ring_delete(NULL.data.uring) 
	    NULL.data.uring = singular_ring_reference(currRing)

which will segfault (OK, this error is much better than the original).

From what you describe, rKill can set currRingHdl=NULL as well, in which case we definitely get an error in the above code. Thus it seems to me it's theoretically possible that __call__ happens when currRingHdl=NULL and currRing!=NULL. In that case you should recreate a handle:

    if currRingHdl != NULL:
        if currRing != NULL:
            if currRingHdl.data.uring != currRing:
                if currRingHdl.data.uring != NULL:
                    singular_ring_delete(currRingHdl.data.uring)
                currRingHdl.data.uring = currRing #I don't think this should up refcounts
        else: #currRing == NULL and currRingHdl != NULL
            rKill(currRingHdl) #delete the handle as well. You could also save it for reuse
                               #I think this looks at the refcount
            currRingHdl = NULL            
    else:
        if currRing != NULL:
            currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 0) #note the 0
            currRingHdl.data.uring = currRing
# NOW the following invariant holds:
    assert (currRingHdl == NULL and currRing == NUL) or currRingHdl.data.uring == currRing

I think that without extra assumptions you must be prepared to deal with all those conditions. Also note that you might be creating a new ringhandle an arbitrary number of times, so knowing whether you leak becomes important. The sequence I use above is what i think the libsingular examples that are shipped with singular do.

You can already see here that the use of rKill (which does look at reference counts) forces us to follow Singular's conventions of what the ref field. At this point we're assuming that if currRingHdl holds the last reference to a ring, we should probably delete it.

I think the convention is that currRingHdl (if it's relevant) is pointing to the same ring as currRing, so they are treated as one reference: currRingHdl.data.uring = currRing does not increase a refcount. That obviously leads to problems when currRing diverges from currRingHdl.

Either you attach memory management ONLY to currRingHdl and trust that when currRing is changed, someone else will still have a reference to the original value, who can throw away the ring (so, currRing is only changed by straight assignments -- no increfs or decrefs. It's always a borrowed reference) or you always ensure that currRingHdl gets updated (or at least cleared) when currRing changes. My impression is that refcounting in singular is only an interpreter thing, so that the former is the model used by Singular itself.

Double free implies segfault: Have you verified this is true with omAlloc?

Refcounting base: I agree that as long as our created rings and rings created internally live separate lives, it doesn't matter what semantics we attach to the ref field. I just think it's dangerous to implicitly assume these rings do live separate lives. They do now, but who's to guarantee that they do in the future? And even then it may take a long time for a visible error to occur, which by then might be difficult to track down. I think we should adjust to the semantics Singular uses in other places (and document the slightly unusual refcounting semantics of Singular in our code!) to make the libsingular interface as robust as possible. You know how extensions/modifications are going to happen: people will just copy/paste what's already there.

comment:64 in reply to: ↑ 57 ; follow-up: Changed 7 years ago by nbruin

Replying to SimonKing:

There are locally generated rings (in std, for example), unbeknownst to the interpreter - they have ref zero.

Singular uses r->ref only in rKill - it makes me wonder why we use rDelete and not rKill, by the way. In fact, rKill deletes local data, but rDelete doesn't.

If we do that, we should ensure that rKill has an accurate view of the number of references. That means adjusting to Singular's conventions. If we rKill with our current code, we'd leak because we'd call with a positive refcount. I guess you could manually decref before calling rKill but that's silly and misleading to future maintainers of the code.

Generally, one should have currRingHdl.ring == currRing, or currRing==NULL and currRingHdl==NULL. I just notice that he wrote currRingHdl.ring, not currRingHdl.data.uring - is that a difference?

I don't think the former actually exists. A macro is used in the singular code base, so Hans would not usually spell it out in full.

Concerning debugging: If one builds Singular so that OM_NDEBUG is not defined, then a debug version of omalloc is used. In that way, we would have more easily detected the original problem with omStrDup.

Didn't work for me but I might have just failed to properly (de)activate those flags. Anyway, I doubt an access-after-free would be immediately detected. That really involves unmapping or protecting memory in order to force a segfault. I didn't see evidence that omAlloc reaches that deep into the system. It may be that omAlloc in debugging mode does consistency checks on every use. In that case we might have found the corruption a little earlier. I doubt we would have had gdb point exactly at the offending instruction. I think it would be nice for singular to update the omAlloc headers with my changes so that (at least on linux and OSX) they can build it with the malloc underneath. I found it very useful for this particular issue.

Conclusions Since r->ref does not play a rôle in libsingular, we are free to use r->ref to count the number of pointers to r (not "number of pointers minus one"). However, when calling singular interpreter functions, we must make sure that r->ref>0. With our patch, we already do so - hence, that's fine.

In principle yes, but not if we use rKill and I think it's a bad idea in general because it might cause unforeseen problems in the future. Better to stick with one interpretation of what the ref field means.

If currRing==NULL, we should not create currRingHdl. If currRing!=NULL then we should let uring point to it. the latter is already done in my patch, the former should be done.

As you indicate, currRingHdl can be NULLed again by rKill. We need to correct that if we need currRingHdl later.

In fact, without extra infrastructure to clean up after it, it seems that rKill leaks ringhandles. Is it the case that ring handles are registered in some global data structure that allows them to be deleted? This loop might be the candidate:

    while (r->idroot!=NULL)
    {
      killhdl2(r->idroot,&(r->idroot),r);
    }

that might do this. In any case, rKill just does a straight currRingHdl=NULL if (r==currRing), so currRingHdl is definitely gone.

Is it really necessary to use rKill? It seems it is more part of the interpreter layer and that the libsingular interface itself is largely one level lower. Have we initialized enough of the interpreter to let rKill function as advertised?

It is by now clear that up to now, the Sage-libsingular interface was largely based on empirically verified guesses. Now that Simon has good contact with Hans, perhaps we can ensure the interface follows official protocol?

Last edited 7 years ago by nbruin (previous) (diff)

comment:65 in reply to: ↑ 64 Changed 7 years ago by SimonKing

Replying to nbruin:

Replying to SimonKing:

Singular uses r->ref only in rKill - it makes me wonder why we use rDelete and not rKill, by the way. In fact, rKill deletes local data, but rDelete doesn't.

If we do that, we should ensure that rKill has an accurate view of the number of references. That means adjusting to Singular's conventions. If we rKill with our current code, we'd leak because we'd call with a positive refcount.

No, we don't. With the current patch, rKill is only called if ref==0.

Conclusions Since r->ref does not play a rôle in libsingular, we are free to use r->ref to count the number of pointers to r (not "number of pointers minus one"). However, when calling singular interpreter functions, we must make sure that r->ref>0. With our patch, we already do so - hence, that's fine.

In principle yes, but not if we use rKill

I disagree, since we only call rKill with r->ref==0.

and I think it's a bad idea in general because it might cause unforeseen problems in the future. Better to stick with one interpretation of what the ref field means.

In a way, we do. Namely, one could argue that a Sage polynomial ring MPolynomialRing_libsingular is an interpreter variable (roughly, it is an object that does not live in the Singular kernel but points to a ring in the Singular kernel). Hence, when creating a polynomial ring, it agrees with Singular's conventions to increase the refcount.

As you indicate, currRingHdl can be NULLed again by rKill. We need to correct that if we need currRingHdl later.

currRingHdl is only NULLed if currRing is deleted. So, that should be fine.

In any case, rKill just does a straight currRingHdl=NULL if (r==currRing), so currRingHdl is definitely gone.

You mean, it doesn't free it before assigning NULL? Yes, that would seem to be a leak.

Is it really necessary to use rKill? It seems it is more part of the interpreter layer and that the libsingular interface itself is largely one level lower.

I'll try rKill versus rDelete in singular_ring_delete and will try to see if there is a leak when creating and deleting super commutative rings (SCA).

comment:66 Changed 7 years ago by SimonKing

Too bad. Apperently an SCA can not be garbage collected, in spite of all the weak caches from #715 and #11521. So, singular_ring_delete will never be called on an SCA.

comment:67 Changed 7 years ago by SimonKing

With #12313 and its dependencies, SCA can be collected, but its underlying ring is not deleted. Wrong refcount, I guess.

comment:68 Changed 7 years ago by SimonKing

  • Status changed from needs_review to needs_work
  • Work issues set to Understand why sometimes `new_RingWrap` needs an incref and sometimes not

The wrong refcount appears to be in sage/libs/singular/function.pyx, where in the current patch we have

            # We need to incref the to-be-converted data,
            # since apparently removing to_convert.data would
            # decref it.
            return new_RingWrap( singular_ring_reference(<ring*> to_convert.data) )

There was some example where the refcount was wrong without calling singular_ring_reference, but for SCA it is wrong with that call.

So, that is something that we need to understand, and needs work.

comment:69 Changed 7 years ago by SimonKing

Some suggestions:

The singular function rKill seems to do what we want in singular_ring_delete. Hence, to keep it simple, singular_ring_delete should call rKill, and nothing more (unless we want a sanity test, such as printing a warning if singular_ring_delete is called on NULL).

rKill only takes action if r->ref is zero when it is called. Indeed, according to Hans, r->ref is the number of interpreter variables pointing to a ring, minus one. In order to avoid confusion, we should follow the same idea.

So, what means "interpreter variables" for us? I think this should be "any Sage object whose deallocation needs the libsingular ring being alive". This is every polynomial ring, every Gröbner strategy, and any element of a polynomial ring.

Since ref is -1 based, when a Sage ring R is created then ref shall be zero, so that the underlying libsingular ring can be properly deleted if R is deallocated. And if any element or Gröbner strategy on that ring is created, ref needs to be incremented. The deallocation of an element or Gröbner strategy must involve a call to singular_ring_delete.

A complication arises when calling a singular_function. When we call a Singular interpreter function, then we should have ref>0, again according to Hans. Hence, at the begin of a call, ref must be incremented, and decremented when leaving it.

Special care needs to be taken on currRingHdl. According to Hans, rKill can make it NULL, and it should be NULL if and only if currRing is NULL. Currently, we create currRingHdl when we create a Singular function, but we only use it when we call a Singular function. That's asking for trouble. Hence, we should move the creation of currRingHdl to SingularFunction.__call__, where we also set the correct currRing. SingularFunction.__init__ is not the right place.

comment:70 Changed 7 years ago by SimonKing

I think one potential problem with our current use of currRingHdl in combination with rKill is that we would redefine my_awesome_sage_ring, hence, we would see the following message on stderr:

// ** redefining my_awesome_sage_ring **

I'll ask Hans how one can create currRingHdl without creating a mock ring.

comment:71 Changed 7 years ago by SimonKing

Here is one complication for singular_function: If you do singular_function("std"), then a SingularKernelFunction is created. In its __init__, it creates a call handler, and for that purpose it needs to find out the arity of the Singular kernel function.

But apparently the arity of std can not be correctly determined, if currRing==0: One obtains arity 0, not 1, for std.

Therefore, when I move the creation of currRingHdl from SingularFunction.__init__ to where the function is called, an error occurs.

A question arises: Is there a singular function that changes its arity depending on currRing? If it is the case, then we must determine SingularKernelFunction.call_handler each time the function is called, rather than during initialisation.

comment:72 follow-up: Changed 7 years ago by nbruin

Replying to SimonKing:

I think one potential problem with our current use of currRingHdl in combination with rKill is that we would redefine my_awesome_sage_ring, hence, we would see the following message on stderr:

// ** redefining my_awesome_sage_ring **

I suspect the records do get cleared, because otherwise Singular would leave a dangling pointer around and one would have what is essentially a singular ring variable with NULL pointer to a ring or a pointer to unallocated memory . If those variables do get properly removed, by the time we have to redefine the thing there should be no trace of the original and hence it would not be a redefinition. If the skeleton somehow does remain (with a nulled out data.uring field then, I suppose?), we could keep it around:

    saved_skeleton=enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 0)
    ....
    if currRingHdl == NULL:
        currRingHdl = saved_skeleton
        currRingHdl.data.uring = singular_ring_reference(currRing)

Incidentally, doesn't currRingHdl count as a ring reference? The code in SingularFunction that started this whole thing seems to indicate this. In that case the only singular_ring_delete(R) that can trigger nulling of currRingHdl and currRing is when the explicit intention is singular_ring_delete(currRingHdl.data.uring). When decreffing any other pointer that just happens to be pointing to the same ring structure, there would still be a refcount left, so no actual deletion would take place.

Note that after singular_ring_delete(currRingHdl.data.uring) we should absolutely NULL currRingHdl.data.uring (probably safer/better to do currRingHdl=NULL) even if deallocation didn't occur, because we should remove the reference. Therefore, I find it surprising that rKill does do a currRingHdl=NULL conditionally internally. With correct usage, rKill should never execute that code usefully, because it needs to happen unconditionally outside of rKill if it ever happens. Is it there as a safety net against sloppy programming?

Note that doing singular_ring_delete(currRing) is nearly always wrong, because currRing is a borrowed reference as far as I understand.

Are we the only ones ever setting currRingHdl or can one write/call singular functions that internally will change currRingHdl? If the former holds, the code becomes a little easier to reason about and then we can make our own provisions and ensure they are followed properly (such as saving a skeleton).

In the latter case we really have to play by Singular's rules, because Singular might be doing the deallocating outside of our control.

comment:73 in reply to: ↑ 72 Changed 7 years ago by SimonKing

Replying to nbruin:

Replying to SimonKing:

I think one potential problem with our current use of currRingHdl in combination with rKill is that we would redefine my_awesome_sage_ring, hence, we would see the following message on stderr:

// ** redefining my_awesome_sage_ring **

I suspect the records do get cleared, because otherwise Singular would leave a dangling pointer around and one would have what is essentially a singular ring variable with NULL pointer to a ring or a pointer to unallocated memory . If those variables do get properly removed, by the time we have to redefine the thing there should be no trace of the original and hence it would not be a redefinition.

I am afraid Singular knows that the mock ring has been defined. OK, that's relative to #12313. But it happened when I did

sage: from sage.rings.polynomial.plural import SCA
sage: E = SCA(QQ, ['x', 'y', 'z'], [0, 1], order = 'degrevlex')
sage: import gc
sage: del E
sage: _ = gc.collect()
sage: E = SCA(QQ, ['x', 'y', 'z'], [0, 1], order = 'degrevlex')

#12313 is needed to make E collectable.

Hans didn't answer, but I found out myself: currRingHdl=rFindHdl(currRing,NULL,NULL), which could still be NULL. If it is NULL, then we may create the mock ring.

Are we the only ones ever setting currRingHdl or can one write/call singular functions that internally will change currRingHdl?

No idea.

Changed 7 years ago by SimonKing

Combined patch

comment:74 follow-up: Changed 7 years ago by SimonKing

I have attached a new version of the patch. Here are a few comments:

For easier debugging, I created methods returning the refcount for RingWrap and for polynomial rings, and for g-algebras.

Concerning singular_function

currRingHdl is now taken care of when the function is called, and not when it is created. In that way, one prevents nastinesses like currRingHdl being NULL because of a previous deallocation. When creating it currRing is guaranteed to exist. I call rFindHdl(currRing,NULL,NULL), which may find a previously defined thingy. If it can not be found, a mock ring is created by

currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 0)

Since the last argument is 0, we can then do

currRingHdl.data.uring = currRing

without the need to call rDelete(currRingHdl.data.uring). Problem: See below.

Since currRing is now not necessarily available when initialising SingularKernelFunction, it is impossible to determine its arity during initialisation, but fortunately it is still possible to determine whether the function exists in the Singular kernel. Therefore, during __init__, I merely test for existence, while the creation of self.call_handler is postponed to a point were currRing is guaranteed to be around.

Since Hans said that we need to make sure that currRing->ref is positive when calling an interpreter function, I temporarily increment the counter during the call.

Concerning ref counting

I think we should simply rely on Singular's rKill for deletion of rings, so that internal data are taken care of.

currRingHdl.data.uring and currRing do not constitute a reference in the sense of "interpreter variable". Therefor I am not incrementing or decrementing the refcount on them (unless I missed something).

Because of ref counting based at -1, a naked MPolynomialRing_libsingular and NCPolynomialRing_plural should have self._ring.ref == 0, so that self._ring can be properly deleted. All other ring-related data, such as a Gröbner strategy or elements of the ring, constitute a new reference.

The same should hold for a RingWrap: If one has a naked RingWrap then the underlying ring should be deletable when deleting RingWrap. A complication arises if both a RingWrap and a polynomial ring refer to the same underlying libsingular ring.

This occurs in NCPolynomialRing_plural, which is created from a ring wrap. I do

        cdef RingWrap rw = ncalgebra(self._c, self._d, ring = P)

        #       rw._output()
        self._ring = singular_ring_reference(rw._ring)

hence incref, because at the end of the initialisation of self, rw will be deleted, so that the refcount for self._ring will then be fine. The same holds for new_CRing and new_NRing in sage.rings.polynomial.plural.

Some experiments with the new patch

Worst of all: I did not run the tests yet.

Here is a test whether there is a memleak when creating a super-commutative algebra repeatedly. Note that one needs

$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
13145.patch
trac12215_weak_cached_function-sk.patch
trac12215_segfault_fixes.patch
trac_12313-mono_dict-combined-random-sk.patch
trac_12313_quit_sage.patch
trac13370_deprecate_is_field.patch
trac_13447-sanitise_ring_refcount.patch

because otherwise super commutative algebras could not be garbage collected.

sage: from sage.rings.polynomial.plural import SCA
sage: import gc
sage: while 1:
....:     E = SCA(QQ, ['x', 'y', 'z'], [0, 1], order = 'degrevlex')
....:     print get_memory_usage(),id(E)
....:     del E
....:     _ = gc.collect()
....:     
223.98828125 4520110400
223.98828125 4520102992
223.98828125 4520145584
223.98828125 4520100800
223.98828125 4520190656
223.98828125 4520116128
223.98828125 4516130448
223.98828125 4520183424
223.98828125 4338291168
223.98828125 4520148816
223.98828125 4520173408
223.98828125 4510376384
223.98828125 4520106672
...

Conclusion: There is no leak, even though it really is a new ring each time.

However, not all is good. Namely, we interrupt the computation above, and continue as follows:

sage: from sage.libs.singular.function import singular_function
sage: P.<x,y,z> = PolynomialRing(QQ)
sage: ringlist = singular_function("ringlist")
sage: l = ringlist(P)
// ** redefining my_awesome_sage_ring **

So, here you see the problem with creating the mock ring. I can only hope that Hans will tell us how one can create a currRingHdl for a given currRing, rather than only finding a previously created currRingHdl.

Because of that last problem, I think it doesn't make much sense to test. However, comments are welcome.

comment:75 Changed 7 years ago by SimonKing

Unfortunately one gets a segfault in the tests of sage/rings/polynomial/multi_polynomial_ideal.py, apparently related with a singular_function call (namely in the method variety).

Too bad. I really thought that the new approach towards refcounting is safe.

comment:76 in reply to: ↑ 74 Changed 7 years ago by nbruin

However, not all is good. Namely, we interrupt the computation above, and continue as follows:

sage: from sage.libs.singular.function import singular_function
sage: P.<x,y,z> = PolynomialRing(QQ)
sage: ringlist = singular_function("ringlist")
sage: l = ringlist(P)
// ** redefining my_awesome_sage_ring **

I think what your code does, is look up if there is a Singular variable that points to currRing. If so, you use that one and you let the Singular variable my_awesome_sage_ring dangling in the system. If you don't find a Singular variable pointing to currRing then you (re)create the variable my_awesome_ring to refer to currRing. But it could be that that variable is alive and well, happily pointing at some other ring (that once was currRing). I think the system probably has good reason to complain.

What you should probably do is create my_awesome_sage_ring, keep a pointer to it, and refcount rings that you let it point to. Now you're simply holding a singular variable reference to a ring. When you need currRingHdl to point somewhere, you can do that via my_awesome_sage_ring. If you need to NULL currRingHdl, you can do so without consequence. You still keep your own pointer to my_awesome_sage_ring.

It means that the ring pointed to by my_awesome_sage_ring is protected against garbage collection, but that's correct: We're holding a reference to it! Such rings will become eligible for deletion once my_awesome_sage_ring is made to point elsewhere. If you don't like rings that were currRing to survive until currRingHdl is actually needed to point elsewhere, you could NULL the data.uring field of my_awesome_sage_ring as soon as you get a chance, but I wouldn't bother.

I haven't traced through your precise usage, but grabbing and dropping Singular variables without adapting reference counts sounds fishy to me and liable to lead to segfault.

I'm pretty sure currRingHdl is just a Singular Interpreter level entry to currRing and probably in the singular interpreter, the current ring is always only a ring that actually has a name in the interpreter, so currRingHdl is simply a borrowed reference to that variable. That's why currRingHdl manipulations in singular do not involve refcounting (I haven't checked, but it seems to me you claim they don't): The rings themselves are refcounted, but that has already happened when the handle struct that currRingHdl is pointing at was instantiated. That struct is surving independent of currRingHdl because it represents a variable binding in the singular interpreter and therefore is probably indexed in some tree, linked list or hash table (i.e., the index that rFindHdl goes digging in).

Last edited 7 years ago by nbruin (previous) (diff)

comment:77 follow-up: Changed 7 years ago by SimonKing

I guess the recreation of my_awesome_sage_ring can be avoided, because Singular certainly has methods to return a previously defined ring when you know the name of that ring. So, no need that Sage keeps a pointer to it, because Singular already does.

The segfault in sage/rings/polynomial/multi_polynomial_ideal.pyx can be avoided by setting ref=1 rather than ref=0 in the function singular_ring_new. However, I wouldn't call it a "fix", because I think this would mean the ring could never be rKilled.

So, I suspect that the reference count goes wrong in a different location.

comment:78 in reply to: ↑ 77 ; follow-up: Changed 7 years ago by SimonKing

Replying to SimonKing:

I guess the recreation of my_awesome_sage_ring can be avoided, because Singular certainly has methods to return a previously defined ring when you know the name of that ring.

It is ggetid.

comment:79 in reply to: ↑ 78 Changed 7 years ago by SimonKing

Replying to SimonKing:

It is ggetid.

Using it does solve the problem with the warning message in comment:74.

In addition, it makes the segfault in sage -t devel/sage/sage/rings/polynomial/multi_polynomial_ideal.pyx disappear - which is a bad news, because instead the test hangs during some call to a singular_function...

Changed 7 years ago by SimonKing

Experiments towards fixing some problems

comment:80 Changed 7 years ago by SimonKing

I have posted a new patch trac_13447-attempted_improvement.patch. According to Hans, it may happen that Singular functions return NULL, which means that there has been an error. Therefore, I suggest to actually raise a RuntimeError if the return value is NULL. Apart from that, I postpone another ggetid call to a location where currRing is guaranteed to exist; however, according to Hans, that call to ggetid (namely in the case of a library functions) should be fine also without currRing.

Here is the problem:

sage -t --verbose devel/sage/sage/rings/polynomial/multi_polynomial_ideal.py
...
Trying:
    R = QQ['a, b']; (a, b,) = R._first_ngens(2); I = R.ideal(a**Integer(2)+b**Integer(2)-Integer(1))###line 3581:_sage_    >>> R.<a,b> = QQ[]; I = R.ideal(a^2+b^2-1)
Expecting nothing
ok
Trying:
    Q = QuotientRing(R,I); K = Frac(Q)###line 3582:_sage_    >>> Q = QuotientRing(R,I); K = Frac(Q)
Expecting nothing
// ** char_series returns 0 x 0 matrix from 2 input polys (0)
I[1,1]=b2+a2-1*** *** Error: TIMED OUT! PROCESS KILLED! *** ***

         [360.3 s]

I have absolutely no idea why that happens.

The error occurs in the Singular library function primdecSY. I did check that currRing is definitely not NULL when calling the function, currRingHdl.data.uring==currRing, and there are plenty of references to currRing. Moreover, the error does not occur if one copies the whole example (which is from the groebner_basis method) into an interactive session.

primdecSY hangs, there is no return value (not even NULL).

Since there is no segfault, I guess gdb would not help here. Since currRing is referenced several times, I guess that it has not been double-freed - but who knows? Can it be that a ring was double-freed, then a new ring was created in the same location, and then an error occurs when working with that new ring?

For now:

Apply trac_13447-sanitise_ring_refcount.patch trac_13447-attempted_improvement.patch

comment:81 Changed 7 years ago by SimonKing

By the way, has anyone preserved an old version of trac_13447-sanitise_ring_refcount.patch? It used to work well, although Nils had some objections.

Here is the result of the test suite:

        sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_ideal.py # Time out
        sage -t  -force_lib devel/sage/doc/de/tutorial/tour_advanced.rst # Time out
        sage -t  -force_lib devel/sage/doc/en/tutorial/tour_advanced.rst # Time out
        sage -t  -force_lib devel/sage/doc/fr/tutorial/tour_advanced.rst # Time out
        sage -t  -force_lib devel/sage/doc/ru/tutorial/tour_advanced.rst # Time out
        sage -t  -force_lib devel/sage/sage/libs/singular/function.pyx # 5 doctests failed
        sage -t  -force_lib devel/sage/sage/schemes/generic/algebraic_scheme.py # Time out

Looks like a lot of work. Fortunately, the errors in function.pyx seem to be a bit more "expressive" than the timeouts.

Last edited 7 years ago by SimonKing (previous) (diff)

comment:82 Changed 7 years ago by SimonKing

I found something fishy with the refcount during sage.libs.singular.function.call_function.

I changed the code as follows:

    # In the Singular interpreter, we must ensure that currRing->ref > 0.
    si_ring.ref += 1  
    cdef int orig_ref = si_ring.ref   # CHANGE: store the reference count
    try:
        with opt_ctx: # we are preserving the global options state here
            if signal_handler:
                sig_on()
                _res = self.call_handler.handle_call(argument_list, si_ring)
                sig_off()
            else:
                _res = self.call_handler.handle_call(argument_list, si_ring)

        if myynest:
            myynest = 0

        if currentVoice:
            currentVoice = NULL

        if errorreported:
            errorreported = 0
            raise RuntimeError("Error in Singular function call '%s':\n %s"%
                (self._name, "\n ".join(error_messages)))
        if si_ring.ref!=orig_ref:  # CHANGE: Test if Singular has changed the refcount
            ff = file("/Users/SimonKing/blubber","a")
            ff.write( "Here is something fishy!\n")
            ff.close()

        res = argument_list.to_python(_res)
        ...

When running the failing sage -t devel/sage/sage/rings/polynomial/multi_polynomial_ideal.py, the refcount changes exactly once before the test hangs.

It is definitely not supposed to happen that Singular changes the refcount of currRing. Changes in the ref counter should only happen in the next line, when the results are converted to python (any polynomial will increase the refcount by one).

I don't know yet what function is causing the trouble - but I hope I'll find out soon...

comment:83 Changed 7 years ago by SimonKing

I found: Applying the function ringlist decreases the ref counter of currRing, while gwalk increases the counter.

Since the number of references by Sage objects does not change, this is asking for trouble, I suppose.

comment:84 Changed 7 years ago by SimonKing

Anyway. I'd really appreciate if someone could find a copy of the old patch version. I think the new one is worse, and I simply do not have the time to work on it now.

comment:85 follow-up: Changed 7 years ago by jdemeyer

I haven't looked at this ticket, but could you also check that #12188 is resolved, it looks related.

comment:86 in reply to: ↑ 85 Changed 7 years ago by SimonKing

Replying to jdemeyer:

I haven't looked at this ticket, but could you also check that #12188 is resolved, it looks related.

It certainly looks related. But it isn't resolved with the patches from here.

comment:87 Changed 7 years ago by jpflori

  • Cc jpflori added

comment:88 in reply to: ↑ 29 ; follow-up: Changed 7 years ago by jpflori

Replying to SimonKing:

Replying to nbruin:

We have:

    def __dealloc__(self):
        celement_destruct(&self.x, get_cparent((<Polynomial_template>self)._parent))

and for us:

get_cparent(parent) == <ntl_ZZ_pEContext_class>(parent._modulus)

The _parent attribute is a cython slot.

Interestingly, there is no complaint about a missing attribute _parent. It is _modulus that is missing.

However, it holds a reference to a python-managed object, so I think cython ensures it's properly taken into account in GC cycle counting. But that would suggest to me python could clear this slot to break cycles! So in that case, Polynomial_template is never safe. It could be I'm wrong, however.

I think you are right. The __dealloc__ of Polynomial_template is unsafe, unless polynomial rings will stay in memory forever. But I'd love to hear that we are wrong, because otherwise each polynomial would need a pointer to the c-data expected to be returned by get_cparent((<Polynomial_template>self)._parent), and we'd need to take into account reference counting for the c-parent during creation and deletion of polynomials.

Or perhaps there is a way out. We have a polynomial ring R and we have some elements a,b,c,... Each element points to R, and R points to some of its elements, namely to its generators. The problem is that deallocation of the elements is only possible as long as R is alive.

If we'd manually incref R upon creation of an element x, decrefing R when x gets deallocated, then we would ensure that R will survive until the last of its elements is deleted. Or would that mean that the elements will survive as well, because of the reference from R to its generators? Edit: Yes it would.

FYI, the cdefed pointer _parent stuff was added in #12313. Have a look at http://trac.sagemath.org/sage_trac/ticket/12313#comment:13

comment:89 in reply to: ↑ 88 ; follow-up: Changed 7 years ago by SimonKing

Replying to jpflori:

FYI, the cdefed pointer _parent stuff was added in #12313. Have a look at http://trac.sagemath.org/sage_trac/ticket/12313#comment:13

Oh boy. So, do we have another circular dependency, meaning that this ticket depends on #12313? Perhaps we should apply Python's cyclic garbage collector on my trac tickets :(

Anyway, I am currently unable to do any programming, since I very urgently have a manuscript to write.

Last edited 7 years ago by SimonKing (previous) (diff)

comment:90 in reply to: ↑ 89 Changed 7 years ago by jpflori

Replying to SimonKing:

Replying to jpflori:

FYI, the cdefed pointer _parent stuff was added in #12313. Have a look at http://trac.sagemath.org/sage_trac/ticket/12313#comment:13

Oh boy. So, do we have another circular dependency, meaning that this ticket depends on #12313? Perhaps we should apply Python's cyclic garbage collector on my trac tickets :(

Does #12313 depends on this ticket? I see Jeroen recently put it as a dependency there but with no explanation. I've now quite read everything here, but it still looks mysterious to me why this was done. Any explanation welcomed :)

comment:91 Changed 7 years ago by SimonKing

Trying an explanation:

#715 plus #11521 in their original versions resulted in trouble. A solution (that is not acceptable for my own work, but might be OK for "usual" applications of Sage) is to re-instate the strong cache of polynomial rings - see the last patch on #715.

If I am not mistaken, #715 plus #11521 in these versions have a positive review, without #13447.

Now, why did I put #13447 as an explicit reference at #12313? I guess that is because (1) I thought that #13447 is close to being fixed (I was mistaken...) and (2) the patchbot got confused by the circular dependency #715 <-> #11521.

But who knows. I am currently not really in the condition to do coding.

comment:92 Changed 7 years ago by jdemeyer

  • Description modified (diff)

comment:93 follow-up: Changed 7 years ago by nthiery

Hi!

What's the status of this patch? It's at the bottom of my dependency list for category patches :-)

Cheers,

Nicolas

comment:94 in reply to: ↑ 93 ; follow-up: Changed 7 years ago by SimonKing

Replying to nthiery:

What's the status of this patch? It's at the bottom of my dependency list for category patches :-)

IIRC, it was found that part of the problem was an upstream bug, that got fixed. And we meanwhile have a weak cache for polynomial rings. Hence, the original problem seems to be solved.

However, we may see whether it makes sense to use the existing reference counter from Singular, rather than relying on our custom reference counter. It might be conceptually better, though not necessarily more stable.

What do the other participants of this ticket think?

comment:95 follow-up: Changed 7 years ago by jpflori

If the leak is now fixed, we could or even should just add some doctest proving it here, and open another ticket for using the singular ref system rather than ours.

comment:96 in reply to: ↑ 94 ; follow-up: Changed 7 years ago by nbruin

Replying to SimonKing:

However, we may see whether it makes sense to use the existing reference counter from Singular, rather than relying on our custom reference counter. It might be conceptually better, though not necessarily more stable.

I think it's a little more than just conceptual. Some rings may be created by direct library calls from sage. Other rings may be generated internally in libsingular. These rings can get mixed up, meaning that a sage created ring might end up being referenced only directly by some internal libsingular data structure and rings created internally to libsingular might end up being referenced by sage.

For reliable, non-leaking memory management you'll have to use the same refcounter in such a situation.

In libecl a different approach is taken: Every lisp object referenced from sage is referenced via a python proxy object that creates a binding to the referenced object in lisp, to prevent garbage collection. The deallocation routine for the proxy includes removal of the lisp binding. We could do that too, but last time it seemed we were relatively close to deciphering libsingular's conventions for handling its refcount.

comment:97 in reply to: ↑ 96 ; follow-up: Changed 7 years ago by SimonKing

Replying to nbruin:

Replying to SimonKing:

However, we may see whether it makes sense to use the existing reference counter from Singular, rather than relying on our custom reference counter. It might be conceptually better, though not necessarily more stable.

I think it's a little more than just conceptual. Some rings may be created by direct library calls from sage. Other rings may be generated internally in libsingular. These rings can get mixed up,

I don't think so.

According to Hans Schönemann, the internal refcount is only relevant in the Singular user interface. One could argue that libsingular replaces the Singular's user interface and should thus use the same refcounting than the user interface. However, internal calls in Singular will not change the refcount.

For reliable, non-leaking memory management you'll have to use the same refcounter in such a situation.

See above: The refcounter is not used internally in Singular (only for the user interface), I think.

comment:98 in reply to: ↑ 97 ; follow-up: Changed 7 years ago by SimonKing

Replying to SimonKing:

One could argue that libsingular replaces the Singular's user interface...

To be precise: Our use of libsingular for constructing polynomial rings in Sage could be considered as something like the Singular user interface.

comment:99 in reply to: ↑ 98 Changed 7 years ago by nbruin

Replying to SimonKing:

To be precise: Our use of libsingular for constructing polynomial rings in Sage could be considered as something like the Singular user interface.

I haven't looked at the code in a long time, but I know that at the time, I noticed that we DO call into the interpreter, at the very least in the doctests that test something about "functions" in singular.

I also recall that in some of the noncommutative stuff, singular goes off and creates a bunch of polynomial rings, in ways that seemed to me equivalent to what happens in the singular interpreter.

Unsurprisingly, those were the areas where we got problems when we started messing with the code.

comment:100 in reply to: ↑ 95 Changed 7 years ago by nbruin

Replying to jpflori:

If the leak is now fixed, we could or even should just add some doctest proving it here, and open another ticket for using the singular ref system rather than ours.

It's not fixed. See sage/rings/polynomial/multi_polynomial_libsingular.pyx:369 libsingular multivariate polynomial rings are still nailed in memory.

comment:101 Changed 7 years ago by jpflori

  • Cc burcin added

comment:102 Changed 7 years ago by nthiery

So what shall we do?

#12876 has been basically sitting there for one year with a positive review, just waiting for dependencies. There only remains this one. And I now have to rebase it once more because #13184 jumped in the middle the way, reimplementing a little piece of #12876.

Would it be reasonable, or not, to consider that #12876 does not depend on this guy?

Cheers,

Nicolas

comment:103 follow-up: Changed 7 years ago by jpflori

From what I read, #13447 was added because the ticket depended on #11521 which was though to depend on #13447. But in the end it was not the case (#11521 is merged now), so just remove it and get your ticket merged!

comment:104 in reply to: ↑ 103 Changed 7 years ago by nthiery

Replying to jpflori:

From what I read, #13447 was added because the ticket depended on #11521 which was though to depend on #13447. But in the end it was not the case (#11521 is merged now), so just remove it and get your ticket merged!

Hey, that's excellent news! Thanks for the quick overview. I'll post a rebased and retested patch for #12876 shortly.

comment:105 Changed 6 years ago by jdemeyer

  • Milestone changed from sage-5.11 to sage-5.12

comment:106 Changed 6 years ago by vbraun_spam

  • Milestone changed from sage-6.1 to sage-6.2

comment:107 Changed 6 years ago by vbraun_spam

  • Milestone changed from sage-6.2 to sage-6.3

comment:108 Changed 5 years ago by vbraun_spam

  • Milestone changed from sage-6.3 to sage-6.4

comment:109 Changed 5 years ago by nbruin

I just stumbled into to this in sage/libs/singular/function.pyx:949 in to_python:

        elif rtyp == POLY_CMD:
            #FIXME
            res_poly = MPolynomial_libsingular(self._sage_ring)
            res_poly._poly = <poly*>to_convert.data
            to_convert.data = NULL
            #prevent it getting free, when cleaning the leftv
            return res_poly

this looks like the simple transfer of ownership of a reference here, but if some of the comments on this ticket quoting Schoenemann are correct and Singular indeed has a mix of refcounted and non-refcounted uses (depending on whether objects are actively interfaced in the interpreter or not) then this step might need attention: the leftv arriving here almost certainly is coming from the interpreter (via SingularFunction) and the res_poly object possibly shouldn't be considered as such.

Any new attempt at resolving this ticket should probably look into this bit of code too.

comment:110 follow-up: Changed 4 years ago by SimonKing

Should we have a look how much of the issues tracked here are now fixed by #18905?

I still think we should switch to use Singular's c-slot for refcounting. What we currently do: Create a python object from a libsingular ring* and put it into a Python dictionary, where the number of references is stored. Wouldn't it be a lot faster to store the number of references in a c-slot?

If we use it, then I still think it makes sense to adopt Singular's convention on that c-slot: It counts the number of references minus one.

My suggestion: For testing, I create a branch based on #18905 that uses both ways of refcounting in parallel, asserting consistency. If that passes doctests, then we can safely remove the old slow refcounting system.

comment:111 in reply to: ↑ 110 Changed 4 years ago by SimonKing

Replying to SimonKing:

If we use it, then I still think it makes sense to adopt Singular's convention on that c-slot: It counts the number of references minus one.

OTOH, it really seems awkward: A fresh ring that we get out of libsingular will have .ref==0. Thus, it is supposed to have a reference already. But there is no reference yet.

comment:112 Changed 4 years ago by SimonKing

Something else: Currently, each polynomial constitutes a reference to the libsingular ring*. But I think it is not the job of polynomials to care about managing ring references. Only Sage's polynomial rings should care.

comment:113 follow-up: Changed 4 years ago by vbraun

If you have a better way of keeping track of the libsingular ring's then that would be great.

Afaik the reason for keeping a reference to the ring in MPolynomial_libsingular is efficiency, basically every libsingular call also needs the ring as argument so you want to avoid the extra indirection trough MPolynomialRing_libsingular.

comment:114 in reply to: ↑ 113 ; follow-up: Changed 4 years ago by SimonKing

Replying to vbraun:

If you have a better way of keeping track of the libsingular ring's then that would be great.

Afaik the reason for keeping a reference to the ring in MPolynomial_libsingular is efficiency, basically every libsingular call also needs the ring as argument so you want to avoid the extra indirection trough MPolynomialRing_libsingular.

Yes, that shortcut still is reasonable and shouldn't be changed. However, there should be no need to take the time and increase the reference count.

comment:115 in reply to: ↑ 114 Changed 4 years ago by SimonKing

Replying to SimonKing:

Yes, that shortcut still is reasonable and shouldn't be changed. However, there should be no need to take the time and increase the reference count.

... because we also have the polynomial ring that keeps a (counted) reference to the libsingular ring.

comment:116 Changed 4 years ago by SimonKing

If I understood correctly, singular functions are supposed to leave the .ref-field alone, unless a ring in the interpreter is concerned. However, I found that using the singular_function "ringlist" does decrease the reference count for the current ring. That's annoying.

Could it in fact be an artefact of how singular_function is implemented? If I recall correctly ringlist(R) is a bit tricky: It should return a list defined in the *current* ring, but with data defining R. So, perhaps we are somehow messing up currRing versus the ring that is given as argument.

comment:117 Changed 4 years ago by SimonKing

And then of course the question is under what circumstances it occurs. In the doctests of sage.libs.singular, ringlist was the only function exhibiting such behaviour. Is it because its argument is a ring? Are there other kernel functions that may change the reference count? Are there library functions that can change the reference count? I think that's highly likely, since there are functions returning a ring that was created inside of the function.

comment:118 Changed 11 months ago by SimonKing

I am trying to revive this ticket.

In a branch that I am currently testing, I change the .ref field of Singular's ring struct whenever Sage's ring refcount is operating, and assert that always both refcounts are consistent.

However, in the following example

sage: R.<x,y,z> = GF(3)[]
sage: I = R*R.gens()
sage: I.groebner_basis()

I find that Sage counts 8 references for some ring, but Singular counts 9 references for the same ring. Perhaps Singular is increasing the refcount internally and isn't resetting it, but I need to investigate it.

comment:119 Changed 11 months ago by SimonKing

I found two spots in which the ref counting currently is not consequently done:

  • SingularFunction.__init__: In some cases it artificially creates a ring and sets .ref = 1, but ignores it in Sage's own refcounting.
  • call_function: When it changes currRingHdl from the previously used to the current ring, it modifies .ref but ignores Sage's own refcounting.

So, that's where Singular's and Sage's ring refcounts get out of sync.

Last edited 11 months ago by SimonKing (previous) (diff)

comment:120 follow-up: Changed 11 months ago by SimonKing

It turns out that my_awesome_sage_ring created in SingularFunction.__init__ gives a core dump when being collected. So, there is something fundamentally wrong here.

comment:121 in reply to: ↑ 120 ; follow-up: Changed 11 months ago by SimonKing

Replying to SimonKing:

It turns out that my_awesome_sage_ring created in SingularFunction.__init__ gives a core dump when being collected. So, there is something fundamentally wrong here.

If I understand correctly, my_awesome_sage_ring is just some ring that is there to make Singular happy, although it is defunct and cannot be deleted. So, in the current code, its ref count in Singular is increased in order to prevent it from being deleted.

Three potential solutions: Either create a fully fledged ring that can be collected. Or reference it twice, so that it will stay in memory till the Sage session ends (that's what is currently happening). Or find a cleaner solution.

comment:122 in reply to: ↑ 121 Changed 11 months ago by SimonKing

Replying to SimonKing:

Or find a cleaner solution.

Instead of creating a defunct ring, one could use currRing. That is possible because even directly after starting sage currRing is available.

comment:123 Changed 11 months ago by SimonKing

Meanwhile I believe that the approach to use libsingular's ring->ref counter to count ring references will not work. As it turns out, some Singular functions do change the value stored in ->ref, some don't. Therefore, Sage shouldn't mess with it.

OTOH, the old implementation of refcounting is quite inefficient: Given a ring*, create a RingWrap, use it as a dictionary key in a defaultdict(int) that counts the references --- and the creation (and subsequent deletion) of a RingWrap occurs each time a ring*gets referenced or dereferenced!

If r is a ring* for which we want to count references, it would be an improvement to simply use <long>r as key of the above-mentioned defaultdict(int). After all, comparison of the RingWrap of r would also just rely on <long>r. So, that approach would be better than the current implementation, because it avoids the creation and deletion of a python wrapper.

But somehow I would find it better to equip MPolynomialRing_libsingular and all other libsingular wrappers with a new slot cdef int *_ring_ref. When a new ring is created, the corresponding _ring_ref would get allocated with a single int. If two different Sage objects wrap the same ring* then they would also share the same _ring_ref pointer. And the pointer would be passed as an argument to singular_ring_ref and singular_ring_delete, where the int being pointed at would be incremented or decremented, thus counting the number of references.

So, instead of a central defaultdict(int) storing all reference counts, there would for each ring be a pointer to an int. I implemented it. For debugging, I currently have BOTH the old and the new refcount in parallel and test whether they are consistent. It turns out that they are!

Nonetheless I get segfaults and errors in the doctests, but actually not very many:

sage -t src/sage/rings/polynomial/plural.pyx  # Killed due to segmentation fault
sage -t src/sage/rings/polynomial/multi_polynomial_ideal.py  # Killed due to segmentation fault
sage -t src/sage/libs/singular/groebner_strategy.pyx  # Killed due to segmentation fault
sage -t src/sage/libs/singular/function.pyx  # 4 doctests failed
sage -t src/sage/tests/cmdline.py  # 1 doctest failed
sage -t src/sage/schemes/curves/curve.py  # 1 doctest failed
sage -t src/sage/schemes/curves/affine_curve.py  # 1 doctest failed

So, I am confident that the new approach will soon work, and I am also confident that it will be faster than the old approach. But how could it be measured? Ring refcounting also happens in MPolynomial_libsingular, not only in MPolynomialRing_libsingular. Hence, a benchmark involving the creation and deletion of many polynomials would probably reveal a speedup.

comment:124 Changed 11 months ago by SimonKing

  • Branch set to u/SimonKing/make_libsingular_multivariate_polynomial_rings_collectable

comment:125 Changed 11 months ago by SimonKing

  • Commit set to d6a3d48d2a3705e52de6b6ec6b4cacd7bbd1ffa0

The commit that I just posted implements the new approach to refcounting:

  • Each singular object that relies on a libsingular ring should have a pointer to that ring (ring *_ring), and in addition an int-pointer for the refcount of that ring (`int *_ring_ref)
  • If self is created using a new ring r, then we would have
        self._ring_ref = <int*>sig_calloc(1, sizeof(int))
        self._ring = singular_ring_reference(r, self._ring_ref)
    
  • If self is created using a ring that is referenced by other, then we would have
        self._ring_ref = other._ring_ref
        self._ring = singular_ring_reference(other._ring, self._ring_ref)
    
  • For deallocation, one does
        singular_ring_delete(self._ring, self._ring_ref)
    
    Note that it is not needed to do sig_free(self._ring_ref), because this is done inside of singular_ring_delete(self._ring, self._ring_ref) when there remain no references to self._ring.

In the commit, I am not removing the old refcounting approach. This is for testing purposes: I assert that the old and the new refcount coincide, to demonstrate that the new approach works.

Next step will be to remove the old refcount, and to find a way to replace the doctests for the old refcount.

Question: Would it make sense to add a function that would automatically test that all rings created after "from sage import all" are properly deleted when Sage is shut down?


New commits:

d6a3d48trac13447: new refcount implementation for libsingular rings

comment:126 Changed 11 months ago by SimonKing

  • Authors changed from Nils Bruin, Simon King to Simon King
  • Report Upstream changed from None of the above - read trac for reasoning. to N/A
  • Reviewers Simon King deleted
  • Work issues Understand why sometimes `new_RingWrap` needs an incref and sometimes not deleted

comment:127 Changed 11 months ago by SimonKing

PS: So far the ultimate goal to ensure that more rings are collected is not addressed.

Without the branch:

sage: from sage.libs.singular.groebner_strategy import NCGroebnerStrategy ## line 333 ##
....: A.<x,y,z> = FreeAlgebra(QQ, 3) ## line 334 ##
....: H.<x,y,z> = A.g_algebra({y*x:x*y-z, z*x:x*z+2*x, z*y:y*z-2*y}) ## line 335 ##
....: I = H.ideal([y^2, x^2, z^2-H.one()]) ## line 336 ##
....: strat = NCGroebnerStrategy(I) #random ## line 337 ##
....: del strat, I, x,y,z,H, A
....: import gc
....: gc.collect()
....: 
155
sage: sum(sage.libs.singular.ring.ring_refcount_dict.values())
17

With the branch:

sage: from sage.libs.singular.groebner_strategy import NCGroebnerStrategy ## line 333 ##
....: A.<x,y,z> = FreeAlgebra(QQ, 3) ## line 334 ##
....: H.<x,y,z> = A.g_algebra({y*x:x*y-z, z*x:x*z+2*x, z*y:y*z-2*y}) ## line 335 ##
....: I = H.ideal([y^2, x^2, z^2-H.one()]) ## line 336 ##
....: strat = NCGroebnerStrategy(I) #random ## line 337 ##
....: del strat, I, x,y,z,H, A
....: import gc
....: gc.collect()
....: 
175
sage: sum(sage.libs.singular.ring.ring_refcount_dict.values())
17

But that's not a surprise since the first commit changes the mechanism of the refcount, but not the number of references.

comment:128 Changed 11 months ago by SimonKing

I inserted print statements whenever certain rings/algebras and elements are created or deallocated, and found that in the following code

sage: from sage.libs.singular.groebner_strategy import NCGroebnerStrategy
....: A.<x,y,z> = FreeAlgebra(QQ, 3)
....: H.<x,y,z> = A.g_algebra({y*x:x*y-z, z*x:x*z+2*x, z*y:y*z-2*y})
....: I = H.ideal([y^2, x^2, z^2-H.one()])
....: strat = NCGroebnerStrategy(I)
....: del strat, I, x,y,z,H, A
....: import gc
....: gc.collect()

the G-Algebra H is garbage collected and its underlying libsingular ring is properly deleted. However, the free algebra is not garbage collected, and the polynomial ring involved in the creation of the free algebra isn't garbage collected either.

So, there is a memory leak, but I suppose it is because of improper strong references in Sage's coercion system, not because of references to libsingular rings.

comment:129 Changed 11 months ago by SimonKing

Here is an assessment for the time needed to increment or decrement the reference count in the various approaches. I am using

from cpython.object cimport Py_EQ, Py_NE
from cysignals.memory cimport sig_calloc, sig_free
from collections import defaultdict

refcount_dict = defaultdict(int)

cdef int total_refcount = 0

cdef class Wrap(object):
    cdef void *p
    def __hash__(self):
        return <long>(self.p)
    def __richcmp__(Wrap self, other, int op):
        if not (op == Py_EQ or op == Py_NE):
            return NotImplemented
        if type(other) is not Wrap:
            return op != Py_EQ
        return (self.p == (<Wrap>other).p) == (op == Py_EQ)

cdef inline wrap(void *p):
    cdef Wrap W = Wrap.__new__(Wrap)
    W.p = p
    return W

cdef inline void *incref1(void *p):
    refcount_dict[wrap(p)] += 1
    return p

cdef inline void decref1(void *p):
    cdef Wrap W = wrap(p)
    cdef int c = refcount_dict[W] - 1
    if c == 0:
        del refcount_dict[W]
    else:
        refcount_dict[W] = c

cdef inline void *incref2(void *p):
    refcount_dict[<long>p] += 1
    return p

cdef inline void decref2(void *p):
    cdef int c = refcount_dict[<long>p] -1
    if c == 0:
        del refcount_dict[<long>p]
    else:
        refcount_dict[<long>p] = c

cdef inline void *incref3(void *p, int *count):
    count[0] += 1
    return p

cdef inline void decref3(void *p, int *count):
    count[0] -= 1

cdef inline void *incref4(void *p, int *count):
    global total_refcount
    total_refcount += 1
    count[0] += 1
    return p
cdef inline void decref4(void *p, int *count):
    global total_refcount
    total_refcount -= 1
    count[0] -= 1

cpdef test1(long x):
    cdef int i
    cdef void *X = <void *>x
    for i in range(1000):
        X = incref1(X)
        decref1(X)

cpdef test2(long x):
    cdef int i
    cdef void *X = <void *>x
    for i in range(1000):
        X = incref1(X)
        decref1(X)

cpdef test3(long x):
    cdef int i
    cdef int *Count = <int*>sig_calloc(1, sizeof(int))
    cdef void *X = <void *>x
    for i in range(1000):
        X = incref3(X, Count)
        decref3(X, Count)
    sig_free(Count)

cpdef test4(long x):
    cdef int i
    cdef int *Count = <int*>sig_calloc(1, sizeof(int))
    cdef void *X = <void *>x
    for i in range(1000):
        X = incref4(X, Count)
        decref4(X, Count)
    sig_free(Count)

and with that I get

sage: %timeit test1(123456)
1000 loops, best of 3: 362 µs per loop
sage: %timeit test2(123456)
1000 loops, best of 3: 366 µs per loop
sage: %timeit test3(123456)
The slowest run took 111.51 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 126 ns per loop
sage: %timeit test4(123456)
The slowest run took 46.28 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 129 ns per loop

Conclusion

  • test1 is how reference count in sage.libs.singular.ring currently works.
  • test2 would be a mild modification of test1, and would apparently not be an improvement.
  • test3 is the simplest form of my suggestion.
  • test4 is a mild modification of my suggestion, making it possible to preserve the doctests that currently involve sage.libs.singular.ring.ring_refcount_dict

I suppose we can afford 3ns loss per 1000 increments and decrements and thus will implement the fourth version of incref/decref. And then we can see if it has an effect on the speed of polynomial arithmetic.

Last edited 11 months ago by SimonKing (previous) (diff)

comment:130 Changed 11 months ago by git

  • Commit changed from d6a3d48d2a3705e52de6b6ec6b4cacd7bbd1ffa0 to 281fcf9085fe16b8e2e7b0e37a214f44f0e1ebd7

Branch pushed to git repo; I updated commit sha1. New commits:

9a671d713477: Switch from old to new reference count
281fcf9trac13447: Remove the old ring refcounting

comment:131 Changed 11 months ago by SimonKing

Good news!

  1. With plain Sage, one gets
    sage: R.<x,y,z> = QQ[]
    sage: %timeit p = 2*x^3+x*y*z-z^2*x^2
    The slowest run took 83.44 times longer than the fastest. This could mean that an intermediate result is being cached.
    100000 loops, best of 3: 10 µs per loop
    
    whereas with the two commits of the current branch one gets
    sage: R.<x,y,z> = QQ[]
    sage: %timeit p = 2*x^3+x*y*z-z^2*x^2
    The slowest run took 73.46 times longer than the fastest. This could mean that an intermediate result is being cached.
    100000 loops, best of 3: 5.37 µs per loop
    
  2. Doctests with the current branch pass.

So, it would already make sense to use this. OTOH, the purpose of this ticket is not to speed-up polynomial arithmetic but to make multivariate polynomial rings collectable. So, I will next try and remove the strong cache for multivariate polynomial rings, and see what will then happen.

Since I have added ring refcounting not only for the ring itself but also for its elements, it could be that the new code is enough stable to survive deletion of polynomials ring without segfaulting. Keep fingers crossed!

comment:132 follow-up: Changed 11 months ago by tscrim

Perhaps it would be better to make the libsingular and the other multivariate polynomial rings be separate tickets to have more granular changes as they are largely (completely?) independent. This already seems like a bigger ticket and the other change would likely be a bigger changeset as well.

From a quick look, your proposal seems good, but I would have to read it in more detail to confirm. It might also be good for someone who is more experienced with memory management to look at this, but I am willing to set this to a positive review once I do a more thorough code review.

comment:133 Changed 11 months ago by git

  • Commit changed from 281fcf9085fe16b8e2e7b0e37a214f44f0e1ebd7 to 70c08b09315f9e897fb2dbc6ba748ed6cc681b06

Branch pushed to git repo; I updated commit sha1. New commits:

70c08b0trac13447: Remove the strong cache for multivariate polynomials

comment:134 Changed 11 months ago by SimonKing

  • Description modified (diff)
  • Status changed from needs_work to needs_review

Even better news: With the last commit, multivariate polynomial rings can to some extent be garbage collected:

        sage: from sage.libs.singular.ring import total_ring_reference_count
        sage: n = total_ring_reference_count()
        sage: P.<x,y,z> = GF(19)[]
        sage: del P,x,y,z
        sage: import gc
        sage: _ = gc.collect()
        sage: n == total_ring_reference_count()
        True

Hence, the purpose of this ticket is fulfilled. I consider it therefore as "needs review".

Caveat: Arithmetic operations are still preventing a garbage collection. But that seems to be a totally different topic.

sage: P.<x,y,z> = GF(19)[]
sage: p = 2*x^3+x*y*z-z^2*x^2
sage: del P,x,y,z,p
sage: gc.collect()
21
sage: n == total_ring_reference_count()
False

comment:135 in reply to: ↑ 132 Changed 11 months ago by SimonKing

Replying to tscrim:

Perhaps it would be better to make the libsingular and the other multivariate polynomial rings be separate tickets to have more granular changes as they are largely (completely?) independent. This already seems like a bigger ticket and the other change would likely be a bigger changeset as well.

Actually it isn't that big. The change of refcount does change a couple of .pxd files, but I consider the changes small. And the removal of the strong cache (commit:70c08b0) really is small.

Therefore, splitting the ticket is not needed. But careful testing: In old attempts to fix this ticket, we saw random segfaults in doctests. I don't see them on my machine, but nonetheless it should better be tested with care on various machines. I am curious what the patchbots have to say...

However, because polynomial rings can still not be collected as soon as an algebraic operation is performed on its elements, I suppose there still is a leak coming from strong references in the coercion system. I believe that THIS really should be on a different ticket. Actually it would make sense to test together with the ticket that introduces "optional weak references" (or how was it called?) to store actions. But what was the ticket number?

Anyway, I think the ticket is now ready for review -- changes are relatively small, and tests pass, at least for me...

EDIT: Sorry for not being able to count. I originally thought it was three commits, but there are four.

Last edited 11 months ago by SimonKing (previous) (diff)

comment:136 Changed 11 months ago by SimonKing

Looking at the diff, I see that some changes are still in that I meant to revert (they are leftovers from my failed attempt to use ring->ref for refcounting).

So, I'll remove these leftovers and see if things still work.

comment:137 Changed 11 months ago by git

  • Commit changed from 70c08b09315f9e897fb2dbc6ba748ed6cc681b06 to ed736bd8651888944e4bd56149b696c54607e290

Branch pushed to git repo; I updated commit sha1. New commits:

ed736bdRemove some leftovers of failed attempts to use ring->ref for refcount

comment:138 Changed 11 months ago by SimonKing

The latest commit makes the diff with respect to the develop branch a bit smaller and easier to read. Need to test it, though.

comment:139 Changed 11 months ago by git

  • Commit changed from ed736bd8651888944e4bd56149b696c54607e290 to 17ff3afc62baf1f149e3338dcf4897df8d7ad1f1

Branch pushed to git repo; I updated commit sha1. New commits:

17ff3afRemove some function that has temporarily been used for debugging

comment:140 Changed 11 months ago by SimonKing

... and another commit that makes the diff smaller: I forgot to delete a function that I temporarily used for debugging. It wasn't documented and used old-style print, which made the patchbot plugins complain. Now that aspect should be fine. Hope that the tests work, too...

comment:141 Changed 11 months ago by SimonKing

The pyflakes plugin complains, but this is in fact in a line that I didn't change. Indeed, some variable n is redefined, but it isn't used in the rest of the function.

However, another bot complains rightfully: The documentation doesn't cleanly build, because of some .. todo::. Need to fix it.

comment:142 Changed 11 months ago by git

  • Commit changed from 17ff3afc62baf1f149e3338dcf4897df8d7ad1f1 to 6e2487dbf8df2c82f2ca1282625f9f1b29a02529

Branch pushed to git repo; I updated commit sha1. New commits:

6e2487dFix formatting of a TODO

comment:143 Changed 11 months ago by SimonKing

With the latest commit, the documentation builds. But I am not very happy about it: The .. TODO:: block is part of the TESTS:. However, the TESTS block does not appear in the references, whereas the TODO does appear in the references.

By consequence, the TODO refers to stuff that isn't visible.

Can the reviewer perhaps have a look and decide how to deal with the formatting of the TODO?

comment:144 Changed 11 months ago by git

  • Commit changed from 6e2487dbf8df2c82f2ca1282625f9f1b29a02529 to 019ca385039f8628a5ed5f16de341efeaf02aa7a

Branch pushed to git repo; I updated commit sha1. New commits:

019ca3813447: Do not use a TODO block in a TESTS block

comment:145 Changed 11 months ago by SimonKing

In the new commit, I am removing the TODO block. Recall: The TODO block is visible in the documentation, but refers to stuff written in a TESTS block and thus is invisible in the documentation.

Of course it is still "needs review". Next, I plan to create a wrapper for Singular's leftv*, so that it becomes possible to interact with libsingular objects on a very basic (thus, fast) level - of course on a different ticket.

Also it would be nice to understand why polynomial rings can only be collected when no arithmetic operation took place on its elements. But this shall be on yet another ticket.

comment:146 follow-up: Changed 11 months ago by tscrim

Two nitpicks first while I go through things in more detail. Can you change the #~ comments to the usual # and with the correct indentations? Also, this change

-    polynomial rings. Thus, we obtain
-    ::
+    polynomial rings. Thus, we obtain::

comment:147 Changed 11 months ago by git

  • Commit changed from 019ca385039f8628a5ed5f16de341efeaf02aa7a to c978ef1d0d5f1b3695eef827c2f6c1983bd3059a

Branch pushed to git repo; I updated commit sha1. New commits:

c978ef1Remove some commented out code

comment:148 in reply to: ↑ 146 Changed 11 months ago by SimonKing

Replying to tscrim:

Two nitpicks first while I go through things in more detail. Can you change the #~ comments to the usual # and with the correct indentations? Also, this change

-    polynomial rings. Thus, we obtain
-    ::
+    polynomial rings. Thus, we obtain::

Done. #~ is what geany (the development environment I'm using) inserts when one comments out some lines.

comment:149 follow-up: Changed 11 months ago by tscrim

Thanks, one more little nitpick:

-        cached. Therefore we have
-        ::
+        cached. Therefore we have::

However, I do have a more broader question: Why are the polynomials holding a reference to the singular ring? If the parent just holds the reference, then it will survive as long as there are polynomials referencing that parent. So it seems like unnecessary overhead both in maintenance and extra function calls.

comment:150 in reply to: ↑ 149 ; follow-up: Changed 11 months ago by SimonKing

Replying to tscrim:

However, I do have a more broader question: Why are the polynomials holding a reference to the singular ring? If the parent just holds the reference, then it will survive as long as there are polynomials referencing that parent.

That's what I originally thought. However, it seems that this is why previous attempts to fix this ticket failed, with segmentation faults in doctests, typically after the actual tests.

If a reference cycle is deleted during cyclic garbage collection, the objects are deallocated in arbitrary order. Hence, a polynomial could be deleted after its parent, and thus it needs a pointer to a valid libsingular ring.

comment:151 Changed 11 months ago by git

  • Commit changed from c978ef1d0d5f1b3695eef827c2f6c1983bd3059a to 7feae99bbda7b0eb54c9397c866119f9ccfc320e

Branch pushed to git repo; I updated commit sha1. New commits:

7feae99Add a colon to the docs.

comment:152 Changed 11 months ago by SimonKing

Example: This change

  • src/sage/rings/polynomial/multi_polynomial_libsingular.pyx

    diff --git a/src/sage/rings/polynomial/multi_polynomial_libsingular.pyx b/src/sage/rings/polynomial/multi_polynomial_libsingular.pyx
    index 1acd983..e69b9bd 100644
    a b cdef class MPolynomial_libsingular(MPolynomial): 
    19951995        """
    19961996        self._poly = NULL
    19971997        self._parent = parent
    1998         self._parent_ring_ref = (<MPolynomialRing_libsingular>parent)._ring_ref
    1999         self._parent_ring = singular_ring_reference(parent._ring, self._parent_ring_ref)
     1998#~         self._parent_ring_ref = (<MPolynomialRing_libsingular>parent)._ring_ref
     1999#~         self._parent_ring = singular_ring_reference(parent._ring, self._parent_ring_ref)
     2000        self._parent_ring = parent._ring
    20002001
    20012002    def __dealloc__(self):
    20022003        # WARNING: the Cython class self._parent is now no longer accessible!
    20032004        if self._poly==NULL:
    20042005            # e.g. MPolynomialRing_libsingular._zero_element
    2005             singular_ring_delete(self._parent_ring, self._parent_ring_ref)
     2006#~             singular_ring_delete(self._parent_ring, self._parent_ring_ref)
    20062007            return
    20072008        assert self._parent_ring != NULL # the constructor has no way to raise an exception
    20082009        p_Delete(&self._poly, self._parent_ring)
    2009         singular_ring_delete(self._parent_ring, self._parent_ring_ref)
     2010#~         singular_ring_delete(self._parent_ring, self._parent_ring_ref)
    20102011
    20112012    def __copy__(self):
    20122013        """
    cdef inline MPolynomial_libsingular new_MP(MPolynomialRing_libsingular parent, p 
    55585559    """
    55595560    cdef MPolynomial_libsingular p = MPolynomial_libsingular.__new__(MPolynomial_libsingular)
    55605561    p._parent = parent
    5561     p._parent_ring_ref = parent._ring_ref
    5562     p._parent_ring = singular_ring_reference(parent._ring, p._parent_ring_ref)
     5562#~     p._parent_ring_ref = parent._ring_ref
     5563#~     p._parent_ring = singular_ring_reference(parent._ring, p._parent_ring_ref)
     5564    p._parent_ring = parent._ring
    55635565    p._poly = juice
    55645566    p_Normalize(p._poly, p._parent_ring)
    55655567    return p

leads to a segfault in

sage -t --warn-long 32.8 src/sage/rings/polynomial/multi_polynomial_ideal.py

Hence, it definitely is not enough to count references only from a polynomial ring to the underlying libsingular ring and let the elements just use a pointer to the libsingular ring. The elements need to reference it, too.

comment:153 in reply to: ↑ 150 ; follow-ups: Changed 11 months ago by tscrim

  • Milestone changed from sage-6.4 to sage-8.7
  • Reviewers set to Travis Scrimshaw

Replying to SimonKing:

If a reference cycle is deleted during cyclic garbage collection, the objects are deallocated in arbitrary order. Hence, a polynomial could be deleted after its parent, and thus it needs a pointer to a valid libsingular ring.

I see, thank you for the explanation. Could you add this explanation as a comment in the code? Once done, you can set a positive review on my behalf.

comment:154 in reply to: ↑ 153 Changed 11 months ago by SimonKing

Replying to tscrim:

I see, thank you for the explanation. Could you add this explanation as a comment in the code? Once done, you can set a positive review on my behalf.

What would be the best place? In the pyx file in the TWO locations in which elements are created (__init__ and newMP? In the pyx file when the polynomial is deallocated (just one location)? In the pxd file as an explanation of _parent_ring and parent_ring_ref?

I'd prefer the pxd file.

comment:155 follow-up: Changed 11 months ago by tscrim

I would do it in the pyx because that is where the code is actually being used. I would prefer in the __init__ if you wanted to choose one, but I am not opposed to putting it in both.

comment:156 Changed 11 months ago by git

  • Commit changed from 7feae99bbda7b0eb54c9397c866119f9ccfc320e to c2b0bf61272c2062b2a4e47232f699fe306d1b21

Branch pushed to git repo; I updated commit sha1. New commits:

c2b0bf6Explain why libsingular elements have to reference the parent's libsingular ring

comment:157 in reply to: ↑ 155 Changed 11 months ago by SimonKing

Replying to tscrim:

I would do it in the pyx because that is where the code is actually being used. I would prefer in the __init__ if you wanted to choose one, but I am not opposed to putting it in both.

OK, I've put it in both (and both commutative and non-commutative).

comment:158 in reply to: ↑ 153 Changed 11 months ago by SimonKing

  • Status changed from needs_review to positive_review

Replying to tscrim:

Replying to SimonKing: I see, thank you for the explanation. Could you add this explanation as a comment in the code? Once done, you can set a positive review on my behalf.

Thank you!

comment:159 Changed 10 months ago by tscrim

  • Status changed from positive_review to needs_work

Doctest failure reported by the patchbot on 8.7.beta3:

sage -t --long src/sage/rings/polynomial/polynomial_ring_constructor.py
**********************************************************************
File "src/sage/rings/polynomial/polynomial_ring_constructor.py", line 562, in sage.rings.polynomial.polynomial_ring_constructor.PolynomialRing
Failed example:
    n == total_ring_reference_count()
Expected:
    True
Got:
    False

comment:160 Changed 9 months ago by embray

  • Milestone changed from sage-8.7 to sage-8.8

Ticket retargeted after milestone closed (if you don't believe this ticket is appropriate for the Sage 8.8 release please retarget manually)

comment:161 Changed 6 months ago by embray

  • Milestone changed from sage-8.8 to sage-8.9

Tickets still needing working or clarification should be moved to the next release milestone at the soonest (please feel free to revert if you think the ticket is close to being resolved).

Note: See TracTickets for help on using tickets.