Opened 9 years ago

Closed 8 years ago

Last modified 8 years ago

#11358 closed defect (fixed)

matrix multiplication over ZZ sometimes gives incorrect results

Reported by: tomc Owned by:
Priority: critical Milestone: sage-5.0
Component: linear algebra Keywords: matrix multiplication, multi-modular, integers, ZZ
Cc: burcin Merged in: sage-5.0.beta12
Authors: William Stein Reviewers: Douglas McNeil
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps: #12710

Description (last modified by was)

Something is wrong with the multi-modular matrix multiplication code for matrices over ZZ. At random, and infrequently, it gives incorrect results. For example, the following code chooses random 3x2 and 2x10 integer matrices and multiplies them together using the multi-modular algorithm. It then does the same multiplication 100 more times, checks that the answer is always the same, and if not raises an exception:

sage: for n in range(2000):
....:     A = MatrixSpace(ZZ,3,2).random_element()    
....:     B = MatrixSpace(ZZ,2,10).random_element()
....:     try_once = A._multiply_multi_modular(B)
....:     for k in range(100):
....:             try_again = A._multiply_multi_modular(B)
....:         if try_once != try_again:
....:                 print "="*60
....:             print "n = %s, k = %s"%(n,k)
....:             print "A = "
....:             print A
....:             print "B ="
....:             print B
....:             print "first attempt = "
....:             print try_once
....:             print "k-th retry = "
....:             print try_again
....:             raise RuntimeError
....: 

This fails with very high probability (on Sage 4.6.2 under OS X, built from source) with output such as:

============================================================
n = 27, k = 43
A = 
[-1  0]
[ 1  0]
[ 0  1]
B =
[  -2    1   -1    0    0   -1   -1    3   -1   -1]
[   1    1 -116    3    0   -1   -1    0   -1    0]
first attempt = 
[   2   -1    1    0    0    1    1   -3    1    1]
[  -2    1   -1    0    0   -1   -1    3   -1   -1]
[   1    1 -116    3    0   -1   -1    0   -1    0]
k-th retry = 
[   2 1102    1    0    0    1    1 1100    1    1]
[1101    1 1102    0    0 1102 1102    3 1102 1102]
[   1    1  987    3    0 1102 1102    0 1102    0]
---------------------------------------------------------------------------
RuntimeError [...]

Note that the two candidates for the matrix product here are congruent modulo 1103, which is prime. If you rerun the code with verbose logging, using set_verbose(2), then every time it fails the two candidates for the matrix product are congruent modulo the prime being used in the multi-modular algorithm. Thus I suspect that the Chinese Remainder Theorem code in sage/ext/multi_modular.pyx is not handling a corner case properly.

SEE ALSO: #10281, which touches the sage/ext/multi_modular.pyx code in nontrivial ways.

Attachments (2)

11358.sage (631 bytes) - added by was 8 years ago.
this is a usable version of the test program in the ticket description
trac_11358-wstein.patch (5.4 KB) - added by jdemeyer 8 years ago.
apply ONLY this.

Download all attachments as: .zip

Change History (28)

comment:1 Changed 9 years ago by tomc

This also fails on Sage 4.6.2 under Red Hat Enterprise Linux (binary install, x86_64 GNU/Linux)

comment:2 Changed 9 years ago by tomc

Looking at the function mpz_crt_vec_tail() in sage/ext/multi_modular.pyx, it appears that three of the last five lines:

cdef Integer zz
zz = PY_NEW(Integer)
mpz_set(zz.value, self.half_product)

are redundant, because the variable zz is never used. (NB I am unfamiliar with both Cython and GMP, so I may be wrong here.) But why is this code there? Is it supposed to be doing something else?

comment:3 follow-up: Changed 9 years ago by rbeezer

From #sagemath on IRC:

[07:11] <sagebot> New news from tractimeline: Ticket #11358 (matrix multiplication over ZZ sometimes gives incorrect results) created
[07:32] <logix> hm, i just tried to reproduce that bug (which i can), and it failed both times when B had a full zero column
[07:32] <logix> (which is the case in the bug report too)
[07:33] <logix> perhaps that's the condition triggering the bug?
[07:34] <logix> i mean one of the conditions triggering the bug (there must be something else, as the inner loop runs more than once and it fails only sometimes

comment:4 in reply to: ↑ 3 ; follow-up: Changed 9 years ago by tomc

Replying to rbeezer:

I don't think that this is correct. By running the code repeatedly, I found examples without zero columns.

comment:5 in reply to: ↑ 4 Changed 9 years ago by rbeezer

Replying to tomc:

I don't think that this is correct. By running the code repeatedly, I found examples without zero columns.

OK, just saw that in IRC and thought I'd capture it.

I'm curious to see what the root cause is here!

Rob

comment:6 follow-up: Changed 9 years ago by dsm

I think I've got it.

This seems to happen iff a random prime happens to be chosen twice. For example, if you modify _extend_moduli_to_height_c to always reuse a prime if more than one is needed, this always happens. The same bug seems to exist in _extend_moduli_to_count as well.

The problem seems to be that in _new_random_prime, there's a test which doesn't do what it's trying to do:

    cdef mod_int _new_random_prime(self):
        # choose a new random prime
        cdef Py_ssize_t i
        cdef mod_int p
        while True:
            p = random_prime(self._u_bound, lbound =self._l_bound)
            for i in range(self.n):
                if p == self.moduli[i]:
                    break
            else:
                return p

IIUC, the "for i in range(self.n)" loop is attempting to avoid the problem of repeated primes, but this only works if p is added to self.moduli immediately after it's returned. In the case of _extend_moduli_to_height_c, this isn't true; the addition is delayed until a later extend_with_primes call, so the above check decides that the prime isn't reused, we get repeated moduli, and the multiplication breaks.

Or even more obviously:

sage: from sage.ext.multi_modular import MultiModularBasis_base
sage: mm = MultiModularBasis_base(1)
sage: mm._extend_moduli_to_count(1000)
1000
sage: len(mm), len(set(mm))
(1000, 843)

ISTM either we can remove the check-for-duplicate code from _new_random_prime and do it externally, or we can add an argument and pass it the "accumulated" primes as we build them elsewhere so it can avoid duplicates there too. I prefer the first. Should I work up a patch?

comment:7 Changed 9 years ago by burcin

  • Cc burcin added

comment:8 in reply to: ↑ 6 Changed 9 years ago by tomc

Replying to dsm:

Good catch! I agree.

ISTM either we can remove the check-for-duplicate code from _new_random_prime and do it externally, or we can add an argument and pass it the "accumulated" primes as we build them elsewhere so it can avoid duplicates there too. I prefer the first. Should I work up a patch?

I prefer the first too. Please write a patch and then I will review it.

comment:9 Changed 9 years ago by dsm

  • Status changed from new to needs_review

Okay, here's my first attempt. Because another function calls _new_random_prime assuming that it correctly avoids duplication I had to change my intended approach a bit, but it should work.

Two other minor changes:

(1) as mentioned above, mpz_crt_vec_tail() has what looks to me like vestigial code, and I've removed it.

(2) the documentation for replace_prime_c claimed that it would replace a prime with a greater prime, but made no efforts to ensure the "greater than" part. I've changed the doc accordingly.

comment:10 Changed 9 years ago by tomc

Thanks. I'll review this.

comment:11 follow-ups: Changed 8 years ago by tomc

  • Status changed from needs_review to needs_work

I can't get the patch to apply. When I apply the patch and then do

sage -b

I get

Error converting Pyrex file to C:
------------------------------------------------------------
...
        """
        cdef Py_ssize_t i
        cdef mod_int p
        while True:
            p = random_prime(self._u_bound, lbound =self._l_bound)
            if p not in self.moduli[:self.n]: return p
                                  ^
------------------------------------------------------------

/Users/tomc/sage-4.6.2/devel/sage-test-crt-fix/sage/ext/multi_modular.pyx:191:35: Cannot convert 'mod_int *' to Python object

Also, in the new implementation of _new_random_prime() there is the possibility of an infinite loop if self.moduli already contains all the primes between self._lbound and self._ubound. I suggest modifying _new_random_prime() so that it:

  • makes some number of tries (10? 100?) to find a new random prime p in the range self._lbound < p < self._ubound; and
  • if this fails then it increases self_ubound and tries again.

comment:12 in reply to: ↑ 11 ; follow-up: Changed 8 years ago by burcin

  • Owner changed from jason, was to (none)

Replying to tomc:

Also, in the new implementation of _new_random_prime() there is the possibility of an infinite loop if self.moduli already contains all the primes between self._lbound and self._ubound. I suggest modifying _new_random_prime() so that it:

  • makes some number of tries (10? 100?) to find a new random prime p in the range self._lbound < p < self._ubound; and
  • if this fails then it increases self_ubound and tries again.

I suggest raising an error instead of increasing _ubound. The upper bound is usually set so that the primes fit in a single machine word. A lot of things would break if this changed without notice.

comment:13 in reply to: ↑ 12 ; follow-up: Changed 8 years ago by tomc

Replying to burcin:

I suggest raising an error instead of increasing _ubound. The upper bound is usually set so that the primes fit in a single machine word. A lot of things would break if this changed without notice.

If we do raise an error then it should be caught before it propagates up to the user. For instance, it should not be that if A and B are integer matrices then evaluating A*B can sometimes cause an exception because the multi-modular matrix multiplication algorithm ran out of primes: in this case we should catch the exception and evaluate A*B using a different algorithm.

Is it really the case that raising self._ubound would cause things to break?  I thought from looking at the multi-modular code that it all used mpz_t types from GMP, and that these would automatically increase their allocated memory as necessary.

The multi-modular code is used in five places: dense matrix multiplication over ZZ; dense matrix multiplication over cyclotomics; dense matrix multiplication over QQ (although only in the rational reconstruction routines _lift_crt_rr() and _lift_crt_rr_with_lcm(), which I think are dead code); echelon form calculations over cyclotomics; and characteristic polynomial calculations over cyclotomics.  If we decide to raise an error if the multi-modular code runs out of primes then we will need to catch this error and handle it appropriately in all five places.

comment:14 in reply to: ↑ 13 ; follow-up: Changed 8 years ago by burcin

Replying to tomc:

Replying to burcin:

I suggest raising an error instead of increasing _ubound. The upper bound is usually set so that the primes fit in a single machine word. A lot of things would break if this changed without notice.

If we do raise an error then it should be caught before it propagates up to the user. For instance, it should not be that if A and B are integer matrices then evaluating A*B can sometimes cause an exception because the multi-modular matrix multiplication algorithm ran out of primes: in this case we should catch the exception and evaluate A*B using a different algorithm.

This is better than trying to handle it magically in the MultiModularBasis class.

Is it really the case that raising self._ubound would cause things to break?  I thought from looking at the multi-modular code that it all used mpz_t types from GMP, and that these would automatically increase their allocated memory as necessary.

Multi precision integers are needed for the lifting step. One of the reasons to use a multi-modular algorithm is to take advantage of fast arithmetic implemented in hardware. This requires that the modulus is small enough to fit in a word size. For linear algebra, the dimensions of the matrix also come into play. Ideally you should reduce only once for each row*column multiplication.

The multi-modular code is used in five places: dense matrix multiplication over ZZ; dense matrix multiplication over cyclotomics; dense matrix multiplication over QQ (although only in the rational reconstruction routines _lift_crt_rr() and _lift_crt_rr_with_lcm(), which I think are dead code); echelon form calculations over cyclotomics; and characteristic polynomial calculations over cyclotomics.  If we decide to raise an error if the multi-modular code runs out of primes then we will need to catch this error and handle it appropriately in all five places.

You cannot assume that the only users of the code are in the Sage library. This class is designed to be an abstraction that makes it easy to implement multi modular algorithms.

Perhaps we can verify that this error will never occur in these places through some theoretical argument. It is highly unlikely that we will always keep choosing the same random prime.

comment:15 in reply to: ↑ 14 ; follow-up: Changed 8 years ago by tomc

Replying to burcin:

Multi precision integers are needed for the lifting step. One of the reasons to use a multi-modular algorithm is to take advantage of fast arithmetic implemented in hardware. This requires that the modulus is small enough to fit in a word size.

I agree that things will slow down if we ever raise self._ubound to be larger than a machine word: my question was whether anything would actually break. But in any case, since the MultiModularBasis? class is explicitly documented as:

"""
    This class stores a list of machine-sized prime numbers...
"""

we should not allow the code to raise self._ubound. Thus we should raise an error if we repeatedly fail to choose a new random prime. As discussed, we will need to handle this error in several other places in Sage.

Note that the situation we are discussing is very unlikely to ever occur in practice, as the default values of _lbound and _ubound are repectively 210 and 215 and there are many primes in between 210 and 215. But in theory someone could do:

sage: mm = MultiModularBasis(lbound=4,ubound=6,...)

or something equally silly, and then the multi-modular code would quickly run out of primes.

comment:16 in reply to: ↑ 11 Changed 8 years ago by dsm

Replying to tomc:

I can't get the patch to apply.

Try it against one of the 4.7 release candidates. I should've mentioned what I was basing it on.

As for the case of running out of primes -- which I considered the caller's fault, and so I simply warned about it in the docstring rather than handle it -- I'm fine with whatever. (Once cython 0.15 gets integrated and we have cythonic generators it may be worth collecting random prime functions into a fast module of their own.)

Given that the moduli have to be below MAX_MODULUS, it wouldn't cost much to check whether there really are any primes left and raise an exception only if we really are done and not merely unlucky. If we're doing that, though, then we should probably raise the same exception if you ask for more primes than there are available in extend_moduli_to_count as well.

comment:17 in reply to: ↑ 15 Changed 8 years ago by was

*ping* why is this ticket dead? It seems very, very important, but everybody forgot about it a year ago?!

Replying to tomc:

Note that the situation we are discussing is very unlikely to ever occur in practice, as the default values of _lbound and _ubound are repectively 210 and 215 and there are many primes in between 210 and 215.

This is completely false. There are hardly any primes in that range:

sage: len(prime_range(2^10,2^15))
3340

The linear algebra in Sage is currently very broken in practice, perhaps due to various optimizations and re-implementations that have fundamentally broken things. For example, I'm constantly hitting a problem of "not enough primes" (see #10281), even for matrices over QQ.

comment:18 Changed 8 years ago by was

  • Description modified (diff)

comment:19 Changed 8 years ago by jen

  • Stopgaps set to #12710

Changed 8 years ago by was

this is a usable version of the test program in the ticket description

comment:20 Changed 8 years ago by was

  • Status changed from needs_work to needs_review

I've posted a patch trac_11358-wstein.patch that fixes the bug. The solution is definitely much better than what was proposed 10 years ago. You can run the attached 11358.sage to check that the exact bug cited here doesn't happen. But better is to see the new doctests which get to the heart of the matter.

comment:21 Changed 8 years ago by dsm

  • Status changed from needs_review to needs_work

The patch does seem to eliminate the original bug for me, but there's still some strangeness for different _extend methods:

sage: set_random_seed(0)
sage: from sage.ext.multi_modular import MultiModularBasis_base
sage: m = MultiModularBasis_base(0);
sage: m._extend_moduli(100)
sage: len(m), len(set(m))
(101, 99)
sage: [k for k in m if list(m).count(k) > 1]
[30859, 3217, 30859, 3217]

sage: set_random_seed(1)
sage: from sage.ext.multi_modular import MultiModularBasis_base
sage: m = MultiModularBasis_base(0);
sage: m._extend_moduli_to_count(100)
100
sage: len(m), len(set(m))
(100, 97)
sage: [k for k in m if list(m).count(k) > 1]
[4001, 5309, 23293, 4001, 23293, 5309]

So I think that two of the three _extend methods still don't give results that I'd expect.

comment:22 Changed 8 years ago by was

  • Status changed from needs_work to needs_review

So I think that two of the three _extend methods still don't give results that I'd expect.

I have fixed the patch. There was one case where I didn't update use of _new_random_prime correctly.

comment:23 Changed 8 years ago by dsm

  • Status changed from needs_review to positive_review

The new patch:

(1) Applies against 5.0.beta8 with minor fuzz. (2) Seems to eliminate the original bug. (3) Matches the original diagnosis about where the problem was (use of repeated primes in a multimodular basis), so I get why it *should* fix the bug. (4) Has doctests which verify that .._height now behaves. (5) The new version passes my separate tests confirming that the other _extend methods behave too, while the first version didn't.

Positive review.

comment:24 Changed 8 years ago by davidloeffler

  • Authors set to William Stein
  • Reviewers set to Douglas McNeil

comment:25 Changed 8 years ago by jdemeyer

  • Merged in set to sage-5.0.beta12
  • Resolution set to fixed
  • Status changed from positive_review to closed

Changed 8 years ago by jdemeyer

apply ONLY this.

comment:26 Changed 8 years ago by jdemeyer

Rebased patch by removing a hunk which just added a newline.

Note: See TracTickets for help on using tickets.