Opened 8 years ago

Closed 8 years ago

#11589 closed enhancement (fixed)

faster zero matrix creation

Reported by: malb Owned by: jason, was
Priority: major Milestone: sage-4.7.2
Component: linear algebra Keywords:
Cc: SimonKing Merged in: sage-4.7.2.alpha1
Authors: Martin Albrecht, Simon King Reviewers: Simon King, Martin Albrecht
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Description (last modified by jdemeyer)

Currently, we perform a matrix copy whenever we create a new zero matrix. This is undesirable for two reasons:

  • it is slower than creating a new matrix from scratch for some matrices
  • it eats up RAM (we create two instead of one matrix)

See also the discussion on [sage-devel]: http://groups.google.com/group/sage-devel/browse_thread/thread/478fb19ecb724be0

Apply all three patches.

Attachments (3)

trac_11589.patch (4.6 KB) - added by malb 8 years ago.
trac_11589_copy_vs_creation_of_zero.patch (4.8 KB) - added by SimonKing 8 years ago.
Some fine tuning for the creation of zero matrices
trac_11589_apply_zero_creation_to_strassen.patch (1.8 KB) - added by SimonKing 8 years ago.
Use the mecanisms from the preceding patches in Strassen multiplication

Download all attachments as: .zip

Change History (25)

comment:1 Changed 8 years ago by malb

  • Authors set to Martin Albrecht, Simon King

Timings on my machine:

with patch

sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 104 µs per loop
625 loops, best of 3: 108 µs per loop
625 loops, best of 3: 115 µs per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 100 µs per loop
625 loops, best of 3: 448 µs per loop
125 loops, best of 3: 1.65 ms per loop

without patch

sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 97.6 µs per loop
625 loops, best of 3: 108 µs per loop
625 loops, best of 3: 128 µs per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 94.1 µs per loop
625 loops, best of 3: 502 µs per loop
125 loops, best of 3: 2.48 ms per loop
sage: 

comment:2 follow-up: Changed 8 years ago by SimonKing

Hi Martin,

Thank you for creating the patch!

I don't like that the timings became a little slower for small matrices. We should try to find out why that happens.

I think we should use calloc only when the argument entries is not a list. Namely, if it is a list then the matrix will be initialised by direct assignments anyway. Hence, in that case there is no need to initialise with calloc.

Did you carefully determine the threshold for which calloc becomes faster than copying? Or is "nrows, ncols > 100" just a guess?

Cheers,

Simon

comment:3 in reply to: ↑ 2 Changed 8 years ago by malb

Replying to SimonKing:

I don't like that the timings became a little slower for small matrices. We should try to find out why that happens.

We'll also have to do more careful timings to establish there really was a regression. I'm not 100% sure it's not a fluke.

I think we should use calloc only when the argument entries is not a list. Namely, if it is a list then the matrix will be initialised by direct assignments anyway. Hence, in that case there is no need to initialise with calloc.

Good point.

Did you carefully determine the threshold for which calloc becomes faster than copying? Or is "nrows, ncols > 100" just a guess?

An educated guess, i.e. I did look at some timings but it's not carefully tuned (and probably machine dependent).

Changed 8 years ago by malb

comment:4 follow-up: Changed 8 years ago by malb

I updated the patch to use malloc() if entries is no list.

comment:5 in reply to: ↑ 4 Changed 8 years ago by SimonKing

Replying to malb:

I updated the patch to use malloc() if entries is no list.

You mean: Use malloc if (and only if) it is a list. At least that is what you do in the patch (and coincides with what I had done as well).

comment:6 Changed 8 years ago by malb

Sorry, yes ... where's my head

comment:7 Changed 8 years ago by SimonKing

Hi Martin,

I have a guess why the patch isn't as quick as it should be: It is the line

    return self.__matrix_class(self, 0, coerce=coerce, copy=copy) 

If you pass the value 0, then the matrix is initialised as a diagonal matrix with 0 on the diagonal -- which means that time is wasted by initialising the diagonal twice.

Better would be to pass self, None, coerce=coerce, copy=copy. I will change that and then do some tests, concerning the threshold and also concerning other base rings.

comment:8 Changed 8 years ago by SimonKing

I was wrong in assuming that entries=0 would provoke a second initialisation of the diagonal.

However, using None as default argument does have a tiny advantage: The test if entries is None or entries==0 is a tad faster if it is actually None.

With entries=None as default in the __call__ method of matrix spaces, timings do improve, and I obtain a threshold nrows==ncols==40: For larger matrices, over GF(p) with p>2, creation from scratch is faster, while for smaller matrices copying is faster.

However, with the original patch, I find a big surprise: Over GF(2), copying is always faster! For GF(p), p>2, I can confirm your statement that the threshold seems to be around 100.

There is one problem with default entries=None, though: Some matrix classes (used for matrices over the reals or symbolics) don't accept entries=None. So, their init methods should be adapted as well.

Question: Is it worth it?

Using None, I obtain for example:

sage: timeit("M = matrix(GF(2),30,30)", number=10000)
10000 loops, best of 3: 105 µs per loop
sage: timeit("M = matrix(GF(2),300,300)", number=10000)
10000 loops, best of 3: 107 µs per loop
sage: timeit("M = matrix(GF(3),30,30)", number=10000)
10000 loops, best of 3: 102 µs per loop
sage: timeit("M = matrix(GF(3),300,300)", number=10000)
10000 loops, best of 3: 210 µs per loop

Using 0, I obtain:

sage: timeit("M = matrix(GF(2),30,30)", number=10000)
10000 loops, best of 3: 113 µs per loop
sage: timeit("M = matrix(GF(2),300,300)", number=10000)
10000 loops, best of 3: 123 µs per loop
sage: timeit("M = matrix(GF(3),30,30)", number=10000)
10000 loops, best of 3: 102 µs per loop
sage: timeit("M = matrix(GF(3),300,300)", number=10000)
10000 loops, best of 3: 215 µs per loop

The difference is small, but it seems quite reproducible (not flaky) for me. So, my suggestion is to go through the matrix classes and allow entries=None.

comment:9 follow-ups: Changed 8 years ago by SimonKing

  • Status changed from new to needs_review

I made all matrices accept "None" as an argument.

I tried to determine the threshold for copying versus creating a zero matrix, using the following test function:

sage: def tests(R):
....:     BetterCopy = []
....:     for d in srange(10,511,20):
....:         MS = MatrixSpace(R,d)
....:         MS._copy_zero = True
....:         globals()['d'] = d
....:         globals()['R'] = R
....:         TCopy = timeit.eval("M = matrix(R,d,d)",number=1000)
....:         MS._copy_zero = False
....:         TNoCopy = timeit.eval("M = matrix(R,d,d)",number=1000)
....:         if TCopy.stats[3] < TNoCopy.stats[3]:
....:             BetterCopy.append((d,TCopy,TNoCopy))
....:     return BetterCopy
....: 

It turns out that with both patches applied, copying is not a good idea over GF(2) and over the rationals:

sage: tests(GF(2))
[]
sage: tests(QQ)
[]

Note that I observed that the memory consumption for the test over the rationals constantly increased: Perhaps there is a memory leak.

However, in all other cases I studied, copying was better, for matrices of up to 30 rows and columns:

sage: tests(GF(3))
[(10, 1000 loops, best of 3: 48.2 µs per loop, 1000 loops, best of 3: 49 µs per loop), (30, 1000 loops, best of 3: 50 µs per loop, 1000 loops, best of 3: 50.9 µs per loop)]
sage: tests(GF(1019))
[(10, 1000 loops, best of 3: 48.3 µs per loop, 1000 loops, best of 3: 49.2 µs per loop)]
sage: tests(GF(625,'a'))
[(10, 1000 loops, best of 3: 45.1 µs per loop, 1000 loops, best of 3: 63.1 µs per loop), (30, 1000 loops, best of 3: 62.9 µs per loop, 1000 loops, best of 3: 68 µs per loop), (230, 1000 loops, best of 3: 1.15 ms per loop, 1000 loops, best of 3: 367 µs per loop), (250, 1000 loops, best of 3: 1.35 ms per loop, 1000 loops, best of 3: 423 µs per loop), (270, 1000 loops, best of 3: 1.71 ms per loop, 1000 loops, best of 3: 618 µs per loop), (290, 1000 loops, best of 3: 1.95 ms per loop, 1000 loops, best of 3: 701 µs per loop), (310, 1000 loops, best of 3: 2.23 ms per loop, 1000 loops, best of 3: 791 µs per loop), (330, 1000 loops, best of 3: 2.52 ms per loop, 1000 loops, best of 3: 883 µs per loop), (350, 1000 loops, best of 3: 2.82 ms per loop, 1000 loops, best of 3: 988 µs per loop)]
sage: tests(ZZ)
[(10, 625 loops, best of 3: 48.6 µs per loop, 625 loops, best of 3: 56.5 µs per loop), (30, 625 loops, best of 3: 111 µs per loop, 625 loops, best of 3: 118 µs per loop), (50, 625 loops, best of 3: 230 µs per loop, 625 loops, best of 3: 235 µs per loop)]

Therefore, I modified the threshold in _copy_zero a bit. There is no special case for Matrix_modn_dense, but for Matrix_rational_dense.

Here are timings similar to the ones you did above (adding another test case).

With both patches:

sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 105 µs per loop
625 loops, best of 3: 108 µs per loop
625 loops, best of 3: 116 µs per loop
sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 103 µs per loop
625 loops, best of 3: 403 µs per loop
625 loops, best of 3: 1.3 ms per loop
sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 198 µs per loop
125 loops, best of 3: 1.97 ms per loop
125 loops, best of 3: 7.29 ms per loop

Without patches:

sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 108 µs per loop
625 loops, best of 3: 116 µs per loop
625 loops, best of 3: 137 µs per loop
sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 103 µs per loop
625 loops, best of 3: 566 µs per loop
125 loops, best of 3: 2.46 ms per loop
sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 225 µs per loop
125 loops, best of 3: 5.95 ms per loop
25 loops, best of 3: 22.9 ms per loop

So, with both patches applied, the timings consistently improve.

Reviewing

I am not sure how the reviewing should be done, as we are both authors. Do you think it is ok if I review the first patch and you review the second patch? I need to run all long doctests, though; so far, I only did the tests in sage/matrix.

TODO

On a different ticket, the potential memory leak for rational dense matrices should be studied.

I recall that in some arithmetic routines of matrices, zero matrices are created at the beginning. I am not sure whether that is done "the official way" or by always copying a zero matrix, but I think it was the latter. That might be worth while to fix as well.

comment:10 Changed 8 years ago by SimonKing

PS: It seems to me that several doc tests from sage/matrix are considerably faster with the patches. I hope that can be confirmed.

comment:11 in reply to: ↑ 9 Changed 8 years ago by malb

Replying to SimonKing:

I made all matrices accept "None" as an argument.

Cool, you are fast!

Reviewing

I am not sure how the reviewing should be done, as we are both authors. Do you think it is ok if I review the first patch and you review the second patch? I need to run all long doctests, though; so far, I only did the tests in sage/matrix.

Yes, as far as I know it's okay if we cross-review. So I'll take a look at your patch.

comment:12 Changed 8 years ago by malb

Here are timings on my machine (Macbook Pro, i7):

no patch

sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 98.1 µs per loop
625 loops, best of 3: 115 µs per loop
625 loops, best of 3: 130 µs per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 93.3 µs per loop
625 loops, best of 3: 538 µs per loop
125 loops, best of 3: 2.41 ms per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 189 µs per loop
125 loops, best of 3: 3.7 ms per loop
25 loops, best of 3: 14.1 ms per loop

first only

sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 104 µs per loop
625 loops, best of 3: 108 µs per loop
625 loops, best of 3: 112 µs per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 95.5 µs per loop
625 loops, best of 3: 442 µs per loop
625 loops, best of 3: 1.48 ms per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 184 µs per loop
125 loops, best of 3: 3.7 ms per loop
25 loops, best of 3: 13.8 ms per loop

with both

sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 94.5 µs per loop
625 loops, best of 3: 98.9 µs per loop
625 loops, best of 3: 108 µs per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 95 µs per loop
625 loops, best of 3: 438 µs per loop
625 loops, best of 3: 1.49 ms per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n))
....: 
625 loops, best of 3: 176 µs per loop
125 loops, best of 3: 1.74 ms per loop
125 loops, best of 3: 6.24 ms per loop

My take: it's definitely faster for large-ish matrices and undecided for small matrices, i.e. my timings for 50x50 are rather unstable.

Changed 8 years ago by SimonKing

Some fine tuning for the creation of zero matrices

comment:13 Changed 8 years ago by SimonKing

Sorry, I had to replace my patch, since apparently I forgot to change the doctests of _copy_zero. Actually, the new patch version adds some tests, as there is now a special case for matrices over the rationals.

FWIW, with the old patch version all long tests except matrix_space.py pass, and with the new patch version (that only fixes the failing doctest) matrix_space.py passes as well.

comment:14 in reply to: ↑ 9 Changed 8 years ago by SimonKing

Replying to SimonKing:

I recall that in some arithmetic routines of matrices, zero matrices are created at the beginning. I am not sure whether that is done "the official way" or by always copying a zero matrix, but I think it was the latter. That might be worth while to fix as well.

I found it. It is the _multiply_strassen method in matrix2.pyx, which says:

        if self.is_sparse():
            output = self.matrix_space(self._nrows, right._ncols, sparse = True)(0)
        else:
            output = self.matrix_space(self._nrows, right._ncols, sparse = False).zero_matrix().__copy__()

I think that ought to be improved, but that would better be on another ticket.

comment:15 follow-up: Changed 8 years ago by SimonKing

I added a third patch, that applies the faster way of creating a zero matrix to _multiply_strassen. With the third patch, all tests from sage/matrix still pass. I have no benchmark that shows a noticeable difference to the old behaviour. But I think it won't hurt to use an optimised procedure for the creation of a zero matrix, rather than to hard-wire the copying of a zero matrix in the current version of _multiply_strassen.

I think that the third patch is small enough to be discussed here. However, note that there already is an open ticket on improving matrix multiplication: #8096. There is no activity for 16 months now! I think it would be worth while to invest some work there. It seems to me that a part of the issues of that old ticket are already resolved.

Concerning reviews: I believe that the first patch is fine, and I give it a positive review. I have already mentioned that the tests pass, and the improvement of the timings (in particular when the second patch is added) is clear.

comment:16 in reply to: ↑ 15 Changed 8 years ago by malb

  • Reviewers set to Simon King, Martin Albrecht
  • Status changed from needs_review to positive_review

Concerning reviews: I believe that the first patch is fine, and I give it a positive review. I have already mentioned that the tests pass, and the improvement of the timings (in particular when the second patch is added) is clear.

I give your two patches a positive review as well.

comment:17 Changed 8 years ago by SimonKing

Note that the patch for Strassen-Winograd multiplication at #11610 benefits from the patches here as well, because many matrix spaces are created during the iteration.

comment:18 follow-up: Changed 8 years ago by jdemeyer

  • Status changed from positive_review to needs_info

Please state which patches have to be applied...

Changed 8 years ago by SimonKing

Use the mecanisms from the preceding patches in Strassen multiplication

comment:19 in reply to: ↑ 18 ; follow-up: Changed 8 years ago by SimonKing

  • Status changed from needs_info to needs_review

Replying to jdemeyer:

Please state which patches have to be applied...

If nothing is stated then, I thought, all patches are to be applied.

I used the occasion to replace the third patch: A commit message was missing.

comment:20 Changed 8 years ago by SimonKing

  • Status changed from needs_review to positive_review

... and back to positive review.

comment:21 in reply to: ↑ 19 Changed 8 years ago by jdemeyer

  • Description modified (diff)

Replying to SimonKing:

If nothing is stated then, I thought, all patches are to be applied.

For safety, I prefer not to make any such assumptions. In the few cases where it is unclear (like here), I prefer simply asking.

comment:22 Changed 8 years ago by jdemeyer

  • Merged in set to sage-4.7.2.alpha1
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.