Opened 11 years ago

Closed 10 years ago

# faster zero matrix creation

Reported by: Owned by: malb jason, was major sage-4.7.2 linear algebra SimonKing sage-4.7.2.alpha1 Martin Albrecht, Simon King Simon King, Martin Albrecht N/A

Currently, we perform a matrix copy whenever we create a new zero matrix. This is undesirable for two reasons:

• it is slower than creating a new matrix from scratch for some matrices
• it eats up RAM (we create two instead of one matrix)

Apply all three patches.

### comment:1 Changed 11 years ago by malb

• Authors set to Martin Albrecht, Simon King

Timings on my machine:

with patch

```sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....:
625 loops, best of 3: 104 µs per loop
625 loops, best of 3: 108 µs per loop
625 loops, best of 3: 115 µs per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....:
625 loops, best of 3: 100 µs per loop
625 loops, best of 3: 448 µs per loop
125 loops, best of 3: 1.65 ms per loop
```

without patch

```sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....:
625 loops, best of 3: 97.6 µs per loop
625 loops, best of 3: 108 µs per loop
625 loops, best of 3: 128 µs per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....:
625 loops, best of 3: 94.1 µs per loop
625 loops, best of 3: 502 µs per loop
125 loops, best of 3: 2.48 ms per loop
sage:

```

### comment:2 follow-up: ↓ 3 Changed 11 years ago by SimonKing

Hi Martin,

Thank you for creating the patch!

I don't like that the timings became a little slower for small matrices. We should try to find out why that happens.

I think we should use calloc only when the argument `entries` is not a list. Namely, if it is a list then the matrix will be initialised by direct assignments anyway. Hence, in that case there is no need to initialise with calloc.

Did you carefully determine the threshold for which calloc becomes faster than copying? Or is "nrows, ncols > 100" just a guess?

Cheers,

Simon

### comment:3 in reply to: ↑ 2 Changed 11 years ago by malb

I don't like that the timings became a little slower for small matrices. We should try to find out why that happens.

We'll also have to do more careful timings to establish there really was a regression. I'm not 100% sure it's not a fluke.

I think we should use calloc only when the argument `entries` is not a list. Namely, if it is a list then the matrix will be initialised by direct assignments anyway. Hence, in that case there is no need to initialise with calloc.

Good point.

Did you carefully determine the threshold for which calloc becomes faster than copying? Or is "nrows, ncols > 100" just a guess?

An educated guess, i.e. I did look at some timings but it's not carefully tuned (and probably machine dependent).

### comment:4 follow-up: ↓ 5 Changed 11 years ago by malb

I updated the patch to use malloc() if entries is no list.

### comment:5 in reply to: ↑ 4 Changed 11 years ago by SimonKing

I updated the patch to use malloc() if entries is no list.

You mean: Use malloc if (and only if) it is a list. At least that is what you do in the patch (and coincides with what I had done as well).

### comment:6 Changed 11 years ago by malb

Sorry, yes ... where's my head

### comment:7 Changed 11 years ago by SimonKing

Hi Martin,

I have a guess why the patch isn't as quick as it should be: It is the line

```    return self.__matrix_class(self, 0, coerce=coerce, copy=copy)
```

If you pass the value 0, then the matrix is initialised as a diagonal matrix with 0 on the diagonal -- which means that time is wasted by initialising the diagonal twice.

Better would be to pass `self, None, coerce=coerce, copy=copy`. I will change that and then do some tests, concerning the threshold and also concerning other base rings.

### comment:8 Changed 11 years ago by SimonKing

I was wrong in assuming that `entries=0` would provoke a second initialisation of the diagonal.

However, using None as default argument does have a tiny advantage: The test `if entries is None or entries==0` is a tad faster if it is actually None.

With `entries=None` as default in the `__call__` method of matrix spaces, timings do improve, and I obtain a threshold `nrows==ncols==40`: For larger matrices, over GF(p) with p>2, creation from scratch is faster, while for smaller matrices copying is faster.

However, with the original patch, I find a big surprise: Over GF(2), copying is always faster! For GF(p), p>2, I can confirm your statement that the threshold seems to be around 100.

There is one problem with default `entries=None`, though: Some matrix classes (used for matrices over the reals or symbolics) don't accept `entries=None`. So, their init methods should be adapted as well.

Question: Is it worth it?

Using None, I obtain for example:

```sage: timeit("M = matrix(GF(2),30,30)", number=10000)
10000 loops, best of 3: 105 µs per loop
sage: timeit("M = matrix(GF(2),300,300)", number=10000)
10000 loops, best of 3: 107 µs per loop
sage: timeit("M = matrix(GF(3),30,30)", number=10000)
10000 loops, best of 3: 102 µs per loop
sage: timeit("M = matrix(GF(3),300,300)", number=10000)
10000 loops, best of 3: 210 µs per loop
```

Using 0, I obtain:

```sage: timeit("M = matrix(GF(2),30,30)", number=10000)
10000 loops, best of 3: 113 µs per loop
sage: timeit("M = matrix(GF(2),300,300)", number=10000)
10000 loops, best of 3: 123 µs per loop
sage: timeit("M = matrix(GF(3),30,30)", number=10000)
10000 loops, best of 3: 102 µs per loop
sage: timeit("M = matrix(GF(3),300,300)", number=10000)
10000 loops, best of 3: 215 µs per loop
```

The difference is small, but it seems quite reproducible (not flaky) for me. So, my suggestion is to go through the matrix classes and allow `entries=None`.

### comment:9 follow-ups: ↓ 11 ↓ 14 Changed 11 years ago by SimonKing

• Status changed from new to needs_review

I made all matrices accept "None" as an argument.

I tried to determine the threshold for copying versus creating a zero matrix, using the following test function:

```sage: def tests(R):
....:     BetterCopy = []
....:     for d in srange(10,511,20):
....:         MS = MatrixSpace(R,d)
....:         MS._copy_zero = True
....:         globals()['d'] = d
....:         globals()['R'] = R
....:         TCopy = timeit.eval("M = matrix(R,d,d)",number=1000)
....:         MS._copy_zero = False
....:         TNoCopy = timeit.eval("M = matrix(R,d,d)",number=1000)
....:         if TCopy.stats[3] < TNoCopy.stats[3]:
....:             BetterCopy.append((d,TCopy,TNoCopy))
....:     return BetterCopy
....:
```

It turns out that with both patches applied, copying is not a good idea over GF(2) and over the rationals:

```sage: tests(GF(2))
[]
sage: tests(QQ)
[]
```

Note that I observed that the memory consumption for the test over the rationals constantly increased: Perhaps there is a memory leak.

However, in all other cases I studied, copying was better, for matrices of up to 30 rows and columns:

```sage: tests(GF(3))
[(10, 1000 loops, best of 3: 48.2 µs per loop, 1000 loops, best of 3: 49 µs per loop), (30, 1000 loops, best of 3: 50 µs per loop, 1000 loops, best of 3: 50.9 µs per loop)]
sage: tests(GF(1019))
[(10, 1000 loops, best of 3: 48.3 µs per loop, 1000 loops, best of 3: 49.2 µs per loop)]
sage: tests(GF(625,'a'))
[(10, 1000 loops, best of 3: 45.1 µs per loop, 1000 loops, best of 3: 63.1 µs per loop), (30, 1000 loops, best of 3: 62.9 µs per loop, 1000 loops, best of 3: 68 µs per loop), (230, 1000 loops, best of 3: 1.15 ms per loop, 1000 loops, best of 3: 367 µs per loop), (250, 1000 loops, best of 3: 1.35 ms per loop, 1000 loops, best of 3: 423 µs per loop), (270, 1000 loops, best of 3: 1.71 ms per loop, 1000 loops, best of 3: 618 µs per loop), (290, 1000 loops, best of 3: 1.95 ms per loop, 1000 loops, best of 3: 701 µs per loop), (310, 1000 loops, best of 3: 2.23 ms per loop, 1000 loops, best of 3: 791 µs per loop), (330, 1000 loops, best of 3: 2.52 ms per loop, 1000 loops, best of 3: 883 µs per loop), (350, 1000 loops, best of 3: 2.82 ms per loop, 1000 loops, best of 3: 988 µs per loop)]
sage: tests(ZZ)
[(10, 625 loops, best of 3: 48.6 µs per loop, 625 loops, best of 3: 56.5 µs per loop), (30, 625 loops, best of 3: 111 µs per loop, 625 loops, best of 3: 118 µs per loop), (50, 625 loops, best of 3: 230 µs per loop, 625 loops, best of 3: 235 µs per loop)]
```

Therefore, I modified the threshold in `_copy_zero` a bit. There is no special case for `Matrix_modn_dense`, but for `Matrix_rational_dense`.

Here are timings similar to the ones you did above (adding another test case).

With both patches:

```sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....:
625 loops, best of 3: 105 µs per loop
625 loops, best of 3: 108 µs per loop
625 loops, best of 3: 116 µs per loop
sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....:
625 loops, best of 3: 103 µs per loop
625 loops, best of 3: 403 µs per loop
625 loops, best of 3: 1.3 ms per loop
sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n))
....:
625 loops, best of 3: 198 µs per loop
125 loops, best of 3: 1.97 ms per loop
125 loops, best of 3: 7.29 ms per loop
```

Without patches:

```sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....:
625 loops, best of 3: 108 µs per loop
625 loops, best of 3: 116 µs per loop
625 loops, best of 3: 137 µs per loop
sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....:
625 loops, best of 3: 103 µs per loop
625 loops, best of 3: 566 µs per loop
125 loops, best of 3: 2.46 ms per loop
sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n))
....:
625 loops, best of 3: 225 µs per loop
125 loops, best of 3: 5.95 ms per loop
25 loops, best of 3: 22.9 ms per loop
```

So, with both patches applied, the timings consistently improve.

Reviewing

I am not sure how the reviewing should be done, as we are both authors. Do you think it is ok if I review the first patch and you review the second patch? I need to run all long doctests, though; so far, I only did the tests in sage/matrix.

TODO

On a different ticket, the potential memory leak for rational dense matrices should be studied.

I recall that in some arithmetic routines of matrices, zero matrices are created at the beginning. I am not sure whether that is done "the official way" or by always copying a zero matrix, but I think it was the latter. That might be worth while to fix as well.

### comment:10 Changed 11 years ago by SimonKing

PS: It seems to me that several doc tests from sage/matrix are considerably faster with the patches. I hope that can be confirmed.

### comment:11 in reply to: ↑ 9 Changed 11 years ago by malb

I made all matrices accept "None" as an argument.

Cool, you are fast!

Reviewing

I am not sure how the reviewing should be done, as we are both authors. Do you think it is ok if I review the first patch and you review the second patch? I need to run all long doctests, though; so far, I only did the tests in sage/matrix.

Yes, as far as I know it's okay if we cross-review. So I'll take a look at your patch.

### comment:12 Changed 11 years ago by malb

Here are timings on my machine (Macbook Pro, i7):

no patch

```sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....:
625 loops, best of 3: 98.1 µs per loop
625 loops, best of 3: 115 µs per loop
625 loops, best of 3: 130 µs per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....:
625 loops, best of 3: 93.3 µs per loop
625 loops, best of 3: 538 µs per loop
125 loops, best of 3: 2.41 ms per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n))
....:
625 loops, best of 3: 189 µs per loop
125 loops, best of 3: 3.7 ms per loop
25 loops, best of 3: 14.1 ms per loop
```

first only

```sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....:
625 loops, best of 3: 104 µs per loop
625 loops, best of 3: 108 µs per loop
625 loops, best of 3: 112 µs per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....:
625 loops, best of 3: 95.5 µs per loop
625 loops, best of 3: 442 µs per loop
625 loops, best of 3: 1.48 ms per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n))
....:
625 loops, best of 3: 184 µs per loop
125 loops, best of 3: 3.7 ms per loop
25 loops, best of 3: 13.8 ms per loop
```

with both

```sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n))
....:
625 loops, best of 3: 94.5 µs per loop
625 loops, best of 3: 98.9 µs per loop
625 loops, best of 3: 108 µs per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n))
....:
625 loops, best of 3: 95 µs per loop
625 loops, best of 3: 438 µs per loop
625 loops, best of 3: 1.49 ms per loop

sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n))
....:
625 loops, best of 3: 176 µs per loop
125 loops, best of 3: 1.74 ms per loop
125 loops, best of 3: 6.24 ms per loop
```

My take: it's definitely faster for large-ish matrices and undecided for small matrices, i.e. my timings for 50x50 are rather unstable.

### Changed 11 years ago by SimonKing

Some fine tuning for the creation of zero matrices

### comment:13 Changed 11 years ago by SimonKing

Sorry, I had to replace my patch, since apparently I forgot to change the doctests of _copy_zero. Actually, the new patch version adds some tests, as there is now a special case for matrices over the rationals.

FWIW, with the old patch version all long tests except matrix_space.py pass, and with the new patch version (that only fixes the failing doctest) matrix_space.py passes as well.

### comment:14 in reply to: ↑ 9 Changed 11 years ago by SimonKing

I recall that in some arithmetic routines of matrices, zero matrices are created at the beginning. I am not sure whether that is done "the official way" or by always copying a zero matrix, but I think it was the latter. That might be worth while to fix as well.

I found it. It is the `_multiply_strassen` method in `matrix2.pyx`, which says:

```        if self.is_sparse():
output = self.matrix_space(self._nrows, right._ncols, sparse = True)(0)
else:
output = self.matrix_space(self._nrows, right._ncols, sparse = False).zero_matrix().__copy__()
```

I think that ought to be improved, but that would better be on another ticket.

### comment:15 follow-up: ↓ 16 Changed 11 years ago by SimonKing

I added a third patch, that applies the faster way of creating a zero matrix to `_multiply_strassen`. With the third patch, all tests from sage/matrix still pass. I have no benchmark that shows a noticeable difference to the old behaviour. But I think it won't hurt to use an optimised procedure for the creation of a zero matrix, rather than to hard-wire the copying of a zero matrix in the current version of _multiply_strassen.

I think that the third patch is small enough to be discussed here. However, note that there already is an open ticket on improving matrix multiplication: #8096. There is no activity for 16 months now! I think it would be worth while to invest some work there. It seems to me that a part of the issues of that old ticket are already resolved.

Concerning reviews: I believe that the first patch is fine, and I give it a positive review. I have already mentioned that the tests pass, and the improvement of the timings (in particular when the second patch is added) is clear.

### comment:16 in reply to: ↑ 15 Changed 11 years ago by malb

• Reviewers set to Simon King, Martin Albrecht
• Status changed from needs_review to positive_review

Concerning reviews: I believe that the first patch is fine, and I give it a positive review. I have already mentioned that the tests pass, and the improvement of the timings (in particular when the second patch is added) is clear.

I give your two patches a positive review as well.

### comment:17 Changed 11 years ago by SimonKing

Note that the patch for Strassen-Winograd multiplication at #11610 benefits from the patches here as well, because many matrix spaces are created during the iteration.

### comment:18 follow-up: ↓ 19 Changed 10 years ago by jdemeyer

• Status changed from positive_review to needs_info

Please state which patches have to be applied...

### Changed 10 years ago by SimonKing

Use the mecanisms from the preceding patches in Strassen multiplication

### comment:19 in reply to: ↑ 18 ; follow-up: ↓ 21 Changed 10 years ago by SimonKing

• Status changed from needs_info to needs_review

Please state which patches have to be applied...

If nothing is stated then, I thought, all patches are to be applied.

I used the occasion to replace the third patch: A commit message was missing.

### comment:20 Changed 10 years ago by SimonKing

• Status changed from needs_review to positive_review

... and back to positive review.

### comment:21 in reply to: ↑ 19 Changed 10 years ago by jdemeyer

• Description modified (diff)