Opened 8 years ago
Closed 8 years ago
#11589 closed enhancement (fixed)
faster zero matrix creation
Reported by: | malb | Owned by: | jason, was |
---|---|---|---|
Priority: | major | Milestone: | sage-4.7.2 |
Component: | linear algebra | Keywords: | |
Cc: | SimonKing | Merged in: | sage-4.7.2.alpha1 |
Authors: | Martin Albrecht, Simon King | Reviewers: | Simon King, Martin Albrecht |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description (last modified by )
Currently, we perform a matrix copy whenever we create a new zero matrix. This is undesirable for two reasons:
- it is slower than creating a new matrix from scratch for some matrices
- it eats up RAM (we create two instead of one matrix)
See also the discussion on [sage-devel]: http://groups.google.com/group/sage-devel/browse_thread/thread/478fb19ecb724be0
Apply all three patches.
Attachments (3)
Change History (25)
comment:1 Changed 8 years ago by
comment:2 follow-up: ↓ 3 Changed 8 years ago by
Hi Martin,
Thank you for creating the patch!
I don't like that the timings became a little slower for small matrices. We should try to find out why that happens.
I think we should use calloc only when the argument entries
is not a list. Namely, if it is a list then the matrix will be initialised by direct assignments anyway. Hence, in that case there is no need to initialise with calloc.
Did you carefully determine the threshold for which calloc becomes faster than copying? Or is "nrows, ncols > 100" just a guess?
Cheers,
Simon
comment:3 in reply to: ↑ 2 Changed 8 years ago by
Replying to SimonKing:
I don't like that the timings became a little slower for small matrices. We should try to find out why that happens.
We'll also have to do more careful timings to establish there really was a regression. I'm not 100% sure it's not a fluke.
I think we should use calloc only when the argument
entries
is not a list. Namely, if it is a list then the matrix will be initialised by direct assignments anyway. Hence, in that case there is no need to initialise with calloc.
Good point.
Did you carefully determine the threshold for which calloc becomes faster than copying? Or is "nrows, ncols > 100" just a guess?
An educated guess, i.e. I did look at some timings but it's not carefully tuned (and probably machine dependent).
Changed 8 years ago by
comment:4 follow-up: ↓ 5 Changed 8 years ago by
I updated the patch to use malloc() if entries is no list.
comment:5 in reply to: ↑ 4 Changed 8 years ago by
Replying to malb:
I updated the patch to use malloc() if entries is no list.
You mean: Use malloc if (and only if) it is a list. At least that is what you do in the patch (and coincides with what I had done as well).
comment:6 Changed 8 years ago by
Sorry, yes ... where's my head
comment:7 Changed 8 years ago by
Hi Martin,
I have a guess why the patch isn't as quick as it should be: It is the line
return self.__matrix_class(self, 0, coerce=coerce, copy=copy)
If you pass the value 0, then the matrix is initialised as a diagonal matrix with 0 on the diagonal -- which means that time is wasted by initialising the diagonal twice.
Better would be to pass self, None, coerce=coerce, copy=copy
. I will change that and then do some tests, concerning the threshold and also concerning other base rings.
comment:8 Changed 8 years ago by
I was wrong in assuming that entries=0
would provoke a second initialisation of the diagonal.
However, using None as default argument does have a tiny advantage: The test if entries is None or entries==0
is a tad faster if it is actually None.
With entries=None
as default in the __call__
method of matrix spaces, timings do improve, and I obtain a threshold nrows==ncols==40
: For larger matrices, over GF(p) with p>2, creation from scratch is faster, while for smaller matrices copying is faster.
However, with the original patch, I find a big surprise: Over GF(2), copying is always faster! For GF(p), p>2, I can confirm your statement that the threshold seems to be around 100.
There is one problem with default entries=None
, though: Some matrix classes (used for matrices over the reals or symbolics) don't accept entries=None
. So, their init methods should be adapted as well.
Question: Is it worth it?
Using None, I obtain for example:
sage: timeit("M = matrix(GF(2),30,30)", number=10000) 10000 loops, best of 3: 105 µs per loop sage: timeit("M = matrix(GF(2),300,300)", number=10000) 10000 loops, best of 3: 107 µs per loop sage: timeit("M = matrix(GF(3),30,30)", number=10000) 10000 loops, best of 3: 102 µs per loop sage: timeit("M = matrix(GF(3),300,300)", number=10000) 10000 loops, best of 3: 210 µs per loop
Using 0, I obtain:
sage: timeit("M = matrix(GF(2),30,30)", number=10000) 10000 loops, best of 3: 113 µs per loop sage: timeit("M = matrix(GF(2),300,300)", number=10000) 10000 loops, best of 3: 123 µs per loop sage: timeit("M = matrix(GF(3),30,30)", number=10000) 10000 loops, best of 3: 102 µs per loop sage: timeit("M = matrix(GF(3),300,300)", number=10000) 10000 loops, best of 3: 215 µs per loop
The difference is small, but it seems quite reproducible (not flaky) for me. So, my suggestion is to go through the matrix classes and allow entries=None
.
comment:9 follow-ups: ↓ 11 ↓ 14 Changed 8 years ago by
- Status changed from new to needs_review
I made all matrices accept "None" as an argument.
I tried to determine the threshold for copying versus creating a zero matrix, using the following test function:
sage: def tests(R): ....: BetterCopy = [] ....: for d in srange(10,511,20): ....: MS = MatrixSpace(R,d) ....: MS._copy_zero = True ....: globals()['d'] = d ....: globals()['R'] = R ....: TCopy = timeit.eval("M = matrix(R,d,d)",number=1000) ....: MS._copy_zero = False ....: TNoCopy = timeit.eval("M = matrix(R,d,d)",number=1000) ....: if TCopy.stats[3] < TNoCopy.stats[3]: ....: BetterCopy.append((d,TCopy,TNoCopy)) ....: return BetterCopy ....:
It turns out that with both patches applied, copying is not a good idea over GF(2) and over the rationals:
sage: tests(GF(2)) [] sage: tests(QQ) []
Note that I observed that the memory consumption for the test over the rationals constantly increased: Perhaps there is a memory leak.
However, in all other cases I studied, copying was better, for matrices of up to 30 rows and columns:
sage: tests(GF(3)) [(10, 1000 loops, best of 3: 48.2 µs per loop, 1000 loops, best of 3: 49 µs per loop), (30, 1000 loops, best of 3: 50 µs per loop, 1000 loops, best of 3: 50.9 µs per loop)] sage: tests(GF(1019)) [(10, 1000 loops, best of 3: 48.3 µs per loop, 1000 loops, best of 3: 49.2 µs per loop)] sage: tests(GF(625,'a')) [(10, 1000 loops, best of 3: 45.1 µs per loop, 1000 loops, best of 3: 63.1 µs per loop), (30, 1000 loops, best of 3: 62.9 µs per loop, 1000 loops, best of 3: 68 µs per loop), (230, 1000 loops, best of 3: 1.15 ms per loop, 1000 loops, best of 3: 367 µs per loop), (250, 1000 loops, best of 3: 1.35 ms per loop, 1000 loops, best of 3: 423 µs per loop), (270, 1000 loops, best of 3: 1.71 ms per loop, 1000 loops, best of 3: 618 µs per loop), (290, 1000 loops, best of 3: 1.95 ms per loop, 1000 loops, best of 3: 701 µs per loop), (310, 1000 loops, best of 3: 2.23 ms per loop, 1000 loops, best of 3: 791 µs per loop), (330, 1000 loops, best of 3: 2.52 ms per loop, 1000 loops, best of 3: 883 µs per loop), (350, 1000 loops, best of 3: 2.82 ms per loop, 1000 loops, best of 3: 988 µs per loop)] sage: tests(ZZ) [(10, 625 loops, best of 3: 48.6 µs per loop, 625 loops, best of 3: 56.5 µs per loop), (30, 625 loops, best of 3: 111 µs per loop, 625 loops, best of 3: 118 µs per loop), (50, 625 loops, best of 3: 230 µs per loop, 625 loops, best of 3: 235 µs per loop)]
Therefore, I modified the threshold in _copy_zero
a bit. There is no special case for Matrix_modn_dense
, but for Matrix_rational_dense
.
Here are timings similar to the ones you did above (adding another test case).
With both patches:
sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 105 µs per loop 625 loops, best of 3: 108 µs per loop 625 loops, best of 3: 116 µs per loop sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 103 µs per loop 625 loops, best of 3: 403 µs per loop 625 loops, best of 3: 1.3 ms per loop sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 198 µs per loop 125 loops, best of 3: 1.97 ms per loop 125 loops, best of 3: 7.29 ms per loop
Without patches:
sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 108 µs per loop 625 loops, best of 3: 116 µs per loop 625 loops, best of 3: 137 µs per loop sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 103 µs per loop 625 loops, best of 3: 566 µs per loop 125 loops, best of 3: 2.46 ms per loop sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 225 µs per loop 125 loops, best of 3: 5.95 ms per loop 25 loops, best of 3: 22.9 ms per loop
So, with both patches applied, the timings consistently improve.
Reviewing
I am not sure how the reviewing should be done, as we are both authors. Do you think it is ok if I review the first patch and you review the second patch? I need to run all long doctests, though; so far, I only did the tests in sage/matrix.
TODO
On a different ticket, the potential memory leak for rational dense matrices should be studied.
I recall that in some arithmetic routines of matrices, zero matrices are created at the beginning. I am not sure whether that is done "the official way" or by always copying a zero matrix, but I think it was the latter. That might be worth while to fix as well.
comment:10 Changed 8 years ago by
PS: It seems to me that several doc tests from sage/matrix are considerably faster with the patches. I hope that can be confirmed.
comment:11 in reply to: ↑ 9 Changed 8 years ago by
Replying to SimonKing:
I made all matrices accept "None" as an argument.
Cool, you are fast!
Reviewing
I am not sure how the reviewing should be done, as we are both authors. Do you think it is ok if I review the first patch and you review the second patch? I need to run all long doctests, though; so far, I only did the tests in sage/matrix.
Yes, as far as I know it's okay if we cross-review. So I'll take a look at your patch.
comment:12 Changed 8 years ago by
Here are timings on my machine (Macbook Pro, i7):
no patch
sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 98.1 µs per loop 625 loops, best of 3: 115 µs per loop 625 loops, best of 3: 130 µs per loop sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 93.3 µs per loop 625 loops, best of 3: 538 µs per loop 125 loops, best of 3: 2.41 ms per loop sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 189 µs per loop 125 loops, best of 3: 3.7 ms per loop 25 loops, best of 3: 14.1 ms per loop
first only
sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 104 µs per loop 625 loops, best of 3: 108 µs per loop 625 loops, best of 3: 112 µs per loop sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 95.5 µs per loop 625 loops, best of 3: 442 µs per loop 625 loops, best of 3: 1.48 ms per loop sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 184 µs per loop 125 loops, best of 3: 3.7 ms per loop 25 loops, best of 3: 13.8 ms per loop
with both
sage: for n in (50,500,1000): timeit('A=matrix(GF(2),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 94.5 µs per loop 625 loops, best of 3: 98.9 µs per loop 625 loops, best of 3: 108 µs per loop sage: for n in (50,500,1000): timeit('A=matrix(GF(3),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 95 µs per loop 625 loops, best of 3: 438 µs per loop 625 loops, best of 3: 1.49 ms per loop sage: for n in (50,500,1000): timeit('A=matrix(GF(625,"a"),%d,%d)'%(n,n)) ....: 625 loops, best of 3: 176 µs per loop 125 loops, best of 3: 1.74 ms per loop 125 loops, best of 3: 6.24 ms per loop
My take: it's definitely faster for large-ish matrices and undecided for small matrices, i.e. my timings for 50x50 are rather unstable.
comment:13 Changed 8 years ago by
Sorry, I had to replace my patch, since apparently I forgot to change the doctests of _copy_zero. Actually, the new patch version adds some tests, as there is now a special case for matrices over the rationals.
FWIW, with the old patch version all long tests except matrix_space.py pass, and with the new patch version (that only fixes the failing doctest) matrix_space.py passes as well.
comment:14 in reply to: ↑ 9 Changed 8 years ago by
Replying to SimonKing:
I recall that in some arithmetic routines of matrices, zero matrices are created at the beginning. I am not sure whether that is done "the official way" or by always copying a zero matrix, but I think it was the latter. That might be worth while to fix as well.
I found it. It is the _multiply_strassen
method in matrix2.pyx
, which says:
if self.is_sparse(): output = self.matrix_space(self._nrows, right._ncols, sparse = True)(0) else: output = self.matrix_space(self._nrows, right._ncols, sparse = False).zero_matrix().__copy__()
I think that ought to be improved, but that would better be on another ticket.
comment:15 follow-up: ↓ 16 Changed 8 years ago by
I added a third patch, that applies the faster way of creating a zero matrix to _multiply_strassen
. With the third patch, all tests from sage/matrix still pass. I have no benchmark that shows a noticeable difference to the old behaviour. But I think it won't hurt to use an optimised procedure for the creation of a zero matrix, rather than to hard-wire the copying of a zero matrix in the current version of _multiply_strassen.
I think that the third patch is small enough to be discussed here. However, note that there already is an open ticket on improving matrix multiplication: #8096. There is no activity for 16 months now! I think it would be worth while to invest some work there. It seems to me that a part of the issues of that old ticket are already resolved.
Concerning reviews: I believe that the first patch is fine, and I give it a positive review. I have already mentioned that the tests pass, and the improvement of the timings (in particular when the second patch is added) is clear.
comment:16 in reply to: ↑ 15 Changed 8 years ago by
- Reviewers set to Simon King, Martin Albrecht
- Status changed from needs_review to positive_review
Concerning reviews: I believe that the first patch is fine, and I give it a positive review. I have already mentioned that the tests pass, and the improvement of the timings (in particular when the second patch is added) is clear.
I give your two patches a positive review as well.
comment:17 Changed 8 years ago by
Note that the patch for Strassen-Winograd multiplication at #11610 benefits from the patches here as well, because many matrix spaces are created during the iteration.
comment:18 follow-up: ↓ 19 Changed 8 years ago by
- Status changed from positive_review to needs_info
Please state which patches have to be applied...
comment:19 in reply to: ↑ 18 ; follow-up: ↓ 21 Changed 8 years ago by
- Status changed from needs_info to needs_review
Replying to jdemeyer:
Please state which patches have to be applied...
If nothing is stated then, I thought, all patches are to be applied.
I used the occasion to replace the third patch: A commit message was missing.
comment:20 Changed 8 years ago by
- Status changed from needs_review to positive_review
... and back to positive review.
comment:21 in reply to: ↑ 19 Changed 8 years ago by
- Description modified (diff)
Replying to SimonKing:
If nothing is stated then, I thought, all patches are to be applied.
For safety, I prefer not to make any such assumptions. In the few cases where it is unclear (like here), I prefer simply asking.
comment:22 Changed 8 years ago by
- Merged in set to sage-4.7.2.alpha1
- Resolution set to fixed
- Status changed from positive_review to closed
Timings on my machine:
with patch
without patch