Opened 7 years ago
Last modified 7 years ago
#13400 new enhancement
Use strong caches diligently
Reported by: | nbruin | Owned by: | robertwb |
---|---|---|---|
Priority: | major | Milestone: | sage-wishlist |
Component: | coercion | Keywords: | |
Cc: | nthiery, jpflori | Merged in: | |
Authors: | Reviewers: | ||
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description
With tickets #715, #11521, #12215, #12313, parent structures don't have eternal life anymore the way they used to have. That affects run-times, because chances are much higher now that the parent has to be rebuilt, whereas before, it was just looked up in a cache. It would be nice if we can mitigate much of this.
(filed under coercion because that seems to be one of the main components in this)
Attachments (3)
Change History (36)
comment:1 Changed 7 years ago by
comment:2 Changed 7 years ago by
- Summary changed from Investigate ways to gain speed in basic constructions to Use strong caches diligently
I think the ticket description is a bit too broad. I hope you don't mind that I change the title? If you do, please change back.
The reason for the name change I'm suggesting is the observation that quite some objects are created during startup of Sage, and are then garbage collected before even the Sage prompt appears. I think that this indicates potential trouble. I find the following with sage-5.2.beta3+#12313 and its dependencies:
When Sage starts, 37 different objects appearing as values in a weak value dictionary get removed. Some of the deleted values occur 3 or 8 times. In particular, the "Groupoid with underlying set Symbolic Ring" is deleted 3 times, and the "Groupoid with underlying set Rational Field" is deleted 8 times, before sage even starts. In addition, both CIF['x']
and CC['x']
are created and deleted twice.
Perhaps one should emphasize that UniqueRepresentation
and UniqueFactory
actually were supposed to have a weak cache, according to the documentation. Nevertheless, some code relies on the fact (the bug) that the documentation was misleading and the cache has been strong.
Would it be useful to have a switch for TripleDict
and MonoDict
, such as strong_cache_on()
, strong_cache_off()
? The idea is that after calling strong_cache_on()
, every item put into a TripleDict
or MonoDict
and every item created by weak_cached_function
(such as UniqueRepresentation
) and every item created with UniqueFactory
is prevented from deletion by adding a strong reference, and strong_cache_off()
deletes the strong references created since last calling strong_cache_on()
, so that only the weak references remain and stuff becomes collectable?
Perhaps that could even become a context, so that the computation of the echelon form could be like
with strong_cache: <the multimodular code>
Concerning startup, I think 37 objects are easy to take care of. Perhaps one should add a strong reference to them in the modules creating them.
comment:3 follow-up: ↓ 4 Changed 7 years ago by
Since the profile above seems that an insane number of calls to is_subcategory
is to blame for at least half a second, I took a look at it. As expected, it's recursive! And if you put a little print statement
print "is_subcategory(self=%s,C=%s)"%(self,C)
it's pretty scary to see how often the system goes off on a investigation of completely trivial questions. For instance, just the line
N=20; matrix(QQ,N,N,srange(N*N)).echelon_form()
triggers 140 lines like
is_subcategory(self=Category of subquotients of monoids,C=Category of commutative rings) is_subcategory(self=Category of subquotients of semigroups,C=Category of commutative rings) is_subcategory(self=Category of subquotients of magmas,C=Category of commutative rings) is_subcategory(self=Category of subquotients of sets,C=Category of commutative rings) ... is_subcategory(self=Category of subquotients of semigroups,C=Category of fields) is_subcategory(self=Category of subquotients of magmas,C=Category of fields) is_subcategory(self=Category of subquotients of sets,C=Category of fields) is_subcategory(self=Category of quotients of sets,C=Category of fields) is_subcategory(self=Category of subquotients of sets,C=Category of fields)
I'm assuming you'll have to try *really* hard to get the category system to produce an arbitrarily long list of categories and that the is_subcategory
relation doesn't change for the lifetime of a category. Then making this routine caching would help a lot and only cost a fixed amount of memory. Indeed, when you do that, you quickly get NO PRINT STATEMENT AT ALL anymore, so it seems that at least my first assumption is about correct. And it DOES make a big difference. For the code above, we now get:
5.3b2 + tickets + caching is_subcategory:
1236719 function calls (1202726 primitive calls) in 4.832 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1000 2.036 0.002 4.431 0.004 {method 'echelon_form' of 'sage.matrix.matrix_rational_dense.Matrix_rational_dense' objects} 4993 0.327 0.000 0.572 0.000 matrix_space.py:230(__init__) 1981 0.242 0.000 0.308 0.000 quotient_ring.py:322(__init__) 5000 0.184 0.000 0.903 0.000 matrix_space.py:1180(matrix) 6000 0.180 0.000 0.182 0.000 dynamic_class.py:259(dynamic_class_internal) 1981 0.103 0.000 0.120 0.000 {method 'ideal' of 'sage.rings.ring.Ring' objects} 16143 0.090 0.000 0.098 0.000 {method 'is_prime' of 'sage.rings.integer.Integer' objects} 1000 0.085 0.000 0.085 0.000 {method 'range' of 'sage.rings.integer_ring.IntegerRing_class' objects} 5943 0.082 0.000 0.251 0.000 category.py:112(_join) 17960/5967 0.074 0.000 0.558 0.000 lazy_attribute.py:506(__get__) 1 0.069 0.069 4.834 4.834 <ipython console>:1(test1) 14162 0.062 0.000 0.189 0.000 arith.py:401(is_prime) 10000 0.061 0.000 0.695 0.000 matrix_space.py:154(__classcall__) 127261 0.054 0.000 0.054 0.000 {isinstance} 42822 0.050 0.000 0.064 0.000 weakref.py:55(__getitem__) 2974 0.048 0.000 0.245 0.000 {sage.rings.ring.is_Field} 2000 0.045 0.000 0.215 0.000 arith.py:1051(previous_prime) 1993 0.041 0.000 0.148 0.000 matrix_space.py:1133(zero_matrix) 1981 0.040 0.000 0.614 0.000 integer_mod_ring.py:191(__init__) 6000/1000 0.035 0.000 0.489 0.000 category.py:1023(parent_class) 7000 0.035 0.000 0.045 0.000 category_types.py:261(__init__) 3974 0.034 0.000 0.067 0.000 sageinspect.py:917(sage_getargspec) 4993 0.033 0.000 0.059 0.000 matrix_space.py:904(_get_matrix_class) 18936 0.032 0.000 0.074 0.000 weakref.py:79(__setitem__) 2000 0.032 0.000 0.052 0.000 sequence.py:86(Sequence) 12000 0.028 0.000 0.057 0.000 misc.py:179(cputime) 53487 0.027 0.000 0.053 0.000 category.py:148(<genexpr>) 12993/9993 0.026 0.000 0.635 0.000 unique_representation.py:452(__classcall__) 993 0.025 0.000 0.079 0.000 homset.py:80(Hom) 14162 0.025 0.000 0.030 0.000 all.py:1(arithmetic) 24894 0.025 0.000 0.025 0.000 weakref.py:228(__init__) 47544 0.024 0.000 0.046 0.000 category.py:150(<genexpr>) 993 0.022 0.000 0.046 0.000 homset.py:353(__init__) 12000 0.021 0.000 0.021 0.000 {resource.getrusage} 1000 0.019 0.000 0.156 0.000 misc.py:1051(srange) 12993/9993 0.018 0.000 0.614 0.000 {sage.misc.classcall_metaclass.typecall} 37639 0.018 0.000 0.043 0.000 category.py:1102(is_subcategory) 26875 0.017 0.000 0.027 0.000 weakref.py:223(__new__) 1000 0.017 0.000 0.177 0.000 constructor.py:34(matrix)
That shaves about 0.7
seconds off!
The next big one seems to be matrix_space.py:230(__init__)
which is costing us 0.572
cumulative and seems absent on plain 5.3b2. That's correct, because all these matrix spaces that are used for multimodular computation are now destroyed an reconstructed everytime, whereas before they remained in permanent cache. Setting gc.disable()
helps a bit:
1186440 function calls (1152569 primitive calls) in 4.464 seconds ... 3996 0.132 0.000 0.367 0.000 matrix_space.py:230(__init__) ...
so apparently some spaces get stuck in cycles, but not all.
comment:4 in reply to: ↑ 3 Changed 7 years ago by
Replying to nbruin:
Since the profile above seems that an insane number of calls to
is_subcategory
is to blame for at least half a second, I took a look at it. As expected, it's recursive!
That shouldn't be the problem, because ...
I'm assuming you'll have to try *really* hard to get the category system to produce an arbitrarily long list of categories and that the
is_subcategory
relation doesn't change for the lifetime of a category. Then making this routine caching would help a lot and only cost a fixed amount of memory.
... it is cached, mainly.
Do we talk about the same version of sage? Namely, with #11943, the set of super categories is created once and for all, stored as an attribute, and then testing subcategories just amounts to testing set containment. And before merging #11943, I think is_subcategory
has even been a cached_method, by #11900 (but I could be mistaken here, I didn't verify it), so that the work also just amounted to a dictionary lookup.
Before looking into the set of supercategories, self.is_subcategory(c)
calls c._subcategory_hook_(self)
. The idea is that c._subcategory_hook_
is a quick (!!) test that allows to avoid the creation of the whole set of supercategories (which can be a bottle neck) in many cases. See #11943 for examples where that helps.
Since _subcategory_hook_
is not mentioned in your profiling, I assume that it isn't problematic. And then, self.is_subcategory
can hardly be problematic either.
Actually I am pretty much sure that the categories involved in that computation are all different (namely, algebras over different base rings or so). If they are all different, then what could be done about it? One could improve the _subcategory_hook_
, or (if creation of parent_class is the problem) use #11935 (whose patch currently needs to be rebased).
Indeed, when you do that, you quickly get NO PRINT STATEMENT AT ALL anymore, so it seems that at least my first assumption is about correct.
The question is: Where did you put the print statement? Namely, the actual work is done in _subcategory_hook_
and in the lazy attribute _set_of_supercategories
. Can you include the print statement there, rather than in is_subcategory?
The next big one seems to be
matrix_space.py:230(__init__)
which is costing us0.572
cumulative and seems absent on plain 5.3b2. That's correct, because all these matrix spaces that are used for multimodular computation are now destroyed an reconstructed everytime, whereas before they remained in permanent cache.
My impression is that this is more of a problem then is_subcategory
. And it illustrates what I meant when I changed the title of this ticket: The matrix spaces should be strongly cached during echelon computation, because they will be used repeatedly.
The question is whether we look into the code and do that manually. Say, by creating a list of all matrix spaces occurring in the computation, where the list either resides on module level (if the matrix spaces are used in different functions/methods) or locally (if there is just a single function creating these matrix spaces).
Or perhaps it is easier to create a context that prevents matrix spaces (or all weak values) to be temporarily strongly cached.
comment:5 follow-up: ↓ 6 Changed 7 years ago by
Putting a print statement into the lazy attribute _set_of_super_categories
, the command
N=20; matrix(QQ,N,N,srange(N*N)).echelon_form()
actually prints nothing. Hence, all the work for the subcategory test seems to be done in the _subcategory_hook_
methods. So, we should focus on them.
comment:6 in reply to: ↑ 5 Changed 7 years ago by
Replying to SimonKing:
all the work for the subcategory test seems to be done in the
_subcategory_hook_
methods. So, we should focus on them.
That said: There are precisely two _subcategory_hook_
methods involved in that computation: The default one of Category, and the one of JoinCategory
. I don't see how they could be improved, except by cythonisation.
comment:7 Changed 7 years ago by
Here is the effect of cythonising the subcategory hook:
# This is the cython version of the default _subcategory_hook_: sage: cython(""" ....: include "../ext/python_type.pxi" ....: def subcat_hook(self,c): ....: return PyType_IsSubtype(c.parent_class, self.parent_class) ....: """) sage: C = Algebras(ZZ) sage: R = Rings() sage: C._subcategory_hook_(R) False sage: R._subcategory_hook_(C) True sage: subcat_hook(C,R) False sage: subcat_hook(R,C) True sage: %timeit C._subcategory_hook_(R) 625 loops, best of 3: 1.94 µs per loop sage: %timeit R._subcategory_hook_(C) 625 loops, best of 3: 2.09 µs per loop sage: %timeit subcat_hook(C,R) 625 loops, best of 3: 846 ns per loop sage: %timeit subcat_hook(R,C) 625 loops, best of 3: 816 ns per loop
comment:8 Changed 7 years ago by
Here is how one can cythonise the default _subcategory_hook_
. The time for testing subcategories drops clearly. Without patch:
sage: C = Algebras(ZZ) sage: R = Rings() sage: %timeit R.is_subcategory(C) 625 loops, best of 3: 3.14 µs per loop sage: %timeit C.is_subcategory(R) 625 loops, best of 3: 3.4 µs per loop
With patch:
sage: C = Algebras(ZZ) sage: R = Rings() sage: %timeit R.is_subcategory(C) 625 loops, best of 3: 1.97 µs per loop sage: %timeit C.is_subcategory(R) 625 loops, best of 3: 2.1 µs per loop
However, that doesn't much improve the performance of
sage: def test1(N,B,t): ....: for i in range(t): ....: m=matrix(QQ,N,N,srange(N*N)).echelon_form() ....:
which (together with the fact that _subcategory_hook_ does all the work in this example) shows that the problem is not in the subcategory test. Hence, my patch is off-topic.
comment:9 follow-up: ↓ 13 Changed 7 years ago by
- Cc nthiery added
I just notice: Apparently in your example, the is_subcategory function is not the usual one (sage/categories/category.py), but the one in sage/categories/covariant_functorial_construction.py.
It says
def is_subcategory(self, C): """ .. todo:: doctests + explain why this method is needed """ if C is self: return True return any(X.is_subcategory(C) for X in self._super_categories)
Actually I wonder why is it needed, myself.
Nicolas, is it not possible to use the default is_subcategory, by #11943?
Anyway, I think the real speed-up will not come from is_subcategory, but from a more intelligent caching.
comment:10 Changed 7 years ago by
The following is on top of sage-5.2.beta3 plus #12313 and its dependencies.
When I remove the custom is_subcategory method in covariant_functorial_constructions.pyx, and apply trac_13400_subclass_hook_cython.patch, I get
sage: def test1(N,B,t): ....: for i in range(t): ....: m=matrix(QQ,N,N,srange(N*N)).echelon_form() ....: sage: %time test1(20,40,10^3) CPU times: user 7.04 s, sys: 0.06 s, total: 7.09 s Wall time: 7.11 s
When I apply my patch but keep the covariant functorial constructions as they are, I get
sage: def test1(N,B,t): ....: for i in range(t): ....: m=matrix(QQ,N,N,srange(N*N)).echelon_form() ....: sage: %time test1(20,40,10^3) CPU times: user 7.38 s, sys: 0.06 s, total: 7.44 s Wall time: 7.45 s
When I do nothing on top of #12313, I get
sage: def test1(N,B,t): ....: for i in range(t): ....: m=matrix(QQ,N,N,srange(N*N)).echelon_form() ....: sage: %time test1(20,40,10^3) CPU times: user 7.58 s, sys: 0.03 s, total: 7.60 s Wall time: 7.63 s
Again, it isn't exactly a break-through. I didn't time it pre-#12313, yet.
comment:11 follow-up: ↓ 15 Changed 7 years ago by
Replying to SimonKing:
... it is cached, mainly.
Do we talk about the same version of sage? Namely, with #11943,
In that case you might want to check if that ticket was properly merged in the
5.3b2 source, as available in sage.math.washington.edu:/home/release
.
What I see is
sage/categories/covariant_functorial_construction.py:359
: (EDIT: removed @cached_method
line which was my addition that I accidentally copied here as well)
def is_subcategory(self, C): """ .. todo:: doctests + explain why this method is needed """ if C is self: return True return any(X.is_subcategory(C) for X in self._super_categories)
if I add print("is_subcategory(self=%s,C=%s)"%(self,C)")
as a first statement
I get the lines as listed above.
[EDIT: removed, because I gave a sample above]
a lot of them, and as you can see, all "static" categories. None of those funny
dynamically generated ones. If I decorate with @cached_method
I get these
lines at startup but not afterwards anymore.
make ptest
improves (5.3b2 + tickets + cached is_subcategory):
Total time for all tests: 989.4 seconds Total time for all tests: 994.4 seconds
multimodular echelonization loop (5.3b2 + tickets + cached is_subcategory):
1263618 function calls (1229625 primitive calls) in 5.191 seconds
The profile looks as above. I'm puzzled as to why I'm no consistently getting 2-3 seconds worse than before. Did some recompilation put things in unfortunate spots?
Strongly caching matrix spaces: Wouldn't help much with the current design. The primes used in some of the multimodular steps are randomized, so even when doing the same computation, you'll end up with other rings.
It might be an idea to keep a cache of good fields for multimodular stuff, that get used before trying random primes. That at least saves creating the fields. Caching the matrix spaces might be painful, since you don't know which dimensions you'll need...
comment:12 Changed 7 years ago by
Hm, simply deleting is_subclass
from CovariantCategory
leads to a similar speedup:
multimodular echelonization loop 5.3b2 + tickets + removed CovariantCategory.is_subcategory
:
1341369 function calls (1307374 primitive calls) in 4.993 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1000 2.070 0.002 4.590 0.005 {method 'echelon_form' of 'sage.matrix.matrix_rational_dense.Matrix_rational_dense' objects} 2320 0.427 0.000 0.510 0.000 quotient_ring.py:322(__init__) 6089 0.187 0.000 0.188 0.000 dynamic_class.py:259(dynamic_class_internal) 5000 0.185 0.000 0.919 0.000 matrix_space.py:1180(matrix) 4996 0.177 0.000 0.424 0.000 matrix_space.py:230(__init__) 2320 0.121 0.000 0.140 0.000 {method 'ideal' of 'sage.rings.ring.Ring' objects} 16466 0.091 0.000 0.099 0.000 {method 'is_prime' of 'sage.rings.integer.Integer' objects} 6618 0.087 0.000 0.275 0.000 category.py:112(_join) 1000 0.084 0.000 0.084 0.000 {method 'range' of 'sage.rings.integer_ring.IntegerRing_class' objects} 18305/6310 0.079 0.000 0.576 0.000 lazy_attribute.py:506(__get__) 1 0.071 0.071 4.995 4.995 <ipython console>:1(test1) 14488 0.063 0.000 0.192 0.000 arith.py:401(is_prime) 44850 0.057 0.000 0.073 0.000 weakref.py:55(__getitem__) 129931 0.053 0.000 0.053 0.000 {isinstance} 10000 0.051 0.000 0.542 0.000 matrix_space.py:154(__classcall__) 2973 0.048 0.000 0.249 0.000 {sage.rings.ring.is_Field} 2320 0.047 0.000 0.870 0.000 integer_mod_ring.py:191(__init__) 2000 0.046 0.000 0.219 0.000 arith.py:1051(previous_prime) 1995 0.041 0.000 0.154 0.000 matrix_space.py:1133(zero_matrix) 6000/1000 0.036 0.000 0.496 0.000 category.py:1023(parent_class) 7000 0.035 0.000 0.046 0.000 category_types.py:261(__init__) 3973 0.035 0.000 0.068 0.000 sageinspect.py:917(sage_getargspec) 19703 0.035 0.000 0.079 0.000 weakref.py:79(__setitem__) 4996 0.034 0.000 0.059 0.000 matrix_space.py:904(_get_matrix_class) 2000 0.032 0.000 0.052 0.000 sequence.py:86(Sequence) 69720 0.032 0.000 0.073 0.000 category.py:1102(is_subcategory) 12000 0.029 0.000 0.057 0.000 misc.py:179(cputime) 67400 0.028 0.000 0.041 0.000 category.py:622(_subcategory_hook_) 12996/9996 0.026 0.000 0.489 0.000 unique_representation.py:452(__classcall__) 995 0.026 0.000 0.084 0.000 homset.py:80(Hom) 995 0.026 0.000 0.051 0.000 homset.py:353(__init__) 14488 0.026 0.000 0.031 0.000 all.py:1(arithmetic) 25673 0.025 0.000 0.025 0.000 weakref.py:228(__init__) 12000 0.021 0.000 0.021 0.000 {resource.getrusage} 57510 0.019 0.000 0.057 0.000 category.py:148(<genexpr>) 12996/9996 0.019 0.000 0.467 0.000 {sage.misc.classcall_metaclass.typecall} 27993 0.019 0.000 0.029 0.000 weakref.py:223(__new__) 1000 0.018 0.000 0.155 0.000 misc.py:1051(srange) 50892 0.018 0.000 0.053 0.000 category.py:150(<genexpr>)
As you can see, it's really the construction of the quotient rings now that's expensive (i.e., the finite fields). That would get better with having a multimodular fields cache.
comment:13 in reply to: ↑ 9 Changed 7 years ago by
Replying to SimonKing:
I just notice: Apparently in your example, the is_subcategory function is not the usual one (sage/categories/category.py), but the one in sage/categories/covariant_functorial_construction.py.
It says
def is_subcategory(self, C): """ .. todo:: doctests + explain why this method is needed """ if C is self: return True return any(X.is_subcategory(C) for X in self._super_categories)Actually I wonder why is it needed, myself.
Nicolas, is it not possible to use the default is_subcategory, by #11943?
Oops, honestly I don't remember at all; I should have done this todo! I am actually surprised that it went into Sage with this comment in.
In any cases: if all tests pass after removing this method, please go ahead.
Cheers,
Nicolas
comment:14 follow-up: ↓ 16 Changed 7 years ago by
ptest with CovariantFunctor?.is_subcategory removed:
All tests passed! Total time for all tests: 992.9 seconds
Main thing now would be to see if the next one on the list,
2320 0.427 0.000 0.510 0.000 quotient_ring.py:322(__init__)
needs to cost as much as it does. And if the general one has to, whether the path should be shortcut for creating finite fields.
comment:15 in reply to: ↑ 11 Changed 7 years ago by
Replying to nbruin:
What I see is
sage/categories/covariant_functorial_construction.py:359
:
Yes, I was talking about the is_subcategory method in sage/categories/category.py. I didn't know that the covariant functorial constructions have their own method (which may be deletable, according to Nicolas, provided tests pass)
@cached_method def is_subcategory(self, C): """ .. todo:: doctests + explain why this method is needed """ if C is self: return True return any(X.is_subcategory(C) for X in self._super_categories)
Just for clarification: The @cached_method is your addition, isn't it?
Anyway. If deleting the method is an option, then we should try it. After all, it isn't cached at all, in contrast to what is done in the default is_subcategory method. Plus, if it turns out that my experimental patch gives more speed, I may consider to make it a proper patch.
comment:16 in reply to: ↑ 14 ; follow-up: ↓ 17 Changed 7 years ago by
Replying to nbruin:
ptest with CovariantFunctor?.is_subcategory removed:
All tests passed! Total time for all tests: 992.9 seconds
OK, then I think we should remove it.
Main thing now would be to see if the next one on the list,
2320 0.427 0.000 0.510 0.000 quotient_ring.py:322(__init__)needs to cost as much as it does. And if the general one has to, whether the path should be shortcut for creating finite fields.
If I remember correctly, improving the initialisation of finite fields was part of #11900.
We could consider to put stuff from sage.rings.quotient_rings.QuotientRing_generic.__init__
or QuotientRing_nc.__init__
directly into sage.rings.finite_rings.integer_mod_ring.IntegerModRing_generic.__init__
and sage.rings.finite_rings.finite_field_prime_modn.FiniteField_prime_modn.__init__
.
And we should test whether "small" finite fields shouldn't better be cached. I see that the elements of an IntegerModRing
of order less than 500 are cached. But the rings themselves are not cached:
sage: id(IntegerModRing(123)) 79926512 sage: import gc sage: gc.collect() 673 sage: id(IntegerModRing(123)) 78943376
versus
sage: R = IntegerModRing(123) sage: R(12) is R(12) True
Hence, I suggest that IntegerModRing.__init__
puts the ring into a strong cache, if it would also cache the elements.
comment:17 in reply to: ↑ 16 Changed 7 years ago by
Replying to SimonKing:
We could consider to put stuff from
sage.rings.quotient_rings.QuotientRing_generic.__init__
orQuotientRing_nc.__init__
directly intosage.rings.finite_rings.integer_mod_ring.IntegerModRing_generic.__init__
andsage.rings.finite_rings.finite_field_prime_modn.FiniteField_prime_modn.__init__
.
Yes, if that speeds things up that might benefit quite a bit of sage. Almost any considerable computation over ZZ (and, by first clearing denominators, over QQ) benefits enormously from a multimodular approach, and I expect a lot of computations in sage to already make use of that. It would be great if these can be computed using the general framework, so construction of finite fields and conversion between ZZ and GF(p) should be blindingly fast. Basically anything goes. This case is different from all other rings *because* it is so close to the metal.
Hence, I suggest that
IntegerModRing.__init__
puts the ring into a strong cache, if it would also cache the elements.
With a prudently chosen bound, this would probably not lead to excessive memory use. I'm wondering whether there's any benefit, though. For elliptic curve computations and other arithmetic-geometric things there might be: One often has to look at the reduction modulo *all* primes below a certain bound, so if one does so repeatedly, the gain can be substantial. This might explain why the elliptic curves code is relatively sensitive to these things. Although I think the most heavy duty applications of this get seconded to GP/Pari wholesale anyway, so would not involve sage finite fields at all.
It would not help multimodular stuff at all, though, because there one usually choses primes that are as large as possible while still allowing for the fastest arithmetic on the processor.
Usually one would randomize the choice of moduli, to amortize the cost of choosing "wrong" moduli. However, that's only theoretically wise. For an actual application, when you choose a prime a little below 2^{31} (or whatever you need to still be able to do a full mult and do a mod computation afterwards. REALLY worth delving into your CPU's instruction set to see if a "mult into two 64-bit registers" combined with a "remainder of 128 mod 64 bit number" exist, because it gives you twice the number of bits for almost the same price -- hopefully we are already using libraries that do that for us), it's NEVER going to be a bad one, so using it all the time's not bad (OK, on a 4GHz machine in a computation that takes 10^{6} cycles, it would be happening about 1.25 times a month)
What might be a good idea is to have a
GoodPrimesForMultiModularComputationsIterator()
which could always start with giving primes from a fixed set (in fixed order or not) and then continue with generating new ones. If we strongly cache the fields belonging to those first ones, we'd get most of the benefit at limited memory cost.
Of course it does mean changing all prime selection loops for multimodular stuff to using this iterator.
The next bit is tuning the crossover points between naive/multimodular/padic methods for most of these.
The main thing this does is mitigate the effects if finite field construction cannot be made lightning fast in general. We still need conversion between ZZ and GF to be lightning fast, even if that means special casing this in relevant code paths. It's nice if derived conversions between matrix algebras and polynomial rings would be similarly fast.
comment:18 Changed 7 years ago by
OOPS! Turns out the default algorithm for these matrices is "padic", which is a bad choice as these examples show. It's particularly bad for low rank examples such as this one. If we take a vandermonde matrix, it's a little better
def test1(N,t): L=[[i^j for i in [1..N]] for j in [1..N]] for i in range(t): m=matrix(QQ,N,N,L).echelon_form()
but multimodular is a clear winner. There the fields DO get generated in a deterministic order:
sage.matrix.matrix_modn_dense.MAX_MODULUS p=previous_prime(sage.matrix.matrix_modn_dense.MAX_MODULUS) while ...: ... p = previous_prime(p)
so if it were an issue we should store finite fields for some of those p (and also for sage.matrix.matrix_modn_sparse.MAX_MODULUS
for good measure).
Indeed:
sage: import gc sage: def test1(N,t): ....: L=[[i^j for i in [1..N]] for j in [1..N]] ....: for i in range(t): ....: m=matrix(QQ,N,N,srange(N*N)).echelon_form(algorithm="multimodular") ....: sage: %time test1(20,10^3) CPU times: user 1.43 s, sys: 0.02 s, total: 1.46 s Wall time: 1.47 s sage: gc.collect() 523 sage: %time test1(20,10^3) CPU times: user 1.40 s, sys: 0.02 s, total: 1.42 s Wall time: 1.43 s sage: gc.collect() 523 sage: %time test1(20,10^3) CPU times: user 1.41 s, sys: 0.02 s, total: 1.43 s Wall time: 1.43 s sage: gc.collect() 523 sage: FieldCache=[] sage: N=sage.matrix.matrix_modn_dense.MAX_MODULUS sage: p=previous_prime(N+1) sage: for i in range(20): ....: FieldCache.append(Integers(p)) ....: p=previous_prime(p) ....: sage: %time test1(20,10^3) CPU times: user 1.41 s, sys: 0.01 s, total: 1.43 s Wall time: 1.44 s sage: gc.collect() 523 sage: %time test1(20,10^3) CPU times: user 1.41 s, sys: 0.01 s, total: 1.42 s Wall time: 1.43 s sage: gc.collect() 523 sage: MatrixCache=[MatrixSpace(F,20,20) for F in FieldCache] sage: sage: sage: %time test1(20,10^3) CPU times: user 1.25 s, sys: 0.01 s, total: 1.26 s Wall time: 1.27 s sage: gc.collect() 0 sage: %time test1(20,10^3) CPU times: user 1.25 s, sys: 0.01 s, total: 1.26 s Wall time: 1.26 s sage: gc.collect() 0
so it seems that with Thierry's code fixed, constructing the matrix rings is the only measurable overhead in this. And I don't think we can intelligently cache those.
The REAL surprise is that in the padic variant the primes get selected via this little pearl:
p = previous_prime(rstate.c_random() % (MAX_MODULUS-15000) + 10000)
Random numbers are very expensive! So just changing that (in two places) to a
simple p = previous_prime(p)
reduces the time SIGNIFICANTLY and in practice of
course behaves just as well, unless you have a particularly nasty adversary
(perhaps draw from a pool a bit larger than just the primes in sequence?).
In short, after picking off the is_subcategory
problem, I don't think this
example is telling us much more than that the choice of algorithm in
echelon_form
can use some tuning. Its preference for padic
over
multimodular
might be premature and was probably induced by William's euphoria
about discovering a really neat trick. As he documents, it should be used on
matrices with large entries.
Futhermore, perhaps the implementation of
padic
is placing a little too much emphasis on properly emulating randomness.
That's probably a little less of an issue for matrices for which the algorithm
is suited, but I don't think one gets real benefit out of using such high
quality randomization.
I suspect there are some places we can benefit from more tuned/intelligent coding and caching, this piece of code is not giving huge pointers where ... Perhaps one should profile some examples in elliptic curves and see how much time is spent in coercion/creation of parents.
comment:19 Changed 7 years ago by
OK, if the echelon form thingy could simply be optimized by a better heuristics for choosing between padic and multimodular, and if it has not so much to do with coercion and cache, the question is whether we split the ticket into one for coercion and cache and another one for the echelon form.
Just for information: With #12313, make ptest took 2155.0 seconds on my machine, with an update of the patch from here took 2152.6 seconds, but vanilla sage-5.2.beta3 took only 1991.3 seconds. Hence, there is indeed more work to do, and I think one should start by finding out what tests showed the worst regression.
comment:20 Changed 7 years ago by
In these tests, vanilla is at least one seconds faster than #12313:
sage -t -force_lib devel/sage/sage/plot/graphics.py sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/cm.py sage -t -force_lib devel/sage/sage/functions/other.py sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/ell_generic.py sage -t -force_lib devel/sage/sage/rings/polynomial/polynomial_quotient_ring.py sage -t -force_lib devel/sage/sage/categories/coxeter_groups.py sage -t -force_lib devel/sage/doc/en/thematic_tutorials/sandpile.rst sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/sha_tate.py sage -t -force_lib devel/sage/sage/combinat/root_system/weyl_group.py sage -t -force_lib devel/sage/sage/symbolic/integration/integral.py sage -t -force_lib devel/sage/sage/modular/overconvergent/genus0.py sage -t -force_lib devel/sage/sage/modules/free_module.py sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/ell_egros.py sage -t -force_lib devel/sage/sage/interfaces/maxima_abstract.py sage -t -force_lib devel/sage/sage/tests/french_book/numbertheory.py sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/gal_reps.py sage -t -force_lib devel/sage/sage/plot/plot3d/plot_field3d.py sage -t -force_lib devel/sage/sage/modular/abvar/abvar.py sage -t -force_lib devel/sage/doc/en/constructions/plotting.rst sage -t -force_lib devel/sage/sage/plot/plot.py sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/BSD.py sage -t -force_lib devel/sage/sage/symbolic/benchmark.py sage -t -force_lib devel/sage/sage/modular/modsym/subspace.py sage -t -force_lib devel/sage/sage/rings/number_field/number_field_element.pyx sage -t -force_lib devel/sage/sage/plot/plot3d/base.pyx sage -t -force_lib devel/sage/sage/rings/qqbar.py sage -t -force_lib devel/sage/sage/modular/modform/space.py sage -t -force_lib devel/sage/sage/plot/plot3d/parametric_plot3d.py sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/ell_rational_field.py sage -t -force_lib devel/sage/sage/modular/modform/constructor.py sage -t -force_lib devel/sage/sage/combinat/root_system/pieri_factors.py sage -t -force_lib devel/sage/sage/modular/local_comp/local_comp.py sage -t -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_ideal.py sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/ell_point.py sage -t -force_lib devel/sage/sage/categories/homset.py sage -t -force_lib devel/sage/sage/rings/number_field/number_field_rel.py sage -t -force_lib devel/sage/sage/geometry/fan.py sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/padics.py sage -t -force_lib devel/sage/sage/misc/cachefunc.pyx sage -t -force_lib devel/sage/sage/schemes/toric/library.py sage -t -force_lib devel/sage/sage/sandpiles/sandpile.py sage -t -force_lib devel/sage/sage/libs/mwrank/interface.py sage -t -force_lib devel/sage/sage/combinat/sf/hall_littlewood.py sage -t -force_lib devel/sage/sage/matrix/matrix_cyclo_dense.pyx sage -t -force_lib devel/sage/sage/symbolic/expression.pyx sage -t -force_lib devel/sage/sage/algebras/quatalg/quaternion_algebra.py sage -t -force_lib devel/sage/sage/matrix/benchmark.py sage -t -force_lib devel/sage/sage/modular/modsym/ambient.py sage -t -force_lib devel/sage/sage/combinat/tiling.py sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/ell_curve_isogeny.py sage -t -force_lib devel/sage/sage/modular/modsym/space.py sage -t -force_lib devel/sage/sage/combinat/sf/sfa.py sage -t -force_lib devel/sage/sage/crypto/classical.py sage -t -force_lib devel/sage/sage/gsl/ode.pyx sage -t -force_lib devel/sage/sage/plot/contour_plot.py sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/period_lattice.py sage -t -force_lib devel/sage/sage/combinat/root_system/weyl_characters.py sage -t -force_lib devel/sage/sage/graphs/generic_graph.py sage -t -force_lib devel/sage/sage/rings/polynomial/polynomial_element.pyx sage -t -force_lib devel/sage/sage/plot/circle.py sage -t -force_lib devel/sage/sage/structure/coerce_dict.pyx sage -t -force_lib devel/sage/sage/modular/quatalg/brandt.py sage -t -force_lib devel/sage/sage/combinat/root_system/weight_space.py sage -t -force_lib devel/sage/sage/modular/modform/element.py sage -t -force_lib devel/sage/sage/modular/local_comp/smoothchar.py sage -t -force_lib devel/sage/sage/plot/polygon.py
In these tests, #12313+#13400 is at least one seconds faster than vanilla:
sage -t -force_lib devel/sage/sage/plot/line.py sage -t -force_lib devel/sage/sage/modular/hecke/submodule.py sage -t -force_lib devel/sage/sage/schemes/toric/variety.py sage -t -force_lib devel/sage/sage/combinat/root_system/weyl_group.py sage -t -force_lib devel/sage/sage/calculus/interpolators.pyx sage -t -force_lib devel/sage/sage/combinat/sf/jack.py sage -t -force_lib devel/sage/sage/combinat/sf/llt.py sage -t -force_lib devel/sage/sage/modular/abvar/abvar.py sage -t -force_lib devel/sage/sage/combinat/sf/macdonald.py sage -t -force_lib devel/sage/doc/en/thematic_tutorials/lie/weyl_character_ring.rst sage -t -force_lib devel/sage/sage/modular/local_comp/type_space.py sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/ell_modular_symbols.py sage -t -force_lib devel/sage/sage/interfaces/expect.py sage -t -force_lib devel/sage/sage/finance/time_series.pyx sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/ell_number_field.py sage -t -force_lib devel/sage/sage/schemes/hyperelliptic_curves/hyperelliptic_padic_field.py sage -t -force_lib devel/sage/sage/modular/local_comp/local_comp.py sage -t -force_lib devel/sage/sage/tests/cmdline.py sage -t -force_lib devel/sage/sage/modular/overconvergent/hecke_series.py sage -t -force_lib devel/sage/sage/matrix/matrix2.pyx sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/padic_lseries.py sage -t -force_lib devel/sage/sage/combinat/crystals/kirillov_reshetikhin.py sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/heegner.py sage -t -force_lib devel/sage/sage/combinat/words/finite_word.py sage -t -force_lib devel/sage/sage/combinat/partition_algebra.py sage -t -force_lib devel/sage/doc/en/bordeaux_2008/elliptic_curves.rst sage -t -force_lib devel/sage/sage/quadratic_forms/quadratic_form__local_representation_conditions.py sage -t -force_lib devel/sage/sage/rings/number_field/number_field.py sage -t -force_lib devel/sage/sage/graphs/graph_generators.py sage -t -force_lib devel/sage/sage/graphs/graph_list.py
comment:21 Changed 7 years ago by
Perhaps the routine I wrote for extracting the execution times of the individual tests was wrong. I tried again. In the following, I give the time t1 in sage-5.2.beta3 versus the time t2 with #12313 and the patch from here, followed by the name of the test.
In the following tests tests, patched sage is slower by at least 5 seconds with a regression of at least 15% (t2>=1.15*t1
):
18.40 vs. 24.80: sage -t -force_lib devel/sage/sage/modular/hecke/submodule.py 15.10 vs. 24.50: sage -t -force_lib devel/sage/sage/modular/abvar/homspace.py 2.40 vs. 7.90: sage -t -force_lib devel/sage/sage/combinat/combinatorial_algebra.py 22.60 vs. 28.10: sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/gal_reps.py 27.80 vs. 38.30: sage -t -force_lib devel/sage/sage/schemes/toric/chow_group.py 23.80 vs. 30.00: sage -t -force_lib devel/sage/sage/modular/abvar/abvar.py 111.30 vs. 141.60: sage -t -force_lib devel/sage/sage/plot/plot.py 196.30 vs. 247.40: sage -t -force_lib devel/sage/sage/combinat/sf/macdonald.py 22.00 vs. 27.50: sage -t -force_lib devel/sage/sage/combinat/root_system/pieri_factors.py 24.10 vs. 30.10: sage -t -force_lib devel/sage/sage/modular/local_comp/local_comp.py 2.70 vs. 14.80: sage -t -force_lib devel/sage/sage/categories/homset.py 58.00 vs. 71.80: sage -t -force_lib devel/sage/sage/modular/overconvergent/hecke_series.py 20.60 vs. 25.80: sage -t -force_lib devel/sage/sage/matrix/benchmark.py 22.50 vs. 27.50: sage -t -force_lib devel/sage/sage/modular/modsym/ambient.py 38.70 vs. 45.70: sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/padic_lseries.py 23.00 vs. 28.00: sage -t -force_lib devel/sage/sage/modular/modsym/space.py 9.00 vs. 19.00: sage -t -force_lib devel/sage/sage/combinat/sf/sfa.py 29.50 vs. 50.20: sage -t -force_lib devel/sage/sage/combinat/partition_algebra.py 35.50 vs. 44.30: sage -t -force_lib devel/sage/sage/quadratic_forms/quadratic_form__local_representation_conditions.py 2.00 vs. 12.00: sage -t -force_lib devel/sage/sage/structure/coerce_dict.pyx 20.40 vs. 26.20: sage -t -force_lib devel/sage/sage/modular/modform/element.py
Here are the tests that became faster in the patched version by at least 3 seconds, improving by at least 7% (t1>=1.07*t2
):
101.20 vs. 93.40: sage -t -force_lib devel/sage/sage/calculus/riemann.pyx 48.90 vs. 43.80: sage -t -force_lib devel/sage/doc/en/thematic_tutorials/lie/weyl_character_ring.rst 68.30 vs. 41.10: sage -t -force_lib devel/sage/sage/interfaces/expect.py 33.30 vs. 28.70: sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/ell_curve_isogeny.py 67.30 vs. 54.90: sage -t -force_lib devel/sage/sage/schemes/elliptic_curves/heegner.py
The worst regression seems to be in devel/sage/sage/combinat/partition_algebra.py, a regression of more than 70%: (50.2-29.5)/29.5
. So, perhaps one should start to see what is happening there.
comment:22 Changed 7 years ago by
I've extracted the partition_algebra doctests into a file (this one's easy: no error, all one-line, so select all lines with "sage:", move the import up to and stuff it all in a function)
vanilla 5.3b2:
6990672 function calls (6690605 primitive calls) in 11.670 seconds 237883 2.732 0.000 7.992 0.000 set.py:197(__init__) 238208 0.719 0.000 3.054 0.000 {hasattr} 255896 0.440 0.000 0.440 0.000 dynamic_class.py:122(dynamic_class) 957865 0.374 0.000 0.374 0.000 {getattr} 152230 0.335 0.000 5.312 0.000 set.py:42(Set) 912924 0.247 0.000 0.247 0.000 {isinstance} 350344/344228 0.218 0.000 0.310 0.000 set.py:700(set) 88799/56520 0.201 0.000 7.396 0.000 set_partition.py:344(_listbloc) 237883 0.193 0.000 8.185 0.000 set.py:603(__init__) 17034 0.185 0.000 0.259 0.000 combinat.py:962(__init__) 85769 0.185 0.000 3.341 0.000 set.py:827(union) 237891 0.108 0.000 3.161 0.000 sets_cat.py:255(_element_constructor_) 409319 0.103 0.000 0.103 0.000 set.py:566(object) 14600 0.097 0.000 7.554 0.001 cartesian_product.py:138(__iter__) 26068 0.089 0.000 0.773 0.000 subset.py:464(__iter__) 9431 0.079 0.000 10.116 0.001 set_partition.py:221(_iterator_part) 35639/35611 0.067 0.000 0.212 0.000 {map} 227219 0.062 0.000 0.119 0.000 set.py:634(__iter__) 17043 0.060 0.000 0.173 0.000 permutation.py:123(Permutation) 6169 0.060 0.000 1.287 0.000 set_partition_ordered.py:388(__iter__) 249245 0.059 0.000 0.059 0.000 {setattr} 17700 0.059 0.000 2.755 0.000 set_partition.py:395(_set_union)
5.32b + ticket + Thierry's fixed code
7381196 function calls (7077132 primitive calls) in 19.272 seconds 488462/247161 3.742 0.000 5.013 0.000 lazy_attribute.py:506(__get__) 241678/241677 0.712 0.000 3.074 0.000 {hasattr} 259646 0.572 0.000 0.883 0.000 dynamic_class.py:122(dynamic_class) 17349 0.531 0.000 0.665 0.000 combinat.py:962(__init__) 154600 0.371 0.000 9.548 0.000 set.py:42(Set) 971690 0.364 0.000 0.364 0.000 {getattr} 90486/57843 0.275 0.000 12.905 0.000 set_partition.py:344(_listbloc) 929509 0.261 0.000 0.261 0.000 {isinstance} 271183 0.256 0.000 0.256 0.000 weakref.py:55(__getitem__) 354289/348145 0.227 0.000 0.320 0.000 set.py:700(set) 241276 0.208 0.000 14.688 0.000 set.py:603(__init__) 86757 0.201 0.000 5.423 0.000 set.py:827(union) 14889 0.156 0.000 13.078 0.001 cartesian_product.py:138(__iter__) 9597 0.146 0.000 17.137 0.002 set_partition.py:221(_iterator_part) 17881 0.131 0.000 4.838 0.000 set_partition.py:395(_set_union) 241301 0.107 0.000 3.180 0.000 sets_cat.py:255(_element_constructor_) 414348 0.105 0.000 0.105 0.000 set.py:566(object) 26411 0.096 0.000 0.915 0.000 subset.py:464(__iter__) 36450/36422 0.070 0.000 0.185 0.000 {map} 229899 0.067 0.000 0.125 0.000 set.py:634(__iter__) 6334 0.067 0.000 2.223 0.000 set_partition_ordered.py:388(__iter__) 17881 0.067 0.000 6.441 0.000 set_partition.py:374(<lambda>) 170189 0.064 0.000 0.132 0.000 set.py:158(is_Set)
The worst change in cumulative time (it would probably be better to sort the profile data on that to see which part of the code is responsible - after that you can see which callees from that code use much time) seem to be in
set_partition.py:221(_iterator_part) set.py:603(__init__)
but not that the NUMBER of calls here isn't much different, so it doesn't look like we're creating parents that much more often here (that would always show up in python profiles, right?) On the other hand, lazy_attribute
has shown up, costing 5 seconds in total (and getting called a LOT), and those do tend to live in the category system, right?
Ah, and I finally found what the slash means and what primitive calls
means. Those are nonrecursive calls, i.e., for recursive functions the entry into the recursion. In particular
488462/247161 3.742 0.000 5.013 0.000 lazy_attribute.py:506(__get__)
means 488462
total calls, 247161
of which are primitive. Glancing over the code gives me the impression this code might indeed be looking up further attributes (there's an mro loop)
I can imagine that these attributes take more time if they get called the first time on a parent, so perhaps this showing up IS an indication new parents are being made.
comment:23 Changed 7 years ago by
I think this illustrates quite nicely what the problems are here:
sage: import collections sage: collections.Counter(type(a) for a in SetPartitionsAk(3)) Counter({<class 'sage.sets.set.Set_object_enumerated_with_category'>: 203})
One creates a parent SetPartitionsAk(3)
and its elements (a lot of them) are parents themselves! Previously, this code was essentially working on a dealloc=no-op
memory model. Now it's a bit slower than it was before.
Conceptually, one would probably want an "element-like" object first, with delayed "parent" properties. Only when the user insists on treating the thing as a parent do the parent-like properties get initialized. I don't know whether it's worth having something like this. I suspect that if you want to access elements of SetPartitionsAk(3)
for speed, you should probably just have a way of getting representatives as plain lists (or python sets).
comment:24 Changed 7 years ago by
I experimented a bit further, towards an improved initialisation of finite fields.
I have already mentioned the idea of caching small fields: The elements of a small field are cached, hence, why should one not cache the field itself? That's in trac_13400_cache_small_rings.patch
And I created a short-cut for building ideals in ZZ. That's useful, because one needs an ideal in ZZ while initialising a finite field. It has a noticeable effect. With only the first two patches:
sage: %time L = [GF(p) for p in prime_range(100000)] CPU times: user 5.25 s, sys: 0.06 s, total: 5.30 s Wall time: 5.32 s sage: timeit("ZZ.ideal(5)", number=1000) 1000 loops, best of 3: 84 µs per loop sage: timeit("ZZ.ideal(5)", number=5000) 5000 loops, best of 3: 90.5 µs per loop sage: timeit("ZZ.ideal(5)", number=5000) 5000 loops, best of 3: 99.6 µs per loop sage: timeit("ZZ.ideal(5)", number=5000) 5000 loops, best of 3: 122 µs per loop sage: timeit("ZZ.ideal(5)", number=5000) 5000 loops, best of 3: 134 µs per loop
Note that the time constantly drops - why is that?
With all three patches:
sage: %time L = [GF(p) for p in prime_range(100000)] CPU times: user 4.42 s, sys: 0.04 s, total: 4.46 s Wall time: 4.47 s sage: timeit("ZZ.ideal(5)", number=1000) 1000 loops, best of 3: 68.7 µs per loop sage: timeit("ZZ.ideal(5)", number=5000) 5000 loops, best of 3: 76.1 µs per loop sage: timeit("ZZ.ideal(5)", number=5000) 5000 loops, best of 3: 86.8 µs per loop sage: timeit("ZZ.ideal(5)", number=5000) 5000 loops, best of 3: 98.1 µs per loop sage: from sage.rings.finite_rings.integer_mod_ring import quick_ZZ_ideal sage: timeit("quick_ZZ_ideal(5)", number=1000) 1000 loops, best of 3: 2.89 µs per loop sage: timeit("quick_ZZ_ideal(5)", number=5000) 5000 loops, best of 3: 2.87 µs per loop sage: timeit("quick_ZZ_ideal(5)", number=5000) 5000 loops, best of 3: 2.89 µs per loop sage: timeit("quick_ZZ_ideal(5)", number=5000) 5000 loops, best of 3: 2.86 µs per loop sage: timeit("quick_ZZ_ideal(5)", number=5000) 5000 loops, best of 3: 2.82 µs per loop
Hence, the quick way of creating an ideal is much faster, and when using it in the creation of finite fields, it yields a speed-up of (5.25-4.42)/5.25
, which is about 16%.
In a next step, one could try to unravel the QuotientRing.__init__
in the finite field initialisation.
Apply trac_13400_subclass_hook_cython.patch trac_13400_cache_small_rings.patch trac_13400_quick_ZZ_ideal.patch
comment:25 follow-up: ↓ 26 Changed 7 years ago by
Are you taking the following into account?
sage: GF(13) is Integers(13) False
The multimodular code I have seen uses Integers(p)
, not GF(p)
.
I did not quite understand the following:
sage: R=Integers(13); k=GF(13) sage: a=R(1); b=k(1) sage: R is k False sage: type(a); type(b) <type 'sage.rings.finite_rings.integer_mod.IntegerMod_int'> <type 'sage.rings.finite_rings.integer_mod.IntegerMod_int'> sage: parent(a+b) is k True sage: R.has_coerce_map_from(k), k.has_coerce_map_from(R) (False, True)
So k
seems to be the more canonical in the eyes of sage. But why the difference between k
and R
at all?
comment:26 in reply to: ↑ 25 Changed 7 years ago by
Replying to nbruin:
Are you taking the following into account?
sage: GF(13) is Integers(13) False
Apparently I do, since the timings improve in the same way: Without trac_13400_quick_ZZ_ideal.patch one has
sage: %time L = [Integers(n) for n in srange(10000)] CPU times: user 4.36 s, sys: 0.05 s, total: 4.40 s Wall time: 4.42 s
With the patch, one has
sage: %time L = [Integers(n) for n in srange(10000)] CPU times: user 3.38 s, sys: 0.08 s, total: 3.46 s Wall time: 3.47 s
Hence the speedup is (4.42-3.47)/4.42
: about 21%
I did not quite understand the following:
sage: R=Integers(13); k=GF(13) sage: a=R(1); b=k(1) sage: R is k False sage: type(a); type(b) <type 'sage.rings.finite_rings.integer_mod.IntegerMod_int'> <type 'sage.rings.finite_rings.integer_mod.IntegerMod_int'> sage: parent(a+b) is k True sage: R.has_coerce_map_from(k), k.has_coerce_map_from(R) (False, True)So
k
seems to be the more canonical in the eyes of sage. But why the difference betweenk
andR
at all?
Actually I have wondered about that, myself. I guess it is just a tribute to very old code. IIRC, someone (William Stein?) argued as follows: It is actually typical that coercion goes from "less structure / simpler" (e.g. from a base ring) to "more structure / more complicated" (e.g., to a polynomial ring or to a matrix space over the base ring).
And: If we have an object O in some category (say, GF(5) in the category of fields), and an object X that is obtained from O by a forgetful functor (say, Integers(5) in the category of rings), and we have a in O and b in X, then we typically want that a+b belongs to the "richer" structure - otherwise, we would have worked in X right away.
comment:27 follow-up: ↓ 28 Changed 7 years ago by
That looks like quite an impressive gain. Given that you already have a good way of analysing doctest timing per-file, would you see what the effect is?
"Premature optimization is the root of evil" (Knuth): We shouldn't actually be optimizing doctests blindly. We can use them as a guide to find gross inefficiencies and repair those. My general feeling is that reducing the overhead on producing finite fields and related structures should be a big benefit generally, but we should only go down that path if we can corroborate that.
I was a bit surprised to see that caching "popular multimodular" fields didn't make much of a difference after fixing Thierry's code for an example where I though it would. It may still be a good idea but I don't have evidence that it matters.
comment:28 in reply to: ↑ 27 Changed 7 years ago by
Replying to nbruin:
I was a bit surprised to see that caching "popular multimodular" fields didn't make much of a difference after fixing Thierry's code for an example where I though it would. It may still be a good idea but I don't have evidence that it matters.
That's why I didn't pack everything in one patch. Keep it modular...
comment:29 Changed 7 years ago by
OK, this is more about tuning rational echelon form code, but since the issue came up here, I'll post some preliminary timings here:
from sage.misc.sage_timeit import sage_timeit def timemat(n,m,r,B): """ does some matrix timings on a single random matrix described by the parameters given: n : Number of distinct rows m : Number of colums r : Number of times the rows should be repeated B : Height bound on rational numbers put in the matrix The matrix is essentially generated by constructing an n x m matrix with random entries and then stacking r copies. This is meant as a cheap way of making non-maximal rank matrices. """ M = MatrixSpace(QQ,n*r,m) L = r*list(randint(-B,B)/randint(1,B) for j in range(n*m)) D = globals().copy() D['M'] = M D['L'] = L t1=min(sage_timeit('M(L).echelon_form("classical")',D).series) t2=min(sage_timeit('M(L).echelon_form("multimodular")',D).series)/t1 t3=min(sage_timeit('M(L).echelon_form("padic")',D).series)/t1 return (1,t2,t3)
Timings are reported as (classical,multimodular,padic)
with classical
scaled to 1. These tests are all in the range where currently padic
is chosen as a default. These timings are all done on 5.3b2 + #12313 + fix of Thierry's code + un-randomization of prime choice in padic
. Without #12313 one would run out of memory on padic
and without the other ones, the timings of padic would be much worse:
sage: timemat(10,10,1,10^3) (1, 1.4593604789072667, 2.25968735560318) sage: timemat(10,10,1,10^800) (1, 1.4550198262904093, 2.2320730206016584) sage: timemat(20,20,1,10^800) (1, 0.26421709436394647, 0.39672901423049933) sage: timemat(50,50,1,10^20) (1, 0.028462514595311343, 0.04085242952549667) sage: timemat(20,30,2,10^20) (1, 1.311503148833886, 1.0495737479473444) sage: timemat(20,30,1,10^20) (1, 1.1408739793541882, 0.8243377328388548) sage: timemat(20,30,2,10^800) (1, 1.0566089264644132, 1.2234341427870548) sage: timemat(50,60,2,10^800) (1, 0.5444370157383747, 0.18547612309046932)
so it seems that the cross-overs could use some tuning. In particular, for non-maximal rank matrices it seems to take a while longer to overtake classical. It would be great if someone would be able to tune this code properly. It is very useful if sage would echelonize small matrices quickly over arbitrary base fields quickly in most most cases (and not make silly choices), because it allows linear algebra intensive algorithms in a base-field agnostic way.
comment:30 follow-up: ↓ 31 Changed 7 years ago by
Above, I mentioned ZZ.ideal(5)
becoming slower and slower with each iteration. I can narrow it down a bit more
With sage-5.3.beta2, #12313 and the patches from here, I get
sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 46.2 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 49 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 50.4 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 52.4 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 54.7 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 56 µs per loop
In unpatched sage-5.3.beta2, one gets
sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 15.9 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 15.8 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 38.5 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 38.5 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 25.4 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 15.8 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 15.8 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 29.8 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 15.9 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 35.3 µs per loop
I thought this was supposed to be fixed with #13370. but sage-5.3.beta2 plus #12313 plus the patches from here plus #13370 yields:
sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 47.1 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 49.1 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 51.1 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 53.4 µs per loop sage: %timeit ZZ.has_coerce_map_from(5) 625 loops, best of 3: 54.7 µs per loop
My impression is that, again, caching is to blame. Namely, since the weak caches from #715 and #12313 compare by identity, not equality, repeating ZZ.has_coerce_map_from(5)
does put many copies of 5 into the cache (integers are not unique).
But have we not recently dealt with the same problem on another ticket? I currently can't find it.
comment:31 in reply to: ↑ 30 Changed 7 years ago by
comment:32 Changed 7 years ago by
I have updated the last two of my patches. The doctests should all pass with them.
comment:33 Changed 7 years ago by
- Cc jpflori added
This example seems fairly typical:
In 5.3b2:
In 5.3b2+tickets
You can see where a lot of that time is going: In category construction stuff. The reason is simple: It's trying some multimodular approach. So there are all kinds of matrix spaces over various finite fields involved. Previously, these structures remained in memory, cached and ready for reuse. Ideal here, but given the peculiar choices of primes, rarely all that useful in practice.
With the stricter collection, these spaces need to be created time and again. However, the number of times that covariant_functorial_construction gets called is insane. That almost looks like something needs to be constructed for every element.
On the plus side, if we run instead
we get with 5.3b2
and 5.3b2+tickets
so it's actually faster! (I've tried, these numbers seem representative).
I have no explanation for what is faster here. Perhaps just some dictionary lookups that happen faster now thanks to 'MonoDict?' and 'TripleDict?'.
So apart from
echelon_form
being really badly tuned for its choice of algorithm, what do we do about this parent construction? Should sage keep a cache of useful finite fields and be a little more restrained in its choice of primes (promoting reuse of these finite fields)?