#13352 closed enhancement (fixed)
Running time improvement of the bitset_len method
Reported by: | dcoudert | Owned by: | jason |
---|---|---|---|
Priority: | major | Milestone: | sage-5.13 |
Component: | misc | Keywords: | |
Cc: | ncohen, leif | Merged in: | sage-5.13.rc0 |
Authors: | David Coudert, Jeroen Demeyer | Reviewers: | David Coudert |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description (last modified by )
This patch improves the running time of the bitset_len
method by using mpn_popcount()
.
Before:
sage: B = Bitset('10'*10000) sage: len(B) 10000 sage: %timeit len(B) 1000 loops, best of 3: 245 us per loop
After:
sage: B = Bitset('10'*10000) sage: len(B) 10000 sage: %timeit len(B) 100000 loops, best of 3: 2.52 us per loop
This patch also reorganizes the bitset files, including deleting the sage/misc/bitset_pxd.pxi
file.
APPLY:
Attachments (6)
Change History (57)
comment:1 Changed 9 years ago by
- Status changed from new to needs_review
comment:2 follow-up: ↓ 3 Changed 9 years ago by
comment:3 in reply to: ↑ 2 ; follow-up: ↓ 5 Changed 9 years ago by
- Description modified (diff)
Replying to leif:
I'd rather "implement" the functions in C (where you can use the C preprocessor and
__builtin_popcount()
if available) and write a thin Cython wrapper.GCC e.g. chooses efficient machine implementations (probably a single instruction), depending on the (configured or specified) target and perhaps also further compiler flags.
As reported in https://groups.google.com/forum/#!topic/sage-devel/GnudTUwbGG4, I tried to implement the function in C with __builtin_popcount()
. I was able to compile successfully, but then sage was not working anymore.
See attached patch trac_13352-bitcount-in-c.patch.
Note that GMP has functions like
mpz_popcount()
formpz_t
s as well.
It could be another solution, but I have not been able yet to find how to access the functions used for int behind?
Not sure what the
parallel
refers to ... ;-)
In fact, the 4 lines n = __COUNT32__(n, 0)
can be executed in parallel and gcc is sometimes able to optimize the sequence of instructions. For instance, all methods have similar running times on my macbook air, but parallel methods are slower on my old PC (system still in 32-bits...).
comment:4 Changed 9 years ago by
- Description modified (diff)
comment:5 in reply to: ↑ 3 Changed 9 years ago by
Replying to dcoudert:
Replying to leif:
Not sure what the
parallel
refers to ... ;-)In fact, the 4 lines
n = __COUNT32__(n, 0)
can be executed in parallel
Well, obviously not, since you have the data dependence on n
. ;-)
Btw., you should use UINT32_C(1)
instead of 0x1u
etc., and e.g. UINT32_MAX
instead of (<uint32_t>-1)
(Cython code) etc.
The types of the shift parameters (of __TWO??__()
) are slightly overdim'ed...
comment:6 Changed 9 years ago by
I'm able to import UINT32_MAX, but not UINT32_C.
- with
from libc.stdint cimport UINT32_C
: it is apparently not imported - with
from libc.stdint import UINT32_C
: it compiles, but sage crashes with weird messages
Any special trick?
Changed 9 years ago by
comment:7 Changed 8 years ago by
- Milestone changed from sage-5.11 to sage-5.12
Changed 8 years ago by
comment:8 Changed 8 years ago by
- Cc ncohen added
- Description modified (diff)
Hello,
Since I have now understood how to use __builtin_popcountl
(not obvious when you don't know), I have uploaded a new version of the patch.
before:
sage: B = Bitset('00'*10000+'1') sage: len(B) 1 sage: %timeit len(B) 1000000 loops, best of 3: 268 ns per loop sage: B = Bitset('10'*10000) sage: len(B) 10000 sage: %timeit len(B) 1000 loops, best of 3: 245 us per loop sage: B = Bitset('10'*1000000) sage: len(B) 1000000 sage: %timeit len(B) 10 loops, best of 3: 24.1 ms per loop
after:
sage: B = Bitset('00'*10000+'1') sage: len(B) 1 sage: %timeit len(B) 1000000 loops, best of 3: 275 ns per loop sage: B = Bitset('10'*10000) sage: len(B) 10000 sage: %timeit len(B) 100000 loops, best of 3: 2.52 us per loop sage: B = Bitset('10'*1000000) sage: len(B) 1000000 sage: %timeit len(B) 1000 loops, best of 3: 236 us per loop
I'm wondering if I should add a wrapper in case the __builtin_popcountl
method is not available. This is not difficult, but I don't know which environnement variables should be tested.
Thanks.
comment:9 Changed 8 years ago by
- Cc leif added
- Description modified (diff)
I have now implemented the function in C in file popcount.h
with a small wrapper. The result is a bit slower than without wrapper, but still way faster than previous method.
sage: B = Bitset('00'*10000+'1') sage: len(B) 1 sage: %timeit len(B) 1000000 loops, best of 3: 361 ns per loop sage: B = Bitset('10'*10000) sage: %timeit len(B) 100000 loops, best of 3: 3.03 us per loop sage: B = Bitset('10'*1000000) sage: len(B) 1000000 sage: %timeit len(B) 1000 loops, best of 3: 287 us per loop
I have used the _WIN32
flag, but I'm not sure it's the good one. So if one knows the best flag to use to detect when the __builtin_popcount
method is not available, please let me know.
comment:10 Changed 7 years ago by
Hello,
after browsing the web, I came up with this new version. The current version tests is the cpu has builtin support. If yes, we use builtin version. Else, we use hakmem methods.
- http://gcc.gnu.org/onlinedocs/gcc/X86-Built_002din-Functions.html
- http://gcc.gnu.org/gcc-4.8/changes.html
- http://www.spinics.net/lists/linux-ext4/msg35827.html
We don't need all the methods for bitset_len
but it could be useful for other usage.
Let me know if it is OK.
comment:11 Changed 7 years ago by
HMmmmmmmmm.... I'm trying to understand the code and it's much more complicated than what I actually read. Funny, looks like you can have as much fun with GCC than with Python, and decide how stuff is to be run at run time instead of compile time :-P
This being said, if I understand well what "ifunc" does I do not see how the popcount instruction will be used when available. To me it looks like you should make sure that the file is compiled with the -mpopcnt
GCC flag (so that __builtin_popcnt
is always replaced by the popcount instruction even if it does not exist), and that this ifunc trick means to avoid using this instruction if it is detected that, at runtime, it is not available.
This being said, it is a very nice trick O_o
Nathann
comment:12 Changed 7 years ago by
I would also say that those 2 blocks
__attribute__ ((__target__ ("popcnt"))) static unsigned int builtin_popcount(unsigned int w){ return (sizeof(unsigned int)==4? __builtin_popcount(w) : __builtin_popcountl(w)); }
are mistakes. Whatever happens, regardless of the architecture, __builtin_popcount
works on int type, and __builtin_popcountl
works on long types.
Also, perhaps you should credit Cristian Rodríguez somewhere :-P
Nathann
Changed 7 years ago by
comment:13 Changed 7 years ago by
Right. I have simplified the code and add credits to Cristian Rodriguez.
comment:14 Changed 7 years ago by
Here's a slightly cleaner version which does not use ifunc but only G++'s multiversionning. C's looking very much like Python these days :-PPPP
Nice links :
- http://gcc.gnu.org/wiki/FunctionMultiVersioning
- http://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning.html
Nathann
Changed 7 years ago by
comment:15 Changed 7 years ago by
Unfortunately, I'm unable to use this for the patch because the "default" keyword is not recognized.
comment:16 Changed 7 years ago by
Hmmmmmm... Well, it seems that this feature is only available in gcc 4.8+. I asked Volker for his advice on this ticket (or rather on the .c file I uploaded), and he told me that we should check for both gcc 4.8+ and x86 (ifdef __i386__
).
Besides, he saw on that link (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041, though I can't find out where on the page) that GCC was about to patch that, and improve the generic version of __builtin_gcc
. If that's the case, perhaps we should just use theirs, waiting for the patch to be merged with a future gcc release. Otherwise we can use the MIT HAKMEM version indeed.
I'll try to send a version with both codes today. And we'll remove the old one in a couple of years :-P
Nathann
comment:17 Changed 7 years ago by
Hey I don't get it... In your patch you also use a attribute/target with popcount. You can't compile my code yet you can compile yours ? And that means that they added this attribute/target feature before adding the "default" keyword ? O_o
Nathann
comment:18 Changed 7 years ago by
Apparently yes. Have you tried to modify the patch and compile on your own computer?
comment:19 Changed 7 years ago by
I'm trying, but it takes forever to compile. What happens if you replace "default" with "" in the .c file ?
Nathann
comment:20 Changed 7 years ago by
Arg. This "" fails :-/
Nathann
comment:21 Changed 7 years ago by
HMmmmmmm O_o
What I get when running tests is :
Traceback (most recent call last): File "/home/ncohen/.Sage/local/bin/sage-runtests", line 79, in <module> from sage.doctest.control import DocTestController File "/home/ncohen/.Sage/local/lib/python2.7/site-packages/sage/doctest/control.py", line 27, in <module> from sources import FileDocTestSource, DictAsObject File "/home/ncohen/.Sage/local/lib/python2.7/site-packages/sage/doctest/sources.py", line 27, in <module> from util import NestedName File "/home/ncohen/.Sage/local/lib/python2.7/site-packages/sage/doctest/util.py", line 23, in <module> from sage.misc.misc import walltime, cputime File "/home/ncohen/.Sage/local/lib/python2.7/site-packages/sage/misc/misc.py", line 565, in <module> from sage.misc.misc_c import prod, running_total, balanced_sum, is_64_bit, is_32_bit ImportError: /home/ncohen/.Sage/local/lib/python2.7/site-packages/sage/misc/misc_c.so: undefined symbol: __builtin_popcount
Nathann
comment:22 Changed 7 years ago by
(that was with your patch applied)
comment:23 Changed 7 years ago by
so you mean with patch http://trac.sagemath.org/raw-attachment/ticket/13352/trac_13352_bitset_len_v3.patch as it is or a modified version ?
When I try to replace my patch with what you have put in a.c file, I get:
In file included from sage/coding/binary_code.c:319:0: ./sage/misc/popcount.h:107:3: error: attribute(target("default")) is unknown ./sage/misc/popcount.h:123:3: error: attribute(target("default")) is unknown ./sage/misc/popcount.h:128:14: error: redefinition of ‘popcount’ ./sage/misc/popcount.h:123:14: note: previous definition of ‘popcount’ was here ./sage/misc/popcount.h:132:14: error: redefinition of ‘popcountl’ ./sage/misc/popcount.h:107:14: note: previous definition of ‘popcountl’ was here error: command 'gcc' failed with exit status 1 Error installing modified sage library code.
comment:24 Changed 7 years ago by
I meant your patch, without modifications. As for the errors you get, I think it's because the file gets compiled with gcc instead of g++. Volker said that it should be sufficient to make it a .cc file, and that GCC would take the hint and use the right compiler instead. It does work when outside of Sage :
~$ cp a.cc a.c ~$ gcc a.c -o a a.c:31:12: error: redefinition of ‘popcount’ inline int popcount(unsigned int i){ ^ a.c:26:12: note: previous definition of ‘popcount’ was here inline int popcount(unsigned int i){ ^ a.c:35:12: error: redefinition of ‘popcountl’ inline int popcountl(unsigned long i){ ^ a.c:10:12: note: previous definition of ‘popcountl’ was here inline int popcountl(unsigned long i){ ^ ~$ gcc a.cc -o a ~$
Though there does not seem to be anything wrong with "default" on my computer. Which version of gcc are you using ? It's probably < 4.8.
Nathann
comment:25 Changed 7 years ago by
gcc (GCC) 4.6.3 20120306.
comment:26 Changed 7 years ago by
Helloooooooo ! Just to mention that I have been fighting with multiversionning in Sage+GCC for two days now, and that I can't seem to make it work. It works outside of Sage, it segfaults inside of Sage and I have no idea why.
Oh, and there is a #DEFINE NO_INLINE
among the default preprocessor variables for anything that gets compiled by Sage it seems.
Nathann
comment:27 Changed 7 years ago by
So unless someone has a good solution, we can make a patch containing only the hakmem methods, and if one is later able to make __builtin_popcount
running, then we may switch to it.
Do you agree?
comment:28 Changed 7 years ago by
Yep, it makes sense but it is a pity :-/
I wrote to sage-devel, just in case ...
https://groups.google.com/d/topic/sage-devel/FIxzZv0L_lo/discussion
Nathann
comment:29 Changed 7 years ago by
I think a great suggestion implied by #14704 is to use mpn_popcount()
for this. That's essentially one line of code to call that function :-)
comment:30 Changed 7 years ago by
I'm working on a patch which does this...
comment:31 follow-up: ↓ 36 Changed 7 years ago by
That would be great if you have the good trick to use that method. I tried some time ago to use mpz_popcount
and failed.
comment:32 Changed 7 years ago by
I'm also merging bitset_pxd.pxi
and bitset.pxd
(I see no reason why these would be two separate files).
comment:33 follow-up: ↓ 34 Changed 7 years ago by
Hmmmmmmmmmm O_O
You are on a dangerous road, Jeroen. The road to many broken dependencies in the graph/ folder :-PPPPPP
Nathann
comment:34 in reply to: ↑ 33 Changed 7 years ago by
Replying to ncohen:
Hmmmmmmmmmm
O_O
You are on a dangerous road, Jeroen. The road to many broken dependencies in the graph/ folder
:-PPPPPP
You mean because of bitset_pxd.pxi
? I could simply leave that file with a one-liner
from sage.misc.bitset cimport *
if you prefer.
comment:35 Changed 7 years ago by
- Description modified (diff)
comment:36 in reply to: ↑ 31 Changed 7 years ago by
Replying to dcoudert:
That would be great if you have the good trick to use that method. I tried some time ago to use
mpz_popcount
and failed.
The trick is to use mpn_popcount
instead of mpz_popcount
.
comment:37 Changed 7 years ago by
Speed is much better too.
With trac_13352_bitset_len_v3.patch:
sage: B = Bitset('10'*10000) sage: timeit('len(B)',repeat=100,number=10000) 10000 loops, best of 100: 1.24 µs per loop sage: B = Bitset('00'*10000) sage: timeit('len(B)',repeat=100,number=10000) 10000 loops, best of 100: 254 ns per loop
With trac_13352_bitset_len_v4.patch:
sage: B = Bitset('10'*10000) sage: timeit('len(B)',repeat=100,number=10000) 10000 loops, best of 100: 168 ns per loop sage: B = Bitset('00'*10000) sage: timeit('len(B)',repeat=100,number=10000) 10000 loops, best of 100: 168 ns per loop
comment:38 Changed 7 years ago by
Similar improvements could be made to other bitset functions (in particular to bitset_next()
and the like). But that would be for a new ticket...
comment:39 Changed 7 years ago by
Strangely, many places were importing bitset
without actually using bitsets... so I removed those redundant imports.
Even stranger were two files sage/combinat/debruijn_sequence.pxd
and sage/matroids/unpickling.pxd
containing just one line
include 'sage/misc/bitset_pxd.pxi'
which was in both cases unused.
Changed 7 years ago by
comment:40 follow-up: ↓ 42 Changed 7 years ago by
Excellent! The patch is working perfectly (install, doctests on sage/ , etc.). The simplification of the hamming weight method is also very good.
Thank you very much.
Should we let Nathann conclude on the status of this ticket?
comment:41 Changed 7 years ago by
See also #10093 for another bitset-related ticket ready for review.
comment:42 in reply to: ↑ 40 Changed 7 years ago by
Should we let Nathann conclude on the status of this ticket?
No sorry, please deal with this without me, I'm already juggling with 1000 patches right now O_o
Nathann
comment:43 Changed 7 years ago by
- Status changed from needs_review to positive_review
For me the patch proposed by Jeroen is OK (and I can install #10093 on top of it), so I set its status to positive.
comment:44 Changed 7 years ago by
- Reviewers set to David Coudert
comment:45 Changed 7 years ago by
This is ridiculous and shows how fast mpn_popcount()
is:
sage: B = FrozenBitset(capacity=10^5) sage: timeit('B.isempty()', repeat=100, number=10000) 10000 loops, best of 100: 980 ns per loop sage: timeit('len(B)', repeat=100, number=10000) 10000 loops, best of 100: 573 ns per loop
comment:46 follow-up: ↓ 47 Changed 7 years ago by
Hmmmmmm... Would you happen to have the popcount instruction on the computer on which you try this ?
Nathann
comment:47 in reply to: ↑ 46 Changed 7 years ago by
Replying to ncohen:
Hmmmmmm... Would you happen to have the popcount instruction on the computer on which you try this ?
I think so, yes.
comment:48 Changed 7 years ago by
- Merged in set to sage-5.13.rc0
- Resolution set to fixed
- Status changed from positive_review to closed
comment:49 Changed 7 years ago by
- Description modified (diff)
comment:50 Changed 7 years ago by
- Description modified (diff)
I changed the description since we use mpn_popcount
and not ___builtin_popcount
comment:51 Changed 6 years ago by
Follow-up: #16937
I'd rather "implement" the functions in C (where you can use the C preprocessor and
__builtin_popcount()
if available) and write a thin Cython wrapper.GCC e.g. chooses efficient machine implementations (probably a single instruction), depending on the (configured or specified) target and perhaps also further compiler flags.
Note that GMP has functions like
mpz_popcount()
formpz_t
s as well.Not sure what the
parallel
refers to ... ;-)