Opened 4 years ago
Closed 22 months ago
#27103 closed enhancement (fixed)
Enable SIMDinstructions for Bitsets
Reported by:  ghkliem  Owned by:  

Priority:  major  Milestone:  sage9.3 
Component:  geometry  Keywords:  bitsets, CombinatorialPolyhedron, SIMD, intrinsics 
Cc:  Sophia Elia, Christian Stump, Florent Hivert, Samuel Lelièvre, Travis Scrimshaw  Merged in:  
Authors:  Jonathan Kliem  Reviewers:  Travis Scrimshaw 
Report Upstream:  N/A  Work issues:  
Branch:  8bd0b1f (Commits, GitHub, GitLab)  Commit:  8bd0b1fe087a802603b426bd22d51458b951d776 
Dependencies:  #31208  Stopgaps: 
Description (last modified by )
#27122 enabled march=native
by default. This allows using intrinsics for Intel and AMD processors (unless the user specified different CFLAGS
or CXXFLAGS
differently).
Bitsets allocated using MemoryAllocator
are overaligned allocated. This is not the case otherwise and there are several reasons not to do so for now:
 More memory usage especially for small bitsets (
bitset_t
itself is rather small).  It makes
realloc
a lot harder.  There seems to be only about 10 percent loss when data is not overaligned.
 There seems to be only about 1 percent performance loss, when the data is overaligned but the code is not customized for it (
_mm256_loadu_si256
instead of_mm256_load_si256
etc).
See https://software.intel.com/sites/landingpage/IntrinsicsGuide for details about the instructions.
The emptyness check as it is seems be really fast and it appears pointless to beat it using intrinsics.
A few minor improvements in the bitset frontend of bitsets were also made.
Timings without intrinsics (should be exactly the same as without this ticket as the code almost did not change):
sage: B = Bitset((randint(0,2**20) for _ in range(2**19))) sage: %timeit B == B 7.93 µs ± 5.71 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: Bop = ~B sage: %timeit B.isdisjoint(Bop) 8.11 µs ± 4.43 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: %timeit B.issubset(B) 7.98 µs ± 7.64 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: B2 = ~Bop sage: %timeit B2.intersection_update(B) 7.22 µs ± 20 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: %timeit B2.union(B) 14.6 µs ± 5.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: %timeit B.symmetric_difference(Bop) 14 µs ± 9.34 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: %timeit B.difference_update(Bop) 7.2 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: P = polytopes.hypercube(14) sage: P1 = P.stack(next(P.face_generator(1))) sage: C = CombinatorialPolyhedron(P1) sage: %time C.f_vector() CPU times: user 12.8 s, sys: 7.98 ms, total: 12.8 s Wall time: 12.8 s (1, 16385, 131069, 487383, 1117948, 1769482, 2047331, 1788501, 1200342, 622908, 248963, 75361, 16640, 2470, 197, 1) sage: P = polytopes.Birkhoff_polytope(5) sage: C = CombinatorialPolyhedron(P) sage: %time C.f_vector() CPU times: user 581 ms, sys: 0 ns, total: 581 ms Wall time: 580 ms (1, 120, 5040, 50250, 233400, 631700, 1113700, 1367040, 1220550, 817150, 419225, 167200, 52120, 12600, 2300, 300, 25, 1) sage: P = polytopes.associahedron(['A', 11], backend='normaliz') sage: C = CombinatorialPolyhedron(P) sage: %time C.f_vector() CPU times: user 38 s, sys: 20.2 ms, total: 38 s Wall time: 38 s (1, 208012, 1144066, 2735810, 3730650, 3197700, 1790712, 659736, 157080, 23100, 1925, 77, 1)
With SSE4.1 (some of them would also run just with SSE2, but already SSE4.1 is 2008):
sage: B = Bitset((randint(0,2**20) for _ in range(2**19))) sage: %timeit B == B 7.91 µs ± 3.03 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: Bop = ~B sage: %timeit B.isdisjoint(Bop) 5.1 µs ± 6.09 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: %timeit B.issubset(B) 4.71 µs ± 5.07 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: B2 = ~Bop sage: %timeit B2.intersection_update(B) 5.57 µs ± 9.94 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: %timeit B2.union(B) 11.9 µs ± 5.64 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: %timeit B.symmetric_difference(Bop) 12.4 µs ± 7.06 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: %timeit B.difference_update(Bop) 4.66 µs ± 38.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: P = polytopes.hypercube(14) sage: P1 = P.stack(next(P.face_generator(1))) sage: C = CombinatorialPolyhedron(P1) sage: %time C.f_vector() CPU times: user 7.98 s, sys: 4 ms, total: 7.99 s Wall time: 7.98 s (1, 16385, 131069, 487383, 1117948, 1769482, 2047331, 1788501, 1200342, 622908, 248963, 75361, 16640, 2470, 197, 1) sage: P = polytopes.Birkhoff_polytope(5) sage: C = CombinatorialPolyhedron(P) sage: %time C.f_vector() CPU times: user 466 ms, sys: 0 ns, total: 466 ms Wall time: 466 ms (1, 120, 5040, 50250, 233400, 631700, 1113700, 1367040, 1220550, 817150, 419225, 167200, 52120, 12600, 2300, 300, 25, 1) sage: P = polytopes.associahedron(['A', 11], backend='normaliz') sage: C = CombinatorialPolyhedron(P) sage: %time C.f_vector() CPU times: user 26.8 s, sys: 8.09 ms, total: 26.8 s Wall time: 26.8 s (1, 208012, 1144066, 2735810, 3730650, 3197700, 1790712, 659736, 157080, 23100, 1925, 77, 1)
With AVX:
sage: B = Bitset((randint(0,2**20) for _ in range(2**19))) sage: %timeit B == B 3.99 µs ± 2.46 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: Bop = ~B sage: %timeit B.isdisjoint(Bop) 3.38 µs ± 12.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: %timeit B.issubset(B) 3.05 µs ± 2.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: P = polytopes.hypercube(14) sage: P1 = P.stack(next(P.face_generator(1))) sage: C = CombinatorialPolyhedron(P1) sage: %time C.f_vector() CPU times: user 5.27 s, sys: 7.97 ms, total: 5.28 s Wall time: 5.28 s (1, 16385, 131069, 487383, 1117948, 1769482, 2047331, 1788501, 1200342, 622908, 248963, 75361, 16640, 2470, 197, 1) sage: P = polytopes.Birkhoff_polytope(5) sage: C = CombinatorialPolyhedron(P) sage: %time C.f_vector() CPU times: user 449 ms, sys: 6 µs, total: 449 ms Wall time: 449 ms (1, 120, 5040, 50250, 233400, 631700, 1113700, 1367040, 1220550, 817150, 419225, 167200, 52120, 12600, 2300, 300, 25, 1) # It does not use subset checks as it is a simple polytope. sage: P = polytopes.associahedron(['A', 11], backend='normaliz') sage: C = CombinatorialPolyhedron(P) sage: %time C.f_vector() CPU times: user 30.4 s, sys: 23.6 ms, total: 30.4 s Wall time: 30.4 s (1, 208012, 1144066, 2735810, 3730650, 3197700, 1790712, 659736, 157080, 23100, 1925, 77, 1)
With AVX2:
sage: B = Bitset((randint(0,2**20) for _ in range(2**19))) sage: Bop = ~B sage: B2 = ~Bop sage: %timeit B2.intersection_update(B) 4.2 µs ± 3.28 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: %timeit B2.union(B) 12 µs ± 5.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: %timeit B.symmetric_difference(Bop) 12 µs ± 5.98 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: %timeit B.difference_update(Bop) 4.03 µs ± 3.46 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) sage: P = polytopes.hypercube(14) sage: P1 = P.stack(next(P.face_generator(1))) sage: C = CombinatorialPolyhedron(P1) sage: %time C.f_vector() CPU times: user 5.18 s, sys: 7 µs, total: 5.18 s Wall time: 5.18 s (1, 16385, 131069, 487383, 1117948, 1769482, 2047331, 1788501, 1200342, 622908, 248963, 75361, 16640, 2470, 197, 1) sage: P = polytopes.Birkhoff_polytope(5) sage: C = CombinatorialPolyhedron(P) sage: %time C.f_vector() CPU times: user 436 ms, sys: 3 µs, total: 436 ms Wall time: 435 ms (1, 120, 5040, 50250, 233400, 631700, 1113700, 1367040, 1220550, 817150, 419225, 167200, 52120, 12600, 2300, 300, 25, 1) sage: P = polytopes.associahedron(['A', 11], backend='normaliz') sage: C = CombinatorialPolyhedron(P) sage: %time C.f_vector() CPU times: user 24.3 s, sys: 19.7 ms, total: 24.3 s Wall time: 24.3 s (1, 208012, 1144066, 2735810, 3730650, 3197700, 1790712, 659736, 157080, 23100, 1925, 77, 1)
Change History (46)
comment:1 Changed 4 years ago by
Dependencies:  #26887 → #26887, #27122 

Description:  modified (diff) 
Keywords:  SIMD intrinsics added; polytopes fvector edge graph flag vector removed 
Priority:  minor → major 
Summary:  Farther improve Combinatorial Polyhedron for AVX but not AVX 2. → Farther improve CombinatorialPolyhedron: Enable SIMDinstructions. 
comment:2 Changed 4 years ago by
Summary:  Farther improve CombinatorialPolyhedron: Enable SIMDinstructions. → Farther improve CombinatorialPolyhedron: Enable SIMDinstructions 

comment:3 Changed 4 years ago by
comment:4 Changed 4 years ago by
Branch:  → public/27103 

Commit:  → 52cb026a0fbddc1984701f7054bd71e97e157ea1 
Last 10 new commits:
6052b61  FaceIterator is an actual iterator now

e8309e4  fixed allocation of array

6a55fa5  it.next() > next(it), also fixed face_lattice_facet_repr for trivial cases

9f42182  improved documentation mostly

687d003  replaced calloc by new[]

f8a201e  Improved comments all through base.pyx

df43a2d  improved pseudocode Algorithm in ``FaceIterator``

b73e53c  Corrected a mathematical error in explonation.

f791cce  removed notes and old files again

52cb026  replaced methods by intrinsic alternatives

comment:5 Changed 4 years ago by
Milestone:  sage8.7 → sage8.8 

Ticket retargeted after milestone closed (if you don't believe this ticket is appropriate for the Sage 8.8 release please retarget manually)
comment:6 Changed 3 years ago by
Milestone:  sage8.8 

As the Sage8.8 release milestone is pending, we should delete the sage8.8 milestone for tickets that are not actively being worked on or that still require significant work to move forward. If you feel that this ticket should be included in the next Sage release at the soonest please set its milestone to the next release milestone (sage8.9).
comment:7 Changed 3 years ago by
Milestone:  → sagepending 

comment:8 Changed 3 years ago by
Summary:  Farther improve CombinatorialPolyhedron: Enable SIMDinstructions → Further improve CombinatorialPolyhedron: Enable SIMDinstructions 

comment:9 Changed 3 years ago by
Commit:  52cb026a0fbddc1984701f7054bd71e97e157ea1 → 10a61963ba0bb2547e44934bfa197933d1395d96 

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
10a6196  enabled intrinsics in bit_vector_operation.cc

comment:10 followup: 12 Changed 3 years ago by
for popcount, you can use something like this when builtin popcount is not available:
cdef inline int popcount64(uint64_t i): """ Return the number of '1' bits in a 64bits integer. """ i = i  ((i >> 1) & 0x5555555555555555ULL) i = (i & 0x3333333333333333ULL) + ((i >> 2) & 0x3333333333333333ULL) return ( ((i + (i >> 4)) & 0x0f0f0f0f0f0f0f0fULL) * 0x0101010101010101ULL ) >> 56
EDIT: removed inappropriate comment (bad copy/paste)
comment:11 Changed 3 years ago by
Commit:  10a61963ba0bb2547e44934bfa197933d1395d96 → d069567e351f0dcecd5f790171cf30a5a8089fb3 

Branch pushed to git repo; I updated commit sha1. New commits:
d069567  improved popcount without intrinsics

comment:12 Changed 3 years ago by
Thanks. Was a nice exercise to go through it.
Replying to dcoudert:
for popcount, you can use something like this when builtin popcount is not available:
cdef inline int popcount64(uint64_t i): """ Return the number of '1' bits in a 64bits integer. """ i = i  ((i >> 1) & 0x5555555555555555ULL) i = (i & 0x3333333333333333ULL) + ((i >> 2) & 0x3333333333333333ULL) return ( ((i + (i >> 4)) & 0x0f0f0f0f0f0f0f0fULL) * 0x0101010101010101ULL ) >> 56EDIT: removed inappropriate comment (bad copy/paste)
comment:13 Changed 3 years ago by
Description:  modified (diff) 

Status:  new → needs_review 
comment:14 Changed 3 years ago by
Authors:  → David Coudert, Jonathan Kliem 

Description:  modified (diff) 
Summary:  Further improve CombinatorialPolyhedron: Enable SIMDinstructions → Enable SIMDinstructions for Combinatorial Polyhedron 
comment:15 Changed 3 years ago by
You could add an how to help reviewing this ticket. I can test it on OSX, but I assume it's far from enough.
PS: no need to add me as an author. I just gave a small trick from hakmem 169. See e.g. http://www.dalkescientific.com/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts.html
comment:16 Changed 3 years ago by
Description:  modified (diff) 

It works for me.
 Operating System: Debian GNU/Linux 10 (buster)
 Kernel: Linux 4.19.06amd64
 Architecture: x8664
 CPU Model name: Intel(R) Core(TM) i77700 CPU @ 3.60GHz
 sse4.1, avx, avx2, popcnt
comment:17 Changed 3 years ago by
Authors:  David Coudert, Jonathan Kliem → Jonathan Kliem 

Cc:  Sophia Elia added 
comment:18 Changed 3 years ago by
System:
 Operating System: OS X 10.14.6
 Kernel: Darwin Kernel Version 18.7.0
 Architecture: x8664
 CPU Model name: Intel(R) Core(TM) i74650U CPU @ 1.70GHz
 sse4.1, avx2, popcnt
Timing using py3:
 0.08155608177185059
With this ticket (git trac checkout 27103
) and CFLAGS set:
 0.024885177612304688
and all tests pass.
comment:19 followup: 21 Changed 3 years ago by
If you want reasonably fast implementation of bitsets, you should consider using sage/data_structures/bitset.pxi
. That uses GMP which has a lot of asembly optimization. If it is not good enough, then you should rather make bitset.pxi
faster than rewriting code for a very specialized application.
comment:20 Changed 3 years ago by
Works on
 Operating System: Ubuntu 18.04.3 LTS
 Kernel: Linux 4.15.065generic
 Architecture: x8664
 Model name: Intel(R) Core(TM) i57200U CPU @ 2.50GHz
 sse4_1 avx avx2 popcnt
Before:
 0.05162477493286133
After
 0.02220010757446289
comment:21 Changed 3 years ago by
Replying to vdelecroix:
If you want reasonably fast implementation of bitsets, you should consider using
sage/data_structures/bitset.pxi
. That uses GMP which has a lot of asembly optimization. If it is not good enough, then you should rather makebitset.pxi
faster than rewriting code for a very specialized application.
Yes...
That is going to be some project, but it definitely makes sense.
One problem with bitset (to my understanding) so far is, that it is not overaligned. If it is overaligned, loads and stores are supposedly much faster (when using intrinsics).
Probably the best way to find out, is to see how the performance of my code actually changes, when I use bitsets instead of my own implementation.
Btw, if you have a specific example in mind, where calculations with bitsets are the bottle neck, I would be interested.
comment:22 Changed 3 years ago by
For examples, have a look in the graph code sage/graphs/
(e.g. iteration over vertex covers, etc). Apparently there are stuff in sage/combinat/
as well. More generally
$ git grep l bitset
comment:23 Changed 3 years ago by
Works on top of #27122, itrself on top of Python 3based 9.0.beta2 on :
 Operating System: Debian GNU/Linux bullseye/sid
 Kernel: Linux 5.2.03amd64
 Architecture: x8664
 Nom de modèle : Intel(R) Core(TM) i78550U CPU @ 1.80GHz
Before: 0.046399831771850586
After: 0.016237735748291016
All polyhedrarelated tests pass.
HTH,
comment:24 Changed 3 years ago by
I'm almost certain that this is applicable to sage/data_structures/bitset.pxi
. I ran the following test:
sage: def fill_bitset(): ....: for _ in range(1000000): ....: yield randint(0,10000000000) ....: yield 10000000000 ....: sage: a = FrozenBitset(fill_bitset()) sage: cython(''' ....: from sage.data_structures.bitset cimport bitset_t, FrozenBitset ....: def intersect(FrozenBitset a, FrozenBitset b): ....: a.intersection(b) ....: ''') sage: %timeit intersect(a,a)
With the current implementation via GMP this takes about 781 ms (best of 3).
mpn_and_n(r.bits, a.bits, b.bits, b.limbs)
I replaced this by:
cdef mp_size_t i cdef mp_limb_t *rb = r.bits, *ab = a.bits, *bb = b.bits for i from 0 <= i < a.limbs: rb[i] = ab[i] & bb[i]
and it seems to be about the same.
Of course this is a very brief naive test, but there seems to be no reason to use GMP for this specific function. Notice, that this doesn't even use intrinsics.
comment:25 Changed 3 years ago by
If you can make data_structures/bitset.pxi
better this is wonderful. What I want to avoid is two different piece of code that achieve the same thing. Please put your code in data_structures/
rather than in polyhedron
and make it as generic as possible. That way it will benefit to others.
Also, note that not all functions in bitset.pxi
use GMP/MPIR. Where it does it should be fast. That is comparison (mpn_cmp
), intersection (mpn_and_n
), union (mpn_ior_n
), etc. You might want to have a look at how these are implemented GMP/MPIR.
comment:26 Changed 3 years ago by
Things are complicated and a lot more complicated than I thought.
 To my knowledge, I cannot do something as
#if __AVX__
in cython, can I? This would mean, we would have to rewrite bitset in C or C++ to use intrinsics.  So far I'm only reimplementing very basic functions.
 I cannot write simple tests, where intrinsics are much faster. For intersection I don't see an improvement, for subset checks I'm maybe 10% faster. Here is, why I think this might be the case:
Loading registers takes a lot of time. In many cases this might be the bottle neck. Sure intrinsics need only one processor step instead of four to do bitwise
AND
of two registers. However, when I'm mostly waiting on read/write this does not help. Without intrinsics the processor can still process registers as fast as they can be loaded and stored again (don't forget that the result blocks another register while it is being stored).
Now in my scenario with subset checks things are a bit different. Nothings needs to be stored after the computation (this would explain why subset check with intrinsics is 10% faster than without). But more importantly seems to be the exact scenario I'm in:
I'm checking whether
A
is a subset of any ofB_1,...,B_n
. In most cases only the first 256 vertices are needed to discover thatA
is not a subset ofB_i
. So the first register ofA
can be reused a lot of times.
So, intrinsics might be useful in other scenarios for bitsets in sage, but it's definitly not a nobrainer. One won't lose. That I'm pretty sure of.
comment:27 Changed 3 years ago by
Commit:  d069567e351f0dcecd5f790171cf30a5a8089fb3 → 5dc41db60cc3754b7022f153ba52f6140f85bf32 

Branch pushed to git repo; I updated commit sha1. New commits:
5dc41db  fixed copy/paste typo

comment:28 Changed 3 years ago by
Cc:  Christian Stump added 

comment:29 Changed 3 years ago by
Cc:  Florent Hivert Samuel Lelièvre added 

Description:  modified (diff) 
comment:30 Changed 3 years ago by
Dependencies:  #26887, #27122 → #27122 

Status:  needs_review → needs_info 
I will probably move those improvements to data_structures/bitset.pxi
, once #27122 is done.
In the mean time, one can of course still pull the attached branch, if one wants to improve the fvector computation.
comment:31 Changed 2 years ago by
Branch:  public/27103 → public/27103reb 

Commit:  5dc41db60cc3754b7022f153ba52f6140f85bf32 → be5c59038e7914610f4a734971deca6cc47c9646 
Dependencies:  #27122 → #27122, #30429, #30435 
New commits:
39f3a38  directly check is_simple/is_simplicial

bf91483  improved count atoms

fe880a4  input of `intersection` in standard order

1826202  Merge branch 'public/30429' of git://trac.sagemath.org/sage into u/ghkliem/improved_count_atoms

44631b8  enabled intrinsics in bit_vector_operation.cc

be5c590  consistent namings in comments

comment:32 Changed 2 years ago by
Branch:  public/27103reb → public/27103reb2 

Commit:  be5c59038e7914610f4a734971deca6cc47c9646 → cd0198e09e3baeb8ae6e135a1e7e9c504958312a 
Dependencies:  #27122, #30429, #30435 → #27122, #30040, #30429, #30435 
Last 10 new commits:
131c065  add a specialized `get_next_level_simple`

bc00d42  prepare slightly modified algorithm for simple/simplicial polytopes

4fdc787  faster algorithm for simple/simplicial polytopes

4724d3c  small fixes

0297853  improvements in documentation

7ffac00  changes in 30429

01b4154  typo

24350b9  Revert "unite and is_zero for bit_vectors"

49eacaf  Merge branch 'public/27103reb' of git://trac.sagemath.org/sage into public/27103reb2

cd0198e  unite and is_zero for bit_vectors with intrinsics

comment:33 Changed 2 years ago by
Commit:  cd0198e09e3baeb8ae6e135a1e7e9c504958312a → d3d1385996810fa322133e0f8490b46feee02196 

comment:34 Changed 2 years ago by
Branch:  public/27103reb2 → public/27103reb3 

Commit:  d3d1385996810fa322133e0f8490b46feee02196 → 67c88cfb1f22e4f886834e3b6c53487e62ae7f6d 
Changes in #30040.
Last 10 new commits:
7f5b19f  unite and is_zero for bit_vectors

1496bfd  add a specialized `get_next_level_simple`

fc4784a  changes in 30429

81ba2e0  simplify `get_next_level_simple by 30458

9cece74  prepare slightly modified algorithm for simple/simplicial polytopes

fe5852a  typo

1510386  faster algorithm for simple/simplicial polytopes

86c564a  small fixes

ed6e966  improvements in documentation

67c88cf  merged in 30040

comment:35 Changed 2 years ago by
Commit:  67c88cfb1f22e4f886834e3b6c53487e62ae7f6d → cfcdeb677073fc32f8dd57263056b640a7f6a518 

Branch pushed to git repo; I updated commit sha1. New commits:
cfcdeb6  Merge branch 'develop' of git://trac.sagemath.org/sage into public/27103reb3

comment:36 Changed 23 months ago by
Dependencies:  #27122, #30040, #30429, #30435 → #31208 

comment:37 Changed 23 months ago by
Branch:  public/27103reb3 → public/27103inbitsets 

Cc:  Travis Scrimshaw added 
Commit:  cfcdeb677073fc32f8dd57263056b640a7f6a518 → bd78875e2aad36e4631d3c329df411c5afbac39a 
Keywords:  bitsets added 
Status:  needs_info → needs_review 
I will update the description yet, but this is basically ready for review.
I got three failing doctests, but I don't think they are related:
sage t long randomseed=0 /srv/public/kliem/sage/src/sage/doctest/sources.py # 1 doctest failed sage t long randomseed=0 /srv/public/kliem/sage/src/sage/doctest/forker.py # 1 doctest failed sage t long randomseed=0 /srv/public/kliem/sage/src/sage/interfaces/singular.py # Killed due to segmentation fault
Otherwise on my machine everything passes (with AVX2).
(in the above two each time FDS.in_lib
fails, so that doesn't seem to be related, also the singular timout doesn't seem to be related).
New commits:
348ed67  use popcnt and tzcnt

2713d89  easier instructions

9c6edc1  better documentation

bd78875  improvements to the frontend

comment:38 Changed 23 months ago by
Milestone:  sagepending → sage9.3 

comment:39 followup: 40 Changed 23 months ago by
Commit:  bd78875e2aad36e4631d3c329df411c5afbac39a → db11a450cfb5d88d7c6e8c8e0837bfb0e506f09a 

Branch pushed to git repo; I updated commit sha1. New commits:
db11a45  enable intrinsics for bitsets

comment:40 Changed 23 months ago by
comment:41 Changed 23 months ago by
Commit:  db11a450cfb5d88d7c6e8c8e0837bfb0e506f09a → 65dba93023b83874be5b18d0e3f58ba5484b7db9 

Branch pushed to git repo; I updated commit sha1. New commits:
65dba93  mistakes in the sse4.1 code

comment:42 Changed 23 months ago by
Description:  modified (diff) 

Summary:  Enable SIMDinstructions for Combinatorial Polyhedron → Enable SIMDinstructions for Bitsets 
comment:43 Changed 23 months ago by
Commit:  65dba93023b83874be5b18d0e3f58ba5484b7db9 → 8bd0b1fe087a802603b426bd22d51458b951d776 

comment:44 Changed 23 months ago by
Reviewers:  → Travis Scrimshaw 

Status:  needs_review → positive_review 
Also works for me with
 Operating System: Ubuntu 18.04.5 LTS
 Kernel: Linux 4.15.0112generic
 Architecture: x8664
 Model: Intel(R) Core(TM) i78700 CPU @ 3.20GHz
 sse4_1, sse4_2, avx, avx2 in cpu flags
Based on this, the patchbots, and the previous comments, I am setting this to a positive review.
comment:46 Changed 22 months ago by
Branch:  public/27103inbitsets → 8bd0b1fe087a802603b426bd22d51458b951d776 

Resolution:  → fixed 
Status:  positive_review → closed 
This is a way, one could boost CombinatorialPolyhedron? with intrinsics. Replace
is_subset
,intersection
andCountFaceBits
by those functions: