Opened 3 years ago

Closed 8 months ago

#27103 closed enhancement (fixed)

Enable SIMD-instructions for Bitsets

Reported by: gh-kliem Owned by:
Priority: major Milestone: sage-9.3
Component: geometry Keywords: bitsets, CombinatorialPolyhedron, SIMD, intrinsics
Cc: selia, stumpc5, hivert, slelievre, tscrim Merged in:
Authors: Jonathan Kliem Reviewers: Travis Scrimshaw
Report Upstream: N/A Work issues:
Branch: 8bd0b1f (Commits, GitHub, GitLab) Commit: 8bd0b1fe087a802603b426bd22d51458b951d776
Dependencies: #31208 Stopgaps:

Status badges

Description (last modified by gh-kliem)

#27122 enabled -march=native by default. This allows using intrinsics for Intel and AMD processors (unless the user specified different CFLAGS or CXXFLAGS differently).

Bitsets allocated using MemoryAllocator are overaligned allocated. This is not the case otherwise and there are several reasons not to do so for now:

  • More memory usage especially for small bitsets (bitset_t itself is rather small).
  • It makes realloc a lot harder.
  • There seems to be only about 10 percent loss when data is not overaligned.
  • There seems to be only about 1 percent performance loss, when the data is overaligned but the code is not customized for it (_mm256_loadu_si256 instead of _mm256_load_si256 etc).

See https://software.intel.com/sites/landingpage/IntrinsicsGuide for details about the instructions.

The emptyness check as it is seems be really fast and it appears pointless to beat it using intrinsics.

A few minor improvements in the bitset frontend of bitsets were also made.

Timings without intrinsics (should be exactly the same as without this ticket as the code almost did not change):

sage: B = Bitset((randint(0,2**20) for _ in range(2**19)))                                                                                                                          
sage: %timeit B == B                                                                                                                                                                
7.93 µs ± 5.71 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: Bop = ~B                                                                                                                                                                      
sage: %timeit B.isdisjoint(Bop)                                                                                                                                                     
8.11 µs ± 4.43 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: %timeit B.issubset(B)                                                                                                                                                         
7.98 µs ± 7.64 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

sage: B2 = ~Bop                                    
sage: %timeit B2.intersection_update(B)                                                                                                                                             
7.22 µs ± 20 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: %timeit B2.union(B)                                                                                                                                                           
14.6 µs ± 5.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: %timeit B.symmetric_difference(Bop)                                                                                                                                           
14 µs ± 9.34 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: %timeit B.difference_update(Bop)                                                                                                                                              
7.2 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

sage: P = polytopes.hypercube(14)                                                                                                                                                   
sage: P1 = P.stack(next(P.face_generator(1)))                                                                                                                                       
sage: C = CombinatorialPolyhedron(P1)                                                                                                                                               
sage: %time C.f_vector()                                                                                                                                                            
CPU times: user 12.8 s, sys: 7.98 ms, total: 12.8 s
Wall time: 12.8 s
(1, 16385, 131069, 487383, 1117948, 1769482, 2047331, 1788501, 1200342, 622908, 248963, 75361, 16640, 2470, 197, 1)

sage: P = polytopes.Birkhoff_polytope(5)                                                                                                                                            
sage: C = CombinatorialPolyhedron(P)                                                                                                                                                
sage: %time C.f_vector()                                                                                                                                                            
CPU times: user 581 ms, sys: 0 ns, total: 581 ms
Wall time: 580 ms
(1, 120, 5040, 50250, 233400, 631700, 1113700, 1367040, 1220550, 817150, 419225, 167200, 52120, 12600, 2300, 300, 25, 1)

sage: P = polytopes.associahedron(['A', 11], backend='normaliz')                                                                                                                    
sage: C = CombinatorialPolyhedron(P)                                                                                                                                                
sage: %time C.f_vector()                                                                                                                                                            
CPU times: user 38 s, sys: 20.2 ms, total: 38 s
Wall time: 38 s
(1, 208012, 1144066, 2735810, 3730650, 3197700, 1790712, 659736, 157080, 23100, 1925, 77, 1)

With SSE4.1 (some of them would also run just with SSE2, but already SSE4.1 is 2008):

sage: B = Bitset((randint(0,2**20) for _ in range(2**19)))                                                                                                                          
sage: %timeit B == B                                                                                                                                                                
7.91 µs ± 3.03 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: Bop = ~B                                                                                                                                                                      
sage: %timeit B.isdisjoint(Bop)                                                                                                                                                     
5.1 µs ± 6.09 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: %timeit B.issubset(B)                                                                                                                                                         
4.71 µs ± 5.07 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

sage: B2 = ~Bop                                                                                                                                                                     
sage: %timeit B2.intersection_update(B)                                                                                                                                             
5.57 µs ± 9.94 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: %timeit B2.union(B)                                                                                                                                                           
11.9 µs ± 5.64 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: %timeit B.symmetric_difference(Bop)                                                                                                                                           
12.4 µs ± 7.06 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: %timeit B.difference_update(Bop)                                                                                                                                              
4.66 µs ± 38.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

sage: P = polytopes.hypercube(14)                                                                                                                                                   
sage: P1 = P.stack(next(P.face_generator(1)))                                                                                                                                       
sage: C = CombinatorialPolyhedron(P1)                                                                                                                                               
sage: %time C.f_vector()                                                                                                                                                            
CPU times: user 7.98 s, sys: 4 ms, total: 7.99 s
Wall time: 7.98 s
(1, 16385, 131069, 487383, 1117948, 1769482, 2047331, 1788501, 1200342, 622908, 248963, 75361, 16640, 2470, 197, 1)

sage: P = polytopes.Birkhoff_polytope(5)                                                                                                                                            
sage: C = CombinatorialPolyhedron(P)                                                                                                                                                
sage: %time C.f_vector()                                                                                                                                                            
CPU times: user 466 ms, sys: 0 ns, total: 466 ms
Wall time: 466 ms
(1, 120, 5040, 50250, 233400, 631700, 1113700, 1367040, 1220550, 817150, 419225, 167200, 52120, 12600, 2300, 300, 25, 1)
                                                                                                                                                                              
sage: P = polytopes.associahedron(['A', 11], backend='normaliz')                                                                                                                    
sage: C = CombinatorialPolyhedron(P)                                                                                                                                                
sage: %time C.f_vector()                                                                                                                                                            
CPU times: user 26.8 s, sys: 8.09 ms, total: 26.8 s
Wall time: 26.8 s
(1, 208012, 1144066, 2735810, 3730650, 3197700, 1790712, 659736, 157080, 23100, 1925, 77, 1)

With AVX:

sage: B = Bitset((randint(0,2**20) for _ in range(2**19)))                                                                                                                          
sage: %timeit B == B                                                                                                                                                                
3.99 µs ± 2.46 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: Bop = ~B                                                                                                                                                                      
sage: %timeit B.isdisjoint(Bop)                                                                                                                                                     
3.38 µs ± 12.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: %timeit B.issubset(B)                                                                                                                                                         
3.05 µs ± 2.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

sage: P = polytopes.hypercube(14)                                                                                                                                                   
sage: P1 = P.stack(next(P.face_generator(1)))                                                                                                                                       
sage: C = CombinatorialPolyhedron(P1)                                                                                                                                               
sage: %time C.f_vector()                                                                                                                                                            
CPU times: user 5.27 s, sys: 7.97 ms, total: 5.28 s
Wall time: 5.28 s
(1, 16385, 131069, 487383, 1117948, 1769482, 2047331, 1788501, 1200342, 622908, 248963, 75361, 16640, 2470, 197, 1)

sage: P = polytopes.Birkhoff_polytope(5)                                                                                                                                            
sage: C = CombinatorialPolyhedron(P)                                                                                                                                                
sage: %time C.f_vector()                                                                                                                                                            
CPU times: user 449 ms, sys: 6 µs, total: 449 ms
Wall time: 449 ms
(1, 120, 5040, 50250, 233400, 631700, 1113700, 1367040, 1220550, 817150, 419225, 167200, 52120, 12600, 2300, 300, 25, 1)

# It does not use subset checks as it is a simple polytope.
sage: P = polytopes.associahedron(['A', 11], backend='normaliz')                                                                                                                    
sage: C = CombinatorialPolyhedron(P)                                                                                                                                                
sage: %time C.f_vector()                                                                                                                                                            
CPU times: user 30.4 s, sys: 23.6 ms, total: 30.4 s
Wall time: 30.4 s
(1, 208012, 1144066, 2735810, 3730650, 3197700, 1790712, 659736, 157080, 23100, 1925, 77, 1)

With AVX2:

sage: B = Bitset((randint(0,2**20) for _ in range(2**19)))                                                                                                                          
sage: Bop = ~B                                                                                                                                                                      
sage: B2 = ~Bop                                                                                                                                                                     
sage: %timeit B2.intersection_update(B)                                                                                                                                             
4.2 µs ± 3.28 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: %timeit B2.union(B)                                                                                                                                                           
12 µs ± 5.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: %timeit B.symmetric_difference(Bop)                                                                                                                                           
12 µs ± 5.98 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage: %timeit B.difference_update(Bop)                                                                                                                                              
4.03 µs ± 3.46 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

sage: P = polytopes.hypercube(14)                                                                                                                                                   
sage: P1 = P.stack(next(P.face_generator(1)))                                                                                                                                       
sage: C = CombinatorialPolyhedron(P1)                                                                                                                                               
sage: %time C.f_vector()                                                                                                                                                            
CPU times: user 5.18 s, sys: 7 µs, total: 5.18 s
Wall time: 5.18 s
(1, 16385, 131069, 487383, 1117948, 1769482, 2047331, 1788501, 1200342, 622908, 248963, 75361, 16640, 2470, 197, 1)

sage: P = polytopes.Birkhoff_polytope(5)                                                                                                                                            
sage: C = CombinatorialPolyhedron(P)                                                                                                                                                
sage: %time C.f_vector()                                                                                                                                                            
CPU times: user 436 ms, sys: 3 µs, total: 436 ms
Wall time: 435 ms
(1, 120, 5040, 50250, 233400, 631700, 1113700, 1367040, 1220550, 817150, 419225, 167200, 52120, 12600, 2300, 300, 25, 1)

sage: P = polytopes.associahedron(['A', 11], backend='normaliz')                                                                                                                    
sage: C = CombinatorialPolyhedron(P)                                                                                                                                                
sage: %time C.f_vector()                                                                                                                                                            
CPU times: user 24.3 s, sys: 19.7 ms, total: 24.3 s
Wall time: 24.3 s
(1, 208012, 1144066, 2735810, 3730650, 3197700, 1790712, 659736, 157080, 23100, 1925, 77, 1)

Change History (46)

comment:1 Changed 3 years ago by gh-kliem

  • Dependencies changed from #26887 to #26887, #27122
  • Description modified (diff)
  • Keywords SIMD intrinsics added; polytopes f-vector edge graph flag vector removed
  • Priority changed from minor to major
  • Summary changed from Farther improve Combinatorial Polyhedron for AVX but not AVX 2. to Farther improve CombinatorialPolyhedron: Enable SIMD-instructions.

comment:2 Changed 3 years ago by gh-kliem

  • Summary changed from Farther improve CombinatorialPolyhedron: Enable SIMD-instructions. to Farther improve CombinatorialPolyhedron: Enable SIMD-instructions

comment:3 Changed 3 years ago by gh-kliem

This is a way, one could boost CombinatorialPolyhedron? with intrinsics. Replace is_subset, intersection and CountFaceBits by those functions:

#if __AVX__
    #include <immintrin.h>
#elif __SSE4_1__
    #include <emmintrin.h>
    #include <smmintrin.h>
#endif

#if __POPCNT__
    #include <immintrin.h>
#endif


// as of now, 512bit does not have something like _mm256_testc_si256,
// which is the bottle neck of this function,
// so it does not make sense to implement it

// inline int is_subset(uint64_t *A, uint64_t *B, size_t face_length)
// the bottlen-neck is checking for subsets, which requires something as
// _mm256_testc_si256, trying to determine, what is the best way of doing it:
#if __AVX__
    // 256-bit commands, those operations are equivalent to the operations
    // defined in `#else`
    // intrics defined in immintrin.h
    const size_t chunksize = 256;
    inline int is_subset(uint64_t *A, uint64_t *B, size_t face_length){
        // A & ~B == 0
        // returns 1 if A is a subset of B, otherwise returns 0
        // this is done by checking if there is an element in A,
        // which is not in B
        // `face_length` is the length of A and B in terms of uint64_t
        // note that A,B need to be 32-byte-aligned
        size_t i;
        for (i = 0; i < face_length; i += 4){
            __m256i a = _mm256_load_si256((const __m256i*)&A[i]);
            __m256i b = _mm256_load_si256((const __m256i*)&B[i]);
            if (!_mm256_testc_si256(b, a)){ //need to be opposite order !!
                return 0;
            }
        }
        return 1;
    }

#elif __SSE4_1__
    // 128-bit commands, those operations are equivalent to the operations
    // defined in `#else`
    // intrics defined in smmintrin.h and emmintrin.h
    const size_t chunksize = 128;
    inline int is_subset(uint64_t *A, uint64_t *B, size_t face_length){
        // A & ~B == 0
        // returns 1 if A is a subset of B, otherwise returns 0
        // this is done by checking if there is an element in A,
        // which is not in B
        // `face_length` is the length of A and B in terms of uint64_t
        // note that A,B need to be 16-byte-aligned
        size_t i;
        for (i = 0; i < face_length; i += 2){
            __m128i a = _mm_load_si128((const __m128i*)&A[i]);
            __m128i b = _mm_load_si128((const __m128i*)&B[i]);
            if (!_mm_testc_si128(b, a)){ //need to be opposite order !!
                return 0;
            }
        }
        return 1;
    }

#else
    // no intrinsics
    const size_t chunksize = 64;
    inline int is_subset(uint64_t *A, uint64_t *B, size_t face_length){
        // A & ~B == 0
        // returns 1 if A is a subset of B, otherwise returns 0
        // this is done by checking if there is an element in A,
        // which is not in B
        // `face_length` is the length of A and B in terms of uint64_t
        size_t i;
        for (i = 0; i < face_length; i++){
            if (A[i] & ~B[i]){
                return 0;
            }
        }
        return 1;
    }

#endif

// inline void intersection(uint64_t *A, uint64_t *B, uint64_t *C,
//                          size_t face_length)
// now determining, how to do insersection
#if __AVX2__
    // 256-bit commands, those operations are equivalent to the operations
    // defined in `#else`
    // intrics defined in immintrin.h
    inline void intersection(uint64_t *A, uint64_t *B, uint64_t *C, \
                             size_t face_length){
        // C = A & B
        // will set C to be the intersection of A and B
        // `face_length` is the length of A, B and C in terms of uint64_t
        // note that A,B,C need to be 32-byte-aligned
        size_t i;
        for (i = 0; i < face_length; i += 4){
            __m256i a = _mm256_load_si256((const __m256i*)&A[i]);
            __m256i b = _mm256_load_si256((const __m256i*)&B[i]);
            __m256i c = _mm256_and_si256(a, b);
            _mm256_store_si256((__m256i*)&C[i],c);
        }
    }

#elif __SSE4_1__
    // actually SSE2 would be fine, but we don't want to force greater chunks,
    // because of intersection, which is not the bottleneck
    // 128-bit commands, those operations are equivalent to the operations
    // defined in `#else`
    // intrinsics defined in emmintrin.h
    inline void intersection(uint64_t *A, uint64_t *B, uint64_t *C, \
                             size_t face_length){
        // C = A & B
        // will set C to be the intersection of A and B
        // `face_length` is the length of A, B and C in terms of uint64_t
        // note that A,B,C need to be 16-byte-aligned
        size_t i;
        for (i = 0; i < face_length; i += 2){
            __m128i a = _mm_load_si128((const __m128i*)&A[i]);
            __m128i b = _mm_load_si128((const __m128i*)&B[i]);
            __m128i c = _mm_and_si128(a, b);
            _mm_store_si128((__m128i*)&C[i],c);
        }
    }

#else
    // commands, without intrinsics
    inline void intersection(uint64_t *A, uint64_t *B, uint64_t *C, \
                             size_t face_length){
        // C = A & B
        // will set C to be the intersection of A and B
        // `face_length` is the length of A, B and C in terms of uint64_t
        size_t i;
        for (i = 0; i < face_length; i++){
            C[i] = A[i] & B[i];
        }
    }

#endif

// inline size_t CountFaceBits(uint64_t* A, size_t face_length)
// determine the best way to count the set bits in uint64_t *
#if (__POPCNT__) && (INTPTR_MAX == INT64_MAX) // 64-bit and popcnt
    inline size_t CountFaceBits(uint64_t* A, size_t face_length) {
        // counts the number of vertices in a face by counting bits set to one
        // `face_length` is the length of A in terms of uint64_t
        size_t i;
        unsigned int count = 0;
        for (i=0; i<face_length; i++){
            count += (size_t) _mm_popcnt_u64(A[i]);
        }
        return count;
    }

#else // popcount without intrinsics
    inline size_t CountFaceBits(uint64_t* A, size_t face_length) {
        // counts the number of vertices in a face by counting bits set to one
        // `face_length` is the length of A in terms of uint64_t
        size_t i;
        unsigned int count = 0;
        for (i=0; i<face_length; i++){
            uint64_t a = A[i];
            while (a){
                count += a & 1;
                a >>= 1;
            }
        }
        return count;
    }

#endif

comment:4 Changed 3 years ago by gh-kliem

  • Branch set to public/27103
  • Commit set to 52cb026a0fbddc1984701f7054bd71e97e157ea1

Last 10 new commits:

6052b61FaceIterator is an actual iterator now
e8309e4fixed allocation of array
6a55fa5it.next() -> next(it), also fixed face_lattice_facet_repr for trivial cases
9f42182improved documentation mostly
687d003replaced calloc by new[]
f8a201eImproved comments all through base.pyx
df43a2dimproved pseudo-code Algorithm in ``FaceIterator``
b73e53cCorrected a mathematical error in explonation.
f791cceremoved notes and old files again
52cb026replaced methods by intrinsic alternatives

comment:5 Changed 2 years ago by embray

  • Milestone changed from sage-8.7 to sage-8.8

Ticket retargeted after milestone closed (if you don't believe this ticket is appropriate for the Sage 8.8 release please retarget manually)

comment:6 Changed 2 years ago by embray

  • Milestone sage-8.8 deleted

As the Sage-8.8 release milestone is pending, we should delete the sage-8.8 milestone for tickets that are not actively being worked on or that still require significant work to move forward. If you feel that this ticket should be included in the next Sage release at the soonest please set its milestone to the next release milestone (sage-8.9).

comment:7 Changed 2 years ago by gh-kliem

  • Milestone set to sage-pending

comment:8 Changed 2 years ago by embray

  • Summary changed from Farther improve CombinatorialPolyhedron: Enable SIMD-instructions to Further improve CombinatorialPolyhedron: Enable SIMD-instructions

comment:9 Changed 2 years ago by git

  • Commit changed from 52cb026a0fbddc1984701f7054bd71e97e157ea1 to 10a61963ba0bb2547e44934bfa197933d1395d96

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

10a6196enabled intrinsics in bit_vector_operation.cc

comment:10 follow-up: Changed 2 years ago by dcoudert

for popcount, you can use something like this when builtin popcount is not available:

cdef inline int popcount64(uint64_t i):
   """
   Return the number of '1' bits in a 64-bits integer.
   """
   i = i - ((i >> 1) & 0x5555555555555555ULL)
   i = (i & 0x3333333333333333ULL) + ((i >> 2) & 0x3333333333333333ULL)
   return ( ((i + (i >> 4)) & 0x0f0f0f0f0f0f0f0fULL) * 0x0101010101010101ULL ) >> 56

EDIT: removed inappropriate comment (bad copy/paste)

Last edited 2 years ago by dcoudert (previous) (diff)

comment:11 Changed 2 years ago by git

  • Commit changed from 10a61963ba0bb2547e44934bfa197933d1395d96 to d069567e351f0dcecd5f790171cf30a5a8089fb3

Branch pushed to git repo; I updated commit sha1. New commits:

d069567improved popcount without intrinsics

comment:12 in reply to: ↑ 10 Changed 2 years ago by gh-kliem

Thanks. Was a nice exercise to go through it.

Replying to dcoudert:

for popcount, you can use something like this when builtin popcount is not available:

cdef inline int popcount64(uint64_t i):
   """
   Return the number of '1' bits in a 64-bits integer.
   """
   i = i - ((i >> 1) & 0x5555555555555555ULL)
   i = (i & 0x3333333333333333ULL) + ((i >> 2) & 0x3333333333333333ULL)
   return ( ((i + (i >> 4)) & 0x0f0f0f0f0f0f0f0fULL) * 0x0101010101010101ULL ) >> 56

EDIT: removed inappropriate comment (bad copy/paste)

comment:13 Changed 2 years ago by gh-kliem

  • Description modified (diff)
  • Status changed from new to needs_review

comment:14 Changed 2 years ago by gh-kliem

  • Authors set to David Coudert, Jonathan Kliem
  • Description modified (diff)
  • Summary changed from Further improve CombinatorialPolyhedron: Enable SIMD-instructions to Enable SIMD-instructions for Combinatorial Polyhedron

comment:15 Changed 2 years ago by dcoudert

You could add an how to help reviewing this ticket. I can test it on OSX, but I assume it's far from enough.

PS: no need to add me as an author. I just gave a small trick from hakmem 169. See e.g. http://www.dalkescientific.com/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts.html

comment:16 Changed 2 years ago by gh-kliem

  • Description modified (diff)

It works for me.

  • Operating System: Debian GNU/Linux 10 (buster)
  • Kernel: Linux 4.19.0-6-amd64
  • Architecture: x86-64
  • CPU Model name: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
  • sse4.1, avx, avx2, popcnt

comment:17 Changed 2 years ago by gh-kliem

  • Authors changed from David Coudert, Jonathan Kliem to Jonathan Kliem
  • Cc selia added

comment:18 Changed 2 years ago by dcoudert

System:

  • Operating System: OS X 10.14.6
  • Kernel: Darwin Kernel Version 18.7.0
  • Architecture: x86-64
  • CPU Model name: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz
  • sse4.1, avx2, popcnt

Timing using py3:

  • 0.08155608177185059

With this ticket (git trac checkout 27103) and CFLAGS set:

  • 0.024885177612304688

and all tests pass.

comment:19 follow-up: Changed 2 years ago by vdelecroix

If you want reasonably fast implementation of bitsets, you should consider using sage/data_structures/bitset.pxi. That uses GMP which has a lot of asembly optimization. If it is not good enough, then you should rather make bitset.pxi faster than rewriting code for a very specialized application.

comment:20 Changed 2 years ago by gh-kliem

Works on

  • Operating System: Ubuntu 18.04.3 LTS
  • Kernel: Linux 4.15.0-65-generic
  • Architecture: x86-64
  • Model name: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
  • sse4_1 avx avx2 popcnt

Before:

  • 0.05162477493286133

After

  • 0.02220010757446289

comment:21 in reply to: ↑ 19 Changed 2 years ago by gh-kliem

Replying to vdelecroix:

If you want reasonably fast implementation of bitsets, you should consider using sage/data_structures/bitset.pxi. That uses GMP which has a lot of asembly optimization. If it is not good enough, then you should rather make bitset.pxi faster than rewriting code for a very specialized application.

Yes...

That is going to be some project, but it definitely makes sense.

One problem with bitset (to my understanding) so far is, that it is not overaligned. If it is overaligned, loads and stores are supposedly much faster (when using intrinsics).

Probably the best way to find out, is to see how the performance of my code actually changes, when I use bitsets instead of my own implementation.

Btw, if you have a specific example in mind, where calculations with bitsets are the bottle neck, I would be interested.

comment:22 Changed 2 years ago by vdelecroix

For examples, have a look in the graph code sage/graphs/ (e.g. iteration over vertex covers, etc). Apparently there are stuff in sage/combinat/ as well. More generally

$ git grep -l bitset

comment:23 Changed 2 years ago by charpent

Works on top of #27122, itrself on top of Python 3-based 9.0.beta2 on :

  • Operating System: Debian GNU/Linux bullseye/sid
  • Kernel: Linux 5.2.0-3-amd64
  • Architecture: x86-64
  • Nom de modèle : Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

Before: 0.046399831771850586

After: 0.016237735748291016

All polyhedra-related tests pass.

HTH,

comment:24 Changed 2 years ago by gh-kliem

I'm almost certain that this is applicable to sage/data_structures/bitset.pxi. I ran the following test:

sage: def fill_bitset():
....:     for _ in range(1000000):
....:         yield randint(0,10000000000)
....:     yield 10000000000
....:     
sage: a = FrozenBitset(fill_bitset())
sage: cython('''
....: from sage.data_structures.bitset cimport bitset_t, FrozenBitset
....: def intersect(FrozenBitset a, FrozenBitset b):
....:     a.intersection(b)
....: ''')
sage: %timeit intersect(a,a)

With the current implementation via GMP this takes about 781 ms (best of 3).

mpn_and_n(r.bits, a.bits, b.bits, b.limbs)

I replaced this by:

cdef mp_size_t i
cdef mp_limb_t *rb = r.bits, *ab = a.bits, *bb = b.bits
for i from 0 <= i < a.limbs:
    rb[i] = ab[i] & bb[i]

and it seems to be about the same.

Of course this is a very brief naive test, but there seems to be no reason to use GMP for this specific function. Notice, that this doesn't even use intrinsics.

comment:25 Changed 2 years ago by vdelecroix

If you can make data_structures/bitset.pxi better this is wonderful. What I want to avoid is two different piece of code that achieve the same thing. Please put your code in data_structures/ rather than in polyhedron and make it as generic as possible. That way it will benefit to others.

Also, note that not all functions in bitset.pxi use GMP/MPIR. Where it does it should be fast. That is comparison (mpn_cmp), intersection (mpn_and_n), union (mpn_ior_n), etc. You might want to have a look at how these are implemented GMP/MPIR.

comment:26 Changed 23 months ago by gh-kliem

Things are complicated and a lot more complicated than I thought.

  • To my knowledge, I cannot do something as #if __AVX__ in cython, can I? This would mean, we would have to rewrite bitset in C or C++ to use intrinsics.
  • So far I'm only reimplementing very basic functions.
  • I cannot write simple tests, where intrinsics are much faster. For intersection I don't see an improvement, for subset checks I'm maybe 10% faster. Here is, why I think this might be the case:

Loading registers takes a lot of time. In many cases this might be the bottle neck. Sure intrinsics need only one processor step instead of four to do bitwise AND of two registers. However, when I'm mostly waiting on read/write this does not help. Without intrinsics the processor can still process registers as fast as they can be loaded and stored again (don't forget that the result blocks another register while it is being stored).

Now in my scenario with subset checks things are a bit different. Nothings needs to be stored after the computation (this would explain why subset check with intrinsics is 10% faster than without). But more importantly seems to be the exact scenario I'm in:

I'm checking whether A is a subset of any of B_1,...,B_n. In most cases only the first 256 vertices are needed to discover that A is not a subset of B_i. So the first register of A can be reused a lot of times.

So, intrinsics might be useful in other scenarios for bitsets in sage, but it's definitly not a no-brainer. One won't lose. That I'm pretty sure of.

comment:27 Changed 23 months ago by git

  • Commit changed from d069567e351f0dcecd5f790171cf30a5a8089fb3 to 5dc41db60cc3754b7022f153ba52f6140f85bf32

Branch pushed to git repo; I updated commit sha1. New commits:

5dc41dbfixed copy/paste typo

comment:28 Changed 21 months ago by gh-kliem

  • Cc stumpc5 added

comment:29 Changed 17 months ago by slelievre

  • Cc hivert slelievre added
  • Description modified (diff)

comment:30 Changed 17 months ago by gh-kliem

  • Dependencies changed from #26887, #27122 to #27122
  • Status changed from needs_review to needs_info

I will probably move those improvements to data_structures/bitset.pxi, once #27122 is done.

In the mean time, one can of course still pull the attached branch, if one wants to improve the f-vector computation.

comment:31 Changed 13 months ago by gh-kliem

  • Branch changed from public/27103 to public/27103-reb
  • Commit changed from 5dc41db60cc3754b7022f153ba52f6140f85bf32 to be5c59038e7914610f4a734971deca6cc47c9646
  • Dependencies changed from #27122 to #27122, #30429, #30435

New commits:

39f3a38directly check is_simple/is_simplicial
bf91483improved count atoms
fe880a4input of `intersection` in standard order
1826202Merge branch 'public/30429' of git://trac.sagemath.org/sage into u/gh-kliem/improved_count_atoms
44631b8enabled intrinsics in bit_vector_operation.cc
be5c590consistent namings in comments

comment:32 Changed 13 months ago by gh-kliem

  • Branch changed from public/27103-reb to public/27103-reb2
  • Commit changed from be5c59038e7914610f4a734971deca6cc47c9646 to cd0198e09e3baeb8ae6e135a1e7e9c504958312a
  • Dependencies changed from #27122, #30429, #30435 to #27122, #30040, #30429, #30435

Last 10 new commits:

131c065add a specialized `get_next_level_simple`
bc00d42prepare slightly modified algorithm for simple/simplicial polytopes
4fdc787faster algorithm for simple/simplicial polytopes
4724d3csmall fixes
0297853improvements in documentation
7ffac00changes in 30429
01b4154typo
24350b9Revert "unite and is_zero for bit_vectors"
49eacafMerge branch 'public/27103-reb' of git://trac.sagemath.org/sage into public/27103-reb2
cd0198eunite and is_zero for bit_vectors with intrinsics

comment:33 Changed 13 months ago by git

  • Commit changed from cd0198e09e3baeb8ae6e135a1e7e9c504958312a to d3d1385996810fa322133e0f8490b46feee02196

Branch pushed to git repo; I updated commit sha1. New commits:

20a39b6outsource inclusion maximal
b672fcaremoved redundant function
f62e770merge in #30458
accea52missing }
d3d1385Merge branch 'public/30040-reb' of git://trac.sagemath.org/sage into public/27103-reb2

comment:34 Changed 13 months ago by gh-kliem

  • Branch changed from public/27103-reb2 to public/27103-reb3
  • Commit changed from d3d1385996810fa322133e0f8490b46feee02196 to 67c88cfb1f22e4f886834e3b6c53487e62ae7f6d

Changes in #30040.


Last 10 new commits:

7f5b19funite and is_zero for bit_vectors
1496bfdadd a specialized `get_next_level_simple`
fc4784achanges in 30429
81ba2e0simplify `get_next_level_simple by 30458
9cece74prepare slightly modified algorithm for simple/simplicial polytopes
fe5852atypo
1510386faster algorithm for simple/simplicial polytopes
86c564asmall fixes
ed6e966improvements in documentation
67c88cfmerged in 30040

comment:35 Changed 13 months ago by git

  • Commit changed from 67c88cfb1f22e4f886834e3b6c53487e62ae7f6d to cfcdeb677073fc32f8dd57263056b640a7f6a518

Branch pushed to git repo; I updated commit sha1. New commits:

cfcdeb6Merge branch 'develop' of git://trac.sagemath.org/sage into public/27103-reb3

comment:36 Changed 8 months ago by gh-kliem

  • Dependencies changed from #27122, #30040, #30429, #30435 to #31208

comment:37 Changed 8 months ago by gh-kliem

  • Branch changed from public/27103-reb3 to public/27103-in-bitsets
  • Cc tscrim added
  • Commit changed from cfcdeb677073fc32f8dd57263056b640a7f6a518 to bd78875e2aad36e4631d3c329df411c5afbac39a
  • Keywords bitsets added
  • Status changed from needs_info to needs_review

I will update the description yet, but this is basically ready for review.

I got three failing doctests, but I don't think they are related:

sage -t --long --random-seed=0 /srv/public/kliem/sage/src/sage/doctest/sources.py  # 1 doctest failed                                                                              
sage -t --long --random-seed=0 /srv/public/kliem/sage/src/sage/doctest/forker.py  # 1 doctest failed                                                                               
sage -t --long --random-seed=0 /srv/public/kliem/sage/src/sage/interfaces/singular.py  # Killed due to segmentation fault  

Otherwise on my machine everything passes (with AVX2).

(in the above two each time FDS.in_lib fails, so that doesn't seem to be related, also the singular timout doesn't seem to be related).


New commits:

348ed67use popcnt and tzcnt
2713d89easier instructions
9c6edc1better documentation
bd78875improvements to the frontend

comment:38 Changed 8 months ago by gh-kliem

  • Milestone changed from sage-pending to sage-9.3

comment:39 follow-up: Changed 8 months ago by git

  • Commit changed from bd78875e2aad36e4631d3c329df411c5afbac39a to db11a450cfb5d88d7c6e8c8e0837bfb0e506f09a

Branch pushed to git repo; I updated commit sha1. New commits:

db11a45enable intrinsics for bitsets

comment:40 in reply to: ↑ 39 Changed 8 months ago by gh-kliem

Replying to git:

Branch pushed to git repo; I updated commit sha1. New commits:

db11a45enable intrinsics for bitsets

Forgot to commit.

comment:41 Changed 8 months ago by git

  • Commit changed from db11a450cfb5d88d7c6e8c8e0837bfb0e506f09a to 65dba93023b83874be5b18d0e3f58ba5484b7db9

Branch pushed to git repo; I updated commit sha1. New commits:

65dba93mistakes in the sse4.1 code

comment:42 Changed 8 months ago by gh-kliem

  • Description modified (diff)
  • Summary changed from Enable SIMD-instructions for Combinatorial Polyhedron to Enable SIMD-instructions for Bitsets

comment:43 Changed 8 months ago by git

  • Commit changed from 65dba93023b83874be5b18d0e3f58ba5484b7db9 to 8bd0b1fe087a802603b426bd22d51458b951d776

Branch pushed to git repo; I updated commit sha1. New commits:

a4f2e44remove sse4.1 for equality check, as it does not help
8bd0b1fremove assumption that LIMBS contain 64 bits

comment:44 Changed 8 months ago by tscrim

  • Reviewers set to Travis Scrimshaw
  • Status changed from needs_review to positive_review

Also works for me with

  • Operating System: Ubuntu 18.04.5 LTS
  • Kernel: Linux 4.15.0-112-generic
  • Architecture: x86-64
  • Model: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
  • sse4_1, sse4_2, avx, avx2 in cpu flags

Based on this, the patchbots, and the previous comments, I am setting this to a positive review.

comment:45 Changed 8 months ago by gh-kliem

Thank you.

comment:46 Changed 8 months ago by vbraun

  • Branch changed from public/27103-in-bitsets to 8bd0b1fe087a802603b426bd22d51458b951d776
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.