#17013 closed enhancement (fixed)
WordDatatype_char
Reported by:  vdelecroix  Owned by:  

Priority:  major  Milestone:  sage6.4 
Component:  combinatorics  Keywords:  words 
Cc:  slabbe  Merged in:  
Authors:  Vincent Delecroix  Reviewers:  Sébastien Labbé, Jeroen Demeyer 
Report Upstream:  N/A  Work issues:  
Branch:  c00964d (Commits)  Commit:  
Dependencies:  Stopgaps: 
Description
Creation of a new class WordDatatype_char that has in backend an 'unsigned char *' and use it wherever possible. The implementation is much faster than anything else but is only available for alphabet contained in [0,255].
Change History (24)
comment:1 Changed 5 years ago by
 Branch set to u/vdelecroix/17013
 Commit set to 398b33c0a6f6a96f5c301e1867edb47158703776
 Status changed from new to needs_review
comment:2 Changed 5 years ago by
 Branch changed from u/vdelecroix/17013 to u/slabbe/17013
 Commit changed from 398b33c0a6f6a96f5c301e1867edb47158703776 to a563343267f1690d74275ad3835cdebef8512a19
New commits:
a563343  trac #17013: doctests improvements + bug fix

comment:3 followup: ↓ 6 Changed 5 years ago by
 Reviewers set to Sébastien Labbé
If Vincent agrees with my changes, then he can set this ticket to positive review.
comment:4 followup: ↓ 5 Changed 5 years ago by
Here is one benchmark.
BEFORE::
sage: w = Word([0,1,2]*100, alphabet=[0,1,2]) sage: type(w) <class 'sage.combinat.words.word.FiniteWord_list'> sage: %timeit w.is_square_free() 1000 loops, best of 3: 417 µs per loop
AFTER::
sage: w = Word([0,1,2]*100, alphabet=[0,1,2]) sage: type(w) <class 'sage.combinat.words.word.FiniteWord_char'> sage: %timeit w.is_square_free() 100000 loops, best of 3: 6.29 µs per loop
We should also add here some benchmark about equality test and also __getitem__
.
comment:5 in reply to: ↑ 4 ; followup: ↓ 7 Changed 5 years ago by
Replying to slabbe:
We should also add here some benchmark about equality test and also
__getitem__
.
EDIT: the timings were edited because there were not about FiniteWord_str
but FiniteWord_list
...
Competitions between:
 Python string
FiniteWord_str
FiniteWord_char
sage: W1 = Words('abc') sage: W2 = Words([0,1,2]) sage: w1 = W1(''.join(choice('abc') for _ in range(1000))) sage: w2 = W2([choice([0,1,2]) for _ in range(1000]) sage: s = str(w1)
then on slices
sage: timeit("for i in range(1000): u = w1[:i]") 25 loops, best of 3: 25.3 ms per loop sage: timeit("for i in range(1000): u = w2[:i]") 625 loops, best of 3: 237 µs per loop sage: timeit("for i in range(1000): u = s[:i]") 625 loops, best of 3: 129 µs per loop
and on equality
sage: timeit("w1 == w1") 625 loops, best of 3: 365 ns per loop sage: timeit("w2 == w2") 625 loops, best of 3: 187 ns per loop sage: timeit("s == s") 625 loops, best of 3: 86.6 ns per loop
So Python strings are still ahead, but its much less ridiculous.
Vincent
comment:6 in reply to: ↑ 3 Changed 5 years ago by
 Status changed from needs_review to positive_review
Replying to slabbe:
If Vincent agrees with my changes, then he can set this ticket to positive review.
I am! Thanks for the review.
Vincent
comment:7 in reply to: ↑ 5 ; followup: ↓ 10 Changed 5 years ago by
Competitions between:
 Python string
FiniteWord_str
FiniteWord_char
I believe you do not have FiniteWord_str
in the competition...
sage: type(w1) <class 'sage.combinat.words.word.FiniteWord_list'>
comment:8 followup: ↓ 9 Changed 5 years ago by
Instead of (note that you already have defined i as a ssize_t
):
cdef ssize_t i for i in range(w._length1, 1, 1):
Maybe the following is better and would allow to use size_t
type?
cdef size_t i for i from w._length > i >= 0:
comment:9 in reply to: ↑ 8 Changed 5 years ago by
Replying to slabbe:
Instead of (note that you already have defined i as a
ssize_t
):cdef ssize_t i for i in range(w._length1, 1, 1):Maybe the following is better and would allow to use
size_t
type?cdef size_t i for i from w._length > i >= 0:
Nope, it does note change anything. And be careful, your second loop is infinite (a size_t is always >= 0).
comment:10 in reply to: ↑ 7 Changed 5 years ago by
Replying to slabbe:
Competitions between:
 Python string
FiniteWord_str
FiniteWord_char
I believe you do not have
FiniteWord_str
in the competition...sage: type(w1) <class 'sage.combinat.words.word.FiniteWord_list'>
Oups... corrected. Does not change anything for slicing, but huge difference in equality.
comment:11 Changed 5 years ago by
Great, so I am ok with this ticket. Positive review as already set!
Sébastien
comment:12 Changed 5 years ago by
 Reviewers changed from Sébastien Labbé to Sébastien Labbé, Jeroen Demeyer
 Status changed from positive_review to needs_work
This is absolutely not the right way to handle exceptions:
if PySlice_GetIndicesEx(...) < 0: return
The fact that it "works" is because IPython picks up the exception much later than you intended. Note the botched traceback:
sage: Words([0,1,2,3])([0,1,0,2,0,3,1,2,3])[slice("foo")]  TypeError Traceback (most recent call last) TypeError: slice indices must be integers or None or have an __index__ method
The right way is to declare your function as except 1
and let Cython do the right thing. See http://docs.cython.org/src/userguide/language_basics.html#errorreturnvalues
comment:13 Changed 5 years ago by
See https://github.com/cython/cython/commit/543c33c4a1fdf105c3173a0950297ce4f7550ece for how such a declaration would look like.
If you feel like it, add that patch to the Cython package and then you can do
from cpython.slice cimport *
instead of manually declaring the PySlice_
functions.
comment:14 Changed 5 years ago by
Hi Jeroen,
As you noticed, the commit in cython is 3 days old... and the reason is because I asked on the cythonusers list. Patching is much more work than the declaration (see the last commit):
cdef extern from "Python.h": int PySlice_GetIndicesEx(...) except 1
Vincent
comment:15 Changed 5 years ago by
 Branch changed from u/slabbe/17013 to u/vdelecroix/17013
 Commit changed from a563343267f1690d74275ad3835cdebef8512a19 to c00964da4655d7580277c3d4aba1e0a51b06b78e
 Status changed from needs_work to needs_review
New commits:
c00964d  trac #17013: fix Python C API declaration

comment:16 Changed 5 years ago by
 Status changed from needs_review to positive_review
comment:17 Changed 5 years ago by
Thanks!
comment:18 Changed 5 years ago by
 Branch changed from u/vdelecroix/17013 to c00964da4655d7580277c3d4aba1e0a51b06b78e
 Resolution set to fixed
 Status changed from positive_review to closed
comment:19 Changed 5 years ago by
 Commit c00964da4655d7580277c3d4aba1e0a51b06b78e deleted
Note that #15820 seems to address similar applications, but more general (without restriction to alphabets of length 255). In preliminary versions of bounded integer sequences, I experimented with char*
as underlying data structure, but found that using GMP long integers with the shift operations provided by GMP was faster.
comment:20 Changed 5 years ago by
You should really compare the speed of the operations here with those from #15820. What I dislike about this ticket here is that it just talks about WordDatatype_char
which doesn't mean anything to me: I never noticed that you were implementing sequences of integers in the interval [0,255].
On #15820, I insisted that the approach would be as general as possible. Of course, Simon King (the author) had a particular application in mind, but the code in #15820 never refers to that application, it is very general.
comment:21 followup: ↓ 22 Changed 5 years ago by
Hello,
The thing is that having a complicated datatype with letters not aligned with computer words also has disadvantages. Namely:
 code less readable
 might be slower if not carefully done (you can not use
memcmp
,memcpy
fromstrings.h
)
And implementing more involved algorithms (pattern matchings, counting subsequences, etc) might be painful.
Nevertheless, I would be happy to give it a try and do proper benchmarkings. There will be no problem to remove the data structure here if #15820 is just better.
Vincent
comment:22 in reply to: ↑ 21 Changed 5 years ago by
Replying to vdelecroix:
The thing is that having a complicated datatype with letters not aligned with computer words also has disadvantages. Namely:
 code less readable
True for the underlying boilerplate functions. Not necessarily true for higher level code.
 might be slower if not carefully done (you can not use
memcmp
,memcpy
fromstrings.h
)
Why can one not use memcmp
and memcpy
? Respectively, what's wrong with using mpn_cmp
and similar functions?
And implementing more involved algorithms (pattern matchings, counting subsequences, etc) might be painful.
Indeed, detection of subsequences (which is also implemented in #15820) is a bit more complicated than it would be with char*
. Detection of subsequences is one of the operations I care about. Admittedly, with my application to right modules over path algebras in mind, I care more about testing whether a sequence starts with a given subsequence than whether it just contains the subsequence.
comment:23 Changed 5 years ago by
Of course, bounded integer sequences have one disadvantage for your applications: you can not prescribe an alphabet. Each bounded integer sequences is formed by integers between 0 and a prescribed bound.
Anyway, here are some timings:
sage: from sage.data_structures.bounded_integer_sequences import BoundedIntegerSequence sage: w_s = Word([9,2,0,1,5,3,10,15,1,14,8,3]*100, alphabet=range(16)) sage: type(w_s) <class 'sage.combinat.words.word.FiniteWord_char'> sage: w_l = Word([9,2,0,1,5,3,10,15,1,14,8,3]*100) sage: type(w_l) <class 'sage.combinat.words.word.FiniteWord_list'> sage: w_t = Word((9,2,0,1,5,3,10,15,1,14,8,3)*100) sage: type(w_t) <class 'sage.combinat.words.word.FiniteWord_tuple'> sage: w_b = BoundedIntegerSequence(16, [9,2,0,1,5,3,10,15,1,14,8,3]*100) sage: w2_s = Word([9,2,0,1,5,3,10,15,1,14,8,3]*99+[9,2,0,1,5,3,10,15,1,14,3,8], alphabet=range(16)) sage: w2_l = Word([9,2,0,1,5,3,10,15,1,14,8,3]*99+[9,2,0,1,5,3,10,15,1,14,3,8]) sage: w2_t = Word((9,2,0,1,5,3,10,15,1,14,8,3)*99+(9,2,0,1,5,3,10,15,1,14,3,8)) sage: w2_b = BoundedIntegerSequence(16, [9,2,0,1,5,3,10,15,1,14,8,3]*99+[9,2,0,1,5,3,10,15,1,14,3,8]) sage: list(w_s)==list(w_l)==list(w_t)==list(w_b) True sage: list(w2_s)==list(w2_l)==list(w2_t)==list(w2_b) True
Comparison:
sage: %timeit w_s==w2_s 100000 loops, best of 3: 2.57 µs per loop sage: %timeit w_l==w2_l 10000 loops, best of 3: 57.8 µs per loop sage: %timeit w_t==w2_t 10000 loops, best of 3: 56 µs per loop sage: %timeit w_b==w2_b 1000000 loops, best of 3: 537 ns per loop
Hash, which is not (yet?) cached for bounded integer sequences:
sage: %timeit hash(w_s) 10000000 loops, best of 3: 137 ns per loop sage: %timeit hash(w_l) 10000000 loops, best of 3: 138 ns per loop sage: %timeit hash(w_t) 10000000 loops, best of 3: 137 ns per loop sage: %timeit hash(w_b) 1000000 loops, best of 3: 246 ns per loop
Concatenation:
sage: list(w_s+w2_s)==list(w_l+w2_l)==list(w_t+w2_t)==list(w_b+w2_b) True sage: %timeit w_s+w2_s 1000 loops, best of 3: 336 µs per loop sage: %timeit w_l+w2_l 10000 loops, best of 3: 31.1 µs per loop sage: %timeit w_t+w2_t 10000 loops, best of 3: 31.7 µs per loop sage: %timeit w_b+w2_b 1000000 loops, best of 3: 1.05 µs per loop
Subsequence containment:
sage: sub_s = Word([9,2,0,1,5,3,10,15,1,14,3,8], alphabet=range(16)) sage: sub_l = Word([9,2,0,1,5,3,10,15,1,14,3,8]) sage: sub_t = Word((9,2,0,1,5,3,10,15,1,14,3,8)) sage: sub_b = BoundedIntegerSequence(16, [9,2,0,1,5,3,10,15,1,14,3,8]) sage: W_s = w2_s+w_s sage: W_l = w2_l+w_l sage: W_t = w2_t+w_t sage: W_b = w2_b+w_b sage: sub_s.is_subword_of(W_s) True sage: sub_b in W_b True sage: %timeit sub_s.is_subword_of(W_s) 10000 loops, best of 3: 21.3 µs per loop sage: %timeit sub_l.is_subword_of(W_l) 100000 loops, best of 3: 10.9 µs per loop sage: %timeit sub_t.is_subword_of(W_t) 100000 loops, best of 3: 10.7 µs per loop sage: %timeit sub_b in W_b 100000 loops, best of 3: 10.8 µs per loop
So, using a compressed data format (where many letters are squeezed into one machine word) is not bad for the operations above.
comment:24 Changed 5 years ago by
I forgot slicing:
sage: list(w_s[30:70]) == list(w_b[30:70]) True sage: %timeit x=w_s[30:70] 1000000 loops, best of 3: 1.02 µs per loop sage: %timeit x=w_l[30:70] 10000 loops, best of 3: 20.6 µs per loop sage: %timeit x=w_t[30:70] 10000 loops, best of 3: 21 µs per loop sage: %timeit x=w_b[30:70] 1000000 loops, best of 3: 1.14 µs per loop sage: list(w_s[30:70:2]) == list(w_b[30:70:2]) True sage: %timeit x=w_s[30:70:2] 100000 loops, best of 3: 2.65 µs per loop sage: %timeit x=w_l[30:70:2] 100000 loops, best of 3: 13.3 µs per loop sage: %timeit x=w_t[30:70:2] 100000 loops, best of 3: 14.4 µs per loop sage: %timeit x=w_b[30:70:2] 1000000 loops, best of 3: 1.63 µs per loop
New commits:
trac #17013: new class WordDatatype_char
trac #17013: added is_square in WordDatatype_char