Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#17013 closed enhancement (fixed)

WordDatatype_char

Reported by: Vincent Delecroix Owned by:
Priority: major Milestone: sage-6.4
Component: combinatorics Keywords: words
Cc: Sébastien Labbé Merged in:
Authors: Vincent Delecroix Reviewers: Sébastien Labbé, Jeroen Demeyer
Report Upstream: N/A Work issues:
Branch: c00964d (Commits, GitHub, GitLab) Commit:
Dependencies: Stopgaps:

Status badges

Description

Creation of a new class WordDatatype_char that has in backend an 'unsigned char *' and use it wherever possible. The implementation is much faster than anything else but is only available for alphabet contained in [0,255].

Change History (24)

comment:1 Changed 8 years ago by Vincent Delecroix

Branch: u/vdelecroix/17013
Commit: 398b33c0a6f6a96f5c301e1867edb47158703776
Status: newneeds_review

New commits:

a25e286trac #17013: new class WordDatatype_char
398b33ctrac #17013: added is_square in WordDatatype_char

comment:2 Changed 8 years ago by Sébastien Labbé

Branch: u/vdelecroix/17013u/slabbe/17013
Commit: 398b33c0a6f6a96f5c301e1867edb47158703776a563343267f1690d74275ad3835cdebef8512a19

New commits:

a563343trac #17013: doctests improvements + bug fix

comment:3 Changed 8 years ago by Sébastien Labbé

Reviewers: Sébastien Labbé

If Vincent agrees with my changes, then he can set this ticket to positive review.

comment:4 Changed 8 years ago by Sébastien Labbé

Here is one benchmark.

BEFORE::

    sage: w = Word([0,-1,2]*100, alphabet=[0,-1,2])
    sage: type(w)
    <class 'sage.combinat.words.word.FiniteWord_list'>
    sage: %timeit w.is_square_free()
    1000 loops, best of 3: 417 µs per loop

AFTER::

    sage: w = Word([0,1,2]*100, alphabet=[0,1,2])
    sage: type(w)
    <class 'sage.combinat.words.word.FiniteWord_char'>
    sage: %timeit w.is_square_free()
    100000 loops, best of 3: 6.29 µs per loop

We should also add here some benchmark about equality test and also __getitem__.

comment:5 in reply to:  4 ; Changed 8 years ago by Vincent Delecroix

Replying to slabbe:

We should also add here some benchmark about equality test and also __getitem__.

EDIT: the timings were edited because there were not about FiniteWord_str but FiniteWord_list...

Competitions between:

  • Python string
  • FiniteWord_str
  • FiniteWord_char
sage: W1 = Words('abc')
sage: W2 = Words([0,1,2])
sage: w1 = W1(''.join(choice('abc') for _ in range(1000)))
sage: w2 = W2([choice([0,1,2]) for _ in range(1000])
sage: s = str(w1)

then on slices

sage: timeit("for i in range(1000): u = w1[:i]")
25 loops, best of 3: 25.3 ms per loop
sage: timeit("for i in range(1000): u = w2[:i]")
625 loops, best of 3: 237 µs per loop
sage: timeit("for i in range(1000): u = s[:i]")
625 loops, best of 3: 129 µs per loop

and on equality

sage: timeit("w1 == w1")
625 loops, best of 3: 365 ns per loop
sage: timeit("w2 == w2")
625 loops, best of 3: 187 ns per loop
sage: timeit("s == s")
625 loops, best of 3: 86.6 ns per loop

So Python strings are still ahead, but its much less ridiculous.

Vincent

Last edited 8 years ago by Vincent Delecroix (previous) (diff)

comment:6 in reply to:  3 Changed 8 years ago by Vincent Delecroix

Status: needs_reviewpositive_review

Replying to slabbe:

If Vincent agrees with my changes, then he can set this ticket to positive review.

I am! Thanks for the review.

Vincent

comment:7 in reply to:  5 ; Changed 8 years ago by Sébastien Labbé

Competitions between:

  • Python string
  • FiniteWord_str
  • FiniteWord_char

I believe you do not have FiniteWord_str in the competition...

sage: type(w1)                                     
<class 'sage.combinat.words.word.FiniteWord_list'> 

comment:8 Changed 8 years ago by Sébastien Labbé

Instead of (note that you already have defined i as a ssize_t):

cdef ssize_t i
for i in range(w._length-1, -1, -1):

Maybe the following is better and would allow to use size_t type?

cdef size_t i
for i from w._length > i >= 0:

comment:9 in reply to:  8 Changed 8 years ago by Vincent Delecroix

Replying to slabbe:

Instead of (note that you already have defined i as a ssize_t):

cdef ssize_t i
for i in range(w._length-1, -1, -1):

Maybe the following is better and would allow to use size_t type?

cdef size_t i
for i from w._length > i >= 0:

Nope, it does note change anything. And be careful, your second loop is infinite (a size_t is always >= 0).

comment:10 in reply to:  7 Changed 8 years ago by Vincent Delecroix

Replying to slabbe:

Competitions between:

  • Python string
  • FiniteWord_str
  • FiniteWord_char

I believe you do not have FiniteWord_str in the competition...

sage: type(w1)                                     
<class 'sage.combinat.words.word.FiniteWord_list'> 

Oups... corrected. Does not change anything for slicing, but huge difference in equality.

comment:11 Changed 8 years ago by Sébastien Labbé

Great, so I am ok with this ticket. Positive review as already set!

Sébastien

comment:12 Changed 8 years ago by Jeroen Demeyer

Reviewers: Sébastien LabbéSébastien Labbé, Jeroen Demeyer
Status: positive_reviewneeds_work

This is absolutely not the right way to handle exceptions:

if PySlice_GetIndicesEx(...) < 0:
    return

The fact that it "works" is because IPython picks up the exception much later than you intended. Note the botched traceback:

sage: Words([0,1,2,3])([0,1,0,2,0,3,1,2,3])[slice("foo")]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: slice indices must be integers or None or have an __index__ method

The right way is to declare your function as except -1 and let Cython do the right thing. See http://docs.cython.org/src/userguide/language_basics.html#error-return-values

comment:13 Changed 8 years ago by Jeroen Demeyer

See https://github.com/cython/cython/commit/543c33c4a1fdf105c3173a0950297ce4f7550ece for how such a declaration would look like.

If you feel like it, add that patch to the Cython package and then you can do

from cpython.slice cimport *

instead of manually declaring the PySlice_ functions.

comment:14 Changed 8 years ago by Vincent Delecroix

Hi Jeroen,

As you noticed, the commit in cython is 3 days old... and the reason is because I asked on the cython-users list. Patching is much more work than the declaration (see the last commit):

cdef extern from "Python.h":
    int PySlice_GetIndicesEx(...) except -1

Vincent

Last edited 8 years ago by Vincent Delecroix (previous) (diff)

comment:15 Changed 8 years ago by Vincent Delecroix

Branch: u/slabbe/17013u/vdelecroix/17013
Commit: a563343267f1690d74275ad3835cdebef8512a19c00964da4655d7580277c3d4aba1e0a51b06b78e
Status: needs_workneeds_review

New commits:

c00964dtrac #17013: fix Python C API declaration

comment:16 Changed 8 years ago by Jeroen Demeyer

Status: needs_reviewpositive_review

comment:17 Changed 8 years ago by Vincent Delecroix

Thanks!

comment:18 Changed 8 years ago by Volker Braun

Branch: u/vdelecroix/17013c00964da4655d7580277c3d4aba1e0a51b06b78e
Resolution: fixed
Status: positive_reviewclosed

comment:19 Changed 8 years ago by Simon King

Commit: c00964da4655d7580277c3d4aba1e0a51b06b78e

Note that #15820 seems to address similar applications, but more general (without restriction to alphabets of length 255). In preliminary versions of bounded integer sequences, I experimented with char* as underlying data structure, but found that using GMP long integers with the shift operations provided by GMP was faster.

comment:20 Changed 8 years ago by Jeroen Demeyer

You should really compare the speed of the operations here with those from #15820. What I dislike about this ticket here is that it just talks about WordDatatype_char which doesn't mean anything to me: I never noticed that you were implementing sequences of integers in the interval [0,255].

On #15820, I insisted that the approach would be as general as possible. Of course, Simon King (the author) had a particular application in mind, but the code in #15820 never refers to that application, it is very general.

comment:21 Changed 8 years ago by Vincent Delecroix

Hello,

The thing is that having a complicated datatype with letters not aligned with computer words also has disadvantages. Namely:

  1. code less readable
  2. might be slower if not carefully done (you can not use memcmp, memcpy from strings.h)

And implementing more involved algorithms (pattern matchings, counting subsequences, etc) might be painful.

Nevertheless, I would be happy to give it a try and do proper benchmarkings. There will be no problem to remove the data structure here if #15820 is just better.

Vincent

comment:22 in reply to:  21 Changed 8 years ago by Simon King

Replying to vdelecroix:

The thing is that having a complicated datatype with letters not aligned with computer words also has disadvantages. Namely:

  1. code less readable

True for the underlying boilerplate functions. Not necessarily true for higher level code.

  1. might be slower if not carefully done (you can not use memcmp, memcpy from strings.h)

Why can one not use memcmp and memcpy? Respectively, what's wrong with using mpn_cmp and similar functions?

And implementing more involved algorithms (pattern matchings, counting subsequences, etc) might be painful.

Indeed, detection of subsequences (which is also implemented in #15820) is a bit more complicated than it would be with char*. Detection of subsequences is one of the operations I care about. Admittedly, with my application to right modules over path algebras in mind, I care more about testing whether a sequence starts with a given subsequence than whether it just contains the subsequence.

comment:23 Changed 8 years ago by Simon King

Of course, bounded integer sequences have one disadvantage for your applications: you can not prescribe an alphabet. Each bounded integer sequences is formed by integers between 0 and a prescribed bound.

Anyway, here are some timings:

sage: from sage.data_structures.bounded_integer_sequences import BoundedIntegerSequence
sage: w_s = Word([9,2,0,1,5,3,10,15,1,14,8,3]*100, alphabet=range(16))
sage: type(w_s)
<class 'sage.combinat.words.word.FiniteWord_char'>
sage: w_l = Word([9,2,0,1,5,3,10,15,1,14,8,3]*100)
sage: type(w_l)
<class 'sage.combinat.words.word.FiniteWord_list'>
sage: w_t = Word((9,2,0,1,5,3,10,15,1,14,8,3)*100)
sage: type(w_t)
<class 'sage.combinat.words.word.FiniteWord_tuple'>
sage: w_b = BoundedIntegerSequence(16, [9,2,0,1,5,3,10,15,1,14,8,3]*100)
sage: w2_s = Word([9,2,0,1,5,3,10,15,1,14,8,3]*99+[9,2,0,1,5,3,10,15,1,14,3,8], alphabet=range(16))
sage: w2_l = Word([9,2,0,1,5,3,10,15,1,14,8,3]*99+[9,2,0,1,5,3,10,15,1,14,3,8])
sage: w2_t = Word((9,2,0,1,5,3,10,15,1,14,8,3)*99+(9,2,0,1,5,3,10,15,1,14,3,8))
sage: w2_b = BoundedIntegerSequence(16, [9,2,0,1,5,3,10,15,1,14,8,3]*99+[9,2,0,1,5,3,10,15,1,14,3,8])
sage: list(w_s)==list(w_l)==list(w_t)==list(w_b)
True
sage: list(w2_s)==list(w2_l)==list(w2_t)==list(w2_b)
True

Comparison:

sage: %timeit w_s==w2_s
100000 loops, best of 3: 2.57 µs per loop
sage: %timeit w_l==w2_l
10000 loops, best of 3: 57.8 µs per loop
sage: %timeit w_t==w2_t
10000 loops, best of 3: 56 µs per loop
sage: %timeit w_b==w2_b
1000000 loops, best of 3: 537 ns per loop

Hash, which is not (yet?) cached for bounded integer sequences:

sage: %timeit hash(w_s)
10000000 loops, best of 3: 137 ns per loop
sage: %timeit hash(w_l)
10000000 loops, best of 3: 138 ns per loop
sage: %timeit hash(w_t)
10000000 loops, best of 3: 137 ns per loop
sage: %timeit hash(w_b)
1000000 loops, best of 3: 246 ns per loop

Concatenation:

sage: list(w_s+w2_s)==list(w_l+w2_l)==list(w_t+w2_t)==list(w_b+w2_b)
True
sage: %timeit w_s+w2_s
1000 loops, best of 3: 336 µs per loop
sage: %timeit w_l+w2_l
10000 loops, best of 3: 31.1 µs per loop
sage: %timeit w_t+w2_t
10000 loops, best of 3: 31.7 µs per loop
sage: %timeit w_b+w2_b
1000000 loops, best of 3: 1.05 µs per loop

Subsequence containment:

sage: sub_s = Word([9,2,0,1,5,3,10,15,1,14,3,8], alphabet=range(16))
sage: sub_l = Word([9,2,0,1,5,3,10,15,1,14,3,8])
sage: sub_t = Word((9,2,0,1,5,3,10,15,1,14,3,8))
sage: sub_b = BoundedIntegerSequence(16, [9,2,0,1,5,3,10,15,1,14,3,8])
sage: W_s = w2_s+w_s
sage: W_l = w2_l+w_l
sage: W_t = w2_t+w_t
sage: W_b = w2_b+w_b
sage: sub_s.is_subword_of(W_s)
True
sage: sub_b in W_b
True
sage: %timeit sub_s.is_subword_of(W_s)
10000 loops, best of 3: 21.3 µs per loop
sage: %timeit sub_l.is_subword_of(W_l)
100000 loops, best of 3: 10.9 µs per loop
sage: %timeit sub_t.is_subword_of(W_t)
100000 loops, best of 3: 10.7 µs per loop
sage: %timeit sub_b in W_b
100000 loops, best of 3: 10.8 µs per loop

So, using a compressed data format (where many letters are squeezed into one machine word) is not bad for the operations above.

comment:24 Changed 8 years ago by Simon King

I forgot slicing:

sage: list(w_s[30:70]) == list(w_b[30:70])
True
sage: %timeit x=w_s[30:70]
1000000 loops, best of 3: 1.02 µs per loop
sage: %timeit x=w_l[30:70]
10000 loops, best of 3: 20.6 µs per loop
sage: %timeit x=w_t[30:70]
10000 loops, best of 3: 21 µs per loop
sage: %timeit x=w_b[30:70]
1000000 loops, best of 3: 1.14 µs per loop
sage: list(w_s[30:70:2]) == list(w_b[30:70:2])
True
sage: %timeit x=w_s[30:70:2]
100000 loops, best of 3: 2.65 µs per loop
sage: %timeit x=w_l[30:70:2]
100000 loops, best of 3: 13.3 µs per loop
sage: %timeit x=w_t[30:70:2]
100000 loops, best of 3: 14.4 µs per loop
sage: %timeit x=w_b[30:70:2]
1000000 loops, best of 3: 1.63 µs per loop
Note: See TracTickets for help on using tickets.