Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#17013 closed enhancement (fixed)

WordDatatype_char

Reported by: vdelecroix Owned by:
Priority: major Milestone: sage-6.4
Component: combinatorics Keywords: words
Cc: slabbe Merged in:
Authors: Vincent Delecroix Reviewers: Sébastien Labbé, Jeroen Demeyer
Report Upstream: N/A Work issues:
Branch: c00964d (Commits) Commit:
Dependencies: Stopgaps:

Description

Creation of a new class WordDatatype_char that has in backend an 'unsigned char *' and use it wherever possible. The implementation is much faster than anything else but is only available for alphabet contained in [0,255].

Change History (24)

comment:1 Changed 5 years ago by vdelecroix

  • Branch set to u/vdelecroix/17013
  • Commit set to 398b33c0a6f6a96f5c301e1867edb47158703776
  • Status changed from new to needs_review

New commits:

a25e286trac #17013: new class WordDatatype_char
398b33ctrac #17013: added is_square in WordDatatype_char

comment:2 Changed 5 years ago by slabbe

  • Branch changed from u/vdelecroix/17013 to u/slabbe/17013
  • Commit changed from 398b33c0a6f6a96f5c301e1867edb47158703776 to a563343267f1690d74275ad3835cdebef8512a19

New commits:

a563343trac #17013: doctests improvements + bug fix

comment:3 follow-up: Changed 5 years ago by slabbe

  • Reviewers set to Sébastien Labbé

If Vincent agrees with my changes, then he can set this ticket to positive review.

comment:4 follow-up: Changed 5 years ago by slabbe

Here is one benchmark.

BEFORE::

    sage: w = Word([0,-1,2]*100, alphabet=[0,-1,2])
    sage: type(w)
    <class 'sage.combinat.words.word.FiniteWord_list'>
    sage: %timeit w.is_square_free()
    1000 loops, best of 3: 417 µs per loop

AFTER::

    sage: w = Word([0,1,2]*100, alphabet=[0,1,2])
    sage: type(w)
    <class 'sage.combinat.words.word.FiniteWord_char'>
    sage: %timeit w.is_square_free()
    100000 loops, best of 3: 6.29 µs per loop

We should also add here some benchmark about equality test and also __getitem__.

comment:5 in reply to: ↑ 4 ; follow-up: Changed 5 years ago by vdelecroix

Replying to slabbe:

We should also add here some benchmark about equality test and also __getitem__.

EDIT: the timings were edited because there were not about FiniteWord_str but FiniteWord_list...

Competitions between:

  • Python string
  • FiniteWord_str
  • FiniteWord_char
sage: W1 = Words('abc')
sage: W2 = Words([0,1,2])
sage: w1 = W1(''.join(choice('abc') for _ in range(1000)))
sage: w2 = W2([choice([0,1,2]) for _ in range(1000])
sage: s = str(w1)

then on slices

sage: timeit("for i in range(1000): u = w1[:i]")
25 loops, best of 3: 25.3 ms per loop
sage: timeit("for i in range(1000): u = w2[:i]")
625 loops, best of 3: 237 µs per loop
sage: timeit("for i in range(1000): u = s[:i]")
625 loops, best of 3: 129 µs per loop

and on equality

sage: timeit("w1 == w1")
625 loops, best of 3: 365 ns per loop
sage: timeit("w2 == w2")
625 loops, best of 3: 187 ns per loop
sage: timeit("s == s")
625 loops, best of 3: 86.6 ns per loop

So Python strings are still ahead, but its much less ridiculous.

Vincent

Last edited 5 years ago by vdelecroix (previous) (diff)

comment:6 in reply to: ↑ 3 Changed 5 years ago by vdelecroix

  • Status changed from needs_review to positive_review

Replying to slabbe:

If Vincent agrees with my changes, then he can set this ticket to positive review.

I am! Thanks for the review.

Vincent

comment:7 in reply to: ↑ 5 ; follow-up: Changed 5 years ago by slabbe

Competitions between:

  • Python string
  • FiniteWord_str
  • FiniteWord_char

I believe you do not have FiniteWord_str in the competition...

sage: type(w1)                                     
<class 'sage.combinat.words.word.FiniteWord_list'> 

comment:8 follow-up: Changed 5 years ago by slabbe

Instead of (note that you already have defined i as a ssize_t):

cdef ssize_t i
for i in range(w._length-1, -1, -1):

Maybe the following is better and would allow to use size_t type?

cdef size_t i
for i from w._length > i >= 0:

comment:9 in reply to: ↑ 8 Changed 5 years ago by vdelecroix

Replying to slabbe:

Instead of (note that you already have defined i as a ssize_t):

cdef ssize_t i
for i in range(w._length-1, -1, -1):

Maybe the following is better and would allow to use size_t type?

cdef size_t i
for i from w._length > i >= 0:

Nope, it does note change anything. And be careful, your second loop is infinite (a size_t is always >= 0).

comment:10 in reply to: ↑ 7 Changed 5 years ago by vdelecroix

Replying to slabbe:

Competitions between:

  • Python string
  • FiniteWord_str
  • FiniteWord_char

I believe you do not have FiniteWord_str in the competition...

sage: type(w1)                                     
<class 'sage.combinat.words.word.FiniteWord_list'> 

Oups... corrected. Does not change anything for slicing, but huge difference in equality.

comment:11 Changed 5 years ago by slabbe

Great, so I am ok with this ticket. Positive review as already set!

Sébastien

comment:12 Changed 5 years ago by jdemeyer

  • Reviewers changed from Sébastien Labbé to Sébastien Labbé, Jeroen Demeyer
  • Status changed from positive_review to needs_work

This is absolutely not the right way to handle exceptions:

if PySlice_GetIndicesEx(...) < 0:
    return

The fact that it "works" is because IPython picks up the exception much later than you intended. Note the botched traceback:

sage: Words([0,1,2,3])([0,1,0,2,0,3,1,2,3])[slice("foo")]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: slice indices must be integers or None or have an __index__ method

The right way is to declare your function as except -1 and let Cython do the right thing. See http://docs.cython.org/src/userguide/language_basics.html#error-return-values

comment:13 Changed 5 years ago by jdemeyer

See https://github.com/cython/cython/commit/543c33c4a1fdf105c3173a0950297ce4f7550ece for how such a declaration would look like.

If you feel like it, add that patch to the Cython package and then you can do

from cpython.slice cimport *

instead of manually declaring the PySlice_ functions.

comment:14 Changed 5 years ago by vdelecroix

Hi Jeroen,

As you noticed, the commit in cython is 3 days old... and the reason is because I asked on the cython-users list. Patching is much more work than the declaration (see the last commit):

cdef extern from "Python.h":
    int PySlice_GetIndicesEx(...) except -1

Vincent

Last edited 5 years ago by vdelecroix (previous) (diff)

comment:15 Changed 5 years ago by vdelecroix

  • Branch changed from u/slabbe/17013 to u/vdelecroix/17013
  • Commit changed from a563343267f1690d74275ad3835cdebef8512a19 to c00964da4655d7580277c3d4aba1e0a51b06b78e
  • Status changed from needs_work to needs_review

New commits:

c00964dtrac #17013: fix Python C API declaration

comment:16 Changed 5 years ago by jdemeyer

  • Status changed from needs_review to positive_review

comment:17 Changed 5 years ago by vdelecroix

Thanks!

comment:18 Changed 5 years ago by vbraun

  • Branch changed from u/vdelecroix/17013 to c00964da4655d7580277c3d4aba1e0a51b06b78e
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:19 Changed 5 years ago by SimonKing

  • Commit c00964da4655d7580277c3d4aba1e0a51b06b78e deleted

Note that #15820 seems to address similar applications, but more general (without restriction to alphabets of length 255). In preliminary versions of bounded integer sequences, I experimented with char* as underlying data structure, but found that using GMP long integers with the shift operations provided by GMP was faster.

comment:20 Changed 5 years ago by jdemeyer

You should really compare the speed of the operations here with those from #15820. What I dislike about this ticket here is that it just talks about WordDatatype_char which doesn't mean anything to me: I never noticed that you were implementing sequences of integers in the interval [0,255].

On #15820, I insisted that the approach would be as general as possible. Of course, Simon King (the author) had a particular application in mind, but the code in #15820 never refers to that application, it is very general.

comment:21 follow-up: Changed 5 years ago by vdelecroix

Hello,

The thing is that having a complicated datatype with letters not aligned with computer words also has disadvantages. Namely:

  1. code less readable
  2. might be slower if not carefully done (you can not use memcmp, memcpy from strings.h)

And implementing more involved algorithms (pattern matchings, counting subsequences, etc) might be painful.

Nevertheless, I would be happy to give it a try and do proper benchmarkings. There will be no problem to remove the data structure here if #15820 is just better.

Vincent

comment:22 in reply to: ↑ 21 Changed 5 years ago by SimonKing

Replying to vdelecroix:

The thing is that having a complicated datatype with letters not aligned with computer words also has disadvantages. Namely:

  1. code less readable

True for the underlying boilerplate functions. Not necessarily true for higher level code.

  1. might be slower if not carefully done (you can not use memcmp, memcpy from strings.h)

Why can one not use memcmp and memcpy? Respectively, what's wrong with using mpn_cmp and similar functions?

And implementing more involved algorithms (pattern matchings, counting subsequences, etc) might be painful.

Indeed, detection of subsequences (which is also implemented in #15820) is a bit more complicated than it would be with char*. Detection of subsequences is one of the operations I care about. Admittedly, with my application to right modules over path algebras in mind, I care more about testing whether a sequence starts with a given subsequence than whether it just contains the subsequence.

comment:23 Changed 5 years ago by SimonKing

Of course, bounded integer sequences have one disadvantage for your applications: you can not prescribe an alphabet. Each bounded integer sequences is formed by integers between 0 and a prescribed bound.

Anyway, here are some timings:

sage: from sage.data_structures.bounded_integer_sequences import BoundedIntegerSequence
sage: w_s = Word([9,2,0,1,5,3,10,15,1,14,8,3]*100, alphabet=range(16))
sage: type(w_s)
<class 'sage.combinat.words.word.FiniteWord_char'>
sage: w_l = Word([9,2,0,1,5,3,10,15,1,14,8,3]*100)
sage: type(w_l)
<class 'sage.combinat.words.word.FiniteWord_list'>
sage: w_t = Word((9,2,0,1,5,3,10,15,1,14,8,3)*100)
sage: type(w_t)
<class 'sage.combinat.words.word.FiniteWord_tuple'>
sage: w_b = BoundedIntegerSequence(16, [9,2,0,1,5,3,10,15,1,14,8,3]*100)
sage: w2_s = Word([9,2,0,1,5,3,10,15,1,14,8,3]*99+[9,2,0,1,5,3,10,15,1,14,3,8], alphabet=range(16))
sage: w2_l = Word([9,2,0,1,5,3,10,15,1,14,8,3]*99+[9,2,0,1,5,3,10,15,1,14,3,8])
sage: w2_t = Word((9,2,0,1,5,3,10,15,1,14,8,3)*99+(9,2,0,1,5,3,10,15,1,14,3,8))
sage: w2_b = BoundedIntegerSequence(16, [9,2,0,1,5,3,10,15,1,14,8,3]*99+[9,2,0,1,5,3,10,15,1,14,3,8])
sage: list(w_s)==list(w_l)==list(w_t)==list(w_b)
True
sage: list(w2_s)==list(w2_l)==list(w2_t)==list(w2_b)
True

Comparison:

sage: %timeit w_s==w2_s
100000 loops, best of 3: 2.57 µs per loop
sage: %timeit w_l==w2_l
10000 loops, best of 3: 57.8 µs per loop
sage: %timeit w_t==w2_t
10000 loops, best of 3: 56 µs per loop
sage: %timeit w_b==w2_b
1000000 loops, best of 3: 537 ns per loop

Hash, which is not (yet?) cached for bounded integer sequences:

sage: %timeit hash(w_s)
10000000 loops, best of 3: 137 ns per loop
sage: %timeit hash(w_l)
10000000 loops, best of 3: 138 ns per loop
sage: %timeit hash(w_t)
10000000 loops, best of 3: 137 ns per loop
sage: %timeit hash(w_b)
1000000 loops, best of 3: 246 ns per loop

Concatenation:

sage: list(w_s+w2_s)==list(w_l+w2_l)==list(w_t+w2_t)==list(w_b+w2_b)
True
sage: %timeit w_s+w2_s
1000 loops, best of 3: 336 µs per loop
sage: %timeit w_l+w2_l
10000 loops, best of 3: 31.1 µs per loop
sage: %timeit w_t+w2_t
10000 loops, best of 3: 31.7 µs per loop
sage: %timeit w_b+w2_b
1000000 loops, best of 3: 1.05 µs per loop

Subsequence containment:

sage: sub_s = Word([9,2,0,1,5,3,10,15,1,14,3,8], alphabet=range(16))
sage: sub_l = Word([9,2,0,1,5,3,10,15,1,14,3,8])
sage: sub_t = Word((9,2,0,1,5,3,10,15,1,14,3,8))
sage: sub_b = BoundedIntegerSequence(16, [9,2,0,1,5,3,10,15,1,14,3,8])
sage: W_s = w2_s+w_s
sage: W_l = w2_l+w_l
sage: W_t = w2_t+w_t
sage: W_b = w2_b+w_b
sage: sub_s.is_subword_of(W_s)
True
sage: sub_b in W_b
True
sage: %timeit sub_s.is_subword_of(W_s)
10000 loops, best of 3: 21.3 µs per loop
sage: %timeit sub_l.is_subword_of(W_l)
100000 loops, best of 3: 10.9 µs per loop
sage: %timeit sub_t.is_subword_of(W_t)
100000 loops, best of 3: 10.7 µs per loop
sage: %timeit sub_b in W_b
100000 loops, best of 3: 10.8 µs per loop

So, using a compressed data format (where many letters are squeezed into one machine word) is not bad for the operations above.

comment:24 Changed 5 years ago by SimonKing

I forgot slicing:

sage: list(w_s[30:70]) == list(w_b[30:70])
True
sage: %timeit x=w_s[30:70]
1000000 loops, best of 3: 1.02 µs per loop
sage: %timeit x=w_l[30:70]
10000 loops, best of 3: 20.6 µs per loop
sage: %timeit x=w_t[30:70]
10000 loops, best of 3: 21 µs per loop
sage: %timeit x=w_b[30:70]
1000000 loops, best of 3: 1.14 µs per loop
sage: list(w_s[30:70:2]) == list(w_b[30:70:2])
True
sage: %timeit x=w_s[30:70:2]
100000 loops, best of 3: 2.65 µs per loop
sage: %timeit x=w_l[30:70:2]
100000 loops, best of 3: 13.3 µs per loop
sage: %timeit x=w_t[30:70:2]
100000 loops, best of 3: 14.4 µs per loop
sage: %timeit x=w_b[30:70:2]
1000000 loops, best of 3: 1.63 µs per loop
Note: See TracTickets for help on using tickets.