Opened 8 years ago
Closed 8 years ago
#16935 closed enhancement (fixed)
Faster palindromes function for the Words library
Reported by:  nadialafreniere  Owned by:  

Priority:  major  Milestone:  sage6.4 
Component:  combinatorics  Keywords:  words, finite words, palindromes 
Cc:  slabbe, sstarosta  Merged in:  
Authors:  Nadia Lafrenière  Reviewers:  Sébastien Labbé 
Report Upstream:  N/A  Work issues:  
Branch:  fcdd2f8 (Commits, GitHub, GitLab)  Commit:  fcdd2f830906015ddf3c1623bd7a29dde8e1e58e 
Dependencies:  Stopgaps: 
Description (last modified by )
I've written a faster algorithm to get the set of all the distinct palindromic factors of a given word. There's also a new function called long_lpc (for long method to get the longest palindome centered at a given position) that is added.
You can see here that the algorithm is factor, especially for long words.
Here is the old version :
sage: f=WordMorphism({'a':['a','b'], 'b':['a']}).fixed_point('a')[:1000] sage: %time M=f.palindromes() CPU times: user 1.83 s, sys: 22.8 ms, total: 1.85 s Wall time: 1.83 s sage: t=WordMorphism({'a':['a','b'], 'b':['b','a']}).fixed_point('a')[:1000] sage: %time N=t.palindromes() CPU times: user 1.02 s, sys: 8.4 ms, total: 1.03 s Wall time: 1.02 s sage: tm=WordMorphism({'a':['a','b'], 'b':['b','a']}).fixed_point('a')[:5000] sage: %time N=tm.palindromes() CPU times: user 28.5 s, sys: 117 ms, total: 28.7 s Wall time: 28.6 s
And the new version :
sage: f=WordMorphism({'a':['a','b'], 'b':['a']}).fixed_point('a')[:1000] sage: %time M=f.palindromes() CPU times: user 436 ms, sys: 20 ms, total: 456 ms Wall time: 450 ms sage: t=WordMorphism({'a':['a','b'], 'b':['b','a']}).fixed_point('a')[:1000] sage: %time N=t.palindromes() CPU times: user 362 ms, sys: 19.7 ms, total: 382 ms Wall time: 371 ms sage: tm=WordMorphism({'a':['a','b'], 'b':['b','a']}).fixed_point('a')[:5000] sage: %time N=tm.palindromes() CPU times: user 2.94 s, sys: 45.3 ms, total: 2.99 s Wall time: 2.96 s
The principle of the algorithm is to calculate, for every letter and every space between two letters, the longest palindrome centered in that position. Once a longer palindrome is found, the information leftwise to the center is copied to the right because the proper subpalindromes are present at least twice.
At the same time, there is a list created recording the length of the longest palindromic suffix of each prefix of the word. Since the number of prefixes is bounded by one plus the length of the word, the set of palindromes to be compared is not too big (the comparison is necessary to get the unicity).
There is still a problem with the palindromes() function already in SAGE that has not been fixed with the new algorithm : For f
palindromes (or pseudopalindromes), the word morphism f
cannot apply to words generated with the WordGenerator? class on the default alphabet (0,1). The main problem is that the type of the letters of these words are int
(not words, nor any type that can be changed into words). I did not find any way to fix it, except to change the alphabet when generating these words.
sage: fib=words.FibonacciWord()[:1000] sage: type(fib) <class 'sage.combinat.words.word.FiniteWord_iter_with_caching'> sage: type(fib[5]) <type 'int'> sage: type(words.KolakoskiWord()[2]) <type 'int'> sage: type(words.ThueMorseWord()[2]) <type 'int'>
Change History (39)
comment:1 Changed 8 years ago by
Branch:  → u/nadialafreniere/faster_palindromes_function_for_the_words_library 

comment:2 Changed 8 years ago by
Commit:  → a86cf9334a58fe016e06577e915a5ccf33cce7c7 

comment:3 Changed 8 years ago by
Branch:  u/nadialafreniere/faster_palindromes_function_for_the_words_library → u/nadialafreniere/16935 

comment:4 Changed 8 years ago by
Commit:  a86cf9334a58fe016e06577e915a5ccf33cce7c7 → ca9a46b527309d80a785b8d0090f66b7691e1c9d 

Description:  modified (diff) 
comment:5 Changed 8 years ago by
I did a first quick reading of the code. Here are some comments about it. I still need to actually read the code itself, which I will do in another comment not today.
 There are two failing doctests in
finite_word.py
. Also, some tabulation were introduced in that file. According to the Python code style, the convention is to use 4 spaces for indentation levels.
sage t src/sage/combinat/words/finite_word.py ********************************************************************** File "src/sage/combinat/words/finite_word.py", line 2736, in sage.combinat.words.finite_word.FiniteWord_class.defect Failed example: Word('a').defect(f) Exception raised: Traceback (most recent call last): ... IndexError: list index out of range ********************************************************************** File "src/sage/combinat/words/finite_word.py", line 2740, in sage.combinat.words.finite_word.FiniteWord_class.defect Failed example: Word('aa').defect(f) Exception raised: Traceback (most recent call last): ... IndexError: list index out of range ********************************************************************** 1 item had failures: 2 of 26 in sage.combinat.words.finite_word.FiniteWord_class.defect Error: TAB character found at lines 2463,2488,2509,2554 [1116 tests, 2 failures, 13.93 s]  sage t src/sage/combinat/words/finite_word.py # Tab character found sage t src/sage/combinat/words/finite_word.py # 2 doctests failed 
 The name of the method
long_lpc
is not communicative. I would suggest to uselength_longest_palindrome
instead.
 There are missing commas in input docstrings:
  ``j``  integer a position that is the symmetry axis of the palindrome. +  ``j``  integer, position that is the symmetry axis of the palindrome.   ``m``  integer (default: 0) the minimal length of the palindrome, if known. +  ``m``  integer (default: 0), minimal length of the palindrome, if known.
 Missing spaces in the argument of a function:
 sage: Word('01010').long_lpc(j=3,f='0>1,1>0') + sage: Word('01010').long_lpc(j=3, f='0>1,1>0')
 White spaces are wrong here:
#Initialize m if set to 0 if m==0: m=(j%2)
and many other places.
 Why? No space was OK.
 INPUT: + INPUT :
 Why do you erase and modify doctests in the
palindromes
method?
 There is *a lot* of comments in the code. Are all of them really necessary? I have never seen Sage code with that many comments. If the algorithm is really complicated, the code can be simplified by moving parts into another method. Also, the pseudo code of it can be put in the doctrings. But, I would leave the code with less comments. See also : http://legacy.python.org/dev/peps/pep0008/#comments
comment:6 Changed 8 years ago by
Status:  new → needs_review 

comment:7 Changed 8 years ago by
Status:  needs_review → needs_work 

comment:8 Changed 8 years ago by
Commit:  ca9a46b527309d80a785b8d0090f66b7691e1c9d → def64a1a59757069be2840256a41dbc5951d5e07 

Branch pushed to git repo; I updated commit sha1. New commits:
def64a1  included suggestions from comment 5 (slabbe)

comment:9 Changed 8 years ago by
Status:  needs_work → needs_review 

If you make any changes to the ticket and you want me to review it again. Then change the status of the ticket to "needs_review". This sends me an email and I will come and look. Otherwise, I think that no update was done and I wait. I just learned that you made changes six week ago.
comment:10 Changed 8 years ago by
It is difficult for me to take a look at the code itself when the syntax convention are not followed.
 One empty line is nice to separate block of code. Two consecutive empty lines should be avoided.
 Never put an empty line after a
for
statement.
 Try to keep comment on one line. Rephrase if necessary.
 One line should be less than 80 characters.
 If a comment takes more than one line, then avoid any indentation for the second line.
 A comment should always start at the actual indentation level.
 Why do you write
k = Integer(0)
instead of justk=0
?
long_lpc
should not appear anymore (code + doctest).
 Slices of words should not be changed into tuple to make the equality test. The code in that class is mathematical. It should not play with the internal representation. Suppose your word is already represented by a tuple. Why change it into a tuple?
 Remove useless comment like
# Return the length of the palindrome return p
 Add comments where it is necessary:
except IndexError: pass
Why such an error is not an error? It seems strange to me to choose such an implementation...
comment:11 Changed 8 years ago by
Status:  needs_review → needs_work 

comment:12 Changed 8 years ago by
 The code of
length_longest_palindrome
method contains the same line four times:
while j//2  i >= 0 and tuple(self[j//2i : j//2i+1]) == tuple(f(self[j//2+i : j//2+i+1])): i = i + 1
Something should be done to factorize the code a little bit. For instance if j%2 == 1
, then store the result of j//2
into some variable K
and else store the variable j/2
in K
. Then, use K
and factorize the code.
 Anyhow, in the line of code above, avoid the creation of a tuple. Also, I suggest to avoid the slice and simply use a for loop for the indexes you want to compare.
I am thinking of a code that would look like that (it is most probably wrong as I did not think about the limit cases and even vs odd palindrome length, etc.)
possible = max(j, self.length()j) i = m / 2 while i < possible and self[ji] == self[j+i]: i += 1 return i
But you get what I mean: short and clean code do not need any comment.
comment:13 Changed 8 years ago by
 I do not get why there is an odd and even case. What is the definition of
the length of the longest palindrome centered at a given position j
?
I think that
j
should be a rational number0<= j < w.length()
such that2j
is an integer. Likej=2.5
could be valid. This idea comes from the fact that a word w is symmetric ifw[i]=w[2ai]
for all indices i and for some value a where 2a is an integer.
comment:14 Changed 8 years ago by
 In the code of the
palindromes
method, you compute two tables:LPS
andLPC
. If I understand correctly, the computation ofLPS
depends on the computation ofLPC
but not conversely. Therefore, why are they computed in parallel?
I would rather suggest if it is possible to write a method that computes and return just
LPC
. Then, another method that computes and return justLPS
if it is possible to compute it fromLPC
. And finally the methodpalindromes
that returns all of the factors from theLPS
method.
Also, the palindromes method should have a
algorithm
argument allowing to choose between the old and new way of doing things. This facilitates the comparison of efficiency. Otherwise, it is more difficult to see how faster is your new algorithm.
comment:15 Changed 8 years ago by
 Be more brief. For instance replace the 7 lines
# The list LPS records the lengths of the longest # `f`palindromic suffix for each prefix of self. LPS = [] # The list LPC records the lengths of the maximal # `f`palindromes centered at each position j in self. LPC = []
by the 2 lines:
LPS = [] #lengths of longest palindrome suffix of prefixes LPC = [] #lengths of maximal palindromes centered at a position
and so on...
comment:16 Changed 8 years ago by
Commit:  def64a1a59757069be2840256a41dbc5951d5e07 → 05cbe2844498c4c1f3d325e4ac27e637188b8e71 

Branch pushed to git repo; I updated commit sha1. New commits:
05cbe28  Separation of lps_lengths and lengths_maximal_palindromes functions, change of indexes, correction of errors, upgrade with version 6.4.1

comment:17 Changed 8 years ago by
I think I've given an answer to almost all the comments in the list above.
However, for number 11., I haven't found a better way to do it. The idea is to avoid to count outside of the string. For example, I don't want to test if w[1]==w[n] for a word w, since I don't want the palindromes in the circular word. Any suggestion would be very welcome!
For number 15., the reason I pretend the new algorithm is faster is that the list of the lengths of longest palindromic suffixes is computed in linear time for any type of words. With the actual version of the SAGE library, for words having large palindromes, it can be much more than that. For example, let r be is a prefix of w with lps(r) long enough. Then, if the length of lps(ra) is smaller than lps(r), where a is the letter after r in w, it will call p.is_palindrome() on each suffix p (starting with the greatest) of lps(r)a. Just to compute this entry of the list, it could take a time up to (lps(r))^{2. }
comment:18 Changed 8 years ago by
Status:  needs_work → needs_review 

comment:19 Changed 8 years ago by
The ticket does not apply on a recent Sage version.
Also, the branch contains unrelated modifs...
comment:20 followups: 21 22 Changed 8 years ago by
...you can keep the same ticket instead of creating #18060. Just update the branch of this ticket.
comment:21 followup: 24 Changed 8 years ago by
comment:22 followup: 25 Changed 8 years ago by
Branch:  u/nadialafreniere/16935 → u/nadialafreniere/palindromes__list_of_lps_lengths_and_list_of_maximal_palindromes_length 

Commit:  05cbe2844498c4c1f3d325e4ac27e637188b8e71 
comment:23 Changed 8 years ago by
Commit:  → 160ccb0ccfb7580fe5d497b3b049fbf2b3c25523 

Branch pushed to git repo; I updated commit sha1. New commits:
160ccb0  Initial commit

comment:24 Changed 8 years ago by
Replying to vdelecroix:
Replying to slabbe:
...you can keep the same ticket instead of creating #18060. Just update the branch of this ticket.
could we just close #18060?
I don't know how to do it, but it would be a good idea. I added my branch to this ticket.
comment:25 Changed 8 years ago by
I really need to learn how to use trac in a proper way!
It is simpler to learn if you are beside us, but your learning will be more persistent if you learn it on your side with our help from internet.
Just don't stop to do what is natural for you to do. If you think you need to create a new ticket, do it:) We will tell you if there is something wrong and that is how you will learn.
comment:26 Changed 8 years ago by
Cc:  sstarosta added 

comment:27 followup: 29 Changed 8 years ago by
Replying to nadialafreniere:
You can see here that the algorithm is factor, especially for long words.
Indeed, I can see that it is currently twice as fast:
sage: w = words.ThueMorseWord()[:1000] sage: %time P = w.palindromic_lacunas_study()[2] CPU times: user 903 ms, sys: 11.4 ms, total: 914 ms Wall time: 895 ms sage: %time Q = w.palindromes() CPU times: user 516 ms, sys: 34.9 ms, total: 551 ms Wall time: 472 ms sage: P == Q True sage: w = words.ThueMorseWord()[:10000] sage: %time P = w.palindromic_lacunas_study()[2] CPU times: user 40.5 s, sys: 200 ms, total: 40.7 s Wall time: 40.6 s sage: %time Q = w.palindromes() CPU times: user 26 s, sys: 256 ms, total: 26.3 s Wall time: 26.2 s sage: P == Q True
I did not find any way to fix it, except to change the alphabet when generating these words.
I am not sure I understand the problem. But, if w
is a word of length 1 in sage, then w[0]
will give the the letter it contains. Using this, you can compare the equality of letters after the application of the function f
.
I am very happy with the way the code is now separated. Most of the methods are written clearly. I am still not convinced by the optimality and clarity of the length_maximal_palindrome
method. I started a new branch where I changed the way the indices are worked out. My branch is here. Tell me what you think.
I added a raise Value Error when the parity of m is the same as the parity of 2j which is an impossible case. I was surprise to realize that it breaks the other methods:
sage: Word('01101001').lengths_maximal_palindromes() ... ValueError: j(m+1)/2(=5/2) must be an integer, i.e., 2*j(=6) and m(=0) can't have the same parity
This means that the impossible case is in fact used. Do you understand why?
Finally, I think the palindrome method can now be written as a one liner using a list comprehension...
comment:28 Changed 8 years ago by
Commit:  160ccb0ccfb7580fe5d497b3b049fbf2b3c25523 → d535b5f2f292d83d1dc952023dffe5506dd8f7f9 

comment:29 Changed 8 years ago by
Replying to slabbe:
I prefer the way you present the length_maximal_palindrome. I think it is clearer. I integrated it to the code.
I added a raise Value Error when the parity of m is the same as the parity of 2j which is an impossible case. I was surprise to realize that it breaks the other methods: [...] This means that the impossible case is in fact used. Do you understand why?
I don't know why, but I used a very complicated way to assing a value for m
in the second case of lengths_maximal_palindromes
. I changed it so it is easier to see what it does, and it does not use the impossible case.
Finally, I think the palindrome method can now be written as a one liner using a list comprehension...
I changed it. Tell me what you think.
comment:30 Changed 8 years ago by
Great. I think we are almost done now. I just looked at the code. Here are some more fixes to do. Tomorrow (or the day after) after you do the below changes, I will review everything again (doctests, documentation building fine, etc).
 The call to Integer is not necessary in the first method (sorry my bad):
 return Integer(jj  2*i  1) + return jj  2*i  1
 (bis) There should be an empty line after OUTPUT. The reason is that without the empty line, the documentation will not build correctly. You may try to check how it looks on a method of an object by doing this (I do not known if this still works) :
sage: m = matrix(2, range(4)) sage: browse_sage_doc(m.inverse)
 spaces:
 if LPC[k]+kj != LPC[2*k  j]: + if LPC[k] + k  j != LPC[2*k  j]:
 remove empty lines:
  k = 0  + k = 0
 The creation of the list is not necessary:
 return set([self[iLPS[i] : i] for i in range(len(self)+1)]) + return set(self[iLPS[i] : i] for i in range(len(self)+1))
comment:31 Changed 8 years ago by
Commit:  d535b5f2f292d83d1dc952023dffe5506dd8f7f9 → f4cf2f7fa9af3bea0b71583f946851cf4c041582 

Branch pushed to git repo; I updated commit sha1. New commits:
f4cf2f7  Modified spacing, use of Integer() and palindromes()

comment:32 Changed 8 years ago by
Excellent. Since we are dealing with possible float in entry (like 2.5) and since w[i]
needs i
to be an integer, I think that we have to keep the following two lines (do the doctests pass for you without them?) :
jj = Integer(jj)
i = Integer(i)
I wanted to remove only the last call to Integer because that one was not necessary.
comment:33 followup: 36 Changed 8 years ago by
The doctests passed without them, but I can put them back just in case it is necessary.
comment:34 Changed 8 years ago by
Commit:  f4cf2f7fa9af3bea0b71583f946851cf4c041582 → d0ee49ef26ff6277c5ed847e3b03a4c95dd857d6 

Branch pushed to git repo; I updated commit sha1. New commits:
d0ee49e  Added Integers() and changed an incorrect error message

comment:35 Changed 8 years ago by
The user does not know what is jj
: this is an internal variable. It is better to use 2*j
for the message error given to the user.
Last small fix for the doc of lps_lengths
(remove one space in the second line). Lines must align properly for the doc to render correctly (see the file SAGE_ROOT/src/doc/output/html/en/reference/combinat/sage/combinat/words/finite_word.html
after you run sage b && sage docbuild reference html
)
  ``f``  involution (default: None) on the alphabet of self. It must  be callable on letters as well as words (e.g. WordMorphism). +  ``f``  involution (default: None) on the alphabet of self. It must + be callable on letters as well as words (e.g. WordMorphism).
comment:36 Changed 8 years ago by
Replying to nadialafreniere:
The doctests passed without them, but I can put them back just in case it is necessary.
That is strange because they did not pass for me... Anyway, now, it works.
comment:37 Changed 8 years ago by
Commit:  d0ee49ef26ff6277c5ed847e3b03a4c95dd857d6 → fcdd2f830906015ddf3c1623bd7a29dde8e1e58e 

Branch pushed to git repo; I updated commit sha1. New commits:
fcdd2f8  modified the error message and spacing in documentation

comment:38 Changed 8 years ago by
Reviewers:  → Sébastien Labbé 

Status:  needs_review → positive_review 
Code looks good. Doctests pass. Documentation builds fine. Félicitations pour ton premier ticket Nadia! Vivement d'autres!
comment:39 Changed 8 years ago by
Branch:  u/nadialafreniere/palindromes__list_of_lps_lengths_and_list_of_maximal_palindromes_length → fcdd2f830906015ddf3c1623bd7a29dde8e1e58e 

Resolution:  → fixed 
Status:  positive_review → closed 
The branch seems empty (it says "already merged").