Opened 8 years ago

Closed 8 years ago

#16935 closed enhancement (fixed)

Faster palindromes function for the Words library

Reported by: nadialafreniere Owned by:
Priority: major Milestone: sage-6.4
Component: combinatorics Keywords: words, finite words, palindromes
Cc: slabbe, sstarosta Merged in:
Authors: Nadia Lafrenière Reviewers: Sébastien Labbé
Report Upstream: N/A Work issues:
Branch: fcdd2f8 (Commits, GitHub, GitLab) Commit: fcdd2f830906015ddf3c1623bd7a29dde8e1e58e
Dependencies: Stopgaps:

GitHub link to the corresponding issue

Description (last modified by nadialafreniere)

I've written a faster algorithm to get the set of all the distinct palindromic factors of a given word. There's also a new function called long_lpc (for long method to get the longest palindome centered at a given position) that is added.

You can see here that the algorithm is factor, especially for long words.

Here is the old version :

sage: f=WordMorphism({'a':['a','b'], 'b':['a']}).fixed_point('a')[:1000]
sage: %time M=f.palindromes()
CPU times: user 1.83 s, sys: 22.8 ms, total: 1.85 s
Wall time: 1.83 s

sage: t=WordMorphism({'a':['a','b'], 'b':['b','a']}).fixed_point('a')[:1000]
sage: %time N=t.palindromes()
CPU times: user 1.02 s, sys: 8.4 ms, total: 1.03 s
Wall time: 1.02 s

sage: tm=WordMorphism({'a':['a','b'], 'b':['b','a']}).fixed_point('a')[:5000]
sage: %time N=tm.palindromes()
CPU times: user 28.5 s, sys: 117 ms, total: 28.7 s
Wall time: 28.6 s

And the new version :

sage: f=WordMorphism({'a':['a','b'], 'b':['a']}).fixed_point('a')[:1000]
sage: %time M=f.palindromes()
CPU times: user 436 ms, sys: 20 ms, total: 456 ms
Wall time: 450 ms

sage: t=WordMorphism({'a':['a','b'], 'b':['b','a']}).fixed_point('a')[:1000]
sage: %time N=t.palindromes()
CPU times: user 362 ms, sys: 19.7 ms, total: 382 ms
Wall time: 371 ms

sage: tm=WordMorphism({'a':['a','b'], 'b':['b','a']}).fixed_point('a')[:5000]                  
sage: %time N=tm.palindromes()
CPU times: user 2.94 s, sys: 45.3 ms, total: 2.99 s
Wall time: 2.96 s

The principle of the algorithm is to calculate, for every letter and every space between two letters, the longest palindrome centered in that position. Once a longer palindrome is found, the information leftwise to the center is copied to the right because the proper subpalindromes are present at least twice.

At the same time, there is a list created recording the length of the longest palindromic suffix of each prefix of the word. Since the number of prefixes is bounded by one plus the length of the word, the set of palindromes to be compared is not too big (the comparison is necessary to get the unicity).

There is still a problem with the palindromes() function already in SAGE that has not been fixed with the new algorithm : For f-palindromes (or pseudo-palindromes), the word morphism f cannot apply to words generated with the WordGenerator? class on the default alphabet (0,1). The main problem is that the type of the letters of these words are int (not words, nor any type that can be changed into words). I did not find any way to fix it, except to change the alphabet when generating these words.

sage: fib=words.FibonacciWord()[:1000]
sage: type(fib)
<class 'sage.combinat.words.word.FiniteWord_iter_with_caching'>

sage: type(fib[5])
<type 'int'>

sage: type(words.KolakoskiWord()[2])
<type 'int'>

sage: type(words.ThueMorseWord()[2])
<type 'int'>

Change History (39)

comment:1 Changed 8 years ago by nadialafreniere

Branch: u/nadialafreniere/faster_palindromes_function_for_the_words_library

comment:2 Changed 8 years ago by slabbe

Commit: a86cf9334a58fe016e06577e915a5ccf33cce7c7

The branch seems empty (it says "already merged").

comment:3 Changed 8 years ago by nadialafreniere

Branch: u/nadialafreniere/faster_palindromes_function_for_the_words_libraryu/nadialafreniere/16935

comment:4 Changed 8 years ago by nadialafreniere

Commit: a86cf9334a58fe016e06577e915a5ccf33cce7c7ca9a46b527309d80a785b8d0090f66b7691e1c9d
Description: modified (diff)

comment:5 Changed 8 years ago by slabbe

I did a first quick reading of the code. Here are some comments about it. I still need to actually read the code itself, which I will do in another comment not today.

  1. There are two failing doctests in finite_word.py. Also, some tabulation were introduced in that file. According to the Python code style, the convention is to use 4 spaces for indentation levels.
sage -t src/sage/combinat/words/finite_word.py
**********************************************************************
File "src/sage/combinat/words/finite_word.py", line 2736, in sage.combinat.words.finite_word.FiniteWord_class.defect
Failed example:
    Word('a').defect(f)
Exception raised:
    Traceback (most recent call last):
    ...
    IndexError: list index out of range
**********************************************************************
File "src/sage/combinat/words/finite_word.py", line 2740, in sage.combinat.words.finite_word.FiniteWord_class.defect
Failed example:
    Word('aa').defect(f)
Exception raised:
    Traceback (most recent call last):
    ...
    IndexError: list index out of range
**********************************************************************
1 item had failures:
   2 of  26 in sage.combinat.words.finite_word.FiniteWord_class.defect
    Error: TAB character found at lines 2463,2488,2509,2554
    [1116 tests, 2 failures, 13.93 s]
----------------------------------------------------------------------
sage -t src/sage/combinat/words/finite_word.py  # Tab character found
sage -t src/sage/combinat/words/finite_word.py  # 2 doctests failed
----------------------------------------------------------------------
  1. The name of the method long_lpc is not communicative. I would suggest to use length_longest_palindrome instead.
  1. There are missing commas in input docstrings:
-        - ``j`` -- integer a position that is the symmetry axis of the palindrome.  
+        - ``j`` -- integer, position that is the symmetry axis of the palindrome.  
-        - ``m`` -- integer (default: 0) the minimal length of the palindrome, if known.
+        - ``m`` -- integer (default: 0), minimal length of the palindrome, if known.
  1. Missing spaces in the argument of a function:
-            sage: Word('01010').long_lpc(j=3,f='0->1,1->0')
+            sage: Word('01010').long_lpc(j=3, f='0->1,1->0')
  1. White spaces are wrong here:
#Initialize m if set to 0
if m==0:                 
    m=-(j%2)             

and many other places.

  1. Why? No space was OK.
-        INPUT:  
+        INPUT : 
  1. Why do you erase and modify doctests in the palindromes method?
  1. There is *a lot* of comments in the code. Are all of them really necessary? I have never seen Sage code with that many comments. If the algorithm is really complicated, the code can be simplified by moving parts into another method. Also, the pseudo code of it can be put in the doctrings. But, I would leave the code with less comments. See also : http://legacy.python.org/dev/peps/pep-0008/#comments

comment:6 Changed 8 years ago by slabbe

Status: newneeds_review

comment:7 Changed 8 years ago by slabbe

Status: needs_reviewneeds_work

comment:8 Changed 8 years ago by git

Commit: ca9a46b527309d80a785b8d0090f66b7691e1c9ddef64a1a59757069be2840256a41dbc5951d5e07

Branch pushed to git repo; I updated commit sha1. New commits:

def64a1included suggestions from comment 5 (slabbe)

comment:9 Changed 8 years ago by slabbe

Status: needs_workneeds_review

If you make any changes to the ticket and you want me to review it again. Then change the status of the ticket to "needs_review". This sends me an email and I will come and look. Otherwise, I think that no update was done and I wait. I just learned that you made changes six week ago.

Last edited 8 years ago by slabbe (previous) (diff)

comment:10 Changed 8 years ago by slabbe

It is difficult for me to take a look at the code itself when the syntax convention are not followed.

  1. One empty line is nice to separate block of code. Two consecutive empty lines should be avoided.
  1. Never put an empty line after a for statement.
  1. Try to keep comment on one line. Rephrase if necessary.
  1. One line should be less than 80 characters.
  1. If a comment takes more than one line, then avoid any indentation for the second line.
  1. A comment should always start at the actual indentation level.
  1. Why do you write k = Integer(0) instead of just k=0?
  1. long_lpc should not appear anymore (code + doctest).
  1. Slices of words should not be changed into tuple to make the equality test. The code in that class is mathematical. It should not play with the internal representation. Suppose your word is already represented by a tuple. Why change it into a tuple?
  1. Remove useless comment like
    # Return the length of the palindrome
    return p
    
  1. Add comments where it is necessary:
except IndexError:
    pass          

Why such an error is not an error? It seems strange to me to choose such an implementation...

comment:11 Changed 8 years ago by slabbe

Status: needs_reviewneeds_work

comment:12 Changed 8 years ago by slabbe

  1. The code of length_longest_palindrome method contains the same line four times:
while j//2 - i >= 0 and tuple(self[j//2-i : j//2-i+1]) == tuple(f(self[j//2+i : j//2+i+1])):
    i = i + 1

Something should be done to factorize the code a little bit. For instance if j%2 == 1, then store the result of j//2 into some variable K and else store the variable j/2 in K. Then, use K and factorize the code.

  1. Anyhow, in the line of code above, avoid the creation of a tuple. Also, I suggest to avoid the slice and simply use a for loop for the indexes you want to compare.

I am thinking of a code that would look like that (it is most probably wrong as I did not think about the limit cases and even vs odd palindrome length, etc.)

possible = max(j, self.length()-j)
i = m / 2
while i < possible and self[j-i] == self[j+i]:
    i += 1
return i

But you get what I mean: short and clean code do not need any comment.

Last edited 8 years ago by slabbe (previous) (diff)

comment:13 Changed 8 years ago by slabbe

  1. I do not get why there is an odd and even case. What is the definition of the length of the longest palindrome centered at a given position j?

I think that j should be a rational number 0<= j < w.length() such that 2j is an integer. Like j=2.5 could be valid. This idea comes from the fact that a word w is symmetric if w[i]=w[2a-i] for all indices i and for some value a where 2a is an integer.

comment:14 Changed 8 years ago by slabbe

  1. In the code of the palindromes method, you compute two tables: LPS and LPC. If I understand correctly, the computation of LPS depends on the computation of LPC but not conversely. Therefore, why are they computed in parallel?

I would rather suggest if it is possible to write a method that computes and return just LPC. Then, another method that computes and return just LPS if it is possible to compute it from LPC. And finally the method palindromes that returns all of the factors from the LPS method.

Also, the palindromes method should have a algorithm argument allowing to choose between the old and new way of doing things. This facilitates the comparison of efficiency. Otherwise, it is more difficult to see how faster is your new algorithm.

comment:15 Changed 8 years ago by slabbe

  1. Be more brief. For instance replace the 7 lines
# The list LPS records the lengths of the longest
# `f`-palindromic suffix for each prefix of self.
LPS = [] 
 
# The list LPC records the lengths of the maximal
# `f`-palindromes centered at each position j in self.
LPC = []     

by the 2 lines:

LPS = [] #lengths of longest palindrome suffix of prefixes
LPC = [] #lengths of maximal palindromes centered at a position

and so on...

comment:16 Changed 8 years ago by git

Commit: def64a1a59757069be2840256a41dbc5951d5e0705cbe2844498c4c1f3d325e4ac27e637188b8e71

Branch pushed to git repo; I updated commit sha1. New commits:

05cbe28Separation of lps_lengths and lengths_maximal_palindromes functions, change of indexes, correction of errors, upgrade with version 6.4.1

comment:17 Changed 8 years ago by nadialafreniere

I think I've given an answer to almost all the comments in the list above.

However, for number 11., I haven't found a better way to do it. The idea is to avoid to count outside of the string. For example, I don't want to test if w[-1]==w[n] for a word w, since I don't want the palindromes in the circular word. Any suggestion would be very welcome!

For number 15., the reason I pretend the new algorithm is faster is that the list of the lengths of longest palindromic suffixes is computed in linear time for any type of words. With the actual version of the SAGE library, for words having large palindromes, it can be much more than that. For example, let r be is a prefix of w with lps(r) long enough. Then, if the length of lps(ra) is smaller than |lps(r)|, where a is the letter after r in w, it will call p.is_palindrome() on each suffix p (starting with the greatest) of lps(r)a. Just to compute this entry of the list, it could take a time up to (|lps(r)|)2.

Last edited 8 years ago by nadialafreniere (previous) (diff)

comment:18 Changed 8 years ago by nadialafreniere

Status: needs_workneeds_review

comment:19 Changed 8 years ago by slabbe

The ticket does not apply on a recent Sage version.

Also, the branch contains unrelated modifs...

comment:20 Changed 8 years ago by slabbe

...you can keep the same ticket instead of creating #18060. Just update the branch of this ticket.

comment:21 in reply to:  20 ; Changed 8 years ago by vdelecroix

Replying to slabbe:

...you can keep the same ticket instead of creating #18060. Just update the branch of this ticket.

could we just close #18060?

comment:22 in reply to:  20 ; Changed 8 years ago by nadialafreniere

Branch: u/nadialafreniere/16935u/nadialafreniere/palindromes__list_of_lps_lengths_and_list_of_maximal_palindromes_length
Commit: 05cbe2844498c4c1f3d325e4ac27e637188b8e71

Replying to slabbe:

...you can keep the same ticket instead of creating #18060. Just update the branch of this ticket.

That was my first idea, but I probably understood you wrong. I thought you told me to avoid it... I really need to learn how to use trac in a proper way!

comment:23 Changed 8 years ago by git

Commit: 160ccb0ccfb7580fe5d497b3b049fbf2b3c25523

Branch pushed to git repo; I updated commit sha1. New commits:

160ccb0Initial commit

comment:24 in reply to:  21 Changed 8 years ago by nadialafreniere

Replying to vdelecroix:

Replying to slabbe:

...you can keep the same ticket instead of creating #18060. Just update the branch of this ticket.

could we just close #18060?

I don't know how to do it, but it would be a good idea. I added my branch to this ticket.

comment:25 in reply to:  22 Changed 8 years ago by slabbe

I really need to learn how to use trac in a proper way!

It is simpler to learn if you are beside us, but your learning will be more persistent if you learn it on your side with our help from internet.

Just don't stop to do what is natural for you to do. If you think you need to create a new ticket, do it:) We will tell you if there is something wrong and that is how you will learn.

comment:26 Changed 8 years ago by sstarosta

Cc: sstarosta added

comment:27 in reply to:  description ; Changed 8 years ago by slabbe

Replying to nadialafreniere:

You can see here that the algorithm is factor, especially for long words.

Indeed, I can see that it is currently twice as fast:

sage: w = words.ThueMorseWord()[:1000]
sage: %time P = w.palindromic_lacunas_study()[2]
CPU times: user 903 ms, sys: 11.4 ms, total: 914 ms
Wall time: 895 ms
sage: %time Q = w.palindromes()
CPU times: user 516 ms, sys: 34.9 ms, total: 551 ms
Wall time: 472 ms
sage: P == Q
True

sage: w = words.ThueMorseWord()[:10000]
sage: %time P = w.palindromic_lacunas_study()[2]
CPU times: user 40.5 s, sys: 200 ms, total: 40.7 s
Wall time: 40.6 s
sage: %time Q = w.palindromes()
CPU times: user 26 s, sys: 256 ms, total: 26.3 s
Wall time: 26.2 s
sage: P == Q
True

I did not find any way to fix it, except to change the alphabet when generating these words.

I am not sure I understand the problem. But, if w is a word of length 1 in sage, then w[0] will give the the letter it contains. Using this, you can compare the equality of letters after the application of the function f.

I am very happy with the way the code is now separated. Most of the methods are written clearly. I am still not convinced by the optimality and clarity of the length_maximal_palindrome method. I started a new branch where I changed the way the indices are worked out. My branch is here. Tell me what you think.

I added a raise Value Error when the parity of m is the same as the parity of 2j which is an impossible case. I was surprise to realize that it breaks the other methods:

sage: Word('01101001').lengths_maximal_palindromes() 
...
ValueError: j-(m+1)/2(=5/2) must be an integer, i.e., 2*j(=6) and m(=0) can't have the same parity

This means that the impossible case is in fact used. Do you understand why?

Finally, I think the palindrome method can now be written as a one liner using a list comprehension...

comment:28 Changed 8 years ago by git

Commit: 160ccb0ccfb7580fe5d497b3b049fbf2b3c25523d535b5f2f292d83d1dc952023dffe5506dd8f7f9

Branch pushed to git repo; I updated commit sha1. New commits:

67ac06fMerge branch 'u/nadialafreniere/palindromes...` into t/16935
ef438b5Improving length_maximal_palindrome method
d535b5fModified of lengths_maximal_palindromes to answer previous question, shortened palindromes function

comment:29 in reply to:  27 Changed 8 years ago by nadialafreniere

Replying to slabbe:

I prefer the way you present the length_maximal_palindrome. I think it is clearer. I integrated it to the code.

I added a raise Value Error when the parity of m is the same as the parity of 2j which is an impossible case. I was surprise to realize that it breaks the other methods: [...] This means that the impossible case is in fact used. Do you understand why?

I don't know why, but I used a very complicated way to assing a value for m in the second case of lengths_maximal_palindromes. I changed it so it is easier to see what it does, and it does not use the impossible case.

Finally, I think the palindrome method can now be written as a one liner using a list comprehension...

I changed it. Tell me what you think.

comment:30 Changed 8 years ago by slabbe

Great. I think we are almost done now. I just looked at the code. Here are some more fixes to do. Tomorrow (or the day after) after you do the below changes, I will review everything again (doctests, documentation building fine, etc).

  1. The call to Integer is not necessary in the first method (sorry my bad):
    -   return Integer(jj - 2*i - 1)
    +  return jj - 2*i - 1
    
  1. (bis) There should be an empty line after OUTPUT. The reason is that without the empty line, the documentation will not build correctly. You may try to check how it looks on a method of an object by doing this (I do not known if this still works) :
sage: m = matrix(2, range(4))
sage: browse_sage_doc(m.inverse)
  1. spaces:
-                if LPC[k]+k-j != LPC[2*k - j]:
+                if LPC[k] + k - j != LPC[2*k - j]:
  1. remove empty lines:
    -
    -        k = 0
    -
    +        k = 0
    
  1. The creation of the list is not necessary:
-        return set([self[i-LPS[i] : i] for i in range(len(self)+1)])
+        return set(self[i-LPS[i] : i] for i in range(len(self)+1))

comment:31 Changed 8 years ago by git

Commit: d535b5f2f292d83d1dc952023dffe5506dd8f7f9f4cf2f7fa9af3bea0b71583f946851cf4c041582

Branch pushed to git repo; I updated commit sha1. New commits:

f4cf2f7Modified spacing, use of Integer() and palindromes()

comment:32 Changed 8 years ago by slabbe

Excellent. Since we are dealing with possible float in entry (like 2.5) and since w[i] needs i to be an integer, I think that we have to keep the following two lines (do the doctests pass for you without them?) :

 jj = Integer(jj)
 i = Integer(i)

I wanted to remove only the last call to Integer because that one was not necessary.

comment:33 Changed 8 years ago by nadialafreniere

The doctests passed without them, but I can put them back just in case it is necessary.

comment:34 Changed 8 years ago by git

Commit: f4cf2f7fa9af3bea0b71583f946851cf4c041582d0ee49ef26ff6277c5ed847e3b03a4c95dd857d6

Branch pushed to git repo; I updated commit sha1. New commits:

d0ee49eAdded Integers() and changed an incorrect error message

comment:35 Changed 8 years ago by slabbe

The user does not know what is jj : this is an internal variable. It is better to use 2*j for the message error given to the user.

Last small fix for the doc of lps_lengths (remove one space in the second line). Lines must align properly for the doc to render correctly (see the file SAGE_ROOT/src/doc/output/html/en/reference/combinat/sage/combinat/words/finite_word.html after you run sage -b && sage -docbuild reference html)

-        - ``f`` - involution (default: None) on the alphabet of self. It must
-           be callable on letters as well as words (e.g. WordMorphism).
+        - ``f`` - involution (default: None) on the alphabet of self. It must
+          be callable on letters as well as words (e.g. WordMorphism).
Last edited 8 years ago by slabbe (previous) (diff)

comment:36 in reply to:  33 Changed 8 years ago by slabbe

Replying to nadialafreniere:

The doctests passed without them, but I can put them back just in case it is necessary.

That is strange because they did not pass for me... Anyway, now, it works.

comment:37 Changed 8 years ago by git

Commit: d0ee49ef26ff6277c5ed847e3b03a4c95dd857d6fcdd2f830906015ddf3c1623bd7a29dde8e1e58e

Branch pushed to git repo; I updated commit sha1. New commits:

fcdd2f8modified the error message and spacing in documentation

comment:38 Changed 8 years ago by slabbe

Reviewers: Sébastien Labbé
Status: needs_reviewpositive_review

Code looks good. Doctests pass. Documentation builds fine. Félicitations pour ton premier ticket Nadia! Vivement d'autres!

Last edited 8 years ago by slabbe (previous) (diff)

comment:39 Changed 8 years ago by vbraun

Branch: u/nadialafreniere/palindromes__list_of_lps_lengths_and_list_of_maximal_palindromes_lengthfcdd2f830906015ddf3c1623bd7a29dde8e1e58e
Resolution: fixed
Status: positive_reviewclosed
Note: See TracTickets for help on using tickets.