Opened 21 months ago
Closed 4 months ago
#31451 closed enhancement (fixed)
Faster version of longest_increasing_subsequences
Reported by:  ghpetermoraw  Owned by:  

Priority:  major  Milestone:  sage9.7 
Component:  combinatorics  Keywords:  permutation, subsequences 
Cc:  Nadia Lafrenière, Emile Nadeau, Travis Scrimshaw  Merged in:  
Authors:  Peter Morawitz, Finn Hulse, Nadia Lafrenière  Reviewers:  David Coudert, Vincent Delecroix 
Report Upstream:  N/A  Work issues:  
Branch:  5969eda (Commits, GitHub, GitLab)  Commit:  5969edaa2e395e683feaba38db8ea94c6290e996 
Dependencies:  Stopgaps: 
Description (last modified by )
The algorithm for the longest_increasing_subsequences function for permutations is currently very inefficient. I worked on a new version that vastly improves the run time of the function. New version:
sage: %timeit Permutations(30).random_element().longest_increasing_subsequences() 19.6 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Old version:
sage: %timeit Permutations(30).random_element().longest_increasing_subsequences() The slowest run took 10.40 times longer than the fastest. This could mean that an intermediate result is being cached. 3min 34s ± 2min 47s per loop (mean ± std. dev. of 7 runs, 1 loop each)
followup: #34218
Change History (41)
comment:1 Changed 21 months ago by
Branch:  → u/ghpetermoraw/LongestIncreasingSubsequences 

comment:2 Changed 21 months ago by
Commit:  → 0bb729e48b1669180599854ac8f2f4ee5741a094 

Status:  new → needs_review 
comment:3 Changed 21 months ago by
 when citing a reference, you should append an underscore, like this
[Sch1961]_
 Do not use "Returns" in the first line of doc, but "Return"
comment:4 Changed 21 months ago by
and patchbot complains about
src/doc/en/reference/references/index.rst # Tab character found
comment:5 Changed 21 months ago by
Cc:  Travis Scrimshaw added 

comment:6 Changed 21 months ago by
I think seq == sorted(seq)
would be better replaced by
all(seq[i] <= seq[i+1] for i in range(len(seq)1))
Also, is there a way you can avoid all of the calls to index
here:
self.index(seq[j]) < self.index(seq[j+1]):
I feel like there should be a better way to track things, otherwise this could run in O(len(seq)^2)
(maybe O(log(len(seq)*len(seq))
) time.
You can remove this useless list
 potential_sequences = list(itertools.product(*s))  for seq in potential_sequences: + for seq in itertools.product(*s):
I think it is cleaner to write j += 1
instead of j = j+1
. It might also be marginally faster.
comment:7 Changed 21 months ago by
Commit:  0bb729e48b1669180599854ac8f2f4ee5741a094 → 86204b5f3a3bf3f4c116e35890c94a4fb3087575 

Branch pushed to git repo; I updated commit sha1. New commits:
86204b5  tscrim edits

comment:9 Changed 21 months ago by
Nor have you commented on why you did not implement all of the suggestions from comment:6.
comment:10 Changed 21 months ago by
Status:  needs_review → needs_work 

comment:11 Changed 21 months ago by
Commit:  86204b5f3a3bf3f4c116e35890c94a4fb3087575 → 1f7a29ae89f3b1e5b3144451c4860d7b9c19e0fe 

comment:12 Changed 21 months ago by
About comment:6, you actually don't need to use index
at all. You can do a single pass on your permutation to check if the element of seq appear in the right order in your permutation.
comment:13 Changed 21 months ago by
Milestone:  sage9.3 → sage9.4 

Sage development has entered the release candidate phase for 9.3. Setting a new milestone for this ticket based on a cursory review of ticket status, priority, and last modification date.
comment:14 Changed 17 months ago by
There's a clever Python way to check if one sequence is a subsequence of another with a single pass ( https://stackoverflow.com/questions/24017363/howtotestifonestringisasubsequenceofanother ):
def is_subseq(x, y): it = iter(y) return all(c in it for c in x)
Short enough that you could eliminate the helper function and go with
it = iter(self) if seq == sorted(seq) and all(c in it for c in seq):
though including the helper function with the name may make code more transparent.
Were there any other lingering concerns besides fixing that helper function? If no other concerns and no preference between inline vs helper function, I'll pick one, push the change and set to 'needs review'.
comment:15 Changed 17 months ago by
I tried to implement Kevin's solution, but it runs slower than Peter's proposed solution on my computer. I agree that the code would look better, though. I think that Peter is working on a clever and much faster solution that would involve a directed graph to store the potential increasing subsequences. That would reduce the number of sequences to check, and they would automatically be subsequences of the permutation (so there would be no need to check that they indeed are; getting rid of the problem pointed out here).
Maybe Peter can confirm my thoughts and update us with where he is in the process.
comment:16 Changed 17 months ago by
Milestone:  sage9.4 → sage9.5 

Setting a new milestone for this ticket based on a cursory review.
comment:17 Changed 12 months ago by
Milestone:  sage9.5 → sage9.6 

comment:18 Changed 8 months ago by
Milestone:  sage9.6 → sage9.7 

comment:19 Changed 5 months ago by
Branch:  u/ghpetermoraw/LongestIncreasingSubsequences → u/nadialafreniere/LongestIncreasingSubsequences 

comment:20 Changed 5 months ago by
Authors:  Peter Morawitz → Peter Morawitz. Finn Hulse, Nadia Lafrenière 

Commit:  1f7a29ae89f3b1e5b3144451c4860d7b9c19e0fe → f5d40fa41d7cdee746b26a98a6a64f10eba47579 
Status:  needs_work → needs_review 
Peter decided to leave the project. I took his ideas as well as some ideas from Finn Hulse, another student I had in my class, to write the patch that should solve the problem. Instead of listing all potential longest increasing subsequences and checking for their presence in the sequence, it uses a directed graph in which all the paths from a source vertex to a sink are longest increasing subsequences.
As for speeding up the solution, I get the following results:
sage: %timeit Permutations(30).random_element().longest_increasing_subsequences() 416 µs ± 6.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) sage: %timeit Permutations(100).random_element().longest_increasing_subsequences() 3.34 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In comparison, with the code currently in Sage 9.7.beta5, I get
sage: %timeit Permutations(30).random_element().longest_increasing_subsequences() The slowest run took 34.93 times longer than the fastest. This could mean that an intermediate result is being cached. 6min 27s ± 6min 38s per loop (mean ± std. dev. of 7 runs, 1 loop each)
New commits:
0466305  Merge branch 'u/ghpetermoraw/LongestIncreasingSubsequences' of git://trac.sagemath.org/sage into t/31451/LongestIncreasingSubsequences

f5d40fa  Implemented the digraph solution for enumerating longest increasing subsequences

comment:21 Changed 5 months ago by
Branch:  u/nadialafreniere/LongestIncreasingSubsequences 

Commit:  f5d40fa41d7cdee746b26a98a6a64f10eba47579 
Sorry, I realized I push the wrong branch. Will push the correct one soon.
comment:22 Changed 5 months ago by
Branch:  → u/nadialafreniere/LongestIncreasingSubsequences2 

Commit:  → 8d1c1d6d8e39875c22299ec1912752db0e9332b6 
Fixed the branch issue.
comment:23 Changed 4 months ago by
I think the end would be clearer as
 increasing_subsequences = [] for i in columns[0]: D.add_edge(0, i) # 0 is source for i in columns[1]: D.add_edge(i, n+1) # n+1 is sink  for p in D.all_paths(0, n+1):  increasing_subsequences.append(p[1:1]) + increasing_subsequences = [p[1:1] for p in D.all_paths(0, n+1)]
comment:24 Changed 4 months ago by
Another detail,
increasing_subsequences.sort() increasing_subsequences.reverse()
would better be
increasing_subsequences.sort(reverse=True)
comment:25 followup: 27 Changed 4 months ago by
Wouldn't it be more efficient to keep the lists in columns
sorted so that one can do early stop
for k in columns[j1]: if k >= self[i]: break D.add_edge(k, self[i])
Of course keeping them sorted imposes a log(size)
for insertion + something smarter than a list as a data structure.
comment:26 Changed 4 months ago by
Commit:  8d1c1d6d8e39875c22299ec1912752db0e9332b6 → 62b9a399268375ef45311c9d773f40227a7dcfd6 

Branch pushed to git repo; I updated commit sha1. New commits:
62b9a39  Simplified end of code for longest increasing subsequences

comment:27 Changed 4 months ago by
Thanks, Vincent, for your comments. I implemented the changes to the end of the code. I believe it is more readable now.
Replying to vdelecroix:
Wouldn't it be more efficient to keep the lists in
columns
sorted so that one can do early stopfor k in columns[j1]: if k >= self[i]: break D.add_edge(k, self[i])Of course keeping them sorted imposes a
log(size)
for insertion + something smarter than a list as a data structure.
That would indeed be more efficient. I'm trying to think of the appropriate data structure for it, but I'm not sure how to handle this in order to keep it efficient. Did you have anything in mind?
comment:28 Changed 4 months ago by
You can maintain the elements in a column sorted using
sage: from bisect import insort sage: L = [] sage: for e in Permutations(30).random_element(): ....: insort(L, e) sage: print(L) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
You can also avoid the call to method self.longest_increasing_subsequence_length
and ensure that both use
first_row_p_tableau and
columns[j]` are sorted. Overall this is faster than your current code.
def longest_increasing_subsequences(self): r""" """ n = self.size() if not n: return([[]]) from bisect import insort # getting the column in which each element is inserted first_row_p_tableau = [] columns = [] D = DiGraph(n+2) for x in self: j = bisect(first_row_p_tableau, x) if j == len(first_row_p_tableau): if columns: for k in columns[1]: if k >= x: break D.add_edge(k, x) first_row_p_tableau.append(x) columns.append([x]) else: first_row_p_tableau[j] = x insort(columns[j], x) if j: for k in columns[j1]: if k >= x: break D.add_edge(k, x) for i in columns[0]: D.add_edge(0, i) # 0 is source for i in columns[1]: D.add_edge(i, n+1) # n+1 is sink return sorted([p[1:1] for p in D.all_paths(0, n+1)], reverse=True)
comment:29 Changed 4 months ago by
Description:  modified (diff) 

comment:30 Changed 4 months ago by
I did not know about bisect.insort
. The Python implementation of list uses contiguous array so that insertion is O(n)
and not O(log n)
. A priori, using balanced search tree would be more efficient here : iterating through them is O(1)
per element and insertion is O(log n)
. Browsing quickly, I found https://pypi.org/project/blist/. But since we are dealing with integers, I would go with the bstree
from boost
in C++. Not sure it is worth implementing it, but it could be mentioned as a remark in the code.
comment:31 Changed 4 months ago by
Right, insort is O(n)
. What we need here is a data structure with quick insert and enabling to iterate over the elements in increasing order without extra cost...
comment:32 Changed 4 months ago by
Commit:  62b9a399268375ef45311c9d773f40227a7dcfd6 → 4a7f5c8cf4097ff903f625d1e4b97b5416308fba 

Branch pushed to git repo; I updated commit sha1. New commits:
4a7f5c8  Use of bisect for columns in longest_increasing_subsequences

comment:33 Changed 4 months ago by
Thanks, David. I did not know about bisect either. That's a great improvement! I left the columns into sorted lists and did not implement the BST mostly because I think that the code is easier to read this way and because the code is still pretty fast. Given how slow it used to be, I doubt anyone ever used this code on large permutations. As suggested by Vincent, I added a note to the docstring about it. If others think a more efficient data structure is worth implementing, I would also be fine with it.
comment:34 Changed 4 months ago by
Nadia, you need a line break after .. NOTE::
as in
.. NOTE:: This algorithm could be made faster using a balanced search tree for each column instead of sorted lists. See discussion on :trac:`31451`.
comment:36 Changed 4 months ago by
and also 2 spaces before comments. Just a coding style.
 D.add_edge(i, n+1) # n+1 is sink + D.add_edge(i, n+1) # n+1 is sink
I also think that this code is good enough for now.
comment:37 Changed 4 months ago by
Commit:  4a7f5c8cf4097ff903f625d1e4b97b5416308fba → 5969edaa2e395e683feaba38db8ea94c6290e996 

Branch pushed to git repo; I updated commit sha1. New commits:
5969eda  proper formatting for longest_increasing_subsequences

comment:38 Changed 4 months ago by
Replying to vdelecroix:
The rest looks perfectly good to me. Thanks.
Thanks, Vincent. I just fixed the formatting.
comment:39 Changed 4 months ago by
Reviewers:  → David Coudert, Vincent Delecroix 

Status:  needs_review → positive_review 
comment:40 Changed 4 months ago by
Authors:  Peter Morawitz. Finn Hulse, Nadia Lafrenière → Peter Morawitz, Finn Hulse, Nadia Lafrenière 

comment:41 Changed 4 months ago by
Branch:  u/nadialafreniere/LongestIncreasingSubsequences2 → 5969edaa2e395e683feaba38db8ea94c6290e996 

Resolution:  → fixed 
Status:  positive_review → closed 
New commits:
improved longest increasing subsequences function
Added reference