Opened 2 years ago
Closed 6 months ago
#31451 closed enhancement (fixed)
Faster version of longest_increasing_subsequences
Reported by: | gh-petermoraw | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | sage-9.7 |
Component: | combinatorics | Keywords: | permutation, subsequences |
Cc: | nadialafreniere, enadeau, tscrim | Merged in: | |
Authors: | Peter Morawitz, Finn Hulse, Nadia Lafrenière | Reviewers: | David Coudert, Vincent Delecroix |
Report Upstream: | N/A | Work issues: | |
Branch: | 5969eda (Commits, GitHub, GitLab) | Commit: | 5969edaa2e395e683feaba38db8ea94c6290e996 |
Dependencies: | Stopgaps: |
Description (last modified by )
The algorithm for the longest_increasing_subsequences function for permutations is currently very inefficient. I worked on a new version that vastly improves the run time of the function. New version:
sage: %timeit Permutations(30).random_element().longest_increasing_subsequences() 19.6 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Old version:
sage: %timeit Permutations(30).random_element().longest_increasing_subsequences() The slowest run took 10.40 times longer than the fastest. This could mean that an intermediate result is being cached. 3min 34s ± 2min 47s per loop (mean ± std. dev. of 7 runs, 1 loop each)
follow-up: #34218
Change History (41)
comment:1 Changed 2 years ago by
Branch: | → u/gh-petermoraw/LongestIncreasingSubsequences |
---|
comment:2 Changed 2 years ago by
Commit: | → 0bb729e48b1669180599854ac8f2f4ee5741a094 |
---|---|
Status: | new → needs_review |
comment:3 Changed 2 years ago by
- when citing a reference, you should append an underscore, like this
[Sch1961]_
- Do not use "Returns" in the first line of doc, but "Return"
comment:4 Changed 2 years ago by
and patchbot complains about
src/doc/en/reference/references/index.rst # Tab character found
comment:5 Changed 2 years ago by
Cc: | tscrim added |
---|
comment:6 Changed 2 years ago by
I think seq == sorted(seq)
would be better replaced by
all(seq[i] <= seq[i+1] for i in range(len(seq)-1))
Also, is there a way you can avoid all of the calls to index
here:
self.index(seq[j]) < self.index(seq[j+1]):
I feel like there should be a better way to track things, otherwise this could run in O(len(seq)^2)
(maybe O(log(len(seq)*len(seq))
) time.
You can remove this useless list
- potential_sequences = list(itertools.product(*s)) - for seq in potential_sequences: + for seq in itertools.product(*s):
I think it is cleaner to write j += 1
instead of j = j+1
. It might also be marginally faster.
comment:7 Changed 23 months ago by
Commit: | 0bb729e48b1669180599854ac8f2f4ee5741a094 → 86204b5f3a3bf3f4c116e35890c94a4fb3087575 |
---|
Branch pushed to git repo; I updated commit sha1. New commits:
86204b5 | tscrim edits
|
comment:9 Changed 23 months ago by
Nor have you commented on why you did not implement all of the suggestions from comment:6.
comment:10 Changed 23 months ago by
Status: | needs_review → needs_work |
---|
comment:11 Changed 23 months ago by
Commit: | 86204b5f3a3bf3f4c116e35890c94a4fb3087575 → 1f7a29ae89f3b1e5b3144451c4860d7b9c19e0fe |
---|
comment:12 Changed 23 months ago by
About comment:6, you actually don't need to use index
at all. You can do a single pass on your permutation to check if the element of seq appear in the right order in your permutation.
comment:13 Changed 23 months ago by
Milestone: | sage-9.3 → sage-9.4 |
---|
Sage development has entered the release candidate phase for 9.3. Setting a new milestone for this ticket based on a cursory review of ticket status, priority, and last modification date.
comment:14 Changed 19 months ago by
There's a clever Python way to check if one sequence is a subsequence of another with a single pass ( https://stackoverflow.com/questions/24017363/how-to-test-if-one-string-is-a-subsequence-of-another ):
def is_subseq(x, y): it = iter(y) return all(c in it for c in x)
Short enough that you could eliminate the helper function and go with
it = iter(self) if seq == sorted(seq) and all(c in it for c in seq):
though including the helper function with the name may make code more transparent.
Were there any other lingering concerns besides fixing that helper function? If no other concerns and no preference between in-line vs helper function, I'll pick one, push the change and set to 'needs review'.
comment:15 Changed 19 months ago by
I tried to implement Kevin's solution, but it runs slower than Peter's proposed solution on my computer. I agree that the code would look better, though. I think that Peter is working on a clever and much faster solution that would involve a directed graph to store the potential increasing subsequences. That would reduce the number of sequences to check, and they would automatically be subsequences of the permutation (so there would be no need to check that they indeed are; getting rid of the problem pointed out here).
Maybe Peter can confirm my thoughts and update us with where he is in the process.
comment:16 Changed 19 months ago by
Milestone: | sage-9.4 → sage-9.5 |
---|
Setting a new milestone for this ticket based on a cursory review.
comment:17 Changed 14 months ago by
Milestone: | sage-9.5 → sage-9.6 |
---|
comment:18 Changed 10 months ago by
Milestone: | sage-9.6 → sage-9.7 |
---|
comment:19 Changed 7 months ago by
Branch: | u/gh-petermoraw/LongestIncreasingSubsequences → u/nadialafreniere/LongestIncreasingSubsequences |
---|
comment:20 Changed 7 months ago by
Authors: | Peter Morawitz → Peter Morawitz. Finn Hulse, Nadia Lafrenière |
---|---|
Commit: | 1f7a29ae89f3b1e5b3144451c4860d7b9c19e0fe → f5d40fa41d7cdee746b26a98a6a64f10eba47579 |
Status: | needs_work → needs_review |
Peter decided to leave the project. I took his ideas as well as some ideas from Finn Hulse, another student I had in my class, to write the patch that should solve the problem. Instead of listing all potential longest increasing subsequences and checking for their presence in the sequence, it uses a directed graph in which all the paths from a source vertex to a sink are longest increasing subsequences.
As for speeding up the solution, I get the following results:
sage: %timeit Permutations(30).random_element().longest_increasing_subsequences() 416 µs ± 6.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) sage: %timeit Permutations(100).random_element().longest_increasing_subsequences() 3.34 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In comparison, with the code currently in Sage 9.7.beta5, I get
sage: %timeit Permutations(30).random_element().longest_increasing_subsequences() The slowest run took 34.93 times longer than the fastest. This could mean that an intermediate result is being cached. 6min 27s ± 6min 38s per loop (mean ± std. dev. of 7 runs, 1 loop each)
New commits:
0466305 | Merge branch 'u/gh-petermoraw/LongestIncreasingSubsequences' of git://trac.sagemath.org/sage into t/31451/LongestIncreasingSubsequences
|
f5d40fa | Implemented the digraph solution for enumerating longest increasing subsequences
|
comment:21 Changed 7 months ago by
Branch: | u/nadialafreniere/LongestIncreasingSubsequences |
---|---|
Commit: | f5d40fa41d7cdee746b26a98a6a64f10eba47579 |
Sorry, I realized I push the wrong branch. Will push the correct one soon.
comment:22 Changed 7 months ago by
Branch: | → u/nadialafreniere/LongestIncreasingSubsequences2 |
---|---|
Commit: | → 8d1c1d6d8e39875c22299ec1912752db0e9332b6 |
Fixed the branch issue.
comment:23 Changed 7 months ago by
I think the end would be clearer as
- increasing_subsequences = [] for i in columns[0]: D.add_edge(0, i) # 0 is source for i in columns[-1]: D.add_edge(i, n+1) # n+1 is sink - for p in D.all_paths(0, n+1): - increasing_subsequences.append(p[1:-1]) + increasing_subsequences = [p[1:-1] for p in D.all_paths(0, n+1)]
comment:24 Changed 7 months ago by
Another detail,
increasing_subsequences.sort() increasing_subsequences.reverse()
would better be
increasing_subsequences.sort(reverse=True)
comment:25 follow-up: 27 Changed 7 months ago by
Wouldn't it be more efficient to keep the lists in columns
sorted so that one can do early stop
for k in columns[j-1]: if k >= self[i]: break D.add_edge(k, self[i])
Of course keeping them sorted imposes a log(|size|)
for insertion + something smarter than a list as a data structure.
comment:26 Changed 7 months ago by
Commit: | 8d1c1d6d8e39875c22299ec1912752db0e9332b6 → 62b9a399268375ef45311c9d773f40227a7dcfd6 |
---|
Branch pushed to git repo; I updated commit sha1. New commits:
62b9a39 | Simplified end of code for longest increasing subsequences
|
comment:27 Changed 7 months ago by
Thanks, Vincent, for your comments. I implemented the changes to the end of the code. I believe it is more readable now.
Replying to vdelecroix:
Wouldn't it be more efficient to keep the lists in
columns
sorted so that one can do early stopfor k in columns[j-1]: if k >= self[i]: break D.add_edge(k, self[i])Of course keeping them sorted imposes a
log(|size|)
for insertion + something smarter than a list as a data structure.
That would indeed be more efficient. I'm trying to think of the appropriate data structure for it, but I'm not sure how to handle this in order to keep it efficient. Did you have anything in mind?
comment:28 Changed 7 months ago by
You can maintain the elements in a column sorted using
sage: from bisect import insort sage: L = [] sage: for e in Permutations(30).random_element(): ....: insort(L, e) sage: print(L) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
You can also avoid the call to method self.longest_increasing_subsequence_length
and use enumerate
in some places.
Overall this is faster than your current code.
def longest_increasing_subsequences(self): r""" """ n = self.size() if not n: return([[]]) from bisect import insort # getting the column in which each element is inserted first_row_p_tableau = [] columns = [] D = DiGraph(n+2) for i, x in enumerate(self): j = bisect(first_row_p_tableau, x) if j == len(first_row_p_tableau): if columns: for k in columns[-1]: if k >= x: break D.add_edge(k, x) first_row_p_tableau.append(x) columns.append([x]) else: first_row_p_tableau[j] = x insort(columns[j], x) if j: for k in columns[j-1]: if k >= x: break D.add_edge(k, x) for i in columns[0]: D.add_edge(0, i) # 0 is source for i in columns[-1]: D.add_edge(i, n+1) # n+1 is sink return sorted([p[1:-1] for p in D.all_paths(0, n+1)], reverse=True)
comment:29 Changed 6 months ago by
Description: | modified (diff) |
---|
comment:30 Changed 6 months ago by
I did not know about bisect.insort
. The Python implementation of list uses contiguous array so that insertion is O(n)
and not O(log n)
. A priori, using balanced search tree would be more efficient here : iterating through them is O(1)
per element and insertion is O(log n)
. Browsing quickly, I found https://pypi.org/project/blist/. But since we are dealing with integers, I would go with the bstree
from boost
in C++. Not sure it is worth implementing it, but it could be mentioned as a remark in the code.
comment:31 Changed 6 months ago by
Right, insort is O(n)
. What we need here is a data structure with quick insert and enabling to iterate over the elements in increasing order without extra cost...
comment:32 Changed 6 months ago by
Commit: | 62b9a399268375ef45311c9d773f40227a7dcfd6 → 4a7f5c8cf4097ff903f625d1e4b97b5416308fba |
---|
Branch pushed to git repo; I updated commit sha1. New commits:
4a7f5c8 | Use of bisect for columns in longest_increasing_subsequences
|
comment:33 Changed 6 months ago by
Thanks, David. I did not know about bisect either. That's a great improvement! I left the columns into sorted lists and did not implement the BST mostly because I think that the code is easier to read this way and because the code is still pretty fast. Given how slow it used to be, I doubt anyone ever used this code on large permutations. As suggested by Vincent, I added a note to the docstring about it. If others think a more efficient data structure is worth implementing, I would also be fine with it.
comment:34 Changed 6 months ago by
Nadia, you need a line break after .. NOTE::
as in
.. NOTE:: This algorithm could be made faster using a balanced search tree for each column instead of sorted lists. See discussion on :trac:`31451`.
comment:36 Changed 6 months ago by
and also 2 spaces before comments. Just a coding style.
- D.add_edge(i, n+1) # n+1 is sink + D.add_edge(i, n+1) # n+1 is sink
I also think that this code is good enough for now.
comment:37 Changed 6 months ago by
Commit: | 4a7f5c8cf4097ff903f625d1e4b97b5416308fba → 5969edaa2e395e683feaba38db8ea94c6290e996 |
---|
Branch pushed to git repo; I updated commit sha1. New commits:
5969eda | proper formatting for longest_increasing_subsequences
|
comment:38 Changed 6 months ago by
Replying to vdelecroix:
The rest looks perfectly good to me. Thanks.
Thanks, Vincent. I just fixed the formatting.
comment:39 Changed 6 months ago by
Reviewers: | → David Coudert, Vincent Delecroix |
---|---|
Status: | needs_review → positive_review |
comment:40 Changed 6 months ago by
Authors: | Peter Morawitz. Finn Hulse, Nadia Lafrenière → Peter Morawitz, Finn Hulse, Nadia Lafrenière |
---|
comment:41 Changed 6 months ago by
Branch: | u/nadialafreniere/LongestIncreasingSubsequences2 → 5969edaa2e395e683feaba38db8ea94c6290e996 |
---|---|
Resolution: | → fixed |
Status: | positive_review → closed |
New commits:
improved longest increasing subsequences function
Added reference