Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#10433 closed enhancement (fixed)

clean up the code for Kruskal's algorithm

Reported by: mvngu Owned by: jason, ncohen, rlm
Priority: major Milestone: sage-4.6.2
Component: graph theory Keywords:
Cc: ncohen Merged in: sage-4.6.2.alpha1
Authors: Minh Van Nguyen Reviewers: Nathann Cohen
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: Stopgaps:

Description (last modified by mvngu)

From sage-support. See also this sage-devel thread for design discussion.

Attachments (2)

trac-10433_clean-up-kruskal.patch (31.0 KB) - added by mvngu 9 years ago.
trac-10433-rebased-4.6.2.alpha0.patch (30.8 KB) - added by ncohen 9 years ago.

Download all attachments as: .zip

Change History (19)

comment:1 Changed 9 years ago by ncohen

  • Cc ncohen added

comment:2 Changed 9 years ago by mvngu

  • Description modified (diff)

comment:3 Changed 9 years ago by mvngu

  • Description modified (diff)

comment:4 Changed 9 years ago by mvngu

  • Summary changed from rewrite or clean up the minimum spanning tree code to clean up the code for Kruskal's algorithm

comment:5 Changed 9 years ago by mvngu

  • Authors set to Minh Van Nguyen
  • Status changed from new to needs_review

comment:6 Changed 9 years ago by mvngu

  • Description modified (diff)

comment:7 Changed 9 years ago by ncohen

  • Status changed from needs_review to needs_work

Hello Minh !!!!

I just looked at your code... Here are several questions/remarks about it :

  • it is not possible to turn the sanity checks off from the min_spanning_tree method. If a users wants to do so, he has to import the kruskal method himself
  • the min_spanning_tree algororithm is one of the most basic, and is meant to be *fast*. That's also why it is very good to have it in Cython. But when the algorithm is run, here is what happens : test of connectedness --> BFS/DFS --> the whole graph is read. When this is done, you are testing whether the graph is a tree : first is_connected is called, then the edges are counted. 2 traversals at this moment. You could save one by just counting the edges of the graph, as you just checked it is connected ( that's exactly how the is_tree method is written ). Anyway, the best would really be to run kruskal's algorithm, to theck afterwards whether the result is connected.
  • I was not able to produce a counterexample, but I think there's something wrong in how you remove multiple edges by keeping the minimum. You list all the multiple edges (and all the edges of a complete graph may be doubled or tripled). You then remember for each vertex *one of the neighbors* with which it has many edges. You are then selecting a neighbor for each vertex in the worst case, which mean a linear number of information. Then, for each vertex and the neighbor you selected, you remember the edge of smallest weight, which is ok... But why should you have removed all the double edges ? What happens if a vertex has multiple edges in common with many of its neighbors ? O_o
  • Personnal question : why is itemgetter(2) better than lambda x:x[1] ? Or is it just because there are no lambda functions available in Cython ?
  • Whatever happens, you seem to need a copy of your graph... What about dealing with multiple edges on the way ? Sorting all the edges according to their weight, even with multiple ones. When you then try to add the heavier copy of an edge you already tried, it is bound to be refused anyway... Do you think it would be slower like that ?
  • Why do you write "for startv in iter(e[0:2]):" instead of "for startv in e[0:2]:" ? O_o
  • Your algorithm returns a list of edges. Many of the methods I first wrote were doing the same, and returned a list of edges instead of returning a graph, to save the time required to build the graph structure... After using them for a while, I noticed I was ALWAYS casting the result to a graph, as it is, in the end, the best way to "represent a set of pairs"... That's what it is made to be in the first place :-D

That's really up to you on that one. I wouldn't say returning a list of edges has any special fault, but I am now writing my methods so that they return graphs directly... Probably just a matter of taste.

  • I had never seen this union-find method... It took me some time, but that's a pretty nice trick :-)

Nathann

comment:8 Changed 9 years ago by mvngu

  • Status changed from needs_work to needs_review

Replying to ncohen:

  • it is not possible to turn the sanity checks off from the min_spanning_tree method. If a users wants to do so, he has to import the kruskal method himself

I've added a new parameter "check" to the method min_spanning_tree(). Wherever appropriate, this parameter will be passed on to any function used in computing a minimum spanning tree.

  • the min_spanning_tree algororithm is one of the most basic, and is meant to be *fast*. That's also why it is very good to have it in Cython. But when the algorithm is run, here is what happens : test of connectedness --> BFS/DFS --> the whole graph is read. When this is done, you are testing whether the graph is a tree : first is_connected is called, then the edges are counted. 2 traversals at this moment. You could save one by just counting the edges of the graph, as you just checked it is connected ( that's exactly how the is_tree method is written ). Anyway, the best would really be to run kruskal's algorithm, to theck afterwards whether the result is connected.

You're right. That's two computation of connectedness. This is now fixed to only do one computation of connectedness and afterward to check for tree.

  • I was not able to produce a counterexample, but I think there's something wrong in how you remove multiple edges by keeping the minimum. You list all the multiple edges (and all the edges of a complete graph may be doubled or tripled). You then remember for each vertex *one of the neighbors* with which it has many edges. You are then selecting a neighbor for each vertex in the worst case, which mean a linear number of information. Then, for each vertex and the neighbor you selected, you remember the edge of smallest weight, which is ok... But why should you have removed all the double edges ? What happens if a vertex has multiple edges in common with many of its neighbors ? O_o

Here is an explanation of the code block on retaining only the minimum weighted edge among all weighted multiple edges with the same start and edge vertices. Let G be an undirected, weighted multigraph. If there are multiple edges from u to v, retain only the start and end vertices of such edges; at this point we don't care about the edge weights. Let a and b be the start and end vertices, respectively, of a weighted edge (a, b, w) having weight w. Then there are multiple weighted edges from a to b if and only if the dictionary uniqE has the key/value pair (a, b). Let (u, v) be a key/value pair in uniqE. Then there are multiple weighted edges from u to v. Let W be a list of all edge weights of multiple edges from u to v, sorted in nondecreasing order. If w is the first element in W, then (u, v, w) is an edge of minimum weight (there may be several edges of minimum weight) among all weighted edges from u to v. If i >= 2 is the i-th element in W, delete the multiple weighted edge (u, v, i). The only remaining weighted edge is (u, v, w) with w being the first element of W.

  • Personnal question : why is itemgetter(2) better than lambda x:x[1] ? Or is it just because there are no lambda functions available in Cython ?

Specifically, I'm using the idiom:

from operator import itemgetter
sorted(iterable, key=itemgetter(n))

It is more or less equivalent to using a lambda, but there are clear documentation on the above sorting idiom. See this page in the Python library. And this other page from the Python wiki, especially the section "Operator Module Functions". Also note that what you're suggesting is using a closure, which is not supported in the version of Cython currently distributed with Sage 4.6.1.alpha3.

  • Whatever happens, you seem to need a copy of your graph... What about dealing with multiple edges on the way ? Sorting all the edges according to their weight, even with multiple ones. When you then try to add the heavier copy of an edge you already tried, it is bound to be refused anyway... Do you think it would be slower like that ?

I work with the assumption that we can only compute minimum spanning trees for undirected simple graphs. Let g be a copy of the input graph and suppose g is an unweighted multigraph. By the working assumption, we merely convert g to be a simple graph and then run the resulting simple graph through Kruskal's algorithm. I think working directly with the input graph, and not a copy of this graph, would result in more complicated code. If you want speed, make sure your input graph is a simple unweighted graph that is connected and not a tree. The keyword check was designed for this purpose

  • Why do you write "for startv in iter(e[0:2]):" instead of "for startv in e[0:2]:" ? O_o

Because I want an iterator over the list returned by "e[0:2]". If I do

for startv in e[0:2]:

then I would first produce a list as requested by the slice operation "e[0:2]" and I would need to wait for this production process to end. After I have finished producing the list, I would then begin iterating over items in the list. However, if I do

for startv in iter(e[0:2]):

then I don't need to first wait for the list "e[0:2]" to be produced before I can start iterating over its elements. The function iter() returns an iterator over an iterable such that the next item in the iterable is only produced when necessary. Using iterators with iter() or the yield statement can save some computation time.

  • Your algorithm returns a list of edges. Many of the methods I first wrote were doing the same, and returned a list of edges instead of returning a graph, to save the time required to build the graph structure... After using them for a while, I noticed I was ALWAYS casting the result to a graph, as it is, in the end, the best way to "represent a set of pairs"... That's what it is made to be in the first place :-D

I've added the keyword as_edges. If the value of this boolean is true, then output a list of weighted edges comprising a minimum spanning tree. If as_edges=False, then output an actual tree that is a minimum spanning tree.

comment:9 Changed 9 years ago by mvngu

Patch updated with more documentation: more comments within code, a new doctest, more explanation about what's expected of the input graph to kruskal().

comment:10 Changed 9 years ago by mvngu

Update patch to use the set data structure during the process of removing multiple edges.

comment:11 Changed 9 years ago by ncohen

  • Status changed from needs_review to needs_info

Hello Minh !!!

I'm sorry to come back with another thing to fix after having made so many remarks, but I really hadn't thought this problem would occur.. I just did some profiling, and even though your code is faster when no checkings are required, there is a large slowdown by default :

  • With checkings :
for i in range(10):
    g = graphs.RandomGNP(50,.3)
    timeit("g.min_spanning_tree()")

25 loops, best of 3: 13.1 ms per loop
25 loops, best of 3: 12.7 ms per loop
25 loops, best of 3: 12.5 ms per loop
25 loops, best of 3: 12.5 ms per loop
25 loops, best of 3: 13.1 ms per loop
25 loops, best of 3: 12.6 ms per loop
25 loops, best of 3: 12 ms per loop
25 loops, best of 3: 11.9 ms per loop
25 loops, best of 3: 12.8 ms per loop
25 loops, best of 3: 13.2 ms per loop
  • Without Checkings :
for i in range(10):
    g = graphs.RandomGNP(50,.3)
    timeit("g.min_spanning_tree(check=False)")

125 loops, best of 3: 2.36 ms per loop
125 loops, best of 3: 2.1 ms per loop
125 loops, best of 3: 2.34 ms per loop
125 loops, best of 3: 2.49 ms per loop
125 loops, best of 3: 2.42 ms per loop
125 loops, best of 3: 2.36 ms per loop
125 loops, best of 3: 2.37 ms per loop
125 loops, best of 3: 2.23 ms per loop
125 loops, best of 3: 2.34 ms per loop
125 loops, best of 3: 2.2 ms per loop
  • With no patch applied ::
for i in range(10):
    g = graphs.RandomGNP(50,.3)
    timeit("g.min_spanning_tree()")

125 loops, best of 3: 3.33 ms per loop
125 loops, best of 3: 2.96 ms per loop
125 loops, best of 3: 3.21 ms per loop
125 loops, best of 3: 3.32 ms per loop
125 loops, best of 3: 2.8 ms per loop
125 loops, best of 3: 3.06 ms per loop
125 loops, best of 3: 3.03 ms per loop
125 loops, best of 3: 3.13 ms per loop
125 loops, best of 3: 3.4 ms per loop
125 loops, best of 3: 3.5 ms per loop

Calling to_undirected() ::

for i in range(10):
    g = graphs.RandomGNP(50,.3)
    timeit("g.to_undirected()")

125 loops, best of 3: 4.43 ms per loop
125 loops, best of 3: 4.56 ms per loop
125 loops, best of 3: 4.88 ms per loop
125 loops, best of 3: 4.59 ms per loop
125 loops, best of 3: 4.92 ms per loop
125 loops, best of 3: 4.54 ms per loop
125 loops, best of 3: 4.57 ms per loop
125 loops, best of 3: 4.25 ms per loop
125 loops, best of 3: 4.69 ms per loop
125 loops, best of 3: 4.63 ms per loop

Calling is_connected()::

for i in range(10):
    g = graphs.RandomGNP(50,.3)
    timeit("g.is_connected()")

625 loops, best of 3: 153 µs per loop
625 loops, best of 3: 154 µs per loop
625 loops, best of 3: 157 µs per loop
625 loops, best of 3: 152 µs per loop
625 loops, best of 3: 150 µs per loop
625 loops, best of 3: 149 µs per loop
625 loops, best of 3: 149 µs per loop
625 loops, best of 3: 153 µs per loop
625 loops, best of 3: 153 µs per loop
625 loops, best of 3: 155 µs per loop

So in general the default behaviour leads to a much slower performance, even though disabling the checkings makes it better than the current implementation. I think most of it is because of the time required to copy the graph (checking the connectivity is comparatively *much* faster). I tried to find some cheap trick to avoid the checking, at least when the graph is simple and unweighted. Actually, I rewrote a computation of spanning trees based on depth-first search, but I wasn't able to do better than what your code does when the checkings are disabled, so it's not useful by itself. I think the best way is to find a way around graph copies, but no cheap tricks so far ^^;

Nathann

Changed 9 years ago by mvngu

comment:12 Changed 9 years ago by mvngu

  • Status changed from needs_info to needs_review

Replying to ncohen:

Hello Minh !!!

I'm sorry to come back with another thing to fix after having made so many remarks, but I really hadn't thought this problem would occur.. I just did some profiling, and even though your code is faster when no checkings are required, there is a large slowdown by default :

What can be done is to turn off the sanity checks by default. I have done this in the updated patch. But note that the function now assumes that by default the input graph is simple, connected, not a tree, and has at least one vertex. If you can't be certain that the input graph meet all of the latter conditions, then toggle check=True. With the updated patch, here are some timing results. First, I generate the reference graph to be used for all profiling:

sage: g = graphs.RandomGNP(2000, 0.3)
sage: save(g, "graphs")

I get some timing for the case where we don't use the patch:

sage: g = load("graphs")
sage: %time g.min_spanning_tree();
CPU times: user 15.57 s, sys: 0.00 s, total: 15.57 s
Wall time: 15.57 s
sage: %time g.min_spanning_tree();
CPU times: user 16.29 s, sys: 0.00 s, total: 16.29 s
Wall time: 16.29 s
sage: %time g.min_spanning_tree();
CPU times: user 16.02 s, sys: 0.00 s, total: 16.02 s
Wall time: 16.02 s
sage: %time g.min_spanning_tree();
CPU times: user 16.50 s, sys: 0.00 s, total: 16.50 s
Wall time: 16.50 s

Now with the patch, but use kruskal() indirectly via min_spanning_tree():

sage: g = load("graphs")
sage: %time g.min_spanning_tree();
CPU times: user 15.22 s, sys: 0.00 s, total: 15.22 s
Wall time: 15.22 s
sage: %time g.min_spanning_tree();
CPU times: user 14.86 s, sys: 0.00 s, total: 14.86 s
Wall time: 14.86 s
sage: %time g.min_spanning_tree();
CPU times: user 14.72 s, sys: 0.00 s, total: 14.72 s
Wall time: 14.71 s
sage: %time g.min_spanning_tree();
CPU times: user 15.10 s, sys: 0.00 s, total: 15.10 s
Wall time: 15.09 s

Finally, with the patch and directly import and use kruskal():

sage: g = load("graphs")
sage: from sage.graphs.spanning_tree import kruskal
sage: %time kruskal(g);
CPU times: user 15.11 s, sys: 0.00 s, total: 15.11 s
Wall time: 15.11 s
sage: %time kruskal(g);
CPU times: user 14.79 s, sys: 0.00 s, total: 14.79 s
Wall time: 14.79 s
sage: %time kruskal(g);
CPU times: user 14.66 s, sys: 0.00 s, total: 14.66 s
Wall time: 14.66 s
sage: %time kruskal(g);
CPU times: user 14.99 s, sys: 0.00 s, total: 14.99 s
Wall time: 14.99 s


Calling to_undirected() ::

for i in range(10):
    g = graphs.RandomGNP(50,.3)
    timeit("g.to_undirected()")

125 loops, best of 3: 4.43 ms per loop
125 loops, best of 3: 4.56 ms per loop
125 loops, best of 3: 4.88 ms per loop
125 loops, best of 3: 4.59 ms per loop
125 loops, best of 3: 4.92 ms per loop
125 loops, best of 3: 4.54 ms per loop
125 loops, best of 3: 4.57 ms per loop
125 loops, best of 3: 4.25 ms per loop
125 loops, best of 3: 4.69 ms per loop
125 loops, best of 3: 4.63 ms per loop

The call to to_undirected(), is_connected(), and to_simple() are only performed when you do sanity checks. There are no other way around them. If sanity checking is toggled on, we need to somehow get the input graph to be simple, connected, undirected, not a tree, and has at least one vertex. Note that the current implementation of the function kruskal() is meant to be a generic implementation, so it needs to handle various types of graphs.

So in general the default behaviour leads to a much slower performance, even though disabling the checkings makes it better than the current implementation. I think most of it is because of the time required to copy the graph (checking the connectivity is comparatively *much* faster). I tried to find some cheap trick to avoid the checking, at least when the graph is simple and unweighted. Actually, I rewrote a computation of spanning trees based on depth-first search, but I wasn't able to do better than what your code does when the checkings are disabled, so it's not useful by itself. I think the best way is to find a way around graph copies, but no cheap tricks so far ^^;

Here's one way to improve performance: get rid of the keyword "as_edges" and return only the edges of a minimum spanning tree. You might want kruskal() to return an actual tree object that represents a minimum spanning tree, but the behaviour of returning the edges can later on be tweaked to use generators and the yield statement to increase performance. Currently, the latest version of Cython (version 0.14) doesn't support generators and the yield statement, so there's no way to use them yet. But I have updated the patch to get rid of the keyword "as_edges". Later on when Cython supports the yield statement, we could use that. The advantage then would be that once an edge of a minimum spanning tree is found, it is returned via a yield statement and you could immediately use that edge to build your tree while the function kruskal() is running. This has the effect of returning an iterator over the edges of a minimum spanning tree, instead of first waiting for all such edges to be found and returned before you could start using them.

What could also increase performance slightly is to use a priority queue to iterate over edges of the input graph, instead of using a sorting function. But a Cython implementation of priority queue is still not yet available.

comment:13 Changed 9 years ago by ncohen

  • Reviewers set to Nathann Cohen
  • Status changed from needs_review to positive_review

Positive review to this patch ! It passes "testall -long", and is a very nice piece of work (plus the beginning of a new module) :-)

Thank you very much for your patience, Minh ! :-)

Nathann

comment:14 Changed 9 years ago by jdemeyer

  • Status changed from positive_review to needs_work
  • Work issues set to rebase

Needs to be rebased to sage-4.6.2.alpha0, I get the following patch output::

patching file doc/en/reference/graphs.rst
Hunk #1 succeeded at 44 with fuzz 2 (offset 1 line).
patching file module_list.py
Hunk #1 succeeded at 380 (offset 12 lines).
patching file sage/graphs/generic_graph.py
Hunk #1 succeeded at 2029 (offset -17 lines).
patching file sage/graphs/graph.py
Hunk #1 succeeded at 3412 (offset 69 lines).
patching file sage/graphs/spanning_tree.pyx
patching file sage/misc/sagedoc.py
Hunk #1 FAILED at 890.
1 out of 1 hunk FAILED -- saving rejects to file sage/misc/sagedoc.py.rej

comment:15 Changed 9 years ago by ncohen

  • Status changed from needs_work to positive_review

The fixing about the results of a search for "tree" in the doc has been fixed elsewhere !

Here is a rebased version of the patch !

Nathann

Changed 9 years ago by ncohen

comment:16 Changed 9 years ago by jdemeyer

  • Merged in set to sage-4.6.2.alpha1
  • Resolution set to fixed
  • Status changed from positive_review to closed

comment:17 Changed 9 years ago by jdemeyer

  • Work issues rebase deleted
Note: See TracTickets for help on using tickets.