#12362 closed enhancement (fixed)
Improvement of GNP generators for graphs and digraphs
Reported by: | David Coudert | Owned by: | jason, ncohen, rlm |
---|---|---|---|
Priority: | minor | Milestone: | sage-5.0 |
Component: | graph theory | Keywords: | directed graphs, generators |
Cc: | Nathann Cohen | Merged in: | sage-5.0.beta9 |
Authors: | David Coudert | Reviewers: | Nathann Cohen |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | Stopgaps: |
Description (last modified by )
The RandomDirectedGNP generator is quit slow because it is written in python. Also I propose to rewrite it in Cython.
With this ticket, I propose to create a file graph_generators_pyx.pyx containing graphs and digraphs generators in cython and to provide a faster implementation of the GNP generator for both graphs and digraphs.
APPLY:
Attachments (4)
Change History (48)
comment:1 Changed 11 years ago by
comment:2 Changed 11 years ago by
Description: | modified (diff) |
---|---|
Summary: | Modification of the digraph generator to cython and improvement of the GNP genrator → Improvement of GNP generators for graphs and digraphs |
comment:3 Changed 11 years ago by
Description: | modified (diff) |
---|
comment:4 Changed 11 years ago by
Another proposal for RandomDirectedGNP.
The function is generic and also works for RandomGNP, but I haven't modified the graph_generators.py file because further benchmark are required: networkx implementation versus this implementation.
How can I remove useless attachments of this ticket ??
Thanks,
D.
comment:5 Changed 11 years ago by
Status: | new → needs_review |
---|
I have changed the status from new to need_review.
Without this patch
sage: %timeit digraphs.RandomDirectedGNP(1000,0.001) 5 loops, best of 3: 2.58 s per loop sage: %timeit networkx.random_graphs.directed_gnp_random_graph(1000,0.001) 5 loops, best of 3: 2.08 s per loop
With this patch
sage: %timeit digraphs.RandomDirectedGNP(1000,0.001) 5 loops, best of 3: 26.6 ms per loop
This patch could be used for undirected graphs, but the networkx implementation is competitive.
Waiting for your feedback.
D.
comment:6 Changed 11 years ago by
Status: | needs_review → needs_work |
---|
Helloooooo !!!
Nice speedup :-D
During the Sage week we also found a similar speedup in a Poset code, by replacing a m[i][j]
by a
m[i,j]
:-P
About the length of the code : it can be divided by two at no cost by removing the "else" on the "loops" argument. If you try to build an edge between i and i in a graph that that has been built with loops = False the edge is silently *not* created. And there should be no noticeable difference in the runtime if the code with loops is applied to the case where loops are forbidden.
It is also possible to write the two codes
for 0 <= i < n: for 0 <= j < n: if random() < pp: G.add_edge(i,j)
and
# Standard GNP generator for graphs for 0 <= i < n-1: for i+1 <= j <n: if random() < pp: G.add_edge(i,j)
as
for 0 <= i < n-1: for (i+1 if directed else 0) <= j <n: if random() < pp: G.add_edge(i,j)
Same thing here, a "if" in a loop (with branch prediction !) should not weigh much :-)
Nathann
comment:7 follow-up: 9 Changed 11 years ago by
Thank you Nathann for the good advices.
I have changed accordingly the standard GNP algorithm. I have also contracted the fast algorithm but it is a bit more difficult to read now. The extra cost for N = 10000 and p=0.0001 is 5ms which is acceptable ;-)
Some running time. The fast algorithm is better for large sparse graphs. Otherwise, the standard GNP algorithm is now fast enough.
sage: %timeit G = digraphs.RandomDirectedGNP(1000,0.001,loops=True,fast=True) 5 loops, best of 3: 55.8 ms per loop sage: %timeit G = digraphs.RandomDirectedGNP(1000,0.001,loops=True) 25 loops, best of 3: 15.8 ms per loop sage: %timeit G = digraphs.RandomDirectedGNP(10000,0.0001,loops=True,fast=True) 5 loops, best of 3: 558 ms per loop sage: %timeit G = digraphs.RandomDirectedGNP(10000,0.0001,loops=True) 5 loops, best of 3: 1.07 s per loop sage: %timeit G = digraphs.RandomDirectedGNP(10000,0.001,loops=True,fast=True) 5 loops, best of 3: 5.31 s per loop sage: %timeit G = digraphs.RandomDirectedGNP(10000,0.001,loops=True) 5 loops, best of 3: 1.5 s per loop
D.
PS: concerning m[i][j] versus m[i,j], is it for python or cython ? which kind of data structure ?
PS2: in general in C, m[i][j] is faster than m[i*N+j], so if you have a NxN matrix encoded in a 1D array A, it is interesting to create a B array of size N and to assign B[i]=A+i*N. Then, we have B[i,j]==A[i*N+j]. OK, I had real C course.
comment:8 Changed 11 years ago by
Status: | needs_work → needs_review |
---|
comment:9 Changed 11 years ago by
Status: | needs_review → needs_work |
---|
Hellooooooo !!!
Some running time. The fast algorithm is better for large sparse graphs.
Nice ! Could you advertise it in the doc ? Something like "the fast algorithm is only faster for *LARGE* instances (try it to know whether it is useful for you)" ?
I tried to improve it a bit because I still find weird that networkx could beat us at this game, but found nothing yet... I'll try again later :-p
This being said, I was thinking of two things. The first of the two you reported yourself some time ago : the means seem to be below what is expected...
sage: mean([digraphs.RandomDirectedGNP(100,.1, fast = True).density()+.0 for i in range(50)]) 0.0990303030303030 sage: mean([digraphs.RandomDirectedGNP(100,.1, fast = True).density()+.0 for i in range(50)]) 0.0985575757575757 sage: mean([digraphs.RandomDirectedGNP(100,.1, fast = True).density()+.0 for i in range(50)]) 0.0989050505050506 sage: mean([digraphs.RandomDirectedGNP(100,.1, fast = True).density()+.0 for i in range(50)]) 0.0993535353535353
And there is also this "detail" :-p
sage: mean([digraphs.RandomDirectedGNP(100,.1, fast = False).density()+.0 for i in range(50)]) 0.0498989898989899 sage: mean([digraphs.RandomDirectedGNP(100,.1, fast = True).density()+.0 for i in range(50)]) 0.0987434343434344
Could you also add the file to the documentation (the files in devel/sage-yourbranch/doc/en/reference/) and make the method available (through an optional argument) in graphs.RandomGNP ? Even though it can not be the default method, as networkx is faster (why the hell should it be ? O_o
), it would be nice to have around.
And about the matrix : it was not about C arrays but about Sage matrices, and this is why the speedup was so huge (#12476). It is because a field was accessed in Python as m[x][y]
which translates as "build a copy of the x^{th row of m, and return its y}th element". A bit more than what was necessary
:-P
Nathann
comment:10 Changed 11 years ago by
Status: | needs_work → needs_review |
---|
I have addressed all your remarks. The "detail" was because I have stupidly copy/paste your (i+1 if directed else 0)
. It should be
(0 if directed else i+1)
;)
For graphs, I have decided to let the networkx method as default.
sage: %timeit networkx.gnp_random_graph(1000,0.01) 5 loops, best of 3: 1.03 s per loop sage: %timeit graphs.RandomGNP(1000,0.01,method='sage') 5 loops, best of 3: 233 ms per loop sage: %timeit graphs.RandomGNP(1000,0.01,method='networkx') 5 loops, best of 3: 66.5 ms per loop sage: %timeit networkx.gnp_random_graph(100,0.05,method='sage') 25 loops, best of 3: 10.9 ms per loop sage: %timeit graphs.RandomGNP(100,0.05,method='sage') 25 loops, best of 3: 11.4 ms per loop sage: %timeit graphs.RandomGNP(100,0.05,method='networkx') 125 loops, best of 3: 3.45 ms per loop
Concerning the average density, it is seems correct with larger sample.
sage: mean([digraphs.RandomDirectedGNP(100,.1, loops = True).density()+.0 for i in range(500)]) 0.100094400000000 sage: mean([digraphs.RandomDirectedGNP(100,.1, loops = False).density()+.0 for i in range(500)]) 0.100155555555556 sage: mean([digraphs.RandomDirectedGNP(100,.1, loops = False, fast = True).density()+.0 for i in range(500)]) 0.0989579797979798 sage: mean([digraphs.RandomDirectedGNP(100,.1, loops = True, fast = True).density()+.0 for i in range(500)]) 0.0988469999999999 sage: mean([graphs.RandomGNP(100,.1, fast = False, method='sage').density()+.0 for i in range(500)]) 0.100131313131313 sage: mean([graphs.RandomGNP(100,.1, fast = False, method='networkx').density()+.0 for i in range(500)]) 0.100190707070707 sage: mean([graphs.RandomGNP(100,.1, fast = True, method='networkx').density()+.0 for i in range(500)]) 0.100002424242424 sage: mean([graphs.RandomGNP(100,.1, fast = True, method='sage').density()+.0 for i in range(500)]) 0.104045165676905
A possible improvement for RandomDirectedGNP is to return a complete digraph (with or without loops) when p=1. However, no such function exists and it is not relevant in this patch. I let it for another patch.
D.
comment:11 Changed 11 years ago by
I have added some set_random_seed(0)
before the tests to ensure that the results are always the same. I have also removed annoying TABs and spaces.
The patch should now pass all tests (at least it is the case on my computer).
However, I don't know how to set the seed inside a cython function. This could be a nice improvement to this patch. I have checked the sage.misc.randstate
documentation, but I don't understand how to proceed. Any help is more than welcome.
D.
Changed 11 years ago by
Attachment: | trac_12362_random_gnp.patch added |
---|
comment:12 Changed 11 years ago by
Hellooooooo David !
Well, I don't think not being able to define the seed *as a parameter* of the function is so bad. As it works fine when you set it globally for the doctests anyway....
Now, here is a small patch with some modifications. It adds a few tests, check that the given method is "networkx" or "Sage" but nothing else (that's how we do it usually). I also added a "(Cython)" to the title of the new file you create. It makes more sense this way when you look at the documentation. Otherwise, you would have:
- Digraph generators
- Graph Generators
- Digraph and Graph generators
Now there is a "Cython" at the end of the third one, and there's some logic to it :-)
This being said, I added one test which does *not* pass... Here it is :
sage: from sage.graphs.graph_generators_pyx import RandomGNP sage: abs(mean([RandomGNP(150,.2, fast = True).density() for i in range(30)])-.2) < .001 False
It looks like there is some deviation with the "fast" algorithm, which does not occur with the usual one :
sage: sage: abs(mean([RandomGNP(200,.2).density() for i in range(30)])-.2) 0.000443886097152429 sage: sage: abs(mean([RandomGNP(200,.2, fast = True).density() for i in range(30)])-.2) 0.0163283739008691
Well, that's already 1% of edges too much. I don't think the deviation with the first algorithm is very bad (though for some reason it is always positive on my tests, and I do not like that), but I'm no big fan of the deviation for the second implementation The "fast" algorithm is also.... *MUCH SLOWER* on most cases... Is it better than the other one in some interesting situations ?
Oh, and I also noticed that the graphs returned did not necessarily have n vertices when p was too low. I fixed that in the patch too :-)
Nathann
Changed 11 years ago by
Attachment: | trac_12362-rev.patch added |
---|
comment:13 follow-up: 14 Changed 11 years ago by
Hello Nathann,
The fast
method is expected to be faster for *large* and *sparse* graphs, e.g.,
N=10000
and
p=0.0001
. But even for such values, the standard method is already fast. Otherwise, the standard method is sufficient.
Since I don't know how to solve the deviation problem you have pointed out (I'm not the designer of the algorithm) and that it is not so useful, I propose to simply remove it from the Cython RandomGNP function and from the RandomDirectedGNP function. For the
graphs.RandomGNP
function, the fast method of
networkx
has no deviation so we can let it.
We should also remove the tests with cputime. They are quite old and will differ from one computer to another.
If you agree with these proposition, I will prepare a rev-rev patch ;-)
D.
comment:14 Changed 11 years ago by
If you agree with these proposition, I will prepare a rev-rev patch ;-)
+1 :-)
Nathann
comment:15 follow-up: 16 Changed 11 years ago by
I have incorporated all proposed modification and updated some tests (mainly adding set_random_seed(0)).
Let me know if it is better now.
Best,
D.
comment:16 Changed 11 years ago by
Let me know if it is better now.
Well.. I see nothing wrong with it, now :-)
If you are fine with all that, you can set the ticket to "positive review". It looks good to me ! :-)
Nathann
comment:17 Changed 11 years ago by
Description: | modified (diff) |
---|
comment:18 Changed 11 years ago by
Status: | needs_review → positive_review |
---|
Thank you for your help Nathann. The gnp generator for digraphs is now reasonably fast.
Other contributors can propose a suitable implementation of the fast algorithm for digraphs.
One more patch for sage-5 ;-)
D.
comment:19 Changed 11 years ago by
Authors: | → David Coudert |
---|---|
Reviewers: | → Nathann Cohen |
comment:20 Changed 11 years ago by
Status: | positive_review → needs_work |
---|
On sage.math, this causes a doctest timeout:
The following tests failed: sage -t --verbose --long -force_lib "devel/sage/sage/graphs/generic_graph.py" # Time out Total time for all tests: 1800.5 seconds
Changed 11 years ago by
Attachment: | trac_12362-rev-rev.patch added |
---|
comment:21 Changed 11 years ago by
Status: | needs_work → needs_review |
---|
The problem was in test results in function is_isomorphic
on random digraphs.
The problem is fixed and the patch pass the tests successfully on my computer (with 5.0 beta6).
Let's hope this patch will not cause further problems.
D.
comment:22 Changed 11 years ago by
Hellooooooooooo !!!!
On my machine I was also getting a timeout when running the doctests with the -long flag. It was caused by the edge_disjoint_spanning_trees method, for which the "random digraph" returned by the new random graph generator was a bad instance.
There actually was not any problem, short of the fact that GLPK was having a hard time. The problem did not occur when CPLEX was installed which is probably why it did not happen on your computer.
The small patch I add does the following : 1) changes the value of epsilon from 1/10*n to 1/3*n 2) adds a verification at the end of edge_disjoint_spanning_tree so that we aresure the answers computed by the LP solver are correct.
Theoretically speaking, setting epsilon to 1/2*n is sufficient, but I often set the value to something smaller "just to be sure". 1/10*n seems to be a bit too low, and for this reason GLPK had a hard time finding the solutions. It would have found it eventually, but with a low value of epsilon it has more branches to explore.
So, with this small patch I am adding the value of epsilon goes to 1/3, which is till more than necessary and should not cause any problem. And "just to be sure", I also added a test at the end of the function, to ensure that the results returned are indeed correct (that each graph is connected).
It passes all long tests on my machine, is faster than before, and "if you agree with those new modifications [...]" :-)
Nathann
comment:23 Changed 11 years ago by
Description: | modified (diff) |
---|
comment:24 Changed 11 years ago by
Status: | needs_review → positive_review |
---|
I would not have been able to find this problem alone. Thanks Nathann.
With your extra patch: installation OK, short/long tests OK, docbuild OK.
I think it's now good to go...
comment:25 follow-up: 26 Changed 11 years ago by
Status: | positive_review → needs_work |
---|
oups sorry, I see some "trailing spaces" in the file :(
comment:26 Changed 11 years ago by
Status: | needs_work → needs_review |
---|
oups sorry, I see some "trailing spaces" in the file :(
Now with *less* trailing spaces in the file. Actually, without the trailing spaces from the function I touched.
The point with trailing spaces is that there shold not be any, but that whenever a file like generic_graph is massively modified (for instance if all trailing spaces were to be removed at once) the guys from Sage combinat would have to rebase they monstruous patch queue immediately. They already threatened several times to kill me if I ever did that :-P
Nathann
comment:27 Changed 11 years ago by
Status: | needs_review → positive_review |
---|
That's much better ! For me the patch is ready to do :))
D.
comment:28 Changed 11 years ago by
Status: | positive_review → needs_work |
---|
On hawk, this yields
sage -t --long -force_lib devel/sage/sage/graphs/generic_graph.py ********************************************************************** File "/export/home/buildbot/build/sage/hawk-1/hawk_full/build/sage-5.0.beta8/devel/sage-main/sage/graphs/generic_graph.py", line 5830: sage: g.flow(0,1, method="FF") == g.flow(0,1,method="LP") Expected: True Got: False **********************************************************************
This is probably a bug uncovered by this patch, rather than caused by this patch. But anyway, it should be looked at.
comment:29 Changed 11 years ago by
Status: | needs_work → needs_review |
---|
Updated ! It must be because of some numerical noise on that machine. Actually, this doctest breaks when the CBC package is installed too, for the same reason. Let it be fixed with the updated version of trac_12362-edge_disjoint_spanning_tree.patch
.
Nathann
Changed 11 years ago by
Attachment: | trac_12362-edge_disjoint_spanning_tree.patch added |
---|
comment:30 Changed 11 years ago by
Status: | needs_review → positive_review |
---|
The patch works perfectly with sage-5.0.beta7 on my computer. Tests are also OK.
compote:~/Soft/sage-5.0.beta7> ./sage -t --long -force_lib generic_graph.py sage -t --long -force_lib "devel/sage-myclone/sage/graphs/generic_graph.py" [74.6 s] ---------------------------------------------------------------------- All tests passed! Total time for all tests: 74.6 seconds
The long tests are also OK for all .py and .pyx files of sage/graphs/
.
Thanks Nathann for last update.
D.
comment:31 Changed 11 years ago by
Long tests are also OK on a fresh new install of sage-5.0.beta8.
compote:~/sage-5.0.beta8> ./sage -t --long -force_lib devel/sage-myclone/sage/graphs/generic_graph.py sage -t --long -force_lib "devel/sage-myclone/sage/graphs/generic_graph.py" [75.5 s] ---------------------------------------------------------------------- All tests passed! Total time for all tests: 75.5 seconds
comment:32 follow-up: 33 Changed 11 years ago by
Nathann,
I don't know if it is a requirement, but there is no ticket number inside files trac_12362-rev.patch and trac_12362-edge_disjoint_spanning_tree.patch
# HG changeset patch # User Nathann Cohen <nathann.cohen@gmail.com> # Date 1329999818 -3600 # Node ID 283e6da5d8af82c7d8b8ae19c16572d55ca06e08 # Parent 34d93e382b071f0f17bb633c6168c1950d901fde Improvement of GNP generators for graphs and digraphs -- Cython diff --git a/module_list.py b/module_list.py
Best, D.
comment:33 follow-up: 34 Changed 11 years ago by
I don't know if it is a requirement, but there is no ticket number inside files trac_12362-rev.patch and trac_12362-edge_disjoint_spanning_tree.patch
That's a very controversial point in Sage development :-D
If I did not miss the latest updates, the moral is : it used to be a requirement, and it is now done automatically with scripts. If you put it anyway Jeroen will remove it himself, but you spare him that if you do not add it. But these things changes faster than lightning :-)
By the way, have you ever played the role of mentor in a "Google Summer of Code" project ? Looks like two people asked to participate in one for Sage : the first about Graphs, the other with LP... I should really work on my application right now, but I still must think f that too O_o;;;
Nathann
comment:34 Changed 11 years ago by
Replying to ncohen:
it used to be a requirement, and it is now done automatically with scripts.
Exactly.
If you put it anyway Jeroen will remove it himself
That's false. If you put it anyway, you end up with a commit message like
Trac #12362: #12362: foo bar
comment:36 Changed 11 years ago by
Merged in: | → sage-5.0.beta9 |
---|---|
Resolution: | → fixed |
Status: | positive_review → closed |
comment:37 Changed 11 years ago by
This patch is bad.
Before:
sage: timeit('graphs.RandomGNP(2000, .01)') 5 loops, best of 3: 263 ms per loop
After:
sage: timeit('graphs.RandomGNP(2000, .01)') 5 loops, best of 3: 3.73 s per loop
comment:38 Changed 11 years ago by
OOch O_O;;;;
David ? You know where that comes from ? O_o;;;;
Nathann
comment:39 Changed 11 years ago by
Well, actually our own code is still faster....
sage: %timeit graphs.RandomGNP(2000,.01, method="Sage") 5 loops, best of 3: 157 ms per loop
But did something change with networkx since ? O_o
comment:40 Changed 11 years ago by
Hello,
the default method is networkx (as before this patch) and so when used with default parameters this patch has no impact on the running time. When I wrote this patch, we decided to let networkx as default algorithm because the running time was faster with new method only for large graphs.
Something has changed with networkx. It is now surprisingly slow...
sage: %timeit networkx.gnp_random_graph(2000,0.01) 5 loops, best of 3: 4.31 s per loop sage: %timeit graphs.RandomGNP(2000, .01, method='networkx') 5 loops, best of 3: 4.47 s per loop sage: %timeit Graph(networkx.gnp_random_graph(2000,0.01)) 5 loops, best of 3: 4.49 s per loop sage: %timeit graphs.RandomGNP(2000, .01, method='Sage') 5 loops, best of 3: 173 ms per loop sage: sage: %timeit graphs.RandomGNP(100, .01, method='networkx') 25 loops, best of 3: 11.2 ms per loop sage: %timeit graphs.RandomGNP(100, .01, method='Sage') 625 loops, best of 3: 659 µs per loop
If we are unable to find why networkx is now slow, we can open a ticket to put Sage as default algorithm.
For directed graphs, our implementation was always faster than networkx, and so we removed the networkx one.
comment:41 Changed 11 years ago by
Yeah, something must have changed in networkx, but the last update seems to be 1.2 a veeeeeeery long time ago. Where the hell does this come from ? O_o
comment:42 Changed 11 years ago by
Here's the code of RandomGNP in beta8
if seed is None: seed = current_randstate().long_seed() if p == 1: return graphs.CompleteGraph(n) import networkx if fast: G = networkx.fast_gnp_random_graph(n, p, seed=seed) else: G = networkx.gnp_random_graph(n, p, seed=seed) return graph.Graph(G)
and in beta14
if n < 0: raise ValueError("The number of nodes must be positive or null.") if 0.0 > p or 1.0 < p: raise ValueError("The probability p must be in [0..1].") if seed is None: seed = current_randstate().long_seed() if p == 1: return graphs.CompleteGraph(n) if method == 'networkx': import networkx if fast: G = networkx.fast_gnp_random_graph(n, p, seed=seed) else: G = networkx.gnp_random_graph(n, p, seed=seed) return graph.Graph(G) elif method == "Sage": # We use the Sage generator from sage.graphs.graph_generators_pyx import RandomGNP as sageGNP return sageGNP(n, p) else: raise ValueError("'method' must be equal to 'networkx' or to 'Sage'.")
comment:43 Changed 11 years ago by
GOT IT !!!!
The flag "fast" used to be "True", it is now "False". We set it to False because our "fast" algorithm is slower than the usual one, so for the Sage algorithm it is better to set it to False. For networkx, however, it is much better to use the "Fast" version, and so the default value has to be set to "True" again, as the default method is networkx !
Nathann
comment:44 Changed 11 years ago by
Bravo !
sage: %timeit graphs.RandomGNP(2000, .01, method='networkx', fast = True) 5 loops, best of 3: 234 ms per loop sage: %timeit graphs.RandomGNP(2000, .01, method='Sage') 5 loops, best of 3: 163 ms per loop sage: %timeit graphs.RandomGNP(2000, .001, method='networkx', fast = True) 25 loops, best of 3: 27.7 ms per loop sage: %timeit graphs.RandomGNP(2000, .001, method='Sage') 5 loops, best of 3: 50.2 ms per loop sage: %timeit graphs.RandomGNP(2000, .1, method='networkx', fast = True) 5 loops, best of 3: 2.38 s per loop sage: %timeit graphs.RandomGNP(2000, .1, method='Sage') 5 loops, best of 3: 1.33 s per loop
I will open a new ticket to correct this. Sorry.
A remaining question is which should be the default algorithm. Apparently, the Sage implementation is often faster than the networkx one, except for large sparse graphs
sage: %timeit graphs.RandomGNP(10000, 0.001, method='Sage') 5 loops, best of 3: 1.21 s per loop sage: %timeit graphs.RandomGNP(10000, 0.001, method='networkx', fast = True) 5 loops, best of 3: 612 ms per loop sage: %timeit graphs.RandomGNP(10000, 0.01, method='Sage') 5 loops, best of 3: 4.17 s per loop sage: %timeit graphs.RandomGNP(10000, 0.01, method='networkx', fast = True) 5 loops, best of 3: 6.09 s per loop
The patchs are not working yet :(
Here the modifications are not taken into account by sage -b. Something is missing but I don't know what. I'm still a beginner with sage...
Any help, advise,... is more than welcome !
D.