Opened 8 years ago
Closed 8 years ago
#13808 closed enhancement (fixed)
Gromov hyperbolicity of graphs
Reported by: | dcoudert | Owned by: | jason, ncohen, rlm |
---|---|---|---|
Priority: | major | Milestone: | sage-5.6 |
Component: | graph theory | Keywords: | graph, hyperbolicity |
Cc: | ncohen | Merged in: | sage-5.6.rc0 |
Authors: | David Coudert | Reviewers: | Nathann Cohen |
Report Upstream: | N/A | Work issues: | |
Branch: | Commit: | ||
Dependencies: | #13821 | Stopgaps: |
Description (last modified by )
This patch implements algorithms for computing the Gromov hyperbolicity of a graph. Although the worst case time complexity is in O(n^4)
, the proposed implementation performs quite well in practice. It has been used on graphs with more than 10.000 vertices.
APPLY:
Attachments (5)
Change History (33)
comment:1 Changed 8 years ago by
- Cc ncohen added
- Keywords graph hyperbolicity added
- Status changed from new to needs_review
comment:2 follow-up: ↓ 3 Changed 8 years ago by
- Status changed from needs_review to needs_work
Here is my current patch, and two comments :
- It would be nice if
sage -coverage hyperbolicity.pyx
returned "100% coverage". I hate to say it, because it often means adding useless tests just to fit to the metrics
-_-
- What is the use of _reduce_biconnected_graph compared to goodcores ? The docstring says that this function, by removing degree 2 vertices whose neighborhood is an edge, would reduce a complete bipartite graph to an edge... But no vertex of a complete bipartite graph is adjacent to two adjacent vertices (a triangle). I do not see how (as the docstring says) this function does anything different from the good core decomposition method above.
It would also be better to set this ticket to need_work
until your have found out where the bug you found comes from.
Tell me what you think of the changes I made ! They should make the html doc nicer !
Nathann
comment:3 in reply to: ↑ 2 ; follow-up: ↓ 4 Changed 8 years ago by
Replying to ncohen:
Here is my current patch, and two comments :
- It would be nice if
sage -coverage hyperbolicity.pyx
returned "100% coverage". I hate to say it, because it often means adding useless tests just to fit to the metrics
-_-
OK, I will work on this.
- What is the use of _reduce_biconnected_graph compared to goodcores ?
Useless, but since it is very fast (linear) it can be used for all methods, while goodcores is only for cuts+.
The docstring says that this function, by removing degree 2 vertices whose neighborhood is an edge, would reduce a complete bipartite graph to an edge... But no vertex of a complete bipartite graph is adjacent to two adjacent vertices (a triangle). I do not see how (as the docstring says) this function does anything different from the good core decomposition method above.
Oups. We must add an edge to get many triangles.
It would also be better to set this ticket to
need_work
until your have found out where the bug you found comes from.
Sure.
Tell me what you think of the changes I made ! They should make the html doc nicer !
That's perfect. Thanks.
Nathann
comment:4 in reply to: ↑ 3 Changed 8 years ago by
Useless, but since it is very fast (linear) it can be used for all methods, while goodcores is only for cuts+.
Oh. But what is the difference between this function and the other one, what is it that _reduce_biconnected_graph
can do and that the core decomposition can't ? Is their running time very different for the same purposes ? Why do you need two functions ?
Nathann
comment:5 Changed 8 years ago by
- Description modified (diff)
- Status changed from needs_work to needs_review
Hello,
I have fully revised the patch. I have merged both patches into a new one to simplify the process. I have removed the _reduce_biconnected_graph
and good_core_decomposition
methods and added a new one: elimination_buckets
, where a bucket is understood as a set of vertices that can be eliminated as soon as the lower equals the index of the bucket. So far, this new function only identifies vertices that can be eliminated as soon as we know that the lower bound is 1. The code contains all the machinery to handle larger values, but I don't have good heuristic/algorithm ready to identify vertices that can be eliminated for 2 (only slow and not efficient methods). I propose to let this part for a future patch. This one is already huge...
The doc coverage is now 100%.
Should be better now.
comment:6 Changed 8 years ago by
Apply trac_13808-v2.patch
comment:7 Changed 8 years ago by
- Status changed from needs_review to needs_work
Helloooooooo David !!
Well, I have been working of this patch for the last hour (at least), and I was not able to produce a patch for some of the changes I think would be good for I still do not understand the code well enough, and I only end up producing segfaults...
I think that it would be nice to define the following C struct, for we are in the wonderful world of C where creating structs is totally free (unlike Java :-P
)
# Defining a pair of vertices as a C struct ctypedef struct pair: int i int j
This way, you do not have wird +2
increments anymore, and you can call the invert_cells
function only once (to exchange pair structs, not integers).
I wanted to add the following documentation
# We organize the pairs of vertices by length in an array of pair structs # # - path_of_length[i] point to the list of nb_path_of_length[i] pairs of # vertices at distance i # # - cpt_path[i] counts the number of pairs have already been added to # path_of_length[i] # # - all_pairs is a memory block of binomial(n,2) pairs of # vertices. path_of_length[i] points toward a memory area contained in - # all_pairs. #
to replace your
# We organize the path by length in an array using 2 cells per path.
though it only makes sense in my segfaulting version of hyperbolicity.pyx. If you agree with this change from integers toward pairs, then this doc could be useful.
Other comments :
- I do not see the point of using uint64_t instead of integers. I mean, if we all agree that the code is not supposed to run on 16-bits machines
:-P
The integers you need are -- in the worst case -- of ordern^2
, which may reach the 32bits limit only if the graph has size around 40k vertices. The rest of the time, requiring the integers to be 64bits just means moving around blocks of 0, and will be slower than int on 32 bits machines. Except if I missed something somewhereO_o
Plus the code would be muuuuuuuuch easier to read without that:-P
- You say several times that you compute the number of paths of each length. This scared me for you did this with only +1 increments, and the number of paths of each length is possibly... HUGE (of order
n!
in a complete graph). What you compute is actually the number of pairs of distance l for each l. Some names of variables need to be updated. - elimination_bucket : hmmm... Well, this method is written so that it is able to deal with code that.... it does not contain :-P
- Then, we take a vertex ``u``. The diameter in ``G`` of the subgraph of its neighbors is at most two. To ensure a valid tree-decomposition, we select ``u`` if and only if the bag formed by its neighbors is either included in another bag, or includes another bag. We repeat until no more vertex can be selected. This step is heuristic and the result depends on the order in which vertices are considered. NOT IMPLEMENTED YET.
And it returns a list of things, in which only the first element will have any use (the one corresponding to vertices whose neighborhood is a clique) for the other ones will be empty. And themax_diameter
is completely useless, ... - Oh, and you can use
sage_calloc
to allocate and set to zero at the same time.
I attach a small patch that deals with unimportant things.
Nathann
comment:8 Changed 8 years ago by
Oh, and I forgot something ! You define in your graph a tmp variable which is totally useless. You can remove it, and define it only inside of _invert_cells
!
Nathann
comment:9 follow-up: ↓ 10 Changed 8 years ago by
Hello,
- Yes we can define a structure to store extremities of paths, but I'm not so sure it is totally free. I agree it could be easier to read.
- I agree with proposed improvement of documentation
- Concerning the uint64_t. I did some tests with very large graphs, so it was useful at least for me (I also used that type to count number of visited 4-tuples although it is not part of this patch). Furthermore, I'm not so sure we can use int instead, and clearly unsigned int or unsigned long int would be harder to read that uint64_t.
- Number of paths: you are right. The proposed modification clarifies the doc.
- elimination_bucket: The patch is ready to host any heuristic/algo able to identify vertices that could be eliminated as soon as the lower bound is 2 or 3 or more. However, as discussed by mail, I don't have efficient ways to identify such vertices yet. Since the extra cost is negligeable, I suggest to let it as it (including comments).
- sage_calloc: was not aware of it. is it safe? stable? can we safely use it if we use structures instead of arrays of numbers?
- extra tmp variable: since the function _invert_cells is inlined, it means each execution would have to instantiate a new variable. I'm not sure it is better/faster than current implementation.
- the a.patch is OK.
D.
comment:10 in reply to: ↑ 9 ; follow-up: ↓ 11 Changed 8 years ago by
Hellooooooooooooooooooo !!
A better day begins, with music. Music solves everything :-P
- Yes we can define a structure to store extremities of paths, but I'm not so sure it is totally free. I agree it could be easier to read.
Oh. C struct ? Yeah, they are free. Especially when their size is 2*sizeof(int) :-P
- I agree with proposed improvement of documentation
- Concerning the uint64_t. I did some tests with very large graphs, so it was useful at least for me (I also used that type to count number of visited 4-tuples although it is not part of this patch).
What ? O_o
Do you mean that in some situation regular int were not enough ? I really wonder where O_O;;;
Furthermore, I'm not so sure we can use int instead, and clearly unsigned int or unsigned long int would be harder to read that uint64_t.
Well, it seems to me that you can easily use regular 32 bits int without any proble, so that coding this with int should work. You count vertices (less than 2^16=65536
, and even less than 2^15=32768
. Anyway the distance_all_pairs code is meant to work with unished short, if I make no mistake, so it would not work if you have more than 2^16-1
vertices. Then, you count pair of vertices, and this is not larger than (2^16)^2 = 2^32
, so what's the problem ? At some point in your code there is a weird "if" with a comment reading "we work on unisnged int", and I did not understand it so I cannot tell right now that this would not be a problem either if 64-bits int are replaced by 32bits int.
- Number of paths: you are right. The proposed modification clarifies the doc.
- elimination_bucket: The patch is ready to host any heuristic/algo able to identify vertices that could be eliminated as soon as the lower bound is 2 or 3 or more. However, as discussed by mail, I don't have efficient ways to identify such vertices yet. Since the extra cost is negligeable, I suggest to let it as it (including comments).
Well, the extra computing time is negligible, but it makes it much harder to understand when you try to read the code. And. Well, it does nothing ^^;
Could you keep a copy of this method instead, to be used when you will have heuristics to plug there, and in the meantime keep Sage's code optimal with what is implemented and not what we plan to implement eventually ? :-P
- sage_calloc: was not aware of it. is it safe? stable? can we safely use it if we use structures instead of arrays of numbers?
Oh. Well, calloc is a C function, so I guess that it is just wrapped by sage, the same way malloc and free are wrapped. http://linux.die.net/man/3/calloc
- extra tmp variable: since the function _invert_cells is inlined, it means each execution would have to instantiate a new variable. I'm not sure it is better/faster than current implementation.
Here is how the function is writted right now :
cdef inline _invert_cells(uint64_t * tab, uint64_t idxa, uint64_t idxb, uint64_t tmp): tmp = tab[idxa] tab[idxa] = tab[idxb] tab[idxb] = tmp
Let us assume for a start that the function is *NOT* inlined. In this case, the value of "tmp" that the function received as input is totally useless : the function creates a new local variable named tmp, and assigns to it the value of tmp given as input. Well, that is useless for a new local variable is created.
And if the function is inlined, a variable will be created, well then there is no problem for a variable will be created inside of the function that call it to do exactly what you are doing.
But. Well, how much do you want to bet that it is totally impossible to detect the difference between the two codes, even if _invert_cells called malloc(sizeof(int)) manually each time this function is called ? :-P
- the a.patch is OK.
Nathann
comment:11 in reply to: ↑ 10 ; follow-up: ↓ 12 Changed 8 years ago by
- Concerning the uint64_t. I did some tests with very large graphs, so it was useful at least for me (I also used that type to count number of visited 4-tuples although it is not part of this patch).
What ?
O_o
Do you mean that in some situation regular int were not enough ? I really wonder whereO_O;;;
On some computers, unsigned long int are on 32 bits. When you count the number of visited 4-tuples, you can exceed 2^32-1
. It happens when working on biconnected graphs with more than 20000 vertices (kind of graphs I'm currently considering).
For the rest of the code, I will (try to) simplify it asap.
D.
comment:12 in reply to: ↑ 11 Changed 8 years ago by
On some computers, unsigned long int are on 32 bits. When you count the number of visited 4-tuples, you can exceed
2^32-1
. It happens when working on biconnected graphs with more than 20000 vertices (kind of graphs I'm currently considering).
Oh, right. n^4
can go far beyond 32bits integers indeed !
Nathann
comment:13 Changed 8 years ago by
- Description modified (diff)
- Reviewers set to Nathann Cohen
- Status changed from needs_work to needs_review
I have implemented all proposed changes:
- use of a struct for pairs
- use of sage_calloc when appropriate
- use uint32_t instead of uint64_t when appropriate. These types are shorter to write/read. Also uint64_t is needed for distributions since number could potentially be huge (more theoretical than practical due to huge computation time)
- improve ordering for visiting pairs
- simplify pre-processing and corresponding data structure. It now only considers simplicial vertices.
I have merged the a.patch into the new version of the patch: trac_13808-v3.patch. The patch is much better now, the doc is nicer, and it is a bit faster.
comment:14 Changed 8 years ago by
don't know why the patchbot applies v2+a.patch+v3. How to change this ?
comment:15 Changed 8 years ago by
Apply trac_13808-v3.patch
comment:16 Changed 8 years ago by
Helloooooo again !!
It took quiiiiiiiiite a while, but I think it's done ! Here is a small patch to attach on top of yours if you agree with it. One mistake (1->2), some comments, and memory checks Btw, if you find a way to make memory checks more compact, I'd be glad to learn it. Thought about it for a while, didn't find anything.
You can set this patch to positive review if everything is fine for you !
Nathann
Changed 8 years ago by
comment:17 Changed 8 years ago by
- Description modified (diff)
Thanks for the revision patch. I did some more improvements in the rev2 patch to prevent others bugs. In particular, I found that the c_distances_all_pairs
function returns NULL when the memory space is too small. I have added dedicated test, but may be we should patch the distance all pairs function.
Q: should we add a message to MemoryError? like "Not enough memory to execute the function.". We cannot doctest this because this is machine dependent.
Let me know.
comment:18 Changed 8 years ago by
- Status changed from needs_review to needs_info
Helloooooooooo !!
Why isn't your hyp==2
a hyp >= 2
? Once this is clear, you can set the ticket to positive_review
. I hope that your next patches will be a tad smaller though :-P
Nathann
comment:19 Changed 8 years ago by
About c_distances_all_pairs
: well, it's up to you actually. Exceptions could be nice too, as I do not expect this to cost much time where c_distances_all_pairs
is called. But in another patch ! ;-)
Nathann
Changed 8 years ago by
comment:20 Changed 8 years ago by
- Status changed from needs_info to needs_review
Hello Nathann,
I have replaced the hyp==2 with hyp>=2 and clarified the text.
Thanks for the review. The patch has been greatly improved.
David.
comment:21 Changed 8 years ago by
- Status changed from needs_review to positive_review
I set the patch to positive_review as you have suggested.
Thanks.
comment:22 Changed 8 years ago by
- Dependencies set to #13821
- Status changed from positive_review to needs_work
This needs to be fixed w.r.t. #13821:
sage -t -force_lib devel/sage/sage/graphs/hyperbolicity.pyx ********************************************************************** File "/release/merger/sage-5.6.beta3/devel/sage-main/sage/graphs/hyperbolicity.pyx", line 351: sage: elimination_ordering_of_simplicial_vertices(G) Expected: [2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 11] Got: doctest:1: DeprecationWarning: Use Combinations(mset,k) instead. See http://trac.sagemath.org/13821 for details. [2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 11] ********************************************************************** File "/release/merger/sage-5.6.beta3/devel/sage-main/sage/graphs/hyperbolicity.pyx", line 806: sage: for i in xrange(10): # long test G = graphs.RandomBarabasiAlbert(Integer(100),Integer(2)) d1,_,_ = hyperbolicity(G,algorithm='basic') d2,_,_ = hyperbolicity(G,algorithm='cuts') d3,_,_ = hyperbolicity(G,algorithm='cuts+') l3,_,u3 = hyperbolicity(G,approximation_factor=Integer(2)) if d1!=d2 or d1!=d3 or l3>d1 or u3<d1: print "That's not good!" Expected nothing Got: doctest:5: DeprecationWarning: Use Combinations(mset,k) instead. See http://trac.sagemath.org/13821 for details. **********************************************************************
comment:23 Changed 8 years ago by
Also,
# long test
should be
# long time
Changed 8 years ago by
comment:24 Changed 8 years ago by
- Description modified (diff)
- Status changed from needs_work to needs_review
I was not aware of this patch.
The dependency is fixed, as well as the #long time. I hope it is OK now.
patchbot apply trac_13808-v3.patch trac_13808-rev.patch trac_13808-rev2.patch trac_13808-combinations.patch
Changed 8 years ago by
comment:25 Changed 8 years ago by
- Description modified (diff)
Minor edit to get rid of trailing white space. Should be better for the patchbot.
patchbot apply trac_13808-v3.patch trac_13808-revb.patch trac_13808-rev2.patch trac_13808-combinations.patch
comment:26 Changed 8 years ago by
- Status changed from needs_review to positive_review
Passed all long tests modulo the doctest problem in beta2. Good to go again !
Nathann
comment:27 Changed 8 years ago by
Thanks (again) !
comment:28 Changed 8 years ago by
- Merged in set to sage-5.6.rc0
- Resolution set to fixed
- Status changed from positive_review to closed
I did my best to put comments inside the patch and sufficient documentation. I hope one will be able to review this patch ;-)
Thanks.