Opened 6 years ago

Closed 6 years ago

#18811 closed enhancement (fixed)

Boost Clustering Coefficient

Reported by: borassi Owned by:
Priority: major Milestone: sage-6.8
Component: graph theory Keywords: Local clustering coefficient, Boost
Cc: ncohen, dcoudert Merged in:
Authors: Michele Borassi Reviewers: Nathann Cohen
Report Upstream: N/A Work issues:
Branch: 582ce34 (Commits) Commit: 582ce3466d9dff303532a84f9738cdd4f581d5a9
Dependencies: 18564 Stopgaps:

Description (last modified by borassi)

Apply Boost algorithm for computing the clustering coefficient, using the interface of Ticket #18564.

Change History (25)

comment:1 Changed 6 years ago by ncohen

  • Cc ncohen added

comment:2 Changed 6 years ago by borassi

  • Branch set to u/borassi/boost_clustering_coefficient

comment:3 follow-up: Changed 6 years ago by borassi

  • Authors set to Michele Borassi
  • Cc dcoudert added
  • Commit set to 283c61be9357902ff555843b27c0ffed6fed6ea4
  • Component changed from PLEASE CHANGE to graph theory
  • Dependencies set to 18564
  • Description modified (diff)
  • Keywords Local clustering coefficient Boost added
  • Status changed from new to needs_review
  • Type changed from PLEASE CHANGE to enhancement

Hello! I have implemented the computation of the local clustering coefficient through Boost. The improvement is not as striking as the edge connectivity, but it is still a 10x improvement, more or less. Some benchmarks:

sage: g = graphs.RandomGNM(20000,100000)
sage: %timeit g.clustering_coeff(implementation='boost')
10 loops, best of 3: 258 ms per loop
sage: %timeit g.clustering_coeff(implementation='networkx')
1 loops, best of 3: 3.99 s per loop
sage: g = graphs.CompleteGraph(300)
sage: %timeit g.clustering_coeff(implementation='boost')
1 loops, best of 3: 6.14 s per loop
sage: %timeit g.clustering_coeff(implementation='networkx')
1 loops, best of 3: 1min 3s per loop
sage: g = graphs.RandomGNM(10000,1000000)
sage: %timeit g.clustering_coeff(implementation='networkx',nodes=range(30))
1 loops, best of 3: 13.1 s per loop
sage: %timeit g.clustering_coeff(implementation='boost',nodes=range(30))
1 loops, best of 3: 1.55 s per loop

I hope you like it!

comment:4 in reply to: ↑ 3 ; follow-up: Changed 6 years ago by ncohen

I hope you like it!

Ahahahah. I do! But the first thing I will do once this thing is reviewed is to see if I can beat it :-P

Nathann

comment:5 in reply to: ↑ 4 ; follow-up: Changed 6 years ago by borassi

Ahahahah. I do! But the first thing I will do once this thing is reviewed is to see if I can beat it :-P

I'm pretty sure you can: to intersect two neighbors, Boost algorithm iterates over both adjacencies in time O(d_1d_2), where d_1 and d_2 are the degrees. Using binary trees instead of vectors, you could do it in O(d_1 log d_2), or viceversa.

comment:6 in reply to: ↑ 5 Changed 6 years ago by ncohen

I'm pretty sure you can: to intersect two neighbors, Boost algorithm iterates over both adjacencies in time O(d_1d_2), where d_1 and d_2 are the degrees. Using binary trees instead of vectors, you could do it in O(d_1 log d_2), or viceversa.

Sage's timings should be approximately those of triangle_count.

comment:7 follow-up: Changed 6 years ago by ncohen

Hellooooooo Michele,

  • About boost_clustering_coeff -- to me it makes more sense if this function, meant to apply to a boost graph, does not take a Sage graph as input. You only need it to get the vertices' correct label, but you only need this at a higher level. Returning a dictionary associating the coeffcient to each (integer) vertex is not very expensive considering the computations you do, and so you can afford to relabel it higher.
  • Could you define "local clustering" somewhere? Is it any different from the 'clustering coefficient of a vertex'?
  • Could you add an INPUT block to cpdef clustering_coeff?
  • Same function -- by using a sig_check instead of sig_on/off it is impossible to interrupt the function while boost_resulting_coeff is running. It will only be interrupted when the function returns.
  • Could you rename 'cc' to 'average_clustering_coefficient'? When you need a comment to say what a variable is, it often means that you should rename the variable.
  • in clustering_average -- could you document the default behaviour of implementation=None? (you did it in clustering_coeff already).
  • You should not indent the *text* after a 'TESTS:' block (like it is done for 'INPUT:'). Usually, you only need to indent text in the documentation after a '::'.
  • Instead of weight==False, it is better to use 'not weight'. Handles better something like weight=0 (faster to type :-P)

Looks very good!

Thanks,

Nathann

comment:8 Changed 6 years ago by ncohen

  • Status changed from needs_review to needs_work

comment:9 follow-up: Changed 6 years ago by dcoudert

Hello,

I have not checked the code yet.

Nathann, I have an old code using bitsets for speeding up intersection of neighborhoods. I can update it using static dense graph and see if it is a good competitor ;) The idea is to compute the number of triangles incident to each edge, and then to compute the number of triangles incident to a given vertex.

David.

comment:10 Changed 6 years ago by git

  • Commit changed from 283c61be9357902ff555843b27c0ffed6fed6ea4 to 18bd01ab82cf0b9d064d4fe680f17df908bed0da

Branch pushed to git repo; I updated commit sha1. New commits:

18bd01aApplied Nathann's suggestions

comment:11 Changed 6 years ago by borassi

I do not change the status to "needs_review", yet, because I'm a bit tired and I'd like to check my work tomorrow morning.

comment:12 Changed 6 years ago by git

  • Commit changed from 18bd01ab82cf0b9d064d4fe680f17df908bed0da to 99bd994335f70455d898c83f1a13b3d38e8f5064

Branch pushed to git repo; I updated commit sha1. New commits:

99bd994Few corrections in the reference manual

comment:13 in reply to: ↑ 7 ; follow-up: Changed 6 years ago by borassi

  • Status changed from needs_work to needs_review

Helloooooooo!

  • About boost_clustering_coeff -- to me it makes more sense if this function,

meant to apply to a boost graph, does not take a Sage graph as input. You only need it to get the vertices' correct label, but you only need this at a higher level. Returning a dictionary associating the coeffcient to each (integer) vertex is not very expensive considering the computations you do, and so you can afford to relabel it higher.

Oops, I wrote it like this to test if it worked, and I forgot to change it. By solving this issue, I also have found problems if the vertex label is not an integer, I corrected it, and I added a doctest.

  • Could you define "local clustering" somewhere? Is it any different from the

'clustering coefficient of a vertex'?

It is exactly the same (https://en.wikipedia.org/wiki/Clustering_coefficient#Local_clustering_coefficient): I decided to remove "local" from all comments.

  • Could you add an INPUT block to cpdef clustering_coeff?

Done!

  • Same function -- by using a sig_check instead of sig_on/off it is

impossible to interrupt the function while boost_resulting_coeff is running. It will only be interrupted when the function returns.

Done! Now I use sig_on/off.

  • Could you rename 'cc' to 'average_clustering_coefficient'? When you need a

comment to say what a variable is, it often means that you should rename the variable.

Done!

  • in clustering_average -- could you document the default behaviour of

implementation=None? (you did it in clustering_coeff already).

Done

  • You should not indent the *text* after a 'TESTS:' block (like it is done for

'INPUT:'). Usually, you only need to indent text in the documentation after a '::'.

Done!

  • Instead of weight==False, it is better to use 'not weight'. Handles better

something like weight=0 (faster to type :-P)

Done!

comment:14 in reply to: ↑ 9 ; follow-up: Changed 6 years ago by ncohen

Nathann, I have an old code using bitsets for speeding up intersection of neighborhoods. I can update it using static dense graph and see if it is a good competitor ;)

Sage does have something like that, too. It would only require minor changes to what was added in #18250.

Nathann

comment:15 in reply to: ↑ 13 Changed 6 years ago by ncohen

  • Reviewers set to Nathann Cohen

Hello Michele,

Thanks for your updates. I added a commit at public/18811, and if you agree with it then you can add it to the branch and this ticket can go.

Here are the modifications I made:

  • remove trailing whitespaces
  • The function boost_clustering_coeff claimed to accept 'None' as a list of vertices, but did not
  • len(clust_of_v.values()) is equal to len(clust_of_v)
  • {g.vertices()[v]: clust_v_int[v] for v in vertices_boost} this line calls g.vertices() for each [v].
  • You can use 'tolerance' for tests like sage: abs(G.clustering_average(implementation='boost') - G.clustering_average(implementation='networkx')) < 1E-12

Have fuuuuuuuuun,

Nathann

comment:16 Changed 6 years ago by git

  • Commit changed from 99bd994335f70455d898c83f1a13b3d38e8f5064 to 6c8d49bc6d0101f897c489124d52c5e4a252a959

Branch pushed to git repo; I updated commit sha1. New commits:

40df844trac #18811: Review
6c8d49bMore reviewer comments

comment:17 follow-up: Changed 6 years ago by ncohen

Hey nononon! I had changed the code to handle the 'None' case! Do whatever you want, but please keep the code and the doc consistent.

Nathann

comment:18 in reply to: ↑ 17 ; follow-up: Changed 6 years ago by borassi

Sorry, I didn't realize you modified the code... That's why I modified the doc, I thought you forgot a commit, or something like that. In my opinion, it is better to modify the doc, for simplicity!

BTW, how can I use 'tolerance' in the other test?

sage: G = graphs.RandomGNM(10,20)
sage: clust_boost = G.clustering_coeff(implementation='boost')
sage: clust_networkx = G.clustering_coeff(implementation='networkx')
sage: for v in G.vertices():
....:     if abs(clust_boost[v] - clust_networkx[v]) > 1E-12:
....:         print "Error:"
....:         print "   Boost clustering of", v, "is", clust_boost[v]
....:         print "NetworkX clustering of", v, "is", clust_networkx[v]

Replying to ncohen:

Hey nononon! I had changed the code to handle the 'None' case! Do whatever you want, but please keep the code and the doc consistent.

Nathann

comment:19 in reply to: ↑ 18 Changed 6 years ago by ncohen

Helloooooo,

In my opinion, it is better to modify the doc, for simplicity!

No problem for me. But please update the code too, in this case.

BTW, how can I use 'tolerance' in the other test?

I think that it could work by adding the same 'tol' flag on all lnes of the loop. Not sure, you have to try ^^;

http://doc.sagemath.org/html/en/developer/coding_basics.html#special-markup-to-influence-tests

Nathann

comment:20 Changed 6 years ago by ncohen

OOOppss nononono sorry. Given that your doctest does not print anything when it works, I'd say that it is easier to understand it this way. Otherwise you would have to add lines and lines of useless '0'.

By the way, you can write 'for v in G' instead of 'for v in G.vertices()'.

Have fuuuuun,

Nathann

comment:21 in reply to: ↑ 14 Changed 6 years ago by ncohen

Sage does have something like that, too. It would only require minor changes to what was added in #18250.

Done at #18834. I will rebase it on top of this patch when it will be in positive review.

Nathann

comment:22 Changed 6 years ago by git

  • Commit changed from 6c8d49bc6d0101f897c489124d52c5e4a252a959 to 582ce3466d9dff303532a84f9738cdd4f581d5a9

Branch pushed to git repo; I updated commit sha1. New commits:

c365d02Removed "vertices is None" from boost_clustering_coeff
582ce34Changed for v in G.vertices() => for v in G

comment:23 Changed 6 years ago by borassi

For me, now the patch is good to go!

comment:24 Changed 6 years ago by ncohen

  • Status changed from needs_review to positive_review

Okayyyy, then let it go.

Nathann

comment:25 Changed 6 years ago by vbraun

  • Branch changed from u/borassi/boost_clustering_coefficient to 582ce3466d9dff303532a84f9738cdd4f581d5a9
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.