Opened 4 years ago
Closed 4 years ago
#19031 closed enhancement (fixed)
New Algorithm for TopK Closeness Centralities
Reported by:  borassi  Owned by:  

Priority:  major  Milestone:  sage6.9 
Component:  graph theory  Keywords:  Closeness Centrality, topk 
Cc:  ncohen, dcoudert  Merged in:  
Authors:  Michele Borassi  Reviewers:  David Coudert 
Report Upstream:  N/A  Work issues:  
Branch:  1367f00 (Commits)  Commit:  1367f0076288c84403c7b962dc2b5e53fa0c170a 
Dependencies:  #19007,#19014  Stopgaps: 
Description (last modified by )
Implement the algorithm in http://arxiv.org/abs/1507.01490 for computing the k most central vertices according to closeness centrality.
Change History (27)
comment:1 Changed 4 years ago by
 Cc ncohen dcoudert added
 Component changed from PLEASE CHANGE to graph theory
 Dependencies set to #19007,#19014
 Description modified (diff)
 Keywords Closeness Centrality topk added
 Type changed from PLEASE CHANGE to enhancement
comment:2 Changed 4 years ago by
 Branch set to u/borassi/new_algorithm_for_top_k_closeness_centralities
comment:3 Changed 4 years ago by
 Commit set to 09d67a9d858c8891c20bf628c539b9c8250e424d
comment:4 Changed 4 years ago by
 Commit changed from 09d67a9d858c8891c20bf628c539b9c8250e424d to a83cdf0e44dfc33e2b0215da7a926bd13dd1bc5c
comment:5 Changed 4 years ago by
 Status changed from new to needs_review
Hello!
This is the first draft of this new closeness centrality algorithm. It is able to conclude in reasonable time even on very big graphs, with millions of nodes and billions of edges. The new algorithm is in the last commit (while all previous commits simply merge with dependent tickets).
See you,
Michele
comment:6 followup: ↓ 7 Changed 4 years ago by
A first round of comments.
In method _estimate_reachable_vertices_dir
The lower bound is stored in reachableL[v]
>The lower bound is stored in reachL[v]
 Although the method will not be tested, could you please add an
INPUT
section specifying the type and expected size of arrays. You could also briefly document the algorithm implemented in the method.  please improve the presentation of the block with all variable declarations. It is hard to read since it constains lots of important calls to various methods. For instance avoid
cdef int i, nscc = tarjan_strongly_connected_components_C
and prefercdef int nscc = tarjan_strongly_connected_components_C
 What's the need for using type
uint64_t
instead ofint
? the corresponding variables are later cast toint
.  You could use
maxscc = max(scc_sizes[i] for i in range(nscc)
For method _compute_reachable_vertices_undir
, you should add some documentation at the beginning of the method.
Method counting sort: no comment, but in the future we should certainly create a method somewhere.
In method def centrality_closeness_top_k(G, int k=1, int verb=0)
verb
>verbose
 the tests for the validity of the results are excessive. There is no need for the loop
range(1000)
. Testing for *one* randomly chosen graph with sayn=20
nodes andrandom.randint(1, n*(n1) / 2)
edges is sufficient.  we can do that
for x in sorted_vert[:n]:
whensorted_vert
is an array of ints ??? if directed: gamma = 0 else: gamma = 1
>gamma = 0 if directed else 1
 when
stopped == True
you don't quit thefor j in range(layer_current_beginning,layer_current_end):
. May be you should replace this loop with a while.  you set
gamma = 0
. So you needgamma==1
only once for undirected graphs. Correct?
David.
comment:7 in reply to: ↑ 6 ; followup: ↓ 9 Changed 4 years ago by
Replying to dcoudert:
A first round of comments.
In method
_estimate_reachable_vertices_dir
The lower bound is stored in reachableL[v]
>The lower bound is stored in reachL[v]
Done!
 Although the method will not be tested, could you please add an
INPUT
section specifying the type and expected size of arrays. You could also briefly document the algorithm implemented in the method.
Done!
 please improve the presentation of the block with all variable declarations. It is hard to read since it constains lots of important calls to various methods. For instance avoid
cdef int i, nscc = tarjan_strongly_connected_components_C
and prefercdef int nscc = tarjan_strongly_connected_components_C
Done!
 What's the need for using type
uint64_t
instead ofint
? the corresponding variables are later cast toint
.
Well, it's a bit tricky. The point is that in the recursion for reachU_scc
I might find values that are much bigger than g.n
. Since I did not want to test at each step if this occurs, I used uint64_t
in order to avoid overflow (these values are smaller than g.n^2
). Then, only at the end, I use min (so that now the values do not overflow), and I set reachU[i] = <int> min(reachU_scc[scc[i]], g.n)
. However, you are right with reachL
, and I modified it.
 You could use
maxscc = max(scc_sizes[i] for i in range(nscc)
I don't think so, maxscc
is not the maximum size of a SCC, but it is the SCC with maximum size. In other words, your instruction outputs scc_sizes[maxscc]
, not maxscc
. Am I missing something?
For method
_compute_reachable_vertices_undir
, you should add some documentation at the beginning of the method.
Done
Method counting sort: no comment, but in the future we should certainly create a method somewhere.
Agree!
In method
def centrality_closeness_top_k(G, int k=1, int verb=0)
verb
>verbose
Done!
 the tests for the validity of the results are excessive. There is no need for the loop
range(1000)
. Testing for *one* randomly chosen graph with sayn=20
nodes andrandom.randint(1, n*(n1) / 2)
edges is sufficient.
Done! I left it because I had to test it 100,000 times to find the last bug...
 we can do that
for x in sorted_vert[:n]:
whensorted_vert
is an array of ints ???
Yep: http://docs.cython.org/src/reference/language_basics.html
if directed: gamma = 0 else: gamma = 1
>gamma = 0 if directed else 1
Done!
 when
stopped == True
you don't quit thefor j in range(layer_current_beginning,layer_current_end):
. May be you should replace this loop with a while.
I added an 'if' at the end of the loop.
 you set
gamma = 0
. So you needgamma==1
only once for undirected graphs. Correct?
Hmm, correct, but I realized that I could do a little better by slightly changing the implementation. So this does not apply anymore.
comment:8 Changed 4 years ago by
 Commit changed from a83cdf0e44dfc33e2b0215da7a926bd13dd1bc5c to 2c738015b029f7c93017607a9a26bb5617ab683f
comment:9 in reply to: ↑ 7 ; followup: ↓ 11 Changed 4 years ago by
 What's the need for using type
uint64_t
instead ofint
? the corresponding variables are later cast toint
.Well, it's a bit tricky. The point is that in the recursion for
reachU_scc
I might find values that are much bigger thang.n
. Since I did not want to test at each step if this occurs, I useduint64_t
in order to avoid overflow (these values are smaller thang.n^2
). Then, only at the end, I use min (so that now the values do not overflow), and I setreachU[i] = <int> min(reachU_scc[scc[i]], g.n)
. However, you are right withreachL
, and I modified it.
Is it mentioned somewhere?
 You could use
maxscc = max(scc_sizes[i] for i in range(nscc)
I don't think so,
maxscc
is not the maximum size of a SCC, but it is the SCC with maximum size. In other words, your instruction outputsscc_sizes[maxscc]
, notmaxscc
. Am I missing something?
Right, it should be maxscc = max((scc_sizes[i],i) for i in range(nscc))[1]
, but this is ugly.
 the tests for the validity of the results are excessive. There is no need for the loop
range(1000)
. Testing for *one* randomly chosen graph with sayn=20
nodes andrandom.randint(1, n*(n1) / 2)
edges is sufficient.Done! I left it because I had to test it 100,000 times to find the last bug...
That's a lot!
However, I don't like the n = random.randint(2,20)
. You could safely let n=20
, no?
 we can do that
for x in sorted_vert[:n]:
whensorted_vert
is an array of ints ???Yep: http://docs.cython.org/src/reference/language_basics.html
That's funny.
Can we always assume that farness[v]>0
? This is for the return sorted([(1.0/farness[v],...
.
Instead of G.vertices()[v]
, you should add a line like cdef list V = G.vertices()
before the return statement and then use V[v]
.
About the behavior when k>=n
. What you propose is in fact to return the centrality closeness (as a list of couples (centrality,vertex) ).
 Is your method faster than
centrality_closeness
in this case?  This behavior should be documented for the parameter
k
That's all for now ;)
David.
comment:10 Changed 4 years ago by
 Commit changed from 2c738015b029f7c93017607a9a26bb5617ab683f to 06d2c00fbfdf6f0166834b68e5e9b0929c5abb89
Branch pushed to git repo; I updated commit sha1. New commits:
06d2c00  Other reviewer's suggestions

comment:11 in reply to: ↑ 9 ; followup: ↓ 13 Changed 4 years ago by
 What's the need for using type
uint64_t
instead ofint
? the corresponding variables are later cast toint
.Well, it's a bit tricky. The point is that in the recursion for
reachU_scc
I might find values that are much bigger thang.n
. Since I did not want to test at each step if this occurs, I useduint64_t
in order to avoid overflow (these values are smaller thang.n^2
). Then, only at the end, I use min (so that now the values do not overflow), and I setreachU[i] = <int> min(reachU_scc[scc[i]], g.n)
. However, you are right withreachL
, and I modified it.Is it mentioned somewhere?
I added two inline comments to explain it.
 the tests for the validity of the results are excessive. There is no need for the loop
range(1000)
. Testing for *one* randomly chosen graph with sayn=20
nodes andrandom.randint(1, n*(n1) / 2)
edges is sufficient.Done! I left it because I had to test it 100,000 times to find the last bug...
That's a lot! However, I don't like the
n = random.randint(2,20)
. You could safely letn=20
, no?
Done!
Can we always assume that farness[v]>0? This is for the return sorted([(1.0/farness[v],....
Yep.
The only instructions that modify the farness are farness[x] = n
(if a visit is stopped) and farness[x] = ((<double> f) * (n1)) / (<double>(nd1) * (nd1))
, where f
is a (positive) sum of distances, nd1
is positive because nd
is the number of reachable vertices from a vertex x
. If nd==1
, it means that the outdegree of x
is 0, and we skip the analysis of x
, so this case never occurs. Then, n>=nd
, and hence also n1
is positive.
Instead of
G.vertices()[v]
, you should add a line likecdef list V = G.vertices()
before the return statement and then useV[v]
.
Done!
About the behavior when
k>=n
. What you propose is in fact to return the centrality closeness (as a list of couples (centrality,vertex) ).
 Is your method faster than
centrality_closeness
in this case?
No, it should be the same (actually, a bit slower because we make some computations during the BFSes).
 This behavior should be documented for the parameter
k
Done!
comment:12 Changed 4 years ago by
 Commit changed from 06d2c00fbfdf6f0166834b68e5e9b0929c5abb89 to 84031c3875f0ad14db2d984c43373af32bdc704f
Branch pushed to git repo; I updated commit sha1. New commits:
84031c3  Added documentation for k

comment:13 in reply to: ↑ 11 Changed 4 years ago by
About the behavior when
k>=n
. What you propose is in fact to return the centrality closeness (as a list of couples (centrality,vertex) ).
 Is your method faster than
centrality_closeness
in this case?No, it should be the same (actually, a bit slower because we make some computations during the BFSes).
So you should add a test if k>=n
at the beginning of the method and if True call the other method and format the result. You could also add a test n==0
and/or n==1
where the result is obvious.
comment:14 Changed 4 years ago by
 Commit changed from 84031c3875f0ad14db2d984c43373af32bdc704f to 5c1905f8e3affd515bf52b45162afb0a3c5d6cac
Branch pushed to git repo; I updated commit sha1. This was a forced push. Last 10 new commits:
0f855c6  Reviewer's suggestions

1ec3f44  Merged with beta2

cbe780c  Reviewer's suggestions and small improvements

4174764  Reviewer's suggestions

0c19024  Use check_allocarray

9167071  Used limits instead of stdint to find maximum int value

83b8057  Small corrections

1551c6f  Undo commit moving closeness_centrality to centrality.pyx. Corrected links.

fd0580a  Merged with #19007

5c1905f  New algorithm for closeness centrality

comment:15 Changed 4 years ago by
 Commit changed from 5c1905f8e3affd515bf52b45162afb0a3c5d6cac to ced875f0b1fba123a0921d5937f894cc8fe41b1e
Branch pushed to git repo; I updated commit sha1. New commits:
ced875f  Reviewer's suggestions / added a reference

comment:16 Changed 4 years ago by
Helloooooooo!
After we moved centrality_closeness
back to generic_graph
, I had to rebase this ticket. The new algorithm is still in file centrality.pyx
, but it is not available in generic_graph
(that is, g.centrality_closeness_top_k()
does not work anymore). However, I don't think this is a serious issue, because this algorithm in used in very specialized cases, and the standard Sage user does not need it, in my opinion.
Cheers,
Michele
comment:17 Changed 4 years ago by
 Reviewers set to David Coudert
 Status changed from needs_review to positive_review
Is not such a big deal to do
sage: from sage.graphs.centrality import centrality_closeness_top_k sage: centrality_closeness_top_k(graphs.PetersenGraph(), 3) [(0.6, 2), (0.6, 1), (0.6, 0)]
and it's documented. Other methods such as hyperbolicity, pathwidth, etc. must be imported for similar reasons. As soon as we will find a reasonable solution for lazy import, we may add it in another ticket.
I set this patch to positive review.
David.
comment:18 Changed 4 years ago by
 Status changed from positive_review to needs_work
On 32bit Linux:
sage t long src/sage/graphs/centrality.pyx ********************************************************************** File "src/sage/graphs/centrality.pyx", line 588, in sage.graphs.centrality.centrality_closeness_top_k Failed example: centrality_closeness_top_k(g, 5, 1) Expected: Final performance ratio: 0.766666666667 [(0.36, 5), (0.36, 4), (0.3333333333333333, 6), (0.3333333333333333, 3), (0.29032258064516125, 2)] Got: Final performance ratio: 0.805555555556 [(0.36, 5), (0.36, 4), (0.3333333333333333, 6), (0.3333333333333333, 3), (0.29032258064516125, 7)] ********************************************************************** 1 item had failures: 1 of 40 in sage.graphs.centrality.centrality_closeness_top_k [51 tests, 1 failure, 0.16 s]
comment:19 Changed 4 years ago by
Michele,
in this tests, nodes 2 and 7 are both correct answers since we have:
 on a 64bit machine (i.e. my laptop)
sage: from sage.graphs.centrality import centrality_closeness_top_k sage: g = graphs.PathGraph(10) sage: centrality_closeness_top_k(g, 5, 1) Final performance ratio: 0.766666666667 [(0.36, 5), (0.36, 4), (0.3333333333333333, 6), (0.3333333333333333, 3), (0.29032258064516125, 2)] sage: g.centrality_closeness() {0: 0.2, 1: 0.24324324324324326, 2: 0.2903225806451613, 3: 0.3333333333333333, 4: 0.36, 5: 0.36, 6: 0.3333333333333333, 7: 0.2903225806451613, 8: 0.24324324324324326, 9: 0.2}
 on a 32 bits computer (old desktop)
sage: from sage.graphs.centrality import centrality_closeness_top_k sage: g = graphs.PathGraph(10) sage: centrality_closeness_top_k(g, 5, 1) Final performance ratio: 0.805555555556 [(0.36, 5), (0.36, 4), (0.3333333333333333, 6), (0.3333333333333333, 3), (0.29032258064516125, 7)] sage: g.centrality_closeness() {0: 0.2, 1: 0.24324324324324326, 2: 0.2903225806451613, 3: 0.3333333333333333, 4: 0.36, 5: 0.36, 6: 0.3333333333333333, 7: 0.2903225806451613, 8: 0.24324324324324326, 9: 0.2}
The good point is that the numbers are the same on both machines ;)
One part of the solution is to call centrality_closeness_top_k(g, 4, 1)
.
However, for the performance ratio it is more difficult.
In fact, we have visited=138
on my mac, and visited=145
on the 32 bits computer. So we have to understand what could change the number of visited nodes depending on 3264 bits.
David.
comment:20 Changed 4 years ago by
 Commit changed from ced875f0b1fba123a0921d5937f894cc8fe41b1e to 5941f86b4846fa454b384ada4ec2088af5cb2609
Branch pushed to git repo; I updated commit sha1. New commits:
5941f86  Removed memset with 1 (32bit problem)

comment:21 Changed 4 years ago by
 Commit changed from 5941f86b4846fa454b384ada4ec2088af5cb2609 to 681d482b053fa261d0d94aeeeec471194a083302
Branch pushed to git repo; I updated commit sha1. New commits:
681d482  Corrected doctest with 32bit

comment:22 Changed 4 years ago by
Hello!
I have installed Sage on a 32bit Virtual Machine: the problem was a comparison between (equal) doubles, and probably due to roundings the error popped out. Indeed, both results were correct.
By asking only top4 in the doctest, we avoid such comparisons, and everything works.
Thank you very much!
Michele
comment:23 Changed 4 years ago by
 Status changed from needs_work to needs_review
comment:24 Changed 4 years ago by
The patch passes all tests on both my mac (64bits) and a 32bits linux PC. For me the patch is ok, but is is certainly better to rebase on 9.3. David.
comment:25 Changed 4 years ago by
 Commit changed from 681d482b053fa261d0d94aeeeec471194a083302 to 1367f0076288c84403c7b962dc2b5e53fa0c170a
Branch pushed to git repo; I updated commit sha1. New commits:
1367f00  Merged with 6.9.beta3

comment:27 Changed 4 years ago by
 Branch changed from u/borassi/new_algorithm_for_top_k_closeness_centralities to 1367f0076288c84403c7b962dc2b5e53fa0c170a
 Resolution set to fixed
 Status changed from positive_review to closed
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
Moved centrality_closeness to centrality.pyx
Trailing whitespaces
Reviewer's suggestions
Merged with #19007 and #19014
Temp
First draft