Opened 4 years ago

Closed 4 years ago

#19031 closed enhancement (fixed)

New Algorithm for Top-K Closeness Centralities

Reported by: borassi Owned by:
Priority: major Milestone: sage-6.9
Component: graph theory Keywords: Closeness Centrality, top-k
Cc: ncohen, dcoudert Merged in:
Authors: Michele Borassi Reviewers: David Coudert
Report Upstream: N/A Work issues:
Branch: 1367f00 (Commits) Commit: 1367f0076288c84403c7b962dc2b5e53fa0c170a
Dependencies: #19007,#19014 Stopgaps:

Description (last modified by borassi)

Implement the algorithm in http://arxiv.org/abs/1507.01490 for computing the k most central vertices according to closeness centrality.

Change History (27)

comment:1 Changed 4 years ago by borassi

  • Authors set to Michele Borassi
  • Cc ncohen dcoudert added
  • Component changed from PLEASE CHANGE to graph theory
  • Dependencies set to #19007,#19014
  • Description modified (diff)
  • Keywords Closeness Centrality top-k added
  • Type changed from PLEASE CHANGE to enhancement

comment:2 Changed 4 years ago by borassi

  • Branch set to u/borassi/new_algorithm_for_top_k_closeness_centralities

comment:3 Changed 4 years ago by git

  • Commit set to 09d67a9d858c8891c20bf628c539b9c8250e424d

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

48514adMoved centrality_closeness to centrality.pyx
cf9cf87Trailing whitespaces
4174764Reviewer's suggestions
e9c1c60Merged with #19007 and #19014
148d26cTemp
09d67a9First draft

comment:4 Changed 4 years ago by git

  • Commit changed from 09d67a9d858c8891c20bf628c539b9c8250e424d to a83cdf0e44dfc33e2b0215da7a926bd13dd1bc5c

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

131c421Temp
2693f52First draft of Tarjan's SCC algorithm
3238039Reviewer's suggestions
61bca8aReviewer's suggestions and small improvements
af07ceeReviewer's suggestions
a83cdf0First draft

comment:5 Changed 4 years ago by borassi

  • Status changed from new to needs_review

Hello!

This is the first draft of this new closeness centrality algorithm. It is able to conclude in reasonable time even on very big graphs, with millions of nodes and billions of edges. The new algorithm is in the last commit (while all previous commits simply merge with dependent tickets).

See you,

Michele

comment:6 follow-up: Changed 4 years ago by dcoudert

A first round of comments.

In method _estimate_reachable_vertices_dir

  • The lower bound is stored in reachableL[v] -> The lower bound is stored in reachL[v]
  • Although the method will not be tested, could you please add an INPUT section specifying the type and expected size of arrays. You could also briefly document the algorithm implemented in the method.
  • please improve the presentation of the block with all variable declarations. It is hard to read since it constains lots of important calls to various methods. For instance avoid cdef int i, nscc = tarjan_strongly_connected_components_Cand prefer cdef int nscc = tarjan_strongly_connected_components_C
  • What's the need for using type uint64_t instead of int? the corresponding variables are later cast to int.
  • You could use maxscc = max(scc_sizes[i] for i in range(nscc)

For method _compute_reachable_vertices_undir, you should add some documentation at the beginning of the method.

Method counting sort: no comment, but in the future we should certainly create a method somewhere.

In method def centrality_closeness_top_k(G, int k=1, int verb=0)

  • verb -> verbose
  • the tests for the validity of the results are excessive. There is no need for the loop range(1000). Testing for *one* randomly chosen graph with say n=20 nodes and random.randint(1, n*(n-1) / 2) edges is sufficient.
  • we can do that for x in sorted_vert[:n]: when sorted_vert is an array of ints ???
  • if directed: gamma = 0 else: gamma = 1 -> gamma = 0 if directed else 1
  • when stopped == True you don't quit the for j in range(layer_current_beginning,layer_current_end):. May be you should replace this loop with a while.
  • you set gamma = 0. So you need gamma==1 only once for undirected graphs. Correct?

David.

comment:7 in reply to: ↑ 6 ; follow-up: Changed 4 years ago by borassi

Replying to dcoudert:

A first round of comments.

In method _estimate_reachable_vertices_dir

  • The lower bound is stored in reachableL[v] -> The lower bound is stored in reachL[v]

Done!

  • Although the method will not be tested, could you please add an INPUT section specifying the type and expected size of arrays. You could also briefly document the algorithm implemented in the method.

Done!

  • please improve the presentation of the block with all variable declarations. It is hard to read since it constains lots of important calls to various methods. For instance avoid cdef int i, nscc = tarjan_strongly_connected_components_Cand prefer cdef int nscc = tarjan_strongly_connected_components_C

Done!

  • What's the need for using type uint64_t instead of int? the corresponding variables are later cast to int.

Well, it's a bit tricky. The point is that in the recursion for reachU_scc I might find values that are much bigger than g.n. Since I did not want to test at each step if this occurs, I used uint64_t in order to avoid overflow (these values are smaller than g.n^2). Then, only at the end, I use min (so that now the values do not overflow), and I set reachU[i] = <int> min(reachU_scc[scc[i]], g.n). However, you are right with reachL, and I modified it.

  • You could use maxscc = max(scc_sizes[i] for i in range(nscc)

I don't think so, maxscc is not the maximum size of a SCC, but it is the SCC with maximum size. In other words, your instruction outputs scc_sizes[maxscc], not maxscc. Am I missing something?

For method _compute_reachable_vertices_undir, you should add some documentation at the beginning of the method.

Done

Method counting sort: no comment, but in the future we should certainly create a method somewhere.

Agree!

In method def centrality_closeness_top_k(G, int k=1, int verb=0)

  • verb -> verbose

Done!

  • the tests for the validity of the results are excessive. There is no need for the loop range(1000). Testing for *one* randomly chosen graph with say n=20 nodes and random.randint(1, n*(n-1) / 2) edges is sufficient.

Done! I left it because I had to test it 100,000 times to find the last bug...

  • we can do that for x in sorted_vert[:n]: when sorted_vert is an array of ints ???

Yep: http://docs.cython.org/src/reference/language_basics.html

  • if directed: gamma = 0 else: gamma = 1 -> gamma = 0 if directed else 1

Done!

  • when stopped == True you don't quit the for j in range(layer_current_beginning,layer_current_end):. May be you should replace this loop with a while.

I added an 'if' at the end of the loop.

  • you set gamma = 0. So you need gamma==1 only once for undirected graphs. Correct?

Hmm, correct, but I realized that I could do a little better by slightly changing the implementation. So this does not apply anymore.

comment:8 Changed 4 years ago by git

  • Commit changed from a83cdf0e44dfc33e2b0215da7a926bd13dd1bc5c to 2c738015b029f7c93017607a9a26bb5617ab683f

Branch pushed to git repo; I updated commit sha1. New commits:

a0c1b14Reviewer's suggestions
2c73801Small improvements

comment:9 in reply to: ↑ 7 ; follow-up: Changed 4 years ago by dcoudert

  • What's the need for using type uint64_t instead of int? the corresponding variables are later cast to int.

Well, it's a bit tricky. The point is that in the recursion for reachU_scc I might find values that are much bigger than g.n. Since I did not want to test at each step if this occurs, I used uint64_t in order to avoid overflow (these values are smaller than g.n^2). Then, only at the end, I use min (so that now the values do not overflow), and I set reachU[i] = <int> min(reachU_scc[scc[i]], g.n). However, you are right with reachL, and I modified it.

Is it mentioned somewhere?

  • You could use maxscc = max(scc_sizes[i] for i in range(nscc)

I don't think so, maxscc is not the maximum size of a SCC, but it is the SCC with maximum size. In other words, your instruction outputs scc_sizes[maxscc], not maxscc. Am I missing something?

Right, it should be maxscc = max((scc_sizes[i],i) for i in range(nscc))[1], but this is ugly.

  • the tests for the validity of the results are excessive. There is no need for the loop range(1000). Testing for *one* randomly chosen graph with say n=20 nodes and random.randint(1, n*(n-1) / 2) edges is sufficient.

Done! I left it because I had to test it 100,000 times to find the last bug...

That's a lot! However, I don't like the n = random.randint(2,20). You could safely let n=20, no?

  • we can do that for x in sorted_vert[:n]: when sorted_vert is an array of ints ???

Yep: http://docs.cython.org/src/reference/language_basics.html

That's funny. Can we always assume that farness[v]>0? This is for the return sorted([(1.0/farness[v],....

Instead of G.vertices()[v], you should add a line like cdef list V = G.vertices() before the return statement and then use V[v].

About the behavior when k>=n. What you propose is in fact to return the centrality closeness (as a list of couples (centrality,vertex) ).

  • Is your method faster than centrality_closeness in this case?
  • This behavior should be documented for the parameter k

That's all for now ;)

David.

comment:10 Changed 4 years ago by git

  • Commit changed from 2c738015b029f7c93017607a9a26bb5617ab683f to 06d2c00fbfdf6f0166834b68e5e9b0929c5abb89

Branch pushed to git repo; I updated commit sha1. New commits:

06d2c00Other reviewer's suggestions

comment:11 in reply to: ↑ 9 ; follow-up: Changed 4 years ago by borassi

  • What's the need for using type uint64_t instead of int? the corresponding variables are later cast to int.

Well, it's a bit tricky. The point is that in the recursion for reachU_scc I might find values that are much bigger than g.n. Since I did not want to test at each step if this occurs, I used uint64_t in order to avoid overflow (these values are smaller than g.n^2). Then, only at the end, I use min (so that now the values do not overflow), and I set reachU[i] = <int> min(reachU_scc[scc[i]], g.n). However, you are right with reachL, and I modified it.

Is it mentioned somewhere?

I added two inline comments to explain it.

  • the tests for the validity of the results are excessive. There is no need for the loop range(1000). Testing for *one* randomly chosen graph with say n=20 nodes and random.randint(1, n*(n-1) / 2) edges is sufficient.

Done! I left it because I had to test it 100,000 times to find the last bug...

That's a lot! However, I don't like the n = random.randint(2,20). You could safely let n=20, no?

Done!

Can we always assume that farness[v]>0? This is for the return sorted([(1.0/farness[v],....

Yep.

The only instructions that modify the farness are farness[x] = n (if a visit is stopped) and farness[x] = ((<double> f) * (n-1)) / (<double>(nd-1) * (nd-1)), where f is a (positive) sum of distances, nd-1 is positive because nd is the number of reachable vertices from a vertex x. If nd==1, it means that the out-degree of x is 0, and we skip the analysis of x, so this case never occurs. Then, n>=nd, and hence also n-1 is positive.

Instead of G.vertices()[v], you should add a line like cdef list V = G.vertices() before the return statement and then use V[v].

Done!

About the behavior when k>=n. What you propose is in fact to return the centrality closeness (as a list of couples (centrality,vertex) ).

  • Is your method faster than centrality_closeness in this case?

No, it should be the same (actually, a bit slower because we make some computations during the BFSes).

  • This behavior should be documented for the parameter k

Done!

comment:12 Changed 4 years ago by git

  • Commit changed from 06d2c00fbfdf6f0166834b68e5e9b0929c5abb89 to 84031c3875f0ad14db2d984c43373af32bdc704f

Branch pushed to git repo; I updated commit sha1. New commits:

84031c3Added documentation for k

comment:13 in reply to: ↑ 11 Changed 4 years ago by dcoudert

About the behavior when k>=n. What you propose is in fact to return the centrality closeness (as a list of couples (centrality,vertex) ).

  • Is your method faster than centrality_closeness in this case?

No, it should be the same (actually, a bit slower because we make some computations during the BFSes).

So you should add a test if k>=n at the beginning of the method and if True call the other method and format the result. You could also add a test n==0 and/or n==1 where the result is obvious.

comment:14 Changed 4 years ago by git

  • Commit changed from 84031c3875f0ad14db2d984c43373af32bdc704f to 5c1905f8e3affd515bf52b45162afb0a3c5d6cac

Branch pushed to git repo; I updated commit sha1. This was a forced push. Last 10 new commits:

0f855c6Reviewer's suggestions
1ec3f44Merged with beta2
cbe780cReviewer's suggestions and small improvements
4174764Reviewer's suggestions
0c19024Use check_allocarray
9167071Used limits instead of stdint to find maximum int value
83b8057Small corrections
1551c6fUndo commit moving closeness_centrality to centrality.pyx. Corrected links.
fd0580aMerged with #19007
5c1905fNew algorithm for closeness centrality

comment:15 Changed 4 years ago by git

  • Commit changed from 5c1905f8e3affd515bf52b45162afb0a3c5d6cac to ced875f0b1fba123a0921d5937f894cc8fe41b1e

Branch pushed to git repo; I updated commit sha1. New commits:

ced875fReviewer's suggestions / added a reference

comment:16 Changed 4 years ago by borassi

Helloooooooo!

After we moved centrality_closeness back to generic_graph, I had to rebase this ticket. The new algorithm is still in file centrality.pyx, but it is not available in generic_graph (that is, g.centrality_closeness_top_k() does not work anymore). However, I don't think this is a serious issue, because this algorithm in used in very specialized cases, and the standard Sage user does not need it, in my opinion.

Cheers,

Michele

comment:17 Changed 4 years ago by dcoudert

  • Reviewers set to David Coudert
  • Status changed from needs_review to positive_review

Is not such a big deal to do

sage: from sage.graphs.centrality import centrality_closeness_top_k
sage: centrality_closeness_top_k(graphs.PetersenGraph(), 3)
[(0.6, 2), (0.6, 1), (0.6, 0)]

and it's documented. Other methods such as hyperbolicity, pathwidth, etc. must be imported for similar reasons. As soon as we will find a reasonable solution for lazy import, we may add it in another ticket.

I set this patch to positive review.

David.

comment:18 Changed 4 years ago by vbraun

  • Status changed from positive_review to needs_work

On 32-bit Linux:

sage -t --long src/sage/graphs/centrality.pyx
**********************************************************************
File "src/sage/graphs/centrality.pyx", line 588, in sage.graphs.centrality.centrality_closeness_top_k
Failed example:
    centrality_closeness_top_k(g, 5, 1)
Expected:
    Final performance ratio: 0.766666666667
    [(0.36, 5),
     (0.36, 4),
     (0.3333333333333333, 6),
     (0.3333333333333333, 3),
     (0.29032258064516125, 2)]
Got:
    Final performance ratio: 0.805555555556
    [(0.36, 5),
     (0.36, 4),
     (0.3333333333333333, 6),
     (0.3333333333333333, 3),
     (0.29032258064516125, 7)]
**********************************************************************
1 item had failures:
   1 of  40 in sage.graphs.centrality.centrality_closeness_top_k
    [51 tests, 1 failure, 0.16 s]

comment:19 Changed 4 years ago by dcoudert

Michele,

in this tests, nodes 2 and 7 are both correct answers since we have:

  • on a 64bit machine (i.e. my laptop)
    sage: from sage.graphs.centrality import centrality_closeness_top_k
    sage: g = graphs.PathGraph(10)
    sage: centrality_closeness_top_k(g, 5, 1)
    Final performance ratio: 0.766666666667
    [(0.36, 5),
     (0.36, 4),
     (0.3333333333333333, 6),
     (0.3333333333333333, 3),
     (0.29032258064516125, 2)]
    sage: g.centrality_closeness()
    {0: 0.2,
     1: 0.24324324324324326,
     2: 0.2903225806451613,
     3: 0.3333333333333333,
     4: 0.36,
     5: 0.36,
     6: 0.3333333333333333,
     7: 0.2903225806451613,
     8: 0.24324324324324326,
     9: 0.2}
    
  • on a 32 bits computer (old desktop)
    sage: from sage.graphs.centrality import centrality_closeness_top_k
    sage: g = graphs.PathGraph(10)
    sage: centrality_closeness_top_k(g, 5, 1)
    Final performance ratio: 0.805555555556
    [(0.36, 5),
     (0.36, 4),
     (0.3333333333333333, 6),
     (0.3333333333333333, 3),
     (0.29032258064516125, 7)]
    sage: g.centrality_closeness()
    {0: 0.2,
     1: 0.24324324324324326,
     2: 0.2903225806451613,
     3: 0.3333333333333333,
     4: 0.36,
     5: 0.36,
     6: 0.3333333333333333,
     7: 0.2903225806451613,
     8: 0.24324324324324326,
     9: 0.2}
    

The good point is that the numbers are the same on both machines ;)

One part of the solution is to call centrality_closeness_top_k(g, 4, 1).

However, for the performance ratio it is more difficult. In fact, we have visited=138 on my mac, and visited=145 on the 32 bits computer. So we have to understand what could change the number of visited nodes depending on 32-64 bits.

David.

comment:20 Changed 4 years ago by git

  • Commit changed from ced875f0b1fba123a0921d5937f894cc8fe41b1e to 5941f86b4846fa454b384ada4ec2088af5cb2609

Branch pushed to git repo; I updated commit sha1. New commits:

5941f86Removed memset with -1 (32bit problem)

comment:21 Changed 4 years ago by git

  • Commit changed from 5941f86b4846fa454b384ada4ec2088af5cb2609 to 681d482b053fa261d0d94aeeeec471194a083302

Branch pushed to git repo; I updated commit sha1. New commits:

681d482Corrected doctest with 32bit

comment:22 Changed 4 years ago by borassi

Hello!

I have installed Sage on a 32bit Virtual Machine: the problem was a comparison between (equal) doubles, and probably due to roundings the error popped out. Indeed, both results were correct.

By asking only top-4 in the doctest, we avoid such comparisons, and everything works.

Thank you very much!

Michele

comment:23 Changed 4 years ago by borassi

  • Status changed from needs_work to needs_review

comment:24 Changed 4 years ago by dcoudert

The patch passes all tests on both my mac (64bits) and a 32bits linux PC. For me the patch is ok, but is is certainly better to rebase on 9.3. David.

comment:25 Changed 4 years ago by git

  • Commit changed from 681d482b053fa261d0d94aeeeec471194a083302 to 1367f0076288c84403c7b962dc2b5e53fa0c170a

Branch pushed to git repo; I updated commit sha1. New commits:

1367f00Merged with 6.9.beta3

comment:26 Changed 4 years ago by dcoudert

  • Status changed from needs_review to positive_review

Thanks.

comment:27 Changed 4 years ago by vbraun

  • Branch changed from u/borassi/new_algorithm_for_top_k_closeness_centralities to 1367f0076288c84403c7b962dc2b5e53fa0c170a
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.