Opened 4 years ago

Closed 4 years ago

#18250 closed enhancement (fixed)

G.triangles_count speedup

Reported by: ncohen Owned by:
Priority: major Milestone: sage-6.7
Component: graph theory Keywords:
Cc: azi, vdelecroix, dcoudert Merged in:
Authors: Nathann Cohen Reviewers: Vincent Delecroix
Report Upstream: N/A Work issues:
Branch: fd88c98 (Commits) Commit: fd88c98cfc2fee8d0750cd9b6f88d1d7b8177ed4
Dependencies: Stopgaps:

Description (last modified by ncohen)

I thought that we had a decent implementation of G.triangles_count. Turns out that we did not.

Here is what the branch does:

  • Exception on non-simple graphs. The digraph version can handle loops (it cannot handle multiple edges) and the graph version cannot handle any of the two.
  • (obvious) speedup in the 'iter' algorithm. Before:
    sage: %timeit _=g.triangles_count(algorithm='iter')
    1 loops, best of 3: 10.2 s per loop
    
    After:
    sage: g=graphs.RandomGNP(700,.3)
    sage: %timeit _=g.triangles_count(algorithm='iter')
    1 loops, best of 3: 6.37 s per loop
    
  • Added triangles_count in static_sparse_graph and another one in static_dense_graph. They first copy the graph in a proper data structure and count the triangles on the copy. They can take require more memory (though they are 100000x cheaper than the default data structure)
  • The default algorithm is 'iter' in the code but 'matrix' in the doc. Fixed: sparse_copy is now the default.

Overall, comparing default implementations:

Before

sage: g = graphs.RandomGNP(500,.5)
sage: %timeit g.triangles_count()
1 loops, best of 3: 9.78 s per loop

After

sage: g = graphs.RandomGNP(500,.5)
sage: %timeit g.triangles_count()
10 loops, best of 3: 162 ms per loop

Change History (33)

comment:1 Changed 4 years ago by ncohen

  • Branch set to public/18250
  • Commit set to e780f711afba40b34c0d11b65adaf3cc4429fac2
  • Description modified (diff)
  • Status changed from new to needs_review

New commits:

e780f71trac #18250: G.triangles_count speedup

comment:2 Changed 4 years ago by git

  • Commit changed from e780f711afba40b34c0d11b65adaf3cc4429fac2 to ed8bb3c374929b2a32b3e9310124990eb5b2350c

Branch pushed to git repo; I updated commit sha1. New commits:

ed8bb3cTrac 18250: documentation

comment:3 follow-up: Changed 4 years ago by vdelecroix

  • Status changed from needs_review to needs_info

Hello,

Small commit for documentation at public/18250.

The algorithm 'iter' returns int while dense_copy returns Integer. This would better be uniform.

You always perform a copy?! Isn't it possible to have a graph whose backend is already a static_sparse_graph or a static_dense_graph?

Vincent

comment:4 Changed 4 years ago by git

  • Commit changed from ed8bb3c374929b2a32b3e9310124990eb5b2350c to fa6115dc6cdea907dad9c2cd69668f5e71660024

Branch pushed to git repo; I updated commit sha1. New commits:

fa6115dtrac #18250: Integer return type and avoid useless copies

comment:5 in reply to: ↑ 3 ; follow-up: Changed 4 years ago by ncohen

  • Status changed from needs_info to needs_review

Yooooooooooooooo,

Small commit for documentation at public/18250.

Oh, right. Thanks.

The algorithm 'iter' returns int while dense_copy returns Integer. This would better be uniform.

Done.

You always perform a copy?! Isn't it possible to have a graph whose backend is already a static_sparse_graph or a static_dense_graph?

Well, in this case copies are eally cheap, so I personally prefer a simpler code to one that avoids that copy.

Still, I implemented it. There is a 'trick' somewhere:

1) I first build a copy of the graph with G.copy(immutable=True) (which makes no copy if the graph is already immutable), then access its internal data structure

2) If I do g[0] = G.copy(immutable=True)._backend._cg.g[0] I get a segfault because G.copy() is immediately deallocated. So I first do G = G.copy(immutable=True)._backend, and *then* access its internal data.

Fortunately the copy G is not deleted until the end of the function. If Cython ever notices that (as far as it is concerned) it is never used again in the function, then it will be deallocated and we will get a segfault again.

Nathann

comment:6 in reply to: ↑ 5 ; follow-up: Changed 4 years ago by vdelecroix

Replying to ncohen:

You always perform a copy?! Isn't it possible to have a graph whose backend is already a static_sparse_graph or a static_dense_graph?

Fortunately the copy G is not deleted until the end of the function. If Cython ever notices that (as far as it is concerned) it is never used again in the function, then it will be deallocated and we will get a segfault again.

Cython does not perform deallocation, it is Python job. Though Cython does increment/decrement the reference counter. And precisely, doing G = G.copy(immutable=True) precisely increment the reference counter!

Looks good! I am running the tests...

Vincent

comment:7 in reply to: ↑ 6 ; follow-up: Changed 4 years ago by ncohen

Cython does not perform deallocation, it is Python job.

Well if you prefer, but this code is translated into C, and gcc may notice that the object is totally useless. They must have done something to keep those Python variables until the end.

Nathann

comment:8 in reply to: ↑ 7 Changed 4 years ago by vdelecroix

Replying to ncohen:

Cython does not perform deallocation, it is Python job.

Well if you prefer, but this code is translated into C, and gcc may notice that the object is totally useless. They must have done something to keep those Python variables until the end.

Nope. gcc will not do that. An affectation G = H is not one instruction if H is a Python object. Cython translates it into something like

cdef PyObject * G = H;
Py_INCREF(G);

whick prevents deallocation of H until G is garbage collected (typically at the end of the function or until a call of del G).

Vincent

comment:9 Changed 4 years ago by vdelecroix

  • Reviewers set to Vincent Delecroix

I do not like the default choice being sparse copy. Especially if G is implemented as a static dense graph! Wouldn't it be better to have a default algorithm=None and be more clever about choosing the right algorithm?

comment:10 follow-up: Changed 4 years ago by ncohen

I don't think that anybody in Sage plays with data structure except me, in Sage's subfunctions. I can do it, but I am rather convinced that it is useless. It may be useful in one or two years if I can rewrite and clean all this stuff.

Will do.

Nathann

comment:11 in reply to: ↑ 10 ; follow-up: Changed 4 years ago by vdelecroix

Replying to ncohen:

I don't think that anybody in Sage plays with data structure except me, in Sage's subfunctions. I can do it, but I am rather convinced that it is useless. It may be useful in one or two years if I can rewrite and clean all this stuff.

Ok. Let say something based on the density.

comment:12 in reply to: ↑ 11 Changed 4 years ago by ncohen

Ok. Let say something based on the density.

O_o

Nononono. That's for the user to decide. I don't want ugly density>0.3 which may only make sense on a specific architecture, and on the kind of graphs that were used for the tests.

comment:13 Changed 4 years ago by git

  • Commit changed from fa6115dc6cdea907dad9c2cd69668f5e71660024 to fd88c98cfc2fee8d0750cd9b6f88d1d7b8177ed4

Branch pushed to git repo; I updated commit sha1. New commits:

fd88c98trac #18250: algorithm=None by default and better 'wrong values' check

comment:14 follow-up: Changed 4 years ago by vdelecroix

Hello,

The method matrix does work for non simple graphs, right? (after removing the loops)

Vincent

comment:15 in reply to: ↑ 14 ; follow-up: Changed 4 years ago by ncohen

The method matrix does work for non simple graphs, right? (after removing the loops)

After removing loops and multiple edges, yes.

Nathann

comment:16 in reply to: ↑ 15 Changed 4 years ago by vdelecroix

Replying to ncohen:

The method matrix does work for non simple graphs, right? (after removing the loops)

After removing loops and multiple edges, yes.

Please, consider the following right answers

sage: G = Graph([(0,1),(0,1),(1,2),(2,0)], multiedges=True)
sage: (G.adjacency_matrix()**3).trace() / 6
2
sage: sage: G = Graph([(0,1),(0,1),(1,2), (1,2), (2,0)], multiedges=True)
sage: (G.adjacency_matrix()**3).trace()//6
4

comment:17 follow-up: Changed 4 years ago by ncohen

What do you mean?

comment:18 in reply to: ↑ 17 Changed 4 years ago by vdelecroix

Replying to ncohen:

What do you mean?

That the method algorithm=matrix is valid for non simple graphs (once you removed the loops).

comment:19 Changed 4 years ago by ncohen

Depends what you call a triangle. For me it is a set of three points adjacent to each other.

comment:20 follow-up: Changed 4 years ago by ncohen

If it is a 'closed walk of length 3' then you should keep the loops inside.

comment:21 in reply to: ↑ 20 Changed 4 years ago by vdelecroix

Replying to ncohen:

If it is a 'closed walk of length 3' then you should keep the loops inside.

It is not. It is C3: three vertices and three edges! The edges are part of the data.

comment:22 follow-up: Changed 4 years ago by ncohen

Ahahah. So if you have three vertices and 4 edges in between, then it is not a triangle? No problem, yet another definition

Nathann

comment:23 in reply to: ↑ 22 Changed 4 years ago by vdelecroix

Replying to ncohen:

Ahahah. So if you have three vertices and 4 edges in between, then it is not a triangle? No problem, yet another definition

No. It is not a triangle. But it does contain many triangles. It is very much like on the sphere. For each (non degenerate) triple of points you have 4 triangles whose vertices are exactly these points.

comment:24 follow-up: Changed 4 years ago by ncohen

See, I am extremely aware of this kind of problems. There is one thousand different ways of defining stuff, and everybody has his own notion of it, depending on where you come from. I'm pretty sure that guys from word theory/automata would think that a triangle is just "a path of length three that leads you back to where you started" (that corresponds to the 'closed walk' I mentionned above).

I don't want people to believe stuff and end up with something else than what they had in mind. If two different persons can expect different result from the same function, I want either none of them or a way to force them to tell me what exactly they expect.

comment:25 follow-up: Changed 4 years ago by ncohen

Oh, and then you will also have the guys who have a weight on their edges and want to count the number of weighted triangles.

comment:26 in reply to: ↑ 25 Changed 4 years ago by vdelecroix

Replying to ncohen:

Oh, and then you will also have the guys who have a weight on their edges and want to count the number of weighted triangles.

This is what I am thinking about when I consider a multi graph, it is just a graph with (positive) integer labels. The weight of a triangle is the product of the labels on the three edges. And to count triangles, you just sum the weights. It is what you get from the matrix trace (when you removed the loops).

But I agree that there is one ambiguity: you might want to count induced C3.

Vincent

comment:27 in reply to: ↑ 24 ; follow-up: Changed 4 years ago by vdelecroix

Replying to ncohen:

I'm pretty sure that guys from word theory/automata would think that a triangle is just "a path of length three that leads you back to where you started" (that corresponds to the 'closed walk' I mentionned above).

Some people have wrong ideas. I know.

comment:28 in reply to: ↑ 27 Changed 4 years ago by ncohen

Some people have wrong ideas. I know.

And in my infinite Christ-like generosity I still care for them, and they shouldn't get wrong results just because they are sinners.

Nathann

comment:29 follow-up: Changed 4 years ago by vdelecroix

  • Status changed from needs_review to positive_review

More seriously, I would prefer if the triangles_count in static_dense_graph would only accept backend being static_dense_graph (and move the copy step inside the generic_graph). Would make more sense. But that's very minor. I just want to know your feelings about how you would handle specificities of the backends in the future.

comment:30 in reply to: ↑ 29 ; follow-up: Changed 4 years ago by ncohen

More seriously, I would prefer if the triangles_count in static_dense_graph would only accept backend being static_dense_graph (and move the copy step inside the generic_graph)

Well.. generic_graph.py cannot handle C types so it would have required to move the function to generic_graph_pyx.pyx (what a mess >_<), but I think that I was also influenced by the function is_strongly_regular immediately above.

Also, at first this function was building a copy of the graph every time, because I really did not think that it mattered. There are many functions like that in distance_all_pairs too. It contains every function which "relies heavily on the distance_all_pairs functions", and most of them accept Sage graphs directly (though some of them have pure C couterparts).

I just want to know your feelings about how you would handle specificities of the backends in the future.

Can you tell me what you mean by "specificities of the backends"?

Nathann

comment:31 Changed 4 years ago by ncohen

(Thanks for the review btw!)

comment:32 in reply to: ↑ 30 Changed 4 years ago by ncohen

Can you tell me what you mean by "specificities of the backends"?

If you actually mean "specificities of the backens" (and not about the Multi,looped,edge-labelled) graph), then what I want to do is merely to make it simple to switch from one to the other, for the user or for the coders. I am trying to clean the constructor first (#18185 in needs_review), then I will try to make it a bit faster (creating immutable copies is unnecessarily complicated), to distribute them a bit (into subfunctions).

The main problem I haven't solved is this: what should the *default* backend for Sage graphs be?

For small graphs I expect that the current one is right. But being able to run "has_edge" in log(n) time, or being able to remove edges has a huge cost. If we just want to add edges and vertices (i.e. to build graphs, from an algorithm or from a file) then it can be MUCH cheaper. For analysis of large networks it is something that you cannot afford to pay, and then this backend must be different.

So if we want to make no choice at all, it really has to become much simpler and natural to pick a backend for a graph.

Also, we may want to have a 'virtual' backend like in Gap for graphs defined from a group action. No graph would actually be stored, yet all algorithms should be available.

Well, that's what I think of these days.

Nathann

comment:33 Changed 4 years ago by vbraun

  • Branch changed from public/18250 to fd88c98cfc2fee8d0750cd9b6f88d1d7b8177ed4
  • Resolution set to fixed
  • Status changed from positive_review to closed
Note: See TracTickets for help on using tickets.