Opened 15 months ago

Last modified 5 months ago

#30641 new enhancement

equality of graphs is 60 times slower than equality of their list of edges

Reported by: vdelecroix Owned by:
Priority: major Milestone: sage-9.5
Component: graph theory Keywords: thursdaysbdx
Cc: gh-kliem, dimpase, dcoudert Merged in:
Authors: Reviewers:
Report Upstream: N/A Work issues:
Branch: Commit:
Dependencies: #30645, #30665, #30776 Stopgaps:

Status badges

Description (last modified by vdelecroix)

On sage 9.1

sage: G = Graph(loops=False, multiedges=False)
sage: G.add_edges([(i, (i+85)%100) for i in range(100)])
sage: G.add_edges([(i, (i+37)%100) for i in range(100)])
sage: G.add_edges([(i, (i+85)%100) for i in range(100)])
sage: H = G.copy()
sage: %timeit G == H
196 µs ± 708 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
sage: E = list(G.edges())
sage: F = list(H.edges())
sage: %timeit E == F
3.3 µs ± 5.33 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using immutable graphs is better by a factor 2 (and hence "only" 30x slower than list comparisons)

sage: E = [(i, (i+85)%100) for i in range(100)] + \
....:     [(i, (i+37)%100) for i in range(100)] + \
....:     [(i, (i+85)%100) for i in range(100)]
sage: G = Graph(E, loops=False, multiedges=False, immutable=True)
sage: H = G.copy()
sage: %timeit G == H
111 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Change History (25)

comment:1 Changed 15 months ago by vdelecroix

  • Keywords thursdaysbdx added

comment:2 Changed 15 months ago by dcoudert

So far, G.edges() returns a sorted list of edges. So the test E == F is simply a lexicographic comparison of lists, omitting the time to extract and sort the list of edges. The reporting time comparison is therefore not completely fair.

Furthermore, since the switch to Python3, we can no longer rely on sorted list of edges (vertices and edge labels may be of different types, leading to an error when trying to sort the list). One objective is to deprecate this behavior.

I think that the only way to speed up the test G == H, is to speed up the test other.has_edge(...).

comment:3 Changed 15 months ago by vdelecroix

The reason why I created this ticket is that I have code which creates a big list of graphs and where I want to remove duplicates. Currently, the program spends more than 50% of its time in graph equality which is ridiculous.

comment:4 Changed 15 months ago by dcoudert

I agree. I recently made a #30510 to speed up method subgraph which was ridiculously slow. It's better now.

Here, I don't know how to speed up the method without going deep into the backends...

comment:5 Changed 15 months ago by vdelecroix

I am fine with "going deep into the backends". My graphs have vertices {0, 1, ..., n-1} and there is no multiple edge. The weights are integers (but I don't think this is taken into account in the backend). The comparison of such graphs should be *faster* than the generic Python list comparison.

comment:6 Changed 15 months ago by gh-kliem

Already the edge iterator is pretty slow. It takes 76 µs out 175 for me.

comment:7 Changed 15 months ago by dcoudert

for your code, it could be better to work with immutable graphs.

sage: sage: E = [(i, (i+85)%100) for i in range(100)] + [(i, (i+37)%100) for i in range(100)] + [(i, (i+85)%100) for i in range(100)]     
sage: G = Graph([range(100), E], format='vertices_and_edges', loops=False, multiedges=False, immutable=True)                                  
sage: H = G.copy()                                                                                                                  
sage: %timeit G == H                                                                                                                
131 µs ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
sage: sage: G = Graph(loops=False, multiedges=False) 
....: sage: G.add_edges([(i, (i+85)%100) for i in range(100)]) 
....: sage: G.add_edges([(i, (i+37)%100) for i in range(100)]) 
....: sage: G.add_edges([(i, (i+85)%100) for i in range(100)]) 
....: sage: H = G.copy() 
....: sage: %timeit G == H 
....:                                                                                                                               
241 µs ± 8.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But of course it would be better to improve the backends to get faster edge iterator, edge membership tests, equality test, etc.

comment:8 Changed 15 months ago by vdelecroix

  • Description modified (diff)

comment:9 Changed 15 months ago by gh-kliem

I wrote a quick dirty solution that counts compares two vertices at a time and counts how much the out arcs differ. This takes about 34 µs instead of 175 using out_arc_unsafe.

comment:10 Changed 15 months ago by gh-kliem

I will start with a tiny ticket that does some trivial optimizations for has_edge.

comment:11 Changed 15 months ago by gh-kliem

  • Dependencies set to #30645, #30665

comment:12 Changed 15 months ago by gh-kliem

#30645 and #30665 are a good start I would say.

Things are set up in a way, that we can implement an optimized containment check in CGraph.

comment:13 Changed 15 months ago by vdelecroix

Nice. Thanks Jonathan for your efforts!

I also wonder if we could implement a comparison that would bypass the translation between "vertex labels" and "integers in {0, ..., n-1}". Namely have a comparison of the internal representations (assuming the backend is the same).

comment:14 Changed 15 months ago by dcoudert

Does it make sense ? Currently, we check that the graphs have same settings and sets of vertices and edges. However, the internal representation might differ a lot. Suppose one graph G is the result of many additions/deletions of vertices of edges. Typically, at some point it has 100000 vertices, but at the end of the sequence of operations, it remains only 2. Internal bitsets/data structures might be very large. Now, if H is a copy of G, it has a very compact internal representation. The 2 graphs are equal, but have different internal representations.

comment:15 Changed 15 months ago by vdelecroix

That is one use case. Here is another. I am considering all trivalent graphs up to isomorphism on n vertices. I want to construct the flip graph where I put an edge between the trivalent graph G1 and trivalent graph G2 if they differ by a Whitehad move

 A          C                      A          C
  \        /                        \        /
   \      /                          \      /
    u -- v          ----------->      u -- v
   /      \                          /      \
  /        \                        /        \
 B          D                      D          B

In order to generate this flip graph, one has to be able to identify the graph after a move. Since canonical labels are computed we need to compare graphs on {0, ..., n-1} with no surprise... and these are a lot of comparisons!

comment:16 Changed 15 months ago by dcoudert

I see, not easy to do it efficiently.

comment:17 Changed 15 months ago by gh-kliem

To check whether one CGraph is contained in another one can store a translation array at the beginning. If you can guarantee that the vertices match, you can of course skip this step.

After this initial step is pretty much the same as #30665. Just that you check has_arc_unsafe of other instead of yield. Things are a bit more complicated for multiple edges.

If you can guarantee the vertices to match, things should be really fast for dense graphs, but of course only, if you are somewhat dense (e.g. if n <= 64). Here you can just immediately compare to vertices.

comment:18 follow-up: Changed 14 months ago by gh-kliem

  • Dependencies changed from #30645, #30665 to #30645, #30665, #30776

Things are improving, I think.

IMO what would still improve things here:

  • better heuristics for vertex translation to int,
  • factor out the backends to standard backends and backends with vertices in range(n) (we could still have active vertices or not, but all vertices need to be in range(n), if you add a new vertex that is too large, you will get a memory problem, but that is your problem),
  • implement specialized versions of is_subgraph and subgraph_given_vertices for dense graphs over range(n) (we need a good name for this graph backend yet), this should be much faster than the current methods

So instead of just CGraphBackend we would have CGraphBackend, CGraphVertices and CGraphRangeVertices (not good names yet). DenseGraphBackend as we know it would have to inherit from CGraphBackend and CGraphVertices and DenseGraphBackendRangeVertices would have to inherit from CGraphBackend and CGraphRangeVertices. Likewise for sparse and static sparse.

Does this sound plausible?

Does anyone have good suggestions for a naming scheme?

comment:19 Changed 14 months ago by gh-kliem

Btw, dense graphs should really use bitsets. I don't understand why it isn't. Eventually this might be sped up with intrinsics. E.g. an iterator over a bitset should be much faster with an improved leading zero count (_lzcnt_u64).

comment:20 Changed 14 months ago by dcoudert

I don't know if it's better to use bitsets for dense graphs. In case, we can use src/sage/data_structures/binary_matrix.<pxi|pxd>.

comment:21 in reply to: ↑ 18 Changed 14 months ago by slabbe

Replying to gh-kliem:

So instead of just CGraphBackend we would have CGraphBackend, CGraphVertices and CGraphRangeVertices (not good names yet). DenseGraphBackend as we know it would have to inherit from CGraphBackend and CGraphVertices and DenseGraphBackendRangeVertices would have to inherit from CGraphBackend and CGraphRangeVertices. Likewise for sparse and static sparse.

Does this sound plausible?

Does anyone have good suggestions for a naming scheme?

This is how DisjointSet (union-find data structure) is implemented in Sage.

def DisjointSet(arg):
cdef class DisjointSet_class(SageObject):
cdef class DisjointSet_of_integers(DisjointSet_class):
cdef class DisjointSet_of_hashables(DisjointSet_class):

in the sense that DisjointSet_of_integers uses directly the internal representation and DisjointSet_of_hashables uses maps from integers in range(n) to the hashable objects and vice versa. See https://github.com/sagemath/sage/blob/develop/src/sage/sets/disjoint_set.pyx

I don't know if the naming scheme _of_integers and _of_hashables I chosen at the time is good or not.

comment:22 Changed 14 months ago by mkoeppe

  • Milestone changed from sage-9.2 to sage-9.3

comment:23 Changed 11 months ago by dcoudert

Timing with 9.3.beta5 on a macbook air. It seems we are much better now !

sage: G = Graph(loops=False, multiedges=False) 
sage: G.add_edges([(i, (i+85)%100) for i in range(100)]) 
sage: G.add_edges([(i, (i+37)%100) for i in range(100)]) 
sage: G.add_edges([(i, (i+85)%100) for i in range(100)]) 
sage: H = G.copy() 
sage: %timeit G == H 
43.1 µs ± 821 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
sage:
sage: E = list(G.edges()) 
sage: F = list(H.edges()) 
sage: %timeit E == F 
12.4 µs ± 357 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage:
sage: E = [(i, (i+85)%100) for i in range(100)] + \ 
....:     [(i, (i+37)%100) for i in range(100)] + \ 
....:     [(i, (i+85)%100) for i in range(100)] 
sage: G = Graph(E, loops=False, multiedges=False, immutable=True) 
sage: H = G.copy() 
sage: %timeit G == H                                                                                                                                               
31 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

comment:24 Changed 10 months ago by mkoeppe

  • Milestone changed from sage-9.3 to sage-9.4

Setting new milestone based on a cursory review of ticket status, priority, and last modification date.

comment:25 Changed 5 months ago by mkoeppe

  • Milestone changed from sage-9.4 to sage-9.5
Note: See TracTickets for help on using tickets.