Apr 22, 2021

Golang profiling

This is a walk-through on how to debug and fix performance issues in golang.

The story

Once, I had a task to implement a graph and to find the shortest path between two node. I found this two libraries for graph implementation

yourbasic/graph
gonum

While gonum is a very well renowned library, the implementation from yourbasic is more like a demo or prove of concept. Let’s see how these two performed for my tasks.

To get started I created a very simple program to test which one can get me the shortest path in shortest time for 1000 nodes.

...
func main() {
	fmt.Print("Begin graph test\n")
	rand.Seed(time.Now().UnixNano())
	noOfNodes := 1000
	fmt.Print("Creating the ping times table...")
	pingTimes := createPingTimes(noOfNodes)
	fmt.Print("Done.\n")

	from, to := int64(rand.Intn(noOfNodes)), int64(rand.Intn(noOfNodes))

	fmt.Print("=== basic graph ===\nBuilding graph...")
	sg := yourBasicGraphBuild(pingTimes)
	fmt.Print("Done.\nTesting shortest path...\n")
	yourBasicGraphPath(sg, from, to)
	fmt.Print("Done.\n")
	fmt.Print("=== gonum graph ===\nBuilding graph...")
	gg := gonumGraphBuild(pingTimes)
	fmt.Print("Done.\nTesting shortest path...\n")
	gonumGraphPath(gg, from, to)
	fmt.Print("Done.\n")
}
...

(download the full source code from here)

build, run, and the result was:

Begin graph test
Creating the ping times table...Done.
=== basic graph ===
Building graph...Done.
Testing shortest path...
time: 15.433212 ms,from 193, to 226, path [193 739 226], distance: 12
Done.
=== gonum graph ===
Building graph...Done.
Testing shortest path...
time: 141.207139 ms,from 193, to 226, path [193 739 226], distance: 12
Done.

Both implementations are using Dijkstra algorithm but the implementation from yourbasic is ten times faster than the gonum implementation.

In the next section we will find out why is such a big difference.

The fun part, debugging

I started by creating a benchmark test for each of the implementations

file: graph_test.go

...
func BenchmarkYourBasicGraph1000(b *testing.B) {
	g := createYourBasicGraph(1000)
	var from, to int64
	from = int64(rand.Intn(195) + 5)
	for {
		to := int64(rand.Intn(195) + 5)
		if from != to {
			break
		}
	}
	for n := 0; n < b.N; n++ {
		_, _ = basicgraph.ShortestPath(g, int(from), int(to))
	}
}
...
func BenchmarkGonumGraph1000(b *testing.B) {
	g := createGonumGraph(1000)
	var from, to int64
	from = int64(rand.Intn(195) + 5)
	for {
		to := int64(rand.Intn(195) + 5)
		if from != to {
			break
		}
	}
	for n := 0; n < b.N; n++ {
		pth := path.DijkstraFrom(simple.Node(from), g)
		_, _ = pth.To(to)
	}
}
...

(download the full source code from here)

and run the tests

go test -cpu 8 -benchmem -run=^$ -bench . graph_test.go

and the result is

goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkYourBasicGraph1000-8                 72          15834949 ns/op         1293871 B/op       1936 allocs/op
BenchmarkGonumGraph1000-8                      3         358852355 ns/op        169655304 B/op    868938 allocs/op
PASS
ok      command-line-arguments  5.532s

Now I can see that gonum implementation is worst on all 3 indicators: ns/op , B/op and allocs/op

This benchmark is only reading the nodes, but what about the memory consumption to the whole graph

file: graph_test.go

...
func TestYourBasicGraphMemory(t *testing.T) {
	var m, m2 runtime.MemStats
	runtime.GC()
	runtime.ReadMemStats(&m)
	createYourBasicGraph(1000)
	runtime.ReadMemStats(&m2)
	totalBytes := m2.Alloc - m.Alloc
	t.Logf("Memory usage %s, %s, %s", byteCountIEC(m.Alloc), byteCountIEC(m2.Alloc), byteCountIEC(totalBytes))
}
...
func TestGonumGraphMemory(t *testing.T) {
	var m, m2 runtime.MemStats
	runtime.GC()
	runtime.ReadMemStats(&m)
	createGonumGraph(1000)
	runtime.ReadMemStats(&m2)
	totalBytes := m2.Alloc - m.Alloc
	t.Logf("Memory usage %s, %s, %s", byteCountIEC(m.Alloc), byteCountIEC(m2.Alloc), byteCountIEC(totalBytes))
}
...

(download the full source code from here)

run the tests again

go test -v .

and get the results

=== RUN   TestYourBasicGraphMemory
    graph_test.go:84: Memory usage 231.5 KiB, 40.8 MiB, 40.5 MiB
--- PASS: TestYourBasicGraphMemory (0.11s)
=== RUN   TestGonumGraphMemory
    graph_test.go:111: Memory usage 239.2 KiB, 246.6 MiB, 246.3 MiB
--- PASS: TestGonumGraphMemory (0.76s)
PASS
ok      graphtest    1.142s

Interesting! The gonum implementation is using 246Mb of memory for 1000 nodes, while the yourbasic implementation is using only 40 Mb.

Next part is to profile the code

go test -run=^$ -bench=^BenchmarkGonumGraph1000$ -cpuprofile=cpu-gonum.p
go tool pprof -http=:8080 cpu-gonum.p

and if you select only where we search for the shortest path path.DijkstraFrom you will see that the runtime spends most of time by reading from the maps

Gonum CPU profiler

Just a side note: “if you don’t understand the implementation and if you have no idea what the code does, you will never have a chance to understand the profile graph”

Let’s look at the source code to see how the WeightedDirectedGraph is implemented

type WeightedDirectedGraph struct {
	nodes map[int64]graph.Node
	from  map[int64]map[int64]graph.WeightedEdge
	to    map[int64]map[int64]graph.WeightedEdge

	self, absent float64

	nodeIDs *uid.Set
}

There are 5 maps in this struct and this why is spending so much time on map lookup, after all we have 999.000 edges to search for.

Also this maps takes quite a lot of memory. At least 16 bytes for keys and another 24 bytes for WeightedEdge all this multiplied by the number of edges.

Let’s use the memory profiles to see what we can find

go test -run=^TestGonumGraphMemory$ -memprofile=mem-gonum.p
go tool pprof -http=:8080 mem-gonum.p

Gonum Memory profiler

You can see here there are huge allocations where edges are added to the from and to map, and this is the cause for the memory issue that we had.

On the other hand the implementation provided by yourbasic/graph is using only one slice of maps, the ID() of the node will be the index of the node, and this is what we will use next.

The only way to improve this is to provide our own implementation of the directed graph. Let’s name it LightWeightedGraph

type LightWeightedGraph struct {
	edges []map[int64]float64
}

Now, in order to be able to use this struct with gonum we need to implement some interfaces

file: graph.go

...
func (wg LightWeightedGraph) Nodes() graph.Nodes {
	return NewNodeIterator(len(wg.edges))
}

func (wg LightWeightedGraph) Node(id int64) graph.Node {
	return simple.Node(id)
}

func (wg LightWeightedGraph) From(id int64) graph.Nodes {
	from := wg.edges[id]
	return NewNodes(from)
}

func (wg LightWeightedGraph) Weight(xid, yid int64) (w float64, ok bool) {
	if xid == yid {
		return 0, true
	}
	if e, ok := wg.edges[xid][yid]; ok {
		return e, true
	}
	return 0, false
}

func (g LightWeightedGraph) SetWeightedEdge(e graph.WeightedEdge) {
	var (
		from = e.From()
		fid  = from.ID()
		to   = e.To()
		tid  = to.ID()
	)
	if fid == tid {
		panic("simple: adding self edge")
	}

	if g.edges[fid] == nil {
		g.edges[fid] = make(map[int64]float64, 1)
	}
	g.edges[fid][tid] = e.Weight()
}
...

(download the full source code from here)

The challenging part is that the library is using a node iterator to retrieve the nodes and the neighbors, we need to implement this too.

file: graph.go

...
type NodeIterator struct {
	len     int
	current int
}

func NewNodeIterator(len int) *NodeIterator {
	return &NodeIterator{
		len:     len,
		current: -1,
	}
}

func (i *NodeIterator) Next() bool {
	if i.current >= i.len-1 {
		return false
	}
	i.current++
	return true
}

func (i *NodeIterator) Len() int {
	return i.len - i.current
}

func (i *NodeIterator) Reset() {
	i.current = -1
}
...

(download the full source code from here)

This is a very simple node iterator that does nothing but to return the index of the node as id of the node.

The iterator for the From(id int64) function is a little more complicated because we need to provide an iterator over a map keys. Luckily in the gonum/graph source code I found a very nice implementation of a map iterator. The bad news is that we cannot use it, so I just copied.

file: map.go

type Nodes struct {
	nodes int
	iter  *mapIter
	pos   int
	curr  graph.Node
}

func NewNodes(nodes map[int64]float64) *Nodes {
	return &Nodes{nodes: len(nodes), iter: newMapIterNodes(nodes)}
}
...

(download the full source code from here)

Let’s update our main file and the gonumGraphBuild function

file: main_after.go

func gonumGraphBuild(pingTimes [][]int64) graph.Graph {
	g := NewLightWeightedGraph(len(pingTimes))
	for i := 0; i < len(pingTimes); i++ {
		for j := 0; j < len(pingTimes[i]); j++ {
			if j != i {
				e := simple.WeightedEdge{F: simple.Node(i), T: simple.Node(j), W: float64(pingTimes[i][j])}
				g.SetWeightedEdge(&e)
			}
		}
	}
	return g
}

(download the full source code from here)

and we run the test again

go run .

and the result is

Begin graph test
Creating the ping times table...Done.
=== basic graph ===
Building graph...Done.
Testing shortest path...
time: 15.727268 ms,from 725, to 172, path [725 172], distance: 6
Done.
=== gonum graph ===
Building graph...Done.
Testing shortest path...
time: 79.618552 ms,from 725, to 172, path [725 172], distance: 6
Done.

Not bad. By providing a custom implementation we managed reduce the running time by half. But still this is not as fast as the yourbasic implementation even if we did used exactly the same storage for the graph. It look like this is the overhead of using the iterator and a bigger data types int64 and float64

I had a dispute once about whether is the best to use a custom implementation vs to use a library.

This is the good example that is not always the best to use a library, and this is why:

the library has a different goal than yours. For gonum library, the graph part, it easy to see that the goal is to cover as much as possible of the all known algorithms. A side effect of this is that the interfaces contains more functions that you need and the implementations contains more fields than is required.

For example the type WeightedDirected interface contains To(id int64) Nodes function which is not used by the Dijkstra implementations but is used by the Louvain implementation for community detection and the field to map[int64]map[int64]graph.WeightedEdgewas added to the typeWeightedDirectedGraph just to fulfill this requirement.
```
type WeightedDirectedGraph struct {
    nodes map[int64]graph.Node
    from  map[int64]map[int64]graph.WeightedEdge
    to    map[int64]map[int64]graph.WeightedEdge // I don't need this

    self, absent float64

    nodeIDs *uid.Set // I don't need this
}
```
this extra fields adds an overhead of additional 128Mb of ram memory in my case.
you will spend a huge amount of time to understand how the library was structured, and why was structured, and in the end you might end up at the conclusion that the library will only solve half of the problem.
you will be forced to implement functions that you don’t need and you don’t use, and for some of them you will have to use
```
panic("not implemented")
```
which makes you code very ugly

In the end I chose to copy the implementation from yourbasic into my project and to adjust it to my needs.

Conclusion

The pprof tool for go is very powerful and it can help you troubleshoot performance and memory issues.

This are some articles that I used as inspiration for this post.

High Performance Go Workshop
Profiling Go Programs