Heka Log Parsing Perfomance

I’ve been experimenting with Heka throughput recently, and have discovered a few log parsing tips and tricks that I thought were worth sharing.

I ran tests on my Lenovo Thinkpad W530 laptop, which has four 2.8GHz cores, showing up as 8 cores to my Linux OS thanks to hyperthreading. From our ops team I got a couple of Nginx log files, each with over 6 million entries. I made three copies of each log file, and put each copy in a separate directory, emulating the common use case of having to parse output from multiple hosts and/or domains. I configured Heka with a single LogstreamerInput set up to concurrently load all six files, using our Nginx access log decoder, which ultimately uses Lua’s LPEG engine to handle the actual parsing. Not a scientifically rigorous test, but I’m just trying to get some ballpark throughput numbers.

For my first run I set Go’s maxprocs value to 1, which makes only one core available for Heka to use. With that setting, I was processing in the neighborhood of 31K events per second (eps); so far so good. Then I cranked maxprocs up to 4. I ran the tests again and my throughput jumped all the way up to about 33K eps. Wait… what!?! I know that sometimes lock contention and other issues can cause poor performance across multiple cores, but we’re processing six distinct streams here; this should parallelize fairly well. I was seeing almost no improvement. What the heck was going on?

Nothing to do but roll up my sleeves and dig in. I busted out the Go profiling tools, but nothing obvious jumped out at me. So I took a look at the source code, and was temporarily stunned by the force of my palm slamming into my forehead at high velocity. While the LogstreamerInput was correctly spinning up a separate goroutine for each unique log stream, it was only spinning up a single decoder to be shared by all of them. That meant that no matter how many cores were available to Heka as a whole, all of the parsing was happening in a single goroutine, which would only be running on a single core.

Luckily this was an easy fix. In about 30 minutes, I had a working Heka that would spin up a separate decoder instance for each log stream; these changes have since landed in the v0.8.3 release. Running this with maxprocs of 1 ran slightly slower than before, about 30K eps. This wasn’t surprising, since now the single core was context switching through six different decoder goroutines whereas before there was just one. But cranking maxprocs up to 4 and trying again gave me around 85K eps. Now we’re talking! Not linear improvement, but still a much better number. I got my best results with maxprocs set to 7, where I was consistently processing approximately 118K log lines per second.

I was actually able to do even better than that. In prior experimentation with log parsing performance, we’ve realized that parsing timestamps is a very processor intensive part of the job. We can lighten this load by configuring our Nginx servers to log timestamps using `$msec` instead of `$time_local` or `$time_iso8601` in the `log_format` setting. This means our log files contain Unix timestamps (seconds-since-epoch format) rather than more complicated date strings. The difference is significant. When I tested versions of the log files using the `$msec` timestamp format, my single core throughput jumped to around 36K eps. Four cores gave me 105K eps, and using 7 cores had me consistently parsing about 143K lines per second. Not too shabby!

It’s worth noting that there is still a lot of room for improvement. For instance, due to details of the runtime layers, pushing text from Go through C and into Lua (and then back again) involves more memory copies than it probably should. The Lua sandbox that Heka uses is implemented as a separate project, in pure C. It would be possible for us to get much higher parsing throughput by writing a thinner wrapper around the sandbox using a language that doesn’t require copies to talk to C, such as C++, Rust, or C itself. This removes us from the rest of the Heka pipeline, of course, but since a considerable amount of the data processing that we do in Heka is also happening in a sandbox, we’d still be able to do a lot of crunching, we’d just have to set up a bespoke pipeline with Lua code instead of using Heka’s routing infrastructure.

Even so, the full Heka stack is fast enough for most use cases, and is ready to be used today. I hope you’ll try it out, and come find us in the #heka channel on irc.mozilla.org, or on the Heka mailing list to ask questions and/or tell us about your experience.