Heka is a general purpose data processing tool, so it supports a variety of ways to get data into its processing pipeline. But, loading and parsing files from a filesystem is the primary use case for many users. We’ll talk about parsing in the future; this post is going to explore some of the challenges involved with loading.
At first blush it might not seem like there is much of a challenge. It’s a file on a disk somewhere, how hard can it be? You open a file handle and read in the data. Real world cases are rarely so simple, however. Log files don’t grow indefinitely. They’re usually subject to rotation and eventual deletion, and rotation schemes vary widely. Sometimes files are renamed with every rotation tick (e.g. `access.log`, `access.log.0`, `access.log.1`, etc.). Other times new files with new names are periodically created (e.g. `access-20140321.log`, `access-20140322.log`, `access-20140323.log`, etc).
Then there’s the issue of tracking the current location in a file. In some cases we’re loading up a significant backlog of historical log data. In others we’re tracking the tail of a file as it’s being generated in real time. Things get tricky if the process is stopped and restarted. Do we have to start our long import all over again, manually dealing with duplicates? Do we lose records that were generated while the process was down, however long that was? We’d rather be able to pick up where we left off. That’s not too hard in the single file case, but it gets complicated if the files may have rotated while the process was down.
Finally, sometimes different sets of files are actually of the same type. A single web server might be serving dozens of domains, each with its own set of access log files and error log files. All of the access log files use the same format, as do all of the error logs. We’d like to be able to express this elegantly, without having to copy and paste nearly identical configuration settings for each of the domains. We’d also like our log file loader to notice if a new domain is added, without the need to explicitly reconfigure and restart the loader every time.
With version 0.5, Heka introduces the LogstreamerInput to try and address these complexities. As the name implies, the LogstreamerInput’s basic logical unit isn’t a log file but a log stream. A log stream is a single linear data stream made up of one or more non-overlapping files with a clearly defined order. In our web server example, the full set of access log files for a single domain would be one log stream and the error log files would be another. Files for the other domains would be separate streams, though all access logs would be of the same type (ditto for all the error logs).
A single LogstreamerInput can manage and load many different log streams of a single type. You point a LogstreamerInput at the root of a directory tree and provide a regular expression that (relative to that root) matches the files that comprise the streams you’d like to track. The expression’s match groups are used to define a “differentiator” that distinguishes between the separate streams, and a “priority” that defines the ordering within the streams. You can also define one or more translation maps, which allow you to map from string values in your regex match groups to numeric values that specify the file order. Full details about how to set this up can be found in our documentation.
If this sounds like it might be a bit fiddly, well, it is. To simplify things, we’ve also included a standalone `heka-logstreamer` command line utility. You point this utility at your Heka configuration. It will extract any LogstreamerInput config settings and output all of the files that your config will match and the order in which they will be loaded. This will let you verify that your data will be processed correctly before you spin Heka up to start the real crunching.
When Heka is started, LogstreamerInput plugins that you’ve set up will scan their directories, looking for files that match the specified regular expressions and converting them to streams of data to be injected into the Heka pipeline. The folders will be periodically rescanned for file rotations and to see if there have been any new folders and/or files added. As data is being pulled in from the files, Logstreamer will keep track of how far it has advanced in each file, maintaining a ring buffer of the last 500 bytes read. The file location and a hash of the ring buffer contents will be flushed out to disk periodically (because crashes happen) and at shutdown, enabling seamless continuation when the Heka process restarts.
We’ve tested a wide variety of situations and are confident that the Logstreamer performs as expected in common scenarios. There will always be edge cases, of course. There are many fewer edge cases when your rotation scheme creates new files without renaming existing ones, so if losing even a single log line is unacceptable we recommend this approach. We’d love to have your help with testing. Try it out, find creative ways to break it, and let us know when you do.