{"id":460,"date":"2014-03-24T17:44:17","date_gmt":"2014-03-24T17:44:17","guid":{"rendered":"http:\/\/blog.mozilla.org\/services\/?p=460"},"modified":"2014-03-24T17:57:44","modified_gmt":"2014-03-24T17:57:44","slug":"heka-loading-log-files-with-logstreamer","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/services\/2014\/03\/24\/heka-loading-log-files-with-logstreamer\/","title":{"rendered":"Heka: Loading log files with Logstreamer"},"content":{"rendered":"<p id=\"magicdomid2\"><span class=\"author-g-0f7h7leqtlfykbgw\"><a title=\"Heka\" href=\"https:\/\/github.com\/mozilla-services\/heka\/\">Heka<\/a> is a general purpose data processing tool, so it supports a variety of ways to get data into its processing pipeline. But<\/span><span class=\"author-g-gtdo26l9fh19cofj\">,<\/span><span class=\"author-g-0f7h7leqtlfykbgw\"> loading and parsing files from a filesystem is<\/span><span class=\"author-g-gtdo26l9fh19cofj\"> t<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">he primary use case<\/span><span class=\"author-g-gtdo26l9fh19cofj\"> for <\/span><span class=\"author-g-0f7h7leqtlfykbgw\">many<\/span><span class=\"author-g-gtdo26l9fh19cofj\"> users<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">. <\/span><span class=\"author-g-gtdo26l9fh19cofj\">We&#8217;ll talk about parsing in the future; this post is going to <\/span><span class=\"author-g-0f7h7leqtlfykbgw\">explore some of the challenges involved with loading.<\/span><\/p>\n<p id=\"magicdomid3\"><span class=\"author-g-0f7h7leqtlfykbgw\">At first blush it might not seem like there is much of a challenge. It&#8217;s a file on a disk somewhere, how hard can it be? You open a file handle and read in the data. Real world cases are rarely so simple, however. Log files don&#8217;t grow indefinitely<\/span><span class=\"author-g-gtdo26l9fh19cofj\">. T<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">hey&#8217;re usually subject to rotation and eventual deletion, and rotation schemes vary widely. Sometimes files are renamed with every rotation tick (e.g. `access.log`, `access.log.0`, `access.log.1`, etc.)<\/span><span class=\"author-g-gtdo26l9fh19cofj\">. O<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">ther times new files with new names are periodically created (e.g. `access-20140321.log`, `access-20140322.log`, `access-20140323.log`, etc).<\/span><\/p>\n<p id=\"magicdomid4\"><span class=\"author-g-0f7h7leqtlfykbgw\">Then there&#8217;s the issue of tracking the current location in a file. In some cases we&#8217;re loading up a significant backlog of historical log data<\/span><span class=\"author-g-gtdo26l9fh19cofj\">. I<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">n others we&#8217;re tracking the tail of a file as it&#8217;s being generated in real time. <\/span><span class=\"author-g-gtdo26l9fh19cofj\">T<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">hings get tricky if the process is stopped and restarted. Do we have to start our long import all over again, manually dealing with duplicates? Do we lose records that were generated while the process was down, however long that was? <\/span><span class=\"author-g-gtdo26l9fh19cofj\">W<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">e&#8217;d rather be able to pick up where we left off. That&#8217;s not too hard in the single file case, but it gets complicated if the files may have rotated while the process was down.<\/span><\/p>\n<p id=\"magicdomid5\"><span class=\"author-g-0f7h7leqtlfykbgw\">Finally, sometimes different sets of files are actually of the same t<\/span><span class=\"author-g-gtdo26l9fh19cofj\">ype<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">. A single web server might be serving dozens of domains, each with its own set of access log files and error log files. All of the access log files use the same format, as do all of the error logs. We&#8217;d like to be able to express this elegantly, without having to copy and paste nearly identical configuration settings for each of the domains. We&#8217;d also like our log file loader to notice if a new domain is added, without the need to explicitly reconfigure and restart the loader every time.<\/span><\/p>\n<p id=\"magicdomid6\"><span class=\"author-g-0f7h7leqtlfykbgw\">With version 0.5, Heka introduces the LogstreamerInput<\/span><span class=\"author-g-gtdo26l9fh19cofj\"> to try and address these complexities<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">. As the name implies, the LogstreamerInput&#8217;s basic logical unit isn&#8217;t a log <em>file<\/em> but a log <em>stream<\/em>. A log stream is a single linear data stream made up of one or more non-overlapping files with a clearly defined order. <\/span><span class=\"author-g-gtdo26l9fh19cofj\">In <\/span><span class=\"author-g-0f7h7leqtlfykbgw\">our web server example, the full set of access log files for a single domain would be one log stream<\/span><span class=\"author-g-gtdo26l9fh19cofj\"> and the<\/span><span class=\"author-g-0f7h7leqtlfykbgw\"> error log files would be a<\/span><span class=\"author-g-gtdo26l9fh19cofj\">nother<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">. <\/span><span class=\"author-g-gtdo26l9fh19cofj\">Files for the other domains would be separate streams, though all access logs would be of the same type (ditto for all the error logs)<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">.<\/span><\/p>\n<p id=\"magicdomid7\"><span class=\"author-g-0f7h7leqtlfykbgw\">A single LogstreamerInput can manage and load many different log streams of a single type. You point a LogstreamerInput at the root of a directory tree and provide a regular expression that (relative to that root) matches the files that comprise the streams you&#8217;d like to track. The expression&#8217;s match groups are used to define a &#8220;differentiator&#8221; that distinguishes between the separate streams, and a &#8220;priority&#8221; that d<\/span><span class=\"author-g-gtdo26l9fh19cofj\">efines the ordering within the streams<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">. You can also define one or more translation maps, which allow you to map from string values in your regex match groups to numeric values that specify the file order. Full details about how to set this up can be found in our <a title=\"LogstreamerInput extended docs\" href=\"http:\/\/hekad.readthedocs.org\/en\/latest\/pluginconfig\/logstreamer.html\">documentation<\/a>.<\/span><\/p>\n<p id=\"magicdomid8\"><span class=\"author-g-0f7h7leqtlfykbgw\">If this sounds like it might be a bit fiddly, well, it is. To <\/span><span class=\"author-g-gtdo26l9fh19cofj\">simplify things<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">, we&#8217;ve also included a standalone `heka-logstreamer` <a title=\"heka-logstreamer docs\" href=\"http:\/\/hekad.readthedocs.org\/en\/latest\/pluginconfig\/logstreamer.html#verifying-settings\">command line utility<\/a>. You point this utility at your Heka configuration<\/span><span class=\"author-g-gtdo26l9fh19cofj\">. I<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">t will extract any LogstreamerInput config settings and output all of the files that your config will match and the order <\/span><span class=\"author-g-gtdo26l9fh19cofj\">in which<\/span><span class=\"author-g-0f7h7leqtlfykbgw\"> they will be loade<\/span><span class=\"author-g-gtdo26l9fh19cofj\">d. This will let you<\/span><span class=\"author-g-0f7h7leqtlfykbgw\"> verify that your data will be processed correctly before you spin Heka up to start the real crunching.<\/span><\/p>\n<p id=\"magicdomid9\"><span class=\"author-g-gtdo26l9fh19cofj\">When Heka is started,<\/span><span class=\"author-g-0f7h7leqtlfykbgw\"> LogstreamerInput plugins that you&#8217;ve set up will scan their directories, looking for files that match the specified regular expressions and converting them to streams of data to be injected into the Heka pipeline. The folders will be periodically rescanned <\/span><span class=\"author-g-gtdo26l9fh19cofj\">for file rotations and <\/span><span class=\"author-g-0f7h7leqtlfykbgw\">to see if there have been any new folders and\/or files added. As data is being pulled in from the files, Logstreamer will keep track of how far it has advanced in each file, maintaining a ring buffer of the last 500 bytes read. The file location and a hash of the ring buffer contents will be flushed out to disk periodically (because crashes happen) and at shutdown<\/span><span class=\"author-g-gtdo26l9fh19cofj\">, enabling seamless continuation when the Heka process restarts<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">.<\/span><\/p>\n<p id=\"magicdomid10\"><span class=\"author-g-gtdo26l9fh19cofj\">W<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">e&#8217;ve tested a wide variety of situations and are confident that the Logstreamer performs as expected in<\/span><span class=\"author-g-gtdo26l9fh19cofj\"> common scenarios<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">. <\/span><span class=\"author-g-gtdo26l9fh19cofj\">There will always be edge cases, of course. <\/span><span class=\"author-g-0f7h7leqtlfykbgw\">There are many <em>fewer<\/em> edge cases when your rotation scheme creates new files without renaming existing ones, so if losing even a single log line is unacceptable we recommend this approach. We&#8217;d love to have your help with testing<\/span><span class=\"author-g-gtdo26l9fh19cofj\">. T<\/span><span class=\"author-g-0f7h7leqtlfykbgw\">ry it out, find creative ways to break it, and <\/span><a title=\"Heka issue tracker\" href=\"https:\/\/github.com\/mozilla-services\/heka\/issues\"><span class=\"author-g-0f7h7leqtlfykbgw u\">let us know<\/span><\/a><span class=\"author-g-0f7h7leqtlfykbgw\"> when you do.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Heka is a general purpose data processing tool, so it supports a variety of ways to get data into its processing pipeline. But, loading and parsing files from a filesystem is the primary use case for many users. We&#8217;ll talk &hellip; <a class=\"go\" href=\"https:\/\/blog.mozilla.org\/services\/2014\/03\/24\/heka-loading-log-files-with-logstreamer\/\">Continue reading<\/a><\/p>\n","protected":false},"author":598,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[30944],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/services\/wp-json\/wp\/v2\/posts\/460"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/services\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/services\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/services\/wp-json\/wp\/v2\/users\/598"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/services\/wp-json\/wp\/v2\/comments?post=460"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/services\/wp-json\/wp\/v2\/posts\/460\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/services\/wp-json\/wp\/v2\/media?parent=460"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/services\/wp-json\/wp\/v2\/categories?post=460"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/services\/wp-json\/wp\/v2\/tags?post=460"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}