In the 0.9.2 release of MDN, we are going to fix some integration issues between Django, Deki, and phpBB using Apache mod_rewrite.
A horrible thought occurred to me… What if one of our RewriteRule directives was written in a way that would cause catastrophic backtracking? Ouch. This could lead to slow performance or no page rendering. Symptoms may include hours of headdesking and painful mod_rewrite debugging.
What is Catastrophic Backtracking?
Regular expression engines work by walking through a string trying to match a pattern. They may backtrack when they test the next character and the pattern doesn’t match, but other permutations exist that could form a valid match still exist. The engine backs up and retries the next possible path towards a match.
A few extra steps is always acceptable for the magic that is regular expressions, but pattern matching gets catastrophic when we’re talking about thousands of steps for inputs that are obviously invalid. We want our regular expressions to not match as quickly as possible to avoid catastrophic backtracking.
Making this mistake is easier to do that one would think. The screenshot above is a simple contrived example. We try to find sentences that end with the letter "d". "hello world" should match and "hello worldz" should not. We can see several backtracking steps in the output on the right. The best thing to do is to test your regexp for negative cases with long input. Maybe you are Mike Shaver and can guess from timing your code, code reviewing your regexp, or writing unit test that exhibits catastrophic backtracking… but for us mere mortals this voodoo approach to finding backtracking issues is error prone. Problems may not surface until your system is under heavy load.
I tried a different tact, which is much more empirical. I installed RegEx buddy, a windows only tool (don’t worry, I have Windows safely caged in a VM on Ubuntu). This tool is marketed as a great resource for learning regexps, but it is also perfect for profiling your regular expressions. It will execute a pattern against a sample input and show you the total number of steps it took to match or fail.
I put in my candidate Apache mod_rewrite patterns into the UI and feed it good and bad urls. I made sure that good urls matched in a reasonable amount of steps, but more importantly I tested how many steps a “bad” url took. Fake input like /en-US/some/wacky/page/should/fail/fast?foo=bar&baz=buz should stop within a dozen or less steps. Don’t just test for successful matches and call it a day. How does a small amount of non-matching input do? How about a large amount of non-matching input? If the engine takes thousands of steps before calling it quits, you know you’re gonna have issues – a slow web server… or worse.
RegEx Buddy is closed source and commercial, so I’d love to hear about FOSS tools that profile regexps.