Investigating a performance issue with a regex
This post is part of the series 'Crash investigations and code reviews'. Be sure to check out the rest of the blog posts of the series!
- Investigating a performance issue with a regex (this post)
- Investigating an infinite loop in Release configuration
- Investigating a crash in Enumerable.LastOrDefault with a custom collection
Regexes are very useful to extract information from a string. In the following example the regex extracts a name and a version from a string such as WhatEverReference("abc", "1.0.0")
. The string can contain multiple references anywhere in it and we need to get all name-version couples contained in the string.
private static readonly Regex regex = new Regex(
@".*Reference\s*\(\s*[$@]*?""(?<NAME>[\w\.-]+?)""\s*,\s*[$@]*""(?<VERSION>[\w\.-]+?)"".*",
RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture,
TimeSpan.FromSeconds(2));
static void Main(string[] args)
{
var str = @"
This is a reference WhatEverReference(""abc"", ""1.0.0"")
This is another one Reference ( ""def"", ""2.0.0"" )";
foreach (Match match in regex.Matches(str))
{
Console.WriteLine($"{match.Groups["NAME"].Value}@{match.Groups["VERSION"].Value}");
}
}
The regex is valid and captures the expected values. The problem is that the evaluation of the regex is slow, a few milliseconds on a 20kB string. In our case, we may need to scan a few hundred files, and the user is waiting for the result in a GUI application. So, we don't want them to wait for a few seconds.
In the regex, we just need to extract the data from the named group. The full captured string (i.e. Match.Value
) is not useful. However, the previous pattern captures the whole line because of the leading and trailing .*
which is pretty useless. Here's the demo
var str = @"
This is a reference WhatEverReference(""abc"", ""1.0.0"")
This is another one Reference ( ""def"", ""2.0.0"" )";
foreach (Match match in regex.Matches(str))
{
Console.WriteLine(match.Value);
}
// Output:
// This is a reference WhatEverReference("abc", "1.0.0")
// This is another one Reference ( "def", "2.0.0" )
The solution is to remove the leading and trailing .*
from the regex, so the evaluator only capture what is interesting for our use-case:
// Without the leading and trailing ".*"
new Regex(@".*Reference\s*\(\s*[$@]*?""(?<NAME>[\w\.-]+?)""\s*,\s*[$@]*""(?<VERSION>[\w\.-]+?)"".*", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture, TimeSpan.FromSeconds(2));
new Regex(@"Reference\s*\(\s*[$@]*?""(?<NAME>[\w\.-]+?)""\s*,\s*[$@]*""(?<VERSION>[\w\.-]+?)""", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture, TimeSpan.FromSeconds(2));
On a 20kB file that the regex parses, the difference is very huge. In a .NET Core 3.1 application, removing the 4 useless characters improves the performance by 1000! You can also notice how the recent .NET 5 regex performance improvements helps to mitigate the bad regex issue.
In conclusion, be sure to capture only what you really need to parse in the regex!
Do you have a question or a suggestion about this post? Contact me!