Investigating a performance issue with a regex

 
 
  • Gérald Barré

This post is part of the series 'Crash investigations and code reviews'. Be sure to check out the rest of the blog posts of the series!

Regexes are very useful to extract information from a string. In the following example the regex extracts a name and a version from a string such as WhatEverReference("abc", "1.0.0"). The string can contain multiple references anywhere in it and we need to get all name-version couples contained in the string.

C#
private static readonly Regex regex = new Regex(
    @".*Reference\s*\(\s*[$@]*?""(?<NAME>[\w\.-]+?)""\s*,\s*[$@]*""(?<VERSION>[\w\.-]+?)"".*",
    RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture,
    TimeSpan.FromSeconds(2));

static void Main(string[] args)
{
    var str = @"
This is a reference WhatEverReference(""abc"", ""1.0.0"")
This is another one Reference ( ""def"", ""2.0.0"" )";

    foreach (Match match in regex.Matches(str))
    {
        Console.WriteLine($"{match.Groups["NAME"].Value}@{match.Groups["VERSION"].Value}");
    }
}

The regex is valid and captures the expected values. The problem is that the evaluation of the regex is slow, a few milliseconds on a 20kB string. In our case, we may need to scan a few hundred files, and the user is waiting for the result in a GUI application. So, we don't want them to wait for a few seconds.

In the regex, we just need to extract the data from the named group. The full captured string (i.e. Match.Value) is not useful. However, the previous pattern captures the whole line because of the leading and trailing .* which is pretty useless. Here's the demo

C#
var str = @"
This is a reference WhatEverReference(""abc"", ""1.0.0"")
This is another one Reference ( ""def"", ""2.0.0"" )";

foreach (Match match in regex.Matches(str))
{
    Console.WriteLine(match.Value);
}

// Output:
// This is a reference WhatEverReference("abc", "1.0.0")
// This is another one Reference ( "def", "2.0.0" )

The solution is to remove the leading and trailing .* from the regex, so the evaluator only capture what is interesting for our use-case:

C#
// Without the leading and trailing ".*"
new Regex(@".*Reference\s*\(\s*[$@]*?""(?<NAME>[\w\.-]+?)""\s*,\s*[$@]*""(?<VERSION>[\w\.-]+?)"".*", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture, TimeSpan.FromSeconds(2));
new Regex(@"Reference\s*\(\s*[$@]*?""(?<NAME>[\w\.-]+?)""\s*,\s*[$@]*""(?<VERSION>[\w\.-]+?)""", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture, TimeSpan.FromSeconds(2));

On a 20kB file that the regex parses, the difference is very huge. In a .NET Core 3.1 application, removing the 4 useless characters improves the performance by 1000! You can also notice how the recent .NET 5 regex performance improvements helps to mitigate the bad regex issue.

In conclusion, be sure to capture only what you really need to parse in the regex!

Do you have a question or a suggestion about this post? Contact me!

Follow me:
Enjoy this blog?Buy Me A Coffee💖 Sponsor on GitHub