Tuesday, January 12, 2016

Let's say no to stackoverflow exception of Java/Scala Regex Library

Earlier this week, my friend complained about the annoying Java Stock Regex library. A simple program checking for valid Base64 encoded message as follows,
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.io.BufferedReader;
import java.io.FileReader;
public class Base64J 
{
 private static final String REGEX = "^\\s*(?:(?:[a-z0-9+/]\\s*){4})*(?:(?:[a-z0-9+/]\\s*){2}\\s*[a-z0-9+/=]\\s*=)?\\s*$";
 public static void main(String[] args) 
 {
  Pattern p = Pattern.compile(REGEX,Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
  try (BufferedReader br = new BufferedReader(new FileReader("/tmp/base64.txt"))) 
  {
   String line;
   while ((line = br.readLine()) != null) 
   {
    Matcher m = p.matcher(line);
    if (m.matches())
    {
     for (int i = 0; i < m.groupCount() ; i++)
     {
      System.out.print(m.group(i));
      System.out.print(",");
     }
     System.out.println();
    } else 
    {
     System.out.println("not matched.");
    }
   }
  }
  catch (Exception e) 
  {
   System.out.println(e);
  }
 }
}
throws a StackOverflow Exception when given a input file with size 23KB.
With a google search I found the following subset of complaints surfacing on the first page


  1. http://bugs.java.com/view_bug.do?bug_id=5050507 
  2. http://www.coderanch.com/t/449271/java/java/java-regular-expression-giving-stack
  3. https://bugs.openjdk.java.net/browse/JDK-7081350 

 Here is the quote the response made by Java maintainer in the first link,
It will always be possible to create patterns that use excessive memory or take 10,000 years to complete. We cannot prevent all these problems without destroying the power of regular expressions. See workaround.
###@###.### 2004-05-20
2004-05-20 WORK AROUND
Use "^([0-9a-fA-F])+$"
or "^((?i)[0-9a-f])+$"
or even "^([a-fA-F]|\\d)++$"

Avoid alternation whenever possible, alternations are inefficient, causing slow match times as well. So the source referenced in the Pattern doc for more information about writing efficient patterns.
###@###.### 2004-05-20 2004-05-20
This problem remains unresolved and is now spread to Scala.

Another search on Google revealed almost no alternative to this library. I ported my Haskell implementation regex-pderiv to Scala. The newly ported library seems to be a good fix to this problem.

Please refer to the following github project pages for detailed examples of using the new library in Java and Scala.

  1. https://github.com/luzhuomi/scala-pderiv-java-ex
  2. https://github.com/luzhuomi/scala-pderiv