Bug 2663 - Regular Expressions bug
Summary: Regular Expressions bug
Status: RESOLVED FIXED
Alias: None
Product: Class Libraries
Classification: Mono
Component: System ()
Version: unspecified
Hardware: PC Mac OS
: Normal normal
Target Milestone: Untriaged
Assignee: Martin Baulig
URL:
Depends on:
Blocks:
 
Reported: 2011-12-28 09:47 UTC by Miguel de Icaza [MSFT]
Modified: 2014-11-23 22:57 UTC (History)
5 users (show)

Tags:
Is this bug a regression?: ---
Last known good build:

Notice (2018-05-24): bugzilla.xamarin.com is now in read-only mode.

Please join us on Visual Studio Developer Community and in the Xamarin and Mono organizations on GitHub to continue tracking issues. Bugzilla will remain available for reference in read-only mode. We will continue to work on open Bugzilla bugs, copy them to the new locations as needed for follow-up, and add the new items under Related Links.

Our sincere thanks to everyone who has contributed on this bug tracker over the years. Thanks also for your understanding as we make these adjustments and improvements for the future.


Please create a new report on GitHub or Developer Community with your current version information, steps to reproduce, and relevant error messages or log files if you are hitting an issue that looks similar to this resolved bug and you do not yet see a matching new report.

Related Links:
Status:
RESOLVED FIXED

Description Miguel de Icaza [MSFT] 2011-12-28 09:47:04 UTC
Run the following code in "csharp":

using System.Text.RegularExpressions;
var r = new Regex ("^(S|SW)?$"); r.Match ("SW");

This displays an empty match, if you remove either the "S|" or the "?" the match works.

This was reported on #monotouch by divil.
Comment 1 Miguel de Icaza [MSFT] 2011-12-28 09:47:24 UTC
Paolo, could you please look at this bug?
Comment 2 Tim Dawson 2012-01-14 06:18:58 UTC
Any chance this will be addressed by the next release? Unfortunately it's stopping one of our most basic parsers from operating correctly.
Comment 3 Robert Wilkens 2012-05-26 17:39:21 UTC
Incidentally, I noticed if you swap the "S|SW" to be "SW|S" it works as far as I can tell.....

Is it broken?

Well, it matches the "S" first, and that's a short circuit of the "|" (or) ... and it falls out of the parenthesis test... but the next thing it would match is end-of-line ($) because of the short circuit.... But it finds a "W" instead of end of line, and fails....  IF you put the SW before the S it would check that first... and it would succeed.

I've debugged this a bit and that's how i came to this conclusions (of what it's doing)..  

Please let me know if you want me to look into this further.
Comment 4 Robert Wilkens 2012-05-26 17:42:49 UTC
p.s. This might be a more appropriate way to test that particular regex:
var r = new Regex ("^(SW?)?$");
Comment 5 Tim Dawson 2012-05-28 13:18:30 UTC
This is the original regex that gave rise to this bug that I reported on IRC:

^([0-9]{4})(N|E|W|S|NE|NW|SE|SW|NDV)?$

It's such a trivial one, this really does need to be fixed.
Comment 6 Robert Wilkens 2012-05-28 18:13:27 UTC
Well, it is pattern matching and not string matching.  To format above as a regular expression, I would (as i was trained long ago in an undergraduate computer science class in the 90s) do something like this:

^([0-9]{4})(N([EW]|DV)?|E|W|S[EW]?)?$

Which seems to work fine for the same pattern you seem to be trying to match.

Please understand, i think this problem is assigned to someone else - who probably works with Xamarin, i am just a community volunteer with time on my hands.

p.s. If you think it's easy to change it, you are welcome to do it yourself - mono is open source.  I had just noticed the problem report had been sitting here close to 6 months and figured me chiming in wouldn't hurt anything.
Comment 7 Robert Wilkens 2012-06-06 07:53:48 UTC
Without looking at the code, i have an idea on how to make it work the way the user wants, and if i'm right this shouldn't be that complicated:
Whenever there is an "or" condition, if there is a match in one "or" path, push the entire state of the matching machine onto the stack, and continue as normal. but after a failure, pop the machine off the stack and continue with the next or condition if there is one -- by popping the machine off the stack we will remembered where we were (what we matched so far at that point) and what to try next..
Comment 8 Robert Wilkens 2012-06-07 08:44:01 UTC
Ok, believe it or not, I have a fix for this.  It didn't involve using a stack either because that turned out to be too complicated when we're dealing with recursion as it was written.

I have to do some major code clean up because i put a lot of debug code in here as i was working, and i'll want to try to break some of my code out into separate functions to make it more readable (at least in one spot where nesting got way too deep).  it'll be a little while before i commit the changes in other words.  

However, i might have a fix for this committed either today or tomorrow.  Still will have to run unit tests too to make sure this doesn't break anything else before i do that.  I still have to generate some unit tests too, and i will likely base that at least in part on your sample regular expressions.  But mid day today i have a dentist appointment, so i might not get it done today.
Comment 9 Robert Wilkens 2012-06-07 10:17:42 UTC
Well, while my fix seemed to work (it passed when it should have passed, and failed when it should have failed) for your extended pattern, it caused some existing unit test failures.. Have to investigate further.  

This is going to take some time to investigate.
Comment 10 Robert Wilkens 2012-06-07 17:12:59 UTC
Update: I've now got it passing all the previous unit tests that were passing, and passing the test associated with this problem report, I believe.

I still want to clean up my code a little (though i'm afraid to break anything that's working by doing such) because some lines (due to nesting) in one method go way beyond 80 characters, and the suggestion in the code style guide says its a good idea to keep lines under 80 characters, so i have to make sub-method(s) for this to call to which bring this back in line.  None of this should break anything, but i'll retest everything when i'm done for good measure.

And finally, i still have to add unit tests for what was failing before, as seen in above problem report.
Comment 11 Robert Wilkens 2012-06-07 20:24:25 UTC
Ok, I have submitted a patch for this.  Whether or not it is accepted is out of my hands.  Also, it may be significantly delayed because it is queued behind some of my earlier patches which require, I'm guessing, a more intense review.
Comment 12 Tim Dawson 2012-06-08 03:52:40 UTC
Fantastic, thank you!
Comment 13 Robert Wilkens 2012-06-08 09:46:24 UTC
This appears to also fix a related bug, i'll mark it there :

https://bugzilla.novell.com/show_bug.cgi?id=541823
Comment 14 Robert Wilkens 2012-06-18 09:21:21 UTC
For reference, i got some guidance on how to clean up my branch so that it only contained this fix, submitted it this morning, and it looks like miguel d
Comment 15 Robert Wilkens 2012-06-18 09:21:43 UTC
(continued)
de icaza put it through.  Sorry for half messages.
Comment 16 Miguel de Icaza [MSFT] 2012-06-18 21:46:56 UTC
Robert, thanks for the bug fix!

Assigned the bug to me to backport this to the relevant branches.
Comment 17 Weeble 2012-07-25 07:09:38 UTC
I think bug 6198 is related to this patch. It demonstrates extremely poor performance (appearing as an application hang) for certain regexes.
Comment 18 Robert Wilkens 2012-07-25 08:45:03 UTC
I'll try to look into it in about 10 days or so, if that's OK.  I have a corrupt Linux install and won't be able to reinstall until around August 1st.  This is probably an easy fix for 6198, given my pre-existing knowledge of the fix for this bug, if it really is as simple as a regex as you describe in 6198.  Unfortunately, I have no way of marking this when I get back, so I just hope I remember to get back to that one.  I'll add myself to cc: list of that one.
Comment 19 Robert Wilkens 2012-07-25 10:40:27 UTC
If someone wants to back out this fix until I get a chance to look at this, that is OK -- but I will fix it with the other bug fix again after august 1st sometime.

The fix for this bug works for this bug, and seemed to work for all the 'make check' checks (unit tests), but apparently I missed some condition or check, or am looping on something incorrectly.  It should be an easy fix.  Please someone try to remember to contact me about this after march 1st sometime and i'll get back to this.
Comment 20 Robert Wilkens 2012-07-25 11:00:33 UTC
Oops replace march 1st with August 1st in last comment.
Comment 21 Robert Wilkens 2012-07-26 17:54:02 UTC
Ok, I have a fix for the reported problem in 6198 mentioned here.  I will submit it eventually, but need some time.  Essentially, every time it looped through in Eval() around !OutOfOptions, it now needs to check to make sure (on return from EvalReal) that we haven't already come back with the same JumpTestList which we have previously come back with - because if we have we're repeating ourselves which we shouldn't do more than once.  I ran the revision against both the test in 6198 and all the unit tests and it worked fine/quick.
Comment 22 Martin Potter 2012-08-19 01:44:08 UTC
We have had to revert the commit associated with this, https://github.com/mono/mono/commit/60a463c6f3164784939cd9535acf865f6b33ef20, on our fork due to:

1. Performance issues: This commit introduced a huge performance problem, as noted in bug 6198. One of our test fixtures went from completing in 0.22 seconds to 555.41 seconds
2. Regressions: Several of out unit tests now fail with the changes in the aforementioned commit. The tests pass on Microsoft's implementation and on Mono if the commit is reverted.

I am trying to simplify the test cases that are now failing so that they can hopefully be added to the existing unit tests.
Comment 23 Robert Wilkens 2012-08-19 07:07:07 UTC
Just remember to revert it from 2.10 as well.  Thanks for taking it out - when i saw the performance hit, I was not happy with it.
Comment 25 Weeble 2012-08-21 16:53:34 UTC
Just to clarify - Martin is saying he reverted the commit on *his* fork, not on any of the official mono repositories, yes?

I've also found failing unit tests in our code that I believe are due to this. I reduced them to this simple example:

Regex.IsMatch("AB","^(A|B)$")

This should return False (the only strings that pattern should ever match are "A" and "B") but in mono 2.11.2 and 2.11.3 it returns True.
Comment 26 Martin Potter 2012-08-21 17:08:28 UTC
That is correct, I reverted this on *my* fork, not the official mono repositories. This commit introduced a bug where it will incorrectly match any combination of the alternations as long as they are encountered in the same order as in the regex. For example, Regex.IsMatch("0000NWNESENDV", "^([0-9]{4})(N|E|W|S|NE|NW|SE|SW|NDV)?$") returns True.
Comment 27 Pavlos Touboulidis 2012-11-06 07:49:57 UTC
Here's a few different testcases where the bug is triggered by the "{ n }" quantifier.

http://pastebin.com/ru48SbJs
Comment 28 Miguel de Icaza [MSFT] 2014-11-23 22:57:29 UTC
Fixed