Jump to content
You must now use your email address to sign in [click for more info] ×

Regular Expressions bug


Recommended Posts

I ran into a nasty bug with find-and-replace and regular expressions recently, and wanted to make sure folks were aware of it. The issues are...

  • What's shown in the find-and-replace studio panel does not necessarily line up with what is actually matched by the expression. (You can tell by clicking on the matches and seeing what highlights and what doesn't.)
  • If you run the replacement with group replacement and any "missing" matches, what you get out is unpredictable.

I've uploaded a test file that demonstrates the issue.

My machine is running MacOS 10.15.7. I'm on Publisher 1.9.1.

As a separate issue, Some matching characters don't work as expected. For instance, the non-greedy qualifier "?" seems to pick out individual letters rather than grabbing whole chunks of text.

RegEx_find_replace_bug.afpub

Link to comment
Share on other sites

12 minutes ago, Colin_Fredericks said:

What's shown in the find-and-replace studio panel does not necessarily line up with what is actually matched by the expression. (You can tell by clicking on the matches and seeing what highlights and what doesn't.)

I can confirm that with your test file, and using Publisher 1.9.2.1009 beta on Windows.

 

12 minutes ago, Colin_Fredericks said:

If you run the replacement with group replacement and any "missing" matches, what you get out is unpredictable.

I don't understand what you mean there.

13 minutes ago, Colin_Fredericks said:

As a separate issue, Some matching characters don't work as expected. For instance, the non-greedy qualifier "?" seems to pick out individual letters rather than grabbing whole chunks of text.

Example, please.

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

38 minutes ago, walt.farrell said:

I don't understand what you mean there.

Explaining group matching -  feel free to skip if you know this.

"Group matching" is when you surround something with parentheses in a regular expression. It captures the matched group so you can put it back in later ("group replacement").

For example, if I wanted to change <strong class="example">emphasis</strong> to <em class="example">emphasis</em>, and also change <strong style="font-size:large;">emphasis</strong> to <em style="font-size:large;">emphasis</em> at the same time, I could do this:

Find: <strong(.*?)>(.*?)</strong>
Replace: <em$1>$2</em>

and it'll replace the <strong> tags with <em> tags and keep any of the attributes. The first group is $1, the second group is $2, etc. The (.*?) group says "any character, zero or more of them, don't get greedy."

(And yes, I know, using regex with HTML is counter-indicated, it's just the first example that came to mind.)

In this particular case:

If you run that kind of replacement when there are lines in the "Find and Replace" studio panel that do not highlight their corresponding places in the text, you get the sorts of results that you see in the sample file. The groups that are replaced do not match up with the groups that are found, and I haven't found a reliable pattern to it yet (though there probably is one).

The non-greedy qualifier issue:

Now, in the example I gave above, the question mark works just fine. Try adding it to the case in the sample file, however, and the match becomes individual letters instead of words. Try it with .*? instead to see it match the spaces between letters (which I'm not sure should ever be matched by any regular expression). Not space characters " ", but the boundaries between one character and the next.

I only found this because I routinely type (.+?) instead of (.+) by muscle memory to avoid matching, say, everything between two different sets of <strong> tags instead of just the contents of the tags.

Link to comment
Share on other sites

Thanks. I was mostly wanting to confirm that you weren't trying to report two issues with the group match problem. 

I'll check on the ? problem when I'm back at my computer.

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

First: It's not related to group matching. Your test case has exactly the same issues whether you search for (.+) or simply .+  in terms of what the results list shows and what gets highlighted in the text.

Next, the non-greedy ca

  • If you use (.+) that says to match (and capture) as many characters as possible.
  • If you use (.+?) that says to match (and capture) as few characters as possible. That's 1, in your sample file. So in that case Publisher is working correctly, and just as other regex processors do.

For example, here's RegExBuddy matching .+ against a character string:
image.png.d119c8b7db01a456dd058c3810d5bd73.png

It matched the full string.

Here it is with .+? instead, which gives individual characters:

image.png.dfc2a1827e0b183a759b9ebe0e0570ff.png

Using ? would be appropriate if there were some larger pattern you were trying to avoid matching. The example you gave just above shows the appropriate usage of ?:
 

Find: <strong(.*?)>(.*?)</strong>
Replace: <em$1>$2</em> 

But in your sample file, the pattern and the text are not complex enough to need the ? and so using it does something that was correct, but that you didn't expect.

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

Terms of Use | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.