Regex bug

MikeW · January 30, 2020

Using the following text:

john.doe@company.com
george.dufus@company.com
ringo@company.com
paul.henry.whoever@company.com

The regex needs to capitalize only the first letter of the name parts (left of the @ symbol) and leave the right side alone.

The regex is:

(\b\w|\b\w\.\K\w)(?=.+@)

Replace is:

\u$1

As can be seen in the below, it seems the \K is missed and matches to the right as well:

Here's what should happen:

Pauls · January 31, 2020

Hi @Mikew Which regex tester are you using ?

MikeW · January 31, 2020

49 minutes ago, Pauls said:

Hi @Mikew Which regex tester are you using ?

Hello Pauls,

RegexBuddy.

It's the same expression I wrote for ID where it works fine. It works as written in UltraEdit. It works on a couple website regex checkers...

And it should work in APub.

Thank you, Mike

walt.farrell · January 31, 2020

14 hours ago, MikeW said:

The regex is:

(\b\w|\b\w\.\K\w)(?=.+@)

I don't understand why your regular expression is so complex, Mike. (But the important part, for solving your problem, is at the end of this post.)

Given as input john.doe@company.com the matching will work as follows:

\b should match before the j, and \w should match the j, and the lookahead will succeed, so the replace will be done for the j. All is good so far, and the second alternative (\b\w\.\K\w) was not used.
\b will match after the n, but \w will fail on the . and all is still good.
\b will match before the d, and \w will match the d, and the lookahead will succeed, so the replace will be done for the d.
All is good, but note that this was done using the \b\w part of the expression. The second alternative still was not used.
\b will match after the e, but \w will not match the @. All is good.
\b will match before the c, and if the lookahead fails as you intend the matching is done for that line.
etc.

The second alternative (\b\w\.\K\w) was never used. When would it be needed?

But, in particular for your question, the real problem is in your lookahead. (?=.+@) is, in a sense, too greedy, because Publisher regular expression processing operates in multi-line mode by default, and with . matching newline characters, and therefore that lookahead will find an @ anywhere later in the text.

So, for the lookahead, you need (?=[^\n]+@) in order to accomplish what you want.

Or, you could use use this: (?-s:(\b\w)(?=.+@)) which nests your regular expression inside the options string (?-s: ... ) which has the effect of turning off the ability of . to match a newline within the nested expression.

MikeW · January 31, 2020

25 minutes ago, walt.farrell said:

... Or, you could use use this: (?-s:(\b\w)(?=.+@)) which nests your regular expression inside the options string (?-s: ... ) which has the effect of turning off the ability of . to match a newline within the nested expression.

Did you try your expression?

MikeW · January 31, 2020

52 minutes ago, walt.farrell said:

I don't understand why your regular expression is so complex...

I forgot to address this. This was originally for use in an ID grep style.

Simply because the complexity grew to handle all the variants. I began with:

(\b\w)(?=.+@) <--which does work for the simplistic samples in this thread.

I then altered it to catch more as the above didn't. It became:

(\b\w|\b\w\.\w)(?=.+@)

It eventually became the one in the opening post. There were 200 or so email addresses already in the ID file and the one I used was actually needed as part of the grep style (and is a valid regex) so when new entries were added either via typing new ones, correcting existing ones, or importing text, ID would auto-correct them.

walt.farrell · January 31, 2020

29 minutes ago, MikeW said:

Did you try your expression?

Yes. And it works in both the beta and stable versions of Publisher.

It could be simplified slightly, as there's an unneeded set of (), but it works either way: (?-s:\b\w(?=.+@))

walt.farrell · January 31, 2020

4 minutes ago, MikeW said:

I forgot to address this. This was originally for use in an ID grep style.

Simply because the complexity grew to handle all the variants. I began with:

(\b\w)(?=.+@) <--which does work for the simplistic samples in this thread.

I then altered it to catch more as the above didn't. It became:

(\b\w|\b\w\.\w)(?=.+@)

It eventually became the one in the opening post. There were 200 or so email addresses already in the ID file and the one I used was actually needed as part of the grep style (and is a valid regex) so when new entries were added either via typing new ones, correcting existing ones, or importing text, ID would auto-correct them.

Thanks. I just don't understand why the added complexity of the regular expression would ever help. I guess I need to see one of the complex examples that the simpler version doesn't catch.

My real issue with the complex one is that as written, if the second alternate would ever match, the first one would match, too. So the second one should never be attempted. It would only be attempted if the first alternate failed. But in that case, it would fail, too, because it starts the same way as the first alternate.

MikeW · January 31, 2020

7 minutes ago, walt.farrell said:

Yes. And it works in both the beta and stable versions of Publisher.

It could be simplified slightly, as there's an unneeded set of (), but it works either way: (?-s:\b\w(?=.+@))

Thanks, Walt.

Doesn't work here even with restarting APub and trying afresh. I'm only trying the release version.

And Serif really needs to fix the . = newline thing (add a switch, default off). I had forgotten that, so thanks...

walt.farrell · January 31, 2020

4 minutes ago, MikeW said:

And Serif really needs to fix the . = newline thing (add a switch, default off). I had forgotten that, so thanks...

You're welcome. And I agree that another switch for that in the Find options would be useful, and would greatly improve the usability.

Kind of weird that it's not working for you, though. You sure that you had Regular Expression ticked in the options? (That would be an odd thing to forget, but I've done that before myself )

MikeW · January 31, 2020

17 minutes ago, walt.farrell said:

You're welcome. And I agree that another switch for that in the Find options would be useful, and would greatly improve the usability.

Kind of weird that it's not working for you, though. You sure that you had Regular Expression ticked in the options? (That would be an odd thing to forget, but I've done that before myself )

OK. Your expression works in the beta, not in my release. I might try resetting the release.

And yes, the regex option is ticked.

walt.farrell · January 31, 2020

Thanks, Mike. If it doesn't work in the release after resetting (saving stuff first, of course), you should be able to use the first version I showed, with [^\n]+ instead of .+, I think.

MikeW · January 31, 2020

15 minutes ago, walt.farrell said:

Thanks, Mike. If it doesn't work in the release after resetting (saving stuff first, of course), you should be able to use the first version I showed, with [^\n]+ instead of .+, I think.

Thanks, Walt...

I don't yet need to do more than "play" with APub. The only reason I tried was because of testing in various applications of what I did in ID. So it was more "play" than needed for something real-use.

Have you looked at the scant regex info in help? That's the first thing I did when the original expression failed.

walt.farrell · January 31, 2020

18 minutes ago, MikeW said:

Have you looked at the scant regex info in help? That's the first thing I did when the original expression failed.

I may have looked at it once, a long time ago. Mostly I've used my regex knowledge, some experimenting, and online resources including https://www.boost.org/doc/libs/1_72_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html which is the implementation Publisher uses, if I remember correctly and correctly interpreted some statements by Serif staff.

MikeW · January 31, 2020

Regular expressions

Regular expressions extend the capabilities and power of the Find and Replace function beyond searching for simple text strings. They are widely used across the word-processing and DTP community, with a multitude of expressions available. As a result, listing regular expressions and their syntax is beyond the scope of Affinity Publisher Help. Please use Internet resources to research and develop your own regular expressions.

Affinity Publisher supports Perl and ECMAScript (with perl extensions) expressions. Regular expressions use the "C" or "POSIX" locale, while Locale Aware Regular Expressions use the locale inferred from the text being searched and locale aware collation is implied.

*********************

Serif should include a hyperlink to an Internet site...and then comply to its use of expressions. But, they could use some example expressions in Help, as well as explanations for what/why those examples work. It's not like they have to cut down trees to do so.

There are a billion websites. Some are great, some not so much. Some that use Javascript syntax unless you change what language is being used, etc.

walt.farrell · January 31, 2020

Good points, Mike. And you're right; that's pretty scant info.

MikeW · January 31, 2020

Walt, the below was copied out of RegexBuddy. It would make for making samples and their explanation easy for Serif to add and where to look for making one's own.

(\b\w)(?=.+@)

Options: Case insensitive; Exact spacing; Dot doesn’t match line breaks; ^$ match at line breaks; Numbered capture

* [Match the regex below and capture its match into backreference number 1][1] `(\b\w)`
* [Assert position at a word boundary (position preceded or followed—but not both—by a Unicode letter, digit, or underscore)][2] `\b`
* [Match a single character that is a “word character” (Unicode; any letter or ideograph, any mark, digit, letter number, connector punctuation)][3] `\w`
* [Assert that the regex below can be matched starting at this position (positive lookahead)][4] `(?=.+@)`
* [Match any single character that is NOT a line break character (line feed)][5] `.+`
* [Between one and unlimited times, as many times as possible, giving back as needed (greedy)][6] `+`
* [Match the character “@” literally][7] `@`

\u$1

* [Convert the next character to uppercase][8] `\u`
* [Insert the text that was last matched by capturing group number 1][9] `$1`

Created with [RegexBuddy](https://www.regexbuddy.com/)

walt.farrell · January 31, 2020

37 minutes ago, MikeW said:

the below was copied out of RegexBuddy. It would make for making samples and their explanation easy for Serif to add and where to look for making one's own.

Unfortunately, copyright restrictions may make it impossible for Serif to copy explanations or examples from most online sources. All of the information on www.regular-expressions.info is copyrighted, for example.

However, if Serif is using the Boost libraries for their regular expression processing (as I think they are), they would be able to point to or use the Boost regular expression documentation at https://www.boost.org/doc/libs/1_72_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html because the license terms for that code and its documentation would permit that.

MikeW · January 31, 2020

I think an email to Jan (the owner of the site and maker of RegexBuddy) may well obtain permission to include explanations of a half dozen or so examples as occurring in RegexBuddy and giving attribution. Especially as it "hypes" both his site, his software and his books (and the books by other authors he recommends, mainly revolving around the programming aspect in nearly every language).

AdamW · February 6, 2020

Thanks for the discussion, we'll add an option for 'Dot matches Paragraph Break' in the next 1.8 beta.

MikeW · February 6, 2020

1 hour ago, AdamW said:

Thanks for the discussion, we'll add an option for 'Dot matches Paragraph Break' in the next 1.8 beta.

Thanks, Adam.

I reread your post. Currently the regex does match paragraph breaks. What is needed is an option to not match paragraph breaks.

Please consider either having the default to not match a paragraph break or that the choice to be persistent, sticky.

AdamW · February 7, 2020

Hi Mike,

Yes - defaulted to 'off' (and sticky). This will change existing default behaviour but I think it's preferable in the long run.

Regex bug

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information