Jump to content
You must now use your email address to sign in [click for more info] ×

Regex bug


MikeW

Recommended Posts

Using the following text:

john.doe@company.com
george.dufus@company.com
ringo@company.com
paul.henry.whoever@company.com

The regex needs to capitalize only the first letter of the name parts (left of the @ symbol) and leave the right side alone.

The regex is:

(\b\w|\b\w\.\K\w)(?=.+@)

Replace is:

\u$1

As can be seen in the below, it seems the \K is missed and matches to the right as well:

Capture_000456.png.8687b6856cf4cd5905e83dd0dce86bad.png

Here's what should happen:

Capture_000455.png.54531bd8cf1531e9d2ec6efd3abce953.png

Link to comment
Share on other sites

49 minutes ago, Pauls said:

Hi @Mikew Which regex tester are you using ?

Hello Pauls,

RegexBuddy.

It's the same expression I wrote for ID where it works fine. It works as written in UltraEdit. It works on a couple website regex checkers...

And it should work in APub.

Thank you, Mike

Link to comment
Share on other sites

14 hours ago, MikeW said:

The regex is:

(\b\w|\b\w\.\K\w)(?=.+@)

I don't understand why your regular expression is so complex, Mike. (But the important part, for solving your problem, is at the end of this post.)

Given as input john.doe@company.com the matching will work as follows:

  1. \b should match before the j, and \w should match the j, and the lookahead will succeed, so the replace will be done for the j. All is good so far, and the second alternative (\b\w\.\K\w) was not used.
  2. \b will match after the n, but \w will fail on the . and all is still good.
  3. \b will match before the d, and \w will match the d, and the lookahead will succeed, so the replace will be done for the d.
    All is good, but note that this was done using the \b\w part of the expression. The second alternative still was not used.
  4. \b will match after the e, but \w will not match the @. All is good.
  5. \b will match before the c, and if the lookahead fails as you intend the matching is done for that line.
  6. etc.

The second alternative (\b\w\.\K\w) was never used. When would it be needed?

But, in particular for your question, the real problem is in your lookahead. (?=.+@) is, in a sense, too greedy, because Publisher regular expression processing operates in multi-line mode by default, and with . matching newline characters, and therefore that lookahead will find an @ anywhere later in the text.

So, for the lookahead, you need (?=[^\n]+@) in order to accomplish what you want.

Or, you could use use this: (?-s:(\b\w)(?=.+@)) which nests your regular expression inside the options string (?-s: ... ) which has the effect of turning off the ability of . to match a newline within the nested expression.

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

25 minutes ago, walt.farrell said:

... Or, you could use use this: (?-s:(\b\w)(?=.+@)) which nests your regular expression inside the options string (?-s: ... ) which has the effect of turning off the ability of . to match a newline within the nested expression.

Did you try your expression?

Capture_000460.png.94562e36411137a99079944496b8be39.png

Link to comment
Share on other sites

52 minutes ago, walt.farrell said:

I don't understand why your regular expression is so complex...

I forgot to address this. This was originally for use in an ID grep style.

Simply because the complexity grew to handle all the variants. I began with:

(\b\w)(?=.+@) <--which does work for the simplistic samples in this thread.

I then altered it to catch more as the above didn't. It became:

(\b\w|\b\w\.\w)(?=.+@)

It eventually became the one in the opening post. There were 200 or so email addresses already in the ID file and the one I used was actually needed as part of the grep style (and is a valid regex) so when new entries were added either via typing new ones, correcting existing ones,  or importing text, ID would auto-correct them.

Link to comment
Share on other sites

29 minutes ago, MikeW said:

Did you try your expression?

Yes. And it works in both the beta and stable versions of Publisher.

It could be simplified slightly, as there's an unneeded set of (), but it works either way: (?-s:\b\w(?=.+@))

image.png.2d156f468e821acd7759455538451252.png

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

4 minutes ago, MikeW said:

I forgot to address this. This was originally for use in an ID grep style.

Simply because the complexity grew to handle all the variants. I began with:

(\b\w)(?=.+@) <--which does work for the simplistic samples in this thread.

I then altered it to catch more as the above didn't. It became:

(\b\w|\b\w\.\w)(?=.+@)

It eventually became the one in the opening post. There were 200 or so email addresses already in the ID file and the one I used was actually needed as part of the grep style (and is a valid regex) so when new entries were added either via typing new ones, correcting existing ones,  or importing text, ID would auto-correct them.

Thanks. I just don't understand why the added complexity of the regular expression would ever help. I guess I need to see one of the complex examples that the simpler version doesn't catch.

My real issue with the complex one is that as written, if the second alternate would ever match, the first one would match, too. So the second one should never be attempted. It would only be attempted if the first alternate failed. But in that case, it would fail, too, because it starts the same way as the first alternate.

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

7 minutes ago, walt.farrell said:

Yes. And it works in both the beta and stable versions of Publisher.

It could be simplified slightly, as there's an unneeded set of (), but it works either way: (?-s:\b\w(?=.+@))

Thanks, Walt.

Doesn't work here even with restarting APub and trying afresh. I'm only trying the release version.

And Serif really needs to fix the . = newline thing (add a switch, default off). I had forgotten that, so thanks...

Link to comment
Share on other sites

4 minutes ago, MikeW said:

And Serif really needs to fix the . = newline thing (add a switch, default off). I had forgotten that, so thanks...

You're welcome. And I agree that another switch  for that in the Find options would be useful, and would greatly improve the usability.

Kind of weird that it's not working for you, though. You sure that you had Regular Expression ticked in the options? (That would be an odd thing to forget, but I've done that before myself :D )

 

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

17 minutes ago, walt.farrell said:

You're welcome. And I agree that another switch  for that in the Find options would be useful, and would greatly improve the usability.

Kind of weird that it's not working for you, though. You sure that you had Regular Expression ticked in the options? (That would be an odd thing to forget, but I've done that before myself :D )

OK. Your expression works in the beta, not in my release. I might try resetting the release.

And yes, the regex option is ticked.

Link to comment
Share on other sites

Thanks, Mike. If it doesn't work in the release after resetting (saving stuff first, of course), you should be able to use the first version I showed, with [^\n]+ instead of .+, I think.

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

15 minutes ago, walt.farrell said:

Thanks, Mike. If it doesn't work in the release after resetting (saving stuff first, of course), you should be able to use the first version I showed, with [^\n]+ instead of .+, I think.

Thanks, Walt...

I don't yet need to do more than "play" with APub. The only reason I tried was because of testing in various applications of what I did in ID. So it was more "play" than needed for something real-use.

Have you looked at the scant regex info in help? That's the first thing I did when the original expression failed.

Link to comment
Share on other sites

18 minutes ago, MikeW said:

Have you looked at the scant regex info in help? That's the first thing I did when the original expression failed.

I may have looked at it once, a long time ago. Mostly I've used my regex knowledge, some experimenting, and online resources including https://www.boost.org/doc/libs/1_72_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html which is the implementation Publisher uses, if I remember correctly and correctly interpreted some statements by Serif staff.

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

Regular expressions

Regular expressions extend the capabilities and power of the Find and Replace function beyond searching for simple text strings. They are widely used across the word-processing and DTP community, with a multitude of expressions available. As a result, listing regular expressions and their syntax is beyond the scope of Affinity Publisher Help. Please use Internet resources to research and develop your own regular expressions.

Affinity Publisher supports Perl and ECMAScript (with perl extensions) expressions. Regular expressions use the "C" or "POSIX" locale, while Locale Aware Regular Expressions use the locale inferred from the text being searched and locale aware collation is implied.

*********************

Serif should include a hyperlink to an Internet site...and then comply to its use of expressions. But, they could use some example expressions in Help, as well as explanations for what/why those examples work. It's not like they have to cut down trees to do so.

There are a billion websites. Some are great, some not so much. Some that use Javascript syntax unless you change what language is being used, etc.

Link to comment
Share on other sites

Good points, Mike. And you're right; that's pretty scant info.

 

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

Walt, the below was copied out of RegexBuddy. It would make for making samples and their explanation easy for Serif to add and where to look for making one's own.

 

(\b\w)(?=.+@)

Options: Case insensitive; Exact spacing; Dot doesn’t match line breaks; ^$ match at line breaks; Numbered capture

* [Match the regex below and capture its match into backreference number 1][1] `(\b\w)`
    * [Assert position at a word boundary (position preceded or followed—but not both—by a Unicode letter, digit, or underscore)][2] `\b`
    * [Match a single character that is a “word character” (Unicode; any letter or ideograph, any mark, digit, letter number, connector punctuation)][3] `\w`
* [Assert that the regex below can be matched starting at this position (positive lookahead)][4] `(?=.+@)`
    * [Match any single character that is NOT a line break character (line feed)][5] `.+`
        * [Between one and unlimited times, as many times as possible, giving back as needed (greedy)][6] `+`
    * [Match the character “@” literally][7] `@`

\u$1

* [Convert the next character to uppercase][8] `\u`
* [Insert the text that was last matched by capturing group number 1][9] `$1`

Created with [RegexBuddy](https://www.regexbuddy.com/)

[1]: https://www.regular-expressions.info/modifiers.html
[2]: https://www.regular-expressions.info/wordboundaries.html
[3]: https://www.regular-expressions.info/shorthand.html
[4]: https://www.regular-expressions.info/lookaround.html
[5]: https://www.regular-expressions.info/dot.html
[6]: https://www.regular-expressions.info/repeat.html
[7]: https://www.regular-expressions.info/characters.html
[8]: https://www.regular-expressions.info/replacecase.html#perl
[9]: https://www.regular-expressions.info/replacebackref.html
 

Link to comment
Share on other sites

37 minutes ago, MikeW said:

the below was copied out of RegexBuddy. It would make for making samples and their explanation easy for Serif to add and where to look for making one's own.

Unfortunately, copyright restrictions may make it impossible for Serif to copy explanations or examples from most online sources. All of the information on www.regular-expressions.info is copyrighted, for example.

However, if Serif is using the Boost libraries for their regular expression processing (as I think they are), they would be able to point to or use the Boost regular expression documentation at https://www.boost.org/doc/libs/1_72_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html because the license terms for that code and its documentation would permit that.

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

I think an email to Jan (the owner of the site and maker of RegexBuddy) may well obtain permission to include explanations of a half dozen or so examples as occurring in RegexBuddy and giving attribution. Especially as it "hypes" both his site, his software and his books (and the books by other authors he recommends, mainly revolving around the programming aspect in nearly every language).

Link to comment
Share on other sites

1 hour ago, AdamW said:

Thanks for the discussion, we'll add an option for 'Dot matches Paragraph Break' in the next 1.8 beta.

Thanks, Adam. 

I reread your post. Currently the regex does match paragraph breaks. What is needed is an option to not match paragraph breaks. 

Please consider either having the default to not match a paragraph break or that the choice to be persistent, sticky.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

Terms of Use | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.