Jump to content
You must now use your email address to sign in [click for more info] ×

Publisher: Using Find&Replace to apply styles to tagged import text


Recommended Posts

I've had some difficulty using Find&Replace to apply styles to tagged text on import, but found a solution that worked.  This is somewhat more involved than the Find&Replace recipe which has been posted on several threads here, and may help someone with similar difficulty.  Publisher 1.7.3.481

My import text is a very simplistic HTML-like structure with free line breaks, which Publisher imports as paragraph breaks.

<h1>This is a sample header</h1>
<p>
This is a line in a
paragraph, which continues
on a 2nd and 3rd line.
</p>

When I place such a tagged text file (.txt) in a Publisher text frame, the first thing I do is find and replace all the paragraph breaks with single spaces.
Find: para-break
Replace: space
Obviously the actual paragraph break and space characters are used in the RE patterns, but I'm spelling them out here to be clear.

Then I use an RE that captures end tags together with any surrounding spaces and replaces them with the end tag, spaces stripped off, and adds a paragraph break.
Find: space*(</.*?>)space*
Replace: \1 para-break

Then for each tag, I use an RE that strips off the start and end tags and any superfluous white space and applies the appropriate paragraph style
Find: <p>\s*(.*?)\s*</p>
Replace: \1 (Format with Body)
Similarly for <h1> and Heading 1 and the other tags.

So, why did I do it this way?  Why didn't I just strip off the tags and superfluous spaces, apply the styling and add a paragraph break in one step, combining my 2nd and 3rd Find&Replace steps?  The answer is simple: It doesn't work!  When I apply the 2nd tag style, the style is also applied to the adjacent paragraphs previously converted to the 1st tag style.  I tried both "para-break \1" and "\1 para-break" as replacement patterns, but neither leaves the previously converted text alone.  The solution is to introduce the paragraph breaks first to separate all the paragraphs cleanly, before styling each paragraph.  Trying to chop out paragraphs and style them at the same time using Find&Replace is not "stable", at least in this version of Publisher.

Link to comment
Share on other sites

3 hours ago, sfriedberg said:

When I place such a tagged text file (.txt) in a Publisher text frame, the first thing I do is find and replace all the paragraph breaks with single spaces.
Find: para-break
Replace: space
Obviously the actual paragraph break and space characters are used in the RE patterns, but I'm spelling them out here to be clear.

I don't understand the purpose of that. Wouldn't that cause the problems that make you use the more complex approach for the rest of the process?

(Of course, without a sample input file it's hard to be sure.)

 

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

4 hours ago, sfriedberg said:

Find: <p>\s*(.*?)\s*</p>
Replace: \1 (Format with Body)

I would do two passes

find <p>(.+?)</p>

Replace \1 adding whichever Paragraph Style

Then go and strip out the spaces and returns.

find "\r " in whichever Paragraph Style.

replace with nothing.

Preferably I would get rid of the extraneous spaces and returns in the text before importing it.

Mac Pro (Late 2013) Mac OS 12.7.4 
Affinity Designer 2.4.1 | Affinity Photo 2.4.1 | Affinity Publisher 2.4.1 | Beta versions as they appear.

I have never mastered color management, period, so I cannot help with that.

Link to comment
Share on other sites

@ walt.ferrell  The purpose is to remove all the superfluous paragraph breaks that Publisher introduces for each end-of-line in the imported text file.  The sample text I provided in my original posting is entirely adequate to demonstrate the issue as a sample input file.   If you want a slightly richer example, just concatenate two copies of the sample text.  Convert <h1> first, then <p> and it will be clear that the one or both occurrences of the previously converted <h1> text (depending on whether your replacement string puts the paragraph-break before or after the \1) will be re-styled as <p>.  And no, the problem seems to be the attempting to use Find&Replace to introduce a paragraph break and assign a paragraph style simultaneously.  This was the main point of my post.  When those two operations are separated, everything proceeds exactly as expected.  When those two operations are combined into a single replace, the new styling "bleeds" into adjacent, previously styled and paragraph-marked text.

@Old Bruce The imported text file has no extraneous spaces, and has returns in the text to make it practical to work in my preferred text editor (vi) with both hands on the keyboard and little or no use of the mouse.  If I could import a .txt file into Publisher without conversion of returns to paragraph breaks, I would.  If Publisher would leave returns as line-breaks, I would let it.  I have tried working with return-less paragraphs and the experience is just horrible, endlessly keystroking to move forward or back or constantly taking my hands off the keyboard to reposition the cursor with the mouse.  Furthermore, import of HTML-like or XML-like content is going to have returns as returns are generally not significant in XML-derived formats anywhere except specially designated stretches of preformatted text.

I want to reiterate that the problem I encountered is not with converting one type of tagged content to one Publisher paragraph style.  The problem is that successive conversions of a series of tags corrupts/undoes/invalidates the styling of previously converted text.  The question of whether/when/how to deal with returns in the original imported text file is not the main point, and could profitably be ignored as not relevant to the problem at hand.  I posted not to complain, but to enable others in similar situations to get things working, using a recipe slightly more complex than posted in previous threads on the subject of importing tagged text.

Link to comment
Share on other sites

11 hours ago, sfriedberg said:

The purpose is to remove all the superfluous paragraph breaks that Publisher introduces for each end-of-line in the imported text file.  The sample text I provided in my original posting is entirely adequate to demonstrate the issue as a sample input file.   If you want a slightly richer example, just concatenate two copies of the sample text.  Convert <h1> first, then <p> and it will be clear that the one or both occurrences of the previously converted <h1> text (depending on whether your replacement string puts the paragraph-break before or after the \1) will be re-styled as <p>.  And no, the problem seems to be the attempting to use Find&Replace to introduce a paragraph break and assign a paragraph style simultaneously.  This was the main point of my post.  When those two operations are separated, everything proceeds exactly as expected.  When those two operations are combined into a single replace, the new styling "bleeds" into adjacent, previously styled and paragraph-marked text.

But in your sample text in the original post (as compared with an actual file) I can't tell (for example) if that break after <p> is a soft-return or a hard-return. If you've used a hard-return, then you added the paragraph break. In my opinion the <p> should be on the same line as the paragraph text (same for the </p>) or you should be using soft-returns (usually shift+Enter or ctrl+Enter, depending on your application, to get a line-break rather than a paragraph break) so that no additional paragraph breaks are introduced.

A hard return ("Enter") is always an actual paragraph break in modern text editors and word processors, so it is likely that you are introducing the "extraneous" paragraph breaks that are causing your problems.

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

2 hours ago, walt.farrell said:

A hard return ("Enter") is always an actual paragraph break in modern text editors and word processors, so it is likely that you are introducing the "extraneous" paragraph breaks that are causing your problems.

I wonder if this has anything to do with the early days of telecommunications before there were word processors & non-printing control characters like <CR> & <LF> (carriage return & line feed respectively) were used to control line printers like Teletype machines?

Because <CR> just sent the carriage back to the start of the line, starting a new line required the <CR><LF> sequence. That did not mark the start of a new paragraph, but when text editors started using one or both for that purpose, the newline sequence might be interpreted as two paragraph breaks instead of just one.

All 3 1.10.8, & all 3 V2.4.1 Mac apps; 2020 iMac 27"; 3.8GHz i7, Radeon Pro 5700, 32GB RAM; macOS 10.15.7
Affinity Photo 
1.10.8; Affinity Designer 1.108; & all 3 V2 apps for iPad; 6th Generation iPad 32 GB; Apple Pencil; iPadOS 15.7

Link to comment
Share on other sites

We seem to be rat-holing on the less useful aspect of this thread.  But I'll play along.  Vi was developed in the mid-1970's, and I've been using it since the 1980's.  The command set is burned into my muscle memory.  It is not a word processor.  It is a text editor.  An end-of-line in a file edited by vi is whatever your native OS uses for end-of-line.  On Unix/Linux and similar systems, it is a linefeed ('\n').  On Windows, it is a carriage-return+linefeed combo ("\r\n").

Having used Model 13 and Model 28 Teletypes back in the day, and having spent thousands of hours sitting at Underwood, Remington and other mechanical typewriters, culminating in the luxurious experience of using an IBM Correcting Selectric II with Lift-Off correction tape, the distinction between linefeed and carriage-return is not one I am likely to forget.  And since we are being distracted from the main point, allow me to take my tangent further.  When Model 40 Teletypes came into play, they replaced (at least as an option) the paper tape punch/reader with a cassette tape deck.  In the Mod13/Mod28 era, there was a standard practice known as "torn tape relay", wherein Station A sends a message to Station C via directly-connected intermediate Station B. Operators at Station B turned on the paper tape punch on the receiving TTY to record the message arriving from Station A, then physically tore the message off the TTY, walked over to the sending TTY connected to Station C, alerted the Station C operators with some variation on ZUJ ZUJ MSG FOR YOU MATE, then fed the torn tape into the paper tape reader for transmission.  When the Model 40's arrived, this practice immediately became known as "torn cassette relay", although no cassette tapes were shredded.  Rather the casettes were popped out of one machine and walked over to the other.  Station managers hated this, because sloppy operators forgot to put a new cassette in the deck, leaving no record of further incoming messages.

[Added in edit] A further irrelevancy. Model 13 and Model 28 Teletypes were highly complex, fully mechanical machines.  When properly maintained, they were quite durable (and also capable of giving the operator a very nasty bone-jarring finger jam as a penalty for sloppy typing.)  In contrast, the Model 40 was based on computer keyboard and printer technologies, and there were entire dumpsters filled with broken Model 40 keyboards.  The space bar, in particular, when hammered by operators accustomed to driving an older mechanical TTY, often had a lifespan measured in months.

Link to comment
Share on other sites

On 2/1/2020 at 3:52 AM, walt.farrell said:

it is likely that you are introducing the "extraneous" paragraph breaks that are causing your problems.

There is no question that importing my tagged text files is introducing unwanted paragraph breaks.  That is why my first step is to remove them.  What is interesting is neither the source of the unwanted paragraph breaks nor the method of eliminating them.  What is interesting is the difficulty in taking tagged text, without paragraph breaks, and introducing both styles and appropriately located paragraph breaks.  Trying to do both in one operation does not work as expected, as successive tag conversions "break" the previous conversions of adjacent text.  The solution is to introduce paragraph breaks before applying styles.

I thought others might find this relevant when bringing tagged text from non-word-processor sources, such as XML files, in which there is no distinction between hard-return and soft-return, and indeed the very concept of "return" is irrelevant in such file formats except in very specific character data contexts.

Link to comment
Share on other sites

  • 2 months later...
On 1/30/2020 at 9:26 AM, sfriedberg said:

Why didn't I just strip off the tags and superfluous spaces, apply the styling and add a paragraph break in one step, combining my 2nd and 3rd Find&Replace steps?

APub or ID need to store the informations about the paragraph style somewhere… and it's usually in the "¶". If you delete it while applying another style, the 2 paragraphs will use this style.

It can be tricky with RE. When we'll be able to use scripts, it'll be easier, like doing a list of specific RE, or even, being able to modify/apply a style depending of the one of the previous paragraph, etc.

 

I'm not fond of the way APub play with the various paragraph/line breaks to use I'm-not-sure-what, but nothing compatible with a lot of apps (and adding or converting everything to paragraph break).
Impossible to copy-paste from APub (notice how the text "this is a line in a" (style different than "no style", but no special flow options) end up on a second page in LibreOffice. How the paragraph and line  breaks aren't reconized…

They need to correct this so we can paste correctly, in a way or in the other.

2020-04-28_212405.thumb.png.2aaf824e5b15d907ea59c64c98ea1d35.png

 

 

Link to comment
Share on other sites

20 minutes ago, Wosven said:

(notice how the text "this is a line in a" (style different than "no style", but no special flow options) end up on a second page in LibreOffice. How the paragraph and line  breaks aren't reconized…

I'm curious how you pasted in LibreOffice:

  • Paste
  • Paste Special > Unformattted Text
  • Paste Special... > RTF

 

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

1 minute ago, walt.farrell said:

pasted in LibreOffice:

Simple paste, since it was to compare the results.

 

[edit] And paste without formatting give the same result as in ID and the text editors, no paragraph break is reconized [/edit]

Edited by Wosven
Link to comment
Share on other sites

1 minute ago, Wosven said:

Simple paste, since it was to compare the results.

Thanks. But since you're claiming that copy/paste from Publisher is not possible, it might be nice to try the other available formats. Maybe pasting is possible but you're just doing it wrong :)

(I don't really think that you are; I think it's a Publisher problem. I believe there's a bug in the formats that Publisher puts on the clipboard, because things like the funky paragraph break characters should be converted to something more standard, especially when pasting in plain or unicode text formats.)

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

Just now, walt.farrell said:

Maybe pasting is possible but you're just doing it wrong

I'm not doing it wrong, any other app you copy from give basic text with paragraph breaks.

That's just that APub only permit paste as  RTF, but not all the app we use are able to reconize it. Their basic text is buged and should contain paragraph breaks.

Link to comment
Share on other sites

10 minutes ago, Wosven said:

Their basic text is buged and should contain paragraph breaks.

Their basic text can contain whatever they want it to contain. But when Copying to the clipboard, the basic text formats they supply to the clipboard should be standard text.

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

Terms of Use | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.