Formatting during text import

MikeA · July 3, 2019

New here — not much luck yet with forum search. If there's a discussion about this, apologies for not having found it.

Back in the Neolithic I used QuarkXPress, which supported a tagging method for text import: Simple codes embedded in plain text were transformed into complex formatting during import. The competition didn't have such a feature at the time. It was among several reasons for QXP's becoming the program of record for book pagination (until InDesign came along).

Even years before microcomputers took over the world, the typesetting systems I used had tagging and translation-table features. Same purpose: Prepare text containing simple codes, then get complex formatting during text import. The machines' CPUs ran at glacial speeds compared with what we have now. But the text-import systems were fast and efficient.

It's orders of magnitude faster than importing plain text into a design/pagination program and then hand-formatting it. Search/replace is not efficient unless a program supports complex search/replace enabling it to find starting and ending tags and formatting text located between those tags. Even at that, having to do it repetitively is tedious and time-consuming. (If search/replace can be controlled via scripting, that certainly helps.)

Manipulating text outside the pagination program is inherently more efficient. It can be done with powerful and fast tools ideal for that purpose (Python, Perl, Ruby, and so forth).

Affinity Publisher looks like an excellent contender. It too needs this kind of feature. If the company has no such plans for the near future, I hope the program has a plug-in architecture enabling a third party to add this functionality. To anyone importing a lot of text, that kind of automation is worth paying for.

MikeW · July 3, 2019

I too have requested using tagged text. And if Serif decides to do this, I would prefer QXP's style of tagged text as it isn't as verbose as ID's style. I prepare tagged text for every book to be laid out in Q (and ID) to this day. It is reason #1 why I won't use APub for books. The tagged text is generated from Word to CMS systems to raw data out of SQL I'll import to Access and generate the tagged text from.

I don't think Serif has committed to plug-ins, but there has been a lot of interest shown by users and at least one plug-in developer (Astute).

Mike

MikeA · July 3, 2019

If the underlying architecture doesn't now support plug-ins, I wonder how much difficulty they'll have adding that later. I never used InDesign — I was out of the book-pagination business long before it became "a thing" — and haven't seen its tagging system. Is it ghastly, along the lines of SGML?

My recollection of XPress Tags is a bit hazy. If memory serves, XPress Tags permitted you to specify not just character styles, but named paragraph styles during text import — yes? I actually worked on a book about QuarkXPress way back then. I suppose I could just go look it all up. :-)

Even rudimentary tagging would be better than nothing.

{pstyle:"some-style-name"}Some text {cstyle:"italics"}with formatting{/cstyle} via tagging.{/pstyle}

I suppose that to someone who's never used systems like that, it looks decidedly user-hungry. But to people accustomed to scripting it's a walk in the park. I did this in creating a simple e-book once. Devised my own system of text codes (far simpler than the above) and then wrote scripts to transform the simple codes into XHTML in which the style names matched what was already set up in the ePub editor. Not worth taking the time for a small job, but absolutely worth taking the time to set up if it's a long document. In the end you save a lot of time and headache...

MikeW · July 3, 2019

I generally do not specify the actual p.style or c.style definitions themselves. But that is certainly possible. Instead I will generate the tagged text like you show. Then once inside of Q, I'll actually modify the styles to suit or import the tagged text into a template with styles already set up to the same names.

Here's an example text block from a tagged text file...

<v13.21><e9>
@01 Date:11
@02 Day:TUE<\c>@03 Title:Nosferatu <@FilmRating>(PG)
@04 Time:7<\a>9pm
@05 Location:Minghella Building
@06 Description:Germany 1922. Directed by F.W. Murnau <\a> with Max Schreck, Greta Schroder, Ruth Landshoff
@07 Length:81 min
@08 Spacing:

The first line is the QXP version + the format (<e9> = UTF8), followed by style names ended by the colon, then the paragraph's text. The <@FilmRating> in this case is how character styles are used. Because the paragraph ends with the c.style, there is no need to use a closing tag or the tag to reset the style back to the p.style (which is simply <$>) presently in use. The <\a> tag is for an en-dash.

Importing into my template results in this section's text frames already formatted, across a few pages of listings.

I use tagged text on far more than books, like the above's newspaper. Each section is processed in the same way, whether an article, page ads, listings, classifieds, etc. It takes me less than an hour with all the sections to produce their respective tagged text files, minutes to import and about 2 hours to paginate and produce the first draft of the newspaper. Nearly every job/job type uses tagged text.

With a decent Word manuscript, a novel takes me between 2-4 hours beginning to end, depending upon images. I couldn't do the same in something that doesn't support tagged text.

MikeA · July 3, 2019

Thanks for taking the time to post that. It's an excellent illustration of how critical such a system is for efficient work. It's triggering a memory of once having done a catalogue job in much the same way. The tagging saved a huge amount of time. (I take it from what you're writing that QXP has some life in it yet.)

At the shop where I worked we had many problems with a thousand little MS-Word "gotchas." I began to dread receiving customer source material in Word format. Because it contained formatting, working with it was faster than importing plain text. Still, there was always a lot of manual "massaging" afterward — sometimes, line-by-line searches for weird problems. Word would sometimes insert strange zero-width characters — I never did learn what they are — that had to be rooted out. It's a mixed blessing.

I hope Affinity Publisher's authors take this kind of thing seriously — much sooner than later. (Or, again, that there's a plug-in architecture making a third-party tool possible.

Does AP at least accept plain text with HTML tagging — simple stuff like or even just — as input? (I haven't bought it yet. I probably will. It really does look excellent in so many ways.)

MikeW · July 3, 2019

Do download a trial version to test.

No to html formatting. At this time, only plain text and RTF file formats for importing text is allowed.

Yep, Q is alive and well.

If you end up with a layout application that imports tagged text and receive Word files, there is an add on that will both clean the file and export tagged text in minutes that is inexpensive. I use it every day.

MikeA · July 3, 2019

1 minute ago, MikeW said:

If you end up with a layout application that imports tagged text and receive Word files, there is an add on that will both clean the file and export tagged text in minutes that is inexpensive. I use it every day.

Sorry to hear about the no-HTML. These are significant gaps, as it were.

The cleanup program sounds intriguing. It might be useful for other purposes, too. What program is it?

MikeW · July 3, 2019

15 minutes ago, MikeA said:

...

The cleanup program sounds intriguing. It might be useful for other purposes, too. What program is it?

Editor's ToolKit Plus 2018, from The Editorum.

http://www.editorium.com

I wouldn't be without it.

MikeA · July 3, 2019

Thanks. Those tools sound extremely useful, all right. Makes me wonder if there's a tool that can take a plain text file containing tagging of some kind and convert it into MS-Word document, creating named paragraph and character styles as it goes (not just the usual "Heading 1", "Heading 2", etc. styles). That could make the job somewhat less painful.

AP supports RTF, eh? I used to try parsing that stuff in scripts. Painful.

MikeW · July 4, 2019

32 minutes ago, MikeA said:

...Makes me wonder if there's a tool that can take a plain text file containing tagging of some kind and convert it into MS-Word document, creating named paragraph and character styles as it goes (not just the usual "Heading 1", "Heading 2", etc. styles). That could make the job somewhat less painful...

That's a sort of. The answer is yes, but not c.styles. But italic, bold & bold italic, yes. From there one can import into Word and do a search/replace with character styles. And can import simple tables & images.

There is another add-in I use to export from Word as Markdown, but can paste Markdown as well.

http://www.writage.com

Mike

MikeA · July 4, 2019

I've been hunting about on the web and have run across a few more tools, including something called Pandoc, which its author describes as a Swiss Army Knife of conversion tools. It scores the usual 11 on the 1-to-10 geekiness scale and has the usual minimalist reference material. Whether it can convert from, say, Markdown to a .docx file containing named (user-defined) paragraph and character styles, I can't tell yet.

Fixx · July 4, 2019

Formatting during import is something that needs to happen in one way of other. We need style automation in import, what ever the method.

Usually creating the tags is the problem part. If it is just as time consuming as styling in layout app it is quite unnecessary.

I would think usually tagging is unneeded extra step as tagging/styling is something that layout app should be able to do itself. (In Ventura tagging and styling were quite the same concept..) Mapping Word-styles to Publisher styles in import would be a good start here.

I understand there are workflows where tagged text gives more control with difficult materials, though.

lacerto · July 4, 2019

(...)

MikeA · July 4, 2019

54 minutes ago, Lagarto said:

Otherwise InDesign's and Publisher's inherent support for applying complex formatting for coherently structured highlighted catalog-kind of paragraph separated text using the (ideally looped) "Apply P1Style, then Next Styles" commands are quite effective as they work without any kind of tagging.

Hmm. Hadn't even thought of whether AP's "next style" command would be effective for this purpose. Yes, if that worked during text import it would be handy (assuming that's what you meant).

MikeA · July 4, 2019

I tried a couple of Word "clone" programs. LibreOffice was buggy...won't take the time to troubleshoot. Uninstalled it. Next was a demo version of an inexpensive program called SoftMaker Office TextMaker. No bugs so far. (I know, I know — give it time...) Next step: Small HTML file containing CSS definitions such as:

p.test {}

Right — just "{ }". Absent the braces, this doesn't work. The test document contains such text as:

<p class="test">something here</p>

This does create a paragraph style "test" in the TextMaker file. Don't know yet about character styles. Having already devised tagging schemes for plain text that became XHTML files, I can see that it wouldn't be difficult to do the same again. The scripting (Perl) is a bit tedious, but once it's done it's done and then you have your HTML file. Open it in the Word (clone) program, import the styles of a previously set-up .DOCX file, and this might be workable after all. Well, for self-authored text. Tediously re-tagging someone else's manuscript would be...ugh.

How AP treats incoming Word styles is another matter. No clue on that yet. And no clue yet whether AP will accept — or choke on — .docx files created in TextMaker.

lacerto · July 4, 2019

(...)

lacerto · July 4, 2019

(...)

walt.farrell · July 4, 2019

1 hour ago, Lagarto said:

(There might be a bug in Publisher as it currenly does NOT remove the tag even if the Replace field is left empty, if the replacement contains a style criterion -- InDesign does remove the searched content but applies replacement formatting for the found spot, so you do not need to do separate search and replace without formatting simply to remove the tags, as you currently need in Publisher). EDIT: Oops, it does not, but my script does as it knows the search criterion to be a tag! So one extra round is needed to remove the tags.

A regex find/replace can to do it in one pass:

Find: (.*)

Replace: \1
and specify the formatting, such as Emphasis or a style name.

lacerto · July 4, 2019

(...)

MikeW · July 4, 2019

Just now, Lagarto said:

Find ()(.*)()

Replace \2

...I think. It would be nice to be able to save regex expressions. Because I script all recurrent tasks that benefit from parameters, or use saved expressions, I need to check the regex syntax over and over again and never get it right in one shot.

Walt's means is shorter, easier. The open/closing tags will be removed automatically without being captured. Yours will also, but they don't need capturing in the first place.

lacerto · July 4, 2019

(...)

dominik · July 4, 2019

15 minutes ago, Lagarto said:

I hope Serif adds saved regex expressions already in some 1.x update

I know it is a rather lame workaround, but what you already can do is save your expressions as small text snippet as assets. You could create a seperat category just for these. Others reported that they keep a seperate *.txt file with their collection of expressions.

This really is just a poor workaround but perhaps helpful.

Apart from that this thread is very educational to me

d.

MikeW · July 4, 2019

12 minutes ago, dominik said:

I know it is a rather lame workaround, but what you already can do is save your expressions as small text snippet as assets. You could create a seperat category just for these. Others reported that they keep a seperate *.txt file with their collection of expressions.

This really is just a poor workaround but perhaps helpful.

Apart from that this thread is very educational to me

d.

While I save certain expressions in my Favorites of my text editor, I also save a text file. I do the same with scripts & macros for the text editor. In all cases I save a text snippet in that text file that led to me writing the expression, script or macro as well. These are modified for the task at hand and saved to a new text file. Been doing this for eons.

MikeA · July 4, 2019

»» you can still import your custom tagged plain text (or even html tags) and then use simply Search and Replace to remove the tags and apply equivalent paragraph and character formatting to text

That would become a lot of work if we're talking about a long document with complex formatting — the kind of work I'd hope to avoid. I've done that kind of thing in the past when I had to. It was tedious and time-consuming. In such cases it's as if the computer isn't helping so much as hindering — creating more work for you rather than less. When you can run scripting within a program to help automate such a task, that helps ease some of the pain.

»» There might be a bug in Publisher as it currenly does NOT remove the tag even if the Replace field is left empty, if the replacement contains a style criterion

Sounds like a bug, all right.

If you want to be rid of the tags entirely — when formatting during the replacements is not an issue: Assuming the program supports character classes and "greedy" matches, this should work (not tested on any real-world document in AP, but I've done this kind of thing many times in the past):

Search for:

<[^>]+?>

That is: find "<", then 1+ of anything that isn't ">", up to — but no further than — the next occurrence of ">" ... and for "replace" use: nothing at all. This would kill ALL of the elements at once — , , <h1>, <ul>, et al. The expression could become more complicated if you also want to remove all closing tags and/or those like " " in a single pass.

Edited July 4, 2019 by MikeA
typo

lacerto · July 5, 2019

(...)

Formatting during text import

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information