Jump to content
Sign in to follow this  
William Overington

Copying text out of a PDF document when the font has OpenType ligature capability

Recommended Posts

I have found a fantastic feature in Affinity Publisher. This feature is really top class quality.

I have been experimenting. Here is the latest result.

Suppose that one produces a PDF document (for publication on the web) where the text of the document is displayed using a font that has OpenType ligature capability.

Suppose that one now copies the text from the PDF document and pastes it into WordPad.

The underlying text is displayed. Not in the same font, but that is not the issue. The issue is that the underlying text is displayed, not blanks where some or all of the ligatures appear.

The ligature must not be mapped as well though.

This is a great facility. It means that one can publish a PDF document of a poem and have ligatures in the display, yet the poem can be copied from the PDF document and the underlying text pasted into another document, such as in, say, WordPad.

Gold star for that.

Something that I have not yet tested is what happens if one has an alternate glyph for a single letter and one produces a PDF document.

For example, a swash e at the end of a line of the text of a poem.

I made two fonts for the tests. The first one tried three possibilities, namely st ligature as regular Unicode,  ct ligature as unmapped and et ligature mapped into the Private Use Area. When I observed the result, I made the second font with the ct ligature and the et ligature unmapped.

Actually, I have not turned OpenType on in Affinity Publisher. I just started typing using the font and the ligature glyphs appeared automatically. I had intended entering the text and then trying to find the OpenType facility.

Does anyone have any information please about use of alternate glyphs in Affinity Publisher please?

I need to relearn how to make a font with an alternate swash e glyph where the glyph is unmapped and in an OpenType font. I have such a glyph available but it is not in an OpenType font as an OpenType alternate glyph.

Please find two fonts and two PDF documents attached.

William Overington

Wednesday 26 December 2018

 

gold_star.png

ligatest.otf

ligatst2.otf

ligature_test_affinity.pdf

ligature_test_2_affinity.pdf


Using a Lenovo laptop running Windows 10 in England

Share this post


Link to post
Share on other sites

I have made good progress.

Please find attached a font and a PDF document.

I have produced both of them myself, respectively using High-Logic FontCreator 8 and Serif Affinity Publisher Beta. The PDF document uses the font.

The PDF document contains a poem that I have written today, written to show five features of the font, namely three ligatures and two stylistic alternates. The stylistic alternates are each for lowercase e.

The ligatures just appeared as I keyed the poem into the computer. I needed to highlight each particular letter e in the text (one at a time) and then use Text Show Typography and choose the desired alternate glyph to replace the ordinary letter e at that location.

The really great thing about the PDF document is that if one copies the text from it and pastes the text into WordPad, one gets the underlying original text.

Not all desktop publishing programs do that. Some just have a blank for the two letters of the ligature glyph (except sometimes for st, which is a special case) or a blank for the stylistic alternate.

The Serif Affinity team have done really well to provide this facility. This facility makes Affinity Publisher a top class product.

William

Thursday 27 December 2018

 

 

ligatst3.otf

white.pdf


Using a Lenovo laptop running Windows 10 in England

Share this post


Link to post
Share on other sites

Having had great success with this feature, I tried something yesterday that was, in fact, really pushing the envelope.

Things did not work out totally well, so I thought about it and tried again and got a better result, certainly useful, but not quite as would have been perfect.

Bearing in mind the extreme envelope pushing involved I decided to just keep it all to myself.

Yet, thinking about it, I am posting details of what has happened, just in case it might highlight some bug that might be worth fixing.

In the three test fonts thus far posted, there are three ligature glyphs, one for a ct ligature, one for an et ligature, one for an st ligature.

The font Ligature test 4 in ligatst4.otf added two more ligatures to the liga table of the font.

I have for some years, since 2009, being carrying out a research project from time to time on communication through the language barrier.

Since 2016 I have been writing a novel based around some of the ideas and how they may be applied.

This test involves a part of the research that is in the novel yet not in a scientific research document, so the links here are to the novel, but just enough so as to give the necessary background to the experiment.

The novel, which is not at the present time complete, is linked, chapter by chapter, from the following web page. Most of the chapters are not very long, so there is not a lot of reading involved for this topic.

http://www.users.globalnet.co.uk/~ngo/novel.htm

For the present purpose,

please read Chapter 46 from the second section of page 1, and page 2;

the second section of Chapter 50 version 2, just the first page for this purpose;

and the fourth and fifth blue glyphs on page 3 of Chapter 72.

What it comes down to for this test is that there is a sequence !123 that is to be regarded as a ligature that will produce the symbol designed to represent 'Good day.' in a language-independent manner and that there is a sequence !987 that is to be regarded as a ligature that will produce the symbol designed to represent 'Best regards,' in a language-independent manner.

The first test is will Affinity Publisher substitute the symbol for !123 automatically? Yes it does.

The second test is can it be copied out of the PDF into plain text? No, it cannot.

I wondered whether the fact that I had named the one glyph in the font to be good_day and the other glyph to be best_regards might be something to do with it, as the ct ligature had been named c_t and the et ligature had been named e_t and the st ligature had been named s_t and that maybe the name provided a clue for decoding in some way.

So on to the font Ligature test 4a in ligatst4a.otf which is far as I have got at present.

So I looked up the glyph name for the exclamation mark, which is exclam, and renamed the glyphs as follows.

exclam_one_two_three

exclam_nine_eight_seven

The first test is will Affinity Publisher substitute the symbol for !123 automatically? Yes it does.

The second test is can it be copied out of the PDF into plain text? Yes, it can, but there seems to be an issue of missing out one or more space characters near the glyph that is decoded.

So it seems to be trying to work but it is not quite right.

By the way, the glyphs shown displayed in Chapter 72 were done using a font where the special glyphs are in an ordinary TrueType font and mapped into the Private Use Area.

If any reader wants to have a look at some more glyphs that have been produced as part of the project, Chapter 5 and Chapter 42 have a number shown. Chapter 34, whilst not showing any glyphs as such might give an insight into the ideas of the project.

William

 

 

ligatst4.otf

ligatst4a.otf

Edited by William Overington
Adding attachments

Using a Lenovo laptop running Windows 10 in England

Share this post


Link to post
Share on other sites
9 minutes ago, William Overington said:

The first test is will Affinity Publisher substitute the symbol for !123 automatically? Yes it does.

The second test is can it be copied out of the PDF into plain text? No, it cannot.

What were your export settings, William? Does the PDF file include an embedded subset of your font?

After you copied the symbol from the PDF, where did you try to paste it?


Alfred online2long.gif
Affinity Designer/Photo/Publisher 1.7.3.481 • Windows 10 Home (4th gen Core i3 CPU)
Affinity Photo for iPad 1.7.3.155 • Designer for iPad 1.7.3.1 • iOS 12.4.1 (iPad Air 2)

Share this post


Link to post
Share on other sites

Hello Alfred

What were your export settings, William?

PDF for web.

> Does the PDF file include an embedded subset of your font?

Yes. I remember seeing a setting for that the other day with another project, but it is the default and I did not touch it.

After you copied the symbol from the PDF, where did you try to paste it?

WordPad

Since posting I have been producing some carefully composed source files and PDF documents that show the issue, those I did yesterday were just rough and I had lost them anyway.

Here are the attachments now, in chronological order.

The second .afpub file is just a Save As from the first one and then a change of font.

Please note how the spaces around the first glyph and the space in front of the second glyph do not get through to WordPad.

William

 

 

 

 

test004.afpub

test004.pdf

test004a.afpub

test004a.pdf


Using a Lenovo laptop running Windows 10 in England

Share this post


Link to post
Share on other sites

Just in case it was a more general issue if the ligature has a space in front of it or after it, I produced the following .afpub files and PDF documents. The copying from the PDFs to WordPad works fine. So it is not general.

 

test002a.afpub

test002a.pdf

test004b.afpub

test004b.pdf


Using a Lenovo laptop running Windows 10 in England

Share this post


Link to post
Share on other sites

BTW, this works properly for applications using the Adobe PDF engine. I tried a couple other applications that, like Affinity applications, use a different PDF engine and all failed. As such, I don't know if it is a failing of these other PDF engines as used or if they are incapable of using the decomposed code points for non-standard ligatures.

Share this post


Link to post
Share on other sites

Thank you.

Well, et is not a standard ligature, yet that has worked fine.

BTW, this works properly for applications using the Adobe PDF engine.

Which particular test, with which font, does "this" in the above line refer please?

William

 


Using a Lenovo laptop running Windows 10 in England

Share this post


Link to post
Share on other sites
15 minutes ago, William Overington said:

Thank you.

Well, et is not a standard ligature, yet that has worked fine.

BTW, this works properly for applications using the Adobe PDF engine.

Which particular test, with which font, does "this" in the above line refer please?

William

But the et ligature is one that APub's PDF engine understands. And likely Dave Harris can make Affinity applications understand the sequence of
exclam one two three (etc.)
But it isn't this way currently.

I made one of my fonts to use a ligature named good_day. The glyph is a simple pair of rectangles with inner ovals, and I used the word start that uses a discretionary lig for the st combination. I also included the same string of text below the text using the dlig feature. In APub and the resulting PDF, it appears as such:

capture-002401.png.76c556b28007fd47c00e484d9d120732.png

If I look into a PDF using the above sequence (the bold words), then I see this in the PDF:

capture-002400.png.9e8c7de85e15c982de4135a2947969bd.png

As you can see, the st lig has the decomposed Unicode code points underlying it (the decomposed Unicode code points). But for the good_day lig, the underlying code point is the Unicode code point for any glyph that cannot be determined. From Wikipedia:

  • Quote

    U+FFFD  REPLACEMENT CHARACTER used to replace an unknown, unrecognized or unrepresentable character

    If I use the same text in InDesign, though, the Unicode code point for the good_day lig is represented properly in the PDF:

capture-002402.png.ee596bdd94ab4f8e984159062ea5bf51.png

Because of this decomp in the PDF, the text from the PDF pastes as such into WordPad:

capture-002403.png.f5d27034edd5e40f5f88459669cd6983.png

Mike

 

 

Share this post


Link to post
Share on other sites

If the glyph for 'Good day.', which is accessed by !123 has a PostScript name within the font of exclam_one_two_three then the !123 can be pasted from the PDF to WordPad, but if the glyph for 'Good day.', which is accessed by !123 has a PostScript name within the font of good_day then the !123 cannot be pasted from the PDF to WordPad. It is as if the correct code points for !123 are decoded from the glyph name, which may or may not be the case.

However, the spaces around the !123 do not get pasted. I am wondering if this is because the width of the glyph for 'Good day.' is a lot wider than the combined width of the !123 characters upon which a glyph substitution takes place.

William

 


Using a Lenovo laptop running Windows 10 in England

Share this post


Link to post
Share on other sites

So I made a test font ligatst5.otf Ligature test 5 and I changed the glyph name for the 'Good day.' glyph to become asterisk_a_h_h and I changed the OpenType code accordingly. That is the glyph name is nothing to do with the !123 sequence but is nevertheless made up of standard Postscript names used in fonts.

The line of text in the OpenType code is as follows.

  sub exclam one two three -> asterisk_a_h_h;

I installed the font and made a copy of test004a.afpub as test005.afpub and simply formatted the text with the Ligature test 5 font.

The PDF document test005.pdf displayed the two special glyphs well.

Copying from the PDF and pasting to WordPad gave *ahh for the plain text version.

William

 

ligatst5.otf

test005.pdf


Using a Lenovo laptop running Windows 10 in England

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×

Important Information

These are the Terms of Use you will be asked to agree to if you join the forum. | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.