Text Not Rendering Properly with Copy/Paste From PDF

BrianUni · August 19, 2019

When I copy and paste text from PDF file I exported from Affinity Publisher, I get weird characters.

Wosven · August 19, 2019

� (a black diamond with a white question mark) this character usually represent a character that is missing in the font use by a program (for example when writing in UTF-8 in an application or web page that use simpler encoding).

Perhaps there are ligatures or other complexe characters in your PDF, or special spaces, etc.?
If you provided at least a sample PDF with those characters some would be able to explain it better.

BrianUni · August 19, 2019

Thanks for your reply!

Attached is the PDF in question.

8-19-19 Quote Universal Healthcare - Fuquay-Varina .pdf

Wosven · August 19, 2019

Thanks, I hope @LibreTraining or some other Fonts Gurus will help with this one

Oups, I clicked too fast, it's the ligatures (tt and ti) that aren't reconized when copied:

Des�ni Sco�

Wosven · August 19, 2019

Depending of the answers on this page:

Quote

It depends on the name used in the font for the "ffi" etc glyph. If the name is the standard (f_f_i) then intelligent pdf-readers like acrobat are able to separate the glyph in its parts. If a non-standard name is used (like e.g. in old versions of the cm-fonts) than the reader can not identify the glyph components. So how copy & paste works depends 1. on the font and 2. on the pdf-reader.

MikeW · August 19, 2019

Those ligatures have no code points in Calibri and are composed of a single glyph. Affinity products are not decomposing the parts and are just stuffing the glyph as is into an unused Unicode code point.

They are named properly in the font file, so ID, for instance, properly encodes the underlying tt glyph (for example) as U+0074 0074, which are two /t characters.

BrianUni · August 19, 2019

Do you think this is an issue with Affinity Publisher, or with the PDF reader?

I just downloaded Adobe Acrobat Reader DC, then used the copy/paste function, and it gave me squares, not question marks.

MikeW · August 19, 2019

Just now, BrianUni said:

Do you think this is an issue with Affinity Publisher, or with the PDF reader?

I just downloaded Adobe Acrobat Reader DC, then used the copy/paste function, and it gave me squares, not question marks.

It's an issue with how Affinity applications are writing the PDF.

The issue of the boxed question marks versus the rectangle isn't important per se.

BrianUni · August 19, 2019

Sorry, Mike. Hadn't refreshed my page.

So it's a Calibri issue!

I used Ariel and it solved the issue.

Thanks for your help!

MikeW · August 19, 2019

Just now, BrianUni said:

Sorry, Mike. Hadn't refreshed my page.

So it's a Calibri issue!

I used Ariel and it solved the issue.

Thanks for your help!

We are cross posting...

It's not really a font issue in that both fonts are properly made. In the Arial instance, there are not ligatures as would affect this issue. Calibri does have those ligatures.

BrianUni · August 19, 2019

Ok, I see that connection with the tt in the Calibri font. Thanks for your help, Mike!

kenmcd · August 22, 2019

@BrianUni

Did you test this with the same document exported to PDF using InDesign?
If yes I would like to see it.

Been playing with this since you first posted.
I have been looking at the actual ToUnicode table in the PDFs for each test and would like to see one from ID too.

MikeW · August 22, 2019

30 minutes ago, LibreTraining said:

@BrianUni

Did you test this with the same document exported to PDF using InDesign?
If yes I would like to see it.

Been playing with this since you first posted.
I have been looking at the actual ToUnicode table in the PDFs for each test and would like to see one from ID too.

On 8/19/2019 at 12:55 PM, MikeW said:

...They are named properly in the font file, so ID, for instance, properly encodes the underlying tt glyph (for example) as U+0074 0074, which are two /t characters.

I did and the result is in the post of mine I quoted.

BrianUni · August 23, 2019

Pauls · September 16, 2019

Can we get the original afpub file please. It can be uploaded here

BrianUni · September 16, 2019

Done! thanks, Paul!

lacerto · September 16, 2019

(...)

lacerto · September 16, 2019

(...)

kenmcd · September 17, 2019

Thanks for the ID PDF.

These PDFs kinda confirmed what I was thinking based on reviewing the actual ToUnicode tables in the PDFs.

When APub exports a font subset it does not put the correct glyph number in the ToUnicode table. It appears to just have increasing/incrementing numbers in that field. That is why we see glyph numbers like #03 which is just wrong. And the character it gets mapped to is also wrong where it is often simply [20] which is the Unicode space code point.

When APub prints to a PDF printer and embeds the entire font it does a bit better. I saw in the ToUnicode table that there were now what looked like actual glyph numbers. It does actually have the correct glyph number for the "ti" ligature (#415), but it still does not connect that to correct Unicode code points.
In my test PDF with the full font embedded it maps to: LATIN CAPITAL LETTER O WITH MIDDLE TILDE [19F] - which is obviously wrong.

In your ID PDF, which only has a subset embedded, it correctly identifies it as glyph #415 in the font, and maps it to two Unicode code points: LATIN SMALL LETTER T [74] + LATIN SMALL LETTER I [74] - (Note this could be an error in my PDF tool as small letter i is actually [69] - or ID messed-up). The small letter t is correct as [74].
So it has the correct glyph number if you have the font installed.
And it maps to the correct multiple Unicode code points if not.

The "fi" ligature is different in that it actually has a Unicode code point.
In your ligatures_apub2 PDF APub incorrectly sets the glyph number as 11 (should be #302), but it does map it to the correct Unicode code point: LATIN SMALL LIGATURE FI [FB01].

So APub is inserting both wrong glyph numbers, and wrong Unicode code points.
Sometimes it does get one or the other correct, but not at the same time.

Since we have not heard from any Affinity folks about this I assume that they know this is not working properly and currently have the fire hose aimed elsewhere.

MikeW · September 17, 2019

1 hour ago, LibreTraining said:

...In your ID PDF, which only has a subset embedded, it correctly identifies it as glyph #415 in the font, and maps it to two Unicode code points: LATIN SMALL LETTER T [74] + LATIN SMALL LETTER I [74] - (Note this could be an error in my PDF tool as small letter i is actually [69] - or ID messed-up). The small letter t is correct as [74]...

The tool is in error. ID, etc., does this correctly.

lacerto · September 17, 2019

(...)

kenmcd · September 17, 2019

I did realize this morning an error in my thinking.
When I print to PDF to force embed the whole font it is the PDF printer rendering engine that is doing the glyph/Unicode mapping, not the APub PDF rendering engine.
So one of the many non-Adobe PDF printers may work correctly (as a work-around for now).
Have to test some others.

Text Not Rendering Properly with Copy/Paste From PDF

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information