Jump to content
BrianUni

Text Not Rendering Properly with Copy/Paste From PDF

Recommended Posts

(a black diamond with a white question mark) this character usually represent a character that is missing in the font use by a program (for example when writing in UTF-8 in an application or web page that use simpler encoding).

Perhaps there are ligatures or other complexe characters in your PDF, or special spaces, etc.?
If you provided at least a sample PDF with those characters some would be able to explain it better.

 

Share this post


Link to post
Share on other sites

Thanks, I hope @LibreTraining or some other Fonts Gurus will help with this one :)

Oups, I clicked too fast, it's the ligatures (tt and ti) that aren't reconized when copied:

Des�ni Sco�

 

Share this post


Link to post
Share on other sites

Depending of the answers on this page:

Quote

It depends on the name used in the font for the "ffi" etc glyph. If the name is the standard (f_f_i) then intelligent pdf-readers like acrobat are able to separate the glyph in its parts. If a non-standard name is used (like e.g. in old versions of the cm-fonts) than the reader can not identify the glyph components. So how copy & paste works depends 1. on the font and 2. on the pdf-reader.

 

Share this post


Link to post
Share on other sites

Those ligatures have no code points in Calibri and are composed of a single glyph. Affinity products are not decomposing the parts and are just stuffing the glyph as is into an unused Unicode code point.

They are named properly in the font file, so ID, for instance, properly encodes the underlying tt glyph (for example) as U+0074 0074, which are two /t characters.

Share this post


Link to post
Share on other sites

Do you think this is an issue with Affinity Publisher, or with the PDF reader?

I just downloaded Adobe Acrobat Reader DC, then used the copy/paste function, and it gave me squares, not question marks. 

Share this post


Link to post
Share on other sites
Just now, BrianUni said:

Do you think this is an issue with Affinity Publisher, or with the PDF reader?

I just downloaded Adobe Acrobat Reader DC, then used the copy/paste function, and it gave me squares, not question marks. 

It's an issue with how Affinity applications are writing the PDF.

The issue of the boxed question marks versus the rectangle isn't important per se.

Share this post


Link to post
Share on other sites

Sorry, Mike.  Hadn't refreshed my page. 

So it's a Calibri issue!

I used Ariel and it solved the issue.

Thanks for your help!

Share this post


Link to post
Share on other sites
Just now, BrianUni said:

Sorry, Mike.  Hadn't refreshed my page. 

So it's a Calibri issue!

I used Ariel and it solved the issue.

Thanks for your help!

We are cross posting...

It's not really a font issue in that both fonts are properly made. In the Arial instance, there are not ligatures as would affect this issue. Calibri does have those ligatures.

Share this post


Link to post
Share on other sites

@BrianUni

Did you test this with the same document exported to PDF using InDesign?
If yes I would like to see it.

Been playing with this since you first posted.
I have been looking at the actual ToUnicode table in the PDFs for each test and would like to see one from ID too.

Share this post


Link to post
Share on other sites
30 minutes ago, LibreTraining said:

@BrianUni

Did you test this with the same document exported to PDF using InDesign?
If yes I would like to see it.

Been playing with this since you first posted.
I have been looking at the actual ToUnicode table in the PDFs for each test and would like to see one from ID too.

 

On 8/19/2019 at 12:55 PM, MikeW said:

...They are named properly in the font file, so ID, for instance, properly encodes the underlying tt glyph (for example) as U+0074 0074, which are two /t characters.

I did and the result is in the post of mine I quoted.

Share this post


Link to post
Share on other sites
On 8/22/2019 at 5:12 AM, LibreTraining said:

Did you test this with the same document exported to PDF using InDesign?
If yes I would like to see it.

Here's one created in InDesign and Affinity Publisher. I think it is as MikeW mentioned above, Affinity does not use the font's codepoint for the ligature but just uses an unused Unicode codepoint, which shows when you copy paste the text. If you copy paste the text from PDF created by InDesign, you get those characters decomposed (so that the destination software defines whether a ligure is used or not).

ligatures_id.pdf

ligatures_apub.pdf

Share this post


Link to post
Share on other sites

Actually it could also be so that Affinity Publisher specifically uses the actual ligature codepoints, while InDesign uses ligature attributes, and just "ti" and "tt" as single characters. So whether a ligature is shown or not, depends on the viewer (i.e., if hardcoded glyph, all viewers can show it, if an attribute, the viewer must support the ligature as a feature), and whether a ligature codepoint is used as a glyph or not, when copy pasted, depends on the destination software (i.e., whether the destination software decomposes the hardcoded glyph as separate characters, or supports use of hardcoded ligature glyph itself not as an attribute but glyph).

See below how PDF-XChange can edit and copy the selected ligature in the PDF created with Affinity Publisher ("ti" as a hard-coded glyph), which PDF-XChange can copy paste to duplicate it:

ligatures_apub_xchange.jpg.d3c450a61a8a47a1874fafffb120e83d.jpg

...and below editing the PDF created by InDesign, where "t" and "i" are separate glyphs which are rendered as ligature glyph "ti" because it has been coded as formatting attribute:

ligatures_id_xchange.jpg.f1d8c61a3d4f92e3fea299f5156d3721.jpg

Just guessing. I do not know internals of PDF encoding, but this kind of makes sense.

EDIT: More guessing... The "attribute" status seems to be achieved by encoding the composing character codes to follow each other, e.g. "t" + "i", which the rendering application maps to the actual ligature glyph, if it supports the feature, and the font has the corresponding iigature glyph; and as separate glyphs "t" and "i", if it there is no support. Affinity Publisher seems to refer directly the ligature glyph code, so that when it is copy pasted it does not necessarily get correctly rendered in the destination app.

See below ligatures_apub2.pdf, which has ligature "fi" added in the text. When this text is copy pasted back to Affinity Publisher (or Illustrator, or InDesign, etc), the "fi" ligature will be correctly rendered (either as ligature "fi" or separate "f" and "i", depdending on whether standard ligature atttibute is turned on in the destination app), but not "ti" (which is not supported in most fonts, but will not be rendered even if Calibri font, that supports it, is retained in the destination app).

Don't know whether it is a question of these two ligatures having been encoded differently, or whether both are "hard coded" (referring directly the ligature glyphs), and only "fi" gets mapped to ligature "attribute" by encoding it as "f" + "i" (U+0066+0069). Anyway, concatenating the composing characters seems to be the safer method, as it is not dependent on the font's capabilities, or the rendeing application's ability to use ligature glyphs.. 

 

ligatures_apub2.pdf

Share this post


Link to post
Share on other sites

Thanks for the ID PDF.

These PDFs kinda confirmed what I was thinking based on reviewing the actual ToUnicode tables in the PDFs.

When APub exports a font subset it does not put the correct glyph number in the ToUnicode table. It appears to just have increasing/incrementing numbers in that field. That is why we see glyph numbers like #03 which is just wrong. And the character it gets mapped to is also wrong where it is often simply [20] which is the Unicode space code point.

When APub prints to a PDF printer and embeds the entire font it does a bit better. I saw in the ToUnicode table that there were now what looked like actual glyph numbers. It does actually have the correct glyph number for the "ti" ligature (#415), but it still does not connect that to correct Unicode code points.
In my test PDF with the full font embedded it maps to: LATIN CAPITAL LETTER O WITH MIDDLE TILDE [19F] - which is obviously wrong.

In your ID PDF, which only has a subset embedded, it correctly identifies it as glyph #415 in the font, and maps it to two Unicode code points: LATIN SMALL LETTER T [74] + LATIN SMALL LETTER I [74] - (Note this could be an error in my PDF tool as small letter i is actually [69] - or ID messed-up). The small letter t is correct as [74].
So it has the correct glyph number if you have the font installed.
And it maps to the correct multiple Unicode code points if not.

The "fi" ligature is different in that it actually has a Unicode code point.
In your ligatures_apub2 PDF APub incorrectly sets the glyph number as 11 (should be #302), but it does map it to the correct Unicode code point: LATIN SMALL LIGATURE FI [FB01].

So APub is inserting both wrong glyph numbers, and wrong Unicode code points.
Sometimes it does get one or the other correct, but not at the same time.

Since we have not heard from any Affinity folks about this I assume that they know this is not working properly and currently have the fire hose aimed elsewhere. :D

Share this post


Link to post
Share on other sites
1 hour ago, LibreTraining said:

...In your ID PDF, which only has a subset embedded, it correctly identifies it as glyph #415 in the font, and maps it to two Unicode code points: LATIN SMALL LETTER T [74] + LATIN SMALL LETTER I [74] - (Note this could be an error in my PDF tool as small letter i is actually [69] - or ID messed-up). The small letter t is correct as [74]...

The tool is in error. ID, etc., does this correctly.

Capture_000219.png.9341e8ea4b74604c1ce076ccf5a25f89.png

Share this post


Link to post
Share on other sites

Thanks for the explanations. Did I understand correctly that the safe method to encode ligatures is using the "concatenation" method, e.g. (U+0066+0069) for the ligature "fi", and (U+0074+0069) for the ligature "ti", rather than referring the existing code point for the glyph  itself, "fi" (FB01), or "ti" (whereever that is placed in the font, whether subsetted or not)?

Anyway, it was good to see that PDF-XChange (Editor Plus) can actually edit the text as if in text editor no matter how the ligatures are encoded, that is I can copy paste the actual Affinity Publisher created "ti" ligatures without problems, while I cannot do this in Adobe Acrobat (or any other tool, as when the PDF is opened, the "ti" won't be rendered, even in Affinity Publisher, and copy pasting from one app to another is guaranteed to fail when custom encoding is used).

I had to purchase this tool recently because InDesign could not produce an acceptable PDF/A archive format for a thesis (!). I have now used the tool to automatically generate hierarchical bookmarks from a TOC (quite useful with Affinity Publisher), so it has proven to be quite a capable PDF tool!

Share this post


Link to post
Share on other sites

I did realize this morning an error in my thinking.
When I print to PDF to force embed the whole font it is the PDF printer rendering engine that is doing the glyph/Unicode mapping, not the APub PDF rendering engine.
So one of the many non-Adobe PDF printers may work correctly (as a work-around for now).
Have to test some others.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×

Important Information

These are the Terms of Use you will be asked to agree to if you join the forum. | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.