[1.9.0.902] Merge - Character encoding problem ?

Delden · January 27, 2021

Hello !

I am working on reports and merging data from a UTF-8 CSV file. As you can see in the screenshot, the data contains the string "Côte d'Ivoire" but the result in Publisher is incorrect.

I tried the following without success :

Write the string manually : it works perfectly well using the same font (Fira Sans), but it is of course not a merge anymore.
Open the data in another software : Numbers, Excel, EasyCSV Editor, CotEditor : they all display the string correctly.
Double check the data, in the case of an invalid character.
Change the encoding (UTF-16, iso 8859-1, etc.)

Does someone have the same issue ? Is there a workaround ? Is it a bug ?

Also, while on the merge operation, how to export files independently instead of one huge file ?

I cannot provide the CSV file as it contains sensitive information. But I can follow instructions to provide you with more information.

Thanks a lot !

garrettm30 · January 27, 2021

26 minutes ago, Delden said:

I cannot provide the CSV file as it contains sensitive information.

It is understandable that you could not provide such a file, yet providing the file is of great value at solving the problem. In such a case, you would make a copy of the file and strip it down to the minimum necessary elements to illustrating the bug. For example, if this is a list of addresses, you would just have a small handful of rows with totally changed values except the minimum necessary for still illustrating the problem. In this case, it would mean you would leave the "Côte d'Ivoire" portion as is and make up a few values for the other fields.

Delden · January 27, 2021

Thank you !

I agree that it totally make sense. So I stripped the CSV file to a minimum. You can find it attached here.

Don't hesitate if more information is needed from your side.

data_extract.csv

Old Bruce · January 27, 2021

Looks fine here on Mac OS 10.14.6

Perhaps it is a font specific issue. Works with the fonts I scrolled through.

Delden · January 27, 2021

Thanks Bruce !

I didn't mention that I am currently on MacOS 11.1 Big Sur.

The font used is Fira Sans and I've already made a bunch of multilingual documents with it without any problems.

This appears during the merge preview/process only (while importing the data from the CSV file). Not when the text is manually entered.

Old Bruce · January 27, 2021

Are you using Fira Sans in the program which made the CSV file? Could be different fonts having different glyphs for the same codes.

Delden · January 27, 2021

The CSV file is built using R and RStudio (Data Science stuff) on raw data. It's purely computational. I also produced thousands of plots, exported as PNG files, and they all display correctly the strings.

No problem to read the CSV file in other applications.

Old Bruce · January 27, 2021

Could you toggle the Unicode for the mess that is the merged result. Select the word Côte and hit the key combination Control + U or use Text > Toggle Unicode. Should give U+0043U+00F4U+0074U+0065 it is the U+00F4 replacement that I am interested in seeing.

kenmcd · January 28, 2021

@Delden

Which format of Fira Sans are you using? OTF or TTF?

The reason I ask is because they store compound glyphs differently (such as the ô in Côte) and it looks like it was decomposed (which is kinda what Old Bruce is getting at with looking at the resultant Unicode codes). The import program may be having an issue with the compound glyphs (like some printers do).

Can you try the other version of the fonts? If you are using the TTFs try the OTFs, etc.

Valencia · January 28, 2021

Whoa, this is a ridiculously great beta for data merge. I have used you guys since the beta of Affinity Photo and I'm still pretty shocked Adobe is completely toppled by you by now. Checking Google Trends shows Photoshop on the massive decline and all Affinity products catching up. Great work guys!

The Data Merge in beta is BEAUTIFUL!!! I've been wanting this for years. Publisher may be the best piece of software for practical purposes I've ever used. I use it to build out materials at scale. Thank you!!!

Lukas G. · February 1, 2021

@Delden this is just a shot in the dark, but is it possible that your entire data set could contain text encoded in different encodings?

So, UTF-8 encoded strings as well as ISO-8859-1 ones in the same file for example?

I'm asking because CSV has no metadata that allows to define the character encoding that's been used. You obviously have to select one when producing the CSV, but there's no way to include that information in the CSV itself - it has no header or any sort of metadata whatsoever. Therefore, software reading CSV often has to guess / recognize the encoding that's been used (charset sniffing).

Now, if Affinity Publisher does this (and it almost has to, since it currently doesn't seem to even allow you to explicitly specify the encoding on ingest), such an heuristic could be thrown way off if there's mixed encodings used in the same file. That could be an explanation why your full data set shows the problem, but the extract doesn't (for me at least, works fine here). I would maybe quickly check if you can still reproduce the issue with your own extract. If not, that could be an indication that some other data in your full data set is throwing some kind of character set detection in AP off.

(For what it's worth - the ô in your extract is correctly encoded as a UTF-8 Multi-Byte character (0xC3 0xB4). What your merged result looks like is exactly what UTF-8 looks like when it's accidentally decoded as ISO-8859-1 (Latin1)).

kenmcd · February 1, 2021

1 hour ago, lukasg said:

(For what it's worth - the ô in your extract is correctly encoded as a UTF-8 Multi-Byte character (0xC3 0xB4). What your merged result looks like is exactly what UTF-8 looks like when it's accidentally decoded as ISO-8859-1 (Latin1)).

So if it is correctly encoded, then the import program is somehow mis-reading or mis-handling the characters. Yes?

walt.farrell · February 1, 2021

18 minutes ago, LibreTraining said:

So if it is correctly encoded, then the import program is somehow mis-reading or mis-handling the characters. Yes?

That character is correctly encoded. Other characters earlier in the file may not be, and may be confusing the decoding process.

Lukas G. · February 1, 2021

Exactly.

And those characters earlier in the file might not have been enough to bias the other Software (that apparently reads the file properly) towards "this looks like ISO-8859-1", maybe because the majority of characters are indeed UTF-8 encoded. Or some of them hard-defaulted to UTF-8, which is not an unreasonable assumption these days.

And if you're really dealing with mixed encodings in the same file, there really is no right or wrong in terms of the exact implementation of the encoding detection mechanism. Some use sophisticated strategies like lookup tables for character frequencies in different languages, others are happy with the first encoding that sort of works and doesn't result in unprintable characters.

The theory about mixed encodings in the data is still just speculation on my part, but I've seen stranger things with real world data.

kenmcd · February 1, 2021

1 hour ago, Lukas G. said:

The theory about mixed encodings in the data is still just speculation on my part, but I've seen stranger things with real world data.

But it makes sense.
As soon as you posted it made me think about cleaning-up old "code page" documents which were imported into a Unicode doc.
When doing the clean-up you soon start to recognize the repeated patterns of weird characters, and what they should be.
Then do a search/replace for those.
So if that is a pattern you recognize it makes perfect sense.

Mooxo · September 23, 2021

Was there ever a solution to this? I have exactly the same problem - French accented characters showing up as weird character combinations. It's not a font issue because I've tried more than 50 fonts (OTF and TTF) and it happens in all of them.

My CSV file is being generated by numbers.app as UTF-8.

Example:

1 février comes out as 1 fÃ©vrier (the unicode in Publisher is U+0031U+0020U+0066U+00C3U+00A9U+0076U+0072U+0069U+0065U+0072 if that helps?)

walt.farrell · September 23, 2021

1 hour ago, Mooxo said:

My CSV file is being generated by numbers.app as UTF-8.

Can you share the .csv file, or another with the problem, with us?

Mooxo · September 23, 2021

Here you go. In further testing, it's something in the last column (the pathname) that's causing it. If I get rid of that column, it works fine.

Everything else in the merge works perfectly, just the accented characters that are the problem.

It's not hard to work around with a quick find and replace in the merged document. Odd though.

test.csv

Old Bruce · September 23, 2021

The é is being read as the two separate Hex code characters 00C3 and 00A9 instead of the unicode 00E9. Why? I don't know.

The second problem is more complex in that you have your pictures in the iCloud cloud. I removed the "/com~apple~CloudDocs/1. test" from the lines in the last column and all worked well. I guess it is the ~ or the . dot.

test bruce.csv

walt.farrell · September 23, 2021

20 minutes ago, Old Bruce said:

The é is being read as the two separate Hex code characters 00C3 and 00A9 instead of the unicode 00E9. Why? I don't know.

Because, from experimentation, Publisher is using the first 4096 bytes of the file to determine the file encoding. If nothing in the first 4096 bytes requires UTF-8 encoding, then Publisher is assuming the file to be using ANSI encoding rather than UTF-8, and interprets the rest of the file in ANSI mode rather than UTF-8 mode.

The first é, which is the first UTF-8 encoded character, is at position 4523 in that file. Therefore Publisher has decided the file is ANSI-encoded.

Edit: And any changes that shorten the file and bring that first é into the first 4096 characters will resolve the problem. As would adding some UTF-8 character to some other location, such as within the header line at the top of the file.

Lukas G. · September 23, 2021

@walt.farrell that makes a lot of sense.
I've also had a look at the test data, and inspected it with a quick Python script. And it is a valid UTF-8 encoded CSV file. The only non-ASCII characters that occur are é and û, and they are both properly UTF-8 encoded, everywhere.

So my hypothesis from above that it's mixed encodings being used in the same file that throws of Publisher's character set sniffing is wrong. But the sniffing heuristic only looking at the first 4k of data sounds very plausible.

That's clearly a bug then in my opinion - at least as long as there isn't an option for the user to just select the encoding to be used, and void the need for any character set sniffing alltogether.

Old Bruce · September 23, 2021

7 minutes ago, walt.farrell said:

Therefore Publisher has decided the file is ANSI-encoded.

Well that is just stupid and ignorant.

Definitely needs to be fixed.

Lukas G. · September 23, 2021

It's not ASCII encoding that Publisher decided to go with in this case though, it's Latin-1 (ISO-8859-1). 0xC3 0xA9 is what an UTF-8 encoded é character looks like, and when it's incorrectly decoded as Latin-1 that then comes out as Ã©.

Mooxo · September 23, 2021

24 minutes ago, walt.farrell said:

Because, from experimentation, Publisher is using the first 4096 bytes of the file to determine the file encoding. If nothing in the first 4096 bytes requires UTF-8 encoding, then Publisher is assuming the file to be using ANSI encoding rather than UTF-8, and interprets the rest of the file in ANSI mode rather than UTF-8 mode.

The first é, which is the first UTF-8 encoded character, is at position 4523 in that file. Therefore Publisher has decided the file is ANSI-encoded.

Edit: And any changes that shorten the file and bring that first é into the first 4096 characters will resolve the problem. As would adding some UTF-8 character to some other location, such as within the header line at the top of the file.

Okay, that makes sense. Putting an accented character in the header line is an easy enough workaround and a better solution that relying on find and replace after the merge. Thanks!

Lukas G. · September 23, 2021

@Mooxo a different workaround could be to just save your CSV with Latin-1 (ISO-8859-1) encoding from Numbers.app, instead of UTF-8. It seems that that's what Publisher uses as its fallback encoding, so this workaround might work as well, and would not even require you to add an accented character at the top.

Edit: "Western (ISO Latin 1)" is what Numbers.app calls ISO-8859-1 I believe.

[1.9.0.902] Merge - Character encoding problem ?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Important Information