Jump to content
You must now use your email address to sign in [click for more info] ×

[1.9.0.902] Merge - Character encoding problem ?


Recommended Posts

Hello !

I am working on reports and merging data from a UTF-8 CSV file. As you can see in the screenshot, the data contains the string "Côte d'Ivoire" but the result in Publisher is incorrect.

I tried the following without success :

  • Write the string manually : it works perfectly well using the same font (Fira Sans), but it is of course not a merge anymore.
  • Open the data in another software : Numbers, Excel, EasyCSV Editor, CotEditor : they all display the string correctly.
  • Double check the data, in the case of an invalid character.
  • Change the encoding (UTF-16, iso 8859-1, etc.)

Does someone have the same issue ? Is there a workaround ? Is it a bug ?

Also, while on the merge operation, how to export files independently instead of one huge file ?

I cannot provide the CSV file as it contains sensitive information. But I can follow instructions to provide you with more information.

Thanks a lot !

Publisher bug.jpg

Link to comment
Share on other sites

26 minutes ago, Delden said:

I cannot provide the CSV file as it contains sensitive information.

It is understandable that you could not provide such a file, yet providing the file is of great value at solving the problem. In such a case, you would make a copy of the file and strip it down to the minimum necessary elements to illustrating the bug. For example, if this is a list of addresses, you would just have a small handful of rows with totally changed values except the minimum necessary for still illustrating the problem. In this case, it would mean you would leave the "Côte d'Ivoire" portion as is and make up a few values for the other fields.

Link to comment
Share on other sites

Looks fine here on Mac OS 10.14.6

Perhaps it is a font specific issue. Works with the fonts I scrolled through.

Mac Pro (Late 2013) Mac OS 12.7.4 
Affinity Designer 2.4.1 | Affinity Photo 2.4.1 | Affinity Publisher 2.4.1 | Beta versions as they appear.

I have never mastered color management, period, so I cannot help with that.

Link to comment
Share on other sites

Thanks Bruce !

I didn't mention that I am currently on MacOS 11.1 Big Sur.

The font used is Fira Sans and I've already made a bunch of multilingual documents with it without any problems.

This appears during the merge preview/process only (while importing the data from the CSV file). Not when the text is manually entered.

Link to comment
Share on other sites

Are you using Fira Sans in the program which made the CSV file? Could be different fonts having different glyphs for the same codes.

Mac Pro (Late 2013) Mac OS 12.7.4 
Affinity Designer 2.4.1 | Affinity Photo 2.4.1 | Affinity Publisher 2.4.1 | Beta versions as they appear.

I have never mastered color management, period, so I cannot help with that.

Link to comment
Share on other sites

The CSV file is built using R and RStudio (Data Science stuff) on raw data. It's purely computational. I also produced thousands of plots, exported as PNG files, and they all display correctly the strings.

No problem to read the CSV file in other applications.

Link to comment
Share on other sites

Could you toggle the Unicode for the mess that is the merged result. Select the word Côte and hit the key combination Control + U or use Text > Toggle Unicode. Should give U+0043U+00F4U+0074U+0065 it is the U+00F4 replacement that I am interested in seeing.

Mac Pro (Late 2013) Mac OS 12.7.4 
Affinity Designer 2.4.1 | Affinity Photo 2.4.1 | Affinity Publisher 2.4.1 | Beta versions as they appear.

I have never mastered color management, period, so I cannot help with that.

Link to comment
Share on other sites

@Delden

Which format of Fira Sans are you using? OTF or TTF?

The reason I ask is because they store compound glyphs differently (such as the ô in Côte) and it looks like it was decomposed (which is kinda what Old Bruce is getting at with looking at the resultant Unicode codes). The import program may be having an issue with the compound glyphs (like some printers do).

Can you try the other version of the fonts? If you are using the TTFs try the OTFs, etc.

Link to comment
Share on other sites

Whoa, this is a ridiculously great beta for data merge. I have used you guys since the beta of Affinity Photo and I'm still pretty shocked Adobe is completely toppled by you by now. Checking Google Trends shows Photoshop on the massive decline and all Affinity products catching up. Great work guys!

The Data Merge in beta is BEAUTIFUL!!! I've been wanting this for years. Publisher may be the best piece of software for practical purposes I've ever used. I use it to build out materials at scale. Thank you!!!

Link to comment
Share on other sites

@Delden this is just a shot in the dark, but is it possible that your entire data set could contain text encoded in different encodings?

So, UTF-8 encoded strings as well as ISO-8859-1 ones in the same file for example?

I'm asking because CSV has no metadata that allows to define the character encoding that's been used. You obviously have to select one when producing the CSV, but there's no way to include that information in the CSV itself - it has no header or any sort of metadata whatsoever. Therefore, software reading CSV often has to guess / recognize the encoding that's been used (charset sniffing).

Now, if Affinity Publisher does this (and it almost has to, since it currently doesn't seem to even allow you to explicitly specify the encoding on ingest), such an heuristic could be thrown way off if there's mixed encodings used in the same file. That could be an explanation why your full data set shows the problem, but the extract doesn't (for me at least, works fine here). I would maybe quickly check if you can still reproduce the issue with your own extract. If not, that could be an indication that some other data in your full data set is throwing some kind of character set detection in AP off.

(For what it's worth - the ô in your extract is correctly encoded as a UTF-8 Multi-Byte character (0xC3 0xB4). What your merged result looks like is exactly what UTF-8 looks like when it's accidentally decoded as  ISO-8859-1 (Latin1)).

Link to comment
Share on other sites

1 hour ago, lukasg said:

(For what it's worth - the ô in your extract is correctly encoded as a UTF-8 Multi-Byte character (0xC3 0xB4). What your merged result looks like is exactly what UTF-8 looks like when it's accidentally decoded as  ISO-8859-1 (Latin1)).

So if it is correctly encoded, then the import program is somehow mis-reading or mis-handling the characters. Yes?

Link to comment
Share on other sites

18 minutes ago, LibreTraining said:

So if it is correctly encoded, then the import program is somehow mis-reading or mis-handling the characters. Yes?

That character is correctly encoded. Other characters earlier in the file may not be, and may be confusing the decoding process.

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

Exactly.

And those characters earlier in the file might not have been enough to bias the other Software (that apparently reads the file properly) towards "this looks like ISO-8859-1", maybe because the majority of characters are indeed UTF-8 encoded. Or some of them hard-defaulted to UTF-8, which is not an unreasonable assumption these days.

And if you're really dealing with mixed encodings in the same file, there really is no right or wrong in terms of the exact implementation of the encoding detection mechanism. Some use sophisticated strategies like lookup tables for character frequencies in different languages, others are happy with the first encoding that sort of works and doesn't result in unprintable characters.

The theory about mixed encodings in the data is still just speculation on my part, but I've seen stranger things with real world data.

Link to comment
Share on other sites

1 hour ago, Lukas G. said:

The theory about mixed encodings in the data is still just speculation on my part, but I've seen stranger things with real world data.

But it makes sense.
As soon as you posted it made me think about cleaning-up old "code page" documents which were imported into a Unicode doc.
When doing the clean-up you soon start to recognize the repeated patterns of weird characters, and what they should be.
Then do a search/replace for those.
So if that is a pattern you recognize it makes perfect sense.

Link to comment
Share on other sites

  • 7 months later...

Was there ever a solution to this? I have exactly the same problem - French accented characters showing up as weird character combinations. It's not a font issue because I've tried more than 50 fonts (OTF and TTF) and it happens in all of them.

My CSV file is being generated by numbers.app as UTF-8.

Example: 

1 février comes out as 1 février (the unicode in Publisher is U+0031U+0020U+0066U+00C3U+00A9U+0076U+0072U+0069U+0065U+0072 if that helps?)

Link to comment
Share on other sites

1 hour ago, Mooxo said:

My CSV file is being generated by numbers.app as UTF-8.

Can you share the .csv  file, or another with the problem, with us?

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

Here you go. In further testing, it's something in the last column (the pathname) that's causing it. If I get rid of that column, it works fine.

Everything else in the merge works perfectly, just the accented characters that are the problem.

It's not hard to work around with a quick find and replace in the merged document. Odd though.

test-merge-screenshot.png

test.csv

Link to comment
Share on other sites

The é is being read as the two separate Hex code characters 00C3 and 00A9 instead of the unicode 00E9. Why? I don't know.

The second problem is more complex in that you have your pictures in the iCloud cloud. I removed the "/com~apple~CloudDocs/1. test" from the lines in the last column and all worked well. I guess it is the ~ or the . dot.

test bruce.csv

Mac Pro (Late 2013) Mac OS 12.7.4 
Affinity Designer 2.4.1 | Affinity Photo 2.4.1 | Affinity Publisher 2.4.1 | Beta versions as they appear.

I have never mastered color management, period, so I cannot help with that.

Link to comment
Share on other sites

20 minutes ago, Old Bruce said:

The é is being read as the two separate Hex code characters 00C3 and 00A9 instead of the unicode 00E9. Why? I don't know.

Because, from experimentation, Publisher is using the first 4096 bytes of the file to determine the file encoding. If nothing in the first 4096 bytes requires UTF-8 encoding, then Publisher is assuming the file to be using ANSI encoding rather than UTF-8, and interprets the rest of the file in ANSI mode rather than UTF-8 mode.

The first é, which is the first UTF-8 encoded character, is at position 4523 in that file. Therefore Publisher has decided the file is ANSI-encoded.

Edit: And any changes that shorten the file and bring that first é into the first 4096 characters will resolve the problem. As would adding some UTF-8 character to some other location, such as within the header line at the top of the file.

-- Walt
Designer, Photo, and Publisher V1 and V2 at latest retail and beta releases
PC:
    Desktop:  Windows 11 Pro, version 23H2, 64GB memory, AMD Ryzen 9 5900 12-Core @ 3.00 GHz, NVIDIA GeForce RTX 3090 

    Laptop:  Windows 11 Pro, version 23H2, 32GB memory, Intel Core i7-10750H @ 2.60GHz, Intel UHD Graphics Comet Lake GT2 and NVIDIA GeForce RTX 3070 Laptop GPU.
iPad:  iPad Pro M1, 12.9": iPadOS 17.4.1, Apple Pencil 2, Magic Keyboard 
Mac:  2023 M2 MacBook Air 15", 16GB memory, macOS Sonoma 14.4.1

Link to comment
Share on other sites

@walt.farrell that makes a lot of sense.
I've also had a look at the test data, and inspected it with a quick Python script. And it is a valid UTF-8 encoded CSV file. The only non-ASCII characters that occur are é and û, and they are both properly UTF-8 encoded, everywhere.

So my hypothesis from above that it's mixed encodings being used in the same file that throws of Publisher's character set sniffing is wrong. But the sniffing heuristic only looking at the first 4k of data sounds very plausible.

That's clearly a bug then in my opinion - at least as long as there isn't an option for the user to just select the encoding to be used, and void the need for any character set sniffing alltogether.

Link to comment
Share on other sites

7 minutes ago, walt.farrell said:

 Therefore Publisher has decided the file is ANSI-encoded.

 

Well that is just stupid and ignorant.

Definitely needs to be fixed.

Mac Pro (Late 2013) Mac OS 12.7.4 
Affinity Designer 2.4.1 | Affinity Photo 2.4.1 | Affinity Publisher 2.4.1 | Beta versions as they appear.

I have never mastered color management, period, so I cannot help with that.

Link to comment
Share on other sites

24 minutes ago, walt.farrell said:

Because, from experimentation, Publisher is using the first 4096 bytes of the file to determine the file encoding. If nothing in the first 4096 bytes requires UTF-8 encoding, then Publisher is assuming the file to be using ANSI encoding rather than UTF-8, and interprets the rest of the file in ANSI mode rather than UTF-8 mode.

The first é, which is the first UTF-8 encoded character, is at position 4523 in that file. Therefore Publisher has decided the file is ANSI-encoded.

Edit: And any changes that shorten the file and bring that first é into the first 4096 characters will resolve the problem. As would adding some UTF-8 character to some other location, such as within the header line at the top of the file.

Okay, that makes sense. Putting an accented character in the header line is an easy enough workaround and a better solution that relying on find and replace after the merge. Thanks!

Link to comment
Share on other sites

@Mooxo a different workaround could be to just save your CSV with Latin-1 (ISO-8859-1) encoding from Numbers.app, instead of UTF-8. It seems that that's what Publisher uses as its fallback encoding, so this workaround might work as well, and would not even require you to add an accented character at the top.

Edit: "Western (ISO Latin 1)" is what Numbers.app calls ISO-8859-1 I believe.

Link to comment
Share on other sites

×
×
  • Create New...

Important Information

Terms of Use | Privacy Policy | Guidelines | We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.