Affinity Publisher - meaning of each special character indicator

w_yne_t_ylor · April 15, 2021

Hello! I keep Text > Show special characters enabled. Most of it makes sense and is extremely useful, for example carriage returns, tabs, spaces etc.

There have been a couple, over time, which I have not recognised. I can't seem to find a reference as to what they all mean? For example, there is a character at the start of the following text frame which is showing as an up/down arrow (it's not a frame decoration!) and I have no clue what it is?

lacerto · April 15, 2021

(...)

w_yne_t_ylor · April 15, 2021

Ooh thats interesting, and it worked! It converted the character into

U+FEFF

Which I can now look up! Thanks so much for that tip, awesome!

It would still be good to have a simple lookup table for these from Serif though rather than having to convert them :0)

lacerto · April 15, 2021

(...)

w_yne_t_ylor · April 15, 2021

It seems to be a character used to indicate (to software) how to treat the text.

It also seems to be described as a zero-width non-breaking space. I'm not sure what value that has as it seems to be a contradiction by definition, haha.

What's really useful is that Affinity shows us it's there. In most cases that's enough. If we were to have some sort of lookup in the documentation they could use embedded images I guess.

lacerto · April 15, 2021

(...)

walt.farrell · April 15, 2021

Interestingly, though, a "zero-width non-breaking space" is not one of the spaces that Affinity allows you to insert. Edit: at least not by name via the menus. You can type a regular space, then use Alt+U to enable editing it to U+FEFF.

23 minutes ago, w_yne_t_ylor said:

It also seems to be described as a zero-width non-breaking space. I'm not sure what value that has as it seems to be a contradiction by definition, haha.

It separates two words (or two strings of any text, really) but does not allow them to split over lines.

I'm not sure it has much use, but there are other forms of non-breaking space, and other widths of breaking spaces, so someone must have a good use case for them

kenmcd · April 15, 2021

3 hours ago, w_yne_t_ylor said:

There have been a couple, over time, which I have not recognised. I can't seem to find a reference as to what they all mean? For example, there is a character at the start of the following text frame which is showing as an up/down arrow (it's not a frame decoration!) and I have no clue what it is?

Based on its location at the beginning of the text, and assuming you did not enter it, it may be a BOM (byte order mark).
My guess is you placed this text from another source, not entered directly.

BOM has the same code as the old zero width non-breaking space (FEFF).
In some applications a BOM character is placed at the beginning of text to signal certain things.
So because it is at the beginning of your text my guess is you are bringing it in with the text.
Where is the text coming from?

Note: zero width non-breaking space is deprecated in Unicode; word joiner is now preferred.

w_yne_t_ylor · April 15, 2021

13 minutes ago, LibreTraining said:

In some applications a BOM character is placed at the beginning of text to signal certain things.

Yes, this is what I thought. Like some form of inline meta data.

13 minutes ago, LibreTraining said:

Where is the text coming from?

I inherited this (large) file as an IDML exported from, *cough*, InDesign. So I can't really say. But you assume correctly that it is not original, typed, content.

37 minutes ago, walt.farrell said:

so someone must have a good use case for them

Absolutely!

walt.farrell · April 15, 2021

42 minutes ago, LibreTraining said:

Based on its location at the beginning of the text, and assuming you did not enter it, it may be a BOM (byte order mark).
My guess is you placed this text from another source, not entered directly.

BOM has the same code as the old zero width non-breaking space (FEFF).
In some applications a BOM character is placed at the beginning of text to signal certain things.

Thanks. That makes sense.

One common use for the BOM, for those who may not know, is to mark a file as being UTF-8 rather than Latin-1 (ISO-8859-1).

lacerto · April 15, 2021

(...)

w_yne_t_ylor · April 15, 2021

That's the fella.

sfriedberg · April 16, 2021

On 4/15/2021 at 9:52 AM, walt.farrell said:

One common use for the BOM, for those who may not know, is to mark a file as being UTF-8 rather than Latin-1 (ISO-8859-1).

Well, it's really useful to distinguish UTF-16/UCS-2, which are streams of 16-bit/2-byte characters, from streams of 8-bit/1-byte characters such as UTF-8 and ASCII or any Windows code page. Additionally, it tells the reading software what the native byte-order of the source material is. If the first bytes, read individually, are 0xFF 0xFE, the software knows it has to byte-swap every 16-bit/2-byte word to get the proper Unicode encoding. If the first two bytes are 0xFE 0xFF, the software doesn't have to do that.

If you have only used Intel-compatible processors, this is probably not familiar to you. Look up "bigendian" and "littleendian" and ignore references to Jonathan Swift and Gulliver's Travels. Software that's producing UTF-16/UCS-2 output could run on either a bigendian or a littleendian platform, and will naturally want to write 16-bit/2-byte quantities in the native byte ordering for that platform. To allow the files to be read correctly on either type of platform, the writing software will start the file with a 2-byte BOM.

If the software is producing UTF-8 output, littleendian or bigendian doesn't matter. And there's not really any need for the BOM. Character encoding is usually identified through some other means for single-byte streams.

walt.farrell · April 16, 2021

2 minutes ago, sfriedberg said:

If the software is producing UTF-8 output, littleendian or bigendian doesn't matter. And there's not really any need for the BOM. Character encoding is usually identified through some other means for single-byte streams.

Without the BOM, a program that can read/understand both Latin-1 and UTF-8 sometimes has to guess what encoding the input file uses. And sometimes the guess will be wrong, and you'll end up with bad characters in the file.

On the other hand, a program that is not expecting a BOM character will end up with a garbage character (or two) at the start of the file.

sfriedberg · April 16, 2021

@walt.farrellI guess my point is that the BOM is not adequate to identify the encoding for single byte streams. Latin-1 is not the only alternative to UTF-8. The real value of the Byte Order Mark is to distinguish bigendian from littleendian byte order, which is a critial issue in UTF-16/UCS-2 encoding. That is its primary purpose. Use of BOM in UTF-8 streams is optional and allowed (it is a Unicode code point, after all), but it has only heuristic value in identifying a single byte stream as UTF-8.

Affinity Publisher - meaning of each special character indicator

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information