UTF-8 Support (ne’s manual)

Previous: Emergency Save, Up: Reference [Contents][Index]

3.11 UTF-8 Support

ne can manipulate UTF-8 files and supports UTF-8 when communicating with the user. At startup, ne fetches the system locale description, and checks whether it contains the string ‘utf8’ or ‘utf-8’. In this case, it starts communicating with the user using UTF-8. This behaviour can be modified either using a suitable command line option (see see Arguments), or using UTF8IO. This makes it possible to display and read from the keyboard a wide range of characters.

Independently of the input/output encoding, ne keeps track of the encoding of each document. ne does not try to select a particular coding on a document, unless it is forced to do so, for instance because a certain character is inserted. Once a document has a definite encoding, however, it keeps it forever.

More precisely, every document may be in one of three encoding modes: US-ASCII, when it is entirely composed of US-ASCII characters; 8-bit, if it contains also other characters, but it is not UTF-8 encoded; and finally, UTF-8, if it is UTF-8-encoded.

The behaviour of ne in US-ASCII and 8-bit mode is similar to previous versions: each byte in the document is considered a separate character.

There are, however, two important differences: first, if I/O is not UTF-8 encoded, any encoding of the ISO-8859 family will work flawlessly, as ne merely reads bytes from the keyboard and displays bytes on the screen. On the contrary, in the case of UTF-8 input/output ne must take a decision as to which encoding is used for non-UTF-8 documents, and presently this is hardwired to ISO-8859-1. Second, 8-bit documents use localized casing and character type functions. This means that case-insensitive searches or case foldings will work with, say, Cyrillic characters, provided that your locale is set correctly.

In UTF-8 mode, instead, ne interprets the bytes in the document in a different way—several bytes may encode a single character. The whole process is completely transparent to the user, but if you really want to look at the document content, you can switch to 8-bit mode (see see UTF8).

For most operations, UTF-8 support should be transparent. However, in some cases, in particular when mixing documents with different encodings, ne will refuse to perform certain operations because of incompatible encodings.

The main limitation of UTF-8 documents is that when searching for a regular expression in a UTF-8 text, character sets may only contain US-ASCII characters (see see Regular Expressions). You can, of course, partially emulate a full UTF-8 character set implementation specifying the possible alternatives using ‘|’ (but you have no ranges).