Previous: , Up: Reference   [Contents][Index]


3.11 UTF-8 Support

Since version 1.30, ne can manipulate UTF-8 files and supports UTF-8 when communicating with the user. At startup, ne fetches the system locale description, and checks whether it contains the string ‘utf8’ or ‘utf-8’. In this case, it starts communicating with the user using UTF-8. This behaviour can be modified either using a suitable command line option (see see Arguments), or using UTF8IO. This makes it possible to display and read from the keyboard a wide range of characters.

Independently of the input/output encoding, ne keeps track of the encoding of each buffer. ne does not try to select a particular coding on a buffer, unless it is forced to do so, for instance because a certain character is inserted. Once a buffer has a definite encoding, however, it keeps it forever.

More precisely, every buffer may be in one of three encoding modes: US-ASCII, when it is entirely composed of US-ASCII characters; 8-bit, if it contains also other characters, but it is not UTF-8 encoded; and finally, UTF-8, if it is UTF-8-encoded.

The behaviour of ne in US-ASCII and 8-bit mode is similar to previous versions: each byte in the buffer is considered a separate character.

There are, however, two important differences: first, if I/O is not UTF-8 encoded, any encoding of the ISO-8859 family will work flawlessly, as ne merely reads bytes from the keyboard and displays bytes on the screen. On the contrary, in the case of UTF-8 input/output ne must take a decision as to which encoding is used for non-UTF-8 buffers, and presently this is hardwired to ISO-8859-1. Second, since version 1.34, 8-bit buffers use localized casing and character type functions. This means that case-insensitive searches or case foldings will work with, say, Cyrillic characters, provided that your locale is set correctly.

In UTF-8 mode, instead, ne interprets the bytes in the buffer in a different way—several bytes may encode a single character. The whole process is completely transparent to the user, but if you really want to look at the buffer content, you can switch to 8-bit mode (see see UTF8).

For most operations, UTF-8 support should be transparent. However, in some cases, in particular when mixing buffers with different encodings, ne will refuse to perform certain operations because of incompatible encodings.

The main limitation of UTF-8 buffers is that when searching for a regular expression in a UTF-8 text, character sets may only contain US-ASCII characters (see see Regular Expressions). You can, of course, partially emulate a full UTF-8 character set implementation specifying the possible alternatives using ‘|’ (but you have no ranges).


Previous: , Up: Reference   [Contents][Index]