ne
can manipulate UTF-8 files and supports
UTF-8 when communicating with the user. At startup, ne
fetches
the system locale description, and checks whether it contains the string
‘utf8’ or ‘utf-8’. In this case, it starts communicating with
the user using UTF-8. This behaviour can be modified either using a
suitable command line option (see see Arguments), or using
UTF8IO. This makes it possible to display and read from the
keyboard a wide range of characters.
Independently of the input/output encoding, ne
keeps track of the
encoding of each document. ne
does not try to select a particular
coding on a document, unless it is forced to do so, for instance because a
certain character is inserted. Once a document has a definite encoding,
however, it keeps it forever.
More precisely, every document may be in one of three encoding modes: US-ASCII, when it is entirely composed of US-ASCII characters; 8-bit, if it contains also other characters, but it is not UTF-8 encoded; and finally, UTF-8, if it is UTF-8-encoded.
The behaviour of ne
in US-ASCII and 8-bit mode is similar to
previous versions: each byte in the document is considered a separate
character.
There are, however, two important differences: first, if I/O is not
UTF-8 encoded, any encoding of the ISO-8859 family will work
flawlessly, as ne
merely reads bytes from the keyboard and
displays bytes on the screen. On the contrary, in the case of UTF-8
input/output ne
must take a decision as to which encoding is used
for non-UTF-8 documents, and presently this is hardwired to ISO-8859-1.
Second, 8-bit documents use localized casing and
character type functions. This means that case-insensitive searches or
case foldings will work with, say, Cyrillic characters, provided that
your locale is set correctly.
In UTF-8 mode, instead, ne
interprets the bytes in the document in
a different way—several bytes may encode a single character. The whole
process is completely transparent to the user, but if you really want to
look at the document content, you can switch to 8-bit mode (see
see UTF8).
For most operations, UTF-8 support should be transparent. However, in
some cases, in particular when mixing documents with different encodings,
ne
will refuse to perform certain operations because of
incompatible encodings.
The main limitation of UTF-8 documents is that when searching for a regular expression in a UTF-8 text, character sets may only contain US-ASCII characters (see see Regular Expressions). You can, of course, partially emulate a full UTF-8 character set implementation specifying the possible alternatives using ‘|’ (but you have no ranges).