ne can manipulate UTF-8 files and supports
UTF-8 when communicating with the user. At startup,
the system locale description, and checks whether it contains the string
‘utf8’ or ‘utf-8’. In this case, it starts communicating with
the user using UTF-8. This behaviour can be modified either using a
suitable command line option (see see Arguments), or using
UTF8IO. This makes it possible to display and read from the
keyboard a wide range of characters.
Independently of the input/output encoding,
ne keeps track of the
encoding of each document.
ne does not try to select a particular
coding on a document, unless it is forced to do so, for instance because a
certain character is inserted. Once a document has a definite encoding,
however, it keeps it forever.
More precisely, every document may be in one of three encoding modes: US-ASCII, when it is entirely composed of US-ASCII characters; 8-bit, if it contains also other characters, but it is not UTF-8 encoded; and finally, UTF-8, if it is UTF-8-encoded.
The behaviour of
ne in US-ASCII and 8-bit mode is similar to
previous versions: each byte in the document is considered a separate
There are, however, two important differences: first, if I/O is not
UTF-8 encoded, any encoding of the ISO-8859 family will work
ne merely reads bytes from the keyboard and
displays bytes on the screen. On the contrary, in the case of UTF-8
ne must take a decision as to which encoding is used
for non-UTF-8 documents, and presently this is hardwired to ISO-8859-1.
Second, 8-bit documents use localized casing and
character type functions. This means that case-insensitive searches or
case foldings will work with, say, Cyrillic characters, provided that
your locale is set correctly.
In UTF-8 mode, instead,
ne interprets the bytes in the document in
a different way—several bytes may encode a single character. The whole
process is completely transparent to the user, but if you really want to
look at the document content, you can switch to 8-bit mode (see
For most operations, UTF-8 support should be transparent. However, in
some cases, in particular when mixing documents with different encodings,
ne will refuse to perform certain operations because of
The main limitation of UTF-8 documents is that when searching for a regular expression in a UTF-8 text, character sets may only contain US-ASCII characters (see see Regular Expressions). You can, of course, partially emulate a full UTF-8 character set implementation specifying the possible alternatives using ‘|’ (but you have no ranges).