Next: , Previous: , Up: Top   [Contents][Index]


8 The Encoding Mess

The original ne handled 8-bit text files, and assumed that every byte coming from the keyboard could be output to the terminal. No other assumption was made—for instance, the up/down casing functions did not assume a particular encoding for non-US-ASCII characters. This choice had a significant advantage: ne could handle easily several different encodings, with minor nuisances for the end user.

Since version 1.30, ne supports UTF-8. It can use UTF-8 for its input/output, and it can also interpret one or more buffers as containing UTF-8 encoded text, acting accordingly. Note that the buffer content is actual UTF-8 text—ne does not use wide characters. As a positive side-effect, ne can support fully the ISO-10646 standard, but nonetheless non-UTF-8 texts occupy exactly one byte per character.

More precisely, any piece of text in ne is classified as US-ASCII, 8-bit or UTF-8. A US-ASCII text contains only US-ASCII characters. An 8-bit text sports a one-to-one correspondence between characters and bytes, whereas an UTF-8 text is interpreted in UTF-8. Of course, this rises a difficult question: when should a buffer be classified as UTF-8?

Character encodings are a mess. There is nothing we can do to change this fact, as character encodings are metadata that modify data semantics. The same file may represent different texts of different lengths when interpreted with different encodings. Thus, there is no safe way of guessing the encoding of a file.

ne stays on the safe side: it will never try to convert a file from an encoding to another one. It can, however, interpret data contained in a buffer depending on an encoding: in other words, encodings are truly treated as metadata. You can switch off UTF-8 at any time, and see the same buffer as a standard 8-bit file.

Moreover, ne uses a lazy approach to the problem: first of all, unless the UTF-8 automatic detection flag is set (see UTF8Auto), no attempt is ever made to consider a file as UTF-8 encoded. Every file, clip, command line, etc., is firstly scanned for non-US-ASCII characters: if it is entirely made of US-ASCII characters, it is classified as US-ASCII. An US-ASCII piece of text is compatible with anything else—it may be pasted in any buffer, or, if it is a buffer, it may accept any form of text. Buffers classified as US-ASCII are distinguished by an ‘A’ on the status bar.

As soon as a user action forces a choice of encoding (e.g., an accented character is typed, or an UTF-8-encoded clip is pasted), ne fixes the mode to 8-bit or UTF-8 (when there is a choice, this depends on the value of the UTF8Auto flag). Of course, in some cases this may be impossible, and in that case an error will be reported.

All this happens behind the scenes, and it is designed so that in 99% of the cases there is no need to think of encodings. In any case, should ne’s behaviour not match your needs, you can always change at run time the level of UTF-8 support.


Next: , Previous: , Up: Top   [Contents][Index]