This is about BOMs – Byte Order Marks. It is also a follow-up to my previous post. I discussed Unicode implementations in MySQL, but I omitted considerations about BOMs last time, as I decided to reserve this for a follow-up post (this one).
Wikipedia: has this article. It explains very well what BOMs are, but I also find it biased. This expresses a “Unix-point-of -view”. And in particular if it is true that the resistance to using BOMs in the Unix world is partly because various Unix console programs don’t handle them, I don’t think it is a valid reason against BOMs. I think then that it is rather a valid reason for having those console applications fixed to comply with Unicode standards!
On the opposite Microsoft claims: “Always prefix a Unicode plain text file with a byte order marks ..”. This is also biased, some would say (and I agree). It is ‘best practises’ on Windows, but not necessarily on other platforms. Most Windows programs (from both Microsoft and others vendors) will write BOMs to files stored with a Unicode-encoding and expect BOMs at the beginning of a file with any Unicode-encoding. Actually one of the few Windows programs that don’t, is our own SQLyog. Because we know that some MySQL versions will have problems with them in various contexts.
The latest example I found of this is posted to this bug report. Of course the world does not collapse because of a bug in software (the Universe runs neither on MySQL, Unix nor Windows!) , but it is an effective show-stopper for importing some data files generated by Microsoft Excel for instance into MySQL with LOAD DATA. I actually thought it was fixed years ago with the fix for this old bug, but not quite yet.
To focus the discussion I will provide a few more quotes from both of the above resources.
Wikipedia: “Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.”. Yes, exactly. But that is also what we are doing with MySQL: running clients and servers on different platforms as well as having replication or clustering setups spanning multiple platforms – and we may copy files around between systems. Probably this is most a concern for Windows users connected to MySQL running on Linux/Unix. But I think Mac users connected to Linux servers are (more or less) affected as well. And in the ‘age of the Cloud’ the server platform may be completely transparent for the client.
Microsoft: “The preferred place to specify byte order is in a file header, but text files do not have headers.” Yes, obviously. The discussion here is about plain text files and nothing else. Not(most) protocolled transfers, not any kind of document/file that has any kind of header where the encoding can be specified.
But if above references are biased, what is then ‘non-biased’? The best (and not too much technical) reference I could find is the unicode.org FAQ. To the question “How I should deal with BOMs?” the answer is;
1) A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
2) Some protocols allow optional BOMs in the case of untagged text. In those cases,
* Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.
‘ Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.
3) Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.
4) Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.
1) Sometimes you must use BOMs (if systems/protocols/applications require them).
2) Sometimes you should not use BOMs (if there is a better option – any kind of ‘document header’ (such as a <meta> tag of a HTML page for instance) – or if systems/protocols/applications don’t handle them.
3) Sometimes you may use BOMs. This is the ‘grey area’. And the problem with computers is that they don’t handle ‘greys areas’ very well. For computers the world is basically 0|1, YES|NO, black|white. There is no ½, MAYBE or grey in a computer architecture. It has to be handled by a human – when connected to a computer system or when writing an application.
But unfortunately MySQL provides no functionality for users to ‘walk in grey area’. It can neither write BOMs to files nor USE them when they are there. So in addition to my previous post I think complete support for Unicode in SELECT INTO OUTFILE and LOAD DATA statements should be something like:
LOAD DATA [LOCAL] INFILE [CHARACTER SET ..| BOM_USE ] …
SELECT .. INTO [LOCAL] OUTFILE .. [CHARACTER SET ..] [BOM_WRITE] …
The alternative to using BOMs for encoding detection is to investigate the byte stream itself. There are algorithms and libraries available for this. Some of them perform remarkable well actually, but they never can be made 100% safe – and the safer they are, the more performance overhead they will typically have.
Users should be allowed to use BOMs if they want.