When I wrote this blog about “bash on Windows” a few days ago I omitted one issue, that I already knew about. This is because it needs some elaboration that did not ‘fit in’ the previous blog. So I will do it here.
It is about Unicode. Unicode always was a pain in “cmd” and with the arrival of “bash” in Windows, this has become more significant and important. Actually on any recent *nix platform user will not do anything to make Unicode work ‘out of the box’ in the console and display all or almost all scripts (though I have noticed that the completeness of the ‘monospace’ font mostly used in the Linux console varies between Linux distros – with some distros you will not get all scripts shown here in the console).
But not so in “cmd” and thus also not in bash on Windows. I will illustrate this using the ‘mysql’ command-line client in “bash” with a result set that requires Unicode for display. You may download an SQL-dump of the simple MySQL table I use here.
In a Linux console it works fine – and without any special setting other than “SET NAMES UTF8;” in the ‘mysql’ client in case the MySQL/MariaDB server runs with another default character set. This is the XFCE terminal in OpenSuSE 42.1 (LEAP) with MariaDB’s ‘mysql’ flavor connecting to the same server. Only one imperfection is seen here: Arabic is not written from right to left as it should (you may check with Wikipedia or Google Translate if you are in doubt who is right here: SQLyog or the Linux console). But OK – we are talking about a console and not a word processor.
In “cmd” (whether using “bash” or not) it does not:
Now, this is expected, actually. You will need to specify “cp 65001” for “cmd” to make it use UTF8. You will also need to replace the default “Consolas” font with a Unicode font (from the settings of the “cmd” window). “Lucida console” is normally recommended. But it does not work with non-latin/cyrillic scripts. The font is incomplete and/or “cmd” does not understand how to use it with non-latin/cyrillic scripts:
Now, you may actually use a TrueType font in “cmd”. Monospace TrueType fonts installed on the system are available from the console settings. Let’s try with “Courier New” (what I used with SQLyog and where it worked perfectly). It makes little difference – only Arabic now comes to the console (and also here characters are also printed incorrectly from left to right).
There is no solution because no font will display non-latin/cyrillc scripts properly even when “cmd” uses “cp 65001” and understands (should at least) the characters as UTF8-encoded Unicode characters. Actually, it surprises me that accented latin strings from different ANSI codepages (Portuguese, Latvian, Czech and Turkish all belong to different ANSI codepages) and also Russian are printed correctly without specifying “cp 65001”. I think this may be a recent improvement in “cmd”. Now, “cmd” understands the characters correctly (in ‘UTF8-mode’). It just does not display them. This becomes clear if you copy from the console into any program/interface that handles UTF8 properly – a text editor, a browser form or whatever. See below in Notepad. And here I actually used the “Lucida console” font too, so the font is basically OK, it seems.
It very much looks like “cmd” was designed for ANSI and “Windows Unicode”/UTF16LE (where a single character max. is 2 bytes long) only and fails with UTF8 characters 3 (or more) bytes long, because of some internal truncation taking place. But even if so, this still does not explain that Arabic does not display with “Lucida console” and does with “Courier New”. So there is more to it with right-to-left writing systems.
I think it would be nice if Microsoft:
1) made UTF8 work in “cmd” with all (or almost all) scripts.
2) made “cmd” switch to “Lucida console” font and also switch to “cp 65001” automatically when “bash” is invoked. You may easily forget to specify “cp 65001” (because you don’t need on other environments where bash runs) and you will have to ‘exit’ from bash and start what you were doing all over again.