Thursday, 14 November 2013

Internet Explorer 11: Two Steps Forward and Two Steps Back

For years I have bemoaned the incomplete and broken implementation of script-specific font configuration in Internet Explorer. The ability to manually configure what font to use for what Unicode script is a killer feature for me, and something that in my opinion should make Internet Explorer vastly superior to Chrome, which does not allow the user to choose what font to use by default for particular Unicode scripts (in the absense of a font being explicitly specified by the page being read). For multilingual users, especially those who work with more obscure scripts and languages, I find that Internet Explorer generally provides a much better experience, with fewer annoying little boxes for unsupported characters. (I have had bad experiences with Firefox in the past, but reinstalled it for this blog post and was pleasantly surprised by its multiscript support, which is much better than I remember.)

Tag Cloud for the BabelStone Blog as viewed with Internet Explorer 10

Tag Cloud for the BabelStone Blog as viewed with Chrome 30

Tag Cloud for the BabelStone Blog as viewed with Firefox 25

IE6 through IE10 support font configuration for 37 languages or scripts. (This is a little better than Firefox 25 which allows font configuration for 32 languages or regions.)

Configurable Languages in IE10 under Windows 7
Language/Script Scripts Unicode
Fonts listed
Armenian1.0Arial Unicode MS
Bengali1.0Arial Unicode MS
Shona Bangla
Braille3.0Segoe UI Symbol
Canadian Syllabic3.0Euphemia
Cherokee3.0Plantagenet Cherokee
Chinese SimplifiedHan1.0(various)
Chinese TraditionalHan and Bopomofo1.0(various)
Arial Unicode MS
Georgian1.0Arial Unicode MS
Gujarati1.0Arial Unicode MS
Gurmukhi1.0Arial Unicode MS
JapaneseHan and Hiragana/Katakana1.0(various)
Kannada1.0Arial Unicode MS
Khmer UI
KoreanHan and Hangul1.0(various)
Lao1.0Arial Unicode MS
Lao UI
Latin based1.0(various)
Malayalam1.0Arial Unicode MS
Ogham3.0Segoe UI Symbol
Oriya1.0Arial Unicode MS
Runic3.0Segoe UI Symbol
Sinhala3.0Iskoola Pota
Syriac3.0Estrangelo Edessa
Tamil1.0Arial Unicode MS
Telugu1.0Arial Unicode MS
Thaana3.0MV Boli
Tibetan2.0Arial Unicode MS
Microsoft Himalaya
User Defined(various)
Yi3.0Microsoft Yi Baiti

As you can see, this list does not include any languages with Unicode scripts introduced later than Unicode version 3.0, which was released in September 1999, but it does include all Unicode scripts available in Unicode 3.0 (Bopomofo is presumably subsumed within Chinese Traditional). When IE6 was released in August 2001 this list was pretty much up to date, and only lacked three scripts added in Unicode 3.1 (Deseret, Gothic and Old Italic), which had been released in March 2001, after IE6 had gone beta.

This was a great start, and suggested that IE was going to provide cutting-edge support for Unicode scripts as they were encoded. However, it seems that no-one took ownership of this feature, and it was left to languish for the next twelve years. When IE10 was released in August 2012, thirteen years after Unicode 3.0, it still only allowed font configuration for the original list of 37 languages.

At the same time as no-one was updating the font configuration feature for the 62 new scripts that were added to Unicode between 3.1 and 6.1 (released in January 2012), no-one was fixing any bugs with the the font configuration feature. As discussed in Michael Kaplan's blog post, The importance of Tagalog to Burmese, aka "Of course I'd lie to you, I'm a font!" (18 April 2008), the main bugs in the feature are due to the way that IE populates the list of fonts for each language. It lists those fonts that: a) have the appropriate Unicode Subset Bitfield bit set; and b) which also have a mapping to a sample Unicode character for the script. Unfortunately, in the case of Myanmar (Burmese), the sample character used is U+1700 ᜀ TAGALOG LETTER A, which is a character from the historic Philippine script Tagalog (Baybayin) which was encoded in Unicode 3.2. The reason for this mistake is that the list of Unicode 3.0 sample characters used by IE was based on draft code charts, and the Myanmar script was relocated from its original proposed location starting at U+1700 to a new location starting at U+1000 when it was actually encoded. This means that no Myanmar font will show up on the list of Myanmar fonts unless it redundantly includes a mapping to the Tagalog character at U+1700. In the case of Mongolian, no sample character is listed at all, which is even worse than the situation for Myanmar as no font ever passes the test for supporting Mongolian, and so although Microsoft has shipped a Mongolian font ("Microsoft Baiti") since Windows Vista, this font does not show up on the list of Mongolian fonts for IE10 and earlier.

Mongolian Font Configuration Dialog in IE10 under Windows 7

No fonts listed even though Windows 7 ships with the "Mongolian Baiti" font.

Myanmar Font Configuration Dialog in IE10 under Windows 7

"Noto Sans Tagalog" is listed although it does not cover Myanmar; and Martin Hosken's Padauk fonts for Myanmar are listed only because they deliberately includes a dotted circle glyph mapped to U+1700.

When Internet Explorer 11 installed itself on my laptop recently, the first thing I did was check the Font configuration setting, as I did with IE7 and IE8 and IE9 and IE10 when they first appeared, but as no changes had been made for IE7 through IE10 I was not expecting anything new from IE11. Imagine my suprise then, when I opened the font configuration dialog and discovered that the list of languages has been expanded from 37 to 55. That seems like one big step forward!

Configurable Languages in IE11 under Windows 7
Language/Script Unicode
Fonts listed
Armenian1.0Arial Unicode MS
Bengali1.0Arial Unicode MS
Shona Bangla
Bopomofo1.0Microsoft JhengHei
Braille3.0Segoe UI Symbol
Buginese4.1Leelawadee UI
Canadian Syllabic3.0Euphemia
Cherokee3.0Plantagenet Cherokee
Chinese Simplified1.0(various)
Chinese Traditional1.0(various)
Coptic4.1Segoe UI Symbol
Deseret3.1Segoe UI Symbol
Arial Unicode MS
Georgian1.0Arial Unicode MS
Glagolitic4.1Segoe UI Symbol
Gothic3.1Segoe UI Symbol
Gujarati1.0Arial Unicode MS
Gurmukhi1.0Arial Unicode MS
Javanese5.2Javanese Text*
Kannada1.0Arial Unicode MS
Khmer UI
Lao1.0Arial Unicode MS
Lao UI
Latin based1.0(various)
Malayalam1.0Arial Unicode MS
Mongolian3.0Mongolian Baiti
Myanmar3.0Myanmar Text*
New Tai Lue4.1Microsoft New Tai Lue
Ogham3.0Segoe UI Symbol
Ol Chiki5.1Nirmala UI
Old Italic3.1Segoe UI Symbol
Old Turkic5.2Segoe UI Symbol
Oriya1.0Arial Unicode MS
Phags-pa5.0Microsoft PhagsPa
Runic3.0Segoe UI Symbol
Sinhala3.0Iskoola Pota
Sora Sompeng6.1Nirmala UI*
Syriac3.0Estrangelo Edessa
Tai Le4.0Microsoft Tai Le
Tamil1.0Arial Unicode MS
Telugu1.0Arial Unicode MS
Thaana3.0MV Boli
Tibetan2.0Arial Unicode MS
Microsoft Himalaya
User Defined(various)
Yi3.0Microsoft Yi Baiti

* Listed in IE11 under Windows 7 although not actually installed on Windows 7.

This is an impressive list, but a little odd. The list does not include all scripts added since Unicode 3.0, but only a selection of scripts added in Unicode versions 3.1 (March 2001), 4.0 (April 2003), 4.1 (March 2005), 5.0 (July 2006), 5.1 (April 2008), 5.2 (October 2009), and 6.1 (January 2012). In fact the list excludes some 47 scripts added to Unicode between 4.0 and 6.1:

  • Avestan (Unicode 5.2)
  • Balinese (Unicode 5.0)
  • Bamum (Unicode 5.2)
  • Batak (Unicode 6.0)
  • Brahmi (Unicode 6.0)
  • Buhid (Unicode 3.2)
  • Carian (Unicode 5.1)
  • Chakma (Unicode 6.1)
  • Cham (Unicode 5.1)
  • Cuneiform (Unicode 5.0)
  • Cypriot (Unicode 4.0)
  • Egyptian Hieroglyphs (Unicode 5.2)
  • Hanunoo (Unicode 3.2)
  • Imperial Aramaic (Unicode 5.2)
  • Inscriptional Pahlavi (Unicode 5.2)
  • Inscriptional Parthian (Unicode 5.2)
  • Kaithi (Unicode 5.2)
  • Kayah Li (Unicode 5.1)
  • Kharoshthi (Unicode 4.1)
  • Lepcha (Unicode 5.1)
  • Limbu (Unicode 4.0)
  • Linear B (Unicode 4.0)
  • Lisu (Unicode 5.2)
  • Lycian (Unicode 5.1)
  • Lydian (Unicode 5.1)
  • Mandaic (Unicode 6.0)
  • Meetei Mayek (Unicode 5.2)
  • Meroitic Cursive (Unicode 6.1)
  • Meroitic Hieroglyphs (Unicode 6.1)
  • Miao (Unicode 6.1)
  • Old Persian (Unicode 4.1)
  • Old South Arabian (Unicode 5.2)
  • Old Turkic (Unicode 5.2)
  • Phoenician (Unicode 5.0)
  • Rejang (Unicode 5.1)
  • Samaritan (Unicode 5.2)
  • Saurashtra (Unicode 5.1)
  • Sharada (Unicode 6.1)
  • Shavian (Unicode 4.0)
  • Sundanese (Unicode 5.1)
  • Syloti Nagri (Unicode 4.1)
  • Tagalog (Unicode 3.2)
  • Tagbanwa (Unicode 3.2)
  • Tai Tham (Unicode 5.2)
  • Tai Viet (Unicode 5.2)
  • Takri (Unicode 6.1)
  • Ugaritic (Unicode 4.0)

Why exclude these 47 scripts? Well, the answer is that they are all scripts for which Microsoft does not currently support at the font level. So it seems that the Microsoft thinking is that users should only be allowed to configure what font to use for what script if Microsoft provides a font for that script. If Microsoft does not currently provide a font for a particular script, but you have third party fonts installed that cover that script, then hard luck. I have to say that this is a very disappointing attitude, and makes it very frustrating for users like myself who are immensely grateful to Microsoft for supporting minor scripts such as Mongolian, Phags-pa, Tibetan, Yi, etc. but who also wish to use scripts for which Microsoft does not yet provide support.

What about the Myanmar and Mongolian bugs? Finally fixed (or at least, so it seems) – another step forward!

Mongolian Font Configuration Dialog in IE11 under Windows 7

"Mongolian Baiti" font is finally listed.

Myanmar Font Configuration Dialog in IE11 under Windows 7

Microsoft's "Myanmar Text" font is listed, but so is "Noto Sans Tagalog"!

Hmm, something's not right here.

Firstly, the Myanmar configuration lists the "Myanmar Text" font, but the sample just shows boxes. Wait a minute, I don't have the "Myanmar Text" font installed on my Windows 7 laptop, because that font only ships with Windows 8 and later. And for that matter, I don't have the "Nirmala UI" font listed for Sora Sompeng or the "Javanese Text" font listed for Javanese either.

Secondly, the Myanmar configuration still lists the "Noto Sans Tagalog" font even that font has not a single Myanmar character in it. A little experiment shows that when U+1700 is removed from the Padauk font it is no longer listed under Myanmar in IE11. So it seems like the Myanmar bug has not been fixed at all, but the dialog has simply been hard-coded to statically include the "Myanmar Text" font in addition to fonts that are dynamically (but still incorrectly) enumerated.

Thirdly, although the Mongolian dialog now lists Microsoft's "Mongolian Baiti" font, it does not list any of the several other third-party Unicode Mongolian fonts installed on my system. I suspect that the Mongolian bug has not been fixed at all, but the dialog has simply been hard-coded to show the "Mongolian Baiti" font. I have a sinking feeling about this. Let's take a look at Phags-pa, as I recently and belatedly updated my Phags-pa fonts to work under Windows 7+. Will they be listed?

Phags-pa Font Configuration Dialog in IE11 under Windows 7

Microsoft's "Microsoft PhagsPa" font is listed, but not my "BabelStone Phags-pa Book" or "BabelStone Phags-pa Tibetan" fonts.

As I thought, only Microsoft's Phags-pa font is listed. My Phags-pa fonts are not listed even though they set the appropriate Unicode Subset Bitfield bit and cover all Phags-pa characters. However, my "BabelStone Phags-pa Book" font is listed under Latin based and User Defined, so it is not getting entirely ignored by IE11, only ignored for the specific script that it is designed for use with.

After a little investigation, it becomes clear that none of the eighteen new IE11 font configuration dialogs (for Bopomofo, Buginese, Coptic, Deseret, Glagolitic, Gothic, Javanese, New Tai Lue, N'Ko, Ol Chiki, Old Italic, Old Turkic, Osmanya, Phags-pa, Sora Sompeng, Tai Le, Tifinagh, Vai) list any installed third-party fonts that cover the particular script. Furthermore, all eighteen dialogs only list a single font, even in the case of Bopomofo which is covered by more than ten Microsoft fonts in Windows 7, so no choice of font is possible. The inescapable conclusion is that the eighteen new font configuration dialogs in IE11 (and also the dialog for Mongolian) simply list a single hard-coded Microsoft font for each script (even if the listed font is not installed on the system), giving the user absolutely no choice whatsoever over font configuration for these scripts. In other words, the IE11 changes to font configuration are a facade thinly disguising a fake implementation. Who in Microsoft, I wonder, decided that a fake implementation that gives the user no choice (not even Hobson's choice as you cannot not select the proffered Microsoft font) was in any way better than not having the font configuration dialogs for these scripts?

So what initially looked like two steps forward turns out to have been an illusion, a cheap conjurer's trick, and in fact IE11 is not one iota better than IE6 was at allowing the user to configure what fonts to use for what scripts. Twelve years on and zero progress.

Postscript A

Does font configuration even work for scripts that have more than one font listed? Not always, at least not for Tibetan. The Tibetan configuration dialog allows you to choose between the "Arial Unicode MS" font (which has glyphs for Tibetan characters but has no shaping behaviour so combining vowels signs are rendered as spacing marks) and the "Microsoft Himalaya" font (which fully supports Tibetan shaping behaviour), but if you choose "Arial Unicode MS" (not a good choice, but if you offer the user a choice they should be free to make a bad choice) then a web page with unstyled Tibetan text will be rendered with "Microsoft Himalaya".

Tibetan Font Configuration Dialog in IE11 under Windows 7

"Arial Unicode MS" font is listed, but the font in the preview can't be "Arial Unicode MS" as it does not do joined-up Tibetan.

In fact, if you install a good third-party Unicode Tibetan font such as Chris Fynn's Jomolhari, it will be listed in the Tibetan font configuration dialog, but if you select it you will still only ever see "Microsoft Himalaya" used to render unstyled Tibetan text on web pages. So what's the point?

Postscript B

In the Phags-pa font configuration dialog shown above the sample Phags-pa text is ꡏꡡꡋꡂꡡꡙ ꡢꡠꡙꡠ mongol qele, which is a brave but flawed attempt to render Mongolian ᠮᠣᠨᠭᠭᠣᠯ ᠬᠡᠯᠡ mongɣol kele "Mongolian language" in the Phags-pa script. It is wrong on several counts:

  1. In the Phags-pa script a space is always used to separate syllables not words, so there should be three spaces not one;
  2. Mongolian ng should be represented by the single Phags-pa letter nga;
  3. Mongolian ɣ is normally represented using the Phags-pa letter qa;
  4. Mongolian k is normally represented using the Phags-pa letter kha;
  5. Mongolian e would probably be represented using the Phags-pa letter ee here (Phags-pa script has two flavours of e, and although the Phags-pa spelling of Mongolian kele is not attested, by analogy with other Phags-pa Mongolian words ee would be expected).

It is unfortunate that this flawed spelling was chosen as someone at Microsoft asked me for the autonym for Mongolian in Phags-pa script in 2011, and I suggested ꡏꡡꡃ ꡢꡡꡙ ꡁꡦ ꡙꡦ mong qol khė lė which I believe to be much more authentic. Mind you, as the Phags-pa script was specifically devised to be used for writing multiple languages, and during the Yuan dynasty was used for writing Chinese at least as much as for writing Mongolian, as well as for Sanskrit, Tibetan and Uyghur, in my opinion choosing "Mongolian language" as the sample text for Phags-pa is not quite right anyway.

Sunday, 20 October 2013

What's new in Unicode 7.0 ?

Previously discussed :

[Update: Unicode 7.0 was released on 16 June 2014]

The two previous releases of Unicode (6.2 and 6.3) have been rather disappointing with regards to the number of new characters introduced into the standard (one in 6.2 and five in 6.3), so Unicode 7.0 should be much more exciting to those of us who think that 110,000 characters in Unicode are not nearly enough. In summary, 2,833 2,834* new characters are going to be added to Unicode 7.0 when it is released in the summer of 2014 (official beta information page for Unicode 7.0.0). Of these, 1,849 characters belong to 23 newly added scripts, which is a greater number of new scripts than for any previous version since Unicode 1.0 (which started life with 24 scripts).

* When I wrote this blog post there were going to be 2,833 new characters, but since then the newly invented Ruble sign has been fast-tracked for encoding in Unicode 7.0 at U+20BD.

23 new scripts in Unicode 7.0

Although all of these new scripts are either historical or have limited modern usage, and most people will be unfamilar with most of them, there are several important additions, notably Grantha and Siddham, as well as Linear A, which may be the first undeciphered writing system to be encoded in Unicode (depending upon whether the symbols on the Phaistos Disc, encoded in Unicode 5.1, represent writing or not).

Apart from the new scripts, the highlight of Unicode 7.0 for most people on the internet will be the addition of 643 wingdings, webdings and other pictographic symbols, which will supplement the emoticons, emoji and many other symbols added to Unicode 6.0. I predict that characters such as "Reversed Hand with Middle Finger Extended", "Reversed Victory Hand" (British equivalent of the finger), and "Raised Hand with Part Between Middle and Ring Fingers" (live long and prosper) will become even more popular on Twitter than the infamous "Pile of Poo" 💩 character*.

* Pile of Poo was encoded in the Unicode standard for compatibility with Japanese telecoms companies (KDDI & Softbank) which included it as part of the Emoji repertoire on their cell phones (see the original Emoji proposal where the character is provisionally named "Dung", later changed to "Pile of Poo" at the suggestion of Michael Everson).

FDAM2 code chart images of characters 1F594 through 1F596

However, the character that seems to be causing the most stir amongst the twitterati is U+1F574 "MAN IN BUSINESS SUIT LEVITATING". People are asking why Unicode has seen fit to encode this particular character. The answer is that in 2011 my good friend Michel Suignard (and project editor of ISO/IEC 10646) proposed to encode the set of symbols used in the widely-used Wingdings and Webdings fonts that were not already in Unicode or unifiable with an existing character. The Webdings font that ships with Microsoft Windows includes a glyph for a man in a business suit apparently levitating at U+F06D () (also accessible as "m" m unless you are using Firefox), and it is being encoded in Unicode 7.0 simply because the glyph is in the Webdings font and it is not unifiable with any existing Unicode character. So if you still want to know why Unicode 7.0 will include a character for MAN IN BUSINESS SUIT LEVITATING you had better ask Vincent Connare et al. why they included the glyph in Webdings in 1997 in the first place.*

* According to Microsoft's Webdings page: Our team of iconographers traveled the world asking site designers and users which symbols, icons and pictograms they thought would be most appropriate for a font of this kind. From thousands of suggestions we had to pick just two hundred and thirty for inclusion in Webdings.

** According to Jen Sorenson, in this blog post from 2009, the Man in Business Suit Levitating glyph in the Webdings font was intended to be an exclamation mark in the style of the rude boy logo found on records by The Specials published under the 2 Tone Records label. So perhaps the Unicode character would have been better named Rude Boy Exclamation Mark. Thanks to Ted Mielczarek for pointing this out to me.

BabelMap showing Webdings character F06D

Unicode and ISO/IEC 10646

Many people seem to think that characters are randomly added to the Unicode standard at a whim, and I can understand why it sometimes seems like that to an outside observer, but in fact the process of adding characters is far from simple. The Unicode standard is synchronized with the international standard, ISO/IEC 10646 ("Information technology—Universal Multiple-Octet Coded Character Set (UCS)"), and the contents of each version of the Unicode standard are largely determined by the committee work and balloting process for ISO/IEC 10646 by national standardization organizations (such as ANSI, BSI, DIN), although as the Unicode Consortium is represented on the committee responsible for ISO/IEC 10646 directly as a liaison member and indirectly via the US national body, it plays a very important role in this process (for more information on the relationship between the Unicode and ISO/IEC 10646 standards, see my blog post on Unicode and ISO/IEC 10646).

Unicode 6.1, released in January 2012, corresponds to ISO/IEC 10646:2012, which was published in June 2012 (freely available from the ISO web site as a set of PDF files and a set of electronic inserts). Amendment 1 to ISO/IEC 10646:2012 was published earlier this year, and one character only from Amd.1 (the Turkish Lira Sign) was added to the Unicode standard in version 6.2 released in September 2012. Amendment 2 to ISO/IEC 10646:2012 is currently in its final stage of balloting, and will be published late this year or early next year. Five characters only from Amd.2 (Arabic Letter Mark, Left-To-Right Isolate, Right-To-Left Isolate, First Strong Isolate, Pop Directional Isolate) were added to the Unicode standard in version 6.3 released at the end of September 2013. The repertoire of Unicode 7.0 will correspond to ISO/IEC 10646:2012 plus Amendments 1 and 2, and so the new characters encoded in 7.0 will correspond to those added to Amendment 1 (1,769 characters) and Amendment 2 (1,070 characters), minus the six characters already added in 6.2 and 6.3 (1,769 + 1,070 - 6 = 2,833 new characters in Unicode 7.0).

Amendment 1

Amendment 1 ("Linear A, Palmyrene, Manichaean, Khojki, Khudawadi, Bassa Vah, Duployan, and other characters") has already been published, so no changes to character allocations or character names in Unicode can be made. This amendment includes 1,769 new characters, as detailed in the tables below. You can download code charts covering the new characters from here or here.

Additions to Existing Blocks (339 characters)
Block Characters Documents
Greek and Coptic
037F: Capital letter yot N3997
058D..058E: 2 Armenian eternity signs N3923
0605: Mark used with Coptic numbers N3843
Arabic Extended-A
08A1: 1 letter used for Fulfulde N3882
08AD..08B1: 5 letters used for Bashkir, Belarusian, Crimean Tatar, and Tatar languages N4072
08FF: 1 letter used for Palula and Shina N4072
0978: 1 letter used for Marwari N3970
0C00: Candrabindu N3964
0C81: Candrabindu N3964
0D01: Candrabindu N3964
0DE6..0DEF: 10 digits for astrological use N3888
191D..191E: 2 consonant conjuncts N3975
Combining Diacritical Marks Supplement
1DE7..1DF4: 14 combining letters used for Teuthonista phonetic transcription N4081
Currency Symbols
20BA: Turkish Lira sign (Unicode 6.2) N4273
Miscellaneous Technical
23F4..23FA: 7 wingdings and webdings symbols N4022
2700: 1 Wingdings and Webdings symbol N4022
Miscellaneous Symbols and Arrows
2B4D..2B4F, 2B5A..2B73, 2B76..2B95, 2B98..2BB9, 2BBD..2BC8, 2BCA..2BD1: 115 wingdings and webdings symbols N4022
Supplement Punctuation
2E3C: Stenographic full stop N3895
2E3D..2E3E: 2 marks for Lithuanian dialectology N4070
2E3F: Capitulum N4022
2E40: Double hyphen N3983
2E41..2E42: 2 marks for Old Hungarian N3664
Cyrillic Extended-B
A698..A69B: 4 early Cyrillic letters N3974
A69C..A69D: 2 modifier letters used for Lithuanian dialectology N4070
Latin Extended-D
A794..A795: 2 letters used for Lithuanian dialectology N4070
A798..A79F: 8 letters used for Teuthonista phonetic transcription N4081
Combining Half Marks
FE27..FE2D: 7 combining half marks N4078
Old Italic
1031F: 1 letter used in a South Picene inscription N4046
Enclosed Alphanumeric Supplement
1F10B..1F10C: 2 wingdings and webdings symbols N4022
Miscellaneous Symbols and Pictographs
1F321..1F32C, 1F336, 1F394..1F395, 1F397, 1F39C..1F39D, 1F3F1..1F3F6, 1F441, 1F53E..1F53F, 1F544..1F54A, 1F568..1F56A, 1F56D..1F56F, 1F571, 1F573, 1F577..1F578, 1F57B, 1F57D..1F57F, 1F582..1F587, 1F589..1F593, 1F597..1F5A3, 1F5A5..1F5BB, 1F5BF..1F5C1, 1F5C4..1F5D1, 1F5D4..1F5DB, 1F5F4..1F5FA: 133 wingdings and webdings symbols N4022
1F641..1F642: 2 wingdings and webdings symbols N4022
Transport and Map Symbols
1F6C6..1F6CA, 1F6E0: 6 wingdings and webdings symbols N4022

Linear A tablet at the Chania Archaeological Museum

{CC BY-SA 3.0 by Ursus}

New Blocks (1,430 characters)
Block Characters Documents
Combining Diacritical Marks Extended
1AB0..1ABE: 15 marks for Teuthonista phonetic transcription N4081
Myanmar Extended-B
A9E0..A9E6: 7 letters used for Shan Pali N3906
Latin Extended-E
AB30..AB5F: 48 letters used for Teuthonista phonetic transcription N4081
Coptic Epact Numbers
102E0..102FB: 28 numbers used in Coptic-Arabic manuscripts N3843
10500..10527: 40 letters used for the Elbasan script N3985
Linear A
10600..10736, 10740..10755, 10760..10767: 341 Linear A signs N3973
10860..1087F: 32 letters used for the Palmyrene script N3867
10880.. 1089E, 108A7.. 108AF: 40 letters and numbers used for the Nabataean script N3969
Old North Arabian
10A80..10A9F: 32 letters and numbers used for the Old North Arabian script N3937
10AC0..10AE6, 10AEB..10AF6: 51 letters, numbers and punctuation marks used for the Manichaean script N4029
Sinhala Archaic Numbers
111E1..111F4: 20 archaic numbers N3876
11200..11211, 11213..1123D: 61 letters, signs and punctuation marks used for the Khojki script N3978
112B0..112EA, 112F0..112F9: 69 letters signs and numbers used for the Khudawadi script N3979
11480..114C7, 114D0..114D9: 82 letters, signs and numbers used for the Tirhuta script N4035
Pau Cin Hau
11AC0..11AF8: 57 letters and other characters used for the Pau Cin Hau script N4017
16A40..16A5E, 16A60..16A6F: 43 letters, numbers and punctuation marks used for the Mro script N3589
Bassa Vah
16AD0..16AED, 16AF0..16AF5: 36 letters and other characters used for the Bassa Vah script N3941
1BC00..1BC6A, 1BC70..1BC7C, 1BC80..1BC88, 1BC90..1BC99, 1BC9C..1BC9F: 143 letters and other characters for Duployan shorthand N3895
Shorthand Format Controls
1BCA0..1BCA3: 4 shorthand format characters N3895
Ornamental Dingbats
1F650..1F67F: 48 wingdings and webdings symbols N4022
Geometric Shapes Extended
1F780..1F7D4: 85 wingdings and webdings symbols N4022
Supplemental Arrows-C
1F800..1F80B, 1F810..1F847, 1F850..1F859, 1F860..1F887, 1F890..1F8AD: 148 wingdings and webdings symbols N4022

Amendment 2

Amendment 2 ("Caucasian Albanian, Psalter Pahlavi, Mahajani, Grantha, Modi, Pahawh Hmong, Mende Kikakui, and other characters") is currently undergoing its final round of balloting, but at this stage no changes to character allocations or character names in Unicode can be made. This amendment includes 1,070 new characters, as detailed in the tables below. You can download code charts covering the new characters from here or here.

Medieval Celtic stone inscribed SABIN{I} FIL{I} MACCODECHET{I}

{CC BY-SA 3.0 by BabelStone}

Additions to Existing Blocks (248 characters)
Block Characters Documents
Cyrillic Supplement
0528..0529: 2 letters used for Orok N4137
052A..052D: 4 letters used for Ossetian and Komi N4199
052E..052F: 2 letters used for Northern Khanty, Eastern Khanty and Forest Nenets N4219
061C: Arabic letter mark (Unicode 6.3) N4180
Arabic Extended-A
08B2: 1 letter for Berber N4271
0980: Anji sign N4157
0C34: Letter llla N4214
16F1..16F3: 3 letters used by J. R. R. Tolkien
16F4..16F8: 5 letters used on the Franks Casket
Vedic Extensions
1CF8..1CF9: 2 svara markers for the Jaiminiya Sama Veda Archika N4134
Combining Diacritical Marks Supplement
1DF5: 1 character used in American lexicography N4279
General Punctuation
2066..2069: 4 bidirectional format characters (Unicode 6.3) N4279
Currency Symbols
20BB: Nordic mark sign N4308
20BC: Azerbaijani Manat sign N4168
Latin Extended-D
A796..A797: 2 letters used for Middle Vietnamese
A7AB..A7AC: 2 letters required for casing
A7F7: 1 letter used in Celtic inscriptions
A7B0..A7B1: 2 letters used in Americanist orthographies N4297
A7AD: 1 letter used for Alabama N4228
Myanmar Extended-B
A9E7..A9FE: 24 letters and numbers used for Tai Laing N3976
Myanmar Extended-A
AA7C..AA7D: 2 signs used for Tai Laing
AA7E..AA7F: 2 letters used for Shwe Palaung
Latin Extended-E
AB64..AB65: 2 letters used for phonetic transcription N4307
Ancient Greek Numbers
1018B..1018C, 101A0: 3 papyrological characters N4194
1107F: Number joiner N4166
111CD: Sutra mark N4269
111DA: Ekam sign N4158
1236F..12398, 12463..1246E, 12474: 55 signs and numeric signs N4277
Playing Cards
1F0BF, 1F0E0..1F0F5: 23 playing card symbols N4089
Miscellaneous Symbols and Pictographs
1F37D, 1F396, 1F398..1F39B, 1F39E..1F39F, 1F3C5, 1F3CB..1F3CE, 1F3D4..1F3DF, 1F3F7, 1F43F, 1F4F8, 1F4FD..1F4FE, 1F56B..1F56C, 1F570, 1F572, 1F574..1F576, 1F579, 1F57C, 1F580..1F581, 1F588, 1F594..1F596, 1F5BC..1F5BE, 1F5C2..1F5C3, 1F5D2..1F5D3, 1F5DC..1F5F3: 76 wingdings and webdings symbols N4022
Transport and Map Symbols
1F6CB..1F6CF, 1F6E1..1F6EC, 1F6F0..16F3: 21 wingdings and webdings symbols N4022

Sanskrit Dhāraṇī in Chinese and Siddham scripts from Yarkhoto

IDP: Berlin-Brandenburgische Akademie der Wissenschaften: SHT 7175

New Blocks (822 characters)
Block Characters Documents
Old Permic
10350..1037A: 43 letters used for the Old Permic script N4263
Caucasian Albanian
10530..10563, 1056F: 53 letters and marks used for the Caucasian Albanian script N4131
Psalter Pahlavi
10B80..10B91, 10B99..10B9C, 10BA9..10BAF: 29 letters, marks and numbers used for the Psalter Pahlavi script N4040
11150..11176: 39 letters and signs used for the Mahajani script N4126
11301..11303, 11305..1130C, 1130F..11310, 11313..11328, 1132A..11330, 11332..11333, 11335..11339, 1133C..11344, 11347..11348, 1134B..1134D, 11357, 1135D..11363, 11366..1136C, 11370..11374: 83 letters, numbers and signs used for the Grantha script N4135
11580..115B5, 115B8..115C9: 72 letters, signs and marks used for the Siddham script N4294
11600..11644, 11650..11659: 79 letters, signs and numbers used for the Modi script N4034
Warang Citi
118A0..118F2, 118FF: 84 letters and numbers used for the Warang Citi script N4259
Pahawh Hmong
16B00..16B45, 16B50..16B59, 16B5B..16B61, 16B63..16B77, 16B7D..16B8F: 127 letters and signs used for the Pahawh Hmong script N4175
Mende Kikakui
1E800..1E8C4, 1E8C7..1E8D6: 213 syllables and numbers used for the Mende Kikakui script N4167

On beyond 7.0

A new (4th) edition of ISO/IEC 10646 will be published next year, and Amendment 1 to this new edition is already in progress. ISO/IEC 10646:2014 (draft code charts) will include Hatran, Old Hungarian (assuming that the Hungarian national body's ballot response is positive), Sharada, Multani, Ahom, Early Dynastic Cuneiform, Anatolian Hieroglyphs, and Sutton Signwriting, as well as 5,762 Han ideographs in a new CJK-E block. Amendment 1 (draft code charts) currently adds Nüshu (Nushu) and Tamil supplement, but more scripts may be added to it as it progresses. The character repertoire, code point allocations, and character names are not yet fixed, and the draft code charts linked to above should be treated with caution.

For the first time, in what I think is a very good move, the Unicode Consortium has publicized the ISO ballots in advance of announcing a beta version of Unicode (at which point it is too late to make changes to character allocation and character names), and requested feedback from the public on the proposed repertoires. See PRI #256 for ISO/IEC 10646:2014 and PRI #255 for ISO/IEC 10646:2014 Amd.1. New scripts and characters added to ISO/IEC 10646:2014 and its amendments will feed into Unicode 7.1 and 7.2 (these are probable version numbers, but are currently unconfirmed) during the next two or three years.

For those of you who have been following the yo-yoing progress of the middle dot letter used for Sinological transcription and 'Phags-pa transliteration (originally proposed for encoding by myself in January 2009, and subsequently put on and then taken off virtually every ballot since then), an agreement was finally reached at the last WG2 meeting in Vilnius during the summer of this year to encode the character at U+A78F under the compromise name of LATIN LETTER SINOLOGICAL DOT, and I hope to see it encoded in the version of Unicode corresponding to ISO/IEC 10646:2014 Amd.1 (it's not currently on Amd.1, but maybe it will get added there).

Tangut is a major historic script that I know that many people want to see encoded in Unicode, and as the main author of a series of proposals to encode Tangut characters and Tangut components I am top this list. However, although the first proposal to encode Tangut characters (by Richard Cook) was made in 2008, it has proved very hard to reach an agreement on character repertoire, and Tangut encoding has floundered. A conference on encoding Tangut, supported by a grant from the Henry Luce Foundation, will be held in Beijing in December of this year (I will be there), and if all goes well it is possible that Tangut could be put on the ballot for ISO/IEC 10646:2014 Amd. 2, and find its way on into Unicode 7.2 or 8.0.

Fonts Supporting Unicode 7.0

Saturday, 5 October 2013

BabelPad Version 6.3.0

Unicode 6.3 was released at the beginning of this week, and so I have released updated versions of BabelPad and BabelMap. There are no significant changes to BabelMap this time (although I am planning a makeover for BabelMap for version 7.0 next year), but I have spent a considerable amount of time working on a number of significant enhancements to BabelPad. As yet again I have not had time to implement the most requested feature (a working help system), I thought it best to describe these new features in a blog post. If you have any comments, questions or suggestions about BabelPad you may either comment on this post or post a question to the BabelStone forum.

Open Lines

You can now open part of a file by selecting the "Open Lines..." command from the File menu.

Open Lines...

When you select "Open Lines..." the standard file dialog will be opened, but after you choose the file to open a new dialog will be opened that allows you to specify which lines of the file to open. For very large files it may take a few seconds for this dialog to appear as it has to first parse the file to determine how many lines long the document is.

Open Lines Dialog

Manipulation of Tabular Columns

I work a lot with tabular data, and I frequently have to swap my data out of BabelPad and into Excel in order sort or reorder the columns, and then swap it back into BabelPad to do other editing. In order to reduce my reliance on Excel I have now implemented support in BabelPad for manipulating tabular columns of text, delimited by tabs, commas or any user-specified character or string. For all operations described below, you need to first select one or more whole lines of text (i.e. the start and end points of the selection both have to be at the start of a line), then select the appropriate operation from the "Columns" submenu of the Edit menu. BabelPad will automatically detect if your columns are tab-delimited or comma-delimited (based on the first line of the selected block of text), but if you want to override the detected delimiter or specify a different column delimiter (either a single character or a text string) you may do so checking the appropriate radio button (if you change the custom delimiter you must check the "other" radio box again for the delimiter to be applied).

Columns Submenu on the Edit Menu

Ordering Columns

This operation enables you to order any number of columns in the selected block of text, for example changing the order of columns A, B, C, D, E to D, A, E, C, B. Simply use the up and down buttons to order a selected column or a contiguous range of selected columns. When you are happy with the new order, press the "Order" buttton; or else press the "Cancel" button to cancel the operation.

Order Columns Dialog

Cutting, Copying and Deleting Columns

These operations enable you to cut copy or delete any number of columns in the selected block of text. Simply select one or more columns (they do not need to be contiguous), then press the "Cut", "Copy" or "Delete" button as appropriate. When cutting or copying multiple columns, each column will be separated by the delimiter character or string (if you are cutting or copying discontiguous columns there will only be a single delimiter character or string between the columns, regardless of how many columns apart they are).

Delete Columns Dialog

Pasting Columns

This operation enables you to insert any number of columns into the selected block of text at a particular column position. The text to be inserted does not need to have been cut or copied from another table, but may be any block of multiple lines of text. If the columns to be inserted are shorter (fewer number of lines) than the selected block of text, the remaining lines will be filled with empty cells. You can choose to insert the column or columns before, over (i.e. replacing) or after any particular column.

Paste Columns Dialog

Sorting Columns

This operation enables you to sort the selected block of text by the values of one or more columns. See below for the various types of sort that are currently supported by BabelPad. You may specify any number of sort levels, with each sort level using any type of sort (however, be aware that using multiple sort levels may significantly slow down the sort, depending on the data and types of sort involved). To specify the columns to sort on, move the column or columns from the left box to the right box by double-clicking or by clicking on the ">" button. To change the default sort type and/or select sort options double-click on the column in the right box. If the selected block of text includes column headers, check the "Do not sort first line" box.

Sort Columns Dialog

Contextual Conversion

I have modified the Contextual Conversion dialog ("Contextual Conversion..." from the Edit menu; or Ctrl+Shft+X) to allow you to restrict the scope of any conversion operation to a a specific column (at present only a single tab-delimited column can be selected). I have also added a Find and Replace conversion so that you can now, for example, replace all occurences of "pig" with "cow" in column 3 of the selected block of text.

Contextual Conversion Dialog


I have long resisted requests to add sorting functionality to BabelPad, even after one kind correspondent offered me the free use of their implementation of the Unicode Collation Algorithm (UCA). However, support for tabular columns would not be complete without sorting, so reluctantly (given the amount of time and effort required) I have now added the ability to sort lines and to sort tabular columns. I have also added commands to randomize lines and remove duplicate lines.

Sort Lines...

To sort whole lines of text, select the lines to sort (these must be a whole number of lines) and hit the "Sort Lines..." command from the Edit menu, which will open the "Sort Options" dialog. If you want to sort by column, select the lines to sort, open the "Sort Columns" dialog (see above), select the column or columns to sort by, and double-click on a selected column to open the Sort Options dialog.

Sort Options Dialog

The "Sort Options" dialog allows you to specify what type of sort you want to use. I have implemented various types of sort, and may add other types of sort in the future (e.g. CJK radical/stroke sort, and sorting CJK characters by pinyin reading):

  • Unicode Collation Algorithm: Implements the Unicode Collation Algorithm (UCA). The UCA collation is based on the Default Unicode Collation Element Table (DUCET), and in BabelPad you can either use the untailored DUCET for language-neutral collation or use the DUCET tailored for certain languages (as discussed below).
  • CLDR Collation Algorithm: Implements the CLDR Collation Algorithm, which is an extension of the Unicode Collation Algorithm. The CLDR collation is based on the CLDR Root Colation which is a modification of the DUCET that puts script-common characters (whitespace, punctuation, general symbols, some numbers, currency symbols) before script-specific characters. In BabelPad you can either use the untailored root collation for language-neutral collation or use the root collation tailored for certain languages (as discussed below).
  • Windows Default Collation: Simply calls the Windows function CStringT::Collate (or CStringT::CollateNoCase if a case-insensitive sort is requested) for each sort comparison.
  • Unicode Code Point Sort: Sorts by the scalar value of the Unicode characters in the sort string.
  • Hexadecimal: Sorts by the hexadecimal value of the sort string.
  • Numeric: Sorts by the numeric value of the sort string. This should work well for decimal numbers in any script, and for Chinese ideographic numbers, but may not yet work correctly for complex non-decimal numbers (such as used in Cuneiform). And, of course, if the string to be sorted is not a number or combines numbers and text you may get unexpected results.
  • Length: Sorts by the length of the sort string in characters (that is characters, not bytes or code units).

The Unicode Collation Algorithm and CLDR Collation Algorithm both have a default root collation table that is language-neutral, but their root collation tables can be tailored to support the specific collation requirements of any particular locale and/or language. At present BabelPad only supports tailored collation for a few languages, but I will consider supporting other language tailorings on request:

  • Old English (Runes)
  • Old English (Latin)
  • Spanish
  • Tibetan (Tibetan collation is quite complex, and any feedback on my implementation of Tibetan sorting is very welcome)
  • Welsh

The "Sort Options" section of the dialog shows the options that are available for the selected sort type. Most of the options are only applicable to the Unicode and CLDR collation algorithms, and some of them are rather esoteric and may not be comprehensible unless you have studied the specifications for the collation algorithms. One additional option that I have added is to limit the sort string to a specified number of characters ("Maximum number of characters to compare"). If you are sorting long lines of text you may only need to check the first few characters of each line to sort correctly, so limiting the number of characters to compare may improve the speed of the sort in some cases.

The check box at the bottom left corner of the "Sort Options" dialog allows you to define the currently selected sort options as the default for text sorts (not applicable to numeric, hexadecimal and text length sorts).

Conversion to/from UTF Code Units

I have added new functions to convert between Unicode characters and ASCII representations of UTF-8, UTF-16 and UTF-32 code units (e.g. convert U+10082 𐂂 to or from "F0 90 82 82" (UTF-8), "D800 DC82" (UTF-16) or "10082" (UTF-32). When converting from code units to characters, any characters in curly braces will not be converted (and the braces dropped), so, for example converting "D800 DC82 { = U+10082}" from UTF-16 code units will result in "𐂂 = U+10082". These conversions are also available from the right-click menu under the "Convert" submenu.

UTF Code Units Submenu on the Edit Menu

Insertion of new Bidirectional Control Characters

BabelPad supports Unicode 6.3 by allowing you to easily insert any of the five bidirectional control characters newly encoded in 6.3. From the Insert menu, click on the "Bidirectional Control Characters" submenu, and the five new characters are listed at the bottom of the submenu.

Bidirectional Control Characters Submenu on the Insert Menu

I have also improved the Variation Selectors submenu to allow insertion of all currently-used variation selectors for Ideographic Variation Sequences (VS17 through VS47).

Test Utilities

For this release of BabelPad I have done a lot of work on improving the testability of BabelPad's Unicode functionality and data. Prior to releasing a new version of BabelPad I run various tests, but in the past these have been mostly carried out manually, and can be quite tiresome to perform. I have now automated several key tests, and although they are intended for my personal use it is not inconceivable that some users might find them helpful, so I have exposed them publicly under the "Test Utilities" submenu in the Tools menu.

Test Utilities Submenu on the Tools Menu

[The Generate UCD Data and Generate Full UCD XML Data functions are also available in BabelMap.]

Generate Core UCD Data

The utility to generate the core UCD data produced for each version of Unicode (UnicodeData.txt) has been available in BabelMap for some years, but I have now added it to BabelPad, under the Test Utilities submenu. It generates an on-screen listing of all rows of the core UCD data for any given version of Unicode.

Generate Core UCD Data Utility

I used to run this for each major version of Unicode, and copy the on-screen listing to BabelPad (copying automatically inserts semi-colon field separators), then save to file (with LF instead of CR/LF) and use WinDiff to compare the actual UnicodeData.txt file for that version of Unicode. That was rather tiresome, so I have now added the ability to automatically generate the data for all versions of Unicode and save as individual files in a specified directory. Now I all I need to do is press the "Save All..." button, go away and make a cup of tea, then come back and use WinDiff to do a directory compare between the directory where the generated files have been saved to and the directory where the original UnicodeData.txt files are stored.

Comparison of Original and Generated Unicode Data

As you can see from the above screenshot, version 1.1.5 fails the comparison, but this is expected as the original file includes some prefatory blurb before the data rows, has an unexpected blank line after U+FD74, and has some unexpected spaces in the decomposition description of five characters.

Generate Full UCD XML Data

The Generate Core UCD Data tool allows me validate BabelPad's current and historical core Unicode character properties, but it does not cover many of the other character properties that are used in BabelPad and BabelMap. To ensure that all Unicode data used in BabelPad is correct, I have added a new tool to generate full Unicode data (excluding Unihan properties) in XML format, exactly matching the data provided in the non-Unihan, flat-format XML version of UCD data (ucd.nounihan.flat.xml, available as a zip from The official XML/UCD data includes 100 properties for each character, many of which are not currently needed for BabelPad, and so writing the tool required the addition of quite a few new functions to produce all the properties, which in the end took a lot longer than I anticipated.

UCD XML Data Generated by BabelPad

Running WinDiff between BabelPad's generated XML/UCD data and the official XML/UCD file shows that the only difference between the two files is the comment that BabelPad adds to the top of the generated document. Originally, after I had completed the utility and ironed out all of the bugs in my code, there were still a number of discrepancies between my generated file and the 6.3 beta version of ucd.nounihan.flat.xml, so I reported the various unexpected idiosyncracies and apparently incorrect data, and I am pleased to say that the XML/UCD data files were quickly updated to fix the reported defects before 6.3 was released.

Run Normalization Tests

BabelPad supports conversion of text into any of the four standard normalization forms (NFD, NFC, NFKD, and NFKC), and before each new release of BabelPad I use the normalization test file produced by Unicode (NormalizationTest.txt from here) to validate BabelPad's normalization functionality. This used to be quite time-consuming and troublesome as I would have to manually extract the five columns of data from the test file, run each of the four normalization functions on each of the five columns of data, then run WinDiff on the twenty output files. I finally became fed up with this approach, and have now added a single function that will read the input file, then perform the required normalizations and comparisons on each line of the file, and report the result at the end. Much better!

Normalization Test Output

Run Unicode Collation Algorithm Tests

As I discuss elsewhere in this post, I have now implemented the Unicode Collation Algorithm and the CLDR Collation Algorithm for sorting in BabelPad. As part of my implementation I added a utility that runs either the UCA test files ( from here) or CLDR test files (available under common\uca in from here), either the "non-ignorable" or "shifted" test file in both cases, and reports the results.

Unicode Collation Algorithm Test Outputs for DUCET

Unicode Collation Algorithm Test Outputs for CLDR

As can be seen, the CLDR tests pass, but the DUCET test for shifted fails for two lines (in fact the DUCET test for non-ignorable only accidentally passes, as BabelPad produces different sort keys than the test expects in the places that the shifted test fails). I believe that the DUCET test files are faulty, and give incorrect sort keys for eight lines of the test relating to Tibetan characters (I reported this before 6.3 was released).