Sunday, 2 December 2007

CJK-B Case Study #1 : U+272F0

The CJK Unified Ideographs Extension B [13MB] block that was added to Unicode/10646 in 2001 comprises 42,711 characters, and it is no secret that there are many problems with this huge collection of mostly quite rare characters, including hundreds of cases of unifiable characters that have been erroneously encoded separately and even a handful of completely duplicate characters. There is enough material to keep a dedicated CJK-B blogger busy for years to come, but I certainly don't want to go down that particular path. However, a recent psot by Michael Kaplan, How bad does it need to be in order to be not good enough, anyway?, about discrepancies in Han character stroke counts provided by China on the one hand and Taiwan on the other set me investigating one particular character, and its history is convoluted enough to be worth writing up as a case history.

Michael notes that the character with the greatest stroke count differential is U+272F0 𧋰, which is given 13 strokes in the PRC stroke count data but 19 in the Taiwan stroke count data. This is what the character looks like in both the Unicode Standard and in ISO/IEC 10646 :

Clearly, as all commentators to Michael's post agree, 13 must be the correct count, and there is absolutely no way to get 19 strokes out of it. In short, "19" must be a mistake. However, I did suggest in my first comment, as a rather wild guess (not actually being familiar with the character in question), that perhaps the stroke count of "19" represented a variant form of the character with two "insect" 虫 radicals at the bottom rather than the single radical shown in the Unicode and 10646 code charts. Then yesterday (which will be several days ago by the time I hit the "Publish" button), I obtained a copy of IRG N1381 (no ambiguity about the document number I trust ;-) which is a draft of the "Extension B multi-column code charts". It may come as somewhat of a surprise to my readers to know that although multi-column source glyph charts for the other CJK Unified Ideograph blocks are published in ISO/IEC 10646, there is no published multi-column chart for CJK-B, and that nearly seven years after CJK-B went live a draft multi-column chart for CJK-B (2,670 pages weighing 40MB) has only just been produced (2007-11-13). But anyhow, if you do look at this document you will find that U+272F0 has a single source glyph from Taiwan (T7-496B), and it does indeed have two insect radicals at the bottom instead of one, making it exactly 19 strokes :

So that would seem to explain where the Taiwan count of "19" came from, but the real question is what this glyph form is doing here, when forms of the same abstract Han character with double radicals are not in Unicode terms unifiable with forms of the same character with a single radical ?

As with all investigations of this sort, the first place to look for some help is the Kangxi Dictionary, where we find that this character exists in quite a few variant forms, although the character we are interested in (U+272F0 with a single radical) does not seem be mentioned :

Note that although there are only six entries for variant forms of this character, the dictionary does actually use seven variant forms, as the fourth entry above is defined as a corruption of a variant form that does not itself have an entry (⿱𠂤⿰虫虫)—despite the fact that that form is also referred to in the sixth entry.

The character itself, in all its guises, is pronounced , and is the first character in the compound word fùzhōng ~螽, which is either a general name for grasshoppers, locusts, etc. or a specific kind of one such insect (also known as 蠜 fán). In early sources the character is written simply as 阜 without an insect radical, as in the 14th song of the Book of Songs (Shi Jing 詩經) :



Yao-yao went the grass-insects,
And the hoppers sprang about.
While I do not see my lord,
My sorrowful heart is agitated.
Let me have seen him,
Let me have met him,
And my heart will then be stilled.

The odes of Shao and the South (No.14)

Incidentally, the earliest Chinese dictionary (written circa 100 AD), Shuowen Jiezi 說文解字, does not have fùzhōng 阜螽, but it does define 蠜 fán as meaning fùfán 𨸏蠜, which is most probably a later transcription error for fùzhōng 𨸏螽 (𨸏 is the Shuowen way of writing 阜). This is supported by the the early 11th century edition of the Yu Pian 玉篇 dictionary (originally compiled during the 6th century), where 蠜 fán is indeed defined as fùzhōng 阜螽 :

So now that we know what the variant ways of writing this character are, we can check and see which of them have been encoded in Unicode, and by my reckoning there are eight of them :

Unicode Encodings of U+86D7 and its Variants
Codepoint Character Ideographic Description
Sequence (IDS)
Radical/Stroke Index
Source References Kangxi Index

Of these eight encoded characters, U+86D7 蛗, U+27313 𧌓, U+2731B 𧌛, U+27449 𧑉, U+27499 𧒙 and U+4600 䘀 correspond to the Kangxi Dictionary entries shown above, and U+27482 𧒂 is the character quoted in the Kangxi entry for U+27449. Only U+272F0 𧋰 (as shown in the Unicode code charts) is not in the Kangxi Dictionary. From this it looks as if the Taiwan source glyph for U+272F0 should be the glyph for U+27499 𧒙. So at this point I think we should double-check the source glyphs for all eight characters :

Hmm, everything seems in order, except for the Taiwan source glyph for U+272F0 which does indeed look as if it should be the source glyph for U+27499, but if that is the case what is the correct source glyph for U+272F0 ?

Going back a step to look at the situation immediately prior to encoding in Unicode/10646, we find that U+272F0 (i.e. ⿳𠂤一虫) does not appear in either the Committee Draft (CD) or the Final Committee Draft (FCD) of ISO/IEC 10646-2:2001 :

ISO/IEC 10646-2:2001 CD and FCD
Final Codepoint Character CD Codepoint
[SC2 N3393]
FCD Codepoint
[SC2 N3442]
CD/FCD Source References

And nor does the Taiwan source reference T7-496B occur in either the CD or FCD, so it looks as if U+272F0 was only added after the FCD had been voted on (the FCD is the last chance to make any technical changes before the standard is published). According to the Disposition of Comments for the FCD, the IRG extensively reviewed and revised the repertoire of CJK-B, and it was this revised repertoire that was accepted for inclusion in the FDIS ballot (the final ballot, where technical changes can no longer be made). So turning to the IRG document registry, we find that the Review Summary On CJK_Extension B FCD (N744) notes that four ideographs were found to be missing from the FCD, including T7-496B, as per an Errata Report for SuperCJK 10.0 (N738) by Taiwan :

What this is saying is that T7-496B had been proposed for unification with the character that would eventually become U+27499 𧒙, but as the characters are not unifiable T7-496B should be added as a separate character (and in case anyone is wondering what IRG N699 has to say about the character, so do I, but unfortunately this document does not seem to be listed in the IRG document registry). So on the basis of this document (IRG N738) T7-496B was added to the FDIS repertoire at U+272F0, and that's how it managed to sneak into CJK-B at the very last minute. Unfortunately, presumably because T7-496B was originally intended to be unified with U+27499, Taiwan somehow got the glyph for T7-496B wrong in their reference font, with the result that the Taiwan stroke count for U+272F0 is out by six, and the Taiwan source glyph for U+272F0 in the new draft multi-column charts (IRG N1381) is the same as the PRC source glyph for U+27499 (incidentally the correct source glyph for U+272F0 is shown is Super CJK Version 10.2 [32MB] page 675 and Super CJK Version 14.0 [64MB] page 1473).

I guess that once the Taiwan source glyph is corrected and the Taiwan stroke count data is amended it should be the end of the story, but the one thing that nags at me (as is the case with so many characters which only have a single Taiwan source reference) is what is the ultimate source of this character and which texts is it used in ? So if anyone can show me an example of U+272F0 in running text (before it was actually encoded) please let me know.

Addendum 1 [2007-12-10]

In the comments to this post Eric Rasmussen has pointed out that the code charts for the CNS 11643-1992 standard which was the source for U+272F0 are available to download as Appendix G to Ken Lunde's book CJKV Information Processing. The character in question is at Plane 7 Row 41 Column 75 (subtract hex 2020 from the source reference, T7-496B, and convert to decimal to get the row/col position) :

CNS 11643 Plane 7 Row 41 Columns 71-79 (from Ken Lunde's CJKV Information Processing)

What Eric noticed was that characters to the left of it and the characters to the right of it all have 13 residual strokes, which strongly suggests that T7-496B should also have 13 residual strokes, and not 7 seven residual strokes as shown here. That is to say, it is looking more and more likely that T7-496B should in fact be the Taiwan source reference for U+27499 𧒙, and that U+272F0 𧋰 is a phantom character.

But it is probably too late to do anything about it now—if Taiwan were to change the source glyph and request a source reference correction for T7-496B (remapping to U+27499) that would leave U+272F0 an orphan, which is not a happy ending.

Addendum 2 [2007-12-16]

In the comments Matthew Fischer linked to a scanned copy of the official CNS 11643-1992 standard which does indeed show that the glyph for the character at Plane 7 Row 41 Col 75 (i.e. T7-496B ) is identical to the glyph for U+27499:

CNS 11643 Plane 7 Row3 32-41 Columns 72-76 (the official version)

This pretty much clinches the argument I think.

Addendum 3 [2011-05-01]

In the Unicode 6.0 code charts, the glyph for T7-496B at U+272F0 has been changed to 19 strokes with a double 虫 radical, which is essentially the same as the glyph for GKX-1100.13 at U+27499, which in my opinion is not an ideal change :


However, in the code charts for the Final Committee Draft of ISO/IEC 10646:2012, which show what the Unicode 6.1 code charts will look like, a new Unicode source reference has been added to both U+272F0 and U+27499, one with a single 虫 radical and one with a double 虫 radical :


I'm not sure that this is a better solution, and I still think that the T7-496B source reference should be moved to U+27499.

Friday, 9 November 2007

Marco Polo and the Universal Script

It is hard to blog in my current circumstances, but as a kind friend sent me a copy of Marco Polo: From Venice to Xanadu by Laurence Bergreen I think I might write something about it, if only to keep the blog alive as its second anniversary approaches (this will also be the sixtieth post and the end of the first cycle). Hopefully I will be able to get back to serious blogging again sometime soon, and I have some interesting Christmas and New Year specials lined up, so do keep tuned in.

Bergreen's book was only published a couple of weeks ago, and as I only received it in this morning's post (three days ago as I hit the 'Publish' button) I have barely had time to page through it. But casually leafing through the book during my lunchtime respite I stopped immediately at page 137, where I was confronted with the very familiar (and entirely uncredited) image of the table of Phags-pa letters that is to be found on Simon Ager's Omniglot page. Simon created the image using my BabelStone Phags-pa Book font so it is of course quite elegant, although it is a shame that he omitted the subjoined letters YA and WA, which are considered to be integral members of the original set of forty-one Phags-pa letters—I leave it as an exercise for the reader to work out which four letters in the table are not part of the original set of letters devised by the Phags-pa Lama [2008-01-12 : Omniglot has now been updated to fix this omission].

Below (at least they were below this paragraph before I started opening the book at random and finding all sorts of niggling errors) are the relevant extracts from the book, with Bergreen's peculiar misuse of the word language where he means script highlighted in bold (it's not as if he does not know what a "script" is, as he uses that word several times as well). The statement that Phags-pa is a constructed language rather than a constructed script is particularly infelicitous, and I suspect that a general reader of these pages would come away with a very confused idea of what the 'Phags-pa language was. For those of my readers who don't already know, it was not a language but a set of letters based on the Tibetan script that could be used to phonetically transcribe various languages, principally Mongolian and Chinese—and as such it has been called the IPA of its time (in retrospect I remember that I argued against calling Phags-pa a "universal script" in my comments to Abecedaria's post on Genghis Khan).

Obviously linguistics and such is not Bergreen's strong point—elsewhere in the book (page 222) he states that "the Chinese written language of the time contained about seven thousand characters", which as everyone knows is a vast underestimate of the true figure. Even though 6,000-7,000 characters seems to be a fairly popular estimate of how many characters a modern educated Chinese speaker recognises, this has little bearing on the number of characters in use 700 years ago. In fact Yuan dynasty vernacular Chinese was particularly rich, and included many characters not used in classical Chinese, some of which are not even to be found amongst the 70,000+ CJK ideographs already encoded in Unicode (just this weekend, whilst researching for my New Year's blog post I came across an unencoded character <⿰土引> meaning "obstruction", which is used frequently in a not very obscure Yuan dynasty treatise on a certain popular pastime).

And whilst I'm in a critical mood, does anyone else think it odd that Bergreen refers to Kublai Khan's capital (modern Beijing) as Cambulac (307 google hits) instead of the more usual Cambaluc (16,700 google hits) throughout the book (see especially page 141) ? The etymology, from Turkic Khanbaliq "city of the Khan", would suggest that Cambaluc is more correct than Cambulac, and the sources I have to hand give Cambaluc (Yule translation), Chanbalu (The Travels of Friar Odoric), Cambalec (Recollections of Travel in the East by John de' Marignolli and The Book of the Estate of the Great Caan by the Archbishop of Soltania) and Cambaliech (Letters of John of Montecorvino and Letter of Andrew of Perugia), so either Bergreen is following some other source or this is just a typo that nobody noticed. But if it is a typo it is indicative of the editorial care that this book is missing.

Perhaps I should refrain from opening the book anymore, as I have just noticed on page 224 that according to Bergreen "Quinsai's work “week” lasted ten days, followed by a single day of rest" (my emphasis), which seems to be saying that there was an eleven-day cycle (ten days work followed by one day rest), when in fact it was a ten day cycle (Chinese xún 旬), apparently with one of the ten days as a rest day. But enough of this petty fault-finding; here then is Bergreen's account of the Phags-pa Lama and his script :

In keeping with his aspiration to become the "universal emperor," Kublai sought to encourage a common written language for all the peoples of his empire. To bring order to the chaos of Mongol communication, he commissioned an influential Tibetan monk named Matidhvaja Sribhadra to devise an entirely new language: an alphabet capable of transcribing all known tongues. Endowed with prodigious intellectual gifts, the monk was said to have taught himself to read and write soon after birth, and could recite a dense Buddhist text known as the Hevajra Tantra from memory by the age of three. As a result of these accomplishments, he was called 'Phags-pa, Tibetan for "Exceptional One." having arrived at the Mongol court in 1253 as an eighteen-year-old prodigy, 'Phags-pa later found special favor with Kublai Khan's principal wife, Chabi, and came to exert a profound influence over the court.


In 1269, 'Phags-pa, in fulfilment of his commission, presented Kublai Khan with a syllabic alphabet—that is, one in which symbols represent consonants and vowels—consisting of forty-one letters, based on traditional Tibetan. The new written language became known as "square script," owing to the letters' form. It was written vertically, from top to bottom, and from left to right, using these symbols:

The system transcribed the spoken Mongolian tongue with more accuracy than its impoverished predecessors, and even recorded the sounds of other languages, notably Chinese. Kublai Khan proudly designated this linguistic innovation as the language of the Mongol officialdom, and he founded academies to promote its use. The Mongolian Language School opened the same year, and two years later, the National University. 'Phags-pa script appeared on paper money, on porcelain, and in official edicts of the Yuan empire, but scholars and scribes, devoted by sentiment and training to Chinese, Persian, or other established languages, resisted adopting it. Nor did Marco demonstrate familiarity with the new Mongolian idiom.

In 1274, about the time the Polo company arrived in Mongolia, 'Phags-pa retired to the Sa-skya-pa monastery in Tibet, where he died in 1280. By that time, his version of Buddhism was falling into disfavor with the Mongols, and his clever script had failed to catch on, except among a small number of adherents who employed it on ceremonial occasions. It remained a worthy but failed experiment in artificial or constructed language.

Laurence Bergreen, Marco Polo: From Venice to Xanadu pp.136-137.

Aside from the linguistic terminology used by Bergreen, one other possible cause for confusion here is the name of the protagonist, who Bergreen calls Matidhvaja Sribhadra, probably following the lead of Omniglot (which is top of a list of less than two dozen pages on the entire internet that use this name for the Phags-pa Lama).

The Phags-pa Lama's given name in Tibetan is in fact blo-gros rgyal-mtshan བློ་གྲོས་རྒྱལ་མཚན (pronounced lodro gyaltsen), which means "Wisdom Banner of Victory" (blo-gros means "intellect, intelligence, wisdom, understanding", and rgyal-mtshan means "banner of victory"—one of the eight auspicious symbols). In the Mongolian script his name is written phonetically as lodoi ǰaltsan ᠯᠣᠳᠣᠢ ᠵᠠᠯᠼᠠᠨ, and in the History of the Yuan Dynasty it is transcribed using Chinese characters as luógǔluósī jiāncáng 羅古羅思監藏. Matidhvaja is the direct Sanskrit equivalent of this name (mati = blo-gros, and dhvaja = rgyal-msthan), but Sribhadra is a title meaning "the glorious and excellent", equivalent to the Tibetan term dpal bzang po དཔལ་བཟང་པོ (transcribed in Chinese as bānzàngbo 班藏卜), so calling him "Matidhvaja Sribhadra", as if it were his given and family name, is perhaps not ideal. Later on he became known simply by the Tibetan title 'phags-pa འཕགས་པ "noble or exalted one", which was represented in Chinese characters as bāsībā 八思巴 (and several other variants), and is variously romanized as 'Phags-pa, Phags-pa, ḥP'ags-pa, hPhags-pa, vPhags-pa, Phagspa, Pagspa, Phagsba, Pagsba, Passepa, Phagpa, Phakpa, Pakpa, etc. etc. (I prefer Phags-pa). In contemporary Chinese sources he is more often referred to by the title guó shī 國師 "Imperial Preceptor".

Before moving on I feel unable to avoid ranting some more about sources. It seems to me that in this day and age it is virtually impossible to write a book such as this without referencing the internet, yet when I read the Acknowledgments (pp.363-365), Notes on Sources (pp.367-381) and Select Bibliography (pp.383-391) there does not appear to be a single mention of the internet in general or any particular site that Bergreen found useful. In his notes for the section on Phags-pa (pp.371-372) he provides us with some further reading :

In Khubilai Khan, pages 40-41 and 155-160, Rossabi ably tells the story of 'Phgas-pa and his script. See also Jack Weatherford's Genghis Khan and the Making of the Modern World pages 205-206. And Zhijiu Yang's [楊志玖] Yuan shi san lun [元史三論] has an interesting discussion of the languages Marco Polo may have known or used, with reference to M. G. Pauthier's thoughts on the subject. Marco Polo's Asia, by Olschki, also discusses 'Phags-pa script and contains interesting assessments, pro and con, of the Mongol impact on Chinese culture (pages 124-128).

All reputable print sources, I am sure, yet it is clear that Bergreen has learnt much about the Phags-pa script from Simon Ager's Omniglot site (the top site if you google for 'Phags-pa), so why does he not provide a link to Omniglot or to other sites dedicated to the Phags-pa script ? Surely pointers to the internet are far more useful to most readers than dry references to stale books that probably require a visit to the library to access or the expenditure of a fair amount of money. Maybe it's all just authorial snobbishness—Bergreen, like all of us, finds the internet to be an indispensible research tool but would not want any of his readers to know that (preserving the myth that knowledge is something that can only be attained from dusty libraries and by actually travelling to China and Mongolia, and is not something that ordinary folk can simply pick up for free from the internet).

Although it is good to see that Mr. Bergreen considered the Phags-pa script important enough to dedicate a couple of pages to it, I think that he missed an opportunity to discuss in a little more depth than he did how the script was a unifying force across the vast Mongol empire—examples of the script have been found on artefacts from the banks of the Volga (a fragment of birchbark manuscript dating from the time of the Golden Horde Khanate, 1240-1502) to the shores of Japan (relics of Kublai's Khan's failed invasions of Japan). Instead Bergreen repeats the tired clichés about the script being a failed experiment that was hardly used despite being mandated as the Mongol empire's official writing system. I, on the other hand, would have emphasized the wide range of uses that the script was put to in both public and private life, for writing languages as diverse as Mongolian, Chinese, Uyghur, Tibetan and Sanskrit.

Monumental Inscriptions

Printed Texts and Manuscripts

Inscriptions on Artefacts

Seals and Seal Impressions

  • Official seals in Chinese seal script Phags-pa (this is just one of dozens of surviving examples)
  • Personal signets with Chinese family name written in Phags-pa script (a few of the very many examples can be seen towards the bottom of this page)
  • Phags-pa seals on Uyghur script documents, such as this quth luq example

But despite the wealth of material that is written in the Phags-pa script and the abundance of artefacts that have Phags-pa inscriptions on them, I still feel very dissatisfied with what we have, and what is now lost gnaws at me constantly. What for instance of Persian, the second language of the Mongol empire ? Even though there is a special Phags-pa letter ( U+A865 PHAGS-PA LETTER GGA ) that is unused in Chinese , Mongolian, Uyghur, Tibetan or Sanskrit, and appears to be intended to represent a glotal stop for use in writing Persian (i.e. the letter 'ayn ع), not a single example of Persian written in the Phags-pa script is known. I am sure that there must once have been Persian Phags-pa texts, just that none of them have survived.

And what about printed texts ? We have a few fragments of pages from the Mongolian translation of the Subhāṣitaratnanidhi by Phags-pa's uncle Sakya Pandita (translated in English as A Treasury of Aphoristic Jewels or Sakya Pandita's Treasury of Good Advice, and known in Chinese as sàjiā géyán 萨迦格言), including one fragment discovered at Dunhuang as recently as 2001. But Chinese bibliographic sources also refer to a number of Mongolian translations of Chinese Confucian classics and works of history that were printed in the Phags-pa script, including :

But not a single page of any Phags-pa edition of any of these books is known to have survived.

And then when I read about the Franciscans who were spreading their faith in China at about the same time that Marco Polo was there, I feel sure that they would have used the phonetic Phags-pa script in preference to the native ideographic script to simplify written communication in Chinese and to aid in the dissemination of the Christian scriptures. In his first letter from Cathay, dated 1305, Friar John of Montecorvino (soon to be consecrated Archbishop of China) notes how important the translation of the Catholic liturgy into the local language is for him :

I have got a competent knowledge of the language and character which is most generally used by the Tartars. And I have already translated into that language and character the New Testament and the Psalter, and have caused them to be written out in the fairest penmanship they have ; and so by writing, reading, and preaching, I bear open and public testimony to the Law of Christ. And I had been in treaty with the late King George [leader of the Öngüt Turks], if he had lived, to translate the whole Latin ritual, that it might be sung throughout the whole extent of his territory ; and whilst he was alive I used to celebrate mass in his church according to the Latin ritual, reading in the before mentioned language and character the words of both the preface and the Canon.

And in his second (and last surviving) letter, dated 1306, he adds that :

I have now had six pictures made, illustrating the Old and New Testaments for the instruction of the ignorant ; and the explanations engraved in Latin, Tarsic, and Persian characters, that all may be able to read them in one tongue or another.

Here is direct evidence that Franciscan missionaries in Cambaluc (modern Beijing) were printing Christian tracts with the text written out in different languages and scripts. Here the word Tarsic has been generally taken to mean the Old Uyghur script, so it is not quite evidence of the use of the Phags-pa script, but John remained in Cambaluc until he died in about 1328, and the Franciscan presence in China continued until the end of the Yuan dynasty in 1368, so there was a period of over sixty years during which the Franciscans would undoubtedly have been printing books and pamphlets in order to spread the word of God to all the different nationaities of the country. But not a trace of any of these missionary tracts have survived.

But if we turn away from the capital, Cambaluc, to the great haven of Zayton (modern Quanzhou) on the east China coast we do find solid evidence of the use of the Phags-pa script by Christians. A large number of Christian tombstones have been found in Quanzhou, mostly with inscriptions written in the Uyghur language using either the Syriac script (example) or less commonly the Uyghur script (example), and these are without doubt memorials to members of the Assyrian church ("Nestorians") who originally came from the western parts of the empire. Many similar tombstones with Syriac inscriptions have also been found in the west of China and elsewhere in central Asia.

However the inscriptions on four tombstones found at Quanzhou are in the Chinese language written in the Phags-pa script, and the names of the deceased are clearly Chinese. Furthermore, whereas the Syriac and Uyghur tombstone inscriptions generally include overt Christian text, typically starting with "In the name of the Father, the Son, and the Holy Spirit..." in Uyghur, these four tombstones have no explicit Christian content other than the image of the cross and angels, but use exactly the same forms of wording that are found on ordinary non-Christian Chinese tombstones of the time, just with the main inscription spelled out phonetically in the Phags-pa script rather than being written in Chinese characters (it is perhaps worth pointing out that as far as I am aware no non-Christian Phags-pa tombstone inscriptions are known).

Tombstone of Zhu Yanke (1311)

It has always been assumed that these four Phags-pa tombstones are also for members of the Assyrian ("Nestorian") church, but I wonder if it is not more likely that they are associated with the vibrant Franciscan mission in Zayton, which was the only Catholic bishopric outside of Cambaluc. We know of a succession of bishops of Zayton, including Gerard (c.1308-1313), Peregrine (1313-1322), Andrew of Perugia (1322-1332), Nicholas of Paris (c.1332-?) and James of Florence (?-c.1362), and the Phags-pa tombstones are dated 1311, 1314 and 1324 (and one undated), which makes them contemporary with the early Franciscan mission in Zayton. It seems to me that the most plausible explanation for using the Phags-pa script on the tombstones of Chinese Christians is that they were converts of the Franciscan mission, and that as the European friars could not read or write Chinese characters they promoted the Phags-pa script is the medium for reading and writing Chinese amongst their congregation.

And what about Marco Polo ? Did he also know and use the Phags-pa script ? Bergreen says that there is no evidence that he did, and that is true enough, but to my mind it would seem quite probable that in fact he did, as it would have been the easiest and most useful script to learn. So, if European merchants and missionaries in China were using the Phags-pa script, one tantalising question is, did knowledge of this script filter its way back to medieval Europe ? The answer, according to some scholars, notably Hidemichi Tanaka, is yes, there was some knowledge of the Phags-pa script in medieval Europe.

The evidence that has been put forward is the frescoes of Giotto, where some of the figures in the Scenes from the Life of Christ at the Arena Chapel in Padua have robes on which are embroidered Oriental script-like motifs :

Detail from the Birth of Christ (Scenes from the Life of Christ No.1)

Detail from the Adoration of the Magi (Scenes from the Life of Christ No.2)

Detail from the Resurrection of Christ (Scenes from the Life of Christ No.21)

Tibetologist and my mentor on all things relating to Zhang Zhung culture and language, Dan Martin has provided a useful bibliography of publications by Professor Tanaka in the comments to this post by abecedarian Suzanne McCarthy, but as I have not yet had the opportunity to read any of them I cannot say exactly how he interprets these motifs, but Jack Weatherford evidently sees actual letters of the Phags-pa script :

In a 1306 illustration of the Robe of Christ in Padua, the robe not only was made in the style and fabric of the Mongols, but the golden trim was painted in Mongol letters from the square Phagspa script commissioned by Khubilai Khan. ... Old Testament prophets were depicted holding scrolls open to long, but undecipherable, texts in Mongol script. The direct allusion to the writing and clothing from the court of Khubilai Khan showed an undeniable connection between Italian Renaissance art and the Mongol Empire.

Jack Weatherford, Genghis Khan and the Making of the Modern World

The National Gallery of Art more cautiously states that these are "pseudo-inscriptions [which] blend letter shapes derived from Arabic and the Mongol Pags-Pa script". I would concur with this description. Although some of the "letters" could be construed as being Phags-pa (there is something which looks quite like Phags-pa letter YA in the picture of the Madonna), they do not follow a vertical flow along a stemline as is the case with both the Uyghur-Mongolian script and the Phags-pa script. Instead they appear to have no fixed orientation, with "letters" jumbled horizontally and vertically into square packets. My impression is that the decorative motif shown here is a simulacrum of oriental writing, and that any resemblence of "letters" in the pattern to individual Phags-pa letters is probably coincidental. Which is not to say that Giotto did not model these motifs on Phags-pa and other scripts that he saw on souvenirs such as bank notes brought back from China, but I do not believe that the motif is sufficiently faithful to say that Giotto incorporates the Phags-pa script into his paintings.

Incidentally, on the script-like motif in the Resurrection of Christ there is something which looks remarkably like the Chinese character 西 "west", which leads me to my readers' challenge for the week : what is the earliest depiction of a Chinese character in a European book or manuscript or painting ? I have no idea what the answer is, so please let me know what you think. For printed books, the earliest example I can find on my shelves (only a copy unfortunately) is the magisterial De Ludis Orientalibus (Oxford, 1694) by Thomas Hyde (1636-1703) which includes the earliest detailed descriptions in Europe of both Shahiludio Chinensium (xiangqi 象棋 or Chinese Chess) and Circumveniendi Ludo Chinensium (weiqi 圍棋 or Go, but written as hoy ki 囬碁 by Hyde). I am sure my readers can go a lot further back than this though.

Sunday, 8 July 2007

Old Hanzi

As I have intimated on more than one occasion, one of the challenges facing Unicode and WG2 is how to successfully encode historic scripts which mostly do not have a standard, well-defined repertoire and which frequently exhibit great variation in character repertoire and glyph forms geographically and/or chronologically. The problems are often exacerbated by the fact that different scholars may have very different opinions on how to encode the script and what names to use for the characters (people often get very hung up on names), and it can be exceedingly difficult to reconcile these differences.

When the Unicode was first devised it was intended to accommodate all the scripts of the world in common modern usage, but as can be seen from Joe Becker's 1988 outline of the proposed Unicode standard, it was not envisaged that "obsolete or rare" scripts would be allowed into the Unicode repertoire :

Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2 14 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally-useful Unicodes.

Joe Becker, Proposal for the Unicode Standard (29th August 1988) page 5

Ten years later, when Unicode had been around for nearly six years, there was still an antipathy in some quarters towards the encoding of rare and historic scripts, as can be seen from this position statement to SC2 by the Netherlands National Body (I just love the line about standardization bodies subsidizing academic research !) :

Market-relevance should guide selection of projects. This does not mean that academic preferences should be ignored, only that standards institutes, depending on industry contributions, cannot be expected to subsidize academic research. If Learned Societies want to raise their agreed conventions to the status of an International Standard, they should take the way of a Fast Track procedure, after having done the development themselves.

SC2 N2881 "Position of the Netherlands National Body (NNI) Regarding Further Development in JTC 1/SC 2" [1997-06-02]

Since the opening up of the supplementary planes this sort of attitude has thankfully become less prevalent, and most people involved in Unicode and 10646 have come to appreciate the importance to the scholarly community of being able to represent historical scripts (or even enigmatic script-like symbols) in electronic form. In many cases the encoding of an historic script is an important step towards greater understanding of the corpus of texts or even the decipherment of the script. As of Unicode 5.1 the following primarily historic scripts will have been encoded, in a large part due to the single-handed dedication and hard work of Michael Everson :

And under consideration for encoding are a number of other historic scripts, including :

Scripts that were devised by a single person at a single point in time, such as Gothic and Phags-pa, generally have a clearcut character repertoire, but it is often difficult to define the character repertoire of scripts that evolved over a long period of time, especially when they developed geographically distinct variants as in the case of Runic. In many cases it is difficult to even clearly define the limits of the script, and there may be arguments amongst experts as to whether different assemblages of inscriptions represent the same or different scripts, or whether a script that evolves over a long period of time should be treated as a single script or a number of distinct scripts in the same lineage. When this is the case, reaching a consensus on how best to encode a script (or even whether a script should be encoded separately) can be quite difficult. Matters are only made more difficult when a proposed script is an historic form of a living script, and users of the living script insist that the characters of the proposed script should be treated as glyph variants of the corresponding characters in the modern script. This was the case when Phoenician was proposed for encoding, and subscribers to the Unicode public mailing list will remember the endless vitriolic arguments between pro-encoders (Phoenician is a separate script in its own right and should be encoded separately from Hebrew) and anti-encoders (Phoenician is just an historical variant of Hebrew that should be dealt with at the font level not the character encoding level).

Which brings me in a roundabout way to "Old Hanzi" 古漢字 (hànzì 漢字 being the Chinese word for a Chinese character or "ideograph", equivalent to the Japanese word kanji). Like other long-lived scripts, the Chinese script is best viewed as a script continuum which evolved by stages to the modern form. Up until a few years ago I think that it was generally assumed within Unicode circles that ancient forms of the Chinese script should be dealt with at the font level rather than at the encoding level, but there was pressure from within China to encode at least the most important early forms of the Chinese script, resulting in an agreement in 2003 to initially encode three important nodes in the Chinese script continuum (the links are to encoding samples for each script prepared by the Chinese National Body) :

Oracle Bone Script

No-one knows for sure when or where the Chinese script was devised, although a number of neolithic sites dating from as early as about 6600 BC up to about 2000 BC have yielded examples of individual symbols carved in isolation on tortoise shells or pottery shards that may or may not be early forms of Chinese characters (personally, I am quite sceptical that any of these marks are directly related to the Chinese script). However, the earliest undisputed stage in the Chinese script continuum that we have evidence for is the Oracle Bone Script (jiǎgǔwén 甲骨文), which was used for divination inscriptions in the royal court of the Shang 商 dynasty at the capital Yin 殷 (near modern Anyang in Henan province) during the period 1300-1050 BC (a few examples of inscribed oracle bones dating the early Western Zhou period have also been found at a number of other sites).

A question, or more frequently a series of parallel questions, is asked by a specialist diviner, and the answer divined by applying intense localised heat to the shell or bone and observing the pattern of the resultant cracks and/or the sound that the cracks make (the character *pŏk 卜 "to divine" both graphically represents a crack, and onomatopoeically represents the sound of a crack being made). The question (usually prefixed by the cyclic day on which the divination took place and the name of the diviner) as well as the resultant prognostication are then inscibed on the shell or bone, and the object archived, so that thousands of years later archaelogists can unearth them and learn all about the daily ritual of court life in the Shang dynasty. Many thousands of inscribed oracle bones from the ancient capital of the Shang dynasty have been preserved, and they indicate that every aspect of royal life, from toothache to warfare, was governed by a complex cycle of divination and ritual.

An Oracle Bone Inscription on an Ox Scapula

Historical Relics Unearthed in New China 新中國出土文物 (Foreign Languages Press, 1972) plate 37.

The above oracle bone was discovered in 1955 southeast of the site of the ancient capital of Yin, and dates to the third of five periods that oracle bone inscriptions can be classified as belonging to. The inscription itself comprises a compound question inscribed in a single column :


On the cyclic days ding mao [Day 4] and gui hai [Day 60] it was divined: "Should the King enter the city of Shang, and on the cyclic day yi chou [Day 2] should the king not perform the hui rite ?"

The resultant prognostication, 弘吉 "very auspicious", is incised by the crack marks to the left of the question.

Bronze Inscription Script

The next stage in the history of the Chinese script is the Bronze Inscription Script (jīnwén 金文), which is a form of the Chinese script that was used for inscriptions on bronze bells and vessels. A few very short inscriptions on Shang dynasty bronze vessels (mostly little more than the name of the vessel's owner) have been found, but the vast majority of bronze inscriptions date to the succeeding Zhou dynasty (circa 1050 to 256 BC). Because of the long period during which these bronze inscriptions were made there is quite a large variation in the style of characters used. The characters found on the earliest bronze inscriptions from the Shang and early Zhou dynasties are very similar in form to those found on oracle bones (although as would be expected, oracle bone characters are generally more angular and often simpler than the corresponding bronze inscription characters due to the difficulty of inscribing characters on a hard medium such as bone and shell). Bronze inscriptions from the later period are much less closely related to the oracle bone script and are more closely related to the Small Seal script.

The Xing Hou gui 邢侯簋 ...

Chinese Bronzes: Art and Ritual (British Museum, 1987) plate 25.

... and its Inscription

Chinese Bronzes: Art and Ritual (British Museum, 1987) rubbing 10.

This is a very famous example of a ritual vessel for offering food known as a gui 簋 that was unearthed at Luoyang 洛陽 in 1921, and is now at the British Museum. The vessel dates to the early or middle Western Zhou period, and has a quite long and rather difficult to read inscription that seems to record the grant of men to the Marquis of Xing (Xing Hou 邢侯), and is dedicated to his famous ancestor, the Duke of Zhou (Zhou Gong 周公), brother of the first ruler of the Zhou dynasty ("〇" represents an undeciphered or unencoded character) :


Small Seal Script

The Small Seal Script (xiǎo zhuàn 小篆) was adopted by the First Emperor (Qin Shi Huang 秦始皇) as the standard script of the Qin dynasty (221-206 BC). It developed from the characters used for inscriptions during the latter part of the Zhou dynasty, and so many late Zhou bronze inscriptions are written with characters that are much closer in style to the small seal script thanto the early Zhou bronze inscription script. By the time that the small seal script had developed the Chinese writing system had adopted the radical/phonetic method of character composition, and so the vast majority of small seal characters correspond directly to a modern character.

The main source for the Small Seal script repertoire will be editions of the Shuowen 說文 dictionary that was compiled by Xu Shen 許慎 in about the year 100. The illustration below shows a page from the table of 540 radicals at the beginning of a modern edition of Xu Shen's dictionary :

Table of Radicals in the Shuowen Dictionary

Shuowen Jiezi 說文解字 (Zhonghua Shuju, 1963) page 3.


You might have thought that the decision to encode these three historic script forms of Chinese would have led to the same level of complex debate and bitter argument that we saw for Phoenician, especially as the result of this decision will be to add many thousands more characters to Unicode, but there hasn't been a squeak. So here are my thoughts about some of the issues involved.

The first thing to realise is that the oracle bone script is quite different from the modern Chinese script in several respects, and that a large percentage of oracle bone characters remain undeciphered or do not correspond directly to any modern character. One of the reasons for this is that the method of composing characters by combining radical and phonetic elements, which is used for the majority of modern Chinese characters, is little used in the oracle bone script, with the result that a character that in the later script is written as a radical/phonetic compound may have been written in the earlier script as a completely different unitary character, which is unrecognisable to modern eyes.

The oracle bone script also makes use of compound characters, in which two separate characters are combined into a single glyph. For example the character jiǎ 甲 in oracle bone script is written as a cross (like 十), but the titles of the royal ancestors Shang Jia 上甲 "Upper Jia" and Xiao Jia 小甲 "Little Jia" are not written as a sequence of two characters shàng plus jiǎ and xiǎo plus jiǎ respectively, as would be expected according to the principles of the modern Chinese script, but Shang Jia is written as a cross (= 甲) in square box, and Xiao Jia is written as a cross (= 甲) with a dot in each of the four corners (however Da Jia 大甲 "Big Jia" is written as a sequence of the two characters plus jiǎ). Likewise, the titles of the royal ancestors Bao Yi 報乙, Bao Bing 報丙 and Bao Ding 報丁 are each written as a sideways bowl shape (similar to a reversed "C") representing the character baò 報 with the second character of the title ( 乙, bǐng 丙 or dīng 丁) enclosed within. All these compound characters are shown in the oracle bone inscription below, which is also a good example of how it is currently impossible to represent many oracle bone inscriptions accurately, and why most authors working with oracle bone texts write the characters out by hand (those characters that are currently unencoded are represented by "〇", although a couple of them are in the pipeline for CJK-D, including ⿰酉彡 for the character which looks like but isn't 酒) :


Complex numbers may also written as compound characters, so that for example the numbers "50", "60", "70", "80" and "90" are represented by the characters for "5" (五), "6" (六), "7" (七), "8" (八) and "9" (九) with the character for "10" (十, written as a vertical line in oracle bone script) joined from above.

In a few cases there is a complete disjuncture between the character used in the oracle bone script and the corresponding modern Chinese character. For example in the oracle bone script the character used to represent the first of the twelve earthly branches does not correspond to the modern character for the first earthly branch ( 子) but is written with a completely unrelated glyph of unknown meaning; whereas the oracle bone character used to represent the sixth earthly branch (which is 巳 in modern Chinese) is actually written with the character for 子 "son". Thus 子 is the first earthly branch in the modern Chinese script but the sixth earthly branch in the oracle bone script.

These sorts of issues are the reason why I think that it is not practical to treat the oracle bone script simply as a stylistic variant of the modern Han script. The fact that a majority of oracle bone characters either have no known counterpart in the modern Chinese script or are significantly different from the corresponding modern Chinese character with respect to their glyph composition also makes it very difficult to represent oracle bone script text using CJK Unified Ideographs and a suitable oracle bone style font that maps oracle bone glyphs to the corresponding modern characters (in many cases the mapping just does not exist). However, it has to be said that many artificial modernised versions of oracle bone and bronze inscription characters have been encoded already or are proposed for encoding (there are 367 characters in CJK-C and 1,481 characters in CJK-D that are derived from Yinzhou Jinwen Jicheng Yinde 殷周金文集成引得 [Concordance of Shang and Zhou Dynasty Bronze Inscriptions]). And it could be argued that if the encoding of artificial modernised forms of ancient characters is extended so that all ancient characters can be mapped to an encoded CJK Unified Ideograph then there would be no need to encode the oracle bone script separately. But the counterargument is that artificial modernised forms of ancient characters can only be encoded if they are attested, and not all oracle script or bronze inscription script characters have been or probably ever will be represented with artificial modernised forms (and often it is almost impossible to devise a modern form of an oracle bone script character). Another argument against this approach is that different scholars may modernise a character differently, so that there may be multiple artificial modern forms for the same oracle bone character.

A further problem that scholars of ancient Chinese inscriptions face is that most oracle bone and bronze inscription characters occur in a variety of different glyph forms, often composed using different combinations of component elements, and scholars want to be able to represent these significant glyph differences at the encoding level. Just picking at random the character for "spring" (chūn 春), it occurs in at least five distinct glyph forms :

Each of these five forms of the character is written with a different set of components, and are thus not unifiable according to the rules of CJK unification. It is to be expected that when the oracle bone script repertoire is eventually submitted for encoding it will contain separate characters for each of these forms (and probably also for other less common forms of the character).

I personally think that encoding the oracle bone script separately from the ordinary Han script is the only way for scholars to be able to work with oracle bone script texts, and I am looking forward to seeing it encoded as soon as possible. The same arguments that I have used to support the encoding of the oracle bone script may also be used for the bronze inscription script, although it could be argued that due to the similarity between the characters on early bronze inscriptions and oracle bone inscriptions it would have been better (or at least more economical) to combine the two scripts at the encoding level so as to avoid encoding duplicate versions of characters that are used in both oracle bone and bronze inscriptions.

I haven't said much about the small seal script, mainly because the issues of character identity that affect the oracle bone and bronze inscription scripts largely do not apply to the seal script. There is a high level of correspondence between small seal characters and modern Chinese characters, so I think that it is quite possible to deal with small seal script satisfactorily at the font level. Nevertheless, I don't have any strong objections to seeing the small seal script encoded as a separate script if that is what the user community wants.

Monday, 2 July 2007

CJK Unified Ideographs : To Infinity and Beyond

It has been remarked now and then that Unicode basically consists of an innumerable number of Han thingies to which assorted non-Han detritus has attached itself. And this does seem to be borne out from the figures :

Percentage of Han Characters within the Unicode Repertoire
Han Script Characters Everything
Percentage of
Han Characters
CJK Unified
CJK Compatibility
CJK Radicals Other Total

Looking 10 years or so into the future, after the encoding of CJK-C, CJK-D, CJK-E and CJK-F, as well as Old Hanzi, even after taking into account large non-Han scripts such as Egyptian Hieroglyphs (~1,000), Tangut (~6,000) and Jurchen (~1,000), it is likely that the Han percentage will still be around 75% of the entire Unicode repertoire (this is assuming that Old Hanzi are classified as belonging to the Han script, which is not entirely certain).

It could also be said that Han ideographs are the driving force behind Unicode. Without them it is unlikely that there would have been the impetus to develop a 16-bit universal character set in the first place, and now that all the major modern scripts have been encoded the unfinished work on CJKV is the main reason why Unicode and 10646 are still continuing to expand. Once China and the other countries that use Han ideographs have encoded all the characters they need, then I expect that WG2 will cease to function and the ISO/IEC 10646 and Unicode standards will stabilize. This means that there is a limited window of opportunity to get as many as possible of the remaining unencoded scripts encoded.

The Han Script

In Unicode terms the Han script comprises unified ideographs, compatibility ideographs (duplicate versions of unified ideographs encoded for round-tripping compatibility with pre-existing standards) and radicals (Kangxi Radicals and CJK Radicals Supplement), as well as Suzhou numbers ("Hangzhou numbers" as they are called in Unicode), ideographic iteration marks and the ideographic zero (all in the CJK Symbols and Punctuation block).

Not included within the Han script are CJK Strokes and Ideographic Description Characters, which are both classified as "common" by Unicode. This makes sense as other (not yet encoded) scripts such as Tangut, Jurchen and Greater Khitan can all be analysed using ideographic description sequences. The characters of these scripts are also composed from the same or similar stroke elements as Han ideographs, and so "CJK" strokes may be used for these scripts when they are encoded (e.g. character indexes for Tangut and Jurchen dictionaries are often subdivided by stroke type). Indeed, I don't see any reason why those strokes that are peculiar to Tangut characters may not be encoded in the "CJK Strokes" block.

Breakdown of the Han Script by Block (as for Unicode 8.0)
Block Name Range Han Characters Unicode Versions
CJK Unified Ideographs4E00..9FFF20,9501.0, 4.1, 5.1, 5.2, 6.1, 8.0
CJK Unified Ideographs Extension A3400..4DBF6,5823.0
CJK Unified Ideographs Extension B20000..2A6DF42,7113.1
CJK Unified Ideographs Extension C2A700..2B73F4,1495.2
CJK Unified Ideographs Extension D2B740..2B81F2226.0
CJK Unified Ideographs Extension E2B820..2CEAF5,7628.0
CJK Compatibility IdeographsF900..FAFF4721.0, 3.2, 4.1, 5.2, 6.1
CJK Compatibility Ideographs Supplement2F800..2FA1F5423.1
Kangxi Radicals2F00.2FDF2143.0
CJK Radicals Supplement2E80..2EFF1153.0
CJK Symbols and Punctuation3000.303F151.0, 3.0, 3.2

Note that the total number of Unified Ideographs (80,388) is twelve more than the sum of the six CJK Unified Ideograph blocks, as twelve characters in the CJK Compatibility Ideographs block are actually unified ideographs.

There seems to be no end to the growth in numbers of unified ideographs, and perhaps if anyone could have imagined when Unicode was first instigated that eventually over a 100,000 Chinese, Japanese, Korean, Vietnamese and Zhuang ideographs would be encoded, then maybe a compositional model of Han ideograph encoding would have been considered; as it is we are stuck, for better or for worse, with a unitary ideograph encoding model (see the Comments to A Brief History of CJK-C for some discussion of this issue), so the only way to represent unencoded Han characters is to add yet more and more unified ideographs to the standard.

But, however many ideographs are encoded, it always seems possible to find yet more to encode. And if you have much dealing with modern, informal Chinese usages such as letter-writing and sign-writing, you will doubtless have encountered a whole class of Han characters which are largely unencoded, that is Second Stage Simplifications :

In the above Chinese postage stamp from 1978 you can see (with a strong magnifying glass!) the word "lacquerware" qīqì 漆器 written with ultrasimplified characters (㲺 for 漆, and a rectangle with a vertical stroke for 器). The ultrasimplified form of 器 (a rectangle with a vertical stroke) is scheduled for encoding in CJK-D [what was going to be CJK-D when this post was originally written, but which is now rescheduled as CJK-E because CJK-D has been taken by a couple of hundred "urgent need characters"], together with some other ultrasimplified forms (e.g. hollow 面 and the righthandside of 能), but no systematic proposal to encode all of the second stage simplifications has yet been made.


CJK-D was originally intended to comprise some 16,000+ ideographs that had not made it into CJK-C (see pages 1-100, 101-200, 201-300 and 301-396). However, just a month ago Taiwan withdrew 6,545 personal name usage characters from CJK-D that were no longer in use (see IRG N1306), so CJK-D has now been reduced in size to about 10,000 characters, plus about fifty more that will be taken out of CJK-C.

The proposed CJK-D collection includes a few characters that I have been patiently waiting to be encoded for many years now, including this one that I had to hack a glyph for when I was compiling and typesetting the Catalogue of the Morrison Collection nearly ten years ago (spot the deliberate error !) :

The character in question (⿰冫玉) is identifiable from context as being a variant form of jué 珏, where the "two dots of water" act as a component iteration mark (i.e. jade doubled), as they also do in U+3560 㕠 (a variant form of shuāng 雙). My great delight in seeing this old friend encoded at last is only matched by my utter dejection when I realise that it is one of the withdrawn Taiwan characters, and with no other source reference it will not be in the proposed CJK-D set after all. [It was eventually encoded in Unicode 8.0 at U+9FD1 .]


The CJK-D collection is now closed for business, and new submissions (such as 1,277 Vietnamese characters, 24 Taiwan characters for Minnan and Hakka usage and 2 PRC placename characters) are queuing for inclusion in CJK-E. Work on CJK-E has not yet officially started, so I'm not going to guess at how many characters it may comprise eventually.

[Update: CJK-E was included in Unicode 8.0, with a total of 5,762 characters.]

Zhuang Usage Ideographs

One very large set of ideographs that remains largely unencoded are "Zhuang square characters" fangkuai Zhuangzi 方塊壯字 (known as saw ndip in the Zhuang language) that have (mostly in the past) been used to write the Zhuang language. These Zhuang ideographs comprise a mixture of existing Chinese ideographs borrowed for their meaning or pronunciation, together with many idiosyncratic creations modelled on Chinese ideographs (mostly on the same principles of radical and phonetic that are used for Chinese, but with some more interesting methods of forming characters as well). As Zhuang usage of Chinese and Chinese-style ideographs was never standardized the actual choice of character used to represent any particular syllable varies from manuscript to manuscript, and as can be seen from the first page of the Gu Zhuangzi Zidian 古壯字字典 [Dictionary of Old Zhuang Characters] (Guangxi Minzu Chubanshe, 1989) there are usually multiple ways of writing any given syllable :

[Image courtesy of John Knightley]

Work on a comprehensive encoding proposal for Zhuang usage ideographs has just started at Guangxi University, but there is a huge amount of material to cover, and it will probably take 3-5 years before the complete set of unencoded ideographs has been identified and analysed. The end result may be another 5,000-10,000 characters to be encoded after CJK-E.

Later in the year (or more probably next year) I want to analyse in detail an actual example of a Zhuang poetic text written in sawndip characters, but for my final post of the current blogging season I will be taking a look at Old Hanzi.

[Last updated : 2015-06-17]

Thursday, 28 June 2007

The Secret Life of Variation Selectors

One of the most controversial encoding mechanisms provided by Unicode is that of variation selectors. Some people revile them as "pseudo-coding" whilst others are eager to embrace them as a solution for almost every new encoding issue that arises. Personally I think that they provide an essential mechanism for selecting contextual glyph forms in isolation or overriding the default contextual glyph selection in some complex scripts such as Mongolian and Phags-pa, but I am not keen on their use to select simple glyph variants for aesthetic or epigraphic purposes, and I definitely oppose their use as private glyph identifiers.

Recently, with more and more historic scripts being encoded in Unicode, there have been frequent suggestions that variation selectors should be used to standardize the multitude of stylistic letterforms that are often recognised by scholars of ancient scripts, usually with the rationale that epigraphers and palaeographers need to be able to distinguish variations in glyph forms at the encoding level in order to accurately represent ancient texts. As a textual scholar by training I appreciate how important distinctions at the glyph level can be to the dating and analysis of a text, but I really doubt the need to represent stylistic glyph variants at the encoding level. This is usually more usefully achieved with higher level markup or at the font level. Time and again when discussing the encoding of some ancient script with Dr. X or Professor Y I hear the assertion that the encoded text must be an exact facsimile of the written or inscribed original, to which my response is that encoded text is not intended as a replacement for facsimile drawings and photographs of manuscripts and inscriptions, and that scholars of ancient texts need to work with both photographic images and electronic text, which serve very different purposes. Thusfar we have managed to stave off the demand for glyph level encoding of historic scripts using variation selectors, but I predict that before long there will be a proliferation of variation sequences for newly encoded historic scripts.

Fundamental Principles

Variation Selectors are a set of 256 characters, FE00..FE0F (VS1..VS16) and E0100..E01EF (VS17..VS256), that can be used to define specific variant glyph forms of Unicode characters. There are also three Mongolian Free Variation Selectors, 180B..180D (FVS1..FVS3), that behave the same as the generic variation selectors but are specific to the Mongolian script. See The Unicode Standard Section 16.4 for more details.

A variation selector may be used to define a variation sequence, which comprises a single base character followed by a single variation selector. The base character must not be either a decomposable character or a combining character otherwise normalization could change the character to which the variation selector is appended (as we shall see below this rule was not followed when mathematical variation sequences were first defined).

The most important thing to realise about variation selectors is that they are not intended to provide a generic method for defining glyph variants by all and sundry, but that only those variation sequences specifically defined by Unicode (aka standardized variants) are valid. To put it another way, no conformant Unicode process is allowed to recognise any variation sequence not defined by Unicode (i.e. a conformant Unicode process may not render the base character to which a variation selector is appended any differently to the base character by itself, if the variation sequence is not defined by Unicode).

Of course there is nothing to stop me from defining my own variation sequence, say <0041 FE0F> (A + VS16) to indicate the Barred A that I use to write the "A" of "A️ndew", but I should not expect Microsoft or anyone else to support my variation sequence. Although, having said that, Microsoft Vista does support some variation sequences that are undefined by Unicode (as we shall see below), and so I hope no-one is advertising Vista as being Unicode-conformant.

At present (Unicode 5.0) Unicode defines variation sequences for various mathematical characters, as well as for the Mongolian and Phags-pa scripts. These are specified in the file StandardizedVariants.txt (also as HTML with glyph images). It is to be expected that the first Han ideographic variants will be defined in Unicode 5.1.

Mathematical Variation Sequences

Unicode defines variation sequences for 15 characters in the Mathematical Operators block [2200..22FF] and 8 characters in the Supplemental Mathematical Operators block [2A00..2AFF]. In all of these cases the variation selector used is U+FE00 (VS1).

Mathematical Variation Sequences
Variation Sequence Appearance *
No VS With VS
U+2229VS1<2229 FE00> INTERSECTION with serifs∩︀
U+222AVS1<222A FE00> UNION with serifs∪︀
U+2268VS1<2268 FE00> LESS-THAN BUT NOT EQUAL TO with vertical stroke≨︀
U+2269VS1<2269 FE00> GREATER-THAN AND NOT DOUBLE EQUAL with vertical stroke≩︀
U+2272VS1<2272 FE00> LESS-THAN OR EQUIVALENT TO following the slant of the lower leg≲︀
U+2273VS1<2273 FE00> GREATER-THAN OR EQUIVALENT TO following the slant of the lower leg≳︀
U+228AVS1<228A FE00> SUBSET OF WITH NOT EQUAL TO with stroke through bottom members⊊︀
U+228BVS1<228B FE00> SUPERSET OF WITH NOT EQUAL TO with stroke through bottom members⊋︀
U+2293VS1<2293 FE00> SQUARE CAP with serifs⊓︀
U+2294VS1<2294 FE00> SQUARE CUP with serifs⊔︀
U+2295VS1<2295 FE00> CIRCLED PLUS with white rim⊕︀
U+2297VS1<2297 FE00> CIRCLED TIMES with white rim⊗︀
U+229CVS1<229C FE00> CIRCLED EQUALS equal sign touching the circle⊜︀
U+22DAVS1<22DA FE00> LESS-THAN EQUAL TO OR GREATER-THAN with slanted equal⋚︀
U+22DBVS1<22DB FE00> GREATER-THAN EQUAL TO OR LESS-THAN with slanted equal⋛︀
U+2A3CVS1<2A3C FE00> INTERIOR PRODUCT tall variant with narrow foot⨼︀
U+2A3DVS1<2A3D FE00> RIGHTHAND INTERIOR PRODUCT tall variant with narrow foot⨽︀
U+2A9DVS1<2A9D FE00> SIMILAR OR LESS-THAN with similar following the slant of the upper leg⪝︀
U+2A9EVS1<2A9E FE00> SIMILAR OR GREATER-THAN with similar following the slant of the upper leg⪞︀
U+2AACVS1<2AAC FE00> SMALLER THAN OR EQUAL TO with slanted equal⪬︀
U+2AADVS1<2AAD FE00> LARGER THAN OR EQUAL TO with slanted equal⪭︀
U+2ACBVS1<2ACB FE00> SUBSET OF ABOVE NOT EQUAL TO with stroke through bottom members⫋︀
U+2ACCVS1<2ACC FE00> SUPERSET OF ABOVE NOT EQUAL TO with stroke through bottom members⫌︀

* If you have a recent version of James Kass's Code2000 installed on your system you should see the difference in appearance between the base character with and without VS1 applied to it (at least it works for me with IE6 or IE7).

Originally when the set of mathematical variation selectors were encoded in Unicode 3.2 there were two additional variation sequences :

  • <2278 FE00> NEITHER LESS-THAN NOR GREATER-THAN with vertical stroke
  • <2279 FE00> NEITHER GREATER-THAN NOR LESS-THAN with vertical stroke

However, as U+2278 and U+2279 are both decomposable characters, if the variation sequences <2278 FE00> and <2279 FE00> are subjected to decomposition (NFD or NFKD) they will change to <2276 0338 FE00> and <2277 0338 FE00> respectively. When this happens VS1 is now appended to U+0338 COMBINING LONG SOLIDUS OVERLAY, and <0338 FE00> is not a defined variation sequence. Therefore these two variation sequences were undefined in Unicode 4.0 (which I guess answers the question of whether once defined a variation sequence can be undefined or not). However, due to an unfortunate oversight, the last paragraph of Section 15.4 of The Unicode Standard still suggests that VS1 can be applied to U+2278 and U+2279 (although an erratum for this has now been issued).

Turning to the general reason for defining these variation sequences in the first place, we find almost no explanation for them in The Unicode Standard (section 15.4). We are asked to "see Section 16.4, Variation Selectors, for more information on some particular variants", but turning to Section 16.4 we find no mention of mathematical variation sequences, much less any information on particular variation sequences. It has been explained to me that mathematical variation sequences have been defined because nobody is quite sure whether there is any semantic difference between the variant glyphs or not; if it was certain that there is a semantic difference between the variant gyphs then the variant forms would have been encoded as separate characters, and conversely, if it was certain that there was no semantic difference then variation sequences would not have been defined for them.

A final important point to note is that whilst the glyph form of a variation sequence is fixed, that of the base character when not part of a variation sequence is not fixed, so that the range of acceptable glyph forms for a particular base character may encompass the glyph form of its standardized variant. For example, although the glyph for <2229 FE00> "INTERSECTION with serifs" must have serifs, this does not mean that the character U+2229 must not have serifs, and depending on the font it may or may not have serifs. In fact, there is no way of selecting "INTERSECTION without serifs" at the encoding level.

Mongolian Variation Sequences

Mongolian variation sequences are formed using the special Mongolian Free Variation Selectors 180B..180D (FVS1..FVS3) rather than the generic variation selectors. Unlike mathematical variation selectors, which seem like a kludge, variation selectors are an essential aspect of the Mongolian encoding model. To understand why they are required you need to understand a little bit about the nature of the Mongolian script, in which most letters have a variety of positional, contextual and semantic glyph forms (see The Unicode Standard Section 13.2 for further details). The glyph form that a particular letter assumes depends upon various factors such as :

  • its position in a word (initial, medial, final or isolate)
  • the gender of the word that it occurs in (masculine or feminine depending upon the vowels in the word, so that, for example, completely different glyph forms of U+182D GA are found in the masculine word jarlig "order" and the feminine word chirig "soldier")
  • what letters it is adjoining to (e.g. U+1822 I is written with a single tooth after a consonant but with a double tooth after a vowel; U+1828 NA in medial position has a dot before a vowel but no dot before a consonant; U+1832 TA and U+1833 DA both take the reclining form before a vowel and the upright form before a consonant)
  • whether the word is a native word or a foreign borrowing (e.g. the glyph form of U+1832 TA and U+1833 DA in medial position in a native word depends upon whether the letter is followed by a vowel or a consonant, but in foreign words U+1832 TA is always written with the upright glyph form, whereas U+1833 DA is always written with the reclining glyph form)
  • whether traditional or modern orthographic rules are being followed (e.g. U+182D GA in the word gal "fire" is written with two dots in modern orthography but with no dots in traditional orthography)

The rendering system should select the correct positional or contextual form of a letter without any need for user intervention (i.e. variation selectors are not normally needed in running text to select glyph forms that the rendering system can predict from context), but for foreign words and words written in traditional orthography the user needs to apply the appropriate variation selector to select the correct glyph form where appropriate.

Variation selectors may also be used to select a particular contextual glyph form of a letter out of context, for example in discussions of the script, where there is a need to display a particular glyph form in isolation.

Not all Mongolian, Todo, Manchu and Sibe letters have glyph forms that need distinguishing by means of variation sequences, but variation sequences are still defined for as many as thirty-eight of the 128 letters in the Mongolian block. In addition to these variation sequences which define contextual glyph forms of letters, there are two variation sequences defined by Unicode where variation selectors are used to select stylistic variants :


With regard to the first of these, I would suggest that U+1880 by itself corresponds to a "candrabindu" (e.g. Devanagari U+0901 and Tibetan U+0F83), whereas the variation sequence <1880 180B> ᢀ᠋ corresponds to an "anusvara" (e.g. Devanagari U+0902 and Tibetan U+0F7E); thus I believe that they are semantically distinct and should have been encoded as separate characters rather than as one character plus a standardized variant. I am not sure about the two forms of the visarga (U+1881 and <1881 180B> ᢁ᠋).

As an aside, one very curious feature about the two characters U+1880 and U+1881 is their names, which both include the unexpected and (in this context) meaningless word "one". My only explanation for this is that at some early stage of the Mongolian character repertoire four characters had been proposed :


But then the "two" characters were redefined as variation sequences of the corresponding "one" character. However, the original names must have been inadvertently left unchanged, with "one" left in the name as a fossil reminder to the time when there were two such characters. But this is pure conjecture; I have not been able to find any support for this theory yet.

The problem with the system of Mongolian variation sequences is that nearly eight years after Mongolian was added to Unicode (3.0 in September 1999) the exact shaping behaviour of Mongolian remains undefined. Although Unicode defines a number of standardized variants for Mongolian, a simple list such as this is not sufficient to implement Mongolian correctly. So when Microsoft decided to support Mongolian in its Vista operating system it had to rely on information on shaping behaviour outside of the Unicode Standard, specifically unpublished draft specifications for Mongolian shaping behaviour from China which in places contradicts both itself and the Unicode Standard with regard to the use of variation selectors.

I have to sympathise with Microsoft, which is in a very difficult position in trying to support a script for which the necessary shaping behaviour specification has long been promised but never delivered, but nevertheless it is very unfortunate that Microsoft did not work with Unicode to write the promised Unicode Technical Report on Mongolian at the same time as it developed its Mongolian implementation. As it stands the Vista implementation of Mongolian is essentially an undocumented and private interpretation of Mongolian shaping behaviour. In particular the Vista implementation (Uniscribe and the Mongolian Baiti font) support a number of variation sequences that are not defined by Unicode.

The table below lists those variation sequences supported in the Mongolian Baiti font that are undefined by Unicode but which have the same glyph appearance as another defined variation sequence. The seven undefined isolate variants are identical to another positional form of the letter, and can be selected using the appropriate combination of ZWJ and FVS; I do not believe any of them are true isolate forms which require special variation sequences other than the already defined sequences for when they occur in a non-isolate position. The two undefined initial variants are identical to the medial forms of the same letter that are selected after NNBSP, and the undefined final variant is identical to the medial form of the same letter that is selected before MVS. I do not think that these are true initial or final forms, and any usage in initial or final position (e.g. when discussing a stem or suffix in isolation) can be dealt with using the existing, defined variation sequences and ZWJ where appropriate (e.g. the suffix ACA that occurs after NNBSP can be represented in isolation as <200D 1820 180C 1834 1820>, without requiring a special initial variant). In summary, not only are none of variation sequences in the table below sanctioned by Unicode, but in my opinion none of them are required anyway.

Undefined Variation Sequences in Mongolian Baiti
Base Character Variation Selector Position Variation Sequence Appearance* Notes
U+1820 FVS2 Isolate <1820 180C> ᠠ᠌ This undefined isolate variant is the same as the defined second final form
U+1821 FVS1 Isolate <1821 180B> ᠡ᠋ This undefined isolate variant is the same as the defined second final form
U+1822 FVS1 Isolate <1822 180B> ᠢ᠋ This undefined isolate variant is the same as the defined final form
U+1824 FVS1 Isolate <1824 180B> ᠤ᠋ This undefined isolate variant is the same as the defined final form
U+1826 FVS2 Isolate <1826 180C> ᠦ᠌ This undefined isolate variant is the same as the defined first final form
U+182D FVS2 Isolate <182D 180B> ᠭ᠋ This undefined isolate variant is the same as the defined feminine medial form
U+1835 FVS1 Isolate <1835 180B> ᠵ᠋ This undefined isolate variant is the same as the defined second medial form
U+1820 FVS1 Initial <1820 180B> ᠠ᠋‍ This undefined initial variant is the same as the defined second medial form (used after NNBSP)
U+1826 FVS1 Initial <1826 180B> ᠦ᠋‍ This undefined initial variant is the same as the defined first medial form (used after NNBSP)
U+1828 FVS2 Final <200D 1828> ‍ᠨ᠌ This undefined final variant is the same as the defined third medial form (used before MVS)

* You will need to be running under Vista to see what I intend to be seen.

In addition to the undefined variation sequences in the above table, Mongolian Baiti supports several other undefined variation sequences which are even more problematic.

Firstly, the undefined variation sequence <1840 180B> ᡀ᠋ (Mongolian LHA plus FVS1) produces a glyph which is the same as the letter LA with a circle diacritic. This in not a variant glyph form of Mongolian LHA (in origins a ligature of the letters LA and HA) at all, but is a completely separate letter used in Manchu to transliterate Tibetan LHA (discussed in more detail here). Although this letter was inadvertently omitted from the original set of Mongolian/Todo/Manchu/Sibe letters, it is to be be encoded as U+18AA MONGOLIAN LETTER MANCHU ALI GALI LHA in Unicode 5.1. All I can say is that trying to represent an unencoded letter by means of an undefined and unsanctioned variation sequence is a shameful hack that should never have been countenanced by a major vendor and founder member of the Unicode Consortium.

Then there are these four variant forms of U+1800 MONGOLIAN BIRGA :

  • <1800 180B> (FVS1) ᠀᠋ "1st variant"
  • <1800 180C> (FVS2) ᠀᠌ "2nd variant"
  • <1800 180D> (FVS3) ᠀᠍ "3rd variant"
  • <1800 200D> (ZWJ) ᠀‍ "4th variant"

And for those without Vista, these are what I am talking about (1st to 4th variants from left to right) :

Although none of these four birga variants are defined in Unicode, they are defined in both Traditional Mongolian Script in the ISO/IEC 10646 and Unicode Standards (UNU/IIST Report No. 170, August 1999) and a book on Mongolian character encoding Mengguwen Bianma 蒙古文编码 (2000) by Professor Quejingzhabu which closely follows the UNU/IIST report.

I suspect that the main reason why Unicode did not accept these four variation sequences when it accepted all the other variation sequences defined in UNU/IIST Report No. 170 is that the fourth variation sequence uses U+200D ZERO WIDTH JOINER as a pseudo-variation selector because there are not enough Mongolian Free Variation Selectors for more than three variants of the same positional form of a letter. This abuse of ZWJ was no doubt unacceptable to Unicode, and I imagine that as they couldn't accept three of the variants and reject one of them, they rejected them all until a better solution could be found. Unfortunately, instead of working with Unicode to define an acceptable solution Microsoft uncritically implemented something Unicode had already rejected.

Let us just consider for a moment the wisdom of using ZWJ as a pseudo-variation selector in a script that already uses ZWJ to select positional forms of letters (X-ZWJ, ZWJ-X-ZWJ and ZWJ-X select the initial, medial and final forms of the letter X respectively). As the Mongolian birga is a head mark that occurs at the start of text, it is quite likely to be followed by a Mongolian letter (maybe with whitespace between them, maybe not). Is it not just possible that if a letter with positional forms occurs immediately after the fourth birga variant <1800 200D> the ZWJ will have an adverse effect on the following letter ?

Well yes, it is just possible, under Vista at least. In IE7 the ZWJ acts upon both the preceding birga (U+1800) and following letter A (U+1820), producing the 4th birga variant followed by the final form of the letter A; whereas in simpler applications such as Notepad the ZWJ only acts upon the following letter, producing the standard birga glyph followed by the final form of the letter A (Birga 4th variant plus letter A separated by space is on the left and Birga 4th variant plus letter A not separated by space is on the right) :

And in Word 2007 you get weird behaviour, as seen below where exactly the same three sequences <1800 200D 1820> may end up being rendered differently from each other :

This sort of unpredictable rendering behaviour is no doubt why Unicode rejected <1800 200D> as a variation sequence in the first place, and why Microsoft should never have implemented it. Unfortunately there is a lot more that I could say about the rendering behaviour of Mongolian Baiti, but that would be beyond the scope of this post.

Phags-pa Variation Sequences

As with the Mongolian model, variation selectors (always VS1) are used in the Phags-pa script in order to select a particular contextual glyph form. This mechanism is only actually required in order to represent the Sanskrit Buddhist texts that are engraved in Phags-pa script on the walls of the "Cloud Platform" 雲台 at Juyong Guan 居庸關 Pass at the Great Wall north-west of Beijing, in commemoration of the construction of a Buddhist edifice in 1345. On these very important inscriptions (and nowhere else in the extant Phags-pa corpus) the Sanskrit retroflex letters ṭa, ṭha, ḍa and ṇa are represented by reversed forms of the Phags-pa letters TA, THA, DA and NA (following the example of Tibetan), and as such these four reversed letters are encoded separately from their unreversed counterparts (A869..A86C : TTA, TTHA, DDA and NNA). However, as the stem on these four reversed letters is on the opposite side compared with normal, when other letters follow them they also normally take a reversed glyph form to facilitate joining along the stem. These reversed glyph forms are not phonetically or semantically any different from the corresponding unreversed glyph forms, and so are not encoded separately, but are treated as contextual glyph variants. This contextual reversing affects the following six letters :


These letters exhibit the following reversing behaviour :

  • The letter HA reverses after the letter DDA
  • The letter Subjoined YA reverses after the letter NNA
  • The letters I, U and E reverse after the letters TTA, TTHA, DDA or NNA (or after a reversed Subjoined YA or HA), although the letter I does not always reverse after the letter TTHA
  • The letter Small A normally does not reverse after the letters TTA or TTHA, presumably because a reversed Small A is identical to the letter SHA, but may sometimes be reversed after the letter TTHA

The rendering system should automatically reverse the glyph form of the letters Small A, HA, I, U, E and Subjoined YA when they occur immediately after one of the letters TTA, TTHA, DDA or NNA (or a reversed Small A, HA, I, U, E or Subjoined YA), but variation selectors are needed to display the reversed glyph forms of the letters Small A, HA, I, U, E and Subjoined YA in isolation (for example when discussing the letters of the script) and when the default reversing behaviour needs to be overridden, for example in order to represent those occurences where the letters Small A and I do not reverse after the letters TTA or TTHA in the Juying Guan inscriptions.

The six variation sequences defined for these purposes are different from any other variation sequence defined thusfar, in that they do not define an absolute glyph form but a relative glyph form :

  • <A856 FE00> phags-pa letter reversed shaping small a
  • <A85C FE00> phags-pa letter reversed shaping ha
  • <A85E FE00> phags-pa letter reversed shaping i
  • <A85F FE00> phags-pa letter reversed shaping u
  • <A860 FE00> phags-pa letter reversed shaping e
  • <A868 FE00> phags-pa letter reversed shaping subjoined ya

By "reversed shaping" it means that where the rendering system would normally display an unreversed form of the letter, applying VS1 will cause the glyph to be reversed; an conversely, where the rendering system would normally display a reversed form of the letter (e.g. after the letters TTA, TTHA, DDA and NNA), applying VS1 will cause the glyph to be unreversed. By this means the same variation sequence can be used to display a reversed glyph form of a letter in isolation and to inhibit glyph reversal in running text.

As an example, the Sanskrit word dhiṣṭhite is transliterated as DHISH TTHI TE in the Phags-pa inscriptions at Juyong Guan, but in some cases the letter I of TTHI is reversed and in some cases it is not. These two versions of the word may be represented as :

  • <A84A A85C A85E A85A 0020 A86A A85E 0020 A848 A860> ꡊꡜꡞꡚ ꡪꡞ ꡈꡠ (letter I contextually reversed by the rendering system)
  • <A84A A85C A85E A85A 0020 A86A A85E FE00 0020 A848 A860> ꡊꡜꡞꡚ ꡪꡞ︀ ꡈꡠ (VS1 inhibits contextual reversing of letter I)

Whereas in this context VS1 inhibits contextual reversing of letter I, we can using the same variation sequence <A85E FE00> in isolation to produce the reversed glyph form of the letter I : ‍ꡞ︀ (preceded by ZWJ to get the final reversed glyph form).

[See Phags-pa Shaping Behaviour for more examples]

Han Ideographic Variation Sequences

For a long time there has been a demand from some quarters for a mechanism to allow vendors and CJK users to register glyph variants of Han ideographs, and in order to accomodate this demand Unicode has recently established an Ideographic Variation Database (IVD). Unlike variation sequences for other scripts, which are individually defined by Unicode, the IVD provides a registration mechanism so that sets of Ideographic Variation Sequences (IVS) can be registered by the "user community" on demand. As long as certain rules are followed and a fee is paid (which Unicode may waive if it so desires) then Unicode (as the registration authority) will accept any set of glyph variants that anybody wants to register, without any scrutiny of the appropriateness of the proposed glyph variants -- there is a 90 day public review period, but in my opinion that's just an excuse to move responsibility away from the UTC.

The Variation Selectors Supplement, comprising 240 variation selectors (VS17-VS256), was specially encoded in anticipation of a large number of Han ideographic variants being defined, and ideographic variation sequences are intended to only use these 240 supplementary variation selectors. The door has been left open to define even more variation selectors if 240 variation sequences for a single CJK unified ideograph proves too few.

The first, and so far only, IVD registration application has come from Adobe, who have requested the registration of the entire set of kanji glyphs in their Adobe-Japan1 collection. This is a set of glyphs used by Adobe for fonts for the Japanese market, and includes 14,664 kanji glyphs. Adobe wants to be able to uniquely refer to each of these glyphs at the encoding level (don't ask me why), but as many of the glyphs are from a Unicode perspective unifiable variants it can only do so by means of variation sequences.

Seven of the glyphs in the Adobe-Japan1 collection do not correspond to encoded ideographs, and so have been fast-tracked (by-passing IRG) for encoding in Unicode 5.1 at 9FBC..9FC2. The remaining 14,657 glyphs have been analysed as mapping to a total of 13,262 encoded ideographs (one glyph, CID+19071, maps to both U+29FCE and U+29FD7 !) :

  • 12,040 characters mapped to 1 glyph
  • 1,084 characters mapped to 2 glyphs
  • 120 characters mapped to 3 glyphs
  • 14 characters mapped to 4 glyphs
  • 1 character mapped to 5 glyphs (U+97FF 響)
  • 1 character mapped to 6 glyphs (U+6168 慨)
  • 1 character mapped to 8 glyphs (U+908A 邊)
  • 1 character mapped to 15 glyphs (U+9089 邉)

From this one would have thought that variation sequences would only be required for those 1,222 ideographs that map to one or more glyphs in the Adobe set, and even then perhaps only for those glyphs that differ from the standard form of the ideograph, yielding at most 2,618 variation sequences. However, for purposes of forward compatibility (if additional Adobe glyphs are mapped to characters that currently only map to a single Adobe glyph), and in order to be able to reference all glyphs in the set as a variation sequence (don't ask me why), a total of 14,658 variation sequences are being put forward for registration (i.e. a unique variation sequence for every glyph in the Adobe-Japan1 collection, other than the seven unencoded characters, although I presume redundant variation sequences for those seven characters will be added once they are encoded). For the vast majority of the 12,040 ideographs for which only a single ideographic variation sequence is specified, the glyph for the IVS has the same appearance as the standard glyph form of the character, i.e. they are variation sequences that define a glyph that is not a variant of the base character and for which their is no need to distinguish it from any other variant glyph forms.

At this point I start seriously worrying about the implications of the Adobe approach to ideographic variation sequences. What if there is an "Adobe-Japan2" collection or an "Adobe-China" collection or an "Adobe-Korean" collection ? Would these collections also require the definition of many thousands of ideographic variation sequences that are not distinguishable from the standard glyph form of the base ideograph ? What if other vendors such as Microsoft or Apple decide to follow Adobe's lead, and define unique ideographic variation sequences for tens of thousands of font glyphs ? As the whole point of the IVD is to ensure that a given variation sequence is used in at most one collection, the same variant (or not-so-variant) glyph in multiple collections will inevitably be defined with different variation sequences, once for every collection it occurs in. It seems to me that the end result of all this will be that many thousands of ideographs will have multiple variation sequences associated with them (one per collection) and that the glyphs for each variation sequence will be practically indistinguishable from each other and from the standard glyph form of the base ideograph.

Looking at the glyphs of the Adobe-Japan1 collection it is evident that in very many cases where a single base ideograph has multiple variation sequences defined, the difference between glyphs is very slight (often just minor differences in stroke formation), and it is hard to see how there could be any practical need to distinguish them at the text level. In some cases the differences between "variant" glyphs is microscopic; for instance, can you differentiate the VS17 and VS18 forms of U+55A9 ?

On the other hand, sometimes the glyph variation is too extreme. One major problem with the collection that was identified during the review period is that the variation sequences for a single ideograph sometimes represent glyph forms that are not unifiable according to the Annex S rules, in particular there are quite a few cases where a Japanese simplified form which has not been encoded is defined by means of a variation sequence as a variant of the encoded non-simplified form. A single example from page 4 should suffice :

  • <56C0 E0100> (VS17) 囀󠄀 [4454] = ⿰口轉
  • <56C0 E0101> (VS18) 囀󠄁 [14116] = ⿰口轉
  • <56C0 E0102> (VS19) 囀󠄂 [20096] = ⿰口転

It has now been clarified that the glyph for any ideographic variation sequence must be within the range of unifiable glyph variation for the base ideograph, and glyphs that would not be unified according to the unification rules may not be treated as variants of the same base ideograph. The text of UTS 37 will be amended accordingly, and a revised list of variation sequences for the Adobe-Japan1 collection will be issued. This means that there will probably be about fifty more characters from the Adobe-Japan1 set that will need encoding, and I have no doubt that, as with the previous seven, they will be fast-tracked (bypassing IRG), and tagged on to the the end of the CJK and CJK-A blocks (probably just enough room for them in the BMP).

When I first read UTS 37 I thought that the purpose of the IVD was to provide CJKV users with a mechanism to define glyph variants that, although unifiable from a character-encoding perspective, were required to be distinguished at the text level in certain circumstances, most obviously when used as personal or place names. But having reviewed the Adobe-Japan1 submission it seems that I must have been mistaken. It is evident that this collection of 14,658 ideographic variation sequences has no practical benefit for anyone other than Adobe, will never be supported by anyone other than Adobe, and will never be used in text by the general CJKV user community. In my opinion the collection is required purely to enable Adobe to uniquely identify their fonts glyphs internally, and not for information interchange, which I personally think is an abuse of ideographic variation sequences. But more than that, this very first IVD registration is going to be seen as a model for what the IVD is intended for, and I am afraid that it will only serve to put off people from registering sensible and useful ideographic variation sequences, for example for the many thousands of Taiwan personal name usage characters, as well as dictionary usage variants. We shall just have to wait and see ...