Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?

April 26, 2017bit character Chinese Japanese Korean recognize Unicode value

0

Posted

How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?

1 Answer

0

Posted

It’s basically impossible and largely meaningless. It’s the equivalent of asking if “a” is an English letter or a French one. There are some characters where one can guess based on the source information in the Unihan Database that it’s traditional Chinese, simplified Chinese, Japanese, Korean, or Vietnamese, but there are too many exceptions to make this really reliable. The phonetic data in the Unihan Database should not be used for this purpose. A blank in the phonetic data means that nobody’s supplied a reading, not that a reading doesn’t exist. Because updating the Unihan Database is an ongoing process, these fields will be increasingly filled out as time goes on, but they should never be taken as absolutely complete. In particular, there are obscure characters where it is known that there is a reading, but since the character does not occur in standard dictionaries, we are unable to supply it (e.g., 䃟 U+40DF in Cantonese). A better solution is to look at the text as a whole: if t