How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
It’s basically impossible and largely meaningless. It’s the equivalent of asking if “a” is an English letter or a French one. There are some characters where one can guess based on the source information in the Unihan Database that it’s traditional Chinese, simplified Chinese, Japanese, Korean, or Vietnamese, but there are too many exceptions to make this really reliable. The phonetic data in the Unihan Database should not be used for this purpose. A blank in the phonetic data means that nobody’s supplied a reading, not that a reading doesn’t exist. Because updating the Unihan Database is an ongoing process, these fields will be increasingly filled out as time goes on, but they should never be taken as absolutely complete. In particular, there are obscure characters where it is known that there is a reading, but since the character does not occur in standard dictionaries, we are unable to supply it (e.g., 䃟 U+40DF in Cantonese). A better solution is to look at the text as a whole: if t
Related Questions
- How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
- Does Campaign Enterprise send emails in other character sets such as Chinese, Korean, or Japanese?
- Can I use CLucene to index text in Chinese, Japanese, Korean, and other multi-byte character sets?