Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

What are the porting issues I need to watch out for with UTF-8?

April 26, 2017porting utf-8 WATCH

0

Posted

What are the porting issues I need to watch out for with UTF-8?

1 Answer

0

Posted

If you port to UTF-8, all code that does not try to interpret byte values greater than 0x7F will work, because ASCII and UTF-8 are identical up to 0x7F. However, watch for anything that truncates strings or buffers at places other than ‘\n’ or ‘\0’ or at space or syntax characters from the ASCII range. Truncations based on character counting are inherently dangerous, because UTF-8 is a multi-byte encoding. Also watch out for jumps into the middle of a string. Many kinds of inner loop code exist for which the code does not need to be aware of the multi-byte nature of UTF-8, for example a simple copy operation like: while (–len && *d++ = *s++); if (!len) *d = 0; will work correctly either for ASCII or for UTF-8.