Standard Library Functions - C Interview Questions VIII


Q. How can I manipulate strings of multibyte characters?
Say your program sometimes deals with English text (which fits comfortably into 8-bit chars with a bit to spare) and sometimes Japanese text (which needs 16 bits to cover all the possibilities). If you use the same code to manipulate either country's text, will you need to set aside 16 bits for every character, even your English text? Maybe not. Some (but not all) ways of encoding multibyte characters can store information about whether more than one byte is necessary.
mbstowcs ("multibyte string to wide character string") and wcstombs ("wide character string to multibyte string") convert between arrays of wchar_t (in which every character takes 16 bits, or two bytes) and multibyte strings (in which individual characters are stored in one byte if possible).
There's no guarantee your compiler can store multibyte strings compactly. (There's no single agreed-upon way of doing this.) If your compiler can help you with multibyte strings, mbstowcs and wcstombs are the functions it provides for that.

Q. What are multibyte characters?
Multibyte characters are another way to make internationalized programs easier to write. Specifically, they help support languages such as Chinese and Japanese that could never fit into eight-bit characters. If your programs will never need to deal with any language but English, you don't need to know about multibyte characters.
Inconsiderate as it might seem, in a world full of people who might want to use your software, not everybody reads English. The good news is that there are standards for fitting the various special characters of European languages into an eight-bit character set. (The bad news is that there are several such standards, and they don't agree.)
Go to Asia, and the problem gets more complicated. Some languages, such as Japanese and Chinese, have more than 256 characters. Those will never fit into any eight-bit character set. (An eight-bit character can store a number between 0 and 255, so it can have only 256 different values.)
The good news is that the standard library has the beginnings of a solution to this problem. <stddef.h> defines a type, wchar_t, that is guaranteed to be long enough to store any character in any language a C program can deal with. Based on all the agreements so far, 16 bits is enough. That's often a short, but it's better to trust that the compiler vendor got wchar_t right than to get in trouble if the size of a short changes.
The mblen, mbtowc, and wctomb functions transform byte strings into multibyte characters. See your compiler manuals for more information on these functions.

Comments