Saturday, April 15, 2023

Unicode and Windows: A Toxic Love Story

 


 

Intro

Unicode string support in Windows is a topic that has intrigued many developers over the years. When Microsoft was tasked with developing a new operating system, they initially planned to create a new version of OS/2 in collaboration with IBM. However, the success of Windows 3 caused a last-minute change of plan. The team behind the new OS decided to shift their API from the extended OS/2 API to an extended version of the Windows API. This change resulted in a strange API, particularly when it came to string handling in the new OS known as Windows NT 3.1.

Windows NT uses Unicode encoding internally, which means that the operating system itself utilizes Unicode encoding in its own private functions. On the other hand, the Windows API used ANSI strings, where each character was an 8-bit char, in Windows 3.0. To ensure that programs written for Windows 3.0 could also run on Windows NT, the team behind the new OS tacked on the old API as is onto NT. However, they had to write code to convert the ANSI strings used in Windows 3.0 into Unicode strings for API calls into NT and then convert them back into ANSI when returning results to the program.

For new programs that would support Unicode, duplicate functions were written that took Unicode strings and returned Unicode strings. As a result, there are two versions of the same function in the Windows API. One accepts ANSI strings, and the other accepts Unicode strings. However, you cannot use both encodings as your native string encoding in the same executable. You must decide at compile time whether to store your characters as 8-bit ANSI or 16-bit Unicode.

 Some Mystic Wizardry

There are 2 atomic character types in Unicode Supported Windows.

char - Character type. Each character is 8 bits.
wchar_t - Wide character type. Each character is 16 bits.
wchar_t UnicodeString[]"UNICODE string óé . This code declares and initializes two character arrays, ansiString and UnicodeString, that hold strings with different character encodings.

The first string, ansiString, is a regular character array that holds a string of characters encoded in the ANSI format. The characters in this string are represented using one byte per character, with each character being 8 bits in size. This encoding is commonly used for text in the English language and other Western European languages that use the Latin alphabet.

The second string, UnicodeString, is a wide character array that holds a string of characters encoded in the Unicode format. The characters in this string are represented using two bytes per character, with each character being 16 bits in size. This encoding is capable of representing characters from all languages in the world, including non-Latin alphabets like Arabic, Chinese, and Hindi.

In addition to the different encodings used in these strings, there is also a difference in the way the strings are declared. The first string is declared using a regular character array with square brackets, while the second string is declared using the wchar_t data type and the L prefix before the string literal. This is because wide character strings in C and C++ are typically declared using the wchar_t data type instead of the regular char data type used for regular character strings. The L prefix indicates that the string literal should be treated as a wide character string.

Finally, the second string includes a special character, ó, which has an accent mark. This character cannot be represented using the ANSI encoding, as it is not part of the regular ASCII character set. However, it can be represented using the Unicode encoding, which is why it is included in the second string.

 s previously noted, the Windows API contains duplicate versions of each function, with one implementation supporting ANSI strings and the other supporting Unicode strings. Let's take a closer look at an example of this in action with the DrawText function.

There are two versions of the DrawText function available in the Windows API: DrawTextA and DrawTextW. Both of these functions perform the same task, but one takes in ANSI strings (A) while the other takes in wide strings (W). The difference in encoding is indicated by the final character in the function name.

To illustrate the usage of these functions, let's take a look at some sample code. In this code, we use the DrawTextA function to draw text encoded in the ANSI format, and the DrawTextW function to draw text encoded in the Unicode format.

DrawTextA(hdc, ansiString, -1, &rect1, DT_SINGLELINE | DT_NOCLIP);
DrawTextW(hdc, UnicodeString, -1, &rect2, DT_SINGLELINE | DT_NOCLIP);

 

n this example, we pass the device context handle (hdc) and the text to be drawn to each of the functions, along with other parameters that control the text layout and formatting. Note that the ansiString variable contains text encoded in the ANSI format, while the UnicodeString variable contains text encoded in the Unicode format.

By using the appropriate function for the encoding of our text, we ensure that the text is correctly displayed on the screen. This is just one example of the many ways in which the Windows API handles different character encodings to support a wide range of languages and text formats.

Attempting to use an ANSI string with a Unicode function, or vice versa, would result in a compilation error due to the type mismatch between char and wchar_t data types. In order to avoid the tedium of checking which function to call in each scenario, the developers behind Windows NT came up with an ingenious solution.

In most cases, it is common practice for a single application to use a consistent string representation, either ANSI or Unicode. The Windows API headers were designed to allow programmers to mostly ignore the existence of the two different string types. Instead, a new "generic" char type called TCHAR is defined in the headers.

The definition of TCHAR is implemented using preprocessor directives, as shown below:

#ifdef UNICODE
#define TCHAR wchar_t
#else
#define TCHAR char
#endif
 

This means that if the UNICODE preprocessor directive is defined, TCHAR will be defined as wchar_t. If not, TCHAR will be defined as char. By using the TCHAR type instead of char or wchar_t, programmers can write code that is agnostic to the underlying string representation. This helps to simplify code and make it more maintainable, as it does not need to be updated if the underlying string representation is changed.

Overall, the use of TCHAR and the clever design of the Windows API headers make it easier for programmers to work with strings in a consistent and seamless manner, regardless of the underlying representation.

NOTE:

The UNICODE preprocessor symbol is defined by default in new Visual Studio projects, which means that TCHAR is replaced by wchar_t by default. However, if you are working with an older codebase or need to support ANSI for some reason, you can define the symbol yourself. To do this, you can add a preprocessor definition in your project settings or use the #define directive in your code. For example:

More C++ Sample Dample

#define UNICODE 0 // define UNICODE as 0 to use ANSI #include <windows.h> TCHAR anyString[] = TEXT("This can either Unicode or ansi. Olé!");

In this example, UNICODE is defined as 0 before including the windows.h header file, which means that TCHAR will be replaced by char instead of wchar_t. The TEXT macro will also be replaced by nothing, resulting in an ANSI string literal.

When it comes to programming on Windows, understanding the differences between Unicode and ANSI character sets is crucial for proper text rendering. If you compile a program using the Unicode Character set, all text should be displayed correctly. However, if you compile using the Multi Byte Character set (ANSI), characters that require Unicode encoding will not render properly and you will see the infamous "?" characters instead.

It is important to note that while it is possible to convert a well-written Unicode program into an ANSI one, it is essential to ensure that any text used in the program fits into the character set it is being compiled into. Otherwise, text will not be displayed as intended.

Fin

In addition to the Unicode and ANSI character sets, Windows headers also define a variety of other string types such as LPSTR, LPWSTR, LPTSTR, LPCTSTR, and more. These types are made up of combinations of pointers to char, wchar_t, or TCHAR. While the names of these types may seem intimidating at first, a closer look at their definitions should make their meanings relatively simple.

To illustrate the importance of character set compatibility, consider the example of a program that receives user input. If the user enters text that contains characters outside of the character set that the program is compiled for, the program will not be able to display or process that text correctly. Therefore, it is important to carefully consider the character set used in your program to ensure proper text rendering and processing.

 

 Til next time! -intro

 

 

 

 

No comments:

Post a Comment

A Guide to Multi-Level Pointer Analysis

  A Comprehensive Guide to Multi-Level Pointer Analysis   A regular pointer points to only one address, but when it's accompanied by a l...