Metagraphics Coding Techniques - Writing Language-Portable C/C++ Code

Programming
Techniques

The following article is excerpted from the Metagraphics C/C++ Programming Guidelines manual. A full copy of this manual can be downloaded from the Metagraphics web site at: http://www.metagraphics.com/pubs/MetagraphicsCodingGuide.pdf (213Kb, Adobe Acrobat PDF file)

Writing Language Portable Code

As the need to broaden applications onto new platforms and into new markets expands, designing code for language portability is a growing importance. Just as familiarity in using int, short and long integer types is important for designing processor-portable code, familiarity in handling different language character types is key for designing language-portable code.

While many compiler and operating systems today still use 8-bit ASCII, a growing number of platforms are now also using 16-bit Unicode as a language standard. To be platform independent your C/C++ application code needs to be capable of compiling and running in both ASCII and Unicode based environments. In addition, there will be cases when working on an ASCII-based platform where you may need to handle Unicode-specific text, and vice-versa on a Unicode-based platform where you may need to handle ASCII-specific text. The desired goal is to maintain a single source code base that is portable to any platform, and that supports both specific ASCII and Unicode needs when required.

Similar to size-specific INT16(short), INT32(long) and generic INT(int) types for integer uses, the basis for language portability for text starts with the definition of three basic character types: size-specific CHAR (8-bit), WCHAR ("wide" char 16-bit, wchar_t), and generic TCHAR (conditional 8- or 16-bit).

Data Type	Win32 Type	Description
CHAR	char	ASCII character (8-bit)
WCHAR	wchar_t	16-bit Unicode character
«» TCHAR	CHAR or WCHAR	8- or 16-bit character, depending if "`_UNICODE`" is defined

«» Indicates variable size platform-dependent conditional data type.

CHAR, WCHAR and TCHAR types
CHAR is the 8-bit ASCII-specific character type, and WCHAR ("wide char") is a 16-bit Unicode-specific character type. TCHAR (generic "text char") is a platform dependent character type that is conditionally equal to either CHAR on ASCII platforms, and equal to WCHAR on Unicode platforms. The defined identifier "_UNICODE" is used to identify if the native environment is a Unicode based platform. If _UNICODE is undefined, then TCHAR is defined equated to ASCII CHAR; if _UNICODE is defined, TCHAR is equated to Unicode WCHAR.

 
      #define  CHAR   char
      #define  WCHAR  wchar_t /* (unsigned short)    */
 
      #ifndef _UNICODE
      #define  TCHAR  CHAR    /* platform is ASCII   */
      #else /*ifdef _UNICODE*/
      #define  TCHAR  WCHAR   /* platform is Unicode */
      #endif /*_UNICODE*/

Literal Characters
The standard C/C++ single-quote (') method for specifying a single literal character works for all three character data types:

 
   CHAR   charASCII   = 'A';   /* this is an 8-bit ASCII character       */
   WCHAR  charUnicode = 'B';   /* this is a 16-bit Unicode character     */
   TCHAR  charSystem  = 'C';   /* ASCII or Unicode depending on platform */

The variable charUnicode will be a 16-bit value 0x0042, which is the Unicode representation for the letter B. (Keep in mind that Intel processors store multibyte values with the least significant bytes first, so the bytes are actually stored in memory in the sequence 0x42, 0x00 - remember this when examining a hex dump of Unicode text in memory.)

Literal Strings

Literal ASCII CHAR Strings
Using the standard C/C++ double-quote (") method for specifying literal strings works for ASCII, but will not work for Unicode character strings.

 
   CHAR  strASCII[] = "this is an ASCII string of 8-bit characters";

Literal Unicode WCHAR Strings
The ANSI C extension for defining literal Unicode strings is to precede the first double-quote with the capital letter L (as in "Long"). The L preceding the first double-quote is required, and there can not be any spaces between the L and the first double-quote. The L tells the compiler that you want the string to be stored as 16-bit WCHAR characters.

 
   WCHAR strUnicode[] = L"this is a Unicode string of 16-bit characters";

Literal TCHAR Strings
For the conditional TCHAR character type, we need a method to conditionally define strings either as an 8-bit ASCII CHAR string, or as a 16-bit Unicode WCHAR string. A method to handle this is to define a special TEXT() macro that performs this function based on if the keyword &_UNICODE& is defined or not.

 
     #ifndef  _UNICODE
     #define  __T(s)  s       /* platform is ASCII   */
     #else  /* ifdef _UNICODE */
     #define  __T(s)  L##s    /* platform is Unicode */
     #endif /*_UNICODE*/
 
     #define  TEXT(s)  __T(s)

L##s is a somewhat obscure C syntax, but it's an ANSI C specification that uses the ## "token paste" operator to have the C preprocessor concatenate the letter L with the token quoted string s. With the above #define TEXT() macro we can now specify TCHAR strings that are conditionally either ASCII or Unicode based on the target platform:

 
    TCHAR strGeneric[] = TEXT("ASCII or Unicode string depending on platform");

Library Functions
In addition to the basic character types and literal specifications, we also need support for common library string manipulation functions. The latest ANSI C STRING.H header file fortunately includes library definitions supporting functions for both 8-bit ASCII and 16-bit Unicode. Similar to the strlen() function that returns the number of characters in an ASCII string, ANSI C now also provides a wcslen() function that returns the number of characters in a Unicode string (very important – the ASCII strlen() function will not return the proper length of a Unicode string!). There is a similar matching Unicode WCHAR function for most of the standard ASCII CHAR string functions. For use with our generic TCHAR type, we define a third set of functions based on our "_UNICODE" keyword. The following table summarizes the data types and function names associated with each of the type-specific CHAR and WCHAR types, and also for our generic TCHAR type.

	ASCII	Unicode	Generic-8/16
character size	8-bit	16-bit	8- or 16-bit
type	CHAR	WCHAR	TCHAR
literal character	'.'	'.'	'.'
literal string	"..."	L"..."	TEXT("...")
get character string length	strlen()	wcslen()	tcslen()
find character in string	strchr()	wcschr()	tcschr()
find character, ignore case	strichr()	wcsichr()	tcsichr()
reverse-find character	strrchr()	wcsrchr()	tcsrchr()
reverse-find char, ignore case	strrichr()	wcsrichr()	tcsrichr()
find substring	strstr()	wcsstr()	tcsstr()
find substring, ignore case	stristr()	wcsistr()	tcsistr()
copy string	strcpy()	wcscpy()	tcscpy()
copy string, w/max	strncpy()	wcsncpy()	tcsncpy()
concatenate string	strcat()	wcscat()	tcscat()
concatenate string (w/max)	strncat()	wcsncat()	tcsncat()
compare string	strcmp()	wcscmp()	tcscmp()
compare string, max	strncmp()	wcsncmp()	tcsncmp()
compare string, ignore case	stricmp()	wcsicmp()	tcsicmp()
compare string, nocase, max	strnicmp()	wcsnicmp()	tcsnicmp()
get non-matching char index	strspn()	wcsspn()	tcsspn()
get matching char index	strcspn()	wcscspn()	tcscspn()
find next token	strtok()	wcstok()	tcstok()
locate matching character	strpbrk()	wcspbrk()	tcspbrk()
is alphanumeric character?	isalnum()	iswalnum()	istalnum()
is alpha character?	isalpha()	iswalpha()	istalpha()
is decimal digit (0-9)?	isdigit()	iswdigit()	istdigit()
is lowercase character?	islower()	iswlower()	istlower()
is uppercase character?	isupper()	iswupper()	istupper()
is hex digit (0-9, A-F, a-f)?	isxdigit()	iswxdigit()	istxdigit()
is white-space character?	isspace()	iswspace()	istspace()
is printable character?	isprint()	iswprint()	istprint()
is punctuation character?	ispunct()	iswpunct()	istpunct()
convert char to lowercase	tolower()	towlower()	totlower()
convert char to uppercase	toupper()	towupper()	totupper()
convert integer to string	itoa()	itow()	itot()
convert long to string	ltoa()	ltow()	ltot()
convert string to integer	atoi()	wtoi()	ttoi()
convert string to long	atol()	wtol()	ttol()
format data to stdout	printf()	wprintf()	tprintf()
format data to file	fprintf()	fwprintf()	ftprintf()
format data to string	sprintf()	swprintf()	stprintf()
format data to string, w/max	snprintf()	snwprintf()	sntprintf()
format arglist to stdout	vprintf()	vwprintf()	vtprintf()
format arglist to file	vfprintf()	vfwprintf()	vftprintf()
format arglist to string	vsprintf()	vswprintf()	vstprintf()
format args to string, w/max	vsnprintf()	vsnwprintf()	vsntprintf()
open file	fopen()	wfopen()	tfopen()
read formatted data	scanf()	wscanf()	tscanf()
read formatted from string	sscanf()	swscanf()	stscanf()
read formatted from file	fscanf()	fwscanf()	ftscanf()

To underscore or not to underscore - that is the question...
To be blunt, the use of underscores in function names for standard C library string routines is a mess. While later C/C++ standards proposed using leading underscores (_) to delineate C library and system functions from user application function names, most of the original ASCII library function names (strlen, printf, sscanf, strncmp, etc.) don't begin with a leading underscore. As new ASCII and Unicode library functions were added, some included leading underscores while others didn't. With over three hundred (300!) ASCII and Unicode standard C library string function names, there's no practical way to remember which ones use underscores (or double underscores), and which ones don't.

To mitigate the string function naming mess, we've adopted an in-house standard that drops all leading underscores from C library string function names. This is accomplished by including an additional header file that simply #define's underscored string function names to a matching name without an underscore. Since our in-house standard is to use mixed-case descriptive words for application function names (the first letter of each word is capitalized), we should never conflict with the standard C library string function names which are always all lowercase shortened acronyms.

With our CHAR, WCHAR and TCHAR types, along with our character and literal string specifiers, and the string library functions outlined above, we have a cohesive and portable system for handling characters and text on any ASCII or Unicode platform. Also where needed, we still have our type-specific functions for handling ASCII-specific strings on Unicode platforms, and vice-versa handling Unicode-specific strings on ASCII platforms.

The preceding was excerpted from Metagraphics C/C++ Programming Guidelines manual, developed internally as part of our on-going efforts to improve the quality of our software products. Implementing a written corporate programming standards document is one of the first steps in the Capability Maturity Model for Software process (CMM-SW). For additional information on CMM-SW visit the Carnegie Mellon University, Software Engineering Institute web site at: http://www.sei.cmu.edu/cmm/cmm.html

ProgrammingTechniques

Programming
Techniques