Programming |
The following article is excerpted from the Metagraphics C/C++ Programming Guidelines manual. A full copy of this manual can be downloaded from the Metagraphics web site at: http://www.metagraphics.com/pubs/MetagraphicsCodingGuide.pdf (213Kb, Adobe Acrobat PDF file) |
|
|
Writing Language Portable Code |
As the need to broaden applications onto new platforms and into new markets expands, designing code for language portability is a growing importance. Just as familiarity in using int, short and long integer types is important for designing processor-portable code, familiarity in handling different language character types is key for designing language-portable code.
While many compiler and operating systems today still use 8-bit ASCII, a growing number of platforms are now also using 16-bit Unicode as a language standard. To be platform independent your C/C++ application code needs to be capable of compiling and running in both ASCII and Unicode based environments. In addition, there will be cases when working on an ASCII-based platform where you may need to handle Unicode-specific text, and vice-versa on a Unicode-based platform where you may need to handle ASCII-specific text. The desired goal is to maintain a single source code base that is portable to any platform, and that supports both specific ASCII and Unicode needs when required.
Similar to size-specific INT16(short)
, INT32(long)
and generic
INT(int)
types for integer uses, the basis for language portability for
text starts with the definition of three basic character types: size-specific
CHAR
(8-bit), WCHAR
("wide" char 16-bit, wchar_t
),
and generic TCHAR
(conditional 8- or 16-bit).
Data Type | Win32 Type | Description |
CHAR | char | ASCII character (8-bit) |
WCHAR | wchar_t | 16-bit Unicode character |
«» TCHAR | CHAR or WCHAR | 8- or 16-bit character, depending if "_UNICODE " is defined |
«» Indicates variable size platform-dependent conditional data type.
CHAR, WCHAR and TCHAR types
CHAR
is the 8-bit ASCII-specific character type,
and WCHAR
("wide char") is a 16-bit Unicode-specific character type.
TCHAR
(generic "text char") is a platform dependent character type that
is conditionally equal to either CHAR
on ASCII platforms, and equal to
WCHAR
on Unicode platforms.
The defined identifier "_UNICODE
" is used to identify
if the native environment is a Unicode based platform.
If _UNICODE
is undefined, then TCHAR
is
defined equated to ASCII CHAR
; if _UNICODE
is defined,
TCHAR
is equated to Unicode WCHAR
.
#define CHAR char #define WCHAR wchar_t /* (unsigned short) */ #ifndef _UNICODE #define TCHAR CHAR /* platform is ASCII */ #else /*ifdef _UNICODE*/ #define TCHAR WCHAR /* platform is Unicode */ #endif /*_UNICODE*/ |
Literal Characters
The standard C/C++ single-quote (') method for specifying a single literal character works
for all three character data types:
CHAR charASCII = 'A'; /* this is an 8-bit ASCII character */ WCHAR charUnicode = 'B'; /* this is a 16-bit Unicode character */ TCHAR charSystem = 'C'; /* ASCII or Unicode depending on platform */ |
The variable charUnicode
will be a 16-bit value 0x0042,
which is the Unicode representation for the letter B.
(Keep in mind that Intel processors store multibyte values with the
least significant bytes first, so the bytes are actually stored in memory
in the sequence 0x42, 0x00 - remember this when examining a hex dump of
Unicode text in memory.)
Literal Strings
Literal ASCII CHAR Strings
Using the standard C/C++ double-quote (") method for specifying literal strings
works for ASCII, but will not work for Unicode character strings.
CHAR strASCII[] = "this is an ASCII string of 8-bit characters"; |
Literal Unicode WCHAR Strings
The ANSI C extension for defining literal Unicode strings is to precede the first
double-quote with the capital letter L
(as in "Long").
The L
preceding the first double-quote is required, and there
can not be any spaces between the L
and the first double-quote.
The L
tells the compiler that you want the string to be stored as 16-bit
WCHAR
characters.
WCHAR strUnicode[] = L"this is a Unicode string of 16-bit characters"; |
Literal TCHAR Strings
For the conditional TCHAR
character type, we need a method to conditionally
define strings either as an 8-bit ASCII CHAR
string, or as a 16-bit Unicode
WCHAR
string.
A method to handle this is to define a special TEXT()
macro that performs this function based on if the keyword &_UNICODE
&
is defined or not.
#ifndef _UNICODE #define __T(s) s /* platform is ASCII */ #else /* ifdef _UNICODE */ #define __T(s) L##s /* platform is Unicode */ #endif /*_UNICODE*/ #define TEXT(s) __T(s) |
L##s
is a somewhat obscure C syntax, but it's an ANSI C specification
that uses the ##
"token paste" operator to have the C preprocessor concatenate
the letter L
with the token quoted string s
.
With the above #define TEXT()
macro we can now specify TCHAR
strings that are conditionally either ASCII or Unicode based on the target platform:
TCHAR strGeneric[] = TEXT("ASCII or Unicode string depending on platform"); |
Library Functions
In addition to the basic character types and literal specifications,
we also need support for common library string manipulation functions.
The latest ANSI C STRING.H
header file fortunately includes library
definitions supporting functions for both 8-bit ASCII and 16-bit Unicode.
Similar to the strlen()
function that returns the number of characters
in an ASCII string, ANSI C now also provides a wcslen()
function that
returns the number of characters in a Unicode string (very important – the ASCII
strlen()
function will not return the proper length of a Unicode string!).
There is a similar matching Unicode WCHAR
function for
most of the standard ASCII CHAR
string functions.
For use with our generic TCHAR
type, we define a third
set of functions based on our "_UNICODE" keyword.
The following table summarizes the data types and function names associated
with each of the type-specific CHAR
and WCHAR
types, and
also for our generic TCHAR
type.
ASCII | Unicode | Generic-8/16 | |
character size | 8-bit | 16-bit | 8- or 16-bit |
type | CHAR | WCHAR | TCHAR |
literal character | '.' | '.' | '.' |
literal string | "..." | L"..." | TEXT("...") |
get character string length | strlen() | wcslen() | tcslen() |
find character in string | strchr() | wcschr() | tcschr() |
find character, ignore case | strichr() | wcsichr() | tcsichr() |
reverse-find character | strrchr() | wcsrchr() | tcsrchr() |
reverse-find char, ignore case | strrichr() | wcsrichr() | tcsrichr() |
find substring | strstr() | wcsstr() | tcsstr() |
find substring, ignore case | stristr() | wcsistr() | tcsistr() |
copy string | strcpy() | wcscpy() | tcscpy() |
copy string, w/max | strncpy() | wcsncpy() | tcsncpy() |
concatenate string | strcat() | wcscat() | tcscat() |
concatenate string (w/max) | strncat() | wcsncat() | tcsncat() |
compare string | strcmp() | wcscmp() | tcscmp() |
compare string, max | strncmp() | wcsncmp() | tcsncmp() |
compare string, ignore case | stricmp() | wcsicmp() | tcsicmp() |
compare string, nocase, max | strnicmp() | wcsnicmp() | tcsnicmp() |
get non-matching char index | strspn() | wcsspn() | tcsspn() |
get matching char index | strcspn() | wcscspn() | tcscspn() |
find next token | strtok() | wcstok() | tcstok() |
locate matching character | strpbrk() | wcspbrk() | tcspbrk() |
is alphanumeric character? | isalnum() | iswalnum() | istalnum() |
is alpha character? | isalpha() | iswalpha() | istalpha() |
is decimal digit (0-9)? | isdigit() | iswdigit() | istdigit() |
is lowercase character? | islower() | iswlower() | istlower() |
is uppercase character? | isupper() | iswupper() | istupper() |
is hex digit (0-9, A-F, a-f)? | isxdigit() | iswxdigit() | istxdigit() |
is white-space character? | isspace() | iswspace() | istspace() |
is printable character? | isprint() | iswprint() | istprint() |
is punctuation character? | ispunct() | iswpunct() | istpunct() |
convert char to lowercase | tolower() | towlower() | totlower() |
convert char to uppercase | toupper() | towupper() | totupper() |
convert integer to string | itoa() | itow() | itot() |
convert long to string | ltoa() | ltow() | ltot() |
convert string to integer | atoi() | wtoi() | ttoi() |
convert string to long | atol() | wtol() | ttol() |
format data to stdout | printf() | wprintf() | tprintf() |
format data to file | fprintf() | fwprintf() | ftprintf() |
format data to string | sprintf() | swprintf() | stprintf() |
format data to string, w/max | snprintf() | snwprintf() | sntprintf() |
format arglist to stdout | vprintf() | vwprintf() | vtprintf() |
format arglist to file | vfprintf() | vfwprintf() | vftprintf() |
format arglist to string | vsprintf() | vswprintf() | vstprintf() |
format args to string, w/max | vsnprintf() | vsnwprintf() | vsntprintf() |
open file | fopen() | wfopen() | tfopen() |
read formatted data | scanf() | wscanf() | tscanf() |
read formatted from string | sscanf() | swscanf() | stscanf() |
read formatted from file | fscanf() | fwscanf() | ftscanf() |
To underscore or not to underscore - that is the question... To be blunt, the use of underscores in function names for standard C library string routines is a mess. While later C/C++ standards proposed using leading underscores (_) to delineate C library and system functions from user application function names, most of the original ASCII library function names (strlen, printf, sscanf, strncmp, etc.) don't begin with a leading underscore. As new ASCII and Unicode library functions were added, some included leading underscores while others didn't. With over three hundred (300!) ASCII and Unicode standard C library string function names, there's no practical way to remember which ones use underscores (or double underscores), and which ones don't. To mitigate the string function naming mess, we've adopted an in-house standard that drops all leading underscores from C library string function names. This is accomplished by including an additional header file that simply #define's underscored string function names to a matching name without an underscore. Since our in-house standard is to use mixed-case descriptive words for application function names (the first letter of each word is capitalized), we should never conflict with the standard C library string function names which are always all lowercase shortened acronyms. |
With our CHAR
, WCHAR
and TCHAR
types, along with
our character and literal string specifiers, and the string library functions outlined above,
we have a cohesive and portable system for handling characters and text on any ASCII or
Unicode platform.
Also where needed, we still have our type-specific functions for handling ASCII-specific strings on
Unicode platforms, and vice-versa handling Unicode-specific strings on ASCII platforms.
The preceding was excerpted from Metagraphics C/C++ Programming Guidelines manual, developed internally as part of our on-going efforts to improve the quality of our software products. Implementing a written corporate programming standards document is one of the first steps in the Capability Maturity Model for Software process (CMM-SW). For additional information on CMM-SW visit the Carnegie Mellon University, Software Engineering Institute web site at: http://www.sei.cmu.edu/cmm/cmm.html
Home | Products | Order | Register | Support | Company | Contact | Feedback
Copyright © 2001 - Metagraphics Software Corporation.