Making the Transition to ANSI C |
1 |
![]() |
Basic Modes
The ANSI C compiler allows both old-style and new-style C code. The following -X (note case) options provide varying degrees of compliance to the ANSI C standard. -Xa is the default mode.
An ANSI C-conforming compiler must issue a diagnostic whenever two incompatible declarations for the same object or function are in the same scope. If all functions are declared and defined with prototypes, and the appropriate headers are included by the correct source files, all calls should agree with the definition of the functions. This protocol eliminates one of the most common C programming mistakes.
Updating Existing Code
If you have an existing application and want the benefits of function prototypes, there are a number of possibilities for updating, depending on how much of the code you would like to change:
For functions with varying arguments, there can be no mixing of ANSI C's ellipsis notation and the old-style varargs() function definition. For functions with a fixed number of parameters, the situation is fairly straightforward: just specify the types of the parameters as they were passed in previous implementations.
In K&R C, each argument was converted just before it was passed to the called function according to the default argument promotions. These promotions specified that all integral types narrower than int were promoted to int size, and any float argument was promoted to double, hence simplifying both the compiler and libraries. Function prototypes are more expressive--the specified parameter type is what is passed to the function.
Thus, if a function prototype is written for an existing (old-style) function definition, there should be no parameters in the function prototype with any of the following types:
char
|
signed char
|
unsigned char
|
float
|
short
|
signed short
|
unsigned short
|
|
There still remain two complications with writing prototypes: typedef names and the promotion rules for narrow unsigned types.
Watch out for the use of id's in prototypes that may be affected by preprocessing. Consider the following example:
#define status 23 void my_exit(int status); /* Normally, scope begins */ /* and ends with prototype */
|
Do not mix function prototypes with old-style function declarations that contain narrow types.
void foo(unsigned char, unsigned short); void foo(i, j) unsigned char i; unsigned short j; {...}
|
header.h:
struct s { /* . . . */ };
#ifdef __STDC__ void errmsg(int, ...); struct s *f(const char *); int g(void); #else void errmsg(); struct s *f(); int g(); #endif
|
The following function uses prototypes and can still be compiled on an older system:
struct s * #ifdef __STDC__ f(const char *p) #else f(p) char *p; #endif { /* . . . */ }
|
Here is an updated source file (as with choice 3 above, on page 3). The local function still uses an old-style definition, but a prototype is included for newer compilers:
source.c:
#include header.h typedef /* . . . */ MyType; #ifdef __STDC__ static void del(MyType *); /* . . . */ #endif static void del(p) MyType *p; { /* . . . */ } /* . . . */
|
Functions with Varying Arguments
In previous implementations, you could not specify the parameter types that a function expected, but ANSI C encourages you to use prototypes to do just that. To support functions such as printf(), the syntax for prototypes includes a special ellipsis (...) terminator. Because an implementation might need to do unusual things to handle a varying number of arguments, ANSI C requires that all declarations and the definition of such a function include the ellipsis terminator.
#ifdef __STDC__ void errmsg(int code, ...); #else void errmsg(); #endif
|
#ifdef __STDC__ #include <stdarg.h> #else #include <varargs.h> #endif #include <stdio.h>
|
stdio.h is included because we call fprintf() and vfprintf() later.
void #ifdef __STDC__ errmsg(int code, ...) #else errmsg(va_alist) va_dcl /* Note: no semicolon! */ #endif { /* more detail below */ }
|
Since the old-style variable argument mechanism did not allow us to specify any fixed parameters, we must arrange for them to be accessed before the varying portion. Also, due to the lack of a name for the "..." part of the parameters, the new va_start() macro has a second argument--the name of the parameter that comes just before the "..." terminator.
int f(...);
For such functions, va_start() should be invoked with an empty second argument, as in: va_start(ap,)
The following is the body of the function:
(void)vfprintf(stderr, va_arg(ap, char *), ap);
|
The definitions for the macros FILENAME, LINENUMBER, and WARNING are presumably contained in the same header as the declaration of errmsg().
errmsg(FILENAME, "<command line>", "cannot open: %s\n", argv[optind]);
|
Promotions: Unsigned Versus Value Preserving
The following information appears in the Rationale section that accompanies the draft C Standard:
In most C compilers, the simpler rule, "unsigned preserving," is used: when an unsigned type needs to be widened, it is widened to an unsigned type; when an unsigned type mixes with a signed type, the result is an unsigned type.
The other rule, specified by ANSI C, is known as "value preserving," in which the result type depends on the relative sizes of the operand types. When an unsigned char or unsigned short is widened, the result type is int if an int is large enough to represent all the values of the smaller type. Otherwise, the result type is unsigned int. The value preserving rule produces the least surprise arithmetic result for most expressions.
int f(void) { int i = -2; unsigned char uc = 1;
return (i + uc) < 17; }
|
The code above causes the compiler to issue the following warning when you use the -xtransition option:
line 6: warning: semantics of "<" change in ANSI C;
The result of the addition has type int (value preserving) or unsigned int (unsigned preserving), but the bit pattern does not change between these two. On a two's-complement machine, we have:
use explicit cast
i: 111...110 (-2) + uc: 000...001 ( 1) =================== 111...111 (-1 or UINT_MAX)
|
This bit representation corresponds to -1 for int and UINT_MAX for unsigned int. Thus, if the result has type int, a signed comparison is used and the less-than test is true; if the result has type unsigned int, an unsigned comparison is used and the less-than test is false.
value preserving: (i + (int)uc) < 17
unsigned preserving: (i + (unsigned int)uc) < 17
|
Since differing compilers chose different meanings for the same code, this expression can be ambiguous. The addition of a cast is as much to help the reader as it is to eliminate the warning message.
Bit-fields
The same situation applies to the promotion of bit-field values. In ANSI C, if the number of bits in an int or unsigned int bit-field is less than the number of bits in an int, the promoted type is int; otherwise, the promoted type is unsigned int. In most older C compilers, the promoted type is unsigned int for explicitly unsigned bit-fields, and int otherwise. Second Example: Same Result
In the following code, assume that both unsigned short and unsigned char are narrower than int.
int f(void) { unsigned short us; unsigned char uc;
return uc < us; }
|
In this example, both automatics are either promoted to int or to unsigned int, so the comparison is sometimes unsigned and sometimes signed. However, the C compiler does not warn you because the result is the same for the two choices.
Integral Constants
As with expressions, the rules for the types of certain integral constants have changed. In K&R C, an unsuffixed decimal constant had type int only if its value fit in an int; an unsuffixed octal or hexadecimal constant had type int only if its value fit in an unsigned int. Otherwise, an integral constant had type long. At times, the value did not fit in the resulting type. In ANSI C, the constant type is the first type encountered in the following list that corresponds to the value:
int f(void) { int i = 0;
return i > 0xffff; }
|
Because the hexadecimal constant's type is either int (with a value of -1 on a two's-complement machine) or an unsigned int (with a value of 65535), the comparison is true in -Xs and -Xt modes, and false in -Xa and -Xc modes.
-Xt, -Xs modes: i > (int)0xffff
-Xa, -Xc modes: i > (unsigned int)0xffff or i > 0xffffU
|
The U suffix character is a new feature of ANSI C and probably produces an error message with older compilers.
Tokenization and Preprocessing
Probably the least specified part of previous versions of C concerned the operations that transformed each source file from a bunch of characters into a sequence of tokens, ready to parse. These operations included recognition of white space (including comments), bundling consecutive characters into tokens, handling preprocessing directive lines, and macro replacement. However, their respective ordering was never guaranteed. ANSI C Translation Phases
The order of these translation phases is specified by ANSI C:
Trigraph Sequence |
Converts to |
Trigraph Sequence |
Converts to |
??=
|
#
|
??<
|
{
|
??-
|
~
|
??>
|
}
|
??(
|
[
|
??/
|
\
|
??)
|
]
|
??'
|
^
|
??!
|
|
|
|
|
/* comment *??/ /* still comment? */
|
/* comment */* still comment? */
Because the tokenization process within the preprocessor was a moment-by-moment operation and macro replacement was done as a character-based, not token-based, operation, the tokens and white space could have a great deal of variation during preprocessing.
There are a number of differences that arise from these two approaches. The rest of this section discusses how code behavior may change due to line splicing, macro replacement, stringizing, and token pasting, which occur during macro replacement.
#define name (*name)
|
causes any use of name to be replaced with an indirect reference through name. The old C preprocessor would produce a huge number of parentheses and stars and eventually produce an error about macro recursion.
Using Strings
In K&R C, the following code produced the string literal "x y!":
Note - In ANSI C, the examples below marked with a produce a warning about use of old features, when you use the
-xtransition option. Only in the transition mode (-Xt and
-Xs) is the result the same as in previous versions of C.
#define str(a) "a!" ![]() str(x y)
|
#define str(a) #a "!" str(x y)
|
The above code produces the two string literals "x y" and "!" which, after concatenation, produces the identical "x y!".
#define CNTL(ch) (037 & 'ch') ![]() CNTL(L)
|
(037 & 'L')
|
#define CNTL(ch) (037 & (ch)) CNTL('L')
|
This code is more readable and more useful, as it can also be applied to expressions.
Token Pasting
In K&R C, there were at least two ways to combine two tokens. Both invocations in the following produced a single identifier x1 out of the two tokens x and 1.
#define self(a) a #define glue(a,b) a/**/b ![]() self(x)1 glue(x,1)
|
#define glue(a,b) a ## b glue(x, 1)
|
# and ## should be used as macro substitution operators only when __STDC__ is defined. Since ## is an actual operator, the invocation can be much freer with respect to white space in both the definition and invocation.
const and volatile
The keyword const was one of the C++ features that found its way into ANSI C. When an analogous keyword, volatile, was invented by the ANSI C Committee, the "type qualifier" category was created. This category still remains one of the more nebulous parts of ANSI C. Types, Only for lvalues
const and volatile are part of an identifier's type, not its storage class. However, they are often removed from the topmost part of the type when an object's value is fetched in the evaluation of an expression--exactly at the point when an lvalue becomes an rvalue. These terms arise from the prototypical assignment "L=R"; in which the left side must still refer directly to an object (an lvalue) and the right side need only be a value (an rvalue). Thus, only expressions that are lvalues can be qualified by const or volatile or both. Type Qualifiers in Derived Types
The type qualifiers may modify type names and derived types. Derived types are those parts of C's declarations that can be applied over and over to build more and more complex types: pointers, arrays, functions, structures, and unions. Except for functions, one or both type qualifiers can be used to change the behavior of a derived type.
const int five = 5;
|
int const five = 5;
|
const five = 5;
|
are identical to the above declaration in its effect.
const int *pci = &five;
|
*(int *)pci = 17;
|
If pci actually points to a const object, the behavior of this code is undefined.
extern int *const cpi;
|
typedef int *INT_PTR; extern const INT_PTR cpi;
|
const int *const cpci;
|
const Means readonly
In hindsight, readonly would have been a better choice for a keyword than const. If one reads const in this manner, declarations such as
char *strcpy(char *, const char *);
|
are easily understood to mean that the second parameter is only used to read character values, while the first parameter overwrites the characters to which it points. Furthermore, despite the fact that in the above example, the type of cpi is a pointer to a const int, you can still change the value of the object to which it points through some other means, unless it actually points to an object declared with const int type.
Examples of const Usage
The two main uses for const are to declare large compile-time initialized tables of information as unchanging, and to specify that pointer parameters do not modify the objects to which they point. volatile Means Exact Semantics
So far, the examples have all used const because it's conceptually simpler. But what does volatile really mean? To a compiler writer, it has one meaning: take no code generation shortcuts when accessing such an object. In ANSI C, it is a programmer's responsibility to declare every object that has the appropriate special properties with a volatile qualified type. Examples of volatile Usage
The usual four examples of volatile objects are:
flag = 1; while (flag) ;
|
is valid as long as flag has a volatile qualified type. Presumably, some asynchronous event sets flag to zero in the future. Otherwise, because the value of flag is unchanged within the body of the loop, the compilation system is free to change the above loop into a truly infinite loop that completely ignores the value of flag.
Multibyte Characters and Wide Characters
At first, the internationalization of ANSI C affected only library functions. However, the final stage of internationalization--multibyte characters and wide characters--also affected the language proper. Asian Languages Require Multibyte Characters
The basic difficulty in an Asian-language computer environment is the huge number of ideograms needed for I/O. To work within the constraints of usual computer architectures, these ideograms are encoded as sequences of bytes. The associated operating systems, application programs, and terminals understand these byte sequences as individual ideograms. Moreover, all of these encodings allow intermixing of regular single-byte characters with the ideogram byte sequences. Just how difficult it is to recognize distinct ideograms depends on the encoding scheme used. Encoding Variations
The encoding schemes fall into two camps. The first is one in which each multibyte character is self-identifying, that is, any multibyte character can simply be inserted between any pair of multibyte characters. Wide Characters
Some of the inconvenience of handling multibyte characters would be eliminated if all characters were of a uniform number of bytes or bits. Since there can be thousands or tens of thousands of ideograms in such a character set, a 16-bit or 32-bit sized integral value should be used to hold all members. (The full Chinese alphabet includes more than 65,000 ideograms!) ANSI C includes the typedef name wchar_t as the implementation-defined integral type large enough to hold all members of the extended character set. Conversion Functions
ANSI C provides five library functions that manage multibyte characters and wide characters:
The behavior of all of these functions depends on the current locale. (See "The setlocale() Function" on page 30.)
It is expected that vendors providing compilation systems targeted to this market supply many more string-like functions to simplify the handling of wide character strings. However, for most application programs, there is no need to convert any multibyte characters to or from wide characters. Programs such as diff, for example, read in and write out multibyte characters, needing only to check for an exact byte-for-byte match. More complicated programs, such as grep, that use regular expression pattern matching, may need to understand multibyte characters, but only the common set of functions that manages the regular expression needs this knowledge. The program grep itself requires no other special multibyte character handling.
C Language Features
To give even more flexibility to the programmer in an Asian-language environment, ANSI C provides wide character constants and wide string literals. These have the same form as their non-wide versions, except that they are immediately prefixed by the letter L:
wchar_t *wp = L"a¥z"; wchar_t x[] = L"a¥z"; wchar_t y[] = {L'a', L'¥', L'z', 0}; wchar_t z[] = {'a', L'¥', 'z', '\0'};
|
In the above example, the three arrays x, y, and z, and the array pointed to by wp, have the same length. All are initialized with identical values.
Standard Headers and Reserved Names
Early in the standardization process, the ANSI Standards Committee chose to include library functions, macros, and header files as part of ANSI C. While this decision was necessary for the writing of truly portable C programs, a side effect is the basis of some of the most negative comments about ANSI C from the public--a large set of reserved names. Balancing Process
To match existing implementations, the ANSI C committee chose names like printf and NULL. However, each such name reduced the set of names available for free use in C programs. Standard Headers
The standard headers are:
assert.h |
locale.h |
stddef.h |
ctype.h |
math.h |
stdio.h |
errno.h |
setjmp.h |
stdlib.h |
float.h |
signal.h |
string.h |
limits.h |
stdarg.h |
time.h |
Most implementations provide more headers, but a strictly conforming ANSI C program can only use these.
Names Reserved for Implementation Use
The Standard places further restrictions on implementations regarding their libraries. In the past, most programmers learned not to use names like read and write for their own functions on UNIX Systems. ANSI C requires that only names reserved by the Standard be introduced by references within the implementation.
_[_A-Z][0-9_a-zA-Z]*
|
Strictly speaking, if your program uses such an identifier, its behavior is undefined. Thus, programs using _POSIX_SOURCE (or _XOPEN_SOURCE) have undefined behavior.
Names Reserved for Expansion
In addition to all the names explicitly reserved, the Standard also reserves (for implementations and future standards) names matching certain patterns:
In the above lists, names that begin with a capital letter are macros and are reserved only when the associated header is included. The rest of the names designate functions and cannot be used to name any global objects or functions.
Names Safe to Use
There are four simple rules you can follow to keep from colliding with any ANSI C reserved names:
Locales
At any time, a C program has a current locale--a collection of information that describes the conventions appropriate to some nationality, culture, and language. Locales have names that are strings. The only two standardized locale names are "C" and "". Each program begins in the "C" locale, which causes all library functions to behave just like they have historically. The "" locale is the implementation's best guess at the correct set of conventions appropriate to the program's invocation. "C" and "" can cause identical behavior. Other locales may be provided by implementations. The setlocale() Function
The setlocale() function is the interface to the program's locale. In general, any program that uses the invocation country's conventions should place a call such as:
#include <locale.h> /*...*/ setlocale(LC_ALL, "");
|
Any of these macros can be passed as the first argument to setlocale() to specify that category.
Most programs do not need this capability.
Changed Functions
Wherever possible and appropriate, existing library functions were extended to include locale-dependent behavior. These functions came in two groups:
Those functions that write or interpret printable floating values can change to use a decimal-point character other than period (.) when the LC_NUMERIC category of the current locale is other than "C". There is no provision for converting any numeric values to printable form with thousands separator-type characters. When converting from a printable form to an internal form, implementations are allowed to accept such additional forms, again in other than the "C" locale. Those functions that make use of the decimal-point character are the printf() and scanf() families, atof(), and strtod(). Those functions that are allowed implementation-defined extensions are atof(), atoi(), atol(), strtod(), strtol(), strtoul(), and the scanf() family.
localeconv()
|
numeric/monetary conventions |
strcoll()
|
collation order of two strings |
strxfrm()
|
translate string for collation |
strftime()
|
formatted date/time conversion |
In addition, there are the multibyte functions mblen(), mbtowc(), mbstowcs(), wctomb(), and wcstombs().
Grouping and Evaluation in Expressions
One of the choices made by Dennis Ritchie in the design of C was to give compilers a license to rearrange expressions involving adjacent operators that are mathematically commutative and associative, even in the presence of parentheses. This is explicitly noted in the appendix in the The C Programming Language by Kernighan and Ritchie. However, ANSI C does not grant compilers this same freedom.
int i, *p, f(void), g(void); /*...*/ i = *++p + f() + g();
|
Definitions
The side effects of an expression are its modifications to memory and its accesses to volatile qualified objects. The side effects in the above expression are the updating of i and p and any side effects contained within the functions f() and g(). The K&R C Rearrangement License
The K&R C rearrangement license applies to the above expression because addition is mathematically commutative and associative. To distinguish between regular parentheses and the actual grouping of an expression, the left and right curly braces designate grouping. The three possible groupings for the expression are:
i = { {*++p + f()} + g() }; i = { *++p + {f() + g()} }; i = { {*++p + g()} + f() };
|
i = *++p + (f() + g()); i = (g() + *++p) + f();
|
If this expression is evaluated on an architecture for which either overflows cause an exception, or addition and subtraction are not inverses across an overflow, these three groupings behave differently if one of the additions overflows.
i = *++p; i += f(); i += g(); i = f(); i += g(); i += *++p; i = *++p; i += g(); i += f();
|
The ANSI C Rules
ANSI C does not allow operations to be rearranged that are mathematically commutative and associative, but that are not actually so on the target architecture. Thus, the precedence and associativity of the ANSI C grammar completely describes the grouping for all expressions; all expressions must be grouped as they are parsed. The expression under consideration is grouped in this manner:
i = { {*++p + f()} + g() };
|
This code still does not mean that f() must be called before g(), or that p must be incremented before g() is called.
The Parentheses
ANSI C is often erroneously described as honoring parentheses or evaluating according to parentheses due to an incomplete understanding or an inaccurate presentation.
i = (((*(++p)) + f()) + g());
|
with no different effect on its grouping or evaluation.
The As If Rule
There were several reasons for the K&R C rearrangement rules:
Thus, all the binary bitwise operators (other than shifting) are allowed to be rearranged on any machine because there is no way to notice such regroupings. On typical two's-complement machines in which overflow wraps around, integer expressions involving multiplication or addition can be rearranged for the same reason.
Therefore, this change in C does not have a significant impact on most C programmers.
There are only three variations of incomplete types: void, arrays of unspecified length, and structures and unions with unspecified content. The type void differs from the other two in that it is an incomplete type that cannot be completed, and it serves as a special function return and parameter type.
An incomplete structure or union type is completed by specifying the content in a following declaration in the same scope for the same tag.
Since array and function parameter types are rewritten to be pointer types, a seemingly incomplete array parameter type is not actually incomplete. The typical declaration of main's argv, namely, char *argv[], as an unspecified length array of character pointers, is rewritten to be a pointer to character pointers.
void *p
|
&*p is a valid subexpression that makes use of this.
Justification
Why are incomplete types necessary? Ignoring void, there is only one feature provided by incomplete types that C has no other way to handle, and that has to do with forward references to structures and unions. If one has two structures that need pointers to each other, the only way to do so is with incomplete types:
struct a { struct b *bp; }; struct b { struct a *ap; };
|
All strongly typed programming languages that have some form of pointer and heterogeneous data types provide some method of handling this case.
Examples
Defining typedef names for incomplete structure and union types is frequently useful. If you have a complicated bunch of data structures that contain many pointers to each other, having a list of typedefs to the structures up front, possibly in a central header, can simplify the declarations.
typedef struct item_tag Item; typedef union note_tag Note; typedef struct list_tag List; . . . struct item_tag { . . . }; . . . struct list_tag { List *next; . . . };
|
Moreover, for those structures and unions whose contents should not be available to the rest of the program, a header can declare the tag without the content. Other parts of the program can use pointers to the incomplete structure or union without any problems, unless they attempt to use any of its members.
Compatible and Composite Types
With K&R C, and even more so with ANSI C, it is possible for two declarations that refer to the same entity to be other than identical. The term "compatible type" is used in ANSI C to denote those types that are "close enough". This section describes compatible types as well as "composite types"--the result of combining two compatible types. Multiple Declarations
If a C program were only allowed to declare each object or function once, there would be no need for compatible types. Linkage, which allows two or more declarations to refer to the same entity, function prototypes, and separate compilation all need such a capability. Separate translation units (source files) have different rules for type compatibility from within a single translation unit. Separate Compilation Compatibility
Since each compilation probably looks at different source files, most of the rules for compatible types across separate compiles are structural in nature:
The compatible types are defined recursively. At the bottom are type specifier keywords. These are the rules that say that unsigned short is the same as unsigned short int, and that a type without type specifiers is the same as one with int. All other types are compatible only if the types from which they are derived are compatible. For example, two qualified types are compatible if the qualifiers, const and volatile, are identical, and the unqualified base types are compatible.
int *const cpi; int *volatile vpi;
|
declare two differently qualified pointers to the same type, int.
Compatible Array Types
For two array types to be compatible, their element types must be compatible. If both array types have a specified size, they must match, that is, an incomplete array type (see "Incomplete Types" on page 36) is compatible both with another incomplete array type and an array type with a specified size.
Compatible Function Types
To make functions compatible, follow these rules:
Another interesting note is that each enumeration type must be compatible with some integral type. For portable programs, this means that enumeration types are separate types. In general, the ANSI C standard views them in that manner.