See Conversions
I recently1 stumbled across an interesting C function. The function was quite simple–just a few lines long—so I was sure I understood it. And yet, the function produced unexpected results—and I could not understand why. Debugging this tiny function led me down the dark hole that is implicit type conversions in C. I have reproduced the function in its entirety below (with a 2-line modification to make the output more readable for this post).
// requires bytes+0 to bytes+sz are valid memory addresses
void print_hex(char *bytes, size_t sz) {
printf("{"); // added for readability
for (unsigned int i = 0; i < sz; i++) {
printf("%02x", bytes[i]);
}
printf("}\n"); // added for readability
}
To avoid making this post too long2, I assume the reader is familiar with hexadecimal notation and C language string escape sequences.
From first glance, we can see this this function will misbehave if the memory region argument is invalid. But if we do that, that’s on us—we failed to respect the contract of this function.3
Here is a complete example program containing this function that applies it to a few test vectors.
#include <stdio.h>
#include <string.h>
// requires bytes+0 to bytes+sz are valid memory addresses
void print_hex(char *bytes, size_t sz) {
printf("{"); // added for readability
for (unsigned int i = 0; i < sz; i++) {
printf("%02x", bytes[i]);
}
printf("}\n"); // added for readability
}
int main(void) {
print_hex("", 0); // empty string
print_hex("\x00", 1); // singleton; smallest possible byte value
print_hex("\x01", 1); // singleton; smallest + 1
print_hex("\xff", 1); // singleton; largest possible byte value
print_hex("\xfe", 1); // singleton; largest - 1
}
What text do you expect this function will print? To avoid confusion, let’s assume we’re running this program on a standard x86-64 Linux box.
Take your time and think about it for a second.
If you need a hint, you can consult my mini-explainer of format strings by clicking here.
An altogether-too-brief introduction to format strings:
- The
printffamily of functions all are designed to print formatted text to a buffer or file. - Their first argument is always a format string that contains 0 or more conversion specifications.
- A conversion specification tells
printfhow to process a particular argument. - If the format string has N conversion specifications, the
printfcall should have N arguments (excluding the format string itself).
Example:
char *player = "Bob";
unsigned int pts = 101;
printf("%s scored %u point(s)!\n", player, pts);
// prints: "Bob scored 101 point(s)!"
Our example had two conversion specifications:
%s- which prints its corresponding argument as a string;%u- which prints its corresponding argument as an unsigned decimal integer.
In general, conversion specifications have the following structure:
%[flags][width][precision][typesize]<conversion>
where:
- a literal
%symbol appears first; - arguments in angle (resp. square) brackets are required (resp. optional).
In our example program above, the conversion specification %02x can be parsed as follows:
%02x
|||`- converts argument to an `unsigned int` and prints as a hexadecimal
||`-- minimum width 2 characters
|`--- optional flag 0 --- pad result with leading zeroes
`---- this symbol introduces a conversion specification
So, to summarize, this conversion specification will:
x- convert its argument to an unsigned int and print as hexadecimal;2- using a minimum of 2 characters;0- where the padding character is a0instead of a space ().
When you're ready, click here to see what text this program prints.
{}
{00}
{01}
{ffffffff}
{fffffffe}
Finally, if you want an explanation of this behavior, read on!
So, first things first: while the first two printed lines matched up with my expectations, the last two lines took me completely by surprise. At first glance, it looks like the last two function calls, intended to print out a single byte, somehow printed out four bytes. But how could that be?
The ultimate explanation for this (unexpected to me) behavior boils down to the fact that the behavior of this C program is implementation-defined!4
Why does this happen? This unusual behavior ultimately arises due to four interrelated factors:
-
Two’s complement encoding
All mainstream, modern, general-purpose computer architectures have a binary memory (i.e., the smallest unit of memory is either 0 or 1) and represent numbers in binary using the two’s complement encoding.
Essentially, in this encoding, the most significant bit of a signed numeric type (called the sign bit) is negated while all other bits remain positive (like unsigned integers).
-
The signedness of type
charWhile modern programmers may not think this way, the C language standard actually defines multiple, distinct
chartypes with different signedness! -
Default argument promotions
This mouthful of a phrase refers to a somewhat arcane corner of the C standard. When considering a variadic function, what type should its variadic parameters have in the function’s body?
For example, given variadic function
int printf(const char*, ...), suppose we make a function callprintf("%s scored %u point(s)!\n", player, pts). In the body ofprintf, what type should the formal parameter binding argumentplayertake on? What aboutpts? -
Signed-to-Unsigned Integer Conversion
Given a signed integer value
i1of typet1to be converted into unsigned typet2with bit-widthb > 0, the process works as follows:- If the
i1is within the range[0,2^b), then the resulti2 = i1; - Otherwise, the result is
i2 = i1 + x*2^bfor the unique5 integerxsuch that0 <= i2 < 2^b.
- If the
We’ll discuss each of these factors in detail below.
Two’s Complement Encoding
There isn’t much more to say about this encoding that hasn’t already been said.
Thus, in this section, we’ll focus on a small example: all 3 bit sequences written in a big-endian notation (i.e., with the most significant bit written first) and their two’s complement values:
| Binary | Value | |
|---|---|---|
011 |
3 | |
010 |
2 | |
001 |
1 | |
000 |
0 | |
111 |
-1 | |
110 |
-2 | |
101 |
-3 | |
100 |
-4 |
If this is all gibberish to you, you can read more here.
To understand these values, there are a few important things to note:
- Like I mentioned above, we negate the value of the first (leftmost) bit.
- Since this is a big-endian binary string, we label each bit position, starting from the right, as position 0, 1, 2, and so forth.
- Then the value of the bit
bin positioniis equal tob * 2^i. - Finally, the value of an entire string is equal to the sum of values of each of its bits.
Let’s stick with 3 bit strings like we used in the table above.
Given a 3-bit string b2 b1 b0, its value will be equal to:
-(b2 * 2^2) + (b1 * 2^1) + (b0 * 2^0)
Again, note that we negated the value of the most significant bit.
Using this template, we’ll work through one simple example.
By applying this template to concrete bit string 100, we obtain the expression:
-(1 * 2^2) + (0 * 2^1) + (0 * 2^0)
Simplifying the exponents, we get:
-(1 * 4) + (0 * 2) + (0 * 1)
Simplifying the multiplications gives:
-(4) + (0) + (0)
Finally, simplifying the additions gives -4, as we claimed in our table above.
Signedness of char
How many versions of char are there?
In fact, there are three:
| Type | Numeric Range6 |
|---|---|
unsigned char |
[0, 255] |
signed char |
[-128, 127] |
char |
either [0, 255] or [-128, 127] |
However, on common platforms, the range of char and signed char have equal ranges.
But why three separate char types?
These various types map to three distinct ways that chars are used in C:
Designated char-variants |
Usage |
|---|---|
signed char, unsigned char |
signed/unsigned number (arithmetic) |
unsigned char |
smallest uninterpreted unit of addressable memory7 |
char |
character in an encoded string |
Essentially, this extra type variant char is intended to represent string data.
Modern systems languages may have a fourth variant explicitly for uninterpreted memory access to help keep things clear.
For example, in Golang, we have the following types:
| Type | Usage |
|---|---|
uint8, int8 |
signed/unsigned number (arithmetic) |
byte |
alias for uint8 and smallest uninterpreted unit of addressable memory |
rune |
character in an encoded string |
Coming back to C, note that other8 integer types can only be used for arithmetic purposes; they cannot be used directly for representing strings9 or memory access10.
| Signed Variant | Unsigned Variant | |
|---|---|---|
short |
unsigned short |
|
int |
unsigned int |
|
long |
unsigned long |
|
long long |
unsigned long long |
Default Argument Promotions
Before we can define how default argument promotions, we have to dive into the details of variadic functions. For brevity, we’ll skip some of the gory details; see this cppreference page to get the full story.
Variadic Functions
According to the C standard, variadic functions are declared by adding an ellipsis to the end of the formal parameter set. For example, the declaration:
int printf(const char*, ...)
describes a function with:
- exactly one required parameter of type
const char*; - zero or more variadic parameters of unspecified type.
To work with variadic parameters, there are four macros that are generally used:
va_list- is the type of a variadic parameter setva_start(va_list, identifier)- initializes the variadic parameter setva_arg(va_list, T)- returns the value of the next variadic parameter (which must be compatible11 withTor else the result is undefined)va_end(va_list)- finalizes the variadic parameter set
Let’s see an example program below:
#include <stdio.h>
#include <stdarg.h>
int variadic_sum(unsigned int cnt, ...) {
int sum = 0;
// declare the variable that stores the variadic parameter set
va_list args;
// initialize the variadic parameter set
// the second argument should be the name of the last
// non-variadic formal parameter; in this case, cnt
va_start(args, cnt);
for (unsigned int i = 0; i < cnt; i++) {
// return the value of the next variadic parameter
// (in this case, it must be compatible with type int)
int elt = va_arg(args, int);
sum += elt;
}
// finalize the variadic parameter set
va_end(args);
return sum;
}
// should print:
// Sums were 15, 7, 0, 0
int main(void) {
int sum1 = variadic_sum(5, 1, 2, 3, 4, 5);
int sum2 = variadic_sum(2, -10, 17);
int sum3 = variadic_sum(0);
int sum4 = variadic_sum(0, -10, 17);
printf("Sums were %d, %d, %d, %d\n", sum1, sum2, sum3, sum4);
return 0;
}
There are some restrictions on the use of these macros; otherwise, the behavior of the program is undefined:
- the first argument to
va_start,va_arg, andva_endmust have typeva_list; va_argcan only be called after (resp. before) a corresponding call tova_start(resp.va_end);va_argmay not be called more than n times, where n is the number of variadic arguments;- if the _i_th call to has the form
va_arg(args, T), thenTmust be compatible with the promoted type of the _i_th variadic argument;
This last point is the one we are interested in.
Variadic Parameters and Default Argument Promotions
Consider the following call to variadic_sum:
int first_arg = 1;
float second_arg = 2.0;
int sum = variadic_sum(2, first_arg, second_arg);
Is it well-defined? The answer is no, because:
Since the first parameter cnt == 2, the va_arg macro will be called exactly twice, but…
on second call to va_arg(args, int), the second variadic parameter has the value 2.0 which, as a floating point value, is not compatible with type int.
But there is a missing detail here.
When a programmer declares some function f, for each normal function parameter p, the programmer must specify its type T.
When compiling f, for each such parameter p, the compiler must allocate a chunk of memory M (which may be a hardware register) that is large enough to store the value of p.
How does the compiler know if the chunk M is large enough?
It simply requires that sizeof(M) >= sizeof(T), that is to say, it consults the type T of p to determine its maximum size.
But, when we declare a variadic function such as int variadic_sum(unsigned int cnt, ...), we aren’t told what the possible types of each variadic parameter might be.
Essentially, we have two options:
- Allocate a fixed size chunk
Mthat is large enough for most values and either report a compiler error or permit undefined behavior ifsizeOf(T) > sizeof(M); - Dynamically allocate a memory chunk
Mto accept the parameterT.
The C standard adopts the latter rule, but with a twist, i.e., it applies default argument promotions to variadic arguments.
Essentially, default argument promotions are a way reduce the number of possible cases to consider when dynamically allocating memory chunks M.12
The rule breaks down into four cases which are applied in order:
- if a variadic argument has an arithmetic type and its range fits inside the range of type
int, it is upcasted toint; - otherwise, if it has arithmetic type and its range fits inside the range of type
unsigned int, it is upcasted tounsigned int; - otherwise, if it has type
float, it is upcasted to typedouble; - otherwise, its type is unchanged.
So, coming back to our example:
int first_arg = 1;
float second_arg = 2.0;
int sum = variadic_sum(2, first_arg, second_arg);
Due to the default argument promotion rule, on the second call to va_arg(args, int), the variadic parameter which binds value second_arg actually has type double (even though second_arg has type float).
Since the type double is not compatible with type int, the behavior of this program is undefined.
Signed-to-Unsigned Integer Conversion
This rule may be the simplest among those discussed above (especially for those familiar with modular arithmetic). Let’s recall the rule we saw above.
Given a signed integer value i1 of type t1 to be converted into unsigned type t2 with bit-width b > 0, the process works as follows:
- If the
i1is within the range[0,2^b), then the resulti2 = i1; - Otherwise, the result is
i2 = i1 + x*2^bfor the unique integerxsuch that0 <= i2 < 2^b.
Since this rule is fairly mechanical, we’ll just show some examples here.
For our examples, suppose we have a standard modern C platform where CHAR_BIT == 8 and sizeof(int) == 4 (i.e., each int is 32 bits wide).
Then the types below would have the following ranges:
| Type | Range | |
|---|---|---|
signed char |
[-2^7,2^7) or [-128,128) |
|
unsigned char |
[0, 2^8) or [0, 256) |
|
int |
[-2^31,2^31) or [-2147483648,2147483648) |
|
unsigned int |
[0, 2^32) or [0, 4294967296) |
Recall that there is no signed int type; int is defined to be signed.
Now let’s list a few signed char/int to unsigned char conversions:
| Input Type | Input | Output | |
|---|---|---|---|
signed char |
128 |
128 |
|
signed char |
127 |
127 |
|
signed char |
2 |
2 |
|
signed char |
1 |
1 |
|
signed char |
0 |
0 |
|
signed char |
-1 |
255 |
|
signed char |
-2 |
254 |
|
signed char |
-127 |
129 |
|
signed char |
-128 |
128 |
|
int |
258 |
2 |
|
int |
257 |
1 |
|
int |
256 |
0 |
|
int |
255 |
255 |
|
int |
-127 |
129 |
|
int |
-128 |
128 |
|
int |
-129 |
127 |
|
int |
-2147483648 |
0 |
And a few consider signed char/int to unsigned int conversions:
| Input Type | Input | Output | |
|---|---|---|---|
signed char |
128 |
128 |
|
signed char |
127 |
127 |
|
signed char |
2 |
2 |
|
signed char |
1 |
1 |
|
signed char |
0 |
0 |
|
signed char |
-1 |
4294967295 |
|
signed char |
-2 |
4294967294 |
|
signed char |
-127 |
4294967169 |
|
signed char |
-128 |
4294967168 |
|
int |
2147483648 |
2147483648 |
|
int |
2147483647 |
2147483647 |
|
int |
128 |
128 |
|
int |
127 |
127 |
|
int |
2 |
2 |
|
int |
1 |
1 |
|
int |
0 |
0 |
|
int |
-1 |
4294967295 |
|
int |
-2 |
4294967294 |
|
int |
-127 |
4294967169 |
|
int |
-128 |
4294967168 |
|
int |
-129 |
4294967167 |
|
int |
-2147483648 |
2147483648 |
I leave double-checking the results in this table as an exercise for the reader.
One important point to note from these rules is that the result of the conversion only depends on the input value and the output type; the input type is irrelevant.
(To the interested reader) I found the following questions useful to clarify how a machine might implement the conversions above:
- Which conversions above (if any) preserve the bit-level representation of the input value (i.e., the input and output values are the exact same bit string)?
- Which conversions above (if any) only modify the bit-level representation of the input value by adding/removing leading zeroes?
Putting It All Together
We now have enough details to piece together why our original program (copied below for your convenience) behaves the way it does. As above, let’s assume that we’re running this program on a standard x86-64 Linux box.
// requires bytes+0 to bytes+sz are valid memory addresses
void print_hex(char *bytes, size_t sz) {
printf("{"); // added for readability
for (unsigned int i = 0; i < sz; i++) {
printf("%02x", bytes[i]);
}
printf("}\n"); // added for readability
}
For each call to printf in the loop, the following type conversions are performed on the argument bytes[i]:
-
Conversion Rule: N/A
Result Type:char
Value Changed?: NoAs we haven’t done any work at step 0, the type of
bytes[i]is stillchar. But recall that on x86-64 Linux,charis a signed type! -
Conversion Rule: Default argument promotion
Result Type:int
Value Changed?: NoBy default argument promotion, we widen the type of
chartoint. Note that, by definition of default argument promotion, the argument value never actually changes—only its type does. -
Conversion Rule: Signed-to-unsigned integer conversion
Result Type:unsigned int
Value Changed?: Yes, but only whenbyte[i] < 0According to the definition of
printf, the conversion specification%02xcontains the base conversionxwhich takes anunsigned intargument and prints it as hexadecimal. This meansprintfwill take our casted argument(int) byte[i]and cast it again tounsigned intbefore printing in hexadecimal. Thus, we must apply the signed-to-unsigned integer conversion rule. However, recall that this rule only depends on the input value and the output type. Since default argument promotion never changes the input value, whenbyte[i] < 0, the result of the conversion will be defined by the equation below:(†)
byte[i] + 1 * (2^32)Of course, if
byte[i] >= 0, then the result will be unchanged, i.e., justbyte[i].
So, from this analysis, we observe that only the final conversion rule will change the value of the argument.
But, to understand the precise output value obtained from equation (†) on our test vectors, we need to determine how the C platform will interpret the input char* literal.
To do this for our test vectors, we need to:
- convert escape sequences into bit strings;
- decode bit strings into decimal values using the two’s complement scheme.
To speed things along, I’ve gone ahead and interpreted the input values and then applied the conversion rules in the table below.
| Test Vector | Input Value | Converted Value | Output String | |||
|---|---|---|---|---|---|---|
"" |
N/A | N/A | {} |
|||
"\x00" |
0 | 0 | {00} |
|||
"\x01" |
1 | 1 | {01} |
|||
"\xff" |
-1 | -1 + 2^32 | {ffffffff} |
|||
"\xfe" |
-2 | -2 + 2^32 | {fffffffe} |
Now, we can plainly see why the final two values printed such large output strings—the converted value expressions contain an additional 2^32 term!
So What Now?
We’ve finally identified the problem. How do we fix it? There are two basic solutions:
-
Change The Input Type
In our function
print_hex, we can change the type ofbytesfromchar*tounsigned char*. If we do that, firstly, the interpretation of"\xff"and"\xfe"will change to 255 and 254 respectively. Then, default argument promotion will change the input type fromunsigned chartoint, but leave the value unchanged. Finally, thexconversion specification will cast that value tounsigned int, which, since the value is positive, also leaves it unchanged. Thus, the final value will be identical to the initial value. -
Change The Conversion Specification Typesize
In this approach, we leave the type of input
bytesunchanged fromchar*and instead rely on more advancedprintffeatures. As it turns out, theprintffunction includes a wide variety of type size modifiers. If we change ourprintfinvocation to:printf("%02hhx", bytes[i]);This will force
printfto cast input argumentbyte[i]to typeunsigned char. And, if you manually apply the signed-to-unsigned conversion rules above, you will see that this gives us the intended result.
After making either change, our program will behave as expected. That is, our test vectors will produce the following tables:
| Test Vector | Input Value | Converted Value | Output String | |||
|---|---|---|---|---|---|---|
"" |
N/A | N/A | {} |
|||
"\x00" |
0 | 0 | {00} |
|||
"\x01" |
1 | 1 | {01} |
|||
"\xff" |
255 (or) -1 | 255 | {ff} |
|||
"\xfe" |
254 (or) -2 | 254 | {fe} |
So which of these solutions is best?
In my opinion, solution (1) is the clear winner, since it more properly communicates intent.
As I see it, the crux of this whole mess was a conflation of the char and unsigned char types and how they are used to represent strings and uninterpreted memory regions.
In that light, assigning type unsigned char* to bytes clearly communicates that this value should be understood as a sequence of uninterpreted bytes (at least, on platforms where CHAR_BIT == 8) as opposed to a string containing human language text.
Epilogue
The convenience of C’s widely cloned printf function belies the language’s truly low-level nature.
Since at least the 1980’s, the C language has been referred to as a “portable assembly language,” and for good reason.
When programming in C, to avoid unexpected pitfalls, one must be acutely aware of a whole slew of implicit behaviors (some of which are implementation-defined or even undefined).
Otherwise, you too may find yourself falling into such a pit (as I did), dazed and confused, wondering how it ever got so dark.
-
Well, it was a fairly recent event when I started writing this blog post. However, my initial attempts to write this post fizzled out, and it sat dormant for the better part of a year before I finally mustered up the willpower to finish it. ↩︎
-
Unfortunately, by the time this post is published, I fear it will, in fact, be too long. ↩︎
-
To the typed language enthusiast: even though the C language is incapable of enforcing the usage contract for this function, that doesn’t mean the contract doesn’t exist; at some point, when interfacing with real systems, we will encounter usage contracts which cannot be enforced (or even checked) in our language. ↩︎
-
However, for all common operating systems (Linux, BSD/Mac, Windows) on all common CPU architectures, the printed text will match what was shown above. ↩︎
-
To see that this is unique, a first hint is to observe that for any value in the range
[0,2^b), adding or subtracting2^bwill produce a value outside of this range. If you want to go deeper, you can read about modular arithmetic. ↩︎ -
Actually, I fudged the truth slightly; the C language doesn’t require
charvariants to occupy exactly 8-bits; their bit width is defined as the value of the macroCHAR_BITwhereCHAR_BIT >= 8. In fact, thesizeof()operator in C does not return the number of bytes that comprise an object, it returns the number of chars, so thatsizeof(char) == 1regardless of architecture. In other words, on a platform whereCHAR_BIT != 8, manipulating values narrower thanCHAR_BITmay require bit-shifting, etc… (just as we must do for 4-bit nibbles when printing bytes as hexadecimal). ↩︎ -
One may wonder why
unsigned charshould be preferred as a unit of uninterpreted memory as opposed to the other character types. The main reasons are that (a) signed values have implementation-defined behavior when right-shifting(>>); (b) when widening a signed arithmetic type, C will perform sign extension, which differs from the normal semantics of memory accesses.
Similarly, one may wonder why the C standard sometimes prefers a pair ofvoid*andsize_tfor representing a blob of memory instead ofunsigned char*andsize_t. Actually, both choices are sensible, but they signal a different intent. While anunsigned char*represents a sequence of uninterpreted memory units (which are usually 8-bit bytes), avoid*represents a sequence of unknown type. As an example, we may use malloc to allocate memory for astruct s { int a; int b; };by writingmalloc(sizeof(struct s))and it will return avoid*. We then think of thatvoid*as pointing to a single object of types. If we cast thatvoid*tounsigned char*, then that is like viewing ansobject as a raw sequence of uninterpreted memory units. ↩︎ -
Again, I’m fudging the truth here a bit because there are other integer types, but we won’t discuss them here. ↩︎
-
The fact that non-
chartypes cannot represent strings in C is a convention informed by the fact that original character sets used by computers in the United States had just 7 or 8 bits; This meant that each representable character on the machine could be stored in just onecharvalue. However, as computers began to be developed and used globally, the number of bits required to encode a single language character expanded beyond just 7 or 8 bits. Unfortunately, at that point, it was not possible to change the meaning of thecharwithout breaking a large number of C programs. To remedy this issue, a widerchar-like type (appropriately calledwchar_t) was introduced to represent characters in languages that required more bits. ↩︎ -
For the full story on why these integer types cannot be used for memory access, an interested reader can consult the C language aliasing rules. The key point is that, by restricting the use of these types to just arithmetic purposes, the compiler can apply certain optimizations that would otherwise not be possible. ↩︎
-
The C language says types
T1andT2are compatible if a value of one type can be used where the other type is expected. While there is some subtlety here in the actual definition, for our purposes, it’s enough to assume thatT1andT2are the same type. ↩︎ -
As far as I can tell, this rule was put in place to simplify the job of compiler writers. ↩︎