See Conversions

I recently1 stumbled across an interesting C function. The function was quite simple–just a few lines long—so I was sure I understood it. And yet, the function produced unexpected results—and I could not understand why. Debugging this tiny function led me down the dark hole that is implicit type conversions in C. I have reproduced the function in its entirety below (with a 2-line modification to make the output more readable for this post).

// requires bytes+0 to bytes+sz are valid memory addresses
void print_hex(char *bytes, size_t sz) {
  printf("{"); // added for readability
  for (unsigned int i = 0; i < sz; i++) {
    printf("%02x", bytes[i]);
  }
  printf("}\n"); // added for readability
}

To avoid making this post too long2, I assume the reader is familiar with hexadecimal notation and C language string escape sequences.

From first glance, we can see this this function will misbehave if the memory region argument is invalid. But if we do that, that’s on us—we failed to respect the contract of this function.3

Here is a complete example program containing this function that applies it to a few test vectors.

#include <stdio.h>
#include <string.h>

// requires bytes+0 to bytes+sz are valid memory addresses
void print_hex(char *bytes, size_t sz) {
  printf("{"); // added for readability
  for (unsigned int i = 0; i < sz; i++) {
    printf("%02x", bytes[i]);
  }
  printf("}\n"); // added for readability
}

int main(void) {
   print_hex("",     0); // empty string
   print_hex("\x00", 1); // singleton; smallest possible byte value
   print_hex("\x01", 1); // singleton; smallest + 1
   print_hex("\xff", 1); // singleton; largest possible byte value
   print_hex("\xfe", 1); // singleton; largest - 1
}

What text do you expect this function will print? To avoid confusion, let’s assume we’re running this program on a standard x86-64 Linux box.

Take your time and think about it for a second.

If you need a hint, you can consult my mini-explainer of format strings by clicking here.

An altogether-too-brief introduction to format strings:

  • The printf family of functions all are designed to print formatted text to a buffer or file.
  • Their first argument is always a format string that contains 0 or more conversion specifications.
  • A conversion specification tells printf how to process a particular argument.
  • If the format string has N conversion specifications, the printf call should have N arguments (excluding the format string itself).

Example:

char *player = "Bob";
unsigned int pts = 101;
printf("%s scored %u point(s)!\n", player, pts);
// prints: "Bob scored 101 point(s)!"

Our example had two conversion specifications:

  • %s - which prints its corresponding argument as a string;
  • %u - which prints its corresponding argument as an unsigned decimal integer.

In general, conversion specifications have the following structure:

%[flags][width][precision][typesize]<conversion>

where:

  • a literal % symbol appears first;
  • arguments in angle (resp. square) brackets are required (resp. optional).

In our example program above, the conversion specification %02x can be parsed as follows:

%02x
|||`- converts argument to an `unsigned int` and prints as a hexadecimal
||`-- minimum width 2 characters
|`--- optional flag 0 --- pad result with leading zeroes
`---- this symbol introduces a conversion specification

So, to summarize, this conversion specification will:

  1. x - convert its argument to an unsigned int and print as hexadecimal;
  2. 2 - using a minimum of 2 characters;
  3. 0 - where the padding character is a 0 instead of a space ( ).

When you're ready, click here to see what text this program prints. {}
{00}
{01}
{ffffffff}
{fffffffe}

Finally, if you want an explanation of this behavior, read on!




So, first things first: while the first two printed lines matched up with my expectations, the last two lines took me completely by surprise. At first glance, it looks like the last two function calls, intended to print out a single byte, somehow printed out four bytes. But how could that be?

The ultimate explanation for this (unexpected to me) behavior boils down to the fact that the behavior of this C program is implementation-defined!4

Why does this happen? This unusual behavior ultimately arises due to four interrelated factors:

  1. Two’s complement encoding

    All mainstream, modern, general-purpose computer architectures have a binary memory (i.e., the smallest unit of memory is either 0 or 1) and represent numbers in binary using the two’s complement encoding.

    Essentially, in this encoding, the most significant bit of a signed numeric type (called the sign bit) is negated while all other bits remain positive (like unsigned integers).

  2. The signedness of type char

    While modern programmers may not think this way, the C language standard actually defines multiple, distinct char types with different signedness!

  3. Default argument promotions

    This mouthful of a phrase refers to a somewhat arcane corner of the C standard. When considering a variadic function, what type should its variadic parameters have in the function’s body?

    For example, given variadic function int printf(const char*, ...), suppose we make a function call printf("%s scored %u point(s)!\n", player, pts). In the body of printf, what type should the formal parameter binding argument player take on? What about pts?

  4. Signed-to-Unsigned Integer Conversion

    Given a signed integer value i1 of type t1 to be converted into unsigned type t2 with bit-width b > 0, the process works as follows:

    1. If the i1 is within the range [0,2^b), then the result i2 = i1;
    2. Otherwise, the result is i2 = i1 + x*2^b for the unique5 integer x such that 0 <= i2 < 2^b.

We’ll discuss each of these factors in detail below.

Two’s Complement Encoding

There isn’t much more to say about this encoding that hasn’t already been said.

Thus, in this section, we’ll focus on a small example: all 3 bit sequences written in a big-endian notation (i.e., with the most significant bit written first) and their two’s complement values:

Binary         Value
011 3
010 2
001 1
000 0
111 -1
110 -2
101 -3
100 -4

If this is all gibberish to you, you can read more here.

To understand these values, there are a few important things to note:

  1. Like I mentioned above, we negate the value of the first (leftmost) bit.
  2. Since this is a big-endian binary string, we label each bit position, starting from the right, as position 0, 1, 2, and so forth.
  3. Then the value of the bit b in position i is equal to b * 2^i.
  4. Finally, the value of an entire string is equal to the sum of values of each of its bits.

Let’s stick with 3 bit strings like we used in the table above. Given a 3-bit string b2 b1 b0, its value will be equal to:

-(b2 * 2^2) + (b1 * 2^1) + (b0 * 2^0)

Again, note that we negated the value of the most significant bit.

Using this template, we’ll work through one simple example. By applying this template to concrete bit string 100, we obtain the expression:

-(1 * 2^2) + (0 * 2^1) + (0 * 2^0)

Simplifying the exponents, we get:

-(1 * 4) + (0 * 2) + (0 * 1)

Simplifying the multiplications gives:

-(4) + (0) + (0)

Finally, simplifying the additions gives -4, as we claimed in our table above.

Signedness of char

How many versions of char are there? In fact, there are three:

Type Numeric Range6
unsigned char [0, 255]
signed char [-128, 127]
char either [0, 255] or [-128, 127]

However, on common platforms, the range of char and signed char have equal ranges.

But why three separate char types? These various types map to three distinct ways that chars are used in C:

Designated char-variants Usage

signed char, unsigned char

signed/unsigned number (arithmetic)

unsigned char

smallest uninterpreted unit of addressable memory7

char

character in an encoded string

Essentially, this extra type variant char is intended to represent string data. Modern systems languages may have a fourth variant explicitly for uninterpreted memory access to help keep things clear. For example, in Golang, we have the following types:

Type Usage
uint8, int8 signed/unsigned number (arithmetic)
byte alias for uint8 and smallest uninterpreted unit of addressable memory
rune character in an encoded string

Coming back to C, note that other8 integer types can only be used for arithmetic purposes; they cannot be used directly for representing strings9 or memory access10.

Signed Variant      Unsigned Variant
short unsigned short
int unsigned int
long unsigned long
long long unsigned long long

Default Argument Promotions

Before we can define how default argument promotions, we have to dive into the details of variadic functions. For brevity, we’ll skip some of the gory details; see this cppreference page to get the full story.

Variadic Functions

According to the C standard, variadic functions are declared by adding an ellipsis to the end of the formal parameter set. For example, the declaration:

int printf(const char*, ...)

describes a function with:

  1. exactly one required parameter of type const char*;
  2. zero or more variadic parameters of unspecified type.

To work with variadic parameters, there are four macros that are generally used:

Let’s see an example program below:

#include <stdio.h>
#include <stdarg.h>

int variadic_sum(unsigned int cnt, ...) {
  int sum = 0;

  // declare the variable that stores the variadic parameter set
  va_list args;
  // initialize the variadic parameter set
  // the second argument should be the name of the last
  // non-variadic formal parameter; in this case, cnt
  va_start(args, cnt);
  for (unsigned int i = 0; i < cnt; i++) {
    // return the value of the next variadic parameter
    // (in this case, it must be compatible with type int)
    int elt = va_arg(args, int);

    sum += elt;
  }
  // finalize the variadic parameter set
  va_end(args);

  return sum;
}

// should print:
// Sums were 15, 7, 0, 0
int main(void) {
  int sum1 = variadic_sum(5, 1, 2, 3, 4, 5);
  int sum2 = variadic_sum(2, -10, 17);
  int sum3 = variadic_sum(0);
  int sum4 = variadic_sum(0, -10, 17);
  printf("Sums were %d, %d, %d, %d\n", sum1, sum2, sum3, sum4);
  return 0;
}

There are some restrictions on the use of these macros; otherwise, the behavior of the program is undefined:

  1. the first argument to va_start, va_arg, and va_end must have type va_list;
  2. va_arg can only be called after (resp. before) a corresponding call to va_start (resp. va_end);
  3. va_arg may not be called more than n times, where n is the number of variadic arguments;
  4. if the _i_th call to has the form va_arg(args, T), then T must be compatible with the promoted type of the _i_th variadic argument;

This last point is the one we are interested in.

Variadic Parameters and Default Argument Promotions

Consider the following call to variadic_sum:

int first_arg = 1;
float second_arg = 2.0;
int sum = variadic_sum(2, first_arg, second_arg);

Is it well-defined? The answer is no, because:

Since the first parameter cnt == 2, the va_arg macro will be called exactly twice, but… on second call to va_arg(args, int), the second variadic parameter has the value 2.0 which, as a floating point value, is not compatible with type int.

But there is a missing detail here. When a programmer declares some function f, for each normal function parameter p, the programmer must specify its type T. When compiling f, for each such parameter p, the compiler must allocate a chunk of memory M (which may be a hardware register) that is large enough to store the value of p. How does the compiler know if the chunk M is large enough? It simply requires that sizeof(M) >= sizeof(T), that is to say, it consults the type T of p to determine its maximum size.

But, when we declare a variadic function such as int variadic_sum(unsigned int cnt, ...), we aren’t told what the possible types of each variadic parameter might be. Essentially, we have two options:

  1. Allocate a fixed size chunk M that is large enough for most values and either report a compiler error or permit undefined behavior if sizeOf(T) > sizeof(M);
  2. Dynamically allocate a memory chunk M to accept the parameter T.

The C standard adopts the latter rule, but with a twist, i.e., it applies default argument promotions to variadic arguments. Essentially, default argument promotions are a way reduce the number of possible cases to consider when dynamically allocating memory chunks M.12 The rule breaks down into four cases which are applied in order:

  1. if a variadic argument has an arithmetic type and its range fits inside the range of type int, it is upcasted to int;
  2. otherwise, if it has arithmetic type and its range fits inside the range of type unsigned int, it is upcasted to unsigned int;
  3. otherwise, if it has type float, it is upcasted to type double;
  4. otherwise, its type is unchanged.

So, coming back to our example:

int first_arg = 1;
float second_arg = 2.0;
int sum = variadic_sum(2, first_arg, second_arg);

Due to the default argument promotion rule, on the second call to va_arg(args, int), the variadic parameter which binds value second_arg actually has type double (even though second_arg has type float). Since the type double is not compatible with type int, the behavior of this program is undefined.

Signed-to-Unsigned Integer Conversion

This rule may be the simplest among those discussed above (especially for those familiar with modular arithmetic). Let’s recall the rule we saw above.

Given a signed integer value i1 of type t1 to be converted into unsigned type t2 with bit-width b > 0, the process works as follows:

  1. If the i1 is within the range [0,2^b), then the result i2 = i1;
  2. Otherwise, the result is i2 = i1 + x*2^b for the unique integer x such that 0 <= i2 < 2^b.

Since this rule is fairly mechanical, we’ll just show some examples here.

For our examples, suppose we have a standard modern C platform where CHAR_BIT == 8 and sizeof(int) == 4 (i.e., each int is 32 bits wide). Then the types below would have the following ranges:

Type      Range
signed char [-2^7,2^7) or [-128,128)
unsigned char [0, 2^8) or [0, 256)
int [-2^31,2^31) or [-2147483648,2147483648)
unsigned int [0, 2^32) or [0, 4294967296)

Recall that there is no signed int type; int is defined to be signed.

Now let’s list a few signed char/int to unsigned char conversions:

Input Type Input       Output
signed char 128 128
signed char 127 127
signed char 2 2
signed char 1 1
signed char 0 0
signed char -1 255
signed char -2 254
signed char -127 129
signed char -128 128
int 258 2
int 257 1
int 256 0
int 255 255
int -127 129
int -128 128
int -129 127
int -2147483648 0

And a few consider signed char/int to unsigned int conversions:

Input Type Input       Output
signed char 128 128
signed char 127 127
signed char 2 2
signed char 1 1
signed char 0 0
signed char -1 4294967295
signed char -2 4294967294
signed char -127 4294967169
signed char -128 4294967168
int 2147483648 2147483648
int 2147483647 2147483647
int 128 128
int 127 127
int 2 2
int 1 1
int 0 0
int -1 4294967295
int -2 4294967294
int -127 4294967169
int -128 4294967168
int -129 4294967167
int -2147483648 2147483648

I leave double-checking the results in this table as an exercise for the reader.

One important point to note from these rules is that the result of the conversion only depends on the input value and the output type; the input type is irrelevant.

(To the interested reader) I found the following questions useful to clarify how a machine might implement the conversions above:

  1. Which conversions above (if any) preserve the bit-level representation of the input value (i.e., the input and output values are the exact same bit string)?
  2. Which conversions above (if any) only modify the bit-level representation of the input value by adding/removing leading zeroes?

Putting It All Together

We now have enough details to piece together why our original program (copied below for your convenience) behaves the way it does. As above, let’s assume that we’re running this program on a standard x86-64 Linux box.

// requires bytes+0 to bytes+sz are valid memory addresses
void print_hex(char *bytes, size_t sz) {
  printf("{"); // added for readability
  for (unsigned int i = 0; i < sz; i++) {
    printf("%02x", bytes[i]);
  }
  printf("}\n"); // added for readability
}

For each call to printf in the loop, the following type conversions are performed on the argument bytes[i]:

  1. Conversion Rule: N/A
    Result Type: char
    Value Changed?: No

    As we haven’t done any work at step 0, the type of bytes[i] is still char. But recall that on x86-64 Linux, char is a signed type!

  2. Conversion Rule: Default argument promotion
    Result Type: int
    Value Changed?: No

    By default argument promotion, we widen the type of char to int. Note that, by definition of default argument promotion, the argument value never actually changes—only its type does.

  3. Conversion Rule: Signed-to-unsigned integer conversion
    Result Type: unsigned int
    Value Changed?: Yes, but only when byte[i] < 0

    According to the definition of printf, the conversion specification %02x contains the base conversion x which takes an unsigned int argument and prints it as hexadecimal. This means printf will take our casted argument (int) byte[i] and cast it again to unsigned int before printing in hexadecimal. Thus, we must apply the signed-to-unsigned integer conversion rule. However, recall that this rule only depends on the input value and the output type. Since default argument promotion never changes the input value, when byte[i] < 0, the result of the conversion will be defined by the equation below:

    (†) byte[i] + 1 * (2^32)

    Of course, if byte[i] >= 0, then the result will be unchanged, i.e., just byte[i].

So, from this analysis, we observe that only the final conversion rule will change the value of the argument. But, to understand the precise output value obtained from equation (†) on our test vectors, we need to determine how the C platform will interpret the input char* literal. To do this for our test vectors, we need to:

  1. convert escape sequences into bit strings;
  2. decode bit strings into decimal values using the two’s complement scheme.

To speed things along, I’ve gone ahead and interpreted the input values and then applied the conversion rules in the table below.

Test Vector      Input Value      Converted Value      Output String
"" N/A N/A {}
"\x00" 0 0 {00}
"\x01" 1 1 {01}
"\xff" -1 -1 + 2^32 {ffffffff}
"\xfe" -2 -2 + 2^32 {fffffffe}

Now, we can plainly see why the final two values printed such large output strings—the converted value expressions contain an additional 2^32 term!

So What Now?

We’ve finally identified the problem. How do we fix it? There are two basic solutions:

  1. Change The Input Type

    In our function print_hex, we can change the type of bytes from char* to unsigned char*. If we do that, firstly, the interpretation of "\xff" and "\xfe" will change to 255 and 254 respectively. Then, default argument promotion will change the input type from unsigned char to int, but leave the value unchanged. Finally, the x conversion specification will cast that value to unsigned int, which, since the value is positive, also leaves it unchanged. Thus, the final value will be identical to the initial value.

  2. Change The Conversion Specification Typesize

    In this approach, we leave the type of input bytes unchanged from char* and instead rely on more advanced printf features. As it turns out, the printf function includes a wide variety of type size modifiers. If we change our printf invocation to:

    printf("%02hhx", bytes[i]);
    

    This will force printf to cast input argument byte[i] to type unsigned char. And, if you manually apply the signed-to-unsigned conversion rules above, you will see that this gives us the intended result.

After making either change, our program will behave as expected. That is, our test vectors will produce the following tables:

Test Vector      Input Value      Converted Value      Output String
"" N/A N/A {}
"\x00" 0 0 {00}
"\x01" 1 1 {01}
"\xff" 255 (or) -1 255 {ff}
"\xfe" 254 (or) -2 254 {fe}

So which of these solutions is best?

In my opinion, solution (1) is the clear winner, since it more properly communicates intent.

As I see it, the crux of this whole mess was a conflation of the char and unsigned char types and how they are used to represent strings and uninterpreted memory regions.

In that light, assigning type unsigned char* to bytes clearly communicates that this value should be understood as a sequence of uninterpreted bytes (at least, on platforms where CHAR_BIT == 8) as opposed to a string containing human language text.

Epilogue

The convenience of C’s widely cloned printf function belies the language’s truly low-level nature. Since at least the 1980’s, the C language has been referred to as a “portable assembly language,” and for good reason. When programming in C, to avoid unexpected pitfalls, one must be acutely aware of a whole slew of implicit behaviors (some of which are implementation-defined or even undefined). Otherwise, you too may find yourself falling into such a pit (as I did), dazed and confused, wondering how it ever got so dark.


  1. Well, it was a fairly recent event when I started writing this blog post. However, my initial attempts to write this post fizzled out, and it sat dormant for the better part of a year before I finally mustered up the willpower to finish it. ↩︎

  2. Unfortunately, by the time this post is published, I fear it will, in fact, be too long. ↩︎

  3. To the typed language enthusiast: even though the C language is incapable of enforcing the usage contract for this function, that doesn’t mean the contract doesn’t exist; at some point, when interfacing with real systems, we will encounter usage contracts which cannot be enforced (or even checked) in our language. ↩︎

  4. However, for all common operating systems (Linux, BSD/Mac, Windows) on all common CPU architectures, the printed text will match what was shown above. ↩︎

  5. To see that this is unique, a first hint is to observe that for any value in the range [0,2^b), adding or subtracting 2^b will produce a value outside of this range. If you want to go deeper, you can read about modular arithmetic↩︎

  6. Actually, I fudged the truth slightly; the C language doesn’t require char variants to occupy exactly 8-bits; their bit width is defined as the value of the macro CHAR_BIT where CHAR_BIT >= 8. In fact, the sizeof() operator in C does not return the number of bytes that comprise an object, it returns the number of chars, so that sizeof(char) == 1 regardless of architecture. In other words, on a platform where CHAR_BIT != 8, manipulating values narrower than CHAR_BIT may require bit-shifting, etc… (just as we must do for 4-bit nibbles when printing bytes as hexadecimal). ↩︎

  7. One may wonder why unsigned char should be preferred as a unit of uninterpreted memory as opposed to the other character types. The main reasons are that (a) signed values have implementation-defined behavior when right-shifting (>>); (b) when widening a signed arithmetic type, C will perform sign extension, which differs from the normal semantics of memory accesses.
     
    Similarly, one may wonder why the C standard sometimes prefers a pair of void* and size_t for representing a blob of memory instead of unsigned char* and size_t. Actually, both choices are sensible, but they signal a different intent. While an unsigned char* represents a sequence of uninterpreted memory units (which are usually 8-bit bytes), a void* represents a sequence of unknown type. As an example, we may use malloc to allocate memory for a struct s { int a; int b; }; by writing malloc(sizeof(struct s)) and it will return a void*. We then think of that void* as pointing to a single object of type s. If we cast that void* to unsigned char*, then that is like viewing an s object as a raw sequence of uninterpreted memory units. ↩︎

  8. Again, I’m fudging the truth here a bit because there are other integer types, but we won’t discuss them here. ↩︎

  9. The fact that non-char types cannot represent strings in C is a convention informed by the fact that original character sets used by computers in the United States had just 7 or 8 bits; This meant that each representable character on the machine could be stored in just one char value. However, as computers began to be developed and used globally, the number of bits required to encode a single language character expanded beyond just 7 or 8 bits. Unfortunately, at that point, it was not possible to change the meaning of the char without breaking a large number of C programs. To remedy this issue, a wider char-like type (appropriately called wchar_t) was introduced to represent characters in languages that required more bits. ↩︎

  10. For the full story on why these integer types cannot be used for memory access, an interested reader can consult the C language aliasing rules. The key point is that, by restricting the use of these types to just arithmetic purposes, the compiler can apply certain optimizations that would otherwise not be possible. ↩︎

  11. The C language says types T1 and T2 are compatible if a value of one type can be used where the other type is expected. While there is some subtlety here in the actual definition, for our purposes, it’s enough to assume that T1 and T2 are the same type. ↩︎

  12. As far as I can tell, this rule was put in place to simplify the job of compiler writers. ↩︎