By Holberton School July 22, 2020

Code review: string concatenation in C

Today, we are going to go through the following code.

The very interesting thing about this code, is that when compiled and ran, it seems it is working just fine. But actually it is not working properly and there is a big problem with it 🙂 And by going through it we are going to learn a ton of things.

At Holberton, we have a strict coding style for each programming language we are using. Let’s start by applying our coding style for C.
– main() should be written main(void). Is that a huge mistake? No. But this forces us to be more structured, and always explicitly write everything.
– We don’t want to initialize our variables at the same time as we are declaring them (Exceptions can apply for arrays.)
– Also we want to group our variable declarations together.
– No empty lines within the code. Only one empty line between declaration and code.
So the code should look like this:

It’s cleaner, more professional, follows the style of the school (remember that in any company we have to follow the coding style of the company, so it’s important that we get into the habit of following strictly one particular style).

We are using the function printf without including its prototype. So when we will compile, we will get a warning from the compiler, even without additional flags.

In order to know the prototype of a function, we can always look at its man page. In this case, man 3 printf.

The man page gives us the prototype and what header to include. We can use either to indicate to the compiler what is the prototype of the function printf (and make the warning go away). We don’t strictly have to include the header, we can simply include the prototype itself. Like so:

But it’s a good habit to include the header (which includes the prototype).

Now let’s see what is happening in the program and why it is working but not really. We will go step by step through the code and look at what happens in memory. Let’s start with the declarations:

At this point, this is what the virtual memory looks like (we are going to assume we are working on a 64-bit, Linux machine):

*Note: In the value line, I do show the letters for each byte of the arrays `aa` and `bb`, but what is actually stored in the virtual memory is the ascii code of this letter. (`man ascii`).*
I also added some colors to make sure we can see the limit of each space reserved in the memory by each variable.

– The string literals are copied into the addresses of the arrays. The arrays have been automatically sized (the compiler can do that because it knows the size of the string literals to copy).
– a and b are pointers so on a 64-bit machine they take 8 bytes in memory.
– The variable aa, is an array of chars of size 14 bytes (14 chars, so 14 * sizeof(char) = 14 bytes).
– At this point the variables a and b have a value but we do not know what it is. The next two lines of code will initialize them.

After these two lines of code, a points to the first letter of the array aa (so it contains the address of the first letter of the array aa, which is also the address of the array aa) and b points to the first letter of the array bb. This is what the virtual memory looks like:

So far so good. With the next lines of code we are going at the end of the “string” (remember there is no type string in C). This code is correct. So at the end of the while loop, a points to the \0 of the array aa.

At the end of this while loop, the virtual memory looks like this:

The next lines of code are the following:

The above loop copies the content of the array bb (remember, b points to the first char contained in bb) at the end of the array aa (as the variable a, at the beginning of the loop, points to the last char (\0) of the array aa). And that is both what we wanted the code to do, AND the problem 🙂

The content of bb is copied, one char at a time, starting from the memory address 19 (in our example). But, our variable aa ENDS at 19 too. That means that we are writing the content of bb AFTER the variable aa, not inside. After 12 iterations, the virtual memory (in our example) looks like this:

In red, we have written 11 bytes outside of the memory reserved for aa, and will continue to do so via the loop for another 10 bytes. The problem of course, is that we are probably replacing the values of other variables, or writing in a memory address that we do not have write access (and will get a beautiful Segmentation Fault). In this particular case, the program still runs “properly” and without warning (because we are unlucky), and as a result, we don’t realize that we are making a mistake.

In fact, in this example, we are actually “destroying” our array bb. Let’s modify a bit the program in order to check that out:

It seems like we changed bb by concatenating it to aa. But bb is not 1 char “shorter”, it still takes the same size in memory, but its content has changed. It is happening, because in the actual virtual memory of our running process the two arrays are next to each other, like so:

I removed the vars `a` and `b` for clarity.

So when we are concatenating bb to aa, we are doing this (concatenated letters in pink):

After this concatenation, bb size doesn’t change, but now the content has changed, and it “seems” it was shifted to the left by 1 char. But that’s because the - of the beginning is now part of aa as the last letter in the reserved memory for aa. Note that bb now has two \0, the one copied, and the initial one.

THE END 🙂 If you would like to learn more about the virtual memory, you can read these articles:

Chapter 0: Hack The Virtual Memory: C strings & /proc
Chapter 1: Hack The Virtual Memory: Python bytes
Chapter 2: Hack The Virtual Memory: Drawing the VM diagram
Chapter 3: Hack the Virtual Memory: malloc, the heap & the program break

To finish with, I would like to thank the author of this code, because thanks to them we learn a ton of things!

“Experience is simply the name we give our mistakes.” Oscar Wilde

Happy coding!