Hack the Virtual Memory: malloc, the heap & the program break

Hack the VM!

This is the fourth chapter in a series around virtual memory. The goal is to learn some CS basics, but in a different and more practical way.

If you missed the previous chapters, you should probably start there:

The heap

In this chapter we will look at the heap and malloc in order to answer some of the questions we ended with at the end of the previous chapter:

  • Why doesn’t our allocated memory start at the very beginning of the heap (0x2050010 vs 02050000)? What are those first 16 bytes used for?
  • Is the heap actually growing upwards?

Prerequisites

In order to fully understand this article, you will need to know:

Environment

All scripts and programs have been tested on the following system:

  • Ubuntu
    • Linux ubuntu 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Tools used:

  • gcc
    • gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
  • glibc 2.19 (see version.c if you need to check your glibc version)
  • strace
    • strace — version 4.8

Everything we will write will be true for this system/environment, but may be different on another system

We will also go through the Linux source code. If you are on Ubuntu, you can download the sources of your current kernel by running this command:

apt-get source linux-image-$(uname -r)

malloc

malloc is the common function used to dynamically allocate memory. This memory is allocated on the “heap”.
Note: malloc is not a system call.

From man malloc:

[...] allocate dynamic memory[...]
void *malloc(size_t size);
[...]
The malloc() function allocates size bytes and returns a pointer to the allocated memory.

No malloc, no [heap]

Let’s look at memory regions of a process that does not call malloc (0-main.c).

#include <stdlib.h>
#include <stdio.h>

/**
 * main - do nothing
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    getchar();
    return (EXIT_SUCCESS);
}

julien@holberton:~/holberton/w/hackthevm3$ gcc -Wall -Wextra -pedantic -Werror 0-main.c -o 0
julien@holberton:~/holberton/w/hackthevm3$ ./0 

Quick reminder (1/3): the memory regions of a process are listed in the /proc/[pid]/maps file. As a result, we first need to know the PID of the process. That is done using the ps command; the second column of ps aux output will give us the PID of the process. Please read chapter 0 to learn more.

julien@holberton:/tmp$ ps aux | grep \ \./0$
julien     3638  0.0  0.0   4200   648 pts/9    S+   12:01   0:00 ./0

Quick reminder (2/3): from the above output, we can see that the PID of the process we want to look at is 3638. As a result, the maps file will be found in the directory /proc/3638.

julien@holberton:/tmp$ cd /proc/3638

Quick reminder (3/3): The maps file contains the memory regions of the process. The format of each line in this file is:
address perms offset dev inode pathname

julien@holberton:/proc/3638$ cat maps
00400000-00401000 r-xp 00000000 08:01 174583                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/0
00600000-00601000 r--p 00000000 08:01 174583                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/0
00601000-00602000 rw-p 00001000 08:01 174583                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/0
7f38f87d7000-7f38f8991000 r-xp 00000000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f38f8991000-7f38f8b91000 ---p 001ba000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f38f8b91000-7f38f8b95000 r--p 001ba000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f38f8b95000-7f38f8b97000 rw-p 001be000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f38f8b97000-7f38f8b9c000 rw-p 00000000 00:00 0 
7f38f8b9c000-7f38f8bbf000 r-xp 00000000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f38f8da3000-7f38f8da6000 rw-p 00000000 00:00 0 
7f38f8dbb000-7f38f8dbe000 rw-p 00000000 00:00 0 
7f38f8dbe000-7f38f8dbf000 r--p 00022000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f38f8dbf000-7f38f8dc0000 rw-p 00023000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f38f8dc0000-7f38f8dc1000 rw-p 00000000 00:00 0 
7ffdd85c5000-7ffdd85e6000 rw-p 00000000 00:00 0                          [stack]
7ffdd85f2000-7ffdd85f4000 r--p 00000000 00:00 0                          [vvar]
7ffdd85f4000-7ffdd85f6000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
julien@holberton:/proc/3638$ 

Note: hackthevm3 is a symbolic link to hack_the_virtual_memory/03. The Heap/

-> As we can see from the above maps file, there’s no [heap] region allocated.

malloc(x)

Let’s do the same but with a program that calls malloc (1-main.c):

#include <stdio.h>
#include <stdlib.h>

/**
 * main - 1 call to malloc
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    malloc(1);
    getchar();
    return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm3$ gcc -Wall -Wextra -pedantic -Werror 1-main.c -o 1
julien@holberton:~/holberton/w/hackthevm3$ ./1 

julien@holberton:/proc/3638$ ps aux | grep \ \./1$
julien     3718  0.0  0.0   4332   660 pts/9    S+   12:09   0:00 ./1
julien@holberton:/proc/3638$ cd /proc/3718
julien@holberton:/proc/3718$ cat maps 
00400000-00401000 r-xp 00000000 08:01 176964                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/1
00600000-00601000 r--p 00000000 08:01 176964                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/1
00601000-00602000 rw-p 00001000 08:01 176964                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/1
01195000-011b6000 rw-p 00000000 00:00 0                                  [heap]
...
julien@holberton:/proc/3718$ 

-> the [heap] is here.

Let’s check the return value of malloc to make sure the returned address is in the heap region (2-main.c):

#include <stdio.h>
#include <stdlib.h>

/**
 * main - prints the malloc returned address
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    void *p;

    p = malloc(1);
    printf("%p\n", p);
    getchar();
    return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm3$ gcc -Wall -Wextra -pedantic -Werror 2-main.c -o 2
julien@holberton:~/holberton/w/hackthevm3$ ./2 
0x24d6010

julien@holberton:/proc/3718$ ps aux | grep \ \./2$
julien     3834  0.0  0.0   4336   676 pts/9    S+   12:48   0:00 ./2
julien@holberton:/proc/3718$ cd /proc/3834
julien@holberton:/proc/3834$ cat maps
00400000-00401000 r-xp 00000000 08:01 176966                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/2
00600000-00601000 r--p 00000000 08:01 176966                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/2
00601000-00602000 rw-p 00001000 08:01 176966                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/2
024d6000-024f7000 rw-p 00000000 00:00 0                                  [heap]
...
julien@holberton:/proc/3834$ 

-> 024d6000 <0x24d6010 < 024f7000

The returned address is inside the heap region. And as we have seen in the previous chapter, the returned address does not start exactly at the beginning of the region; we’ll see why later.

strace, brk and sbrk

malloc is a “regular” function (as opposed to a system call), so it must call some kind of syscall in order to manipulate the heap. Let’s use strace to find out.

strace is a program used to trace system calls and signals. Any program will always use a few syscalls before your main function is executed. In order to know which syscalls are used by malloc, we will add a write syscall before and after the call to malloc(3-main.c).

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/**
 * main - let's find out which syscall malloc is using
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    void *p;

    write(1, "BEFORE MALLOC\n", 14);
    p = malloc(1);
    write(1, "AFTER MALLOC\n", 13);
    printf("%p\n", p);
    getchar();
    return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm3$ gcc -Wall -Wextra -pedantic -Werror 3-main.c -o 3
julien@holberton:~/holberton/w/hackthevm3$ strace ./3 
execve("./3", ["./3"], [/* 61 vars */]) = 0
...
write(1, "BEFORE MALLOC\n", 14BEFORE MALLOC
)         = 14
brk(0)                                  = 0xe70000
brk(0xe91000)                           = 0xe91000
write(1, "AFTER MALLOC\n", 13AFTER MALLOC
)          = 13
...
read(0, 

From the above listing we can focus on this:

brk(0)                                  = 0xe70000
brk(0xe91000)                           = 0xe91000

-> malloc is using the brk system call in order to manipulate the heap. From brk man page (man brk), we can see what this system call is doing:

...
       int brk(void *addr);
       void *sbrk(intptr_t increment);
...
DESCRIPTION
       brk() and sbrk() change the location of the program  break,  which  defines
       the end of the process's data segment (i.e., the program break is the first
       location after the end of the uninitialized data segment).  Increasing  the
       program  break has the effect of allocating memory to the process; decreas‐
       ing the break deallocates memory.

       brk() sets the end of the data segment to the value specified by addr, when
       that  value  is  reasonable,  the system has enough memory, and the process
       does not exceed its maximum data size (see setrlimit(2)).

       sbrk() increments the program's data space  by  increment  bytes.   Calling
       sbrk()  with  an increment of 0 can be used to find the current location of
       the program break.

The program break is the address of the first location beyond the current end of the data region of the program in the virual memory.

program break before the call to malloc / brk

By increasing the value of the program break, via brk or sbrk, the function malloc creates a new space that can then be used by the process to dynamically allocate memory (using malloc).

program break after the malloc / brk call

So the heap is actually an extension of the data segment of the program.

The first call to brk (brk(0)) returns the current address of the program break to malloc. And the second call is the one that actually creates new memory (since 0xe91000 > 0xe70000) by increasing the value of the program break. In the above example, the heap is now starting at 0xe70000 and ends at 0xe91000. Let’s double check with the /proc/[PID]/maps file:

julien@holberton:/proc/3855$ ps aux | grep \ \./3$
julien     4011  0.0  0.0   4748   708 pts/9    S+   13:04   0:00 strace ./3
julien     4014  0.0  0.0   4336   644 pts/9    S+   13:04   0:00 ./3
julien@holberton:/proc/3855$ cd /proc/4014
julien@holberton:/proc/4014$ cat maps 
00400000-00401000 r-xp 00000000 08:01 176967                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/3
00600000-00601000 r--p 00000000 08:01 176967                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/3
00601000-00602000 rw-p 00001000 08:01 176967                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/3
00e70000-00e91000 rw-p 00000000 00:00 0                                  [heap]
...
julien@holberton:/proc/4014$ 

-> 00e70000-00e91000 rw-p 00000000 00:00 0 [heap] matches the pointers returned back to malloc by brk.

That’s great, but wait, why didmalloc increment the heap by 00e9100000e70000 = 0x21000 or 135168 bytes, when we only asked for only 1 byte?

Many mallocs

What will happen if we call malloc several times? (4-main.c)

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/**
 * main - many calls to malloc
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    void *p;

    write(1, "BEFORE MALLOC #0\n", 17);
    p = malloc(1024);
    write(1, "AFTER MALLOC #0\n", 16);
    printf("%p\n", p);

    write(1, "BEFORE MALLOC #1\n", 17);
    p = malloc(1024);
    write(1, "AFTER MALLOC #1\n", 16);
    printf("%p\n", p);

    write(1, "BEFORE MALLOC #2\n", 17);
    p = malloc(1024);
    write(1, "AFTER MALLOC #2\n", 16);
    printf("%p\n", p);

    write(1, "BEFORE MALLOC #3\n", 17);
    p = malloc(1024);
    write(1, "AFTER MALLOC #3\n", 16);
    printf("%p\n", p);

    getchar();
    return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm3$ gcc -Wall -Wextra -pedantic -Werror 4-main.c -o 4
julien@holberton:~/holberton/w/hackthevm3$ strace ./4 
execve("./4", ["./4"], [/* 61 vars */]) = 0
...
write(1, "BEFORE MALLOC #0\n", 17BEFORE MALLOC #0
)      = 17
brk(0)                                  = 0x1314000
brk(0x1335000)                          = 0x1335000
write(1, "AFTER MALLOC #0\n", 16AFTER MALLOC #0
)       = 16
...
write(1, "0x1314010\n", 100x1314010
)             = 10
write(1, "BEFORE MALLOC #1\n", 17BEFORE MALLOC #1
)      = 17
write(1, "AFTER MALLOC #1\n", 16AFTER MALLOC #1
)       = 16
write(1, "0x1314420\n", 100x1314420
)             = 10
write(1, "BEFORE MALLOC #2\n", 17BEFORE MALLOC #2
)      = 17
write(1, "AFTER MALLOC #2\n", 16AFTER MALLOC #2
)       = 16
write(1, "0x1314830\n", 100x1314830
)             = 10
write(1, "BEFORE MALLOC #3\n", 17BEFORE MALLOC #3
)      = 17
write(1, "AFTER MALLOC #3\n", 16AFTER MALLOC #3
)       = 16
write(1, "0x1314c40\n", 100x1314c40
)             = 10
...
read(0, 

-> malloc is NOT calling brk each time we call it.

The first time, malloc creates a new space (the heap) for the program (by increasing the program break location). The following times, malloc uses the same space to give our program “new” chunks of memory. Those “new” chunks of memory are part of the memory previously allocated using brk. This way, malloc doesn’t have to use syscalls (brk) every time we call it, and thus it makes malloc – and our programs using malloc – faster. It also allows malloc and free to optimize the usage of the memory.

Let’s double check that we have only one heap, allocated by the first call to brk:

julien@holberton:/proc/4014$ ps aux | grep \ \./4$
julien     4169  0.0  0.0   4748   688 pts/9    S+   13:33   0:00 strace ./4
julien     4172  0.0  0.0   4336   656 pts/9    S+   13:33   0:00 ./4
julien@holberton:/proc/4014$ cd /proc/4172
julien@holberton:/proc/4172$ cat maps
00400000-00401000 r-xp 00000000 08:01 176973                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/4
00600000-00601000 r--p 00000000 08:01 176973                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/4
00601000-00602000 rw-p 00001000 08:01 176973                             /home/julien/holberton/w/hack_the_virtual_memory/03. The Heap/4
01314000-01335000 rw-p 00000000 00:00 0                                  [heap]
7f4a3f2c4000-7f4a3f47e000 r-xp 00000000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f4a3f47e000-7f4a3f67e000 ---p 001ba000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f4a3f67e000-7f4a3f682000 r--p 001ba000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f4a3f682000-7f4a3f684000 rw-p 001be000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f4a3f684000-7f4a3f689000 rw-p 00000000 00:00 0 
7f4a3f689000-7f4a3f6ac000 r-xp 00000000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f4a3f890000-7f4a3f893000 rw-p 00000000 00:00 0 
7f4a3f8a7000-7f4a3f8ab000 rw-p 00000000 00:00 0 
7f4a3f8ab000-7f4a3f8ac000 r--p 00022000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f4a3f8ac000-7f4a3f8ad000 rw-p 00023000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f4a3f8ad000-7f4a3f8ae000 rw-p 00000000 00:00 0 
7ffd1ba73000-7ffd1ba94000 rw-p 00000000 00:00 0                          [stack]
7ffd1bbed000-7ffd1bbef000 r--p 00000000 00:00 0                          [vvar]
7ffd1bbef000-7ffd1bbf1000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
julien@holberton:/proc/4172$ 

-> We have only one [heap] and the addresses match those returned by sbrk: 0x1314000 & 0x1335000

Naive malloc

Based on the above, and assuming we won’t ever need to free anything, we can now write our own (naive) version of malloc, that would move the program break each time it is called.

#include <stdlib.h>
#include <unistd.h>

/**                                                                                            
 * malloc - naive version of malloc: dynamically allocates memory on the heap using sbrk                         
 * @size: number of bytes to allocate                                                          
 *                                                                                             
 * Return: the memory address newly allocated, or NULL on error                                
 *                                                                                             
 * Note: don't do this at home :)                                                              
 */
void *malloc(size_t size)
{
    void *previous_break;

    previous_break = sbrk(size);
    /* check for error */
    if (previous_break == (void *) -1)
    {
        /* on error malloc returns NULL */
        return (NULL);
    }
    return (previous_break);
}

The 0x10 lost bytes

If we look at the output of the previous program (4-main.c), we can see that the first memory address returned by malloc doesn’t start at the beginning of the heap, but 0x10 bytes after: 0x1314010 vs 0x1314000. Also, when we call malloc(1024) a second time, the address should be 0x1314010 (the returned value of the first call to malloc) + 1024 (or 0x400 in hexadecimal, since the first call to malloc was asking for 1024 bytes) = 0x1318010. But the return value of the second call to malloc is 0x1314420. We have lost 0x10 bytes again! Same goes for the subsequent calls.

Let’s look at what we can find inside those “lost” 0x10-byte memory spaces (5-main.c) and whether the memory loss stays constant:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/**                                                                                            
 * pmem - print mem                                                                            
 * @p: memory address to start printing from                                                   
 * @bytes: number of bytes to print                                                            
 *                                                                                             
 * Return: nothing                                                                             
 */
void pmem(void *p, unsigned int bytes)
{
    unsigned char *ptr;
    unsigned int i;

    ptr = (unsigned char *)p;
    for (i = 0; i < bytes; i++)
    {
        if (i != 0)
        {
            printf(" ");
        }
        printf("%02x", *(ptr + i));
    }
    printf("\n");
}

/**
 * main - the 0x10 lost bytes
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    void *p;
    int i;

    for (i = 0; i < 10; i++)
    {
        p = malloc(1024 * (i + 1));
        printf("%p\n", p);
        printf("bytes at %p:\n", (void *)((char *)p - 0x10));
        pmem((char *)p - 0x10, 0x10);
    }
    return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm3$ gcc -Wall -Wextra -pedantic -Werror 5-main.c -o 5
julien@holberton:~/holberton/w/hackthevm3$ ./5
0x1fa8010
bytes at 0x1fa8000:
00 00 00 00 00 00 00 00 11 04 00 00 00 00 00 00
0x1fa8420
bytes at 0x1fa8410:
00 00 00 00 00 00 00 00 11 08 00 00 00 00 00 00
0x1fa8c30
bytes at 0x1fa8c20:
00 00 00 00 00 00 00 00 11 0c 00 00 00 00 00 00
0x1fa9840
bytes at 0x1fa9830:
00 00 00 00 00 00 00 00 11 10 00 00 00 00 00 00
0x1faa850
bytes at 0x1faa840:
00 00 00 00 00 00 00 00 11 14 00 00 00 00 00 00
0x1fabc60
bytes at 0x1fabc50:
00 00 00 00 00 00 00 00 11 18 00 00 00 00 00 00
0x1fad470
bytes at 0x1fad460:
00 00 00 00 00 00 00 00 11 1c 00 00 00 00 00 00
0x1faf080
bytes at 0x1faf070:
00 00 00 00 00 00 00 00 11 20 00 00 00 00 00 00
0x1fb1090
bytes at 0x1fb1080:
00 00 00 00 00 00 00 00 11 24 00 00 00 00 00 00
0x1fb34a0
bytes at 0x1fb3490:
00 00 00 00 00 00 00 00 11 28 00 00 00 00 00 00
julien@holberton:~/holberton/w/hackthevm3$ 

There is one clear pattern: the size of the malloc’ed memory chunk is always found in the preceding 0x10 bytes. For instance, the first malloc call is malloc’ing 1024 (0x0400) bytes and we can find 11 04 00 00 00 00 00 00 in the preceding 0x10 bytes. Those last bytes represent the number 0x 00 00 00 00 00 00 04 11 = 0x400 (1024) + 0x10 (the block size preceding those 1024 bytes + 1 (we’ll talk about this “+1” later in this chapter). If we look at each 0x10 bytes preceding the addresses returned by malloc, they all contain the size of the chunk of memory asked to malloc + 0x10 + 1.

At this point, given what we said and saw earlier, we can probably guess that those 0x10 bytes are a sort of data structure used by malloc (and free) to deal with the heap. And indeed, even though we don’t understand everything yet, we can already use this data structure to go from one malloc’ed chunk of memory to the other (6-main.c) as long as we have the address of the beginning of the heap (and as long as we have never called free):

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/**                                                                                            
 * pmem - print mem                                                                            
 * @p: memory address to start printing from                                                   
 * @bytes: number of bytes to print                                                            
 *                                                                                             
 * Return: nothing                                                                             
 */
void pmem(void *p, unsigned int bytes)
{
    unsigned char *ptr;
    unsigned int i;

    ptr = (unsigned char *)p;
    for (i = 0; i < bytes; i++)
    {
        if (i != 0)
        {
            printf(" ");
        }
        printf("%02x", *(ptr + i));
    }
    printf("\n");
}

/**
 * main - using the 0x10 bytes to jump to next malloc'ed chunks
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    void *p;
    int i;
    void *heap_start;
    size_t size_of_the_block;

    heap_start = sbrk(0);
    write(1, "START\n", 6);
    for (i = 0; i < 10; i++)
    {
        p = malloc(1024 * (i + 1)); 
        *((int *)p) = i;
        printf("%p: [%i]\n", p, i);
    }
    p = heap_start;
    for (i = 0; i < 10; i++)
    {
        pmem(p, 0x10);
        size_of_the_block = *((size_t *)((char *)p + 8)) - 1;
        printf("%p: [%i] - size = %lu\n",
              (void *)((char *)p + 0x10),
              *((int *)((char *)p + 0x10)),
              size_of_the_block);
        p = (void *)((char *)p + size_of_the_block);
    }
    write(1, "END\n", 4);
    return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm3$ gcc -Wall -Wextra -pedantic -Werror 6-main.c -o 6
julien@holberton:~/holberton/w/hackthevm3$ ./6 
START
0x9e6010: [0]
0x9e6420: [1]
0x9e6c30: [2]
0x9e7840: [3]
0x9e8850: [4]
0x9e9c60: [5]
0x9eb470: [6]
0x9ed080: [7]
0x9ef090: [8]
0x9f14a0: [9]
00 00 00 00 00 00 00 00 11 04 00 00 00 00 00 00
0x9e6010: [0] - size = 1040
00 00 00 00 00 00 00 00 11 08 00 00 00 00 00 00
0x9e6420: [1] - size = 2064
00 00 00 00 00 00 00 00 11 0c 00 00 00 00 00 00
0x9e6c30: [2] - size = 3088
00 00 00 00 00 00 00 00 11 10 00 00 00 00 00 00
0x9e7840: [3] - size = 4112
00 00 00 00 00 00 00 00 11 14 00 00 00 00 00 00
0x9e8850: [4] - size = 5136
00 00 00 00 00 00 00 00 11 18 00 00 00 00 00 00
0x9e9c60: [5] - size = 6160
00 00 00 00 00 00 00 00 11 1c 00 00 00 00 00 00
0x9eb470: [6] - size = 7184
00 00 00 00 00 00 00 00 11 20 00 00 00 00 00 00
0x9ed080: [7] - size = 8208
00 00 00 00 00 00 00 00 11 24 00 00 00 00 00 00
0x9ef090: [8] - size = 9232
00 00 00 00 00 00 00 00 11 28 00 00 00 00 00 00
0x9f14a0: [9] - size = 10256
END
julien@holberton:~/holberton/w/hackthevm3$ 

One of our open questions from the previous chapter is now answered: malloc is using 0x10 additional bytes for each malloc’ed memory block to store the size of the block.

0x10 bytes preceeding malloc

This data will actually be used by free to save it to a list of available blocks for future calls to malloc.

But our study also raises a new question: what are the first 8 bytes of the 16 (0x10 in hexadecimal) bytes used for? It seems to always be zero. Is it just padding?

RTFSC

At this stage, we probably want to check the source code of malloc to confirm what we just found (malloc.c from the glibc).

1055 /*
1056       malloc_chunk details:
1057    
1058        (The following includes lightly edited explanations by Colin Plumb.)
1059    
1060        Chunks of memory are maintained using a `boundary tag' method as
1061        described in e.g., Knuth or Standish.  (See the paper by Paul
1062        Wilson ftp://ftp.cs.utexas.edu/pub/garbage/allocsrv.ps for a
1063        survey of such techniques.)  Sizes of free chunks are stored both
1064        in the front of each chunk and at the end.  This makes
1065        consolidating fragmented chunks into bigger chunks very fast.  The
1066        size fields also hold bits representing whether chunks are free or
1067        in use.
1068    
1069        An allocated chunk looks like this:
1070    
1071    
1072        chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1073                |             Size of previous chunk, if unallocated (P clear)  |
1074                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1075                |             Size of chunk, in bytes                     |A|M|P|
1076          mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1077                |             User data starts here...                          .
1078                .                                                               .
1079                .             (malloc_usable_size() bytes)                      .
1080                .                                                               |
1081    nextchunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1082                |             (size of chunk, but used for application data)    |
1083                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1084                |             Size of next chunk, in bytes                |A|0|1|
1085                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1086    
1087        Where "chunk" is the front of the chunk for the purpose of most of
1088        the malloc code, but "mem" is the pointer that is returned to the
1089        user.  "Nextchunk" is the beginning of the next contiguous chunk.

-> We were correct \o/. Right before the address returned by malloc to the user, we have two variables:

  • Size of previous chunk, if unallocated: we never free’d any chunks so that is why it was always 0
  • Size of chunk, in bytes

Let’s free some chunks to confirm that the first 8 bytes are used the way the source code describes it (7-main.c):

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/**                                                                                            
 * pmem - print mem                                                                            
 * @p: memory address to start printing from                                                   
 * @bytes: number of bytes to print                                                            
 *                                                                                             
 * Return: nothing                                                                             
 */
void pmem(void *p, unsigned int bytes)
{
    unsigned char *ptr;
    unsigned int i;

    ptr = (unsigned char *)p;
    for (i = 0; i < bytes; i++)
    {
        if (i != 0)
        {
            printf(" ");
        }
        printf("%02x", *(ptr + i));
    }
    printf("\n");
}

/**
 * main - confirm the source code
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    void *p;
    int i;
    size_t size_of_the_chunk;
    size_t size_of_the_previous_chunk;
    void *chunks[10];

    for (i = 0; i < 10; i++)
    {
        p = malloc(1024 * (i + 1));
        chunks[i] = (void *)((char *)p - 0x10);
        printf("%p\n", p);
    }
    free((char *)(chunks[3]) + 0x10);
    free((char *)(chunks[7]) + 0x10);
    for (i = 0; i < 10; i++)
    {
        p = chunks[i];
        printf("chunks[%d]: ", i);
        pmem(p, 0x10);
        size_of_the_chunk = *((size_t *)((char *)p + 8)) - 1;
        size_of_the_previous_chunk = *((size_t *)((char *)p));
        printf("chunks[%d]: %p, size = %li, prev = %li\n",
              i, p, size_of_the_chunk, size_of_the_previous_chunk);
    }
    return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm3$ gcc -Wall -Wextra -pedantic -Werror 7-main.c -o 7
julien@holberton:~/holberton/w/hackthevm3$ ./7
0x1536010
0x1536420
0x1536c30
0x1537840
0x1538850
0x1539c60
0x153b470
0x153d080
0x153f090
0x15414a0
chunks[0]: 00 00 00 00 00 00 00 00 11 04 00 00 00 00 00 00
chunks[0]: 0x1536000, size = 1040, prev = 0
chunks[1]: 00 00 00 00 00 00 00 00 11 08 00 00 00 00 00 00
chunks[1]: 0x1536410, size = 2064, prev = 0
chunks[2]: 00 00 00 00 00 00 00 00 11 0c 00 00 00 00 00 00
chunks[2]: 0x1536c20, size = 3088, prev = 0
chunks[3]: 00 00 00 00 00 00 00 00 11 10 00 00 00 00 00 00
chunks[3]: 0x1537830, size = 4112, prev = 0
chunks[4]: 10 10 00 00 00 00 00 00 10 14 00 00 00 00 00 00
chunks[4]: 0x1538840, size = 5135, prev = 4112
chunks[5]: 00 00 00 00 00 00 00 00 11 18 00 00 00 00 00 00
chunks[5]: 0x1539c50, size = 6160, prev = 0
chunks[6]: 00 00 00 00 00 00 00 00 11 1c 00 00 00 00 00 00
chunks[6]: 0x153b460, size = 7184, prev = 0
chunks[7]: 00 00 00 00 00 00 00 00 11 20 00 00 00 00 00 00
chunks[7]: 0x153d070, size = 8208, prev = 0
chunks[8]: 10 20 00 00 00 00 00 00 10 24 00 00 00 00 00 00
chunks[8]: 0x153f080, size = 9231, prev = 8208
chunks[9]: 00 00 00 00 00 00 00 00 11 28 00 00 00 00 00 00
chunks[9]: 0x1541490, size = 10256, prev = 0
julien@holberton:~/holberton/w/hackthevm3$ 

As we can see from the above listing, when the previous chunk has been free’d, the malloc chunk’s first 8 bytes contain the size of the previous unallocated chunk. So the correct representation of a malloc chunk is the following:

malloc chunk

Also, it seems that the first bit of the next 8 bytes (containing the size of the current chunk) serves as a flag to check if the previous chunk is used (1) or not (0). So the correct updated version of our program should be written this way (8-main.c):

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/**                                                                                            
 * pmem - print mem                                                                            
 * @p: memory address to start printing from                                                   
 * @bytes: number of bytes to print                                                            
 *                                                                                             
 * Return: nothing                                                                             
 */
void pmem(void *p, unsigned int bytes)
{
    unsigned char *ptr;
    unsigned int i;

    ptr = (unsigned char *)p;
    for (i = 0; i < bytes; i++)
    {
        if (i != 0)
        {
            printf(" ");
        }
        printf("%02x", *(ptr + i));
    }
    printf("\n");
}

/**
 * main - updating with correct checks
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    void *p;
    int i;
    size_t size_of_the_chunk;
    size_t size_of_the_previous_chunk;
    void *chunks[10];
    char prev_used;

    for (i = 0; i < 10; i++)
    {
        p = malloc(1024 * (i + 1));
        chunks[i] = (void *)((char *)p - 0x10);
    }
    free((char *)(chunks[3]) + 0x10);
    free((char *)(chunks[7]) + 0x10);
    for (i = 0; i < 10; i++)
    {
        p = chunks[i];
        printf("chunks[%d]: ", i);
        pmem(p, 0x10);
        size_of_the_chunk = *((size_t *)((char *)p + 8));
        prev_used = size_of_the_chunk & 1;
        size_of_the_chunk -= prev_used;
        size_of_the_previous_chunk = *((size_t *)((char *)p));
        printf("chunks[%d]: %p, size = %li, prev (%s) = %li\n",
              i, p, size_of_the_chunk,
              (prev_used? "allocated": "unallocated"), size_of_the_previous_chunk);
    }
    return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm3$ gcc -Wall -Wextra -pedantic -Werror 8-main.c -o 8
julien@holberton:~/holberton/w/hackthevm3$ ./8 
chunks[0]: 00 00 00 00 00 00 00 00 11 04 00 00 00 00 00 00
chunks[0]: 0x1031000, size = 1040, prev (allocated) = 0
chunks[1]: 00 00 00 00 00 00 00 00 11 08 00 00 00 00 00 00
chunks[1]: 0x1031410, size = 2064, prev (allocated) = 0
chunks[2]: 00 00 00 00 00 00 00 00 11 0c 00 00 00 00 00 00
chunks[2]: 0x1031c20, size = 3088, prev (allocated) = 0
chunks[3]: 00 00 00 00 00 00 00 00 11 10 00 00 00 00 00 00
chunks[3]: 0x1032830, size = 4112, prev (allocated) = 0
chunks[4]: 10 10 00 00 00 00 00 00 10 14 00 00 00 00 00 00
chunks[4]: 0x1033840, size = 5136, prev (unallocated) = 4112
chunks[5]: 00 00 00 00 00 00 00 00 11 18 00 00 00 00 00 00
chunks[5]: 0x1034c50, size = 6160, prev (allocated) = 0
chunks[6]: 00 00 00 00 00 00 00 00 11 1c 00 00 00 00 00 00
chunks[6]: 0x1036460, size = 7184, prev (allocated) = 0
chunks[7]: 00 00 00 00 00 00 00 00 11 20 00 00 00 00 00 00
chunks[7]: 0x1038070, size = 8208, prev (allocated) = 0
chunks[8]: 10 20 00 00 00 00 00 00 10 24 00 00 00 00 00 00
chunks[8]: 0x103a080, size = 9232, prev (unallocated) = 8208
chunks[9]: 00 00 00 00 00 00 00 00 11 28 00 00 00 00 00 00
chunks[9]: 0x103c490, size = 10256, prev (allocated) = 0
julien@holberton:~/holberton/w/hackthevm3$ 

Is the heap actually growing upwards?

The last question left unanswered is: “Is the heap actually growing upwards?”. From the brk man page, it seems so:

DESCRIPTION
       brk() and sbrk() change the location of the program break, which defines the end  of  the
       process's  data  segment  (i.e., the program break is the first location after the end of
       the uninitialized data segment).  Increasing the program break has the effect of allocat‐
       ing memory to the process; decreasing the break deallocates memory.

Let’s check! (9-main.c)

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/**
 * main - moving the program break
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    int i;

    write(1, "START\n", 6);
    malloc(1);
    getchar();
    write(1, "LOOP\n", 5);
    for (i = 0; i < 0x25000 / 1024; i++)
    {
        malloc(1024);
    }
    write(1, "END\n", 4);
    getchar();
    return (EXIT_SUCCESS);
}

Now let’s confirm this assumption with strace:

julien@holberton:~/holberton/w/hackthevm3$ strace ./9 
execve("./9", ["./9"], [/* 61 vars */]) = 0
...
write(1, "START\n", 6START
)                  = 6
brk(0)                                  = 0x1fd8000
brk(0x1ff9000)                          = 0x1ff9000
...
write(1, "LOOP\n", 5LOOP
)                   = 5
brk(0x201a000)                          = 0x201a000
write(1, "END\n", 4END
)                    = 4
...
julien@holberton:~/holberton/w/hackthevm3$ 

clearly, malloc made only two calls to brk to increase the allocated space on the heap. And the second call is using a higher memory address argument (0x201a000 > 0x1ff9000). The second syscall was triggered when the space on the heap was too small to host all the malloc calls.

Let’s double check with /proc.

julien@holberton:~/holberton/w/hackthevm3$ gcc -Wall -Wextra -pedantic -Werror 9-main.c -o 9
julien@holberton:~/holberton/w/hackthevm3$ ./9
START

julien@holberton:/proc/7855$ ps aux | grep \ \./9$
julien     7972  0.0  0.0   4332   684 pts/9    S+   19:08   0:00 ./9
julien@holberton:/proc/7855$ cd /proc/7972
julien@holberton:/proc/7972$ cat maps
...
00901000-00922000 rw-p 00000000 00:00 0                                  [heap]
...
julien@holberton:/proc/7972$ 

-> 00901000-00922000 rw-p 00000000 00:00 0 [heap]
Let’s hit Enter and look at the [heap] again:

LOOP
END

julien@holberton:/proc/7972$ cat maps
...
00901000-00943000 rw-p 00000000 00:00 0                                  [heap]
...
julien@holberton:/proc/7972$ 

-> 00901000-00943000 rw-p 00000000 00:00 0 [heap]
The beginning of the heap is still the same, but the size has increased upwards from 00922000 to 00943000.

The Address Space Layout Randomisation (ASLR)

You may have noticed something “strange” in the /proc/pid/maps listing above, that we want to study:

The program break is the address of the first location beyond the current end of the data region – so the address of the first location beyond the executable in the virtual memory. As a consequence, the heap should start right after the end of the executable in memory. As you can see in all above listing, it is NOT the case. The only thing that is true is that the heap is always the next memory region after the executable, which makes sense since the heap is actually part of the data segment of the executable itself. Also, if we look even closer, the memory gap size between the executable and the heap is never the same:

Format of the following lines: [PID of the above maps listings]: address of the beginning of the [heap] – address of the end of the executable = memory gap size

  • [3718]: 01195000 – 00602000 = b93000
  • [3834]: 024d6000 – 00602000 = 1ed4000
  • [4014]: 00e70000 – 00602000 = 86e000
  • [4172]: 01314000 – 00602000 = d12000
  • [7972]: 00901000 – 00602000 = 2ff000

It seems that this gap size is random, and indeed, it is. If we look at the ELF binary loader source code (fs/binfmt_elf.c) we can find this:

        if ((current->flags & PF_RANDOMIZE) && (randomize_va_space > 1)) {
                current->mm->brk = current->mm->start_brk =
                        arch_randomize_brk(current->mm);
#ifdef compat_brk_randomized
                current->brk_randomized = 1;
#endif
        }

where current->mm->brk is the address of the program break. The arch_randomize_brk function can be found in the arch/x86/kernel/process.c file:

unsigned long arch_randomize_brk(struct mm_struct *mm)
{
        unsigned long range_end = mm->brk + 0x02000000;
        return randomize_range(mm->brk, range_end, 0) ? : mm->brk;
}

The randomize_range returns a start address such that:

    [...... <range> .....]
  start                  end

Source code of the randomize_range function (drivers/char/random.c):

/*
 * randomize_range() returns a start address such that
 *
 *    [...... <range> .....]
 *  start                  end
 *
 * a <range> with size "len" starting at the return value is inside in the
 * area defined by [start, end], but is otherwise randomized.
 */
unsigned long
randomize_range(unsigned long start, unsigned long end, unsigned long len)
{
        unsigned long range = end - len - start;

        if (end <= start + len)
                return 0;
        return PAGE_ALIGN(get_random_int() % range + start);
}

As a result, the offset between the data section of the executable and the program break initial position when the process runs can have a size of anywhere between 0 and 0x02000000. This randomization is known as Address Space Layout Randomisation (ASLR). ASLR is a computer security technique involved in preventing exploitation of memory corruption vulnerabilities. In order to prevent an attacker from jumping to, for example, a particular exploited function in memory, ASLR randomly arranges the address space positions of key data areas of a process, including the positions of the heap and the stack.

The updated VM diagram

With all the above in mind, we can now update our VM diagram:

Virtual memory diagram

malloc(0)

Did you ever wonder what was happening when we call malloc with a size of 0? Let’s check! (10-main.c)

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/**                                                                                            
 * pmem - print mem                                                                            
 * @p: memory address to start printing from                                                   
 * @bytes: number of bytes to print                                                            
 *                                                                                             
 * Return: nothing                                                                             
 */
void pmem(void *p, unsigned int bytes)
{
    unsigned char *ptr;
    unsigned int i;

    ptr = (unsigned char *)p;
    for (i = 0; i < bytes; i++)
    {
        if (i != 0)
        {
            printf(" ");
        }
        printf("%02x", *(ptr + i));
    }
    printf("\n");
}

/**
 * main - moving the program break
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    void *p;
    size_t size_of_the_chunk;
    char prev_used;

    p = malloc(0);
    printf("%p\n", p);
    pmem((char *)p - 0x10, 0x10);
    size_of_the_chunk = *((size_t *)((char *)p - 8));
    prev_used = size_of_the_chunk & 1;
    size_of_the_chunk -= prev_used;
    printf("chunk size = %li bytes\n", size_of_the_chunk);
    return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm3$ gcc -Wall -Wextra -pedantic -Werror 10-main.c -o 10
julien@holberton:~/holberton/w/hackthevm3$ ./10
0xd08010
00 00 00 00 00 00 00 00 21 00 00 00 00 00 00 00
chunk size = 32 bytes
julien@holberton:~/holberton/w/hackthevm3$ 

-> malloc(0) is actually using 32 bytes, including the first 0x10 bytes.

Again, note that this will not always be the case. From the man page (man malloc):

NULL may also be returned by a successful call to malloc() with a size of zero

Outro

We have learned a couple of things about malloc and the heap. But there is actually more than brk and sbrk. You can try malloc’ing a big chunk of memory, strace it, and look at /proc to learn more before we cover it in a next chapter 🙂

Also, studying how free works in coordination with malloc is something we haven’t covered yet. If you want to look at it, you will find part of the answer to why the minimum chunk size is 32 (when we ask malloc for 0 bytes) vs 16 (0x10 in hexadecimal) or 0.

As usual, to be continued! Let me know if you have something you would like me to cover in the next chapter.

Questions? Feedback?

If you have questions or feedback don’t hesitate to ping us on Twitter at @holbertonschool or @julienbarbier42.
Haters, please send your comments to /dev/null.

Happy Hacking!

Thank you for reading!

As always, no-one is perfect (except Chuck of course), so don’t hesitate to contribute or send me your comments if you find anything I missed.

Files

This repo contains the source code (naive_malloc.c, version.c & “X-main.c` files) for programs created in this tutorial.

Read more about the virtual memory

Follow @holbertonschool or @julienbarbier42 on Twitter to get the next chapters!

Many thanks to Tim, Anne and Ian for proof-reading! 🙂

Hack the Virtual Memory: drawing the VM diagram

hack the virtual memory

Hack The Virtual Memory, chapter 2: Drawing the VM diagram

We previously talked about what you could find in the virtual memory of a process, and where you could find it. Today, we will try to “reconstruct” (part of) the following diagram by making our process print addresses of various elements of the program.

the virtual memory

Prerequisites

In order to fully understand this article, you will need to know:

  • The basics of the C programming language
  • A little bit of assembly (but not required)
  • The very basics of the Linux filesystem and the shell
  • We will also use the /proc/[pid]/maps file (see man proc or read our first article Hack The Virtual Memory, chapter 0: C strings & /proc)

Environment

All scripts and programs have been tested on the following system:

  • Ubuntu
    • Linux ubuntu 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
    • Everything we will write will be true for this system, but may be different on another system

Tools used:

  • gcc
    • gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
  • objdump
    • GNU objdump (GNU Binutils for Ubuntu) 2.24
  • udcli
    • udis86 1.7.2
  • bc
    • bc 1.06.95

The stack

The first thing we want to locate in our diagram is the stack. We know that in C, local variables are located on the stack. So if we print the address of a local variable, it should give us an idea on where we would find the stack in the virtual memory. Let’s use this program (main-1.c) to find out:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**
 * main - print locations of various elements
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    int a;

    printf("Address of a: %p\n", (void *)&a);
    return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm2$ gcc -Wall -Wextra -pedantic -Werror main-0.c -o 0
julien@holberton:~/holberton/w/hackthevm2$ ./0
Address of a: 0x7ffd14b8bd9c
julien@holberton:~/holberton/w/hackthevm2$ 

This will be our first point of reference when we will compare other elements’ addresses.

The heap

The heap is used when you malloc space for your variables. Let’s add a line to use malloc and see where the memory address returned by malloc is located (main-1.c):

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**
 * main - print locations of various elements
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    int a;
    void *p;

    printf("Address of a: %p\n", (void *)&a);
    p = malloc(98);
    if (p == NULL)
    {
        fprintf(stderr, "Can't malloc\n");
        return (EXIT_FAILURE);
    }
    printf("Allocated space in the heap: %p\n", p);
    return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm2$ gcc -Wall -Wextra -pedantic -Werror main-1.c -o 1
julien@holberton:~/holberton/w/hackthevm2$ ./1 
Address of a: 0x7ffd4204c554
Allocated space in the heap: 0x901010
julien@holberton:~/holberton/w/hackthevm2$ 

It’s now clear that the heap (0x901010) is way below the stack (0x7ffd4204c554). At this point we can already draw this diagram:

heap and stack

The executable

Your program is also in the virtual memory. If we print the address of the main function, we should have an idea of where the program is located compared to the stack and the heap. Let’s see if we find it below the heap as expected (main-2.c):

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**
 * main - print locations of various elements
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    int a;
    void *p;

    printf("Address of a: %p\n", (void *)&a);
    p = malloc(98);
    if (p == NULL)
    {
        fprintf(stderr, "Can't malloc\n");
        return (EXIT_FAILURE);
    }
    printf("Allocated space in the heap: %p\n", p);
    printf("Address of function main: %p\n", (void *)main);
    return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm2$ gcc -Wall -Wextra -Werror main-2.c -o 2
julien@holberton:~/holberton/w/hackthevm2$ ./2 
Address of a: 0x7ffdced37d74
Allocated space in the heap: 0x2199010
Address of function main: 0x40060d
julien@holberton:~/holberton/w/hackthevm2$ 

It seems that our program (0x40060d) is located below the heap (0x2199010), just as expected.
But let’s make sure that this is the actual code of our program, and not some sort of pointer to another location. Let’s disassemble our program 2 with objdump to look at the “memory address” of the main function:

julien@holberton:~/holberton/w/hackthevm2$ objdump -M intel -j .text -d 2 | grep '<main>:' -A 5
000000000040060d <main>:
  40060d:   55                      push   rbp
  40060e:   48 89 e5                mov    rbp,rsp
  400611:   48 83 ec 10             sub    rsp,0x10
  400615:   48 8d 45 f4             lea    rax,[rbp-0xc]
  400619:   48 89 c6                mov    rsi,rax

000000000040060d <main> -> we find the exact same address (0x40060d). If you still have any doubts, you can print the first bytes located at this address, to make sure they match the output of objdump (main-3.c):

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**
 * main - print locations of various elements
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    int a;
    void *p;
    unsigned int i;

    printf("Address of a: %p\n", (void *)&a);
    p = malloc(98);
    if (p == NULL)
    {
        fprintf(stderr, "Can't malloc\n");
        return (EXIT_FAILURE);
    }
    printf("Allocated space in the heap: %p\n", p);
    printf("Address of function main: %p\n", (void *)main);
    printf("First bytes of the main function:\n\t");
    for (i = 0; i < 15; i++)
    {
        printf("%02x ", ((unsigned char *)main)[i]);
    }
    printf("\n");
    return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm2$ gcc -Wall -Wextra -Werror main-3.c -o 3
julien@holberton:~/holberton/w/hackthevm2$ objdump -M intel -j .text -d 3 | grep '<main>:' -A 5
000000000040064d <main>:
  40064d:   55                      push   rbp
  40064e:   48 89 e5                mov    rbp,rsp
  400651:   48 83 ec 10             sub    rsp,0x10
  400655:   48 8d 45 f0             lea    rax,[rbp-0x10]
  400659:   48 89 c6                mov    rsi,rax
julien@holberton:~/holberton/w/hackthevm2$ ./3 
Address of a: 0x7ffeff0f13b0
Allocated space in the heap: 0x8b3010
Address of function main: 0x40064d
First bytes of the main function:
    55 48 89 e5 48 83 ec 10 48 8d 45 f0 48 89 c6 
julien@holberton:~/holberton/w/hackthevm2$ echo "55 48 89 e5 48 83 ec 10 48 8d 45 f0 48 89 c6" | udcli -64 -x -o 40064d
000000000040064d 55               push rbp                
000000000040064e 4889e5           mov rbp, rsp            
0000000000400651 4883ec10         sub rsp, 0x10           
0000000000400655 488d45f0         lea rax, [rbp-0x10]     
0000000000400659 4889c6           mov rsi, rax            
julien@holberton:~/holberton/w/hackthevm2$

-> We can see that we print the same address and the same content. We are now triple sure this is our main function.

You can download the Udis86 Disassembler Library here.

Here is the updated diagram, based on what we have learned:

stack, heap and executable

Command line arguments and environment variables

The main function can take arguments:

  • The command line arguments
    • the first argument of the main function (usually named argc or ac) is the number of command line arguments
    • the second argument of the main function (usually named argv or av) is an array of pointers to the arguments (C strings)
  • The environment variables
    • the third argument of the main function (usally named env or envp) is an array of pointers to the environment variables (C strings)

Let’s see where those elements stand in the virtual memory of our process (main-4.c):

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**
 * main - print locations of various elements
 *
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS
 */
int main(int ac, char **av, char **env)
{
        int a;
        void *p;
        int i;

        printf("Address of a: %p\n", (void *)&a);
        p = malloc(98);
        if (p == NULL)
        {
                fprintf(stderr, "Can't malloc\n");
                return (EXIT_FAILURE);
        }
        printf("Allocated space in the heap: %p\n", p);
        printf("Address of function main: %p\n", (void *)main);
        printf("First bytes of the main function:\n\t");
        for (i = 0; i < 15; i++)
        {
                printf("%02x ", ((unsigned char *)main)[i]);
        }
        printf("\n");
        printf("Address of the array of arguments: %p\n", (void *)av);
        printf("Addresses of the arguments:\n\t");
        for (i = 0; i < ac; i++)
        {
                printf("[%s]:%p ", av[i], av[i]);
        }
        printf("\n");
        printf("Address of the array of environment variables: %p\n", (void *)env);
    printf("Address of the first environment variable: %p\n", (void *)(env[0]));
        return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm2$ gcc -Wall -Wextra -Werror main-4.c -o 4
julien@holberton:~/holberton/w/hackthevm2$ ./4 Hello Holberton School!
Address of a: 0x7ffe7d6d8da0
Allocated space in the heap: 0xc8c010
Address of function main: 0x40069d
First bytes of the main function:
    55 48 89 e5 48 83 ec 30 89 7d ec 48 89 75 e0 
Address of the array of arguments: 0x7ffe7d6d8e98
Addresses of the arguments:
    [./4]:0x7ffe7d6da373 [Hello]:0x7ffe7d6da377 [Holberton]:0x7ffe7d6da37d [School!]:0x7ffe7d6da387 
Address of the array of environment variables: 0x7ffe7d6d8ec0
Address of the first environment variables:
    [0x7ffe7d6da38f]:"XDG_VTNR=7"
    [0x7ffe7d6da39a]:"XDG_SESSION_ID=c2"
    [0x7ffe7d6da3ac]:"CLUTTER_IM_MODULE=xim"
julien@holberton:~/holberton/w/hackthevm2$ 

These elements are above the stack as expected, but now we know the exact order: stack (0x7ffe7d6d8da0) < argv (0x7ffe7d6d8e98) < env (0x7ffe7d6d8ec0) < arguments (from 0x7ffe7d6da373 to 0x7ffe7d6da387 + 8 (8 = size of the string "school" + 1 for the '\0' char)) < environment variables (starting at 0x7ffe7d6da38f).

Actually, we can also see that all the command line arguments are next to each other in the memory, and also right next to the environment variables.

Are the argv and env arrays next to each other?

The array argv is 5 elements long (there were 4 arguments from the command line + 1 NULL element at the end (argv always ends with NULL to mark the end of the array)). Each element is a pointer to a char and since we are on a 64-bit machine, a pointer is 8 bytes (if you want to make sure, you can use the C operator sizeof() to get the size of a pointer). As a result our argv array is of size 5 * 8 = 40. 40 in base 10 is 0x28 in base 16. If we add this value to the address of the beginning of the array 0x7ffe7d6d8e98, we get… 0x7ffe7d6d8ec0 (The address of the env array)! So the two arrays are next to each other in memory.

Is the first command line argument stored right after the env array?

In order to check this we need to know the size of the env array. We know that it ends with a NULL pointer, so in order to get the number of its elements we simply need to loop through it, checking if the “current” element is NULL. Here’s the updated C code (main-5.c):

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**                                                                                                      
 * main - print locations of various elements                                                            
 *                                                                                                       
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS                                      
 */
int main(int ac, char **av, char **env)
{
     int a;
     void *p;
     int i;
     int size;

     printf("Address of a: %p\n", (void *)&a);
     p = malloc(98);
     if (p == NULL)
     {
          fprintf(stderr, "Can't malloc\n");
          return (EXIT_FAILURE);
     }
     printf("Allocated space in the heap: %p\n", p);
     printf("Address of function main: %p\n", (void *)main);
     printf("First bytes of the main function:\n\t");
     for (i = 0; i < 15; i++)
     {
          printf("%02x ", ((unsigned char *)main)[i]);
     }
     printf("\n");
     printf("Address of the array of arguments: %p\n", (void *)av);
     printf("Addresses of the arguments:\n\t");
     for (i = 0; i < ac; i++)
     {
          printf("[%s]:%p ", av[i], av[i]);
     }
     printf("\n");
     printf("Address of the array of environment variables: %p\n", (void *)env);
     printf("Address of the first environment variables:\n");
     for (i = 0; i < 3; i++)
     {
          printf("\t[%p]:\"%s\"\n", env[i], env[i]);
     }
     /* size of the env array */
     i = 0;
     while (env[i] != NULL)
     {
          i++;
     }
     i++; /* the NULL pointer */
     size = i * sizeof(char *);
     printf("Size of the array env: %d elements -> %d bytes (0x%x)\n", i, size, size);
     return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm2$ ./5 Hello Betty Holberton!
Address of a: 0x7ffc77598acc
Allocated space in the heap: 0x2216010
Address of function main: 0x40069d
First bytes of the main function:
    55 48 89 e5 48 83 ec 40 89 7d dc 48 89 75 d0 
Address of the array of arguments: 0x7ffc77598bc8
Addresses of the arguments:
    [./5]:0x7ffc7759a374 [Hello]:0x7ffc7759a378 [Betty]:0x7ffc7759a37e [Holberton!]:0x7ffc7759a384 
Address of the array of environment variables: 0x7ffc77598bf0
Address of the first environment variables:
    [0x7ffc7759a38f]:"XDG_VTNR=7"
    [0x7ffc7759a39a]:"XDG_SESSION_ID=c2"
    [0x7ffc7759a3ac]:"CLUTTER_IM_MODULE=xim"
Size of the array env: 62 elements -> 496 bytes (0x1f0)
julien@holberton:~/holberton/w/hackthevm2$ bc
bc 1.06.95
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
obase=16
ibase=16
1F0+7FFC77598BF0
7FFC77598DE0
quit
julien@holberton:~/holberton/w/hackthevm2$ 

-> 7FFC77598DE0 != (but still <) 0x7ffc7759a374. So the answer is no 🙂

Wrapping up

Let’s update our diagram with what we learned.

virtual memory with command line arguments and environment variables

Is the stack really growing downwards?

Let’s call a function and figure this out! If this is true, then the variables of the calling function will be higher in memory than those from the called function (main-6.c).

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**                                                                                                      
 * f - print locations of various elements                                                               
 *                                                                                                       
 * Returns: nothing                                                                                      
 */
void f(void)
{
     int a;
     int b;
     int c;

     a = 98;
     b = 1024;
     c = a * b;
     printf("[f] a = %d, b = %d, c = a * b = %d\n", a, b, c);
     printf("[f] Adresses of a: %p, b = %p, c = %p\n", (void *)&a, (void *)&b, (void *)&c);
}

/**                                                                                                      
 * main - print locations of various elements                                                            
 *                                                                                                       
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS                                      
 */
int main(int ac, char **av, char **env)
{
     int a;
     void *p;
     int i;
     int size;

     printf("Address of a: %p\n", (void *)&a);
     p = malloc(98);
     if (p == NULL)
     {
          fprintf(stderr, "Can't malloc\n");
          return (EXIT_FAILURE);
     }
     printf("Allocated space in the heap: %p\n", p);
     printf("Address of function main: %p\n", (void *)main);
     printf("First bytes of the main function:\n\t");
     for (i = 0; i < 15; i++)
     {
          printf("%02x ", ((unsigned char *)main)[i]);
     }
     printf("\n");
     printf("Address of the array of arguments: %p\n", (void *)av);
     printf("Addresses of the arguments:\n\t");
     for (i = 0; i < ac; i++)
     {
          printf("[%s]:%p ", av[i], av[i]);
     }
     printf("\n");
     printf("Address of the array of environment variables: %p\n", (void *)env);
     printf("Address of the first environment variables:\n");
     for (i = 0; i < 3; i++)
     {
          printf("\t[%p]:\"%s\"\n", env[i], env[i]);
     }
     /* size of the env array */
     i = 0;
     while (env[i] != NULL)
     {
          i++;
     }
     i++; /* the NULL pointer */
     size = i * sizeof(char *);
     printf("Size of the array env: %d elements -> %d bytes (0x%x)\n", i, size, size);
     f();
     return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm2$ gcc -Wall -Wextra -Werror main-6.c -o 6
julien@holberton:~/holberton/w/hackthevm2$ ./6
Address of a: 0x7ffdae53ea4c
Allocated space in the heap: 0xf32010
Address of function main: 0x4006f9
First bytes of the main function:
    55 48 89 e5 48 83 ec 40 89 7d dc 48 89 75 d0 
Address of the array of arguments: 0x7ffdae53eb48
Addresses of the arguments:
    [./6]:0x7ffdae54038b 
Address of the array of environment variables: 0x7ffdae53eb58
Address of the first environment variables:
    [0x7ffdae54038f]:"XDG_VTNR=7"
    [0x7ffdae54039a]:"XDG_SESSION_ID=c2"
    [0x7ffdae5403ac]:"CLUTTER_IM_MODULE=xim"
Size of the array env: 62 elements -> 496 bytes (0x1f0)
[f] a = 98, b = 1024, c = a * b = 100352
[f] Adresses of a: 0x7ffdae53ea04, b = 0x7ffdae53ea08, c = 0x7ffdae53ea0c
julien@holberton:~/holberton/w/hackthevm2$ 

-> True! (address of var a in function f) 0x7ffdae53ea04 < 0x7ffdae53ea4c (address of var a in function main)

We now update our diagram:

stack is growing downwards

/proc

Let’s double check everything we found so far with /proc/[pid]/maps (man proc or refer to the first article in this series to learn about the proc filesystem if you don’t know what it is).

Let’s add a getchar() to our program so that we can look at its “/proc” (main-7.c):

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**                                                                                                      
 * f - print locations of various elements                                                               
 *                                                                                                       
 * Returns: nothing                                                                                      
 */
void f(void)
{
     int a;
     int b;
     int c;

     a = 98;
     b = 1024;
     c = a * b;
     printf("[f] a = %d, b = %d, c = a * b = %d\n", a, b, c);
     printf("[f] Adresses of a: %p, b = %p, c = %p\n", (void *)&a, (void *)&b, (void *)&c);
}

/**                                                                                                      
 * main - print locations of various elements                                                            
 *                                                                                                       
 * Return: EXIT_FAILURE if something failed. Otherwise EXIT_SUCCESS                                      
 */
int main(int ac, char **av, char **env)
{
     int a;
     void *p;
     int i;
     int size;

     printf("Address of a: %p\n", (void *)&a);
     p = malloc(98);
     if (p == NULL)
     {
          fprintf(stderr, "Can't malloc\n");
          return (EXIT_FAILURE);
     }
     printf("Allocated space in the heap: %p\n", p);
     printf("Address of function main: %p\n", (void *)main);
     printf("First bytes of the main function:\n\t");
     for (i = 0; i < 15; i++)
     {
          printf("%02x ", ((unsigned char *)main)[i]);
     }
     printf("\n");
     printf("Address of the array of arguments: %p\n", (void *)av);
     printf("Addresses of the arguments:\n\t");
     for (i = 0; i < ac; i++)
     {
          printf("[%s]:%p ", av[i], av[i]);
     }
     printf("\n");
     printf("Address of the array of environment variables: %p\n", (void *)env);
     printf("Address of the first environment variables:\n");
     for (i = 0; i < 3; i++)
     {
          printf("\t[%p]:\"%s\"\n", env[i], env[i]);
     }
     /* size of the env array */
     i = 0;
     while (env[i] != NULL)
     {
          i++;
     }
     i++; /* the NULL pointer */
     size = i * sizeof(char *);
     printf("Size of the array env: %d elements -> %d bytes (0x%x)\n", i, size, size);
     f();
     getchar();
     return (EXIT_SUCCESS);
}
julien@holberton:~/holberton/w/hackthevm2$ gcc -Wall -Wextra -Werror main-7.c -o 7
julien@holberton:~/holberton/w/hackthevm2$ ./7 Rona is a Legend SRE
Address of a: 0x7fff16c8146c
Allocated space in the heap: 0x2050010
Address of function main: 0x400739
First bytes of the main function:
    55 48 89 e5 48 83 ec 40 89 7d dc 48 89 75 d0 
Address of the array of arguments: 0x7fff16c81568
Addresses of the arguments:
    [./7]:0x7fff16c82376 [Rona]:0x7fff16c8237a [is]:0x7fff16c8237f [a]:0x7fff16c82382 [Legend]:0x7fff16c82384 [SRE]:0x7fff16c8238b 
Address of the array of environment variables: 0x7fff16c815a0
Address of the first environment variables:
    [0x7fff16c8238f]:"XDG_VTNR=7"
    [0x7fff16c8239a]:"XDG_SESSION_ID=c2"
    [0x7fff16c823ac]:"CLUTTER_IM_MODULE=xim"
Size of the array env: 62 elements -> 496 bytes (0x1f0)
[f] a = 98, b = 1024, c = a * b = 100352
[f] Adresses of a: 0x7fff16c81424, b = 0x7fff16c81428, c = 0x7fff16c8142c

julien@holberton:~$ ps aux | grep "./7" | grep -v grep
julien     5788  0.0  0.0   4336   628 pts/8    S+   18:04   0:00 ./7 Rona is a Legend SRE
julien@holberton:~$ cat /proc/5788/maps
00400000-00401000 r-xp 00000000 08:01 171828                             /home/julien/holberton/w/hackthevm2/7
00600000-00601000 r--p 00000000 08:01 171828                             /home/julien/holberton/w/hackthevm2/7
00601000-00602000 rw-p 00001000 08:01 171828                             /home/julien/holberton/w/hackthevm2/7
02050000-02071000 rw-p 00000000 00:00 0                                  [heap]
7f68caa1c000-7f68cabd6000 r-xp 00000000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f68cabd6000-7f68cadd6000 ---p 001ba000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f68cadd6000-7f68cadda000 r--p 001ba000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f68cadda000-7f68caddc000 rw-p 001be000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f68caddc000-7f68cade1000 rw-p 00000000 00:00 0 
7f68cade1000-7f68cae04000 r-xp 00000000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f68cafe8000-7f68cafeb000 rw-p 00000000 00:00 0 
7f68cafff000-7f68cb003000 rw-p 00000000 00:00 0 
7f68cb003000-7f68cb004000 r--p 00022000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f68cb004000-7f68cb005000 rw-p 00023000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f68cb005000-7f68cb006000 rw-p 00000000 00:00 0 
7fff16c62000-7fff16c83000 rw-p 00000000 00:00 0                          [stack]
7fff16d07000-7fff16d09000 r--p 00000000 00:00 0                          [vvar]
7fff16d09000-7fff16d0b000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
julien@holberton:~$ 

Let’s check a few things:

  • The stack starts at 7fff16c62000 and ends at 7fff16c83000. Our variables are all inside this region (0x7fff16c8146c, 0x7fff16c81424, 0x7fff16c81428, 0x7fff16c8142c)
  • The heap starts at 02050000 and ends at 02071000. Our allocated memory is in there (0x2050010)
  • Our code (the main function) is located at address 0x400739, so in the following region:
    00400000-00401000 r-xp 00000000 08:01 171828 /home/julien/holberton/w/hackthevm2/7
    It comes from the file /home/julien/holberton/w/hackthevm2/7 (our executable) and this region has execution permissions, which also makes sense.
  • The arguments and environment variables (from 0x7fff16c81568 to 0x7fff16c8238f + 0x1f0) are located in the region starting at 7fff16c62000 and ending at 7fff16c83000… the stack! 🙂 So they are IN the stack, not outside the stack.

This also brings up more questions:

  • Why is our executable “divided” into three different regions with different permissions? What is inside these two regions?
    • 00600000-00601000 r--p 00000000 08:01 171828 /home/julien/holberton/w/hackthevm2/7
    • 00601000-00602000 rw-p 00001000 08:01 171828 /home/julien/holberton/w/hackthevm2/7
  • What are all those other regions?
  • Why our allocated memory does not start at the very beginning of the heap (0x2050010 vs 02050000)? What are those first 16 bytes used for?

There is also another fact that we haven’t checked: Is the heap actually growing upwards?

We’ll find out another day! But before we end this chapter, let’s update our diagram with everything we’ve learned:

the virtual memory

Outro

We have learned a ton of things by simply printing informations from our executables! But we still have open questions that we will explore in a future chapter to complete our diagram of the virtual memory. In the meantime, you should try to find out yourself.

Questions? Feedback?

If you have questions or feedback don’t hesitate to ping us on Twitter at @holbertonschool or @julienbarbier42.
Haters, please send your comments to /dev/null.

Happy Hacking!

Thank you for reading!

As always, no-one is perfect (except Chuck of course), so don’t hesitate to contribute or send me your comments if you find anything I missed.

Files

This repo contains the source code (main-X.c files) for programs created in this tutorial.

Read more about the virtual memory

Follow @holbertonschool or @julienbarbier42 on Twitter to get the next chapters! This was the third chapter in our series on the virtual memory.

Many thanks to Tim and SantaCruzDad for English proof-reading! 🙂

Hack The Virtual Memory: Python bytes

hack the vm!

Hack The Virtual Memory, Chapter 1: Python bytes

For this second chapter, we’ll do almost the same thing as for chapter 0: C strings & /proc, but instead we’ll access the virtual memory of a running Python 3 script. It won’t be as straightfoward.
Let’s take this as an excuse to look at some Python 3 internals!

Prerequisites

This article is based on everything we learned in the previous chapter. Please read (and understand) chapter 0: C strings & /proc before reading this one.

In order to fully understand this article, you need to know:

  • The basics of the C programming language
  • Some Python
  • The very basics of the Linux filesystem and the shell
  • The very basics of the /proc filesystem (see chapter 0: C strings & /proc for an intro on this topic)

Environment

All scripts and programs have been tested on the following system:

  • Ubuntu
    • Linux ubuntu 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
  • gcc
    • gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
  • Python 3:
    • Python 3.4.3 (default, Nov 17 2016, 01:08:31)
    • [GCC 4.8.4] on linux

Python script

We’ll first use this script (main.py) and try to modify the “string” Holberton in the virtual memory of the process running it.

#!/usr/bin/env python3
'''
Prints a b"string" (bytes object), reads a char from stdin
and prints the same (or not :)) string again
'''

import sys

s = b"Holberton"
print(s)
sys.stdin.read(1)
print(s)

About the bytes object

bytes vs str

As you can see, we are using a bytes object (we use a b in front of our string literal) to store our string. This type will store the characters of the string as bytes (vs potentially multibytes – you can read the unicodeobject.h to learn more about how Python 3 encodes strings). This ensures that the string will be a succession of ASCII-values in the virtual memory of the process running the script.

Technically s is not a Python string (but it doesn’t matter in our context):

julien@holberton:~/holberton/w/hackthevm1$ python3
Python 3.4.3 (default, Nov 17 2016, 01:08:31) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s = "Betty"
>>> type(s)
<class 'str'>
>>> s = b"Betty"
>>> type(s)
<class 'bytes'>
>>> quit()

Everything is an object

Everything in Python is an object: integers, strings, bytes, functions, everything. So the line s = b"Holberton" should create an object of type bytes, and store the string b"Holberton somewhere in memory. Probably in the heap since it has to reserve space for the object and the bytes referenced by or stored in the object (at this point we don’t know about the exact implementation).

Running read_write_heap.py against the Python script

Note: read_write_heap.py is a script we wrote in the previous chapter chapter 0: C strings & /proc

Let’s run the above script and then run our read_write_heap.py script:

julien@holberton:~/holberton/w/hackthevm1$ ./main.py 
b'Holberton'

At this point main.py is waiting for the user to hit Enter. That corresponds to the line sys.stdin.read(1) in our code.
Let’s run read_write_heap.py:

julien@holberton:~/holberton/w/hackthevm1$ ps aux | grep main.py | grep -v grep
julien     3929  0.0  0.7  31412  7848 pts/0    S+   15:10   0:00 python3 ./main.py
julien@holberton:~/holberton/w/hackthevm1$ sudo ./read_write_heap.py 3929 Holberton "~ Betty ~"
[*] maps: /proc/3929/maps
[*] mem: /proc/3929/mem
[*] Found [heap]:
    pathname = [heap]
    addresses = 022dc000-023c6000
    permisions = rw-p
    offset = 00000000
    inode = 0
    Addr start [22dc000] | end [23c6000]
[*] Found 'Holberton' at 8e192
[*] Writing '~ Betty ~' at 236a192
julien@holberton:~/holberton/w/hackthevm1$ 

Easy! As expected, we found the string on the heap and replaced it. Now when we will hit Enter in the main.py script, it will print b'~ Betty ~':


b'Holberton' julien@holberton:~/holberton/w/hackthevm1$

Wait, WAT?!

WAT!

We found the string “Holberton” and replaced it, but it was not the correct string?
Before we go down the rabbit hole, we have one more thing to check. Our script stops when it finds the first occurence of the string. Let’s run it several times to see if there are more occurences of the same string in the heap.

julien@holberton:~/holberton/w/hackthevm1$ ./main.py 
b'Holberton'


julien@holberton:~/holberton/w/hackthevm1$ ps aux | grep main.py | grep -v grep julien 4051 0.1 0.7 31412 7832 pts/0 S+ 15:53 0:00 python3 ./main.py julien@holberton:~/holberton/w/hackthevm1$ sudo ./read_write_heap.py 4051 Holberton "~ Betty ~" [*] maps: /proc/4051/maps [*] mem: /proc/4051/mem [*] Found [heap]: pathname = [heap] addresses = 00bf4000-00cde000 permisions = rw-p offset = 00000000 inode = 0 Addr start [bf4000] | end [cde000] [*] Found 'Holberton' at 8e162 [*] Writing '~ Betty ~' at c82162 julien@holberton:~/holberton/w/hackthevm1$ sudo ./read_write_heap.py 4051 Holberton "~ Betty ~" [*] maps: /proc/4051/maps [*] mem: /proc/4051/mem [*] Found [heap]: pathname = [heap] addresses = 00bf4000-00cde000 permisions = rw-p offset = 00000000 inode = 0 Addr start [bf4000] | end [cde000] Can't find 'Holberton' julien@holberton:~/holberton/w/hackthevm1$

Only one occurence. So where is the string “Holberton” that is used by the script? Where is our Python bytes object in memory? Could it be in the stack? Let’s replace “[heap]” by “[stack]”* in our read_write_heap.py script to create the read_write_stack.py:

(*) _see previous article, the stack region is called “[stack]” on the /proc/[pid]/maps file_

#!/usr/bin/env python3
'''
Locates and replaces the first occurrence of a string in the stack
of a process

Usage: ./read_write_stack.py PID search_string replace_by_string
Where:
- PID is the pid of the target process
- search_string is the ASCII string you are looking to overwrite
- replace_by_string is the ASCII string you want to replace
search_string with
'''

import sys

def print_usage_and_exit():
    print('Usage: {} pid search write'.format(sys.argv[0]))
    sys.exit(1)

# check usage
if len(sys.argv) != 4:
    print_usage_and_exit()

# get the pid from args
pid = int(sys.argv[1])
if pid <= 0:
    print_usage_and_exit()
search_string = str(sys.argv[2])
if search_string  == "":
    print_usage_and_exit()
write_string = str(sys.argv[3])
if search_string  == "":
    print_usage_and_exit()

# open the maps and mem files of the process
maps_filename = "/proc/{}/maps".format(pid)
print("[*] maps: {}".format(maps_filename))
mem_filename = "/proc/{}/mem".format(pid)
print("[*] mem: {}".format(mem_filename))

# try opening the maps file
try:
    maps_file = open('/proc/{}/maps'.format(pid), 'r')
except IOError as e:
    print("[ERROR] Can not open file {}:".format(maps_filename))
    print("        I/O error({}): {}".format(e.errno, e.strerror))
    sys.exit(1)

for line in maps_file:
    sline = line.split(' ')
    # check if we found the stack
    if sline[-1][:-1] != "[stack]":
        continue
    print("[*] Found [stack]:")

    # parse line
    addr = sline[0]
    perm = sline[1]
    offset = sline[2]
    device = sline[3]
    inode = sline[4]
    pathname = sline[-1][:-1]
    print("\tpathname = {}".format(pathname))
    print("\taddresses = {}".format(addr))
    print("\tpermisions = {}".format(perm))
    print("\toffset = {}".format(offset))
    print("\tinode = {}".format(inode))

    # check if there is read and write permission
    if perm[0] != 'r' or perm[1] != 'w':
        print("[*] {} does not have read/write permission".format(pathname))
        maps_file.close()
        exit(0)

    # get start and end of the stack in the virtual memory
    addr = addr.split("-")
    if len(addr) != 2: # never trust anyone, not even your OS :)
        print("[*] Wrong addr format")
        maps_file.close()
        exit(1)
    addr_start = int(addr[0], 16)
    addr_end = int(addr[1], 16)
    print("\tAddr start [{:x}] | end [{:x}]".format(addr_start, addr_end))

    # open and read mem
    try:
        mem_file = open(mem_filename, 'rb+')
    except IOError as e:
        print("[ERROR] Can not open file {}:".format(mem_filename))
        print("        I/O error({}): {}".format(e.errno, e.strerror))
        maps_file.close()
        exit(1)

    # read stack
    mem_file.seek(addr_start)
    stack = mem_file.read(addr_end - addr_start)

    # find string
    try:
        i = stack.index(bytes(search_string, "ASCII"))
    except Exception:
        print("Can't find '{}'".format(search_string))
        maps_file.close()
        mem_file.close()
        exit(0)
    print("[*] Found '{}' at {:x}".format(search_string, i))

    # write the new stringprint("[*] Writing '{}' at {:x}".format(write_string, addr_start + i))
    mem_file.seek(addr_start + i)
    mem_file.write(bytes(write_string, "ASCII"))

    # close filesmaps_file.close()
    mem_file.close()

    # there is only one stack in our example
    break

The above script (read_write_heap.py) does exactly the same thing than the previous one (read_write_stack.py), the same way. Except we’re looking at the stack, instead of the heap. Let’s try to find our string in the stack:

julien@holberton:~/holberton/w/hackthevm1$ ./main.py
b'Holberton'

julien@holberton:~/holberton/w/hackthevm1$ ps aux | grep main.py | grep -v grep
julien     4124  0.2  0.7  31412  7848 pts/0    S+   16:10   0:00 python3 ./main.py
julien@holberton:~/holberton/w/hackthevm1$ sudo ./read_write_stack.py 4124 Holberton "~ Betty ~"
[sudo] password for julien: 
[*] maps: /proc/4124/maps
[*] mem: /proc/4124/mem
[*] Found [stack]:
    pathname = [stack]
    addresses = 7fff2997e000-7fff2999f000
    permisions = rw-p
    offset = 00000000
    inode = 0
    Addr start [7fff2997e000] | end [7fff2999f000]
Can't find 'Holberton'
julien@holberton:~/holberton/w/hackthevm1$ 

So our string is not in the heap and not in the stack: Where is it? It’s time to dig into Python 3 internals and locate the string using what we will learn. Brace yourselves, the fun will begin now 🙂

Locating the string in the virtual memory

Note: It is important to note that there are many implementations of Python 3. In this article, we are using the original and most commonly used: CPython (coded in C). What we are about to say about Python 3 will only be true for this implementation.

id

There is a simple way to know where the object (be careful, the object is not the string) is in the virtual memory. CPython has a specific implementation of the id() builtin: id() will return the address of the object in memory.

If we add a line to the Python script to print the id of our object, we should get its address (main_id.py):

#!/usr/bin/env python3
'''
Prints:
- the address of the bytes object
- a b"string" (bytes object)
reads a char from stdin
and prints the same (or not :)) string again
'''

import sys

s = b"Holberton"
print(hex(id(s)))
print(s)
sys.stdin.read(1)
print(s)
julien@holberton:~/holberton/w/hackthevm1$ ./main_id.py
0x7f343f010210
b'Holberton'

-> 0x7f343f010210. Let’s look at /proc/ to understand where exactly our object is located.

julien@holberton:/usr/include/python3.4$ ps aux | grep main_id.py | grep -v grep
julien     4344  0.0  0.7  31412  7856 pts/0    S+   16:53   0:00 python3 ./main_id.py
julien@holberton:/usr/include/python3.4$ cat /proc/4344/maps
00400000-006fa000 r-xp 00000000 08:01 655561                             /usr/bin/python3.4
008f9000-008fa000 r--p 002f9000 08:01 655561                             /usr/bin/python3.4
008fa000-00986000 rw-p 002fa000 08:01 655561                             /usr/bin/python3.4
00986000-009a2000 rw-p 00000000 00:00 0 
021ba000-022a4000 rw-p 00000000 00:00 0                                  [heap]
7f343d797000-7f343de79000 r--p 00000000 08:01 663747                     /usr/lib/locale/locale-archive
7f343de79000-7f343df7e000 r-xp 00000000 08:01 136303                     /lib/x86_64-linux-gnu/libm-2.19.so
7f343df7e000-7f343e17d000 ---p 00105000 08:01 136303                     /lib/x86_64-linux-gnu/libm-2.19.so
7f343e17d000-7f343e17e000 r--p 00104000 08:01 136303                     /lib/x86_64-linux-gnu/libm-2.19.so
7f343e17e000-7f343e17f000 rw-p 00105000 08:01 136303                     /lib/x86_64-linux-gnu/libm-2.19.so
7f343e17f000-7f343e197000 r-xp 00000000 08:01 136416                     /lib/x86_64-linux-gnu/libz.so.1.2.8
7f343e197000-7f343e396000 ---p 00018000 08:01 136416                     /lib/x86_64-linux-gnu/libz.so.1.2.8
7f343e396000-7f343e397000 r--p 00017000 08:01 136416                     /lib/x86_64-linux-gnu/libz.so.1.2.8
7f343e397000-7f343e398000 rw-p 00018000 08:01 136416                     /lib/x86_64-linux-gnu/libz.so.1.2.8
7f343e398000-7f343e3bf000 r-xp 00000000 08:01 136275                     /lib/x86_64-linux-gnu/libexpat.so.1.6.0
7f343e3bf000-7f343e5bf000 ---p 00027000 08:01 136275                     /lib/x86_64-linux-gnu/libexpat.so.1.6.0
7f343e5bf000-7f343e5c1000 r--p 00027000 08:01 136275                     /lib/x86_64-linux-gnu/libexpat.so.1.6.0
7f343e5c1000-7f343e5c2000 rw-p 00029000 08:01 136275                     /lib/x86_64-linux-gnu/libexpat.so.1.6.0
7f343e5c2000-7f343e5c4000 r-xp 00000000 08:01 136408                     /lib/x86_64-linux-gnu/libutil-2.19.so
7f343e5c4000-7f343e7c3000 ---p 00002000 08:01 136408                     /lib/x86_64-linux-gnu/libutil-2.19.so
7f343e7c3000-7f343e7c4000 r--p 00001000 08:01 136408                     /lib/x86_64-linux-gnu/libutil-2.19.so
7f343e7c4000-7f343e7c5000 rw-p 00002000 08:01 136408                     /lib/x86_64-linux-gnu/libutil-2.19.so
7f343e7c5000-7f343e7c8000 r-xp 00000000 08:01 136270                     /lib/x86_64-linux-gnu/libdl-2.19.so
7f343e7c8000-7f343e9c7000 ---p 00003000 08:01 136270                     /lib/x86_64-linux-gnu/libdl-2.19.so
7f343e9c7000-7f343e9c8000 r--p 00002000 08:01 136270                     /lib/x86_64-linux-gnu/libdl-2.19.so
7f343e9c8000-7f343e9c9000 rw-p 00003000 08:01 136270                     /lib/x86_64-linux-gnu/libdl-2.19.so
7f343e9c9000-7f343eb83000 r-xp 00000000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f343eb83000-7f343ed83000 ---p 001ba000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f343ed83000-7f343ed87000 r--p 001ba000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f343ed87000-7f343ed89000 rw-p 001be000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f343ed89000-7f343ed8e000 rw-p 00000000 00:00 0 
7f343ed8e000-7f343eda7000 r-xp 00000000 08:01 136373                     /lib/x86_64-linux-gnu/libpthread-2.19.so
7f343eda7000-7f343efa6000 ---p 00019000 08:01 136373                     /lib/x86_64-linux-gnu/libpthread-2.19.so
7f343efa6000-7f343efa7000 r--p 00018000 08:01 136373                     /lib/x86_64-linux-gnu/libpthread-2.19.so
7f343efa7000-7f343efa8000 rw-p 00019000 08:01 136373                     /lib/x86_64-linux-gnu/libpthread-2.19.so
7f343efa8000-7f343efac000 rw-p 00000000 00:00 0 
7f343efac000-7f343efcf000 r-xp 00000000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f343f000000-7f343f1b6000 rw-p 00000000 00:00 0 
7f343f1c5000-7f343f1cc000 r--s 00000000 08:01 918462                     /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
7f343f1cc000-7f343f1ce000 rw-p 00000000 00:00 0 
7f343f1ce000-7f343f1cf000 r--p 00022000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f343f1cf000-7f343f1d0000 rw-p 00023000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f343f1d0000-7f343f1d1000 rw-p 00000000 00:00 0 
7ffccf1fd000-7ffccf21e000 rw-p 00000000 00:00 0                          [stack]
7ffccf23c000-7ffccf23e000 r--p 00000000 00:00 0                          [vvar]
7ffccf23e000-7ffccf240000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
julien@holberton:/usr/include/python3.4$ 

-> Our object is stored in the following memory region: 7f343f000000-7f343f1b6000 rw-p 00000000 00:00 0, which is not the heap, and not the stack. That confirms what we saw earlier. But that doesn’t mean that the string itself is stored in the same memory region. For instance, the bytes object could store a pointer to the string, and not a copy of a string. Of course, at this point we could search for our string in this memory region, but we want to understand and be positive we’re looking in the right area, and not use “brute force” to find the solution. It’s time to learn more about bytes objects.

bytesobject.h

We are using the C implementation of Python (CPython), so let’s look at the header file for bytes objects.

Note: If you don’t have the Python 3 header files, you can use this command on Ubuntu: sudo apt-get install python3-dev to downloadd them on your system. If you are using the exact same environment as me (see the “Environment” section above), you should then be able to see the Python 3 header files in the /usr/include/python3.4/ directory.

From bytesobject.h:

typedef struct {
    PyObject_VAR_HEAD
    Py_hash_t ob_shash;
    char ob_sval[1];

    /* Invariants:
     *     ob_sval contains space for 'ob_size+1' elements.
     *     ob_sval[ob_size] == 0.
     *     ob_shash is the hash of the string or -1 if not computed yet.
     */
} PyBytesObject;

What does that tell us?

  • A Python 3 bytes object is represented internally using a variable of type PyBytesObject
  • ob_sval holds the entire string
  • The string ends with 0
  • ob_size stores the length of the string (to find ob_size, look at the definition of the macro PyObject_VAR_HEAD in objects.h. We’ll look at it later)

So in our example, if we were able to print the bytes object, we should see this:

  • ob_sval: “Holberton” -> Bytes values: 48 6f 6c 62 65 72 74 6f 6e 00
  • ob_size: 9

Based on what we did learn previously, this means that the string is “inside” the bytes object. So inside the same memory region \o/

What if we didn’t know about the way id was implemented in CPython? There is actually another way that we can use to find where the string is: looking at the actual object in memory.

Looking at the bytes object in memory

If we want to look directly at the PyBytesObject variable, we will need to create a C function, and call this C function from Python. There are different ways to call a C function from Python. We will use the simplest one: using a dynamic library.

Creating our C function

So the idea is to create a C function that is called from Python with the object as a parameter, and then “explore” this object to get the exact address of the string (as well as other information about the object).

The function prototype should be: void print_python_bytes(PyObject *p);, where p is a pointer to our object (so p stores the address of our object in the virtual memory). It doesn’t have to return anything.

object.h

You probably have noticed that we don’t use a parameter of type PyBytesObject. To understand why, let’s have a look at the object.h header file and see what we can learn from it:

/* Object and type object interface */

/*
Objects are structures allocated on the heap.  Special rules apply to
the use of objects to ensure they are properly garbage-collected.
Objects are never allocated statically or on the stack; they must be
...
*/
  • “Objects are never allocated statically or on the stack” -> ok, now we know why it was not on the stack.
  • “Objects are structures allocated on the heap” -> wait… WAT? We searched for the string in the heap and it was NOT there… I’m confused! We’ll discuss this later, in another article 🙂

What else can we read in this file:

/*
...
Objects do not float around in memory; once allocated an object keeps
the same size and address.  Objects that must hold variable-size data
...
*/
  • “Objects do not float around in memory; once allocated an object keeps the same size and address”. Good to know. That means that if we modify the correct string, it will always be modfied, and the addresses will never change
  • “once allocated” -> allocation? but not using the heap? I’m confused! We’ll discuss this later in another article 🙂
/*
...
Objects are always accessed through pointers of the type 'PyObject *'.
The type 'PyObject' is a structure that only contains the reference count
and the type pointer.  The actual memory allocated for an object
contains other data that can only be accessed after casting the pointer
to a pointer to a longer structure type.  This longer type must start
with the reference count and type fields; the macro PyObject_HEAD should be
used for this (to accommodate for future changes).  The implementation
of a particular object type can cast the object pointer to the proper
type and back.
...
*/
  • “Objects are always accessed through pointers of the type ‘PyObject *'” -> that is why we have to have to take a pointer of type PyObject (vs PyBytesObject) as the parameter of our function
  • “The actual memory allocated for an object contains other data that can only be accessed after casting the pointer to a pointer to a longer structure type.” -> So we will have to cast our function parameter to PyBytesObject * in order to access all its information. This is possible because the beginning of the PyBytesObject starts with a PyVarObject which itself starts with a PyObject:
/* PyObject_VAR_HEAD defines the initial segment of all variable-size
 * container objects.  These end with a declaration of an array with 1
 * element, but enough space is malloc'ed so that the array actually
 * has room for ob_size elements.  Note that ob_size is an element count,
 * not necessarily a byte count.
 */
#define PyObject_VAR_HEAD      PyVarObject ob_base;
#define Py_INVALID_SIZE (Py_ssize_t)-1

/* Nothing is actually declared to be a PyObject, but every pointer to
 * a Python object can be cast to a PyObject*.  This is inheritance built
 * by hand.  Similarly every pointer to a variable-size Python object can,
 * in addition, be cast to PyVarObject*.
 */
typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;

-> Here is the ob_size that bytesobject.h was mentioning.

The C function

Based on everything we just learned, the C code is pretty straightforward (bytes.c):

#include "Python.h"

/**
 * print_python_bytes - prints info about a Python 3 bytes object
 * @p: a pointer to a Python 3 bytes object
 * 
 * Return: Nothing
 */
void print_python_bytes(PyObject *p)
{
     /* The pointer with the correct type.*/
     PyBytesObject *s;
     unsigned int i;

     printf("[.] bytes object info\n");
     /* casting the PyObject pointer to a PyBytesObject pointer */
     s = (PyBytesObject *)p;
     /* never trust anyone, check that this is actually
        a PyBytesObject object. */
     if (s && PyBytes_Check(s))
     {
          /* a pointer holds the memory address of the first byte
         of the data it points to */
          printf("  address of the object: %p\n", (void *)s);
          /* op_size is in the ob_base structure, of type PyVarObject. */
          printf("  size: %ld\n", s->ob_base.ob_size);
          /* ob_sval is the array of bytes, ending with the value 0:
         ob_sval[ob_size] == 0 */
          printf("  trying string: %s\n", s->ob_sval);
          printf("  address of the data: %p\n", (void *)(s->ob_sval));
          printf("  bytes:");
          /* printing each byte at a time, in case this is not
         a "string". bytes doesn't have to be strings.
         ob_sval contains space for 'ob_size+1' elements.
         ob_sval[ob_size] == 0. */
          for (i = 0; i < s->ob_base.ob_size + 1; i++)
          {
               printf(" %02x", s->ob_sval[i] & 0xff);
          }
          printf("\n");
     }
     /* if this is not a PyBytesObject print an error message */
     else
     {
          fprintf(stderr, "  [ERROR] Invalid Bytes Object\n");
     }
}

Calling the C function from the python script

Creating the dynamic library

As we said earlier, we will use the “dynamic library method” to call our C function from Python 3. So we just need to compile our C file with this command:

gcc -Wall -Wextra -pedantic -Werror -std=c99 -shared -Wl,-soname,libPython.so -o libPython.so -fPIC -I/usr/include/python3.4 bytes.c

Don’t forget to include the Python 3 header files directory: -I/usr/include/python3.4

Hopefully, this should have created a dynamic library called libPython.so.

Using the dynamic library from Python 3

In order to use our function we simply need to add these lines in the Python script:

import ctypes

lib = ctypes.CDLL('./libPython.so')
lib.print_python_bytes.argtypes = [ctypes.py_object]

and call our function this way:

lib.print_python_bytes(s)

The new Python script

Here is the complete source code of the new Python 3 script (main_bytes.py):

#!/usr/bin/env python3
'''
Prints:
- the address of the bytes object
- a b"string" (bytes object)
- information about the bytes object
And then:
- reads a char from stdin
- prints the same (or not :)) information again
'''

import sys
import ctypes

lib = ctypes.CDLL('./libPython.so')
lib.print_python_bytes.argtypes = [ctypes.py_object]

s = b"Holberton"
print(hex(id(s)))
print(s)
lib.print_python_bytes(s)

sys.stdin.read(1)

print(hex(id(s)))
print(s)
lib.print_python_bytes(s)

Let’s run it!

julien@holberton:~/holberton/w/hackthevm1$ ./main_bytes.py 
0x7f04d721b210
b'Holberton'
[.] bytes object info
  address of the object: 0x7f04d721b210
  size: 9
  trying string: Holberton
  address of the data: 0x7f04d721b230
  bytes: 48 6f 6c 62 65 72 74 6f 6e 00

As expected:

  • id() returns the address of the object itself (0x7f04d721b210)
  • the size of the data of our object (ob_size) is 9
  • the data of our object is “Holberton”, 48 6f 6c 62 65 72 74 6f 6e 00 (and it ends with 00 as specified on the header file bytesobject.h)

Annnnnnd, we have found the exact address of our string: 0x7f04d721b230 \o/

monty python

Sorry I had to add at least one Monty Python reference 🙂 (why)

rw_all.py

Now that we undertand a little bit more about what’s happening, it’s ok to “brute-force” the mapped memory regions. Let’s update the script that replaces the string. Instead of looking only in the stack or the heap, let’s look in all readable and writeable memory regions of the process. Here’s the source code:

#!/usr/bin/env python3
'''
Locates and replaces (if we have permission) all occurrences of
an ASCII string in the entire virtual memory of a process.

Usage: ./rw_all.py PID search_string replace_by_string
Where:
- PID is the pid of the target process
- search_string is the ASCII string you are looking to overwrite
- replace_by_string is the ASCII string you want to replace
search_string with
'''

import sys

def print_usage_and_exit():
    print('Usage: {} pid search write'.format(sys.argv[0]))
    exit(1)

# check usage
if len(sys.argv) != 4:
    print_usage_and_exit()

# get the pid from args
pid = int(sys.argv[1])
if pid <= 0:
    print_usage_and_exit()
search_string = str(sys.argv[2])
if search_string  == "":
    print_usage_and_exit()
write_string = str(sys.argv[3])
if search_string  == "":
    print_usage_and_exit()

# open the maps and mem files of the process
maps_filename = "/proc/{}/maps".format(pid)
print("[*] maps: {}".format(maps_filename))
mem_filename = "/proc/{}/mem".format(pid)
print("[*] mem: {}".format(mem_filename))

# try opening the file
try:
    maps_file = open('/proc/{}/maps'.format(pid), 'r')
except IOError as e:
    print("[ERROR] Can not open file {}:".format(maps_filename))
    print("        I/O error({}): {}".format(e.errno, e.strerror))
    exit(1)

for line in maps_file:
    # print the name of the memory region
    sline = line.split(' ')
    name = sline[-1][:-1];
    print("[*] Searching in {}:".format(name))

    # parse line
    addr = sline[0]
    perm = sline[1]
    offset = sline[2]
    device = sline[3]
    inode = sline[4]
    pathname = sline[-1][:-1]

    # check if there are read and write permissions
    if perm[0] != 'r' or perm[1] != 'w':
        print("\t[\x1B[31m!\x1B[m] {} does not have read/write permissions ({})".format(pathname, perm))
        continue

    print("\tpathname = {}".format(pathname))
    print("\taddresses = {}".format(addr))
    print("\tpermisions = {}".format(perm))
    print("\toffset = {}".format(offset))
    print("\tinode = {}".format(inode))

    # get start and end of the memoy region
    addr = addr.split("-")
    if len(addr) != 2: # never trust anyone
        print("[*] Wrong addr format")
        maps_file.close()
        exit(1)
    addr_start = int(addr[0], 16)
    addr_end = int(addr[1], 16)
    print("\tAddr start [{:x}] | end [{:x}]".format(addr_start, addr_end))

    # open and read the memory region
    try:
        mem_file = open(mem_filename, 'rb+')
    except IOError as e:
        print("[ERROR] Can not open file {}:".format(mem_filename))
        print("        I/O error({}): {}".format(e.errno, e.strerror))
        maps_file.close()

    # read the memory region
    mem_file.seek(addr_start)
    region = mem_file.read(addr_end - addr_start)

    # find string
    nb_found = 0;
    try:
        i = region.index(bytes(search_string, "ASCII"))
        while (i):
            print("\t[\x1B[32m:)\x1B[m] Found '{}' at {:x}".format(search_string, i))
            nb_found = nb_found + 1
            # write the new string
        print("\t[:)] Writing '{}' at {:x}".format(write_string, addr_start + i))
            mem_file.seek(addr_start + i)
            mem_file.write(bytes(write_string, "ASCII"))
            mem_file.flush()

            # update our buffer
        region.write(bytes(write_string, "ASCII"), i)

            i = region.index(bytes(search_string, "ASCII"))
    except Exception:
        if nb_found == 0:
            print("\t[\x1B[31m:(\x1B[m] Can't find '{}'".format(search_string))
    mem_file.close()

# close files
maps_file.close()

Let’s run it!

julien@holberton:~/holberton/w/hackthevm1$ ./main_bytes.py 
0x7f37f1e01210
b'Holberton'
[.] bytes object info
  address of the object: 0x7f37f1e01210
  size: 9
  trying string: Holberton
  address of the data: 0x7f37f1e01230
  bytes: 48 6f 6c 62 65 72 74 6f 6e 00

julien@holberton:~/holberton/w/hackthevm1$ ps aux | grep main_bytes.py | grep -v grep
julien     4713  0.0  0.8  37720  8208 pts/0    S+   18:48   0:00 python3 ./main_bytes.py
julien@holberton:~/holberton/w/hackthevm1$ sudo ./rw_all.py 4713 Holberton "~ Betty ~"
[*] maps: /proc/4713/maps
[*] mem: /proc/4713/mem
[*] Searching in /usr/bin/python3.4:
    [!] /usr/bin/python3.4 does not have read/write permissions (r-xp)
...
[*] Searching in [heap]:
    pathname = [heap]
    addresses = 00e26000-00f11000
    permisions = rw-p
    offset = 00000000
    inode = 0
    Addr start [e26000] | end [f11000]
    [:)] Found 'Holberton' at 8e422
    [:)] Writing '~ Betty ~' at eb4422
...
[*] Searching in :
    pathname = 
    addresses = 7f37f1df1000-7f37f1fa7000
    permisions = rw-p
    offset = 00000000
    inode = 0
    Addr start [7f37f1df1000] | end [7f37f1fa7000]
    [:)] Found 'Holberton' at 10230
    [:)] Writing '~ Betty ~' at 7f37f1e01230
...
[*] Searching in [stack]:
    pathname = [stack]
    addresses = 7ffdc3d0c000-7ffdc3d2d000
    permisions = rw-p
    offset = 00000000
    inode = 0
    Addr start [7ffdc3d0c000] | end [7ffdc3d2d000]
    [:(] Can't find 'Holberton'
...
julien@holberton:~/holberton/w/hackthevm1$ 

And if we hit enter in the running main_bytes.py

julien@holberton:~/holberton/w/hackthevm1$ ./main_bytes.py 
0x7f37f1e01210
b'Holberton'
[.] bytes object info
  address of the object: 0x7f37f1e01210
  size: 9
  trying string: Holberton
  address of the data: 0x7f37f1e01230
  bytes: 48 6f 6c 62 65 72 74 6f 6e 00

0x7f37f1e01210
b'~ Betty ~'
[.] bytes object info
  address of the object: 0x7f37f1e01210
  size: 9
  trying string: ~ Betty ~
  address of the data: 0x7f37f1e01230
  bytes: 7e 20 42 65 74 74 79 20 7e 00
julien@holberton:~/holberton/w/hackthevm1$ 

BOOM!

yeah!

Outro

We managed to modify the string used by our Python 3 script. Awesome! But we still have questions to answer:

  • What is the “Holberton” string that is in the [heap] memory region?
  • How does Python 3 allocate memory outside of the heap?
  • If Python 3 is not using the heap, what does it refer to when it says “Objects are structures allocated on the heap” in object.h?

That will be for another time 🙂

In the meantime, if you are too curious to wait for the next article, you can try to find out yourself.

Questions? Feedback?

If you have questions or feedback don’t hesitate to ping us on Twitter at @holbertonschool or @julienbarbier42.
Haters, please send your comments to /dev/null.

Happy Hacking!

Thank you for reading!

As always, no-one is perfect (except Chuck of course), so don’t hesitate to contribute or send me your comments.

Files

This repo contains the source code for all scripts and dynamic libraries created in this tutorial:

  • main.py: the first target
  • main_id.py: the second target, printing the id of the bytes object
  • main_bytes.py: the final target, printing also the information about the bytes object, using our dynamic library
  • read_write_heap.py: the “original” script to find and replace strings in the heap of a process
  • read_write_stack.py: same but, searches and replaces in the stack instead of the heap
  • rw_all.py: same but in every memory regions that are readable and writable
  • bytes.c: the C function to print info about a Python 3 bytes object

Read on!

Many thanks to Tim for English proof-reading & Guillaume for PEP8 proof-reading 🙂

Hack The Virtual Memory: C strings & /proc

hack the vm!

Intro

Hack The Virtual Memory, Chapter 0: Play with C strings & /proc

This is the first in a series of small articles / tutorials based around virtual memory. The goal is to learn some CS basics, but in a different and more practical way.

For this first piece, we’ll use /proc to find and modify variables (in this example, an ASCII string) contained inside the virtual memory of a running process, and learn some cool things along the way.

Environment

All scripts and programs have been tested on the following system:

  • Ubuntu 14.04 LTS
    • Linux ubuntu 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
  • gcc
    • gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
  • Python 3:
    • Python 3.4.3 (default, Nov 17 2016, 01:08:31)
    • [GCC 4.8.4] on linux

Prerequisites

In order to fully understand this article, you need to know:

  • The basics of the C programming language
  • Some Python
  • The very basics of the Linux filesystem and the shell

Virtual Memory

In computing, virtual memory is a memory management technique that is implemented using both hardware and software. It maps memory addresses used by a program, called virtual addresses, into physical addresses in computer memory. Main storage (as seen by a process or task) appears as a contiguous address space, or collection of contiguous segments. The operating system manages virtual address spaces and the assignment of real memory to virtual memory. Address translation hardware in the CPU, often referred to as a memory management unit or MMU, automatically translates virtual addresses to physical addresses. Software within the operating system may extend these capabilities to provide a virtual address space that can exceed the capacity of real memory and thus reference more memory than is physically present in the computer.

The primary benefits of virtual memory include freeing applications from having to manage a shared memory space, increased security due to memory isolation, and being able to conceptually use more memory than might be physically available, using the technique of paging.

You can read more about the virtual memory on Wikipedia.

In chapter 2, we’ll go into more details and do some fact checking on what lies inside the virtual memory and where. For now, here are some key points you should know before you read on:

  • Each process has its own virtual memory
  • The amount of virtual memory depends on your system’s architecture
  • Each OS handles virtual memory differently, but for most modern operating systems, the virtual memory of a process looks like this:

virtual memory

In the high memory addresses you can find (this is a non exhaustive list, there’s much more to be found, but that’s not today’s topic):

  • The command line arguments and environment variables
  • The stack, growing “downwards”. This may seem counter-intuitive, but this is the way the stack is implemented in virtual memory

In the low memory addresses you can find:

  • Your executable (it’s a little more complicated than that, but this is enough to understand the rest of this article)
  • The heap, growing “upwards”

The heap is a portion of memory that is dynamically allocated (i.e. containing memory allocated using malloc).

Also, keep in mind that virtual memory is not the same as RAM.

C program

Let’s start with this simple C program:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**
 * main - uses strdup to create a new string, and prints the
 * address of the new duplcated string
 *
 * Return: EXIT_FAILURE if malloc failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    char *s;

    s = strdup("Holberton");
    if (s == NULL)
    {
        fprintf(stderr, "Can't allocate mem with malloc\n");
        return (EXIT_FAILURE);
    }
    printf("%p\n", (void *)s);
    return (EXIT_SUCCESS);
}

strdup

Take a moment to think before going further. How do you think strdup creates a copy of the string “Holberton”? How can you confirm that?

.

.

.

strdup has to create a new string, so it first has to reserve space for it. The function strdup is probably using malloc. A quick look at its man page can confirm:

DESCRIPTION
       The  strdup()  function returns a pointer to a new string which is a duplicate of the string s.
       Memory for the new string is obtained with malloc(3), and can be freed with free(3).

Take a moment to think before going further. Based on what we said earlier about virtual memory, where do you think the duplicate string will be located? At a high or low memory address?

.

.

.

Probably in the lower addresses (in the heap). Let’s compile and run our small C program to test our hypothesis:

julien@holberton:~/holberton/w/hackthevm0$ gcc -Wall -Wextra -pedantic -Werror main.c -o holberton
julien@holberton:~/holberton/w/hackthevm0$ ./holberton 
0x1822010
julien@holberton:~/holberton/w/hackthevm0$ 

Our duplicated string is located at the address 0x1822010. Great. But is this a low or a high memory address?

How big is the virtual memory of a process

The size of the virtual memory of a process depends on your system architecture. In this example I am using a 64-bit machine, so theoretically the size of each process’ virtual memory is 2^64 bytes. In theory, the highest memory address possible is 0xffffffffffffffff (1.8446744e+19), and the lowest is 0x0.

0x1822010 is small compared to 0xffffffffffffffff, so the duplicated string is probably located at a lower memory address. We will be able to confirm this when we will be looking at the proc filesystem).

The proc filesystem

From man proc:

The proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures.  It is commonly mounted at `/proc`.  Most of it is read-only, but some files allow kernel variables to be changed.

If you list the contents of your /proc directory, you will probably see a lot of files. We will focus on two of them:

  • /proc/[pid]/mem
  • /proc/[pid]/maps

mem

From man proc:

      /proc/[pid]/mem
              This file can be used to access the pages of a process's memory
          through open(2), read(2), and lseek(2).

Awesome! So, can we access and modify the entire virtual memory of any process?

maps

From man proc:

      /proc/[pid]/maps
              A  file containing the currently mapped memory regions and their access permissions.
          See mmap(2) for some further information about memory mappings.

              The format of the file is:

       address           perms offset  dev   inode       pathname
       00400000-00452000 r-xp 00000000 08:02 173521      /usr/bin/dbus-daemon
       00651000-00652000 r--p 00051000 08:02 173521      /usr/bin/dbus-daemon
       00652000-00655000 rw-p 00052000 08:02 173521      /usr/bin/dbus-daemon
       00e03000-00e24000 rw-p 00000000 00:00 0           [heap]
       00e24000-011f7000 rw-p 00000000 00:00 0           [heap]
       ...
       35b1800000-35b1820000 r-xp 00000000 08:02 135522  /usr/lib64/ld-2.15.so
       35b1a1f000-35b1a20000 r--p 0001f000 08:02 135522  /usr/lib64/ld-2.15.so
       35b1a20000-35b1a21000 rw-p 00020000 08:02 135522  /usr/lib64/ld-2.15.so
       35b1a21000-35b1a22000 rw-p 00000000 00:00 0
       35b1c00000-35b1dac000 r-xp 00000000 08:02 135870  /usr/lib64/libc-2.15.so
       35b1dac000-35b1fac000 ---p 001ac000 08:02 135870  /usr/lib64/libc-2.15.so
       35b1fac000-35b1fb0000 r--p 001ac000 08:02 135870  /usr/lib64/libc-2.15.so
       35b1fb0000-35b1fb2000 rw-p 001b0000 08:02 135870  /usr/lib64/libc-2.15.so
       ...
       f2c6ff8c000-7f2c7078c000 rw-p 00000000 00:00 0    [stack:986]
       ...
       7fffb2c0d000-7fffb2c2e000 rw-p 00000000 00:00 0   [stack]
       7fffb2d48000-7fffb2d49000 r-xp 00000000 00:00 0   [vdso]

              The address field is the address space in the process that the mapping occupies.
          The perms field is a set of permissions:

                   r = read
                   w = write
                   x = execute
                   s = shared
                   p = private (copy on write)

              The offset field is the offset into the file/whatever;
          dev is the device (major:minor); inode is the inode on that device.   0  indicates
              that no inode is associated with the memory region,
          as would be the case with BSS (uninitialized data).

              The  pathname field will usually be the file that is backing the mapping.
          For ELF files, you can easily coordinate with the offset field
              by looking at the Offset field in the ELF program headers (readelf -l).

              There are additional helpful pseudo-paths:

                   [stack]
                          The initial process's (also known as the main thread's) stack.

                   [stack:<tid>] (since Linux 3.4)
                          A thread's stack (where the <tid> is a thread ID).
              It corresponds to the /proc/[pid]/task/[tid]/ path.

                   [vdso] The virtual dynamically linked shared object.

                   [heap] The process's heap.

              If the pathname field is blank, this is an anonymous mapping as obtained via the mmap(2) function.
          There is no easy  way  to  coordinate
              this back to a process's source, short of running it through gdb(1), strace(1), or similar.

              Under Linux 2.0 there is no field giving pathname.

This means that we can look at the /proc/[pid]/mem file to locate the heap of a running process. If we can read from the heap, we can locate the string we want to modify. And if we can write to the heap, we can replace this string with whatever we want.

pid

A process is an instance of a program, with a unique process ID. This process ID (PID) is used by many functions and system calls to interact with and manipulate processes.

We can use the program ps to get the PID of a running process (man ps).

C program

We now have everything we need to write a script or program that finds a string in the heap of a running process and then replaces it with another string (of the same length or shorter). We will work with the following simple program that infinitely loops and prints a “strduplicated” string.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

/**              
 * main - uses strdup to create a new string, loops forever-ever
 *                
 * Return: EXIT_FAILURE if malloc failed. Other never returns
 */
int main(void)
{
     char *s;
     unsigned long int i;

     s = strdup("Holberton");
     if (s == NULL)
     {
          fprintf(stderr, "Can't allocate mem with malloc\n");
          return (EXIT_FAILURE);
     }
     i = 0;
     while (s)
     {
          printf("[%lu] %s (%p)\n", i, s, (void *)s);
          sleep(1);
          i++;
     }
     return (EXIT_SUCCESS);
}

Compiling and running the above source code should give you this output, and loop indefinitely until you kill the process.

julien@holberton:~/holberton/w/hackthevm0$ gcc -Wall -Wextra -pedantic -Werror loop.c -o loop
julien@holberton:~/holberton/w/hackthevm0$ ./loop 
[0] Holberton (0xfbd010)
[1] Holberton (0xfbd010)
[2] Holberton (0xfbd010)
[3] Holberton (0xfbd010)
[4] Holberton (0xfbd010)
[5] Holberton (0xfbd010)
[6] Holberton (0xfbd010)
[7] Holberton (0xfbd010)
...

If you would like, pause the reading now and try to write a script or program that finds a string in the heap of a running process before reading further.

.

.

.

looking at /proc

Let’s run our loop program.

julien@holberton:~/holberton/w/hackthevm0$ ./loop 
[0] Holberton (0x10ff010)
[1] Holberton (0x10ff010)
[2] Holberton (0x10ff010)
[3] Holberton (0x10ff010)
...

The first thing we need to find is the PID of the process.

julien@holberton:~/holberton/w/hackthevm0$ ps aux | grep ./loop | grep -v grep
julien     4618  0.0  0.0   4332   732 pts/14   S+   17:06   0:00 ./loop

In the above example, the PID is 4618 (it will be different each time we run it, and it is probably a different number if you are trying this on your own computer). As a result, the maps and mem files we want to look at are located in the /proc/4618 directory:

  • /proc/4618/maps
  • /proc/4618/mem

A quick ls -la in the directory should give you something like this:

julien@ubuntu:/proc/4618$ ls -la
total 0
dr-xr-xr-x   9 julien julien 0 Mar 15 17:07 .
dr-xr-xr-x 257 root   root   0 Mar 15 10:20 ..
dr-xr-xr-x   2 julien julien 0 Mar 15 17:11 attr
-rw-r--r--   1 julien julien 0 Mar 15 17:11 autogroup
-r--------   1 julien julien 0 Mar 15 17:11 auxv
-r--r--r--   1 julien julien 0 Mar 15 17:11 cgroup
--w-------   1 julien julien 0 Mar 15 17:11 clear_refs
-r--r--r--   1 julien julien 0 Mar 15 17:07 cmdline
-rw-r--r--   1 julien julien 0 Mar 15 17:11 comm
-rw-r--r--   1 julien julien 0 Mar 15 17:11 coredump_filter
-r--r--r--   1 julien julien 0 Mar 15 17:11 cpuset
lrwxrwxrwx   1 julien julien 0 Mar 15 17:11 cwd -> /home/julien/holberton/w/funwthevm
-r--------   1 julien julien 0 Mar 15 17:11 environ
lrwxrwxrwx   1 julien julien 0 Mar 15 17:11 exe -> /home/julien/holberton/w/funwthevm/loop
dr-x------   2 julien julien 0 Mar 15 17:07 fd
dr-x------   2 julien julien 0 Mar 15 17:11 fdinfo
-rw-r--r--   1 julien julien 0 Mar 15 17:11 gid_map
-r--------   1 julien julien 0 Mar 15 17:11 io
-r--r--r--   1 julien julien 0 Mar 15 17:11 limits
-rw-r--r--   1 julien julien 0 Mar 15 17:11 loginuid
dr-x------   2 julien julien 0 Mar 15 17:11 map_files
-r--r--r--   1 julien julien 0 Mar 15 17:11 maps
-rw-------   1 julien julien 0 Mar 15 17:11 mem
-r--r--r--   1 julien julien 0 Mar 15 17:11 mountinfo
-r--r--r--   1 julien julien 0 Mar 15 17:11 mounts
-r--------   1 julien julien 0 Mar 15 17:11 mountstats
dr-xr-xr-x   5 julien julien 0 Mar 15 17:11 net
dr-x--x--x   2 julien julien 0 Mar 15 17:11 ns
-r--r--r--   1 julien julien 0 Mar 15 17:11 numa_maps
-rw-r--r--   1 julien julien 0 Mar 15 17:11 oom_adj
-r--r--r--   1 julien julien 0 Mar 15 17:11 oom_score
-rw-r--r--   1 julien julien 0 Mar 15 17:11 oom_score_adj
-r--------   1 julien julien 0 Mar 15 17:11 pagemap
-r--------   1 julien julien 0 Mar 15 17:11 personality
-rw-r--r--   1 julien julien 0 Mar 15 17:11 projid_map
lrwxrwxrwx   1 julien julien 0 Mar 15 17:11 root -> /
-rw-r--r--   1 julien julien 0 Mar 15 17:11 sched
-r--r--r--   1 julien julien 0 Mar 15 17:11 schedstat
-r--r--r--   1 julien julien 0 Mar 15 17:11 sessionid
-rw-r--r--   1 julien julien 0 Mar 15 17:11 setgroups
-r--r--r--   1 julien julien 0 Mar 15 17:11 smaps
-r--------   1 julien julien 0 Mar 15 17:11 stack
-r--r--r--   1 julien julien 0 Mar 15 17:07 stat
-r--r--r--   1 julien julien 0 Mar 15 17:11 statm
-r--r--r--   1 julien julien 0 Mar 15 17:07 status
-r--------   1 julien julien 0 Mar 15 17:11 syscall
dr-xr-xr-x   3 julien julien 0 Mar 15 17:11 task
-r--r--r--   1 julien julien 0 Mar 15 17:11 timers
-rw-r--r--   1 julien julien 0 Mar 15 17:11 uid_map
-r--r--r--   1 julien julien 0 Mar 15 17:11 wchan

/proc/pid/maps

As we have seen earlier, the /proc/pid/maps file is a text file, so we can directly read it. The content of the maps file of our process looks like this:

julien@ubuntu:/proc/4618$ cat maps
00400000-00401000 r-xp 00000000 08:01 1070052                            /home/julien/holberton/w/funwthevm/loop
00600000-00601000 r--p 00000000 08:01 1070052                            /home/julien/holberton/w/funwthevm/loop
00601000-00602000 rw-p 00001000 08:01 1070052                            /home/julien/holberton/w/funwthevm/loop
010ff000-01120000 rw-p 00000000 00:00 0                                  [heap]
7f144c052000-7f144c20c000 r-xp 00000000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f144c20c000-7f144c40c000 ---p 001ba000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f144c40c000-7f144c410000 r--p 001ba000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f144c410000-7f144c412000 rw-p 001be000 08:01 136253                     /lib/x86_64-linux-gnu/libc-2.19.so
7f144c412000-7f144c417000 rw-p 00000000 00:00 0 
7f144c417000-7f144c43a000 r-xp 00000000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f144c61e000-7f144c621000 rw-p 00000000 00:00 0 
7f144c636000-7f144c639000 rw-p 00000000 00:00 0 
7f144c639000-7f144c63a000 r--p 00022000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f144c63a000-7f144c63b000 rw-p 00023000 08:01 136229                     /lib/x86_64-linux-gnu/ld-2.19.so
7f144c63b000-7f144c63c000 rw-p 00000000 00:00 0 
7ffc94272000-7ffc94293000 rw-p 00000000 00:00 0                          [stack]
7ffc9435e000-7ffc94360000 r--p 00000000 00:00 0                          [vvar]
7ffc94360000-7ffc94362000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Circling back to what we said earlier, we can see that the stack ([stack]) is located in high memory addresses and the heap ([heap]) in the lower memory addresses.

[heap]

Using the maps file, we can find all the information we need to locate our string:

010ff000-01120000 rw-p 00000000 00:00 0                                  [heap]

The heap:

  • Starts at address 0x010ff000 in the virtual memory of the process
  • Ends at memory address: 0x01120000
  • Is readable and writable (rw)

A quick look back to our (still running) loop program:

...
[1024] Holberton (0x10ff010)
...

-> 0x010ff000 < 0x10ff010 < 0x01120000. This confirms that our string is located in the heap. More precisely, it is located at index 0x10 of the heap. If we open the /proc/pid/mem/ file (in this example /proc/4618/mem) and seek to the memory address 0x10ff010, we can write to the heap of the running process, overwriting the “Holberton” string!

Let’s write a script or program that does just that. Choose your favorite language and let’s do it!

If you would like, stop reading now and try to write a script or program that finds a string in the heap of a running process, before reading further. The next paragraph will give away the source code of the answer!

.

.

.

Overwriting the string in the virtual memory

We’ll be using Python 3 for writing the script, but you could write this in any language. Here is the code:

#!/usr/bin/env python3
'''             
Locates and replaces the first occurrence of a string in the heap
of a process    

Usage: ./read_write_heap.py PID search_string replace_by_string
Where:           
- PID is the pid of the target process
- search_string is the ASCII string you are looking to overwrite
- replace_by_string is the ASCII string you want to replace
  search_string with
'''

import sys

def print_usage_and_exit():
    print('Usage: {} pid search write'.format(sys.argv[0]))
    sys.exit(1)

# check usage  
if len(sys.argv) != 4:
    print_usage_and_exit()

# get the pid from args
pid = int(sys.argv[1])
if pid <= 0:
    print_usage_and_exit()
search_string = str(sys.argv[2])
if search_string  == "":
    print_usage_and_exit()
write_string = str(sys.argv[3])
if search_string  == "":
    print_usage_and_exit()

# open the maps and mem files of the process
maps_filename = "/proc/{}/maps".format(pid)
print("[*] maps: {}".format(maps_filename))
mem_filename = "/proc/{}/mem".format(pid)
print("[*] mem: {}".format(mem_filename))

# try opening the maps file
try:
    maps_file = open('/proc/{}/maps'.format(pid), 'r')
except IOError as e:
    print("[ERROR] Can not open file {}:".format(maps_filename))
    print("        I/O error({}): {}".format(e.errno, e.strerror))
    sys.exit(1)

for line in maps_file:
    sline = line.split(' ')
    # check if we found the heap
    if sline[-1][:-1] != "[heap]":
        continue
    print("[*] Found [heap]:")

    # parse line
    addr = sline[0]
    perm = sline[1]
    offset = sline[2]
    device = sline[3]
    inode = sline[4]
    pathname = sline[-1][:-1]
    print("\tpathname = {}".format(pathname))
    print("\taddresses = {}".format(addr))
    print("\tpermisions = {}".format(perm))
    print("\toffset = {}".format(offset))
    print("\tinode = {}".format(inode))

    # check if there is read and write permission
    if perm[0] != 'r' or perm[1] != 'w':
        print("[*] {} does not have read/write permission".format(pathname))
        maps_file.close()
        exit(0)

    # get start and end of the heap in the virtual memory
    addr = addr.split("-")
    if len(addr) != 2: # never trust anyone, not even your OS :)
        print("[*] Wrong addr format")
        maps_file.close()
        exit(1)
    addr_start = int(addr[0], 16)
    addr_end = int(addr[1], 16)
    print("\tAddr start [{:x}] | end [{:x}]".format(addr_start, addr_end))

    # open and read mem
    try:
        mem_file = open(mem_filename, 'rb+')
    except IOError as e:
        print("[ERROR] Can not open file {}:".format(mem_filename))
        print("        I/O error({}): {}".format(e.errno, e.strerror))
        maps_file.close()
        exit(1)

    # read heap  
    mem_file.seek(addr_start)
    heap = mem_file.read(addr_end - addr_start)

    # find string
    try:
        i = heap.index(bytes(search_string, "ASCII"))
    except Exception:
        print("Can't find '{}'".format(search_string))
        maps_file.close()
        mem_file.close()
        exit(0)
    print("[*] Found '{}' at {:x}".format(search_string, i))

    # write the new string
    print("[*] Writing '{}' at {:x}".format(write_string, addr_start + i))
    mem_file.seek(addr_start + i)
    mem_file.write(bytes(write_string, "ASCII"))

    # close files
    maps_file.close()
    mem_file.close()

    # there is only one heap in our example
    break

Note: You will need to run this script as root, otherwise you won’t be able to read or write to the /proc/pid/mem file, even if you are the owner of the process.

Running the script

julien@holberton:~/holberton/w/hackthevm0$ sudo ./read_write_heap.py 4618 Holberton "Fun w vm!"
[*] maps: /proc/4618/maps
[*] mem: /proc/4618/mem
[*] Found [heap]:
    pathname = [heap]
    addresses = 010ff000-01120000
    permisions = rw-p
    offset = 00000000
    inode = 0
    Addr start [10ff000] | end [1120000]
[*] Found 'Holberton' at 10
[*] Writing 'Fun w vm!' at 10ff010
julien@holberton:~/holberton/w/hackthevm0$ 

Note that this address corresponds to the one we found manually:

  • The heap lies from addresses 0x010ff000 to 0x01120000 in the virtual memory of the running process
  • Our string is at index 0x10 in the heap, so at the memory address 0x10ff010

If we go back to our loop program, it should now print “fun w vm!”

...
[2676] Holberton (0x10ff010)
[2677] Holberton (0x10ff010)
[2678] Holberton (0x10ff010)
[2679] Holberton (0x10ff010)
[2680] Holberton (0x10ff010)
[2681] Holberton (0x10ff010)
[2682] Fun w vm! (0x10ff010)
[2683] Fun w vm! (0x10ff010)
[2684] Fun w vm! (0x10ff010)
[2685] Fun w vm! (0x10ff010)
...

hack the virtual memory of a process: mind blowing    !

Outro

Questions? Feedback?

If you have questions or feedback don’t hesitate to ping us on Twitter at @holbertonschool or @julienbarbier42.
Haters, please send your comments to /dev/null.

Happy Hacking!

Thank you for reading!

As always, no-one is perfect (except Chuck of course), so don’t hesitate to contribute or send me your comments.

Files

This repo contains the source code for all programs shown in this tutorial:

  • main.c: the first C program that prints the location of the string and exits
  • loop.c: the second C program that loops indefinitely
  • read_write_heap.py: the script used to modify the string in the running C program

What’s next?

In the next chapter we will do almost the same thing, but instead we’ll access the memory of a running Python 3 script. It won’t be that straightfoward. We’ll take this as an excuse to look at some Python 3 internals. If you are curious, try to do it yourself, and find out why the above read_write_heap.py script won’t work to modify a Python 3 ASCII string.

See you next time and Happy Hacking!

Read on!

Many thanks to Kristine, Tim for English proof-reading & Guillaume for PEP8 proof-reading 🙂

Welcome, Hippokampoiers!

Today, we are very proud to announce that we have selected the 33 students for the inaugural class of January 2016. Yes, we announced that we would select only 32 students, but there were just too many great applicants. Please join us in welcoming team Hippokampoi!

hippocampoiers

The selection process

To become students at Holberton, candidates had to go through our four-step selection process, based only on talent and motivation, and not on the basis of educational degree, or programming experience. We designed the selection process to be the beginning of the curriculum so that applicants start learning through it. No programming experience is required at all. It consists of four different levels:

  • Level 0 – Fill out a short online form about yourself (~2 to 10 minutes)
  • Level 1 – Small online projects and tests that applicants can do at their own pace (~2 to 10 hours of work)
  • Level 2 – A step by step challenge during which you will create your first website, with a specific deadline (~50 hours of work)
  • Level 3 – On-site or Skype interview

zee_quote

Level 2 is a hands-on project, designed for beginners, during which applicants create their first website, from A to Z. Not only do they build a website with HTML, CSS and JavaScript but they also install and run their web server (Apache) on a Linux machine (Ubuntu) using the Linux command line. Here is an overview of what applicants did during Level 2:

  • Access a distant server using ssh
  • Learn the very basics of the Linux command line
  • Install software on Linux
  • Use the Emacs text editor
  • Install a web server on Linux
  • Read a configuration file
  • Use HTML, CSS and javascript to build a website
  • And most importantly: search for information and help each other

From 1,338 applicants to 33 students

Here are some stats about our funnel. Since the announcement on Sept 29th, 1,338 applicants started the admission process. Overall, the 33 students of the January class represent the top 2.5% of all applicants. Congratulations to them!

admissions-stats

 

Women performed better than men

Women performed better than men on average during our selection process. At the very top of the funnel, women represented 20% of our visitors.

sessions-male-female

And at the end of the funnel, 40% of our students are women. Note that Level 0, Level 1 and most of Level 2 are completely automated. It’s only at the end of Level 2 that we are hands-on because we need to check the projects of the applicants. Level 3 is an interview, during which there is also a technical test on Linux. Here is our funnel in terms of gender percentage:

women-stats

Connect with Holberton students

In the past weeks, they have already set up their servers and installed different applications, including a blog where you will be able to read more about their lives and what they learned at the school. Also, you can follow them on Twitter and say congrats by tweeting at @hippokampoiers!

Interested in becoming a full-stack software engineer?

Applications are now open for the next two classes, starting in May and September 2016.

U2VlIHlvdSBhbGwgaW4gSmFuIDIy
bmQgZm9yIHRoZSBmaXJzdCBkYX
kgYXQgSG9sYmVydG9uIFNjaG9vbCE=

Meetup: Don’t code the weakest link

TL;DR;

Title:  Don’t code the weakest link, introduction to security
Speaker: Nicolas Bacca
When: Friday, January 29, 2016 – 9:00 AM to 12:00 PM
RSVP here.

More info

The security of a system is equivalent to the security of its weakest link – to build a secure system you need to understand the general interactions between its building blocks.

* What connected systems look like in 2016
* From computer hacker movies to real world security exploits – the future is now
* Network security basics
* Understanding attacks and threats (from malware to code injection to sophisticated phishing)
* Gentle introduction to cryptography
* Death of the password

Exercises / Hands-on

* Find my PIN (Level: beginner)
* Decrypt me (Level: beginner)
* Decrypt me 2 (Level: medium)
* The lost machine (Level: advanced)

Bring your laptop and an ID.

Important: We will check IDs at the entrance. You will not be able to enter the school if you are not on the list.

RSVP here.

About Nicolas Bacca

 

600_445887227

Nicolas is the co-founder and CTO of  Ledger, a startup designing Bitcoin hardware wallets and other auditable, open and secure Personal Security Devices. He has been involved in the embedded security industry for more than 15 years as R&D engineer, and independent consultant for major industry players; CEO and CTO of several startups. He built the most cost efficient FIDO U2F Security Key implementation with his team, is passionate about optimized code and well implemented security protocols, tries to push for more open standards at every possible opportunity. Also, he is not afraid of nudging closed objects into being more open when diplomacy fails.

Announcing 14 new mentors, including IBM Fellow & CTO and the Father of Symfony PHP

Today we are proud to announce 14 new Holberton School mentors, including Jerry Cuomo, Fellow and CTO at IBM, and Fabien Potencier, founder of Symfony and currently CEO of SensioLabs and Blackfire.io. They join our already impressive list of 81 mentors.

Jerry_Cuomo_and_Fab

Mentors are the backbone of our school, giving students a great way to network, gain insight into the “real world” and get some great projects. They create many of the projects for our students, interact with them as well as with other mentors to make the school reflect the creativity, innovation and working demands of building great software in the real world.

We feel very proud and honored to have such amazing people joining us today.

Please join us to welcome them to the Holberton School family!

Thank you!

Interested in becoming a mentor at Holberton School? Join us!

Level Up!

Today, we are very excited to open Level 2, the last technical Level of the Holberton School application process. If you have completed Level 1, you should receive an email from us very soon (expect an email before Nov 1st).

level2-blogpost

During Level 2, you will build your first website, from A to Z. That means that not only will you build a website with HTML, CSS and JavaScript but you will also install and run your web server (Apache) on your own Linux machine (Ubuntu). Note that most web servers out there are also running Linux as an operating system and Apache as a web server software; so you’ll be already using what industry professionals are using!

Here is an overview of what you will do and learn during Level 2:

  • Access a distant server using ssh
  • Learn the very basics of the Linux command line
  • Install a software on Linux
  • Use the Emacs text editor
  • Install a web server on Linux
  • Read a configuration file
  • Use HTML, CSS and javascript to build a website
  • And most importantly: search for information and help each other

R29vZCBsdWNrIHRvIGV2ZXJ5b25lLCB3ZSBob3BlIHRv
IHNlZSB5b3Ugb24gTGV2ZWwgMyB2ZXJ5IHNvb24h

Many thanks to our reviewers and beta testers who spent countless hours with us, improving Levels 1 & 2: Ashlynn Polin, Lamia Himdach, Vanessa Rigot, Katya KalacheSophie Rigault Barbier.