yakubin’s notes

Race conditions in procfs

OpenBSD deleted procfs in 5.7 release (2015), citing race conditions as one of the main issues. FreeBSD deprecated it in 2011 and doesn’t mount it by default. Instead users are expected to rely on tools like the procstat command-line tool and the libprocstat C library, which underneath use the sysctl(3) interface to communicate with the kernel. In contrast, Linux supports procfs with no plans of deprecating in sight and lots of basic things on Linux can only be done through procfs. On Linux, tools like ps, pmap, pgrep, pkill, lsof all rely on procfs.

Personally, I dislike the idea of /proc itself: exposing system information that’s available in the kernel in binary form by converting it to text only to be converted back to binary in user processes, probably with some bugs along the way. Or, putting it differently: stringly-typed programming. This has consequences, such as most files in /proc reporting their size through fstat(2) as 0, even though they’re not empty when you actually read them. In order for the kernel to report the actual size of such a file, it would need to serialise its data into text every time you ask it for the size, since the relation between the size of binary data represented by the text and the size of the text is non-obvious.

Another issue is that this API masquerades such files as regular files, making them available through the same method as any other files, even though when they are treated as regular files, bugs are inevitable. There is no warning that they should be treated specially in order for the user to receive correct data. So using tools like the cat command or generic function for reading files like base::ReadFileToString() in Chromium leads to incorrect data. And that’s assuming that receiving correct data can even somehow be guaranteed. It can be for some files in /proc, but not for others, as I’ll show here.

Some files in /proc guarantee that their content is consistent when read in a single read(2) syscall. Others don’t. The obvious issue here is that in order to be able to read a whole file in a single read(2) syscall, we need to know its size first. And most files in /proc report that their size is 0, even though they’re not empty:

$ ls -l /proc | egrep -v '^d' | head
total 0
-r--r--r--  1 root  root  0 Jun 11 03:47 buddyinfo
-r--r--r--  1 root  root  0 Jun 11 03:42 cgroups
-r--r--r--  1 root  root  0 Jun 11 03:42 cmdline
-r--r--r--  1 root  root  0 Jun 11 03:47 consoles
-r--r--r--  1 root  root  0 Jun 11 03:42 cpuinfo
-r--r--r--  1 root  root  0 Jun 11 03:47 crypto
-r--r--r--  1 root  root  0 Jun 11 03:42 devices
-r--r--r--  1 root  root  0 Jun 11 03:47 diskstats
-r--r--r--  1 root  root  0 Jun 11 03:47 dma
$ wc -c /proc/cpuinfo
5508 /proc/cpuinfo

Demo: Linux

So is it a real problem? Everything seems to work. Can we get incorrect data in practice? For demonstration purposes I’ll focus on the /proc/[pid]/maps file, since I can easily programatically control its contents and show how mutating it (which happens e.g. when a program allocates or deallocates memory) and reading it concurrently leads to incorrect data.

How should we read it? I’ll go with simulating the behaviour of the base::ReadFileToString() function from Chromium, which, through a sequence of indirections, ultimately calls base::ReadStreamToSpanWithMaxSize(). This function in turn uses the fread(3) C standard library function to read the file in chunks. If the file has a non-zero size reported by fstat(2), then it uses that as the chunk size. If the reported size is zero on the other hand, it uses a 4kiB chunk for the first read and 64kiB after that if it’s not enough. One might ask: what does pmap do? It’s behaviour is actually more bug-prone. It uses fgets(3), which, according to strace, on my system reduces to reading the file in 1024-byte chunks.

This is the function I’m going to use for reading /proc/[pid]/maps:

void linux_copy_maps_to_file(const char* dst, pid_t pid) {
    const size_t big_chunk_size = 1 << 16;
    const size_t small_chunk_size = 1 << 12;

    char maps_path[26] = {0};
    snprintf(maps_path, (sizeof maps_path) - 1, "/proc/%u/maps", (unsigned) pid);

    FILE* src_fp = fopen(maps_path, "r");
    if (src_fp == NULL) {
        perror("fopen");
        return;
    }

    FILE* dst_fp = fopen(dst, "w");
    if (dst_fp == NULL) {
        perror("fopen");
        return;
    }

    char chunk[big_chunk_size];
    for (size_t chunk_size = small_chunk_size; ; chunk_size = big_chunk_size) {
        size_t n = fread(chunk, 1, chunk_size, src_fp);
        if (0 < n) {
            size_t m = fwrite(chunk, 1, n, dst_fp);
            if (n != m) {
                fprintf(stderr, "fwrite failed\n");
                return;
            }
        }
        if (n < chunk_size) {
            if (ferror(src_fp)) {
                fprintf(stderr, "fread failed\n");
                return;
            }
            if (feof(src_fp)) {
                break;
            }
        }
    }

    if (fflush(dst_fp)) {
        perror("fflush");
        return;
    }

    fclose(src_fp);
    fclose(dst_fp);
}

#ifdef __linux__
#define copy_maps_to_file linux_copy_maps_to_file
#endif

void copy_maps_n_times(pid_t pid, int n) {
    assert(n < 100);
    for (int i = 0; i < n; i++) {
        char dstPathBuf[30] = {0};
        snprintf(dstPathBuf, sizeof dstPathBuf - 1, "./saved-maps-%02d.txt", i);
        copy_maps_to_file(dstPathBuf, pid);
    }
}

Now the mutating part of my program. I’m going to allocate a region of memory and start by marking every other page of it as read-only, initially leaving the other ones with no permissions. This ensures that /proc/[pid]/maps doesn’t collapse the whole allocation into one line, but instead has a separate line for each page, making the file big enough, exceeding the size of the chunks used to read it. After that, I’m going to mark the first page as read-write and copy /proc/[pid]/maps to a local file named saved-maps.txt. Then I’m going to start another process which in the background is going to copy /proc/[pid]/maps in a loop to local files named using the pattern saved-maps-%02d.txt. While this process is doing that, in the original process I’m going to iterate through the mapped region of memory, in each iteration first marking another page as read-write and then marking the previous one (from the previous iteration) as having no permissions. This way the mapped region always has exactly 1 or 2 pages of memory with read-write permissions. Other pages are either read-only or have no permissions. In order to make the pattern more erratic, before the loop I’m going to shuffle an array of pointers to pages that I’m going to iterate over. In order to verify the result, the program is also going to print the start and end addresses of the mapped region of memory. The code:

int main() {
    struct sigaction act = {0};
    act.sa_handler = do_nothing;
    if (sigaction(SIGCHLD, &act, NULL) != 0) {
        perror("sigaction");
        return 1;
    }

    pid_t pid = getpid();

    size_t pagesize = sysconf(_SC_PAGESIZE);
    size_t total = 65000;

    void** pages = mmap(NULL, total * sizeof *pages, PROT_READ | PROT_WRITE,
            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (pages == MAP_FAILED) {
        perror("mmap");
        return 2;
    }

#ifdef MAP_NORESERVE
    int noreserve = MAP_NORESERVE;
#else
    int noreserve = 0;
#endif

    void* arena = mmap(NULL, total * pagesize, PROT_NONE,
            MAP_PRIVATE | MAP_ANONYMOUS | noreserve, -1, 0);
    if (pages == MAP_FAILED) {
        perror("mmap");
        return 3;
    }

    printf("%p %p\n", arena, arena + total * pagesize);

    for (size_t i = 0; i < total; i++) {
        pages[i] = arena + i * pagesize;
        if (i % 2 == 0 && mprotect(pages[i], pagesize, PROT_READ) != 0) {
            perror("mprotect read");
            return 4;
        }
    }

    if (shuffle(pages, total)) {
        return 5;
    }

    if (mprotect(pages[0], pagesize, PROT_READ | PROT_WRITE) != 0) {
        perror("mprotect prot_read_write");
        return 6;
    }

    copy_maps_to_file("./saved-maps.txt", pid);

    if (fork() == 0) {
        copy_maps_n_times(pid, 20);
        return 0;
    }

    int ret = 0;

    for (size_t i = 1; i < total; i++) {
        if (mprotect(pages[i], pagesize, PROT_READ | PROT_WRITE) != 0) {
            perror("mprotect prot_read_write");
            ret = 7;
            break;
        }
        if (mprotect(pages[i - 1], pagesize, PROT_NONE) != 0) {
            perror("mprotect prot_none");
            ret = 8;
            break;
        }
    }

    int child_pid;
    do {
        int stat_loc;
        child_pid = wait(&stat_loc);
    } while (child_pid < 0 && errno == EINTR);

    if (child_pid < 0) {
        perror("wait");
        ret = 9;
    }

    return ret;
}

The whole file can be downloaded from here: rolling-mmap.c

Build it:

cc -std=gnu11 -O2 -Wall -Wextra -o rolling-mmap rolling-mmap.c

In order to verify the result I wrote a little helper script. Given 2 arguments representing the start and end addresses of the mapped region of memory, it will print the number of read-write pages in that region:

linux-count-rw-maps.sh#!/usr/bin/env sh

[ $# -ne 2 ] && (echo "usage: $0 <start> <end>"; exit)

sed -E -e 's/^([^-]+)-([^ ]+) /0x\1 0x\2 /' | gawk -v start="$1" -v end="$2" 'BEGIN{c=0} strtonum($1) >= strtonum(start) && strtonum($2) < strtonum(end) && $3 ~ /^rw/{c+=(strtonum($2) - strtonum($1)) / 4096} END{print c}'

The result:

$ ./rolling-mmap
0x7fa44f494000 0x7fa45f27c000
$ <saved-maps.txt ./linux-count-rw-maps.sh 0x7fa44f494000 0x7fa45f27c000
1
$ for f in saved-maps-*.txt; do <"$f" ./linux-count-rw-maps.sh 0x7fa44f494000 0x7fa45f27c000; done
1
1
4
2
2
1
1
2
2
4
0
1
2
3
1
2
2
2
1
3

Well, that’s odd. Remember, the mapped region always has exactly 1 or 2 pages of memory with read-write permissions. And yet according to those copies of /proc/[pid]/maps sometimes there were 4, sometimes 3, sometimes 0.

Alternative: FreeBSD

Now let’s look at what result we may get on a system that uses a different API for fetching the same information. The FreeBSD libprocstat library provides procstat_getvmmap(), which for running processes calls kinfo_getvmmap():

struct kinfo_vmentry *
procstat_getvmmap(struct procstat *procstat, struct kinfo_proc *kp,
    unsigned int *cntp)
{

    switch(procstat->type) {
    case PROCSTAT_KVM:
        warnx("kvm method is not supported");
        return (NULL);
    case PROCSTAT_SYSCTL:
        return (kinfo_getvmmap(kp->ki_pid, cntp));
    case PROCSTAT_CORE:
        return (kinfo_getvmmap_core(procstat->core, cntp));
    default:
        warnx("unknown access method: %d", procstat->type);
        return (NULL);
    }
}

kinfo_getvmmap() in turn calls sysctl twice:

struct kinfo_vmentry *
kinfo_getvmmap(pid_t pid, int *cntp)
{
    int mib[4];
    int error;
    int cnt;
    size_t len;
    char *buf, *bp, *eb;
    struct kinfo_vmentry *kiv, *kp, *kv;

    *cntp = 0;
    len = 0;
    mib[0] = CTL_KERN;
    mib[1] = KERN_PROC;
    mib[2] = KERN_PROC_VMMAP;
    mib[3] = pid;

    error = sysctl(mib, nitems(mib), NULL, &len, NULL, 0);
    if (error)
        return (NULL);
    len = len * 4 / 3;
    buf = malloc(len);
    if (buf == NULL)
        return (NULL);
    error = sysctl(mib, nitems(mib), buf, &len, NULL, 0);
    if (error) {
        free(buf);
        return (NULL);
    }
    // code snipped -- this is followed by unpacking
    // the data into structs expected by callers

And now the consistency guarantees of this code. From the sysctl(3) manpage:

Unless explicitly noted below, sysctl() returns a consistent snapshot of the data requested. Consistency is obtained by locking the destination buffer into memory so that the data may be copied out without blocking. Calls to sysctl() are serialized to avoid deadlock.

[…]

The information is copied into the buffer specified by oldp. The size of the buffer is given by the location specified by oldlenp before the call, and that location gives the amount of data copied after a successful call and after a call that returns with the error code ENOMEM. If the amount of data available is greater than the size of the buffer supplied, the call supplies as much data as fits in the buffer provided and returns with the error code ENOMEM. If the old value is not desired, oldp and oldlenp should be set to NULL.

So the first call to sysctl(3) is to find out the size of the data to be fetched. After that, kinfo_getvmmap() allocates a buffer that’s a bit bigger (4/3 times) just in case the size changes in the mean time. The second call fetches the actual data. If the data is truncated, sysctl(3) will return an error and set errno to ENOMEM. In that situation, kinfo_getvmmap() is going to return NULL, which is going to be forwarded by procstat_getvmmap() to users of libprocstat. So procstat_getvmmap() either returns NULL or consistent accurate data. No other option.

Now time for a demonstration. The most natural way to write the FreeBSD equivalent of what my little demo program does would be to use procstat_getvmmap() directly and check the permission bits in C. However, in order to make the comparison more direct, I also want this data to be in text at the end and the easiest way to do it is to shell out to procstat:

void freebsd_copy_maps_to_file(const char* dst, pid_t pid) {
    char pid_str[11] = {0};
    snprintf(pid_str, (sizeof pid_str) - 1, "%u", (unsigned) pid);

    if (fork() == 0) {
        int fd;
        do {
            fd = open(dst, O_CREAT | O_TRUNC | O_WRONLY, 0644);
        } while (fd < 0 && errno == EINTR);

        if (fd < 0) {
            perror("open");
            return;
        }

        // Redirect stdout to the output file.
        if (dup2(fd, 1) < 0) {
            perror("dup2");
            return;
        }
        close(fd);

        execl("/usr/bin/procstat", "procstat", "vm", pid_str, NULL);
        perror("exec procstat");
        return;
    }

    int child_pid;
    do {
        int stat_loc;
        child_pid = wait(&stat_loc);
    } while (child_pid < 0 && errno == EINTR);

    if (child_pid < 0) {
        perror("wait");
    }
}

#ifdef __FreeBSD__
#define copy_maps_to_file freebsd_copy_maps_to_file
#endif

The FreeBSD procstat output is printed in a format which is a bit different than /proc/[pid]/maps on Linux so the script to parse it is also a bit different:

freebsd-count-rw-maps.sh#!/usr/bin/env sh

[ $# -ne 2 ] && (echo "usage: $0 <start> <end>"; exit)

gawk -v start="$1" -v end="$2" 'BEGIN{c=0} strtonum($2) >= strtonum(start) && strtonum($2) < strtonum(end) && $4 ~ /^rw/{c+=(strtonum($3) - strtonum($2)) / 4096} END{print c}'

I have my FreeBSD box configured in such a way that one unprivileged process can’t read the memory maps of another process, so here I’m going to need to run the program with elevated privileges. The result:

$ doas ./rolling-mmap
0x825f28000 0x835d10000
$ <saved-maps.txt ./freebsd-count-rw-maps.sh 0x825f28000 0x835d10000
1
$ for f in saved-maps-*.txt; do <"$f" ./freebsd-count-rw-maps.sh 0x825f28000 0x835d10000; done
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

Speaks for itself.