Race conditions in procfs
OpenBSD deleted procfs in 5.7 release (2015), citing race conditions as
one of the main issues. FreeBSD deprecated it in 2011 and doesn’t mount
it by default. Instead users are expected to rely on tools like the
procstat command-line tool and the libprocstat
C library, which underneath use the sysctl(3) interface to
communicate with the kernel. In contrast, Linux supports
procfs with no plans of deprecating in sight and lots of
basic things on Linux can only be done through procfs. On
Linux, tools like ps, pmap,
pgrep, pkill, lsof all rely on
procfs.
Personally, I dislike the idea of /proc itself: exposing
system information that’s available in the kernel in binary form by
converting it to text only to be converted back to binary in user
processes, probably with some bugs along the way. Or, putting it
differently: stringly-typed programming. This has consequences, such as
most files in /proc reporting their size through
fstat(2) as 0, even though they’re not empty when you
actually read them. In order for the kernel to report the actual size of
such a file, it would need to serialise its data into text every time
you ask it for the size, since the relation between the size of binary
data represented by the text and the size of the text is
non-obvious.
Another issue is that this API masquerades such files as regular
files, making them available through the same method as any other files,
even though when they are treated as regular files, bugs are inevitable.
There is no warning that they should be treated specially in order for
the user to receive correct data. So using tools like the
cat command or generic function for reading files like
base::ReadFileToString() in Chromium leads to incorrect
data. And that’s assuming that receiving correct data can even somehow
be guaranteed. It can be for some files in /proc, but not
for others, as I’ll show here.
Some files in /proc guarantee that their content is
consistent when read in a single read(2) syscall. Others
don’t. The obvious issue here is that in order to be able to read a
whole file in a single read(2) syscall, we need to know its
size first. And most files in /proc report that their size
is 0, even though they’re not empty:
$ ls -l /proc | egrep -v '^d' | head
total 0
-r--r--r-- 1 root root 0 Jun 11 03:47 buddyinfo
-r--r--r-- 1 root root 0 Jun 11 03:42 cgroups
-r--r--r-- 1 root root 0 Jun 11 03:42 cmdline
-r--r--r-- 1 root root 0 Jun 11 03:47 consoles
-r--r--r-- 1 root root 0 Jun 11 03:42 cpuinfo
-r--r--r-- 1 root root 0 Jun 11 03:47 crypto
-r--r--r-- 1 root root 0 Jun 11 03:42 devices
-r--r--r-- 1 root root 0 Jun 11 03:47 diskstats
-r--r--r-- 1 root root 0 Jun 11 03:47 dma
$ wc -c /proc/cpuinfo
5508 /proc/cpuinfoDemo: Linux
So is it a real problem? Everything seems to work. Can we get
incorrect data in practice? For demonstration purposes I’ll focus on the
/proc/[pid]/maps file, since I can easily programatically
control its contents and show how mutating it (which happens e.g. when a
program allocates or deallocates memory) and reading it concurrently
leads to incorrect data.
How should we read it? I’ll go with simulating the behaviour of the
base::ReadFileToString()
function from Chromium, which, through a sequence of indirections,
ultimately calls base::ReadStreamToSpanWithMaxSize().
This function in turn uses the fread(3) C standard library
function to read the file in chunks. If the file has a non-zero size
reported by fstat(2), then it uses that as the chunk size.
If the reported size is zero on the other hand, it uses a 4kiB chunk for
the first read and 64kiB after that if it’s not enough. One might ask:
what does pmap do? It’s behaviour is actually more
bug-prone. It uses fgets(3), which, according to
strace, on my system reduces to reading the file in
1024-byte chunks.
This is the function I’m going to use for reading
/proc/[pid]/maps:
void linux_copy_maps_to_file(const char* dst, pid_t pid) {
const size_t big_chunk_size = 1 << 16;
const size_t small_chunk_size = 1 << 12;
char maps_path[26] = {0};
snprintf(maps_path, (sizeof maps_path) - 1, "/proc/%u/maps", (unsigned) pid);
FILE* src_fp = fopen(maps_path, "r");
if (src_fp == NULL) {
perror("fopen");
return;
}
FILE* dst_fp = fopen(dst, "w");
if (dst_fp == NULL) {
perror("fopen");
return;
}
char chunk[big_chunk_size];
for (size_t chunk_size = small_chunk_size; ; chunk_size = big_chunk_size) {
size_t n = fread(chunk, 1, chunk_size, src_fp);
if (0 < n) {
size_t m = fwrite(chunk, 1, n, dst_fp);
if (n != m) {
fprintf(stderr, "fwrite failed\n");
return;
}
}
if (n < chunk_size) {
if (ferror(src_fp)) {
fprintf(stderr, "fread failed\n");
return;
}
if (feof(src_fp)) {
break;
}
}
}
if (fflush(dst_fp)) {
perror("fflush");
return;
}
fclose(src_fp);
fclose(dst_fp);
}
#ifdef __linux__
#define copy_maps_to_file linux_copy_maps_to_file
#endif
void copy_maps_n_times(pid_t pid, int n) {
assert(n < 100);
for (int i = 0; i < n; i++) {
char dstPathBuf[30] = {0};
snprintf(dstPathBuf, sizeof dstPathBuf - 1, "./saved-maps-%02d.txt", i);
copy_maps_to_file(dstPathBuf, pid);
}
}Now the mutating part of my program. I’m going to allocate a region
of memory and start by marking every other page of it as read-only,
initially leaving the other ones with no permissions. This ensures that
/proc/[pid]/maps doesn’t collapse the whole allocation into
one line, but instead has a separate line for each page, making the file
big enough, exceeding the size of the chunks used to read it. After
that, I’m going to mark the first page as read-write and copy
/proc/[pid]/maps to a local file named
saved-maps.txt. Then I’m going to start another process
which in the background is going to copy /proc/[pid]/maps
in a loop to local files named using the pattern
saved-maps-%02d.txt. While this process is doing that, in
the original process I’m going to iterate through the mapped region of
memory, in each iteration first marking another page as read-write and
then marking the previous one (from the previous iteration) as having no
permissions. This way the mapped region always has exactly 1 or
2 pages of memory with read-write permissions. Other pages are
either read-only or have no permissions. In order to make the pattern
more erratic, before the loop I’m going to shuffle an array of pointers
to pages that I’m going to iterate over. In order to verify the result,
the program is also going to print the start and end addresses of the
mapped region of memory. The code:
int main() {
struct sigaction act = {0};
act.sa_handler = do_nothing;
if (sigaction(SIGCHLD, &act, NULL) != 0) {
perror("sigaction");
return 1;
}
pid_t pid = getpid();
size_t pagesize = sysconf(_SC_PAGESIZE);
size_t total = 65000;
void** pages = mmap(NULL, total * sizeof *pages, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (pages == MAP_FAILED) {
perror("mmap");
return 2;
}
#ifdef MAP_NORESERVE
int noreserve = MAP_NORESERVE;
#else
int noreserve = 0;
#endif
void* arena = mmap(NULL, total * pagesize, PROT_NONE,
MAP_PRIVATE | MAP_ANONYMOUS | noreserve, -1, 0);
if (pages == MAP_FAILED) {
perror("mmap");
return 3;
}
printf("%p %p\n", arena, arena + total * pagesize);
for (size_t i = 0; i < total; i++) {
pages[i] = arena + i * pagesize;
if (i % 2 == 0 && mprotect(pages[i], pagesize, PROT_READ) != 0) {
perror("mprotect read");
return 4;
}
}
if (shuffle(pages, total)) {
return 5;
}
if (mprotect(pages[0], pagesize, PROT_READ | PROT_WRITE) != 0) {
perror("mprotect prot_read_write");
return 6;
}
copy_maps_to_file("./saved-maps.txt", pid);
if (fork() == 0) {
copy_maps_n_times(pid, 20);
return 0;
}
int ret = 0;
for (size_t i = 1; i < total; i++) {
if (mprotect(pages[i], pagesize, PROT_READ | PROT_WRITE) != 0) {
perror("mprotect prot_read_write");
ret = 7;
break;
}
if (mprotect(pages[i - 1], pagesize, PROT_NONE) != 0) {
perror("mprotect prot_none");
ret = 8;
break;
}
}
int child_pid;
do {
int stat_loc;
child_pid = wait(&stat_loc);
} while (child_pid < 0 && errno == EINTR);
if (child_pid < 0) {
perror("wait");
ret = 9;
}
return ret;
}The whole file can be downloaded from here: rolling-mmap.c
Build it:
cc -std=gnu11 -O2 -Wall -Wextra -o rolling-mmap rolling-mmap.cIn order to verify the result I wrote a little helper script. Given 2 arguments representing the start and end addresses of the mapped region of memory, it will print the number of read-write pages in that region:
linux-count-rw-maps.sh#!/usr/bin/env sh [ $# -ne 2 ] && (echo "usage: $0 <start> <end>"; exit) sed -E -e 's/^([^-]+)-([^ ]+) /0x\1 0x\2 /' | gawk -v start="$1" -v end="$2" 'BEGIN{c=0} strtonum($1) >= strtonum(start) && strtonum($2) < strtonum(end) && $3 ~ /^rw/{c+=(strtonum($2) - strtonum($1)) / 4096} END{print c}'
The result:
$ ./rolling-mmap
0x7fa44f494000 0x7fa45f27c000
$ <saved-maps.txt ./linux-count-rw-maps.sh 0x7fa44f494000 0x7fa45f27c000
1
$ for f in saved-maps-*.txt; do <"$f" ./linux-count-rw-maps.sh 0x7fa44f494000 0x7fa45f27c000; done
1
1
4
2
2
1
1
2
2
4
0
1
2
3
1
2
2
2
1
3Well, that’s odd. Remember, the mapped region always has
exactly 1 or 2 pages of memory with read-write permissions. And
yet according to those copies of /proc/[pid]/maps sometimes
there were 4, sometimes 3, sometimes 0.
Alternative: FreeBSD
Now let’s look at what result we may get on a system that uses a
different API for fetching the same information. The FreeBSD
libprocstat library provides procstat_getvmmap(),
which for running processes calls kinfo_getvmmap():
struct kinfo_vmentry *
procstat_getvmmap(struct procstat *procstat, struct kinfo_proc *kp,
unsigned int *cntp)
{
switch(procstat->type) {
case PROCSTAT_KVM:
warnx("kvm method is not supported");
return (NULL);
case PROCSTAT_SYSCTL:
return (kinfo_getvmmap(kp->ki_pid, cntp));
case PROCSTAT_CORE:
return (kinfo_getvmmap_core(procstat->core, cntp));
default:
warnx("unknown access method: %d", procstat->type);
return (NULL);
}
}kinfo_getvmmap()
in turn calls sysctl twice:
struct kinfo_vmentry *
kinfo_getvmmap(pid_t pid, int *cntp)
{
int mib[4];
int error;
int cnt;
size_t len;
char *buf, *bp, *eb;
struct kinfo_vmentry *kiv, *kp, *kv;
*cntp = 0;
len = 0;
mib[0] = CTL_KERN;
mib[1] = KERN_PROC;
mib[2] = KERN_PROC_VMMAP;
mib[3] = pid;
error = sysctl(mib, nitems(mib), NULL, &len, NULL, 0);
if (error)
return (NULL);
len = len * 4 / 3;
buf = malloc(len);
if (buf == NULL)
return (NULL);
error = sysctl(mib, nitems(mib), buf, &len, NULL, 0);
if (error) {
free(buf);
return (NULL);
}
// code snipped -- this is followed by unpacking
// the data into structs expected by callersAnd now the consistency guarantees of this code. From the sysctl(3)
manpage:
Unless explicitly noted below, sysctl() returns a consistent snapshot of the data requested. Consistency is obtained by locking the destination buffer into memory so that the data may be copied out without blocking. Calls to sysctl() are serialized to avoid deadlock.
[…]
The information is copied into the buffer specified by oldp. The size of the buffer is given by the location specified by oldlenp before the call, and that location gives the amount of data copied after a successful call and after a call that returns with the error code ENOMEM. If the amount of data available is greater than the size of the buffer supplied, the call supplies as much data as fits in the buffer provided and returns with the error code ENOMEM. If the old value is not desired, oldp and oldlenp should be set to NULL.
So the first call to sysctl(3) is to find out the size
of the data to be fetched. After that, kinfo_getvmmap()
allocates a buffer that’s a bit bigger (4/3 times) just in case the size
changes in the mean time. The second call fetches the actual data. If
the data is truncated, sysctl(3) will return an error and
set errno to ENOMEM. In that situation,
kinfo_getvmmap() is going to return NULL,
which is going to be forwarded by procstat_getvmmap() to
users of libprocstat. So procstat_getvmmap()
either returns NULL or consistent accurate data. No other
option.
Now time for a demonstration. The most natural way to write the
FreeBSD equivalent of what my little demo program does would be to use
procstat_getvmmap() directly and check the permission bits
in C. However, in order to make the comparison more direct, I also want
this data to be in text at the end and the easiest way to do it is to
shell out to procstat:
void freebsd_copy_maps_to_file(const char* dst, pid_t pid) {
char pid_str[11] = {0};
snprintf(pid_str, (sizeof pid_str) - 1, "%u", (unsigned) pid);
if (fork() == 0) {
int fd;
do {
fd = open(dst, O_CREAT | O_TRUNC | O_WRONLY, 0644);
} while (fd < 0 && errno == EINTR);
if (fd < 0) {
perror("open");
return;
}
// Redirect stdout to the output file.
if (dup2(fd, 1) < 0) {
perror("dup2");
return;
}
close(fd);
execl("/usr/bin/procstat", "procstat", "vm", pid_str, NULL);
perror("exec procstat");
return;
}
int child_pid;
do {
int stat_loc;
child_pid = wait(&stat_loc);
} while (child_pid < 0 && errno == EINTR);
if (child_pid < 0) {
perror("wait");
}
}
#ifdef __FreeBSD__
#define copy_maps_to_file freebsd_copy_maps_to_file
#endifThe FreeBSD procstat output is printed in a format which
is a bit different than /proc/[pid]/maps on Linux so the
script to parse it is also a bit different:
freebsd-count-rw-maps.sh#!/usr/bin/env sh [ $# -ne 2 ] && (echo "usage: $0 <start> <end>"; exit) gawk -v start="$1" -v end="$2" 'BEGIN{c=0} strtonum($2) >= strtonum(start) && strtonum($2) < strtonum(end) && $4 ~ /^rw/{c+=(strtonum($3) - strtonum($2)) / 4096} END{print c}'
I have my FreeBSD box configured in such a way that one unprivileged process can’t read the memory maps of another process, so here I’m going to need to run the program with elevated privileges. The result:
$ doas ./rolling-mmap
0x825f28000 0x835d10000
$ <saved-maps.txt ./freebsd-count-rw-maps.sh 0x825f28000 0x835d10000
1
$ for f in saved-maps-*.txt; do <"$f" ./freebsd-count-rw-maps.sh 0x825f28000 0x835d10000; done
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1Speaks for itself.