yakubin’s notes

# Reserve and commit

This page contains a bit of maths. To view it correctly you need a web browser which supports MathML. That means either Firefox, Safari or Chrome. Chrome should be in version 109 or later.

I’ve been working on a yet-unpublished side project in Rust and decided to use a bump allocator provided by bumpalo to simplify lifetimes and improve performance of my code. Generally, making lots of micro-allocations and then freeing them one by one is far from optimal, when you can instead group them into larger allocations and free all at once. In the same way, a large number of little objects with their own separate lifetimes is a lot harder to keep track of conceptually than when their lifetimes are grouped into just a few longer ones. Fewer, longer lifetimes also make working with the borrow-checker easier.

In broad strokes, bumpalo provides memory by allocating chunks of memory from the OS and keeping track of how many bytes in the last chunk are already used. When you request $n$ bytes from bumpalo, it gives you memory from the free portion of the last chunk and increments its counter of used bytes. When there isn’t enough space in the last chunk, it allocates another one and does the same.

Works well enough. However:

1. It involves much more memory management code than I’d like.
2. It fragments memory over disparate chunks, where there’s likely always going to be unused space left at the end of each chunk.
3. Freeing Bump takes linear time relative to the number of chunks.
4. Reserving memory upfront by setting capacity avoids fragmentation into chunks, but it takes that amount of memory from the OS, regardless of whether all that memory is actually going to be used. That means the user needs to have very precise upfront expectations of memory use if they want to avoid fragmentation by reserving. Whether it’s difficult and fragile, or impossible varies on a case-by-case basis.

There is a way to implement a bump allocator without using chunks, and without taking unnecessary memory from the OS. I’m going to describe it later down this page, but first I want to establish a data structure which could serve as a benchmark of whether that allocator is any good.

In this text I’m going to use C++, because it’s the official language of Windows APIs, it makes it easy to access C APIs on other platforms as well, and it’s more established than Rust, so talking about data structures provided by the C++ standard library is going to meet with less suspicion than doing the same with the Rust standard library.

## Dynamic arrays

Dynamic arrays are arrays which can change their length. A typical implementation of this data structure, such as std::vector, when it isn’t being modified, behaves like a normal array, meaning that it holds its elements in a contiguous memory buffer. $N$th element (starting from $0$) starts at $N \times M$ bytes after the start of $0$th element, when each element is $M$ bytes long. Length (reported by size()) and capacity (capacity()) are stored on the side. Length is the number of elements currently stored, while capacity is the number of elements std::vector is capable of storing in the current allocation. $\mathit{Length} \le \mathit{Capacity}$.

To store its elements, std::vector asks the allocator for a buffer big enough to store all elements. When you append an element to the end using push_back(), std::vector checks whether the last allocation has enough extra space to hold another element between the current last element and the end of allocation. In other words, it checks whether $\mathit{Length} + 1 \le \mathit{Capacity}$. If yes, then it puts the new element right after the current last element and updates the length by adding $1$ to it.

If however $\mathit{Length} = \mathit{Capacity}$, then there is no more space in the current allocation. In that case, std::vector asks the allocator for another buffer, this time of size $G\times\mathit{Capacity}$. $G$ is called the growth factor. libstdc++ and libc++ use $G=2$. MSVC uses $G=\frac{3}{2}$.

After allocating the new bigger buffer, elements from the old buffer are copied to the new one and the new element is put after them. Old buffer is freed, length is incremented by $1$, and capacity is set to the size of the newly-allocated buffer.

### Performance Cost

#### Time

If the std::vector had $N$ elements before the whole operation, all of that required $O(N)$ steps (due to copying). However, this copying doesn’t happen at every push_back(). It happens at every $G^N$th push_back(). In other cases this method takes constant time.

If we consider a sequence of $n$ push_back() calls, then they are going to include $\lceil\log_{G}n\rceil$ calls which copy the underlying buffer, and $n - \lceil\log_{G}n\rceil$ calls which do not. All the copies taken together are going to take $1 + G + G^2 + \cdots + G^{\lceil\log_{G}n\rceil} = \frac{G^{\lceil\log_{G}n\rceil} - 1}{G - 1} \approx \frac{n - 1}{G - 1}$ steps. The copy-less push_back() calls taken together are going to take $n - \lceil\log_{G}n\rceil$ steps. Adding it together we get that all push_back() calls together cost us $O(\frac{n - 1}{G - 1} + n - \lceil\log_{G}n\rceil)$.

There are $n$ push_back() calls, so on average push_back() takes constant time:

$O(\frac{\frac{n - 1}{G - 1} + n - \lceil\log_{G}n\rceil}{n}) = O(\frac{\frac{n - 1}{G - 1} + n}{n}) = O(\frac{n - 1 + n}{n}) = O(\frac{2n}{n}) = O(2) = O(1)$

That’s why it’s sometimes said that std::vector::push_back() takes constant time amortised – the cost of every expensive push_back(), which copies $n$ elements, is amortised over $n$ previous push_back()s, each of which took constant time.

However, the latency of push_back() is pretty bad, as demonstrated by the measurements later in this note.

#### Space

Between allocating the new buffer and discarding the old one after copying it, the two buffers occupied $(N + G \times N) \times M = (G+1) \times N \times M$ bytes of memory, which means that in order to get an $N+1$-element std::vector, we need almost $G+1$ times as much memory as is needed for our elements. That’s $3$ times for libstdc++ and libc++, and $2.5$ times for MSVC. It’s a short time window, so most of the time it isn’t terrible, but sometimes we may not have this much memory available to us.

After push_back() finishes, we’re using $G \times N \times M$ bytes to store our $(N+1) \times M$, which gives us the lower bound for space utilisation. For $G=\frac{3}{2}$, space utilisation of std::vector is in the range $(\frac{2}{3}, 1]$. For $G=2$, it is in the range $(\frac{1}{2}, 1]$.

Not terrible, not great.

### Complexity cost

Due to all the copying, our program cannot depend on any elements staying in one place for too long. If we want to refer to a string stored in a std::vector, we cannot simply grab a reference/pointer/string_view to it, as that could result in a dangling reference/pointer after a push_back() call.

We could work around this problem by grabbing index of the string in that std::vector. However, that would mean that functions operating on that string would need to operate on such indices instead of references/pointers to strings, which in turn would mean that they would either need to take the std::vector as another argument, or that they would work only with strings from a selected std::vector. That would make using those functions in different contexts difficult. E.g. we wouldn’t be able to just pass a string directly to those functions in tests (normal, or transient).

We also wouldn’t be able to easily refer to substrings without copying them or creating a substring struct of the form $(\mathit{IndexOfString}, \mathit{StartPos}, \mathit{EndPos}, \mathit{Vec})$, which is a lot more things to keep track of than just a plain std::string_view, i.e. a $(\mathit{PointerToStart}, \mathit{Length})$. And again, the functions operating on those structs would be confined to operating only on those special objects instead of arbitrary substrings.

Surely, there must be a way to make it all simpler and more efficient?

## Virtual memory

Let’s take a step back and ponder what it means to allocate memory on a typical operating system these days. Essentially, we want to get access to a region of RAM. That access is controlled by the OS and the CPU using a memory management unit (MMU).

These components logically divide memory into pages. A page usually spans $4$kiB of memory. To address every byte in such a page we need $\log_{2}4096 = 12$ bits, so we can treat the first $64 - 12 = 52$ bits as index of the page itself. If we have an address and want to know the index of the page holding the byte under that address, we only need to perform a logical shift right on the address by 12 bits.

The operating system manages a page table, which conceptually is a hash table from page indices to page indices and some minimal metadata, such as permissions. (Due to space constraints on $64$-bit systems, on most systems the hash table is actually nested.)

When a process reads from or writes to memory at an address, the CPU fetches the hash table entry for that address, checks whether the process has appropriate permissions and accesses the memory with the page index replaced with the one stored in the value fetched from the page table. If the process doesn’t have the appropriate permissions however, the CPU will trigger a hardware page fault, which is going to be caught by the OS and redirected to the process as a segfault (SIGSEGV on POSIX systems1).

This arrangement creates two address spaces: virtual and physical. The virtual address space consists of addresses the way processes see them. The physical address space consists of addresses directly referring to physical memory (RAM). Each of them is divided into pages. The virtual address space is divided into virtual pages, while the physical address space into physical pages. The page table maintains a mapping from virtual to physical pages. Each process lives in a separate virtual address space.

## Reserving vs committing memory

When we write a program and want to be able to access a piece of memory, we need to first ask the OS to update the page table. To create a new mapping, we need to call VirtualAlloc() on Windows, or mmap() on POSIX. Both of them accept an argument, hinting which virtual address the new mapping should start at. If this argument is null, the OS picks an address of its own choosing. Those functions are what’s used underneath by malloc() from the C standard library.

Looking at all that, maybe our dynamic array could allocate memory page after page in contiguity? Each time we want to append a new element, if we used up all the previously-allocated space, we’d just call VirtualAlloc()/mmap() to give us another virtual page directly after our existing allocation and continue normally.

That sounds almost like a good plan, but there is no guarantee that the virtual page right after our allocation isn’t already mapped. It could be mapped before our process even started execution, or we could have mapped it ourselves in a different place in code.

What we need is a way to tell the OS that a certain region of virtual address space shouldn’t be mapped, unless we specifically ask for it (using the address and length arguments of VirtualAlloc()/mmap()). This way we can ensure that when we need another virtual page mapped, it is available to us. And when we don’t need it, we don’t use up scarce physical memory.

VirtualAlloc() allows us to do just that. It separates reserving memory from committing it. Reserving memory is exactly what I described in the last paragraph: telling the OS not to map virtual pages in a given address range, unless we specifically ask for it. Committing memory is the act of mapping virtual pages to physical pages. On Windows committed memory is counted towards the Working Set Size (WSS) of a process, which is what’s presented to the user in the Task Manager as the amount of memory a process consumes.

While physical memory is scarce, virtual memory on $64$-bit systems is abundant. On amd64 systems we typically have about $48$ bits of virtual address space available, which amounts to $\frac{1}{4}$PiB. So while my PC has only $16$GiB of RAM, I can reserve about $16384$ times as much virtual memory.

### Lazy Overcommit

On POSIX systems, such as Mac, FreeBSD and Linux which I’m going to be using here, there is no explicit way to differentiate reserving memory from committing it. However, those systems in a normal configuration use lazy overcommit, i.e. when a process asks for memory to be mapped, the operating system doesn’t update the page table right away. Instead it tells the process that the mapping succeeded, and when the process tries to access one of the virtual pages, which it was led to believe is mapped, CPU generates a page fault, the OS catches it, updates the page table and resumes execution of the process from the instruction which triggered the page fault, as if nothing happened. That’s why it’s called lazy.

In that setup, memory allocations done with mmap() almost never fail (which would allow processes to handle the error condition), even when the sum of memory requested with mmap() is more than is available in the system (that’s why it’s called overcommit). Instead the operating system waits until after all allocations are made and processes actually try to use the allocated memory.

When it turns out that the processes weren’t joking and they really attempt to use all that memory, which is more than the physical memory available, the operating system starts an out-of-memory (OOM) killer which picks a process according to its own fancy and terminates it, thereby sacrificing it on the altar of overcommit. It may not be the process responsible for excessive memory consumption. It may be a process critical to the security of your system, like your screen locker. This is repeated, until the amount of memory that surviving processes try to access doesn’t exceed the amount of physical memory available. This is stupid.

However, in the absence of a proper reserve and commit API, it lets us emulate reserve and commit using mprotect() calls, preventing us from accessing pages of memory we didn’t explicitly “commit” by disabling all permissions on those pages. When we want to “commit” pages, we change their permissions to allow us to access them and let lazy overcommit take care of the fact that the initial mmap() call requested more memory than we may need and possibly than is even available in the system.

As long as the system administrator didn’t disable overcommit. But given that overcommit is expected on POSIX, applications in the POSIX world in general aren’t written with expectation that memory allocations can fail. So when overcommit is disabled, things are going to break anyway.

## The code

I’ll go over key parts of the code here, but omit other parts. Complete code is available in its own git repository. If you find bugs, I do accept patches. README describes how to send them.

I’m going to first present the platform-specific parts of code, and then follow them with platform-independent code. All platform-specific code is in namespace pal (for platform abstraction layer).

### Windows

Apparently that’s how you declare the entry-point function on Windows:

#define DECL_MAIN int __cdecl wmain()

For benchmarks I’ll want to prevent inlining in a couple places to prevent the optimiser from knowing the value of a couple arguments passed to benchmarking functions, which could influence the results.

#define NOINLINE __declspec(noinline)

Theoretically we should fetch page size with GetSystemInfo(), but I want to keep the example code simple, and on Windows it’s almost always going to be 4096 anyway:

static constexpr size_t kPageSize = 0x1000;

Reserving memory:

inline void* MemReserve(size_t len) {
void* buf = VirtualAlloc(nullptr, len, MEM_RESERVE, PAGE_NOACCESS);
if (buf == nullptr) {
DWORD last_error = GetLastError();
std::cerr << "Reserving memory with VirtualAlloc failed. Error code: " << last_error << std::endl;
return nullptr;
}

return buf;
}

Committing memory:

inline bool MemCommit(void* p, size_t len) {
void* committed = VirtualAlloc(p, len, MEM_COMMIT, PAGE_READWRITE);
if (committed == nullptr) {
DWORD last_error = GetLastError();
std::cerr << "Committing memory with VirtualAlloc failed. Error code: " << last_error << std::endl;
return false;
}

return true;
}

Freeing memory:

inline void MemFree(void* p, size_t /*len*/) {
VirtualFree(p, 0, MEM_RELEASE);
}

That’s all we need for memory allocation. Now, a couple functions for the measurements done in benchmarks. First, peak memory use (working set size):

inline size_t GetPeakMemoryUse() {
PROCESS_MEMORY_COUNTERS mem;
bool res = GetProcessMemoryInfo(GetCurrentProcess(), &mem, sizeof mem);
if (!res) {
DWORD last_error = GetLastError();
std::cerr << "GetProcessMemoryInfo failed. Error code: " << last_error << std::endl;
return 0;
}

return mem.PeakWorkingSetSize;
}

And the time the process spent on-CPU, adding time spent in user-mode and in kernel-mode, in microseconds:

inline uint64_t ConvertFiletimeToMicroseconds(FILETIME t) {
ULARGE_INTEGER li;
li.LowPart = t.dwLowDateTime;
li.HighPart = t.dwHighDateTime;
return static_cast<uint64_t>(li.QuadPart) / 10;
}

inline uint64_t GetCpuTime() {
FILETIME a, b, c, d;
if (GetProcessTimes(GetCurrentProcess(), &a, &b, &c, &d) != 0) {
auto kernel_time = ConvertFiletimeToMicroseconds(c);
auto user_time = ConvertFiletimeToMicroseconds(d);
return user_time + kernel_time;
} else {
std::cerr << "failed to get CPU time\n";
assert(false);
return 0;
}
}

### POSIX

Most of the code is common to all POSIX platforms with a couple tiny differences handled with the preprocessor. Normally, I’d split them into separate files, but that felt like too much ceremony than is necessary for this little post.

main:

#define DECL_MAIN int main()

NOINLINE:

#define NOINLINE __attribute__((noinline))

I’m going to be working with FreeBSD and Linux on amd64, which use 4kiB pages, and Mac M1 which uses 16kiB pages (as all ARM-based platforms from Apple do):

static constexpr size_t kPageSize =
#if defined(__APPLE__) && defined(__aarch64__)
0x4000
#else
0x1000
#endif
;

Reserving memory. We call mmap() with PROT_NONE to disable all permissions and add MAP_NORESERVE flag, when it’s available (in practice it means Linux), to opt into overcommit, when it’s in heuristic mode:

inline void* MemReserve(size_t len) {
int flags = MAP_PRIVATE | MAP_ANONYMOUS;

#ifdef MAP_NORESERVE
// On Linux, guarantee lazy commit even if /proc/sys/vm/overcommit_memory is set to heuristic (0).
flags |= MAP_NORESERVE;
#endif

void* buf = mmap(nullptr, len, PROT_NONE, flags, -1, 0);
if (buf == nullptr) {
int saved_errno = errno;
std::cerr << "Reserving memory with mmap failed. Exit code: " << saved_errno << std::endl;
return nullptr;
}

return buf;
}

Commit by adding read and write permissions:

inline bool MemCommit(void* p, size_t len) {
if (mprotect(p, len, PROT_READ | PROT_WRITE) != 0) {
int saved_errno = errno;
std::cerr << "Committing memory with mprotect failed. Exit code: " << saved_errno << std::endl;
return false;
}

return true;
}

Free:

inline void MemFree(void* p, size_t len) {
if (munmap(p, len) != 0) {
int saved_errno = errno;
std::cerr << "Failed to unmap memory. Exit code: " << saved_errno << std::endl;
}
}

On POSIX platforms Resident Set Size (RSS) is what’s interpreted as “memory use”. To measure that, we’re going to use getrusage() and here is where POSIX lets us down. Everyone has this function, but everyone interprets the meaning of the returned ru_maxrss differently. On FreeBSD and Linux it’s in kilobytes. On Mac it’s in bytes. And curiously on illumos it’s in pages. The similarities are sufficient to lull you into a sense that everything is working fine when porting: everything has the same names, everything compiles, but then the interpretation of results is completely different. The only reason I noticed it is because suddenly results differed by several orders of magnitude between Mac and FreeBSD/Linux. Others may not be this fortunate. Reading the Linux and FreeBSD man pages, it wasn’t clear to me whether by kilobytes they meant $1024$ bytes or $1000$ bytes. I settled for $1000$:

inline size_t GetPeakMemoryUse() {
rusage r_usage;
if (getrusage(RUSAGE_SELF, &r_usage) != 0) {
int saved_errno = errno;
std::cerr << "Failed to fetch memory use. Exit code: " << saved_errno << std::endl;
return 0;
}

// On Mac, ru_maxrss is in bytes. On Linux and FreeBSD it's in kilobytes. So much for POSIX.
#ifdef __APPLE__
size_t multiplier = 1;
#else
size_t multiplier = 1000;
#endif

return r_usage.ru_maxrss * multiplier;
}

Measuring CPU time is done the same way everywhere:

inline uint64_t ConvertTimevalToMicroseconds(timeval t) {
return static_cast<uint64_t>(t.tv_sec) * 1000000ULL + static_cast<uint64_t>(t.tv_usec);
}

inline uint64_t GetCpuTime() {
rusage r_usage;
if (getrusage(RUSAGE_SELF, &r_usage) != 0) {
int saved_errno = errno;
std::cerr << "Failed to fetch memory use. Exit code: " << saved_errno << std::endl;
return 0;
}

auto user_time = ConvertTimevalToMicroseconds(r_usage.ru_utime);
auto kernel_time = ConvertTimevalToMicroseconds(r_usage.ru_stime);
return user_time + kernel_time;
}

### Platform-independent part

For simplicity, let’s first write a basic arena of bytes, so we don’t need to trouble ourselves with templates for now.

Unlike std::vector, we’re going to grow the allocation linearly, not exponentially, which should give us better space efficiency. Let’s settle on the amount of pages we’re going to add each time we run out of space in our allocation. I say $1$ is a pretty good start:

static constexpr size_t kBytesPerCommit = pal::kPageSize;

We also need a way to calculate the number of pages needed to hold the bytes in our arena, multiplied by number of bytes in a page, since VirtualAlloc accepts length in bytes, not pages. We calculate it by rounding the number of bytes we need to store up to the nearest page boundary. To do that, we perform ceiled division by kBytesPerCommit, followed by multiplication by kBytesPerCommit:

size_t GetCommittedBytes(size_t used_bytes) {
return ((used_bytes + kBytesPerCommit - 1) / kBytesPerCommit) * kBytesPerCommit;
}

Now the struct for arena bookkeeping, which should be pretty self-explanatory at this point:

struct Arena {
char* base = nullptr;
size_t cap = 0;
size_t len = 0;

// ... methods ...
};

RAII boilerplate:

    Arena() = delete;
Arena(const Arena&) = delete;
Arena(Arena&& other) {
base = other.base;
cap = other.cap;
len = other.len;

other.base = nullptr;
other.cap = 0;
other.len = 0;
}

Main constructor, followed by the destructor:

    explicit Arena(size_t max_len) {
assert(max_len != 0);

cap = GetCommittedBytes(max_len);
len = 0;

void* buf = pal::MemReserve(cap);
if (buf == nullptr) {
throw std::bad_alloc();
}

base = static_cast<char*>(buf);
}

~Arena() {
if (base != nullptr) {
pal::MemFree(base, cap);
}
}

I apologise for exceptions. However, I didn’t want to use extra templates like std::optional in this POC, which would get in the way of demonstration.

Our constructor reserves the amount of memory requested in its argument. It doesn’t commit anything. Destructor frees the whole arena.

Growing logic:

    bool Grow(size_t additional_len) {
assert(len+additional_len <= cap);
size_t old_committed = GetCommittedBytes(len);
size_t new_committed = GetCommittedBytes(len+additional_len);
size_t delta = new_committed - old_committed;
if (delta == 0) {
len += additional_len;
return true;
}

if (!pal::MemCommit(base + old_committed, delta)) {
return false;
}

len += additional_len;
return true;
}

If we can fit the additional bytes in current allocation, we only update the length and report success. Otherwise, we attempt to commit more pages (the minimal amount that’s needed to fit the additional bytes).

Our POC arena is complete. Now let’s move on to the dynamic array of elements of type T:

template <typename T>
class DynArray {
Arena arena;

public:
// ... methods ...
};

No surprises here. RAII boilerplate:

    // Redundant, but useful as documentation.
DynArray() = delete;
DynArray(const DynArray<T>& other) = delete;
DynArray(DynArray&& other) = default;

Copy constructor is left as an exercise for the reader.

Main constructor just forwards a scaled argument to the arena. For more robust code, we should add a check for arithmetic overflow, but for simplicity of examples let’s leave it out:

    explicit DynArray(size_t max_len)
: arena(max_len * sizeof(T)) {}

Support for range loops:

    T* begin() {
return reinterpret_cast<T*>(arena.base);
}

T* end() {
return reinterpret_cast<T*>(arena.base + arena.len);
}

Methods for querying the length and indexing with square brackets would also be nice:

    size_t Len() const {
return arena.len / sizeof(T);
}

T& operator[](size_t i) {
assert(i < Len());
return begin()[i];
}

And the meat of the structure:

    bool Append(T v) {
auto* dst = end();

if (!arena.Grow(sizeof(T))) {
return false;
}

*dst = std::move(v);
return true;
}

Should be pretty obvious.

Theoretically, we should make sure that we put T at addresses aligned according to the alignment requirements of T. However, the base address of the arena, which is returned by VirtualAlloc(), will always be aligned to a page boundary. Alignment needs to be a power of 2. Additionally, in practice I don’t think there are any types with alignment bigger than page size2, which is also a power of 2. It follows that alignment always divides page size equally, therefore we satisfy alignment requirements with no additional effort.

#### Demonstrative tests

Now for some rudimentary tests. First, the arena:

    // Reserving 16GiB.
auto a = Arena(16ULL * 1024ULL * 1024ULL * 1024ULL);

// Committing 1MiB to make space for 4 bytes (with some headroom).
bool committed = a.Grow(4);
assert(committed);

a.base[0] = 'a';
a.base[1] = 'b';
a.base[2] = 'c';
a.base[4] = '\0';

std::cout << a.base << std::endl;

// Past the declared length, but still in committed memory.
// No segfault, but in real code it's probably a bug.
a.base[5] = 'x';
a.base[kBytesPerCommit-1] = 'y';

// Reserved, but not committed memory. Segfault.
// a.base[kBytesPerCommit] = 'x';

That code should print “abc” and exit without crashing. We can see that unfortunately we can access elements beyond the stated length of the arena, in this case elements with indices $5$ and $\mathtt{kBytesPerCommit}-1$, because they’re placed on the same page of memory which is used to store valid elements. However, we cannot reach beyond this page, which is demonstrated with the last line. After uncommenting it and running the program, it’s going to crash due to a segfault. The uncommitted pages at the end serve as natural guard pages, providing a margin of safety, even in situations where for some reason it’s been decided that bounds-checking should be disabled (which is most C++ shops).

And the dynamic array:

    static constexpr size_t N = 100000;
auto a = DynArray<int>(N);

for (size_t i = 0; i < 16; i++) {
assert(a.Append(static_cast<int>(i)));
}

for (size_t i = 0; i < 12; i++) {
std::cout << a[i] << "\n";
}

for (size_t i = 16; i < N; i++) {
a.Append(static_cast<int>(i));
}

size_t real_cap = GetCommittedBytes(N * sizeof(int)) / sizeof(int);

for (size_t i = N; i < real_cap; i++) {
a.Append(static_cast<int>(i));
}

// Exceeded capacity. Assertion failure.
// a.Append(0);

That should print:

0
1
2
3
4
5
6
7
8
9
10
11

Uncommenting the last line would cause an assertion failure, this time because we run out of reserved space (due to deliberately smaller capacity declared in the beginning).

#### Performance cost

When DynArray has $n$ elements, Append() takes $O(1)$ time and doesn’t have any temporary spikes in space use, meaning that the amount of space used during Append() never exceeds the amount of space used after it.

$n$-element DynArray uses $\lceil\frac{n\times M}{4096}\rceil\times 4096$ bytes for the buffer plus the constant amount of space for length, capacity and pointer to buffer. This amounts to $\frac{n\times M}{\lceil\frac{n\times M}{4096}\rceil\times 4096} \approx \frac{n\times M}{\frac{n\times M}{4096}\times 4096} = \frac{n\times M}{n\times M} = 1$ space utilisation.

Pretty good. As $n$ grows, we’re approaching perfect space utilisation, and non-utilised space never exceeds $4095$ bytes, whereas in the case of std::vector it grows exponentially.

#### Benchmarks

I’ve run all Windows, FreeBSD and Linux tests on the exact same machine. It’s a PC with an AMD Ryzen 7 2700X CPU and 16 GB of DDR4 RAM. For Mac tests I used my 2020 Mac Mini M1 with 16GB RAM. It wasn’t my objective to compare which compiler is better than another compiler, so to keep things shorter I used only one compiler on each platform, the one that was closest to the official one. Software versions:

1. Windows 10 Pro with MSVC 19.32.31332 running from x64 Native Tools Command Prompt for VS 2022. Compilation script: build.bat.
2. MacOS Monterey 12.6.2 with Apple clang version 13.0.0 (clang-1300.0.29.30), the version shipped with Xcode 13.2.1. Compilation script: build-clang.sh.
3. FreeBSD 13.1-p3 (with 13.1-p5 userland) with Clang 13.0.0 (FreeBSD clang version 13.0.0), which is shipped with the base system. Compilation script: build-clang.sh.
4. Debian 11.6 (5.10.0-20-amd64 #1 SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux) with GCC (g++ (Debian 10.2.1-6) 10.2.1 20210110) installed with APT through the g++ package. Compilation script: build-gcc.sh.
##### Windows
###### Memory

First we’ll measure peak memory use, when constructing arrays of sizes $10^3$, $10^6$, $10^9$, and $10^{10}$. For details, see mem_dynarray.cpp and mem_stdvec.cpp in the repo.

Reminder: I don’t call std::vector::reserve(), because that method doesn’t reserve memory according to the Windows (and our) nomenclature, it actually commits memory, and the whole motivation for this page is finding a memory management strategy for situations when I don’t have accurate and precise predictions for how much memory I’m going to need. Besides, if I had them, instead of std::vector, I’d use a structure which allocates only once, and doesn’t have all that resizing logic.

Array Size Peak WSS (bytes)
DynArray std::vector
$10^3$ $3\,088\,384$ $2\,551\,808$
$10^6$ $7\,086\,080$ $10\,190\,848$
$10^9$ $4\,002\,926\,592$ $5\,521\,395\,712$
$10^{10}$ $12\,859\,588\,608$ std::bad_alloc

DynArray uses more memory when the whole array can fit in space smaller than a single memory page, but for bigger arrays it uses less space, because it grows linearly, not exponentially as std::vector does. In fact, as the last row demonstrates, std::vector failed to scale to $10^{10}$ elements on my machine, reporting a memory allocation error, whereas DynArray continued working.

###### Throughput

Now let’s see how many elements we’re able to append in a given time period. We take the total time it takes to construct the whole array and then divide it by the number of elements. Each benchmark is repeated for different sizes of the resulting dynamic array. Results are included for builds with different levels of optimisation: no optimisation, /O1 and /O2.

Aside from ints, we also use std::unique_ptrs to demonstrate how those dynamic arrays scale when their elements have non-trivial (but still almost-trivial) move constructors. For details, see bench.cpp in the repo.

No optimisation
Length Element Array Total time (seconds) Throughput (Hz)
$10^3$ int DynArray $0$ $\infty$
std::vector $0$ $\infty$
std::unique_ptr DynArray $0$ $\infty$
std::vector $0$ $\infty$
$10^6$ int DynArray $0$ $\infty$
std::vector $0.031\,250$ $32\,000\,000$
std::unique_ptr DynArray $0.078\,125$ $12\,800\,000$
std::vector $0.140\,625$ $7\,111\,111$
$10^9$ int DynArray $14.828\,125$ $67\,439\,409$
std::vector $23.078\,125$ $43\,331\,076$
std::unique_ptr DynArray $84.656\,250$ $11\,812\,476$
std::vector $297.937\,500$ $3\,356\,408$
$10^{10}$ int DynArray $174.093\,750$ $57\,440\,315$
std::vector std::bad_alloc
std::unique_ptr DynArray std::bad_alloc
std::vector std::bad_alloc
Optimisation /O1
Length Element Array Total time (seconds) Throughput (Hz)
$10^3$ int DynArray $0$ $\infty$
std::vector $0$ $\infty$
std::unique_ptr DynArray $0$ $\infty$
std::vector $0$ $\infty$
$10^6$ int DynArray $0$ $\infty$
std::vector $0$ $\infty$
std::unique_ptr DynArray $0.046\,875$ $21\,333\,333$
std::vector $0.031\,250$ $32\,000\,000$
$10^9$ int DynArray $4.328\,125$ $231\,046\,931$
std::vector $2.828\,125$ $353\,591\,160$
std::unique_ptr DynArray $62.359\,375$ $16\,036\,081$
std::vector $213.609\,375$ $4\,681\,442$
$10^{10}$ int DynArray $59.812\,500$ $167\,189\,132$
std::vector std::bad_alloc
std::unique_ptr DynArray std::bad_alloc
std::vector std::bad_alloc
Optimisation /O2
Length Element Array Total time (seconds) Throughput (Hz)
$10^3$ int DynArray $0$ $\infty$
std::vector $0$ $\infty$
std::unique_ptr DynArray $0.015\,625$ $64000$
std::vector $0$ $\infty$
$10^6$ int DynArray $0$ $\infty$
std::vector $0$ $\infty$
std::unique_ptr DynArray $0.046\,875$ $21\,333\,333$
std::vector $0.046\,875$ $21\,333\,333$
$10^9$ int DynArray $4.406\,250$ $226\,950\,354$
std::vector $2.765\,625$ $361\,581\,920$
std::unique_ptr DynArray $49.187\,500$ $20\,330\,368$
std::vector $203.906\,250$ $4\,904\,214$
$10^{10}$ int DynArray $56.687\,500$ $176\,405\,733$
std::vector std::bad_alloc
std::unique_ptr DynArray std::bad_alloc
std::vector std::bad_alloc

With no optimisations enabled, DynArray has consistently higher throughput than std::vector, across all controlled parameters. With /O1 and /O2, std::vector has higher throughput for smaller arrays and for big arrays with trivial move constructors; but for large arrays with non-trivial move constructors DynArray has the upper hand. Extrapolating, the time it takes to construct the whole array, relative to its end length, grows much slower in the case of DynArray than in the case of std::vector.

###### Latency

In case of latency, we measure the time taken by each append call, save the minimum and maximum values, and aggregate all times into a linear histogram organised into buckets of $[0, 10^{-6})$, $[10^{-6}, 10^{-5})$, $[10^{-5}, 10^{-4})$, $[10^{-4}, 10^{-3})$, $[10^{-3}, 10^{-2})$, $[10^{-2}, 10^{-1})$, $[10^{-1}, \infty)$ seconds. Bucket columns are labeled with the upper bounds. For details, see bench.cpp in the repo.

No optimisation
Length Element Array Min Max $10^{-6}$ $10^{-5}$ $10^{-4}$ $10^{-3}$ $10^{-2}$ $10^{-1}$ $\infty$
$10^3$ int DynArray $0$ $0$ $1\,000$ $0$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0$ $1\,000$ $0$ $0$ $0$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.015\,625$ $999$ $0$ $0$ $0$ $0$ $1$ $0$
std::vector $0$ $0$ $1\,000$ $0$ $0$ $0$ $0$ $0$ $0$
$10^6$ int DynArray $0$ $0.015\,625$ $999\,983$ $0$ $0$ $0$ $0$ $17$ $0$
std::vector $0$ $0.062\,500$ $999\,979$ $0$ $0$ $0$ $0$ $21$ $0$
std::unique_ptr DynArray $0$ $0.015\,625$ $999\,983$ $0$ $0$ $0$ $0$ $17$ $0$
std::vector $0$ $0.031\,250$ $999\,977$ $0$ $0$ $0$ $0$ $23$ $0$
$10^9$ int DynArray $0$ $0.031\,250$ $999\,987\,307$ $0$ $0$ $0$ $0$ $12\,693$ $0$
std::vector $0$ $0.468\,750$ $999\,987\,325$ $0$ $0$ $0$ $0$ $12\,671$ $4$
std::unique_ptr DynArray $0$ $1.796\,875$ $999\,965\,127$ $0$ $0$ $0$ $0$ $34\,870$ $3$
std::vector $0$ $29.484\,375$ $999\,982\,339$ $0$ $0$ $0$ $0$ $17\,646$ $15$
$10^{10}$ int DynArray $0$ $0.515\,625$ $9\,999\,869\,443$ $0$ $0$ $0$ $0$ $130\,556$ $1$
std::vector std::bad_alloc
std::unique_ptr DynArray std::bad_alloc
std::vector std::bad_alloc
Optimisation /O1
Length Element Array Min Max $10^{-6}$ $10^{-5}$ $10^{-4}$ $10^{-3}$ $10^{-2}$ $10^{-1}$ $\infty$
$10^3$ int DynArray $0$ $0$ $1\,000$ $0$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0$ $1\,000$ $0$ $0$ $0$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0$ $1\,000$ $0$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0$ $1\,000$ $0$ $0$ $0$ $0$ $0$ $0$
$10^6$ int DynArray $0$ $0.015\,625$ $999\,993$ $0$ $0$ $0$ $0$ $7$ $0$
std::vector $0$ $0.093\,750$ $999\,983$ $0$ $0$ $0$ $0$ $17$ $0$
std::unique_ptr DynArray $0$ $0.015\,625$ $999\,990$ $0$ $0$ $0$ $0$ $10$ $0$
std::vector $0$ $0.015\,625$ $999\,988$ $0$ $0$ $0$ $0$ $12$ $0$
$10^9$ int DynArray $0$ $0.031\,250$ $999\,988\,885$ $0$ $0$ $0$ $0$ $11\,115$ $0$
std::vector $0$ $0.484\,375$ $999\,989\,172$ $0$ $0$ $0$ $0$ $10\,824$ $0$
std::unique_ptr DynArray $0$ $0.046\,875$ $999\,975\,686$ $0$ $0$ $0$ $0$ $24\,314$ $0$
std::vector $0$ $14.109\,375$ $999\,986\,756$ $0$ $0$ $0$ $0$ $13\,236$ $8$
$10^{10}$ int DynArray $0$ $0.468\,750$ $9\,999\,887\,755$ $0$ $0$ $0$ $0$ $112\,244$ $1$
std::vector std::bad_alloc
std::unique_ptr DynArray std::bad_alloc
std::vector std::bad_alloc
Optimisation /O2
Length Element Array Min Max $10^{-6}$ $10^{-5}$ $10^{-4}$ $10^{-3}$ $10^{-2}$ $10^{-1}$ $\infty$
$10^3$ int DynArray $0$ $0$ $1\,000$ $0$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0$ $1\,000$ $0$ $0$ $0$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0$ $1\,000$ $0$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0$ $1\,000$ $0$ $0$ $0$ $0$ $0$ $0$
$10^6$ int DynArray $0$ $0.015\,625$ $999\,987$ $0$ $0$ $0$ $0$ $13$ $0$
std::vector $0$ $0.093\,750$ $999\,985$ $0$ $0$ $0$ $0$ $15$ $0$
std::unique_ptr DynArray $0$ $0.015\,625$ $999\,986$ $0$ $0$ $0$ $0$ $14$ $0$
std::vector $0$ $0.015\,625$ $999\,988$ $0$ $0$ $0$ $0$ $12$ $0$
$10^9$ int DynArray $0$ $0.031\,250$ $999\,988\,917$ $0$ $0$ $0$ $0$ $11\,083$ $0$
std::vector $0$ $0.500\,000$ $999\,989\,054$ $0$ $0$ $0$ $0$ $10\,942$ $4$
std::unique_ptr DynArray $0$ $0.390\,625$ $999\,973\,866$ $0$ $0$ $0$ $0$ $26\,133$ $1$
std::vector $0$ $13.109\,375$ $999\,984\,610$ $0$ $0$ $0$ $0$ $15\,381$ $9$
$10^{10}$ int DynArray $0$ $0.093\,750$ $9\,999\,888\,781$ $0$ $0$ $0$ $0$ $111\,219$ $0$
std::vector std::bad_alloc
std::unique_ptr DynArray std::bad_alloc
std::vector std::bad_alloc

Except the one flake in the unoptimised benchmark for arrays of $10^3$ elements, the maximum latency of DynArray always stayed below the latency of std::vector. In fact, usually it was a couple times lower, and for big arrays it was orders of magnitude lower, where, if you care about latency at all, performance of std::vector wasn’t acceptable by any standards.

All latency measurements demonstrate a bimodal distribution. In the case of std::vector it’s no surprise – that’s what we should expect when we have two paths, one the common case of no resize, one the odd case of resize. However, in the case of DynArray, the bimodal distribution surprised me, as I didn’t see any cause for it in any of the code that I wrote: on the one hand, VirtualAlloc() isn’t called in every Append(), but on the other hand the number of calls to one is linearly proportional to the other, and the numbers do not reflect that, so I decided to dig into it.

I tried to collect hardware performance counters (PMC), but Intel VTune failed to gather any for my AMD CPU. AMD μProf didn’t seem to even have any place to display them. Microsoft’s tracelog.exe also failed to collect anything. The ETL file contained no PMC data. That’s quite a let-down compared to the ease of collecting this data on Linux with perf.

Luckily, DTrace was ported to Windows, so I decided to use that. First I tried to measure how much time VirtualAlloc() takes in my tests, so I wrote the following script, using the syscall DTrace provider:

syscall::NtAllocateVirtualMemory:entry
/execname == "bench2.exe"/
{
self->ts = timestamp;
}

syscall::NtAllocateVirtualMemory:return
/execname == "bench2.exe"/
{
@alloc[execname] = quantize((timestamp - self->ts) / 1000);
}

That collects running times (in microseconds) of the NtAllocateVirtualMemory() syscall into a histogram. I run bench2.exe in one terminal, while in the Administrator’s cmd.exe in another window I typed:

"C:\Program Files\DTrace\dtrace.exe" -s virtualallochist.d

That printed:

  bench2.exe
value  ------------- Distribution ------------- count
0 |                                         0
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  25497083
2 |@                                        765607
4 |                                         42939
8 |                                         7796
16 |                                         2619
32 |                                         233
64 |                                         20
128 |                                         41
256 |                                         89
512 |                                         90
1024 |                                         46
2048 |                                         4
4096 |                                         12
8192 |                                         6
16384 |                                         4
32768 |                                         0

$16\,384$ microseconds ($0.016\,384$ seconds) isn’t near anything explaining the times from my benchmarks, where append once took as much as $0.390$ seconds. I had to dig deeper.

At that point I thought: I’m not calling NtAllocateVirtualMemory() directly, I’m calling VirtualAlloc(). I should measure that. That’s a bit harder, but still doable. VirtualAlloc() is not a syscall of the NT kernel, but a normal function defined in one of DLLs distributed with Windows, so I can’t trace it with the syscall DTrace provider.

However, I can use the pid provider. For tracing with that provider, I needed to know the PID of the executable to trace, not just the execname. One way to get it is to use the -c <cmd> option to dtrace.exe, which starts the <cmd> command and passes its PID to the DTrace script as $target. However, that approach caused DTrace to run out of memory very fast, so I needed to find another way. I decided to fetch the PID with Powershell and pass it as a positional argument to my DTrace script, which measured VirtualAlloc() and a couple of my own functions to compare: pid$1::VirtualAlloc:entry
{
self->ts[probefunc] = timestamp;
}

pid$1::VirtualAlloc:return { @[execname,probefunc] = quantize((timestamp - self->ts[probefunc]) / 1000); } pid$1::GetProcessTimes:entry
{
self->ts[probefunc] = timestamp;
}

pid$1::GetProcessTimes:return { @[execname,probefunc] = quantize((timestamp - self->ts[probefunc]) / 1000); } pid$1::"DynArray<std::unique_ptr<int,std::default_delete<int> > >::Append":entry
{
self->ts[probefunc] = timestamp;
}

pid$1::"DynArray<std::unique_ptr<int,std::default_delete<int> > >::Append":return { @[execname,probefunc] = quantize((timestamp - self->ts[probefunc]) / 1000); } pid$1::"DynArray<int>::Append":entry
{
self->ts[probefunc] = timestamp;
}

pid\$1::"DynArray<int>::Append":return
{
@[execname,probefunc] = quantize((timestamp - self->ts[probefunc]) / 1000);
}

When optimisations are enabled, some of those DynArray methods are inlined and eliminated from the executable. So I needed to use bench0.exe, the unoptimised version of the benchmark. After starting bench0.exe in one terminal, I typed the following in Administrator’s Powershell:

& "C:\Program Files\DTrace\dtrace.exe" -s virtualallochist.d (Get-Process -Name "bench0").Id

That made the benchmark run several times longer than normally (~2.5 days compared to ~2 hours) in wall clock time, but CPU time seemed to be completely unaffected. (Actually, during first attempt Windows Update rebooted my computer after 1.5 days without asking me for permission.)

The results were as follows:

  bench0.exe                                          VirtualAlloc
value  ------------- Distribution ------------- count
1 |                                         0
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 25247233
4 |                                         100680
8 |                                         7843
16 |                                         509
32 |                                         16
64 |                                         5
128 |                                         21
256 |                                         71
512 |                                         4
1024 |                                         2
2048 |                                         0
4096 |                                         0
8192 |                                         1
16384 |                                         0
32768 |                                         1
65536 |                                         0
131072 |                                         1
262144 |                                         1
524288 |                                         0

bench0.exe                                          DynArray<std::unique_ptr<int,std::default_delete<int> > >::Append
value  ------------- Distribution ------------- count
1 |                                         0
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1995992981
4 |                                         641329
8 |                                         4288545
16 |                                         65842
32 |                                         380
64 |                                         361
128 |                                         3862
256 |                                         6963
512 |                                         670
1024 |                                         31
2048 |                                         8
4096 |                                         27
8192 |                                         0
16384 |                                         0
32768 |                                         1
65536 |                                         0

bench0.exe                                          DynArray<int>::Append
value  ------------- Distribution ------------- count
0 |                                         0
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 21782293396
2 |                                         155237404
4 |                                         15543570
8 |                                         9630471
16 |                                         117290
32 |                                         3117
64 |                                         1737
128 |                                         26619
256 |                                         38741
512 |                                         12741
1024 |                                         15708
2048 |                                         6991
4096 |                                         77
8192 |                                         24
16384 |                                         40
32768 |                                         3
65536 |                                         0
131072 |                                         1
262144 |                                         1
524288 |                                         0

bench0.exe                                          GetProcessTimes
value  ------------- Distribution ------------- count
0 |                                         0
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 27825184719
2 |                                         177795208
4 |                                         1094430
8 |                                         3752032
16 |                                         127442
32 |                                         3074
64 |                                         1284
128 |                                         13774
256 |                                         35437
512 |                                         606
1024 |                                         2
2048 |                                         1
4096 |                                         0

Here we can see that the outliers of VirtualAlloc() practically perfectly correspond to the outliers of DynArray<int>::Append(). $262\,144$ microseconds for $262\,144$ microseconds. $131\,072$ microseconds for $131\,072$ microseconds. The rest of numbers don’t correspond as perfectly, but generally are in the right ballpark, after dividing the amounts by 3 orders of magnitude (VirtualAlloc() isn’t called in every Append()). That to me was sufficient evidence that DynArray::Append() latency really was determined by the latency of VirtualAlloc() and not by something extra that I did.

For the other platforms, I’m only going to comment the numbers, when I see something unexpected in them.

##### Mac
###### Memory
Array Size Peak RSS (bytes)
DynArray std::vector
$10^3$ $1\,490\,944$ $1\,474\,560$
$10^6$ $5\,521\,408$ $9\,715\,712$
$10^9$ $4\,001\,513\,472$ $4\,564\,942\,848$
$10^{10}$ $10\,247\,536\,640$ $11\,684\,904\,960$
###### Throughput
No optimisation (-O0)
Length Element Array Total time (seconds) Throughput (Hz)
$10^3$ int DynArray $0.000\,260$ $3\,846\,153$
std::vector $0.000\,089$ $11\,235\,955$
std::unique_ptr DynArray $0.000\,462$ $2\,164\,502$
std::vector $0.000\,372$ $2\,688\,172$
$10^6$ int DynArray $0.023\,056$ $43\,372\,657$
std::vector $0.036\,736$ $27\,221\,254$
std::unique_ptr DynArray $0.051\,803$ $19\,303\,901$
std::vector $0.093\,763$ $10\,665\,187$
$10^9$ int DynArray $7.423\,329$ $134\,710\,451$
std::vector $19.339\,444$ $51\,707\,794$
std::unique_ptr DynArray $77.600\,856$ $12\,886\,455$
std::vector $98.452\,377$ $10\,157\,195$
$10^{10}$ int DynArray $76.536\,928$ $130\,655\,884$
std::vector $245.702\,541$ $40\,699\,619$
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O1
Length Element Array Total time (seconds) Throughput (Hz)
$10^3$ int DynArray $0.000\,317$ $3\,154\,574$
std::vector $0.000\,021$ $47\,619\,047$
std::unique_ptr DynArray $0.000\,388$ $2\,577\,319$
std::vector $0.000\,100$ $10\,000\,000$
$10^6$ int DynArray $0.013\,400$ $74\,626\,865$
std::vector $0.020\,791$ $48\,097\,734$
std::unique_ptr DynArray $0.046\,920$ $21\,312\,872$
std::vector $0.045\,331$ $22\,059\,958$
$10^9$ int DynArray $4.411\,640$ $226\,673\,073$
std::vector $8.280\,048$ $120\,772\,246$
std::unique_ptr DynArray $61.713\,235$ $16\,203\,979$
std::vector $49.259\,080$ $20\,300\,825$
$10^{10}$ int DynArray $45.927\,810$ $217\,733\,003$
std::vector $107.315\,495$ $93183188$
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O2
Length Element Array Total time (seconds) Throughput (Hz)
$10^3$ int DynArray $0.000\,313$ $3\,194\,888$
std::vector $0.000\,006$ $166\,666\,666$
std::unique_ptr DynArray $0.000\,374$ $2\,673\,796$
std::vector $0.000\,049$ $20\,408\,163$
$10^6$ int DynArray $0.006\,542$ $152\,858\,453$
std::vector $0.003\,067$ $326\,051\,516$
std::unique_ptr DynArray $0.033\,893$ $29\,504\,617$
std::vector $0.018\,518$ $54\,001\,512$
$10^9$ int DynArray $2.581\,145$ $387\,424\,960$
std::vector $1.024\,011$ $976\,552\,009$
std::unique_ptr DynArray $47.618\,644$ $21\,000\,177$
std::vector $19.480\,989$ $51\,332\,096$
$10^{10}$ int DynArray $28.125\,049$ $355\,554\,936$
std::vector $34.616\,087$ $288\,883\,027$
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
###### Latency
No optimisation (-O0)
Length Element Array Min Max $10^{-6}$ $10^{-5}$ $10^{-4}$ $10^{-3}$ $10^{-2}$ $10^{-1}$ $\infty$
$10^3$ int DynArray $0$ $0.000\,023$ $736$ $263$ $1$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,042$ $726$ $271$ $3$ $0$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,004$ $735$ $265$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,018$ $700$ $299$ $1$ $0$ $0$ $0$ $0$
$10^6$ int DynArray $0$ $0.000\,011$ $766\,875$ $233\,123$ $2$ $0$ $0$ $0$ $0$
std::vector $0$ $0.002\,437$ $769\,699$ $230\,290$ $6$ $3$ $2$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,077$ $746\,335$ $253\,649$ $16$ $0$ $0$ $0$ $0$
std::vector $0$ $0.016\,903$ $739\,203$ $260\,770$ $19$ $3$ $4$ $1$ $0$
$10^9$ int DynArray $0$ $0.000\,119$ $778\,041\,048$ $221\,953\,956$ $4\,995$ $1$ $0$ $0$ $0$
std::vector $0$ $2.372\,019$ $773\,599\,499$ $226\,394\,128$ $6\,358$ $3$ $3$ $4$ $5$
std::unique_ptr DynArray $0$ $0.004\,569$ $746\,486\,003$ $253\,494\,790$ $19\,085$ $106$ $16$ $0$ $0$
std::vector $0$ $18.939\,011$ $741\,265\,650$ $258\,717\,702$ $16\,547$ $85$ $5$ $3$ $8$
$10^{10}$ int DynArray $0$ $0.000\,128$ $7\,781\,041\,097$ $2\,218\,900\,231$ $58\,622$ $50$ $0$ $0$ $0$
std::vector $0$ $51.349\,387$ $7\,738\,549\,626$ $2\,261\,396\,549$ $53\,792$ $17$ $4$ $3$ $9$
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O1
Length Element Array Min Max $10^{-6}$ $10^{-5}$ $10^{-4}$ $10^{-3}$ $10^{-2}$ $10^{-1}$ $\infty$
$10^3$ int DynArray $0$ $0.000\,012$ $779$ $220$ $1$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,012$ $737$ $260$ $3$ $0$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,002$ $742$ $258$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,006$ $737$ $263$ $0$ $0$ $0$ $0$ $0$
$10^6$ int DynArray $0$ $0.000\,011$ $758\,870$ $241\,126$ $4$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,177$ $758\,529$ $241\,453$ $17$ $1$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,052$ $740\,348$ $259\,644$ $8$ $0$ $0$ $0$ $0$
std::vector $0$ $0.006\,409$ $745\,641$ $254\,334$ $19$ $3$ $3$ $0$ $0$
$10^9$ int DynArray $0$ $0.000\,042$ $772\,114\,172$ $227\,881\,241$ $4\,587$ $0$ $0$ $0$ $0$
std::vector $0$ $0.204\,094$ $767\,987\,634$ $232\,007\,898$ $4\,457$ $2$ $4$ $4$ $1$
std::unique_ptr DynArray $0$ $0.002\,871$ $748\,034\,858$ $251\,951\,964$ $13\,154$ $16$ $8$ $0$ $0$
std::vector $0$ $8.329\,990$ $745\,391\,293$ $254\,599\,112$ $9\,572$ $6$ $7$ $3$ $7$
$10^{10}$ int DynArray $0$ $0.000\,072$ $7\,720\,702\,598$ $2\,279\,248\,850$ $48\,552$ $0$ $0$ $0$ $0$
std::vector $0$ $16.388\,545$ $7\,679\,523\,460$ $2\,320\,428\,793$ $47\,726$ $8$ $4$ $3$ $6$
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O2
Length Element Array Min Max $10^{-6}$ $10^{-5}$ $10^{-4}$ $10^{-3}$ $10^{-2}$ $10^{-1}$ $\infty$
$10^3$ int DynArray $0$ $0.000\,008$ $768$ $232$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,054$ $758$ $237$ $5$ $0$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,003$ $748$ $252$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,006$ $741$ $259$ $0$ $0$ $0$ $0$ $0$
$10^6$ int DynArray $0$ $0.000\,012$ $763\,508$ $236\,491$ $1$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,185$ $765\,899$ $234\,092$ $8$ $1$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,074$ $745\,131$ $254\,864$ $5$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,895$ $750\,934$ $249\,047$ $15$ $4$ $0$ $0$ $0$
$10^9$ int DynArray $0$ $0.000\,054$ $769\,831\,147$ $230\,164\,413$ $4\,440$ $0$ $0$ $0$ $0$
std::vector $0$ $0.210\,026$ $770\,968\,775$ $229\,027\,009$ $4\,204$ $3$ $4$ $4$ $1$
std::unique_ptr DynArray $0$ $0.006\,377$ $753\,974\,964$ $246\,013\,774$ $11\,250$ $9$ $3$ $0$ $0$
std::vector $0$ $2.935\,007$ $752\,346\,656$ $247\,643\,704$ $9\,616$ $8$ $9$ $3$ $4$
$10^{10}$ int DynArray $0$ $0.000\,132$ $7\,699\,417\,275$ $2\,300\,534\,425$ $48\,299$ $1$ $0$ $0$ $0$
std::vector $0$ $16.338\,763$ $7\,689\,689\,639$ $2\,310\,264\,815$ $45\,530$ $4$ $3$ $3$ $6$
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
##### FreeBSD
###### Memory
Array Size Peak RSS (bytes)
DynArray std::vector
$10^3$ $0$ $0$
$10^6$ $3\,792\,000$ $4\,548\,000$
$10^9$ $3\,913\,836\,000$ $8\,093\,936\,000$
$10^{10}$ killed by OOM killer killed by OOM killer

The first row is a bit surprising. It turns out that if I execute mem_dynarray1000 in an infinite loop, $90\%$ of the time it prints $0$, and other times it prints numbers which differ by several orders of magnitude between each other. It looks like a bug. Results on Mac and Linux are stable between runs. It’s safe to assume numbers in this table are underreported.

When it comes to the OOM killer, the first time I run the test it killed not only the process running the benchmark, but also my SSH session. Case in point for my rant about overcommit and OOM killer earlier.

###### Throughput
No optimisation (-O0)
Length Element Array Total time (seconds) Throughput (Hz)
$10^3$ int DynArray $0.000\,017$ $58\,823\,529$
std::vector $0.000\,061$ $16\,393\,442$
std::unique_ptr DynArray $0.000\,069$ $14\,492\,753$
std::vector $0.000\,145$ $6\,896\,551$
$10^6$ int DynArray $0.014\,053$ $71\,159\,183$
std::vector $0.030\,769$ $32\,500\,243$
std::unique_ptr DynArray $0.064\,249$ $15\,564\,444$
std::vector $0.127\,867$ $7\,820\,626$
$10^9$ int DynArray $12.720\,856$ $78\,611\,062$
std::vector $33.253\,775$ $30\,071\,773$
std::unique_ptr DynArray $66.044\,360$ $15\,141\,338$
std::vector $134.103\,236$ $7\,456\,941$
$10^{10}$ int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O1
Length Element Array Total time (seconds) Throughput (Hz)
$10^3$ int DynArray $0.000\,012$ $83\,333\,333$
std::vector $0.000\,043$ $23\,255\,813$
std::unique_ptr DynArray $0.000\,029$ $34\,482\,758$
std::vector $0.000\,030$ $33\,333\,333$
$10^6$ int DynArray $0.008\,916$ $112\,157\,918$
std::vector $0.005\,711$ $175\,100\,682$
std::unique_ptr DynArray $0.024\,446$ $40\,906\,487$
std::vector $0.020\,466$ $48\,861\,526$
$10^9$ int DynArray $3.729\,038$ $268\,165\,677$
std::vector $3.950\,968$ $253\,102\,530$
std::unique_ptr DynArray $16.577\,674$ $60\,322\,093$
std::vector $19.598\,912$ $51\,023\,240$
$10^{10}$ int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O2
Length Element Array Total time (seconds) Throughput (Hz)
$10^3$ int DynArray $0.000\,007$ $142\,857\,142$
std::vector $0.000\,022$ $45\,454\,545$
std::unique_ptr DynArray $0.000\,015$ $66\,666\,666$
std::vector $0.000\,015$ $66\,666\,666$
$10^6$ int DynArray $0.003\,832$ $260\,960\,334$
std::vector $0.002\,374$ $421\,229\,991$
std::unique_ptr DynArray $0.014\,070$ $71\,073\,205$
std::vector $0.011\,030$ $90\,661\,831$
$10^9$ int DynArray $3.948\,549$ $253\,257\,589$
std::vector $2.587\,411$ $386\,486\,723$
std::unique_ptr DynArray $16.367\,930$ $61\,095\,080$
std::vector $19.173\,182$ $52\,156\,183$
$10^{10}$ int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
###### Latency
No optimisation (-O0)
Length Element Array Min Max $10^{-6}$ $10^{-5}$ $10^{-4}$ $10^{-3}$ $10^{-2}$ $10^{-1}$ $\infty$
$10^3$ int DynArray $0$ $0.000\,015$ $840$ $159$ $1$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,229$ $794$ $205$ $0$ $1$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,003$ $727$ $273$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.048\,135$ $721$ $277$ $1$ $0$ $0$ $1$ $0$
$10^6$ int DynArray $0$ $0.007\,504$ $957\,349$ $42\,619$ $1$ $10$ $21$ $0$ $0$
std::vector $0$ $0.007\,498$ $954\,260$ $45\,710$ $5$ $8$ $17$ $0$ $0$
std::unique_ptr DynArray $0$ $0.007\,491$ $952\,252$ $47\,717$ $3$ $3$ $25$ $0$ $0$
std::vector $0$ $0.025\,308$ $910\,396$ $89\,553$ $9$ $14$ $26$ $2$ $0$
$10^9$ int DynArray $0$ $0.007\,476$ $935\,647\,184$ $64\,329\,993$ $64$ $495$ $22264$ $0$ $0$
std::vector $0$ $4.924\,694$ $904\,947\,744$ $95\,027\,736$ $120$ $3$ $24\,388$ $3$ $6$
std::unique_ptr DynArray $0$ $0.004\,695$ $863\,930\,190$ $136\,034\,658$ $3\,684$ $361$ $31\,107$ $0$ $0$
std::vector $0$ $28.367\,182$ $869\,894\,094$ $130\,069\,513$ $3\,530$ $226$ $32\,623$ $5$ $9$
$10^{10}$ int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O1
Length Element Array Min Max $10^{-6}$ $10^{-5}$ $10^{-4}$ $10^{-3}$ $10^{-2}$ $10^{-1}$ $\infty$
$10^3$ int DynArray $0$ $0.000\,009$ $817$ $183$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,597$ $790$ $209$ $0$ $1$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,002$ $799$ $201$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.054\,681$ $857$ $142$ $0$ $0$ $0$ $1$ $0$
$10^6$ int DynArray $0$ $0.006\,055$ $943\,746$ $56\,238$ $0$ $0$ $16$ $0$ $0$
std::vector $0$ $0.006\,036$ $952\,068$ $47\,904$ $3$ $1$ $24$ $0$ $0$
std::unique_ptr DynArray $0$ $0.006\,013$ $934\,911$ $65\,061$ $0$ $0$ $28$ $0$ $0$
std::vector $0$ $0.005\,994$ $937\,884$ $62\,084$ $6$ $1$ $25$ $0$ $0$
$10^9$ int DynArray $0$ $0.006\,043$ $892\,789\,630$ $107\,189\,456$ $92$ $2$ $20\,820$ $0$ $0$
std::vector $0$ $1.079\,716$ $868\,199\,213$ $131\,779\,846$ $92$ $6$ $20\,837$ $3$ $3$
std::unique_ptr DynArray $0$ $0.007\,322$ $867\,898\,304$ $132\,073\,111$ $4\,900$ $886$ $22\,799$ $0$ $0$
std::vector $0$ $2.992\,284$ $851\,345\,064$ $148\,627\,732$ $4\,090$ $680$ $22\,424$ $5$ $5$
$10^{10}$ int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O2
Length Element Array Min Max $10^{-6}$ $10^{-5}$ $10^{-4}$ $10^{-3}$ $10^{-2}$ $10^{-1}$ $\infty$
$10^3$ int DynArray $0$ $0.000\,008$ $818$ $182$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,358$ $804$ $195$ $0$ $1$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,003$ $793$ $207$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.057\,792$ $847$ $152$ $0$ $0$ $0$ $1$ $0$
$10^6$ int DynArray $0$ $0.005\,989$ $952\,351$ $47\,624$ $0$ $0$ $25$ $0$ $0$
std::vector $0$ $0.005\,962$ $944\,598$ $55\,372$ $5$ $2$ $23$ $0$ $0$
std::unique_ptr DynArray $0$ $0.005\,943$ $940\,700$ $59\,272$ $3$ $0$ $25$ $0$ $0$
std::vector $0$ $0.006\,367$ $950\,479$ $49\,492$ $5$ $2$ $22$ $0$ $0$
$10^9$ int DynArray $0$ $0.006\,070$ $889\,075\,210$ $110\,903\,827$ $94$ $2$ $20\,867$ $0$ $0$
std::vector $0$ $1.102\,612$ $867\,475\,739$ $132\,503\,564$ $103$ $5$ $20\,583$ $3$ $3$
std::unique_ptr DynArray $0$ $0.007\,244$ $876\,357\,385$ $123\,614\,252$ $4\,383$ $1\,138$ $22\,842$ $0$ $0$
std::vector $0$ $3.004\,393$ $843\,681\,442$ $156\,290\,107$ $5\,391$ $635$ $22\,415$ $5$ $5$
$10^{10}$ int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
##### Linux
###### Memory
Array Size Peak RSS (bytes)
DynArray std::vector
$10^3$ $2\,912\,000$ $2\,856\,000$
$10^6$ $6\,700\,000$ $7\,124\,000$
$10^9$ $3\,909\,288\,000$ $4\,198\,016\,000$
$10^{10}$ killed by OOM killer std::bad_alloc

Interestingly, here std::vector threw std::bad_alloc and the benchmark wasn’t killed by the OOM killer. However, in later benchmarks of throughput and latency it didn’t happen and instead the OOM killer killed the process as usual.

###### Throughput
No optimisation (-O0)
Length Element Array Total time (seconds) Throughput (Hz)
$10^3$ int DynArray $0.000\,029$ $34\,482\,758$
std::vector $0.000\,046$ $21\,739\,130$
std::unique_ptr DynArray $0.000\,151$ $6\,622\,516$
std::vector $0.000\,277$ $3\,610\,108$
$10^6$ int DynArray $0.016\,614$ $60\,190\,201$
std::vector $0.029\,401$ $34\,012\,448$
std::unique_ptr DynArray $0.111\,757$ $8\,947\,985$
std::vector $0.204\,039$ $4\,901\,023$
$10^9$ int DynArray $16.356\,449$ $61\,137\,964$
std::vector $31.854\,182$ $31\,393\,052$
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
$10^{10}$ int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O1
Length Element Array Total time (seconds) Throughput (Hz)
$10^3$ int DynArray $0.000\,014$ $71\,428\,571$
std::vector $0.000\,013$ $76\,923\,076$
std::unique_ptr DynArray $0.000\,049$ $20\,408\,163$
std::vector $0.000\,051$ $19\,607\,843$
$10^6$ int DynArray $0.005\,999$ $166\,694\,449$
std::vector $0.006\,483$ $154\,249\,575$
std::unique_ptr DynArray $0.041\,418$ $24\,144\,091$
std::vector $0.024\,917$ $40\,133\,242$
$10^9$ int DynArray $3.298\,444$ $303\,173\,253$
std::vector $4.674\,902$ $213\,908\,227$
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
$10^{10}$ int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O2
Length Element Array Total time (seconds) Throughput (Hz)
$10^3$ int DynArray $0.000\,013$ $76\,923\,076$
std::vector $0.000\,013$ $76\,923\,076$
std::unique_ptr DynArray $0.000\,049$ $20\,408\,163$
std::vector $0.000\,041$ $24\,390\,243$
$10^6$ int DynArray $0.005\,905$ $169\,348\,010$
std::vector $0.006\,156$ $162\,443\,144$
std::unique_ptr DynArray $0.040\,347$ $24\,784\,990$
std::vector $0.022\,915$ $43\,639\,537$
$10^9$ int DynArray $3.295\,792$ $303\,417\,205$
std::vector $2.972950$ $336\,366\,235$
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
$10^{10}$ int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
###### Latency
No optimisation (-O0)
Length Element Array Min Max $10^{-6}$ $10^{-5}$ $10^{-4}$ $10^{-3}$ $10^{-2}$ $10^{-1}$ $\infty$
$10^3$ int DynArray $0$ $0.000\,016$ $736$ $263$ $1$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,002$ $712$ $288$ $0$ $0$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,003$ $656$ $344$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,040$ $648$ $349$ $3$ $0$ $0$ $0$ $0$
$10^6$ int DynArray $0$ $0.000\,006$ $802\,250$ $197\,750$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,559$ $752\,663$ $247\,331$ $4$ $2$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,009$ $640\,887$ $359\,113$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.040\,269$ $629\,382$ $370\,605$ $3$ $3$ $4$ $3$ $0$
$10^9$ int DynArray $0$ $0.000\,375$ $828\,756\,776$ $171\,243\,207$ $12$ $5$ $0$ $0$ $0$
std::vector $0$ $0.322\,032$ $744\,429\,501$ $255\,568\,581$ $9$ $1\,899$ $4$ $4$ $2$
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
$10^{10}$ int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O1
Length Element Array Min Max $10^{-6}$ $10^{-5}$ $10^{-4}$ $10^{-3}$ $10^{-2}$ $10^{-1}$ $\infty$
$10^3$ int DynArray $0$ $0.000\,018$ $853$ $146$ $1$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,016$ $871$ $128$ $1$ $0$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,005$ $780$ $220$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,006$ $763$ $237$ $0$ $0$ $0$ $0$ $0$
$10^6$ int DynArray $0$ $0.000\,007$ $758\,437$ $241\,563$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,414$ $768\,969$ $231\,025$ $4$ $2$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,016$ $719\,386$ $280\,613$ $1$ $0$ $0$ $0$ $0$
std::vector $0$ $0.006\,132$ $718\,193$ $281\,800$ $4$ $2$ $1$ $0$ $0$
$10^9$ int DynArray $0$ $0.000\,351$ $766\,618\,933$ $233\,381\,026$ $36$ $5$ $0$ $0$ $0$
std::vector $0$ $0.321\,729$ $760\,193\,989$ $239\,804\,071$ $31$ $1900$ $3$ $4$ $2$
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
$10^{10}$ int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O2
Length Element Array Min Max $10^{-6}$ $10^{-5}$ $10^{-4}$ $10^{-3}$ $10^{-2}$ $10^{-1}$ $\infty$
$10^3$ int DynArray $0$ $0.000\,019$ $767$ $230$ $3$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,023$ $769$ $230$ $1$ $0$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,004$ $735$ $265$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,005$ $732$ $268$ $0$ $0$ $0$ $0$ $0$
$10^6$ int DynArray $0$ $0.000\,009$ $764\,199$ $235\,801$ $0$ $0$ $0$ $0$ $0$
std::vector $0$ $0.000\,611$ $761\,404$ $238\,590$ $4$ $2$ $0$ $0$ $0$
std::unique_ptr DynArray $0$ $0.000\,013$ $702\,667$ $297\,332$ $1$ $0$ $0$ $0$ $0$
std::vector $0$ $0.006\,420$ $715\,711$ $284\,282$ $4$ $2$ $1$ $0$ $0$
$10^9$ int DynArray $0$ $0.000\,416$ $759\,662\,742$ $240\,337\,214$ $38$ $6$ $0$ $0$ $0$
std::vector $0$ $0.322\,445$ $760\,329\,293$ $239\,668\,772$ $26$ $1\,900$ $3$ $4$ $2$
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
$10^{10}$ int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer

## What about mremap()?

1. As far as I know, there is no equivalent to it on Windows.
2. It relocates objects in virtual space, so you can’t use that as a basis for an arena, which needs to keep pointers valid without modifying them.

## What about shrinking?

I’ve only shown how to increase the length of DynArray. If you need shrinking as well, on Windows you call VirtualFree() with MEM_DECOMMIT instead of MEM_RELEASE, on POSIX you call mprotect() with PROT_NONE and madvise() with MADV_FREE or MADV_DONTNEED (whichever constant is available).

Personally, I’ve never needed to do that, so I’m not going to bother implementing it. It seems that in my style of programming I either throw out whole data structures or keep them, but I never delete single elements. Judging from the Swiss Tables CppCon talk, I’m not alone in that.

## Closing words

So that’s what I’m doing now in my side project. If you have anything to add, you may write me an email at .

I dedicate this post to Windows Update.

1. Weirdly, it’s called EXC_BAD_ACCESS on Mac.↩︎

2. Unless someone manually sets the alignment of a structure with special compiler or linker directives to something bigger. But then it’s their problem.↩︎