yakubin’s notes

Reserve and commit

This page contains a bit of maths. To view it correctly you need a web browser which supports MathML. That means either Firefox, Safari or Chrome. Chrome should be in version 109 or later.

I’ve been working on a yet-unpublished side project in Rust and decided to use a bump allocator provided by bumpalo to simplify lifetimes and improve performance of my code. Generally, making lots of micro-allocations and then freeing them one by one is far from optimal, when you can instead group them into larger allocations and free all at once. In the same way, a large number of little objects with their own separate lifetimes is a lot harder to keep track of conceptually than when their lifetimes are grouped into just a few longer ones. Fewer, longer lifetimes also make working with the borrow-checker easier.

In broad strokes, bumpalo provides memory by allocating chunks of memory from the OS and keeping track of how many bytes in the last chunk are already used. When you request nn bytes from bumpalo, it gives you memory from the free portion of the last chunk and increments its counter of used bytes. When there isn’t enough space in the last chunk, it allocates another one and does the same.

Works well enough. However:

  1. It involves much more memory management code than I’d like.
  2. It fragments memory over disparate chunks, where there’s likely always going to be unused space left at the end of each chunk.
  3. Freeing Bump takes linear time relative to the number of chunks.
  4. Reserving memory upfront by setting capacity avoids fragmentation into chunks, but it takes that amount of memory from the OS, regardless of whether all that memory is actually going to be used. That means the user needs to have very precise upfront expectations of memory use if they want to avoid fragmentation by reserving. Whether it’s difficult and fragile, or impossible varies on a case-by-case basis.

There is a way to implement a bump allocator without using chunks, and without taking unnecessary memory from the OS. I’m going to describe it later down this page, but first I want to establish a data structure which could serve as a benchmark of whether that allocator is any good.

In this text I’m going to use C++, because it’s the official language of Windows APIs, it makes it easy to access C APIs on other platforms as well, and it’s more established than Rust, so talking about data structures provided by the C++ standard library is going to meet with less suspicion than doing the same with the Rust standard library.

Dynamic arrays

Dynamic arrays are arrays which can change their length. A typical implementation of this data structure, such as std::vector, when it isn’t being modified, behaves like a normal array, meaning that it holds its elements in a contiguous memory buffer. NNth element (starting from 00) starts at N×MN \times M bytes after the start of 00th element, when each element is MM bytes long. Length (reported by size()) and capacity (capacity()) are stored on the side. Length is the number of elements currently stored, while capacity is the number of elements std::vector is capable of storing in the current allocation. 𝐿𝑒𝑛𝑔𝑡h𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑦\mathit{Length} \le \mathit{Capacity}.

To store its elements, std::vector asks the allocator for a buffer big enough to store all elements. When you append an element to the end using push_back(), std::vector checks whether the last allocation has enough extra space to hold another element between the current last element and the end of allocation. In other words, it checks whether 𝐿𝑒𝑛𝑔𝑡h+1𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑦\mathit{Length} + 1 \le \mathit{Capacity}. If yes, then it puts the new element right after the current last element and updates the length by adding 11 to it.

If however 𝐿𝑒𝑛𝑔𝑡h=𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑦\mathit{Length} = \mathit{Capacity}, then there is no more space in the current allocation. In that case, std::vector asks the allocator for another buffer, this time of size G×𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑦G\times\mathit{Capacity}. GG is called the growth factor. libstdc++ and libc++ use G=2G=2. MSVC uses G=32G=\frac{3}{2}.

After allocating the new bigger buffer, elements from the old buffer are copied to the new one and the new element is put after them. Old buffer is freed, length is incremented by 11, and capacity is set to the size of the newly-allocated buffer.

Performance Cost

Time

If the std::vector had NN elements before the whole operation, all of that required O(N)O(N) steps (due to copying). However, this copying doesn’t happen at every push_back(). It happens at every GNG^Nth push_back(). In other cases this method takes constant time.

If we consider a sequence of nn push_back() calls, then they are going to include logGn\lceil\log_{G}n\rceil calls which copy the underlying buffer, and nlogGnn - \lceil\log_{G}n\rceil calls which do not. All the copies taken together are going to take 1+G+G2++GlogGn=GlogGn1G1n1G11 + G + G^2 + \cdots + G^{\lceil\log_{G}n\rceil} = \frac{G^{\lceil\log_{G}n\rceil} - 1}{G - 1} \approx \frac{n - 1}{G - 1} steps. The copy-less push_back() calls taken together are going to take nlogGnn - \lceil\log_{G}n\rceil steps. Adding it together we get that all push_back() calls together cost us O(n1G1+nlogGn)O(\frac{n - 1}{G - 1} + n - \lceil\log_{G}n\rceil).

There are nn push_back() calls, so on average push_back() takes constant time:

O(n1G1+nlogGnn)=O(n1G1+nn)=O(n1+nn)=O(2nn)=O(2)=O(1)O(\frac{\frac{n - 1}{G - 1} + n - \lceil\log_{G}n\rceil}{n}) = O(\frac{\frac{n - 1}{G - 1} + n}{n}) = O(\frac{n - 1 + n}{n}) = O(\frac{2n}{n}) = O(2) = O(1)

That’s why it’s sometimes said that std::vector::push_back() takes constant time amortised – the cost of every expensive push_back(), which copies nn elements, is amortised over nn previous push_back()s, each of which took constant time.

However, the latency of push_back() is pretty bad, as demonstrated by the measurements later in this note.

Space

Between allocating the new buffer and discarding the old one after copying it, the two buffers occupied (N+G×N)×M=(G+1)×N×M(N + G \times N) \times M = (G+1) \times N \times M bytes of memory, which means that in order to get an N+1N+1-element std::vector, we need almost G+1G+1 times as much memory as is needed for our elements. That’s 33 times for libstdc++ and libc++, and 2.52.5 times for MSVC. It’s a short time window, so most of the time it isn’t terrible, but sometimes we may not have this much memory available to us.

After push_back() finishes, we’re using G×N×MG \times N \times M bytes to store our (N+1)×M(N+1) \times M, which gives us the lower bound for space utilisation. For G=32G=\frac{3}{2}, space utilisation of std::vector is in the range (23,1](\frac{2}{3}, 1]. For G=2G=2, it is in the range (12,1](\frac{1}{2}, 1].

Not terrible, not great.

Complexity cost

Due to all the copying, our program cannot depend on any elements staying in one place for too long. If we want to refer to a string stored in a std::vector, we cannot simply grab a reference/pointer/string_view to it, as that could result in a dangling reference/pointer after a push_back() call.

We could work around this problem by grabbing index of the string in that std::vector. However, that would mean that functions operating on that string would need to operate on such indices instead of references/pointers to strings, which in turn would mean that they would either need to take the std::vector as another argument, or that they would work only with strings from a selected std::vector. That would make using those functions in different contexts difficult. E.g. we wouldn’t be able to just pass a string directly to those functions in tests (normal, or transient).

We also wouldn’t be able to easily refer to substrings without copying them or creating a substring struct of the form (𝐼𝑛𝑑𝑒𝑥𝑂𝑓𝑆𝑡𝑟𝑖𝑛𝑔,𝑆𝑡𝑎𝑟𝑡𝑃𝑜𝑠,𝐸𝑛𝑑𝑃𝑜𝑠,𝑉𝑒𝑐)(\mathit{IndexOfString}, \mathit{StartPos}, \mathit{EndPos}, \mathit{Vec}), which is a lot more things to keep track of than just a plain std::string_view, i.e. a (𝑃𝑜𝑖𝑛𝑡𝑒𝑟𝑇𝑜𝑆𝑡𝑎𝑟𝑡,𝐿𝑒𝑛𝑔𝑡h)(\mathit{PointerToStart}, \mathit{Length}). And again, the functions operating on those structs would be confined to operating only on those special objects instead of arbitrary substrings.

Surely, there must be a way to make it all simpler and more efficient?

Virtual memory

Let’s take a step back and ponder what it means to allocate memory on a typical operating system these days. Essentially, we want to get access to a region of RAM. That access is controlled by the OS and the CPU using a memory management unit (MMU).

These components logically divide memory into pages. A page usually spans 44kiB of memory. To address every byte in such a page we need log24096=12\log_{2}4096 = 12 bits, so we can treat the first 6412=5264 - 12 = 52 bits as index of the page itself. If we have an address and want to know the index of the page holding the byte under that address, we only need to perform a logical shift right on the address by 12 bits.

The operating system manages a page table, which conceptually is a hash table from page indices to page indices and some minimal metadata, such as permissions. (Due to space constraints on 6464-bit systems, on most systems the hash table is actually nested.)

When a process reads from or writes to memory at an address, the CPU fetches the hash table entry for that address, checks whether the process has appropriate permissions and accesses the memory with the page index replaced with the one stored in the value fetched from the page table. If the process doesn’t have the appropriate permissions however, the CPU will trigger a hardware page fault, which is going to be caught by the OS and redirected to the process as a segfault (SIGSEGV on POSIX systems1).

This arrangement creates two address spaces: virtual and physical. The virtual address space consists of addresses the way processes see them. The physical address space consists of addresses directly referring to physical memory (RAM). Each of them is divided into pages. The virtual address space is divided into virtual pages, while the physical address space into physical pages. The page table maintains a mapping from virtual to physical pages. Each process lives in a separate virtual address space.

Reserving vs committing memory

When we write a program and want to be able to access a piece of memory, we need to first ask the OS to update the page table. To create a new mapping, we need to call VirtualAlloc() on Windows, or mmap() on POSIX. Both of them accept an argument, hinting which virtual address the new mapping should start at. If this argument is null, the OS picks an address of its own choosing. Those functions are what’s used underneath by malloc() from the C standard library.

Looking at all that, maybe our dynamic array could allocate memory page after page in contiguity? Each time we want to append a new element, if we used up all the previously-allocated space, we’d just call VirtualAlloc()/mmap() to give us another virtual page directly after our existing allocation and continue normally.

That sounds almost like a good plan, but there is no guarantee that the virtual page right after our allocation isn’t already mapped. It could be mapped before our process even started execution, or we could have mapped it ourselves in a different place in code.

What we need is a way to tell the OS that a certain region of virtual address space shouldn’t be mapped, unless we specifically ask for it (using the address and length arguments of VirtualAlloc()/mmap()). This way we can ensure that when we need another virtual page mapped, it is available to us. And when we don’t need it, we don’t use up scarce physical memory.

VirtualAlloc() allows us to do just that. It separates reserving memory from committing it. Reserving memory is exactly what I described in the last paragraph: telling the OS not to map virtual pages in a given address range, unless we specifically ask for it. Committing memory is the act of mapping virtual pages to physical pages. On Windows committed memory is counted towards the Working Set Size (WSS) of a process, which is what’s presented to the user in the Task Manager as the amount of memory a process consumes.

While physical memory is scarce, virtual memory on 6464-bit systems is abundant. On amd64 systems we typically have about 4848 bits of virtual address space available, which amounts to 14\frac{1}{4}PiB. So while my PC has only 1616GiB of RAM, I can reserve about 1638416384 times as much virtual memory.

Lazy Overcommit

On POSIX systems, such as Mac, FreeBSD and Linux which I’m going to be using here, there is no explicit way to differentiate reserving memory from committing it. However, those systems in a normal configuration use lazy overcommit, i.e. when a process asks for memory to be mapped, the operating system doesn’t update the page table right away. Instead it tells the process that the mapping succeeded, and when the process tries to access one of the virtual pages, which it was led to believe is mapped, CPU generates a page fault, the OS catches it, updates the page table and resumes execution of the process from the instruction which triggered the page fault, as if nothing happened. That’s why it’s called lazy.

In that setup, memory allocations done with mmap() almost never fail (which would allow processes to handle the error condition), even when the sum of memory requested with mmap() is more than is available in the system (that’s why it’s called overcommit). Instead the operating system waits until after all allocations are made and processes actually try to use the allocated memory.

When it turns out that the processes weren’t joking and they really attempt to use all that memory, which is more than the physical memory available, the operating system starts an out-of-memory (OOM) killer which picks a process according to its own fancy and terminates it, thereby sacrificing it on the altar of overcommit. It may not be the process responsible for excessive memory consumption. It may be a process critical to the security of your system, like your screen locker. This is repeated, until the amount of memory that surviving processes try to access doesn’t exceed the amount of physical memory available. This is stupid.

However, in the absence of a proper reserve and commit API, it lets us emulate reserve and commit using mprotect() calls, preventing us from accessing pages of memory we didn’t explicitly “commit” by disabling all permissions on those pages. When we want to “commit” pages, we change their permissions to allow us to access them and let lazy overcommit take care of the fact that the initial mmap() call requested more memory than we may need and possibly than is even available in the system.

As long as the system administrator didn’t disable overcommit. But given that overcommit is expected on POSIX, applications in the POSIX world in general aren’t written with expectation that memory allocations can fail. So when overcommit is disabled, things are going to break anyway.

The code

I’ll go over key parts of the code here, but omit other parts. Complete code is available in its own git repository. If you find bugs, I do accept patches. README describes how to send them.

I’m going to first present the platform-specific parts of code, and then follow them with platform-independent code. All platform-specific code is in namespace pal (for platform abstraction layer).

Windows

Apparently that’s how you declare the entry-point function on Windows:

#define DECL_MAIN int __cdecl wmain()

For benchmarks I’ll want to prevent inlining in a couple places to prevent the optimiser from knowing the value of a couple arguments passed to benchmarking functions, which could influence the results.

#define NOINLINE __declspec(noinline)

Theoretically we should fetch page size with GetSystemInfo(), but I want to keep the example code simple, and on Windows it’s almost always going to be 4096 anyway:

static constexpr size_t kPageSize = 0x1000;

Reserving memory:

inline void* MemReserve(size_t len) {
    void* buf = VirtualAlloc(nullptr, len, MEM_RESERVE, PAGE_NOACCESS);
    if (buf == nullptr) {
        DWORD last_error = GetLastError();
        std::cerr << "Reserving memory with VirtualAlloc failed. Error code: " << last_error << std::endl;
        return nullptr;
    }

    return buf;
}

Committing memory:

inline bool MemCommit(void* p, size_t len) {
    void* committed = VirtualAlloc(p, len, MEM_COMMIT, PAGE_READWRITE);
    if (committed == nullptr) {
        DWORD last_error = GetLastError();
        std::cerr << "Committing memory with VirtualAlloc failed. Error code: " << last_error << std::endl;
        return false;
    }

    return true;
}

Freeing memory:

inline void MemFree(void* p, size_t /*len*/) {
    VirtualFree(p, 0, MEM_RELEASE);
}

That’s all we need for memory allocation. Now, a couple functions for the measurements done in benchmarks. First, peak memory use (working set size):

inline size_t GetPeakMemoryUse() {
    PROCESS_MEMORY_COUNTERS mem;
    bool res = GetProcessMemoryInfo(GetCurrentProcess(), &mem, sizeof mem);
    if (!res) {
        DWORD last_error = GetLastError();
        std::cerr << "GetProcessMemoryInfo failed. Error code: " << last_error << std::endl;
        return 0;
    }

    return mem.PeakWorkingSetSize;
}

And the time the process spent on-CPU, adding time spent in user-mode and in kernel-mode, in microseconds:

inline uint64_t ConvertFiletimeToMicroseconds(FILETIME t) {
    ULARGE_INTEGER li;
    li.LowPart = t.dwLowDateTime;
    li.HighPart = t.dwHighDateTime;
    return static_cast<uint64_t>(li.QuadPart) / 10;
}

inline uint64_t GetCpuTime() {
    FILETIME a, b, c, d;
    if (GetProcessTimes(GetCurrentProcess(), &a, &b, &c, &d) != 0) {
        auto kernel_time = ConvertFiletimeToMicroseconds(c);
        auto user_time = ConvertFiletimeToMicroseconds(d);
        return user_time + kernel_time;
    } else {
        std::cerr << "failed to get CPU time\n";
        assert(false);
        return 0;
    }
}

POSIX

Most of the code is common to all POSIX platforms with a couple tiny differences handled with the preprocessor. Normally, I’d split them into separate files, but that felt like too much ceremony than is necessary for this little post.

main:

#define DECL_MAIN int main()

NOINLINE:

#define NOINLINE __attribute__((noinline))

I’m going to be working with FreeBSD and Linux on amd64, which use 4kiB pages, and Mac M1 which uses 16kiB pages (as all ARM-based platforms from Apple do):

static constexpr size_t kPageSize =
    #if defined(__APPLE__) && defined(__aarch64__)
        0x4000
    #else
        0x1000
    #endif
    ;

Reserving memory. We call mmap() with PROT_NONE to disable all permissions and add MAP_NORESERVE flag, when it’s available (in practice it means Linux), to opt into overcommit, when it’s in heuristic mode:

inline void* MemReserve(size_t len) {
    int flags = MAP_PRIVATE | MAP_ANONYMOUS;

    #ifdef MAP_NORESERVE
        // On Linux, guarantee lazy commit even if /proc/sys/vm/overcommit_memory is set to `heuristic` (0).
        flags |= MAP_NORESERVE;
    #endif

    void* buf = mmap(nullptr, len, PROT_NONE, flags, -1, 0);
    if (buf == nullptr) {
        int saved_errno = errno;
        std::cerr << "Reserving memory with mmap failed. Exit code: " << saved_errno << std::endl;
        return nullptr;
    }

    return buf;
}

Commit by adding read and write permissions:

inline bool MemCommit(void* p, size_t len) {
    if (mprotect(p, len, PROT_READ | PROT_WRITE) != 0) {
        int saved_errno = errno;
        std::cerr << "Committing memory with mprotect failed. Exit code: " << saved_errno << std::endl;
        return false;
    }

    return true;
}

Free:

inline void MemFree(void* p, size_t len) {
    if (munmap(p, len) != 0) {
        int saved_errno = errno;
        std::cerr << "Failed to unmap memory. Exit code: " << saved_errno << std::endl;
    }
}

On POSIX platforms Resident Set Size (RSS) is what’s interpreted as “memory use”. To measure that, we’re going to use getrusage() and here is where POSIX lets us down. Everyone has this function, but everyone interprets the meaning of the returned ru_maxrss differently. On FreeBSD and Linux it’s in kilobytes. On Mac it’s in bytes. And curiously on illumos it’s in pages. The similarities are sufficient to lull you into a sense that everything is working fine when porting: everything has the same names, everything compiles, but then the interpretation of results is completely different. The only reason I noticed it is because suddenly results differed by several orders of magnitude between Mac and FreeBSD/Linux. Others may not be this fortunate. Reading the Linux and FreeBSD man pages, it wasn’t clear to me whether by kilobytes they meant 10241024 bytes or 10001000 bytes. I settled for 10001000:

inline size_t GetPeakMemoryUse() {
    rusage r_usage;
    if (getrusage(RUSAGE_SELF, &r_usage) != 0) {
        int saved_errno = errno;
        std::cerr << "Failed to fetch memory use. Exit code: " << saved_errno << std::endl;
        return 0;
    }

    // On Mac, ru_maxrss is in bytes. On Linux and FreeBSD it's in kilobytes. So much for POSIX.
    #ifdef __APPLE__
    size_t multiplier = 1;
    #else
    size_t multiplier = 1000;
    #endif

    return r_usage.ru_maxrss * multiplier;
}

Measuring CPU time is done the same way everywhere:

inline uint64_t ConvertTimevalToMicroseconds(timeval t) {
    return static_cast<uint64_t>(t.tv_sec) * 1000000ULL + static_cast<uint64_t>(t.tv_usec);
}

inline uint64_t GetCpuTime() {
    rusage r_usage;
    if (getrusage(RUSAGE_SELF, &r_usage) != 0) {
        int saved_errno = errno;
        std::cerr << "Failed to fetch memory use. Exit code: " << saved_errno << std::endl;
        return 0;
    }

    auto user_time = ConvertTimevalToMicroseconds(r_usage.ru_utime);
    auto kernel_time = ConvertTimevalToMicroseconds(r_usage.ru_stime);
    return user_time + kernel_time;
}

Platform-independent part

For simplicity, let’s first write a basic arena of bytes, so we don’t need to trouble ourselves with templates for now.

Unlike std::vector, we’re going to grow the allocation linearly, not exponentially, which should give us better space efficiency. Let’s settle on the amount of pages we’re going to add each time we run out of space in our allocation. I say 11 is a pretty good start:

static constexpr size_t kBytesPerCommit = pal::kPageSize;

We also need a way to calculate the number of pages needed to hold the bytes in our arena, multiplied by number of bytes in a page, since VirtualAlloc accepts length in bytes, not pages. We calculate it by rounding the number of bytes we need to store up to the nearest page boundary. To do that, we perform ceiled division by kBytesPerCommit, followed by multiplication by kBytesPerCommit:

size_t GetCommittedBytes(size_t used_bytes) {
    return ((used_bytes + kBytesPerCommit - 1) / kBytesPerCommit) * kBytesPerCommit;
}

Now the struct for arena bookkeeping, which should be pretty self-explanatory at this point:

struct Arena {
    char* base = nullptr;
    size_t cap = 0;
    size_t len = 0;
    
    // ... methods ...
};

RAII boilerplate:

    Arena() = delete;
    Arena(const Arena&) = delete;
    Arena(Arena&& other) {
        base = other.base;
        cap = other.cap;
        len = other.len;

        other.base = nullptr;
        other.cap = 0;
        other.len = 0;
    }

Main constructor, followed by the destructor:

    explicit Arena(size_t max_len) {
        assert(max_len != 0);

        cap = GetCommittedBytes(max_len);
        len = 0;

        void* buf = pal::MemReserve(cap);
        if (buf == nullptr) {
            throw std::bad_alloc();
        }

        base = static_cast<char*>(buf);
    }

    ~Arena() {
        if (base != nullptr) {
            pal::MemFree(base, cap);
        }
    }

I apologise for exceptions. However, I didn’t want to use extra templates like std::optional in this POC, which would get in the way of demonstration.

Our constructor reserves the amount of memory requested in its argument. It doesn’t commit anything. Destructor frees the whole arena.

Growing logic:

    bool Grow(size_t additional_len) {
        assert(len+additional_len <= cap);
        size_t old_committed = GetCommittedBytes(len);
        size_t new_committed = GetCommittedBytes(len+additional_len);
        size_t delta = new_committed - old_committed;
        if (delta == 0) {
            len += additional_len;
            return true;
        }

        if (!pal::MemCommit(base + old_committed, delta)) {
            return false;
        }

        len += additional_len;
        return true;
    }

If we can fit the additional bytes in current allocation, we only update the length and report success. Otherwise, we attempt to commit more pages (the minimal amount that’s needed to fit the additional bytes).

Our POC arena is complete. Now let’s move on to the dynamic array of elements of type T:

template <typename T>
class DynArray {
    Arena arena;
    
public:
    // ... methods ...
};

No surprises here. RAII boilerplate:

    // Redundant, but useful as documentation.
    DynArray() = delete;
    DynArray(const DynArray<T>& other) = delete;
    DynArray(DynArray&& other) = default;

Copy constructor is left as an exercise for the reader.

Main constructor just forwards a scaled argument to the arena. For more robust code, we should add a check for arithmetic overflow, but for simplicity of examples let’s leave it out:

    explicit DynArray(size_t max_len)
    : arena(max_len * sizeof(T)) {}

Support for range loops:

    T* begin() {
        return reinterpret_cast<T*>(arena.base);
    }

    T* end() {
        return reinterpret_cast<T*>(arena.base + arena.len);
    }

Methods for querying the length and indexing with square brackets would also be nice:

    size_t Len() const {
        return arena.len / sizeof(T);
    }

    T& operator[](size_t i) {
        assert(i < Len());
        return begin()[i];
    }

And the meat of the structure:

    bool Append(T v) {
        auto* dst = end();

        if (!arena.Grow(sizeof(T))) {
            return false;
        }

        *dst = std::move(v);
        return true;
    }

Should be pretty obvious.

Theoretically, we should make sure that we put T at addresses aligned according to the alignment requirements of T. However, the base address of the arena, which is returned by VirtualAlloc(), will always be aligned to a page boundary. Alignment needs to be a power of 2. Additionally, in practice I don’t think there are any types with alignment bigger than page size2, which is also a power of 2. It follows that alignment always divides page size equally, therefore we satisfy alignment requirements with no additional effort.

Demonstrative tests

Now for some rudimentary tests. First, the arena:

    // Reserving 16GiB.
    auto a = Arena(16ULL * 1024ULL * 1024ULL * 1024ULL);

    // Committing 1MiB to make space for 4 bytes (with some headroom).
    bool committed = a.Grow(4);
    assert(committed);

    a.base[0] = 'a';
    a.base[1] = 'b';
    a.base[2] = 'c';
    a.base[4] = '\0';

    std::cout << a.base << std::endl;

    // Past the declared length, but still in committed memory.
    // No segfault, but in real code it's probably a bug.
    a.base[5] = 'x';
    a.base[kBytesPerCommit-1] = 'y';

    // Reserved, but not committed memory. Segfault.
    // a.base[kBytesPerCommit] = 'x';

That code should print “abc” and exit without crashing. We can see that unfortunately we can access elements beyond the stated length of the arena, in this case elements with indices 55 and 𝚔𝙱𝚢𝚝𝚎𝚜𝙿𝚎𝚛𝙲𝚘𝚖𝚖𝚒𝚝1\mathtt{kBytesPerCommit}-1, because they’re placed on the same page of memory which is used to store valid elements. However, we cannot reach beyond this page, which is demonstrated with the last line. After uncommenting it and running the program, it’s going to crash due to a segfault. The uncommitted pages at the end serve as natural guard pages, providing a margin of safety, even in situations where for some reason it’s been decided that bounds-checking should be disabled (which is most C++ shops).

And the dynamic array:

    static constexpr size_t N = 100000;
    auto a = DynArray<int>(N);

    for (size_t i = 0; i < 16; i++) {
        assert(a.Append(static_cast<int>(i)));
    }

    for (size_t i = 0; i < 12; i++) {
        std::cout << a[i] << "\n";
    }

    for (size_t i = 16; i < N; i++) {
        a.Append(static_cast<int>(i));
    }

    size_t real_cap = GetCommittedBytes(N * sizeof(int)) / sizeof(int);

    for (size_t i = N; i < real_cap; i++) {
        a.Append(static_cast<int>(i));
    }

    // Exceeded capacity. Assertion failure.
    // a.Append(0);

That should print:

0
1
2
3
4
5
6
7
8
9
10
11

Uncommenting the last line would cause an assertion failure, this time because we run out of reserved space (due to deliberately smaller capacity declared in the beginning).

Performance cost

When DynArray has nn elements, Append() takes O(1)O(1) time and doesn’t have any temporary spikes in space use, meaning that the amount of space used during Append() never exceeds the amount of space used after it.

nn-element DynArray uses n×M4096×4096\lceil\frac{n\times M}{4096}\rceil\times 4096 bytes for the buffer plus the constant amount of space for length, capacity and pointer to buffer. This amounts to n×Mn×M4096×4096n×Mn×M4096×4096=n×Mn×M=1\frac{n\times M}{\lceil\frac{n\times M}{4096}\rceil\times 4096} \approx \frac{n\times M}{\frac{n\times M}{4096}\times 4096} = \frac{n\times M}{n\times M} = 1 space utilisation.

Pretty good. As nn grows, we’re approaching perfect space utilisation, and non-utilised space never exceeds 40954095 bytes, whereas in the case of std::vector it grows exponentially.

Benchmarks

I’ve run all Windows, FreeBSD and Linux tests on the exact same machine. It’s a PC with an AMD Ryzen 7 2700X CPU and 16 GB of DDR4 RAM. For Mac tests I used my 2020 Mac Mini M1 with 16GB RAM. It wasn’t my objective to compare which compiler is better than another compiler, so to keep things shorter I used only one compiler on each platform, the one that was closest to the official one. Software versions:

  1. Windows 10 Pro with MSVC 19.32.31332 running from x64 Native Tools Command Prompt for VS 2022. Compilation script: build.bat.
  2. MacOS Monterey 12.6.2 with Apple clang version 13.0.0 (clang-1300.0.29.30), the version shipped with Xcode 13.2.1. Compilation script: build-clang.sh.
  3. FreeBSD 13.1-p3 (with 13.1-p5 userland) with Clang 13.0.0 (FreeBSD clang version 13.0.0), which is shipped with the base system. Compilation script: build-clang.sh.
  4. Debian 11.6 (5.10.0-20-amd64 #1 SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux) with GCC (g++ (Debian 10.2.1-6) 10.2.1 20210110) installed with APT through the g++ package. Compilation script: build-gcc.sh.
Windows
Memory

First we’ll measure peak memory use, when constructing arrays of sizes 10310^3, 10610^6, 10910^9, and 101010^{10}. For details, see mem_dynarray.cpp and mem_stdvec.cpp in the repo.

Reminder: I don’t call std::vector::reserve(), because that method doesn’t reserve memory according to the Windows (and our) nomenclature, it actually commits memory, and the whole motivation for this page is finding a memory management strategy for situations when I don’t have accurate and precise predictions for how much memory I’m going to need. Besides, if I had them, instead of std::vector, I’d use a structure which allocates only once, and doesn’t have all that resizing logic.

Array Size Peak WSS (bytes)
DynArray std::vector
10310^3 30883843\,088\,384 25518082\,551\,808
10610^6 70860807\,086\,080 1019084810\,190\,848
10910^9 40029265924\,002\,926\,592 55213957125\,521\,395\,712
101010^{10} 1285958860812\,859\,588\,608 std::bad_alloc

DynArray uses more memory when the whole array can fit in space smaller than a single memory page, but for bigger arrays it uses less space, because it grows linearly, not exponentially as std::vector does. In fact, as the last row demonstrates, std::vector failed to scale to 101010^{10} elements on my machine, reporting a memory allocation error, whereas DynArray continued working.

Throughput

Now let’s see how many elements we’re able to append in a given time period. We take the total time it takes to construct the whole array and then divide it by the number of elements. Each benchmark is repeated for different sizes of the resulting dynamic array. Results are included for builds with different levels of optimisation: no optimisation, /O1 and /O2.

Aside from ints, we also use std::unique_ptrs to demonstrate how those dynamic arrays scale when their elements have non-trivial (but still almost-trivial) move constructors. For details, see bench.cpp in the repo.

No optimisation
Length Element Array Total time (seconds) Throughput (Hz)
10310^3 int DynArray 00 \infty
std::vector 00 \infty
std::unique_ptr DynArray 00 \infty
std::vector 00 \infty
10610^6 int DynArray 00 \infty
std::vector 0.0312500.031\,250 3200000032\,000\,000
std::unique_ptr DynArray 0.0781250.078\,125 1280000012\,800\,000
std::vector 0.1406250.140\,625 71111117\,111\,111
10910^9 int DynArray 14.82812514.828\,125 6743940967\,439\,409
std::vector 23.07812523.078\,125 4333107643\,331\,076
std::unique_ptr DynArray 84.65625084.656\,250 1181247611\,812\,476
std::vector 297.937500297.937\,500 33564083\,356\,408
101010^{10} int DynArray 174.093750174.093\,750 5744031557\,440\,315
std::vector std::bad_alloc
std::unique_ptr DynArray std::bad_alloc
std::vector std::bad_alloc
Optimisation /O1
Length Element Array Total time (seconds) Throughput (Hz)
10310^3 int DynArray 00 \infty
std::vector 00 \infty
std::unique_ptr DynArray 00 \infty
std::vector 00 \infty
10610^6 int DynArray 00 \infty
std::vector 00 \infty
std::unique_ptr DynArray 0.0468750.046\,875 2133333321\,333\,333
std::vector 0.0312500.031\,250 3200000032\,000\,000
10910^9 int DynArray 4.3281254.328\,125 231046931231\,046\,931
std::vector 2.8281252.828\,125 353591160353\,591\,160
std::unique_ptr DynArray 62.35937562.359\,375 1603608116\,036\,081
std::vector 213.609375213.609\,375 46814424\,681\,442
101010^{10} int DynArray 59.81250059.812\,500 167189132167\,189\,132
std::vector std::bad_alloc
std::unique_ptr DynArray std::bad_alloc
std::vector std::bad_alloc
Optimisation /O2
Length Element Array Total time (seconds) Throughput (Hz)
10310^3 int DynArray 00 \infty
std::vector 00 \infty
std::unique_ptr DynArray 0.0156250.015\,625 6400064000
std::vector 00 \infty
10610^6 int DynArray 00 \infty
std::vector 00 \infty
std::unique_ptr DynArray 0.0468750.046\,875 2133333321\,333\,333
std::vector 0.0468750.046\,875 2133333321\,333\,333
10910^9 int DynArray 4.4062504.406\,250 226950354226\,950\,354
std::vector 2.7656252.765\,625 361581920361\,581\,920
std::unique_ptr DynArray 49.18750049.187\,500 2033036820\,330\,368
std::vector 203.906250203.906\,250 49042144\,904\,214
101010^{10} int DynArray 56.68750056.687\,500 176405733176\,405\,733
std::vector std::bad_alloc
std::unique_ptr DynArray std::bad_alloc
std::vector std::bad_alloc

With no optimisations enabled, DynArray has consistently higher throughput than std::vector, across all controlled parameters. With /O1 and /O2, std::vector has higher throughput for smaller arrays and for big arrays with trivial move constructors; but for large arrays with non-trivial move constructors DynArray has the upper hand. Extrapolating, the time it takes to construct the whole array, relative to its end length, grows much slower in the case of DynArray than in the case of std::vector.

Latency

In case of latency, we measure the time taken by each append call, save the minimum and maximum values, and aggregate all times into a linear histogram organised into buckets of [0,106)[0, 10^{-6}), [106,105)[10^{-6}, 10^{-5}), [105,104)[10^{-5}, 10^{-4}), [104,103)[10^{-4}, 10^{-3}), [103,102)[10^{-3}, 10^{-2}), [102,101)[10^{-2}, 10^{-1}), [101,)[10^{-1}, \infty) seconds. Bucket columns are labeled with the upper bounds. For details, see bench.cpp in the repo.

No optimisation
Length Element Array Min Max 10610^{-6} 10510^{-5} 10410^{-4} 10310^{-3} 10210^{-2} 10110^{-1} \infty
10310^3 int DynArray 00 00 10001\,000 00 00 00 00 00 00
std::vector 00 00 10001\,000 00 00 00 00 00 00
std::unique_ptr DynArray 00 0.0156250.015\,625 999999 00 00 00 00 11 00
std::vector 00 00 10001\,000 00 00 00 00 00 00
10610^6 int DynArray 00 0.0156250.015\,625 999983999\,983 00 00 00 00 1717 00
std::vector 00 0.0625000.062\,500 999979999\,979 00 00 00 00 2121 00
std::unique_ptr DynArray 00 0.0156250.015\,625 999983999\,983 00 00 00 00 1717 00
std::vector 00 0.0312500.031\,250 999977999\,977 00 00 00 00 2323 00
10910^9 int DynArray 00 0.0312500.031\,250 999987307999\,987\,307 00 00 00 00 1269312\,693 00
std::vector 00 0.4687500.468\,750 999987325999\,987\,325 00 00 00 00 1267112\,671 44
std::unique_ptr DynArray 00 1.7968751.796\,875 999965127999\,965\,127 00 00 00 00 3487034\,870 33
std::vector 00 29.48437529.484\,375 999982339999\,982\,339 00 00 00 00 1764617\,646 1515
101010^{10} int DynArray 00 0.5156250.515\,625 99998694439\,999\,869\,443 00 00 00 00 130556130\,556 11
std::vector std::bad_alloc
std::unique_ptr DynArray std::bad_alloc
std::vector std::bad_alloc
Optimisation /O1
Length Element Array Min Max 10610^{-6} 10510^{-5} 10410^{-4} 10310^{-3} 10210^{-2} 10110^{-1} \infty
10310^3 int DynArray 00 00 10001\,000 00 00 00 00 00 00
std::vector 00 00 10001\,000 00 00 00 00 00 00
std::unique_ptr DynArray 00 00 10001\,000 00 00 00 00 00 00
std::vector 00 00 10001\,000 00 00 00 00 00 00
10610^6 int DynArray 00 0.0156250.015\,625 999993999\,993 00 00 00 00 77 00
std::vector 00 0.0937500.093\,750 999983999\,983 00 00 00 00 1717 00
std::unique_ptr DynArray 00 0.0156250.015\,625 999990999\,990 00 00 00 00 1010 00
std::vector 00 0.0156250.015\,625 999988999\,988 00 00 00 00 1212 00
10910^9 int DynArray 00 0.0312500.031\,250 999988885999\,988\,885 00 00 00 00 1111511\,115 00
std::vector 00 0.4843750.484\,375 999989172999\,989\,172 00 00 00 00 1082410\,824 00
std::unique_ptr DynArray 00 0.0468750.046\,875 999975686999\,975\,686 00 00 00 00 2431424\,314 00
std::vector 00 14.10937514.109\,375 999986756999\,986\,756 00 00 00 00 1323613\,236 88
101010^{10} int DynArray 00 0.4687500.468\,750 99998877559\,999\,887\,755 00 00 00 00 112244112\,244 11
std::vector std::bad_alloc
std::unique_ptr DynArray std::bad_alloc
std::vector std::bad_alloc
Optimisation /O2
Length Element Array Min Max 10610^{-6} 10510^{-5} 10410^{-4} 10310^{-3} 10210^{-2} 10110^{-1} \infty
10310^3 int DynArray 00 00 10001\,000 00 00 00 00 00 00
std::vector 00 00 10001\,000 00 00 00 00 00 00
std::unique_ptr DynArray 00 00 10001\,000 00 00 00 00 00 00
std::vector 00 00 10001\,000 00 00 00 00 00 00
10610^6 int DynArray 00 0.0156250.015\,625 999987999\,987 00 00 00 00 1313 00
std::vector 00 0.0937500.093\,750 999985999\,985 00 00 00 00 1515 00
std::unique_ptr DynArray 00 0.0156250.015\,625 999986999\,986 00 00 00 00 1414 00
std::vector 00 0.0156250.015\,625 999988999\,988 00 00 00 00 1212 00
10910^9 int DynArray 00 0.0312500.031\,250 999988917999\,988\,917 00 00 00 00 1108311\,083 00
std::vector 00 0.5000000.500\,000 999989054999\,989\,054 00 00 00 00 1094210\,942 44
std::unique_ptr DynArray 00 0.3906250.390\,625 999973866999\,973\,866 00 00 00 00 2613326\,133 11
std::vector 00 13.10937513.109\,375 999984610999\,984\,610 00 00 00 00 1538115\,381 99
101010^{10} int DynArray 00 0.0937500.093\,750 99998887819\,999\,888\,781 00 00 00 00 111219111\,219 00
std::vector std::bad_alloc
std::unique_ptr DynArray std::bad_alloc
std::vector std::bad_alloc

Except the one flake in the unoptimised benchmark for arrays of 10310^3 elements, the maximum latency of DynArray always stayed below the latency of std::vector. In fact, usually it was a couple times lower, and for big arrays it was orders of magnitude lower, where, if you care about latency at all, performance of std::vector wasn’t acceptable by any standards.

All latency measurements demonstrate a bimodal distribution. In the case of std::vector it’s no surprise – that’s what we should expect when we have two paths, one the common case of no resize, one the odd case of resize. However, in the case of DynArray, the bimodal distribution surprised me, as I didn’t see any cause for it in any of the code that I wrote: on the one hand, VirtualAlloc() isn’t called in every Append(), but on the other hand the number of calls to one is linearly proportional to the other, and the numbers do not reflect that, so I decided to dig into it.

I tried to collect hardware performance counters (PMC), but Intel VTune failed to gather any for my AMD CPU. AMD μProf didn’t seem to even have any place to display them. Microsoft’s tracelog.exe also failed to collect anything. The ETL file contained no PMC data. That’s quite a let-down compared to the ease of collecting this data on Linux with perf.

Luckily, DTrace was ported to Windows, so I decided to use that. First I tried to measure how much time VirtualAlloc() takes in my tests, so I wrote the following script, using the syscall DTrace provider:

syscall::NtAllocateVirtualMemory:entry
/execname == "bench2.exe"/
{
    self->ts = timestamp;
}

syscall::NtAllocateVirtualMemory:return
/execname == "bench2.exe"/
{
    @alloc[execname] = quantize((timestamp - self->ts) / 1000);
}

That collects running times (in microseconds) of the NtAllocateVirtualMemory() syscall into a histogram. I run bench2.exe in one terminal, while in the Administrator’s cmd.exe in another window I typed:

"C:\Program Files\DTrace\dtrace.exe" -s virtualallochist.d

That printed:

  bench2.exe
           value  ------------- Distribution ------------- count
               0 |                                         0
               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  25497083
               2 |@                                        765607
               4 |                                         42939
               8 |                                         7796
              16 |                                         2619
              32 |                                         233
              64 |                                         20
             128 |                                         41
             256 |                                         89
             512 |                                         90
            1024 |                                         46
            2048 |                                         4
            4096 |                                         12
            8192 |                                         6
           16384 |                                         4
           32768 |                                         0

1638416\,384 microseconds (0.0163840.016\,384 seconds) isn’t near anything explaining the times from my benchmarks, where append once took as much as 0.3900.390 seconds. I had to dig deeper.

At that point I thought: I’m not calling NtAllocateVirtualMemory() directly, I’m calling VirtualAlloc(). I should measure that. That’s a bit harder, but still doable. VirtualAlloc() is not a syscall of the NT kernel, but a normal function defined in one of DLLs distributed with Windows, so I can’t trace it with the syscall DTrace provider.

However, I can use the pid provider. For tracing with that provider, I needed to know the PID of the executable to trace, not just the execname. One way to get it is to use the -c <cmd> option to dtrace.exe, which starts the <cmd> command and passes its PID to the DTrace script as $target. However, that approach caused DTrace to run out of memory very fast, so I needed to find another way.

I decided to fetch the PID with Powershell and pass it as a positional argument to my DTrace script, which measured VirtualAlloc() and a couple of my own functions to compare:

pid$1::VirtualAlloc:entry
{
    self->ts[probefunc] = timestamp;
}

pid$1::VirtualAlloc:return
{
    @[execname,probefunc] = quantize((timestamp - self->ts[probefunc]) / 1000);
}

pid$1::GetProcessTimes:entry
{
    self->ts[probefunc] = timestamp;
}

pid$1::GetProcessTimes:return
{
    @[execname,probefunc] = quantize((timestamp - self->ts[probefunc]) / 1000);
}

pid$1::"DynArray<std::unique_ptr<int,std::default_delete<int> > >::Append":entry
{
    self->ts[probefunc] = timestamp;
}

pid$1::"DynArray<std::unique_ptr<int,std::default_delete<int> > >::Append":return
{
    @[execname,probefunc] = quantize((timestamp - self->ts[probefunc]) / 1000);
}

pid$1::"DynArray<int>::Append":entry
{
    self->ts[probefunc] = timestamp;
}

pid$1::"DynArray<int>::Append":return
{
    @[execname,probefunc] = quantize((timestamp - self->ts[probefunc]) / 1000);
}

When optimisations are enabled, some of those DynArray methods are inlined and eliminated from the executable. So I needed to use bench0.exe, the unoptimised version of the benchmark. After starting bench0.exe in one terminal, I typed the following in Administrator’s Powershell:

& "C:\Program Files\DTrace\dtrace.exe" -s virtualallochist.d (Get-Process -Name "bench0").Id

That made the benchmark run several times longer than normally (~2.5 days compared to ~2 hours) in wall clock time, but CPU time seemed to be completely unaffected. (Actually, during first attempt Windows Update rebooted my computer after 1.5 days without asking me for permission.)

The results were as follows:

  bench0.exe                                          VirtualAlloc
           value  ------------- Distribution ------------- count
               1 |                                         0
               2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 25247233
               4 |                                         100680
               8 |                                         7843
              16 |                                         509
              32 |                                         16
              64 |                                         5
             128 |                                         21
             256 |                                         71
             512 |                                         4
            1024 |                                         2
            2048 |                                         0
            4096 |                                         0
            8192 |                                         1
           16384 |                                         0
           32768 |                                         1
           65536 |                                         0
          131072 |                                         1
          262144 |                                         1
          524288 |                                         0

  bench0.exe                                          DynArray<std::unique_ptr<int,std::default_delete<int> > >::Append
           value  ------------- Distribution ------------- count
               1 |                                         0
               2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1995992981
               4 |                                         641329
               8 |                                         4288545
              16 |                                         65842
              32 |                                         380
              64 |                                         361
             128 |                                         3862
             256 |                                         6963
             512 |                                         670
            1024 |                                         31
            2048 |                                         8
            4096 |                                         27
            8192 |                                         0
           16384 |                                         0
           32768 |                                         1
           65536 |                                         0

  bench0.exe                                          DynArray<int>::Append
           value  ------------- Distribution ------------- count
               0 |                                         0
               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 21782293396
               2 |                                         155237404
               4 |                                         15543570
               8 |                                         9630471
              16 |                                         117290
              32 |                                         3117
              64 |                                         1737
             128 |                                         26619
             256 |                                         38741
             512 |                                         12741
            1024 |                                         15708
            2048 |                                         6991
            4096 |                                         77
            8192 |                                         24
           16384 |                                         40
           32768 |                                         3
           65536 |                                         0
          131072 |                                         1
          262144 |                                         1
          524288 |                                         0

  bench0.exe                                          GetProcessTimes
           value  ------------- Distribution ------------- count
               0 |                                         0
               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 27825184719
               2 |                                         177795208
               4 |                                         1094430
               8 |                                         3752032
              16 |                                         127442
              32 |                                         3074
              64 |                                         1284
             128 |                                         13774
             256 |                                         35437
             512 |                                         606
            1024 |                                         2
            2048 |                                         1
            4096 |                                         0

Here we can see that the outliers of VirtualAlloc() practically perfectly correspond to the outliers of DynArray<int>::Append(). 262144262\,144 microseconds for 262144262\,144 microseconds. 131072131\,072 microseconds for 131072131\,072 microseconds. The rest of numbers don’t correspond as perfectly, but generally are in the right ballpark, after dividing the amounts by 3 orders of magnitude (VirtualAlloc() isn’t called in every Append()). That to me was sufficient evidence that DynArray::Append() latency really was determined by the latency of VirtualAlloc() and not by something extra that I did.

For the other platforms, I’m only going to comment the numbers, when I see something unexpected in them.

Mac
Memory
Array Size Peak RSS (bytes)
DynArray std::vector
10310^3 14909441\,490\,944 14745601\,474\,560
10610^6 55214085\,521\,408 97157129\,715\,712
10910^9 40015134724\,001\,513\,472 45649428484\,564\,942\,848
101010^{10} 1024753664010\,247\,536\,640 1168490496011\,684\,904\,960
Throughput
No optimisation (-O0)
Length Element Array Total time (seconds) Throughput (Hz)
10310^3 int DynArray 0.0002600.000\,260 38461533\,846\,153
std::vector 0.0000890.000\,089 1123595511\,235\,955
std::unique_ptr DynArray 0.0004620.000\,462 21645022\,164\,502
std::vector 0.0003720.000\,372 26881722\,688\,172
10610^6 int DynArray 0.0230560.023\,056 4337265743\,372\,657
std::vector 0.0367360.036\,736 2722125427\,221\,254
std::unique_ptr DynArray 0.0518030.051\,803 1930390119\,303\,901
std::vector 0.0937630.093\,763 1066518710\,665\,187
10910^9 int DynArray 7.4233297.423\,329 134710451134\,710\,451
std::vector 19.33944419.339\,444 5170779451\,707\,794
std::unique_ptr DynArray 77.60085677.600\,856 1288645512\,886\,455
std::vector 98.45237798.452\,377 1015719510\,157\,195
101010^{10} int DynArray 76.53692876.536\,928 130655884130\,655\,884
std::vector 245.702541245.702\,541 4069961940\,699\,619
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O1
Length Element Array Total time (seconds) Throughput (Hz)
10310^3 int DynArray 0.0003170.000\,317 31545743\,154\,574
std::vector 0.0000210.000\,021 4761904747\,619\,047
std::unique_ptr DynArray 0.0003880.000\,388 25773192\,577\,319
std::vector 0.0001000.000\,100 1000000010\,000\,000
10610^6 int DynArray 0.0134000.013\,400 7462686574\,626\,865
std::vector 0.0207910.020\,791 4809773448\,097\,734
std::unique_ptr DynArray 0.0469200.046\,920 2131287221\,312\,872
std::vector 0.0453310.045\,331 2205995822\,059\,958
10910^9 int DynArray 4.4116404.411\,640 226673073226\,673\,073
std::vector 8.2800488.280\,048 120772246120\,772\,246
std::unique_ptr DynArray 61.71323561.713\,235 1620397916\,203\,979
std::vector 49.25908049.259\,080 2030082520\,300\,825
101010^{10} int DynArray 45.92781045.927\,810 217733003217\,733\,003
std::vector 107.315495107.315\,495 9318318893183188
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O2
Length Element Array Total time (seconds) Throughput (Hz)
10310^3 int DynArray 0.0003130.000\,313 31948883\,194\,888
std::vector 0.0000060.000\,006 166666666166\,666\,666
std::unique_ptr DynArray 0.0003740.000\,374 26737962\,673\,796
std::vector 0.0000490.000\,049 2040816320\,408\,163
10610^6 int DynArray 0.0065420.006\,542 152858453152\,858\,453
std::vector 0.0030670.003\,067 326051516326\,051\,516
std::unique_ptr DynArray 0.0338930.033\,893 2950461729\,504\,617
std::vector 0.0185180.018\,518 5400151254\,001\,512
10910^9 int DynArray 2.5811452.581\,145 387424960387\,424\,960
std::vector 1.0240111.024\,011 976552009976\,552\,009
std::unique_ptr DynArray 47.61864447.618\,644 2100017721\,000\,177
std::vector 19.48098919.480\,989 5133209651\,332\,096
101010^{10} int DynArray 28.12504928.125\,049 355554936355\,554\,936
std::vector 34.61608734.616\,087 288883027288\,883\,027
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Latency
No optimisation (-O0)
Length Element Array Min Max 10610^{-6} 10510^{-5} 10410^{-4} 10310^{-3} 10210^{-2} 10110^{-1} \infty
10310^3 int DynArray 00 0.0000230.000\,023 736736 263263 11 00 00 00 00
std::vector 00 0.0000420.000\,042 726726 271271 33 00 00 00 00
std::unique_ptr DynArray 00 0.0000040.000\,004 735735 265265 00 00 00 00 00
std::vector 00 0.0000180.000\,018 700700 299299 11 00 00 00 00
10610^6 int DynArray 00 0.0000110.000\,011 766875766\,875 233123233\,123 22 00 00 00 00
std::vector 00 0.0024370.002\,437 769699769\,699 230290230\,290 66 33 22 00 00
std::unique_ptr DynArray 00 0.0000770.000\,077 746335746\,335 253649253\,649 1616 00 00 00 00
std::vector 00 0.0169030.016\,903 739203739\,203 260770260\,770 1919 33 44 11 00
10910^9 int DynArray 00 0.0001190.000\,119 778041048778\,041\,048 221953956221\,953\,956 49954\,995 11 00 00 00
std::vector 00 2.3720192.372\,019 773599499773\,599\,499 226394128226\,394\,128 63586\,358 33 33 44 55
std::unique_ptr DynArray 00 0.0045690.004\,569 746486003746\,486\,003 253494790253\,494\,790 1908519\,085 106106 1616 00 00
std::vector 00 18.93901118.939\,011 741265650741\,265\,650 258717702258\,717\,702 1654716\,547 8585 55 33 88
101010^{10} int DynArray 00 0.0001280.000\,128 77810410977\,781\,041\,097 22189002312\,218\,900\,231 5862258\,622 5050 00 00 00
std::vector 00 51.34938751.349\,387 77385496267\,738\,549\,626 22613965492\,261\,396\,549 5379253\,792 1717 44 33 99
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O1
Length Element Array Min Max 10610^{-6} 10510^{-5} 10410^{-4} 10310^{-3} 10210^{-2} 10110^{-1} \infty
10310^3 int DynArray 00 0.0000120.000\,012 779779 220220 11 00 00 00 00
std::vector 00 0.0000120.000\,012 737737 260260 33 00 00 00 00
std::unique_ptr DynArray 00 0.0000020.000\,002 742742 258258 00 00 00 00 00
std::vector 00 0.0000060.000\,006 737737 263263 00 00 00 00 00
10610^6 int DynArray 00 0.0000110.000\,011 758870758\,870 241126241\,126 44 00 00 00 00
std::vector 00 0.0001770.000\,177 758529758\,529 241453241\,453 1717 11 00 00 00
std::unique_ptr DynArray 00 0.0000520.000\,052 740348740\,348 259644259\,644 88 00 00 00 00
std::vector 00 0.0064090.006\,409 745641745\,641 254334254\,334 1919 33 33 00 00
10910^9 int DynArray 00 0.0000420.000\,042 772114172772\,114\,172 227881241227\,881\,241 45874\,587 00 00 00 00
std::vector 00 0.2040940.204\,094 767987634767\,987\,634 232007898232\,007\,898 44574\,457 22 44 44 11
std::unique_ptr DynArray 00 0.0028710.002\,871 748034858748\,034\,858 251951964251\,951\,964 1315413\,154 1616 88 00 00
std::vector 00 8.3299908.329\,990 745391293745\,391\,293 254599112254\,599\,112 95729\,572 66 77 33 77
101010^{10} int DynArray 00 0.0000720.000\,072 77207025987\,720\,702\,598 22792488502\,279\,248\,850 4855248\,552 00 00 00 00
std::vector 00 16.38854516.388\,545 76795234607\,679\,523\,460 23204287932\,320\,428\,793 4772647\,726 88 44 33 66
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O2
Length Element Array Min Max 10610^{-6} 10510^{-5} 10410^{-4} 10310^{-3} 10210^{-2} 10110^{-1} \infty
10310^3 int DynArray 00 0.0000080.000\,008 768768 232232 00 00 00 00 00
std::vector 00 0.0000540.000\,054 758758 237237 55 00 00 00 00
std::unique_ptr DynArray 00 0.0000030.000\,003 748748 252252 00 00 00 00 00
std::vector 00 0.0000060.000\,006 741741 259259 00 00 00 00 00
10610^6 int DynArray 00 0.0000120.000\,012 763508763\,508 236491236\,491 11 00 00 00 00
std::vector 00 0.0001850.000\,185 765899765\,899 234092234\,092 88 11 00 00 00
std::unique_ptr DynArray 00 0.0000740.000\,074 745131745\,131 254864254\,864 55 00 00 00 00
std::vector 00 0.0008950.000\,895 750934750\,934 249047249\,047 1515 44 00 00 00
10910^9 int DynArray 00 0.0000540.000\,054 769831147769\,831\,147 230164413230\,164\,413 44404\,440 00 00 00 00
std::vector 00 0.2100260.210\,026 770968775770\,968\,775 229027009229\,027\,009 42044\,204 33 44 44 11
std::unique_ptr DynArray 00 0.0063770.006\,377 753974964753\,974\,964 246013774246\,013\,774 1125011\,250 99 33 00 00
std::vector 00 2.9350072.935\,007 752346656752\,346\,656 247643704247\,643\,704 96169\,616 88 99 33 44
101010^{10} int DynArray 00 0.0001320.000\,132 76994172757\,699\,417\,275 23005344252\,300\,534\,425 4829948\,299 11 00 00 00
std::vector 00 16.33876316.338\,763 76896896397\,689\,689\,639 23102648152\,310\,264\,815 4553045\,530 44 33 33 66
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
FreeBSD
Memory
Array Size Peak RSS (bytes)
DynArray std::vector
10310^3 00 00
10610^6 37920003\,792\,000 45480004\,548\,000
10910^9 39138360003\,913\,836\,000 80939360008\,093\,936\,000
101010^{10} killed by OOM killer killed by OOM killer

The first row is a bit surprising. It turns out that if I execute mem_dynarray1000 in an infinite loop, 90%90\% of the time it prints 00, and other times it prints numbers which differ by several orders of magnitude between each other. It looks like a bug. Results on Mac and Linux are stable between runs. It’s safe to assume numbers in this table are underreported.

When it comes to the OOM killer, the first time I run the test it killed not only the process running the benchmark, but also my SSH session. Case in point for my rant about overcommit and OOM killer earlier.

Throughput
No optimisation (-O0)
Length Element Array Total time (seconds) Throughput (Hz)
10310^3 int DynArray 0.0000170.000\,017 5882352958\,823\,529
std::vector 0.0000610.000\,061 1639344216\,393\,442
std::unique_ptr DynArray 0.0000690.000\,069 1449275314\,492\,753
std::vector 0.0001450.000\,145 68965516\,896\,551
10610^6 int DynArray 0.0140530.014\,053 7115918371\,159\,183
std::vector 0.0307690.030\,769 3250024332\,500\,243
std::unique_ptr DynArray 0.0642490.064\,249 1556444415\,564\,444
std::vector 0.1278670.127\,867 78206267\,820\,626
10910^9 int DynArray 12.72085612.720\,856 7861106278\,611\,062
std::vector 33.25377533.253\,775 3007177330\,071\,773
std::unique_ptr DynArray 66.04436066.044\,360 1514133815\,141\,338
std::vector 134.103236134.103\,236 74569417\,456\,941
101010^{10} int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O1
Length Element Array Total time (seconds) Throughput (Hz)
10310^3 int DynArray 0.0000120.000\,012 8333333383\,333\,333
std::vector 0.0000430.000\,043 2325581323\,255\,813
std::unique_ptr DynArray 0.0000290.000\,029 3448275834\,482\,758
std::vector 0.0000300.000\,030 3333333333\,333\,333
10610^6 int DynArray 0.0089160.008\,916 112157918112\,157\,918
std::vector 0.0057110.005\,711 175100682175\,100\,682
std::unique_ptr DynArray 0.0244460.024\,446 4090648740\,906\,487
std::vector 0.0204660.020\,466 4886152648\,861\,526
10910^9 int DynArray 3.7290383.729\,038 268165677268\,165\,677
std::vector 3.9509683.950\,968 253102530253\,102\,530
std::unique_ptr DynArray 16.57767416.577\,674 6032209360\,322\,093
std::vector 19.59891219.598\,912 5102324051\,023\,240
101010^{10} int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O2
Length Element Array Total time (seconds) Throughput (Hz)
10310^3 int DynArray 0.0000070.000\,007 142857142142\,857\,142
std::vector 0.0000220.000\,022 4545454545\,454\,545
std::unique_ptr DynArray 0.0000150.000\,015 6666666666\,666\,666
std::vector 0.0000150.000\,015 6666666666\,666\,666
10610^6 int DynArray 0.0038320.003\,832 260960334260\,960\,334
std::vector 0.0023740.002\,374 421229991421\,229\,991
std::unique_ptr DynArray 0.0140700.014\,070 7107320571\,073\,205
std::vector 0.0110300.011\,030 9066183190\,661\,831
10910^9 int DynArray 3.9485493.948\,549 253257589253\,257\,589
std::vector 2.5874112.587\,411 386486723386\,486\,723
std::unique_ptr DynArray 16.36793016.367\,930 6109508061\,095\,080
std::vector 19.17318219.173\,182 5215618352\,156\,183
101010^{10} int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Latency
No optimisation (-O0)
Length Element Array Min Max 10610^{-6} 10510^{-5} 10410^{-4} 10310^{-3} 10210^{-2} 10110^{-1} \infty
10310^3 int DynArray 00 0.0000150.000\,015 840840 159159 11 00 00 00 00
std::vector 00 0.0002290.000\,229 794794 205205 00 11 00 00 00
std::unique_ptr DynArray 00 0.0000030.000\,003 727727 273273 00 00 00 00 00
std::vector 00 0.0481350.048\,135 721721 277277 11 00 00 11 00
10610^6 int DynArray 00 0.0075040.007\,504 957349957\,349 4261942\,619 11 1010 2121 00 00
std::vector 00 0.0074980.007\,498 954260954\,260 4571045\,710 55 88 1717 00 00
std::unique_ptr DynArray 00 0.0074910.007\,491 952252952\,252 4771747\,717 33 33 2525 00 00
std::vector 00 0.0253080.025\,308 910396910\,396 8955389\,553 99 1414 2626 22 00
10910^9 int DynArray 00 0.0074760.007\,476 935647184935\,647\,184 6432999364\,329\,993 6464 495495 2226422264 00 00
std::vector 00 4.9246944.924\,694 904947744904\,947\,744 9502773695\,027\,736 120120 33 2438824\,388 33 66
std::unique_ptr DynArray 00 0.0046950.004\,695 863930190863\,930\,190 136034658136\,034\,658 36843\,684 361361 3110731\,107 00 00
std::vector 00 28.36718228.367\,182 869894094869\,894\,094 130069513130\,069\,513 35303\,530 226226 3262332\,623 55 99
101010^{10} int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O1
Length Element Array Min Max 10610^{-6} 10510^{-5} 10410^{-4} 10310^{-3} 10210^{-2} 10110^{-1} \infty
10310^3 int DynArray 00 0.0000090.000\,009 817817 183183 00 00 00 00 00
std::vector 00 0.0005970.000\,597 790790 209209 00 11 00 00 00
std::unique_ptr DynArray 00 0.0000020.000\,002 799799 201201 00 00 00 00 00
std::vector 00 0.0546810.054\,681 857857 142142 00 00 00 11 00
10610^6 int DynArray 00 0.0060550.006\,055 943746943\,746 5623856\,238 00 00 1616 00 00
std::vector 00 0.0060360.006\,036 952068952\,068 4790447\,904 33 11 2424 00 00
std::unique_ptr DynArray 00 0.0060130.006\,013 934911934\,911 6506165\,061 00 00 2828 00 00
std::vector 00 0.0059940.005\,994 937884937\,884 6208462\,084 66 11 2525 00 00
10910^9 int DynArray 00 0.0060430.006\,043 892789630892\,789\,630 107189456107\,189\,456 9292 22 2082020\,820 00 00
std::vector 00 1.0797161.079\,716 868199213868\,199\,213 131779846131\,779\,846 9292 66 2083720\,837 33 33
std::unique_ptr DynArray 00 0.0073220.007\,322 867898304867\,898\,304 132073111132\,073\,111 49004\,900 886886 2279922\,799 00 00
std::vector 00 2.9922842.992\,284 851345064851\,345\,064 148627732148\,627\,732 40904\,090 680680 2242422\,424 55 55
101010^{10} int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O2
Length Element Array Min Max 10610^{-6} 10510^{-5} 10410^{-4} 10310^{-3} 10210^{-2} 10110^{-1} \infty
10310^3 int DynArray 00 0.0000080.000\,008 818818 182182 00 00 00 00 00
std::vector 00 0.0003580.000\,358 804804 195195 00 11 00 00 00
std::unique_ptr DynArray 00 0.0000030.000\,003 793793 207207 00 00 00 00 00
std::vector 00 0.0577920.057\,792 847847 152152 00 00 00 11 00
10610^6 int DynArray 00 0.0059890.005\,989 952351952\,351 4762447\,624 00 00 2525 00 00
std::vector 00 0.0059620.005\,962 944598944\,598 5537255\,372 55 22 2323 00 00
std::unique_ptr DynArray 00 0.0059430.005\,943 940700940\,700 5927259\,272 33 00 2525 00 00
std::vector 00 0.0063670.006\,367 950479950\,479 4949249\,492 55 22 2222 00 00
10910^9 int DynArray 00 0.0060700.006\,070 889075210889\,075\,210 110903827110\,903\,827 9494 22 2086720\,867 00 00
std::vector 00 1.1026121.102\,612 867475739867\,475\,739 132503564132\,503\,564 103103 55 2058320\,583 33 33
std::unique_ptr DynArray 00 0.0072440.007\,244 876357385876\,357\,385 123614252123\,614\,252 43834\,383 11381\,138 2284222\,842 00 00
std::vector 00 3.0043933.004\,393 843681442843\,681\,442 156290107156\,290\,107 53915\,391 635635 2241522\,415 55 55
101010^{10} int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Linux
Memory
Array Size Peak RSS (bytes)
DynArray std::vector
10310^3 29120002\,912\,000 28560002\,856\,000
10610^6 67000006\,700\,000 71240007\,124\,000
10910^9 39092880003\,909\,288\,000 41980160004\,198\,016\,000
101010^{10} killed by OOM killer std::bad_alloc

Interestingly, here std::vector threw std::bad_alloc and the benchmark wasn’t killed by the OOM killer. However, in later benchmarks of throughput and latency it didn’t happen and instead the OOM killer killed the process as usual.

Throughput
No optimisation (-O0)
Length Element Array Total time (seconds) Throughput (Hz)
10310^3 int DynArray 0.0000290.000\,029 3448275834\,482\,758
std::vector 0.0000460.000\,046 2173913021\,739\,130
std::unique_ptr DynArray 0.0001510.000\,151 66225166\,622\,516
std::vector 0.0002770.000\,277 36101083\,610\,108
10610^6 int DynArray 0.0166140.016\,614 6019020160\,190\,201
std::vector 0.0294010.029\,401 3401244834\,012\,448
std::unique_ptr DynArray 0.1117570.111\,757 89479858\,947\,985
std::vector 0.2040390.204\,039 49010234\,901\,023
10910^9 int DynArray 16.35644916.356\,449 6113796461\,137\,964
std::vector 31.85418231.854\,182 3139305231\,393\,052
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
101010^{10} int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O1
Length Element Array Total time (seconds) Throughput (Hz)
10310^3 int DynArray 0.0000140.000\,014 7142857171\,428\,571
std::vector 0.0000130.000\,013 7692307676\,923\,076
std::unique_ptr DynArray 0.0000490.000\,049 2040816320\,408\,163
std::vector 0.0000510.000\,051 1960784319\,607\,843
10610^6 int DynArray 0.0059990.005\,999 166694449166\,694\,449
std::vector 0.0064830.006\,483 154249575154\,249\,575
std::unique_ptr DynArray 0.0414180.041\,418 2414409124\,144\,091
std::vector 0.0249170.024\,917 4013324240\,133\,242
10910^9 int DynArray 3.2984443.298\,444 303173253303\,173\,253
std::vector 4.6749024.674\,902 213908227213\,908\,227
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
101010^{10} int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O2
Length Element Array Total time (seconds) Throughput (Hz)
10310^3 int DynArray 0.0000130.000\,013 7692307676\,923\,076
std::vector 0.0000130.000\,013 7692307676\,923\,076
std::unique_ptr DynArray 0.0000490.000\,049 2040816320\,408\,163
std::vector 0.0000410.000\,041 2439024324\,390\,243
10610^6 int DynArray 0.0059050.005\,905 169348010169\,348\,010
std::vector 0.0061560.006\,156 162443144162\,443\,144
std::unique_ptr DynArray 0.0403470.040\,347 2478499024\,784\,990
std::vector 0.0229150.022\,915 4363953743\,639\,537
10910^9 int DynArray 3.2957923.295\,792 303417205303\,417\,205
std::vector 2.9729502.972950 336366235336\,366\,235
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
101010^{10} int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Latency
No optimisation (-O0)
Length Element Array Min Max 10610^{-6} 10510^{-5} 10410^{-4} 10310^{-3} 10210^{-2} 10110^{-1} \infty
10310^3 int DynArray 00 0.0000160.000\,016 736736 263263 11 00 00 00 00
std::vector 00 0.0000020.000\,002 712712 288288 00 00 00 00 00
std::unique_ptr DynArray 00 0.0000030.000\,003 656656 344344 00 00 00 00 00
std::vector 00 0.0000400.000\,040 648648 349349 33 00 00 00 00
10610^6 int DynArray 00 0.0000060.000\,006 802250802\,250 197750197\,750 00 00 00 00 00
std::vector 00 0.0005590.000\,559 752663752\,663 247331247\,331 44 22 00 00 00
std::unique_ptr DynArray 00 0.0000090.000\,009 640887640\,887 359113359\,113 00 00 00 00 00
std::vector 00 0.0402690.040\,269 629382629\,382 370605370\,605 33 33 44 33 00
10910^9 int DynArray 00 0.0003750.000\,375 828756776828\,756\,776 171243207171\,243\,207 1212 55 00 00 00
std::vector 00 0.3220320.322\,032 744429501744\,429\,501 255568581255\,568\,581 99 18991\,899 44 44 22
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
101010^{10} int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O1
Length Element Array Min Max 10610^{-6} 10510^{-5} 10410^{-4} 10310^{-3} 10210^{-2} 10110^{-1} \infty
10310^3 int DynArray 00 0.0000180.000\,018 853853 146146 11 00 00 00 00
std::vector 00 0.0000160.000\,016 871871 128128 11 00 00 00 00
std::unique_ptr DynArray 00 0.0000050.000\,005 780780 220220 00 00 00 00 00
std::vector 00 0.0000060.000\,006 763763 237237 00 00 00 00 00
10610^6 int DynArray 00 0.0000070.000\,007 758437758\,437 241563241\,563 00 00 00 00 00
std::vector 00 0.0004140.000\,414 768969768\,969 231025231\,025 44 22 00 00 00
std::unique_ptr DynArray 00 0.0000160.000\,016 719386719\,386 280613280\,613 11 00 00 00 00
std::vector 00 0.0061320.006\,132 718193718\,193 281800281\,800 44 22 11 00 00
10910^9 int DynArray 00 0.0003510.000\,351 766618933766\,618\,933 233381026233\,381\,026 3636 55 00 00 00
std::vector 00 0.3217290.321\,729 760193989760\,193\,989 239804071239\,804\,071 3131 19001900 33 44 22
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
101010^{10} int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
Optimisation -O2
Length Element Array Min Max 10610^{-6} 10510^{-5} 10410^{-4} 10310^{-3} 10210^{-2} 10110^{-1} \infty
10310^3 int DynArray 00 0.0000190.000\,019 767767 230230 33 00 00 00 00
std::vector 00 0.0000230.000\,023 769769 230230 11 00 00 00 00
std::unique_ptr DynArray 00 0.0000040.000\,004 735735 265265 00 00 00 00 00
std::vector 00 0.0000050.000\,005 732732 268268 00 00 00 00 00
10610^6 int DynArray 00 0.0000090.000\,009 764199764\,199 235801235\,801 00 00 00 00 00
std::vector 00 0.0006110.000\,611 761404761\,404 238590238\,590 44 22 00 00 00
std::unique_ptr DynArray 00 0.0000130.000\,013 702667702\,667 297332297\,332 11 00 00 00 00
std::vector 00 0.0064200.006\,420 715711715\,711 284282284\,282 44 22 11 00 00
10910^9 int DynArray 00 0.0004160.000\,416 759662742759\,662\,742 240337214240\,337\,214 3838 66 00 00 00
std::vector 00 0.3224450.322\,445 760329293760\,329\,293 239668772239\,668\,772 2626 19001\,900 33 44 22
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer
101010^{10} int DynArray killed by OOM killer
std::vector killed by OOM killer
std::unique_ptr DynArray killed by OOM killer
std::vector killed by OOM killer

What about mremap()?

  1. As far as I know, there is no equivalent to it on Windows.
  2. It relocates objects in virtual space, so you can’t use that as a basis for an arena, which needs to keep pointers valid without modifying them.

What about shrinking?

I’ve only shown how to increase the length of DynArray. If you need shrinking as well, on Windows you call VirtualFree() with MEM_DECOMMIT instead of MEM_RELEASE, on POSIX you call mprotect() with PROT_NONE and madvise() with MADV_FREE or MADV_DONTNEED (whichever constant is available).

Personally, I’ve never needed to do that, so I’m not going to bother implementing it. It seems that in my style of programming I either throw out whole data structures or keep them, but I never delete single elements. Judging from the Swiss Tables CppCon talk, I’m not alone in that.

Closing words

So that’s what I’m doing now in my side project. If you have anything to add, you may write me an email at .

I dedicate this post to Windows Update.


  1. Weirdly, it’s called EXC_BAD_ACCESS on Mac.↩︎

  2. Unless someone manually sets the alignment of a structure with special compiler or linker directives to something bigger. But then it’s their problem.↩︎