Reserve and commit
This page contains a bit of maths. To view it correctly you need a web browser which supports MathML. That means either Firefox, Safari or Chrome. Chrome should be in version 109 or later.
I’ve been working on a yet-unpublished side project in Rust and
decided to use a bump allocator provided by bumpalo
to
simplify lifetimes and improve performance of my code. Generally, making
lots of micro-allocations and then freeing them one by one is far from
optimal, when you can instead group them into larger allocations and
free all at once. In the same way, a large number of little objects with
their own separate lifetimes is a lot harder to keep track of
conceptually than when their lifetimes are grouped into just a few
longer ones. Fewer, longer lifetimes also make working with the
borrow-checker easier.
In broad strokes, bumpalo
provides memory by allocating
chunks of memory from the OS and keeping track of how many bytes in the
last chunk are already used. When you request
bytes from bumpalo
, it gives you memory from the free
portion of the last chunk and increments its counter of used bytes. When
there isn’t enough space in the last chunk, it allocates another one and
does the same.
Works well enough. However:
- It involves much more memory management code than I’d like.
- It fragments memory over disparate chunks, where there’s likely always going to be unused space left at the end of each chunk.
- Freeing
Bump
takes linear time relative to the number of chunks. - Reserving memory upfront by setting capacity avoids fragmentation into chunks, but it takes that amount of memory from the OS, regardless of whether all that memory is actually going to be used. That means the user needs to have very precise upfront expectations of memory use if they want to avoid fragmentation by reserving. Whether it’s difficult and fragile, or impossible varies on a case-by-case basis.
There is a way to implement a bump allocator without using chunks, and without taking unnecessary memory from the OS. I’m going to describe it later down this page, but first I want to establish a data structure which could serve as a benchmark of whether that allocator is any good.
In this text I’m going to use C++, because it’s the official language of Windows APIs, it makes it easy to access C APIs on other platforms as well, and it’s more established than Rust, so talking about data structures provided by the C++ standard library is going to meet with less suspicion than doing the same with the Rust standard library.
Dynamic arrays
Dynamic arrays are arrays which can change their length. A typical
implementation of this data structure, such as std::vector
,
when it isn’t being modified, behaves like a normal array, meaning that
it holds its elements in a contiguous memory buffer.
th
element (starting from
)
starts at
bytes after the start of
th
element, when each element is
bytes long. Length (reported by size()
) and capacity
(capacity()
) are stored on the side. Length is the number
of elements currently stored, while capacity is the number of elements
std::vector
is capable of storing in the current
allocation.
.
To store its elements, std::vector
asks the allocator
for a buffer big enough to store all elements. When you append an
element to the end using push_back()
,
std::vector
checks whether the last allocation has enough
extra space to hold another element between the current last element and
the end of allocation. In other words, it checks whether
.
If yes, then it puts the new element right after the current last
element and updates the length by adding
to it.
If however
,
then there is no more space in the current allocation. In that case,
std::vector
asks the allocator for another buffer, this
time of size
.
is called the growth factor. libstdc++
and
libc++
use
.
MSVC uses
.
After allocating the new bigger buffer, elements from the old buffer are copied to the new one and the new element is put after them. Old buffer is freed, length is incremented by , and capacity is set to the size of the newly-allocated buffer.
Performance Cost
Time
If the std::vector
had
elements before the whole operation, all of that required
steps (due to copying). However, this copying doesn’t happen at every
push_back()
. It happens at every
th
push_back()
. In other cases this method takes constant
time.
If we consider a sequence of
push_back()
calls, then they are going to include
calls which copy the underlying buffer, and
calls which do not. All the copies taken together are going to take
steps. The copy-less push_back()
calls taken together are
going to take
steps. Adding it together we get that all push_back()
calls
together cost us
.
There are
push_back()
calls, so on average push_back()
takes constant time:
That’s why it’s sometimes said that
std::vector::push_back()
takes constant time
amortised – the cost of every expensive
push_back()
, which copies
elements, is amortised over
previous push_back()
s, each of which took constant
time.
However, the latency of push_back()
is pretty bad, as
demonstrated by the measurements later in this note.
Space
Between allocating the new buffer and discarding the old one after
copying it, the two buffers occupied
bytes of memory, which means that in order to get an
-element
std::vector
, we need almost
times as much memory as is needed for our elements. That’s
times for libstdc++
and libc++
, and
times for MSVC. It’s a short time window, so most of the time it isn’t
terrible, but sometimes we may not have this much memory available to
us.
After push_back()
finishes, we’re using
bytes to store our
,
which gives us the lower bound for space utilisation. For
,
space utilisation of std::vector
is in the range
.
For
,
it is in the range
.
Not terrible, not great.
Complexity cost
Due to all the copying, our program cannot depend on any elements
staying in one place for too long. If we want to refer to a string
stored in a std::vector
, we cannot simply grab a
reference/pointer/string_view
to it, as that could result
in a dangling reference/pointer after a push_back()
call.
We could work around this problem by grabbing index of the string in
that std::vector
. However, that would mean that functions
operating on that string would need to operate on such indices instead
of references/pointers to strings, which in turn would mean that they
would either need to take the std::vector
as another
argument, or that they would work only with strings from a selected
std::vector
. That would make using those functions in
different contexts difficult. E.g. we wouldn’t be able to just pass a
string directly to those functions in tests (normal, or transient).
We also wouldn’t be able to easily refer to substrings without
copying them or creating a substring struct of the form
,
which is a lot more things to keep track of than just a plain
std::string_view
, i.e. a
.
And again, the functions operating on those structs would be confined to
operating only on those special objects instead of arbitrary
substrings.
Surely, there must be a way to make it all simpler and more efficient?
Virtual memory
Let’s take a step back and ponder what it means to allocate memory on a typical operating system these days. Essentially, we want to get access to a region of RAM. That access is controlled by the OS and the CPU using a memory management unit (MMU).
These components logically divide memory into pages. A page usually spans kiB of memory. To address every byte in such a page we need bits, so we can treat the first bits as index of the page itself. If we have an address and want to know the index of the page holding the byte under that address, we only need to perform a logical shift right on the address by 12 bits.
The operating system manages a page table, which conceptually is a hash table from page indices to page indices and some minimal metadata, such as permissions. (Due to space constraints on -bit systems, on most systems the hash table is actually nested.)
When a process reads from or writes to memory at an address, the CPU
fetches the hash table entry for that address, checks whether the
process has appropriate permissions and accesses the memory with the
page index replaced with the one stored in the value fetched from the
page table. If the process doesn’t have the appropriate permissions
however, the CPU will trigger a hardware page fault, which is going to
be caught by the OS and redirected to the process as a segfault
(SIGSEGV
on POSIX systems1).
This arrangement creates two address spaces: virtual and physical. The virtual address space consists of addresses the way processes see them. The physical address space consists of addresses directly referring to physical memory (RAM). Each of them is divided into pages. The virtual address space is divided into virtual pages, while the physical address space into physical pages. The page table maintains a mapping from virtual to physical pages. Each process lives in a separate virtual address space.
Reserving vs committing memory
When we write a program and want to be able to access a piece of
memory, we need to first ask the OS to update the page table. To create
a new mapping, we need to call VirtualAlloc()
on Windows,
or mmap()
on POSIX. Both of them accept an argument,
hinting which virtual address the new mapping should start at. If this
argument is null, the OS picks an address of its own choosing. Those
functions are what’s used underneath by malloc()
from the C
standard library.
Looking at all that, maybe our dynamic array could allocate memory
page after page in contiguity? Each time we want to append a new
element, if we used up all the previously-allocated space, we’d just
call VirtualAlloc()
/mmap()
to give us another
virtual page directly after our existing allocation and continue
normally.
That sounds almost like a good plan, but there is no guarantee that the virtual page right after our allocation isn’t already mapped. It could be mapped before our process even started execution, or we could have mapped it ourselves in a different place in code.
What we need is a way to tell the OS that a certain region of virtual
address space shouldn’t be mapped, unless we specifically ask for it
(using the address and length arguments of
VirtualAlloc()
/mmap()
). This way we can ensure
that when we need another virtual page mapped, it is available to us.
And when we don’t need it, we don’t use up scarce physical memory.
VirtualAlloc()
allows us to do just that. It separates
reserving memory from committing it. Reserving memory
is exactly what I described in the last paragraph: telling the OS not to
map virtual pages in a given address range, unless we specifically ask
for it. Committing memory is the act of mapping virtual pages to
physical pages. On Windows committed memory is counted towards the
Working Set Size (WSS) of a process, which is what’s presented
to the user in the Task Manager as the amount of memory a process
consumes.
While physical memory is scarce, virtual memory on
-bit
systems is abundant. On amd64
systems we typically have
about
bits of virtual address space available, which amounts to
PiB.
So while my PC has only
GiB
of RAM, I can reserve about
times as much virtual memory.
Lazy Overcommit
On POSIX systems, such as Mac, FreeBSD and Linux which I’m going to be using here, there is no explicit way to differentiate reserving memory from committing it. However, those systems in a normal configuration use lazy overcommit, i.e. when a process asks for memory to be mapped, the operating system doesn’t update the page table right away. Instead it tells the process that the mapping succeeded, and when the process tries to access one of the virtual pages, which it was led to believe is mapped, CPU generates a page fault, the OS catches it, updates the page table and resumes execution of the process from the instruction which triggered the page fault, as if nothing happened. That’s why it’s called lazy.
In that setup, memory allocations done with mmap()
almost never fail (which would allow processes to handle the error
condition), even when the sum of memory requested with
mmap()
is more than is available in the system (that’s why
it’s called overcommit). Instead the operating system waits
until after all allocations are made and processes actually try to use
the allocated memory.
When it turns out that the processes weren’t joking and they really attempt to use all that memory, which is more than the physical memory available, the operating system starts an out-of-memory (OOM) killer which picks a process according to its own fancy and terminates it, thereby sacrificing it on the altar of overcommit. It may not be the process responsible for excessive memory consumption. It may be a process critical to the security of your system, like your screen locker. This is repeated, until the amount of memory that surviving processes try to access doesn’t exceed the amount of physical memory available. This is stupid.
However, in the absence of a proper reserve and commit API, it lets
us emulate reserve and commit using mprotect()
calls,
preventing us from accessing pages of memory we didn’t explicitly
“commit” by disabling all permissions on those pages. When we want to
“commit” pages, we change their permissions to allow us to access them
and let lazy overcommit take care of the fact that the initial
mmap()
call requested more memory than we may need and
possibly than is even available in the system.
As long as the system administrator didn’t disable overcommit. But given that overcommit is expected on POSIX, applications in the POSIX world in general aren’t written with expectation that memory allocations can fail. So when overcommit is disabled, things are going to break anyway.
The code
I’ll go over key parts of the code here, but omit other parts. Complete code is available in its own git repository. If you find bugs, I do accept patches. README describes how to send them.
I’m going to first present the platform-specific parts of code, and
then follow them with platform-independent code. All platform-specific
code is in namespace pal
(for platform abstraction
layer).
Windows
Apparently that’s how you declare the entry-point function on Windows:
#define DECL_MAIN int __cdecl wmain()
For benchmarks I’ll want to prevent inlining in a couple places to prevent the optimiser from knowing the value of a couple arguments passed to benchmarking functions, which could influence the results.
#define NOINLINE __declspec(noinline)
Theoretically we should fetch page size with
GetSystemInfo()
, but I want to keep the example code
simple, and on Windows it’s almost always going to be 4096 anyway:
static constexpr size_t kPageSize = 0x1000;
Reserving memory:
inline void* MemReserve(size_t len) {
void* buf = VirtualAlloc(nullptr, len, MEM_RESERVE, PAGE_NOACCESS);
if (buf == nullptr) {
DWORD last_error = GetLastError();
std::cerr << "Reserving memory with VirtualAlloc failed. Error code: " << last_error << std::endl;
return nullptr;
}
return buf;
}
Committing memory:
inline bool MemCommit(void* p, size_t len) {
void* committed = VirtualAlloc(p, len, MEM_COMMIT, PAGE_READWRITE);
if (committed == nullptr) {
DWORD last_error = GetLastError();
std::cerr << "Committing memory with VirtualAlloc failed. Error code: " << last_error << std::endl;
return false;
}
return true;
}
Freeing memory:
inline void MemFree(void* p, size_t /*len*/) {
VirtualFree(p, 0, MEM_RELEASE);
}
That’s all we need for memory allocation. Now, a couple functions for the measurements done in benchmarks. First, peak memory use (working set size):
inline size_t GetPeakMemoryUse() {
PROCESS_MEMORY_COUNTERS mem;
bool res = GetProcessMemoryInfo(GetCurrentProcess(), &mem, sizeof mem);
if (!res) {
DWORD last_error = GetLastError();
std::cerr << "GetProcessMemoryInfo failed. Error code: " << last_error << std::endl;
return 0;
}
return mem.PeakWorkingSetSize;
}
And the time the process spent on-CPU, adding time spent in user-mode and in kernel-mode, in microseconds:
inline uint64_t ConvertFiletimeToMicroseconds(FILETIME t) {
ULARGE_INTEGER li;
li.LowPart = t.dwLowDateTime;
li.HighPart = t.dwHighDateTime;
return static_cast<uint64_t>(li.QuadPart) / 10;
}
inline uint64_t GetCpuTime() {
FILETIME a, b, c, d;
if (GetProcessTimes(GetCurrentProcess(), &a, &b, &c, &d) != 0) {
auto kernel_time = ConvertFiletimeToMicroseconds(c);
auto user_time = ConvertFiletimeToMicroseconds(d);
return user_time + kernel_time;
} else {
std::cerr << "failed to get CPU time\n";
assert(false);
return 0;
}
}
POSIX
Most of the code is common to all POSIX platforms with a couple tiny differences handled with the preprocessor. Normally, I’d split them into separate files, but that felt like too much ceremony than is necessary for this little post.
main
:
#define DECL_MAIN int main()
NOINLINE
:
#define NOINLINE __attribute__((noinline))
I’m going to be working with FreeBSD and Linux on amd64, which use 4kiB pages, and Mac M1 which uses 16kiB pages (as all ARM-based platforms from Apple do):
static constexpr size_t kPageSize =
#if defined(__APPLE__) && defined(__aarch64__)
0x4000
#else
0x1000
#endif
;
Reserving memory. We call mmap()
with
PROT_NONE
to disable all permissions and add
MAP_NORESERVE
flag, when it’s available (in practice it
means Linux), to opt into overcommit, when it’s in heuristic mode:
inline void* MemReserve(size_t len) {
int flags = MAP_PRIVATE | MAP_ANONYMOUS;
#ifdef MAP_NORESERVE
// On Linux, guarantee lazy commit even if /proc/sys/vm/overcommit_memory is set to `heuristic` (0).
flags |= MAP_NORESERVE;
#endif
void* buf = mmap(nullptr, len, PROT_NONE, flags, -1, 0);
if (buf == nullptr) {
int saved_errno = errno;
std::cerr << "Reserving memory with mmap failed. Exit code: " << saved_errno << std::endl;
return nullptr;
}
return buf;
}
Commit by adding read and write permissions:
inline bool MemCommit(void* p, size_t len) {
if (mprotect(p, len, PROT_READ | PROT_WRITE) != 0) {
int saved_errno = errno;
std::cerr << "Committing memory with mprotect failed. Exit code: " << saved_errno << std::endl;
return false;
}
return true;
}
Free:
inline void MemFree(void* p, size_t len) {
if (munmap(p, len) != 0) {
int saved_errno = errno;
std::cerr << "Failed to unmap memory. Exit code: " << saved_errno << std::endl;
}
}
On POSIX platforms Resident Set Size (RSS) is what’s
interpreted as “memory use”. To measure that, we’re going to use
getrusage()
and here is where POSIX lets us down. Everyone
has this function, but everyone interprets the meaning of the returned
ru_maxrss
differently. On FreeBSD and Linux it’s in
kilobytes. On Mac it’s in bytes. And curiously on illumos it’s in pages.
The similarities are sufficient to lull you into a sense that everything
is working fine when porting: everything has the same names, everything
compiles, but then the interpretation of results is completely
different. The only reason I noticed it is because suddenly results
differed by several orders of magnitude between Mac and FreeBSD/Linux.
Others may not be this fortunate. Reading the Linux and FreeBSD man
pages, it wasn’t clear to me whether by kilobytes they meant
bytes or
bytes. I settled for
:
inline size_t GetPeakMemoryUse() {
rusage r_usage;
if (getrusage(RUSAGE_SELF, &r_usage) != 0) {
int saved_errno = errno;
std::cerr << "Failed to fetch memory use. Exit code: " << saved_errno << std::endl;
return 0;
}
// On Mac, ru_maxrss is in bytes. On Linux and FreeBSD it's in kilobytes. So much for POSIX.
#ifdef __APPLE__
size_t multiplier = 1;
#else
size_t multiplier = 1000;
#endif
return r_usage.ru_maxrss * multiplier;
}
Measuring CPU time is done the same way everywhere:
inline uint64_t ConvertTimevalToMicroseconds(timeval t) {
return static_cast<uint64_t>(t.tv_sec) * 1000000ULL + static_cast<uint64_t>(t.tv_usec);
}
inline uint64_t GetCpuTime() {
rusage r_usage;
if (getrusage(RUSAGE_SELF, &r_usage) != 0) {
int saved_errno = errno;
std::cerr << "Failed to fetch memory use. Exit code: " << saved_errno << std::endl;
return 0;
}
auto user_time = ConvertTimevalToMicroseconds(r_usage.ru_utime);
auto kernel_time = ConvertTimevalToMicroseconds(r_usage.ru_stime);
return user_time + kernel_time;
}
Platform-independent part
For simplicity, let’s first write a basic arena of bytes, so we don’t need to trouble ourselves with templates for now.
Unlike std::vector
, we’re going to grow the allocation
linearly, not exponentially, which should give us better space
efficiency. Let’s settle on the amount of pages we’re going to add each
time we run out of space in our allocation. I say
is a pretty good start:
static constexpr size_t kBytesPerCommit = pal::kPageSize;
We also need a way to calculate the number of pages needed to hold
the bytes in our arena, multiplied by number of bytes in a page, since
VirtualAlloc
accepts length in bytes, not pages. We
calculate it by rounding the number of bytes we need to store up to the
nearest page boundary. To do that, we perform ceiled division by
kBytesPerCommit
, followed by multiplication by
kBytesPerCommit
:
size_t GetCommittedBytes(size_t used_bytes) {
return ((used_bytes + kBytesPerCommit - 1) / kBytesPerCommit) * kBytesPerCommit;
}
Now the struct for arena bookkeeping, which should be pretty self-explanatory at this point:
struct Arena {
char* base = nullptr;
size_t cap = 0;
size_t len = 0;
// ... methods ...
};
RAII boilerplate:
Arena() = delete;
Arena(const Arena&) = delete;
Arena(Arena&& other) {
base = other.base;
cap = other.cap;
len = other.len;
other.base = nullptr;
other.cap = 0;
other.len = 0;
}
Main constructor, followed by the destructor:
explicit Arena(size_t max_len) {
assert(max_len != 0);
cap = GetCommittedBytes(max_len);
len = 0;
void* buf = pal::MemReserve(cap);
if (buf == nullptr) {
throw std::bad_alloc();
}
base = static_cast<char*>(buf);
}
~Arena() {
if (base != nullptr) {
pal::MemFree(base, cap);
}
}
I apologise for exceptions. However, I didn’t want to use extra
templates like std::optional
in this POC, which would get
in the way of demonstration.
Our constructor reserves the amount of memory requested in its argument. It doesn’t commit anything. Destructor frees the whole arena.
Growing logic:
bool Grow(size_t additional_len) {
assert(len+additional_len <= cap);
size_t old_committed = GetCommittedBytes(len);
size_t new_committed = GetCommittedBytes(len+additional_len);
size_t delta = new_committed - old_committed;
if (delta == 0) {
len += additional_len;
return true;
}
if (!pal::MemCommit(base + old_committed, delta)) {
return false;
}
len += additional_len;
return true;
}
If we can fit the additional bytes in current allocation, we only update the length and report success. Otherwise, we attempt to commit more pages (the minimal amount that’s needed to fit the additional bytes).
Our POC arena is complete. Now let’s move on to the dynamic array of
elements of type T
:
template <typename T>
class DynArray {
Arena arena;
public:
// ... methods ...
};
No surprises here. RAII boilerplate:
// Redundant, but useful as documentation.
DynArray() = delete;
DynArray(const DynArray<T>& other) = delete;
DynArray(DynArray&& other) = default;
Copy constructor is left as an exercise for the reader.
Main constructor just forwards a scaled argument to the arena. For more robust code, we should add a check for arithmetic overflow, but for simplicity of examples let’s leave it out:
explicit DynArray(size_t max_len)
: arena(max_len * sizeof(T)) {}
Support for range loops:
T* begin() {
return reinterpret_cast<T*>(arena.base);
}
T* end() {
return reinterpret_cast<T*>(arena.base + arena.len);
}
Methods for querying the length and indexing with square brackets would also be nice:
size_t Len() const {
return arena.len / sizeof(T);
}
T& operator[](size_t i) {
assert(i < Len());
return begin()[i];
}
And the meat of the structure:
bool Append(T v) {
auto* dst = end();
if (!arena.Grow(sizeof(T))) {
return false;
}
*dst = std::move(v);
return true;
}
Should be pretty obvious.
Theoretically, we should make sure that we put T
at
addresses aligned according to the alignment requirements of
T
. However, the base address of the arena, which is
returned by VirtualAlloc()
, will always be aligned to a
page boundary. Alignment needs to be a power of 2. Additionally, in
practice I don’t think there are any types with alignment bigger than
page size2, which is also a power of 2. It
follows that alignment always divides page size equally, therefore we
satisfy alignment requirements with no additional effort.
Demonstrative tests
Now for some rudimentary tests. First, the arena:
// Reserving 16GiB.
auto a = Arena(16ULL * 1024ULL * 1024ULL * 1024ULL);
// Committing 1MiB to make space for 4 bytes (with some headroom).
bool committed = a.Grow(4);
assert(committed);
a.base[0] = 'a';
a.base[1] = 'b';
a.base[2] = 'c';
a.base[4] = '\0';
std::cout << a.base << std::endl;
// Past the declared length, but still in committed memory.
// No segfault, but in real code it's probably a bug.
a.base[5] = 'x';
a.base[kBytesPerCommit-1] = 'y';
// Reserved, but not committed memory. Segfault.
// a.base[kBytesPerCommit] = 'x';
That code should print “abc” and exit without crashing. We can see that unfortunately we can access elements beyond the stated length of the arena, in this case elements with indices and , because they’re placed on the same page of memory which is used to store valid elements. However, we cannot reach beyond this page, which is demonstrated with the last line. After uncommenting it and running the program, it’s going to crash due to a segfault. The uncommitted pages at the end serve as natural guard pages, providing a margin of safety, even in situations where for some reason it’s been decided that bounds-checking should be disabled (which is most C++ shops).
And the dynamic array:
static constexpr size_t N = 100000;
auto a = DynArray<int>(N);
for (size_t i = 0; i < 16; i++) {
assert(a.Append(static_cast<int>(i)));
}
for (size_t i = 0; i < 12; i++) {
std::cout << a[i] << "\n";
}
for (size_t i = 16; i < N; i++) {
a.Append(static_cast<int>(i));
}
size_t real_cap = GetCommittedBytes(N * sizeof(int)) / sizeof(int);
for (size_t i = N; i < real_cap; i++) {
a.Append(static_cast<int>(i));
}
// Exceeded capacity. Assertion failure.
// a.Append(0);
That should print:
0
1
2
3
4
5
6
7
8
9
10
11
Uncommenting the last line would cause an assertion failure, this time because we run out of reserved space (due to deliberately smaller capacity declared in the beginning).
Performance cost
When DynArray
has
elements, Append()
takes
time and doesn’t have any temporary spikes in space use, meaning that
the amount of space used during Append()
never exceeds the
amount of space used after it.
-element
DynArray
uses
bytes for the buffer plus the constant amount of space for length,
capacity and pointer to buffer. This amounts to
space utilisation.
Pretty good. As
grows, we’re approaching perfect space utilisation, and non-utilised
space never exceeds
bytes, whereas in the case of std::vector
it grows
exponentially.
Benchmarks
I’ve run all Windows, FreeBSD and Linux tests on the exact same machine. It’s a PC with an AMD Ryzen 7 2700X CPU and 16 GB of DDR4 RAM. For Mac tests I used my 2020 Mac Mini M1 with 16GB RAM. It wasn’t my objective to compare which compiler is better than another compiler, so to keep things shorter I used only one compiler on each platform, the one that was closest to the official one. Software versions:
- Windows 10 Pro with MSVC
19.32.31332
running fromx64 Native Tools Command Prompt for VS 2022
. Compilation script:build.bat
. - MacOS Monterey
12.6.2
withApple clang version 13.0.0
(clang-1300.0.29.30
), the version shipped with Xcode13.2.1
. Compilation script:build-clang.sh
. - FreeBSD
13.1-p3
(with13.1-p5
userland) with Clang13.0.0
(FreeBSD clang version 13.0.0
), which is shipped with the base system. Compilation script:build-clang.sh
. - Debian
11.6
(5.10.0-20-amd64 #1 SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux
) with GCC (g++ (Debian 10.2.1-6) 10.2.1 20210110
) installed with APT through theg++
package. Compilation script:build-gcc.sh
.
Windows
Memory
First we’ll measure peak memory use, when constructing arrays of
sizes
,
,
,
and
.
For details, see mem_dynarray.cpp
and
mem_stdvec.cpp
in the repo.
Reminder: I don’t call std::vector::reserve()
, because
that method doesn’t reserve memory according to the Windows
(and our) nomenclature, it actually commits memory, and the
whole motivation for this page is finding a memory management strategy
for situations when I don’t have accurate and precise
predictions for how much memory I’m going to need. Besides, if I had
them, instead of std::vector
, I’d use a structure which
allocates only once, and doesn’t have all that resizing logic.
Array Size | Peak WSS (bytes) | |
---|---|---|
DynArray
|
std::vector
|
|
std::bad_alloc
|
DynArray
uses more memory when the whole array can fit
in space smaller than a single memory page, but for bigger arrays it
uses less space, because it grows linearly, not exponentially as
std::vector
does. In fact, as the last row demonstrates,
std::vector
failed to scale to
elements on my machine, reporting a memory allocation error, whereas
DynArray
continued working.
Throughput
Now let’s see how many elements we’re able to append in a given time
period. We take the total time it takes to construct the whole array and
then divide it by the number of elements. Each benchmark is repeated for
different sizes of the resulting dynamic array. Results are included for
builds with different levels of optimisation: no optimisation,
/O1
and /O2
.
Aside from ints, we also use std::unique_ptr
s to
demonstrate how those dynamic arrays scale when their elements have
non-trivial (but still almost-trivial) move constructors. For details,
see bench.cpp
in the repo.
No optimisation | ||||
---|---|---|---|---|
Length | Element | Array | Total time (seconds) | Throughput (Hz) |
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
std::bad_alloc
|
|||
std::unique_ptr
|
DynArray
|
std::bad_alloc
|
||
std::vector
|
std::bad_alloc
|
|||
Optimisation /O1
|
||||
Length | Element | Array | Total time (seconds) | Throughput (Hz) |
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
std::bad_alloc
|
|||
std::unique_ptr
|
DynArray
|
std::bad_alloc
|
||
std::vector
|
std::bad_alloc
|
|||
Optimisation /O2
|
||||
Length | Element | Array | Total time (seconds) | Throughput (Hz) |
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
std::bad_alloc
|
|||
std::unique_ptr
|
DynArray
|
std::bad_alloc
|
||
std::vector
|
std::bad_alloc
|
With no optimisations enabled, DynArray
has consistently
higher throughput than std::vector
, across all controlled
parameters. With /O1
and /O2
,
std::vector
has higher throughput for smaller arrays and
for big arrays with trivial move constructors; but for large arrays with
non-trivial move constructors DynArray
has the upper hand.
Extrapolating, the time it takes to construct the whole array, relative
to its end length, grows much slower in the case of
DynArray
than in the case of std::vector
.
Latency
In case of latency, we measure the time taken by each append call,
save the minimum and maximum values, and aggregate all times into a
linear histogram organised into buckets of
,
,
,
,
,
,
seconds. Bucket columns are labeled with the upper bounds. For details,
see bench.cpp
in the repo.
No optimisation | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Length | Element | Array | Min | Max | |||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
std::bad_alloc
|
||||||||||||
std::unique_ptr
|
DynArray
|
std::bad_alloc
|
|||||||||||
std::vector
|
std::bad_alloc
|
||||||||||||
Optimisation /O1
|
|||||||||||||
Length | Element | Array | Min | Max | |||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
std::bad_alloc
|
||||||||||||
std::unique_ptr
|
DynArray
|
std::bad_alloc
|
|||||||||||
std::vector
|
std::bad_alloc
|
||||||||||||
Optimisation /O2
|
|||||||||||||
Length | Element | Array | Min | Max | |||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
std::bad_alloc
|
||||||||||||
std::unique_ptr
|
DynArray
|
std::bad_alloc
|
|||||||||||
std::vector
|
std::bad_alloc
|
Except the one flake in the unoptimised benchmark for arrays of
elements, the maximum latency of DynArray
always stayed
below the latency of std::vector
. In fact, usually it was a
couple times lower, and for big arrays it was orders of magnitude lower,
where, if you care about latency at all, performance of
std::vector
wasn’t acceptable by any standards.
All latency measurements demonstrate a bimodal distribution. In the
case of std::vector
it’s no surprise – that’s what we
should expect when we have two paths, one the common case of no resize,
one the odd case of resize. However, in the case of
DynArray
, the bimodal distribution surprised me, as I
didn’t see any cause for it in any of the code that I wrote: on the one
hand, VirtualAlloc()
isn’t called in every
Append()
, but on the other hand the number of calls to one
is linearly proportional to the other, and the numbers do not reflect
that, so I decided to dig into it.
I tried to collect hardware performance counters (PMC), but Intel
VTune failed to gather any for my AMD CPU. AMD μProf didn’t seem to even
have any place to display them. Microsoft’s tracelog.exe
also failed to collect anything. The ETL file contained no PMC data.
That’s quite a let-down compared to the ease of collecting this data on
Linux with perf
.
Luckily, DTrace was ported to Windows, so I decided to use that.
First I tried to measure how much time VirtualAlloc()
takes
in my tests, so I wrote the following script, using the
syscall
DTrace provider:
syscall::NtAllocateVirtualMemory:entry
/execname == "bench2.exe"/
{
self->ts = timestamp;
}
syscall::NtAllocateVirtualMemory:return
/execname == "bench2.exe"/
{
@alloc[execname] = quantize((timestamp - self->ts) / 1000);
}
That collects running times (in microseconds) of the
NtAllocateVirtualMemory()
syscall into a histogram. I run
bench2.exe
in one terminal, while in the Administrator’s
cmd.exe
in another window I typed:
"C:\Program Files\DTrace\dtrace.exe" -s virtualallochist.d
That printed:
bench2.exe
value ------------- Distribution ------------- count
0 | 0
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 25497083
2 |@ 765607
4 | 42939
8 | 7796
16 | 2619
32 | 233
64 | 20
128 | 41
256 | 89
512 | 90
1024 | 46
2048 | 4
4096 | 12
8192 | 6
16384 | 4
32768 | 0
microseconds ( seconds) isn’t near anything explaining the times from my benchmarks, where append once took as much as seconds. I had to dig deeper.
At that point I thought: I’m not calling
NtAllocateVirtualMemory()
directly, I’m calling
VirtualAlloc()
. I should measure that. That’s a bit harder,
but still doable. VirtualAlloc()
is not a syscall of the NT
kernel, but a normal function defined in one of DLLs distributed with
Windows, so I can’t trace it with the syscall
DTrace
provider.
However, I can use the pid
provider. For tracing with
that provider, I needed to know the PID of the executable to trace, not
just the execname
. One way to get it is to use the
-c <cmd>
option to dtrace.exe
, which
starts the <cmd>
command and passes its PID to the
DTrace script as $target
. However, that approach caused
DTrace to run out of memory very fast, so I needed to find another
way.
I decided to fetch the PID with Powershell and pass it as a
positional argument to my DTrace script, which measured
VirtualAlloc()
and a couple of my own functions to
compare:
pid$1::VirtualAlloc:entry
{
self->ts[probefunc] = timestamp;
}
pid$1::VirtualAlloc:return
{
@[execname,probefunc] = quantize((timestamp - self->ts[probefunc]) / 1000);
}
pid$1::GetProcessTimes:entry
{
self->ts[probefunc] = timestamp;
}
pid$1::GetProcessTimes:return
{
@[execname,probefunc] = quantize((timestamp - self->ts[probefunc]) / 1000);
}
pid$1::"DynArray<std::unique_ptr<int,std::default_delete<int> > >::Append":entry
{
self->ts[probefunc] = timestamp;
}
pid$1::"DynArray<std::unique_ptr<int,std::default_delete<int> > >::Append":return
{
@[execname,probefunc] = quantize((timestamp - self->ts[probefunc]) / 1000);
}
pid$1::"DynArray<int>::Append":entry
{
self->ts[probefunc] = timestamp;
}
pid$1::"DynArray<int>::Append":return
{
@[execname,probefunc] = quantize((timestamp - self->ts[probefunc]) / 1000);
}
When optimisations are enabled, some of those DynArray
methods are inlined and eliminated from the executable. So I needed to
use bench0.exe
, the unoptimised version of the benchmark.
After starting bench0.exe
in one terminal, I typed the
following in Administrator’s Powershell:
& "C:\Program Files\DTrace\dtrace.exe" -s virtualallochist.d (Get-Process -Name "bench0").Id
That made the benchmark run several times longer than normally (~2.5 days compared to ~2 hours) in wall clock time, but CPU time seemed to be completely unaffected. (Actually, during first attempt Windows Update rebooted my computer after 1.5 days without asking me for permission.)
The results were as follows:
bench0.exe VirtualAlloc
value ------------- Distribution ------------- count
1 | 0
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 25247233
4 | 100680
8 | 7843
16 | 509
32 | 16
64 | 5
128 | 21
256 | 71
512 | 4
1024 | 2
2048 | 0
4096 | 0
8192 | 1
16384 | 0
32768 | 1
65536 | 0
131072 | 1
262144 | 1
524288 | 0
bench0.exe DynArray<std::unique_ptr<int,std::default_delete<int> > >::Append
value ------------- Distribution ------------- count
1 | 0
2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1995992981
4 | 641329
8 | 4288545
16 | 65842
32 | 380
64 | 361
128 | 3862
256 | 6963
512 | 670
1024 | 31
2048 | 8
4096 | 27
8192 | 0
16384 | 0
32768 | 1
65536 | 0
bench0.exe DynArray<int>::Append
value ------------- Distribution ------------- count
0 | 0
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 21782293396
2 | 155237404
4 | 15543570
8 | 9630471
16 | 117290
32 | 3117
64 | 1737
128 | 26619
256 | 38741
512 | 12741
1024 | 15708
2048 | 6991
4096 | 77
8192 | 24
16384 | 40
32768 | 3
65536 | 0
131072 | 1
262144 | 1
524288 | 0
bench0.exe GetProcessTimes
value ------------- Distribution ------------- count
0 | 0
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 27825184719
2 | 177795208
4 | 1094430
8 | 3752032
16 | 127442
32 | 3074
64 | 1284
128 | 13774
256 | 35437
512 | 606
1024 | 2
2048 | 1
4096 | 0
Here we can see that the outliers of VirtualAlloc()
practically perfectly correspond to the outliers of
DynArray<int>::Append()
.
microseconds for
microseconds.
microseconds for
microseconds. The rest of numbers don’t correspond as perfectly, but
generally are in the right ballpark, after dividing the amounts by 3
orders of magnitude (VirtualAlloc()
isn’t called in every
Append()
). That to me was sufficient evidence that
DynArray::Append()
latency really was determined by the
latency of VirtualAlloc()
and not by something extra that I
did.
For the other platforms, I’m only going to comment the numbers, when I see something unexpected in them.
Mac
Memory
Array Size | Peak RSS (bytes) | |
---|---|---|
DynArray
|
std::vector
|
|
Throughput
No optimisation (-O0 )
|
||||
---|---|---|---|---|
Length | Element | Array | Total time (seconds) | Throughput (Hz) |
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
Optimisation -O1
|
||||
Length | Element | Array | Total time (seconds) | Throughput (Hz) |
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
Optimisation -O2
|
||||
Length | Element | Array | Total time (seconds) | Throughput (Hz) |
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
Latency
No optimisation (-O0 )
|
|||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Length | Element | Array | Min | Max | |||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
Optimisation -O1
|
|||||||||||||
Length | Element | Array | Min | Max | |||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
Optimisation -O2
|
|||||||||||||
Length | Element | Array | Min | Max | |||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
FreeBSD
Memory
Array Size | Peak RSS (bytes) | |
---|---|---|
DynArray
|
std::vector
|
|
killed by OOM killer
|
killed by OOM killer
|
The first row is a bit surprising. It turns out that if I execute
mem_dynarray1000
in an infinite loop,
of the time it prints
,
and other times it prints numbers which differ by several orders of
magnitude between each other. It looks like a bug. Results on Mac and
Linux are stable between runs. It’s safe to assume numbers in this table
are underreported.
When it comes to the OOM killer, the first time I run the test it killed not only the process running the benchmark, but also my SSH session. Case in point for my rant about overcommit and OOM killer earlier.
Throughput
No optimisation (-O0 )
|
||||
---|---|---|---|---|
Length | Element | Array | Total time (seconds) | Throughput (Hz) |
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
Optimisation -O1
|
||||
Length | Element | Array | Total time (seconds) | Throughput (Hz) |
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
Optimisation -O2
|
||||
Length | Element | Array | Total time (seconds) | Throughput (Hz) |
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
Latency
No optimisation (-O0 )
|
|||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Length | Element | Array | Min | Max | |||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
Optimisation -O1
|
|||||||||||||
Length | Element | Array | Min | Max | |||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
Optimisation -O2
|
|||||||||||||
Length | Element | Array | Min | Max | |||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
Linux
Memory
Array Size | Peak RSS (bytes) | |
---|---|---|
DynArray
|
std::vector
|
|
killed by OOM killer
|
std::bad_alloc
|
Interestingly, here std::vector
threw
std::bad_alloc
and the benchmark wasn’t killed by the OOM
killer. However, in later benchmarks of throughput and latency it didn’t
happen and instead the OOM killer killed the process as usual.
Throughput
No optimisation (-O0 )
|
||||
---|---|---|---|---|
Length | Element | Array | Total time (seconds) | Throughput (Hz) |
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
int
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
Optimisation -O1
|
||||
Length | Element | Array | Total time (seconds) | Throughput (Hz) |
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
int
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
Optimisation -O2
|
||||
Length | Element | Array | Total time (seconds) | Throughput (Hz) |
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
|||
std::vector
|
||||
int
|
DynArray
|
|||
std::vector
|
||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
int
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
|||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
||
std::vector
|
killed by OOM killer
|
Latency
No optimisation (-O0 )
|
|||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Length | Element | Array | Min | Max | |||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
int
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
Optimisation -O1
|
|||||||||||||
Length | Element | Array | Min | Max | |||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
int
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
Optimisation -O2
|
|||||||||||||
Length | Element | Array | Min | Max | |||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
int
|
DynArray
|
||||||||||||
std::vector
|
|||||||||||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
int
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
||||||||||||
std::unique_ptr
|
DynArray
|
killed by OOM killer
|
|||||||||||
std::vector
|
killed by OOM killer
|
What about mremap()
?
- As far as I know, there is no equivalent to it on Windows.
- It relocates objects in virtual space, so you can’t use that as a basis for an arena, which needs to keep pointers valid without modifying them.
What about shrinking?
I’ve only shown how to increase the length of DynArray
.
If you need shrinking as well, on Windows you call
VirtualFree()
with MEM_DECOMMIT
instead of
MEM_RELEASE
, on POSIX you call mprotect()
with
PROT_NONE
and madvise()
with
MADV_FREE
or MADV_DONTNEED
(whichever constant
is available).
Personally, I’ve never needed to do that, so I’m not going to bother implementing it. It seems that in my style of programming I either throw out whole data structures or keep them, but I never delete single elements. Judging from the Swiss Tables CppCon talk, I’m not alone in that.
Closing words
So that’s what I’m doing now in my side project. If you have anything to add, you may write me an email at jakub@yakubin.com.
I dedicate this post to Windows Update.