diff options
Diffstat (limited to 'doc/mimalloc-doc.h')
-rw-r--r-- | doc/mimalloc-doc.h | 1096 |
1 files changed, 1096 insertions, 0 deletions
diff --git a/doc/mimalloc-doc.h b/doc/mimalloc-doc.h new file mode 100644 index 0000000..c3e4137 --- /dev/null +++ b/doc/mimalloc-doc.h @@ -0,0 +1,1096 @@ +/* ---------------------------------------------------------------------------- +Copyright (c) 2018, Microsoft Research, Daan Leijen +This is free software; you can redistribute it and/or modify it under the +terms of the MIT license. A copy of the license can be found in the file +"license.txt" at the root of this distribution. +-----------------------------------------------------------------------------*/ + +#error "documentation file only!" + + +/*! \mainpage + +This is the API documentation of the +[mimalloc](https://github.com/koka-lang/mimalloc) allocator +(pronounced "me-malloc") -- a +general purpose allocator with excellent [performance](bench.html) +characteristics. Initially +developed by Daan Leijen for the run-time systems of the +[Koka](https://github.com/koka-lang/koka) and [Lean](https://github.com/leanprover/lean) languages. + +Notable aspects of the design include: + +- __small and consistent__: the library is less than 3500 LOC using simple and + consistent data structures. This makes it very suitable + to integrate and adapt in other projects. For runtime systems it + provides hooks for a monotonic _heartbeat_ and deferred freeing (for + bounded worst-case times with reference counting). +- __free list sharding__: "the big idea": instead of one big free list (per size class) we have + many smaller lists per memory "page" which both reduces fragmentation + and increases locality -- + things that are allocated close in time get allocated close in memory. + (A memory "page" in mimalloc contains blocks of one size class and is + usually 64KB on a 64-bit system). +- __eager page reset__: when a "page" becomes empty (with increased chance + due to free list sharding) the memory is marked to the OS as unused ("reset" or "purged") + reducing (real) memory pressure and fragmentation, especially in long running + programs. +- __lazy initialization__: pages in a segment are lazily initialized so + no memory is touched until it becomes allocated, reducing the resident + memory and potential page faults. +- __bounded__: it does not suffer from _blowup_ \[1\], has bounded worst-case allocation + times (_wcat_), bounded space overhead (~0.2% meta-data, with at most 16.7% waste in allocation sizes), + and has no internal points of contention using atomic operations almost + everywhere. + +Further information: + +- \ref build +- \ref using +- \ref overrides +- \ref bench +- \ref malloc +- \ref extended +- \ref aligned +- \ref heap +- \ref typed +- \ref analysis +- \ref options + +*/ + + +/// \defgroup malloc Basic Allocation +/// The basic allocation interface. +/// \{ + + +/// Free previously allocated memory. +/// The pointer `p` must have been allocated before (or be \a NULL). +/// @param p pointer to free, or \a NULL. +void mi_free(void* p); + +/// Allocate \a size bytes. +/// @param size number of bytes to allocate. +/// @returns pointer to the allocated memory or \a NULL if out of memory. +/// Returns a unique pointer if called with \a size 0. +void* mi_malloc(size_t size); + +/// Allocate zero-initialized `size` bytes. +/// @param size The size in bytes. +/// @returns Pointer to newly allocated zero initialized memory, +/// or \a NULL if out of memory. +void* mi_zalloc(size_t size); + +/// Allocate zero-initialized \a count elements of \a size bytes. +/// @param count number of elements. +/// @param size size of each element. +/// @returns pointer to the allocated memory +/// of \a size*\a count bytes, or \a NULL if either out of memory +/// or when `count*size` overflows. +/// +/// Returns a unique pointer if called with either \a size or \a count of 0. +/// @see mi_zalloc() +void* mi_calloc(size_t count, size_t size); + +/// Re-allocate memory to \a newsize bytes. +/// @param p pointer to previously allocated memory (or \a NULL). +/// @param newsize the new required size in bytes. +/// @returns pointer to the re-allocated memory +/// of \a newsize bytes, or \a NULL if out of memory. +/// If \a NULL is returned, the pointer \a p is not freed. +/// Otherwise the original pointer is either freed or returned +/// as the reallocated result (in case it fits in-place with the +/// new size). If the pointer \a p is \a NULL, it behaves as +/// \a mi_malloc(\a newsize). If \a newsize is larger than the +/// original \a size allocated for \a p, the bytes after \a size +/// are uninitialized. +void* mi_realloc(void* p, size_t newsize); + +/// Reallocate memory to \a newsize bytes, with extra memory initialized to zero. +/// @param p Pointer to a previously allocated block (or \a NULL). +/// @param newsize The new required size in bytes. +/// @returns A pointer to a re-allocated block of \a newsize bytes, or \a NULL +/// if out of memory. +/// +/// If the \a newsize is larger than the original allocated size of \a p, +/// the extra bytes are initialized to zero. +void* mi_rezalloc(void* p, size_t newsize); + +/// Re-allocate memory to \a count elements of \a size bytes, with extra memory initialized to zero. +/// @param p Pointer to a previously allocated block (or \a NULL). +/// @param count The number of elements. +/// @param size The size of each element. +/// @returns A pointer to a re-allocated block of \a count * \a size bytes, or \a NULL +/// if out of memory or if \a count * \a size overflows. +/// +/// If there is no overflow, it behaves exactly like `mi_rezalloc(p,count*size)`. +/// @see mi_reallocn() +/// @see mi_rezalloc() +/// @see [recallocarray()](http://man.openbsd.org/reallocarray) (on BSD). +void* mi_recalloc(void* p, size_t count, size_t size); + + +/// Try to re-allocate memory to \a newsize bytes _in place_. +/// @param p pointer to previously allocated memory (or \a NULL). +/// @param newsize the new required size in bytes. +/// @returns pointer to the re-allocated memory +/// of \a newsize bytes (always equal to \a p), +/// or \a NULL if either out of memory or if +/// the memory could not be expanded in place. +/// If \a NULL is returned, the pointer \a p is not freed. +/// Otherwise the original pointer is returned +/// as the reallocated result since it fits in-place with the +/// new size. If \a newsize is larger than the +/// original \a size allocated for \a p, the bytes after \a size +/// are uninitialized. +void* mi_expand(void* p, size_t newsize); + +/// Allocate \a count elements of \a size bytes. +/// @param count The number of elements. +/// @param size The size of each element. +/// @returns A pointer to a block of \a count * \a size bytes, or \a NULL +/// if out of memory or if \a count * \a size overflows. +/// +/// If there is no overflow, it behaves exactly like `mi_malloc(p,count*size)`. +/// @see mi_calloc() +/// @see mi_zallocn() +void* mi_mallocn(size_t count, size_t size); + +/// Re-allocate memory to \a count elements of \a size bytes. +/// @param p Pointer to a previously allocated block (or \a NULL). +/// @param count The number of elements. +/// @param size The size of each element. +/// @returns A pointer to a re-allocated block of \a count * \a size bytes, or \a NULL +/// if out of memory or if \a count * \a size overflows. +/// +/// If there is no overflow, it behaves exactly like `mi_realloc(p,count*size)`. +/// @see mi_recalloc() +/// @see [reallocarray()](<http://man.openbsd.org/reallocarray>) (on BSD) +void* mi_reallocn(void* p, size_t count, size_t size); + +/// Re-allocate memory to \a newsize bytes, +/// @param p pointer to previously allocated memory (or \a NULL). +/// @param newsize the new required size in bytes. +/// @returns pointer to the re-allocated memory +/// of \a newsize bytes, or \a NULL if out of memory. +/// +/// In contrast to mi_realloc(), if \a NULL is returned, the original pointer +/// \a p is freed (if it was not \a NULL itself). +/// Otherwise the original pointer is either freed or returned +/// as the reallocated result (in case it fits in-place with the +/// new size). If the pointer \a p is \a NULL, it behaves as +/// \a mi_malloc(\a newsize). If \a newsize is larger than the +/// original \a size allocated for \a p, the bytes after \a size +/// are uninitialized. +/// +/// @see [reallocf](https://www.freebsd.org/cgi/man.cgi?query=reallocf) (on BSD) +void* mi_reallocf(void* p, size_t newsize); + + +/// Allocate and duplicate a string. +/// @param s string to duplicate (or \a NULL). +/// @returns a pointer to newly allocated memory initialized +/// to string \a s, or \a NULL if either out of memory or if +/// \a s is \a NULL. +/// +/// Replacement for the standard [strdup()](http://pubs.opengroup.org/onlinepubs/9699919799/functions/strdup.html) +/// such that mi_free() can be used on the returned result. +char* mi_strdup(const char* s); + +/// Allocate and duplicate a string up to \a n bytes. +/// @param s string to duplicate (or \a NULL). +/// @param n maximum number of bytes to copy (excluding the terminating zero). +/// @returns a pointer to newly allocated memory initialized +/// to string \a s up to the first \a n bytes (and always zero terminated), +/// or \a NULL if either out of memory or if \a s is \a NULL. +/// +/// Replacement for the standard [strndup()](http://pubs.opengroup.org/onlinepubs/9699919799/functions/strndup.html) +/// such that mi_free() can be used on the returned result. +char* mi_strndup(const char* s, size_t n); + +/// Resolve a file path name. +/// @param fname File name. +/// @param resolved_name Should be \a NULL (but can also point to a buffer +/// of at least \a PATH_MAX bytes). +/// @returns If successful a pointer to the resolved absolute file name, or +/// \a NULL on failure (with \a errno set to the error code). +/// +/// If \a resolved_name was \a NULL, the returned result should be freed with +/// mi_free(). +/// +/// Replacement for the standard [realpath()](http://pubs.opengroup.org/onlinepubs/9699919799/functions/realpath.html) +/// such that mi_free() can be used on the returned result (if \a resolved_name was \a NULL). +char* mi_realpath(const char* fname, char* resolved_name); + +/// \} + +// ------------------------------------------------------ +// Extended functionality +// ------------------------------------------------------ + +/// \defgroup extended Extended Functions +/// Extended functionality. +/// \{ + +/// Maximum size allowed for small allocations in +/// #mi_malloc_small and #mi_zalloc_small (usually `128*sizeof(void*)` (= 1KB on 64-bit systems)) +#define MI_SMALL_SIZE_MAX (128*sizeof(void*)) + +/// Allocate a small object. +/// @param size The size in bytes, can be at most #MI_SMALL_SIZE_MAX. +/// @returns a pointer to newly allocated memory of at least \a size +/// bytes, or \a NULL if out of memory. +/// This function is meant for use in run-time systems for best +/// performance and does not check if \a size was indeed small -- use +/// with care! +void* mi_malloc_small(size_t size); + +/// Allocate a zero initialized small object. +/// @param size The size in bytes, can be at most #MI_SMALL_SIZE_MAX. +/// @returns a pointer to newly allocated zero-initialized memory of at +/// least \a size bytes, or \a NULL if out of memory. +/// This function is meant for use in run-time systems for best +/// performance and does not check if \a size was indeed small -- use +/// with care! +void* mi_zalloc_small(size_t size); + +/// Return the available bytes in a memory block. +/// @param p Pointer to previously allocated memory (or \a NULL) +/// @returns Returns the available bytes in the memory block, or +/// 0 if \a p was \a NULL. +/// +/// The returned size can be +/// used to call \a mi_expand successfully. +/// The returned size is always at least equal to the +/// allocated size of \a p, and, in the current design, +/// should be less than 16.7% more. +/// +/// @see [_msize](https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/msize?view=vs-2017) (Windows) +/// @see [malloc_usable_size](http://man7.org/linux/man-pages/man3/malloc_usable_size.3.html) (Linux) +/// @see mi_good_size() +size_t mi_usable_size(void* p); + +/// Return the used allocation size. +/// @param size The minimal required size in bytes. +/// @returns the size `n` that will be allocated, where `n >= size`. +/// +/// Generally, `mi_usable_size(mi_malloc(size)) == mi_good_size(size)`. +/// This can be used to reduce internal wasted space when +/// allocating buffers for example. +/// +/// @see mi_usable_size() +size_t mi_good_size(size_t size); + +/// Eagerly free memory. +/// @param force If \a true, aggressively return memory to the OS (can be expensive!) +/// +/// Regular code should not have to call this function. It can be beneficial +/// in very narrow circumstances; in particular, when a long running thread +/// allocates a lot of blocks that are freed by other threads it may improve +/// resource usage by calling this every once in a while. +void mi_collect(bool force); + +/// Print statistics. +/// @param out Output file. Use \a NULL for \a stderr. +/// +/// Most detailed when using a debug build. +void mi_stats_print(FILE* out); + +/// Reset statistics. +void mi_stats_reset(); + +/// Initialize mimalloc on a thread. +/// Should not be used as on most systems (pthreads, windows) this is done +/// automatically. +void mi_thread_init(); + +/// Uninitialize mimalloc on a thread. +/// Should not be used as on most systems (pthreads, windows) this is done +/// automatically. Ensures that any memory that is not freed yet (but will +/// be freed by other threads in the future) is properly handled. +void mi_thread_done(); + +/// Print out heap statistics for this thread. +/// @param out Output file. Use \a NULL for \a stderr. +/// +/// Most detailed when using a debug build. +void mi_thread_stats_print(FILE* out); + +/// Type of deferred free functions. +/// @param force If \a true all outstanding items should be freed. +/// @param heartbeat A monotonically increasing count. +/// +/// @see mi_register_deferred_free +typedef void (mi_deferred_free_fun)(bool force, unsigned long long heartbeat); + +/// Register a deferred free function. +/// @param deferred_free Address of a deferred free-ing function or \a NULL to unregister. +/// +/// Some runtime systems use deferred free-ing, for example when using +/// reference counting to limit the worst case free time. +/// Such systems can register (re-entrant) deferred free function +/// to free more memory on demand. When the \a force parameter is +/// \a true all possible memory should be freed. +/// The per-thread \a heartbeat parameter is monotonically increasing +/// and guaranteed to be deterministic if the program allocates +/// deterministically. The \a deferred_free function is guaranteed +/// to be called deterministically after some number of allocations +/// (regardless of freeing or available free memory). +/// At most one \a deferred_free function can be active. +void mi_register_deferred_free(mi_deferred_free_fun* deferred_free); + +/// \} + +// ------------------------------------------------------ +// Aligned allocation +// ------------------------------------------------------ + +/// \defgroup aligned Aligned Allocation +/// +/// Allocating aligned memory blocks. +/// +/// \{ + +/// Allocate \a size bytes aligned by \a alignment. +/// @param size number of bytes to allocate. +/// @param alignment the minimal alignment of the allocated memory. +/// @returns pointer to the allocated memory or \a NULL if out of memory. +/// The returned pointer is aligned by \a alignment, i.e. +/// `(uintptr_t)p % alignment == 0`. +/// +/// Returns a unique pointer if called with \a size 0. +/// @see [_aligned_malloc](https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/aligned-malloc?view=vs-2017) (on Windows) +/// @see [aligned_alloc](http://man.openbsd.org/reallocarray) (on BSD, with switched arguments!) +/// @see [posix_memalign](https://linux.die.net/man/3/posix_memalign) (on Posix, with switched arguments!) +/// @see [memalign](https://linux.die.net/man/3/posix_memalign) (on Linux, with switched arguments!) +void* mi_malloc_aligned(size_t size, size_t alignment); +void* mi_zalloc_aligned(size_t size, size_t alignment); +void* mi_calloc_aligned(size_t count, size_t size, size_t alignment); +void* mi_realloc_aligned(void* p, size_t newsize, size_t alignment); +void* mi_rezalloc_aligned(void* p, size_t newsize, size_t alignment); +void* mi_recalloc_aligned(void* p, size_t count, size_t size, size_t alignment); + +/// Allocate \a size bytes aligned by \a alignment at a specified \a offset. +/// @param size number of bytes to allocate. +/// @param alignment the minimal alignment of the allocated memory at \a offset. +/// @param offset the offset that should be aligned. +/// @returns pointer to the allocated memory or \a NULL if out of memory. +/// The returned pointer is aligned by \a alignment at \a offset, i.e. +/// `((uintptr_t)p + offset) % alignment == 0`. +/// +/// Returns a unique pointer if called with \a size 0. +/// @see [_aligned_offset_malloc](https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/aligned-offset-malloc?view=vs-2017) (on Windows) +void* mi_malloc_aligned_at(size_t size, size_t alignment, size_t offset); +void* mi_zalloc_aligned_at(size_t size, size_t alignment, size_t offset); +void* mi_calloc_aligned_at(size_t count, size_t size, size_t alignment, size_t offset); +void* mi_realloc_aligned_at(void* p, size_t newsize, size_t alignment, size_t offset); +void* mi_rezalloc_aligned_at(void* p, size_t newsize, size_t alignment, size_t offset); +void* mi_recalloc_aligned_at(void* p, size_t count, size_t size, size_t alignment, size_t offset); + +/// \} + +/// \defgroup heap Heap Allocation +/// +/// First-class heaps that can be destroyed in one go. +/// +/// \{ + +/// Type of first-class heaps. +/// A heap can only be used for allocation in +/// the thread that created this heap! Any allocated +/// blocks can be freed or reallocated by any other thread though. +struct mi_heap_s; + +/// Type of first-class heaps. +/// A heap can only be used for (re)allocation in +/// the thread that created this heap! Any allocated +/// blocks can be freed by any other thread though. +typedef struct mi_heap_s mi_heap_t; + +/// Create a new heap that can be used for allocation. +mi_heap_t* mi_heap_new(); + +/// Delete a previously allocated heap. +/// This will release resources and migrate any +/// still allocated blocks in this heap (efficienty) +/// to the default heap. +/// +/// If \a heap is the default heap, the default +/// heap is set to the backing heap. +void mi_heap_delete(mi_heap_t* heap); + +/// Destroy a heap, freeing all its still allocated blocks. +/// Use with care as this will free all blocks still +/// allocated in the heap. However, this can be a very +/// efficient way to free all heap memory in one go. +/// +/// If \a heap is the default heap, the default +/// heap is set to the backing heap. +void mi_heap_destroy(mi_heap_t* heap); + +/// Set the default heap to use for mi_malloc() et al. +/// @param heap The new default heap. +/// @returns The previous default heap. +mi_heap_t* mi_heap_set_default(mi_heap_t* heap); + +/// Get the default heap that is used for mi_malloc() et al. +/// @returns The current default heap. +mi_heap_t* mi_heap_get_default(); + +/// Get the backing heap. +/// The _backing_ heap is the initial default heap for +/// a thread and always available for allocations. +/// It cannot be destroyed or deleted +/// except by exiting the thread. +mi_heap_t* mi_heap_get_backing(); + +/// Allocate in a specific heap. +/// @see mi_malloc() +void* mi_heap_malloc(mi_heap_t* heap, size_t size); + +/// Allocate zero-initialized in a specific heap. +/// @see mi_zalloc() +void* mi_heap_zalloc(mi_heap_t* heap, size_t size); + +/// Allocate \a count zero-initialized elements in a specific heap. +/// @see mi_calloc() +void* mi_heap_calloc(mi_heap_t* heap, size_t count, size_t size); + +/// Allocate \a count elements in a specific heap. +/// @see mi_mallocn() +void* mi_heap_mallocn(mi_heap_t* heap, size_t count, size_t size); + +/// Duplicate a string in a specific heap. +/// @see mi_strdup() +char* mi_heap_strdup(mi_heap_t* heap, const char* s); + +/// Duplicate a string of at most length \a n in a specific heap. +/// @see mi_strndup() +char* mi_heap_strndup(mi_heap_t* heap, const char* s, size_t n); + +/// Resolve a file path name using a specific \a heap to allocate the result. +/// @see mi_realpath() +char* mi_heap_realpath(mi_heap_t* heap, const char* fname, char* resolved_name); + + +/// \} + +/// \defgroup typed Typed Macros +/// +/// Typed allocation macros +/// +/// \{ + +/// Allocate a block of type \a tp. +/// @param tp The type of the block to allocate. +/// @returns A pointer to an object of type \a tp, or +/// \a NULL if out of memory. +/// +/// **Example:** +/// ``` +/// int* p = mi_malloc_tp(int) +/// ``` +/// +/// @see mi_malloc() +#define mi_malloc_tp(tp) ((tp*)mi_malloc(sizeof(tp))) + +/// Allocate a zero-initialized block of type \a tp. +#define mi_zalloc_tp(tp) ((tp*)mi_zalloc(sizeof(tp))) + +/// Allocate \a count zero-initialized blocks of type \a tp. +#define mi_calloc_tp(tp,count) ((tp*)mi_calloc(count,sizeof(tp))) + +/// Allocate \a count blocks of type \a tp. +#define mi_mallocn_tp(tp,count) ((tp*)mi_mallocn(count,sizeof(tp))) + +/// Re-allocate to \a count blocks of type \a tp. +#define mi_reallocn_tp(p,tp,count) ((tp*)mi_reallocn(p,count,sizeof(tp))) + +/// Re-allocate to \a count zero-initialized blocks of type \a tp. +#define mi_recalloc_tp(p,tp,count) ((tp*)mi_recalloc(p,count,sizeof(tp))) + +/// Allocate a block of type \a tp in a heap \a hp. +#define mi_heap_malloc_tp(hp,tp) ((tp*)mi_malloc(hp,sizeof(tp))) + +/// Allocate a zero-initialized block of type \a tp in a heap \a hp. +#define mi_heap_zalloc_tp(hp,tp) ((tp*)mi_zalloc(hp,sizeof(tp))) + +/// Allocate \a count zero-initialized blocks of type \a tp in a heap \a hp. +#define mi_heap_calloc_tp(hp,tp,count) ((tp*)mi_calloc(hp,count,sizeof(tp))) + +/// Allocate \a count blocks of type \a tp in a heap \a hp. +#define mi_heap_mallocn_tp(hp,tp,count) ((tp*)mi_mallocn(hp,count,sizeof(tp))) + +/// \} + +/// \defgroup analysis Heap Introspection +/// +/// Inspect the heap at runtime. +/// +/// \{ + +/// Does a heap contain a pointer to a previously allocated block? +/// @param heap The heap. +/// @param p Pointer to a previously allocated block (in any heap)-- cannot be some +/// random pointer! +/// @returns \a true if the block pointed to by \a p is in the \a heap. +/// @see mi_heap_check_owned() +bool mi_heap_contains_block(mi_heap_t* heap, const void* p); + +/// Check safely if any pointer is part of a heap. +/// @param heap The heap. +/// @param p Any pointer -- not required to be previously allocated by us. +/// @returns \a true if \a p points to a block in \a heap. +/// +/// Note: expensive function, linear in the pages in the heap. +/// @see mi_heap_contains_block() +/// @see mi_heap_get_default() +bool mi_heap_check_owned(mi_heap_t* heap, const void* p); + +/// Check safely if any pointer is part of the default heap of this thread. +/// @param p Any pointer -- not required to be previously allocated by us. +/// @returns \a true if \a p points to a block in default heap of this thread. +/// +/// Note: expensive function, linear in the pages in the heap. +/// @see mi_heap_contains_block() +/// @see mi_heap_get_default() +bool mi_check_owned(const void* p); + +/// An area of heap space contains blocks of a single size. +/// The bytes in freed blocks are `committed - used`. +typedef struct mi_heap_area_s { + void* blocks; ///< start of the area containing heap blocks + size_t reserved; ///< bytes reserved for this area + size_t committed; ///< current committed bytes of this area + size_t used; ///< bytes in use by allocated blocks + size_t block_size; ///< size in bytes of one block +} mi_heap_area_t; + +/// Visitor function passed to mi_heap_visit_blocks() +/// @returns \a true if ok, \a false to stop visiting (i.e. break) +/// +/// This function is always first called for every \a area +/// with \a block as a \a NULL pointer. If \a visit_all_blocks +/// was \a true, the function is then called for every allocated +/// block in that area. +typedef bool (mi_block_visit_fun)(const mi_heap_t* heap, const mi_heap_area_t* area, void* block, size_t block_size, void* arg); + +/// Visit all areas and blocks in a heap. +/// @param heap The heap to visit. +/// @param visit_all_blocks If \a true visits all allocated blocks, otherwise +/// \a visitor is only called for every heap area. +/// @param visitor This function is called for every area in the heap +/// (with \a block as \a NULL). If \a visit_all_blocks is +/// \a true, \a visitor is also called for every allocated +/// block in every area (with `block!=NULL`). +/// return \a false from this function to stop visiting early. +/// @param arg Extra argument passed to \a visitor. +/// @returns \a true if all areas and blocks were visited. +bool mi_heap_visit_blocks(const mi_heap_t* heap, bool visit_all_blocks, mi_block_visit_fun* visitor, void* arg); + +/// \} + +/// \defgroup options Runtime Options +/// +/// Set runtime behavior. +/// +/// \{ + +/// Runtime options. +typedef enum mi_option_e { + mi_option_page_reset, ///< Reset page memory when it becomes free. + mi_option_cache_reset, ///< Reset segment memory when a segment is cached. + mi_option_pool_commit, ///< Commit segments in large pools. + mi_option_show_stats, ///< Print statistics to `stderr` when the program is done. + mi_option_show_errors, ///< Print error messages to `stderr`. + mi_option_verbose, ///< Print verbose messages to `stderr`. + _mi_option_last +} mi_option_t; + +bool mi_option_enabled(mi_option_t option); +void mi_option_enable(mi_option_t option, bool enable); +void mi_option_enable_default(mi_option_t option, bool enable); + +long mi_option_get(mi_option_t option); +void mi_option_set(mi_option_t option, long value); +void mi_option_set_default(mi_option_t option, long value); + + +/// \} + +/*! \page build Building + +Checkout the sources from Github: +``` +git clone https://github.com/koka-lang/mimalloc.git +``` + +## Windows + +Open `ide/vs2017/mimalloc.sln` in Visual Studio 2017 and build. +The `mimalloc` project builds a static library, while the +`mimalloc-override` project builds a DLL for overriding malloc +in the entire program. + +## MacOSX, Linux, BSD, etc. + +We use [`cmake`](https://cmake.org)<sup>1</sup> as the build system: + +- `mkdir -p out/release` (create a build directory) +- `cd out/release` (go to it) +- `cmake ../..` (generate the make file) +- `make` (and build) + + This will build the library as a shared (dynamic) + library (`.so` or `.dylib`), a static library (`.a`), and + as a single object file (`.o`). + +- `sudo make install` (install the library and header files in `/usr/lib` and `/usr/include`) + + Use the option `-DCMAKE_INSTALL_PREFIX=../local` (for example) to the `ccmake` + command to install to a local directory to see what gets installed. + +You can build the debug version which does many internal checks and +maintains detailed statistics as: + +- `mkdir -p out/debug` +- `cd out/debug` +- `cmake -DCMAKE_BUILD_TYPE=Debug ../..` +- `make` + + This will name the shared library as `libmimalloc-debug.so`. + +Or build with `clang`: + +- `CC=clang cmake ../..` + +Use `ccmake`<sup>2</sup> instead of `cmake` +to see and customize all the available build options. + +Notes: +1. Install CMake: `sudo apt-get install cmake` +2. Install CCMake: `sudo apt-get install cmake-curses-gui` +*/ + +/*! \page using Using the library + +The preferred usage is including `<mimalloc.h>`, linking with +the shared- or static library, and using the `mi_malloc` API exclusively for allocation. For example, +``` +gcc -o myprogram -lmimalloc myfile.c +``` + +mimalloc uses only safe OS calls (`mmap` and `VirtualAlloc`) and can co-exist +with other allocators linked to the same program. +If you use `cmake`, you can simply use: +``` +find_package(mimalloc 1.0 REQUIRED) +``` +in your `CMakeLists.txt` to find a locally installed mimalloc. Then use either: +``` +target_link_libraries(myapp PUBLIC mimalloc) +``` +to link with the shared (dynamic) library, or: +``` +target_link_libraries(myapp PUBLIC mimalloc-static) +``` +to link with the static library. See `test\CMakeLists.txt` for an example. + + +You can pass environment variables to print verbose messages (`MIMALLOC_VERBOSE=1`) +and statistics (`MIMALLOC_STATS=1`) (in the debug version): +``` +> env MIMALLOC_STATS=1 ./cfrac 175451865205073170563711388363 + +175451865205073170563711388363 = 374456281610909315237213 * 468551 + +heap stats: peak total freed unit +normal 2: 16.4 kb 17.5 mb 17.5 mb 16 b ok +normal 3: 16.3 kb 15.2 mb 15.2 mb 24 b ok +normal 4: 64 b 4.6 kb 4.6 kb 32 b ok +normal 5: 80 b 118.4 kb 118.4 kb 40 b ok +normal 6: 48 b 48 b 48 b 48 b ok +normal 17: 960 b 960 b 960 b 320 b ok + +heap stats: peak total freed unit + normal: 33.9 kb 32.8 mb 32.8 mb 1 b ok + huge: 0 b 0 b 0 b 1 b ok + total: 33.9 kb 32.8 mb 32.8 mb 1 b ok +malloc requested: 32.8 mb + + committed: 58.2 kb 58.2 kb 58.2 kb 1 b ok + reserved: 2.0 mb 2.0 mb 2.0 mb 1 b ok + reset: 0 b 0 b 0 b 1 b ok + segments: 1 1 1 +-abandoned: 0 + pages: 6 6 6 +-abandoned: 0 + mmaps: 3 + mmap fast: 0 + mmap slow: 1 + threads: 0 + elapsed: 2.022s + process: user: 1.781s, system: 0.016s, faults: 756, reclaims: 0, rss: 2.7 mb +``` + +The above model of using the `mi_` prefixed API is not always possible +though in existing programs that already use the standard malloc interface, +and another option is to override the standard malloc interface +completely and redirect all calls to the _mimalloc_ library instead. + +See \ref overrides for more info. + +*/ + +/*! \page overrides Overriding Malloc + +Overriding standard malloc can be done either _dynamically_ or _statically_. + +## Dynamic override + +This is the recommended way to override the standard malloc interface. + +### Unix, BSD, MacOSX + +On these system we preload the mimalloc shared +library so all calls to the standard `malloc` interface are +resolved to the _mimalloc_ library. + +- `env LD_PRELOAD=/usr/lib/libmimalloc.so myprogram` (on Linux, BSD, etc.) +- `env DYLD_INSERT_LIBRARIES=usr/lib/libmimalloc.dylib myprogram` (On MacOSX) + + Note certain security restrictions may apply when doing this from + the [shell](https://stackoverflow.com/questions/43941322/dyld-insert-libraries-ignored-when-calling-application-through-bash). + +You can set extra environment variables to check that mimalloc is running, +like: +``` +env MIMALLOC_VERBOSE=1 LD_PRELOAD=/usr/lib/libmimalloc.so myprogram +``` +or run with the debug version to get detailed statistics: +``` +env MIMALLOC_STATS=1 LD_PRELOAD=/usr/lib/libmimallocd.so myprogram +``` + +### Windows + +On Windows you need to link your program explicitly with the mimalloc +DLL, and use the C-runtime library as a DLL (the `/MD` or `/MDd` switch). +To ensure the mimalloc DLL gets loaded it is easiest to insert some +call to the mimalloc API in the `main` function, like `mi_version()`. + +Due to the way mimalloc overrides the standard malloc at runtime, it is best +to link to the mimalloc import library first on the command line so it gets +loaded right after the universal C runtime DLL (`ucrtbase`). See +the `mimalloc-override-test` project for an example. + + +## Static override + +You can also statically link with _mimalloc_ to override the standard +malloc interface. The recommended way is to link the final program with the +_mimalloc_ single object file (`mimalloc-override.o` (or `.obj`)). We use +an object file instead of a library file as linkers give preference to +that over archives to resolve symbols. To ensure that the standard +malloc interface resolves to the _mimalloc_ library, link it as the first +object file. For example: + +``` +gcc -o myprogram mimalloc-override.o myfile1.c ... +``` + + + +## List of Overrides: + +The specific functions that get redirected to the _mimalloc_ library are: + +``` +// C +void* malloc(size_t size); +void* calloc(size_t size, size_t n); +void* realloc(void* p, size_t newsize); +void free(void* p); + +// C++ +void operator delete(void* p); +void operator delete[](void* p); + +void* operator new(std::size_t n) noexcept(false); +void* operator new[](std::size_t n) noexcept(false); +void* operator new( std::size_t n, std::align_val_t align) noexcept(false); +void* operator new[]( std::size_t n, std::align_val_t align) noexcept(false); + +void* operator new ( std::size_t count, const std::nothrow_t& tag); +void* operator new[]( std::size_t count, const std::nothrow_t& tag); +void* operator new ( std::size_t count, std::align_val_t al, const std::nothrow_t&); +void* operator new[]( std::size_t count, std::align_val_t al, const std::nothrow_t&); + +// Posix +int posix_memalign(void** p, size_t alignment, size_t size); + +// Linux +void* memalign(size_t alignment, size_t size); +void* aligned_alloc(size_t alignment, size_t size); +void* valloc(size_t size); +void* pvalloc(size_t size); +size_t malloc_usable_size(void *p); + +// BSD +void* reallocarray( void* p, size_t count, size_t size ); +void* reallocf(void* p, size_t newsize); +void cfree(void* p); + +// Windows +void* _expand(void* p, size_t newsize); +size_t _msize(void* p); + +void* _malloc_dbg(size_t size, int block_type, const char* fname, int line); +void* _realloc_dbg(void* p, size_t newsize, int block_type, const char* fname, int line); +void* _calloc_dbg(size_t count, size_t size, int block_type, const char* fname, int line); +void* _expand_dbg(void* p, size_t size, int block_type, const char* fname, int line); +size_t _msize_dbg(void* p, int block_type); +void _free_dbg(void* p, int block_type); +``` + +*/ + +/*! \page bench Performance + + +tldr: In our benchmarks, mimalloc always outperforms +all other leading allocators (jemalloc, tcmalloc, hoard, and glibc), and usually +uses less memory (with less then 25% more in the worst case) (as of Jan 2019). +A nice property is that it does consistently well over a wide range of benchmarks. + +Disclaimer: allocators are interesting as there is no optimal algorithm -- for +a given allocator one can always construct a workload where it does not do so well. +The goal is thus to find an allocation strategy that performs well over a wide +range of benchmarks without suffering from underperformance in less +common situations (which is what our second benchmark set tests for). + + +## Benchmarking + +We tested _mimalloc_ with 5 other allocators over 11 benchmarks. +The tested allocators are: + +- **mi**: The mimalloc allocator (version tag `v1.0.0`). +- **je**: [jemalloc](https://github.com/jemalloc/jemalloc), by [Jason Evans](https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919) (Facebook); + currently (2018) one of the leading allocators and is widely used, for example + in BSD, Firefox, and at Facebook. Installed as package `libjemalloc-dev:amd64/bionic 3.6.0-11`. +- **tc**: [tcmalloc](https://github.com/gperftools/gperftools), by Google as part of the performance tools. + Highly performant and used in the Chrome browser. Installed as package `libgoogle-perftools-dev:amd64/bionic 2.5-2.2ubuntu3`. +- **jx**: A compiled version of a more recent instance of [jemalloc](https://github.com/jemalloc/jemalloc). + Using commit ` 7a815c1b` ([dev](https://github.com/jemalloc/jemalloc/tree/dev), 2019-01-15). +- **hd**: [Hoard](https://github.com/emeryberger/Hoard), by Emery Berger \[1]. + One of the first multi-thread scalable allocators. + ([master](https://github.com/emeryberger/Hoard), 2019-01-01, version tag `3.13`) +- **mc**: The system allocator. Here we use the LibC allocator (which is originally based on + PtMalloc). Using version 2.27. (Note that version 2.26 significantly improved scalability over + earlier versions). + +All allocators run exactly the same benchmark programs and use `LD_PRELOAD` to override the system allocator. +The wall-clock elapsed time and peak resident memory (_rss_) are +measured with the `time` program. The best scores over 5 runs are used. +Performance is reported relative to mimalloc, e.g. a time of 66% means that +mimalloc ran 1.5× faster (i.e. that mimalloc finished in 66% of the time +that the other allocator needed). + +## On a 16-core AMD EPYC running Linux + +Testing on a big Amazon EC2 instance ([r5a.4xlarge](https://aws.amazon.com/ec2/instance-types/)) +consisting of a 16-core AMD EPYC 7000 at 2.5GHz +with 128GB ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0. + + +The first benchmark set consists of programs that allocate a lot. Relative +elapsed time: + + + +and memory usage: + + + +The benchmarks above are (with N=16 in our case): + +- __cfrac__: by Dave Barrett, implementation of continued fraction factorization: + uses many small short-lived allocations. Factorizes as `./cfrac 175451865205073170563711388363274837927895`. +- __espresso__: a programmable logic array analyzer \[3]. +- __barnes__: a hierarchical n-body particle solver \[4]. Simulates 163840 particles. +- __leanN__: by Leonardo de Moura _et al_, the [lean](https://github.com/leanprover/lean) + compiler, version 3.4.1, compiling its own standard library concurrently using N cores (`./lean --make -j N`). + Big real-world workload with intensive allocation, takes about 1:40s when running on a + single high-end core. +- __redis__: running the [redis](https://redis.io/) 5.0.3 server on + 1 million requests pushing 10 new list elements and then requesting the + head 10 elements. Measures the requests handled per second. +- __alloc-test__: a modern [allocator test](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/) + developed by by OLogN Technologies AG at [ITHare.com](http://ithare.com). Simulates intensive allocation workloads with a Pareto + size distribution. The `alloc-testN` benchmark runs on N cores doing 100×10<sup>6</sup> + allocations per thread with objects up to 1KB in size. + Using commit `94f6cb` ([master](https://github.com/node-dot-cpp/alloc-test), 2018-07-04) + +We can see mimalloc outperforms the other allocators moderately but all +these modern allocators perform well. +In `cfrac`, mimalloc is about 13% +faster than jemalloc for many small and short-lived allocations. +The `cfrac` and `espresso` programs do not use much +memory (~1.5MB) so it does not matter too much, but still mimalloc uses about half the resident +memory of tcmalloc (and almost 5× less than Hoard on `espresso`). + +_The `leanN` program is most interesting as a large realistic and concurrent +workload and there is a 6% speedup over both tcmalloc and jemalloc. This is +quite significant: if Lean spends (optimistically) 20% of its time in the allocator +that means that mimalloc is 1.5× faster than the others._ + +The `alloc-test` is very allocation intensive and we see the larger +diffrerences here. Since all allocators perform almost identical on `alloc-test1` +as `alloc-testN`, we can see that they are all excellent and scale (almost) linearly. + +The second benchmark set test specific aspects of the allocators and +shows more extreme differences between allocators: + + + + + + + +The benchmarks in the second set are (again with N=16): + +- __larson__: by Larson and Krishnan \[2]. Simulates a server workload using 100 + separate threads where + they allocate and free many objects but leave some objects to + be freed by other threads. Larson and Krishnan observe this behavior + (which they call _bleeding_) in actual server applications, and the + benchmark simulates this. +- __sh6bench__: by [MicroQuill](http://www.microquill.com) as part of SmartHeap. Stress test for + single-threaded allocation where some of the objects are freed + in a usual last-allocated, first-freed (LIFO) order, but others + are freed in reverse order. Using the public [source](http://www.microquill.com/smartheap/shbench/bench.zip) (retrieved 2019-01-02) +- __sh8bench__: by [MicroQuill](http://www.microquill.com) as part of SmartHeap. Stress test for + multithreaded allocation (with N threads) where, just as in `larson`, some objects are freed + by other threads, and some objects freed in reverse (as in `sh6bench`). + Using the public [source](http://www.microquill.com/smartheap/SH8BENCH.zip) (retrieved 2019-01-02) +- __cache-scratch__: by Emery Berger _et al_ \[1]. Introduced with the Hoard + allocator to test for _passive-false_ sharing of cache lines: first some + small objects are allocated and given to each thread; the threads free that + object and allocate another one and access that repeatedly. If an allocator + allocates objects from different threads close to each other this will + lead to cache-line contention. + +In the `larson` server workload mimalloc is 2.5× faster than +tcmalloc and jemalloc which is quite surprising -- probably due to the object +migration between different threads. Also in `sh6bench` mimalloc does much +better than the others (more than 4× faster than jemalloc). a +We cannot explain this well but believe it may be +caused in part by the "reverse" free-ing in `sh6bench`. Again in `sh8bench` +the mimalloc allocator handles object migration between threads much better . + +The `cache-scratch` benchmark also demonstrates the different architectures +of the allocators nicely. With a single thread they all perform the same, but when +running with multiple threads the allocator induced false sharing of the +cache lines causes large run-time differences, where mimalloc is up to +20× faster than tcmalloc here. Only the original jemalloc does almost +as well (but the most recent version, jxmalloc, regresses). The +Hoard allocator is specifically designed to avoid this false sharing and we +are not sure why it is not doing well here (although it runs still 5× as +fast as tcmalloc and jxmalloc). + + +## On a 8-core Intel Xeon running Linux + +Testing on a compute optimized Amazon EC2 instance ([c5d.2xlarge](https://aws.amazon.com/ec2/instance-types/)) +consisting of a 8-core Intel Xeon Platinum at 3GHz (up to 3.5GHz turbo boost) +with 16GB ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0. + +First the regular workload benchmarks (with N=8): + + + + + + + +Most results are quite similar to the 16-core AMD machine except the +the differences are less pronounced with all a bit closer to mimalloc performance. + +This is shown too in the second set of benchmarks: + + + + + + + +On the server workload of `larson` everyone does a bit better on the 8-cores +than on 16. On the other benchmarks the performance does not improve though. + + +## On Windows (4-core Intel Xeon) + +Testing on a HP Z4 G4 Workstation with a 4-core Intel® Xeon® W2123 at 3.6 GHz +with 16GB ECC memory, running Windows 10 Pro (version 10.0.17134 Build 17134) +with Visual Studio 2017 (version 15.8.9). + +Since we cannot use `LD_PRELOAD` on Windows we compiled a subset of our +allocators and benchmarks and linked them statically. The **je** benchmark +is therefore equivalent to the **jx** benchmark in the previous graphs. +The **mc** allocator now refers to the standard Microsoft allocator. +Unfortunately we could not get Hoard to work on Windows at this time. + +We used the Windows call `QueryPerformanceCounter` to measure elapsed wall-clock +times, and `GetProcessMemoryInfo` to measure the peak working set (rss). + +First the regular workload benchmarks: + + + + + + + +Here mimalloc and tcmalloc perform very similar, and outperform the system +allocator by a significant margin. Somehow jemalloc does much worse than +running on Linux. It it not clear why yet, but it might be a compilation issue: +when running through the profiler the `__chkstk` routine takes +quite some time. This is a compiler inserted runtime function to check for enough +stack space if there are many local variables or when the compiler cannot make +a static estimate. Perhaps this is the culprit but it needs more investigation. + +The second set of benchmarks shows again more pronounced differences: + + + + + + + +In the `larson` server workload mimalloc is 25% faster than +tcmalloc, and both significantly outperform the system allocator. +(again probably due to the object +migration between different threads). +Also in `sh6bench` and `sh8bench`, mimalloc scales much +better than the others. + +## References + +- \[1] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. + _Hoard: A Scalable Memory Allocator for Multithreaded Applications_ + the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX). Cambridge, MA, November 2000. + [pdf](http://www.cs.utexas.edu/users/mckinley/papers/asplos-2000.pdf) + +- \[2] P. Larson and M. Krishnan. _Memory allocation for long-running server applications_. In ISMM, Vancouver, B.C., Canada, 1998. + [pdf](http://citeseemi.ist.psu.edu/viewdoc/download;jsessionid=5F0BFB4F57832AEB6C11BF8257271088?doi=10.1.1.45.1947&rep=rep1&type=pdf) + +- \[3] D. Grunwald, B. Zorn, and R. Henderson. + _Improving the cache locality of memory allocation_. In R. Cartwright, editor, + Proceedings of the Conference on Programming Language Design and Implementation, pages 177–186, New York, NY, USA, June 1993. + [pdf](http://citeseemi.ist.psu.edu/viewdoc/download?doi=10.1.1.43.6621&rep=rep1&type=pdf) + +- \[4] J. Barnes and P. Hut. _A hierarchical O(n*log(n)) force-calculation algorithm_. Nature, 324:446-449, 1986. + +*/ |