diff options
Diffstat (limited to 'docs/native_allocator.md')
-rw-r--r-- | docs/native_allocator.md | 348 |
1 files changed, 348 insertions, 0 deletions
diff --git a/docs/native_allocator.md b/docs/native_allocator.md new file mode 100644 index 000000000..82a98fe48 --- /dev/null +++ b/docs/native_allocator.md @@ -0,0 +1,348 @@ +# Native Memory Allocator Verification +This document describes how to verify the native memory allocator on Android. +This procedure should be followed when upgrading or moving to a new allocator. +A small minor upgrade might not need to run all of the benchmarks, however, +at least the +[SQL Allocation Trace Benchmark](#sql-allocation-trace-benchmark), +[Memory Replay Benchmarks](#memory-replay-benchmarks) and +[Performance Trace Benchmarks](#performance-trace-benchmarks) should be run. + +It is important to note that there are two modes for a native allocator +to run in on Android. The first is the normal allocator, the second is +called the svelte config, which is designed to run on memory constrained +systems and be a bit slower, but take less RSS. To enable the svelte config, +add this line to the `BoardConfig.mk` for the given target: + + MALLOC_SVELTE := true + +The `BoardConfig.mk` file is usually found in the directory +`device/<DEVICE_NAME>/` or in a sub directory. + +When evaluating a native allocator, make sure that you benchmark both +versions. + +## Android Extensions +Android supports a few non-standard functions and mallopt controls that +a native allocator needs to implement. + +### Iterator Functions +These are functions that are used to implement a memory leak detector +called `libmemunreachable`. + +#### malloc\_disable +This function, when called, should pause all threads that are making a +call to an allocation function (malloc/free/etc). When a call +is made to `malloc_enable`, the paused threads should start running again. + +#### malloc\_enable +This function, when called, does nothing unless there was a previous call +to `malloc_disable`. This call will unpause any thread which is making +a call to an allocation function (malloc/free/etc) when `malloc_disable` +was called previously. + +#### malloc\_iterate +This function enumerates all of the allocations currently live in the +system. It is meant to be called after a call to `malloc_disable` to +prevent further allocations while this call is being executed. To +see what is expected for this function, the best description is the +tests for this funcion in `bionic/tests/malloc_itearte_test.cpp`. + +### Mallopt Extensions +These are mallopt options that Android requires for a native allocator +to work efficiently. + +#### M\_DECAY\_TIME +When set to zero, `mallopt(M_DECAY_TIME, 0)`, it is expected that an +allocator will attempt to purge and release any unused memory back to the +kernel on free calls. This is important in Android to avoid consuming extra +RSS. + +When set to non-zero, `mallopt(M_DECAY_TIME, 1)`, an allocator can delay the +purge and release action. The amount of delay is up to the allocator +implementation, but it should be a reasonable amount of time. The jemalloc +allocator was implemented to have a one second delay. + +The drawback to this option is that most allocators do not have a separate +thread to handle the purge, so the decay is only handled when an +allocation operation occurs. For server processes, this can mean that +RSS is slightly higher when the server is waiting for the next connection +and no other allocation calls are made. The `M_PURGE` option is used to +force a purge in this case. + +For all applications on Android, the call `mallopt(M_DECAY_TIME, 1)` is +made by default. The idea is that it allows application frees to run a +bit faster, while only increasing RSS a bit. + +#### M\_PURGE +When called, `mallopt(M_PURGE, 0)`, an allocator should purge and release +any unused memory immediately. The argument for this call is ignored. If +possible, this call should clear thread cached memory if it exists. The +idea is that this can be called to purge memory that has not been +purged when `M_DECAY_TIME` is set to one. This is useful if you have a +server application that does a lot of native allocations and the +application wants to purge that memory before waiting for the next connection. + +## Correctness Tests +These are the tests that should be run to verify an allocator is +working properly according to Android. + +### Bionic Unit Tests +The bionic unit tests contain a small number of allocator tests. These +tests are primarily verifying Android extensions and non-standard behavior +of allocation routines such as what happens when a non-power of two alignment +is passed to memalign. + +To run all of the compliance tests: + + adb shell /data/nativetest64/bionic-unit-tests/bionic-unit-tests --gtest_filter="malloc*" + adb shell /data/nativetest/bionic-unit-tests/bionic-unit-tests --gtest_filter="malloc*" + +The allocation tests are not meant to be complete, so it is expected +that a native allocator will have its own set of tests that can be run. + +### CTS Entropy Test +In addition to the bionic tests, there is also a CTS test that is designed +to verify that the addresses returned by malloc are sufficiently randomized +to help defeat potential security bugs. + +Run this test thusly: + + atest AslrMallocTest + +If there are multiple devices connected to the system, use `-s <SERIAL>` +to specify a device. + +## Performance +There are multiple different ways to evaluate the performance of a native +allocator on Android. One is allocation speed in various different scenarios, +another is total RSS taken by the allocator. + +The last is virtual address space consumed in 32 bit applications. There is +a limited amount of address space available in 32 bit apps, and there have +been allocator bugs that cause memory failures when too much virtual +address space is consumed. For 64 bit executables, this can be ignored. + +### Bionic Benchmarks +These are the microbenchmarks that are part of the bionic benchmarks suite of +benchmarks. These benchmarks can be built using this command: + + mmma -j bionic/benchmarks + +These benchmarks are only used to verify the speed of the allocator and +ignore anything related to RSS and virtual address space consumed. + +#### Allocate/Free Benchmarks +These are the benchmarks to verify the allocation speed of a loop doing a +single allocation, touching every page in the allocation to make it resident +and then freeing the allocation. + +To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: + + adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_free_default + adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_free_default + +To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands: + + adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_free_decay1 + adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_free_decay1 + +The last value in the output is the size of the allocation in bytes. It is +useful to look at these kinds of benchmarks to make sure that there are +no outliers, but these numbers should not be used to make a final decision. +If these numbers are slightly worse than the current allocator, the +single thread numbers from trace data is a better representative of +real world situations. + +#### Multiple Allocations Retained Benchmarks +These are the benchmarks that examine how the allocator handles multiple +allocations of the same size at the same time. + +The first set of these benchmarks does a set number of 8192 byte allocations +in one loop, and then frees all of the allocations at the end of the loop. +Only the time it takes to do the allocations is recorded, the frees are not +counted. The value of 8192 was chosen since the jemalloc native allocator +had issues with this size. It is possible other sizes might show different +results, but, as mentioned before, these microbenchmark numbers should +not be used as absolutes for determining if an allocator is worth using. + +This benchmark is designed to verify that there is no performance issue +related to having multiple allocations alive at the same time. + +To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: + + adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_default + adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_default + +To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands: + + adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_decay1 + adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_decay1 + +For these benchmarks, the last parameter is the total number of allocations to +do in each loop. + +The other variation of this benchmark is to always do forty allocations in +each loop, but vary the size of the forty allocations. As with the other +benchmark, only the time it takes to do the allocations is tracked, the +frees are not counted. Forty allocations is an arbitrary number that could +be modified in the future. It was chosen because a version of the native +allocator, jemalloc, showed a problem at forty allocations. + +To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: + + adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_default + adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_default + +To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these command: + + adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_decay1 + adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_decay1 + +For these benchmarks, the last parameter in the output is the size of the +allocation in bytes. + +As with the other microbenchmarks, an allocator with numbers in the same +proximity of the current values is usually sufficient to consider making +a switch. The trace benchmarks are more important than these benchmarks +since they simulate real world allocation profiles. + +#### SQL Allocation Trace Benchmark +This benchmark is a trace of the allocations performed when running +the SQLite BenchMark app. + +This benchmark is designed to verify that the allocator will be performant +in a real world allocation scenario. SQL operations were chosen as a +benchmark because these operations tend to do lots of malloc/realloc/free +calls, and they tend to be on the critical path of applications. + +To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: + + adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_default + adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_default + +To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands: + + adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_decay1 + adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_decay1 + +These numbers should be as performant as the current allocator. + +### Memory Trace Benchmarks +These benchmarks measure all three axes of a native allocator, RSS, virtual +address space consumed, speed of allocation. They are designed to +run on a trace of the allocations from a real world application or system +process. + +To build this benchmark: + + mmma -j system/extras/memory_replay + +This will build two executables: + + /system/bin/memory_replay32 + /system/bin/memory_replay64 + +And these two benchmark executables: + + /data/benchmarktest64/trace_benchmark/trace_benchmark + /data/benchmarktest/trace_benchmark/trace_benchmark + +#### Memory Replay Benchmarks +These benchmarks display RSS, virtual memory consumed (VA space), and do a +bit of performance testing on actual traces taken from running applications. + +The trace data includes what thread does each operation, so the replay +mechanism will simulate this by creating threads and replaying the operations +on a thread as if it was rerunning the real trace. The only issue is that +this is a worst case scenario for allocations happening at the same time +in all threads since it collapses all of the allocation operations to occur +one after another. This will cause a lot of threads allocating at the same +time. The trace data does not include timestamps, +so it is not possible to create a completely accurate replay. + +To generate these traces, see the [Malloc Debug documentation](https://android.googlesource.com/platform/bionic/+/master/libc/malloc_debug/README.md), +the option [record\_allocs](https://android.googlesource.com/platform/bionic/+/master/libc/malloc_debug/README.md#record_allocs_total_entries). + +To run these benchmarks, first copy the trace files to the target and +unzip them using these commands: + + adb shell push system/extras/traces /data/local/tmp + adb shell 'cd /data/local/tmp/traces && for name in *.zip; do unzip $name; done' + +Since all of the traces come from applications, the `memory_replay` program +will always call `mallopt(M_DECAY_TIME, 1)' before running the trace. + +Run the benchmark thusly: + + adb shell memory_replay64 /data/local/tmp/traces/XXX.txt + adb shell memory_replay32 /data/local/tmp/traces/XXX.txt + +Where XXX.txt is the name of a trace file. + +Every 100000 allocation operations, a dump of the RSS and VA space will be +performed. At the end, a final RSS and VA space number will be printed. +For the most part, the intermediate data can be ignored, but it is always +a good idea to look over the data to verify that no strange spikes are +occurring. + +The performance number is a measure of the time it takes to perform all of +the allocation calls (malloc/memalign/posix_memalign/realloc/free/etc). +For any call that allocates a pointer, the time for the call and the time +it takes to make the pointer completely resident in memory is included. + +The performance numbers for these runs tend to have a wide variability so +they should not be used as absolute value for comparison against the +current allocator. But, they should be in the same range as the current +values. + +When evaluating an allocator, one of the most important traces is the +camera.txt trace. The camera application does very large allocations, +and some allocators might leave large virtual address maps around +rather than delete them. When that happens, it can lead to allocation +failures and would cause the camera app to abort/crash. It is +important to verify that when running this trace using the 32 bit replay +executable, the virtual address space consumed is not much larger than the +current allocator. A small increase (on the order of a few MBs) would be okay. + +There is no specific benchmark for memory fragmentation, instead, the RSS +when running the memory traces acts as a proxy for this. An allocator that +is fragmenting badly will show an increase in RSS. The best trace for +tracking fragmentation is system\_server.txt which is an extremely long +trace (~13 million operations). The total number of live allocations goes +up and down a bit, but stays mostly the same so an allocator that fragments +badly would likely show an abnormal increase in RSS on this trace. + +NOTE: When a native allocator calls mmap, it is expected that the allocator +will name the map using the call: + + prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, <PTR>, <SIZE>, "libc_malloc"); + +If the native allocator creates a different name, then it necessary to +modify the file: + + system/extras/memory_replay/NativeInfo.cpp + +The `GetNativeInfo` function needs to be modified to include the name +of the maps that this allocator includes. + +In addition, in order for the frameworks code to keep track of the memory +of a process, any named maps must be added to the file: + + frameworks/base/core/jni/android_os_Debug.cpp + +Modify the `load_maps` function and add a check of the new expected name. + +#### Performance Trace Benchmarks +This is a benchmark that treats the trace data as if all allocations +occurred in a single thread. This is the scenario that could +happen if all of the allocations are spaced out in time so no thread +every does an allocation at the same time as another thread. + +Run these benchmarks thusly: + + adb shell /data/benchmarktest64/trace_benchmark/trace_benchmark + adb shell /data/benchmarktest/trace_benchmark/trace_benchmark + +When run without any arguments, the benchmark will run over all of the +traces and display data. It takes many minutes to complete these runs in +order to get as accurate a number as possible. |