summaryrefslogtreecommitdiff
path: root/compiler/optimizing/loop_optimization.h
AgeCommit message (Collapse)Author
2021-02-05ARM64: Support SVE VL other than 128-bit.Artem Serov
Arm SVE register size is not fixed and can be a multiple of 128 bits. To support that the patch removes explicit assumptions on the SIMD register size to be 128 bit from the vectorizer and code generators and enables configurable SVE vector length autovectorization, e.g. extends SIMD register save/restore routines. Test: art SIMD tests on VIXL simulator. Test: art tests on FVP (steps in test/README.arm_fvp.md) with FVP arg: -C SVE.ScalableVectorExtension.veclen=[2,4] (SVE vector [128,256] bits wide) Change-Id: Icb46e7eb17f21d3bd38b16dd50f735c29b316427
2021-02-04ART: Implement predicated SIMD vectorization.Artem Serov
This CL brings support for predicated execution for auto-vectorizer and implements arm64 SVE vector backend. This version passes all the VIXL simulator-runnable tests in SVE mode with checker off (as all VecOp CHECKs need to be adjusted for an extra input) and all tests in NEON mode. Test: art SIMD tests on VIXL simulator. Test: art tests on FVP (steps in test/README.arm_fvp.md) Change-Id: Ib78bde31a15e6713d875d6668ad4458f5519605f
2020-04-17ART: Refactor SIMD slots and regs size processing.Artem Serov
ART vectorizer assumes that there is single size of SIMD register used for the whole program. Make this assumption explicit and refactor the code. Note: This is a base for the future introduction of SIMD slots of size other than 8 or 16 bytes. Test: test-art-target, test-art-host. Change-Id: Id699d5e3590ca8c655ecd9f9ed4e63f49e3c4f9c
2019-10-14Revert "Make compiler/optimizing/ symbols hidden."Vladimir Marko
This reverts commit e2727154f25e0db9a5bb92af494d8e47b181dfcf. Reason for revert: Breaks ASAN tests (ODR violation). Bug: 142365358 Change-Id: I38103d74a1297256c81d90872b6902ff1e9ef7a4
2019-10-14Make compiler/optimizing/ symbols hidden.Vladimir Marko
Make symbols in compiler/optimizing hidden by a namespace attribute. The unit intrinsic_objects.{h,cc} is excluded as it is needed by dex2oat. As the symbols are no longer exported, gtests are now linked with the static version of the libartd-compiler library. libart-compiler.so size: - before: arm: 2396152 arm64: 3345280 - after: arm: 2016176 (-371KiB, -15.9%) arm64: 2874480 (-460KiB, -14.1%) Test: m test-art-host-gtest Test: testrunner.py --host --optimizing --jit Bug: 142365358 Change-Id: I1fb04a33351f53f00b389a1642e81a68e40912a8
2018-09-25ART: ARM64: Support DotProd SIMD idiom.Artem Serov
Implement support for vectorization idiom which performs dot product of two vectors and adds the result to wider precision components in the accumulator. viz. DOT_PRODUCT([ a1, .. , am], [ x1, .. , xn ], [ y1, .. , yn ]) = [ a1 + sum(xi * yi), .. , am + sum(xj * yj) ], for m <= n, non-overlapping sums, for either both signed or both unsigned operands x, y. The patch shows up to 7x performance improvement on a micro benchmark on Cortex-A57. Test: 684-checker-simd-dotprod. Test: test-art-host, test-art-target. Change-Id: Ibab0d51f537fdecd1d84033197be3ebf5ec4e455
2018-08-28Use 'final' and 'override' specifiers directly in ART.Roland Levillain
Remove all uses of macros 'FINAL' and 'OVERRIDE' and replace them with 'final' and 'override' specifiers. Remove all definitions of these macros as well, which were located in these files: - libartbase/base/macros.h - test/913-heaps/heaps.cc - test/ti-agent/ti_macros.h ART is now using C++14; the 'final' and 'override' specifiers have been introduced in C++11. Test: mmma art Change-Id: I256c7758155a71a2940ef2574925a44076feeebf
2018-07-04ART: Implement loop full unrolling.Artem Serov
Performs whole loop unrolling for small loops with small trip count to eliminate the loop check overhead, to have more opportunities for inter-iteration optimizations. caffeinemark/FloatAtom: 1.2x performance on arm64 Cortex-A57. Test: 530-checker-peel-unroll. Test: test-art-host, test-art-target. Change-Id: Idf3fe3cb611376935d176c60db8c49907222e28a
2018-07-04ART: Refactor scalar loop optimizations.Artem Serov
Refactor scalar loop peeling and unrolling to eliminate repeated checks and graph traversals, to make the code more readable and to make it easier to add new scalar loop opts. This is a prerequisite for full unrolling patch. Test: 530-checker-peel-unroll. Test: test-art-target, test-art-host. Change-Id: If824a95f304033555085eefac7524e59ed540322
2018-06-25Move instruction_set_ to CompilerOptions.Vladimir Marko
Removes CompilerDriver dependency from ImageWriter and several other classes. Test: m test-art-host-gtest Test: testrunner.py --host --optimizing Test: Pixel 2 XL boots. Test: m test-art-target-gtest Test: testrunner.py --target --optimizing Change-Id: I3c5b8ff73732128b9c4fad9405231a216ea72465
2018-05-15ART: Enable scalar loop peeling and unrolling.Artem Serov
Turn on scalar loop peeling and unrolling by default. Test: 482-checker-loop-back-edge-use, 530-checker-peel-unroll Test: test-art-host, test-art-target, boot-to-gui Change-Id: Ibfe1b54f790a97b281e85396da2985e0f22c2834
2018-05-01Remove some SIMD recognition code.Aart Bik
Test: : test-art-host,target Change-Id: I7f00315c61ed99723236283bc39a4c7fb279df47
2018-04-26Step 1 of 2: conditional passes.Aart Bik
Rationale: The change adds a return value to Run() in preparation of conditional pass execution. The value returned by Run() is best effort, returning false means no optimizations were applied or no useful information was obtained. I filled in a few cases with more exact information, others still just return true. In addition, it integrates inlining as a regular pass, avoiding the ugly "break" into optimizations1 and optimziations2. Bug: b/78171933, b/74026074 Test: test-art-host,target Change-Id: Ia39c5c83c01dcd79841e4b623917d61c754cf075
2018-04-17ART: Implement scalar loop peeling.Artem Serov
Implement scalar loop peeling for invariant exits elimination (on arm64). If the loop exit condition is loop invariant then loop peeling + GVN + DCE can eliminate this exit in the loop body. Note: GVN and DCE aren't applied during loop optimizations. Note: this functionality is turned off by default now. Test: test-art-host, test-art-target, boot-to-gui. Change-Id: I98d20054a431838b452dc06bd25c075eb445960c
2018-03-26ART: Implement scalar loop unrolling.Artem Serov
Implement scalar loop unrolling for small loops (on arm64) with known trip count to reduce loop check and branch penalty and to provide more opportunities for instruction scheduling. Note: this functionality is turned off by default now. Test: cloner_test.cc Test: test-art-target, test-art-host Change-Id: Ic27fd8fb0bc0d7b69251252da37b8b510bc30acc
2018-03-15Vectorization of saturation arithmetic.Aart Bik
Rationale: Because faster is better. Bug: b/74026074 Test: test-art-host,target Change-Id: Ifa970a62cef1c0b8bb1c593f629d8c724f1ffe0e
2017-11-20Refactored optimization passes setup.Aart Bik
Rationale: Refactors the way we set up optimization passes in the compiler into a more centralized approach. The refactoring also found some "holes" in the existing mechanism (missing string lookup in the debugging mechanism, or inablity to set alternative name for optimizations that may repeat). Bug: 64538565 Test: test-art-host test-art-target Change-Id: Ie5e0b70f67ac5acc706db91f64612dff0e561f83
2017-10-27Alignment optimizations in vectorizer.Aart Bik
Rationale: Since aligned data access is generally better (enables more efficient aligned moves and prevents nasty cache line splits), computing and/or enforcing alignment has been added to the vectorizer: (1) If the initial alignment is known completely and suffices, then a static peeling factor enforces proper alignment. (2) If (1) fails, but the base alignment allows, dynamically peeling until total offset is aligned forces proper aligned access patterns. By using ART conventions only, any forced alignment is preserved over suspends checks where data may move. Note 1: Current allocation convention is just 8 byte alignment on arrays/strings, so only ARM32 benefits. However, all optimizations are implemented in a general way, so moving to a 16 byte alignment will immediately take advantage of any new convention!! Note 2: This CL also exposes how bad the choice of 12 byte offset of arrays really is. Even though the new optimizations fix the misaligned, it requires peeling for the most common case: 0 indexed loops. Therefore, we may even consider moving to a 16 byte offset. Again the optimizations in this CL will immediately take advantage of that new convention!! Test: test-art-host test-art-target Change-Id: Ib6cc0fb68c9433d3771bee573603e64a3a9423ee
2017-10-12ARM: Support SIMD reduction for 32-bit backend.Artem Serov
Support SIMD reduction (add, min, max) and SAD (for int->int only) idioms for arm (32-bit) backend. Test: test-art-target, test-art-host Test: 661-checker-simd-reduc, 660-checker-simd-sad-int Change-Id: Ic6121f5d781a9bcedc33041b6c4ecafad9b0420a
2017-10-06ART: Use ScopedArenaAllocator for pass-local data.Vladimir Marko
Passes using local ArenaAllocator were hiding their memory usage from the allocation counting, making it difficult to track down where memory was used. Using ScopedArenaAllocator reveals the memory usage. This changes the HGraph constructor which requires a lot of changes in tests. Refactor these tests to limit the amount of work needed the next time we change that constructor. Test: m test-art-host-gtest Test: testrunner.py --host Test: Build with kArenaAllocatorCountAllocations = true. Bug: 64312607 Change-Id: I34939e4086b500d6e827ff3ef2211d1a421ac91a
2017-09-25ART: Introduce compiler data type.Vladimir Marko
Replace most uses of the runtime's Primitive in compiler with a new class DataType. This prepares for introducing new types, such as Uint8, that the runtime does not need to know about. Test: m test-art-host-gtest Test: testrunner.py --host Bug: 23964345 Change-Id: Iec2ad82454eec678fffcd8279a9746b90feb9b0c
2017-09-21Implement Sum-of-Abs-Differences idiom recognition.Aart Bik
Rationale: Currently just on ARM64 (x86 lacks proper support), using the SAD idiom yields great speedup on loops that compute the sum-of-abs-difference operation. Also includes some refinements around type conversions. Speedup ExoPlayerAudio (golem run): 1.3x on ARM64 1.1x on x86 Test: test-art-host test-art-target Bug: 64091002 Change-Id: Ia2b711d2bc23609a2ed50493dfe6719eedfe0130
2017-09-06Pass stats into the loop optimization phase.Aart Bik
Test: market scan. Change-Id: I58b23b8d254883f30619ea3602d34bf93618d432
2017-09-05Basic SIMD reduction support.Aart Bik
Rationale: Enables vectorization of x += .... for very basic (simple, same-type) constructs. Paves the way for more complex (narrower and/or mixed-type) constructs, which will be handled by the next CL. This is a revert of Icb5d6c805516db0a1d911c3ede9a246ccef89a22 and thus a revert^2 of I2454778dd0ef1da915c178c7274e1cf33e271d0f and thus a revert^3 of I1c1c87b6323e01442e8fbd94869ddc9e760ea1fc and thus a revert^4 of I7880c135aee3ed0a39da9ae5b468cbf80e613766 PS1-2 shows what needed to change Test: test-art-host test-art-target Bug: 64091002 Change-Id: I647889e0da0959ca405b70081b79c7d3c9bcb2e9
2017-09-02Revert "Basic SIMD reduction support."Nicolas Geoffray
Fails 530-checker-lse on arm64. Bug: 64091002, 65212948 This reverts commit cfa59b49cde265dc5329a7e6956445f9f7a75f15. Change-Id: Icb5d6c805516db0a1d911c3ede9a246ccef89a22
2017-09-01Basic SIMD reduction support.Aart Bik
Rationale: Enables vectorization of x += .... for very basic (simple, same-type) constructs. Paves the way for more complex (narrower and/or mixed-type) constructs, which will be handled by the next CL. This is a revert^2 of I7880c135aee3ed0a39da9ae5b468cbf80e613766 and thus a revert of I1c1c87b6323e01442e8fbd94869ddc9e760ea1fc PS1-2 shows what needed to change, with regression tests Test: test-art-host test-art-target Bug: 64091002, 65212948 Change-Id: I2454778dd0ef1da915c178c7274e1cf33e271d0f
2017-08-30Revert "Basic SIMD reduction support."Aart Bik
This reverts commit 9879d0eac8fe2aae19ca6a4a2a83222d6383afc2. Getting these type check failures in some builds. Need time to look at this better, so reverting for now :-( dex2oatd F 08-30 21:14:29 210122 226218 code_generator.cc:115] Check failed: CheckType(instruction->GetType(), locations->InAt(0)) PrimDouble C Change-Id: I1c1c87b6323e01442e8fbd94869ddc9e760ea1fc
2017-08-30Basic SIMD reduction support.Aart Bik
Rationale: Enables vectorization of x += .... for very basic (simple, same-type) constructs. Paves the way for more complex (narrower and/or mixed-type) constructs, which will be handled by the next CL. Test: test-art-host test-art-target Bug: 64091002 Change-Id: I7880c135aee3ed0a39da9ae5b468cbf80e613766
2017-08-08Set basic framework for detecting reductions.Aart Bik
Rationale: Recognize reductions in loops. Note that reductions are *not* optimized yet (we would proceed with e.g. unrolling and vectorization). This CL merely sets up the basic detection framework. Also does a bit of cleanup on loop optimization code. Bug: 64091002 Test: test-art-host Change-Id: I0f52bd7ca69936315b03d02e83da743b8ad0ae72
2017-06-27Unrolling and dynamic loop peeling framework in vectorizer.Aart Bik
Rationale: This CL introduces the basic framework for dynamically peeling (to obtain aligned access) and unrolling the vector loop (to reduce looping overhead and allow more target specific optimizations on e.g. SIMD loads and stores). NOTE: The current heuristics are "bogus" and merely meant to exercise the new framework. This CL focuses on introducing correct code for the vectorizer. Heuristics and the memory computations for alignment are to be implemented later. Test: test-art-target, test-art-host Change-Id: I010af1475f42f92fd1daa6a967d7a85922beace8
2017-06-22Fix loop optimization in the presence of environment uses.Nicolas Geoffray
We should not remove instructions that have deoptimize as users, or that have environment uses in a debuggable setup. bug: 62536525 bug: 33775412 Test: 656-loop-deopt Change-Id: Iaec1a0b6e90c6a0169f18c6985f00fd8baf2dece
2017-05-29MIPS64: ART VectorizerGoran Jakovljevic
MIPS64 implementation which uses MSA extension. Also extended all relevant checker tests to test MIPS64 implementation. Test: booted MIPS64R6 in QEMU Test: ./testrunner.py --target --optimizing -j1 in QEMU Change-Id: I8b8a2f601076bca1925e21213db8ed1d41d79b52
2017-05-24Support for narrow operands in "dangerous" operations.Aart Bik
This is a revert^2 of commit 636e870d55c1739e2318c2180fac349683dbfa97. Rationale: Under strict conditions, even operations that are sensitive to higher order bits can vectorize by inspecting the operands carefully. This enables more vectorization, as demonstrated by the removal of quite a few TODOs. Test: test-art-target, test-art-host Change-Id: Ic2684f771d2e36df10432286198533284acaf472
2017-05-23Revert "Support for narrow operands in "dangerous" operations."Nicolas Geoffray
Fails on armv8 / speed-profile This reverts commit 636e870d55c1739e2318c2180fac349683dbfa97. Change-Id: Ib2a09b3adeba994c6b095672a1e08b32d3871872
2017-05-18Support for narrow operands in "dangerous" operations.Aart Bik
Rationale: Under strict conditions, even operations that are sensitive to higher order bits can vectorize by inspecting the operands carefully. This enables more vectorization, as demonstrated by the removal of quite a few TODOs. Test: test-art-target, test-art-host Change-Id: I2b0fda6a182da9aed9ce1708a53eaf0b7e1c9146
2017-05-15Min/max SIMDization support.Aart Bik
Rationale: The more vectorized, the better! Test: test-art-target, test-art-host Change-Id: I758becca5beaa5b97fab2ab70f2e00cb53458703
2017-04-19Implement halving add idiom (with checker tests).Aart Bik
Rationale: First of several idioms that map to very efficient SIMD instructions. Note that the is-zero-ext and is-sign-ext are general-purpose utilities that will be widely used in the vectorizer to detect low precision idioms, so expect that code to be shared with many CLs to come. Test: test-art-host, test-art-target Change-Id: If7dc2926c72a2e4b5cea15c44ef68cf5503e9be9
2017-04-05Implemented ABS vectorization.Aart Bik
Rationale: This CL adds the concept of vectorizing intrinsics to the ART vectorizer. More can follow (MIN, MAX, etc). Test: test-art-host, test-art-target (angler) Change-Id: Ieed8aa83ec64c1250ac0578570249cce338b5d36
2017-03-31ART vectorizer.Aart Bik
Rationale: Make SIMD great again with a retargetable and easily extendable vectorizer. Provides a full x86/x86_64 and a proof-of-concept ARM implementation. Sample improvement (without any perf tuning yet) for Linpack on x86 is about 20% to 50%. Test: test-art-host, test-art-target (angler) Bug: 34083438, 30933338 Change-Id: Ifb77a0f25f690a87cd65bf3d5e9f6be7ea71d6c1
2017-03-06Pass driver to loop opt. Add new side_effects phase.Aart Bik
Rationale: Break-out CL of ART Vectorizer: number 3. The purpose is making the original CL smaller and easier to review. Bug: 34083438 Test: test-art-host Change-Id: I7cece807ee4f5fcaeae41f1deed33ac263447b77
2017-01-13Complete unrolling of loops with small body and trip count one.Aart Bik
Rationale: Avoids the unnecessary loop control overhead, suspend check, and exposes more opportunities for constant folding in the resulting loop body. Fully unrolls loop in execute() of the Dhrystone benchmark (3% to 8% improvements). Test: test-art-host Change-Id: If30f38caea9e9f87a929df041dfb7ed1c227aba3
2016-12-09Added polynomial induction variables analysis. With tests.Aart Bik
Rationale: Information on polynomial sequences is nice to further enhance BCE and last-value assignment. In this case, this CL enables more loop optimizations for benchpress' Sum (80 x speedup). Also changed rem-based geometric induction to wrap-around induction. Test: test-art-host Change-Id: Ie4d2659edefb814edda2c971c1f70ba400c31111
2016-11-04Account for early exit loop.Aart Bik
Rationale: last value computation is obviously only right if the loop does not have early exits; only needed if cycle leaks to outside loop in any way. Bug:32633772 Test: 623-checker-loop-regressions Change-Id: Id60beca4704491cff611ad12a24bfc63c09d32c3
2016-10-24Improved induction variable analysis and loop optimizations.Aart Bik
Rationale: Rather than half-baked reconstructing cycles during loop optimizations, this CL passes the SCC computed during induction variable analysis to the loop optimizer (trading some memory for more optimizations). This further improves CaffeineLogic from 6000us down to 4200us (dx) and 2200us to 1690us (jack). Note that this is on top of prior improvements in previous CLs. Also, some narrowing type concerns are taken care of during transfer operations. Test: test-art-host Change-Id: Ice2764811a70073c5014b3a05fb51f39fd2f4c3c
2016-10-18Enable last value generation of periodic sequence.Aart Bik
Rationale: This helps to eliminate more dead induction. For example, CaffeineLogic when compiled with latest Jack improves with a 1.3 speedup (2900us -> 2200us) due to eliminating first loop (second loop can be removed also, but for a later case). The currently benchmarks.dex has a different construct for the periodics, however, still to be recognized. Test: test-art-host Change-Id: Ia81649a207a2b1f03ead0855436862ed4e4f45e0
2016-10-11Improved and simplified loop optimizations.Aart Bik
Rationale: Empty preheader simplification has been simplified to a much more general empty block removal optimization step. Incremental updating of induction variable analysis enables repeated elimination or simplification of induction cycles. This enabled an extra layer of optimization for e.g. Benchpress Loop (17.5us. -> 0.24us. -> 0.08us). So the original 73x speedup is now multiplied by another 3x, for a total of about 218x. Test: 618-checker-induction et al. Change-Id: I394699981481cdd5357e0531bce88cd48bd32879
2016-10-07Improved and simplified loop optimizations.Aart Bik
Rationale: This CL merges some common cases into one, thereby simplifying the code quite a bit. It also prepares for more general induction cycles (rather than the simple phi-add currently used). Finally, it generalizes the closed form elimination with empty loops. As a result of the latter, elaborate but weird code like: private static int waterFall() { int i = 0; for (; i < 10; i++); for (; i < 20; i++); for (; i < 30; i++); for (; i < 40; i++); for (; i < 50; i++); return i; } now becomes just this (on x86)! mov eax, 50 ret Change-Id: I8d22ce63ce9696918f57bb90f64d9a9303a4791d Test: m test-art-host
2016-10-05Refactoring of graph linearization and linear order.Aart Bik
Rationale: Ownership of graph's linear order and iterators was a bit unclear now that other phases are using it. New approach allows phases to compute their own order, while ssa_liveness is sole owner for graph (since it is not mutated afterwards). Also shortens lifetime of loop's arena. Test: test-art-host Change-Id: Ib7137d1203a1e0a12db49868f4117d48a4277f30
2016-10-05Make it possible to pass an arena allocator to HLoopOptimization.Nicolas Geoffray
loop_optimization_test uses memory from HLoopOptimization's allocator, which is scoped by the Run method. Fix is to pass custom allocator. test: m test-art-host-gtest Change-Id: I359330e22202519f400a26da5403eeb00f0b2db4
2016-10-05Properly scope HLoopOptimization's allocator.Nicolas Geoffray
HOptimization classes do not get their destructor called, as they are arena objects. So the scope for the optimization allocator needs to be the Run method. Also anticipate bisection search breakage by adding HLoopOptimization to the list of recognized optimizations. Change-Id: I7770989c39d5700a3b6b0a20af5d4b874dfde111