summaryrefslogtreecommitdiff
path: root/compiler/optimizing/loop_optimization.cc
AgeCommit message (Collapse)Author
2017-04-10ARM64: Support vectorization for double and long.Artem Serov
Test: test-art-host, test-art-target Change-Id: I1d4db1763b64737766f9756e5d0f85c5736e3522
2017-04-10ARM64: Support 128-bit registers for SIMD.Artem Serov
Test: test-art-host, test-art-target Change-Id: Ifb931a99d34ea77602a0e0781040ed092de9faaa
2017-04-07Fixed missing context while detecting unit strides.Aart Bik
With regression test (found by fuzz testing). Bug: 37033123 Test: test-art-target Change-Id: Id738b2a3a353985c3d0bf3beeb581a31f1fcbc3f
2017-04-06Fix a few comments in vectorization code that were incorrect or incomplete.Aart Bik
Test: test-art-target Change-Id: I7c6a5a2d29edd0b2782abc303d8d8cb09c1c2f91
2017-04-06Merge "Ensure environment is ready when populating loop."Aart Bik
2017-04-05Ensure environment is ready when populating loop.Aart Bik
Rationale: OSR requires the suspend check to already have an environment, albeit just for testing irreducible loops. This CL fixes the omission. Note, the error is spurious on OSR and writing a unit or regression test for this is hard. Test: test-art-host Bug: 36950873 Change-Id: Ica89e18e10deb438dead79e2cc40dd00a60b529f
2017-04-05Implemented ABS vectorization.Aart Bik
Rationale: This CL adds the concept of vectorizing intrinsics to the ART vectorizer. More can follow (MIN, MAX, etc). Test: test-art-host, test-art-target (angler) Change-Id: Ieed8aa83ec64c1250ac0578570249cce338b5d36
2017-03-31ART vectorizer.Aart Bik
Rationale: Make SIMD great again with a retargetable and easily extendable vectorizer. Provides a full x86/x86_64 and a proof-of-concept ARM implementation. Sample improvement (without any perf tuning yet) for Linpack on x86 is about 20% to 50%. Test: test-art-host, test-art-target (angler) Bug: 34083438, 30933338 Change-Id: Ifb77a0f25f690a87cd65bf3d5e9f6be7ea71d6c1
2017-03-08Merge "Inlining a few small methods based on profiling dex2oat with perf."Mingyao Yang
2017-03-08Inlining a few small methods based on profiling dex2oat with perf.Mingyao Yang
Test: m test-art-host Change-Id: I6313158e59592d8d132154523be9c82dda3c7eb8
2017-03-06Pass driver to loop opt. Add new side_effects phase.Aart Bik
Rationale: Break-out CL of ART Vectorizer: number 3. The purpose is making the original CL smaller and easier to review. Bug: 34083438 Test: test-art-host Change-Id: I7cece807ee4f5fcaeae41f1deed33ac263447b77
2017-02-17Skip loop optimization if there is no loop in the graph.Mingyao Yang
LinearizeGraph() does quite some allocations. Also add some comments on the possible false positives of some flags. Test: m test-art-host Change-Id: I80ef89a2dc031d601e7621d0b22060cd8c17fae3
2017-01-13Complete unrolling of loops with small body and trip count one.Aart Bik
Rationale: Avoids the unnecessary loop control overhead, suspend check, and exposes more opportunities for constant folding in the resulting loop body. Fully unrolls loop in execute() of the Dhrystone benchmark (3% to 8% improvements). Test: test-art-host Change-Id: If30f38caea9e9f87a929df041dfb7ed1c227aba3
2016-12-09Added polynomial induction variables analysis. With tests.Aart Bik
Rationale: Information on polynomial sequences is nice to further enhance BCE and last-value assignment. In this case, this CL enables more loop optimizations for benchpress' Sum (80 x speedup). Also changed rem-based geometric induction to wrap-around induction. Test: test-art-host Change-Id: Ie4d2659edefb814edda2c971c1f70ba400c31111
2016-11-04Account for early exit loop.Aart Bik
Rationale: last value computation is obviously only right if the loop does not have early exits; only needed if cycle leaks to outside loop in any way. Bug:32633772 Test: 623-checker-loop-regressions Change-Id: Id60beca4704491cff611ad12a24bfc63c09d32c3
2016-11-03More loop-body simplifications.Aart Bik
Rationale: This removes all dead induction from the CaffeineLogic loop, giving yet the next performance boost (2700us->1700us). Also, the runtime is now the same between a DX compiled and JACK compiled version, giving confidence that all recent introduced optimizations are generally useful and something expected from any optimizing compiler. Last, less realistic improvement will pale anything seen so far, since it removes the full loop (still TBD). Test: test-art-host Change-Id: Id6b89f74b7d009616821dca195200933cc0eaaf2
2016-10-24Improved induction variable analysis and loop optimizations.Aart Bik
Rationale: Rather than half-baked reconstructing cycles during loop optimizations, this CL passes the SCC computed during induction variable analysis to the loop optimizer (trading some memory for more optimizations). This further improves CaffeineLogic from 6000us down to 4200us (dx) and 2200us to 1690us (jack). Note that this is on top of prior improvements in previous CLs. Also, some narrowing type concerns are taken care of during transfer operations. Test: test-art-host Change-Id: Ice2764811a70073c5014b3a05fb51f39fd2f4c3c
2016-10-20Improve recognition of select-based period induction.Aart Bik
Rationale: Similar to the previous CL, this helps to eliminate more dead induction. Now, CaffeineLogic, when compiled with dx (rather than jack) improves by a 1.5 speedup (9000us -> 6000us). Note: We need to run the simplifier before induction analysis to trigger the select simplification first. Although a bit of a compile-time hit, it seems a good idea to run a simplifier here again anyway. Test: test-art-host Change-Id: I93b91ca40a4d64385c64393028e8d213f0c904a8
2016-10-18Enable last value generation of periodic sequence.Aart Bik
Rationale: This helps to eliminate more dead induction. For example, CaffeineLogic when compiled with latest Jack improves with a 1.3 speedup (2900us -> 2200us) due to eliminating first loop (second loop can be removed also, but for a later case). The currently benchmarks.dex has a different construct for the periodics, however, still to be recognized. Test: test-art-host Change-Id: Ia81649a207a2b1f03ead0855436862ed4e4f45e0
2016-10-11Improved and simplified loop optimizations.Aart Bik
Rationale: Empty preheader simplification has been simplified to a much more general empty block removal optimization step. Incremental updating of induction variable analysis enables repeated elimination or simplification of induction cycles. This enabled an extra layer of optimization for e.g. Benchpress Loop (17.5us. -> 0.24us. -> 0.08us). So the original 73x speedup is now multiplied by another 3x, for a total of about 218x. Test: 618-checker-induction et al. Change-Id: I394699981481cdd5357e0531bce88cd48bd32879
2016-10-07Improved and simplified loop optimizations.Aart Bik
Rationale: This CL merges some common cases into one, thereby simplifying the code quite a bit. It also prepares for more general induction cycles (rather than the simple phi-add currently used). Finally, it generalizes the closed form elimination with empty loops. As a result of the latter, elaborate but weird code like: private static int waterFall() { int i = 0; for (; i < 10; i++); for (; i < 20; i++); for (; i < 30; i++); for (; i < 40; i++); for (; i < 50; i++); return i; } now becomes just this (on x86)! mov eax, 50 ret Change-Id: I8d22ce63ce9696918f57bb90f64d9a9303a4791d Test: m test-art-host
2016-10-05Refactoring of graph linearization and linear order.Aart Bik
Rationale: Ownership of graph's linear order and iterators was a bit unclear now that other phases are using it. New approach allows phases to compute their own order, while ssa_liveness is sole owner for graph (since it is not mutated afterwards). Also shortens lifetime of loop's arena. Test: test-art-host Change-Id: Ib7137d1203a1e0a12db49868f4117d48a4277f30
2016-10-05Make it possible to pass an arena allocator to HLoopOptimization.Nicolas Geoffray
loop_optimization_test uses memory from HLoopOptimization's allocator, which is scoped by the Run method. Fix is to pass custom allocator. test: m test-art-host-gtest Change-Id: I359330e22202519f400a26da5403eeb00f0b2db4
2016-10-05Properly scope HLoopOptimization's allocator.Nicolas Geoffray
HOptimization classes do not get their destructor called, as they are arena objects. So the scope for the optimization allocator needs to be the Run method. Also anticipate bisection search breakage by adding HLoopOptimization to the list of recognized optimizations. Change-Id: I7770989c39d5700a3b6b0a20af5d4b874dfde111
2016-10-03A first implementation of a loop optimization framework.Aart Bik
Rationale: We are planning to add more and more loop related optimizations and this framework provides the basis to do so. For starters, the framework optimizes dead induction, induction that can be replaced with a simpler closed-form, and eliminates dead loops completely (either pre-existing or as a result of induction removal). Speedup on e.g. Benchpress Loop is 73x (17.5us. -> 0.24us.) [with the potential for more exploiting outer loop too] Test: 618-checker-induction et al. Change-Id: If80a809acf943539bf6726b0030dcabd50c9babc