Age | Commit message (Collapse) | Author |
|
Test: test-art-host, test-art-target
Change-Id: I1d4db1763b64737766f9756e5d0f85c5736e3522
|
|
Test: test-art-host, test-art-target
Change-Id: Ifb931a99d34ea77602a0e0781040ed092de9faaa
|
|
With regression test (found by fuzz testing).
Bug: 37033123
Test: test-art-target
Change-Id: Id738b2a3a353985c3d0bf3beeb581a31f1fcbc3f
|
|
Test: test-art-target
Change-Id: I7c6a5a2d29edd0b2782abc303d8d8cb09c1c2f91
|
|
|
|
Rationale:
OSR requires the suspend check to already have an environment,
albeit just for testing irreducible loops. This CL fixes the
omission. Note, the error is spurious on OSR and writing a
unit or regression test for this is hard.
Test: test-art-host
Bug: 36950873
Change-Id: Ica89e18e10deb438dead79e2cc40dd00a60b529f
|
|
Rationale:
This CL adds the concept of vectorizing intrinsics
to the ART vectorizer. More can follow (MIN, MAX, etc).
Test: test-art-host, test-art-target (angler)
Change-Id: Ieed8aa83ec64c1250ac0578570249cce338b5d36
|
|
Rationale:
Make SIMD great again with a retargetable and easily extendable vectorizer.
Provides a full x86/x86_64 and a proof-of-concept ARM implementation. Sample
improvement (without any perf tuning yet) for Linpack on x86 is about 20% to 50%.
Test: test-art-host, test-art-target (angler)
Bug: 34083438, 30933338
Change-Id: Ifb77a0f25f690a87cd65bf3d5e9f6be7ea71d6c1
|
|
|
|
Test: m test-art-host
Change-Id: I6313158e59592d8d132154523be9c82dda3c7eb8
|
|
Rationale:
Break-out CL of ART Vectorizer: number 3.
The purpose is making the original CL smaller
and easier to review.
Bug: 34083438
Test: test-art-host
Change-Id: I7cece807ee4f5fcaeae41f1deed33ac263447b77
|
|
LinearizeGraph() does quite some allocations.
Also add some comments on the possible false positives of
some flags.
Test: m test-art-host
Change-Id: I80ef89a2dc031d601e7621d0b22060cd8c17fae3
|
|
Rationale:
Avoids the unnecessary loop control overhead, suspend check,
and exposes more opportunities for constant folding in the
resulting loop body. Fully unrolls loop in execute() of
the Dhrystone benchmark (3% to 8% improvements).
Test: test-art-host
Change-Id: If30f38caea9e9f87a929df041dfb7ed1c227aba3
|
|
Rationale:
Information on polynomial sequences is nice to further enhance
BCE and last-value assignment. In this case, this CL enables more
loop optimizations for benchpress' Sum (80 x speedup). Also
changed rem-based geometric induction to wrap-around induction.
Test: test-art-host
Change-Id: Ie4d2659edefb814edda2c971c1f70ba400c31111
|
|
Rationale:
last value computation is obviously only right if
the loop does not have early exits; only needed
if cycle leaks to outside loop in any way.
Bug:32633772
Test: 623-checker-loop-regressions
Change-Id: Id60beca4704491cff611ad12a24bfc63c09d32c3
|
|
Rationale:
This removes all dead induction from the CaffeineLogic loop,
giving yet the next performance boost (2700us->1700us).
Also, the runtime is now the same between a DX compiled
and JACK compiled version, giving confidence that all
recent introduced optimizations are generally useful
and something expected from any optimizing compiler.
Last, less realistic improvement will pale anything
seen so far, since it removes the full loop (still TBD).
Test: test-art-host
Change-Id: Id6b89f74b7d009616821dca195200933cc0eaaf2
|
|
Rationale:
Rather than half-baked reconstructing cycles during loop optimizations,
this CL passes the SCC computed during induction variable analysis
to the loop optimizer (trading some memory for more optimizations).
This further improves CaffeineLogic from 6000us down to 4200us (dx)
and 2200us to 1690us (jack). Note that this is on top of prior
improvements in previous CLs. Also, some narrowing type concerns
are taken care of during transfer operations.
Test: test-art-host
Change-Id: Ice2764811a70073c5014b3a05fb51f39fd2f4c3c
|
|
Rationale:
Similar to the previous CL, this helps to eliminate more dead induction.
Now, CaffeineLogic, when compiled with dx (rather than jack) improves
by a 1.5 speedup (9000us -> 6000us).
Note:
We need to run the simplifier before induction analysis to trigger
the select simplification first. Although a bit of a compile-time hit,
it seems a good idea to run a simplifier here again anyway.
Test: test-art-host
Change-Id: I93b91ca40a4d64385c64393028e8d213f0c904a8
|
|
Rationale:
This helps to eliminate more dead induction. For example,
CaffeineLogic when compiled with latest Jack improves with
a 1.3 speedup (2900us -> 2200us) due to eliminating first
loop (second loop can be removed also, but for a later
case). The currently benchmarks.dex has a different construct
for the periodics, however, still to be recognized.
Test: test-art-host
Change-Id: Ia81649a207a2b1f03ead0855436862ed4e4f45e0
|
|
Rationale:
Empty preheader simplification has been simplified
to a much more general empty block removal optimization
step. Incremental updating of induction variable
analysis enables repeated elimination or simplification
of induction cycles.
This enabled an extra layer of optimization for
e.g. Benchpress Loop (17.5us. -> 0.24us. -> 0.08us).
So the original 73x speedup is now multiplied
by another 3x, for a total of about 218x.
Test: 618-checker-induction et al.
Change-Id: I394699981481cdd5357e0531bce88cd48bd32879
|
|
Rationale:
This CL merges some common cases into one, thereby simplifying
the code quite a bit. It also prepares for more general induction
cycles (rather than the simple phi-add currently used). Finally,
it generalizes the closed form elimination with empty loops.
As a result of the latter, elaborate but weird code like:
private static int waterFall() {
int i = 0;
for (; i < 10; i++);
for (; i < 20; i++);
for (; i < 30; i++);
for (; i < 40; i++);
for (; i < 50; i++);
return i;
}
now becomes just this (on x86)!
mov eax, 50
ret
Change-Id: I8d22ce63ce9696918f57bb90f64d9a9303a4791d
Test: m test-art-host
|
|
Rationale:
Ownership of graph's linear order and iterators was
a bit unclear now that other phases are using it.
New approach allows phases to compute their own
order, while ssa_liveness is sole owner for graph
(since it is not mutated afterwards).
Also shortens lifetime of loop's arena.
Test: test-art-host
Change-Id: Ib7137d1203a1e0a12db49868f4117d48a4277f30
|
|
loop_optimization_test uses memory from HLoopOptimization's
allocator, which is scoped by the Run method.
Fix is to pass custom allocator.
test: m test-art-host-gtest
Change-Id: I359330e22202519f400a26da5403eeb00f0b2db4
|
|
HOptimization classes do not get their destructor called,
as they are arena objects. So the scope for the optimization
allocator needs to be the Run method.
Also anticipate bisection search breakage by adding
HLoopOptimization to the list of recognized optimizations.
Change-Id: I7770989c39d5700a3b6b0a20af5d4b874dfde111
|
|
Rationale:
We are planning to add more and more loop related optimizations
and this framework provides the basis to do so. For starters,
the framework optimizes dead induction, induction that can be
replaced with a simpler closed-form, and eliminates dead loops
completely (either pre-existing or as a result of induction
removal).
Speedup on e.g. Benchpress Loop is 73x (17.5us. -> 0.24us.)
[with the potential for more exploiting outer loop too]
Test: 618-checker-induction et al.
Change-Id: If80a809acf943539bf6726b0030dcabd50c9babc
|