Age | Commit message (Collapse) | Author |
|
Arm SVE register size is not fixed and can be a
multiple of 128 bits. To support that the patch
removes explicit assumptions on the SIMD register
size to be 128 bit from the vectorizer and code
generators and enables configurable SVE vector
length autovectorization, e.g. extends SIMD register
save/restore routines.
Test: art SIMD tests on VIXL simulator.
Test: art tests on FVP (steps in test/README.arm_fvp.md)
with FVP arg:
-C SVE.ScalableVectorExtension.veclen=[2,4]
(SVE vector [128,256] bits wide)
Change-Id: Icb46e7eb17f21d3bd38b16dd50f735c29b316427
|
|
This CL brings support for predicated execution for
auto-vectorizer and implements arm64 SVE vector backend.
This version passes all the VIXL simulator-runnable tests in
SVE mode with checker off (as all VecOp CHECKs need to be
adjusted for an extra input) and all tests in NEON mode.
Test: art SIMD tests on VIXL simulator.
Test: art tests on FVP (steps in test/README.arm_fvp.md)
Change-Id: Ib78bde31a15e6713d875d6668ad4458f5519605f
|
|
ART vectorizer assumes that there is single size of SIMD
register used for the whole program. Make this assumption explicit
and refactor the code.
Note: This is a base for the future introduction of SIMD slots of
size other than 8 or 16 bytes.
Test: test-art-target, test-art-host.
Change-Id: Id699d5e3590ca8c655ecd9f9ed4e63f49e3c4f9c
|
|
This reverts commit e2727154f25e0db9a5bb92af494d8e47b181dfcf.
Reason for revert: Breaks ASAN tests (ODR violation).
Bug: 142365358
Change-Id: I38103d74a1297256c81d90872b6902ff1e9ef7a4
|
|
Make symbols in compiler/optimizing hidden by a namespace
attribute. The unit intrinsic_objects.{h,cc} is excluded as
it is needed by dex2oat.
As the symbols are no longer exported, gtests are now linked
with the static version of the libartd-compiler library.
libart-compiler.so size:
- before:
arm: 2396152
arm64: 3345280
- after:
arm: 2016176 (-371KiB, -15.9%)
arm64: 2874480 (-460KiB, -14.1%)
Test: m test-art-host-gtest
Test: testrunner.py --host --optimizing --jit
Bug: 142365358
Change-Id: I1fb04a33351f53f00b389a1642e81a68e40912a8
|
|
Implement support for vectorization idiom which performs dot
product of two vectors and adds the result to wider precision
components in the accumulator.
viz. DOT_PRODUCT([ a1, .. , am], [ x1, .. , xn ], [ y1, .. , yn ]) =
[ a1 + sum(xi * yi), .. , am + sum(xj * yj) ],
for m <= n, non-overlapping sums,
for either both signed or both unsigned operands x, y.
The patch shows up to 7x performance improvement on a micro
benchmark on Cortex-A57.
Test: 684-checker-simd-dotprod.
Test: test-art-host, test-art-target.
Change-Id: Ibab0d51f537fdecd1d84033197be3ebf5ec4e455
|
|
Remove all uses of macros 'FINAL' and 'OVERRIDE' and replace them with
'final' and 'override' specifiers. Remove all definitions of these
macros as well, which were located in these files:
- libartbase/base/macros.h
- test/913-heaps/heaps.cc
- test/ti-agent/ti_macros.h
ART is now using C++14; the 'final' and 'override' specifiers have
been introduced in C++11.
Test: mmma art
Change-Id: I256c7758155a71a2940ef2574925a44076feeebf
|
|
Performs whole loop unrolling for small loops with small
trip count to eliminate the loop check overhead, to have
more opportunities for inter-iteration optimizations.
caffeinemark/FloatAtom: 1.2x performance on arm64 Cortex-A57.
Test: 530-checker-peel-unroll.
Test: test-art-host, test-art-target.
Change-Id: Idf3fe3cb611376935d176c60db8c49907222e28a
|
|
Refactor scalar loop peeling and unrolling to eliminate repeated
checks and graph traversals, to make the code more readable and
to make it easier to add new scalar loop opts.
This is a prerequisite for full unrolling patch.
Test: 530-checker-peel-unroll.
Test: test-art-target, test-art-host.
Change-Id: If824a95f304033555085eefac7524e59ed540322
|
|
Removes CompilerDriver dependency from ImageWriter and
several other classes.
Test: m test-art-host-gtest
Test: testrunner.py --host --optimizing
Test: Pixel 2 XL boots.
Test: m test-art-target-gtest
Test: testrunner.py --target --optimizing
Change-Id: I3c5b8ff73732128b9c4fad9405231a216ea72465
|
|
Turn on scalar loop peeling and unrolling by default.
Test: 482-checker-loop-back-edge-use, 530-checker-peel-unroll
Test: test-art-host, test-art-target, boot-to-gui
Change-Id: Ibfe1b54f790a97b281e85396da2985e0f22c2834
|
|
Test: : test-art-host,target
Change-Id: I7f00315c61ed99723236283bc39a4c7fb279df47
|
|
Rationale:
The change adds a return value to Run() in preparation of
conditional pass execution. The value returned by Run() is
best effort, returning false means no optimizations were
applied or no useful information was obtained. I filled
in a few cases with more exact information, others
still just return true. In addition, it integrates inlining
as a regular pass, avoiding the ugly "break" into
optimizations1 and optimziations2.
Bug: b/78171933, b/74026074
Test: test-art-host,target
Change-Id: Ia39c5c83c01dcd79841e4b623917d61c754cf075
|
|
Implement scalar loop peeling for invariant exits elimination
(on arm64). If the loop exit condition is loop invariant then
loop peeling + GVN + DCE can eliminate this exit in the loop
body. Note: GVN and DCE aren't applied during loop optimizations.
Note: this functionality is turned off by default now.
Test: test-art-host, test-art-target, boot-to-gui.
Change-Id: I98d20054a431838b452dc06bd25c075eb445960c
|
|
Implement scalar loop unrolling for small loops
(on arm64) with known trip count to reduce loop check
and branch penalty and to provide more opportunities
for instruction scheduling.
Note: this functionality is turned off by default now.
Test: cloner_test.cc
Test: test-art-target, test-art-host
Change-Id: Ic27fd8fb0bc0d7b69251252da37b8b510bc30acc
|
|
Rationale:
Because faster is better.
Bug: b/74026074
Test: test-art-host,target
Change-Id: Ifa970a62cef1c0b8bb1c593f629d8c724f1ffe0e
|
|
Rationale:
Refactors the way we set up optimization passes
in the compiler into a more centralized approach.
The refactoring also found some "holes" in the
existing mechanism (missing string lookup in
the debugging mechanism, or inablity to set
alternative name for optimizations that may repeat).
Bug: 64538565
Test: test-art-host test-art-target
Change-Id: Ie5e0b70f67ac5acc706db91f64612dff0e561f83
|
|
Rationale:
Since aligned data access is generally better (enables more efficient
aligned moves and prevents nasty cache line splits), computing and/or
enforcing alignment has been added to the vectorizer:
(1) If the initial alignment is known completely and suffices,
then a static peeling factor enforces proper alignment.
(2) If (1) fails, but the base alignment allows, dynamically peeling
until total offset is aligned forces proper aligned access patterns.
By using ART conventions only, any forced alignment is preserved
over suspends checks where data may move.
Note 1:
Current allocation convention is just 8 byte alignment on arrays/strings,
so only ARM32 benefits. However, all optimizations are implemented in
a general way, so moving to a 16 byte alignment will immediately
take advantage of any new convention!!
Note 2:
This CL also exposes how bad the choice of 12 byte offset of arrays
really is. Even though the new optimizations fix the misaligned, it
requires peeling for the most common case: 0 indexed loops. Therefore,
we may even consider moving to a 16 byte offset. Again the optimizations
in this CL will immediately take advantage of that new convention!!
Test: test-art-host test-art-target
Change-Id: Ib6cc0fb68c9433d3771bee573603e64a3a9423ee
|
|
Support SIMD reduction (add, min, max) and SAD (for int->int only)
idioms for arm (32-bit) backend.
Test: test-art-target, test-art-host
Test: 661-checker-simd-reduc, 660-checker-simd-sad-int
Change-Id: Ic6121f5d781a9bcedc33041b6c4ecafad9b0420a
|
|
Passes using local ArenaAllocator were hiding their memory
usage from the allocation counting, making it difficult to
track down where memory was used. Using ScopedArenaAllocator
reveals the memory usage.
This changes the HGraph constructor which requires a lot of
changes in tests. Refactor these tests to limit the amount
of work needed the next time we change that constructor.
Test: m test-art-host-gtest
Test: testrunner.py --host
Test: Build with kArenaAllocatorCountAllocations = true.
Bug: 64312607
Change-Id: I34939e4086b500d6e827ff3ef2211d1a421ac91a
|
|
Replace most uses of the runtime's Primitive in compiler
with a new class DataType. This prepares for introducing
new types, such as Uint8, that the runtime does not need
to know about.
Test: m test-art-host-gtest
Test: testrunner.py --host
Bug: 23964345
Change-Id: Iec2ad82454eec678fffcd8279a9746b90feb9b0c
|
|
Rationale:
Currently just on ARM64 (x86 lacks proper support),
using the SAD idiom yields great speedup on loops
that compute the sum-of-abs-difference operation.
Also includes some refinements around type conversions.
Speedup ExoPlayerAudio (golem run):
1.3x on ARM64
1.1x on x86
Test: test-art-host test-art-target
Bug: 64091002
Change-Id: Ia2b711d2bc23609a2ed50493dfe6719eedfe0130
|
|
Test: market scan.
Change-Id: I58b23b8d254883f30619ea3602d34bf93618d432
|
|
Rationale:
Enables vectorization of x += .... for very basic (simple, same-type)
constructs. Paves the way for more complex (narrower and/or mixed-type)
constructs, which will be handled by the next CL.
This is a revert of Icb5d6c805516db0a1d911c3ede9a246ccef89a22
and thus a revert^2 of I2454778dd0ef1da915c178c7274e1cf33e271d0f
and thus a revert^3 of I1c1c87b6323e01442e8fbd94869ddc9e760ea1fc
and thus a revert^4 of I7880c135aee3ed0a39da9ae5b468cbf80e613766
PS1-2 shows what needed to change
Test: test-art-host test-art-target
Bug: 64091002
Change-Id: I647889e0da0959ca405b70081b79c7d3c9bcb2e9
|
|
Fails 530-checker-lse on arm64.
Bug: 64091002, 65212948
This reverts commit cfa59b49cde265dc5329a7e6956445f9f7a75f15.
Change-Id: Icb5d6c805516db0a1d911c3ede9a246ccef89a22
|
|
Rationale:
Enables vectorization of x += .... for very basic (simple, same-type)
constructs. Paves the way for more complex (narrower and/or mixed-type)
constructs, which will be handled by the next CL.
This is a revert^2 of I7880c135aee3ed0a39da9ae5b468cbf80e613766
and thus a revert of I1c1c87b6323e01442e8fbd94869ddc9e760ea1fc
PS1-2 shows what needed to change, with regression tests
Test: test-art-host test-art-target
Bug: 64091002, 65212948
Change-Id: I2454778dd0ef1da915c178c7274e1cf33e271d0f
|
|
This reverts commit 9879d0eac8fe2aae19ca6a4a2a83222d6383afc2.
Getting these type check failures in some builds. Need time to look at this better, so reverting for now :-(
dex2oatd F 08-30 21:14:29 210122 226218
code_generator.cc:115] Check failed: CheckType(instruction->GetType(), locations->InAt(0)) PrimDouble C
Change-Id: I1c1c87b6323e01442e8fbd94869ddc9e760ea1fc
|
|
Rationale:
Enables vectorization of x += .... for very basic (simple, same-type)
constructs. Paves the way for more complex (narrower and/or mixed-type)
constructs, which will be handled by the next CL.
Test: test-art-host test-art-target
Bug: 64091002
Change-Id: I7880c135aee3ed0a39da9ae5b468cbf80e613766
|
|
Rationale:
Recognize reductions in loops. Note that reductions are *not*
optimized yet (we would proceed with e.g. unrolling and vectorization).
This CL merely sets up the basic detection framework. Also does
a bit of cleanup on loop optimization code.
Bug: 64091002
Test: test-art-host
Change-Id: I0f52bd7ca69936315b03d02e83da743b8ad0ae72
|
|
Rationale:
This CL introduces the basic framework for dynamically peeling
(to obtain aligned access) and unrolling the vector loop (to reduce
looping overhead and allow more target specific optimizations
on e.g. SIMD loads and stores).
NOTE:
The current heuristics are "bogus" and merely meant to exercise
the new framework. This CL focuses on introducing correct code for
the vectorizer. Heuristics and the memory computations for alignment
are to be implemented later.
Test: test-art-target, test-art-host
Change-Id: I010af1475f42f92fd1daa6a967d7a85922beace8
|
|
We should not remove instructions that have deoptimize as
users, or that have environment uses in a debuggable setup.
bug: 62536525
bug: 33775412
Test: 656-loop-deopt
Change-Id: Iaec1a0b6e90c6a0169f18c6985f00fd8baf2dece
|
|
MIPS64 implementation which uses MSA extension. Also extended all
relevant checker tests to test MIPS64 implementation.
Test: booted MIPS64R6 in QEMU
Test: ./testrunner.py --target --optimizing -j1 in QEMU
Change-Id: I8b8a2f601076bca1925e21213db8ed1d41d79b52
|
|
This is a revert^2 of commit 636e870d55c1739e2318c2180fac349683dbfa97.
Rationale:
Under strict conditions, even operations that are sensitive
to higher order bits can vectorize by inspecting the operands
carefully. This enables more vectorization, as demonstrated
by the removal of quite a few TODOs.
Test: test-art-target, test-art-host
Change-Id: Ic2684f771d2e36df10432286198533284acaf472
|
|
Fails on armv8 / speed-profile
This reverts commit 636e870d55c1739e2318c2180fac349683dbfa97.
Change-Id: Ib2a09b3adeba994c6b095672a1e08b32d3871872
|
|
Rationale:
Under strict conditions, even operations that are sensitive
to higher order bits can vectorize by inspecting the operands
carefully. This enables more vectorization, as demonstrated
by the removal of quite a few TODOs.
Test: test-art-target, test-art-host
Change-Id: I2b0fda6a182da9aed9ce1708a53eaf0b7e1c9146
|
|
Rationale:
The more vectorized, the better!
Test: test-art-target, test-art-host
Change-Id: I758becca5beaa5b97fab2ab70f2e00cb53458703
|
|
Rationale:
First of several idioms that map to very efficient SIMD instructions.
Note that the is-zero-ext and is-sign-ext are general-purpose utilities
that will be widely used in the vectorizer to detect low precision
idioms, so expect that code to be shared with many CLs to come.
Test: test-art-host, test-art-target
Change-Id: If7dc2926c72a2e4b5cea15c44ef68cf5503e9be9
|
|
Rationale:
This CL adds the concept of vectorizing intrinsics
to the ART vectorizer. More can follow (MIN, MAX, etc).
Test: test-art-host, test-art-target (angler)
Change-Id: Ieed8aa83ec64c1250ac0578570249cce338b5d36
|
|
Rationale:
Make SIMD great again with a retargetable and easily extendable vectorizer.
Provides a full x86/x86_64 and a proof-of-concept ARM implementation. Sample
improvement (without any perf tuning yet) for Linpack on x86 is about 20% to 50%.
Test: test-art-host, test-art-target (angler)
Bug: 34083438, 30933338
Change-Id: Ifb77a0f25f690a87cd65bf3d5e9f6be7ea71d6c1
|
|
Rationale:
Break-out CL of ART Vectorizer: number 3.
The purpose is making the original CL smaller
and easier to review.
Bug: 34083438
Test: test-art-host
Change-Id: I7cece807ee4f5fcaeae41f1deed33ac263447b77
|
|
Rationale:
Avoids the unnecessary loop control overhead, suspend check,
and exposes more opportunities for constant folding in the
resulting loop body. Fully unrolls loop in execute() of
the Dhrystone benchmark (3% to 8% improvements).
Test: test-art-host
Change-Id: If30f38caea9e9f87a929df041dfb7ed1c227aba3
|
|
Rationale:
Information on polynomial sequences is nice to further enhance
BCE and last-value assignment. In this case, this CL enables more
loop optimizations for benchpress' Sum (80 x speedup). Also
changed rem-based geometric induction to wrap-around induction.
Test: test-art-host
Change-Id: Ie4d2659edefb814edda2c971c1f70ba400c31111
|
|
Rationale:
last value computation is obviously only right if
the loop does not have early exits; only needed
if cycle leaks to outside loop in any way.
Bug:32633772
Test: 623-checker-loop-regressions
Change-Id: Id60beca4704491cff611ad12a24bfc63c09d32c3
|
|
Rationale:
Rather than half-baked reconstructing cycles during loop optimizations,
this CL passes the SCC computed during induction variable analysis
to the loop optimizer (trading some memory for more optimizations).
This further improves CaffeineLogic from 6000us down to 4200us (dx)
and 2200us to 1690us (jack). Note that this is on top of prior
improvements in previous CLs. Also, some narrowing type concerns
are taken care of during transfer operations.
Test: test-art-host
Change-Id: Ice2764811a70073c5014b3a05fb51f39fd2f4c3c
|
|
Rationale:
This helps to eliminate more dead induction. For example,
CaffeineLogic when compiled with latest Jack improves with
a 1.3 speedup (2900us -> 2200us) due to eliminating first
loop (second loop can be removed also, but for a later
case). The currently benchmarks.dex has a different construct
for the periodics, however, still to be recognized.
Test: test-art-host
Change-Id: Ia81649a207a2b1f03ead0855436862ed4e4f45e0
|
|
Rationale:
Empty preheader simplification has been simplified
to a much more general empty block removal optimization
step. Incremental updating of induction variable
analysis enables repeated elimination or simplification
of induction cycles.
This enabled an extra layer of optimization for
e.g. Benchpress Loop (17.5us. -> 0.24us. -> 0.08us).
So the original 73x speedup is now multiplied
by another 3x, for a total of about 218x.
Test: 618-checker-induction et al.
Change-Id: I394699981481cdd5357e0531bce88cd48bd32879
|
|
Rationale:
This CL merges some common cases into one, thereby simplifying
the code quite a bit. It also prepares for more general induction
cycles (rather than the simple phi-add currently used). Finally,
it generalizes the closed form elimination with empty loops.
As a result of the latter, elaborate but weird code like:
private static int waterFall() {
int i = 0;
for (; i < 10; i++);
for (; i < 20; i++);
for (; i < 30; i++);
for (; i < 40; i++);
for (; i < 50; i++);
return i;
}
now becomes just this (on x86)!
mov eax, 50
ret
Change-Id: I8d22ce63ce9696918f57bb90f64d9a9303a4791d
Test: m test-art-host
|
|
Rationale:
Ownership of graph's linear order and iterators was
a bit unclear now that other phases are using it.
New approach allows phases to compute their own
order, while ssa_liveness is sole owner for graph
(since it is not mutated afterwards).
Also shortens lifetime of loop's arena.
Test: test-art-host
Change-Id: Ib7137d1203a1e0a12db49868f4117d48a4277f30
|
|
loop_optimization_test uses memory from HLoopOptimization's
allocator, which is scoped by the Run method.
Fix is to pass custom allocator.
test: m test-art-host-gtest
Change-Id: I359330e22202519f400a26da5403eeb00f0b2db4
|
|
HOptimization classes do not get their destructor called,
as they are arena objects. So the scope for the optimization
allocator needs to be the Run method.
Also anticipate bisection search breakage by adding
HLoopOptimization to the list of recognized optimizations.
Change-Id: I7770989c39d5700a3b6b0a20af5d4b874dfde111
|