Age | Commit message (Collapse) | Author |
|
|
|
|
|
Also make internal_state struct have a static size regardless of what features have been activated.
Internal_state is now always 6040 bytes on Linux/x86-64, and 5952 bytes on Linux/x86-32.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Change window_size from unsigned long to unsigned int
- Change block_start from long to int
- Change high_water from unsigned long to unsigned int
- Reorder to promote cache locality in hot code and decrease holes.
On x86_64 this means the struct goes from:
/* size: 6008, cachelines: 94, members: 57 */
/* sum members: 5984, holes: 6, sum holes: 24 */
/* last cacheline: 56 bytes */
To:
/* size: 5984, cachelines: 94, members: 57 */
/* sum members: 5972, holes: 3, sum holes: 8 */
/* padding: 4 */
/* last cacheline: 32 bytes */
|
|
This gives a good performance increase, and usually also improves compression.
Make separate define HASH_SLIDE for fallback version of UPDATE_HASH.
|
|
|
|
searching for matches.
Move TRIGGER_LEVEL to match_tpl.h since it is only used in longest match.
Use early return inside match loops instead of cont variable.
Added back two variable check for platforms that don't supported unaligned access.
|
|
By implementing a (UNALIGNED_OK && !UNALIGNED64_OK) codepath.
|
|
|
|
|
|
|
|
|
|
trees.c and deflate_quick.c so that their functions can be statically linked for performance reasons.
|
|
static tables, allowing for 32k window.
|
|
ARM.
|
|
unaligned conditionally compiled code for insert_string and quick_insert_string. Unify sse42 crc32 assembly between insert_string and quick_insert_string. Modified quick_insert_string to work across architectures.
|
|
|
|
|
|
new bits valid it bit buffer.
|
|
Put most likely condition in send_bits first.
|
|
|
|
Renamed putShortMSB to put_short_msb and moved to deflate.h with the rest of the put functions.
Added put_uint32 and put_uint32_msb and replaced instances of put_byte with put_short or put_uint32.
|
|
indent, initial function brace on the same line as definition, removed extraneous spaces and new lines.
|
|
|
|
Re-introduce private temp variables for val and len in send_bits macro.
|
|
Also simplify the debug tracing into the define instead
of using a separate static function.
x86_64 shows a small performance improvement.
|
|
|
|
Add necessary code to cmake and configure.
Fix slide_hash_sse2 to compile with zlib-ng.
|
|
library code.
|
|
(#363)
|
|
IBM Z DEFLATE CONVERSION CALL may produce different (but valid)
compressed data for the same uncompressed data. This behavior might be
unacceptable for certain use cases (e.g. reproducible builds). This
patch introduces Z_DEFLATE_REPRODUCIBLE parameter, which can be used to
indicate that this is the case, and turn off IBM Z DEFLATE CONVERSION
CALL.
|
|
This pull request attempts to fix some compiler warnings on Windows when compiled in Release mode.
```
"zlib-ng\ALL_BUILD.vcxproj" (default target) (1) ->
"zlib-ng\zlibstatic.vcxproj" (default target) (6) ->
zlib-ng\deflate.c(1626): warning C4244: '=': conversion from 'uint16_t' to 'unsigned cha
r', possible loss of data [zlib-ng\zlibstatic.vcxproj]
zlib-ng\deflate_fast.c(61): warning C4244: '=': conversion from 'uint16_t' to 'unsigned
char', possible loss of data [zlib-ng\zlibstatic.vcxproj]
zlib-ng\deflate_slow.c(89): warning C4244: '=': conversion from 'uint16_t' to 'unsigned
char', possible loss of data [zlib-ng\zlibstatic.vcxproj]
```
|
|
|
|
define names, resulting in improved compiler support.
Based on the overviews from several sites, such as:
http://nadeausoftware.com/articles/2012/02/c_c_tip_how_detect_processor_type_using_compiler_predefined_macros
|
|
|
|
This patch addresses several warnings from `make test` when
zlib-ng was configured -with-fuzzers -with-sanitizers:
zlib-ng/trees.c:798:5: runtime error: store to misaligned address 0x63100125c801 for type 'uint16_t' (aka 'unsigned short'), which requires 2 byte alignment
0x63100125c801: note: pointer points here
00 80 76 01 8b 08 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior zlib-ng/trees.c:798:5 in
zlib-ng/trees.c:799:5: runtime error: store to misaligned address 0x63100125c803 for type 'uint16_t' (aka 'unsigned short'), which requires 2 byte alignment
0x63100125c803: note: pointer points here
76 01 f5 08 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
Instead of using `*(uint16_t*) foo = bar` to write a uint16_t, call
__builtin_memcpy which will be safe in case of memory page boundaries.
Without the patch:
Performance counter stats for './minigzip -9 llvm.tar':
13173.840115 task-clock (msec) # 1.000 CPUs utilized
27 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
129 page-faults # 0.010 K/sec
57,801,072,298 cycles # 4.388 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
75,270,723,557 instructions # 1.30 insns per cycle
17,797,368,302 branches # 1350.963 M/sec
196,795,107 branch-misses # 1.11% of all branches
13.177897531 seconds time elapsed
45408 -rw-rw-r-- 1 spop spop 46493896 Dec 11 14:45 llvm.tar.gz
With remove-unaligned-stores patch:
13184.736536 task-clock (msec) # 1.000 CPUs utilized
44 context-switches # 0.003 K/sec
1 cpu-migrations # 0.000 K/sec
129 page-faults # 0.010 K/sec
57,882,724,316 cycles # 4.390 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
75,235,920,853 instructions # 1.30 insns per cycle
17,826,873,999 branches # 1352.084 M/sec
196,050,096 branch-misses # 1.10% of all branches
13.185868238 seconds time elapsed
45408 -rw-rw-r-- 1 spop spop 46493896 Dec 11 14:46 llvm.tar.gz
|
|
We noticed recently on the Skia tree that if we build Chromium's zlib
with GCC, -O3, -m32, and -msse2, deflateInit2_() crashes. Might also
need -fPIC... not sure.
I tracked this down to a `movaps` (16-byte aligned store) to an address
that was only 8-byte aligned. This address was somewhere in the middle
of the deflate_state struct that deflateInit2_()'s job is to initialize.
That deflate_state struct `s` is allocated using ZALLOC, which calls any
user supplied zalloc if set, or the default if not. Neither one of
these has any special alignment contract, so generally they'll tend to
be 2*sizeof(void*) aligned. On 32-bit builds, that's 8-byte aligned.
But because we've annotated crc0 as zalign(16), the natural alignment of
the whole struct is 16-byte, and a compiler like GCC can feel free to
use 16-byte aligned stores to parts of the struct that are 16-byte
aligned, like the beginning, crc0, or any other part before or after
crc0 that happens to fall on a 16-byte boundary. With -O3 and -msse2,
GCC does exactly that, writing a few of the fields with one 16-byte
store.
The fix is simply to remove zalign(16). All the code that manipulates
this field was actually already using unaligned loads and stores. You
can see it all right at the top of crc_folding.c, CRC_LOAD and CRC_SAVE.
This bug comes from the Intel performance patches we landed a few years
ago, and isn't present in upstream zlib, Android's zlib, or Google's
internal zlib.
It doesn't seem to be tickled by Clang, and won't happen on 64-bit GCC
builds: zalloc is likely 16-byte aligned there. I _think_ it's possible
for it to trigger on non-x86 32-bit builds with GCC, but haven't tested
that. I also have not tested MSVC.
Reviewed-on: https://chromium-review.googlesource.com/1236613
|
|
This bug was reported by Danilo Ramos of Eideticom, Inc. It has
lain in wait 13 years before being found! The bug was introduced
in zlib 1.2.2.2, with the addition of the Z_FIXED option. That
option forces the use of fixed Huffman codes. For rare inputs with
a large number of distant matches, the pending buffer into which
the compressed data is written can overwrite the distance symbol
table which it overlays. That results in corrupted output due to
invalid distances, and can result in out-of-bound accesses,
crashing the application.
The fix here combines the distance buffer and literal/length
buffers into a single symbol buffer. Now three bytes of pending
buffer space are opened up for each literal or length/distance
pair consumed, instead of the previous two bytes. This assures
that the pending buffer cannot overwrite the symbol table, since
the maximum fixed code compressed length/distance is 31 bits, and
since there are four bytes of pending space for every three bytes
of symbol space.
|
|
|
|
|
|
The struct contains pointers to select functions to be used by the
rest of zlib, and the init function selects what functions will be
used depending on what optimizations has been compiled in and what
instruction-sets are available at runtime.
Tests done on a haswell cpu running minigzip -6 compression of a
40M file shows a 2.5% decrease in branches, and a 25-30% reduction
in iTLB-loads. The reduction i iTLB-loads is likely mostly due to
the inability to inline functions. This also causes a slight
performance regression of around 1%, this might still be worth it
to make it much easier to implement new optimized functions for
various architectures and instruction sets.
The performance penalty will get smaller for functions that get more
alternative implementations to choose from, since there is no need
to add more branches to every call of the function.
Today insert_string has 1 branch to choose insert_string_sse
or insert_string_c, but if we also add for example insert_string_sse4
then that would have needed another branch, and it would probably
at some point hinder effective inlining too.
|
|
implementation. Also change from pre-increment to post-increment to
prevent a double-store on non-x86 platforms.
|
|
this improves compression performance and can often provide slightly
better compression.
|