Red Hat Bugzilla – Bug 465243
rsync-friendly deflate in zlib
Last modified: 2009-05-04 05:17:09 EDT
Description of problem:
zlib, as installed on Fedora 8, Fedora 9 or Fedora 10beta, does not provide
rsync-friendly compression (like that given by gzip --rsyncable)
I have applied slightly modified patches from UHU-linux to add the possibility
of rsync-friendly deflate compression to zlib (and minizip).
The proposed patches also cause minizip and miniunzip binaries
to be included to minizip binary RPM package
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Please note that rsync-friendly means also xdelta-friendly.
If zlib with rsync patches is used, code modifications in deflate
to generate rsync-friendly output are used if environment variable
ZLIB_RSYNC is defined to an non-empty string different from "0"
This is needed for example:
- for backup software using zlib, if we need results to be transferred
with rsync+ssh over (relatively) slow link
- if we need to generate ZIP format files and be able to get
smaller deltas (rsync or xdelta)
Please note that while minizip is fairly simplistic compared to zip,
it can generate rsync-friendly ZIP archived, while current version
of zip cannot.
Some limitations of minizip can be worked around using xargs, for example
find find_spec -type f -print0 | env LANG=C sort -z | env ZLIB_RSYNC=1 xargs -r0 minizip -a destfile.zip
Please consider including the zlib patches I will attach for all fedora and enterprise linux versions.
Created attachment 319225 [details]
patches to add 1. rsyncable support for zlib 2. minizip binaries 3. -q for minizip
I have rebuilt successfully, installed and tested resulting
zlib-1.2.3-21.fc8 and minizip-1.2.3-21.fc8
on fully-patched Fedora8 system (Pentium 4 machine).
The base for patches was zlib-1.2.3-18.fc9.src.rpm
(which is identical to Fedora10 beta zlib src.rpm)
Files included in zlib_123_20_patch_spec.tgz :
zlib.spec is spec file
zlib.spec.diff is diff from zlib.spec in zlib-1.2.3-18.fc9.src.rpm
zlib-1.2.3-rsyncable.patch - adds rsyncable support to zlib - details in zlib.h
zlib-1.2.3-quietminizip.patch - adds -q to minizip and miniunzip
rsyncable support is turned on if ZLIB_RSYNC=1
environment variable is set when deflateInit (or eqv.) is called for the first time.
zlib-1.2.3-rsyncable.patch is adapted from
Getting option from environment allows that public interface of zlib
is not changed and that the feature can be used without changing
software using zlib.
I have modified the 03* patch so that it does not rely on gcc specific feature,
in the hope that it would be easier to be accepted to zlib upstream.
Also, please note that the added benefit is that you can call
setenv("ZLIB_RSYNC", "1", 1)
is program before first deflate* call
I think it will be best to add rsync-friendly support to upstream version (as you mention at the end of comment #1), so please have you sent the patches there too?
I have not sent them the patches yet.
I assumed that it would be easier to them accepted upstream once there is some use cases in the wild (and even that might be hard; gzip has --rsyncable in several Linux distribution for years, and it is not in GNU/Savannah gzip alpha or CVS ...)
So I thought first to try to push the patches to Fedora and friends.
Do you suggest me to send the patches to the zlib team right now?
I've taken another look at the old 02-rsync.patch (the one we've found somewhere on the net, created by Rusty Russell IIRC) and 03-rsync-from-env.patch (created by me). I also looked at Wojtek's patch.
Some technical notes:
The gcc-specific hack initialized the library when it was loaded; Wojtek's solution initializes it when it is first used.
The original patch introduced the external symbol zlib_rsync which enables applications to control the behavior of the library. Wojtek's patch introduces zlib_rsync_set as another external symbol. (External means that it pollutes the global namespace, no other library or the main application itself can use this symbol for other purposes, but they can always easily read or write this variable - "nm libz.so" prints an uppercase letter for this symbol.) If an application wants to explicitly control the behavior, now it has to set zlib_rsync to the desired value as well as set zlib_rsync_set to 1 - to make sure that it works no matter if the library was already initialized or not. Also it's not possible to query the value if the library wasn't initialized. This is not a very nice design.
Probably it would be way better to have these variables internal to the library ("static" in .c and not mentioned in .h) and expose a getter and setter, e.g. int get_zlib_rsync() and void set_zlib_rsync(int). The getter would initialize the value based on the env var if it wasn't initialized, and the setter would also set the initialized flag. This way it would be completely hidden from the application when initialization from the env var happens. These functions probably wouldn't be thread-safe, though.
zlib_rsync_init() is currently a non-static function, but not mentioned in the header file. If you'd like to make it public, you should have its signature in the header. If not, however, then the function should be static so that its name is not exported as a symbol. (The gcc-specific init trick was a special case in this regard - it needs to be externally available but you don't want developers to write code that calls this function.)
I'm not familiar with zlib internals at all, not even with its interface. In an ideal world, you would probably have instances ("objects") of a zipper, multiple instances could have different rsync configuration, etc... I also don't know if zlib is thread-safe at all (but I would be surprised if it wasn't).
So if you aim at mainstream inclusion, it would be nice to probably think on a cleaner design that allows more widespread usage of the feature. I think the current solution (both the external int to control the behavior, and the env var to initialize it) is quite hackish, I just don't know how to do this absolutely right.
I have asked zlib team whether they would consider adding the patches (modified as needed) to official zlib.
Regarding previous comment from Egmont, I wanted to avoid any change
in public interface of zlib, so that no program would require
recompiling, and to be able to 'implant' the rsync-friendly
feature 'externally' into a program using zlib, without modifying
such a program.
My Fedora 8 system works with no problem after I have installed modified
zlib rpm, so I hope it is not a complete disaster.
I will think more about points raised here this week.
To address some of the issues raised by Egmont
I would propose the following:
- zlib_rsync_init() changed to static
- zlib_rsync_set changed to static
- zlib_rsync left as external int
So we would have
a) one external name added zlib_rsync
b) ability to set zlib_rsync through ZLIB_RSYNC environment variable
c) ability to set zlib_rsync through putenv before 1st call to deflate
d) ability to directly manipulate zlib_rsync after 1st call to deflate
e) binary compatibility with programs using zlib
d) should perhaps only be used in statically linked applications;
What is most important for me is e) and b)
Is that acceptable?
Storing rsyncable flag in z_stream_s would be cleaner solution, because
1. it would allow setting to be specified for each stream independently
2. it would let us eliminate globals, namespace pollution while being thread-safe
it would require changing both public and binary interface to zlib, which
is not worth the trouble.
Perhaps if upstream agrees to have that in some future version of zlib...
Unfortunately there is no response from zlib team ...
thanks for your work, I will try to connect with upstream guys myself, I hope it will help.
Created attachment 320405 [details]
Slightly modified patch adding 1. rsyncable support to zlib; 2. minizip binaries; 3. -q for minizip
I have added a slightly modified patch adding rsyncable support
Changes are as follows:
- changed two external names to local (static)
now zlib_rsync_set and zlib_rsync_init() are static
- changed spelling of help messages describing -q in minizip and miniunzip
to be consistent with descriptions of other options
I have looked more closely at zlib source code and will try
to create a cleaner patch, with rsyncable flag set per stream
Created attachment 321016 [details]
new patch adding rsyncable support to zlib
This is a new patchset for adding rsyncable support to zlib:
Inside the tar archive:
org/ contains original files for reference, which I did not modify
org/*patch, *patch should go to SOURCES/ (and zlib-1.2.3.tar.gz which I do not include)
zlib.spec should go to SPECS/
zlib.spec.rs23.diff is for reference only.
In this new patch I have addressed issues raised by Egmont
as far as I was able to.
More details in a comment,
The rs23 patchset for zlib v1.2.3 adds rsync-friendly deflate support
for zlib, as well as:
- adds minizip and miniunzip binaries to minizip RPM
- adds -q (quiet) option to minizip and miniunzip
- adds -s option to minizip and minigzip (rsync-friendly deflate)
- adds --rsyncable option to minigzip (same function as -s, for gzip compatibility)
Thus rsync-friendly deflate output from minigzip and minizip can be requested either
via -s option or via exported environment variable ZLIB_RSYNC=1
- adds -- option to minizip (end of options indicator)
- adds --help option to minigzip (-h has already been used)
- allows for longer pathnames on unix and win32 in minizip and minigzip
- fixes some buffer overflows in minigzip
The current version should be almost thread-safe;
The default state of requested rsync-friendly deflate output is kept
in a static variable (shared between threads), but is it being
used only for deflate stream initialization;
So the only caveat I am aware of is that if one calls
deflateSetRsyncDflt() in one thread, deflate stream initialiazed
later in all threads use the new value.
Please look at zlib.h file for details.
The basic idea is that we have
zlib_rsync_dflt static integer, initially set to -1;
When a deflate stream it being initialized, deflateGetRsyncDflt() gets called
from lm_init; If zlib_rsync_dflt is not set at that moment to a value >= 0,
ZLIB_RSYNC environment is looked at, and if it exists and is non-empty string
different from "0", then zlib_rsync_dflt is set to 1, otherwise to 0.
The value returned by deflateGetRsyncDflt() is stored in strm->state->zlib_rsync;
That value is used during deflate operation.
It is also possible to directly set and get state->zlib_rsync of a stream using
deflateGetStrRsync() and deflateSetStrRsync().
The requirement here (not enforced by the code) is that deflateGetStrRsync()
is called directly after deflateInit() or deflateInit2() or deflateReset()
(actually deflateReset() is called at the end of deflateInit() or deflateInit2())
I believe the design is reasonably thread safe and makes it possible
to use the rsyncable feature of zlib even with code that is not modified in any way
(just linked with new zlib).
zlib binary interace has not been changed by these patches, so programs using
zlib shared library continue to work.
I am aware there are some software using zlib (e.g. raqbackup) people are using, which
would be much more useful when rsync-friendly feature of zlib is implemented.
I would kindly request that these patches be considered for inclusion in Fedora.
Please send comment, criticism, etc.
- rsync-friendly deflate is added to C code only; I did not touch (or even look at) assembler variants;
- it might deserve some explanation why I have patches minigzip, when gzip on Fedora already has --rsyncable;
This is because there are slight differences between gzip output and zlib output;
If one needs to create compressed data with, say perl zlib modules and gzip (from similar source data)
and wants small rsync delta, minigzip -s is needed for that.
- minigzip is not packaged, but is easy to build:
gcc -o minigzip /usr/share/doc/zlib-devel-1.2.3/minigzip.c -lz
Any comments / news ?
I am running Fedora 8 on i386 (P4 machine with HT) with the zlib verseion posted here for 10 days, and so far no problems I am aware of.
Mark Adler from zlib upstream reply to my question regarding this bug.
He will look to the patch you sent here but he planed to invent his own rsyncable breakpoint generation algorithm, not to use some based on gzip - and if you are willing to help him with testing of this feature then it would be helpful.
If you like to participate to the zlib testing Mark Adler just announce pigz version on zlib mailing list and there are some comments.
At the monent rsyncable support for pigz (Parallel Implementation of GZip) is not implemented (planned as future development).
pigz features will possibly be moved into zlib at some point if future.
There seems to be no progress on upstream zlib or pigz.
Could you possibly reconsider adding the patches to fedora
zlib package (just as rsyncable patches have been added to fedora gzip
please what is the reason why upstream does not use this patches?
Upstream seemed to plan a different implementation.
I am not aware of any progress, though.
I personally use the patches on Fedora10 to create
rsync-friendly zip archives (minizip) and have found
no problems with them.
I don't like to put it to fedora if upstream will have another solution.
I ask upstream maintainer again about the situation - for now I will close this bug.
No progress on upstream side :(.