Bug 1582555 - regression in zlib-1.2.11-8: ARM optimizations broke git log on aarch64
Summary: regression in zlib-1.2.11-8: ARM optimizations broke git log on aarch64
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: git
Version: 28
Hardware: aarch64
OS: Linux
urgent
urgent
Target Milestone: ---
Assignee: Petr Stodulka
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-25 14:49 UTC by Pavel Cahyna
Modified: 2018-08-23 09:19 UTC (History)
17 users (show)

Fixed In Version: git-2.17.1-3.fc28
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-22 15:04:44 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1602275 0 unspecified CLOSED readChar on gzfile returns garbage on aarch64 2021-02-22 00:41:40 UTC

Internal Links: 1602275

Description Pavel Cahyna 2018-05-25 14:49:27 UTC
Description of problem:

git build started failing on tests with zlib-1.2.11-8 on f28: https://koji.fedoraproject.org/koji/taskinfo?taskID=27187823

It can be seen from build.log that the tests fail b/c "git log" prints garbage after the correct log entries. It turns out that the same git binary package works with zlib-1.2.11-7.

Version-Release number of selected component (if applicable):
zlib-1.2.11-8.fc28.aarch64

How reproducible:
by building git, which executes tests in the %check phase, or using the simplified steps below, where
commit() {         echo "$1" >tracked &&         git add tracked &&         git commit -m "$1"; }

First step of the reproducer can be done with the old version of zlib to show that the problematic code is executed only in the last step.

Steps to Reproduce:
1. mkdir t5538-push-shallow; cd t5538-push-shallow; git init .
2. commit 1
3. git repack
4. dnf upgrade zlib-1.2.11-8
5. git log --format="%s"

Actual results:

1 U

Expected results:

1

Additional info:

The U in the log output is garbage and the git test output show more examples of (mostly binary) garbage. Moreover, I built git without optimizations (-O0); I had suspected a compiler bug. It has the same problem, but the binary garbage printed by "git log" is sometimes different from an optimized build. This strongly suggests there is a bad memory access somewhere.

Here is a ltrace output showing what does git call in zlib.

$ ltrace -e @libz.so.1 git log --format="%s" 
libz.so.1->inflateInit2_(0xffffffc7b7a8, 15, 0xaaaab34461a8, 112 <unfinished ...>
libz.so.1->malloc(7160)                          = 0xaaaadf0d6740
libz.so.1->inflateReset2(0xffffffc7b7a8, 15, 7169, 0xaaaadf0d8330 <unfinished ...>
libz.so.1->inflateReset(0xffffffc7b7a8, 0, 0xffffbe580108, 0xffffc0cc <unfinished ...>
libz.so.1->inflateResetKeep(0xffffffc7b7a8, 0, 0xaaaadf0d6740, 0xffffc0cc) = 0
<... inflateReset resumed> )                     = 0
<... inflateReset2 resumed> )                    = 0
<... inflateInit2_ resumed> )                    = 0
libz.so.1->adler32(0, 0, 0, 0 <unfinished ...>
libz.so.1->adler32_z(0, 0, 0, 0)                 = 1
<... adler32 resumed> )                          = 1
libz.so.1->adler32(1, 0xaaaadf0d40a0, 218, 0 <unfinished ...>
libz.so.1->adler32_z(1, 0xaaaadf0d40a0, 218, 0)  = 0x709d4367
<... adler32 resumed> )                          = 0x709d4367
libz.so.1->memset(0xaaaadf0d417a, 'U', 1)        = 0xaaaadf0d417a
libz.so.1->free(0xaaaadf0d6740)                  = <void>
1 U
--- SIGCHLD (Child exited) ---
libz.so.1->__cxa_finalize(0xffffbe59fac0, 0, 0xffffbe572510, 0) = 0xffffbe4f4fa8
+++ exited (status 0) +++

Comment 1 Todd Zullinger 2018-05-25 15:21:34 UTC
Thanks for tracking this down Pavel.  I hit this last night while trying to build a minor git bugfix release, as it breaks the test suite badly.  I would imagine this causes breakage in many other packages too, many which may not have test suites that catch the issue before it reaches users.

Comment 2 Peter Robinson 2018-05-25 16:14:48 UTC
Jeremy can you take a look at this?

Comment 3 Jeremy Linton 2018-05-25 16:29:03 UTC
Yes, I will get the patch authors to take a look.

Thanks,

Comment 4 Pavel Raiskup 2018-05-25 16:31:55 UTC
Jeremy, can we please also have the links to the upstream pull requests?
Or all the links pointing to (attempts for) upstream discussions?

Comment 5 Peter Robinson 2018-05-25 17:50:00 UTC
(In reply to Pavel Raiskup from comment #4)
> Jeremy, can we please also have the links to the upstream pull requests?
> Or all the links pointing to (attempts for) upstream discussions?

They're all AFAIK at the upstream maintainers githhub (along with the 100 odd others):
https://github.com/madler/zlib/pulls

Comment 6 adenilson 2018-05-25 18:14:25 UTC
Joining the discussion, I will start looking into it ASAP.

Comment 7 Jeremy Linton 2018-05-25 20:01:16 UTC
I thought this sounded like a zlib bug, but I think I've tracked it down to what appears to be a bug in git, which i've fixed and now all the unit tests have passed. I'm going to get someone else to take a quick peek and doublecheck the patch.

Comment 8 Todd Zullinger 2018-05-25 20:57:15 UTC
Interesting.  Thanks for poking at this Jeremy and adenilson!

Comment 9 Jeremy Linton 2018-05-25 22:13:56 UTC
So, I think I fully understand what is happening now. "Bug" is a bit strong, more a misinterpretation of whats allowed.

So git, in unpack_compressed_entry() is allocating a buffer of 'size' using xmallocz_gently() which actually allocates a slightly larger buffer and null terminates it. The code in unpack_compressed_entry() is utilizing this feature to go ahead and pass size+1 to zlib as the size of the buffer. To zlib it considers that it owns a buffer of size+1 and does a correct decompression (sets consumed/remaining sizes correctlt) but poisons the remainder of the buffer. 

The zlib code then gets the return it expects and continues on its merry way not considering that the null may have been poisoned. Initially a quck hack to set the buffer passed to zlib as 'size' cleans up all the unit test failures, despite the fact that zlib is triggering an error case (which is ignored due to a successful return) due to the extra byte being consumed (AKA git's overconsumption check fails if the overconsumption is exactly 1 byte).

Comment 10 Jeremy Linton 2018-05-25 23:22:14 UTC
Posted a patch here:

https://public-inbox.org/git/20180525231713.23047-1-lintonrjeremy@gmail.com/T/#u

I'm changing the component.

Comment 11 Jeremy Linton 2018-05-25 23:27:10 UTC
Comment #9 above should read "The git code.."

Comment 12 Todd Zullinger 2018-05-25 23:41:09 UTC
Thanks for the quick diagnosis and patch Jeremy!

I have a scratch build running now:

    https://koji.fedoraproject.org/koji/taskinfo?taskID=27194025

I'll check back on that shortly and follow up on your patch on the git list.

Comment 13 Todd Zullinger 2018-05-27 20:38:47 UTC
Pavel, can you give test the scratch build on aarch64 and confirm it fixes things?  The test suite passes so I'm confident it will, but it's always good to have extra confirmation.  Plus this has been broken on f28 aarch64 for about a month.  Another day or two won't hurt at this point.

I pushed this to the master branch of my fork: https://src.fedoraproject.org/fork/tmz/rpms/git

Comment 14 Pavel Raiskup 2018-05-28 07:56:40 UTC
Please, consider waiting for upstream git fix till we break the
upstream-first rule again.

Comment 15 Todd Zullinger 2018-05-28 15:04:58 UTC
Yeah, I don't intend to push anything until upstream has a consensus.  I'm following that discussion closely.  I built those scratch packages just for testing to help confirm that it resolved the issue.  I don't have access to any aarch64 hardware, so the only testing I'm able to do is running the test suite in koji.  That's why I asked about testing the case where you noticed the issue. :)

Comment 16 Todd Zullinger 2018-05-29 18:45:00 UTC
In unfortunate timing, a git released an update today which fixes 2 security issues (CVE-2018-11233 & CVE-2018-11235)¹.  This forces our hand a bit with regard to fixing the aarch64 builds.  We could disable the tests on aarch64 to allow a successful git build (which we know to be broken).  That's certainly not ideal.

Instead, I've applied Jeremy's packfile patch to aarch64 only.  Upstream has some questions about the patch, which can hopefully be resolved soon and the patch (or some version of it) can be accepted.

(I wish I had seen the broken git builds in koschei when they showed up weeks ago.  I don't know if those failed builds didn't generate any notification or if I've disabled the notifications because they too frequently were due to transient issues.)

¹ I've alerted the Red Hat security team so they can create tracking bugs for the issues

Comment 17 Fedora Update System 2018-05-29 20:19:08 UTC
git-2.17.1-2.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-75f7624a9f

Comment 18 Pavel Raiskup 2018-05-29 22:41:06 UTC
Still there's possibility that our zlib is broken; could we fix the
optimization patches so the input/output API isn't changed at all?

Comment 19 Todd Zullinger 2018-05-29 23:34:53 UTC
I hope that the conversation on the git list continues and a consensus is reached on the proper changes and in which components, of course.

Since there is a relatively high risk vulnerability requiring a git build (#1583862), I didn't feel that we had much choice but to apply the git patch on aarch64.  Seeing that git was fairly well broken on aarch64 already, it's unlikely that it's going to be worse with the patch. :)

So far as I can tell, upstream is (rightly) concerned that the patch undoes a previous change to the deflate handling in packfile.c and wants to be sure that the fix isn't going to cause any regressions.  The tests suite still passes, including the test which was added at that time (in 39eea7bdd9 ("Fix incorrect error check while reading deflated pack data", 2009-10-21)).  That makes me think that even if the patch isn't the ideal fix, it's not going to cause problems.

I only applied the current git patches on aarch64 and I don't have the update set to close this bug when the update goes stable.

Comment 20 Jeremy Linton 2018-05-30 15:47:52 UTC
I'm a bit split about the "correct" fix. The zlib changes are as I understand it intentionally poisioning the buffer, but I don't really think the git code is correct either (despite the attempt at defensive coding, it needs to protect that null and its not doing it). I would have a tendency to fix both of them...

Comment 21 Fedora Update System 2018-05-30 16:55:53 UTC
git-2.17.1-2.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-75f7624a9f

Comment 22 Fedora Update System 2018-06-01 12:04:03 UTC
git-2.17.1-2.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.

Comment 23 Todd Zullinger 2018-06-11 00:48:10 UTC
It may well be that changes are needed in both git and zlib.  I don't think anyone upstream at git is opposed to that idea, they just want to ensure that the changes don't lose a fix for a previous issue.

Without the zlib patches upstream, the only people affected by this are us, and only on aarch64.  That makes it harder to get much traction or interest and leaves us in the less-than-ideal position of carrying non-upstream patches in both git and zlib.

It would be great if the folks who best understand these changes could work with git and zlib upstream to get any necessary patches accepted.

Comment 24 Peter Robinson 2018-06-11 01:20:29 UTC
> It would be great if the folks who best understand these changes could work
> with git and zlib upstream to get any necessary patches accepted.

They are, jlinton is the one that debugged and sent the git patch, the upstream zlib maintainer is non responsive, these same patches are in chromeOS/chromium and other places too.

Comment 25 Todd Zullinger 2018-06-11 01:25:20 UTC
Alright.  I asked because upstream git had some questions and I haven't seen any responses there.

Comment 26 Pavel Cahyna 2018-06-11 16:06:48 UTC
yes, last questions in the thread were in the message from Junio C Hamano on the 28th and there was no reply.

Comment 27 Dave Rodgman 2018-06-11 20:15:06 UTC
Hi Pavel, do you have a link to the upstream thread you mention? I'll check that we're following this up with upstream.

Comment 28 Todd Zullinger 2018-06-11 23:48:36 UTC
I Dave, the thread is at https://public-inbox.org/git/20180525231713.23047-1-lintonrjeremy@gmail.com/T/#u (from comment 10). :)

Thanks for looking into this further!

Comment 29 Jeremy Linton (ARM) 2018-06-12 22:00:23 UTC
Yes, I'm going to probably just post the null terminator version RSN.. AFAIK, I posted a response last week to that affect although its not showing up in that public-inbox thread for whatever reason.

Comment 30 Todd Zullinger 2018-06-12 22:07:35 UTC
Thanks Jeremy.  FWIW, the message never arrived in my mailbox either, so I imagine it didn't reach the git list (or was dropped by the list server itself).

Comment 31 Fedora Update System 2018-06-14 22:14:15 UTC
git-2.17.1-3.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-972fa3841b

Comment 32 Todd Zullinger 2018-06-14 22:21:53 UTC
I replaced the initial patch with what's now queued upstream (https://github.com/git/git/commit/b611396e97), and applied it to all architectures.

I set the update to close this ticket when it reaches stable.   If there's any reason to keep it open, let me know and I'll adjust the update settings.

Thanks to Jeremy and everyone else that helped get this fixed.

Comment 33 Fedora Update System 2018-06-15 16:35:01 UTC
git-2.17.1-3.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-972fa3841b

Comment 34 Fedora Update System 2018-06-22 15:04:44 UTC
git-2.17.1-3.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.