Bug 2267598

Summary:	Invalid writes regression in liblzma.so
Product:	[Fedora] Fedora	Reporter:	Nathan Scott <nathans>
Component:	xz	Assignee:	Matej Mužila <mmuzila>
Status:	CLOSED ERRATA	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	rawhide	CC:	awilliam, eblake, ehila, jnovy, mjw, mmuzila, pkubat, praiskup, qguo, rjones, sam, thomas.barbier
Target Milestone:	---	Keywords:	Regression
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	xz-5.6.1-1.fc41	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-03-09 14:26:18 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nathan Scott 2024-03-04 03:17:21 UTC

We have observed ~130 (of ~1000) new test failures over the weekend in the PCP project CI on rawhide.  I notice a new build/version of liblzma recently arrived, so let's start here.  This is the output from valgrind (apologies, no symbols and no binary listed - but its across many different binaries and libraries within PCP, so hopefully its something you can easily reproduce by trying valgrind on xz's own tests.  If its not reproducible let me know & I'll extract a test case.

Here's one sample case from the PCP testsuite ... (output is from valgrind).

output mismatch (see 1101.out.bad)
205a206,240
> Invalid write of size 8
> at HEX: ??? (in /usr/lib64/liblzma.so.5.6.0)
> by HEX: ??? (in /usr/lib64/liblzma.so.5.6.0)
> by HEX: ???
> by HEX: ???
> by HEX: ???
> by HEX: ???
> by HEX: elf_machine_rela (dl-machine.h:314)
> by HEX: elf_dynamic_do_Rela (do-rel.h:147)
> by HEX: _dl_relocate_object (dl-reloc.c:301)
> by HEX: ???
> by HEX: ???
> by HEX: ???
> by HEX: ???
> by HEX: ??? (in /usr/lib64/libgcc_s-14-20240228.so.1)
> Address HEX is on thread 1's stack
> 136 bytes below stack pointer
> {
>    <insert_a_suppression_name_here>
>    Memcheck:Addr8
>    obj:/usr/lib64/liblzma.so.5.6.0
>    obj:/usr/lib64/liblzma.so.5.6.0
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    fun:elf_machine_rela
>    fun:elf_dynamic_do_Rela
>    fun:_dl_relocate_object
>    obj:*
>    obj:*
>    obj:*
>    obj:*
>    obj:/usr/lib64/libgcc_s-14-20240228.so.1
> }
209c244
< ERROR SUMMARY: 0 errors from 0 contexts ...
---
> ERROR SUMMARY: 112 errors from 1 contexts ...
Check local PMCD is still alive ...

Reproducible: Always

Steps to Reproduce:
1. Run PCP regression tests, which involve running programs under valgrind
2.
3.
Actual Results:  
'Invalid write' failure messages from valgrind as shown in the report.

Expected Results:  
No 'Invalid write' failure messages from valgrind.

Comment 1 Richard W.M. Jones 2024-03-04 10:11:28 UTC

I ran valgrind over some of the tests in the xz test suite and was not able
to reproduce any error.  There are also no commits upstream since 5.6.0 which
would indicate any fix.  So I think we'll need a reproducer.

Comment 2 Richard W.M. Jones 2024-03-04 13:33:32 UTC

> 1. Run PCP regression tests, which involve running programs under valgrind

These tests are where and how do I run them?  The spec file itself doesn't mention valgrind.

Comment 3 Richard W.M. Jones 2024-03-04 16:14:27 UTC

I seem to have reproduced this in another project.  My stack trace has some more
symbols:

==746855== Invalid write of size 8
==746855==    at 0x52E8645: ??? (in /usr/lib64/liblzma.so.5.6.0)
==746855==    by 0x52CA83B: _get_cpuid (in /usr/lib64/liblzma.so.5.6.0)
==746855==    by 0x6: ???
==746855==    by 0x1FFEFFF4AF: ???
==746855==    by 0x77AD31E59B84CFFF: ???
==746855==    by 0x1FFEFFF4AF: ???
==746855==    by 0x400F253: elf_machine_rela (dl-machine.h:314)
==746855==    by 0x400F253: elf_dynamic_do_Rela (do-rel.h:147)
==746855==    by 0x400F253: _dl_relocate_object (dl-reloc.c:301)
==746855==    by 0x52015AF: ???
==746855==    by 0x5200B0F: ???
==746855==    by 0x1FFEFFF43F: ???
==746855==    by 0x1FFEFFF42F: ???
==746855==    by 0x53E6D17: ??? (in /usr/lib64/libffi.so.8.1.2)
==746855==  Address 0x1ffeffe538 is on thread 1's stack
==746855==  136 bytes below stack pointer

Comment 4 Richard W.M. Jones 2024-03-04 16:17:26 UTC

Even a trivial program that links with lzma reproduces this:

$ echo 'int main(){return 0;}' > test.c
$ gcc test.c -llzma -o test
$ valgrind ./test
==749691== Memcheck, a memory error detector
==749691== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==749691== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==749691== Command: ./test
==749691== 
==749691== Invalid write of size 8
==749691==    at 0x4897645: ??? (in /usr/lib64/liblzma.so.5.6.0)
==749691==    by 0x487983B: _get_cpuid (in /usr/lib64/liblzma.so.5.6.0)
==749691==    by 0x6: ???
==749691==    by 0x1FFEFFF8DF: ???
==749691==    by 0xDD2A8041A0E922FF: ???
==749691==    by 0x1FFEFFF8DF: ???
==749691==    by 0x400F253: elf_machine_rela (dl-machine.h:314)
==749691==    by 0x400F253: elf_dynamic_do_Rela (do-rel.h:147)
==749691==    by 0x400F253: _dl_relocate_object (dl-reloc.c:301)
==749691==    by 0x483BA7F: ???
==749691==    by 0x483B56F: ???
==749691==    by 0x1FFEFFF86F: ???
==749691==    by 0x1FFEFFF85F: ???
==749691==    by 0x48CDD9F: ??? (in /usr/lib64/libc.so.6)
==749691==  Address 0x1ffeffe968 is on thread 1's stack
==749691==  136 bytes below stack pointer
==749691== 
==749691== 
==749691== HEAP SUMMARY:
==749691==     in use at exit: 0 bytes in 0 blocks
==749691==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==749691== 
==749691== All heap blocks were freed -- no leaks are possible
==749691== 
==749691== For lists of detected and suppressed errors, rerun with: -s
==749691== ERROR SUMMARY: 112 errors from 1 contexts (suppressed: 0 from 0)

(xz-libs-5.6.0-2.fc41.x86_64)

Comment 5 Richard W.M. Jones 2024-03-04 17:52:33 UTC

I changed xz to use ./configure --disable-ifunc which
works around the problem:

https://src.fedoraproject.org/rpms/xz/c/6db19f2231927b4d93e9c021d32cb7433708e26f?branch=rawhide

Build for F40:
https://koji.fedoraproject.org/koji/taskinfo?taskID=114461506

Build for Rawhide:
https://koji.fedoraproject.org/koji/taskinfo?taskID=114461456

As this is only a workaround, let's keep this bug open.

Comment 6 Adam Williamson 2024-03-04 20:49:14 UTC

Is the broken version in F40 stable currently? If so, should we propose this as an FE to ensure we don't ship the broken one in Beta?

Comment 7 Richard W.M. Jones 2024-03-04 21:09:25 UTC

These are the fixed packages:
F40: https://bodhi.fedoraproject.org/updates/FEDORA-2024-f5033032b8
F41: https://bodhi.fedoraproject.org/updates/FEDORA-2024-f0381d82b3

This is the broken update:
https://bodhi.fedoraproject.org/updates/FEDORA-2024-4417db3376

(Do we actually need to rebuild perl-Compress-Raw-Lzma on every *release*?  Spec
file seems to say only on version changes.)

Not sure what FE is, but it looks like the broken update is not in Fedora 40
right now, so we just need to make sure it doesn't get out.

Comment 8 Adam Williamson 2024-03-04 22:36:14 UTC

Yes. We do. You can tell, because the tests failed. :D Those are the tests that always fail when the package can't be updated because the perl-Compress-Raw-Lzma dependency is broken.

It would have made more sense to just edit the new xz build into the existing F40 update, but now that ship has sailed :( We will need to bump and build perl-Compress-Raw-Lzma again and edit it into the new update.

Comment 9 Adam Williamson 2024-03-04 22:37:56 UTC

Actually, let me qualify that - no, we don't need to rebuild it every *release*, only on version changes - but because https://bodhi.fedoraproject.org/updates/FEDORA-2024-4417db3376 never made it to stable, stable still has the old perl-Compress-Raw-Lzma that requires version 5.4.6 of xz-libs.

Comment 10 Nathan Scott 2024-03-04 22:46:00 UTC

Hi Rich,

Sorry for the lack of detail - the failure rate is so high for us here that I wondered if every program using liblzma would exhibit the same problem - that's certainly the case across all our tools.

This is last nights run and shows the extent of the issue.  I expect that ~130 rawhide failures list is every test we have that uses valgrind.
https://performancecopilot.github.io/qa-reports/reports/20240304_212241-a8847c35/

You can install the pcp-testsuite package to get at the PCP tests locally.  They're shell scripts so pretty easy to follow (obviously, many of them invoke other tools like valgrind).

The individual tests can also be perused here:
https://github.com/performancecopilot/pcp/tree/main/qa

cheers.

Comment 11 Adam Williamson 2024-03-04 22:47:00 UTC

I'll do the perl-Compress-Raw-Lzma builds and clean up the update.

Comment 12 Richard W.M. Jones 2024-03-05 08:49:43 UTC

(In reply to Adam Williamson from comment #9)
> Actually, let me qualify that - no, we don't need to rebuild it every
> *release*, only on version changes - but because
> https://bodhi.fedoraproject.org/updates/FEDORA-2024-4417db3376 never made it
> to stable, stable still has the old perl-Compress-Raw-Lzma that requires
> version 5.4.6 of xz-libs.

Right yes, this above is the reason.

In perl-Compress-Raw-Lzma it only depends on the liblzma version:

Requires:       xz-libs%{?_isa} = %((pkg-config --modversion liblzma 2>/dev/null || echo 0) | tr -dc '[0-9.]')

where 'pkg-config --modversion liblzma' expands to '5.6.0'.  So the comment
is correct.

Comment 13 Nathan Scott 2024-03-05 22:32:08 UTC

Confirming we're seeing goodness across the PCP tests running on rawhide once more.  Thanks Rich!

Comment 14 Richard W.M. Jones 2024-03-09 12:32:10 UTC

This is fixed by
https://github.com/tukaani-project/xz/commit/82ecc538193b380a21622aea02b0ba078e7ade92
included in xz 5.6.1.

Comment 15 Fedora Update System 2024-03-09 13:03:19 UTC

FEDORA-2024-7e9c14633a (perl-Compress-Raw-Lzma-2.209-5.fc41 and xz-5.6.1-1.fc41) has been submitted as an update to Fedora 41.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-7e9c14633a

Comment 16 Fedora Update System 2024-03-09 14:26:18 UTC

FEDORA-2024-7e9c14633a (perl-Compress-Raw-Lzma-2.209-5.fc41 and xz-5.6.1-1.fc41) has been pushed to the Fedora 41 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 17 Eric Blake 2024-03-29 18:26:37 UTC

Yikes - the author of the upstream patch used this bug to justify making his xz backdoor (CVE-2024-3094) even bigger.  :(
https://github.com/tukaani-project/xz/commit/82ecc538193b380a21622aea02b0ba078e7ade92
https://www.openwall.com/lists/oss-security/2024/03/29/4

Comment 18 Mark Wielaard 2024-03-30 11:15:09 UTC

(In reply to Richard W.M. Jones from comment #14)
> This is fixed by
> https://github.com/tukaani-project/xz/commit/
> 82ecc538193b380a21622aea02b0ba078e7ade92
> included in xz 5.6.1.

Unfortunately (or luckily?) github has disabled the project.
Is this commit available somewhere? I wonder how that "fix" worked around the valgrind memcheck errors.

Comment 19 Ehila 2024-03-30 15:30:57 UTC

(In reply to Mark Wielaard from comment #18)
> (In reply to Richard W.M. Jones from comment #14)
> > This is fixed by
> > https://github.com/tukaani-project/xz/commit/
> > 82ecc538193b380a21622aea02b0ba078e7ade92
> > included in xz 5.6.1.
> 
> Unfortunately (or luckily?) github has disabled the project.
> Is this commit available somewhere? I wonder how that "fix" worked around
> the valgrind memcheck errors.

I was curious as well, the project website is still hosting the source. The commit seems to be here https://git.tukaani.org/?p=xz.git;a=commit;h=82ecc538193b380a21622aea02b0ba078e7ade92

Comment 20 Mark Wielaard 2024-03-30 17:10:29 UTC

(In reply to Ehila from comment #19)
> (In reply to Mark Wielaard from comment #18)
> > (In reply to Richard W.M. Jones from comment #14)
> > > This is fixed by
> > > https://github.com/tukaani-project/xz/commit/
> > > 82ecc538193b380a21622aea02b0ba078e7ade92
> > > included in xz 5.6.1.
> > 
> > Unfortunately (or luckily?) github has disabled the project.
> > Is this commit available somewhere? I wonder how that "fix" worked around
> > the valgrind memcheck errors.
> 
> I was curious as well, the project website is still hosting the source. The
> commit seems to be here
> https://git.tukaani.org/?p=xz.git;a=commit;
> h=82ecc538193b380a21622aea02b0ba078e7ade92

Wow, thanks. That commit does sound somewhat plausible. And I doubt I would have recognized all this as suspicious. Although it should have because there is no real reason this would only show up under valgrind (valgrind does however have an issue where interception of a ifunc can misfire, so it isn't completely unreasonable to suspect a valgrind bug here). But the "real" fix for this "valgrind issue" seems to come an hour later when some test files are updated. Which are then included in the next xz release.