We have observed ~130 (of ~1000) new test failures over the weekend in the PCP project CI on rawhide. I notice a new build/version of liblzma recently arrived, so let's start here. This is the output from valgrind (apologies, no symbols and no binary listed - but its across many different binaries and libraries within PCP, so hopefully its something you can easily reproduce by trying valgrind on xz's own tests. If its not reproducible let me know & I'll extract a test case. Here's one sample case from the PCP testsuite ... (output is from valgrind). output mismatch (see 1101.out.bad) 205a206,240 > Invalid write of size 8 > at HEX: ??? (in /usr/lib64/liblzma.so.5.6.0) > by HEX: ??? (in /usr/lib64/liblzma.so.5.6.0) > by HEX: ??? > by HEX: ??? > by HEX: ??? > by HEX: ??? > by HEX: elf_machine_rela (dl-machine.h:314) > by HEX: elf_dynamic_do_Rela (do-rel.h:147) > by HEX: _dl_relocate_object (dl-reloc.c:301) > by HEX: ??? > by HEX: ??? > by HEX: ??? > by HEX: ??? > by HEX: ??? (in /usr/lib64/libgcc_s-14-20240228.so.1) > Address HEX is on thread 1's stack > 136 bytes below stack pointer > { > <insert_a_suppression_name_here> > Memcheck:Addr8 > obj:/usr/lib64/liblzma.so.5.6.0 > obj:/usr/lib64/liblzma.so.5.6.0 > obj:* > obj:* > obj:* > obj:* > fun:elf_machine_rela > fun:elf_dynamic_do_Rela > fun:_dl_relocate_object > obj:* > obj:* > obj:* > obj:* > obj:/usr/lib64/libgcc_s-14-20240228.so.1 > } 209c244 < ERROR SUMMARY: 0 errors from 0 contexts ... --- > ERROR SUMMARY: 112 errors from 1 contexts ... Check local PMCD is still alive ... Reproducible: Always Steps to Reproduce: 1. Run PCP regression tests, which involve running programs under valgrind 2. 3. Actual Results: 'Invalid write' failure messages from valgrind as shown in the report. Expected Results: No 'Invalid write' failure messages from valgrind.
I ran valgrind over some of the tests in the xz test suite and was not able to reproduce any error. There are also no commits upstream since 5.6.0 which would indicate any fix. So I think we'll need a reproducer.
> 1. Run PCP regression tests, which involve running programs under valgrind These tests are where and how do I run them? The spec file itself doesn't mention valgrind.
I seem to have reproduced this in another project. My stack trace has some more symbols: ==746855== Invalid write of size 8 ==746855== at 0x52E8645: ??? (in /usr/lib64/liblzma.so.5.6.0) ==746855== by 0x52CA83B: _get_cpuid (in /usr/lib64/liblzma.so.5.6.0) ==746855== by 0x6: ??? ==746855== by 0x1FFEFFF4AF: ??? ==746855== by 0x77AD31E59B84CFFF: ??? ==746855== by 0x1FFEFFF4AF: ??? ==746855== by 0x400F253: elf_machine_rela (dl-machine.h:314) ==746855== by 0x400F253: elf_dynamic_do_Rela (do-rel.h:147) ==746855== by 0x400F253: _dl_relocate_object (dl-reloc.c:301) ==746855== by 0x52015AF: ??? ==746855== by 0x5200B0F: ??? ==746855== by 0x1FFEFFF43F: ??? ==746855== by 0x1FFEFFF42F: ??? ==746855== by 0x53E6D17: ??? (in /usr/lib64/libffi.so.8.1.2) ==746855== Address 0x1ffeffe538 is on thread 1's stack ==746855== 136 bytes below stack pointer
Even a trivial program that links with lzma reproduces this: $ echo 'int main(){return 0;}' > test.c $ gcc test.c -llzma -o test $ valgrind ./test ==749691== Memcheck, a memory error detector ==749691== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==749691== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info ==749691== Command: ./test ==749691== ==749691== Invalid write of size 8 ==749691== at 0x4897645: ??? (in /usr/lib64/liblzma.so.5.6.0) ==749691== by 0x487983B: _get_cpuid (in /usr/lib64/liblzma.so.5.6.0) ==749691== by 0x6: ??? ==749691== by 0x1FFEFFF8DF: ??? ==749691== by 0xDD2A8041A0E922FF: ??? ==749691== by 0x1FFEFFF8DF: ??? ==749691== by 0x400F253: elf_machine_rela (dl-machine.h:314) ==749691== by 0x400F253: elf_dynamic_do_Rela (do-rel.h:147) ==749691== by 0x400F253: _dl_relocate_object (dl-reloc.c:301) ==749691== by 0x483BA7F: ??? ==749691== by 0x483B56F: ??? ==749691== by 0x1FFEFFF86F: ??? ==749691== by 0x1FFEFFF85F: ??? ==749691== by 0x48CDD9F: ??? (in /usr/lib64/libc.so.6) ==749691== Address 0x1ffeffe968 is on thread 1's stack ==749691== 136 bytes below stack pointer ==749691== ==749691== ==749691== HEAP SUMMARY: ==749691== in use at exit: 0 bytes in 0 blocks ==749691== total heap usage: 0 allocs, 0 frees, 0 bytes allocated ==749691== ==749691== All heap blocks were freed -- no leaks are possible ==749691== ==749691== For lists of detected and suppressed errors, rerun with: -s ==749691== ERROR SUMMARY: 112 errors from 1 contexts (suppressed: 0 from 0) (xz-libs-5.6.0-2.fc41.x86_64)
I changed xz to use ./configure --disable-ifunc which works around the problem: https://src.fedoraproject.org/rpms/xz/c/6db19f2231927b4d93e9c021d32cb7433708e26f?branch=rawhide Build for F40: https://koji.fedoraproject.org/koji/taskinfo?taskID=114461506 Build for Rawhide: https://koji.fedoraproject.org/koji/taskinfo?taskID=114461456 As this is only a workaround, let's keep this bug open.
Is the broken version in F40 stable currently? If so, should we propose this as an FE to ensure we don't ship the broken one in Beta?
These are the fixed packages: F40: https://bodhi.fedoraproject.org/updates/FEDORA-2024-f5033032b8 F41: https://bodhi.fedoraproject.org/updates/FEDORA-2024-f0381d82b3 This is the broken update: https://bodhi.fedoraproject.org/updates/FEDORA-2024-4417db3376 (Do we actually need to rebuild perl-Compress-Raw-Lzma on every *release*? Spec file seems to say only on version changes.) Not sure what FE is, but it looks like the broken update is not in Fedora 40 right now, so we just need to make sure it doesn't get out.
Yes. We do. You can tell, because the tests failed. :D Those are the tests that always fail when the package can't be updated because the perl-Compress-Raw-Lzma dependency is broken. It would have made more sense to just edit the new xz build into the existing F40 update, but now that ship has sailed :( We will need to bump and build perl-Compress-Raw-Lzma again and edit it into the new update.
Actually, let me qualify that - no, we don't need to rebuild it every *release*, only on version changes - but because https://bodhi.fedoraproject.org/updates/FEDORA-2024-4417db3376 never made it to stable, stable still has the old perl-Compress-Raw-Lzma that requires version 5.4.6 of xz-libs.
Hi Rich, Sorry for the lack of detail - the failure rate is so high for us here that I wondered if every program using liblzma would exhibit the same problem - that's certainly the case across all our tools. This is last nights run and shows the extent of the issue. I expect that ~130 rawhide failures list is every test we have that uses valgrind. https://performancecopilot.github.io/qa-reports/reports/20240304_212241-a8847c35/ You can install the pcp-testsuite package to get at the PCP tests locally. They're shell scripts so pretty easy to follow (obviously, many of them invoke other tools like valgrind). The individual tests can also be perused here: https://github.com/performancecopilot/pcp/tree/main/qa cheers.
I'll do the perl-Compress-Raw-Lzma builds and clean up the update.
(In reply to Adam Williamson from comment #9) > Actually, let me qualify that - no, we don't need to rebuild it every > *release*, only on version changes - but because > https://bodhi.fedoraproject.org/updates/FEDORA-2024-4417db3376 never made it > to stable, stable still has the old perl-Compress-Raw-Lzma that requires > version 5.4.6 of xz-libs. Right yes, this above is the reason. In perl-Compress-Raw-Lzma it only depends on the liblzma version: Requires: xz-libs%{?_isa} = %((pkg-config --modversion liblzma 2>/dev/null || echo 0) | tr -dc '[0-9.]') where 'pkg-config --modversion liblzma' expands to '5.6.0'. So the comment is correct.
Confirming we're seeing goodness across the PCP tests running on rawhide once more. Thanks Rich!
This is fixed by https://github.com/tukaani-project/xz/commit/82ecc538193b380a21622aea02b0ba078e7ade92 included in xz 5.6.1.
FEDORA-2024-7e9c14633a (perl-Compress-Raw-Lzma-2.209-5.fc41 and xz-5.6.1-1.fc41) has been submitted as an update to Fedora 41. https://bodhi.fedoraproject.org/updates/FEDORA-2024-7e9c14633a
FEDORA-2024-7e9c14633a (perl-Compress-Raw-Lzma-2.209-5.fc41 and xz-5.6.1-1.fc41) has been pushed to the Fedora 41 stable repository. If problem still persists, please make note of it in this bug report.
Yikes - the author of the upstream patch used this bug to justify making his xz backdoor (CVE-2024-3094) even bigger. :( https://github.com/tukaani-project/xz/commit/82ecc538193b380a21622aea02b0ba078e7ade92 https://www.openwall.com/lists/oss-security/2024/03/29/4
(In reply to Richard W.M. Jones from comment #14) > This is fixed by > https://github.com/tukaani-project/xz/commit/ > 82ecc538193b380a21622aea02b0ba078e7ade92 > included in xz 5.6.1. Unfortunately (or luckily?) github has disabled the project. Is this commit available somewhere? I wonder how that "fix" worked around the valgrind memcheck errors.
(In reply to Mark Wielaard from comment #18) > (In reply to Richard W.M. Jones from comment #14) > > This is fixed by > > https://github.com/tukaani-project/xz/commit/ > > 82ecc538193b380a21622aea02b0ba078e7ade92 > > included in xz 5.6.1. > > Unfortunately (or luckily?) github has disabled the project. > Is this commit available somewhere? I wonder how that "fix" worked around > the valgrind memcheck errors. I was curious as well, the project website is still hosting the source. The commit seems to be here https://git.tukaani.org/?p=xz.git;a=commit;h=82ecc538193b380a21622aea02b0ba078e7ade92
(In reply to Ehila from comment #19) > (In reply to Mark Wielaard from comment #18) > > (In reply to Richard W.M. Jones from comment #14) > > > This is fixed by > > > https://github.com/tukaani-project/xz/commit/ > > > 82ecc538193b380a21622aea02b0ba078e7ade92 > > > included in xz 5.6.1. > > > > Unfortunately (or luckily?) github has disabled the project. > > Is this commit available somewhere? I wonder how that "fix" worked around > > the valgrind memcheck errors. > > I was curious as well, the project website is still hosting the source. The > commit seems to be here > https://git.tukaani.org/?p=xz.git;a=commit; > h=82ecc538193b380a21622aea02b0ba078e7ade92 Wow, thanks. That commit does sound somewhat plausible. And I doubt I would have recognized all this as suspicious. Although it should have because there is no real reason this would only show up under valgrind (valgrind does however have an issue where interception of a ifunc can misfire, so it isn't completely unreasonable to suspect a valgrind bug here). But the "real" fix for this "valgrind issue" seems to come an hour later when some test files are updated. Which are then included in the next xz release.