Bug 2001549
| Summary: | udisksd sometimes crashes during shutdown with "malloc_consolidate(): unaligned fastbin chunk detected" | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Marius Vollmer <mvollmer> | ||||
| Component: | udisks2 | Assignee: | Tomáš Bžatek <tbzatek> | ||||
| Status: | CLOSED ERRATA | QA Contact: | guazhang <guazhang> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 9.0 | CC: | guazhang | ||||
| Target Milestone: | rc | Keywords: | Reopened, Triaged | ||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | udisks2-2.9.4-1.el9 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-05-17 12:52:43 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 2010690 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
This will reproduce it, if you know how to run the Cockpit CI suite: $ export TEST_OS=rhel-9-0 $ make bots $ ./test/image-prepare -q $ ./test/common/run-tests -n -b 1 --test-dir test/verify This takes quite a bit of time. I will try to cut this down a bit more, get a backtrace and maybe even try to run udisksd under valgrind. I got this from valgrind, but I am not sure if it is relevant yet: ==5484== Thread 8 pool-udisksd: ==5484== Source and destination overlap in memcpy_chk(0xb126055, 0xb126050, 7) ==5484== at 0x484C232: __memcpy_chk (vg_replace_strmem.c:1617) ==5484== by 0x517FDED: blkid_devno_to_devname (in /usr/lib64/libblkid.so.1.1.0) ==5484== by 0x518ACE1: ??? (in /usr/lib64/libblkid.so.1.1.0) ==5484== by 0x5181DED: blkid_do_safeprobe (in /usr/lib64/libblkid.so.1.1.0) ==5484== by 0x48C7DEF: mnt_cache_read_tags (in /usr/lib64/libmount.so.1.1.0) ==5484== by 0x48C8180: mnt_get_fstype (in /usr/lib64/libmount.so.1.1.0) ==5484== by 0x48D81E2: ??? (in /usr/lib64/libmount.so.1.1.0) ==5484== by 0x48E5F99: mnt_context_prepare_mount (in /usr/lib64/libmount.so.1.1.0) ==5484== by 0x48E6BB9: mnt_context_mount (in /usr/lib64/libmount.so.1.1.0) ==5484== by 0x7C770AB: ??? (in /usr/lib64/libbd_fs.so.2.0.0) ==5484== by 0x7C7887E: bd_fs_mount (in /usr/lib64/libbd_fs.so.2.0.0) ==5484== by 0x13D7FA: ??? (in /usr/libexec/udisks2/udisksd) ==5484== And this at exit: ==58962== Invalid read of size 4 ==58962== at 0x4B4CD24: g_resource_unref (in /usr/lib64/libgio-2.0.so.0.6800.4) ==58962== by 0x4B4EAD0: g_static_resource_fini (in /usr/lib64/libgio-2.0.so.0.6800.4) ==58962== by 0x4010ADB: _dl_fini (dl-fini.c:139) ==58962== by 0x4E8BDC4: __run_exit_handlers (in /usr/lib64/libc.so.6) ==58962== by 0x4E8BF3F: exit (in /usr/lib64/libc.so.6) ==58962== by 0x4E74566: __libc_start_call_main (in /usr/lib64/libc.so.6) ==58962== by 0x4E7460B: __libc_start_main@@GLIBC_2.34 (in /usr/lib64/libc.so.6) ==58962== by 0x1286D4: (below main) (in /usr/libexec/udisks2/udisksd) ==58962== Address 0x5da98d0 is 0 bytes inside a block of size 16 free'd ==58962== at 0x48440E4: free (vg_replace_malloc.c:755) ==58962== by 0x4D4BD1C: g_free (in /usr/lib64/libglib-2.0.so.0.6800.4) ==58962== by 0x4B4EAC8: g_static_resource_fini (in /usr/lib64/libgio-2.0.so.0.6800.4) ==58962== by 0x4010ADB: _dl_fini (dl-fini.c:139) ==58962== by 0x4E8BDC4: __run_exit_handlers (in /usr/lib64/libc.so.6) ==58962== by 0x4E8BF3F: exit (in /usr/lib64/libc.so.6) ==58962== by 0x4E74566: __libc_start_call_main (in /usr/lib64/libc.so.6) ==58962== by 0x4E7460B: __libc_start_main@@GLIBC_2.34 (in /usr/lib64/libc.so.6) ==58962== by 0x1286D4: (below main) (in /usr/libexec/udisks2/udisksd) ==58962== Block was alloc'd at ==58962== at 0x484186F: malloc (vg_replace_malloc.c:380) ==58962== by 0x4D4F408: g_malloc (in /usr/lib64/libglib-2.0.so.0.6800.4) ==58962== by 0x4B4E347: g_resource_new_from_data (in /usr/lib64/libgio-2.0.so.0.6800.4) ==58962== by 0x4B4E408: ??? (in /usr/lib64/libgio-2.0.so.0.6800.4) ==58962== by 0x4B4EB0C: g_static_resource_get_resource (in /usr/lib64/libgio-2.0.so.0.6800.4) ==58962== by 0x14F1F3: udisks_linux_mount_options_get_builtin (in /usr/libexec/udisks2/udisksd) ==58962== by 0x12CAEE: ??? (in /usr/libexec/udisks2/udisksd) ==58962== by 0x4CB9C86: ??? (in /usr/lib64/libgobject-2.0.so.0.6800.4) ==58962== by 0x4CBB137: g_object_new_valist (in /usr/lib64/libgobject-2.0.so.0.6800.4) ==58962== by 0x4CBB66C: g_object_new (in /usr/lib64/libgobject-2.0.so.0.6800.4) ==58962== by 0x12998F: udisks_daemon_new (in /usr/libexec/udisks2/udisksd) ==58962== by 0x129A05: ??? (in /usr/libexec/udisks2/udisksd) Interesting! Might be caused by the fact RHEL 9 is compiled with much strict CFLAGS than Fedora. This looks like a stack corruption from a first look. From time to time we run udisks under Valgrind, though I have never tried that on RHEL 9. I guess a good reproducer would be a key here. Please keep trying! Got it! (The reproducer, not the fix...)
This script causes udisksd to abort for me:
# Make a fresh block device with some LUKS on it
dd if=/dev/zero of=/disk.img bs=1M count=50
DEV=$(losetup -f --show /disk.img)
echo einszweidrei | cryptsetup luksFormat ${DEV}
udevadm settle
# Start udisksd, just so that a second copy refuses to start
systemctl start udisks2
# Now run udisksd. It will immediately exit (because there is already
# one running), and will crash during exit.
/usr/libexec/udisks2/udisksd
# Clean up for the next run
losetup -d $DEV
Basically, if there is a LUKS device (doesn't need to be unlocked), just starting udisksd does something that will cause it to abort during exit.
Running udisksd under valgrind does not provide any hints, the invalid read in comment 3 happens always during exit.
And here is the backtrace of the abort: #0 0x00007ffff7852763 in pthread_kill.5 () from /lib64/libc.so.6 #1 0x00007ffff7805686 in raise () from /lib64/libc.so.6 #2 0x00007ffff77ef7d3 in abort () from /lib64/libc.so.6 #3 0x00007ffff7846a07 in __libc_message () from /lib64/libc.so.6 #4 0x00007ffff785c72c in malloc_printerr () from /lib64/libc.so.6 #5 0x00007ffff785d49c in malloc_consolidate () from /lib64/libc.so.6 #6 0x00007ffff785e4a0 in _int_free () from /lib64/libc.so.6 #7 0x00007ffff7860ac5 in free () from /lib64/libc.so.6 #8 0x00007ffff709fa6d in sa_doall.lto_priv () from /lib64/libcrypto.so.3 #9 0x00007ffff70c457d in ossl_method_store_free.part () from /lib64/libcrypto.so.3 #10 0x00007ffff709a825 in CRYPTO_free_ex_data () from /lib64/libcrypto.so.3 #11 0x00007ffff709359f in context_deinit.part.0.lto_priv () from /lib64/libcrypto.so.3 #12 0x00007ffff709cfba in OPENSSL_cleanup () from /lib64/libcrypto.so.3 #13 0x00007ffff5592303 in libcryptsetup_exit () from /lib64/libcryptsetup.so.12 #14 0x00007ffff7fdaadc in _dl_fini () at dl-fini.c:139 #15 0x00007ffff7807dc5 in __run_exit_handlers () from /lib64/libc.so.6 #16 0x00007ffff7807f40 in exit () from /lib64/libc.so.6 #17 0x00007ffff77f0567 in __libc_start_call_main () from /lib64/libc.so.6 #18 0x00007ffff77f060c in __libc_start_main_impl () from /lib64/libc.so.6 #19 0x00005555555746d5 in _start () Thanks, this looks like cryptsetup bug 1998921 With bug 1998921 fixed and verified and the backtrace looking suspiciously similar, let's just close this one as a duplicate. No fix needed on udisks2 side. *** This bug has been marked as a duplicate of bug 1998921 *** cryptsetup-libs-2.4.0-2.el9_b.x86_64 unfortunately does not fix this crash. The reproducer in comment 5 still results in a the same output: udisks-Message: 03:26:22.116: udisks daemon version 2.9.2 starting udisks-Message: 03:26:22.220: udisks daemon version 2.9.2 exiting malloc_consolidate(): unaligned fastbin chunk detected Aborted (core dumped) (and udiskd doesn't actually link to libcryptsetup.so as far as ldd tells me, or is there some dlopening going on?) You're right! Sorry for not noticing the backtrace difference, both are however related to at_exit handlers and library deinitialization. I've reproduced this on my local development system with slightly different udisks setup. I've opened bug 2010690 to keep the history clean. (In reply to Marius Vollmer from comment #11) > (and udiskd doesn't actually link to libcryptsetup.so as far as ldd tells > me, or is there some dlopening going on?) Yes, libcryptsetup.so gets dlopened in the process through bd_try_init() along with libbd_crypto.so.2 that actually links against libcryptsetup.so. There's no explicit dlclose() as far as I can tell, only exit handlers attached. (In reply to Tomáš Bžatek from comment #13) > (In reply to Marius Vollmer from comment #11) > > (and udiskd doesn't actually link to libcryptsetup.so as far as ldd tells > > me, or is there some dlopening going on?) > > Yes, libcryptsetup.so gets dlopened in the process through bd_try_init() > along with libbd_crypto.so.2 that actually links against libcryptsetup.so. > There's no explicit dlclose() as far as I can tell, only exit handlers > attached. I see, thanks! Turns out you were right again, Marius! The warning in comment 3 has actually caused memory corruption that caused a crash later in the cryptsetup code. This upstream change should fix the issue: https://github.com/storaged-project/udisks/pull/926 I'll include it along with the 2.9.4 release rebase (bug 2010363). Hi reproduce the bug with udisks2-2.9.2-6.el9.x86_64. # sh test.sh 50+0 records in 50+0 records out 52428800 bytes (52 MB, 50 MiB) copied, 0.0198364 s, 2.6 GB/s udisks-Message: 03:15:32.426: udisks daemon version 2.9.2 starting udisks-Message: 03:15:32.730: udisks daemon version 2.9.2 exiting malloc_consolidate(): unaligned fastbin chunk detected test.sh: line 15: 5723 Aborted (core dumped) /usr/libexec/udisks2/udisksd Do not reproduce the bug with udisks2-2.9.4-1.el9. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: udisks2), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2413 |
Created attachment 1820822 [details] Journal messages with the crash Description of problem: We see a udisksd crash sometime during shutdown of a CI machine: Sep 06 05:35:27 rhel-9-0-127-0-0-2-2201 udisksd[16264]: udisks daemon version 2.9.2 exiting Sep 06 05:35:27 rhel-9-0-127-0-0-2-2201 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=gssproxy comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' Sep 06 05:35:27 rhel-9-0-127-0-0-2-2201 systemd[1]: gssproxy.service: Deactivated successfully. Sep 06 05:35:27 rhel-9-0-127-0-0-2-2201 systemd[1]: Stopped GSSAPI Proxy Daemon. Sep 06 05:35:27 rhel-9-0-127-0-0-2-2201 udisksd[16264]: malloc_consolidate(): unaligned fastbin chunk detected Sep 06 05:35:27 rhel-9-0-127-0-0-2-2201 audit[16264]: ANOM_ABEND auid=4294967295 uid=0 gid=0 ses=4294967295 subj=system_u:system_r:devicekit_disk_t:s0 pid=16264 comm="udisksd" exe="/usr/libexec/udisks2/udisksd" sig=6 res=1 Sep 06 05:35:27 rhel-9-0-127-0-0-2-2201 systemd-coredump[166138]: Failed to connect to coredump service: Connection refused The VM has run a good number of tests at that time, and it is not clear what exactly has gottin udisksd into the crashy state. I will try to find something reproducible. Also, systemd-coredump has already exited by the time udisksd crashes, so there is no backtrace either. 2350-20210906-090323-4a6c7db3-rhel-9-0-cockpit-project-cockpit/TestUpdates-testTracer-rhel-9-0-127.0.0.2-2301-FAIL.log.gz Version-Release number of selected component (if applicable): udisks2-2.9.2-6.el9.x86_64 How reproducible: Not at all easy, but maybe I can find something Steps to Reproduce: 1. Run the full Cockpit CI suite for rhel-9-0.