We have been seeing this for a while, finally filing a bug on it to get it better tracked. :( Sporadically we are seeing livemedia composes in both f39 and rawhide fail. anaconda reports that it couldn't connect to dbus. On the builders there's a SIGILL from python3.12: PID: 1535555 (python3) UID: 0 (root) GID: 0 (root) Signal: 4 (ILL) Timestamp: Tue 2023-10-31 08:00:49 UTC (12h ago) Command Line: python3 -m pyanaconda.modules.storage Executable: /usr/bin/python3.12 Control Group: /machine.slice/machine-810eb7856fe6435d848985a92b99d9b9.scope/payload Unit: machine-810eb7856fe6435d848985a92b99d9b9.scope Slice: machine.slice Boot ID: 4a08a17dadbd421d9b3a274841c5ad95 Machine ID: 56794a87462f49be8a21501b922b4dbd Hostname: 810eb7856fe6435d848985a92b99d9b9 Storage: none Message: Process 1535555 (python3) of user 0 dumped core. https://koji.fedoraproject.org/koji/taskinfo?taskID=108358818 is an example of a failed task. This could be a anaconda bug, or a problem with our builders, but I thought I would start with python3.12. Its only on aarch64 and only livemedia tasks. Its sporadic, not the same builder or image all the time. It's also not the same command line all the time. 15 Command Line: python3 -m pyanaconda.modules.localization 15 Command Line: python3 -m pyanaconda.modules.network 13 Command Line: python3 -m pyanaconda.modules.payloads 3 Command Line: python3 -m pyanaconda.modules.runtime 5 Command Line: python3 -m pyanaconda.modules.security 6 Command Line: python3 -m pyanaconda.modules.services 12 Command Line: python3 -m pyanaconda.modules.storage 4 Command Line: python3 -m pyanaconda.modules.subscription 6 Command Line: python3 -m pyanaconda.modules.timezone 4 Command Line: python3 -m pyanaconda.modules.users Happy to try and gather more info or track this down. Thanks! Reproducible: Sometimes Steps to Reproduce: 1. compose fedora 2. see some livemedia failures
I have no idea where would I start investigating this or how. Is it possible to download the filesystem of the builder and boot it up in some VM?
Yeah, I am really not sure how to debug this either. :( Perhaps we could first cc the anaconda folks and see if they see anything like it on their aarch64 testing/ci? All of these are on bare metal aarch64 builders, with mock / systemd-nspawn and livemedia-creator running in that. I could also try and duplicate it in staging, but probibly not this week...
Oh, I forgot, another thing I'd like to do is update/reboot those builders... but we are in f39 freeze and not making any changes there, but can hopefully soon...
can we get the coredump and backtrace it at least?
There isn't one. At least nothing in coredumpctl. ;( I am not sure why.
it's ulimit, I think: Oct 31 09:49:01 buildhw-a64-02.iad2.fedoraproject.org systemd-coredump[1773132]: Resource limits disable core dumping for process 1773084 (python3). we should be able to change the limits and get a coredump.
If you can get a coredump, I may be able to help you to debug it. "python3 -m pyanaconda.modules.storage": is it easy to reproduce the issue? Do you think that it would be possible to create a special image containing python3-debug and run anaconda programs (such as "python3 -m pyanaconda.modules.storage") with this program instead? Like run "python3-debug -X dev -m pyanaconda.modules.storage" instead. Adding "-X dev" enables more debug checks: * -X dev: https://docs.python.org/dev/library/devmode.html * python3-debug is a Python debug build: https://docs.python.org/dev/using/configure.html#python-debug-build
we're working on getting a coredump - see https://pagure.io/fedora-infra/ansible/pull-request/1618 . we might be able to do a custom image also, I can look into that later.
Any luck getting a coredump since the PR has been merged?
Sadly no. It didn't work. ;( I'll try and figure out another way...
ok. Got a core finally... PID: 3558463 (python3) UID: 0 (root) GID: 0 (root) Signal: 4 (ILL) Timestamp: Tue 2023-11-14 19:12:16 UTC (4min 15s ago) Command Line: python3 -m pyanaconda.modules.payloads Executable: /usr/bin/python3.12 Control Group: /machine.slice/machine-a419a95f068a4ded988be5153ba67666.scope/payload Unit: machine-a419a95f068a4ded988be5153ba67666.scope Slice: machine.slice Boot ID: 45eb1ecb91fe45fc9c7bda79e918763c Machine ID: 5d8add178f344e378788cb2550e6a521 Hostname: a419a95f068a4ded988be5153ba67666 Storage: /var/lib/systemd/coredump/core.python3.0.45eb1ecb91fe45fc9c7bda79e918763c.3558463.1699989136000000.zst (present) Size on Disk: 2.7M Package: python3.12/3.12.0-2.fc40 build-id: b9bd80861442b62dc97210c2bd0334084f6c734c Message: Process 3558463 (python3) of user 0 dumped core. Module /usr/bin/python3.12 from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libdevmapper.so.1.02 from rpm lvm2-2.03.22-1.fc39.aarch64 Module /usr/lib64/libbd_lvm-dbus.so.3.0.0 from rpm libblockdev-3.0.4-1.fc40.aarch64 Module /usr/lib64/python3.12/site-packages/selinux/_selinux.cpython-312-aarch64-linux-gnu.so from rpm libselinux-3.5-5.fc39.aarch64 Module /usr/lib64/libzstd.so.1.5.5 from rpm zstd-1.5.5-4.fc39.aarch64 Module /usr/lib64/libcap.so.2.48 from rpm libcap-2.48-7.fc39.aarch64 Module /usr/lib64/libkmod.so.2.4.1 from rpm kmod-31-2.fc40.aarch64 Module /usr/lib64/libudev.so.1.7.7 from rpm systemd-254.5-2.fc40.aarch64 Module /usr/lib64/libbd_utils.so.3.0.0 from rpm libblockdev-3.0.4-1.fc40.aarch64 Module /usr/lib64/libblockdev.so.3.0.0 from rpm libblockdev-3.0.4-1.fc40.aarch64 Module /usr/lib64/libgmp.so.10.4.1 from rpm gmp-6.2.1-5.fc39.aarch64 Module /usr/lib64/libmpfr.so.6.2.1 from rpm mpfr-4.2.1-1.fc40.aarch64 Module /usr/lib64/libbytesize.so.1.0.0 from rpm libbytesize-2.10-1.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_ctypes.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_opcode.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/array.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_socket.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libblkid.so.1.1.0 from rpm util-linux-2.39.2-1.fc40.aarch64 Module /usr/lib64/libselinux.so.1 from rpm libselinux-3.5-5.fc39.aarch64 Module /usr/lib64/libmount.so.1.1.0 from rpm util-linux-2.39.2-1.fc40.aarch64 Module /usr/lib64/libgio-2.0.so.0.7800.1 from rpm glib2-2.78.1-1.fc40.aarch64 Module /usr/lib64/libgmodule-2.0.so.0.7800.1 from rpm glib2-2.78.1-1.fc40.aarch64 Module /usr/lib64/libpcre2-8.so.0.11.2 from rpm pcre2-10.42-2.fc40.aarch64 Module /usr/lib64/libffi.so.8.1.2 from rpm libffi-3.4.4-4.fc39.aarch64 Module /usr/lib64/libgobject-2.0.so.0.7800.1 from rpm glib2-2.78.1-1.fc40.aarch64 Module /usr/lib64/libgirepository-1.0.so.1.0.0 from rpm gobject-introspection-1.78.1-1.fc40.aarch64 Module /usr/lib64/libglib-2.0.so.0.7800.1 from rpm glib2-2.78.1-1.fc40.aarch64 Module /usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so from rpm pygobject3-3.46.0-1.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_struct.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libmpdec.so.2.5.1 from rpm mpdecimal-2.5.1-7.fc39.aarch64 Module /usr/lib64/python3.12/lib-dynload/_decimal.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libuuid.so.1.3.0 from rpm util-linux-2.39.2-1.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_uuid.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libb2.so.1.0.4 from rpm libb2-0.98.1-9.fc39.aarch64 Module /usr/lib64/python3.12/lib-dynload/_blake2.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libcrypto.so.3.1.4 from rpm openssl-3.1.4-1.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_hashlib.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_random.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_bisect.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/liblzma.so.5.4.5 from rpm xz-5.4.5-1.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_lzma.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libbz2.so.1.0.8 from rpm bzip2-1.0.8-16.fc39.aarch64 Module /usr/lib64/python3.12/lib-dynload/_bz2.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libz.so.1.2.13 from rpm zlib-1.2.13-5.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/zlib.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/math.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/select.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/_posixsubprocess.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/fcntl.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/python3.12/lib-dynload/syslog.cpython-312-aarch64-linux-gnu.so from rpm python3.12-3.12.0-2.fc40.aarch64 Module /usr/lib64/libpython3.12.so.1.0 from rpm python3.12-3.12.0-2.fc40.aarch64 Stack trace of thread 53: #0 0x0000ffffa2df1c80 n/a (/usr/lib64/libc.so.6 + 0x91c80) #1 0x0000ffffa2df1c6c n/a (/usr/lib64/libc.so.6 + 0x91c6c) #2 0x0000ffffa2da5800 n/a (/usr/lib64/libc.so.6 + 0x45800) #3 0x0000ffffa35de7f0 n/a (ld-linux-aarch64.so.1 + 0x4d7f0) #4 0x0000ffff93372cd8 n/a (/usr/lib64/libbd_utils.so.3.0.0 + 0x12cd8) #5 0x0000ffff930b4438 n/a (/usr/lib64/libbd_lvm-dbus.so.3.0.0 + 0x14438) #6 0x0000ffff933c1680 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x11680) #7 0x0000ffff933ccba0 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x1cba0) #8 0x0000ffff933cf958 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x1f958) #9 0x0000ffff933d6ee8 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x26ee8) #10 0x0000ffff93ecc050 n/a (/usr/lib64/libffi.so.8.1.2 + 0x1c050) #11 0x0000ffff93ec63ac n/a (/usr/lib64/libffi.so.8.1.2 + 0x163ac) #12 0x0000ffff941c1344 n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x31344) #13 0x0000ffff941bfe4c n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x2fe4c) #14 0x0000ffffa30aa008 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x17a008) #15 0x0000ffffa30b6518 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x186518) #16 0x0000ffffa316bf48 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x23bf48) #17 0x0000ffffa3193c04 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x263c04) #18 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0) #19 0x0000ffffa30b7868 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187868) #20 0x0000ffffa30d9ed0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a9ed0) #21 0x0000ffffa3105b98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d5b98) #22 0x0000ffffa310516c n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d516c) #23 0x0000ffffa3113fc4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1e3fc4) #24 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0) #25 0x0000ffffa30b7868 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187868) #26 0x0000ffffa30d9ed0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a9ed0) #27 0x0000ffffa3105b98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d5b98) #28 0x0000ffffa310516c n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d516c) #29 0x0000ffffa30ba964 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18a964) #30 0x0000ffffa316bf48 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x23bf48) #31 0x0000ffffa3193c04 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x263c04) #32 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0) #33 0x0000ffffa30b7868 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187868) #34 0x0000ffffa30d9ed0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a9ed0) #35 0x0000ffffa3105b98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d5b98) #36 0x0000ffffa310516c n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d516c) #37 0x0000ffffa30ba964 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18a964) #38 0x0000ffffa316bf48 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x23bf48) #39 0x0000ffffa3193c04 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x263c04) #40 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0) #41 0x0000ffffa30b7868 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187868) #42 0x0000ffffa30d9ed0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a9ed0) #43 0x0000ffffa3105b98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d5b98) #44 0x0000ffffa310516c n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d516c) #45 0x0000ffffa30ba964 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18a964) #46 0x0000ffffa316bf48 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x23bf48) #47 0x0000ffffa3193c04 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x263c04) #48 0x0000ffffa30d0bd0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0bd0) #49 0x0000ffffa30d0678 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a0678) #50 0x0000ffffa30b6518 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x186518) #51 0x0000ffffa31acc40 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x27cc40) #52 0x0000ffffa31abefc n/a (/usr/lib64/libpython3.12.so.1.0 + 0x27befc) #53 0x0000ffffa3151c34 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x221c34) #54 0x0000ffffa2d909dc n/a (/usr/lib64/libc.so.6 + 0x309dc) #55 0x0000ffffa2d90ab0 n/a (/usr/lib64/libc.so.6 + 0x30ab0) #56 0x0000aaaad11f00f0 n/a (/usr/bin/python3.12 + 0x100f0) ELF object binary architecture: AARCH64 The file is at https://people.fedoraproject.org/~kevin/core.python3.0.45eb1ecb91fe45fc9c7bda79e918763c.3558463.1699989136000000.zst let me know if I can add anymore.
Kevin, could you get an output of "coredumpctl gdb"?
ok. Here's what I got... Also, FYI, we are doing a update/reboot cycle tomorrow, so we can see if this even persists after updates... Core was generated by `python3 -m pyanaconda.modules.payloads'. Program terminated with signal SIGILL, Illegal instruction. #0 __pthread_kill_implementation (threadid=281473422540832, signo=signo@entry=4, no_tid=no_tid@entry=0) at pthread_kill.c:44 Downloading source file /usr/src/debug/glibc-2.38.9000-19.fc40.aarch64/nptl/pthread_kill.c 44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0; [Current thread is 1 (LWP 53)] Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-13.fc38.aarch64 devic e-mapper-libs-1.02.189-2.fc38.aarch64 glib2-2.76.5-2.fc38.aarch64 glibc-2.37-10.fc38.aarch64 g mp-6.2.1-4.fc38.aarch64 gobject-introspection-1.76.1-1.fc38.aarch64 kmod-libs-30-4.fc38.aarch6 4 libb2-0.98.1-8.fc38.aarch64 libblkid-2.38.1-4.fc38.aarch64 libbytesize-2.9-1.fc38.aarch64 li bcap-2.48-6.fc38.aarch64 libffi-3.4.4-2.fc38.aarch64 libgcc-13.2.1-1.fc38.aarch64 libgomp-13.2 .1-1.fc38.aarch64 libmount-2.38.1-4.fc38.aarch64 libselinux-3.5-1.fc38.aarch64 libuuid-2.38.1- 4.fc38.aarch64 libzstd-1.5.5-1.fc38.aarch64 mpdecimal-2.5.1-6.fc38.aarch64 mpfr-4.1.1-3.fc38.a arch64 openssl-libs-3.0.9-2.fc38.aarch64 pcre2-10.42-1.fc38.1.aarch64 systemd-libs-253.10-1.fc 38.aarch64 xz-libs-5.4.1-1.fc38.aarch64 zlib-1.2.13-3.fc38.aarch64 (gdb) thread apply all bt full Thread 3 (LWP 69): #0 0x0000ffffa2e4d6e4 in __GI___poll (fds=0xaaaaf7fdf790, nfds=1, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41 _x3tmp = 0 _x0tmp = 187651281778576 _x0 = 187651281778576 _x3 = 0 _x4tmp = 0 _x1tmp = 1 _x1 = 1 _x4 = 0 _x2tmp = <optimized out> _x2 = 0 _x8 = 73 _sys_result = <optimized out> sc_ret = <optimized out> sc_cancel_oldtype = 0 sc_ret = <optimized out> timeout_ts = {tv_sec = 187651281778064, tv_nsec = 187651281778064} timeout_ts_p = <optimized out> #1 0x0000ffff940be5b0 in g_main_context_poll_unlocked (priority=2147483647, n_fds=1, fds=0xaa aaf7fdf790, timeout=<optimized out>, context=0xaaaaf7fdf590) at ../glib/gmain.c:4653 ret = <optimized out> errsv = <optimized out> poll_func = 0xffff9406e2c0 <g_poll> poll_func = <optimized out> ret = <optimized out> errsv = <optimized out> #2 g_main_context_iterate_unlocked.isra.0 (context=context@entry=0xaaaaf7fdf590, block=block@ entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4344 max_priority = 2147483647 timeout = -1 some_ready = <optimized out> nfds = 1 allocated_nfds = 1 fds = 0xaaaaf7fdf790 begin_time_nsec = <optimized out> #3 0x0000ffff9405db18 in g_main_context_iteration (context=0xaaaaf7fdf590, may_block=may_bloc k@entry=1) at ../glib/gmain.c:4414 retval = <optimized out> #4 0x0000ffff9405db74 in glib_worker_main (data=<optimized out>) at ../glib/gmain.c:6574 No locals. #5 0x0000ffff940944c4 in g_thread_proxy (data=0xaaaaf7fdf730) at ../glib/gthread.c:831 thread = 0xaaaaf7fdf730 __func__ = "g_thread_proxy" #6 0x0000ffffa2deff84 in start_thread (arg=0xffffa35d4760) at pthread_create.c:444 ret = <optimized out> pd = 0xffffa35d4760 out = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {281473139601792, 281473139601792, 48, 2814 73422542688, 281473414266048, 0, 281473131151360, 281473415512064, 281473422540832, 2814731311 51360, 281473139599888, 11276355813591266238, 0, 11276355813333378930, 0, 0, 0, 0, 0, 0, 0, 0} , mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = 0 #7 0x0000ffffa2e58dcc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone3.S:76 No locals. Thread 2 (LWP 66): --Type <RET> for more, q to quit, c to continue without paging--c #0 syscall () at ../sysdeps/unix/sysv/linux/aarch64/syscall.S:38 No locals. #1 0x0000ffff940badcc in g_cond_wait (cond=<optimized out>, mutex=0xaaaaf7fd54d0) at ../glib/ gthread-posix.c:1552 saved_errno = 0 res = <optimized out> sampled = 0 #2 0x0000ffff940293e8 in g_async_queue_pop_intern_unlocked (queue=0xaaaaf7fd54d0, wait=1, end _time=-1) at ../glib/gasyncqueue.c:425 retval = <optimized out> __func__ = "g_async_queue_pop_intern_unlocked" #3 0x0000ffff9409669c in g_thread_pool_spawn_thread (data=<optimized out>) at ../glib/gthread pool.c:311 spawn_thread_data = <optimized out> thread = 0x0 error = 0x0 prgname = <optimized out> name = "pool", '\000' <repeats 11 times> #4 0x0000ffff940944c4 in g_thread_proxy (data=0xaaaaf7e34c20) at ../glib/gthread.c:831 thread = 0xaaaaf7e34c20 __func__ = "g_thread_proxy" #5 0x0000ffffa2deff84 in start_thread (arg=0xffffa35d4760) at pthread_create.c:444 ret = <optimized out> pd = 0xffffa35d4760 out = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {281473148055936, 281473148055936, 48, 2814 73422542688, 281473414266048, 0, 281473139605504, 281473415512064, 281473422540832, 2814731396 05504, 281473148054032, 11276355813591266238, 0, 11276355813353629554, 0, 0, 0, 0, 0, 0, 0, 0} , mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = 0 #6 0x0000ffffa2e58dcc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone3.S:76 No locals. Thread 1 (LWP 53): #0 __pthread_kill_implementation (threadid=281473422540832, signo=signo@entry=4, no_tid=no_ti d@entry=0) at pthread_kill.c:44 tid = 53 ret = 0 pd = 0xffffa35d4020 old_mask = {__val = {0}} ret = <optimized out> #1 0x0000ffffa2df1cf8 in __pthread_kill_internal (signo=4, threadid=<optimized out>) at pthre ad_kill.c:78 No locals. #2 0x0000ffffa2da5800 in __GI_raise (sig=4) at ../sysdeps/posix/raise.c:26 ret = <optimized out> #3 <signal handler called> No symbol table info available. #4 0x0000ffffa359cfe0 in ?? () No symbol table info available. #5 0x0000aaaaf7e28e80 in ?? () No symbol table info available. Backtrace stopped: previous frame inner to this frame (corrupt stack?)
> Missing separate debuginfos, use: dnf debuginfo-install (...) Oh, the debuginfod server was not enabled? https://fedoraproject.org/wiki/Debuginfod You may install these debug symbols to get a more useful backtrace. Sadly, the most interesting thread has an unusable bactrace: Thread 1 (LWP 53): ... #3 <signal handler called> No symbol table info available. #4 0x0000ffffa359cfe0 in ?? () No symbol table info available. ...
Apparently, the crash is related to libblockdev, libbd_lvm-dbus and libbd_utils: Stack trace of thread 53: #0 0x0000ffffa2df1c80 n/a (/usr/lib64/libc.so.6 + 0x91c80) #1 0x0000ffffa2df1c6c n/a (/usr/lib64/libc.so.6 + 0x91c6c) #2 0x0000ffffa2da5800 n/a (/usr/lib64/libc.so.6 + 0x45800) #3 0x0000ffffa35de7f0 n/a (ld-linux-aarch64.so.1 + 0x4d7f0) #4 0x0000ffff93372cd8 n/a (/usr/lib64/libbd_utils.so.3.0.0 + 0x12cd8) #5 0x0000ffff930b4438 n/a (/usr/lib64/libbd_lvm-dbus.so.3.0.0 + 0x14438) #6 0x0000ffff933c1680 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x11680) #7 0x0000ffff933ccba0 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x1cba0) #8 0x0000ffff933cf958 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x1f958) #9 0x0000ffff933d6ee8 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x26ee8) #10 0x0000ffff93ecc050 n/a (/usr/lib64/libffi.so.8.1.2 + 0x1c050) #11 0x0000ffff93ec63ac n/a (/usr/lib64/libffi.so.8.1.2 + 0x163ac) #12 0x0000ffff941c1344 n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x31344) #13 0x0000ffff941bfe4c n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x2fe4c)
From what I see, the problem comes more likely from libblockdev than from Python. I reassign the issue to libblockdev.
> debuginfod server was not enabled? It was. This builder hasn't had updates applied in a while (because we were in freeze before f39 release), so likely all those things have updated and the debuginfo for the older packages it has are not around anymore? in any case it did download a bunch of things, just not everything. ;(
If someone needs access to an aarch64 system with rawhide, let us know. Bellow is a gdb session showing the disassembly of the function containing 0x0000ffff93372cd8 where the crash likely happens. The missing backtrace points to some memory/stack corruption (eg. data treated as an instruction) ... [root@hpe-apollo-cn99xx-15-vm-30 ~]# gdb --quiet /usr/bin/python3.12 core.python3.0.45eb1ecb91fe45fc9c7bda79e918763c.3558463.1699989136000000 Reading symbols from /usr/bin/python3.12... This GDB supports auto-downloading debuginfo from the following URLs: <https://debuginfod.fedoraproject.org/> Enable debuginfod for this session? (y or [n]) y Debuginfod has been enabled. To make this setting permanent, add 'set debuginfod enabled on' to .gdbinit. Reading symbols from /root/.cache/debuginfod_client/b9bd80861442b62dc97210c2bd0334084f6c734c/debuginfo... warning: Can't open file /usr/lib/locale/locale-archive during file-backed mapping note processing [New LWP 53] [New LWP 66] [New LWP 69] warning: Could not load shared library symbols for /usr/lib64/python3.12/site-packages/selinux/_selinux.cpython-312-aarch64-linux-gnu.so. Do you need "set solib-search-path" or "set sysroot"? [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `python3 -m pyanaconda.modules.payloads'. Program terminated with signal SIGILL, Illegal instruction. #0 __pthread_kill_implementation (threadid=281473422540832, signo=signo@entry=4, no_tid=no_tid@entry=0) at pthread_kill.c:44 44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0; [Current thread is 1 (Thread 0xffffa35d4020 (LWP 53))] (gdb) disass 0x0000ffff93372cd8 Dump of assembler code for function bd_utils_log_format: 0x0000ffff93372be0 <+0>: paciasp 0x0000ffff93372be4 <+4>: sub sp, sp, #0x120 0x0000ffff93372be8 <+8>: adrp x8, 0xffff9339f000 0x0000ffff93372bec <+12>: ldr x8, [x8, #4056] 0x0000ffff93372bf0 <+16>: stp x29, x30, [sp, #80] 0x0000ffff93372bf4 <+20>: add x29, sp, #0x50 0x0000ffff93372bf8 <+24>: stp x19, x20, [sp, #96] 0x0000ffff93372bfc <+28>: adrp x20, 0xffff933a0000 <log_level> 0x0000ffff93372c00 <+32>: mov w19, w0 0x0000ffff93372c04 <+36>: str q0, [sp, #112] 0x0000ffff93372c08 <+40>: str q1, [sp, #128] 0x0000ffff93372c0c <+44>: str q2, [sp, #144] 0x0000ffff93372c10 <+48>: str q3, [sp, #160] 0x0000ffff93372c14 <+52>: str q4, [sp, #176] 0x0000ffff93372c18 <+56>: str q5, [sp, #192] 0x0000ffff93372c1c <+60>: str q6, [sp, #208] 0x0000ffff93372c20 <+64>: str q7, [sp, #224] 0x0000ffff93372c24 <+68>: stp x2, x3, [sp, #240] 0x0000ffff93372c28 <+72>: stp x4, x5, [sp, #256] 0x0000ffff93372c2c <+76>: stp x6, x7, [sp, #272] 0x0000ffff93372c30 <+80>: ldr x0, [x20, #8] 0x0000ffff93372c34 <+84>: ldr x2, [x8] 0x0000ffff93372c38 <+88>: str x2, [sp, #72] 0x0000ffff93372c3c <+92>: mov x2, #0x0 // #0 0x0000ffff93372c40 <+96>: str xzr, [sp, #32] 0x0000ffff93372c44 <+100>: cbz x0, 0xffff93372ce0 <bd_utils_log_format+256> 0x0000ffff93372c48 <+104>: adrp x2, 0xffff933a0000 <log_level> 0x0000ffff93372c4c <+108>: mov x0, #0x0 // #0 0x0000ffff93372c50 <+112>: ldr w2, [x2] 0x0000ffff93372c54 <+116>: cmp w2, w19 0x0000ffff93372c58 <+120>: b.ge 0xffff93372c90 <bd_utils_log_format+176> // b.tcont 0x0000ffff93372c5c <+124>: bl 0xffff93370060 <g_free@plt> 0x0000ffff93372c60 <+128>: adrp x0, 0xffff9339f000 0x0000ffff93372c64 <+132>: ldr x0, [x0, #4056] 0x0000ffff93372c68 <+136>: ldr x2, [sp, #72] 0x0000ffff93372c6c <+140>: ldr x1, [x0] 0x0000ffff93372c70 <+144>: subs x2, x2, x1 0x0000ffff93372c74 <+148>: mov x1, #0x0 // #0 0x0000ffff93372c78 <+152>: b.ne 0xffff93372cf4 <bd_utils_log_format+276> // b.any 0x0000ffff93372c7c <+156>: ldp x29, x30, [sp, #80] 0x0000ffff93372c80 <+160>: ldp x19, x20, [sp, #96] 0x0000ffff93372c84 <+164>: add sp, sp, #0x120 0x0000ffff93372c88 <+168>: autiasp 0x0000ffff93372c8c <+172>: ret 0x0000ffff93372c90 <+176>: mov w3, #0xffffff80 // #-128 0x0000ffff93372c94 <+180>: mov w0, #0xffffffd0 // #-48 0x0000ffff93372c98 <+184>: stp w0, w3, [sp, #64] 0x0000ffff93372c9c <+188>: add x3, sp, #0x28 0x0000ffff93372ca0 <+192>: add x2, sp, #0x120 0x0000ffff93372ca4 <+196>: add x4, sp, #0xf0 0x0000ffff93372ca8 <+200>: stp x2, x2, [sp, #40] 0x0000ffff93372cac <+204>: mov x2, sp 0x0000ffff93372cb0 <+208>: add x0, sp, #0x20 0x0000ffff93372cb4 <+212>: str x4, [sp, #56] 0x0000ffff93372cb8 <+216>: ldp q0, q1, [x3] 0x0000ffff93372cbc <+220>: stp q0, q1, [x2] 0x0000ffff93372cc0 <+224>: bl 0xffff93370760 <g_vasprintf@plt> 0x0000ffff93372cc4 <+228>: tbnz w0, #31, 0xffff93372ce8 <bd_utils_log_format+264> 0x0000ffff93372cc8 <+232>: ldr x1, [sp, #32] 0x0000ffff93372ccc <+236>: mov w0, w19 0x0000ffff93372cd0 <+240>: ldr x2, [x20, #8] 0x0000ffff93372cd4 <+244>: blr x2 0x0000ffff93372cd8 <+248>: ldr x0, [sp, #32] 0x0000ffff93372cdc <+252>: b 0xffff93372c5c <bd_utils_log_format+124> 0x0000ffff93372ce0 <+256>: mov x0, #0x0 // #0 0x0000ffff93372ce4 <+260>: b 0xffff93372c5c <bd_utils_log_format+124> 0x0000ffff93372ce8 <+264>: ldr x0, [sp, #32] 0x0000ffff93372cec <+268>: bl 0xffff93370060 <g_free@plt> 0x0000ffff93372cf0 <+272>: b 0xffff93372c60 <bd_utils_log_format+128> 0x0000ffff93372cf4 <+276>: bl 0xffff93370470 <__stack_chk_fail@plt> End of assembler dump.
Is bd_utils_log_format called with an invalid format string and/or args? It is a wrapper around vasprintf, so with a bad format string it could scribble over memory. Does anyone know how to figure out where bd_utils_log_format was called?
There are only 33 places in libblockdev that call bd_utils_log_format with a format specifier (always either %s or %d). Well, on one line, anyway. So...that might not be too many to just go through them and look for suspicious ones... [adamw@xps13a libblockdev (master)]$ grep -R bd_utils_log_format | grep % scripts/boilerplate_generator.py: ret += ' bd_utils_log_format (BD_UTILS_LOG_WARNING, "failed to load module {0}: %s", dlerror());\n'.format(module_name) scripts/boilerplate_generator.py: ret += ' bd_utils_log_format (BD_UTILS_LOG_DEBUG, "failed to load the init() function for {0}: %s", error);\n'.format(module_name) scripts/boilerplate_generator.py: ret += ' bd_utils_log_format (BD_UTILS_LOG_WARNING, "failed to load {0.name}: %s", error);\n\n'.format(info) scripts/boilerplate_generator.py: ret += ' bd_utils_log_format (BD_UTILS_LOG_DEBUG, "failed to load the close_plugin() function for {0}: %s", error);\n'.format(module_name) src/lib/blockdev.c.in: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Cannot process the config file '%s': %s. Skipping.", src/lib/blockdev.c.in: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Failed to load config files: %s. Using the built-in config", src/lib/blockdev.c.in: bd_utils_log_format (BD_UTILS_LOG_INFO, "Failed to load config files: %s. Using the built-in config", src/plugins/btrfs.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "%s", error->msg); src/plugins/btrfs.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "%s", error->msg); src/plugins/btrfs.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "%s", error->msg); src/plugins/crypto.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Unknown cryptsetup log level %d.", level); src/plugins/crypto.c: bd_utils_log_format (BD_UTILS_LOG_ERR, "Failed to deactivate temporary device %s", tmp_name); src/plugins/crypto.c: bd_utils_log_format (BD_UTILS_LOG_ERR, "Failed to deactivate temporary device %s", tmp_name); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to get system bus address: %s\n", (*error)->message); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to setup DBus connection: %s", error->message); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to flush DBus connection: %s", error->message); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to close DBus connection: %s", error->message); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Got error when getting progress: %s", l_error->message); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Failed to get VG properties for PV %s: %s", src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO operating mode: %s", value); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO compression state: %s", value); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO index state: %s", value); src/plugins/lvm-dbus.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO write policy: %s", value); src/plugins/lvm.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO operating mode: %s", value); src/plugins/lvm.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO compression state: %s", value); src/plugins/lvm.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO index state: %s", value); src/plugins/lvm.c: bd_utils_log_format (BD_UTILS_LOG_DEBUG, "Unknown VDO write policy: %s", value); src/plugins/nvme/nvme-info.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Unhandled self-test log entry result code: %d", dsts); src/plugins/nvme/nvme-info.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Unhandled self-test log entry action code: %d", code); src/plugins/part.c: bd_utils_log_format (BD_UTILS_LOG_INFO, "Not resizing, partition '%s' is already at its maximum size.", part); src/plugins/s390.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Device %s status is %s, needs dasdfmt.", dasd, status); src/plugins/s390.c: bd_utils_log_format (BD_UTILS_LOG_WARNING, "Device %s is already online", devno);
another dumb idea: do a custom libblockdev build that just adds a silly "XXX I AM LINE 123 OF FILE.C ABOUT TO LOG SOMETHING OH BOY!" printf before each of those, then run some scratch koji tasks with that build of libblockdev, and hope that tells us which bd_utils_log_format call is the problematic one...
Hmm, isn't the compiler supposed to warn on invalid types supplied to printf-like funcs? Perhaps the > void bd_utils_log_format (gint level, const gchar *format, ...) { definition is too vague and should be transformed to something like > void bd_utils_log_format (gint level, const gchar *format) G_GNUC_FORMAT (2) { (which expands to '__attribute__ ((format_arg (2)))') to get proper annotations. However that requires a string return value which is not the case here. Anyway, looking at lvm-dbus.c, the worst case of a bad value is passed in is NULL which for "%s" should be gracefully translated to "(null)". Except... https://github.com/storaged-project/libblockdev/pull/990
(In reply to Tomáš Bžatek from comment #22) > Hmm, isn't the compiler supposed to warn on invalid types supplied to > printf-like funcs? Perhaps the it should and I haven't seen any such warnings > > void bd_utils_log_format (gint level, const gchar *format, ...) { > definition is too vague and should be transformed to something like > > void bd_utils_log_format (gint level, const gchar *format) G_GNUC_FORMAT (2) { > (which expands to '__attribute__ ((format_arg (2)))') to get proper > annotations. However that requires a string return value which is not the > case here. > > Anyway, looking at lvm-dbus.c, the worst case of a bad value is passed in is > NULL which for "%s" should be gracefully translated to "(null)". > > Except... https://github.com/storaged-project/libblockdev/pull/990
This is currently happening on almost every build. its been over 10 days since it has not happened.
*** Bug 2265238 has been marked as a duplicate of this bug. ***
We signed off Beta today, so there's no point in Beta FE status any more. Clearing that nomination.
The fixes from comment #22 were in 3.1.0, which is stable for F39, F40 and Rawhide. So I guess those did not fix this.
So it's a bit of a shot in the dark, but some research into when this seems to have started combined with the history of libblockdev made me want to try this: https://src.fedoraproject.org/rpms/libblockdev/c/ffea9acbba2af77ed6b7eaa500b6388addf05eea?branch=rawhide . I've sent that build to Rawhide. I figure we can give it a day or two and see if the images clear up, or not. If they do, we know it's one of those few that causes the problem.
Unfortunately, that didn't work. We're stratergerizing new dumb ways to try and pin this down now. :D
Is there a way to reproduce the crash outside koji? Then we could experiment with instrumented libblockdev builds and similar ...
That's one of the things I'm intending to look into. Though given the bug is a SIGILL, it's *possible* it's specific to the type of hardware these jobs run on. I'm going to fiddle around on a Jetson Nano I have here and see if I can reproduce it on that. Even if we can't, we're thinking it might be possible to write a little test script that runs enough of anaconda on the koji builders to hit the bug, or maybe use scratch builds.
So I spent quite a lot of time on this today, but with inconclusive results :/ It's not that hard to reproduce what pungi/koji do - ultimately, run livemedia-creator in a mock chroot. So I've been doing that all day, tried it on my local aarch64 box first, then an openQA worker host I control, then finally on one of the actual Koji builders that hits the bug. I managed to reproduce the failure twice on the Koji builder, but that was twice out of like 50 attempts. Didn't reproduce it all on the other two hosts. It seems like it happens much less often when I'm trying to reproduce it manually than it does when we run the real composes, for some reason. That low level of reproducibility makes it hard to test anything with confidence :|
One thing I did notice: when this bug happens, it seems to happen right after the "anaconda 41.2-2.fc41 for Fedora-i3-Live Rawhide (pre-release) started." message is printed. The thing anaconda does right after that is...a bunch of fiddling with signal handling. the 'print_startup_note' here prints that message, then see what happens next... startup_utils.print_startup_note(options=opts) from pyanaconda.anaconda import Anaconda anaconda = Anaconda() # reset python's default SIGINT handler signal.signal(signal.SIGINT, signal.SIG_IGN) signal.signal(signal.SIGTERM, lambda num, frame: sys.exit(1)) # synchronously-delivered signals such as SIGSEGV and SIGILL cannot be handled properly from # Python, so install signal handlers from the faulthandler stdlib module. import faulthandler faulthandler.enable() not sure if it's really relevant, but it seems like an interesting coincidence...
Just to save anyone duplicating work while we confirm this - my last dumb diagnosis patch may have hit paydirt. I added a pre-log log before every bd_utils_log_format in lvm-dbus.c , and comparing the counts of pre-log messages and actual log messages in the resulting jobs, we found a discrepancy: [adamw@xps13a tmp]$ grep "FAILED TO CREATE" anaconda-dbus.log | wc -l 11 [adamw@xps13a tmp]$ grep "Failed to create" anaconda-dbus.log | wc -l 9 in one case we got a very nice smoking gun in the livemedia-out.log: 2024-03-23 09:04:49,305 INFO pylorax: Running anaconda. 2024-03-23 09:04:49,813 INFO pylorax: 2024-03-23 09:04:49,814 INFO pylorax: ** (process:1): CRITICAL **: 09:04:49.668: XXX ABOUT TO LOG 'FAILED TO CREATE A NEW CONNECTION' 2024-03-23 09:04:50,815 ERROR pylorax: Running anaconda failed: process '['unshare', '--pid', '--kill-child', '--mount', '--propagation', 'unchanged', 'anaconda', '--kickstart', '/chroot_tmpdir/koji-image-f41-build-115326908.ks', '--cmdline', '--dirinstall', '--remotelog', '127.0.0.1:39053']' was killed by signal 11 Note that there we see my 'pre-log' message but *not* the 'real' log message that should have come immediately after it, and then anaconda dies. So, we think the offending log line is likely this one: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message); I suspect the issue is that the 'error' there is not always populated. There are several cases of this pattern in the file: we initialize an error variable, pass it to some function or other, and expect that if that function call fails, the error will be populated when it gets back to us. So here we're doing pretty much this, condensed and simplified (in fact the error creation is in one function and rest is in another which is passed 'error') - please excuse any mistakes in C pointer syntax while I was condensing this, I am not a licensed C coder :>): GError *error = NULL; bus = g_dbus_connection_new_for_address_sync (addr, G_DBUS_CONNECTION_FLAGS_AUTHENTICATION_CLIENT|G_DBUS_CONNECTION_FLAGS_MESSAGE_BUS_CONNECTION,NULL, NULL, &error); if (!bus || g_dbus_connection_is_closed (bus)) { bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message); } so what I suspect is happening is that it's not safe to assume 'error' will always have been populated by the g_dbus_connection_new_for_address_sync call (perhaps especially if we have a 'bus', but the connection to it is closed for some reason). The fix is probably something like this: if (error) { bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message); } else { bd_utils_log (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: unknown reason"); } but for now we're testing a build which takes the log message out entirely, to see if that 'fixes' it. That build is going through tests currently, once it's stable we'll run a new compose and see what happens.
`bd_lvm_init`, the function that calls `setup_dbus_connection` where the above logic is, similarly tries to log using the same error variable, so that call also needs guarding. I've sent a build with that changed also and will see how the next nightly compose fares.
(In reply to Adam Williamson from comment #34) > GError *error = NULL; > bus = g_dbus_connection_new_for_address_sync (addr, > G_DBUS_CONNECTION_FLAGS_AUTHENTICATION_CLIENT|G_DBUS_CONNECTION_FLAGS_MESSAGE > _BUS_CONNECTION,NULL, NULL, &error); > if (!bus || g_dbus_connection_is_closed (bus)) { > bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new > connection for the system bus: %s\n", (*error)->message); > } This is definitely suspect; on the one hand, it's checking that `bus` is NULL, and on the other it's checking if it's closed. If `bus` is NULL, then the `g_dbus_connection_new_for_address_sync` call must have failed and set `error`, but if it's not NULL, then there's no reason for `error` to be set. It might be easier to handle both this function and its caller(s?) by setting an error if necessary: bus = g_dbus_connection_new_for_address_sync (addr, G_DBUS_CONNECTION_FLAGS_AUTHENTICATION_CLIENT|G_DBUS_CONNECTION_FLAGS_MESSAGE_BUS_CONNECTION,NULL, NULL, &error); if (!bus || g_dbus_connection_is_closed (bus)) { if (error == NULL) { g_set_error_literal (&error, G_IO_ERROR, G_IO_ERROR_CLOSED, "connection was closed immediately"); } bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to create a new connection for the system bus: %s\n", (*error)->message); return; }
"This is definitely suspect; on the one hand, it's checking that `bus` is NULL, and on the other it's checking if it's closed. If `bus` is NULL, then the `g_dbus_connection_new_for_address_sync` call must have failed and set `error`, but if it's not NULL, then there's no reason for `error` to be set." Yep, that's pretty much what I thought. For now I just want to check the theory, I didn't want to bother too merge about writing an upstreamable patch immediately, so I've converted both log messages to `bd_utils_log` calls with no templating. We'll see if that helps in the next compose.
Unfortunately that didn't fix it :( And it was such a good-looking theory, too. I am at a bit of a loss, now. I do wonder if we can get some *more* coredumps and compare them, maybe that would shed some light?
I suppose one thing that would help with trying to figure out what's going on is if each anaconda dbus module could prefix all its log messages in anaconda-dbus.log with its name. AIUI, we have 10 different modules starting up there, and one or more of them is crashing on this illegal instruction. But all the log messages from all of them are just thrown in that log together, which makes it difficult or impossible to distinguish them. If the messages were all prefixed we might be able to at least figure out how far each one that crashes gets before it crashes...
Thanks, Adam, for your debugging efforts. I can only hope I haven't made a mistake in the initial analysis.
So far I haven't been successful in reproducing the issue (on Ampere Mt.Snow machine). My steps to reproduce it are - grab mock config from koji (regular rawhide might work too) - init chroot with "mock -r ./f41-build.cfg init - mock -r ./f41-build.cfg install bash coreutils glibc-all-langpacks lorax-lmc-novirt selinux-policy-targeted shadow-utils util-linux - go into the mock shell - grap the kickstart with eg. "/sbin/livemedia-creator --ks ./koji-image-f41-build-115361236.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-Workstation-Live --make-iso --volid Fedora-WS-Live-rawh-20240324.n.0 --iso-only --iso-name Fedora-Workstation-Live-aarch64-Rawhide-20240324.n.0.iso --releasever Rawhide --nomacboot" (taken from mock_output.log) - repeat previous step FWIW the last i3 Live images fail on some dependency error I think
yeah, it was very difficult to reproduce for me too. I only managed it twice, trying to do it on the actual builder hardware (never reproduced it anywhere else), retrying over and over and over in a mock chroot. There's clearly something about the environment of an actual compose which makes it happen way more often. Kevin and I suspect it might be having multiple tasks including multiple live image builds running on the same host at once. That's a bit awkward to try and reproduce manually, of course. It definitely happens very frequently in real composes, we rarely get a compose without at least one image affected by the bug.
So, hum. Looking at the two failures on this in today's Rawhide compose - https://koji.fedoraproject.org/koji/taskinfo?taskID=115400251 and https://koji.fedoraproject.org/koji/taskinfo?taskID=115400246 - the anaconda-dbus.log for each logs *ten* instances each of "XXX Failed to create a new connection for the system bus" and "XXX Failed to setup DBus connection" (which are the two messages we should see when hitting the path I modified). That's one short. When we don't see this problem, we log *eleven* instances each of those messages. I think that's one pair for the main process and one pair each for the ten modules that get started. The interesting thing is, if I'm following it right, the codepath we actually follow through lvm-dbus.c is very short. `bd_lvm_init` is, as its comment says, "called automatically by the library's initialization functions." It calls `setup_dbus_connection`, which I believe in the weird environment we are running in, is always expected to fail. It should then log "XXX Failed to setup DBus connection" and return FALSE. That's the whole codepath. That makes it look like the only other place it could really be failing is the other templated log call in setup_dbus_connection: bd_utils_log_format (BD_UTILS_LOG_CRIT, "Failed to get system bus address: %s\n", (*error)->message); but...the fly in the ointment there is that, when I did the build with "pre-log" lines, we never hit the "pre-log" message for that log message, on any of the logs I checked. We always hit the "Failed to create a new connection for the system bus" one, not that earlier one. I'm really baffled as to how or where this is still blowing up. It's pretty weird.
so, I poked at a recent coredump with gdb. I tried to follow what Danny did. the stack trace for this crash is: Stack trace of thread 38: #0 0x0000ffffb8f91fe0 n/a (ld-linux-aarch64.so.1 + 0xbfe0) #1 0x0000ffffa85e443c n/a (/usr/lib64/libbd_lvm-dbus.so.3.0.0 + 0x1443c) #2 0x0000ffffa8971648 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x11648) #3 0x0000ffffa89809ec n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x209ec) #4 0x0000ffffa89884a0 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x284a0) #5 0x0000ffffa8985d70 n/a (/usr/lib64/libblockdev.so.3.0.0 + 0x25d70) #6 0x0000ffffa925c054 n/a (/usr/lib64/libffi.so.8.1.4 + 0x1c054) #7 0x0000ffffa9256404 n/a (/usr/lib64/libffi.so.8.1.4 + 0x16404) #8 0x0000ffffa95512d4 n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x312d4) #9 0x0000ffffa95508b8 n/a (/usr/lib64/python3.12/site-packages/gi/_gi.cpython-312-aarch64-linux-gnu.so + 0x308b8) #10 0x0000ffffb897b1f8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x17b1f8) #11 0x0000ffffb8987340 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187340) #12 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98) #13 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60) #14 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #15 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4) #16 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8) #17 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298) #18 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838) #19 0x0000ffffb89e6be0 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1e6be0) #20 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #21 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4) #22 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8) #23 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298) #24 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838) #25 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154) #26 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98) #27 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60) #28 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #29 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4) #30 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8) #31 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298) #32 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838) #33 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154) #34 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98) #35 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60) #36 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #37 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4) #38 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8) #39 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298) #40 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838) #41 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154) #42 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98) #43 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60) #44 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #45 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4) #46 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8) #47 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298) #48 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838) #49 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154) #50 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98) #51 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60) #52 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #53 0x0000ffffb898afd4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18afd4) #54 0x0000ffffb89abaf8 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1abaf8) #55 0x0000ffffb89d8298 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d8298) #56 0x0000ffffb89d7838 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1d7838) #57 0x0000ffffb898c154 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x18c154) #58 0x0000ffffb8a40f98 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x240f98) #59 0x0000ffffb8a67f60 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x267f60) #60 0x0000ffffb89a24ac n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a24ac) #61 0x0000ffffb89a1f38 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x1a1f38) #62 0x0000ffffb8987340 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x187340) #63 0x0000ffffb8a816e4 n/a (/usr/lib64/libpython3.12.so.1.0 + 0x2816e4) so in gdb I did this (note gdb is having issues with the trace, but never mind that I guess?): (gdb) bt #0 0x0000ffffb8f91fe0 in ?? () #1 0x0000ffffa85e4360 in ?? () at /usr/src/debug/libblockdev-3.1.0-6.fc41.aarch64/src/plugins/lvm-dbus.c:252 from /lib64/libbd_lvm-dbus.so.3 Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) disass 0x0000ffffa85e443c and this is what I got: Dump of assembler code for function bd_lvm_init: 0x0000ffffa85e4360 <+0>: paciasp 0x0000ffffa85e4364 <+4>: sub sp, sp, #0x40 0x0000ffffa85e4368 <+8>: adrp x0, 0xffffa860f000 0x0000ffffa85e436c <+12>: ldr x0, [x0, #4056] 0x0000ffffa85e4370 <+16>: stp x29, x30, [sp, #16] 0x0000ffffa85e4374 <+20>: add x29, sp, #0x10 0x0000ffffa85e4378 <+24>: stp x19, x20, [sp, #32] 0x0000ffffa85e437c <+28>: adrp x19, 0xffffa8610000 0x0000ffffa85e4380 <+32>: ldr x1, [x19, #64] 0x0000ffffa85e4384 <+36>: ldr x2, [x0] 0x0000ffffa85e4388 <+40>: str x2, [sp, #8] 0x0000ffffa85e438c <+44>: mov x2, #0x0 // #0 0x0000ffffa85e4390 <+48>: str xzr, [sp] 0x0000ffffa85e4394 <+52>: cbz x1, 0xffffa85e43e0 <bd_lvm_init+128> 0x0000ffffa85e4398 <+56>: adrp x0, 0xffffa85e0000 <_init> 0x0000ffffa85e439c <+60>: add x0, x0, #0xac0 0x0000ffffa85e43a0 <+64>: bl 0xffffa85e0840 <dm_log_with_errno_init@plt> 0x0000ffffa85e43a4 <+68>: mov w0, #0x6 // #6 0x0000ffffa85e43a8 <+72>: bl 0xffffa85e0060 <dm_log_init_verbose@plt> 0x0000ffffa85e43ac <+76>: mov w0, #0x1 // #1 0x0000ffffa85e43b0 <+80>: adrp x1, 0xffffa860f000 0x0000ffffa85e43b4 <+84>: ldr x1, [x1, #4056] 0x0000ffffa85e43b8 <+88>: ldr x3, [sp, #8] 0x0000ffffa85e43bc <+92>: ldr x2, [x1] 0x0000ffffa85e43c0 <+96>: subs x3, x3, x2 0x0000ffffa85e43c4 <+100>: mov x2, #0x0 // #0 0x0000ffffa85e43c8 <+104>: b.ne 0xffffa85e4488 <bd_lvm_init+296> // b.any 0x0000ffffa85e43cc <+108>: ldp x29, x30, [sp, #16] 0x0000ffffa85e43d0 <+112>: ldp x19, x20, [sp, #32] 0x0000ffffa85e43d4 <+116>: add sp, sp, #0x40 0x0000ffffa85e43d8 <+120>: autiasp 0x0000ffffa85e43dc <+124>: ret 0x0000ffffa85e43e0 <+128>: mov x2, sp 0x0000ffffa85e43e4 <+132>: mov w0, #0x1 // #1 0x0000ffffa85e43e8 <+136>: str x21, [sp, #48] 0x0000ffffa85e43ec <+140>: bl 0xffffa85e06b0 <g_dbus_address_get_for_bus_sync@plt> 0x0000ffffa85e43f0 <+144>: mov x20, x0 0x0000ffffa85e43f4 <+148>: cbz x0, 0xffffa85e446c <bd_lvm_init+268> 0x0000ffffa85e43f8 <+152>: mov w1, #0x9 // #9 0x0000ffffa85e43fc <+156>: mov x4, sp 0x0000ffffa85e4400 <+160>: mov x3, #0x0 // #0 0x0000ffffa85e4404 <+164>: mov x2, #0x0 // #0 0x0000ffffa85e4408 <+168>: bl 0xffffa85e0490 <g_dbus_connection_new_for_address_sync@plt> 0x0000ffffa85e440c <+172>: mov x1, x0 0x0000ffffa85e4410 <+176>: mov x0, x20 0x0000ffffa85e4414 <+180>: str x1, [x19, #64] 0x0000ffffa85e4418 <+184>: bl 0xffffa85e0080 <g_free@plt> 0x0000ffffa85e441c <+188>: ldr x0, [x19, #64] 0x0000ffffa85e4420 <+192>: cbz x0, 0xffffa85e442c <bd_lvm_init+204> 0x0000ffffa85e4424 <+196>: bl 0xffffa85e0900 <g_dbus_connection_is_closed@plt> 0x0000ffffa85e4428 <+200>: cbz w0, 0xffffa85e4458 <bd_lvm_init+248> 0x0000ffffa85e442c <+204>: adrp x1, 0xffffa85f0000 0x0000ffffa85e4430 <+208>: mov w0, #0x2 // #2 0x0000ffffa85e4434 <+212>: add x1, x1, #0xa88 0x0000ffffa85e4438 <+216>: bl 0xffffa85e04a0 <bd_utils_log@plt> 0x0000ffffa85e443c <+220>: mov w0, #0x2 // #2 0x0000ffffa85e4440 <+224>: adrp x1, 0xffffa85f0000 0x0000ffffa85e4444 <+228>: add x1, x1, #0xac8 0x0000ffffa85e4448 <+232>: bl 0xffffa85e04a0 <bd_utils_log@plt> 0x0000ffffa85e444c <+236>: ldr x21, [sp, #48] 0x0000ffffa85e4450 <+240>: mov w0, #0x0 // #0 0x0000ffffa85e4454 <+244>: b 0xffffa85e43b0 <bd_lvm_init+80> 0x0000ffffa85e4458 <+248>: ldr x0, [x19, #64] 0x0000ffffa85e445c <+252>: mov w1, #0x0 // #0 0x0000ffffa85e4460 <+256>: bl 0xffffa85e0990 <g_dbus_connection_set_exit_on_close@plt> 0x0000ffffa85e4464 <+260>: ldr x21, [sp, #48] 0x0000ffffa85e4468 <+264>: b 0xffffa85e4398 <bd_lvm_init+56> 0x0000ffffa85e446c <+268>: ldr x2, [sp] 0x0000ffffa85e4470 <+272>: adrp x1, 0xffffa85f0000 0x0000ffffa85e4474 <+276>: mov w0, #0x2 // #2 0x0000ffffa85e4478 <+280>: add x1, x1, #0xa60 0x0000ffffa85e447c <+284>: ldr x2, [x2, #8] 0x0000ffffa85e4480 <+288>: bl 0xffffa85e0260 <bd_utils_log_format@plt> 0x0000ffffa85e4484 <+292>: b 0xffffa85e443c <bd_lvm_init+220> 0x0000ffffa85e4488 <+296>: str x21, [sp, #48] 0x0000ffffa85e448c <+300>: bl 0xffffa85e0510 <__stack_chk_fail@plt> so...the top part there looks different from before, but the bottom part looks much the same...we've still got bd_lvm_init , g_dbus_connection_set_exit_on_close , and bd_utils_log_format ... I'm really not sure where we'd still be getting a bd_utils_log_format , though, aside from the other one in setup_dbus_connection that it didn't look like we were hitting...I guess we can *try* modifying that one too, but it doesn't feel right...
Oh, or does the fact that it now says "Dump of assembler code for function bd_lvm_init:" mean we're crashing in bd_lvm_init rather than in bd_utils_log_format now?
Created attachment 2023553 [details] updated coredump with libblockdev-3.1.0-6.fc41 Here is the updated coredump, in case it tells anyone else anything more than it told me.
What hw platform are the compose builders? I could try reserving the same in beaker for further debugging ...
(In reply to Adam Williamson from comment #45) > Oh, or does the fact that it now says "Dump of assembler code for function > bd_lvm_init:" mean we're crashing in bd_lvm_init rather than in > bd_utils_log_format now? Indeed we are (from your coredump): (gdb) list *0x0000ffffa85e443c 0xffffa85e443c is in bd_lvm_init (/usr/src/debug/libblockdev-3.1.0-6.fc41.aarch64/src/plugins/lvm-dbus.c:339). Downloading source file /usr/src/debug/libblockdev-3.1.0-6.fc41.aarch64/src/plugins/lvm-dbus.c 334 GError *error = NULL; 335 336 /* the check() call should create the DBus connection for us, but let's not 337 completely rely on it */ 338 if (G_UNLIKELY (!bus) && !setup_dbus_connection (&error)) { 339 bd_utils_log (BD_UTILS_LOG_CRIT, "XXX Failed to setup DBus connection"); 340 return FALSE; 341 } 342 343 dm_log_with_errno_init ((dm_log_with_errno_fn) redirect_dm_log); However, I think `setup_dbus_connection` was inlined, as if you do the previous line from the disassembly, you get the inner debug line: (gdb) list *0x0000ffffa85e4438 0xffffa85e4438 is in bd_lvm_init (/usr/src/debug/libblockdev-3.1.0-6.fc41.aarch64/src/plugins/lvm-dbus.c:272). 267 NULL, NULL, error); 268 269 g_free (addr); 270 271 if (!bus || g_dbus_connection_is_closed (bus)) { 272 bd_utils_log (BD_UTILS_LOG_CRIT, "XXX Failed to create a new connection for the system bus"); 273 return FALSE; 274 } 275 276 g_dbus_connection_set_exit_on_close (bus, FALSE);
I am thinking about building libblockdev with -O0 to avoid "over-optimization" and similar issues, but the library doesn't want to give up :-) But diff --git a/libblockdev.spec b/libblockdev.spec index c5c4b71..dcfca81 100644 --- a/libblockdev.spec +++ b/libblockdev.spec @@ -576,7 +576,13 @@ A meta-package that pulls all the libblockdev plugins as dependencies. %autosetup -n %{name}-%{version} -p1 %build +find . -name Makefile.am | xargs sed -i -e 's/-Werror//g' + autoreconf -ivf + +%global _fortify_level 0 +%global optflags $(echo %optflags | sed -e 's/-O2/-O0/g') -Wp,-D_FORTIFY_SOURCE=0 + %configure %{?configure_opts} %{__make} %{?_smp_mflags} seems to do it.
Big thanks for investigating this! I've created this PR for upstream based on your findings: https://github.com/storaged-project/libblockdev/pull/1012
(In reply to Dan Horák from comment #47) > What hw platform are the compose builders? I could try reserving the same in > beaker for further debugging ... The builders in the aarch64 livemedia channel are mostly emags, but 2 cavium thunderX2's. I have some mt snow I want to add, but haven't yet. This issue appears on both the emags and the thunderx2's.
So...apologies for dumb questions, but what does the info from Elliott's comment tell us exactly? Are the line numbers (339 and 272) significant - are we really somehow still failing on the log lines, even though there's no templating any more? Or is it more complicated than that? Sorry, I am used to Python traces, where at least I pretty much always know exactly where it's breaking. :P Other than that, I guess I'll do a build with Dan's patch and we can see how the coredumps look after the next compose?
OK, I'm running a build now that both replaces my dumb XXX patch with Vojtech's upstream PR, and implements Dan's de-optimization changes: https://koji.fedoraproject.org/koji/taskinfo?taskID=115471730 (libblockdev-3.1.1-2.fc41). We'll see how things go with the next compose.
Another hint: try clang?
(In reply to Tomáš Bžatek from comment #54) > Another hint: try clang? yup, and also instrumented build with ubsan and/or asan
Ooo. It looks like the latest build might have fixed it (or we got really lucky). Workstation and KDE both built OK this time. lxqt and i3 failed, but they did not fail on the SIGILL - they failed much later, due to lack of disk space. I'm not sure if Vojtech's better fix for the logging issues did the trick, or if somehow the bug goes away with the de-optimization changes, or we just got lucky. I'm going to do another build *with* Vojtech's patch but *without* the de-optimization changes and we can watch the next compose. I also sent a couple of PRs to bump the space for i3 and lxqt, so those two should pass if they don't hit the SIGILL (I hope).
I suspect it passed with -2 due the de-optimization, as it fails again with -3 when the de-op has been removed ...
Ah, foo, indeed. So, how do you want to proceed? Try with -O1 ?
good question, I would keep Vojtech's patch (looks right to me) and I would add the gcc sanitizer options, start with "undefined behaviour", then try "address", and just for aarch64 to be on the safe side with regard to other arches --- a/libblockdev.spec +++ b/libblockdev.spec @@ -576,7 +576,14 @@ A meta-package that pulls all the libblockdev plugins as dependencies. %autosetup -n %{name}-%{version} -p1 %build autoreconf -ivf + +%ifarch aarch64 +%global optflags %optflags -fsanitize=undefined +%endif + %configure %{?configure_opts} %{__make} %{?_smp_mflags} worst case we might be hitting some CPU design issue (cache not invalidated or such) ... And thus it might be useful to know if moving the jobs to the much newer Mt.Snows would resolve it. But let's start with the sanitizers and maybe better targeted de-optimizing.
OK, can do. Is the goal here to find a minimal level of de-optimization that makes the bug go away, or to debug it further, or both? Anyway, will apply that change and run another build.
The de-optimization can be done on source file or even on function level (and we have some suspects), so it would serve both, I think. But the goal should be to get to the root cause, but this "remote" debugging doesn't make it easy :-)
yeah, unfortunately I can't find an easy way to tighten the loop :/ it's not really the remoteness so much as the cycle; it takes about five hours for us to get from start of a compose to success/failure of the aarch64 lives, I think.
Created attachment 2024016 [details] another updated coredump with libblockdev 3.1.1-4 (with -fsanitize=undefined)
Dan: ping? I know we all got distracted by xz :D but did that last coredump help any?
(In reply to Adam Williamson from comment #64) > Dan: ping? I know we all got distracted by xz :D but did that last coredump > help any? not really :-( I was hoping for some output from the "undefined behaviour" sanitizer, my next step is address sanitizer in libblockdev-3.1.1-5.fc41
So... that didn't work at all. ;) https://koji.fedoraproject.org/koji/taskinfo?taskID=115944106 and the screenshot has a dead anaconda with: "==2499==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD" I have untagged it for now so we can get a compose...
Hmm, that's unfortunate :-( I will revert the spec back to "normal" build before any next steps.
I'm thinking that for the next Fedora 40 Final RC, at least, we might want to put in a 'detuned' (per comment #49) libblockdev build so we get all the images. It's not the correct fix, of course, but it *does* avoid the bug in practice. Doing this would mean we'd have to ship the detuned libblockdev build as part of the release, though, and of course it would be in *all* images in the compose, not just used to build the affected aarch64 live images. How bad is the performance impact of losing those optimizations? Do you think it's bad enough that this would be a bad plan, Dan?
proposing as FinalFE for that purpose.
+3 in https://pagure.io/fedora-qa/blocker-review/issue/1588 , marking accepted FE. this means we *can* include a detuned build in the next compose, but we'll still wait for Dan's opinion on whether we *should*.
I believe libblockdev isn't performance critical, so it makes sense to do a de-optimized (on aarch64 only) build. Let me prepare a rawhide build ASAP. Adam, feel free to issue a F-40 too, seems it can't be fast-forwarded, but a cherry-pick into the f40 branch should work. Unfortunately I am mostly out of ideas about the next steps to get to the root cause :-(
oh, yeah, I forgot we can apply the deoptimization only on aarch64, which limits the impact. cool. Yeah, go ahead and do a Rawhide build and I'll take care of F40.
rawhide build is https://koji.fedoraproject.org/koji/taskinfo?taskID=116239319 And for the next step I will try to de-optimize just the "utils" internal library ...
So, 3 failed aarch64 live's with that build. ;( https://koji.fedoraproject.org/koji/taskinfo?taskID=116257542 https://koji.fedoraproject.org/koji/taskinfo?taskID=116257533 https://koji.fedoraproject.org/koji/taskinfo?taskID=116257552 Adding Florian here in case he has ideas or knows anyone in tools space that might be able to see the problem more clearly than we...
Most of the disassembly here is for pthread_kill, which is explicitly called for signal 4, which is SIGILL. This must come from the (Anaconda?) crash handler. These in-process crash handlers are best avoided because they may obfuscate what's going on. A coredump without a crash handler (such as captured by systemd-coredump) may have more information. The real crash is in the frame that comes in a GDB backtrace right after the line with <signal handler called>. The fault address 0x0000ffffa85e443c seems bogus because that instruction cannot fault. So unless something rewrites the file in place, I don't see how this can crash.
(In reply to Kevin Fenzi from comment #74) > So, 3 failed aarch64 live's with that build. ;( > > https://koji.fedoraproject.org/koji/taskinfo?taskID=116257542 > https://koji.fedoraproject.org/koji/taskinfo?taskID=116257533 > https://koji.fedoraproject.org/koji/taskinfo?taskID=116257552 > > Adding Florian here in case he has ideas or knows anyone in tools space that > might be able to see the problem more clearly than we... FWIW only the latter 2 failed due the SIGILL if I see right ...
Florian: I'm pretty sure I got the coredumps I attached by using `coredumpctl dump`, which is from systemd-coredump, isn't it?
(In reply to Adam Williamson from comment #77) > Florian: I'm pretty sure I got the coredumps I attached by using > `coredumpctl dump`, which is from systemd-coredump, isn't it? Yes, but some of the backtraces seem to have the crash handler on the stack, rather than the actual stack contents at the crash site.
So, branched compose failed last night... workstation live failed, but this time it was not only the aarch64 one, but also the x86_64 one. ;( https://koji.fedoraproject.org/koji/taskinfo?taskID=116308658 Sadly, we don't have the core there. We never enabled it for x86_64 builders, only aarch64 buildhw's.
This is really odd. The circumstances for SIGILL are not exactly the same between AArch64 and x86-64. For example, a GCC-generated trap results in SIGILL on x86-64, but on AArch64 it's SIGTRAP.
Confirming this issue in fedora40 on aarch64: Service org.fedoraproject.Anaconda.Modules.Localization has failed to start: Process org.fedoraproject.Anaconda.Modules.Localization exited with status 132 Log: 2024-09-11 06:22:22,793: livemedia-creator v40.7-1 2024-09-11 06:22:22,794: selinux is Disabled 2024-09-11 06:22:23,053: disk_img = /build/results/lmc-disk-ifmw0_xn.img 2024-09-11 06:22:23,054: Using disk size of 7002MiB 2024-09-11 06:22:23,629: Running anaconda. 2024-09-11 06:22:24,645: Failed to create a new connection for the system bus: Could not connect: No such file or directory 2024-09-11 06:22:24,646: 2024-09-11 06:22:24,647: Failed to setup DBus connection: Could not connect: No such file or directory 2024-09-11 06:22:27,649: Starting installer, one moment... 2024-09-11 06:22:27,650: terminal size detection failed, using default width 2024-09-11 06:22:27,650: 06:22:27 Kernel boot argument 'selinux' detected. Did you want to use 'inst.selinux' for the installer instead? 2024-09-11 06:22:27,651: 06:22:27 All Anaconda kernel boot arguments are now required to use 'inst.' prefix! 2024-09-11 06:22:27,651: 06:22:27 Enforcing the non-interactive text mode for dir and image installations. 2024-09-11 06:22:31,164: anaconda 40.22.3-1.fc40 for Linux 40 (pre-release) started. 2024-09-11 06:22:31,165: 06:22:30 Service org.fedoraproject.Anaconda.Modules.Localization has failed to start: Process org.fedoraproject.Anaconda.Modules.Localization exited with status 132 2024-09-11 06:22:41,681: Running anaconda failed: process '['unshare', '--pid', '--kill-child', '--mount', '--propagation', 'unchanged', 'anaconda', '--kickstart', '/build/live-arm.ks', '--cmdline', '--dirinstall', '--remotelog', '127.0.0.1:46965']' exited with status 1 2024-09-11 06:23:23,772: Install failed: novirt_install failed 2024-09-11 06:23:23,773: Removing bad disk image 2024-09-11 06:23:23,776: ERROR: Image creation failed: novirt_install failed 2024-09-11 06:23:23,777: Image creation failed: novirt_install failed stdout_lines: <omitted> # rpm -qa | grep libblockdev libblockdev-utils-3.1.1-1.fc40.aarch64 libblockdev-3.1.1-1.fc40.aarch64 libblockdev-btrfs-3.1.1-1.fc40.aarch64 libblockdev-dm-3.1.1-1.fc40.aarch64 libblockdev-fs-3.1.1-1.fc40.aarch64 libblockdev-loop-3.1.1-1.fc40.aarch64 libblockdev-swap-3.1.1-1.fc40.aarch64 libblockdev-lvm-3.1.1-1.fc40.aarch64 libblockdev-nvdimm-3.1.1-1.fc40.aarch64 libblockdev-part-3.1.1-1.fc40.aarch64 libblockdev-mdraid-3.1.1-1.fc40.aarch64 libblockdev-nvme-3.1.1-1.fc40.aarch64 libblockdev-crypto-3.1.1-1.fc40.aarch64 libblockdev-mpath-3.1.1-1.fc40.aarch64 libblockdev-plugins-all-3.1.1-1.fc40.aarch64 libblockdev-lvm-dbus-3.1.1-1.fc40.aarch64 I see libblockdev v3.2.0 was released few hours ago, I will keep an eye on fedora/linux/development/rawhide/Everything/aarch64/os/Packages/l/ and will run another build using 3.2.0, and report in
tried running live creator on f42(rawhide) in mock using libblockdev v3.2.0, got the usual error: 2024-09-12 10:29:42,189 INFO pylorax: Starting automated install.... 2024-09-12 10:29:42,189 INFO pylorax: 2024-09-12 10:29:42,190 INFO pylorax: ================================================================================ 2024-09-12 10:29:42,190 INFO pylorax: ================================================================================ 2024-09-12 10:29:42,190 INFO pylorax: Installation 2024-09-12 10:29:42,191 INFO pylorax: 2024-09-12 10:29:42,191 INFO pylorax: 1) [x] Language settings 2) [x] Time settings 2024-09-12 10:29:42,191 INFO pylorax: (English (United States)) (UTC timezone) 2024-09-12 10:29:42,195 INFO pylorax: 3) [x] Installation source 4) [x] Software selection 2024-09-12 10:29:42,195 INFO pylorax: (http://192.168.122.1:8000/fedor (Custom software selected) 2024-09-12 10:29:42,195 INFO pylorax: a/42/aarch64/) 2024-09-12 10:29:42,196 INFO pylorax: 2024-09-12 10:29:42,196 INFO pylorax: ================================================================================ 2024-09-12 10:29:42,196 INFO pylorax: ================================================================================ 2024-09-12 10:29:42,197 INFO pylorax: Progress 2024-09-12 10:29:42,197 INFO pylorax: 2024-09-12 10:29:42,197 INFO pylorax: Setting up the installation environment 2024-09-12 10:29:42,201 INFO pylorax: . 2024-09-12 10:29:42,201 INFO pylorax: Configuring storage 2024-09-12 10:29:46,709 INFO pylorax: . 2024-09-12 10:29:46,711 INFO pylorax: Running pre-installation scripts 2024-09-12 10:29:46,711 INFO pylorax: . 2024-09-12 10:29:46,711 INFO pylorax: Running pre-installation tasks 2024-09-12 10:29:47,212 INFO pylorax: 2024-09-12 10:29:47,212 INFO pylorax: The installation was stopped due to an error which occurred while running in non-interactive cmdline mode. Since there cannot be any questions in cmdline mode, edit your kickstart file and retry installation. 2024-09-12 10:29:47,212 INFO pylorax: The exact error message is: 2024-09-12 10:29:47,213 INFO pylorax: 2024-09-12 10:29:47,213 INFO pylorax: Non interactive installation failed: Message recipient disconnected from message bus without replying. 2024-09-12 10:29:47,213 INFO pylorax: 2024-09-12 10:29:47,213 INFO pylorax: The installer will now terminate. 2024-09-12 10:32:48,859 ERROR pylorax: Running anaconda failed: process '['unshare', '--pid', '--kill-child', '--mount', '--propagation', 'unchanged', 'anaconda', '--kickstart', '/build/live-arm.ks', '--cmdline', '--dirinstall', '--remotelog', '127.0.0.1:40605']' exited with status 1 2024-09-12 10:32:48,860 INFO livemedia-creator: Shutting down log processing 2024-09-12 10:32:48,945 ERROR pylorax: Install failed: novirt_install failed 2024-09-12 10:32:48,946 INFO pylorax: Removing bad disk image 2024-09-12 10:32:48,950 ERROR pylorax: ERROR: Image creation failed: novirt_install failed 2024-09-12 10:32:48,952 ERROR livemedia-creator: Image creation failed: novirt_install failed logs attached in logs.tar.gz
yeah, the current understanding of the issue is that the cause is something subtle and deep somewhere in the stack :-(
Wait - has anybody mentioned bug 2305877#c1?
nope, it looks promising, but it doesn't explain non-aarch64 crashes :-(
what's "funny" is that I can't seem to be able to reproduce this using same kickstart but in F38 $ diff -u /volumes/development/mock/mock-live-g35f6t38/build/live-arm.ks /volumes/development/mock/mock-live-twtu0lta/build/live-arm.ks --- /volumes/development/mock/mock-live-g35f6t38/build/live-arm.ks 2024-09-12 10:50:04.593713541 +0000 +++ /volumes/development/mock/mock-live-twtu0lta/build/live-arm.ks 2024-09-12 10:24:48.975854932 +0000 @@ -45,9 +45,9 @@ # Repo # # use net install -url --url=http://192.168.122.1:8000/fedora/38/aarch64/ +url --url=http://192.168.122.1:8000/fedora/42/aarch64/ -repo --name=create_live_default --baseurl=http://192.168.122.1:8000/fedora/38/aarch64/ +repo --name=create_live_default --baseurl=http://192.168.122.1:8000/fedora/42/aarch64/ # # Packages regression must have happened somewhere in f39, I'll run few extra tests like building in f42 using anaconda from f38 etc.
Sure, that's why this bug comes from f39 cycle and the description says "Sporadically we are seeing livemedia composes in both f39 and rawhide fail." Before that it wasn't happening. This really is aarch64-specific, I think. That *one* time we got a SIGILL on x86_64 I think was just a blip. We see this happen multiple times per compose on aarch64 but I think only that one time on any other arch. The other bug report does look interesting, but...it looks like it is supposed to be fixed in Rawhide and F41 already, but we are still seeing this failure multiple times per compose. e.g. four aarch64 lives failed in https://pagure.io/releng/failed-composes/issue/6962 , the most recent Rawhide nightly. Still, let's CC Jeremy and Carlos and see if they have any ideas.
I was able to successfully run live image creator for 41 on aarch64 in f41 mock using anaconda from f38. I think it's not an issue with live image creator but rather anaconda
I don't think we ever thought it was an issue in livemedia-creator.
This message is a reminder that Fedora Linux 39 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 39 on 2024-11-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '39'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see it. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 39 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
This is still happening. We're hoping ultimately to get around it by building the images with Kiwi, I think.
I'm unable to reproduce this error using livemedia-creator from lorax-41.6-1.fc41.aarch64 and anaconda 42.13-1.fc42 (and all deps that come with it) 2024-11-14 16:38:57,027 INFO livemedia-creator: livemedia-creator v41.6-1 2024-11-14 16:38:57,028 INFO pylorax: selinux is Disabled 2024-11-14 16:38:57,714 INFO pylorax: disk_img = /build/results/lmc-disk-a1fzjl9h.img 2024-11-14 16:38:57,714 INFO pylorax: Using disk size of 7502MiB 2024-11-14 16:38:58,090 DEBUG pylorax.imgutils: make tmp mountdir /var/tmp/lorax.imgutils.yv8px59k 2024-11-14 16:38:58,339 DEBUG pylorax.imgutils: remove tmp mountdir /var/tmp/lorax.imgutils.yv8px59k 2024-11-14 16:38:58,727 INFO pylorax: Running anaconda. 2024-11-14 16:39:00,260 INFO pylorax: Failed to setup DBus connection: Failed to create a new connection for the system bus: Could not connect: No such file or directory 2024-11-14 16:39:06,771 INFO pylorax: Starting installer, one moment... 2024-11-14 16:39:06,773 INFO pylorax: terminal size detection failed, using default width 2024-11-14 16:39:06,773 INFO pylorax: 16:39:06 Enforcing the non-interactive text mode for dir and image installations. 2024-11-14 16:39:39,506 INFO livemedia-creator: Processing logs from ('127.0.0.1', 32836) 2024-11-14 16:39:40,414 INFO pylorax: anaconda 42.13-1.fc42 for Linux 41 (pre-release) started. 2024-11-14 16:39:48,932 INFO pylorax: Starting automated install...... I believe this got fixed in f42
Nope, we're still getting multiple failed images daily. See https://pagure.io/releng/failed-composes/issue/7373 for e.g. https://koji.fedoraproject.org/koji/taskinfo?taskID=125843293 is the Workstation live failure on aarch64, https://kojipkgs.fedoraproject.org//work/tasks/3293/125843293/anaconda-dbus.log has the illegal instruction error: /usr/libexec/anaconda/start-module: line 41: 49 Illegal instruction (core dumped) python3 -m $1 dbus-daemon[6]: Activated service 'org.fedoraproject.Anaconda.Modules.Timezone' failed: Process org.fedoraproject.Anaconda.Modules.Timezone exited with status 132
This bug appears to have been reported against 'rawhide' during the Fedora Linux 42 development cycle. Changing version to 42.