In https://issues.redhat.com/browse/OCPBUGS-9685?focusedCommentId=21882015&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-21882015 we discovered that sometimes rpm-ostree was segfaulting when trying to override with the RT kernel. Narrowing things down a bit, it's not related to the rt kernel; reproduction steps so far seem to be: - boot rhel8 host - queue a rebase to rhel9 - layer a package Stack trace looks like this: ``` PID: 263873 (rpm-ostree) UID: 0 (root) GID: 0 (root) Signal: 11 (SEGV) Timestamp: Thu 2023-03-09 23:03:03 UTC (2min 46s ago) Command Line: /usr/bin/rpm-ostree start-daemon Executable: /usr/bin/rpm-ostree Control Group: / Slice: -.slice Boot ID: 3eb1772866404a42abb9abd231bbb0a1 Machine ID: 17248ec74f0145ef8e3115c54d37bc2a Hostname: cosa-devsh Storage: /var/lib/systemd/coredump/core.rpm-ostree.0.3eb1772866404a42abb9abd231bbb0a1.263873.1678402983000000.lz4 Message: Process 263873 (rpm-ostree) of user 0 dumped core. Stack trace of thread 443865: #0 0x00007ffad342c0bc hex2bin (librpm.so.8) #1 0x00007ffad342fdd7 rpmfilesNew (librpm.so.8) #2 0x00007ffad34307bd rpmfiNewPool (librpm.so.8) #3 0x0000559c271d391a rpmostree_context_assemble (rpm-ostree) #4 0x0000559c27202878 rpmostree_sysroot_upgrader_deploy (rpm-ostree) #5 0x0000559c271c072c _ZL26deploy_transaction_executeP22_RpmostreedTransactionP13_GCancellablePP7_GError (rpm-ostree) #6 0x0000559c271b8453 _ZL26transaction_execute_threadP6_GTaskPvS1_P13_GCancellable (rpm-ostree) #7 0x00007ffad3c69357 g_task_thread_pool_thread (libgio-2.0.so.0) #8 0x00007ffad36df9d3 g_thread_pool_thread_proxy (libglib-2.0.so.0) #9 0x00007ffad36defca g_thread_proxy (libglib-2.0.so.0) #10 0x00007ffad0cb71cf start_thread (libpthread.so.0) #11 0x00007ffad0923dd3 __clone (libc.so.6) Stack trace of thread 263873: #0 0x00007ffad0a19a07 epoll_wait (libc.so.6) #1 0x0000559c27166d26 _ZN5tokio2io6driver6Driver4turn17h9a20fe9d16a70024E.llvm.9154779605146092572 (rpm-ostree) #2 0x0000559c271636f2 _ZN5tokio4time6driver15Driver$LT$P$GT$13park_internal17he11018871cd3c611E.llvm.7853487799161909216 (rpm-ostree) #3 0x0000559c2716bc3d _ZN78_$LT$tokio..park..either..Either$LT$A$C$B$GT$$u20$as$u20$tokio..park..Park$GT$4park17h8d360b7d05da848aE (rpm-ostree) #4 0x0000559c2717a3b9 _ZN5tokio7runtime15basic_scheduler7Context4park17h6d96f78c085bf685E (rpm-ostree) #5 0x0000559c26e6d7fc _ZN5tokio6macros10scoped_tls18ScopedKey$LT$T$GT$3set17h9304ae0e2f3349d3E (rpm-ostree) #6 0x0000559c26e47954 _ZN5tokio7runtime15basic_scheduler14BasicScheduler8block_on17h334695d20fab50d0E (rpm-ostree) #7 0x0000559c26e6e72c _ZN5tokio7runtime7Runtime8block_on17h6bfe3f9a1752a5d1E (rpm-ostree) #8 0x0000559c26edb073 _ZN10rpm_ostree4main17h410db1bc1923be7aE (rpm-ostree) #9 0x0000559c26ed4b03 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17ha07cd638c2ad92c5E (rpm-ostree) #10 0x0000559c26e3d4d9 _ZN3std2rt10lang_start28_$u7b$$u7b$closure$u7d$$u7d$17hf0d73452748f8331E.llvm.1177766809197949552 (rpm-ostree) #11 0x0000559c2713301e _ZN3std2rt19lang_start_internal17ha9d47c9583ae6f7eE (rpm-ostree) #12 0x0000559c26edb2e8 main (rpm-ostree) #13 0x00007ffad0924cf3 __libc_start_main (libc.so.6) #14 0x0000559c26c3a34e _start (rpm-ostree) Stack trace of thread 263876: #0 0x00007ffad0a0eac1 __poll (libc.so.6) #1 0x00007ffad36b6c86 g_main_context_iterate.isra.20 (libglib-2.0.so.0) #2 0x00007ffad36b7042 g_main_loop_run (libglib-2.0.so.0) #3 0x00007ffad3cac56a gdbus_shared_thread_func (libgio-2.0.so.0) #4 0x00007ffad36defca g_thread_proxy (libglib-2.0.so.0) #5 0x00007ffad0cb71cf start_thread (libpthread.so.0) #6 0x00007ffad0923dd3 __clone (libc.so.6) Stack trace of thread 443866: #0 0x00007ffad092391d syscall (libc.so.6) #1 0x00007ffad36fde3e g_cond_wait_until (libglib-2.0.so.0) #2 0x00007ffad3689191 g_async_queue_pop_intern_unlocked (libglib-2.0.so.0) #3 0x00007ffad36dfb12 g_thread_pool_thread_proxy (libglib-2.0.so.0) #4 0x00007ffad36defca g_thread_proxy (libglib-2.0.so.0) #5 0x00007ffad0cb71cf start_thread (libpthread.so.0) #6 0x00007ffad0923dd3 __clone (libc.so.6) Stack trace of thread 263875: #0 0x00007ffad0a0eac1 __poll (libc.so.6) #1 0x00007ffad36b6c86 g_main_context_iterate.isra.20 (libglib-2.0.so.0) #2 0x00007ffad36b6db0 g_main_context_iteration (libglib-2.0.so.0) #3 0x00007ffad36b6e01 glib_worker_main (libglib-2.0.so.0) #4 0x00007ffad36defca g_thread_proxy (libglib-2.0.so.0) #5 0x00007ffad0cb71cf start_thread (libpthread.so.0) #6 0x00007ffad0923dd3 __clone (libc.so.6) Stack trace of thread 263874: #0 0x00007ffad0a0eac1 __poll (libc.so.6) #1 0x00007ffad36b6c86 g_main_context_iterate.isra.20 (libglib-2.0.so.0) #2 0x00007ffad36b6db0 g_main_context_iteration (libglib-2.0.so.0) #3 0x0000559c271b16bc rpmostreed_daemon_run_until_idle_exit (rpm-ostree) #4 0x0000559c271a81b8 rpmostree_builtin_start_daemon (rpm-ostree) #5 0x0000559c2719d889 _ZN12rpmostreecxx14rpmostree_mainEN4rust10cxxbridge15SliceIKNS1_3StrEEE (rpm-ostree) #6 0x0000559c271976c7 rpmostreecxx$cxxbridge1$rpmostree_main (rpm-ostree) #7 0x0000559c26ead73e _ZN102_$LT$tokio..runtime..blocking..task..BlockingTask$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$4poll17hf624d895f71b5ee7E (rpm-ostree) #8 0x0000559c26eb8c03 _ZN5tokio7runtime4task7harness20Harness$LT$T$C$S$GT$4poll17ha6c1caddcd64e06fE (rpm-ostree) #9 0x0000559c27173182 _ZN5tokio7runtime8blocking4pool5Inner3run17h3e8406aa9e4c45e2E (rpm-ostree) #10 0x0000559c27166000 _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17ha83e312e4406ac5dE.llvm.9154779605146092572 (rpm-ostree) #11 0x0000559c27168617 _ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17h2e98583ffc825ff9E (rpm-ostree) #12 0x0000559c27146545 _ZN3std3sys4unix6thread6Thread3new12thread_start17h30bda5701f7abd82E (rpm-ostree) #13 0x00007ffad0cb71cf start_thread (libpthread.so.0) #14 0x00007ffad0923dd3 __clone (libc.so.6) ```
OK I got a build with debuginfo. The stack goes pretty deep into librpm: Starts here https://github.com/coreos/rpm-ostree/blob/4bdcefabaaf3e51729150f03bfd0ad6732334af1/src/libpriv/rpmostree-core.cxx#L2741 Here's the new backtrace with a bit more local variables: ``` (gdb) bt #0 0x00007ffad342c0bc in rnibble (c=0 '\000') at ../include/rpm/rpmstring.h:99 #1 0x00007ffad342c0bc in hex2bin (h=h@entry=0x7ffaaf0709e0, tag=tag@entry=5090, num=num@entry=14, len=226) at rpmfi.c:1515 #2 0x00007ffad342fdd7 in rpmfilesPopulate (flags=229374, h=0x7ffaaf0709e0, fi=0x7ffaad702d90) at rpmfi.c:1590 #3 0x00007ffad342fdd7 in rpmfilesNew (pool=pool@entry=0x0, h=h@entry=0x7ffaaf0709e0, tagN=tagN@entry=1117, flags=flags@entry=229374) at rpmfi.c:1652 #4 0x00007ffad34307bd in rpmfiNewPool (pool=pool@entry=0x0, h=h@entry=0x7ffaaf0709e0, tagN=tagN@entry=1117, flags=flags@entry=229374) at rpmfi.c:1760 #5 0x00007ffad34307eb in rpmfiNew (ts=ts@entry=0x7ffaac357a00, h=h@entry=0x7ffaaf0709e0, tagN=tagN@entry=1117, flags=flags@entry=229374) at rpmfi.c:1767 #6 0x0000559c271d391a in handle_file_dispositions (error=0x7ffac27286e0, cancellable=0x7ffaaee35010 [_IS_CANCELLABLE (cancellable)], out_files_skip_delete=<synthetic pointer>, out_files_skip_add=<synthetic pointer>, ts=0x7ffaac357a00, tmprootfs_dfd=557, self=0x7ffaac44e190) at src/libpriv/rpmostree-core.cxx:2738 #7 0x0000559c271d391a in rpmostree_context_assemble(RpmOstreeContext*, GCancellable*, GError**) (self=<optimized out>, cancellable=cancellable@entry=0x7ffaaee35010 [_IS_CANCELLABLE (cancellable)], error=error@entry=0x7ffac27286e0) at src/libpriv/rpmostree-core.cxx:4184 #8 0x0000559c27202878 in perform_local_assembly (error=<optimized out>, cancellable=<optimized out>, self=<optimized out>) at src/daemon/rpmostree-sysroot-upgrader.cxx:1025 #9 0x0000559c27202878 in rpmostree_sysroot_upgrader_deploy(RpmOstreeSysrootUpgrader*, OstreeDeployment**, GCancellable*, GError**) (self=self@entry=0x7ffab000c4d0, out_deployment=out_deployment@entry=0x7ffac27285d0, cancellable=cancellable@entry=0x7ffaaee35010 [_IS_CANCELLABLE (cancellable)], error=error@entry=0x7ffac27286e0) at src/daemon/rpmostree-sysroot-upgrader.cxx:1335 #10 0x0000559c271c072c in deploy_transaction_execute(RpmostreedTransaction*, GCancellable*, GError**) (transaction=0x7ffac40cde00, cancellable=<optimized out>, error=<optimized out>) at src/daemon/rpmostreed-transaction-types.cxx:1596 #11 0x0000559c271b8453 in transaction_execute_thread(GTask*, gpointer, gpointer, GCancellable*) (task=0x7ffac40e3920 [RUE], source_object=<optimized out>, task_data=<optimized out>, cancellable=0x7ffaaee35010 [_IS_CANCELLABLE (cancellable)]) at src/daemon/rpmostreed-transaction.cxx:338 #12 0x00007ffad3c69357 in g_task_thread_cleanup () at gtask.c:1318 #13 0x00007ffad3c69357 in g_task_thread_pool_thread (thread_data=0x7ffac40e3920, pool_data=<optimized out>) at gtask.c:1336 #14 0x00007ffad36df9d3 in g_thread_pool_thread_proxy (data=<optimized out>) at gthreadpool.c:307 #15 0x00007ffad36defca in g_thread_proxy (data=0x7ffa9c1a2a30) at gthread.c:784 #16 0x00007ffad0cb71cf in start_thread () at /lib64/libpthread.so.0 #17 0x00007ffad0923dd3 in __libc_start_main () at /lib64/libc.so.6 #18 0x0000000000000000 in () (gdb) ``` ``` (gdb) list 94 * @return binary nibble 95 */ 96 RPM_GNUC_CONST 97 static inline unsigned char rnibble(char c) 98 { 99 if (c >= '0' && c <= '9') 100 return (c - '0'); 101 if (c >= 'a' && c <= 'f') 102 return (c - 'a') + 10; 103 if (c >= 'A' && c <= 'F') (gdb) p c $4 = 0 '\000' ``` OK, so we seem to have walked off the end into a NUL character? ``` (gdb) up #1 hex2bin (h=h@entry=0x7ffaaf0709e0, tag=tag@entry=5090, num=num@entry=14, len=226) at rpmfi.c:1515 1515 *t = (rnibble(s[0]) << 4) | rnibble(s[1]); (gdb) p s $5 = 0x7ffaafffefff "" ``` Right, s[1] seems invalid here? ``` (gdb) up #2 0x00007ffad342fdd7 in rpmfilesPopulate (flags=229374, h=0x7ffaaf0709e0, fi=0x7ffaad702d90) at rpmfi.c:1590 1590 fi->signatures = hex2bin(h, RPMTAG_FILESIGNATURES, (gdb) list 1585 1586 fi->signatures = NULL; 1587 /* grab hex signatures from header and store in binary format */ 1588 if (!(flags & RPMFI_NOFILESIGNATURES)) { 1589 fi->signaturelength = headerGetNumber(h, RPMTAG_FILESIGNATURELENGTH); 1590 fi->signatures = hex2bin(h, RPMTAG_FILESIGNATURES, 1591 totalfc, fi->signaturelength); 1592 } 1593 1594 /* XXX TR_REMOVED doesn;t need fmtimes, frdevs, finodes */ (gdb) ``` This relates to the file signatures. ``` (gdb) up #3 rpmfilesNew (pool=pool@entry=0x0, h=h@entry=0x7ffaaf0709e0, tagN=tagN@entry=1117, flags=flags@entry=229374) at rpmfi.c:1652 1652 if (rpmfilesPopulate(fi, h, flags)) (gdb) up #4 0x00007ffad34307bd in rpmfiNewPool (pool=pool@entry=0x0, h=h@entry=0x7ffaaf0709e0, tagN=tagN@entry=1117, flags=flags@entry=229374) at rpmfi.c:1760 1760 rpmfiles files = rpmfilesNew(pool, h, tagN, flags); (gdb) up #5 0x00007ffad34307eb in rpmfiNew (ts=ts@entry=0x7ffaac357a00, h=h@entry=0x7ffaaf0709e0, tagN=tagN@entry=1117, flags=flags@entry=229374) at rpmfi.c:1767 1767 return rpmfiNewPool(NULL, h, tagN, flags); (gdb) up ``` Didn't see anything obviously interesting in these frames. ``` #6 0x0000559c271d391a in handle_file_dispositions (error=0x7ffac27286e0, cancellable=0x7ffaaee35010 [_IS_CANCELLABLE (cancellable)], out_files_skip_delete=<synthetic pointer>, out_files_skip_add=<synthetic pointer>, ts=0x7ffaac357a00, tmprootfs_dfd=557, self=0x7ffaac44e190) at src/libpriv/rpmostree-core.cxx:2738 2738 g_auto (rpmfi) fi = rpmfiNew (ts, h, RPMTAG_BASENAMES, flags); (gdb) ``` And now we're at the faulting line.
Versions here are $ rpm -q rpm rpm-ostree rpm-4.14.3-24.el8_6.x86_64 rpm-ostree-2022.10.112.g3d0ac35b-3.el8.x86_64 $
And to be clear here, in this case it's rhel8 rpm-ostree installing rhel9 RPMs. I wonder if that's in play here... https://github.com/coreos/rpm-ostree/pull/3610 may be related too in that rpm-ostree now tries to parse the IMA bits, but IIRC there were issues with that in the rhel8 era?
Panu, any of this look familiar?
Currently failing to reproduce this with -fsanitize=address enabled or glibc MALLOC_CHECK_=1 unfortunately... Going to look at pulling out either valgrind or https://rr-project.org/ next.
Oh yup, that'd be bug 2018937. It got fixed on RHEL 9 but not 8 because IMA signatures don't exist there yet, but of course segfaulting on any package is not an option. We'll need to fix this in RHEL 8 rpm too then...
> We'll need to fix this in RHEL 8 rpm too then... Yeah seems this way. OK so with my OCP hat on...I think this is *probably* not strictly speaking a blocker. Because while it causes the OS updater (rpm-ostree) to segfault...that's not fatal because it's all transactional (hooray), and actually the upgrade process gets retried and succeeds. It would however be *embarrassing* and I'm sure some customers would find this too. We have a CI test that verifies nothing segfaulted on the node, and that started intermittently failing...but the failure rate is low enough that it still counts as a pass. If you're curious, click on https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-gcp-ovn-rt-upgrade-4.14-minor-release-openshift-release-analysis-aggregator/1634080465902112768 and scroll down a bit and you'll see > Passed Tests > Passed: suite=[Symptom Detection], Node process segfaulted > Passed: Passed 8 times, failed 2 times. The historical pass rate is 96%. The required number of passes is 7. > Failure - periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1634080460835393536 > Failure - periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1634080463372947456 If we can get a fix queued for 8.6.z relatively soon, that's better. Then we can tag that right away into rhaos-4.12-rhel8.
A tricky thing here is that we actually have two different places which are invoking rpmfiNew, and while for this caller we could ignore them, there's actually a distinct code path that *is* trying to (optionally) honor the IMA signatures. In the PR linked above https://github.com/coreos/rpm-ostree/pull/3610 it's supported for rpm-ostree to properly propagate RPM IMA signatures into the resulting ostree commit. (More on ostree and IMA: https://ostreedev.github.io/ostree/ima/ ) But...actually out of conservatism we didn't enable that by default, so I think I can conditionally pass the RPMFI_NOFILESIGNATURES in that case, and unconditionally in the first. Yeah, this seems viable to do in rpm-ostree; working on it.
Moving that over to https://bugzilla.redhat.com/show_bug.cgi?id=2177225
Wouldn't this affect in-place traditional dnf RHEL8 -> RHEL9 upgrades too? Or hm, does LEAP always start an upgrade from the initramfs or so?
This is a peculiar thing: from a few random samples of RHEL 9 packages I tried, none had IMA signatures that RHEL-8 would recognize at all, and so this bug doesn't trigger. It's not obvious to me from the logs which packages and from where exactly are being installed when this occurs. Colin? Note that the existence of the flaw is not in doubt, just trying to find an easy reproducer package. Alternatives would be setting up an IMA signing environment on RHEL-8 to try and create a reproducer, or backport support for the RHEL-9 style IMA signatures. Which would seem a weird thing to do just for a reproducers sake...