Bug 1993505
Summary: | Intermittent failure in kernel-core %posttrans when creating live images | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> |
Component: | kexec-tools | Assignee: | Kairui Song <kasong> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | rawhide | CC: | acaringi, adscvr, airlied, alciregi, bhe, bskeggs, hdegoede, jarodwilson, jeremy, jforbes, jglisse, jonathan, josef, kasong, kernel-maint, kevin, lgoncalv, linville, masami256, mchehab, mtasaka, ptalbert, ruyang, steved, zbyszek |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Unspecified | ||
Whiteboard: | AcceptedFreezeException | ||
Fixed In Version: | kexec-tools-2.0.22-7.fc35 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-09-07 19:07:06 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1891954 |
Description
Adam Williamson
2021-08-13 16:00:04 UTC
Shouldn't this bug be promoted to f35-beta blocker? This issue still happens on Today's compose (randomly). Did you get any meaningful output last weekend from the builds with -v? Whether it's a blocker is kinda philosophical, because it doesn't exactly *prevent* us from composing - if we really need a compose we can just keep firing till one gets through. Regardless, we're treating it as an important issue, it's just not easy to figure out what the heck is going on :( (In reply to Justin M. Forbes from comment #2) > Did you get any meaningful output last weekend from the builds with -v? Composes have been failing for other reasons now (broken deps) so no new info on this yet. I did find the f35 lxqt live failed for this issue, but I am not seeing the -v output. ;( https://koji.fedoraproject.org/koji/taskinfo?taskID=74131298 I've been running that LXQt build locally over and over again to try and reproduce the failure in a context where I can maybe find more info, but it's worked every time so far :/ (In reply to Kevin Fenzi from comment #4) > (In reply to Justin M. Forbes from comment #2) > > Did you get any meaningful output last weekend from the builds with -v? > > Composes have been failing for other reasons now (broken deps) so no new > info on this yet. > > I did find the f35 lxqt live failed for this issue, but I am not seeing the > -v output. ;( > https://koji.fedoraproject.org/koji/taskinfo?taskID=74131298 Yeah, the -v was only put in for the single build last weekend. I will add it again to Monday's rc7 build for f35. From todays failed f35 with -v: ... 10:52:23,531 INF dnf.rpm: +/usr/lib/kernel/install.d/00-entry-directory.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz +/usr/lib/kernel/install.d/10-devicetree.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz +/usr/lib/kernel/install.d/20-grub.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz +/usr/lib/kernel/install.d/20-grubby.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz +/usr/lib/kernel/install.d/50-depmod.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz Running depmod -a 5.14.0-0.rc7.54.fc35.x86_64 +/usr/lib/kernel/install.d/50-dracut.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz +/usr/lib/kernel/install.d/51-dracut-rescue.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz +/usr/lib/kernel/install.d/60-kdump.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz +/usr/lib/kernel/install.d/90-loaderentry.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz +/usr/lib/kernel/install.d/92-crashkernel.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz +/usr/lib/kernel/install.d/95-kernel-hooks.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz +/usr/lib/kernel/install.d/99-grub-mkconfig.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz warning: %posttrans(kernel-core-5.14.0-0.rc7.54.fc35.x86_64) scriptlet failed, exit status 1 10:52:23,532 ERR dnf.rpm: Error in POSTTRANS scriptlet in rpm package kernel-core 10:52:23,991 INF dnf.rpm: Running in chroot, ignoring command 'set-property' .. so, sadly it doesnt tell us what exited 1. ;( Perhaps we could get systemd folks to improve the -v output (include exit codes from scriptlets?) So, we added a strace to it and it generated a gigantic pile of output. A few months ago, kexec-tools added a 92-crashkernel.install scriptlet in https://src.fedoraproject.org/rpms/kexec-tools/c/646364193597401a55eb2485ef7840bb5eef7d88?branch=rawhide [pid 33381] execve("/sbin/grubby", ["grubby", "--args", "crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M", "--update-kernel", "/boot/vmlinuz-5.14.0-61.fc36.x86_64"], 0x563d11877dc0 /* 22 vars */) = 0 ... [pid 33381] read(255, "\nif [[ -n $set_default ]]; then\n set_default_bls \"${set_default}\"\nfi\n\ nexit 0\n", 8192) = 80 [pid 33381] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 [pid 33381] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [pid 33381] exit_group(0) = ? [pid 33381] +++ exited with 0 +++ ... [pid 33289] exit_group(1) = ? [pid 33289] +++ exited with 1 +++ So it looks like the 92-crashkernel.install runs grubby (successfully) then exits 1? That function is: set_kernel_ck() { kernel=$1 ck_cmdline=$2 entry=$(grubby --info ALL | grep "^kernel=.*$kernel") entry=${entry#kernel=} entry=${entry#\"} entry=${entry%\"} if [[ -z "$entry" ]]; then echo "$0: failed to find boot entry for kernel $kernel" return 1 fi [[ -f /etc/zipl.conf ]] && zipl_arg="--zipl" grubby --args "$ck_cmdline" --update-kernel "$entry" $zipl_arg [[ $zipl_arg ]] && zipl > /dev/null } So, if $zipl_arg is not defined, shouldn't this always exit 1? If so, I am not sure at all this is the problem, since it's been sporadic, but it's _a_ problem. Moving to kexec-tools for comment. Justin: Much less invasive than the strace, and possibly more helpfull, could you do a 'sh -x /bin/kernel-install ...' in the next build? That shouldn't cause problems, and should hopefully tell us if others are failing here. (In reply to Kevin Fenzi from comment #8) > So, we added a strace to it and it generated a gigantic pile of output. > > A few months ago, kexec-tools added a 92-crashkernel.install scriptlet in > https://src.fedoraproject.org/rpms/kexec-tools/c/ > 646364193597401a55eb2485ef7840bb5eef7d88?branch=rawhide > > > [pid 33381] execve("/sbin/grubby", ["grubby", "--args", > "crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M", "--update-kernel", > "/boot/vmlinuz-5.14.0-61.fc36.x86_64"], 0x563d11877dc0 /* 22 vars */) = 0 > ... > [pid 33381] read(255, "\nif [[ -n $set_default ]]; then\n set_default_bls > \"${set_default}\"\nfi\n\ > nexit 0\n", 8192) = 80 > [pid 33381] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 > [pid 33381] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > [pid 33381] exit_group(0) = ? > [pid 33381] +++ exited with 0 +++ > ... > > [pid 33289] exit_group(1) = ? > [pid 33289] +++ exited with 1 +++ > > So it looks like the 92-crashkernel.install runs grubby (successfully) then > exits 1? > > That function is: > > set_kernel_ck() { > kernel=$1 > ck_cmdline=$2 > > entry=$(grubby --info ALL | grep "^kernel=.*$kernel") > entry=${entry#kernel=} > entry=${entry#\"} > entry=${entry%\"} > > if [[ -z "$entry" ]]; then > echo "$0: failed to find boot entry for kernel $kernel" > return 1 > fi > > [[ -f /etc/zipl.conf ]] && zipl_arg="--zipl" > grubby --args "$ck_cmdline" --update-kernel "$entry" $zipl_arg > [[ $zipl_arg ]] && zipl > /dev/null > } > > So, if $zipl_arg is not defined, shouldn't this always exit 1? If so, I am > not sure at all this is the problem, since it's been sporadic, but it's _a_ > problem. > > Moving to kexec-tools for comment. > > Justin: Much less invasive than the strace, and possibly more helpfull, > could you do a 'sh -x /bin/kernel-install ...' in the next build? That > shouldn't cause problems, and should hopefully tell us if others are failing > here. kexec-tools definitely need a fix, I'll fix it and make a new release. Sorry, forgot to mention here I actually patched the zipl line to always return 0 last night. We're not sure that's the whole story since the failure is intermittent, but the change at least seems correct and safe so I figured it couldn't hurt to make it: https://src.fedoraproject.org/rpms/kexec-tools/c/5270d40dd0b690d4be628f937e508cdc80dab1d9?branch=rawhide we figured we'd watch Rawhide for a day or two and if the bug doesn't happen to any of those lives, we can land it in F35 and hope the problem is solved. So, no Rawhide lives in today's compose hit the bug, which is I guess a good sign. I'll send out an update for F35 and mark it as fixing this bug, if nothing in Rawhide hits it tomorrow either, maybe we can push it. Proposing as a Beta FE since we're in freeze now - this is borderline a blocker as it affects our ability to compose reliably, but it seems a no-brainer to be at least an FE. FEDORA-2021-54084a7818 has been submitted as an update to Fedora 35. https://bodhi.fedoraproject.org/updates/FEDORA-2021-54084a7818 FEDORA-2021-54084a7818 has been pushed to the Fedora 35 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-54084a7818` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-54084a7818 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates. as kevin pointed out, we haven't hit this bug in Rawhide composes for the last few days...but we also haven't hit it in F35 composes, and the fix isn't stable for F35 yet. So, uh, we're not sure what's going on. (In reply to Adam Williamson from comment #14) > as kevin pointed out, we haven't hit this bug in Rawhide composes for the > last few days...but we also haven't hit it in F35 composes, and the fix > isn't stable for F35 yet. So, uh, we're not sure what's going on. At least Fedora-Astronomy_KDE-Live-35-20210904.n.0 saw this bug again. +3 in https://pagure.io/fedora-qa/blocker-review/issue/425 , marking accepted. I'll check the results again shortly and if we still have no Rawhide failures we can land this... FEDORA-2021-54084a7818 has been pushed to the Fedora 35 stable repository. If problem still persists, please make note of it in this bug report. |