Bug 1993505

Summary: Intermittent failure in kernel-core %posttrans when creating live images
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: kexec-toolsAssignee: Kairui Song <kasong>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rawhideCC: acaringi, adscvr, airlied, alciregi, bhe, bskeggs, hdegoede, jarodwilson, jeremy, jforbes, jglisse, jonathan, josef, kasong, kernel-maint, kevin, lgoncalv, linville, masami256, mchehab, mtasaka, ptalbert, ruyang, steved, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: AcceptedFreezeException
Fixed In Version: kexec-tools-2.0.22-7.fc35 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-07 19:07:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1891954    

Description Adam Williamson 2021-08-13 16:00:04 UTC
For the last couple of weeks, we've noticed an intermittent issue where live image builds fail because of a failure in the kernel-core %posttrans scriptlet:

DEBUG util.py:444:  2021-08-13 10:43:03,707: The installation was stopped due to an error which occurred while running in non-interactive cmdline mode. Since there cannot be any questions in cmdline mode, edit your kickstart file and retry installation. 
DEBUG util.py:444:  2021-08-13 10:43:03,707: The exact error message is: 
DEBUG util.py:444:  2021-08-13 10:43:03,707: 
DEBUG util.py:444:  2021-08-13 10:43:03,707: Non interactive installation failed: An error occurred during the transaction: Error in POSTTRANS scriptlet in rpm package kernel-core. 
DEBUG util.py:444:  2021-08-13 10:43:03,707: 
DEBUG util.py:444:  2021-08-13 10:43:03,708: The installer will now terminate.

This seems to hit roughly two or three live image builds per compose. If it happens in one of the critical images (KDE or GNOME), that causes the whole compose to fail. It's affecting both Rawhide and Branched (F35) composes.

We first noticed this on the Fedora-Rawhide-20210730.n.0 compose - https://pagure.io/releng/failed-composes/issue/2637 - but it's possible it appeared before then but that was just the first time it caused a compose to fail. Unfortunately the logs for live image build tasks from before that time have been garbage-collected now, so we can't be sure.

It doesn't seem like the kernel itself or systemd (where most of the stuff the %posttrans script runs lives) changed around that time, but we're filing against kernel for now anyway just because we don't know where the issue lies yet. Just wanted to have somewhere to track the problem.

Kevin asked Justin to add a -v to the scriptlet so we might get a bit more information at least when it fails.

Comment 1 Mamoru TASAKA 2021-08-19 13:39:13 UTC
Shouldn't this bug be promoted to f35-beta blocker? This issue still happens on Today's compose (randomly).

Comment 2 Justin M. Forbes 2021-08-19 15:25:47 UTC
Did you get any meaningful output last weekend from the builds with -v?

Comment 3 Adam Williamson 2021-08-19 15:39:51 UTC
Whether it's a blocker is kinda philosophical, because it doesn't exactly *prevent* us from composing - if we really need a compose we can just keep firing till one gets through. Regardless, we're treating it as an important issue, it's just not easy to figure out what the heck is going on :(

Comment 4 Kevin Fenzi 2021-08-19 18:01:07 UTC
(In reply to Justin M. Forbes from comment #2)
> Did you get any meaningful output last weekend from the builds with -v?

Composes have been failing for other reasons now (broken deps) so no new info on this yet. 

I did find the f35 lxqt live failed for this issue, but I am not seeing the -v output. ;( 
https://koji.fedoraproject.org/koji/taskinfo?taskID=74131298

Comment 5 Adam Williamson 2021-08-20 21:15:15 UTC
I've been running that LXQt build locally over and over again to try and reproduce the failure in a context where I can maybe find more info, but it's worked every time so far :/

Comment 6 Justin M. Forbes 2021-08-21 14:08:39 UTC
(In reply to Kevin Fenzi from comment #4)
> (In reply to Justin M. Forbes from comment #2)
> > Did you get any meaningful output last weekend from the builds with -v?
> 
> Composes have been failing for other reasons now (broken deps) so no new
> info on this yet. 
> 
> I did find the f35 lxqt live failed for this issue, but I am not seeing the
> -v output. ;( 
> https://koji.fedoraproject.org/koji/taskinfo?taskID=74131298

Yeah, the -v was only put in for the single build last weekend.  I will add it again to Monday's rc7 build for f35.

Comment 7 Kevin Fenzi 2021-08-24 15:39:39 UTC
From todays failed f35 with -v: 

...
10:52:23,531 INF dnf.rpm: +/usr/lib/kernel/install.d/00-entry-directory.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/10-devicetree.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/20-grub.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/20-grubby.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/50-depmod.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz 
Running depmod -a 5.14.0-0.rc7.54.fc35.x86_64
+/usr/lib/kernel/install.d/50-dracut.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/51-dracut-rescue.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/60-kdump.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/90-loaderentry.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/92-crashkernel.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/95-kernel-hooks.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/99-grub-mkconfig.install add 5.14.0-0.rc7.54.fc35.x86_64 /boot/3819039cc9ba49e6b6d95db91bc0b033/5.14.0-0.rc7.54.fc35.x86_64 /lib/modules/5.14.0-0.rc7.54.fc35.x86_64/vmlinuz 
warning: %posttrans(kernel-core-5.14.0-0.rc7.54.fc35.x86_64) scriptlet failed, exit status 1

10:52:23,532 ERR dnf.rpm: Error in POSTTRANS scriptlet in rpm package kernel-core
10:52:23,991 INF dnf.rpm: Running in chroot, ignoring command 'set-property'
..

so, sadly it doesnt tell us what exited 1. ;( Perhaps we could get systemd folks to improve the -v output (include exit codes from scriptlets?)

Comment 8 Kevin Fenzi 2021-08-31 20:05:46 UTC
So, we added a strace to it and it generated a gigantic pile of output. 

A few months ago, kexec-tools added a 92-crashkernel.install scriptlet in https://src.fedoraproject.org/rpms/kexec-tools/c/646364193597401a55eb2485ef7840bb5eef7d88?branch=rawhide


[pid 33381] execve("/sbin/grubby", ["grubby", "--args", "crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M", "--update-kernel", "/boot/vmlinuz-5.14.0-61.fc36.x86_64"], 0x563d11877dc0 /* 22 vars */) = 0
...
[pid 33381] read(255, "\nif [[ -n $set_default ]]; then\n    set_default_bls \"${set_default}\"\nfi\n\
nexit 0\n", 8192) = 80
[pid 33381] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
[pid 33381] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 33381] exit_group(0)               = ?
[pid 33381] +++ exited with 0 +++
...

[pid 33289] exit_group(1)               = ?
[pid 33289] +++ exited with 1 +++

So it looks like the 92-crashkernel.install runs grubby (successfully) then exits 1?

That function is: 

set_kernel_ck() {
        kernel=$1
        ck_cmdline=$2

        entry=$(grubby --info ALL | grep "^kernel=.*$kernel")
        entry=${entry#kernel=}
        entry=${entry#\"}
        entry=${entry%\"}

        if [[ -z "$entry" ]]; then
                echo "$0: failed to find boot entry for kernel $kernel"
                return 1
        fi

        [[ -f /etc/zipl.conf ]] && zipl_arg="--zipl"
        grubby --args "$ck_cmdline" --update-kernel "$entry" $zipl_arg
        [[ $zipl_arg ]] && zipl > /dev/null
}

So, if $zipl_arg is not defined, shouldn't this always exit 1? If so, I am not sure at all this is the problem, since it's been sporadic, but it's _a_ problem.

Moving to kexec-tools for comment. 

Justin: Much less invasive than the strace, and possibly more helpfull, could you do a 'sh -x /bin/kernel-install ...' in the next build? That shouldn't cause problems, and should hopefully tell us if others are failing here.

Comment 9 Kairui Song 2021-09-01 09:36:54 UTC
(In reply to Kevin Fenzi from comment #8)
> So, we added a strace to it and it generated a gigantic pile of output. 
> 
> A few months ago, kexec-tools added a 92-crashkernel.install scriptlet in
> https://src.fedoraproject.org/rpms/kexec-tools/c/
> 646364193597401a55eb2485ef7840bb5eef7d88?branch=rawhide
> 
> 
> [pid 33381] execve("/sbin/grubby", ["grubby", "--args",
> "crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M", "--update-kernel",
> "/boot/vmlinuz-5.14.0-61.fc36.x86_64"], 0x563d11877dc0 /* 22 vars */) = 0
> ...
> [pid 33381] read(255, "\nif [[ -n $set_default ]]; then\n    set_default_bls
> \"${set_default}\"\nfi\n\
> nexit 0\n", 8192) = 80
> [pid 33381] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> [pid 33381] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> [pid 33381] exit_group(0)               = ?
> [pid 33381] +++ exited with 0 +++
> ...
> 
> [pid 33289] exit_group(1)               = ?
> [pid 33289] +++ exited with 1 +++
> 
> So it looks like the 92-crashkernel.install runs grubby (successfully) then
> exits 1?
> 
> That function is: 
> 
> set_kernel_ck() {
>         kernel=$1
>         ck_cmdline=$2
> 
>         entry=$(grubby --info ALL | grep "^kernel=.*$kernel")
>         entry=${entry#kernel=}
>         entry=${entry#\"}
>         entry=${entry%\"}
> 
>         if [[ -z "$entry" ]]; then
>                 echo "$0: failed to find boot entry for kernel $kernel"
>                 return 1
>         fi
> 
>         [[ -f /etc/zipl.conf ]] && zipl_arg="--zipl"
>         grubby --args "$ck_cmdline" --update-kernel "$entry" $zipl_arg
>         [[ $zipl_arg ]] && zipl > /dev/null
> }
> 
> So, if $zipl_arg is not defined, shouldn't this always exit 1? If so, I am
> not sure at all this is the problem, since it's been sporadic, but it's _a_
> problem.
> 
> Moving to kexec-tools for comment. 
> 
> Justin: Much less invasive than the strace, and possibly more helpfull,
> could you do a 'sh -x /bin/kernel-install ...' in the next build? That
> shouldn't cause problems, and should hopefully tell us if others are failing
> here.

kexec-tools definitely need a fix, I'll fix it and make a new release.

Comment 10 Adam Williamson 2021-09-01 15:16:57 UTC
Sorry, forgot to mention here I actually patched the zipl line to always return 0 last night. We're not sure that's the whole story since the failure is intermittent, but the change at least seems correct and safe so I figured it couldn't hurt to make it:

https://src.fedoraproject.org/rpms/kexec-tools/c/5270d40dd0b690d4be628f937e508cdc80dab1d9?branch=rawhide

we figured we'd watch Rawhide for a day or two and if the bug doesn't happen to any of those lives, we can land it in F35 and hope the problem is solved.

Comment 11 Adam Williamson 2021-09-01 20:43:31 UTC
So, no Rawhide lives in today's compose hit the bug, which is I guess a good sign. I'll send out an update for F35 and mark it as fixing this bug, if nothing in Rawhide hits it tomorrow either, maybe we can push it. Proposing as a Beta FE since we're in freeze now - this is borderline a blocker as it affects our ability to compose reliably, but it seems a no-brainer to be at least an FE.

Comment 12 Fedora Update System 2021-09-01 20:45:17 UTC
FEDORA-2021-54084a7818 has been submitted as an update to Fedora 35. https://bodhi.fedoraproject.org/updates/FEDORA-2021-54084a7818

Comment 13 Fedora Update System 2021-09-02 15:55:53 UTC
FEDORA-2021-54084a7818 has been pushed to the Fedora 35 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-54084a7818`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-54084a7818

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 14 Adam Williamson 2021-09-03 17:23:25 UTC
as kevin pointed out, we haven't hit this bug in Rawhide composes for the last few days...but we also haven't hit it in F35 composes, and the fix isn't stable for F35 yet. So, uh, we're not sure what's going on.

Comment 15 Mamoru TASAKA 2021-09-04 13:58:33 UTC
(In reply to Adam Williamson from comment #14)
> as kevin pointed out, we haven't hit this bug in Rawhide composes for the
> last few days...but we also haven't hit it in F35 composes, and the fix
> isn't stable for F35 yet. So, uh, we're not sure what's going on.

At least Fedora-Astronomy_KDE-Live-35-20210904.n.0 saw this bug again.

Comment 16 Adam Williamson 2021-09-06 15:06:56 UTC
+3 in https://pagure.io/fedora-qa/blocker-review/issue/425 , marking accepted. I'll check the results again shortly and if we still have no Rawhide failures we can land this...

Comment 17 Fedora Update System 2021-09-07 19:07:06 UTC
FEDORA-2021-54084a7818 has been pushed to the Fedora 35 stable repository.
If problem still persists, please make note of it in this bug report.