Bug 2166233

Summary: grubby fails to add the kernel entry when upgrading from RHEL6 using redhat-upgrade-tool
Product: Red Hat Enterprise Linux 7 Reporter: Renaud Métrich <rmetrich>
Component: kernelAssignee: Denys Vlasenko <dvlasenk>
kernel sub component: Packaging QA Contact: zhijwang <zhijwang>
Status: POST --- Docs Contact:
Severity: high    
Priority: high CC: bmader, bwelterl, dvlasenk, hkrzesin, jstancek, kernel-qe, lilu, mkluson, mreznik, nmurray, ppaddhar, prjagtap, pstodulk, ptalbert, rhandlin, tmeszaro, zhijwang
Version: 7.9Keywords: Triaged
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2108243    

Description Renaud Métrich 2023-02-01 08:37:47 UTC
Description of problem:

Since fixing BZ #1893756, the bootloader entry is created in posttrans scriptlet only.
In the past (e.g. up to 3.10.0-1160.59.1.el7 included), it was done in 2 phases:
- postinstall to create the entry without the initrd (because initrd is not created yet)
- posttrans to update the entry with the initrd

Due to this change, the upgrade from RHEL6 fails due to grubby failing in error when adding the kernel:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
grubby fatal error: unable to find a suitable template
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

This leads to rebooting the system with no kernel entry available at all, making the system completely unbootable.

The reason for this is detailed below:
1. Initially after running redhat-upgrade-tool and before rebooting, there are 2 bootloader entries at least:

  title System Upgrade (redhat-upgrade-tool)
  --> the RHEL7 kernel used to upgrade
  title Red Hat Enterprise Linux Server (2.6.32-754.35.1.el6.x86_64)
  --> the RHEL6 kernel

2. Upon rebooting to perform the upgrade, the "System Upgrade" entry is deleted (this is to avoid breaking if system upgrade failed)
3. The upgrade happens, which deletes the RHEL6 kernel and associated entry
4. %posttrans of the RHEL7 kernel executes, which makes grubby fail since it cannot copy the kernel arguments for any kernel since there are none left


Version-Release number of selected component (if applicable):

kernel-3.10.0-1160.62.1.el7 and later

How reproducible:

Always

Steps to Reproduce:
1. Setup a RHEL6 system and update it to latest
2. Install redhat-upgrade-tool

  # yum -y install preupgrade-assistant preupgrade-assistant-el6toel7 redhat-upgrade-tool
  # preupg

3. Prepare the upgrade with latest RHEL7 bits

  # redhat-upgrade-tool --nogpgcheck --network 7.9 --instrepo http://192.168.122.1/rhel79 --addrepo=latest='http://rhsm-pulp.corp.redhat.com/content/dist/rhel/server/7/7Server/x86_64/os' --cleanup-post

  Here above the RHEL7.9 DVD is mounted on HTTP server at "/rhel79" location and Pulp is used to fetch latest packages (including the kernel).

4. Reboot to perform the system upgrade

Actual results:

No RHEL7 entry in Grub configuration, and following messages displayed during upgrade:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
[  416.029243] upgrade[2955]: grubby fatal error: unable to find a suitable template^M
[  416.064057] upgrade[2955]: [127/658] (80%) cleaning kernel-2.6.32-754.el6...^M
 :
[  443.656832] upgrade[2955]: running %posttrans script for kernel-3.10.0-1160.83.1.el7^M
[  473.216297] upgrade[2955]: grubby fatal error: unable to find a suitable template^M
[  512.360489] upgrade[2955]: grubby fatal error: unable to find a suitable template^M
 :
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Expected results:

RHEL7 entry in Grub configuration

Additional info:

I don't know if it's up to Kernel package to fix this issue, or if we need to fix this in redhat-upgrade-tool, knowing that this would require an update of the tool in RHEL6 and I'm not sure there are still developers knowing the internals.

At the time of the upgrade, we have a few facts available to "detect" we are running from an upgrade:
- running kernel is always 3.10.0-1160.el7
- UPGRADE=1 is set in the environment
- DRACUT_SYSTEMD=1 is set in the environment
- UDEVVERSION=219 is set in the environment
- NEWROOT=/sysroot is set in the environment
- action=Boot is set in the environment

Maybe we could restore the old scriptlet (2 phases entry creation) when having all conditions.

Comment 3 Renaud Métrich 2023-02-01 08:51:51 UTC
I'm setting the Priority/Severity as HIGH because it's preventing customers from upgrading their RHEL6 systems.

Using the RHEL7.9 DVD level for the upgrade and not latest bits is usually not possible when having additional repositories (optional, supplementary, etc.) because many newer packages in these repositories require more recent components that RHEL 7.9 DVD level.

Comment 5 Petr Stodulka 2023-03-09 10:07:14 UTC
Hi, just confirming that Renaud is right. I've investigated the issue (https://bugzilla.redhat.com/show_bug.cgi?id=2108243#c10) and I see that the valid fix - and the best way to fix the issue - is to fix the scriptlet - either by providing more robust script or just reverting the change. As I am informed, we have customers that are nowadays upgrading or preparing for the upgrade from RHEL 6 to RHEL 7 and if they use up-to-date packages, as required officially, they will hit this crucial issue.

Comment 6 Petr Stodulka 2023-03-09 10:09:34 UTC
The bug has been introduce by the fix for the following BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1893756

Comment 7 Denys Vlasenko 2023-03-13 08:29:50 UTC
Testing the original fix proposed in mr 313. Interrupted install still works:

# yum install kernel-3.10.0-1160.89.1.el7.kpq.test.x86_64.rpm strace tcpdump mc gimp bzip2 traceroute gdb gcc firefox
...
  Installing : kernel-3.10.0-1160.89.1.el7.kpq.test.x86_64                       39/40 
  Installing : 1:mc-4.8.7-11.el7.x86_64                                          40/40 
^Z
[1]+  Stopped                 yum install kernel-3.10.0-1160.89.1.el7.kpq.test.x86_64.rpm strace tcpdump mc gimp bzip2 traceroute gdb gcc firefox
# reboot
...
# uname -sr
Linux 3.10.0-1160.89.1.el7.kpq.test.x86_64



Testing rhel6->rhel7 upgrade:

Install latest rhel6 (server, not client).

Get rhel7 install ISO image: rhel-server-7.9-x86_64-dvd.iso
(sha256sum:2cb36122a74be084c551bc7173d2d38a1cfb75c8ffbc1489c630c916d1b31b25 size:4526702592)

Get these packages:
preupgrade-assistant-2.6.2-1.el6.noarch.rpm
preupgrade-assistant-el6toel7-0.8.0-3.el6.noarch.rpm
preupgrade-assistant-el6toel7-data-0.20200704-1.el6.noarch.rpm
redhat-upgrade-tool-0.8.0-9.el6.noarch.rpm
(for example from https://access.redhat.com/downloads/content/69/ver=/rhel---6/6.10/x86_64/packages)
yum -y install *.rpm createrepo

Get the test kernel, in my case kernel-3.10.0-1160.89.1.el7.kpq.test.x86_64.rpm.

createrepo /path/to/test_kernel  # a dir with kernel-3.10.0-1160.89.1.el7.kpq.test.x86_64.rpm

Run a local http server which exports /path/to/test_kernel on http://127.0.0.1/

Run "preupg", it should finish with no errors precluding rhel6->rhel7 migration

Final step is to run "redhat-upgrade-tool", then reboot when prompted, and watch
boot process to see whether grub menu is not broken.
(Note that failed test makes machine unbootable).

redhat-upgrade-tool --nogpgcheck --iso rhel-server-7.9-x86_64-dvd.iso --cleanup-post
# ^^^ this should work - old kernel with no %posttrans changes is used, from ISO image

redhat-upgrade-tool --nogpgcheck --iso rhel-server-7.9-x86_64-dvd.iso --addrepo=latest='http://rhsm-pulp.corp.redhat.com/content/dist/rhel/server/7/7Server/x86_64/os' --cleanup-post
# ^^^ this should FAIL - kernel with buggy %posttrans change used, from rhsm-pulp

redhat-upgrade-tool --nogpgcheck --iso rhel-server-7.9-x86_64-dvd.iso --addrepo=latest='http://127.0.0.1/' --cleanup-post
# ^^^ this works in my testing (and I verified that the kernel used is indeed the test one)

Comment 9 Petr Stodulka 2023-03-13 16:37:32 UTC
*** Bug 2108243 has been marked as a duplicate of this bug. ***

Comment 10 Jan Stancek 2023-03-16 09:31:07 UTC
(In reply to Petr Stodulka from comment #5)
> Hi, just confirming that Renaud is right. I've investigated the issue
> (https://bugzilla.redhat.com/show_bug.cgi?id=2108243#c10) and I see that the
> valid fix - and the best way to fix the issue - is to fix the scriptlet -
> either by providing more robust script or just reverting the change.

Creating boot entries before there is initrd is guaranteed and over time proven to cause issues for customers.
I'd rather see systemd (new-kernel-pkg) or grubby be made more robust - for example by storing kernel parameters somewhere, if it is last kernel being uninstalled.

Comment 11 Petr Stodulka 2023-03-20 12:00:06 UTC
> Creating boot entries before there is initrd is guaranteed and over time proven to cause issues for customers.
> I'd rather see systemd (new-kernel-pkg) or grubby be made more robust - for example by storing kernel parameters somewhere, if it is last kernel being uninstalled.

Has there been any situation in the original bug, when the kernel posttrans scriptlet has not been executed? In case the scriptlet has been always executed, nothing should prevent kernel to deal with the situation. Fixing the issue anywhere else than in kernel scriptlet seems to me too much work when speaking about RHEL 7.9. Especially in case we speak about corner-corner case which we know that people could hit:
* if they in-place upgrade 6 -> 7 (in 100% cases on intel)
* if they boot to rescue kernel / live OS and from there remove all installed kernel packages manually and then installing a kernel again (which I would consider as unsupported action if somone does something like that)

Comment 12 Denys Vlasenko 2023-04-09 18:38:02 UTC
(In reply to Petr Stodulka from comment #11)
> > Creating boot entries before there is initrd is guaranteed and over time proven to cause issues for customers.
> > I'd rather see systemd (new-kernel-pkg) or grubby be made more robust - for example by storing kernel parameters somewhere, if it is last kernel being uninstalled.
> 
> Has there been any situation in the original bug, when the kernel posttrans
> scriptlet has not been executed?

Yes, indeed.

The typical scenario when this happens in real world is when admin simply runs 
"yum update".

This tries updating many packages, and if any package's update scripts
is buggy in a way that "yum update" hangs, admin has little choice than killing it.

In this case, if a newer kernel was already installed, there will be a new
grub entry for it, but no initramfs.

On next reboot, grub will not be able to find initramfs, and boot will fail.

I think we had about 15 user complaints about this happening.

Comment 13 Petr Stodulka 2023-04-14 12:21:26 UTC
Hi Denys, thanks for the info. Hearing for the first time about such issues on RHEL, but it's true that real systems contain a lot of custom & 3rd-party content too which could affect it also. Not mentioning all possible configurations of real systems.

Comment 15 Linqing Lu 2023-05-26 13:52:50 UTC
(In reply to Petr Stodulka from comment #6)
> The bug has been introduce by the fix for the following BZ: Red
> Hathttps://bugzilla.redhat.com/show_bug.cgi?id=1893756

@zhijwang 
Hi Zhijun,

Can you also take this bug as it's a follow up for 1893756?
Let us know if you need a hand.

Thanks!

Comment 16 zhijwang 2023-05-29 08:47:18 UTC
(In reply to Linqing Lu from comment #15)
> Hi Zhijun,
> 
> Can you also take this bug as it's a follow up for 1893756?
> Let us know if you need a hand.
> 
> Thanks!

Sure, I will take it. Thanks Linqing!