Bug 1925078

Summary: RHOSP13-16.1 FFU: Overcloud upgrade hangs in controller after failed attempt with reference to wrong ceph image.
Product: Red Hat OpenStack Reporter: Shravan Kumar Tiwari <shtiwari>
Component: openstack-tripleo-heat-templatesAssignee: Lukas Bezdicka <lbezdick>
Status: CLOSED ERRATA QA Contact: Jason Grosso <jgrosso>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 16.1 (Train)CC: apetrich, astupnik, fj-lsoft-ofuku, gfidente, igallagh, jjoyce, jkreger, jpretori, jschluet, kthakre, lbezdick, mburns, msufiyan, slinaber, spower, tvignaud, vgrosu
Target Milestone: z4Keywords: Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20210104205662.el8ost.2 Doc Type: Known Issue
Doc Text:
Systems that use UEFI boot and a UEFI bootloader in OSP13 might run into an UEFI issue that results in: * /etc/fstab not being updated * grub-install used incorrectly on EFI system If your systems use UEFI, contact Red Hat Technical Support. For more information, see the Red Hat Knowledgebase solution https://access.redhat.com/solutions/5861031[FFU 13 to 16.1: Leapp fails to update the kernel on UEFI based systems and /etc/fstab does not contain the EFI partition]
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-17 15:36:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1768952    

Description Shravan Kumar Tiwari 2021-02-04 11:20:09 UTC
Description of problem:

Overcloud upgrade run failed as container-prepare-image.yaml had the wrong reference to ceph3_image. It was corrected by customer later and upgrade prepare was ran to get the new values in plan.

But, the overcloud upgrade run failed and still trying to pull the wrong ceph image.

Later systemd files in controllers were manually updated to reference to the correct ceph image and then in the logs we could see that image pull happended for corretc image but the overcloud upgrade run still hangs and customer is not able to proceed.



Version-Release number of selected component (if applicable):
FFU from RHOSP13z14 to RHOSP16.1

undercloud upgrade and overcloud upgrade for first controller is in progress.


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
overcloud upgrade run hangs for controller1

Expected results:
upgrade run should proceed and success.

Additional info:

Comment 3 Lukas Bezdicka 2021-02-05 16:29:45 UTC
Breakdown of the problem:

1) Customer used wrong ceph image and needed to update this in systemd service file for ceph to remove reoccurring log about podman failing to pull image. This was irrelevant to stuck upgrade.
2) The stuck upgrade came from mysql upgrade container getting stuck with podman reporting it running but nothing happening.
3) The container was stuck due to wrong kernel - post Leapp the system booted into old 3.10.0-1160 instead of 4...
4) This was due to Leapp failing to update kernel because system it self was EFI based but /etc/fstab does not contain the EFI partition.

To break this down during deployment Ironic runs:

...
Jan 11 05:22:25 host-192-168-0-200 ironic-python-agent[2108]: 2021-01-11 05:22:25.019 2108 DEBUG oslo_concurrency.processutils [-] CMD "mount /dev/sda1 /tmp/tmpFpRF6S/boot/efi" returned: 0 in 0.076s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:409
Jan 11 05:22:25 host-192-168-0-200 ironic-python-agent[2108]: 2021-01-11 05:22:25.281 2108 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): chroot /tmp/tmpFpRF6S /bin/sh -c "grub2-install /dev/sda" execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:372
Jan 11 05:22:27 host-192-168-0-200 ironic-python-agent[2108]: 2021-01-11 05:22:27.290 2108 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): chroot /tmp/tmpFpRF6S /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg" execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:372
...

On the controller we found sign of using grub-install on efi system which creates non secure boot compatible setups:

Boot0018* red   HD(1,GPT,be5dd387-fc63-4d37-b5a8-68ccca72b172,0x800,0x64000)/File(\EFI\red\grubx64.efi)


Here we can see that partitions are present on the disk so if system boots via EFI it happens through unmounted and not updated partition:

0080-sosreport-oscar02ctr001-2021-01-14-keewgqn.tar.xz/sosreport-oscar02ctr001-2021-01-14-keewgqn/sos_commands/block/fdisk_-l_.dev.sda 
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sda: 300.0 GB, 299966136320 bytes, 585871360 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disk label type: gpt
Disk identifier: 7954D191-CD2D-4DC3-A3A2-2696AB9E3634


#         Start          End    Size  Type            Name
 1         2048       411647    200M  EFI System      primary
 2       411648       413695      1M  Microsoft basic primary
 3       413696    585871325  279.2G  Microsoft basic primary


0080-sosreport-oscar02ctr001-2021-01-14-keewgqn.tar.xz/sosreport-oscar02ctr001-2021-01-14-keewgqn/etc/fstab 
LABEL=img-rootfs / xfs defaults 0 1




Issues:
1) /etc/fstab was not updated
2) grub-install was incorrectly used on EFI system

Comment 5 Lukas Bezdicka 2021-02-11 16:00:46 UTC
blkid output:
/dev/sda1: SEC_TYPE="msdos" LABEL="efi-part" UUID="1930-AFD0" TYPE="vfat" PARTLABEL="primary" PARTUUID="c5f32f78-0c85-469c-8649-1bfb1f56d116"

add /etc/fstab record:
UUID="1930-AFD0" /boot/efi vfat umask=0077 0 1

mount /boot/efi

dnf/yum reinstall grub2-efi-x64 shim-x64

efibootmgr -c --disk /dev/sda -p 1 -w -L RHEL -l "\\EFI\\redhat\\grubx64.efi" 

grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg


https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/ch-working_with_the_grub_2_boot_loader

Comment 6 Lukas Bezdicka 2021-02-24 12:37:54 UTC
*** Bug 1906681 has been marked as a duplicate of this bug. ***

Comment 23 errata-xmlrpc 2021-03-17 15:36:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0817

Comment 24 Steve Baker 2021-03-18 19:14:29 UTC
*** Bug 1936523 has been marked as a duplicate of this bug. ***