Description of problem:
Overcloud upgrade run failed as container-prepare-image.yaml had the wrong reference to ceph3_image. It was corrected by customer later and upgrade prepare was ran to get the new values in plan.
But, the overcloud upgrade run failed and still trying to pull the wrong ceph image.
Later systemd files in controllers were manually updated to reference to the correct ceph image and then in the logs we could see that image pull happended for corretc image but the overcloud upgrade run still hangs and customer is not able to proceed.
Version-Release number of selected component (if applicable):
FFU from RHOSP13z14 to RHOSP16.1
undercloud upgrade and overcloud upgrade for first controller is in progress.
Steps to Reproduce:
overcloud upgrade run hangs for controller1
upgrade run should proceed and success.
Breakdown of the problem:
1) Customer used wrong ceph image and needed to update this in systemd service file for ceph to remove reoccurring log about podman failing to pull image. This was irrelevant to stuck upgrade.
2) The stuck upgrade came from mysql upgrade container getting stuck with podman reporting it running but nothing happening.
3) The container was stuck due to wrong kernel - post Leapp the system booted into old 3.10.0-1160 instead of 4...
4) This was due to Leapp failing to update kernel because system it self was EFI based but /etc/fstab does not contain the EFI partition.
To break this down during deployment Ironic runs:
Jan 11 05:22:25 host-192-168-0-200 ironic-python-agent: 2021-01-11 05:22:25.019 2108 DEBUG oslo_concurrency.processutils [-] CMD "mount /dev/sda1 /tmp/tmpFpRF6S/boot/efi" returned: 0 in 0.076s execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:409
Jan 11 05:22:25 host-192-168-0-200 ironic-python-agent: 2021-01-11 05:22:25.281 2108 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): chroot /tmp/tmpFpRF6S /bin/sh -c "grub2-install /dev/sda" execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:372
Jan 11 05:22:27 host-192-168-0-200 ironic-python-agent: 2021-01-11 05:22:27.290 2108 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): chroot /tmp/tmpFpRF6S /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg" execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:372
On the controller we found sign of using grub-install on efi system which creates non secure boot compatible setups:
Boot0018* red HD(1,GPT,be5dd387-fc63-4d37-b5a8-68ccca72b172,0x800,0x64000)/File(\EFI\red\grubx64.efi)
Here we can see that partitions are present on the disk so if system boots via EFI it happens through unmounted and not updated partition:
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
Disk /dev/sda: 300.0 GB, 299966136320 bytes, 585871360 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disk label type: gpt
Disk identifier: 7954D191-CD2D-4DC3-A3A2-2696AB9E3634
# Start End Size Type Name
1 2048 411647 200M EFI System primary
2 411648 413695 1M Microsoft basic primary
3 413696 585871325 279.2G Microsoft basic primary
LABEL=img-rootfs / xfs defaults 0 1
1) /etc/fstab was not updated
2) grub-install was incorrectly used on EFI system
/dev/sda1: SEC_TYPE="msdos" LABEL="efi-part" UUID="1930-AFD0" TYPE="vfat" PARTLABEL="primary" PARTUUID="c5f32f78-0c85-469c-8649-1bfb1f56d116"
add /etc/fstab record:
UUID="1930-AFD0" /boot/efi vfat umask=0077 0 1
dnf/yum reinstall grub2-efi-x64 shim-x64
efibootmgr -c --disk /dev/sda -p 1 -w -L RHEL -l "\\EFI\\redhat\\grubx64.efi"
grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg
*** Bug 1906681 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
*** Bug 1936523 has been marked as a duplicate of this bug. ***