Bug 1755841

Summary: After Leapp upgrade 7.6->8.0, the machine boots with 7.6 kernel
Product: Red Hat Enterprise Linux 7 Reporter: Jiri Stransky <jstransk>
Component: leapp-repositoryAssignee: Leapp team <leapp-notifications>
Status: CLOSED ERRATA QA Contact: Alois Mahdal <amahdal>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.6CC: fmartine, jfrancoa, mbocek, michele, mreznik, msekleta, pbabbar, pstodulk
Target Milestone: rcKeywords: Upgrades
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: leapp-repository-0.10.0-2.el7_8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-29 01:45:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1727807    
Attachments:
Description Flags
successful upgrade 7.7->8.0
none
broken upgrade 7.7->8.0
none
sosreport broken 7.6->8.0 upgrade none

Description Jiri Stransky 2019-09-26 10:35:25 UTC
Description of problem:

Sometimes our Leapp upgrades from 7.7 to 8.0 succeed but the system is in an incorrect state afterwards, i've noticed issues with syslog and iptables. The clearest sign of the problem is incorrect /dev/log socket -- it should be a symlink to journald, but it isn't.

Here are outputs from 2 machines, both were upgraded from 7.7 to 8.0 with Leapp:

Successful upgrade and correct state:

[root@controller-0 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.0 (Ootpa)
[root@controller-0 ~]# ll /dev/log
lrwxrwxrwx. 1 root root 28 Sep 25 11:47 /dev/log -> /run/systemd/journal/dev-log
[root@controller-0 ~]# logger -t test test
[root@controller-0 ~]#

Successful upgrade but incorrect state:

[root@controller-2 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.0 (Ootpa)
[root@controller-2 ~]# ll /dev/log
srw-rw-rw-. 1 root root 0 Sep 25 13:30 /dev/log
[root@controller-2 ~]# logger -t test test
logger: socket /dev/log: Connection refused
[root@controller-2 ~]#

I realize 7.7 -> 8.0 is not supported upgrade but i'm thinking this issue could perhaps be hit in a supported upgrade too.


How reproducible:

Intermittent, might be some sort of race condition, occurs quite rarely in our testing (we hit it just twice so far, but by different engineers and in different environments).


Steps to Reproduce:

We run Leapp upgrade in testing without RHSM and skipping the OS release check to let us upgrade from 7.7. Essentially:

LEAPP_SKIP_CHECK_OS_RELEASE=1 LEAPP_DEVEL_SKIP_RHSM=1 sudo -E leapp upgrade --debug

I will upload logs both from working and broken upgrades.

Comment 2 Jiri Stransky 2019-09-26 10:41:44 UTC
Created attachment 1619445 [details]
successful upgrade 7.7->8.0

Comment 3 Jiri Stransky 2019-09-26 10:42:23 UTC
Created attachment 1619446 [details]
broken upgrade 7.7->8.0

Comment 5 Jiri Stransky 2019-10-02 08:38:43 UTC
Created attachment 1621762 [details]
sosreport broken 7.6->8.0 upgrade

Comment 6 Jiri Stransky 2019-10-02 08:42:41 UTC
Interesting thing i noticed now, the machine is upgraded to RHEL 8, and RHEL 8 kernel is installed, but the machine is running a RHEL 7 kernel.

[root@controller-0 ~]# ll /dev/log 
srw-rw-rw-. 1 root root 0 říj  1 13:01 /dev/log

[root@controller-0 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.0 (Ootpa)

[root@controller-0 ~]# uname -a
Linux controller-0 3.10.0-957.21.3.el7.x86_64 #1 SMP Fri Jun 14 02:54:29 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux

[root@controller-0 ~]# rpm -qa | grep kernel | sort
kernel-3.10.0-957.21.3.el7.x86_64
kernel-4.18.0-80.11.2.el8_0.x86_64
kernel-4.18.0-80.4.2.el8_0.x86_64
kernel-core-4.18.0-80.11.2.el8_0.x86_64
kernel-core-4.18.0-80.4.2.el8_0.x86_64
kernel-headers-4.18.0-80.11.2.el8_0.x86_64
kernel-modules-4.18.0-80.11.2.el8_0.x86_64
kernel-modules-4.18.0-80.4.2.el8_0.x86_64
kernel-modules-extra-4.18.0-80.11.2.el8_0.x86_64
kernel-modules-extra-4.18.0-80.4.2.el8_0.x86_64
kernel-rpm-macros-116-1.el8.noarch
kernel-tools-4.18.0-80.11.2.el8_0.x86_64
kernel-tools-libs-4.18.0-80.11.2.el8_0.x86_64
kernel-workaround-0.1-1.el8.noarch

Comment 7 Petr Stodulka 2019-10-02 09:04:31 UTC
jstranky: It happens from time to time. We have no clear information, why the entry with the old kernel is used as default as we run cmd to set as default entry with the new kernel. This could be resolved when se start to remove all RHEL 7 rpm leftovers (unfortunately, there is no way to calculate the upgrade transaction with remove of the original kernel).

Comment 8 Michal Reznik 2019-10-03 14:05:57 UTC
controller-0 systemd[1]: systemd-journald-dev-log.socket: Failed to create symlink /run/systemd/journal/dev-log → /dev/log, ignoring: File exists  

This looks like some kind of race condition indeed. Unfortunately do not know what is actually creating "/dev/log". Maybe "systemd" can help to check?

Comment 9 Jiri Stransky 2019-10-03 15:26:34 UTC
I think the cause of the failure in systemd-journald-dev-log.socket could perhaps be that the service is meant for running with RHEL 8 kernel but the system is in fact running on RHEL 7 kernel? I don't know what could be the root cause of running on RHEL 7 kernel though...

Comment 10 Jiri Stransky 2019-10-07 14:37:03 UTC
I'll amend the title to what we presently think is the root cause (or not entirely root but closer to root than the /dev/log issue).

Comment 11 Jiri Stransky 2019-11-12 16:24:13 UTC
Is there any chance this can be looked into? I think presently this is the nastiest bug for OpenStack because we don't have a workaround, this intermittently breaks our upgrade testing.

Comment 12 Petr Stodulka 2019-11-13 13:57:52 UTC
The bug has higher priority now (I am setting the priority in the BZ as well to reflect it).

Comment 13 Petr Stodulka 2019-11-13 13:58:23 UTC
That means - in the worst case, I will look at it again next week.

Comment 14 Michal Reznik 2019-11-13 14:33:44 UTC
@Jiri, would you be able to prepare a reproducer? Can we expect to hit the issue at least once out of let's say 10 runs?

Comment 15 Michal Reznik 2019-11-13 15:03:23 UTC
@Jiri, were there any actions done right after the upgrade? Do you have any custom actors?

I see there are 2x RHEL8 kernels after upgrade:

kernel.x86_64                                         4.18.0-80.4.2.el8_0                                      @System                                 
kernel.x86_64                                         4.18.0-80.11.2.el8_0                                     @rhosp-rhel-8.0-baseos

We would need logs from a system right after the upgrade without any tunings. But the reproducer is preferable. Thanks...

Comment 16 Jiri Stransky 2019-11-14 15:35:34 UTC
I don't think we did anything extra after the RHEL upgrade besides investigating. I'll try to provide a reproducer, unfortunately the issue is intermittent and i haven't hit it recently (even in a single env with all 3 openstack controller VMs configured the same, i only hit it on some of them). I'll keep a machine on standby in case i hit it again.

Comment 17 Javier Martinez Canillas 2019-12-16 13:47:03 UTC
Hello,

This very much looks like Bug #1640979 that was caused by GRUB not being able to sort entries correctly if these started with a number.

The bug was fixed by the following commit https://github.com/rhboot/grub2/commit/291907f1cf6f51ec3929e20af3a99d00d9bc9e34, but the fix is for the GRUB core and that is not updated when the grub2 package is upgraded. The GRUB core is installed in the gap that exists between the end of the Master Boot Record (MBR) and the start of the first partition.

To update the GRUB core, the grub2-install command has to be executed. So I think that LEAPP should executed grub2-install after installing the RHEL8 packages to make sure that the GRUB used is the latest from RHEL8 and not the one from RHEL7.

Comment 24 Alois Mahdal 2020-04-24 15:22:38 UTC
Vast majority of tests already passed, so with that, and with nature of the original issue in mind:

VERIFIED on all platforms with:

      - leapp-0.10.0-2.el7_8
      - leapp-repository-0.10.0-2.el7_8

Comment 26 errata-xmlrpc 2020-04-29 01:45:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1959