Bug 1827143 - kernel of a rt node from an upgrade between 4.4 and 4.5 lacks kvm module
Summary: kernel of a rt node from an upgrade between 4.4 and 4.5 lacks kvm module
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.5.0
Assignee: Sinny Kumari
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1771572
TreeView+ depends on / blocked
 
Reported: 2020-04-23 10:53 UTC by Karim Boumedhel
Modified: 2020-07-13 17:31 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:30:50 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1691 0 None closed Bug 1827143: daemon: consider addition of an RT kernel package as update 2020-07-21 14:33:13 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:31:03 UTC

Description Karim Boumedhel 2020-04-23 10:53:01 UTC
Description of problem:
updating a cluster from 4.4 to 4.5, a worker rt node gets a new kernel where kvm module is missing, although it's present in a fresh 4.5 install


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. deploy a 4.4 cluster
2. mark node as realtime
3. upgrade cluster to 4.5
4. check whether kvm module is present in the upgraded node

Actual results:
kvm module is not there

Expected results:
kvm should be there


Additional info:
it can be argued that kvm was not present in the 4.4 rt worker but since it's present when deploying 4.5 from the beginning, and upgrade to 4.5 should bring the node to the same state as a new install

Comment 1 Antonio Murdaca 2020-04-24 10:35:50 UTC
Uhm, I'm moving to RHCOS to better assess why it's missing and what we can do - I'll move it back to MCO if we need to install that ourselves (but it would be very weird I think)

Comment 2 Marc Sluiter 2020-04-24 11:21:06 UTC
Note: the realtime kvm module was added recently and is available in OCP 4.5 only. And when installing OCP 4.5 directly it works as expected.
But somehow that module is missed during upgrade from 4.4 to 4.5.

Comment 3 Sinny Kumari 2020-04-24 11:48:58 UTC
The way RT kernel is implemented currently, we only upgrade RT kernel related packages which are already installed on the host. Since kernel-rt-kvm package was not shipped in 4.4, it was never installed on 4.4 based RHCOS node and hence they are not getting updated/installed.

One hacky way to deal with 4.4 based cluster is to install kernel-rt-kvm manually on 4.4 based nodes by pulling latest 4.5 machine-os-content image RHCOS node is using and running `rpm-ostree install <local kernel-rt-kvm package location>`. This is a one time task, during next upgrade MCO should handle it since kernel-rt-kvm is already been installed.

Comment 4 Colin Walters 2020-04-24 18:25:29 UTC
I think we should change the MCO to try installing kernel-rt-kvm if it exists.

Comment 5 Colin Walters 2020-04-24 18:47:36 UTC
> I think we should change the MCO to try installing kernel-rt-kvm if it exists.

To elaborate, when doing an upgrade, if we have `kernel-rt`, also install `kernel-rt-kvm` if we don't already have it.

Comment 6 Antonio Murdaca 2020-04-24 18:50:14 UTC
(In reply to Colin Walters from comment #5)
> > I think we should change the MCO to try installing kernel-rt-kvm if it exists.
> 
> To elaborate, when doing an upgrade, if we have `kernel-rt`, also install
> `kernel-rt-kvm` if we don't already have it.

I think this is what we may want to do +1

(In reply to Sinny Kumari from comment #3)
> The way RT kernel is implemented currently, we only upgrade RT kernel
> related packages which are already installed on the host. Since
> kernel-rt-kvm package was not shipped in 4.4, it was never installed on 4.4
> based RHCOS node and hence they are not getting updated/installed.
> 
> One hacky way to deal with 4.4 based cluster is to install kernel-rt-kvm
> manually on 4.4 based nodes by pulling latest 4.5 machine-os-content image
> RHCOS node is using and running `rpm-ostree install <local kernel-rt-kvm
> package location>`. This is a one time task, during next upgrade MCO should
> handle it since kernel-rt-kvm is already been installed.

if I got Colin's right, we'd need to just try and install that package once we boot into 4.5 rhcos and a sync in MCD happens, we would look for that pkg and if it's missing we'll install it

Comment 7 Sinny Kumari 2020-04-27 07:41:55 UTC
(In reply to Colin Walters from comment #4)
> I think we should change the MCO to try installing kernel-rt-kvm if it
> exists.

yes, we can. I am a bit concerned that every additional conditions increases chance of error if the list grows with time ...

Comment 8 Sinny Kumari 2020-04-27 07:44:04 UTC
(In reply to Sinny Kumari from comment #7)
> (In reply to Colin Walters from comment #4)
> > I think we should change the MCO to try installing kernel-rt-kvm if it
> > exists.
> 
> yes, we can. I am a bit concerned that every additional conditions increases
> chance of error if the list grows with time ...

s/error/bugs/

Comment 11 Sinny Kumari 2020-04-29 08:05:51 UTC
@Karim Fix should be available in latest 4.5 nightly. Let us know if cluster upgrade with RT kernel works as expected for you. Thanks.

Comment 12 Michael Nguyen 2020-04-30 23:01:38 UTC
verified on 4.5.0-0.nightly-2020-04-30-112808

$ oc get nodes
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-134-199.us-west-2.compute.internal   Ready    worker   103m   v1.17.1
ip-10-0-143-138.us-west-2.compute.internal   Ready    master   116m   v1.17.1
ip-10-0-144-170.us-west-2.compute.internal   Ready    worker   104m   v1.17.1
ip-10-0-156-222.us-west-2.compute.internal   Ready    master   117m   v1.17.1
ip-10-0-163-220.us-west-2.compute.internal   Ready    worker   104m   v1.17.1
ip-10-0-165-195.us-west-2.compute.internal   Ready    master   116m   v1.17.1
$ oc debug node/ip-10-0-134-199.us-west-2.compute.internal 
Starting pod/ip-10-0-134-199us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# uname -a
Linux ip-10-0-134-199 4.18.0-147.8.1.rt24.101.el8_1.x86_64 #1 SMP PREEMPT RT Wed Feb 26 16:43:18 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
sh-4.4# rpm -qa | grep kernel
kernel-rt-core-4.18.0-147.8.1.rt24.101.el8_1.x86_64
kernel-rt-modules-extra-4.18.0-147.8.1.rt24.101.el8_1.x86_64
kernel-rt-modules-4.18.0-147.8.1.rt24.101.el8_1.x86_64
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
$ oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-30-112808 --force
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-30-112808


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-04-30-051505   True        True          45m     Working towards 4.5.0-0.nightly-2020-04-30-112808: 84% complete
$ watch oc get clusterversion

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-30-112808   True        False         5m42s   Cluster version is 4.5.0-0.nightly-2020-04-30-112808

$ oc debug node/ip-10-0-134-199.us-west-2.compute.internal -- chroot /host uname -a
Starting pod/ip-10-0-134-199us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Linux ip-10-0-134-199 4.18.0-147.8.1.rt24.101.el8_1.x86_64 #1 SMP PREEMPT RT Wed Feb 26 16:43:18 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Removing debug pod ...
$ oc debug node/ip-10-0-134-199.us-west-2.compute.internal -- chroot /host rpm -qa | grep kernel
Starting pod/ip-10-0-134-199us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
kernel-rt-kvm-4.18.0-147.8.1.rt24.101.el8_1.x86_64
kernel-headers-4.18.0-147.8.1.el8_1.x86_64
kernel-rt-core-4.18.0-147.8.1.rt24.101.el8_1.x86_64
kernel-rt-modules-extra-4.18.0-147.8.1.rt24.101.el8_1.x86_64
kernel-devel-4.18.0-147.8.1.el8_1.x86_64
kernel-rt-modules-4.18.0-147.8.1.rt24.101.el8_1.x86_64

Comment 13 errata-xmlrpc 2020-07-13 17:30:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.