Bug 1922263 - [4.4.z] real-time kernel in RHCOS is not synchronized
Summary: [4.4.z] real-time kernel in RHCOS is not synchronized
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.4
Hardware: All
OS: All
high
urgent
Target Milestone: ---
: 4.4.z
Assignee: Steve Milner
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On: 1914469 1914988 1922262
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-29 14:43 UTC by Micah Abbott
Modified: 2021-03-08 18:04 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: RHCOS build config was using a stage repo location for the kernel-rt package instead of the production repo location Consequence: The kernel-rt package would not be synchronized to the vanilla kernel package Fix: Change the RHCOS build config to use the production repo location. Result: kernel-rt package is synchronized with vanilla kernel package
Clone Of: 1922262
Environment:
Last Closed: 2021-03-08 18:04:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Micah Abbott 2021-01-29 14:43:21 UTC
+++ This bug was initially created as a clone of Bug #1922262 +++

+++ This bug was initially created as a clone of Bug #1914988 +++

+++ This bug was initially created as a clone of Bug #1914469 +++

Description of problem:
The Realtime (RT) variant of the RHEL kernel shipped in downstream RHCOS appears to not be synchronized with the standard kernel.

For example, the latest currently shipping stable version as of the BZ authoring is version 4.6.9.  That corresponds to 46.82.202012151054-0, which has kernel version 4.18.0-193.37.1.el8_2.x86_64.

When switching to the RT variant of the kernel via the Performance AddOn Operator, one gets booted into RT kernel version 
4.18.0-193.28.1.rt13.77.el8_2.x86_64.  This should be a more recent kernel.

Additional info:
The 28.1 kernel is from October and should be on a later release, such as 37.1, which is from December.

In looking at nightly builds for 4.6, it still has the kernel version from October.  (4.6.0-0.nightly-2021-01-08-200800) mounting the machine-os-content and looking in the extensions folder.

./extensions/kernel-rt/kernel-headers-4.18.0-193.28.1.el8_2.x86_64.rpm
./extensions/kernel-rt/kernel-rt-core-4.18.0-193.28.1.rt13.77.el8_2.x86_64.rpm
./extensions/kernel-rt/kernel-rt-devel-4.18.0-193.28.1.rt13.77.el8_2.x86_64.rpm
./extensions/kernel-rt/kernel-rt-kvm-4.18.0-193.28.1.rt13.77.el8_2.x86_64.rpm
./extensions/kernel-rt/kernel-rt-modules-4.18.0-193.28.1.rt13.77.el8_2.x86_64.rpm
./extensions/kernel-rt/kernel-rt-modules-extra-4.18.0-193.28.1.rt13.77.el8_2.x86_64.rpm

There are numerous fixes in more recent RT kernel versions that are absolutely critical for low latency applications running on OpenShift 4.6.

--- Additional comment from Micah Abbott on 2021-01-11 14:44:34 UTC ---

RHCOS 4.6 is billed as an EUS release and uses the RHEL 8.2 EUS sources.  The kernel-rt package does not have an EUS release, rather it uses the moniker "Telecommunications Update Service".  See the most recent advisory for `kernel-rt` - https://access.redhat.com/errata/RHSA-2020:5428

The RHCOS build process is incorrectly using the wrong location for TUS updates on `kernel-rt`, so we'll have to update our build process/configuration to use the proper location.

--- Additional comment from Micah Abbott on 2021-01-11 17:02:40 UTC ---

The RHCOS 4.6 build config is updated here - https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/merge_requests/1207

We will need to force a build of RHCOS 4.6 via https://gitlab.cee.redhat.com/openshift-art/rhcos-upshift/-/merge_requests/217

--- Additional comment from Micah Abbott on 2021-01-11 19:54:03 UTC ---

The fix landed in RHCOS 46.82.202101111741-0

```
(1/5): kernel-rt-kvm-4.18.0-193.37.1.rt13.87.el 3.5 MB/s | 3.2 MB     00:00    
(2/5): kernel-rt-modules-extra-4.18.0-193.37.1. 2.7 MB/s | 3.4 MB     00:01    
(3/5): kernel-rt-modules-4.18.0-193.37.1.rt13.8 7.1 MB/s |  24 MB     00:03    
(4/5): kernel-rt-core-4.18.0-193.37.1.rt13.87.e 7.2 MB/s |  27 MB     00:03    
(5/5): kernel-rt-devel-4.18.0-193.37.1.rt13.87. 6.4 MB/s |  15 MB     00:02    
```

--- Additional comment from OpenShift Automated Release Tooling on 2021-01-12 08:40:30 UTC ---

Elliott changed bug status from MODIFIED to ON_QA.

--- Additional comment from Dave Cain on 2021-01-12 11:55:04 UTC ---

This looks to be addressed in 4.6.0-0.nightly-2021-01-12-084037.  Do we know what minor (.z) version this will appear in?

--- Additional comment from Micah Abbott on 2021-01-12 20:08:32 UTC ---

(In reply to Dave Cain from comment #4)
> This looks to be addressed in 4.6.0-0.nightly-2021-01-12-084037.  Do we know
> what minor (.z) version this will appear in?

4.6.10 has already been selected, so the first z-stream release it could be included in would be 4.6.11

--- Additional comment from errata-xmlrpc on 2021-01-13 05:30:21 UTC ---

This bug has been added to advisory RHSA-2021:0037 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com)

--- Additional comment from Michael Nguyen on 2021-01-13 16:02:02 UTC ---

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-01-12-084037   True        False         37m     Cluster version is 4.6.0-0.nightly-2021-01-12-084037
$ cat 99-worker-realtime.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: "worker"
  name: 99-worker-realtime
spec:
  config:
    ignition:
      version: 3.1.0
  kernelType: realtime

$ oc create -f 99-worker-realtime.yaml 
machineconfig.machineconfiguration.openshift.io/99-worker-realtime created
$ oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          eab9c35dfbeb0d21be6e1db3887acbbb93592d34   3.1.0             31m
00-worker                                          eab9c35dfbeb0d21be6e1db3887acbbb93592d34   3.1.0             31m
01-master-container-runtime                        eab9c35dfbeb0d21be6e1db3887acbbb93592d34   3.1.0             31m
01-master-kubelet                                  eab9c35dfbeb0d21be6e1db3887acbbb93592d34   3.1.0             31m
01-worker-container-runtime                        eab9c35dfbeb0d21be6e1db3887acbbb93592d34   3.1.0             31m
01-worker-kubelet                                  eab9c35dfbeb0d21be6e1db3887acbbb93592d34   3.1.0             31m
99-master-generated-registries                     eab9c35dfbeb0d21be6e1db3887acbbb93592d34   3.1.0             31m
99-master-ssh                                                                                 3.1.0             38m
99-worker-generated-registries                     eab9c35dfbeb0d21be6e1db3887acbbb93592d34   3.1.0             31m
99-worker-realtime                                                                            3.1.0             3s
99-worker-ssh                                                                                 3.1.0             38m
rendered-master-5d7bfa47cbeec95df59f71f33b975eb1   eab9c35dfbeb0d21be6e1db3887acbbb93592d34   3.1.0             31m
rendered-worker-3913f58d5e37f4cbf0cce63c0b49a0a3   eab9c35dfbeb0d21be6e1db3887acbbb93592d34   3.1.0             31m
rendered-worker-48aa8698ff39eae0d76d83d06b9f6978   eab9c35dfbeb0d21be6e1db3887acbbb93592d34   3.1.0             72s
$ oc get mcp/worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-3913f58d5e37f4cbf0cce63c0b49a0a3   False     True       False      3              2                   2                     0                      33m
$ oc get mcp/worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-3913f58d5e37f4cbf0cce63c0b49a0a3   False     True       False      3              2                   2                     0                      33m
$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-y3lkdbb-f76d1-hsvx6-master-0         Ready    master   54m   v1.19.0+9c69bdc
ci-ln-y3lkdbb-f76d1-hsvx6-master-1         Ready    master   54m   v1.19.0+9c69bdc
ci-ln-y3lkdbb-f76d1-hsvx6-master-2         Ready    master   54m   v1.19.0+9c69bdc
ci-ln-y3lkdbb-f76d1-hsvx6-worker-b-dd88k   Ready    worker   43m   v1.19.0+9c69bdc
ci-ln-y3lkdbb-f76d1-hsvx6-worker-c-wrkgp   Ready    worker   43m   v1.19.0+9c69bdc
ci-ln-y3lkdbb-f76d1-hsvx6-worker-d-nrsfm   Ready    worker   43m   v1.19.0+9c69bdc
$ oc debug node/ci-ln-y3lkdbb-f76d1-hsvx6-master-0
Starting pod/ci-ln-y3lkdbb-f76d1-hsvx6-master-0-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# uname -a
Linux ci-ln-y3lkdbb-f76d1-hsvx6-master-0 4.18.0-193.37.1.el8_2.x86_64 #1 SMP Sun Dec 6 19:59:00 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
$ oc debug node/ci-ln-y3lkdbb-f76d1-hsvx6-worker-b-dd88k
Starting pod/ci-ln-y3lkdbb-f76d1-hsvx6-worker-b-dd88k-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# uname -a
Linux ci-ln-y3lkdbb-f76d1-hsvx6-worker-b-dd88k 4.18.0-193.37.1.rt13.87.el8_2.x86_64 #1 SMP PREEMPT RT Mon Dec 7 13:13:06 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
sh-4.4# rpm -qa | grep kernel
kernel-rt-modules-extra-4.18.0-193.37.1.rt13.87.el8_2.x86_64
kernel-rt-modules-4.18.0-193.37.1.rt13.87.el8_2.x86_64
kernel-rt-kvm-4.18.0-193.37.1.rt13.87.el8_2.x86_64
kernel-rt-core-4.18.0-193.37.1.rt13.87.el8_2.x86_64
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...

$ oc debug node/ci-ln-y3lkdbb-f76d1-hsvx6-worker-b-dd88k -- chroot /host rpm-ostree status
Starting pod/ci-ln-y3lkdbb-f76d1-hsvx6-worker-b-dd88k-debug ...
To use host binaries, run `chroot /host`
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eb3a8ce181e114db2b7f859b349b840721d109c89261623d87fac96b2499b1b9
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202101111741-0 (2021-01-11T17:44:58Z)
       RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-193.37.1.el8_2
           LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules
                            kernel-rt-modules-extra

  ostree://cb0327325553e6922ff25822ea7eb1a2ec213e70c7cf6880965e7e2bb5ee7dea
                   Version: 46.82.202011260640-0 (2020-11-26T06:44:15Z)

--- Additional comment from errata-xmlrpc on 2021-01-18 01:35:55 UTC ---

Bug report changed to RELEASE_PENDING status by Errata System.
Advisory RHSA-2021:0037-08 has been changed to PUSH_READY status.
https://errata.devel.redhat.com/advisory/67864

--- Additional comment from errata-xmlrpc on 2021-01-18 18:00:27 UTC ---

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.6.12 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0037

Comment 4 Micah Abbott 2021-02-01 16:47:40 UTC
Moving back to MODIFIED to added to the correct errata.

Comment 7 Micah Abbott 2021-02-10 15:40:55 UTC
Verified with 4.4.0-0.nightly-2021-02-09-013322

```
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2021-02-09-013322   True        False         21m     Cluster version is 4.4.0-0.nightly-2021-02-09-013322

$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-rss714k-f76d1-25nsl-master-0         Ready    master   42m   v1.17.1+5ef953f
ci-ln-rss714k-f76d1-25nsl-master-1         Ready    master   42m   v1.17.1+5ef953f
ci-ln-rss714k-f76d1-25nsl-master-2         Ready    master   42m   v1.17.1+5ef953f
ci-ln-rss714k-f76d1-25nsl-worker-b-wct8h   Ready    worker   30m   v1.17.1+5ef953f
ci-ln-rss714k-f76d1-25nsl-worker-c-cmz2k   Ready    worker   30m   v1.17.1+5ef953f
ci-ln-rss714k-f76d1-25nsl-worker-d-mfqkk   Ready    worker   30m   v1.17.1+5ef953f

$ oc debug node/ci-ln-rss714k-f76d1-25nsl-worker-b-wct8h -- chroot /host uname -a
Starting pod/ci-ln-rss714k-f76d1-25nsl-worker-b-wct8h-debug ...
To use host binaries, run `chroot /host`
Linux ci-ln-rss714k-f76d1-25nsl-worker-b-wct8h 4.18.0-193.41.1.el8_2.x86_64 #1 SMP Wed Jan 13 11:33:33 EST 2021 x86_64 x86_64 x86_64 GNU/Linux

Removing debug pod ...

$ cat machineConfigs/worker-realtime.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: "worker"
  name: 99-worker-kerneltype
spec:
  kernelType: realtime
$ oc apply -f machineConfigs/worker-realtime.yaml 
machineconfig.machineconfiguration.openshift.io/99-worker-kerneltype created

$ oc debug node/ci-ln-rss714k-f76d1-25nsl-worker-b-wct8h -- chroot /host uname -a
Starting pod/ci-ln-rss714k-f76d1-25nsl-worker-b-wct8h-debug ...
To use host binaries, run `chroot /host`
Linux ci-ln-rss714k-f76d1-25nsl-worker-b-wct8h 4.18.0-193.41.1.rt13.91.el8_2.x86_64 #1 SMP PREEMPT RT Wed Jan 13 15:16:38 EST 2021 x86_64 x86_64 x86_64 GNU/Linux

Removing debug pod ...
```


Note You need to log in before you can comment on or make changes to this bug.