Bug 1894910 - Update the node to use the real-time kernel fails
Summary: Update the node to use the real-time kernel fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.7
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Steve Milner
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1894972
TreeView+ depends on / blocked
 
Reported: 2020-11-05 12:13 UTC by Artyom
Modified: 2021-09-06 07:11 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1894972 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:31:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:31:54 UTC

Description Artyom 2020-11-05 12:13:40 UTC
Description of problem:
Our CI started to fail recently because the node dropped to the degraded state when we are trying to update it to use the machineconfig with the real-time option enabled.

I saw two different errors:

1. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-kni_performance-addon-operators/433/pull-ci-openshift-kni-performance-addon-operators-master-e2e-gcp/1324108099925053440/artifacts/e2e-gcp/gather-extra/

{
                        "lastTransitionTime": "2020-11-04T23:10:07Z",
                        "message": "Node ci-op-lx8l2lsg-24cc7-fg9bn-worker-b-8bvw7 is reporting: \"error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: error: System transaction in progress\\n: exit status 1\"",
                        "reason": "1 nodes are reporting degraded status on sync",
                        "status": "True",
                        "type": "NodeDegraded"
                    },

2. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-kni_performance-addon-operators/434/pull-ci-openshift-kni-performance-addon-operators-master-e2e-gcp/1323998850603552768/artifacts/e2e-gcp/gather-extra/

"message": "Node ci-op-4pgtrg3b-24cc7-zz7c8-worker-b-58l2r is reporting: \"error removing staged deployment: error running rpm-ostree cleanup -p: error: System transaction in progress\\n: exit status 1: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: Checking out tree 30e9764...done\\nEnabled rpm-md repositories: coreos-extensions\\nrpm-md repo 'coreos-extensions' (cached); generated: 2020-11-04T00:35:32Z\\nImporting rpm-md...done\\nResolving dependencies...done\\nerror: Could not depsolve transaction; 4 problems detected:\\n Problem 1: conflicting requests\\n  - nothing provides linux-firmware \u003e= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.rt7.54.el8.x86_64\\n Problem 2: package kernel-rt-modules-extra-4.18.0-240.rt7.54.el8.x86_64 requires kernel-rt-uname-r = 4.18.0-240.rt7.54.el8.x86_64, but none of the providers can be installed\\n  - conflicting requests\\n  - nothing provides linux-firmware \u003e= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.rt7.54.el8.x86_64\\n Problem 3: package kernel-rt-modules-4.18.0-240.rt7.54.el8.x86_64 requires kernel-rt-uname-r = 4.18.0-240.rt7.54.el8.x86_64, but none of the providers can be installed\\n  - conflicting requests\\n  - nothing provides linux-firmware \u003e= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.rt7.54.el8.x86_64\\n Problem 4: package kernel-rt-kvm-4.18.0-240.rt7.54.el8.x86_64 requires kernel-rt = 4.18.0-240.rt7.54.el8, but none of the providers can be installed\\n  - conflicting requests\\n  - nothing provides linux-firmware \u003e= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.rt7.54.el8.x86_64\\n: exit status 1\"",
                        "reason": "1 nodes are reporting degraded status on sync"

Version-Release number of selected component (if applicable):
master

How reproducible:
Always under the CI

Steps to Reproduce:
1.
2.
3.

Actual results:
The update of the node to work with RT kernel fails

Expected results:
The update of the node to work with the RT kernel should succeed

Additional info:
You can find all relevant information under the CI links that I provided above(MCP, MC, must-gather...)

Comment 1 Sinny Kumari 2020-11-05 13:33:30 UTC
Both 4.6 and 4.7 issue would be most likely related. We are seeing trimmed error message in 4.6 because it doesn't have verbose log enabled from rpm-ostree - https://github.com/openshift/machine-config-operator/pull/2097. 

It seems RHCOS is shipping linux-firmware-20200512-98.gitb2cad6a2.el8 but we are shipping kernel-rt 4.18.0-240.rt7.54.el8 package in latest machine-os-content which needs linux-firmware-20200619-99.git3890db36 . This needs machine-OS-content update to have correct linux-firmware dependency available for kernel-rt install to succeed.

Making this bug as urgent as this also effect MCO 4.7 and 4.6 ci:
4.6 - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2193/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op/1324193039505166336
4.7 - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2035/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1324274818467500032

Comment 2 Colin Walters 2020-11-05 13:56:19 UTC
This is on track to be fixed by https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/merge_requests/1162

Comment 3 Micah Abbott 2020-11-05 14:00:25 UTC
Targeting 4.7; will need a clone for 4.6.z

Comment 7 Michael Nguyen 2020-11-10 22:41:22 UTC
Verified on RHCOS 47.82.202011100542-0

$ cat << EOF > rt.yaml
> apiVersion: machineconfiguration.openshift.io/v1
> kind: MachineConfig
> metadata:
>   labels:
>     machineconfiguration.openshift.io/role: "worker"
>   name: worker-kerneltype
> spec:
>   kernelType: realtime
> EOF
$ oc create -f rt.yaml 
machineconfig.machineconfiguration.openshift.io/worker-kerneltype created
$ oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             5h31m
00-worker                                          da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             5h31m
01-master-container-runtime                        da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             5h31m
01-master-kubelet                                  da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             5h31m
01-worker-container-runtime                        da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             5h31m
01-worker-kubelet                                  da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             5h31m
99-master-generated-registries                     da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             5h31m
99-master-ssh                                                                                 3.1.0             5h37m
99-worker-generated-registries                     da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             5h31m
99-worker-ssh                                                                                 3.1.0             5h37m
rendered-master-8d25b9ae487bc5e7ffb021bd93bfff7d   da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             5h31m
rendered-worker-344e86d98ae75cde6fb5a5e2997bf82c   da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             48m
rendered-worker-69dac79db33505219af92d594dbbc383   da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             5h31m
rendered-worker-903310a06a3daf6543a338b18daeee4f   da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             16m
rendered-worker-e6858708d022f5e2ad4b50ef033be75a   da75bdfb74bbb30568b58b1526ba369b6441d281   3.1.0             4h9m
test-file                                                                                     3.1.0             48m
worker-kerneltype                                                                                               4s
$ oc get mcp/worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-903310a06a3daf6543a338b18daeee4f   False     True       False      3              0                   0                     0                      5h33m
$ watch oc get node
$ oc debug node/ip-10-0-194-240.us-west-2.compute.internal
Starting pod/ip-10-0-194-240us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`

If you don't see a command prompt, try pressing enter.
sh-4.2# 
sh-4.2# chroot /host
sh-4.4# uname -a    
Linux ip-10-0-194-240 4.18.0-193.28.1.rt13.77.el8_2.x86_64 #1 SMP PREEMPT RT Fri Oct 16 14:11:07 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
sh-4.4# rpm -qa | grep kernel
kernel-rt-modules-4.18.0-193.28.1.rt13.77.el8_2.x86_64
kernel-rt-core-4.18.0-193.28.1.rt13.77.el8_2.x86_64
kernel-rt-modules-extra-4.18.0-193.28.1.rt13.77.el8_2.x86_64
kernel-rt-kvm-4.18.0-193.28.1.rt13.77.el8_2.x86_64
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-11-10-093436   True        False         5h14m   Cluster version is 4.7.0-0.nightly-2020-11-10-093436
$ oc debug node/ip-10-0-194-240.us-west-2.compute.internal -- chroot /host rpm-ostree status
Starting pod/ip-10-0-194-240us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b22ac1787cafdd263f4fb2bb80dbdb1ec702d383d0eed13e4954a012d5d80dd6
              CustomOrigin: Managed by machine-config-operator
                   Version: 47.82.202011100542-0 (2020-11-10T05:46:41Z)
       RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-193.29.1.el8_2
           LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules
                            kernel-rt-modules-extra

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b22ac1787cafdd263f4fb2bb80dbdb1ec702d383d0eed13e4954a012d5d80dd6
              CustomOrigin: Managed by machine-config-operator
                   Version: 47.82.202011100542-0 (2020-11-10T05:46:41Z)

Removing debug pod ...

Comment 10 errata-xmlrpc 2021-02-24 15:31:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.