Description of problem: Our CI started to fail recently because the node dropped to the degraded state when we are trying to update it to use the machineconfig with the real-time option enabled. I saw two different errors: 1. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-kni_performance-addon-operators/433/pull-ci-openshift-kni-performance-addon-operators-master-e2e-gcp/1324108099925053440/artifacts/e2e-gcp/gather-extra/ { "lastTransitionTime": "2020-11-04T23:10:07Z", "message": "Node ci-op-lx8l2lsg-24cc7-fg9bn-worker-b-8bvw7 is reporting: \"error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: error: System transaction in progress\\n: exit status 1\"", "reason": "1 nodes are reporting degraded status on sync", "status": "True", "type": "NodeDegraded" }, 2. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-kni_performance-addon-operators/434/pull-ci-openshift-kni-performance-addon-operators-master-e2e-gcp/1323998850603552768/artifacts/e2e-gcp/gather-extra/ "message": "Node ci-op-4pgtrg3b-24cc7-zz7c8-worker-b-58l2r is reporting: \"error removing staged deployment: error running rpm-ostree cleanup -p: error: System transaction in progress\\n: exit status 1: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: Checking out tree 30e9764...done\\nEnabled rpm-md repositories: coreos-extensions\\nrpm-md repo 'coreos-extensions' (cached); generated: 2020-11-04T00:35:32Z\\nImporting rpm-md...done\\nResolving dependencies...done\\nerror: Could not depsolve transaction; 4 problems detected:\\n Problem 1: conflicting requests\\n - nothing provides linux-firmware \u003e= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.rt7.54.el8.x86_64\\n Problem 2: package kernel-rt-modules-extra-4.18.0-240.rt7.54.el8.x86_64 requires kernel-rt-uname-r = 4.18.0-240.rt7.54.el8.x86_64, but none of the providers can be installed\\n - conflicting requests\\n - nothing provides linux-firmware \u003e= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.rt7.54.el8.x86_64\\n Problem 3: package kernel-rt-modules-4.18.0-240.rt7.54.el8.x86_64 requires kernel-rt-uname-r = 4.18.0-240.rt7.54.el8.x86_64, but none of the providers can be installed\\n - conflicting requests\\n - nothing provides linux-firmware \u003e= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.rt7.54.el8.x86_64\\n Problem 4: package kernel-rt-kvm-4.18.0-240.rt7.54.el8.x86_64 requires kernel-rt = 4.18.0-240.rt7.54.el8, but none of the providers can be installed\\n - conflicting requests\\n - nothing provides linux-firmware \u003e= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.rt7.54.el8.x86_64\\n: exit status 1\"", "reason": "1 nodes are reporting degraded status on sync" Version-Release number of selected component (if applicable): master How reproducible: Always under the CI Steps to Reproduce: 1. 2. 3. Actual results: The update of the node to work with RT kernel fails Expected results: The update of the node to work with the RT kernel should succeed Additional info: You can find all relevant information under the CI links that I provided above(MCP, MC, must-gather...)
Both 4.6 and 4.7 issue would be most likely related. We are seeing trimmed error message in 4.6 because it doesn't have verbose log enabled from rpm-ostree - https://github.com/openshift/machine-config-operator/pull/2097. It seems RHCOS is shipping linux-firmware-20200512-98.gitb2cad6a2.el8 but we are shipping kernel-rt 4.18.0-240.rt7.54.el8 package in latest machine-os-content which needs linux-firmware-20200619-99.git3890db36 . This needs machine-OS-content update to have correct linux-firmware dependency available for kernel-rt install to succeed. Making this bug as urgent as this also effect MCO 4.7 and 4.6 ci: 4.6 - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2193/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op/1324193039505166336 4.7 - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2035/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1324274818467500032
This is on track to be fixed by https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/merge_requests/1162
Targeting 4.7; will need a clone for 4.6.z
Verified on RHCOS 47.82.202011100542-0 $ cat << EOF > rt.yaml > apiVersion: machineconfiguration.openshift.io/v1 > kind: MachineConfig > metadata: > labels: > machineconfiguration.openshift.io/role: "worker" > name: worker-kerneltype > spec: > kernelType: realtime > EOF $ oc create -f rt.yaml machineconfig.machineconfiguration.openshift.io/worker-kerneltype created $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 5h31m 00-worker da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 5h31m 01-master-container-runtime da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 5h31m 01-master-kubelet da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 5h31m 01-worker-container-runtime da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 5h31m 01-worker-kubelet da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 5h31m 99-master-generated-registries da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 5h31m 99-master-ssh 3.1.0 5h37m 99-worker-generated-registries da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 5h31m 99-worker-ssh 3.1.0 5h37m rendered-master-8d25b9ae487bc5e7ffb021bd93bfff7d da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 5h31m rendered-worker-344e86d98ae75cde6fb5a5e2997bf82c da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 48m rendered-worker-69dac79db33505219af92d594dbbc383 da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 5h31m rendered-worker-903310a06a3daf6543a338b18daeee4f da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 16m rendered-worker-e6858708d022f5e2ad4b50ef033be75a da75bdfb74bbb30568b58b1526ba369b6441d281 3.1.0 4h9m test-file 3.1.0 48m worker-kerneltype 4s $ oc get mcp/worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-903310a06a3daf6543a338b18daeee4f False True False 3 0 0 0 5h33m $ watch oc get node $ oc debug node/ip-10-0-194-240.us-west-2.compute.internal Starting pod/ip-10-0-194-240us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# sh-4.2# chroot /host sh-4.4# uname -a Linux ip-10-0-194-240 4.18.0-193.28.1.rt13.77.el8_2.x86_64 #1 SMP PREEMPT RT Fri Oct 16 14:11:07 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux sh-4.4# rpm -qa | grep kernel kernel-rt-modules-4.18.0-193.28.1.rt13.77.el8_2.x86_64 kernel-rt-core-4.18.0-193.28.1.rt13.77.el8_2.x86_64 kernel-rt-modules-extra-4.18.0-193.28.1.rt13.77.el8_2.x86_64 kernel-rt-kvm-4.18.0-193.28.1.rt13.77.el8_2.x86_64 sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-11-10-093436 True False 5h14m Cluster version is 4.7.0-0.nightly-2020-11-10-093436 $ oc debug node/ip-10-0-194-240.us-west-2.compute.internal -- chroot /host rpm-ostree status Starting pod/ip-10-0-194-240us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b22ac1787cafdd263f4fb2bb80dbdb1ec702d383d0eed13e4954a012d5d80dd6 CustomOrigin: Managed by machine-config-operator Version: 47.82.202011100542-0 (2020-11-10T05:46:41Z) RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-193.29.1.el8_2 LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules kernel-rt-modules-extra pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b22ac1787cafdd263f4fb2bb80dbdb1ec702d383d0eed13e4954a012d5d80dd6 CustomOrigin: Managed by machine-config-operator Version: 47.82.202011100542-0 (2020-11-10T05:46:41Z) Removing debug pod ...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633