Bug 1894972 - [4.6.z] Update the node to use the real-time kernel fails
Summary: [4.6.z] Update the node to use the real-time kernel fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.6.z
Assignee: Steve Milner
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 1894910
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-05 14:02 UTC by Micah Abbott
Modified: 2020-11-16 14:38 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1894910
Environment:
Last Closed: 2020-11-16 14:37:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4987 0 None None None 2020-11-16 14:38:00 UTC

Description Micah Abbott 2020-11-05 14:02:50 UTC
+++ This bug was initially created as a clone of Bug #1894910 +++

Description of problem:
Our CI started to fail recently because the node dropped to the degraded state when we are trying to update it to use the machineconfig with the real-time option enabled.

I saw two different errors:

1. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-kni_performance-addon-operators/433/pull-ci-openshift-kni-performance-addon-operators-master-e2e-gcp/1324108099925053440/artifacts/e2e-gcp/gather-extra/

{
                        "lastTransitionTime": "2020-11-04T23:10:07Z",
                        "message": "Node ci-op-lx8l2lsg-24cc7-fg9bn-worker-b-8bvw7 is reporting: \"error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: error: System transaction in progress\\n: exit status 1\"",
                        "reason": "1 nodes are reporting degraded status on sync",
                        "status": "True",
                        "type": "NodeDegraded"
                    },

2. https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift-kni_performance-addon-operators/434/pull-ci-openshift-kni-performance-addon-operators-master-e2e-gcp/1323998850603552768/artifacts/e2e-gcp/gather-extra/

"message": "Node ci-op-4pgtrg3b-24cc7-zz7c8-worker-b-58l2r is reporting: \"error removing staged deployment: error running rpm-ostree cleanup -p: error: System transaction in progress\\n: exit status 1: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: Checking out tree 30e9764...done\\nEnabled rpm-md repositories: coreos-extensions\\nrpm-md repo 'coreos-extensions' (cached); generated: 2020-11-04T00:35:32Z\\nImporting rpm-md...done\\nResolving dependencies...done\\nerror: Could not depsolve transaction; 4 problems detected:\\n Problem 1: conflicting requests\\n  - nothing provides linux-firmware \u003e= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.rt7.54.el8.x86_64\\n Problem 2: package kernel-rt-modules-extra-4.18.0-240.rt7.54.el8.x86_64 requires kernel-rt-uname-r = 4.18.0-240.rt7.54.el8.x86_64, but none of the providers can be installed\\n  - conflicting requests\\n  - nothing provides linux-firmware \u003e= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.rt7.54.el8.x86_64\\n Problem 3: package kernel-rt-modules-4.18.0-240.rt7.54.el8.x86_64 requires kernel-rt-uname-r = 4.18.0-240.rt7.54.el8.x86_64, but none of the providers can be installed\\n  - conflicting requests\\n  - nothing provides linux-firmware \u003e= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.rt7.54.el8.x86_64\\n Problem 4: package kernel-rt-kvm-4.18.0-240.rt7.54.el8.x86_64 requires kernel-rt = 4.18.0-240.rt7.54.el8, but none of the providers can be installed\\n  - conflicting requests\\n  - nothing provides linux-firmware \u003e= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.rt7.54.el8.x86_64\\n: exit status 1\"",
                        "reason": "1 nodes are reporting degraded status on sync"

Version-Release number of selected component (if applicable):
master

How reproducible:
Always under the CI

Steps to Reproduce:
1.
2.
3.

Actual results:
The update of the node to work with RT kernel fails

Expected results:
The update of the node to work with the RT kernel should succeed

Additional info:
You can find all relevant information under the CI links that I provided above(MCP, MC, must-gather...)

--- Additional comment from Sinny Kumari on 2020-11-05 13:33:30 UTC ---

Both 4.6 and 4.7 issue would be most likely related. We are seeing trimmed error message in 4.6 because it doesn't have verbose log enabled from rpm-ostree - https://github.com/openshift/machine-config-operator/pull/2097. 

It seems RHCOS is shipping linux-firmware-20200512-98.gitb2cad6a2.el8 but we are shipping kernel-rt 4.18.0-240.rt7.54.el8 package in latest machine-os-content which needs linux-firmware-20200619-99.git3890db36 . This needs machine-OS-content update to have correct linux-firmware dependency available for kernel-rt install to succeed.

Making this bug as urgent as this also effect MCO 4.7 and 4.6 ci:
4.6 - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2193/pull-ci-openshift-machine-config-operator-release-4.6-e2e-gcp-op/1324193039505166336
4.7 - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2035/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1324274818467500032

--- Additional comment from Colin Walters on 2020-11-05 13:56:19 UTC ---

This is on track to be fixed by https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/merge_requests/1162

--- Additional comment from Micah Abbott on 2020-11-05 14:00:25 UTC ---

Targeting 4.7; will need a clone for 4.6.z

Comment 6 Micah Abbott 2020-11-09 19:10:54 UTC
Verified with 4.6.0-0.nightly-2020-11-07-035509 on GCP

```
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-11-07-035509   True        False         102s    Cluster version is 4.6.0-0.nightly-2020-11-07-035509

$ cat machineConfigs/worker-realtime.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: "worker"
  name: 99-worker-kerneltype
spec:
  kernelType: realtime

$ oc apply -f machineConfigs/worker-realtime.yaml
machineconfig.machineconfiguration.openshift.io/99-worker-kerneltype created

$ oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          054f6197a19ceffff44f361674bd24644d1a2bcb   3.1.0             106m
00-worker                                          054f6197a19ceffff44f361674bd24644d1a2bcb   3.1.0             106m
01-master-container-runtime                        054f6197a19ceffff44f361674bd24644d1a2bcb   3.1.0             106m
01-master-kubelet                                  054f6197a19ceffff44f361674bd24644d1a2bcb   3.1.0             106m
01-worker-container-runtime                        054f6197a19ceffff44f361674bd24644d1a2bcb   3.1.0             106m
01-worker-kubelet                                  054f6197a19ceffff44f361674bd24644d1a2bcb   3.1.0             106m
99-master-generated-registries                     054f6197a19ceffff44f361674bd24644d1a2bcb   3.1.0             106m
99-master-ssh                                                                                 3.1.0             113m
99-worker-generated-registries                     054f6197a19ceffff44f361674bd24644d1a2bcb   3.1.0             106m
99-worker-kerneltype                                                                                            73m
99-worker-ssh                                                                                 3.1.0             113m
rendered-master-26289f039b78077aab0d57f41e7c83fc   054f6197a19ceffff44f361674bd24644d1a2bcb   3.1.0             106m
rendered-worker-07945f23ee9807f5fe10a2cca7d94019   054f6197a19ceffff44f361674bd24644d1a2bcb   3.1.0             72m
rendered-worker-1c3093c409badcfbe46fb3060728b1ef   054f6197a19ceffff44f361674bd24644d1a2bcb   3.1.0             106m

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-26289f039b78077aab0d57f41e7c83fc   True      False      False      3              3                   3                     0                      105m
worker   rendered-worker-07945f23ee9807f5fe10a2cca7d94019   True      False      False      3              3                   3                     0                      105m

$ oc get nodes
NAME                                       STATUS   ROLES    AGE    VERSION
ci-ln-5j7t03k-f76d1-mg4d4-master-0         Ready    master   106m   v1.19.0+9f84db3
ci-ln-5j7t03k-f76d1-mg4d4-master-1         Ready    master   107m   v1.19.0+9f84db3
ci-ln-5j7t03k-f76d1-mg4d4-master-2         Ready    master   106m   v1.19.0+9f84db3
ci-ln-5j7t03k-f76d1-mg4d4-worker-b-kvjlh   Ready    worker   98m    v1.19.0+9f84db3
ci-ln-5j7t03k-f76d1-mg4d4-worker-c-2jmgg   Ready    worker   98m    v1.19.0+9f84db3
ci-ln-5j7t03k-f76d1-mg4d4-worker-d-wg5tx   Ready    worker   98m    v1.19.0+9f84db3

$ oc debug node/ci-ln-5j7t03k-f76d1-mg4d4-worker-b-kvjlh
Starting pod/ci-ln-5j7t03k-f76d1-mg4d4-worker-b-kvjlh-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.32.4
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:47e1213c98063dfd7f5ccae41e611a25446c8ac493cfdd05d8f1c46b61ab13d4
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202011061621-0 (2020-11-06T16:25:16Z)
       RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-193.29.1.el8_2
           LayeredPackages: kernel-rt-core kernel-rt-kvm kernel-rt-modules kernel-rt-modules-extra

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:47e1213c98063dfd7f5ccae41e611a25446c8ac493cfdd05d8f1c46b61ab13d4
              CustomOrigin: Managed by machine-config-operator
                   Version: 46.82.202011061621-0 (2020-11-06T16:25:16Z)
sh-4.4# uname -a
Linux ci-ln-5j7t03k-f76d1-mg4d4-worker-b-kvjlh 4.18.0-193.28.1.rt13.77.el8_2.x86_64 #1 SMP PREEMPT RT Fri Oct 16 14:11:07 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
sh-4.4# exit
exit
sh-4.4# exit
exit

Removing debug pod ...
```

Comment 8 errata-xmlrpc 2020-11-16 14:37:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.4 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4987


Note You need to log in before you can comment on or make changes to this bug.