Bug 1861026 - [4.5] Need to upgrade host and kernel-rt layer atomically
Summary: [4.5] Need to upgrade host and kernel-rt layer atomically
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.5.z
Assignee: Sinny Kumari
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On: 1873249
Blocks: 1803965 1873383
TreeView+ depends on / blocked
 
Reported: 2020-07-27 16:39 UTC by Artyom
Modified: 2020-09-14 14:55 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1873249 1873383 (view as bug list)
Environment:
Last Closed: 2020-09-14 14:54:26 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2029 None closed Bug 1861026: daemon: perform other rpm-ostree operations after OS rebase 2020-11-13 16:23:39 UTC
Red Hat Product Errata RHBA-2020:3618 None None None 2020-09-14 14:55:00 UTC

Description Artyom 2020-07-27 16:39:06 UTC
Description of problem:
Update of the realtime kernel fails with the error about the missing package

Version-Release number of selected component (if applicable):
oc version
Client Version: 4.6.0-0.nightly-2020-07-25-091217
Server Version: 4.5.3
Kubernetes Version: v1.18.3+3107688

rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da17e52f45616b71ad173da3db2cb7e94cd5b3ca60b9bed764f5ec2cfa475e4a
              CustomOrigin: Managed by machine-config-operator
                   Version: 44.81.202007010318-0 (2020-07-01T03:23:35Z)
       RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-147.20.1.el8_1
             LocalPackages: kernel-rt-core-4.18.0-147.8.1.rt24.101.el8_1.x86_64 kernel-rt-modules-4.18.0-147.8.1.rt24.101.el8_1.x86_64
                            kernel-rt-modules-extra-4.18.0-147.8.1.rt24.101.el8_1.x86_64
                 Initramfs: -I '/etc/systemd/system.conf /etc/systemd/system.conf.d/setAffinity.conf'

How reproducible:
Always

Steps to Reproduce:
1. Under the node run 
# podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:62eeb6da08efd1a7722cce7ab709366066464f97e74d14773818abb07ce3f7a7
# podman create --net=none --annotation=org.openshift.machineconfigoperator.pivot=true --name mcd-0d4dbcdb-ac83-4ed9-80de-9ccb1b2cbcdc quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:62eeb6da08efd1a7722cce7ab709366066464f97e74d14773818abb07ce3f7a7
# podman mount <container_id>
2. rpm-ostree uninstall kernel-rt-core-4.18.0-147.8.1.rt24.101.el8_1.x86_64 kernel-rt-modules-4.18.0-147.8.1.rt24.101.el8_1.x86_64 kernel-rt-modules-extra-4.18.0-147.8.1.rt24.101.el8_1.x86_64 --install /var/lib/containers/storage/overlay/<mount_id>/merged/kernel-rt-core-4.18.0-193.13.2.rt13.65.el8_2.x86_64.rpm --install /var/lib/containers/storage/overlay/<mount_id>/merged/kernel-rt-kvm-4.18.0-193.13.2.rt13.65.el8_2.x86_64.rpm --install /var/lib/containers/storage/overlay/<mount_id>/merged/kernel-rt-modules-4.18.0-193.13.2.rt13.65.el8_2.x86_64.rpm --install /var/lib/containers/storage/overlay/<mount_id>/merged/kernel-rt-modules-extra-4.18.0-193.13.2.rt13.65.el8_2.x86_64.rpm
3.

Actual results:
The command fails with the error
Checking out tree 7624994... done
Enabled rpm-md repositories:
Importing rpm-md... done
Resolving dependencies... done
error: Could not depsolve transaction; 4 problems detected:
 Problem 1: conflicting requests
  - nothing provides linux-firmware >= 20191202-97.gite8a0f4c9 needed by kernel-rt-core-4.18.0-193.13.2.rt13.65.el8_2.x86_64
 Problem 2: package kernel-rt-modules-4.18.0-193.13.2.rt13.65.el8_2.x86_64 requires kernel-rt-uname-r = 4.18.0-193.13.2.rt13.65.el8_2.x86_64, but none of the providers can be installed
  - conflicting requests
  - nothing provides linux-firmware >= 20191202-97.gite8a0f4c9 needed by kernel-rt-core-4.18.0-193.13.2.rt13.65.el8_2.x86_64
 Problem 3: package kernel-rt-modules-extra-4.18.0-193.13.2.rt13.65.el8_2.x86_64 requires kernel-rt-uname-r = 4.18.0-193.13.2.rt13.65.el8_2.x86_64, but none of the providers can be installed
  - conflicting requests
  - nothing provides linux-firmware >= 20191202-97.gite8a0f4c9 needed by kernel-rt-core-4.18.0-193.13.2.rt13.65.el8_2.x86_64
 Problem 4: package kernel-rt-kvm-4.18.0-193.13.2.rt13.65.el8_2.x86_64 requires kernel-rt = 4.18.0-193.13.2.rt13.65.el8_2, but none of the providers can be installed
  - conflicting requests
  - nothing provides linux-firmware >= 20191202-97.gite8a0f4c9 needed by kernel-rt-core-4.18.0-193.13.2.rt13.65.el8_2.x86_64

Expected results:
The upgrade should succeed

Additional info:

I provided the manual steps to reproduce the bug, but it happened for use under the machine-config-daemon.

Comment 1 Micah Abbott 2020-07-27 18:21:36 UTC
This looks like an order of operation problem.  The 4.5.3 machine-os-content has `linux-firmware-20191202-97.gite8a0f4c9.el8` included as part of the update.  But based on this reproducer it seems like an upgrade of the RT kernel is attempted before the underlying RHCOS is updated and the RT kernel dependencies can't be fulfilled.

I'm going to tag in Sinny and Jonathan for more triage.  I *think* this might be an issue in how the MCO orchestrates the update, but it could be an RHCOS/rpm-ostree problem.

Comment 2 Jonathan Lebon 2020-07-27 19:12:47 UTC
> This looks like an order of operation problem.  The 4.5.3 machine-os-content has `linux-firmware-20191202-97.gite8a0f4c9.el8` included as part of the update.  But based on this reproducer it seems like an upgrade of the RT kernel is attempted before the underlying RHCOS is updated and the RT kernel dependencies can't be fulfilled.

Yup, I agree with your diagnosis. See https://bugzilla.redhat.com/show_bug.cgi?id=1859269#c7. This is not technically a new bug, but it's made easier to trigger by the 8.1 to 8.2 update (it's kind of the RHCOS equivalent of https://github.com/coreos/fedora-coreos-tracker/issues/400, except here it's totally solvable by doing the upgrade first :) ).

@Sinny, IIUC that should be solved by the extensions PR for 4.6, right? For 4.5, I think we'll need a fix where instead of `install` then `rebase`, we unify them into `rebase --install ... --uninstall ...`. That way it happens atomically.

This should be how it's done day 1 too, except that `rebase` doesn't support changing overrides, so you can't do e.g. `rpm-ostree rebase ... --override-remove kernel --install kernel-rt`. So it'll have to remain a two-step operation there, but the `override remove ... --install ...` should still happen after the `rebase`. (We can enhance the `rebase` CLI, though long-term I think it'd be cleaner to use the D-Bus UpdateDeployment() API directly?)

Let's use this bug to track the 4.5 fix.

Comment 3 Sinny Kumari 2020-07-28 05:32:13 UTC
(In reply to Jonathan Lebon from comment #2)
> > This looks like an order of operation problem.  The 4.5.3 machine-os-content has `linux-firmware-20191202-97.gite8a0f4c9.el8` included as part of the update.  But based on this reproducer it seems like an upgrade of the RT kernel is attempted before the underlying RHCOS is updated and the RT kernel dependencies can't be fulfilled.
> 
> Yup, I agree with your diagnosis. See
> https://bugzilla.redhat.com/show_bug.cgi?id=1859269#c7. This is not
> technically a new bug, but it's made easier to trigger by the 8.1 to 8.2
> update (it's kind of the RHCOS equivalent of
> https://github.com/coreos/fedora-coreos-tracker/issues/400, except here it's
> totally solvable by doing the upgrade first :) ).
> 
> @Sinny, IIUC that should be solved by the extensions PR for 4.6, right? For
> 4.5, I think we'll need a fix where instead of `install` then `rebase`, we
> unify them into `rebase --install ... --uninstall ...`. That way it happens
> atomically.

Yeah, this should be fixed with extensions PR https://github.com/openshift/machine-config-operator/pull/1941
Also with PR https://github.com/openshift/machine-config-operator/pull/1766 which have already landed in 4.6, we will always pull m-c-d binary from image, so once PR#1941 lands in upgrade from 4.5->4.6 should work as expected.

> This should be how it's done day 1 too, except that `rebase` doesn't support
> changing overrides, so you can't do e.g. `rpm-ostree rebase ...
> --override-remove kernel --install kernel-rt`. So it'll have to remain a
> two-step operation there, but the `override remove ... --install ...` should
> still happen after the `rebase`. (We can enhance the `rebase` CLI, though
> long-term I think it'd be cleaner to use the D-Bus UpdateDeployment() API
> directly?)

Fixing the m-c-d behavior in 4.5 is going to be tricky with the current design but should be doable with some time investment. I see upgrading to 4.6 as one solution.

> Let's use this bug to track the 4.5 fix.

Comment 4 Colin Walters 2020-07-28 18:59:58 UTC
> Fixing the m-c-d behavior in 4.5 is going to be tricky with the current design but should be doable with some time investment. I see upgrading to 4.6 as one solution.

Right =/  It will be messy to re-do this just for 4.5.  But we may have to.

Comment 6 Martin Sivák 2020-08-18 13:52:57 UTC
Just a stupid question, might this be fixed too by https://bugzilla.redhat.com/show_bug.cgi?id=1827712#c24 ? Or is that a different issue?

Comment 7 Sinny Kumari 2020-08-18 17:21:49 UTC
Yeah, this is a different issue. This issue won't be happening in OCP 4.6 or later version.
We need to find a way to fix it in 4.5.

Comment 8 Denys Shchedrivyi 2020-08-20 19:38:05 UTC
@Sinny, any plans to backport it into 4.4? I see the same issue during minor upgrades of OCP 4.4.5 -> 4.4.17 (node with RT kernel becomes degraded)

Comment 13 Micah Abbott 2020-08-31 14:54:45 UTC
This was fixed in https://github.com/openshift/machine-config-operator/pull/2029

Comment 15 Micah Abbott 2020-09-01 19:45:18 UTC
QE was unable to verify this BZ in time for release, so it has been dropped from the current advisory.

Comment 16 Sinny Kumari 2020-09-02 12:16:01 UTC
Verifying this issue is tricky because RHCOS node should have machine-config-daemon package that contains the patch.

1. Colin has mentioned some steps at https://github.com/openshift/machine-config-operator/pull/2029#issuecomment-682495058 which we can use to verify the bug. Copying the content here as well:

Since 4.4.17 has already shipped this requires manual intervention:

    Create a custom release image with new MCO from this patch based on e.g. 4.4.5
    Upgrade to custom
    Upgrade to 4.4.18 or a new release that still has this patch
2. Get machine-config-daemon-4.5.0-202008280032.p0.git.2558.a93c8dc.el8 https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1299991 or later version that contains the patch.

One can override the installed machine-config-daemon package and then try `Steps to Reproduce` section in comment #0

Comment 17 Sinny Kumari 2020-09-02 14:32:37 UTC
Moving it to Assigned to include the fixes in 4.5 boot images as well https://github.com/openshift/installer/pull/4125

Comment 18 Sinny Kumari 2020-09-02 14:39:12 UTC
See corresponding 4.4 bug https://bugzilla.redhat.com/show_bug.cgi?id=1873383#c1 where we saw another instance of happening the issue during cluster install time as well.

Comment 19 Sinny Kumari 2020-09-04 05:20:52 UTC
Moving this PR back to Modified since we are no longer doing bootimage bump now, see https://github.com/openshift/installer/pull/4125#issuecomment-686751300

Comment 20 Micah Abbott 2020-09-06 16:38:00 UTC
I spent some time trying to create an environment where this could be verified, but encountered a few issues:

1.  There's no pure 4.5 environment that allows us to test the upgrade of 4.5 with RT kernel where this issue can be reproduced, since RHCOS 4.5 has always used RHEL 8.2 and the issue is produced when upgrading from an RHCOS using RHEL 8.1 to an RHCOS using RHEL 8.2
2.  Trying to create a custom 4.4 environment as a starting point, with RHCOS using RHEL 8.1 and the fix to MCO was included, caused me to encounter BZ#1859269 when upgrading to an OCP 4.5 build.

I think the best we can hope for here, in terms of verifying the BZ, is to create a cluster using 4.5 with the MCO fixed, deploy the RT kernel on the worker nodes,  and performing an upgrade to a newer 4.5.  We can take steps to verify the MCO fix is included as expected and the upgrade was successful.  However, I don't think it is a good use of resources to try to create a frankenstein environment which would allow us to fully prove out this issue.

Comment 21 Sinny Kumari 2020-09-08 04:22:07 UTC
right.
Although getting this fixes in will avoid any future upgrade issue if applicable and also unblocks getting fixes into 4.4z (where we have RHEL 8.1 content)

Comment 24 Micah Abbott 2020-09-09 15:17:11 UTC
Verified using 4.5.8

Per the discussion in comments #20 + #21, I booted a 4.5.8 cluster in GCP, applied an MC to switch to the RT kernel on the worker nodes, and then upgraded to the latest 4.5 nightly.

All operations were successful.

```
$ oc get clusterversion                                                     
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS                                                                                                                                                                                                                                
version   4.5.8     True        False         5m24s   Cluster version is 4.5.8                                                                                                                                                                                                              

$ oc get nodes                                                                                                                                                                                                            
NAME                                                           STATUS   ROLES    AGE   VERSION                                                                                                                                                                                              
miabbott-4-5-8-mg4zb-master-0.c.openshift-gce-devel.internal   Ready    master   28m   v1.18.3+6c42de8                                                                                                                                                                                      
miabbott-4-5-8-mg4zb-master-1.c.openshift-gce-devel.internal   Ready    master   28m   v1.18.3+6c42de8                                                                                                                                                                                      
miabbott-4-5-8-mg4zb-master-2.c.openshift-gce-devel.internal   Ready    master   28m   v1.18.3+6c42de8                                                                                                                                                                                      
miabbott-4-5-8-mg4zb-worker-a-8h79w                            Ready    worker   16m   v1.18.3+6c42de8                                        
miabbott-4-5-8-mg4zb-worker-b-jg22w                            Ready    worker   16m   v1.18.3+6c42de8
miabbott-4-5-8-mg4zb-worker-c-rtv8n                            Ready    worker   16m   v1.18.3+6c42de8

$ oc debug node/miabbott-4-5-8-mg4zb-worker-a-8h79w -- chroot /host uname -a
Starting pod/miabbott-4-5-8-mg4zb-worker-a-8h79w-debug ...
To use host binaries, run `chroot /host`
Linux miabbott-4-5-8-mg4zb-worker-a-8h79w 4.18.0-193.14.3.el8_2.x86_64 #1 SMP Mon Jul 20 15:02:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Removing debug pod ...

$ cat ../machineConfigs/worker-realtime.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: "worker"
  name: 99-worker-kerneltype
spec:
  kernelType: realtime

$ oc apply -f ../machineConfigs/worker-realtime.yaml 
machineconfig.machineconfiguration.openshift.io/99-worker-kerneltype created

$ oc debug node/miabbott-4-5-8-mg4zb-worker-a-8h79w -- chroot /host uname -a 
Starting pod/miabbott-4-5-8-mg4zb-worker-a-8h79w-debug ...
To use host binaries, run `chroot /host`
Linux miabbott-4-5-8-mg4zb-worker-a-8h79w 4.18.0-193.14.3.rt13.67.el8_2.x86_64 #1 SMP PREEMPT RT Mon Jul 20 16:41:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Removing debug pod ...

$ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph"}}' --type=merge
clusterversion.config.openshift.io/version patched

$ oc adm upgrade --allow-explicit-upgrade=true --allow-upgrade-with-warnings=true --force=true --to-image=registry.svc.ci.openshift.org/ocp/release@sha256:bf05358f3eba0d0135ddb46e710e5715c39d5d6a51283eaa4cae20751e74435e                     
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to preceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.svc.ci.openshift.org/ocp/release@sha256:bf05358f3eba0d0135ddb46e710e5715c39d5d6a51283eaa4cae20751e74435e

...

$ oc get clusterversion                                                      
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS                                                        
version   4.5.0-0.nightly-2020-09-08-123650   True        False         17m     Cluster version is 4.5.0-0.nightly-2020-09-08-123650          
(reverse-i-search)`node': oc get ^Cdes -o wide                                                                                                 

$ oc debug node/miabbott-4-5-8-mg4zb-worker-a-8h79w -- chroot /host uname -a 
Starting pod/miabbott-4-5-8-mg4zb-worker-a-8h79w-debug ...                                                                                    
To use host binaries, run `chroot /host`                                                                                                      
Linux miabbott-4-5-8-mg4zb-worker-a-8h79w 4.18.0-193.19.1.rt13.70.el8_2.x86_64 #1 SMP PREEMPT RT Wed Aug 26 17:57:22 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
                                                                                                                                              
Removing debug pod ...                                                                                                                        
$ oc get nodes -o wide                                                      
NAME                                                           STATUS   ROLES    AGE   VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                         CONTAINER-RUNTIME
miabbott-4-5-8-mg4zb-master-0.c.openshift-gce-devel.internal   Ready    master   96m   v1.18.3+6c42de8   10.0.0.6                    Red Hat Enterprise Linux CoreOS 45.82.202009081029-0 (Ootpa)   4.18.0-193.19.1.el8_2.x86_64           cri-o://1.18.3-12.rhaos4.5.git99f5d4a.el8
miabbott-4-5-8-mg4zb-master-1.c.openshift-gce-devel.internal   Ready    master   96m   v1.18.3+6c42de8   10.0.0.4                    Red Hat Enterprise Linux CoreOS 45.82.202009081029-0 (Ootpa)   4.18.0-193.19.1.el8_2.x86_64           cri-o://1.18.3-12.rhaos4.5.git99f5d4a.el8
miabbott-4-5-8-mg4zb-master-2.c.openshift-gce-devel.internal   Ready    master   96m   v1.18.3+6c42de8   10.0.0.5                    Red Hat Enterprise Linux CoreOS 45.82.202009081029-0 (Ootpa)   4.18.0-193.19.1.el8_2.x86_64           cri-o://1.18.3-12.rhaos4.5.git99f5d4a.el8
miabbott-4-5-8-mg4zb-worker-a-8h79w                            Ready    worker   85m   v1.18.3+6c42de8   10.0.32.2                   Red Hat Enterprise Linux CoreOS 45.82.202009081029-0 (Ootpa)   4.18.0-193.19.1.rt13.70.el8_2.x86_64   cri-o://1.18.3-12.rhaos4.5.git99f5d4a.el8
miabbott-4-5-8-mg4zb-worker-b-jg22w                            Ready    worker   85m   v1.18.3+6c42de8   10.0.32.3                   Red Hat Enterprise Linux CoreOS 45.82.202009081029-0 (Ootpa)   4.18.0-193.19.1.rt13.70.el8_2.x86_64   cri-o://1.18.3-12.rhaos4.5.git99f5d4a.el8        
miabbott-4-5-8-mg4zb-worker-c-rtv8n                            Ready    worker   85m   v1.18.3+6c42de8   10.0.32.4                   Red Hat Enterprise Linux CoreOS 45.82.202009081029-0 (Ootpa)   4.18.0-193.19.1.rt13.70.el8_2.x86_64   cri-o://1.18.3-12.rhaos4.5.git99f5d4a.el8
```

Comment 26 errata-xmlrpc 2020-09-14 14:54:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.9 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3618


Note You need to log in before you can comment on or make changes to this bug.