Description of problem: Update of the realtime kernel fails with the error about the missing package Version-Release number of selected component (if applicable): oc version Client Version: 4.6.0-0.nightly-2020-07-25-091217 Server Version: 4.5.3 Kubernetes Version: v1.18.3+3107688 rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da17e52f45616b71ad173da3db2cb7e94cd5b3ca60b9bed764f5ec2cfa475e4a CustomOrigin: Managed by machine-config-operator Version: 44.81.202007010318-0 (2020-07-01T03:23:35Z) RemovedBasePackages: kernel-core kernel-modules kernel kernel-modules-extra 4.18.0-147.20.1.el8_1 LocalPackages: kernel-rt-core-4.18.0-147.8.1.rt24.101.el8_1.x86_64 kernel-rt-modules-4.18.0-147.8.1.rt24.101.el8_1.x86_64 kernel-rt-modules-extra-4.18.0-147.8.1.rt24.101.el8_1.x86_64 Initramfs: -I '/etc/systemd/system.conf /etc/systemd/system.conf.d/setAffinity.conf' How reproducible: Always Steps to Reproduce: 1. Under the node run # podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:62eeb6da08efd1a7722cce7ab709366066464f97e74d14773818abb07ce3f7a7 # podman create --net=none --annotation=org.openshift.machineconfigoperator.pivot=true --name mcd-0d4dbcdb-ac83-4ed9-80de-9ccb1b2cbcdc quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:62eeb6da08efd1a7722cce7ab709366066464f97e74d14773818abb07ce3f7a7 # podman mount <container_id> 2. rpm-ostree uninstall kernel-rt-core-4.18.0-147.8.1.rt24.101.el8_1.x86_64 kernel-rt-modules-4.18.0-147.8.1.rt24.101.el8_1.x86_64 kernel-rt-modules-extra-4.18.0-147.8.1.rt24.101.el8_1.x86_64 --install /var/lib/containers/storage/overlay/<mount_id>/merged/kernel-rt-core-4.18.0-193.13.2.rt13.65.el8_2.x86_64.rpm --install /var/lib/containers/storage/overlay/<mount_id>/merged/kernel-rt-kvm-4.18.0-193.13.2.rt13.65.el8_2.x86_64.rpm --install /var/lib/containers/storage/overlay/<mount_id>/merged/kernel-rt-modules-4.18.0-193.13.2.rt13.65.el8_2.x86_64.rpm --install /var/lib/containers/storage/overlay/<mount_id>/merged/kernel-rt-modules-extra-4.18.0-193.13.2.rt13.65.el8_2.x86_64.rpm 3. Actual results: The command fails with the error Checking out tree 7624994... done Enabled rpm-md repositories: Importing rpm-md... done Resolving dependencies... done error: Could not depsolve transaction; 4 problems detected: Problem 1: conflicting requests - nothing provides linux-firmware >= 20191202-97.gite8a0f4c9 needed by kernel-rt-core-4.18.0-193.13.2.rt13.65.el8_2.x86_64 Problem 2: package kernel-rt-modules-4.18.0-193.13.2.rt13.65.el8_2.x86_64 requires kernel-rt-uname-r = 4.18.0-193.13.2.rt13.65.el8_2.x86_64, but none of the providers can be installed - conflicting requests - nothing provides linux-firmware >= 20191202-97.gite8a0f4c9 needed by kernel-rt-core-4.18.0-193.13.2.rt13.65.el8_2.x86_64 Problem 3: package kernel-rt-modules-extra-4.18.0-193.13.2.rt13.65.el8_2.x86_64 requires kernel-rt-uname-r = 4.18.0-193.13.2.rt13.65.el8_2.x86_64, but none of the providers can be installed - conflicting requests - nothing provides linux-firmware >= 20191202-97.gite8a0f4c9 needed by kernel-rt-core-4.18.0-193.13.2.rt13.65.el8_2.x86_64 Problem 4: package kernel-rt-kvm-4.18.0-193.13.2.rt13.65.el8_2.x86_64 requires kernel-rt = 4.18.0-193.13.2.rt13.65.el8_2, but none of the providers can be installed - conflicting requests - nothing provides linux-firmware >= 20191202-97.gite8a0f4c9 needed by kernel-rt-core-4.18.0-193.13.2.rt13.65.el8_2.x86_64 Expected results: The upgrade should succeed Additional info: I provided the manual steps to reproduce the bug, but it happened for use under the machine-config-daemon.
This looks like an order of operation problem. The 4.5.3 machine-os-content has `linux-firmware-20191202-97.gite8a0f4c9.el8` included as part of the update. But based on this reproducer it seems like an upgrade of the RT kernel is attempted before the underlying RHCOS is updated and the RT kernel dependencies can't be fulfilled. I'm going to tag in Sinny and Jonathan for more triage. I *think* this might be an issue in how the MCO orchestrates the update, but it could be an RHCOS/rpm-ostree problem.
> This looks like an order of operation problem. The 4.5.3 machine-os-content has `linux-firmware-20191202-97.gite8a0f4c9.el8` included as part of the update. But based on this reproducer it seems like an upgrade of the RT kernel is attempted before the underlying RHCOS is updated and the RT kernel dependencies can't be fulfilled. Yup, I agree with your diagnosis. See https://bugzilla.redhat.com/show_bug.cgi?id=1859269#c7. This is not technically a new bug, but it's made easier to trigger by the 8.1 to 8.2 update (it's kind of the RHCOS equivalent of https://github.com/coreos/fedora-coreos-tracker/issues/400, except here it's totally solvable by doing the upgrade first :) ). @Sinny, IIUC that should be solved by the extensions PR for 4.6, right? For 4.5, I think we'll need a fix where instead of `install` then `rebase`, we unify them into `rebase --install ... --uninstall ...`. That way it happens atomically. This should be how it's done day 1 too, except that `rebase` doesn't support changing overrides, so you can't do e.g. `rpm-ostree rebase ... --override-remove kernel --install kernel-rt`. So it'll have to remain a two-step operation there, but the `override remove ... --install ...` should still happen after the `rebase`. (We can enhance the `rebase` CLI, though long-term I think it'd be cleaner to use the D-Bus UpdateDeployment() API directly?) Let's use this bug to track the 4.5 fix.
(In reply to Jonathan Lebon from comment #2) > > This looks like an order of operation problem. The 4.5.3 machine-os-content has `linux-firmware-20191202-97.gite8a0f4c9.el8` included as part of the update. But based on this reproducer it seems like an upgrade of the RT kernel is attempted before the underlying RHCOS is updated and the RT kernel dependencies can't be fulfilled. > > Yup, I agree with your diagnosis. See > https://bugzilla.redhat.com/show_bug.cgi?id=1859269#c7. This is not > technically a new bug, but it's made easier to trigger by the 8.1 to 8.2 > update (it's kind of the RHCOS equivalent of > https://github.com/coreos/fedora-coreos-tracker/issues/400, except here it's > totally solvable by doing the upgrade first :) ). > > @Sinny, IIUC that should be solved by the extensions PR for 4.6, right? For > 4.5, I think we'll need a fix where instead of `install` then `rebase`, we > unify them into `rebase --install ... --uninstall ...`. That way it happens > atomically. Yeah, this should be fixed with extensions PR https://github.com/openshift/machine-config-operator/pull/1941 Also with PR https://github.com/openshift/machine-config-operator/pull/1766 which have already landed in 4.6, we will always pull m-c-d binary from image, so once PR#1941 lands in upgrade from 4.5->4.6 should work as expected. > This should be how it's done day 1 too, except that `rebase` doesn't support > changing overrides, so you can't do e.g. `rpm-ostree rebase ... > --override-remove kernel --install kernel-rt`. So it'll have to remain a > two-step operation there, but the `override remove ... --install ...` should > still happen after the `rebase`. (We can enhance the `rebase` CLI, though > long-term I think it'd be cleaner to use the D-Bus UpdateDeployment() API > directly?) Fixing the m-c-d behavior in 4.5 is going to be tricky with the current design but should be doable with some time investment. I see upgrading to 4.6 as one solution. > Let's use this bug to track the 4.5 fix.
> Fixing the m-c-d behavior in 4.5 is going to be tricky with the current design but should be doable with some time investment. I see upgrading to 4.6 as one solution. Right =/ It will be messy to re-do this just for 4.5. But we may have to.
Just a stupid question, might this be fixed too by https://bugzilla.redhat.com/show_bug.cgi?id=1827712#c24 ? Or is that a different issue?
Yeah, this is a different issue. This issue won't be happening in OCP 4.6 or later version. We need to find a way to fix it in 4.5.
@Sinny, any plans to backport it into 4.4? I see the same issue during minor upgrades of OCP 4.4.5 -> 4.4.17 (node with RT kernel becomes degraded)
This was fixed in https://github.com/openshift/machine-config-operator/pull/2029
QE was unable to verify this BZ in time for release, so it has been dropped from the current advisory.
Verifying this issue is tricky because RHCOS node should have machine-config-daemon package that contains the patch. 1. Colin has mentioned some steps at https://github.com/openshift/machine-config-operator/pull/2029#issuecomment-682495058 which we can use to verify the bug. Copying the content here as well: Since 4.4.17 has already shipped this requires manual intervention: Create a custom release image with new MCO from this patch based on e.g. 4.4.5 Upgrade to custom Upgrade to 4.4.18 or a new release that still has this patch 2. Get machine-config-daemon-4.5.0-202008280032.p0.git.2558.a93c8dc.el8 https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1299991 or later version that contains the patch. One can override the installed machine-config-daemon package and then try `Steps to Reproduce` section in comment #0
Moving it to Assigned to include the fixes in 4.5 boot images as well https://github.com/openshift/installer/pull/4125
See corresponding 4.4 bug https://bugzilla.redhat.com/show_bug.cgi?id=1873383#c1 where we saw another instance of happening the issue during cluster install time as well.
Moving this PR back to Modified since we are no longer doing bootimage bump now, see https://github.com/openshift/installer/pull/4125#issuecomment-686751300
I spent some time trying to create an environment where this could be verified, but encountered a few issues: 1. There's no pure 4.5 environment that allows us to test the upgrade of 4.5 with RT kernel where this issue can be reproduced, since RHCOS 4.5 has always used RHEL 8.2 and the issue is produced when upgrading from an RHCOS using RHEL 8.1 to an RHCOS using RHEL 8.2 2. Trying to create a custom 4.4 environment as a starting point, with RHCOS using RHEL 8.1 and the fix to MCO was included, caused me to encounter BZ#1859269 when upgrading to an OCP 4.5 build. I think the best we can hope for here, in terms of verifying the BZ, is to create a cluster using 4.5 with the MCO fixed, deploy the RT kernel on the worker nodes, and performing an upgrade to a newer 4.5. We can take steps to verify the MCO fix is included as expected and the upgrade was successful. However, I don't think it is a good use of resources to try to create a frankenstein environment which would allow us to fully prove out this issue.
right. Although getting this fixes in will avoid any future upgrade issue if applicable and also unblocks getting fixes into 4.4z (where we have RHEL 8.1 content)
Verified using 4.5.8 Per the discussion in comments #20 + #21, I booted a 4.5.8 cluster in GCP, applied an MC to switch to the RT kernel on the worker nodes, and then upgraded to the latest 4.5 nightly. All operations were successful. ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.8 True False 5m24s Cluster version is 4.5.8 $ oc get nodes NAME STATUS ROLES AGE VERSION miabbott-4-5-8-mg4zb-master-0.c.openshift-gce-devel.internal Ready master 28m v1.18.3+6c42de8 miabbott-4-5-8-mg4zb-master-1.c.openshift-gce-devel.internal Ready master 28m v1.18.3+6c42de8 miabbott-4-5-8-mg4zb-master-2.c.openshift-gce-devel.internal Ready master 28m v1.18.3+6c42de8 miabbott-4-5-8-mg4zb-worker-a-8h79w Ready worker 16m v1.18.3+6c42de8 miabbott-4-5-8-mg4zb-worker-b-jg22w Ready worker 16m v1.18.3+6c42de8 miabbott-4-5-8-mg4zb-worker-c-rtv8n Ready worker 16m v1.18.3+6c42de8 $ oc debug node/miabbott-4-5-8-mg4zb-worker-a-8h79w -- chroot /host uname -a Starting pod/miabbott-4-5-8-mg4zb-worker-a-8h79w-debug ... To use host binaries, run `chroot /host` Linux miabbott-4-5-8-mg4zb-worker-a-8h79w 4.18.0-193.14.3.el8_2.x86_64 #1 SMP Mon Jul 20 15:02:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Removing debug pod ... $ cat ../machineConfigs/worker-realtime.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: "worker" name: 99-worker-kerneltype spec: kernelType: realtime $ oc apply -f ../machineConfigs/worker-realtime.yaml machineconfig.machineconfiguration.openshift.io/99-worker-kerneltype created $ oc debug node/miabbott-4-5-8-mg4zb-worker-a-8h79w -- chroot /host uname -a Starting pod/miabbott-4-5-8-mg4zb-worker-a-8h79w-debug ... To use host binaries, run `chroot /host` Linux miabbott-4-5-8-mg4zb-worker-a-8h79w 4.18.0-193.14.3.rt13.67.el8_2.x86_64 #1 SMP PREEMPT RT Mon Jul 20 16:41:14 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Removing debug pod ... $ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph"}}' --type=merge clusterversion.config.openshift.io/version patched $ oc adm upgrade --allow-explicit-upgrade=true --allow-upgrade-with-warnings=true --force=true --to-image=registry.svc.ci.openshift.org/ocp/release@sha256:bf05358f3eba0d0135ddb46e710e5715c39d5d6a51283eaa4cae20751e74435e warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade to the update to preceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Updating to release image registry.svc.ci.openshift.org/ocp/release@sha256:bf05358f3eba0d0135ddb46e710e5715c39d5d6a51283eaa4cae20751e74435e ... $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-09-08-123650 True False 17m Cluster version is 4.5.0-0.nightly-2020-09-08-123650 (reverse-i-search)`node': oc get ^Cdes -o wide $ oc debug node/miabbott-4-5-8-mg4zb-worker-a-8h79w -- chroot /host uname -a Starting pod/miabbott-4-5-8-mg4zb-worker-a-8h79w-debug ... To use host binaries, run `chroot /host` Linux miabbott-4-5-8-mg4zb-worker-a-8h79w 4.18.0-193.19.1.rt13.70.el8_2.x86_64 #1 SMP PREEMPT RT Wed Aug 26 17:57:22 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux Removing debug pod ... $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME miabbott-4-5-8-mg4zb-master-0.c.openshift-gce-devel.internal Ready master 96m v1.18.3+6c42de8 10.0.0.6 Red Hat Enterprise Linux CoreOS 45.82.202009081029-0 (Ootpa) 4.18.0-193.19.1.el8_2.x86_64 cri-o://1.18.3-12.rhaos4.5.git99f5d4a.el8 miabbott-4-5-8-mg4zb-master-1.c.openshift-gce-devel.internal Ready master 96m v1.18.3+6c42de8 10.0.0.4 Red Hat Enterprise Linux CoreOS 45.82.202009081029-0 (Ootpa) 4.18.0-193.19.1.el8_2.x86_64 cri-o://1.18.3-12.rhaos4.5.git99f5d4a.el8 miabbott-4-5-8-mg4zb-master-2.c.openshift-gce-devel.internal Ready master 96m v1.18.3+6c42de8 10.0.0.5 Red Hat Enterprise Linux CoreOS 45.82.202009081029-0 (Ootpa) 4.18.0-193.19.1.el8_2.x86_64 cri-o://1.18.3-12.rhaos4.5.git99f5d4a.el8 miabbott-4-5-8-mg4zb-worker-a-8h79w Ready worker 85m v1.18.3+6c42de8 10.0.32.2 Red Hat Enterprise Linux CoreOS 45.82.202009081029-0 (Ootpa) 4.18.0-193.19.1.rt13.70.el8_2.x86_64 cri-o://1.18.3-12.rhaos4.5.git99f5d4a.el8 miabbott-4-5-8-mg4zb-worker-b-jg22w Ready worker 85m v1.18.3+6c42de8 10.0.32.3 Red Hat Enterprise Linux CoreOS 45.82.202009081029-0 (Ootpa) 4.18.0-193.19.1.rt13.70.el8_2.x86_64 cri-o://1.18.3-12.rhaos4.5.git99f5d4a.el8 miabbott-4-5-8-mg4zb-worker-c-rtv8n Ready worker 85m v1.18.3+6c42de8 10.0.32.4 Red Hat Enterprise Linux CoreOS 45.82.202009081029-0 (Ootpa) 4.18.0-193.19.1.rt13.70.el8_2.x86_64 cri-o://1.18.3-12.rhaos4.5.git99f5d4a.el8 ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5.9 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3618