Description of problem: 1. When performing a disconnected upgrade on KVM hosted OCP Z cluster running 4.7.24 to 4.7.29, the upgrade process does not complete. 2. All operators will show that it is upgraded to the target 4.7.29 release, except for the machine-config operator: machine-config 4.7.24 False True True 137m 3. The cluster status will stay stuck at this point: [root@bastion ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.24 True True 51m Working towards 4.7.29: 560 of 669 done (83% complete) 4. One of the master/control nodes and one of the worker nodes will be stuck at Ready,SchedulingDisabled: [root@bastion ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com Ready master 4h9m v1.20.0+558d959 master-1.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com Ready,SchedulingDisabled master 4h9m v1.20.0+558d959 master-2.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com Ready master 4h9m v1.20.0+558d959 worker-0.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com Ready,SchedulingDisabled worker 4h1m v1.20.0+558d959 worker-1.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com Ready worker 4h1m v1.20.0+558d959 5. When taking a closer look at one of these nodes, master-1 for example, it will show this "No such file or directory" error. [root@bastion ~]# oc describe node master-1.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com ... Annotations: k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_master-0.pok-243-sept-qemu.ocptest.pok.stglabs.ibm.com","mac-address":"52:54:00:da:70:41... k8s.ovn.org/node-chassis-id: 63a4adbf-120d-4767-ac63-c1a27c9ebead k8s.ovn.org/node-local-nat-ip: {"default":["169.254.5.142"]} k8s.ovn.org/node-mgmt-port-mac-address: 16:6e:ee:3b:d0:3f k8s.ovn.org/node-primary-ifaddr: {"ipv4":"192.168.79.21/24"} k8s.ovn.org/node-subnets: {"default":"10.130.0.0/23"} machineconfiguration.openshift.io/currentConfig: rendered-master-d20c7a4b400256a687718be1d2c2e2a8 machineconfiguration.openshift.io/desiredConfig: rendered-master-e3807e131147bdb9afe722b453ec015e machineconfiguration.openshift.io/reason: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:285b4e56265d28a5e46bb54714933348f8517d63de4d237568e457a995e3a... error: opendir(/run/mco-machine-os-content/os-content-838794730/srv/repo): No such file or directory machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: true Version-Release number of selected component (if applicable): OCP 4.7.29 How reproducible: Consistently reproducible Steps to Reproduce: 1. Start with a OCP cluster version 4.7.24 - disconnected/restricted install type. 2. Perform the operations for a disconnected/restricted upgrade to target release 4.7.29 from the mirror site. 3. Observe the upgrade does not complete and machine-config operator remains at 4.7.24 and a few master/work nodes remain in Ready,SchedulingDisabled state. Actual results: The OCP cluster upgrade process will fail to complete. Expected results: The OCP cluster and all its operators should upgrade to the targeted release of OCP 4.7.29. Additional info: We are also experiencing upgrade problems from 4.7.24 to 4.7.29 on zVM. Additional details to follow from Kyle Moser.
We also performed tests with connected and disconnected upgrades for the following releases: 4.7.24 --> 4.7.30 (KVM and zVM) 4.7.29 --> 4.7.30 (KVM and zVM) These upgrades succeeded. We will continue to look at the 4.7.29 upgrade failure and post any new information to this bugzilla.
I don't think this is a blocker as a functional upgrade path exists to a later release version (4.7.30). Also, lowering severity to medium for same reason this is not a blocker, a functional upgrade path exists. There is insufficient information here to make any sort of determination as to why this is happening. I am going to assign to the MCO team to help assist in asking the right questions to understand what is going on.
Kyle/Phil, Can we get the machine-config-daemon logs on the node with the issue? and also can you check what temp directories are present inside /run/mco-machine-os-content inside the daemon? does the directory /run/mco-machine-os-content/os-content-838794730 exist? did the image extraction succeed ? Thanks Prashanth
> failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:285b4e56265d28a5e46bb54714933348f8517d63de4d237568e457a995e3a... Odd, that sha256 checksum looks truncated. The digest for 4.7.29 is: ``` $ oc adm release info --image-for=machine-os-content quay.io/openshift-release-dev/ocp-release:4.7.29-s390x $ ``` Mounting that container, I do see /srv/repo in there. This might be related to https://bugzilla.redhat.com/show_bug.cgi?id=2000195 somehow? May be worth retrying with 4.7.30.
We probably will need a bit more logs, at the minimum the machine-config-daemon logs on a failing node, or a full must-gather is best. Like Colin said, since this is in a disconnected setup, it is possible that the update is actually failing on podman actions, and the error bubbled up does not contain the full picture
I re-ran a new disconnected upgrade from 4.7.24 to 4.7.29 to gather additional logs to all the above inquiries. As a baseline, here are the nodes that are currently stuck during MCO update(master-0 and worker-0): # oc get nodes NAME STATUS ROLES AGE VERSION master-0.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com Ready,SchedulingDisabled master 12h v1.20.0+558d959 master-1.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com Ready master 12h v1.20.0+558d959 master-2.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com Ready master 12h v1.20.0+558d959 worker-0.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com Ready,SchedulingDisabled worker 12h v1.20.0+558d959 worker-1.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com Ready worker 12h v1.20.0+558d959 And these are the new machine-config-daemons: # oc get pods -o wide -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES machine-config-controller-8fdd8d8f9-cxqrm 1/1 Running 0 10h 10.128.0.77 master-1.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com <none> <none> machine-config-daemon-bz456 2/2 Running 0 10h 192.168.79.24 worker-0.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com <none> <none> machine-config-daemon-dr4bm 2/2 Running 0 10h 192.168.79.22 master-1.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com <none> <none> machine-config-daemon-hxl4d 2/2 Running 0 10h 192.168.79.21 master-0.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com <none> <none> machine-config-daemon-klbjr 2/2 Running 0 10h 192.168.79.25 worker-1.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com <none> <none> machine-config-daemon-mmddq 2/2 Running 0 10h 192.168.79.23 master-2.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com <none> <none> machine-config-operator-7f489c49d5-b2gp8 1/1 Running 0 10h 10.130.0.78 master-2.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com <none> <none> machine-config-server-hb7sb 1/1 Running 0 10h 192.168.79.23 master-2.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com <none> <none> machine-config-server-r776l 1/1 Running 0 10h 192.168.79.21 master-0.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com <none> <none> machine-config-server-tws9p 1/1 Running 0 10h 192.168.79.22 master-1.pok-93-sept-qemu.ocptest.pok.stglabs.ibm.com <none> <none> @Yu - I will post the must-gather and machine-config-daemon logs for further review. This is just to note what must-gather summary ended with: When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information. ClusterID: edb5478c-8670-4722-81b1-1a8e9814ebbf ClusterVersion: Updating to "4.7.29" from "4.7.24" for 12 hours: Unable to apply 4.7.29: wait has exceeded 40 minutes for these operators: openshift-apiserver ClusterOperators: clusteroperator/authentication is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver () clusteroperator/ingress is degraded because Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-7c76795c5d-6tmcp" cannot be scheduled: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available) clusteroperator/machine-config is not available (Cluster not available for 4.7.29) because Unable to apply 4.7.29: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1) clusteroperator/openshift-apiserver is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver () @Prashanth and @Colin - If we look at master-0 about the missing directory: Annotations: machineconfiguration.openshift.io/currentConfig: rendered-master-61ba39de1d60c9d70ee2d1337e1a7ad4 machineconfiguration.openshift.io/desiredConfig: rendered-master-240d001f2dc7cdd91891266f11a87798 machineconfiguration.openshift.io/reason: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:285b4e56265d28a5e46bb54714933348f8517d63de4d237568e457a995e3a... error: opendir(/run/mco-machine-os-content/os-content-152081691/srv/repo): No such file or directory machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: true That specific directory does not exist. However, the upgrade continues to try and extract the contents to a new directory. From what I can tell, it can only get as far as the extensions directory before it stops, deletes the directory and retries: [core@master-0 ~]$ ls /run/mco-machine-os-content/os-content-389481125/ bin boot etc extensions [core@master-0 ~]$ ls /run/mco-machine-os-content/os-content-170249399/ bin boot etc extensions The sha256 checksum may look weird, but I can download directly from quay.io or from our mirror-registry and extract the contents without any issue: # oc image extract -a /root/.ocp4_pull_secret quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:285b4e56265d28a5e46bb54714933348f8517d63de4d237568e457a995e3abed # ls bin boot dev etc extensions home lib lib64 lost+found media mnt opt pkglist.txt proc root run sbin srv sys tmp usr var # oc image extract -a /root/disconnectedinstall/pull-secret.json bastion:5000/ocp4/openshift4:4.7.29-s390x-machine-os-content # ls bin boot dev etc extensions home lib lib64 lost+found media mnt opt pkglist.txt proc root run sbin srv sys tmp usr var Also, I wanted to confirm we are able to perform a disconnected upgrade from 4.7.24 -> 4.7.30 successfully. Which is good, the problem is resolved with a later release build. The main concern is that since 4.7.29 continues to be available on the public mirror site with this potential problem. A customer may still be able to try the upgrade to 4.7.29. And if their cluster gets into this state, is there a solution to upgrade them to 4.7.30 or above as a resolution?
Created attachment 1825363 [details] master-0 daemon log
Created attachment 1825364 [details] worker-0 daemon log
Created attachment 1825365 [details] must-gather.Sept22 partaa
Created attachment 1825366 [details] must-gather.Sept22 partab
Created attachment 1825367 [details] must-gather.Sept22 partac
Ah ok I think this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2000195 which has https://bugzilla.redhat.com/show_bug.cgi?id=2000746 as tracking the fix to 4.7.z, which as you noted is fixed in 4.7.30 https://bugzilla.redhat.com/show_bug.cgi?id=2000746#c6 As to pulling edges, we may need to evaluate the impact. Might be more than just a few 4.7 z stream releases.
@Trevor some time has passed now. Do you know if any action has been taken on this, or whether we still need to perform anything? Essentially all disconnected upgrades in the window of ~6 (?) 4.7 releases are broken. I am not sure if anyone else ran into this though, and we do have much newer 4.7 edges. Is it maybe fine to close this?
Sorry for the slow response; I'd lost track of this one. Sounds like restricted network updates from... something into 4.7.29 and maybe some earlier 4.7.z will hang on this. I'm not clear on mitigations or anything. If there were any attached cases, or when the fix from bug 2000746's 4.7.30 was newer, I'd probably attach UpgradeBlocker and an impact statement request per [1], to fill in those missing pieces. Even if we said "4.7.30 has been out for a while, so there are many alternative paths, and we aren't blocking edges on this", having clarity on affected updates, symptoms, mitigations, etc. can help folks who do happen to be bitten. However, the lack of attached cases and the age of the 4.7.30 fix suggest that nobody's out there at the moment, or likely to be out there in the future, and in need of that clarity. So I'm going to close as a dup, and we'll revisit mitigation notes and such if anyone actually trips over this in the wild. [1]: https://github.com/openshift/enhancements/tree/2911c46bf7d2f22eb1ab81739b4f9c2603fd0c07/enhancements/update/update-blocker-lifecycle#impact-statement-request *** This bug has been marked as a duplicate of bug 2000746 ***