Description of problem: During upgrade chain 4.2.24 -> 4.3.8 -> 4.4.0-rc.6 on loaded cluster, upgrade stuck during upgrade to 4.4.0-rc.6 due degradation of openshift-apiserver operator. Version-Release number of selected component (if applicable): 4.4.0-rc.6 Cluster provider: AWS How reproducible: So far 1 on 1 attempt Steps to Reproduce: 1. Create cluster 4.2.24- scale up to 10 x m5.xlarge working nodes 2. Load with 250 projects (each project 1 pod, 10 build-config, 10 build template, 10 image streams, 10 secrets, 10 routes) 3. Upgrade to 4.3.8 4. Upgrade to 4.4.0-rc.6 Actual results: oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.8 True True 70m Unable to apply 4.4.0-rc.6: the cluster operator openshift-apiserver is degraded oc get clusteroperators.config.openshift.io NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.0-rc.6 True False False 4h29m cloud-credential 4.4.0-rc.6 True False False 4h41m cluster-autoscaler 4.4.0-rc.6 True False False 4h37m console 4.4.0-rc.6 True False False 33m csi-snapshot-controller 4.4.0-rc.6 True False False 25m dns 4.4.0-rc.6 True False False 4h41m etcd 4.4.0-rc.6 True False False 68m image-registry 4.4.0-rc.6 True False False 11m ingress 4.4.0-rc.6 True False False 14m insights 4.4.0-rc.6 True False False 4h41m kube-apiserver 4.4.0-rc.6 True False False 4h40m kube-controller-manager 4.4.0-rc.6 True False False 65m kube-scheduler 4.4.0-rc.6 True False False 65m kube-storage-version-migrator 4.4.0-rc.6 True False False 17m machine-api 4.4.0-rc.6 True False False 4h41m machine-config 4.3.8 False True True 22m marketplace 4.4.0-rc.6 True False False 59m monitoring 4.4.0-rc.6 True False False 55m network 4.4.0-rc.6 True False False 4h40m node-tuning 4.4.0-rc.6 True False False 60m openshift-apiserver 4.4.0-rc.6 True False True 51m openshift-controller-manager 4.4.0-rc.6 True False False 4h39m openshift-samples 4.4.0-rc.6 True False False 60m operator-lifecycle-manager 4.4.0-rc.6 True False False 4h40m operator-lifecycle-manager-catalog 4.4.0-rc.6 True False False 4h40m operator-lifecycle-manager-packageserver 4.4.0-rc.6 True False False 32m service-ca 4.4.0-rc.6 True False False 4h41m service-catalog-apiserver 4.4.0-rc.6 True False False 4h38m service-catalog-controller-manager 4.4.0-rc.6 True False False 4h38m storage 4.4.0-rc.6 True False False 60m Expected results: Successful upgrade to 4.4.0-rc.6 Additional info:
From the private must-gather: cluster-scoped-resources/config.openshift.io/clusteroperators/openshift-apiserver.yaml: - lastTransitionTime: "2020-04-03T20:36:36Z" message: 'APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable' reason: APIServerDeployment_UnavailablePod status: "True" type: Degraded so degraded because a control-plane node is down? From comment 0's ClusterOperator dump, you can see that machine-config is stuck on 4.3.8 and unavailable. From cluster-scoped-resources/config.openshift.io/clusteroperators/machine-config.yaml: - lastTransitionTime: "2020-04-03T20:44:54Z" message: 'Unable to apply 4.4.0-rc.6: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-05ae789544906d872f30dd4d3304b6ca expected a7b13759061f645a76f03c04d385d275bbbd0c02 has ab4d62a3bf3774b77b6f9b04a2028faec1568aca, retrying' reason: RequiredPoolsFailed status: "True" type: Degraded But: $ for X in cluster-scoped-resources/core/nodes/*.yaml; do yaml2json <"${X}" | jq -r '.ready = (.status.conditions[] | select(.type == "Ready")) | .metadata.name + " " + .metadata.annotations["machineconfiguration.openshift.io/currentConfig"] + " " + .ready.status + " " + .ready.lastTransitionTime + " " + .ready.message'; done | grep -v worker ip-10-0-142-202.us-east-2.compute.internal rendered-master-05ae789544906d872f30dd4d3304b6ca True 2020-04-03T19:30:06Z kubelet is posting ready status ip-10-0-157-62.us-east-2.compute.internal rendered-master-05ae789544906d872f30dd4d3304b6ca True 2020-04-03T19:25:12Z kubelet is posting ready status ip-10-0-173-119.us-east-2.compute.internal rendered-master-05ae789544906d872f30dd4d3304b6ca True 2020-04-03T19:34:32Z kubelet is posting ready status So the nodes all seem to be ready, even if they happen to be stuck on 4.3.8. Finding the stuck pod: $ grep nodeName namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-*/*.yaml namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-6drpj/apiserver-85b7c7855d-6drpj.yaml: nodeName: ip-10-0-173-119.us-east-2.compute.internal namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-6g86z/apiserver-85b7c7855d-6g86z.yaml: nodeName: ip-10-0-142-202.us-east-2.compute.internal From the pod with no node, namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-mnkcn/apiserver-85b7c7855d-mnkcn.yaml: - lastProbeTime: null lastTransitionTime: "2020-04-03T20:34:31Z" message: '0/13 nodes are available: 1 node(s) were unschedulable, 10 node(s) didn''t match node selector, 2 node(s) didn''t match pod affinity/anti-affinity.' reason: Unschedulable status: "False" type: PodScheduled Back to the node YAML cluster-scoped-resources/core/nodes/ip-10-0-157-62.us-east-2.compute.internal.yaml: taints: - effect: NoSchedule key: node-role.kubernetes.io/master - effect: NoSchedule key: node.kubernetes.io/unschedulable timeAdded: "2020-04-03T20:34:31Z" unschedulable: true so possibly this is still cordoned while the machine-config operator attempts to drain it. Moving to them to weigh in, but might be a dup of bug 1814241 / bug 1814282.
Checking for any csi-* stuff to see if this is a dup of bug 1814282: $ grep -r csi- host_service_logs ...no hits... And I don't expect there to have been an e2e run to leak storage resources in the history of this cluster. So this is probably a distinct issue. Looking for drain details: $ grep 'nodeName: ip-10-0-157-62\.' namespaces/openshift-machine-config-operator/pods/machine-config-daemon-*/*.yaml namespaces/openshift-machine-config-operator/pods/machine-config-daemon-xzzsq/machine-config-daemon-xzzsq.yaml: nodeName: ip-10-0-157-62.us-east-2.compute.internal $ tail -n2 namespaces/openshift-machine-config-operator/pods/machine-config-daemon-xzzsq/machine-config-daemon/machine-config-daemon/logs/current.log 2020-04-03T21:10:27.309942395Z I0403 21:10:27.309939 159222 update.go:811] Removed stale file "/etc/kubernetes/manifests/etcd-member.yaml" 2020-04-03T21:10:27.310397419Z E0403 21:10:27.309986 159222 writer.go:135] Marking Degraded due to: rename /etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig /usr/local/bin/etcd-member-add.sh: invalid cross-device link Ah, so maybe this one is a dup of bug 1814397, which should have been fixed for 4.4 via mco#1609 and bug 1817455 and which is still open for 4.3 with bug 1817458. Checking rc.6: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.6-x86_64 | grep machine-config machine-config-operator https://github.com/openshift/machine-config-operator a7b13759061f645a76f03c04d385d275bbbd0c02 $ git --no-pager log --format='%h %s' --first-parent -10 origin/release-4.4 b3578068 Merge pull request #1609 from runcom/upbug-4.4 a7b13759 Merge pull request #1583 from openshift-cherrypick-robot/cherry-pick-1580-to-release-4.4 So yeah, just missed picking up that fix. Closing this one as a dup. *** This bug has been marked as a duplicate of bug 1817455 ***