Bug 1821369
Summary: | Can't upgrade to 4.4.0-rc.6 due openshift-apiserver operator is degraded | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Simon <skordas> |
Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> |
Status: | CLOSED DUPLICATE | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.4 | CC: | aos-bugs, mfojtik, wking |
Target Milestone: | --- | Keywords: | Upgrades |
Target Release: | 4.4.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-04-07 20:14:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Simon
2020-04-06 17:18:42 UTC
From the private must-gather: cluster-scoped-resources/config.openshift.io/clusteroperators/openshift-apiserver.yaml: - lastTransitionTime: "2020-04-03T20:36:36Z" message: 'APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable' reason: APIServerDeployment_UnavailablePod status: "True" type: Degraded so degraded because a control-plane node is down? From comment 0's ClusterOperator dump, you can see that machine-config is stuck on 4.3.8 and unavailable. From cluster-scoped-resources/config.openshift.io/clusteroperators/machine-config.yaml: - lastTransitionTime: "2020-04-03T20:44:54Z" message: 'Unable to apply 4.4.0-rc.6: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-05ae789544906d872f30dd4d3304b6ca expected a7b13759061f645a76f03c04d385d275bbbd0c02 has ab4d62a3bf3774b77b6f9b04a2028faec1568aca, retrying' reason: RequiredPoolsFailed status: "True" type: Degraded But: $ for X in cluster-scoped-resources/core/nodes/*.yaml; do yaml2json <"${X}" | jq -r '.ready = (.status.conditions[] | select(.type == "Ready")) | .metadata.name + " " + .metadata.annotations["machineconfiguration.openshift.io/currentConfig"] + " " + .ready.status + " " + .ready.lastTransitionTime + " " + .ready.message'; done | grep -v worker ip-10-0-142-202.us-east-2.compute.internal rendered-master-05ae789544906d872f30dd4d3304b6ca True 2020-04-03T19:30:06Z kubelet is posting ready status ip-10-0-157-62.us-east-2.compute.internal rendered-master-05ae789544906d872f30dd4d3304b6ca True 2020-04-03T19:25:12Z kubelet is posting ready status ip-10-0-173-119.us-east-2.compute.internal rendered-master-05ae789544906d872f30dd4d3304b6ca True 2020-04-03T19:34:32Z kubelet is posting ready status So the nodes all seem to be ready, even if they happen to be stuck on 4.3.8. Finding the stuck pod: $ grep nodeName namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-*/*.yaml namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-6drpj/apiserver-85b7c7855d-6drpj.yaml: nodeName: ip-10-0-173-119.us-east-2.compute.internal namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-6g86z/apiserver-85b7c7855d-6g86z.yaml: nodeName: ip-10-0-142-202.us-east-2.compute.internal From the pod with no node, namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-mnkcn/apiserver-85b7c7855d-mnkcn.yaml: - lastProbeTime: null lastTransitionTime: "2020-04-03T20:34:31Z" message: '0/13 nodes are available: 1 node(s) were unschedulable, 10 node(s) didn''t match node selector, 2 node(s) didn''t match pod affinity/anti-affinity.' reason: Unschedulable status: "False" type: PodScheduled Back to the node YAML cluster-scoped-resources/core/nodes/ip-10-0-157-62.us-east-2.compute.internal.yaml: taints: - effect: NoSchedule key: node-role.kubernetes.io/master - effect: NoSchedule key: node.kubernetes.io/unschedulable timeAdded: "2020-04-03T20:34:31Z" unschedulable: true so possibly this is still cordoned while the machine-config operator attempts to drain it. Moving to them to weigh in, but might be a dup of bug 1814241 / bug 1814282. Checking for any csi-* stuff to see if this is a dup of bug 1814282: $ grep -r csi- host_service_logs ...no hits... And I don't expect there to have been an e2e run to leak storage resources in the history of this cluster. So this is probably a distinct issue. Looking for drain details: $ grep 'nodeName: ip-10-0-157-62\.' namespaces/openshift-machine-config-operator/pods/machine-config-daemon-*/*.yaml namespaces/openshift-machine-config-operator/pods/machine-config-daemon-xzzsq/machine-config-daemon-xzzsq.yaml: nodeName: ip-10-0-157-62.us-east-2.compute.internal $ tail -n2 namespaces/openshift-machine-config-operator/pods/machine-config-daemon-xzzsq/machine-config-daemon/machine-config-daemon/logs/current.log 2020-04-03T21:10:27.309942395Z I0403 21:10:27.309939 159222 update.go:811] Removed stale file "/etc/kubernetes/manifests/etcd-member.yaml" 2020-04-03T21:10:27.310397419Z E0403 21:10:27.309986 159222 writer.go:135] Marking Degraded due to: rename /etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig /usr/local/bin/etcd-member-add.sh: invalid cross-device link Ah, so maybe this one is a dup of bug 1814397, which should have been fixed for 4.4 via mco#1609 and bug 1817455 and which is still open for 4.3 with bug 1817458. Checking rc.6: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.6-x86_64 | grep machine-config machine-config-operator https://github.com/openshift/machine-config-operator a7b13759061f645a76f03c04d385d275bbbd0c02 $ git --no-pager log --format='%h %s' --first-parent -10 origin/release-4.4 b3578068 Merge pull request #1609 from runcom/upbug-4.4 a7b13759 Merge pull request #1583 from openshift-cherrypick-robot/cherry-pick-1580-to-release-4.4 So yeah, just missed picking up that fix. Closing this one as a dup. *** This bug has been marked as a duplicate of bug 1817455 *** |