Bug 1821369
| Summary: | Can't upgrade to 4.4.0-rc.6 due openshift-apiserver operator is degraded | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Simon <skordas> |
| Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> |
| Status: | CLOSED DUPLICATE | QA Contact: | Michael Nguyen <mnguyen> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.4 | CC: | aos-bugs, mfojtik, wking |
| Target Milestone: | --- | Keywords: | Upgrades |
| Target Release: | 4.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-04-07 20:14:54 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Simon
2020-04-06 17:18:42 UTC
From the private must-gather:
cluster-scoped-resources/config.openshift.io/clusteroperators/openshift-apiserver.yaml:
- lastTransitionTime: "2020-04-03T20:36:36Z"
message: 'APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable'
reason: APIServerDeployment_UnavailablePod
status: "True"
type: Degraded
so degraded because a control-plane node is down? From comment 0's ClusterOperator dump, you can see that machine-config is stuck on 4.3.8 and unavailable. From cluster-scoped-resources/config.openshift.io/clusteroperators/machine-config.yaml:
- lastTransitionTime: "2020-04-03T20:44:54Z"
message: 'Unable to apply 4.4.0-rc.6: timed out waiting for the condition during
syncRequiredMachineConfigPools: pool master has not progressed to latest configuration:
controller version mismatch for rendered-master-05ae789544906d872f30dd4d3304b6ca
expected a7b13759061f645a76f03c04d385d275bbbd0c02 has ab4d62a3bf3774b77b6f9b04a2028faec1568aca,
retrying'
reason: RequiredPoolsFailed
status: "True"
type: Degraded
But:
$ for X in cluster-scoped-resources/core/nodes/*.yaml; do yaml2json <"${X}" | jq -r '.ready = (.status.conditions[] | select(.type == "Ready")) | .metadata.name + " " + .metadata.annotations["machineconfiguration.openshift.io/currentConfig"] + " " + .ready.status + " " + .ready.lastTransitionTime + " " + .ready.message'; done | grep -v worker
ip-10-0-142-202.us-east-2.compute.internal rendered-master-05ae789544906d872f30dd4d3304b6ca True 2020-04-03T19:30:06Z kubelet is posting ready status
ip-10-0-157-62.us-east-2.compute.internal rendered-master-05ae789544906d872f30dd4d3304b6ca True 2020-04-03T19:25:12Z kubelet is posting ready status
ip-10-0-173-119.us-east-2.compute.internal rendered-master-05ae789544906d872f30dd4d3304b6ca True 2020-04-03T19:34:32Z kubelet is posting ready status
So the nodes all seem to be ready, even if they happen to be stuck on 4.3.8. Finding the stuck pod:
$ grep nodeName namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-*/*.yaml
namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-6drpj/apiserver-85b7c7855d-6drpj.yaml: nodeName: ip-10-0-173-119.us-east-2.compute.internal
namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-6g86z/apiserver-85b7c7855d-6g86z.yaml: nodeName: ip-10-0-142-202.us-east-2.compute.internal
From the pod with no node, namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-mnkcn/apiserver-85b7c7855d-mnkcn.yaml:
- lastProbeTime: null
lastTransitionTime: "2020-04-03T20:34:31Z"
message: '0/13 nodes are available: 1 node(s) were unschedulable, 10 node(s) didn''t
match node selector, 2 node(s) didn''t match pod affinity/anti-affinity.'
reason: Unschedulable
status: "False"
type: PodScheduled
Back to the node YAML cluster-scoped-resources/core/nodes/ip-10-0-157-62.us-east-2.compute.internal.yaml:
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
timeAdded: "2020-04-03T20:34:31Z"
unschedulable: true
so possibly this is still cordoned while the machine-config operator attempts to drain it. Moving to them to weigh in, but might be a dup of bug 1814241 / bug 1814282.
Checking for any csi-* stuff to see if this is a dup of bug 1814282: $ grep -r csi- host_service_logs ...no hits... And I don't expect there to have been an e2e run to leak storage resources in the history of this cluster. So this is probably a distinct issue. Looking for drain details: $ grep 'nodeName: ip-10-0-157-62\.' namespaces/openshift-machine-config-operator/pods/machine-config-daemon-*/*.yaml namespaces/openshift-machine-config-operator/pods/machine-config-daemon-xzzsq/machine-config-daemon-xzzsq.yaml: nodeName: ip-10-0-157-62.us-east-2.compute.internal $ tail -n2 namespaces/openshift-machine-config-operator/pods/machine-config-daemon-xzzsq/machine-config-daemon/machine-config-daemon/logs/current.log 2020-04-03T21:10:27.309942395Z I0403 21:10:27.309939 159222 update.go:811] Removed stale file "/etc/kubernetes/manifests/etcd-member.yaml" 2020-04-03T21:10:27.310397419Z E0403 21:10:27.309986 159222 writer.go:135] Marking Degraded due to: rename /etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig /usr/local/bin/etcd-member-add.sh: invalid cross-device link Ah, so maybe this one is a dup of bug 1814397, which should have been fixed for 4.4 via mco#1609 and bug 1817455 and which is still open for 4.3 with bug 1817458. Checking rc.6: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.6-x86_64 | grep machine-config machine-config-operator https://github.com/openshift/machine-config-operator a7b13759061f645a76f03c04d385d275bbbd0c02 $ git --no-pager log --format='%h %s' --first-parent -10 origin/release-4.4 b3578068 Merge pull request #1609 from runcom/upbug-4.4 a7b13759 Merge pull request #1583 from openshift-cherrypick-robot/cherry-pick-1580-to-release-4.4 So yeah, just missed picking up that fix. Closing this one as a dup. *** This bug has been marked as a duplicate of bug 1817455 *** |