Bug 1830409
Summary: | Cluster upgrade stuck on etcd operator | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Eric Matysek <ematysek> | |
Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> | |
Status: | CLOSED DUPLICATE | QA Contact: | ge liu <geliu> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.4 | CC: | ccoleman, erich, lmohanty, mgahagan, mifiedle, prubenda, skolicha, vlaad, wking, xxia, zyu | |
Target Milestone: | --- | Keywords: | UpcomingSprint, Upgrades | |
Target Release: | 4.5.0 | Flags: | erich:
needinfo-
|
|
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1830789 (view as bug list) | Environment: | ||
Last Closed: | 2020-05-22 19:55:45 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1830789 |
Description
Eric Matysek
2020-05-01 20:42:37 UTC
status: conditions: - lastTransitionTime: "2020-05-01T18:38:46Z" message: |- NodeInstallerDegraded: 1 nodes are failing on revision 3: NodeInstallerDegraded: pods "installer-3-mffaz1-mt6g4-master-2" not found reason: NodeInstaller_InstallerPodFailed status: "True" type: Degraded - lastTransitionTime: "2020-05-01T18:31:38Z" message: 'NodeInstallerProgressing: 1 nodes are at revision 2; 2 nodes are at revision 3' reason: NodeInstaller status: "True" type: Progressing - lastTransitionTime: "2020-05-01T18:37:20Z" message: |- StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 2; 2 nodes are at revision 3 EtcdMembersAvailable: mffaz1-mt6g4-master-2,mffaz1-mt6g4-master-1,mffaz1-mt6g4-master-0 members are available, have not started, are unhealthy, are unknown reason: AsExpected status: "True" type: Available - lastTransitionTime: "2020-05-01T17:51:58Z" reason: AsExpected status: "True" type: Upgradeable extension: null relatedObjects: - group: operator.openshift.io name: cluster resource: etcds - group: "" name: openshift-config resource: namespaces - group: "" name: openshift-config-managed resource: namespaces - group: "" name: openshift-etcd-operator resource: namespaces - group: "" name: openshift-etcd resource: namespaces Checking on the Available=False node-tuning operator too: $ oc get clusteroperators -o json node-tuning | jq -r '.status.conditions[] | select(.type == "Available")' { "lastTransitionTime": "2020-05-01T18:29:00Z", "message": "DaemonSet \"tuned\" has no available Pod(s).", "reason": "TunedUnavailable", "status": "False", "type": "Available" } In the (private, sorry) must-gather: $ grep 'revision\|phase' namespaces/openshift-etcd/pods/etcd-*/*.yaml namespaces/openshift-etcd/pods/etcd-mffaz1-mt6g4-master-0/etcd-mffaz1-mt6g4-master-0.yaml: revision: "3" namespaces/openshift-etcd/pods/etcd-mffaz1-mt6g4-master-0/etcd-mffaz1-mt6g4-master-0.yaml: phase: Running namespaces/openshift-etcd/pods/etcd-mffaz1-mt6g4-master-1/etcd-mffaz1-mt6g4-master-1.yaml: revision: "3" namespaces/openshift-etcd/pods/etcd-mffaz1-mt6g4-master-1/etcd-mffaz1-mt6g4-master-1.yaml: phase: Running namespaces/openshift-etcd/pods/etcd-mffaz1-mt6g4-master-2/etcd-mffaz1-mt6g4-master-2.yaml: revision: "3" namespaces/openshift-etcd/pods/etcd-mffaz1-mt6g4-master-2/etcd-mffaz1-mt6g4-master-2.yaml: phase: Running So the operator's "1 nodes are at revision 2; 2 nodes are at revision 3" seems stale. Back to tuning: yaml2json <namespaces/openshift-cluster-node-tuning-operator/apps/daemonsets/tuned.yaml | jq -r .status { "numberReady": "0", "numberMisscheduled": "0", "currentNumberScheduled": "0", "desiredNumberScheduled": "0" } So... maybe node-tuned needs to be more relaxed about not being asked for? Anyhow, still not sure why the etcd operator is confused. cluster-scoped-resources/operator.openshift.io/etcds/cluster.yaml has: nodeStatuses: - currentRevision: 3 nodeName: mffaz1-mt6g4-master-0 - currentRevision: 3 nodeName: mffaz1-mt6g4-master-1 - currentRevision: 2 lastFailedRevision: 3 lastFailedRevisionErrors: - pods "installer-3-mffaz1-mt6g4-master-2" not found nodeName: mffaz1-mt6g4-master-2 targetRevision: 2 and also: $ ls namespaces/openshift-etcd/pods | grep installer installer-3-mffaz1-mt6g4-master-0 installer-3-mffaz1-mt6g4-master-1 Not sure why master-2 doesn't have an installer. Seems like there was an installer-3 on master-2 at some point, but it was terminated: $ grep 'mffaz1-mt6g4-master-2.*installer-3' host_service_logs/masters/kubelet_service.log | grep -B1 terminated | head -n2 May 01 18:35:59.934424 mffaz1-mt6g4-master-2 hyperkube[1377]: I0501 18:35:59.934295 1377 status_manager.go:575] Status for pod "installer-3-mffaz1-mt6g4-master-2_openshift-etcd(242ebc5c-c12b-4b7c-8f98-7183660460b3)" updated successfully: (2, {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.0.4 PodIP: PodIPs:[] StartTime:2020-05-01 18:35:53 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:installer State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fa5c557cc88d76148e963f914258323f7d3ec23ca9fb59c8516a50bbb90e858 ImageID: ContainerID: Started:0xc001367940}] QOSClass:Guaranteed EphemeralContainerStatuses:[]}) May 01 18:35:59.934587 mffaz1-mt6g4-master-2 hyperkube[1377]: I0501 18:35:59.934514 1377 kubelet_pods.go:940] Pod "installer-3-mffaz1-mt6g4-master-2_openshift-etcd(242ebc5c-c12b-4b7c-8f98-7183660460b3)" is terminated, but some volumes have not been cleaned up Not sure why, but maybe the issue is that etcd-operator isn't reconciling that and launching a replacement after a surprise termination? Forcing a new revision on etcd served as a work-around: $ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge As soon as a new revision is forced, master-2 got the installer successfully completing, and then, other two masters catching up with the new revision 4. $ oc get etcd '-o=jsonpath={range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}' AllNodesAtLatestRevision 3 nodes are at revision 4 We still don't understand the reason for original issue, but we can use this as a workaround if we get into this rare issue. And also, I'm still not clear why the tuned DaemonSet wanted no replicas. But after the etcd recovery^, it seems to have eventually worked itself out, and now we have: $ oc -n openshift-cluster-node-tuning-operator get -o yaml daemonset tuned ... status: currentNumberScheduled: 6 desiredNumberScheduled: 6 numberAvailable: 6 numberMisscheduled: 0 numberReady: 6 observedGeneration: 1 updatedNumberScheduled: 6 happy operators, and a completed update: $ oc get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-05-01-142756 True False 7m40s Cluster version is 4.4.0-0.nightly-2020-05-01-142756 https://search.apps.build01.ci.devcluster.openshift.com/?search=etcd.%2BNodeInstaller_InstallerPodFailed&maxAge=168h&context=1&type=bug%2Bjunit&name=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job Looks like 3-4 times in last seven days - can’t filter down by which version, but it also failed in a similar fashion in a 4.4 upgrade job. That’s probably between 1-5% at most FYI per the bug comment 2 showing 'pods "installer-3-...-master-2" not found' while comment 0 showing nodes are ready, and comment 7 showing same workaround, this bug seems same symptom as bug 1817419. Etcd was observed once per the record in bug 1817419#c3 . Though the root reason that caused it may be on other component, the components that reported the issue need improve to retry the rolling out. My latest attempt to upgrade an ARO cluster from 4.3.18 to 4.4.4 seems to be successful, I will destroy the cluster and try another upgrade to see if I can reproduce the issue. [m@localhost ARO-RP]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.4 True False False 163m cloud-credential 4.4.4 True False False 3h cluster-autoscaler 4.4.4 True False False 169m console 4.4.4 True False False 103m csi-snapshot-controller 4.4.4 True False False 108m dns 4.4.4 True False False 175m etcd 4.4.4 True False False 110m image-registry 4.4.4 True False False 112m ingress 4.4.4 True False False 108m insights 4.4.4 True False False 170m kube-apiserver 4.4.4 True False False 173m kube-controller-manager 4.4.4 True False False 140m kube-scheduler 4.4.4 True False False 172m kube-storage-version-migrator 4.4.4 True False False 108m machine-api 4.4.4 True False False 175m machine-config 4.4.4 True False False 175m marketplace 4.4.4 True False False 106m monitoring 4.4.4 True False False 100m network 4.4.4 True False False 175m node-tuning 4.4.4 True False False 100m openshift-apiserver 4.4.4 True False False 111m openshift-controller-manager 4.4.4 True False False 173m openshift-samples 4.4.4 True True False 2m23s operator-lifecycle-manager 4.4.4 True False False 171m operator-lifecycle-manager-catalog 4.4.4 True False False 171m operator-lifecycle-manager-packageserver 4.4.4 True False False 103m service-ca 4.4.4 True False False 176m service-catalog-apiserver 4.4.4 True False False 171m service-catalog-controller-manager 4.4.4 True False False 171m storage 4.4.4 True False False 133m I believe the openshift-samples operator restarting itself is a known issue that is being addressed in a seperate BZ I have completed 3 successful attempts to upgrade a ARO 4.3.18 cluster to 4.4.4 and etcd-operator is able to finish each time. I did however see 2 instances of https://bugzilla.redhat.com/show_bug.cgi?id=1832986 Reviewed must-gather[1] a bit more and found signature of the TLS bug we fixed for 4.4.4[1]. ``` 2020-05-01T18:32:01.3610404Z 2020-05-01 18:32:01.360941 E | rafthttp: failed to dial 977ad1bb70f6178d on stream MsgApp v2 (x509: certificate is valid for localhost, mffaz1.qe.azure.devcluster.openshift.com, 10.0.0.5, not etcd-1.mffaz1.qe.azure.devcluster.openshift.com) 2020-05-01T18:32:01.3610404Z 2020-05-01 18:32:01.360961 I | rafthttp: peer 977ad1bb70f6178d became inactive (message send to peer failed) 2020-05-01T18:32:01.36189Z 2020-05-01 18:32:01.361788 E | rafthttp: failed to dial 72dbcc9dee90ce8 on stream MsgApp v2 (x509: certificate is valid for localhost, mffaz1.qe.azure.devcluster.openshift.com, 10.0.0.4, not etcd-2.mffaz1.qe.azure.devcluster.openshift.com) 2020-05-01T18:32:01.36189Z 2020-05-01 18:32:01.361804 I | rafthttp: peer 72dbcc9dee90ce8 became inactive (message send to peer failed) ``` Speaking with OTA team we have not seen any additional occurrences of this bug to 4.4.4. Based on this data I am marking this a dupe of 1830510. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1830409#c1 [2]https://github.com/openshift/cluster-etcd-operator/pull/340 *** This bug has been marked as a duplicate of bug 1830510 *** |