+++ This bug was initially created as a clone of Bug #1830409 +++ Upgrade stuck on etcd operator failing to upgrade. etcd in degraded state. Steps to Reproduce: 1. deploy 4.3.18 2. upgrade to 4.4.0-0.nightly-2020-05-01-142756 Actual results: etcd stuck for 2+ hours Expected results: Successful upgrade to 4.4.0-0.nightly-2020-05-01-142756 Additional info: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.18 True True 167m Working towards 4.4.0-0.nightly-2020-05-01-142756: 11% complete etcd ClusterOperator: status: conditions: - lastTransitionTime: "2020-05-01T18:38:46Z" message: |- NodeInstallerDegraded: 1 nodes are failing on revision 3: NodeInstallerDegraded: pods "installer-3-mffaz1-mt6g4-master-2" not found reason: NodeInstaller_InstallerPodFailed status: "True" type: Degraded - lastTransitionTime: "2020-05-01T18:31:38Z" message: 'NodeInstallerProgressing: 1 nodes are at revision 2; 2 nodes are at revision 3' reason: NodeInstaller status: "True" type: Progressing - lastTransitionTime: "2020-05-01T18:37:20Z" message: |- StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 2; 2 nodes are at revision 3 EtcdMembersAvailable: mffaz1-mt6g4-master-2,mffaz1-mt6g4-master-1,mffaz1-mt6g4-master-0 members are available, have not started, are unhealthy, are unknown reason: AsExpected status: "True" type: Available - lastTransitionTime: "2020-05-01T17:51:58Z" reason: AsExpected status: "True" type: Upgradeable ... Seems like there was an installer-3 on master-2 at some point, but it was terminated: $ grep 'mffaz1-mt6g4-master-2.*installer-3' host_service_logs/masters/kubelet_service.log | grep -B1 terminated | head -n2 May 01 18:35:59.934424 mffaz1-mt6g4-master-2 hyperkube[1377]: I0501 18:35:59.934295 1377 status_manager.go:575] Status for pod "installer-3-mffaz1-mt6g4-master-2_openshift-etcd(242ebc5c-c12b-4b7c-8f98-7183660460b3)" updated successfully: (2, {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.0.4 PodIP: PodIPs:[] StartTime:2020-05-01 18:35:53 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:installer State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fa5c557cc88d76148e963f914258323f7d3ec23ca9fb59c8516a50bbb90e858 ImageID: ContainerID: Started:0xc001367940}] QOSClass:Guaranteed EphemeralContainerStatuses:[]}) May 01 18:35:59.934587 mffaz1-mt6g4-master-2 hyperkube[1377]: I0501 18:35:59.934514 1377 kubelet_pods.go:940] Pod "installer-3-mffaz1-mt6g4-master-2_openshift-etcd(242ebc5c-c12b-4b7c-8f98-7183660460b3)" is terminated, but some volumes have not been cleaned up Forcing a new revision on etcd served as a work-around: $ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge As soon as a new revision is forced, master-2 got the installer successfully completing, and then, other two masters catching up with the new revision 4. $ oc get etcd '-o=jsonpath={range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}' AllNodesAtLatestRevision 3 nodes are at revision 4 We still don't understand the reason for original issue, but we can use this as a workaround if we get into this rare issue. https://search.apps.build01.ci.devcluster.openshift.com/?search=etcd.%2BNodeInstaller_InstallerPodFailed&maxAge=168h&context=1&type=bug%2Bjunit&name=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job Looks like 3-4 times in last seven days - can’t filter down by which version, but it also failed in a similar fashion in a 4.4 upgrade job. That’s probably between 1-5% at most
apiVersion: operator.openshift.io/v1 kind: Etcd metadata: annotations: release.openshift.io/create-only: "true" creationTimestamp: "2020-05-16T04:20:06Z" generation: 3 name: cluster resourceVersion: "76833320" selfLink: /apis/operator.openshift.io/v1/etcds/cluster uid: 7990f6e2-3f87-4783-96be-7d476f138fea spec: forceRedeploymentReason: recovery-2020-05-21 20:53:41.618139268-05:00 managementState: Managed status: conditions: - lastTransitionTime: "2020-05-22T01:56:21Z" reason: MembersReported status: "False" type: EtcdMembersControllerDegraded - lastTransitionTime: "2020-05-21T23:50:24Z" reason: AsExpected status: "False" type: BootstrapTeardownDegraded - lastTransitionTime: "2020-05-16T04:20:49Z" message: |- pods "etcd-shmamtr002.hma.am.honda.com" not found pods "etcd-shmamtr003.hma.am.honda.com" not found pods "etcd-shmamtr001.hma.am.honda.com" not found reason: Error status: "True" type: StaticPodsDegraded
i thought i had a similar issue but when i tried oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge it is in the same state maybe the preceding case did not have pod not found
1830510 *** This bug has been marked as a duplicate of bug 1830510 ***