Bug 1830789

Summary: [4.4] Cluster upgrade stuck on etcd operator
Product: OpenShift Container Platform Reporter: Sam Batschelet <sbatsche>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED DUPLICATE QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: medium    
Version: 4.4CC: ccoleman, ematysek, erich, geliu, john_schiwitz, mfojtik, mgahagan, mifiedle, rgregory, rkshirsa, scuppett, sdodson, skolicha, vlaad, wking, xxia, zyu
Target Milestone: ---Keywords: UpcomingSprint, Upgrades
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1830409 Environment:
Last Closed: 2020-05-27 00:12:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1830409, 1830510, 1953291    
Bug Blocks:    

Description Sam Batschelet 2020-05-03 21:13:11 UTC
+++ This bug was initially created as a clone of Bug #1830409 +++

Upgrade stuck on etcd operator failing to upgrade.  etcd in degraded state.

Steps to Reproduce:
1. deploy 4.3.18
2. upgrade to 4.4.0-0.nightly-2020-05-01-142756

Actual results:
etcd stuck for 2+ hours

Expected results:
Successful upgrade to 4.4.0-0.nightly-2020-05-01-142756

Additional info:
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.18    True        True          167m    Working towards 4.4.0-0.nightly-2020-05-01-142756: 11% complete

etcd ClusterOperator:

status:
  conditions:
  - lastTransitionTime: "2020-05-01T18:38:46Z"
    message: |-
      NodeInstallerDegraded: 1 nodes are failing on revision 3:
      NodeInstallerDegraded: pods "installer-3-mffaz1-mt6g4-master-2" not found
    reason: NodeInstaller_InstallerPodFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-05-01T18:31:38Z"
    message: 'NodeInstallerProgressing: 1 nodes are at revision 2; 2 nodes are at
      revision 3'
    reason: NodeInstaller
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-05-01T18:37:20Z"
    message: |-
      StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 2; 2 nodes are at revision 3
      EtcdMembersAvailable: mffaz1-mt6g4-master-2,mffaz1-mt6g4-master-1,mffaz1-mt6g4-master-0 members are available,  have not started,  are unhealthy,  are unknown
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2020-05-01T17:51:58Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  ...

Seems like there was an installer-3 on master-2 at some point, but it was terminated:

$ grep 'mffaz1-mt6g4-master-2.*installer-3' host_service_logs/masters/kubelet_service.log | grep -B1 terminated | head -n2
May 01 18:35:59.934424 mffaz1-mt6g4-master-2 hyperkube[1377]: I0501 18:35:59.934295    1377 status_manager.go:575] Status for pod "installer-3-mffaz1-mt6g4-master-2_openshift-etcd(242ebc5c-c12b-4b7c-8f98-7183660460b3)" updated successfully: (2, {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.0.4 PodIP: PodIPs:[] StartTime:2020-05-01 18:35:53 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:installer State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fa5c557cc88d76148e963f914258323f7d3ec23ca9fb59c8516a50bbb90e858 ImageID: ContainerID: Started:0xc001367940}] QOSClass:Guaranteed EphemeralContainerStatuses:[]})
May 01 18:35:59.934587 mffaz1-mt6g4-master-2 hyperkube[1377]: I0501 18:35:59.934514    1377 kubelet_pods.go:940] Pod "installer-3-mffaz1-mt6g4-master-2_openshift-etcd(242ebc5c-c12b-4b7c-8f98-7183660460b3)" is terminated, but some volumes have not been cleaned up

Forcing a new revision on etcd served as a work-around:

$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

As soon as a new revision is forced, master-2 got the installer successfully completing, and then, other two masters catching up with the new revision 4.

$ oc get etcd '-o=jsonpath={range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
AllNodesAtLatestRevision
3 nodes are at revision 4

We still don't understand the reason for original issue, but we can use this as a workaround if we get into this rare issue.

https://search.apps.build01.ci.devcluster.openshift.com/?search=etcd.%2BNodeInstaller_InstallerPodFailed&maxAge=168h&context=1&type=bug%2Bjunit&name=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job

Looks like 3-4 times in last seven days - can’t filter down by which version, but it also failed in a similar fashion in a 4.4 upgrade job.  That’s probably between 1-5% at most

Comment 9 john schiwitz 2020-05-22 02:12:24 UTC
apiVersion: operator.openshift.io/v1
kind: Etcd
metadata:
  annotations:
    release.openshift.io/create-only: "true"
  creationTimestamp: "2020-05-16T04:20:06Z"
  generation: 3
  name: cluster
  resourceVersion: "76833320"
  selfLink: /apis/operator.openshift.io/v1/etcds/cluster
  uid: 7990f6e2-3f87-4783-96be-7d476f138fea
spec:
  forceRedeploymentReason: recovery-2020-05-21 20:53:41.618139268-05:00
  managementState: Managed
status:
  conditions:
  - lastTransitionTime: "2020-05-22T01:56:21Z"
    reason: MembersReported
    status: "False"
    type: EtcdMembersControllerDegraded
  - lastTransitionTime: "2020-05-21T23:50:24Z"
    reason: AsExpected
    status: "False"
    type: BootstrapTeardownDegraded
  - lastTransitionTime: "2020-05-16T04:20:49Z"
    message: |-
      pods "etcd-shmamtr002.hma.am.honda.com" not found
      pods "etcd-shmamtr003.hma.am.honda.com" not found
      pods "etcd-shmamtr001.hma.am.honda.com" not found
    reason: Error
    status: "True"
    type: StaticPodsDegraded

Comment 10 john schiwitz 2020-05-22 02:14:51 UTC
i thought i had a similar issue but when i tried oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
 it is in the same state maybe the preceding case did not have pod not found

Comment 11 Sam Batschelet 2020-05-27 00:12:01 UTC
1830510

*** This bug has been marked as a duplicate of bug 1830510 ***