Bug 1830409

Summary:	Cluster upgrade stuck on etcd operator
Product:	OpenShift Container Platform	Reporter:	Eric Matysek <ematysek>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED DUPLICATE	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.4	CC:	ccoleman, erich, lmohanty, mgahagan, mifiedle, prubenda, skolicha, vlaad, wking, xxia, zyu
Target Milestone:	---	Keywords:	UpcomingSprint, Upgrades
Target Release:	4.5.0	Flags:	erich: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1830789 (view as bug list)		Environment:
Last Closed:	2020-05-22 19:55:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1830789

Description Eric Matysek 2020-05-01 20:42:37 UTC

Description of problem:
Upgrade stuck on etcd operator failing to upgrade.
etcd in degraded state.

Version-Release number of selected component (if applicable):
upgrade from 4.3.18 -> 4.4.0-0.nightly-2020-05-01-142756


How reproducible:
unclear

Steps to Reproduce:
1. deploy 4.3.18
2. upgrade to 4.4.0-0.nightly-2020-05-01-142756
3.

Actual results:
etcd stuck for 2+ hours

Expected results:
Successful upgrade to 4.4.0-0.nightly-2020-05-01-142756


Additional info:
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.18    True        True          167m    Working towards 4.4.0-0.nightly-2020-05-01-142756: 11% complete

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-05-01-142756   True        False         False      19h
cloud-credential                           4.4.0-0.nightly-2020-05-01-142756   True        False         False      19h
cluster-autoscaler                         4.4.0-0.nightly-2020-05-01-142756   True        False         False      19h
console                                    4.4.0-0.nightly-2020-05-01-142756   True        False         False      121m
csi-snapshot-controller                    4.4.0-0.nightly-2020-05-01-142756   True        False         False      125m
dns                                        4.4.0-0.nightly-2020-05-01-142756   True        False         False      19h
etcd                                       4.4.0-0.nightly-2020-05-01-142756   True        True          True       121m
image-registry                             4.4.0-0.nightly-2020-05-01-142756   True        False         False      125m
ingress                                    4.4.0-0.nightly-2020-05-01-142756   True        False         False      120m
insights                                   4.4.0-0.nightly-2020-05-01-142756   True        False         False      19h
kube-apiserver                             4.4.0-0.nightly-2020-05-01-142756   True        False         False      163m
kube-controller-manager                    4.4.0-0.nightly-2020-05-01-142756   True        False         False      161m
kube-scheduler                             4.4.0-0.nightly-2020-05-01-142756   True        False         False      162m
kube-storage-version-migrator              4.4.0-0.nightly-2020-05-01-142756   True        False         False      125m
machine-api                                4.4.0-0.nightly-2020-05-01-142756   True        False         False      19h
machine-config                             4.4.0-0.nightly-2020-05-01-142756   True        False         False      19h
marketplace                                4.4.0-0.nightly-2020-05-01-142756   True        False         False      126m
monitoring                                 4.4.0-0.nightly-2020-05-01-142756   True        False         False      18h
network                                    4.4.0-0.nightly-2020-05-01-142756   True        False         False      19h
node-tuning                                4.4.0-0.nightly-2020-05-01-142756   False       True          False      129m
openshift-apiserver                        4.4.0-0.nightly-2020-05-01-142756   True        False         False      121m
openshift-controller-manager               4.4.0-0.nightly-2020-05-01-142756   True        False         False      19h
openshift-samples                          4.4.0-0.nightly-2020-05-01-142756   True        False         False      117s
operator-lifecycle-manager                 4.4.0-0.nightly-2020-05-01-142756   True        False         False      19h
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-05-01-142756   True        False         False      19h
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-05-01-142756   True        False         False      56m
service-ca                                 4.4.0-0.nightly-2020-05-01-142756   True        False         False      19h
service-catalog-apiserver                  4.4.0-0.nightly-2020-05-01-142756   True        False         False      118m
service-catalog-controller-manager         4.4.0-0.nightly-2020-05-01-142756   True        False         False      19h
storage                                    4.4.0-0.nightly-2020-05-01-142756   True        False         False      151m

$ oc get nodes
NAME                               STATUS   ROLES    AGE   VERSION
mffaz1-mt6g4-master-0              Ready    master   20h   v1.17.1
mffaz1-mt6g4-master-1              Ready    master   20h   v1.17.1
mffaz1-mt6g4-master-2              Ready    master   20h   v1.17.1
mffaz1-mt6g4-worker-westus-ffdws   Ready    worker   19h   v1.17.1
mffaz1-mt6g4-worker-westus-rj6vv   Ready    worker   19h   v1.17.1
mffaz1-mt6g4-worker-westus-vt5v9   Ready    worker   19h   v1.17.1

Comment 2 Mike Fiedler 2020-05-01 21:07:49 UTC

status:
  conditions:
  - lastTransitionTime: "2020-05-01T18:38:46Z"
    message: |-
      NodeInstallerDegraded: 1 nodes are failing on revision 3:
      NodeInstallerDegraded: pods "installer-3-mffaz1-mt6g4-master-2" not found
    reason: NodeInstaller_InstallerPodFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-05-01T18:31:38Z"
    message: 'NodeInstallerProgressing: 1 nodes are at revision 2; 2 nodes are at
      revision 3'
    reason: NodeInstaller
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-05-01T18:37:20Z"
    message: |-
      StaticPodsAvailable: 3 nodes are active; 1 nodes are at revision 2; 2 nodes are at revision 3
      EtcdMembersAvailable: mffaz1-mt6g4-master-2,mffaz1-mt6g4-master-1,mffaz1-mt6g4-master-0 members are available,  have not started,  are unhealthy,  are unknown
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2020-05-01T17:51:58Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension: null
  relatedObjects:
  - group: operator.openshift.io
    name: cluster
    resource: etcds
  - group: ""
    name: openshift-config
    resource: namespaces
  - group: ""
    name: openshift-config-managed
    resource: namespaces
  - group: ""
    name: openshift-etcd-operator
    resource: namespaces
  - group: ""
    name: openshift-etcd
    resource: namespaces

Comment 4 W. Trevor King 2020-05-01 22:53:20 UTC

Checking on the Available=False node-tuning operator too:

$ oc get clusteroperators -o json node-tuning | jq -r '.status.conditions[] | select(.type == "Available")'
{
  "lastTransitionTime": "2020-05-01T18:29:00Z",
  "message": "DaemonSet \"tuned\" has no available Pod(s).",
  "reason": "TunedUnavailable",
  "status": "False",
  "type": "Available"
}

In the (private, sorry) must-gather:

$ grep 'revision\|phase' namespaces/openshift-etcd/pods/etcd-*/*.yaml
namespaces/openshift-etcd/pods/etcd-mffaz1-mt6g4-master-0/etcd-mffaz1-mt6g4-master-0.yaml:    revision: "3"
namespaces/openshift-etcd/pods/etcd-mffaz1-mt6g4-master-0/etcd-mffaz1-mt6g4-master-0.yaml:  phase: Running
namespaces/openshift-etcd/pods/etcd-mffaz1-mt6g4-master-1/etcd-mffaz1-mt6g4-master-1.yaml:    revision: "3"
namespaces/openshift-etcd/pods/etcd-mffaz1-mt6g4-master-1/etcd-mffaz1-mt6g4-master-1.yaml:  phase: Running
namespaces/openshift-etcd/pods/etcd-mffaz1-mt6g4-master-2/etcd-mffaz1-mt6g4-master-2.yaml:    revision: "3"
namespaces/openshift-etcd/pods/etcd-mffaz1-mt6g4-master-2/etcd-mffaz1-mt6g4-master-2.yaml:  phase: Running

So the operator's "1 nodes are at revision 2; 2 nodes are at revision 3" seems stale.

Back to tuning:

yaml2json <namespaces/openshift-cluster-node-tuning-operator/apps/daemonsets/tuned.yaml | jq -r .status
{
  "numberReady": "0",
  "numberMisscheduled": "0",
  "currentNumberScheduled": "0",
  "desiredNumberScheduled": "0"
}

So... maybe node-tuned needs to be more relaxed about not being asked for?  Anyhow, still not sure why the etcd operator is confused.

Comment 5 W. Trevor King 2020-05-01 23:01:03 UTC

cluster-scoped-resources/operator.openshift.io/etcds/cluster.yaml has:

  nodeStatuses:
  - currentRevision: 3
    nodeName: mffaz1-mt6g4-master-0
  - currentRevision: 3
    nodeName: mffaz1-mt6g4-master-1
  - currentRevision: 2
    lastFailedRevision: 3
    lastFailedRevisionErrors:
    - pods "installer-3-mffaz1-mt6g4-master-2" not found
    nodeName: mffaz1-mt6g4-master-2
    targetRevision: 2

and also:

$ ls namespaces/openshift-etcd/pods | grep installer
installer-3-mffaz1-mt6g4-master-0
installer-3-mffaz1-mt6g4-master-1

Not sure why master-2 doesn't have an installer.

Comment 6 W. Trevor King 2020-05-01 23:37:42 UTC

Seems like there was an installer-3 on master-2 at some point, but it was terminated:

$ grep 'mffaz1-mt6g4-master-2.*installer-3' host_service_logs/masters/kubelet_service.log | grep -B1 terminated | head -n2
May 01 18:35:59.934424 mffaz1-mt6g4-master-2 hyperkube[1377]: I0501 18:35:59.934295    1377 status_manager.go:575] Status for pod "installer-3-mffaz1-mt6g4-master-2_openshift-etcd(242ebc5c-c12b-4b7c-8f98-7183660460b3)" updated successfully: (2, {Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [installer]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-05-01 18:35:53 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.0.4 PodIP: PodIPs:[] StartTime:2020-05-01 18:35:53 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:installer State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fa5c557cc88d76148e963f914258323f7d3ec23ca9fb59c8516a50bbb90e858 ImageID: ContainerID: Started:0xc001367940}] QOSClass:Guaranteed EphemeralContainerStatuses:[]})
May 01 18:35:59.934587 mffaz1-mt6g4-master-2 hyperkube[1377]: I0501 18:35:59.934514    1377 kubelet_pods.go:940] Pod "installer-3-mffaz1-mt6g4-master-2_openshift-etcd(242ebc5c-c12b-4b7c-8f98-7183660460b3)" is terminated, but some volumes have not been cleaned up

Not sure why, but maybe the issue is that etcd-operator isn't reconciling that and launching a replacement after a surprise termination?

Comment 7 Suresh Kolichala 2020-05-01 23:56:52 UTC

Forcing a new revision on etcd served as a work-around:

$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

As soon as a new revision is forced, master-2 got the installer successfully completing, and then, other two masters catching up with the new revision 4.

$ oc get etcd '-o=jsonpath={range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
AllNodesAtLatestRevision
3 nodes are at revision 4

We still don't understand the reason for original issue, but we can use this as a workaround if we get into this rare issue.

Comment 8 W. Trevor King 2020-05-02 00:08:49 UTC

And also, I'm still not clear why the tuned DaemonSet wanted no replicas.  But after the etcd recovery^, it seems to have eventually worked itself out, and now we have:

$ oc -n openshift-cluster-node-tuning-operator get -o yaml daemonset tuned
...
status:
  currentNumberScheduled: 6
  desiredNumberScheduled: 6
  numberAvailable: 6
  numberMisscheduled: 0
  numberReady: 6
  observedGeneration: 1
  updatedNumberScheduled: 6

happy operators, and a completed update:

$ oc get clusterversion version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-05-01-142756   True        False         7m40s   Cluster version is 4.4.0-0.nightly-2020-05-01-142756

Comment 11 Clayton Coleman 2020-05-02 21:33:21 UTC

https://search.apps.build01.ci.devcluster.openshift.com/?search=etcd.%2BNodeInstaller_InstallerPodFailed&maxAge=168h&context=1&type=bug%2Bjunit&name=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job

Looks like 3-4 times in last seven days - can’t filter down by which version, but it also failed in a similar fashion in a 4.4 upgrade job.  That’s probably between 1-5% at most

Comment 12 Xingxing Xia 2020-05-03 05:01:54 UTC

FYI per the bug comment 2 showing 'pods "installer-3-...-master-2" not found' while comment 0 showing nodes are ready, and comment 7 showing same workaround, this bug seems same symptom as bug 1817419. Etcd was observed once per the record in bug 1817419#c3 . Though the root reason that caused it may be on other component, the components that reported the issue need improve to retry the rolling out.

Comment 17 Mike Gahagan 2020-05-15 17:56:44 UTC

My latest attempt to upgrade an ARO cluster from 4.3.18 to 4.4.4 seems to be successful, I will destroy the cluster and try another upgrade to see if I can reproduce the issue. 

[m@localhost ARO-RP]$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.4     True        False         False      163m
cloud-credential                           4.4.4     True        False         False      3h
cluster-autoscaler                         4.4.4     True        False         False      169m
console                                    4.4.4     True        False         False      103m
csi-snapshot-controller                    4.4.4     True        False         False      108m
dns                                        4.4.4     True        False         False      175m
etcd                                       4.4.4     True        False         False      110m
image-registry                             4.4.4     True        False         False      112m
ingress                                    4.4.4     True        False         False      108m
insights                                   4.4.4     True        False         False      170m
kube-apiserver                             4.4.4     True        False         False      173m
kube-controller-manager                    4.4.4     True        False         False      140m
kube-scheduler                             4.4.4     True        False         False      172m
kube-storage-version-migrator              4.4.4     True        False         False      108m
machine-api                                4.4.4     True        False         False      175m
machine-config                             4.4.4     True        False         False      175m
marketplace                                4.4.4     True        False         False      106m
monitoring                                 4.4.4     True        False         False      100m
network                                    4.4.4     True        False         False      175m
node-tuning                                4.4.4     True        False         False      100m
openshift-apiserver                        4.4.4     True        False         False      111m
openshift-controller-manager               4.4.4     True        False         False      173m
openshift-samples                          4.4.4     True        True          False      2m23s
operator-lifecycle-manager                 4.4.4     True        False         False      171m
operator-lifecycle-manager-catalog         4.4.4     True        False         False      171m
operator-lifecycle-manager-packageserver   4.4.4     True        False         False      103m
service-ca                                 4.4.4     True        False         False      176m
service-catalog-apiserver                  4.4.4     True        False         False      171m
service-catalog-controller-manager         4.4.4     True        False         False      171m
storage                                    4.4.4     True        False         False      133m

I believe the openshift-samples operator restarting itself is a known issue that is being addressed in a seperate BZ

Comment 18 Mike Gahagan 2020-05-19 13:26:14 UTC

I have completed 3 successful attempts to upgrade a ARO 4.3.18 cluster to 4.4.4 and etcd-operator is able to finish each time. I did however see 2 instances of https://bugzilla.redhat.com/show_bug.cgi?id=1832986

Comment 19 Sam Batschelet 2020-05-22 19:55:45 UTC

Reviewed must-gather[1] a bit more and found signature of the TLS bug we fixed for 4.4.4[1].

```
2020-05-01T18:32:01.3610404Z 2020-05-01 18:32:01.360941 E | rafthttp: failed to dial 977ad1bb70f6178d on stream MsgApp v2 (x509: certificate is valid for localhost, mffaz1.qe.azure.devcluster.openshift.com, 10.0.0.5, not etcd-1.mffaz1.qe.azure.devcluster.openshift.com)
2020-05-01T18:32:01.3610404Z 2020-05-01 18:32:01.360961 I | rafthttp: peer 977ad1bb70f6178d became inactive (message send to peer failed)
2020-05-01T18:32:01.36189Z 2020-05-01 18:32:01.361788 E | rafthttp: failed to dial 72dbcc9dee90ce8 on stream MsgApp v2 (x509: certificate is valid for localhost, mffaz1.qe.azure.devcluster.openshift.com, 10.0.0.4, not etcd-2.mffaz1.qe.azure.devcluster.openshift.com)
2020-05-01T18:32:01.36189Z 2020-05-01 18:32:01.361804 I | rafthttp: peer 72dbcc9dee90ce8 became inactive (message send to peer failed)
```

Speaking with OTA team we have not seen any additional occurrences of this bug to 4.4.4. Based on this data I am marking this a dupe of 1830510.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1830409#c1
[2]https://github.com/openshift/cluster-etcd-operator/pull/340

*** This bug has been marked as a duplicate of bug 1830510 ***