Bug 1761557

Summary:	4.1 to 4.2 upgrade fails, second master is caught in a drain loop wedge and upgrade never completes
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Machine Config Operator	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED ERRATA	QA Contact:	Michael Nguyen <mnguyen>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.2.0	CC:	alpatel, amurdaca, asadawar, baptiste.millemathias, benl, borazem, cblecker, ChetRHosey, christoph.obexer, clichybi, dahernan, dcain, dmoessne, dornelas, erich, gblomqui, jcoscia, jeder, jkaur, jswensso, kgarriso, mfuruta, mgugino, mharri, mifiedle, mpatel, ndordet, palshure, rbost, rkrawitz, rsandu, schoudha, scuppett, sgarciam, syangsao, trees, vlaad, wking
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1762005 1762565 (view as bug list)		Environment:
Last Closed:	2020-01-23 11:07:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1762005, 1762356, 1762565, 1763107

Description Clayton Coleman 2019-10-14 17:00:37 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/478

The machine config starts rolling out masters, but the second master never gets completely rolled out.  We see it attempting to drain in the event log a second time:

first drains

Oct 10 02:05:33.798 I node/ip-10-0-129-141.ec2.internal Written pending config rendered-worker-31944a31cfafcf0761c9241e3fe4eb05
Oct 10 02:05:33.995 I node/ip-10-0-129-141.ec2.internal Draining node to update config.
Oct 10 02:05:33.995 I node/ip-10-0-140-114.ec2.internal Written pending config rendered-master-1f24ee3a54a8096baebdebd8d853094b
Oct 10 02:05:34.057 I node/ip-10-0-140-114.ec2.internal Draining node to update config.

second drain

Oct 10 02:09:32.844 I node/ip-10-0-155-63.ec2.internal Booting node ip-10-0-155-63.ec2.internal, currentConfig rendered-master-09e0d8d727ce88d82598027dec078ffc, desiredConfig rendered-master-1f24ee3a54a8096baebdebd8d853094b
Oct 10 02:09:32.965 I node/ip-10-0-155-63.ec2.internal Written pending config rendered-master-1f24ee3a54a8096baebdebd8d853094b
Oct 10 02:09:32.986 I node/ip-10-0-155-63.ec2.internal Draining node to update config.

tries to drain again

Oct 10 02:17:46.244 I node/ip-10-0-155-63.ec2.internal Booting node ip-10-0-155-63.ec2.internal, currentConfig rendered-master-09e0d8d727ce88d82598027dec078ffc, desiredConfig rendered-master-1f24ee3a54a8096baebdebd8d853094b (3 times)
Oct 10 02:17:46.330 I node/ip-10-0-155-63.ec2.internal Written pending config rendered-master-1f24ee3a54a8096baebdebd8d853094b (2 times)
Oct 10 02:17:46.338 I node/ip-10-0-155-63.ec2.internal Draining node to update config. (2 times)

We don't believe this is a network problem, but are not sure.

Blocks upgrade from 4.1 to 4.2 and possibly blocks 4.2.0 z upgrades.

Comment 1 Michael Gugino 2019-10-14 17:57:28 UTC

Oct 10 02:09:32.986 I node/ip-10-0-155-63.ec2.internal Draining node to update config.
...
Oct 10 02:09:34.602 I node/ip-10-0-155-63.ec2.internal Node ip-10-0-155-63.ec2.internal status is now: NodeNotSchedulable
...
Oct 10 02:55:01.976 I node/ip-10-0-155-63.ec2.internal Draining node to update config. (3 times)
...
Oct 10 02:56:56.353 I node/ip-10-0-155-63.ec2.internal Rolling back pending config rendered-master-1f24ee3a54a8096baebdebd8d853094b: failed to drain node (5 tries): timed out waiting for the condition: Get https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%!D(MISSING)ip-10-0-155-63.ec2.internal: unexpected EOF (3 times)

Something blocked drain, but the node did not report unready.  It reported 'NodeNotSchedulable' immediately after the first drain attempt (which is correct, event represents the cordon operation).

Since node was not ready, most likely cause is PDB issues.

etcd-quorum guard:

ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-4jbbg node/ip-10-0-140-114.ec2.internal graceful deletion within 600s
Oct 10 02:05:39.987 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 node/ created
Oct 10 02:05:39.991 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-4jbbg Stopping container guard
Oct 10 02:05:39.996 I ns/openshift-machine-config-operator replicaset/etcd-quorum-guard-85478598fc Created pod: etcd-quorum-guard-85478598fc-7pm69
Oct 10 02:05:40.002 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable.
...
Oct 10 02:09:26.846 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Successfully assigned openshift-machine-config-operator/etcd-quorum-guard-85478598fc-7pm69 to ip-10-0-140-114.ec2.internal
Oct 10 02:09:27.347 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:528f2ead3d1605bdf818579976d97df5dd86df0a2a5d80df9aa8209c82333a86" already present on machine
Oct 10 02:09:27.511 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Created container guard
Oct 10 02:09:27.537 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Started container guard
Oct 10 02:09:29.002 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed: 
Oct 10 02:09:38.972 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed:  (2 times)
Oct 10 02:16:48.900 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-vnjvb node/ip-10-0-137-114.ec2.internal container=guard container stopped being ready
Oct 10 02:16:48.996 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp node/ip-10-0-155-63.ec2.internal container=guard container stopped being ready
Oct 10 02:16:48.998 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed:  (4 times)
Oct 10 02:16:48.998 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed:  (4 times)
Oct 10 02:16:49.374 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed:  (5 times)
Oct 10 02:16:49.894 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-vnjvb Readiness probe failed:  (2 times)

Comment 2 Michael Gugino 2019-10-14 17:58:24 UTC

(In reply to Michael Gugino from comment #1)
> Oct 10 02:09:32.986 I node/ip-10-0-155-63.ec2.internal Draining node to
> update config.
> ...
> Oct 10 02:09:34.602 I node/ip-10-0-155-63.ec2.internal Node
> ip-10-0-155-63.ec2.internal status is now: NodeNotSchedulable
> ...
> Oct 10 02:55:01.976 I node/ip-10-0-155-63.ec2.internal Draining node to
> update config. (3 times)
> ...
> Oct 10 02:56:56.353 I node/ip-10-0-155-63.ec2.internal Rolling back pending
> config rendered-master-1f24ee3a54a8096baebdebd8d853094b: failed to drain
> node (5 tries): timed out waiting for the condition: Get
> https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%!D(MISSING)ip-
> 10-0-155-63.ec2.internal: unexpected EOF (3 times)
> 
> Something blocked drain, but the node did not report unready.  It reported
> 'NodeNotSchedulable' immediately after the first drain attempt (which is
> correct, event represents the cordon operation).
> 
> Since node was not ready, most likely cause is PDB issues.
> 
> etcd-quorum guard:
> 
> ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-4jbbg
> node/ip-10-0-140-114.ec2.internal graceful deletion within 600s
> Oct 10 02:05:39.987 I ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-7pm69 node/ created
> Oct 10 02:05:39.991 I ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-4jbbg Stopping container guard
> Oct 10 02:05:39.996 I ns/openshift-machine-config-operator
> replicaset/etcd-quorum-guard-85478598fc Created pod:
> etcd-quorum-guard-85478598fc-7pm69
> Oct 10 02:05:40.002 W ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 2 node(s)
> didn't match node selector, 2 node(s) didn't match pod
> affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity
> rules, 2 node(s) were unschedulable.
> ...
> Oct 10 02:09:26.846 I ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-7pm69 Successfully assigned
> openshift-machine-config-operator/etcd-quorum-guard-85478598fc-7pm69 to
> ip-10-0-140-114.ec2.internal
> Oct 10 02:09:27.347 I ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-7pm69 Container image
> "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 528f2ead3d1605bdf818579976d97df5dd86df0a2a5d80df9aa8209c82333a86" already
> present on machine
> Oct 10 02:09:27.511 I ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-7pm69 Created container guard
> Oct 10 02:09:27.537 I ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-7pm69 Started container guard
> Oct 10 02:09:29.002 W ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed: 
> Oct 10 02:09:38.972 W ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed:  (2 times)
> Oct 10 02:16:48.900 W ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-vnjvb node/ip-10-0-137-114.ec2.internal
> container=guard container stopped being ready
> Oct 10 02:16:48.996 W ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-p79wp node/ip-10-0-155-63.ec2.internal
> container=guard container stopped being ready
> Oct 10 02:16:48.998 W ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed:  (4 times)
> Oct 10 02:16:48.998 W ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed:  (4 times)
> Oct 10 02:16:49.374 W ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed:  (5 times)
> Oct 10 02:16:49.894 W ns/openshift-machine-config-operator
> pod/etcd-quorum-guard-85478598fc-vnjvb Readiness probe failed:  (2 times)


*since node was not 'not ready'.  AKA, node was listed as ready, I wrote it wrong above.

Comment 3 Michael Gugino 2019-10-14 18:04:45 UTC

Summarizing etcd quorum guard events above

ip-10-0-140-114.ec2.internal was drain successfully.
After ip-10-0-140-114.ec2.internal is back, pod/etcd-quorum-guard-85478598fc-7pm69 is successfully scheduled to it.
At 02:09:29.002, pod/etcd-quorum-guard-85478598fc-7pm69 fails readiness check.
At 02:09:32.986 node/ip-10-0-155-63.ec2.internal is cordoned and drained.
Drain is blocked due to etcd-quorum pod being unready on the master that was previously rebooted, this is expected.

Then, the other two etcd-quorum pods go unready for unknown reason:
Oct 10 02:16:48.900 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-vnjvb node/ip-10-0-137-114.ec2.internal container=guard container stopped being ready
Oct 10 02:16:48.996 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp node/ip-10-0-155-63.ec2.internal container=guard container stopped being ready

Comment 4 Robert Krawitz 2019-10-14 18:16:20 UTC

What does oc describe pod on the etcd-quorum-guard show?

Comment 5 Michael Gugino 2019-10-14 18:31:42 UTC

(In reply to Robert Krawitz from comment #4)
> What does oc describe pod on the etcd-quorum-guard show?

Robert,

  Unfortunately must-gather didn't complete due to api outage.  Build log is all we have to go on at this point: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/478

Comment 6 Alay Patel 2019-10-14 20:39:39 UTC

Looking at the build-log.txt, it looks like `pod/etcd-member-ip-10-0-140-114.ec2.internal` was stopped on one node at 02:05 and `openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal` was stopped again on another node at 02:16. After both the pods are down. This explains why etcd-quorum quard on both on them is failing.

Unfortunately, I am not able to figure out why two pods were down at the same time. Is this expected?


logs:
---------

$ cat build-log.txt | grep etcd-member
Oct 10 02:05:43.197 I ns/openshift-etcd pod/etcd-member-ip-10-0-140-114.ec2.internal Stopping container etcd-member
Oct 10 02:05:43.204 I ns/openshift-etcd pod/etcd-member-ip-10-0-140-114.ec2.internal Stopping container etcd-metrics
Oct 10 02:05:43.905 W ns/openshift-etcd pod/etcd-member-ip-10-0-140-114.ec2.internal node/ip-10-0-140-114.ec2.internal graceful deletion within 0s
Oct 10 02:05:43.908 W ns/openshift-etcd pod/etcd-member-ip-10-0-140-114.ec2.internal node/ip-10-0-140-114.ec2.internal deleted
Oct 10 02:07:35.812 E ns/openshift-cluster-node-tuning-operator pod/tuned-w9shd node/ip-10-0-140-114.ec2.internal container=tuned container exited with code 255 (Error): /installer-8-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:37.729251   73175 openshift-tuned.go:539] Pod (openshift-kube-controller-manager/installer-5-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:37.933435   73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/installer-5-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:38.379044   73175 openshift-tuned.go:539] Pod (openshift-kube-controller-manager/installer-8-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:38.528874   73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/revision-pruner-6-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:38.728008   73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/installer-6-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:38.926691   73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/revision-pruner-2-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:39.124843   73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/revision-pruner-7-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:39.325879   73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/revision-pruner-4-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:39.730779   73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/revision-pruner-5-ip-10-0-140-114.ec2.internal) labels changed node wide: true\nI1010 02:05:43.874925   73175 openshift-tuned.go:397] Dumping labels to /var/lib/tuned/ocp-pod-labels.cfg\nI1010 02:05:43.876748   73175 openshift-tuned.go:430] Getting recommended profile...\nI1010 02:05:44.026187   73175 openshift-tuned.go:627] Active and recommended profile (openshift-control-plane) match.  Label changes will not trigger profile reload.\nI1010 02:05:44.027188   73175 openshift-tuned.go:539] Pod (openshift-etcd/etcd-member-ip-10-0-140-114.ec2.internal) labels changed node wide: true\n
Oct 10 02:09:44.177 I ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal Stopping container etcd-member
Oct 10 02:09:44.183 I ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal Stopping container etcd-metrics
Oct 10 02:16:44.676 W ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal node/ip-10-0-155-63.ec2.internal graceful deletion within 0s
Oct 10 02:16:48.057 W ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal node/ip-10-0-155-63.ec2.internal deleted
Oct 10 02:16:48.175 I ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal node/ip-10-0-155-63.ec2.internal created
Oct 10 02:16:57.767 I ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:32b5d2dec67f1f5099a3559c1b27e532ee8589e0efefc6d7268297b5b8802703" already present on machine
Oct 10 02:16:58.173 I ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal Created container discovery (2 times)
Oct 10 02:16:58.568 I ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal Started container discovery (2 times)

...
...
...
...

Comment 7 Alay Patel 2019-10-15 00:41:57 UTC

Some further observations:

$ cat build-log.txt | grep 'Booting.*ip-10-0-155-63.ec2.internal\|Setting node\|etcd.*failed'
...
...
Oct 10 02:08:25.242 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable. (18 times)
Oct 10 02:09:10.420 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable. (19 times)
Oct 10 02:09:21.075 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable. (20 times)
Oct 10 02:09:21.657 I node/ip-10-0-129-141.ec2.internal Setting node ip-10-0-129-141.ec2.internal, currentConfig rendered-worker-31944a31cfafcf0761c9241e3fe4eb05 to Done
Oct 10 02:09:22.265 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector.
Oct 10 02:09:26.792 I node/ip-10-0-140-114.ec2.internal Setting node ip-10-0-140-114.ec2.internal, currentConfig rendered-master-1f24ee3a54a8096baebdebd8d853094b to Done
Oct 10 02:09:26.828 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector.
Oct 10 02:09:26.846 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Successfully assigned openshift-machine-config-operator/etcd-quorum-guard-85478598fc-7pm69 to ip-10-0-140-114.ec2.internal
Oct 10 02:09:27.347 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:528f2ead3d1605bdf818579976d97df5dd86df0a2a5d80df9aa8209c82333a86" already present on machine
Oct 10 02:09:27.511 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Created container guard
Oct 10 02:09:27.537 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Started container guard
Oct 10 02:09:29.002 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed:
Oct 10 02:09:32.844 I node/ip-10-0-155-63.ec2.internal Booting node ip-10-0-155-63.ec2.internal, currentConfig rendered-master-09e0d8d727ce88d82598027dec078ffc, desiredConfig rendered-master-1f24ee3a54a8096baebdebd8d853094b
Oct 10 02:09:38.972 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed:  (2 times)
Oct 10 02:09:44.820 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed:
Oct 10 02:16:48.154 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed:  (2 times)
Oct 10 02:16:48.470 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed:  (3 times)
Oct 10 02:16:48.823 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed:  (3 times)
...
...
...
...



Looks like the node ip-10-0-140-114.ec2.internal was marked `Done` with the upgrade at 02:09:26.792 even if the etcd-quorum-guard-85478598fc-7pm69 readiness probe was not successful at `02:09:29.002`. Seeing this node/ip-10-0-155-63.ec2.internal was picked up at 02:09:32.844 for reboot, leading to 2 etcd-pods being down at the same time. Is this something expected? 

It seems that a node should only be marked done after readiness probe passes so we are sure that etcd on that node is successfully up, and only then another node should be rebooted.

Please let me know if I am missing something!

Thanks,
Alay

Comment 8 Michael Gugino 2019-10-15 12:54:43 UTC

> Looks like the node ip-10-0-140-114.ec2.internal was marked `Done` with the upgrade at 02:09:26.792 even if the etcd-quorum-guard-85478598fc-7pm69 readiness probe was not successful at `02:09:29.002`. Seeing this node/ip-10-0-155-63.ec2.internal was picked up at 02:09:32.844 for reboot, leading to 2 etcd-pods being down at the same time. Is this something expected? 

I don't believe node/ip-10-0-155-63.ec2.internal was ever rebooted because it was stuck in drain due to PDB of etcd-quorum-guard.  PDB will keep the node from completing drain and rebooting, this is expected.  Also, MCO doesn't need to wait for etcd-quorum-guard to become ready before going to next node because PDB will block bad stuff from happening.

It appears that something else caused etcd to fail on one or more masters before ip-10-0-140-114.ec2.internal came back online.

Comment 11 Michael Gugino 2019-10-15 15:52:46 UTC

Sending this over to the etcd team.  Action item is to figure out a way to always collect etcd logs (remote logging, permanent storage or some type, etc) so we can diagnose this issue further if we encounter it again.

Comment 13 Michael Gugino 2019-10-15 23:05:13 UTC

Another failed CI run:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/511


The following appears in the logs a few times, looks like etcd-quorum pod is flapping.  Some taint is coming and going.


Oct 15 14:49:14.329 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 1 node(s) were unschedulable, 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules.
Oct 15 14:49:14.329 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Successfully assigned openshift-machine-config-operator/etcd-quorum-guard-69bdd668fc-m6l76 to ip-10-0-155-58.ec2.internal
Oct 15 14:49:14.332 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c98d37506c6469ecc3fec1f89413b60800a7818653bf2ac7a70739d2767d08c0" already present on machine
Oct 15 14:49:14.333 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Created container guard
Oct 15 14:49:14.334 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Started container guard
Oct 15 14:49:14.334 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Readiness probe failed:
...

Oct 15 14:51:20.883 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 1 node(s) were unschedulable, 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules.
Oct 15 14:51:20.883 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Successfully assigned openshift-machine-config-operator/etcd-quorum-guard-69bdd668fc-m6l76 to ip-10-0-155-58.ec2.internal
Oct 15 14:51:20.886 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c98d37506c6469ecc3fec1f89413b60800a7818653bf2ac7a70739d2767d08c0" already present on machine

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/511/build-log.txt

Comment 14 Michael Gugino 2019-10-15 23:37:56 UTC

Interesting note about the above log, and a few others I found in the failure logs.  Before the first master is drained, a worker is drained.

Interesting log from about the worker: 'Oct 15 14:33:09.602 W ns/openshift-monitoring pod/telemeter-client-9b678948b-tf2w7 node/ip-10-0-132-229.ec2.internal invariant violation (bug): pod should not transition Running->Pending even when terminated'

Comment 15 Alay Patel 2019-10-16 14:26:02 UTC

I tried to update a cluster 4.1.20 to 4.2.0 and encountered a similar problem. Observations: master node 1 drained, master 2 failing drain waiting for master 1 to come up, but the quorum is lost. 

The etcd pod in master node 2 (failing drain node) was crash looping because of the following error seen in kubelet logs:

----
Oct 16 03:06:25 ip-10-0-147-106 hyperkube[1034]: E1016 03:06:25.854043    1034 file.go:187] Can't process manifest file "/etc/kubernetes/manifests/etcd-member.yaml": invalid pod: [spec.initContainers[0].image: Required value]

----
# cat /etc/kubernetes/manifests/etcd-member.yaml | grep image -C 2
  initContainers:
  - name: discovery
    image: ""
    args:
    - "run"

Because 2 etcd pods are out at the same time and the etcd pod on 2nd master node is crashlooping, it makes me believe that this is the same bug as one seen in various CI failures. 

Upon extracting the etcd db file and starting a single-member cluster on the same db, the configmap used by MCO to populate image url is as follows:

------
$ etcdctl get /kubernetes.io/configmaps/openshift-machine-config-operator/machine-config-operator-images
/kubernetes.io/configmaps/openshift-machine-config-operator/machine-config-operator-images
k8s

v1      ConfigMap

}
machine-config-operator-images!openshift-machine-config-operator"*$05af1c6f-efba-11e9-9195-0a0dfc43804a2zB


images.json
{
  "machineConfigOperator": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c8866339544dd34389652561e952f16034ccf55f45776c8b99099b4540a404e1",
  "etcd": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1f41ea9d99f6ec0dfff6e67b6fee0360a37114b3a533b680e3a1e107bd0da70d",
  "infraImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411564166ecb166e7846cd47cb6bcbce9ea174fb39c2bb14c6553ad0ec8bab38",
  "kubeClientAgentImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c465733d2c2031c55c0dd2717b94e69281d403c5a61cfc2da545dbdb7bbbe16e",
  "keepalivedImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a8a2f5c39f33e1d812f5e9f0fef9a1a6a002f36e7400eb6404024aac96209af9",
  "corednsImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ad83e438e33f0561b4b90a5574ad1b48f2003318ef1e93cbb119c87f36a2f15",
  "mdnsPublisherImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:56e2770c5c5f3bc0dae7ad99264ca8342b8086517d07c3c8f9543957954f43c2",
  "haproxyImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:04f2e7bcf5cb3e2aecc22e83870d966b2ad3660accab465cd13ba1b68e22afd5",
  "baremetalRuntimeCfgImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:50f432196c9ce96072806b6ec342458756973c5979266a4ab98772f1a490aa41"
}
"

Comment 16 Michael Gugino 2019-10-16 16:32:20 UTC

RCA:

There are two issues.

1) Changes to image-name key for etcd in MCO config resulted in empty etcd-member static pod manifest having an empty string.
2) Due to 1, an invalid etcd-member static pod was synced to the first master, that master drained as normal, etcd-quorum-guard never went ready because etcd was dead (expected).  However, because these manifests are being applied *before* a machine is properly drained, after the first master finished booting, we immediately applied the broken manifest to next host, despite not draining it first.

Recommended Fixes:
1) Ensure we don't push empty images in static pod manifests
2) Drain first, then apply static pod manifests. This will result in etcd-quorum-guard protecting us in the future in case we get a bad etcd-image somewhow.


More Info
https://github.com/openshift/machine-config-operator/pull/1177

Oct 16 03:06:25 ip-10-0-147-106 hyperkube[1034]: E1016 03:06:25.854043    1034 file.go:187] Can't process manifest file "/etc/kubernetes/manifests/etcd-member.yaml": invalid pod: [spec.initContainers[0].image: Required value]

[root@ip-10-0-147-106 ~]# cat /etc/kubernetes/manifests/etcd-member.yaml | grep image -C 2
  initContainers:
  - name: discovery
    image: ""
    args:
    - "run"

Comment 18 ge liu 2019-10-17 10:41:23 UTC

@Alay Patel, QE tried upgrade from 4.1.z to 4.2, but have not trigger this problem, is there any special steps for recreate it?

Comment 19 Sam Batschelet 2019-10-17 13:31:58 UTC

Ge,
We have found this to be about a 10% reproduction rate. So You will need to upgrade a few clusters to hit the bug. We upgraded 4.1.20 -> 4.2.0 no special actions were required.

Comment 20 ge liu 2019-10-18 10:20:50 UTC

Tried 4.1.20 to 4.2.0 for more than 10 times covered different hypervisor platform, have not trigger this problem, and run some upgrade from 4.1.20 to 4.2 latest version, it passed, we will continue to watch this issue.

Comment 21 ge liu 2019-10-18 10:22:17 UTC

pls ignore comments 20, it's update for 4.1 to 4.2

Comment 22 ge liu 2019-11-18 08:10:09 UTC

Tried many times and have not trigger this problem, with ocp: 4.3.0-0.nightly-2019-11-17-224250

Comment 28 errata-xmlrpc 2020-01-23 11:07:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062