Bug 1761557
Summary: | 4.1 to 4.2 upgrade fails, second master is caught in a drain loop wedge and upgrade never completes | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | |
Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> | |
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.2.0 | CC: | alpatel, amurdaca, asadawar, baptiste.millemathias, benl, borazem, cblecker, ChetRHosey, christoph.obexer, clichybi, dahernan, dcain, dmoessne, dornelas, erich, gblomqui, jcoscia, jeder, jkaur, jswensso, kgarriso, mfuruta, mgugino, mharri, mifiedle, mpatel, ndordet, palshure, rbost, rkrawitz, rsandu, schoudha, scuppett, sgarciam, syangsao, trees, vlaad, wking | |
Target Milestone: | --- | |||
Target Release: | 4.3.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1762005 1762565 (view as bug list) | Environment: | ||
Last Closed: | 2020-01-23 11:07:31 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1762005, 1762356, 1762565, 1763107 |
Description
Clayton Coleman
2019-10-14 17:00:37 UTC
Oct 10 02:09:32.986 I node/ip-10-0-155-63.ec2.internal Draining node to update config. ... Oct 10 02:09:34.602 I node/ip-10-0-155-63.ec2.internal Node ip-10-0-155-63.ec2.internal status is now: NodeNotSchedulable ... Oct 10 02:55:01.976 I node/ip-10-0-155-63.ec2.internal Draining node to update config. (3 times) ... Oct 10 02:56:56.353 I node/ip-10-0-155-63.ec2.internal Rolling back pending config rendered-master-1f24ee3a54a8096baebdebd8d853094b: failed to drain node (5 tries): timed out waiting for the condition: Get https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%!D(MISSING)ip-10-0-155-63.ec2.internal: unexpected EOF (3 times) Something blocked drain, but the node did not report unready. It reported 'NodeNotSchedulable' immediately after the first drain attempt (which is correct, event represents the cordon operation). Since node was not ready, most likely cause is PDB issues. etcd-quorum guard: ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-4jbbg node/ip-10-0-140-114.ec2.internal graceful deletion within 600s Oct 10 02:05:39.987 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 node/ created Oct 10 02:05:39.991 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-4jbbg Stopping container guard Oct 10 02:05:39.996 I ns/openshift-machine-config-operator replicaset/etcd-quorum-guard-85478598fc Created pod: etcd-quorum-guard-85478598fc-7pm69 Oct 10 02:05:40.002 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable. ... Oct 10 02:09:26.846 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Successfully assigned openshift-machine-config-operator/etcd-quorum-guard-85478598fc-7pm69 to ip-10-0-140-114.ec2.internal Oct 10 02:09:27.347 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:528f2ead3d1605bdf818579976d97df5dd86df0a2a5d80df9aa8209c82333a86" already present on machine Oct 10 02:09:27.511 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Created container guard Oct 10 02:09:27.537 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Started container guard Oct 10 02:09:29.002 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed: Oct 10 02:09:38.972 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed: (2 times) Oct 10 02:16:48.900 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-vnjvb node/ip-10-0-137-114.ec2.internal container=guard container stopped being ready Oct 10 02:16:48.996 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp node/ip-10-0-155-63.ec2.internal container=guard container stopped being ready Oct 10 02:16:48.998 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed: (4 times) Oct 10 02:16:48.998 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed: (4 times) Oct 10 02:16:49.374 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed: (5 times) Oct 10 02:16:49.894 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-vnjvb Readiness probe failed: (2 times) (In reply to Michael Gugino from comment #1) > Oct 10 02:09:32.986 I node/ip-10-0-155-63.ec2.internal Draining node to > update config. > ... > Oct 10 02:09:34.602 I node/ip-10-0-155-63.ec2.internal Node > ip-10-0-155-63.ec2.internal status is now: NodeNotSchedulable > ... > Oct 10 02:55:01.976 I node/ip-10-0-155-63.ec2.internal Draining node to > update config. (3 times) > ... > Oct 10 02:56:56.353 I node/ip-10-0-155-63.ec2.internal Rolling back pending > config rendered-master-1f24ee3a54a8096baebdebd8d853094b: failed to drain > node (5 tries): timed out waiting for the condition: Get > https://172.30.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%!D(MISSING)ip- > 10-0-155-63.ec2.internal: unexpected EOF (3 times) > > Something blocked drain, but the node did not report unready. It reported > 'NodeNotSchedulable' immediately after the first drain attempt (which is > correct, event represents the cordon operation). > > Since node was not ready, most likely cause is PDB issues. > > etcd-quorum guard: > > ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-4jbbg > node/ip-10-0-140-114.ec2.internal graceful deletion within 600s > Oct 10 02:05:39.987 I ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-7pm69 node/ created > Oct 10 02:05:39.991 I ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-4jbbg Stopping container guard > Oct 10 02:05:39.996 I ns/openshift-machine-config-operator > replicaset/etcd-quorum-guard-85478598fc Created pod: > etcd-quorum-guard-85478598fc-7pm69 > Oct 10 02:05:40.002 W ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 2 node(s) > didn't match node selector, 2 node(s) didn't match pod > affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity > rules, 2 node(s) were unschedulable. > ... > Oct 10 02:09:26.846 I ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-7pm69 Successfully assigned > openshift-machine-config-operator/etcd-quorum-guard-85478598fc-7pm69 to > ip-10-0-140-114.ec2.internal > Oct 10 02:09:27.347 I ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-7pm69 Container image > "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 528f2ead3d1605bdf818579976d97df5dd86df0a2a5d80df9aa8209c82333a86" already > present on machine > Oct 10 02:09:27.511 I ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-7pm69 Created container guard > Oct 10 02:09:27.537 I ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-7pm69 Started container guard > Oct 10 02:09:29.002 W ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed: > Oct 10 02:09:38.972 W ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed: (2 times) > Oct 10 02:16:48.900 W ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-vnjvb node/ip-10-0-137-114.ec2.internal > container=guard container stopped being ready > Oct 10 02:16:48.996 W ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-p79wp node/ip-10-0-155-63.ec2.internal > container=guard container stopped being ready > Oct 10 02:16:48.998 W ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed: (4 times) > Oct 10 02:16:48.998 W ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed: (4 times) > Oct 10 02:16:49.374 W ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed: (5 times) > Oct 10 02:16:49.894 W ns/openshift-machine-config-operator > pod/etcd-quorum-guard-85478598fc-vnjvb Readiness probe failed: (2 times) *since node was not 'not ready'. AKA, node was listed as ready, I wrote it wrong above. Summarizing etcd quorum guard events above ip-10-0-140-114.ec2.internal was drain successfully. After ip-10-0-140-114.ec2.internal is back, pod/etcd-quorum-guard-85478598fc-7pm69 is successfully scheduled to it. At 02:09:29.002, pod/etcd-quorum-guard-85478598fc-7pm69 fails readiness check. At 02:09:32.986 node/ip-10-0-155-63.ec2.internal is cordoned and drained. Drain is blocked due to etcd-quorum pod being unready on the master that was previously rebooted, this is expected. Then, the other two etcd-quorum pods go unready for unknown reason: Oct 10 02:16:48.900 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-vnjvb node/ip-10-0-137-114.ec2.internal container=guard container stopped being ready Oct 10 02:16:48.996 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp node/ip-10-0-155-63.ec2.internal container=guard container stopped being ready What does oc describe pod on the etcd-quorum-guard show? (In reply to Robert Krawitz from comment #4) > What does oc describe pod on the etcd-quorum-guard show? Robert, Unfortunately must-gather didn't complete due to api outage. Build log is all we have to go on at this point: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/478 Looking at the build-log.txt, it looks like `pod/etcd-member-ip-10-0-140-114.ec2.internal` was stopped on one node at 02:05 and `openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal` was stopped again on another node at 02:16. After both the pods are down. This explains why etcd-quorum quard on both on them is failing. Unfortunately, I am not able to figure out why two pods were down at the same time. Is this expected? logs: --------- $ cat build-log.txt | grep etcd-member Oct 10 02:05:43.197 I ns/openshift-etcd pod/etcd-member-ip-10-0-140-114.ec2.internal Stopping container etcd-member Oct 10 02:05:43.204 I ns/openshift-etcd pod/etcd-member-ip-10-0-140-114.ec2.internal Stopping container etcd-metrics Oct 10 02:05:43.905 W ns/openshift-etcd pod/etcd-member-ip-10-0-140-114.ec2.internal node/ip-10-0-140-114.ec2.internal graceful deletion within 0s Oct 10 02:05:43.908 W ns/openshift-etcd pod/etcd-member-ip-10-0-140-114.ec2.internal node/ip-10-0-140-114.ec2.internal deleted Oct 10 02:07:35.812 E ns/openshift-cluster-node-tuning-operator pod/tuned-w9shd node/ip-10-0-140-114.ec2.internal container=tuned container exited with code 255 (Error): /installer-8-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:37.729251 73175 openshift-tuned.go:539] Pod (openshift-kube-controller-manager/installer-5-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:37.933435 73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/installer-5-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:38.379044 73175 openshift-tuned.go:539] Pod (openshift-kube-controller-manager/installer-8-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:38.528874 73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/revision-pruner-6-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:38.728008 73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/installer-6-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:38.926691 73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/revision-pruner-2-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:39.124843 73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/revision-pruner-7-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:39.325879 73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/revision-pruner-4-ip-10-0-140-114.ec2.internal) labels changed node wide: false\nI1010 02:05:39.730779 73175 openshift-tuned.go:539] Pod (openshift-kube-scheduler/revision-pruner-5-ip-10-0-140-114.ec2.internal) labels changed node wide: true\nI1010 02:05:43.874925 73175 openshift-tuned.go:397] Dumping labels to /var/lib/tuned/ocp-pod-labels.cfg\nI1010 02:05:43.876748 73175 openshift-tuned.go:430] Getting recommended profile...\nI1010 02:05:44.026187 73175 openshift-tuned.go:627] Active and recommended profile (openshift-control-plane) match. Label changes will not trigger profile reload.\nI1010 02:05:44.027188 73175 openshift-tuned.go:539] Pod (openshift-etcd/etcd-member-ip-10-0-140-114.ec2.internal) labels changed node wide: true\n Oct 10 02:09:44.177 I ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal Stopping container etcd-member Oct 10 02:09:44.183 I ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal Stopping container etcd-metrics Oct 10 02:16:44.676 W ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal node/ip-10-0-155-63.ec2.internal graceful deletion within 0s Oct 10 02:16:48.057 W ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal node/ip-10-0-155-63.ec2.internal deleted Oct 10 02:16:48.175 I ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal node/ip-10-0-155-63.ec2.internal created Oct 10 02:16:57.767 I ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:32b5d2dec67f1f5099a3559c1b27e532ee8589e0efefc6d7268297b5b8802703" already present on machine Oct 10 02:16:58.173 I ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal Created container discovery (2 times) Oct 10 02:16:58.568 I ns/openshift-etcd pod/etcd-member-ip-10-0-155-63.ec2.internal Started container discovery (2 times) ... ... ... ... Some further observations: $ cat build-log.txt | grep 'Booting.*ip-10-0-155-63.ec2.internal\|Setting node\|etcd.*failed' ... ... Oct 10 02:08:25.242 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable. (18 times) Oct 10 02:09:10.420 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable. (19 times) Oct 10 02:09:21.075 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 2 node(s) were unschedulable. (20 times) Oct 10 02:09:21.657 I node/ip-10-0-129-141.ec2.internal Setting node ip-10-0-129-141.ec2.internal, currentConfig rendered-worker-31944a31cfafcf0761c9241e3fe4eb05 to Done Oct 10 02:09:22.265 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector. Oct 10 02:09:26.792 I node/ip-10-0-140-114.ec2.internal Setting node ip-10-0-140-114.ec2.internal, currentConfig rendered-master-1f24ee3a54a8096baebdebd8d853094b to Done Oct 10 02:09:26.828 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector. Oct 10 02:09:26.846 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Successfully assigned openshift-machine-config-operator/etcd-quorum-guard-85478598fc-7pm69 to ip-10-0-140-114.ec2.internal Oct 10 02:09:27.347 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:528f2ead3d1605bdf818579976d97df5dd86df0a2a5d80df9aa8209c82333a86" already present on machine Oct 10 02:09:27.511 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Created container guard Oct 10 02:09:27.537 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Started container guard Oct 10 02:09:29.002 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed: Oct 10 02:09:32.844 I node/ip-10-0-155-63.ec2.internal Booting node ip-10-0-155-63.ec2.internal, currentConfig rendered-master-09e0d8d727ce88d82598027dec078ffc, desiredConfig rendered-master-1f24ee3a54a8096baebdebd8d853094b Oct 10 02:09:38.972 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed: (2 times) Oct 10 02:09:44.820 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed: Oct 10 02:16:48.154 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed: (2 times) Oct 10 02:16:48.470 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-p79wp Readiness probe failed: (3 times) Oct 10 02:16:48.823 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-85478598fc-7pm69 Readiness probe failed: (3 times) ... ... ... ... Looks like the node ip-10-0-140-114.ec2.internal was marked `Done` with the upgrade at 02:09:26.792 even if the etcd-quorum-guard-85478598fc-7pm69 readiness probe was not successful at `02:09:29.002`. Seeing this node/ip-10-0-155-63.ec2.internal was picked up at 02:09:32.844 for reboot, leading to 2 etcd-pods being down at the same time. Is this something expected? It seems that a node should only be marked done after readiness probe passes so we are sure that etcd on that node is successfully up, and only then another node should be rebooted. Please let me know if I am missing something! Thanks, Alay > Looks like the node ip-10-0-140-114.ec2.internal was marked `Done` with the upgrade at 02:09:26.792 even if the etcd-quorum-guard-85478598fc-7pm69 readiness probe was not successful at `02:09:29.002`. Seeing this node/ip-10-0-155-63.ec2.internal was picked up at 02:09:32.844 for reboot, leading to 2 etcd-pods being down at the same time. Is this something expected?
I don't believe node/ip-10-0-155-63.ec2.internal was ever rebooted because it was stuck in drain due to PDB of etcd-quorum-guard. PDB will keep the node from completing drain and rebooting, this is expected. Also, MCO doesn't need to wait for etcd-quorum-guard to become ready before going to next node because PDB will block bad stuff from happening.
It appears that something else caused etcd to fail on one or more masters before ip-10-0-140-114.ec2.internal came back online.
Sending this over to the etcd team. Action item is to figure out a way to always collect etcd logs (remote logging, permanent storage or some type, etc) so we can diagnose this issue further if we encounter it again. Another failed CI run: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/511 The following appears in the logs a few times, looks like etcd-quorum pod is flapping. Some taint is coming and going. Oct 15 14:49:14.329 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 1 node(s) were unschedulable, 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules. Oct 15 14:49:14.329 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Successfully assigned openshift-machine-config-operator/etcd-quorum-guard-69bdd668fc-m6l76 to ip-10-0-155-58.ec2.internal Oct 15 14:49:14.332 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c98d37506c6469ecc3fec1f89413b60800a7818653bf2ac7a70739d2767d08c0" already present on machine Oct 15 14:49:14.333 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Created container guard Oct 15 14:49:14.334 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Started container guard Oct 15 14:49:14.334 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Readiness probe failed: ... Oct 15 14:51:20.883 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 1 node(s) were unschedulable, 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules. Oct 15 14:51:20.883 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Successfully assigned openshift-machine-config-operator/etcd-quorum-guard-69bdd668fc-m6l76 to ip-10-0-155-58.ec2.internal Oct 15 14:51:20.886 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-69bdd668fc-m6l76 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c98d37506c6469ecc3fec1f89413b60800a7818653bf2ac7a70739d2767d08c0" already present on machine https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/511/build-log.txt Interesting note about the above log, and a few others I found in the failure logs. Before the first master is drained, a worker is drained. Interesting log from about the worker: 'Oct 15 14:33:09.602 W ns/openshift-monitoring pod/telemeter-client-9b678948b-tf2w7 node/ip-10-0-132-229.ec2.internal invariant violation (bug): pod should not transition Running->Pending even when terminated' I tried to update a cluster 4.1.20 to 4.2.0 and encountered a similar problem. Observations: master node 1 drained, master 2 failing drain waiting for master 1 to come up, but the quorum is lost. The etcd pod in master node 2 (failing drain node) was crash looping because of the following error seen in kubelet logs: ---- Oct 16 03:06:25 ip-10-0-147-106 hyperkube[1034]: E1016 03:06:25.854043 1034 file.go:187] Can't process manifest file "/etc/kubernetes/manifests/etcd-member.yaml": invalid pod: [spec.initContainers[0].image: Required value] ---- # cat /etc/kubernetes/manifests/etcd-member.yaml | grep image -C 2 initContainers: - name: discovery image: "" args: - "run" Because 2 etcd pods are out at the same time and the etcd pod on 2nd master node is crashlooping, it makes me believe that this is the same bug as one seen in various CI failures. Upon extracting the etcd db file and starting a single-member cluster on the same db, the configmap used by MCO to populate image url is as follows: ------ $ etcdctl get /kubernetes.io/configmaps/openshift-machine-config-operator/machine-config-operator-images /kubernetes.io/configmaps/openshift-machine-config-operator/machine-config-operator-images k8s v1 ConfigMap } machine-config-operator-images!openshift-machine-config-operator"*$05af1c6f-efba-11e9-9195-0a0dfc43804a2zB images.json { "machineConfigOperator": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c8866339544dd34389652561e952f16034ccf55f45776c8b99099b4540a404e1", "etcd": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1f41ea9d99f6ec0dfff6e67b6fee0360a37114b3a533b680e3a1e107bd0da70d", "infraImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:411564166ecb166e7846cd47cb6bcbce9ea174fb39c2bb14c6553ad0ec8bab38", "kubeClientAgentImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c465733d2c2031c55c0dd2717b94e69281d403c5a61cfc2da545dbdb7bbbe16e", "keepalivedImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a8a2f5c39f33e1d812f5e9f0fef9a1a6a002f36e7400eb6404024aac96209af9", "corednsImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ad83e438e33f0561b4b90a5574ad1b48f2003318ef1e93cbb119c87f36a2f15", "mdnsPublisherImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:56e2770c5c5f3bc0dae7ad99264ca8342b8086517d07c3c8f9543957954f43c2", "haproxyImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:04f2e7bcf5cb3e2aecc22e83870d966b2ad3660accab465cd13ba1b68e22afd5", "baremetalRuntimeCfgImage": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:50f432196c9ce96072806b6ec342458756973c5979266a4ab98772f1a490aa41" } " RCA: There are two issues. 1) Changes to image-name key for etcd in MCO config resulted in empty etcd-member static pod manifest having an empty string. 2) Due to 1, an invalid etcd-member static pod was synced to the first master, that master drained as normal, etcd-quorum-guard never went ready because etcd was dead (expected). However, because these manifests are being applied *before* a machine is properly drained, after the first master finished booting, we immediately applied the broken manifest to next host, despite not draining it first. Recommended Fixes: 1) Ensure we don't push empty images in static pod manifests 2) Drain first, then apply static pod manifests. This will result in etcd-quorum-guard protecting us in the future in case we get a bad etcd-image somewhow. More Info https://github.com/openshift/machine-config-operator/pull/1177 Oct 16 03:06:25 ip-10-0-147-106 hyperkube[1034]: E1016 03:06:25.854043 1034 file.go:187] Can't process manifest file "/etc/kubernetes/manifests/etcd-member.yaml": invalid pod: [spec.initContainers[0].image: Required value] [root@ip-10-0-147-106 ~]# cat /etc/kubernetes/manifests/etcd-member.yaml | grep image -C 2 initContainers: - name: discovery image: "" args: - "run" @Alay Patel, QE tried upgrade from 4.1.z to 4.2, but have not trigger this problem, is there any special steps for recreate it? Ge, We have found this to be about a 10% reproduction rate. So You will need to upgrade a few clusters to hit the bug. We upgraded 4.1.20 -> 4.2.0 no special actions were required. Tried 4.1.20 to 4.2.0 for more than 10 times covered different hypervisor platform, have not trigger this problem, and run some upgrade from 4.1.20 to 4.2 latest version, it passed, we will continue to watch this issue. pls ignore comments 20, it's update for 4.1 to 4.2 Tried many times and have not trigger this problem, with ocp: 4.3.0-0.nightly-2019-11-17-224250 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062 |