Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1742744

Summary: [ci][upgrade] Cluster upgrade fails because of machine config wedging
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Machine Config OperatorAssignee: Kirsten Garrison <kgarriso>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: alegrand, anpicker, erooth, ffranz, geliu, hongli, juzhao, kgarriso, lcosic, mloibl, pkrupa, pmuller, sbatsche, surbania
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1740372 Environment:
Last Closed: 2019-10-16 06:36:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-08-16 20:11:37 UTC
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/98

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/98/artifacts/e2e-aws-upgrade/e2e.log

The job is trying to upgrade to 4.2 from 4.1, after which it will rollback.  The job never gets to success, stopping at 88% waiting for machine config.

Aug 15 06:33:53.499 W clusterversion/version changed Progressing to True: DownloadingUpdate: Working towards registry.svc.ci.openshift.org/ocp/release:4.2.0-0.ci-2019-08-14-232724: downloading update
Aug 15 06:33:55.215 - 15s   W clusterversion/version cluster is updating to
Aug 15 06:34:25.215 - 7140s W clusterversion/version cluster is updating to 4.2.0-0.ci-2019-08-14-232724
Aug 15 07:13:08.600 E clusterversion/version changed Failing to True: ClusterOperatorNotAvailable: Cluster operator machine-config is still updating
A

Urgent because 4.1 to 4.2 upgrades should never break.

Comment 1 Kirsten Garrison 2019-08-19 17:55:02 UTC
seeing in https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/98/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-txcfh_machine-config-daemon.log

```I0815 07:01:21.402817  130669 update.go:89] error when evicting pod "etcd-quorum-guard-85c9bf4f89-tczjg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0815 07:01:26.409812  130669 update.go:89] error when evicting pod "etcd-quorum-guard-85c9bf4f89-tczjg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
..
..
I0815 08:34:14.949610  130669 update.go:89] error when evicting pod "etcd-quorum-guard-85c9bf4f89-tczjg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0815 08:34:19.956818  130669 update.go:89] error when evicting pod "etcd-quorum-guard-85c9bf4f89-tczjg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
```

for approx 1.5 hours?

Comment 2 Sam Batschelet 2019-08-19 18:10:53 UTC
*** Bug 1733305 has been marked as a duplicate of this bug. ***

Comment 3 Stephen Greene 2019-08-20 20:10:49 UTC
*** Bug 1737678 has been marked as a duplicate of this bug. ***

Comment 4 Sam Batschelet 2019-08-21 02:08:58 UTC
>I0815 08:34:14.949610  130669 update.go:89] error when evicting pod "etcd-quorum-guard-85c9bf4f89-tczjg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
>I0815 08:34:19.956818  130669 update.go:89] error when evicting pod "etcd-quorum-guard-85c9bf4f89-tczjg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

This is actually etcd-quorum-guard working correctly. If we look at the pods[1] for this run you will notice that only 2 exist for etcd. That is because the manifest deployed for that member does not contain an image for setup-etcd-environment. Recent changes have taken place with MCO which made this image part of the MCO image vs separate and might have been to blame? Because one etcd pod was missing we could not lose another per the message.

Now we have a few other bugs attached to this I want to make sure they are all the same before we close. But I think it would be interesting to know why exactly the image was not populated for the spec.


```
Aug 15 08:36:54 ip-10-0-136-216 hyperkube[1515]: E0815 08:36:54.063849    1515 file.go:187] Can't process manifest file "/etc/kubernetes/manifests/etcd-member.yaml": invalid pod: [spec.initContainers[0].image: Required value]
```

[1] https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/98/artifacts/e2e-aws-upgrade/pods/

Comment 5 Sam Batschelet 2019-08-21 11:39:53 UTC
*** Bug 1737799 has been marked as a duplicate of this bug. ***

Comment 6 Kirsten Garrison 2019-08-21 16:21:24 UTC
Sam and I would like to try to get https://github.com/openshift/machine-config-operator/pull/1057

in which updates the etcd DR images, to see if this helps.

Comment 9 errata-xmlrpc 2019-10-16 06:36:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922