Bug 1742744 - [ci][upgrade] Cluster upgrade fails because of machine config wedging
Summary: [ci][upgrade] Cluster upgrade fails because of machine config wedging
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.2.0
Assignee: Kirsten Garrison
QA Contact: Michael Nguyen
URL:
Whiteboard:
: 1737678 1737799 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-16 20:11 UTC by Clayton Coleman
Modified: 2019-10-16 06:36 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1740372
Environment:
Last Closed: 2019-10-16 06:36:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1057 0 'None' closed BUG 1742744: DR: set command for setup-etcd-environment 2021-02-19 11:50:48 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:36:30 UTC

Description Clayton Coleman 2019-08-16 20:11:37 UTC
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/98

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/98/artifacts/e2e-aws-upgrade/e2e.log

The job is trying to upgrade to 4.2 from 4.1, after which it will rollback.  The job never gets to success, stopping at 88% waiting for machine config.

Aug 15 06:33:53.499 W clusterversion/version changed Progressing to True: DownloadingUpdate: Working towards registry.svc.ci.openshift.org/ocp/release:4.2.0-0.ci-2019-08-14-232724: downloading update
Aug 15 06:33:55.215 - 15s   W clusterversion/version cluster is updating to
Aug 15 06:34:25.215 - 7140s W clusterversion/version cluster is updating to 4.2.0-0.ci-2019-08-14-232724
Aug 15 07:13:08.600 E clusterversion/version changed Failing to True: ClusterOperatorNotAvailable: Cluster operator machine-config is still updating
A

Urgent because 4.1 to 4.2 upgrades should never break.

Comment 1 Kirsten Garrison 2019-08-19 17:55:02 UTC
seeing in https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/98/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-txcfh_machine-config-daemon.log

```I0815 07:01:21.402817  130669 update.go:89] error when evicting pod "etcd-quorum-guard-85c9bf4f89-tczjg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0815 07:01:26.409812  130669 update.go:89] error when evicting pod "etcd-quorum-guard-85c9bf4f89-tczjg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
..
..
I0815 08:34:14.949610  130669 update.go:89] error when evicting pod "etcd-quorum-guard-85c9bf4f89-tczjg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0815 08:34:19.956818  130669 update.go:89] error when evicting pod "etcd-quorum-guard-85c9bf4f89-tczjg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
```

for approx 1.5 hours?

Comment 2 Sam Batschelet 2019-08-19 18:10:53 UTC
*** Bug 1733305 has been marked as a duplicate of this bug. ***

Comment 3 Stephen Greene 2019-08-20 20:10:49 UTC
*** Bug 1737678 has been marked as a duplicate of this bug. ***

Comment 4 Sam Batschelet 2019-08-21 02:08:58 UTC
>I0815 08:34:14.949610  130669 update.go:89] error when evicting pod "etcd-quorum-guard-85c9bf4f89-tczjg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
>I0815 08:34:19.956818  130669 update.go:89] error when evicting pod "etcd-quorum-guard-85c9bf4f89-tczjg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

This is actually etcd-quorum-guard working correctly. If we look at the pods[1] for this run you will notice that only 2 exist for etcd. That is because the manifest deployed for that member does not contain an image for setup-etcd-environment. Recent changes have taken place with MCO which made this image part of the MCO image vs separate and might have been to blame? Because one etcd pod was missing we could not lose another per the message.

Now we have a few other bugs attached to this I want to make sure they are all the same before we close. But I think it would be interesting to know why exactly the image was not populated for the spec.


```
Aug 15 08:36:54 ip-10-0-136-216 hyperkube[1515]: E0815 08:36:54.063849    1515 file.go:187] Can't process manifest file "/etc/kubernetes/manifests/etcd-member.yaml": invalid pod: [spec.initContainers[0].image: Required value]
```

[1] https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/98/artifacts/e2e-aws-upgrade/pods/

Comment 5 Sam Batschelet 2019-08-21 11:39:53 UTC
*** Bug 1737799 has been marked as a duplicate of this bug. ***

Comment 6 Kirsten Garrison 2019-08-21 16:21:24 UTC
Sam and I would like to try to get https://github.com/openshift/machine-config-operator/pull/1057

in which updates the etcd DR images, to see if this helps.

Comment 9 errata-xmlrpc 2019-10-16 06:36:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.