1706632 – [upgrade] Upgrade didn't complete with one master still degraded

Bug 1706632 - [upgrade] Upgrade didn't complete with one master still degraded

Summary: [upgrade] Upgrade didn't complete with one master still degraded

Keywords:
Status:	CLOSED DUPLICATE of bug 1706606
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Antonio Murdaca
QA Contact:	Micah Abbott
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-05 19:59 UTC by Clayton Coleman
Modified:	2019-05-06 20:31 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-06 20:31:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Clayton Coleman 2019-05-05 19:59:38 UTC

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1199

May 04 11:28:45.935 E clusterversion/version changed Failing to True: ClusterOperatorNotAvailable: Cluster operator machine-config is still updating
May 04 11:34:10.388 E clusteroperator/machine-config changed Degraded to True: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1): Unable to apply 4.1.0-0.ci-2019-05-04-094326: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)
May 04 11:36:00.932 E clusterversion/version changed Failing to True: ClusterOperatorNotAvailable: Cluster operator machine-config is still updating
May 04 11:43:30.934 E clusterversion/version changed Failing to True: ClusterOperatorNotAvailable: Cluster operator machine-config is still updating

Then the upgrade times out.  Needs to determine why this happened (Derek reported a similar flaw).

The upgrade tests need to check for this condition even on passing builds.

Comment 1 Clayton Coleman 2019-05-05 20:02:25 UTC

Also:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1208
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1212

Looks like 3/5 runs have failed so far, likely introduced between 5am EDT 05/04 and 9am EDT 05/04

Comment 2 Clayton Coleman 2019-05-05 20:04:21 UTC

Wait, wrong times.  14 UTC and 21 UTC

Comment 3 Clayton Coleman 2019-05-05 20:06:05 UTC

Reasonable ranges for when regression was introduced:

https://openshift-release.svc.ci.openshift.org/releasestream/4.1.0-0.nightly/release/4.1.0-0.nightly-2019-05-04-210601?from=4.1.0-0.nightly-2019-05-04-070249

or 

https://openshift-release.svc.ci.openshift.org/releasestream/4.1.0-0.ci/release/4.1.0-0.ci-2019-05-04-221522?from=4.1.0-0.ci-2019-05-04-132829

I see a few MCD changes and a machine-os-content change.

Comment 4 Antonio Murdaca 2019-05-05 20:25:00 UTC

As reported in Derek's BZ, I've spotted that we report to write a file, but on reboot, the file isn't there anymore and not sure it's a race in Runtimes/Kubernetes/chroot https://bugzilla.redhat.com/show_bug.cgi?id=1706606#c6

The failure tl;dr; is that:

- MCD writes files and upgrade os, then write a file "pending config"
- reboot
- MCD starts again, and uses pending config to finalize the upgrade

What's I'm witnessing is that that file isn't on disk anymore after the reboot (and I'm not sure why is that...)

Comment 5 Antonio Murdaca 2019-05-05 20:27:13 UTC

This PR is removing all mounts that we don't need (all except the rootfs which we chroot into anyway): https://github.com/openshift/machine-config-operator/pull/704

My theory is that since the pending config is under /etc/machine-config-daemon and that's a mount in current code, we might experience a race (since this failure isn't consistent).

Comment 6 Antonio Murdaca 2019-05-05 22:49:53 UTC

This error and Derek's BZ error predates that range in https://bugzilla.redhat.com/show_bug.cgi?id=1706632#c1

I can see jobs failing with that on 05/03/2019 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/697/pull-ci-openshift-machine-config-operator-master-e2e-aws-upgrade/148/

Comment 7 Colin Walters 2019-05-06 16:11:39 UTC

An os update is indeed at play here.  One unfortunate thing is that we lost the original MCD logs since they scrolled off apparently.  (Maybe we should back off retries in this case?)

Looking into this more.

Comment 8 Colin Walters 2019-05-06 19:50:35 UTC

In this BZ we're definitely booted into the osimageurl that's for the *desired* config and not the current.
Why that is...I think Antonio is on the right track that this has something to do with the "pending config" file going wrong.  But we need more logging.

Comment 9 Colin Walters 2019-05-06 19:53:04 UTC

I guess though philosophically...since the MCD is now a controller, if we're booted into what should be the desired state, we could just carry on.  

(My worry though is reboot loops; see also https://github.com/openshift/machine-config-operator/pull/245 )

Comment 10 Colin Walters 2019-05-06 20:31:30 UTC

Pretty sure this is a dup of #1706606

*** This bug has been marked as a duplicate of bug 1706606 ***

Note You need to log in before you can comment on or make changes to this bug.