Description of problem: $ oc adm upgrade error: Unable to apply 4.0.0-0.10: the cluster operator machine-config is failing: Reason: ClusterOperatorFailing Message: Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.10 because: error pool master is not ready, retrying. Status: (total: 3, updated: 0, unavailable: 1) Version-Release number of selected component (if applicable): Going from 4.0.0-0.alpha-2019-04-02-152430 to 4.0.0-0.10, and apparently in a number of other situations too. How reproducible: 5% of recent CI failures (SVG to follow), and that's not even upgrade-specific. Steps to Reproduce: 1. Do what CI is doing. 2. Sometimes hit this error ;). Expected results: A successful upgrade.
From one of the CI runs [1]: Apr 03 04:56:26.887 E clusteroperator/machine-config changed Failing to True: error pool master is not ready, retrying. Status: (total: 3, updated: 1, unavailable: 1): Failed to resync 4.0.0-0.ci-2019-04-03-035327 because: error pool master is not ready, retrying. Status: (total: 3, updated: 1, unavailable: 1) [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/857
Created attachment 1551442 [details] Occurrences of this error in CI from 2019-04-02T17:31 to 2019-04-03T14:41 UTC This occurred in [20] of our 27 failures (74%) in *-e2e-aws* jobs across the whole CI system over the past 21 hours. Generated with [1]: $ deck-build-log-plot 'error pool master is not ready' [1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log
Oops, that^ is actually "all *-e2e-aws-upgrade* jobs".
In progress PR: https://github.com/openshift/machine-config-operator/pull/601
(In reply to W. Trevor King from comment #1) > From one of the CI runs [1]: > > Apr 03 04:56:26.887 E clusteroperator/machine-config changed Failing to > True: error pool master is not ready, retrying. Status: (total: 3, updated: > 1, unavailable: 1): Failed to resync 4.0.0-0.ci-2019-04-03-035327 because: > error pool master is not ready, retrying. Status: (total: 3, updated: 1, > unavailable: 1) > > [1]: > https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release- > openshift-origin-installer-e2e-aws-upgrade-4.0/857 That is a transient error though, so besides that, is machine-config actually failing the upgrade? I can see in test logs: Apr 03 05:05:53.327 W clusteroperator/machine-config changed Failing to False Apr 03 05:05:53.329 I kube-apiserver received an error while watching events: The resourceVersion for the provided watch is too old. Apr 03 05:05:53.341 W clusteroperator/machine-config changed Available to True: Cluster has deployed 4.0.0-0.ci-2019-04-03-035327 So, machine-config reconciles and logs look fine as well. I guess my question is, besides the PR that we put in place, is the machine-config clusteroperator actually making the upgrade fail?
Further clarifications about this bug: - the first comment (c0) is reporting an error scenario which is NOT what provided by logs from CI in comment https://bugzilla.redhat.com/show_bug.cgi?id=1695721#c1 - the error from the first comment is a different issue from the CI issue and it's now tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1695789 - the CI error reported by Trevor is just an event at level Error which is just a transient error that we (MCO) report but that reconciles eventually so it has NOT caused that CI job run to fail - action for MCO: I'm not sure, we can't really lower the level at which we report the event, otherwise, we won't be able to distinguish from real error, and since we're just acting like a controller, we fail->we retry indefinitely (as the kubelet does)
also, from the CI job link from #c1: ```59 error level events were detected during this test run:``` the reason why they're all reported (including the MCO one which was believed to be the cause of the upgrade failure) is that something has failed and that caused the upgrade to fail and thus the test shows us _all_ the error level events. if the upgrade went through, we still had error level events but they wouldn't be reported I don't think that BZ is at all valid for the MCO at this point (unless I'm missing where the MCO is causing the failure). All operators can transiently fail during upgrade and if that's the case, I'm leaning toward closing this BZ.
Debugged this with Trevor. The MCO upgrades just fine and, as other operators, we report Failing=True for transient error. We'll follow up if that shouldn't be the case but as far as this bug is concerned...there's not an MCO bug.