Bug 1695721 - clusteroperator/machine-config changed Failing to True: error pool master is not ready, retrying. Status: (total: 3, updated: 1, unavailable: 1): Failed to resync 4.0.0-0.ci-2019-04-03-035327 because: error pool master is not ready, retrying. Status: (tot
Summary: clusteroperator/machine-config changed Failing to True: error pool master is ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Antonio Murdaca
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-03 16:50 UTC by W. Trevor King
Modified: 2019-04-04 14:11 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-04 14:11:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Occurrences of this error in CI from 2019-04-02T17:31 to 2019-04-03T14:41 UTC (51.71 KB, image/svg+xml)
2019-04-03 16:55 UTC, W. Trevor King
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 601 0 'None' closed pkg/controller: resync on unavailable 2020-05-05 17:33:53 UTC

Description W. Trevor King 2019-04-03 16:50:54 UTC
Description of problem:

$ oc adm upgrade
error: Unable to apply 4.0.0-0.10: the cluster operator machine-config is failing:

 Reason: ClusterOperatorFailing
 Message: Cluster operator machine-config is reporting a failure: Failed to resync 4.0.0-0.10 because: error pool master is not ready, retrying. Status: (total: 3, updated: 0, unavailable: 1)

Version-Release number of selected component (if applicable):

Going from 4.0.0-0.alpha-2019-04-02-152430 to 4.0.0-0.10, and apparently in a number of other situations too.

How reproducible:

5% of recent CI failures (SVG to follow), and that's not even upgrade-specific.

Steps to Reproduce:

1. Do what CI is doing.
2. Sometimes hit this error ;).

Expected results:

A successful upgrade.

Comment 1 W. Trevor King 2019-04-03 16:52:42 UTC
From one of the CI runs [1]:

Apr 03 04:56:26.887 E clusteroperator/machine-config changed Failing to True: error pool master is not ready, retrying. Status: (total: 3, updated: 1, unavailable: 1): Failed to resync 4.0.0-0.ci-2019-04-03-035327 because: error pool master is not ready, retrying. Status: (total: 3, updated: 1, unavailable: 1)

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/857

Comment 2 W. Trevor King 2019-04-03 16:55:11 UTC
Created attachment 1551442 [details]
Occurrences of this error in CI from 2019-04-02T17:31 to 2019-04-03T14:41 UTC

This occurred in [20] of our 27 failures (74%) in *-e2e-aws* jobs across the whole CI system over the past 21 hours.  Generated with [1]:

  $ deck-build-log-plot 'error pool master is not ready'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 3 W. Trevor King 2019-04-03 16:55:54 UTC
Oops, that^ is actually "all *-e2e-aws-upgrade* jobs".

Comment 8 Kirsten Garrison 2019-04-03 20:35:19 UTC
In progress PR: https://github.com/openshift/machine-config-operator/pull/601

Comment 9 Antonio Murdaca 2019-04-04 07:22:49 UTC
(In reply to W. Trevor King from comment #1)
> From one of the CI runs [1]:
> 
> Apr 03 04:56:26.887 E clusteroperator/machine-config changed Failing to
> True: error pool master is not ready, retrying. Status: (total: 3, updated:
> 1, unavailable: 1): Failed to resync 4.0.0-0.ci-2019-04-03-035327 because:
> error pool master is not ready, retrying. Status: (total: 3, updated: 1,
> unavailable: 1)
> 
> [1]:
> https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-
> openshift-origin-installer-e2e-aws-upgrade-4.0/857

That is a transient error though, so besides that, is machine-config actually failing the upgrade? I can see in test logs:

Apr 03 05:05:53.327 W clusteroperator/machine-config changed Failing to False
Apr 03 05:05:53.329 I kube-apiserver received an error while watching events: The resourceVersion for the provided watch is too old.
Apr 03 05:05:53.341 W clusteroperator/machine-config changed Available to True: Cluster has deployed 4.0.0-0.ci-2019-04-03-035327

So, machine-config reconciles and logs look fine as well.

I guess my question is, besides the PR that we put in place, is the machine-config clusteroperator actually making the upgrade fail?

Comment 10 Antonio Murdaca 2019-04-04 12:03:12 UTC
Further clarifications about this bug:

- the first comment (c0) is reporting an error scenario which is NOT what provided by logs from CI in comment https://bugzilla.redhat.com/show_bug.cgi?id=1695721#c1
- the error from the first comment is a different issue from the CI issue and it's now tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1695789
- the CI error reported by Trevor is just an event at level Error which is just a transient error that we (MCO) report but that reconciles eventually so it has NOT caused that CI job run to fail
- action for MCO: I'm not sure, we can't really lower the level at which we report the event, otherwise, we won't be able to distinguish from real error, and since we're just acting like a controller, we fail->we retry indefinitely (as the kubelet does)

Comment 11 Antonio Murdaca 2019-04-04 12:09:08 UTC
also, from the CI job link from #c1:

```59 error level events were detected during this test run:```

the reason why they're all reported (including the MCO one which was believed to be the cause of the upgrade failure) is that something has failed and that caused the upgrade to fail and thus the test shows us _all_ the error level events.

if the upgrade went through, we still had error level events but they wouldn't be reported

I don't think that BZ is at all valid for the MCO at this point (unless I'm missing where the MCO is causing the failure). All operators can transiently fail during upgrade and if that's the case, I'm leaning toward closing this BZ.

Comment 12 Antonio Murdaca 2019-04-04 14:11:57 UTC
Debugged this with Trevor. The MCO upgrades just fine and, as other operators, we report Failing=True for transient error. We'll follow up if that shouldn't be the case but as far as this bug is concerned...there's not an MCO bug.


Note You need to log in before you can comment on or make changes to this bug.