Bug 1691513 - msg="failed to initialize the cluster: timed out waiting for the condition"
Summary: msg="failed to initialize the cluster: timed out waiting for the condition"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.1.0
Assignee: Abhinav Dahiya
QA Contact: Johnny Liu
URL:
Whiteboard:
: 1690366 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-21 19:05 UTC by bpeterse
Modified: 2019-06-04 10:46 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:46:21 UTC
Target Upstream Version:


Attachments (Terms of Use)
CVO waterfall showing the hung RoleBinding (54.22 KB, image/svg+xml)
2019-03-21 23:50 UTC, W. Trevor King
no flags Details
Occurences of this error in CI from 2019-03-19T12:28 to 2019-03-21T14:53Z (321.23 KB, image/svg+xml)
2019-03-21 23:53 UTC, W. Trevor King
no flags Details
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC (337.16 KB, image/svg+xml)
2019-03-22 06:30 UTC, W. Trevor King
no flags Details
Occurrences of this error in CI from 2019-03-25T22:54 to 2019-03-27T22:34 UTC (604.34 KB, image/svg+xml)
2019-03-27 22:54 UTC, W. Trevor King
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:46:28 UTC
Github openshift cluster-version-operator pull 141 'None' 'closed' 'Bug 1691513: CVO should retry initialization until cancelled to avoid wedging because of failing dependencies' 2019-11-26 10:07:31 UTC

Description bpeterse 2019-03-21 19:05:59 UTC
Description of problem:

The installer may fail with a generic message such as "timed out waiting for the condition".  This message lacks enough specificity to know what the underlying error is and how it could be resolved (and may possibly mask a series of errors).  

a sample of the build log output:
```````````````````````````````````````````````
Container setup exited with code 1, reason Error
---
Installing from release registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-21-122755
level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating infrastructure resources..."
level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-zclynzj2-5a633.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=info msg="API v1.12.4+1970440 up"
level=info msg="Waiting up to 30m0s for the bootstrap-complete event..."
level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-zclynzj2-5a633.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=fatal msg="failed to initialize the cluster: timed out waiting for the condition"
---
Container test exited with code 1, reason Error
---
Another process exited
```````````````````````````````````````````````
As seen in:
https://prow.k8s.io/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5997

Version-Release number of the following components:
4.1

How reproducible:
Test flake

Comment 1 W. Trevor King 2019-03-21 23:50:14 UTC
Created attachment 1546730 [details]
CVO waterfall showing the hung RoleBinding

More-specific message in flight with [1].  In this case and a number of similar ones that I've dug into recently, the issue is a failed RoleBinding due to the CVO making the request before the API server is ready to receive it, and then failing to retry until the next sync cycle ~25 minutes later.  Sometimes this is soon enough to get in under the 'cluster create' timeouts, sometimes not.  Here's the CVO waterfall generated with [2]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5997/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-699d7cb8c4-2zdkt_cluster-version-operator.log.gz | gunzip | ./cvo-waterfall.py >cvo.svg

[1]: https://github.com/openshift/installer/pull/1447
[2]: https://github.com/wking/openshift-release/tree/debug-scripts/waterfall

Comment 2 W. Trevor King 2019-03-21 23:53:21 UTC
Created attachment 1546731 [details]
Occurences of this error in CI from 2019-03-19T12:28 to 2019-03-21T14:53Z

This has caused 122 of our 816 failures in *-e2e-aws* jobs across the whole CI system over the past ~48 hours.  Generated with [1]:

  $ deck-build-log-plot 'failed to initialize the cluster: timed out waiting for the condition'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 3 Clayton Coleman 2019-03-22 05:04:21 UTC
Wait, who has a 25 minute sync cycle?

Comment 4 W. Trevor King 2019-03-22 06:15:40 UTC
(In reply to Clayton Coleman from comment #3)
> Wait, who has a 25 minute sync cycle?

Looks like the CVO:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5997/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-699d7cb8c4-2zdkt_cluster-version-operator.log.gz | gunzip | grep '191 of 310'
I0321 12:54:10.272506       1 sync_worker.go:462] Running sync for rolebinding "openshift-ingress-operator/ingress-operator" (191 of 310)
E0321 12:54:10.420582       1 task.go:58] error running apply for rolebinding "openshift-ingress-operator/ingress-operator" (191 of 310): rolebindings.rbac.authorization.k8s.io "ingress-operator" is forbidden: the server could not find the requested resource (get rolebindingrestrictions.authorization.openshift.io)
E0321 12:54:11.519781       1 task.go:58] error running apply for rolebinding "openshift-ingress-operator/ingress-operator" (191 of 310): rolebindings.rbac.authorization.k8s.io "ingress-operator" is forbidden: the server could not find the requested resource (get rolebindingrestrictions.authorization.openshift.io)
E0321 12:54:13.619732       1 task.go:58] error running apply for rolebinding "openshift-ingress-operator/ingress-operator" (191 of 310): rolebindings.rbac.authorization.k8s.io "ingress-operator" is forbidden: the server could not find the requested resource (get rolebindingrestrictions.authorization.openshift.io)
E0321 12:54:17.719990       1 task.go:58] error running apply for rolebinding "openshift-ingress-operator/ingress-operator" (191 of 310): rolebindings.rbac.authorization.k8s.io "ingress-operator" is forbidden: the server could not find the requested resource (get rolebindingrestrictions.authorization.openshift.io)
I0321 13:18:49.294778       1 task_graph.go:566] Result of work: [Could not update rolebinding "openshift-cluster-storage-operator/cluster-storage-operator" (231 of 310): the server has forbidden updates to this resource Could not update rolebinding "openshift-ingress-operator/ingress-operator" (191 of 310): the server has forbidden updates to this resource Could not update oauthclient "console" (254 of 310): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-controller-manager-operator/openshift-controller-manager-operator" (309 of 310): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (303 of 310): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (300 of 310): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-apiserver-operator/openshift-apiserver-operator" (306 of 310): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (297 of 310): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator" (294 of 310): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator" (291 of 310): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-image-registry/image-registry" (288 of 310): the server does not recognize this resource, check extension API servers Cluster operator authentication is still updating: missing version information for integrated-oauth-server Cluster operator monitoring is still updating]
I0321 13:19:20.418838       1 sync_worker.go:462] Running sync for rolebinding "openshift-ingress-operator/ingress-operator" (191 of 310)
I0321 13:19:21.552678       1 sync_worker.go:475] Done syncing for rolebinding "openshift-ingress-operator/ingress-operator" (191 of 310)

although I haven't tracked that down to backing code.  When that RoleBinding attempt happens after the OpenShift API server reports success, it goes through super-fast in the first round, instead of waiting 25 minutes for the second round.

Comment 5 W. Trevor King 2019-03-22 06:30:11 UTC
Created attachment 1546784 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 137 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'failed to initialize the cluster: timed out waiting for the condition'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 6 Clayton Coleman 2019-03-25 18:04:12 UTC
Going to fix this.

Comment 8 W. Trevor King 2019-03-26 14:11:08 UTC
*** Bug 1690366 has been marked as a duplicate of this bug. ***

Comment 9 W. Trevor King 2019-03-27 20:35:51 UTC
cvo#141 landed [1].

[1]: https://github.com/openshift/cluster-version-operator/pull/141#event-2234398871

Comment 10 bpeterse 2019-03-27 20:57:28 UTC
So I have seen a few failures today that look more or like this:

````````````````````````````
level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating infrastructure resources..."
level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-tv42hgbz-4e531.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=info msg="API v1.12.4+918f340 up"
level=info msg="Waiting up to 30m0s for the bootstrap-complete event..."
level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-tv42hgbz-4e531.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=fatal msg="failed to initialize the cluster: Working towards 0.0.1-2019-03-27-174637: 98% complete: timed out waiting for the condition"
````````````````````````````

The last message has two new pieces of information:
- Working towards 0.0.1-2019-03-27-174637
- 98% complete

Definitely steps in the right direction! Is it clear enough to consider this bug resolved?

Comment 11 W. Trevor King 2019-03-27 22:53:22 UTC
That more-specific error message is from installer#1447 (see comment 1).  Have you seen any of those 90+% failures in clusters launched since cvo#141 landed?  I don't see any (attachment to follow) except for the "22% complete" error in [1].  And that's a low-enough percentage that I suspect it's a broken release image and not a CVO bug (although I haven't dug in).

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22344/pull-ci-openshift-origin-master-e2e-aws/6016

Comment 12 W. Trevor King 2019-03-27 22:54:31 UTC
Created attachment 1548751 [details]
Occurrences of this error in CI from 2019-03-25T22:54 to 2019-03-27T22:34 UTC

This occurred in [101] of our 1150 failures (8%) in *-e2e-aws* jobs across the whole CI system over the past 47 hours.  Generated with [1]:

  $ deck-build-log-plot 'failed to initialize the cluster'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 13 W. Trevor King 2019-03-27 22:58:25 UTC
Sigh, this is what I get for regenerating the graph before posting: a new "98% complete" job which started 2019-03-27T21:12Z [1] (cvo#141 landed 2019-03-27T20:23).  Looking...

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-kube-scheduler-operator/79/pull-ci-openshift-cluster-kube-scheduler-operator-master-e2e-aws/344

Comment 14 W. Trevor King 2019-03-27 23:03:09 UTC
Turns out that job was somehow still running the old CVO code:

cluster-version-operator$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-kube-scheduler-operator/79/pull-ci-openshift-cluster-kube-scheduler-operator-master-e2e-aws/344/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-d4784fb6-dhct9_cluster-version-operator.log.gz | gunzip | head -n1
I0327 21:57:14.331570       1 start.go:23] ClusterVersionOperator v3.11.0-238-g7c051fb7-dirty
cluster-version-operator$ git log --first-parent --format='%ad %h %d %s' --date=iso -2
2019-03-27 13:23:59 -0700 2b66cef  (HEAD -> master, wking/master, origin/release-4.0, origin/master, origin/HEAD) Merge pull request #141 from smarterclayton/context_cancel
2019-03-26 08:50:11 -0700 7c051fb  Merge pull request #143 from smarterclayton/more_time

Not using the current CVO tip might be a test-platform issue, but I'm back to being confident that cvo#141 fixed the operator issue.

Comment 16 Johnny Liu 2019-03-28 03:27:13 UTC
Verified this bug with 4.0.0-0.nightly-2019-03-27-213938, and PASS.

Comment 17 Ben Parees 2019-03-29 19:52:23 UTC
Seen again in https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/735 which ran on March 29th.

At a minimum it would be nice to see the installer dump the clusteroperators it is waiting for.

Comment 18 W. Trevor King 2019-03-30 05:16:07 UTC
> At a minimum it would be nice to see the installer dump the clusteroperators it is waiting for.

The installer is not waiting for ClusterOperators, it's waiting for ClusterVersion.  So if we want more information here, I think it's up to the cluster-version operator to surface a more-detailed progress message.  More on that in bug 1692353.  The current installer code will already pass that message along to our logs and error messages, so I don't see a need for further installer changes here.

Comment 19 W. Trevor King 2019-03-30 11:47:03 UTC
One way to get more-detailed CVO progress reports: https://github.com/openshift/cluster-version-operator/pull/152

Comment 20 Ben Parees 2019-03-30 19:30:19 UTC
> The installer is not waiting for ClusterOperators, it's waiting for ClusterVersion.  So if we want more information here, I think it's up to the cluster-version operator to surface a more-detailed progress message.

Yes, sorry i need to break myself of the bad habit of using installer+cvo somewhat interchangeably. Thanks for the clarification.

Comment 22 Scott Dodson 2019-04-01 14:17:07 UTC
As of April 1st 10AM EDT this happened 2 out of 10 of the failures I looked at.

Comment 23 Scott Dodson 2019-04-01 15:21:30 UTC
In the first it's openshift-apiserver that's crashlooping
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22434/pull-ci-openshift-origin-master-e2e-aws/6130

In the second it's several operators unavailable, Ingress isn't available and as such authentication cannot get a route
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/264/pull-ci-openshift-machine-api-operator-master-e2e-aws/791
"name": "authentication",
                    {
                        "lastTransitionTime": "2019-04-01T08:31:04Z",
                        "message": "Failing: error checking payload readiness: unable to check route health: failed to GET route: EOF",
                        "reason": "Failing",
                        "status": "True",
                        "type": "Failing"
                    },
                    {
                        "lastTransitionTime": "2019-04-01T08:32:20Z",
                        "reason": "AsExpected",
                        "status": "False",
                        "type": "Progressing"
                    },
                    {
                        "lastTransitionTime": "2019-04-01T08:30:58Z",
                        "reason": "Available",
                        "status": "False",
                        "type": "Available"
                    }
"name": "console",
                    {
                        "lastTransitionTime": "2019-04-01T08:31:02Z",
                        "reason": "AsExpected",
                        "status": "False",
                        "type": "Failing"
                    },
                    {
                        "lastTransitionTime": "2019-04-01T08:31:02Z",
                        "message": "Progressing: Moving to version 0.0.1",
                        "reason": "Progressing",
                        "status": "True",
                        "type": "Progressing"
                    },
                    {
                        "lastTransitionTime": "2019-04-01T08:31:02Z",
                        "message": "Available: No pods available for console deployment.",
                        "reason": "Available",
                        "status": "False",
                        "type": "Available"
                    },
                    {
                        "lastTransitionTime": "2019-04-01T08:31:00Z",
                        "reason": "AsExpected",
                        "status": "True",
                        "type": "Upgradeable"
                    }

"name": "image-registry",
                    {
                        "lastTransitionTime": "2019-04-01T08:30:56Z",
                        "message": "The deployment does not have available replicas",
                        "reason": "NoReplicasAvailable",
                        "status": "False",
                        "type": "Available"
                    },
                    {
                        "lastTransitionTime": "2019-04-01T08:30:57Z",
                        "message": "The deployment has not completed",
                        "reason": "DeploymentNotCompleted",
                        "status": "True",
                        "type": "Progressing"
                    },
                    {
                        "lastTransitionTime": "2019-04-01T08:30:57Z",
                        "status": "False",
                        "type": "Failing"
                    }
"name": "ingress",
                    {
                        "lastTransitionTime": "2019-04-01T08:30:58Z",
                        "status": "False",
                        "type": "Failing"
                    },
                    {
                        "lastTransitionTime": "2019-04-01T08:30:58Z",
                        "message": "0 ingress controllers available, want 1",
                        "reason": "Reconciling",
                        "status": "True",
                        "type": "Progressing"
                    },
                    {
                        "lastTransitionTime": "2019-04-01T08:30:58Z",
                        "message": "0 ingress controllers available, want 1",
                        "reason": "IngressUnavailable",
                        "status": "False",
                        "type": "Available"
                    }
"name": "monitoring",
                    {
                        "lastTransitionTime": "2019-04-01T08:56:28Z",
                        "message": "Rolling out the stack.",
                        "status": "True",
                        "type": "Progressing"
                    },
                    {
                        "lastTransitionTime": "2019-04-01T08:41:11Z",
                        "message": "Failed to rollout the stack. Error: running task Updating Cluster Monitoring Operator failed: reconciling Cluster Monitoring Operator ServiceMonitor failed: creating ServiceMonitor object failed: the server could not find the requested resource (post servicemonitors.monitoring.coreos.com)",
                        "status": "True",
                        "type": "Failing"
                    },
                    {
                        "lastTransitionTime": "2019-04-01T08:36:05Z",
                        "status": "False",
                        "type": "Available"
                    }

Comment 24 W. Trevor King 2019-04-11 00:46:47 UTC
https://github.com/openshift/cluster-version-operator/pull/158 should get us better messages.

Comment 25 Scott Dodson 2019-04-12 13:03:30 UTC
Better error messages should now be presented to the user, the user will need to investigate the failing operators.

Comment 27 Johnny Liu 2019-04-16 07:59:26 UTC
Verified this bug with 4.0.0-0.nightly-2019-04-10-182914, and PASS.

Ensure aws resource is not tagged correctly, just like BZ#1697236, trigger deployment. 

The following log is seen, the message become more clear.

time="2019-04-14T23:23:56-04:00" level=info msg="Waiting up to 30m0s for the cluster at https://api.jialiu-upi1.qe1.devcluster.openshift.com:6443 to initialize..."
time="2019-04-14T23:23:56-04:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.0.0-0.nightly-2019-04-10-182914: 94% complete"
time="2019-04-14T23:26:43-04:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.0.0-0.nightly-2019-04-10-182914: 98% complete"
time="2019-04-14T23:27:54-04:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.0.0-0.nightly-2019-04-10-182914: 98% complete"
time="2019-04-14T23:28:28-04:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.0.0-0.nightly-2019-04-10-182914: 99% complete"
time="2019-04-14T23:37:28-04:00" level=debug msg="Still waiting for the cluster to initialize: Cluster operator authentication is still updating"

Comment 29 errata-xmlrpc 2019-06-04 10:46:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.