Bug 1877116
Summary: | e2e aws calico tests fail with `rpc error: code = ResourceExhausted` | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ankita Thomas <ankithom> | |
Component: | Installer | Assignee: | Abhinav Dahiya <adahiya> | |
Installer sub component: | openshift-installer | QA Contact: | Yunfei Jiang <yunjiang> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | high | CC: | adahiya, ally, aos-bugs, eparis, jokerman, lwan, rgregory, yanyang | |
Version: | 4.5 | |||
Target Milestone: | --- | |||
Target Release: | 4.7.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause:
Internal terraform backend for installer does not support very large inputs from terraform core to the terraform provider like aws. Therefore bootstrap.ign when passed to aws provider as a string would break this limit causing failure.
Consequence:
The installer would fail when trying to create a bootstrap ignition s3 bucket
Fix:
The terraform backend was modiefied to pass on the bootstrap.ign as a path on disk so that the aws provider can read the large file circumventing the input size limit.
Result:
no error when performing calico installation that creates the bootstrap ignition file larger than the input limits.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1877117 (view as bug list) | Environment: |
test: operator
|
|
Last Closed: | 2021-02-24 15:17:43 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1877117, 1877118 |
Description
Ankita Thomas
2020-09-08 20:41:30 UTC
As far as I know hive has no relation to openshift CI and e2e. Given this says calico I will reassign to networking. *** Bug 1877117 has been marked as a duplicate of this bug. *** *** Bug 1877118 has been marked as a duplicate of this bug. *** This is failing early in the cluster bootstrap... I can't tell who is making the grpc call... it may be Calico, but it looks like it's earlier than when Calico can even run. However their installer may have changed something that is causing our installer to send an oversized message. Can someone on the installer team see if they can work out what is making that call (and why it might be oversized). level=info msg="Consuming Master Machines from target directory" level=info msg="Consuming OpenShift Install (Manifests) from target directory" level=info msg="Consuming Worker Machines from target directory" level=info msg="Credentials loaded from the \"default\" profile in file \"/etc/openshift-installer/.awscred\"" level=info msg="Creating infrastructure resources..." level=error level=error msg="Error: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4968901 vs. 4194304)" level=error level=error level=error msg="Failed to read tfstate: open /tmp/openshift-install-770414117/terraform.tfstate: no such file or directory" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change" There is an upstream bug on terraform where the grpc plugin doesn't support more than 4mb data transfer... https://github.com/hashicorp/terraform/issues/21709 might be related to that. This bug is happenning only on e2e aws calico see https://search.ci.openshift.org/?search=Error%3A+rpc+error%3A+code+%3D+ResourceExhausted+desc+%3D+grpc%3A+received+message+larger+than+max&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job What is special about this e2e job that is causing this failure? My guess is that the bootstrap.ign is becoming too large that the https://github.com/openshift/installer/blob/0d5c871ce7d03f3d03ab4371dc39916a5415cf5c/data/data/aws/bootstrap/main.tf#L27 is failing to create a s3 bucket. that looks like the only large asset. instead of using content, which passes the bytes over the grpc we can maybe use source https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_object#source that reads the ign file from file. maybe that helps with reducing the transfer size. Working with customer ServiceNow on an vSphere IPI install with Calico EE and are running into the same grpc issue on internal vSphere lab cluster, Tigera lab cluster, and SNOW vSphere deployments verified. FAILED. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-calico-4.5/1320824972217683968 level=error level=error msg="Error: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5055165 vs. 4194304)" level=error (In reply to Yunfei Jiang from comment #12) > verified. FAILED. > > https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift- > origin-installer-e2e-aws-calico-4.5/1320824972217683968 > > level=error > level=error msg="Error: rpc error: code = ResourceExhausted desc = grpc: > received message larger than max (5055165 vs. 4194304)" > level=error The fix was merged for master (4.7) and not 4.5 . So please test the latest code for changes. verified. PASS. @Abhinav, sorry for the mistake. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |