Bug 1877116 - e2e aws calico tests fail with `rpc error: code = ResourceExhausted`
Summary: e2e aws calico tests fail with `rpc error: code = ResourceExhausted`
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.7.0
Assignee: Abhinav Dahiya
QA Contact: Yunfei Jiang
URL:
Whiteboard:
: 1877117 1877118 (view as bug list)
Depends On:
Blocks: 1877117 1877118
TreeView+ depends on / blocked
 
Reported: 2020-09-08 20:41 UTC by Ankita Thomas
Modified: 2023-12-15 19:13 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Internal terraform backend for installer does not support very large inputs from terraform core to the terraform provider like aws. Therefore bootstrap.ign when passed to aws provider as a string would break this limit causing failure. Consequence: The installer would fail when trying to create a bootstrap ignition s3 bucket Fix: The terraform backend was modiefied to pass on the bootstrap.ign as a path on disk so that the aws provider can read the large file circumventing the input size limit. Result: no error when performing calico installation that creates the bootstrap ignition file larger than the input limits.
Clone Of:
: 1877117 (view as bug list)
Environment:
test: operator
Last Closed: 2021-02-24 15:17:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4281 0 None closed Bug 1877116: aws: use file for bootstrap ign when uploading to s3 2021-02-15 19:28:46 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:18:17 UTC

Description Ankita Thomas 2020-09-08 20:41:30 UTC
test: operator.Run template e2e-aws - e2e-aws-calico container setup fails frequently, see job: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-calico-4.5/1302701517358239744

The test fails frequently with
`
Error: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4966185 vs. 4194304)
`

Comment 1 Devan Goodwin 2020-09-09 12:29:26 UTC
As far as I know hive has no relation to openshift CI and e2e. Given this says calico I will reassign to networking.

Comment 2 Ben Bennett 2020-09-09 13:55:37 UTC
*** Bug 1877117 has been marked as a duplicate of this bug. ***

Comment 3 Ben Bennett 2020-09-09 13:55:43 UTC
*** Bug 1877118 has been marked as a duplicate of this bug. ***

Comment 4 Ben Bennett 2020-09-09 14:10:24 UTC
This is failing early in the cluster bootstrap... I can't tell who is making the grpc call... it may be Calico, but it looks like it's earlier than when Calico can even run.  However their installer may have changed something that is causing our installer to send an oversized message.  Can someone on the installer team see if they can work out what is making that call (and why it might be oversized).

level=info msg="Consuming Master Machines from target directory"
level=info msg="Consuming OpenShift Install (Manifests) from target directory"
level=info msg="Consuming Worker Machines from target directory"
level=info msg="Credentials loaded from the \"default\" profile in file \"/etc/openshift-installer/.awscred\""
level=info msg="Creating infrastructure resources..."
level=error
level=error msg="Error: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4968901 vs. 4194304)"
level=error
level=error
level=error msg="Failed to read tfstate: open /tmp/openshift-install-770414117/terraform.tfstate: no such file or directory"
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"

Comment 5 Abhinav Dahiya 2020-09-10 18:09:07 UTC
There is an upstream bug on terraform where the grpc plugin doesn't support more than 4mb data transfer... https://github.com/hashicorp/terraform/issues/21709

might be related to that.

Comment 6 Abhinav Dahiya 2020-09-10 18:24:52 UTC
This bug is happenning only on e2e aws calico see https://search.ci.openshift.org/?search=Error%3A+rpc+error%3A+code+%3D+ResourceExhausted+desc+%3D+grpc%3A+received+message+larger+than+max&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

What is special about this e2e job that is causing this failure?

Comment 7 Abhinav Dahiya 2020-09-10 18:30:47 UTC
My guess is that the bootstrap.ign is becoming too large that the https://github.com/openshift/installer/blob/0d5c871ce7d03f3d03ab4371dc39916a5415cf5c/data/data/aws/bootstrap/main.tf#L27 is failing to create a s3 bucket. that looks like the only large asset.

instead of using content, which passes the bytes over the grpc we can maybe use source https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_object#source that reads the ign file from file. maybe that helps with reducing the transfer size.

Comment 10 Alex 2020-10-19 15:11:05 UTC
Working with customer ServiceNow on an vSphere IPI install with Calico EE and are running into the same grpc issue on internal vSphere lab cluster, Tigera lab cluster, and SNOW vSphere deployments

Comment 12 Yunfei Jiang 2020-10-27 02:39:42 UTC
verified. FAILED.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-calico-4.5/1320824972217683968

level=error
level=error msg="Error: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5055165 vs. 4194304)"
level=error

Comment 13 Abhinav Dahiya 2020-11-09 18:07:02 UTC
(In reply to Yunfei Jiang from comment #12)
> verified. FAILED.
> 
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-
> origin-installer-e2e-aws-calico-4.5/1320824972217683968
> 
> level=error
> level=error msg="Error: rpc error: code = ResourceExhausted desc = grpc:
> received message larger than max (5055165 vs. 4194304)"
> level=error

The fix was merged for master (4.7) and not 4.5 . So please test the latest code for changes.

Comment 14 Yunfei Jiang 2020-11-11 07:57:26 UTC
verified. PASS.

@Abhinav, sorry for the mistake.

Comment 17 errata-xmlrpc 2021-02-24 15:17:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.