Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1877116

Summary:	e2e aws calico tests fail with `rpc error: code = ResourceExhausted`
Product:	OpenShift Container Platform	Reporter:	Ankita Thomas <ankithom>
Component:	Installer	Assignee:	Abhinav Dahiya <adahiya>
Installer sub component:	openshift-installer	QA Contact:	Yunfei Jiang <yunjiang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	high	CC:	adahiya, ally, aos-bugs, eparis, jokerman, lwan, rgregory, yanyang
Version:	4.5
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Internal terraform backend for installer does not support very large inputs from terraform core to the terraform provider like aws. Therefore bootstrap.ign when passed to aws provider as a string would break this limit causing failure. Consequence: The installer would fail when trying to create a bootstrap ignition s3 bucket Fix: The terraform backend was modiefied to pass on the bootstrap.ign as a path on disk so that the aws provider can read the large file circumventing the input size limit. Result: no error when performing calico installation that creates the bootstrap ignition file larger than the input limits.	Story Points:	---
Clone Of:
Clones:	1877117 (view as bug list)		Environment:	test: operator
Last Closed:	2021-02-24 15:17:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1877117, 1877118

Description Ankita Thomas 2020-09-08 20:41:30 UTC

test: operator.Run template e2e-aws - e2e-aws-calico container setup fails frequently, see job: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-calico-4.5/1302701517358239744

The test fails frequently with
`
Error: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4966185 vs. 4194304)
`

Comment 1 Devan Goodwin 2020-09-09 12:29:26 UTC

As far as I know hive has no relation to openshift CI and e2e. Given this says calico I will reassign to networking.

Comment 2 Ben Bennett 2020-09-09 13:55:37 UTC

*** Bug 1877117 has been marked as a duplicate of this bug. ***

Comment 3 Ben Bennett 2020-09-09 13:55:43 UTC

*** Bug 1877118 has been marked as a duplicate of this bug. ***

Comment 4 Ben Bennett 2020-09-09 14:10:24 UTC

This is failing early in the cluster bootstrap... I can't tell who is making the grpc call... it may be Calico, but it looks like it's earlier than when Calico can even run.  However their installer may have changed something that is causing our installer to send an oversized message.  Can someone on the installer team see if they can work out what is making that call (and why it might be oversized).

level=info msg="Consuming Master Machines from target directory"
level=info msg="Consuming OpenShift Install (Manifests) from target directory"
level=info msg="Consuming Worker Machines from target directory"
level=info msg="Credentials loaded from the \"default\" profile in file \"/etc/openshift-installer/.awscred\""
level=info msg="Creating infrastructure resources..."
level=error
level=error msg="Error: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4968901 vs. 4194304)"
level=error
level=error
level=error msg="Failed to read tfstate: open /tmp/openshift-install-770414117/terraform.tfstate: no such file or directory"
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply Terraform: failed to complete the change"

Comment 5 Abhinav Dahiya 2020-09-10 18:09:07 UTC

There is an upstream bug on terraform where the grpc plugin doesn't support more than 4mb data transfer... https://github.com/hashicorp/terraform/issues/21709

might be related to that.

Comment 6 Abhinav Dahiya 2020-09-10 18:24:52 UTC

This bug is happenning only on e2e aws calico see https://search.ci.openshift.org/?search=Error%3A+rpc+error%3A+code+%3D+ResourceExhausted+desc+%3D+grpc%3A+received+message+larger+than+max&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

What is special about this e2e job that is causing this failure?

Comment 7 Abhinav Dahiya 2020-09-10 18:30:47 UTC

My guess is that the bootstrap.ign is becoming too large that the https://github.com/openshift/installer/blob/0d5c871ce7d03f3d03ab4371dc39916a5415cf5c/data/data/aws/bootstrap/main.tf#L27 is failing to create a s3 bucket. that looks like the only large asset.

instead of using content, which passes the bytes over the grpc we can maybe use source https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_object#source that reads the ign file from file. maybe that helps with reducing the transfer size.

Comment 10 Alex 2020-10-19 15:11:05 UTC

Working with customer ServiceNow on an vSphere IPI install with Calico EE and are running into the same grpc issue on internal vSphere lab cluster, Tigera lab cluster, and SNOW vSphere deployments

Comment 12 Yunfei Jiang 2020-10-27 02:39:42 UTC

verified. FAILED.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-calico-4.5/1320824972217683968

level=error
level=error msg="Error: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5055165 vs. 4194304)"
level=error

Comment 13 Abhinav Dahiya 2020-11-09 18:07:02 UTC

(In reply to Yunfei Jiang from comment #12)
> verified. FAILED.
> 
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-
> origin-installer-e2e-aws-calico-4.5/1320824972217683968
> 
> level=error
> level=error msg="Error: rpc error: code = ResourceExhausted desc = grpc:
> received message larger than max (5055165 vs. 4194304)"
> level=error

The fix was merged for master (4.7) and not 4.5 . So please test the latest code for changes.

Comment 14 Yunfei Jiang 2020-11-11 07:57:26 UTC

verified. PASS.

@Abhinav, sorry for the mistake.

Comment 17 errata-xmlrpc 2021-02-24 15:17:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633