Bug 2010560

Summary: Gateway timeout with error occurred (504) when calling the CreateBucket operation
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Tiffany Nguyen <tunguyen>
Component: Multi-Cloud Object GatewayAssignee: Nimrod Becker <nbecker>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.9CC: aindenba, dzaken, ebenahar, etamir, nbecker, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-09 08:15:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tiffany Nguyen 2021-10-05 02:41:46 UTC
Description of problem (please be detailed as possible and provide log
snippests):
An error occurred (504) when calling the CreateBucket operation (reached max retries: 4): Gateway Timeout is seen when running multiple OBCs with I/O.

Snipped error log from console:
if http.status_code >= 300:
            error_code = parsed_response.get("Error", {}).get("Code")
            error_class = self.exceptions.from_code(error_code)
>           raise error_class(parsed_response, operation_name)
E           botocore.exceptions.ClientError: An error occurred (504) when calling the CreateBucket operation (reached max retries: 4): Gateway Timeout


Version of all relevant components (if applicable):
$ oc get csv -n openshift-storage 
NAME                            DISPLAY                       VERSION        REPLACES   PHASE
noobaa-operator.v4.9.0-164.ci   NooBaa Operator               4.9.0-164.ci              Succeeded
ocs-operator.v4.9.0-164.ci      OpenShift Container Storage   4.9.0-164.ci              Succeeded
odf-operator.v4.9.0-164.ci      OpenShift Data Foundation     4.9.0-164.ci              Succeeded


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
This will be impacting the number of obc running with i/o.


Is there any workaround available to the best of your knowledge?
None

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
100% reproducible.


Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create obcs
2. Run i/o using mcg_job_factory()
3. Observe the error from console log


Actual results:
Failed to run i/o on > 20 obcs

Expected results:
I/O should run without any issue.

Additional info:
More error can be seen at: 
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-test-pr/1584/testReport/tests.e2e.scale.noobaa.test_scale_obc_creation_repsin_noobaa_pods/TestScaleOCBCreation/test_scale_obc_creation_noobaa_pod_respin_noobaa_core_openshift_storage_noobaa_io_/

Comment 1 Alexander Indenbaum 2021-10-10 13:48:45 UTC
Hello @tunguyen,

Could you please provide must-gather logs of the NooBaa components: endpoint, core, etc for this test? Those logs would provide better insight into the root cause of the HTTP 504 error for the `CreateBucket` operation.

Thank you!

Comment 2 Tiffany Nguyen 2021-10-12 03:14:35 UTC
@ainden

Comment 5 Elad 2021-10-19 08:56:03 UTC
Proposing as a blocker so it won't be pushed out of 4.9.0

Comment 9 Tiffany Nguyen 2021-11-18 17:16:18 UTC
Issue is seen on build ODF 4.9.0-241.ci when creating 100 obcs with I/O using mcg_job_factory():

 if http.status_code >= 300:
            error_code = parsed_response.get("Error", {}).get("Code")
            error_class = self.exceptions.from_code(error_code)
>           raise error_class(parsed_response, operation_name)
E           botocore.exceptions.ClientError: An error occurred (504) when calling the CreateBucket operation (reached max retries: 4): Gateway Timeout

Must-gather logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2010560/

Comment 15 Nimrod Becker 2022-03-03 13:00:10 UTC
4.10 dev freeze milestone checkpoint meeting, decided to move out of 4.11