Bug 2090968

Summary: S3 PUT requests failing with Internal Error.
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Anant Malhotra <anamalho>
Component: cephAssignee: Matt Benjamin (redhat) <mbenjamin>
ceph sub component: RGW QA Contact: Elad <ebenahar>
Status: CLOSED NOTABUG Docs Contact:
Severity: high    
Priority: unspecified CC: belimele, bniver, etamir, hnallurv, madam, mbenjamin, mkasturi, mmuench, muagarwa, ocs-bugs, odf-bz-bot, pnataraj, rayalon, tdesala, vashastr
Version: 4.10Flags: hnallurv: needinfo? (mbenjamin)
muagarwa: needinfo? (hnallurv)
sheggodu: needinfo? (mbenjamin)
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-03 02:38:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Anant Malhotra 2022-05-27 06:45:09 UTC
Description of problem (please be detailed as possible and provide log
snippests):

S3 PUT requests frequently fail with "We encountered an internal error. Please try again." on obc backed by rgw ns store.

The issue arose while running the following script 'test_longevity_stage2.py' in the PR: https://github.com/red-hat-storage/ocs-ci/pull/5540/
The script used to pass uploading the objects to the bucket (backed by rgw namespace store) without any issues until a couple of days back but in the later runs (4/4) the upload is failing with below error:

```E           ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n openshift-storage rsh session-awscli-relay-pod-1b40c3df84164d4 sh -c "AWS_CA_BUNDLE=/cert/service-ca.crt AWS_ACCESS_KEY_ID=***** AWS_SECRET_ACCESS_KEY=***** AWS_DEFAULT_REGION=us-east-2 aws s3 --endpoint=***** sync test_longevity_stage2/origin s3://oc-bucket-da750ffd1edb47c78c3495b813faaa".
E           Error is upload failed: test_longevity_stage2/origin/test58 to s3://oc-bucket-da750ffd1edb47c78c3495b813faaa/test58 An error occurred (InternalError) when calling the PutObject operation (reached max retries: 2): We encountered an internal error. Please try again.```


https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/12962/consoleFull -> this is the success run


The exception occurred inside the 'write_empty_files_to_bucket' function which is ran after creation of OBCs inside the '_multi_obc_lifecycle_factory' function.: 
https://github.com/red-hat-storage/ocs-ci/pull/5540/files#diff-008aeb103a5a9ae662ae2e86cf3a0c9335d41b6047180ff64584d9b2243d2ed8R46


logs
----
 [32mMay-25 16:57:57.143 [35m [Endpoint/13] [31m[ERROR] [39m core.rpc.rpc_schema:: INVALID_SCHEMA_PARAMS CLIENT pool_api#/methods/update_issues_report ERRORS: [ { instancePath: [32m'/error_code' [39m, schemaPath: [32m'pool_api#/methods/update_issues_report/params/properties/error_code/type' [39m, keyword: [32m'type' [39m, params: { type: [32m'string' [39m }, message: [32m'must be string' [39m, schema: [32m'string' [39m, parentSchema: { type: [32m'string' [39m }, data: [33m502 [39m }, [length]: [33m1 [39m ] PARAMS: { namespace_resource_id: [32m'628e5ff0029cdc0029f4f9ea' [39m, error_code: [33m502 [39m, time: [33m1653497877143 [39m }
 [32mMay-25 16:57:57.143 [35m [Endpoint/13] [31m[ERROR] [39m core.rpc.rpc:: RPC._request: response ERROR srv pool_api.update_issues_report reqid <no-reqid-yet> connid <no-connection-yet> params { namespace_resource_id: [32m'628e5ff0029cdc0029f4f9ea' [39m, error_code: [33m502 [39m, time: [33m1653497877143 [39m }  Error: INVALID_SCHEMA_PARAMS CLIENT

----
May-25 16:57:59.858 [Endpoint/13] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><Resource>/oc-bucket-e6dfcb221a464422836c07f1891e24/test520</Resource><RequestId>l3lty19u-g3u9an-1931</RequestId></Error> PUT /oc-bucket-e6dfcb221a464422836c07f1891e24/test520 {"host":"s3.openshift-storage.svc","accept-encoding":"identity","user-agent":"aws-cli/2.0.13 Python/3.7.3 Linux/4.18.0-305.45.1.el8_4.x86_64 botocore/2.0.0dev17","expect":"100-continue","x-amz-date":"20220525T165759Z","x-amz-content-sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","authorization":"AWS4-HMAC-SHA256 Credential=mwLpNOKmzu5z3yx5f69J/20220525/us-east-2/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=b172a99c20ad62858c44bd5f64bd13759d602aaeea0735c66972ecf9698702a2","content-length":"0"} 502: null


> The script passed when uploading a single object to the bucket (backed by rgw namespace store) without any issues.


Version of all relevant components (if applicable):
ODF 4.10.2
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Unable to upload objects.


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes (4/4)

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
Not Sure

Steps to Reproduce:
1. Run the following script 'test_longevity_stage2.py' in the PR: https://github.com/red-hat-storage/ocs-ci/pull/5540/
(The exception occurred inside the 'write_empty_files_to_bucket' function which is ran after creation of OBCs inside the '_multi_obc_lifecycle_factory' function.: 
https://github.com/red-hat-storage/ocs-ci/pull/5540/files#diff-008aeb103a5a9ae662ae2e86cf3a0c9335d41b6047180ff64584d9b2243d2ed8R46)



Actual results: Exception occurred
```E           ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n openshift-storage rsh session-awscli-relay-pod-1b40c3df84164d4 sh -c "AWS_CA_BUNDLE=/cert/service-ca.crt AWS_ACCESS_KEY_ID=***** AWS_SECRET_ACCESS_KEY=***** AWS_DEFAULT_REGION=us-east-2 aws s3 --endpoint=***** sync test_longevity_stage2/origin s3://oc-bucket-da750ffd1edb47c78c3495b813faaa".
E           Error is upload failed: test_longevity_stage2/origin/test58 to s3://oc-bucket-da750ffd1edb47c78c3495b813faaa/test58 An error occurred (InternalError) when calling the PutObject operation (reached max retries: 2): We encountered an internal error. Please try again.```


Expected results:
S3 PUT requests should execute successfully.

Additional info: 
> Must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/tdesala-long-testd/tdesala-long-testd_20220525T080711/logs/failed_testcase_ocs_logs_1653499324/test_longevity_stage2_ocs_logs/ocs_must_gather/
> Same error occurred on 4.11.0 as well.

Comment 10 Mudit Agarwal 2022-10-26 03:30:39 UTC
Harish/Ben, is this still an issue?

Comment 13 Mudit Agarwal 2022-11-03 02:38:39 UTC
Thanks Ben, please reopen if this is seen after fixing the ci issue.