2090968 – S3 PUT requests failing with Internal Error.

Bug 2090968 - S3 PUT requests failing with Internal Error.

Summary: S3 PUT requests failing with Internal Error.

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Matt Benjamin (redhat)
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-27 06:45 UTC by Anant Malhotra
Modified:	2023-12-08 04:28 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-03 02:38:39 UTC
Embargoed:

Attachments	(Terms of Use)

Description Anant Malhotra 2022-05-27 06:45:09 UTC

Description of problem (please be detailed as possible and provide log
snippests):

S3 PUT requests frequently fail with "We encountered an internal error. Please try again." on obc backed by rgw ns store.

The issue arose while running the following script 'test_longevity_stage2.py' in the PR: https://github.com/red-hat-storage/ocs-ci/pull/5540/
The script used to pass uploading the objects to the bucket (backed by rgw namespace store) without any issues until a couple of days back but in the later runs (4/4) the upload is failing with below error:

```E           ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n openshift-storage rsh session-awscli-relay-pod-1b40c3df84164d4 sh -c "AWS_CA_BUNDLE=/cert/service-ca.crt AWS_ACCESS_KEY_ID=***** AWS_SECRET_ACCESS_KEY=***** AWS_DEFAULT_REGION=us-east-2 aws s3 --endpoint=***** sync test_longevity_stage2/origin s3://oc-bucket-da750ffd1edb47c78c3495b813faaa".
E           Error is upload failed: test_longevity_stage2/origin/test58 to s3://oc-bucket-da750ffd1edb47c78c3495b813faaa/test58 An error occurred (InternalError) when calling the PutObject operation (reached max retries: 2): We encountered an internal error. Please try again.```


https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/12962/consoleFull -> this is the success run


The exception occurred inside the 'write_empty_files_to_bucket' function which is ran after creation of OBCs inside the '_multi_obc_lifecycle_factory' function.: 
https://github.com/red-hat-storage/ocs-ci/pull/5540/files#diff-008aeb103a5a9ae662ae2e86cf3a0c9335d41b6047180ff64584d9b2243d2ed8R46


logs
----
 [32mMay-25 16:57:57.143 [35m [Endpoint/13] [31m[ERROR] [39m core.rpc.rpc_schema:: INVALID_SCHEMA_PARAMS CLIENT pool_api#/methods/update_issues_report ERRORS: [ { instancePath: [32m'/error_code' [39m, schemaPath: [32m'pool_api#/methods/update_issues_report/params/properties/error_code/type' [39m, keyword: [32m'type' [39m, params: { type: [32m'string' [39m }, message: [32m'must be string' [39m, schema: [32m'string' [39m, parentSchema: { type: [32m'string' [39m }, data: [33m502 [39m }, [length]: [33m1 [39m ] PARAMS: { namespace_resource_id: [32m'628e5ff0029cdc0029f4f9ea' [39m, error_code: [33m502 [39m, time: [33m1653497877143 [39m }
 [32mMay-25 16:57:57.143 [35m [Endpoint/13] [31m[ERROR] [39m core.rpc.rpc:: RPC._request: response ERROR srv pool_api.update_issues_report reqid <no-reqid-yet> connid <no-connection-yet> params { namespace_resource_id: [32m'628e5ff0029cdc0029f4f9ea' [39m, error_code: [33m502 [39m, time: [33m1653497877143 [39m }  Error: INVALID_SCHEMA_PARAMS CLIENT

----
May-25 16:57:59.858 [Endpoint/13] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><Resource>/oc-bucket-e6dfcb221a464422836c07f1891e24/test520</Resource><RequestId>l3lty19u-g3u9an-1931</RequestId></Error> PUT /oc-bucket-e6dfcb221a464422836c07f1891e24/test520 {"host":"s3.openshift-storage.svc","accept-encoding":"identity","user-agent":"aws-cli/2.0.13 Python/3.7.3 Linux/4.18.0-305.45.1.el8_4.x86_64 botocore/2.0.0dev17","expect":"100-continue","x-amz-date":"20220525T165759Z","x-amz-content-sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","authorization":"AWS4-HMAC-SHA256 Credential=mwLpNOKmzu5z3yx5f69J/20220525/us-east-2/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=b172a99c20ad62858c44bd5f64bd13759d602aaeea0735c66972ecf9698702a2","content-length":"0"} 502: null


> The script passed when uploading a single object to the bucket (backed by rgw namespace store) without any issues.


Version of all relevant components (if applicable):
ODF 4.10.2
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Unable to upload objects.


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes (4/4)

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
Not Sure

Steps to Reproduce:
1. Run the following script 'test_longevity_stage2.py' in the PR: https://github.com/red-hat-storage/ocs-ci/pull/5540/
(The exception occurred inside the 'write_empty_files_to_bucket' function which is ran after creation of OBCs inside the '_multi_obc_lifecycle_factory' function.: 
https://github.com/red-hat-storage/ocs-ci/pull/5540/files#diff-008aeb103a5a9ae662ae2e86cf3a0c9335d41b6047180ff64584d9b2243d2ed8R46)



Actual results: Exception occurred
```E           ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n openshift-storage rsh session-awscli-relay-pod-1b40c3df84164d4 sh -c "AWS_CA_BUNDLE=/cert/service-ca.crt AWS_ACCESS_KEY_ID=***** AWS_SECRET_ACCESS_KEY=***** AWS_DEFAULT_REGION=us-east-2 aws s3 --endpoint=***** sync test_longevity_stage2/origin s3://oc-bucket-da750ffd1edb47c78c3495b813faaa".
E           Error is upload failed: test_longevity_stage2/origin/test58 to s3://oc-bucket-da750ffd1edb47c78c3495b813faaa/test58 An error occurred (InternalError) when calling the PutObject operation (reached max retries: 2): We encountered an internal error. Please try again.```


Expected results:
S3 PUT requests should execute successfully.

Additional info: 
> Must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/tdesala-long-testd/tdesala-long-testd_20220525T080711/logs/failed_testcase_ocs_logs_1653499324/test_longevity_stage2_ocs_logs/ocs_must_gather/
> Same error occurred on 4.11.0 as well.

Comment 10 Mudit Agarwal 2022-10-26 03:30:39 UTC

Harish/Ben, is this still an issue?

Comment 13 Mudit Agarwal 2022-11-03 02:38:39 UTC

Thanks Ben, please reopen if this is seen after fixing the ci issue.

Comment 14 Red Hat Bugzilla 2023-12-08 04:28:52 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.