Bug 2005417 - [IBM Z/P]: Bad Gateway error with multiple S3 requests while syncing objects to rgw bucket
Summary: [IBM Z/P]: Bad Gateway error with multiple S3 requests while syncing objects ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Scott Ostapovicz
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-17 15:12 UTC by Sravika
Modified: 2023-08-09 16:37 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-17 09:22:48 UTC
Embargoed:


Attachments (Terms of Use)
ocs-ci-test_case log (16.42 KB, application/zip)
2021-09-17 15:12 UTC, Sravika
no flags Details
nbcore_postgres.gz (336.90 KB, application/gzip)
2021-11-09 21:31 UTC, Sravika
no flags Details

Description Sravika 2021-09-17 15:12:01 UTC
Created attachment 1823937 [details]
ocs-ci-test_case log

Description of problem (please be detailed as possible and provide log
snippests):

This test scenario is part of the ocs-ci test tests/manage/rgw/test_bucket_deletion.py::TestBucketDeletion::test_bucket_delete_with_objects[RGW-OC], which creates rgw obc and syncs all the objects and directories in a folder to the rgw obc.

I have verified this test manually by creating the rgw obc and synced the files to the rgw obc, upload of few files failed after certain number of s3 requests. However , I have copied the failed files to the same bucket individually and the upload did not fail.

# oc get obc -n openshift-storage
NAME                                       STORAGE-CLASS                 PHASE   AGE
rgw-oc-bucket-76db54f20b3e40ccb8a6798913   ocs-storagecluster-ceph-rgw   Bound   142m


# aws s3 --no-verify-ssl --endpoint <> ls
2021-09-17 14:06:17 rgw-oc-bucket-76db54f20b3e40ccb8a6798913

# oc -n openshift-storage rsh session-awscli-relay-pod-9e55c3f9e24d4fb sh -c "AWS_CA_BUNDLE=/cert/service-ca.crt AWS_ACCESS_KEY_ID="<>" AWS_SECRET_ACCESS_KEY="<>" AWS_DEFAULT_REGION=us-east-1 aws s3 --endpoint=<> sync /test_objects/ s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913"

upload: ../test_objects/book.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/book.txt
upload: ../test_objects/bolder.jpg to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/bolder.jpg
upload: ../test_objects/apple.mp4 to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/apple.mp4
upload failed: ../test_objects/airbus.jpg to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/airbus.jpg Connection was closed before we received a valid response from endpoint URL: "http://ocs-storagecluster-cephobjectstore-openshift-storage.apps.ocsm4205001.lnxne.boe/rgw-oc-bucket-76db54f20b3e40ccb8a6798913/airbus.jpg?uploads".
upload: ../test_objects/canada.jpg to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/canada.jpg
upload: ../test_objects/random1.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random1.txt
upload: ../test_objects/random2.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random2.txt
upload: ../test_objects/random10.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random10.txt
upload: ../test_objects/random4.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random4.txt
upload: ../test_objects/random5.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random5.txt
upload: ../test_objects/random3.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random3.txt
upload: ../test_objects/random7.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random7.txt
upload: ../test_objects/random6.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random6.txt
upload: ../test_objects/random9.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random9.txt
upload: ../test_objects/rome.jpg to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/rome.jpg
upload failed: ../test_objects/goldman.webm to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/goldman.webm An error occurred (502) when calling the CreateMultipartUpload operation (reached max retries:
4): Bad Gateway
upload failed: ../test_objects/random8.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random8.txt Connection was closed before we received a valid response from endpoint URL: "http://ocs-storagecluster-cephobjectstore-openshift-storage.apps.ocsm4205001.lnxne.boe/rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random8.txt".
upload: ../test_objects/danny.webm to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/danny.webm
upload failed: ../test_objects/enwik8 to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/enwik8 An error occurred (502) when calling the UploadPart operation (reached max retries: 4): Bad Gateway
upload: ../test_objects/steve.webm to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/steve.webm
command terminated with exit code 1


# oc -n openshift-storage rsh session-awscli-relay-pod-9e55c3f9e24d4fb sh -c "AWS_CA_BUNDLE=/cert/service-ca.crt AWS_ACCESS_KEY_ID="<>" AWS_SECRET_ACCESS_KEY="<>" AWS_DEFAULT_REGION=us-east-1 aws s3 --endpoint=<> cp /test_objects/goldman.webm s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913"
upload: ../test_objects/goldman.webm to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/goldman.webm


# oc -n openshift-storage rsh session-awscli-relay-pod-9e55c3f9e24d4fb sh -c "AWS_CA_BUNDLE=/cert/service-ca.crt AWS_ACCESS_KEY_ID="<>" AWS_SECRET_ACCESS_KEY="<>" AWS_DEFAULT_REGION=us-east-1 aws s3 --endpoint=<> cp /test_objects/random8.txt s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913"
upload: ../test_objects/random8.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random8.txt



Version of all relevant components (if applicable):

OCP: 4.9.0-0.nightly-s390x-2021-09-09-135631
OCS-Operator: 4.9.0-142.ci
LSO : 4.9.0-202109071344
Noobaa: 4.9.0-139.ci
 
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Ocs-ci test fails

Is there any workaround available to the best of your knowledge?

Upload individually

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCP and OCS
2. Create Rgw obc bucket with the following yaml

apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
  name: rgw-oc-bucket-76db54f20b3e40ccb8a6798913
  namespace: openshift-storage
spec:
  bucketName: rgw-oc-bucket-76db54f20b3e40ccb8a6798913
  storageClassName: ocs-storagecluster-ceph-rgw

3. Sync more than 20 objects or directories to the rgw obc

# oc -n openshift-storage rsh session-awscli-relay-pod-9e55c3f9e24d4fb sh -c "AWS_CA_BUNDLE=/cert/service-ca.crt AWS_ACCESS_KEY_ID="<>" AWS_SECRET_ACCESS_KEY="<>" AWS_DEFAULT_REGION=us-east-1 aws s3 --endpoint=<> sync /test_objects/ s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913"


Actual results:

Sync of objects to obc fails with multiple s3 requests

# oc -n openshift-storage rsh session-awscli-relay-pod-9e55c3f9e24d4fb sh -c "AWS_CA_BUNDLE=/cert/service-ca.crt AWS_ACCESS_KEY_ID="<>" AWS_SECRET_ACCESS_KEY="<>" AWS_DEFAULT_REGION=us-east-1 aws s3 --endpoint=<> sync /test_objects/ s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913"

upload: ../test_objects/book.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/book.txt
upload: ../test_objects/bolder.jpg to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/bolder.jpg
upload: ../test_objects/apple.mp4 to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/apple.mp4
upload failed: ../test_objects/airbus.jpg to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/airbus.jpg Connection was closed before we received a valid response from endpoint URL: "http://ocs-storagecluster-cephobjectstore-openshift-storage.apps.ocsm4205001.lnxne.boe/rgw-oc-bucket-76db54f20b3e40ccb8a6798913/airbus.jpg?uploads".
upload: ../test_objects/canada.jpg to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/canada.jpg
upload: ../test_objects/random1.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random1.txt
upload: ../test_objects/random2.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random2.txt
upload: ../test_objects/random10.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random10.txt
upload: ../test_objects/random4.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random4.txt
upload: ../test_objects/random5.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random5.txt
upload: ../test_objects/random3.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random3.txt
upload: ../test_objects/random7.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random7.txt
upload: ../test_objects/random6.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random6.txt
upload: ../test_objects/random9.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random9.txt
upload: ../test_objects/rome.jpg to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/rome.jpg
upload failed: ../test_objects/goldman.webm to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/goldman.webm An error occurred (502) when calling the CreateMultipartUpload operation (reached max retries:
4): Bad Gateway
upload failed: ../test_objects/random8.txt to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random8.txt Connection was closed before we received a valid response from endpoint URL: "http://ocs-storagecluster-cephobjectstore-openshift-storage.apps.ocsm4205001.lnxne.boe/rgw-oc-bucket-76db54f20b3e40ccb8a6798913/random8.txt".
upload: ../test_objects/danny.webm to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/danny.webm
upload failed: ../test_objects/enwik8 to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/enwik8 An error occurred (502) when calling the UploadPart operation (reached max retries: 4): Bad Gateway
upload: ../test_objects/steve.webm to s3://rgw-oc-bucket-76db54f20b3e40ccb8a6798913/steve.webm
command terminated with exit code 1


Expected results:

Sync should work fine and all objects should be uploaded successfully

Additional info:

https://drive.google.com/file/d/1UyN-_XiC2xlFm1tL_5dgAqg6oV2JMM4q/view?usp=sharing

Comment 2 Abdul Kandathil (IBM) 2021-09-23 16:38:01 UTC
I am able to reproduce it with the ocs-ci tier2 test
tests/manage/rgw/test_object_integrity.py::TestObjectIntegrity::test_empty_file_integrity


E           ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n openshift-storage rsh session-awscli-relay-pod-20562f6b72ec44a sh -c "AWS_CA_BUNDLE=/cert/service-ca.crt AWS_ACCESS_KEY_ID=***** AWS_SECRET_ACCESS_KEY=***** AWS_DEFAULT_REGION=us-east-1 aws s3 --endpoint=***** sync test_empty_file_integrity/origin s3://rgw-oc-bucket-1f0ae58edf5b4ae9bc1425f152".
E           Error is fatal error: Connection was closed before we received a valid response from endpoint URL: "*****/rgw-oc-bucket-1f0ae58edf5b4ae9bc1425f152?list-type=2&prefix=&encoding-type=url".
E           command terminated with exit code 1

Comment 4 Mudit Agarwal 2021-10-06 08:06:06 UTC
Nimrod, can someone please take a look. This is blocking IBM team

Comment 5 Romy Ayalon 2021-10-28 15:46:06 UTC
Hi,

Bad Gateway and ‘Connection was closed before we received a valid response from endpoint URL’ can imply a networking issue.
Also from logs, I see that from Sep-16 22:20:42.060 there are many RPC disconnection errors and NO_SUCH_NODE errors inside NooBaa core and NooBaa endpoint logs.

Few questions: 
1. Do you experience other networking issues on that cluster?
2. Did you reproduce the issue on the same cluster? if not, can you try to reproduce it on another cluster?
3. Can you please provide db-dump from inside the noobaa-db-pg-0 pod run: pg_dump nbcore | gzip > nbcore_postgres.gz 

Thanks

Comment 6 Sravika 2021-11-09 21:29:15 UTC
Hi @rayalon ,

1. No there is'nt any network issue on the cluster
2. This error has been reproduced on multiple clusters and has occurred each and every time during test case execution
3. db-dump collected and attached to the BZ (nbcore_postgres.gz )

Comment 7 Sravika 2021-11-09 21:31:11 UTC
Created attachment 1840949 [details]
nbcore_postgres.gz

Comment 8 Romy Ayalon 2021-11-15 16:02:24 UTC
Hi Sravika,

This is not an MCG issue, these are tests that test RGW OBC and not NooBaa OBC, this bucket is not created in noobaa, but in rook ceph.
you can also see that by the test path tests/manage/rgw/test_bucket_deletion.py::TestBucketDeletion::test_bucket_delete_with_objects[RGW-OC]

Also, I had a short call with Ben from OCS-CI team, and he is saying that this was an OCS-CI issue that was fixed by this PR: https://github.com/red-hat-storage/ocs-ci/pull/5011/files

Please check that and I think you can close the bug afterward.

Thanks,
Romy

Comment 9 Mudit Agarwal 2021-11-17 09:22:48 UTC
Confirmed with Sravika, this issue is not seen now.


Note You need to log in before you can comment on or make changes to this bug.