1838621 – [NooBaa] S3 commands fail (InternalError)

Bug 1838621 - [NooBaa] S3 commands fail (InternalError)

Summary: [NooBaa] S3 commands fail (InternalError)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	Multi-Cloud Object Gateway
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	OCS 4.5.0
Assignee:	Ohad
QA Contact:	Ben Eli
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1856459 (view as bug list)
Depends On:
Blocks:	1979301
TreeView+	depends on / blocked

Reported:	2020-05-21 12:58 UTC by Ben Eli
Modified:	2024-03-25 15:57 UTC (History)
CC List:	12 users (show)
Fixed In Version:	v4.5.0-487.ci
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1904965 1979301 (view as bug list)
Environment:
Last Closed:	2020-12-07 09:26:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	noobaa noobaa-core pull 6078	None	closed	Fixing issues with default pool and with missing bucket for chunk	2021-01-21 16:56:23 UTC
Github	noobaa noobaa-core pull 6087	None	closed	Backport to 5.5	2021-01-21 16:56:23 UTC
Github	noobaa noobaa-core pull 6088	None	closed	Fixing default_pool as mongo on s3 buckets	2021-01-21 16:56:24 UTC
Red Hat Product Errata	RHBA-2020:3754	None	None	None	2020-09-15 10:17:34 UTC

Description Ben Eli 2020-05-21 12:58:45 UTC

Description of problem (please be detailed as possible and provide log
snippests):
In one of our OCS-CI test automation runs, we ran into an issue in three of the tests.
Whenever we would try to sync objects to a NooBaa bucket, the sync would fail with the following message:
> An error occurred (InternalError) when calling the ListObjectsV2 operation (reached max retries: 4): We encountered an internal error. Please try again.

Then, upon teardown, those three tests ran into a similar issue when trying to remove the buckets created for the test:
> An error occurred (InternalError) when calling the ListObjectVersions operation

Version of all relevant components (if applicable):
> OCP 4.4.0-0.nightly-2020-05-18-164758
> OCS 4.4.0-428.ci (RC6)
>noobaa-operator:
>mcg-operator@sha256:c2fb84a40850fbf8cbcb95509804e3a8ed9f188273a5e605478db6eb7ad00bda
> noobaa_core:
>mcg-core@sha256:cf4135edaf75556a5c1d308345bec321ed1d7f60f16380cadfd5947a5c45b2c0
>noobaa_db:
>mongodb-36-rhel7@sha256:e2460bd732c38b281c0b8f7ca9ca0c1f8a131935db1b69fb2f840b252c494847

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes - it prevents me from performing S3 operations on NooBaa

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
The issue seems to be rare and hard to reproduce. I ran the same test suite three times on the same version of OCS (with a newer nightly OCP - 4.4.0-0.nightly-2020-05-21-042450) and all the tests passed.

Can this issue reproduce from the UI?
The issue was discovered via AWSCLI and Boto3, UI was not tested, and objects cannot be written via the UI, so N/A

Steps to Reproduce:
1. Create a NooBaa bucket
2. Try to use `aws s3 sync` to sync objects to the bucket
3. Try to delete the bucket using `boto3` 

Actual results:
The operations fail, and an error message is returned

Expected results:
The operations execute successfully

Additional info:

Comment 3 Nimrod Becker 2020-06-09 08:28:43 UTC

Waiting for a repro
take db dump and logs when happens

Comment 4 Ben Eli 2020-06-16 09:47:12 UTC

We did not yet encounter the issue on a live cluster yet, despite efforts to reproduce it.
We will need a custom must-gather that dumps the database in order to inspect failed runs, since those seem like the main source of reproductions.

Comment 12 Elad 2020-07-09 10:15:14 UTC

Looks like DB dump failed to be collected, and this is seen in all the occurrences of this bug:

12:58:47 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: 2020-07-08T12:58:45.611+0000 writing nbcore.objectparts to archive 'nbcore.gz'
2020-07-08T12:58:45.613+0000 writing nbcore.datablocks to archive 'nbcore.gz'
2020-07-08T12:58:45.622+0000 writing nbcore.datachunks to archive 'nbcore.gz'
2020-07-08T12:58:45.622+0000 writing nbcore.activitylogs to archive 'nbcore.gz'
2020-07-08T12:58:45.886+0000 done dumping nbcore.activitylogs (234 documents)
2020-07-08T12:58:45.886+0000 writing nbcore.objectmultiparts to archive 'nbcore.gz'
2020-07-08T12:58:45.886+0000 done dumping nbcore.objectparts (562 documents)
2020-07-08T12:58:45.886+0000 writing nbcore.objectmds to archive 'nbcore.gz'
2020-07-08T12:58:45.886+0000 done dumping nbcore.datachunks (503 documents)
2020-07-08T12:58:45.886+0000 writing nbcore.mongo_internal_agent.files to archive 'nbcore.gz'
2020-07-08T12:58:46.025+0000 done dumping nbcore.datablocks (503 documents)
2020-07-08T12:58:46.025+0000 writing nbcore.mongo_internal_agent.chunks to archive 'nbcore.gz'
2020-07-08T12:58:46.230+0000 done dumping nbcore.objectmds (76 documents)
2020-07-08T12:58:46.230+0000 writing nbcore.tiers to archive 'nbcore.gz'
2020-07-08T12:58:46.233+0000 done dumping nbcore.mongo_internal_agent.files (78 documents)
2020-07-08T12:58:46.233+0000 writing nbcore.tieringpolicies to archive 'nbcore.gz'
2020-07-08T12:58:46.234+0000 done dumping nbcore.objectmultiparts (186 documents)
2020-07-08T12:58:46.234+0000 writing nbcore.buckets to archive 'nbcore.gz'
command terminated with exit code 137





As this issue is seen quite frequently with the latest OCS 4.5 build, retargeting to 4.5 and proposing as a blocker

Comment 13 Evgeniy Belyi 2020-07-09 12:29:15 UTC

Exit code 137 means that the system terminated the container due to MEMORY/CPU limitations.
Please note that the collection of MongoDB dump requires resources (CPU, MEMORY), I would advise increasing the CPU and MEMORY resources for the pod/worker.
This way CI builds won't constantly fail collecting MongoDB dumps.
From the NooBaa logs, I can see that we are trying to write on MongoDB resource and the resource crashes (due to MongoDB connection drop).
Please attempt to reproduce with sufficient MEMORY/CPU resources so the reproduction will include MongoDB dump.
Thank you.

Comment 14 Elad 2020-07-09 13:06:26 UTC

These are test executions over AWS with the minimum required instance type - m5.x4large for worker nodes.
You can see here [1] that none of the nodes had significant CPU or memory utilization.



Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests       Limits
  --------                    --------       ------
  cpu                         6654m (42%)    6 (38%)
  memory                      19003Mi (30%)  15872Mi (25%)
  ephemeral-storage           0 (0%)         0 (0%)
  hugepages-1Gi               0 (0%)         0 (0%)
  hugepages-2Mi               0 (0%)         0 (0%)
  attachable-volumes-aws-ebs  0              0


Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1530m (43%)   0 (0%)
  memory                      4549Mi (30%)  512Mi (3%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0            

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests       Limits
  --------                    --------       ------
  cpu                         5875m (37%)    5500m (35%)
  memory                      15544Mi (25%)  14324Mi (23%)
  ephemeral-storage           0 (0%)         0 (0%)
  hugepages-1Gi               0 (0%)         0 (0%)
  hugepages-2Mi               0 (0%)         0 (0%)
  attachable-volumes-aws-ebs  0              0

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1704m (48%)   0 (0%)
  memory                      5149Mi (34%)  512Mi (3%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1825m (52%)   0 (0%)
  memory                      5749Mi (38%)  512Mi (3%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests       Limits
  --------                    --------       ------
  cpu                         6654m (42%)    6 (38%)
  memory                      19003Mi (30%)  15872Mi (25%)
  ephemeral-storage           0 (0%)         0 (0%)
  hugepages-1Gi               0 (0%)         0 (0%)
  hugepages-2Mi               0 (0%)         0 (0%)
  attachable-volumes-aws-ebs  0              0



-------------------------
I'm now trying to have a way to collect the DB dump with a workaround, so hopefully, we will have it in the next few hours.




[1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr2458-b2259/jnk-pr2458-b2259_20200708T171603/logs/failed_testcase_ocs_logs_1594232005/test_write_file_to_bucket_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-47f2a6b4417e188455d2c169b0baea5077fd4ff8d008eccdffc48af173421229/oc_output/describe_nodes

Comment 15 Elad 2020-07-10 15:32:11 UTC

Reproduced once more, this time, we managed to collect the DB dump.
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr2469-b2280/jnk-pr2469-b2280_20200710T090721/logs/failed_testcase_ocs_logs_1594375004/test_write_file_to_bucket_ocs_logs/

Comment 18 Ben Eli 2020-07-13 16:59:31 UTC

Seems like something similar just happened on run 1594643589
InternalErrors on ListObjectVersions and CreateBucket operations.
Default backingstore in TemporaryError state with "Topology was destroyed" error message.
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9777/

Comment 20 Nimrod Becker 2020-07-13 18:02:36 UTC

*** Bug 1856459 has been marked as a duplicate of this bug. ***

Comment 22 Jacky Albo 2020-07-16 16:49:21 UTC

After a small chat with Ben I understand some restarts was being involved in this test - so we want to understand better what really was going on and if noobaa-db-0 or his PV was down as part of this test - which maybe caused some temporary topology issues for the DB.
Ben will continue is investigation on Sunday and will update this issue. Waiting for his feedback

Comment 23 Ben Eli 2020-07-19 14:46:39 UTC

Reverting to ON_QA, since this might be caused by a restart performed in the specific test that led to the latest reproduction.
Awaits further testing by the QE team.

Comment 27 errata-xmlrpc 2020-09-15 10:17:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Comment 30 Petr Balogh 2020-12-03 09:00:48 UTC

Reproduced on vSphere 4.6 execution with RC4 build from stage here:  https://ocs4-jenkins-csb-ocsqe.cloud.paas.psi.redhat.com/job/qe-deploy-ocs-cluster/52/

Comment 31 Petr Balogh 2020-12-03 22:06:34 UTC

Maybe another occurrence : https://ocs4-jenkins-csb-ocsqe.cloud.paas.psi.redhat.com/job/qe-deploy-ocs-cluster/57/

If so it happened 2 times in raw, once for stage testing here and another with RC5 internal build.

@nbecker can someone from noobaa team take a look?

Comment 38 Elad 2020-12-07 09:29:08 UTC

New bug 1904965

Note You need to log in before you can comment on or make changes to this bug.