Description of problem (please be detailed as possible and provide log snippests): In one of our OCS-CI test automation runs, we ran into an issue in three of the tests. Whenever we would try to sync objects to a NooBaa bucket, the sync would fail with the following message: > An error occurred (InternalError) when calling the ListObjectsV2 operation (reached max retries: 4): We encountered an internal error. Please try again. Then, upon teardown, those three tests ran into a similar issue when trying to remove the buckets created for the test: > An error occurred (InternalError) when calling the ListObjectVersions operation Version of all relevant components (if applicable): > OCP 4.4.0-0.nightly-2020-05-18-164758 > OCS 4.4.0-428.ci (RC6) >noobaa-operator: >mcg-operator@sha256:c2fb84a40850fbf8cbcb95509804e3a8ed9f188273a5e605478db6eb7ad00bda > noobaa_core: >mcg-core@sha256:cf4135edaf75556a5c1d308345bec321ed1d7f60f16380cadfd5947a5c45b2c0 >noobaa_db: >mongodb-36-rhel7@sha256:e2460bd732c38b281c0b8f7ca9ca0c1f8a131935db1b69fb2f840b252c494847 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes - it prevents me from performing S3 operations on NooBaa Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? The issue seems to be rare and hard to reproduce. I ran the same test suite three times on the same version of OCS (with a newer nightly OCP - 4.4.0-0.nightly-2020-05-21-042450) and all the tests passed. Can this issue reproduce from the UI? The issue was discovered via AWSCLI and Boto3, UI was not tested, and objects cannot be written via the UI, so N/A Steps to Reproduce: 1. Create a NooBaa bucket 2. Try to use `aws s3 sync` to sync objects to the bucket 3. Try to delete the bucket using `boto3` Actual results: The operations fail, and an error message is returned Expected results: The operations execute successfully Additional info:
Waiting for a repro take db dump and logs when happens
We did not yet encounter the issue on a live cluster yet, despite efforts to reproduce it. We will need a custom must-gather that dumps the database in order to inspect failed runs, since those seem like the main source of reproductions.
Looks like DB dump failed to be collected, and this is seen in all the occurrences of this bug: 12:58:47 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: 2020-07-08T12:58:45.611+0000 writing nbcore.objectparts to archive 'nbcore.gz' 2020-07-08T12:58:45.613+0000 writing nbcore.datablocks to archive 'nbcore.gz' 2020-07-08T12:58:45.622+0000 writing nbcore.datachunks to archive 'nbcore.gz' 2020-07-08T12:58:45.622+0000 writing nbcore.activitylogs to archive 'nbcore.gz' 2020-07-08T12:58:45.886+0000 done dumping nbcore.activitylogs (234 documents) 2020-07-08T12:58:45.886+0000 writing nbcore.objectmultiparts to archive 'nbcore.gz' 2020-07-08T12:58:45.886+0000 done dumping nbcore.objectparts (562 documents) 2020-07-08T12:58:45.886+0000 writing nbcore.objectmds to archive 'nbcore.gz' 2020-07-08T12:58:45.886+0000 done dumping nbcore.datachunks (503 documents) 2020-07-08T12:58:45.886+0000 writing nbcore.mongo_internal_agent.files to archive 'nbcore.gz' 2020-07-08T12:58:46.025+0000 done dumping nbcore.datablocks (503 documents) 2020-07-08T12:58:46.025+0000 writing nbcore.mongo_internal_agent.chunks to archive 'nbcore.gz' 2020-07-08T12:58:46.230+0000 done dumping nbcore.objectmds (76 documents) 2020-07-08T12:58:46.230+0000 writing nbcore.tiers to archive 'nbcore.gz' 2020-07-08T12:58:46.233+0000 done dumping nbcore.mongo_internal_agent.files (78 documents) 2020-07-08T12:58:46.233+0000 writing nbcore.tieringpolicies to archive 'nbcore.gz' 2020-07-08T12:58:46.234+0000 done dumping nbcore.objectmultiparts (186 documents) 2020-07-08T12:58:46.234+0000 writing nbcore.buckets to archive 'nbcore.gz' command terminated with exit code 137 As this issue is seen quite frequently with the latest OCS 4.5 build, retargeting to 4.5 and proposing as a blocker
Exit code 137 means that the system terminated the container due to MEMORY/CPU limitations. Please note that the collection of MongoDB dump requires resources (CPU, MEMORY), I would advise increasing the CPU and MEMORY resources for the pod/worker. This way CI builds won't constantly fail collecting MongoDB dumps. From the NooBaa logs, I can see that we are trying to write on MongoDB resource and the resource crashes (due to MongoDB connection drop). Please attempt to reproduce with sufficient MEMORY/CPU resources so the reproduction will include MongoDB dump. Thank you.
These are test executions over AWS with the minimum required instance type - m5.x4large for worker nodes. You can see here [1] that none of the nodes had significant CPU or memory utilization. Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 6654m (42%) 6 (38%) memory 19003Mi (30%) 15872Mi (25%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1530m (43%) 0 (0%) memory 4549Mi (30%) 512Mi (3%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 5875m (37%) 5500m (35%) memory 15544Mi (25%) 14324Mi (23%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1704m (48%) 0 (0%) memory 5149Mi (34%) 512Mi (3%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1825m (52%) 0 (0%) memory 5749Mi (38%) 512Mi (3%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 6654m (42%) 6 (38%) memory 19003Mi (30%) 15872Mi (25%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 ------------------------- I'm now trying to have a way to collect the DB dump with a workaround, so hopefully, we will have it in the next few hours. [1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr2458-b2259/jnk-pr2458-b2259_20200708T171603/logs/failed_testcase_ocs_logs_1594232005/test_write_file_to_bucket_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-47f2a6b4417e188455d2c169b0baea5077fd4ff8d008eccdffc48af173421229/oc_output/describe_nodes
Reproduced once more, this time, we managed to collect the DB dump. http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-pr2469-b2280/jnk-pr2469-b2280_20200710T090721/logs/failed_testcase_ocs_logs_1594375004/test_write_file_to_bucket_ocs_logs/
Seems like something similar just happened on run 1594643589 InternalErrors on ListObjectVersions and CreateBucket operations. Default backingstore in TemporaryError state with "Topology was destroyed" error message. https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9777/
*** Bug 1856459 has been marked as a duplicate of this bug. ***
After a small chat with Ben I understand some restarts was being involved in this test - so we want to understand better what really was going on and if noobaa-db-0 or his PV was down as part of this test - which maybe caused some temporary topology issues for the DB. Ben will continue is investigation on Sunday and will update this issue. Waiting for his feedback
Reverting to ON_QA, since this might be caused by a restart performed in the specific test that led to the latest reproduction. Awaits further testing by the QE team.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754
Reproduced on vSphere 4.6 execution with RC4 build from stage here: https://ocs4-jenkins-csb-ocsqe.cloud.paas.psi.redhat.com/job/qe-deploy-ocs-cluster/52/
Maybe another occurrence : https://ocs4-jenkins-csb-ocsqe.cloud.paas.psi.redhat.com/job/qe-deploy-ocs-cluster/57/ If so it happened 2 times in raw, once for stage testing here and another with RC5 internal build. @nbecker can someone from noobaa team take a look?
New bug 1904965