Created attachment 1667175 [details] Segfault from pod log Description of problem (please be detailed as possible and provide log snippests): When noobaa backing store fills to 100% , the automatic resharding process starts and tries to increase from old_num_shards: 1 to new_num_shards: 2 This causes the rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-xyz pod to segfault (see attached log) and restart. This error message is shown in pod log output: debug 2020-02-27 14:01:17.406 7fe100037700 0 RGWReshardLock::lock failed to acquire lock on noobaa-backing-store-49f85f8e-31fa-4823-ad3c-41938ac9a882:ff08d26e-7085-4cb6-85e8-d09de035a592.5628.1 ret=-16 The restarted rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-xyz pod will then wait a few minutes and then try the resharding again - and crash again. Version of all relevant components (if applicable): OCS 4.2.2 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? This is a test environment and we are evaluating OCS. The impact is that object storage is down for a short while the pod is restarted every few minutes and nooba is unavailable during that time. Is there any workaround available to the best of your knowledge? No, not without deleting data. Manual resharding doesnt work either. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Yes Can this issue reproduce from the UI? Only by searching kibana logs If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Fill a bucket with 100 000+ objects (approx 1 month of data from openshift-metering) 2. Check the logs from rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-xyz pod and it will segfault. 3. Commands used for troublehooting Enter the rook-ceph-operator pod $ radosgw-admin bucket limit check --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --conf=/var/lib/rook/openshift-storage/openshift-storage.config [ { "user_id": "noobaa-ceph-objectstore-user", "buckets": [ { "bucket": "noobaa-backing-store-49f85f8e-31fa-4823-ad3c-41938ac9a882", "tenant": "", "num_objects": 106177, "num_shards": 0, "objects_per_shard": 106177, "fill_status": "OVER 100.000000%" } ] }, { "user_id": "ocs-storagecluster-cephobjectstoreuser", "buckets": [] } ] $ radosgw-admin reshard list --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --conf=/var/lib/rook/openshift-storage/openshift-storage.config [ { "time": "2020-03-02 06:49:15.521656Z", "tenant": "", "bucket_name": "noobaa-backing-store-49f85f8e-31fa-4823-ad3c-41938ac9a882", "bucket_id": "ff08d26e-7085-4cb6-85e8-d09de035a592.5628.1", "new_instance_id": "", "old_num_shards": 1, "new_num_shards": 2 } ] $ radosgw-admin reshard status --bucket noobaa-backing-store-49f85f8e-31fa-4823-ad3c-41938ac9a882 --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --conf=/var/lib/rook/openshift-storage/openshift-storage.config [ { "reshard_status": "in-progress", "new_bucket_instance_id": "ff08d26e-7085-4cb6-85e8-d09de035a592.2151282.1", "num_shards": 2 } ] $ radosgw-admin reshard process --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --conf=/var/lib/rook/openshift-storage/openshift-storage.config ERROR: failed to process reshard logs, error=2020-03-02 06:50:12.162 7fa2ab71d8c0 0 RGWReshardLock::lock failed to acquire lock on reshard.0000000014 ret=-16 (16) Device or resource busy # or sometimes this happens: $ radosgw-admin reshard process --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --conf=/var/lib/rook/openshift-storage/openshift-storage.config Segmentation fault (core dumped) Actual results: rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-xyz pod segfaults Expected results: Resharding should work and increase shards to 2. Additional info:
Note: we are running OCP 4.3.1 on VMware UPI
picking up ceph patches is not possible for 4.3 any more ==> moving to 4.4
So it's not in OCS 4.4 - let's move it to OCS 4.5 or 4.6?
Correct. Should be ON_QA now?
(In reply to Yaniv Kaul from comment #23) > Correct. Should be ON_QA now? Right. Needs ACKs first. Fix being in RHCS 4.1, we could also verify in 4.4.1 or later 4.4.z. Generally, for BZs on the ceph component of OCS, we should depend on a BZ on the RHCS product for tracking. Otherwise, the status will remain unclear.
Verified with OCS 4.5 build: # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-08-10-150345 True False 14h Cluster version is 4.5.0-0.nightly-2020-08-10-150345 #oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.5.0-508.ci OpenShift Container Storage 4.5.0-508.ci Succeeded # ceph version ceph version 14.2.8-79.el7cp (2d4542a7b3632dd9a7b09b5700f711e8016a94fd) nautilus (stable) # rook version rook: v1.3.8 go: go1.13.8 Writing 150K+ and 250K+ objects to a single bucket and run "adosgw-admin reshard" command to look for segfault. No segfault is seen during this testing. Testing was done on both VMware and AWS clusters.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754