Bug 1809545

Summary: [GSS] rook-ceph-rgw-ocs-storagecluster-cephobjectstore pod segfaults when resharding starts
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Göran Törnqvist <goran.tornqvist>
Component: cephAssignee: Matt Benjamin (redhat) <mbenjamin>
Status: CLOSED ERRATA QA Contact: Tiffany Nguyen <tunguyen>
Severity: high Docs Contact:
Priority: high    
Version: 4.2CC: assingh, bkunal, ebenahar, edonnell, hnallurv, jonte.regnell, jthottan, kramdoss, madam, mbenjamin, mkogan, ocs-bugs, owasserm, sostapov, tdesala
Target Milestone: ---Keywords: AutomationBackLog
Target Release: OCS 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
.RGW server no longer crashes or leaks memory Previously, an incorrect code construction in the final "bucket link" step of the RADOS Gateway (RGW) bucket create lead to undefined behavior in some instances. The RGW server could crash, or occasionally leak memory. This bug has been fixed, and the RGW server behaves as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-15 10:16:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1841422, 1859307    
Attachments:
Description Flags
Segfault from pod log none

Description Göran Törnqvist 2020-03-03 11:45:24 UTC
Created attachment 1667175 [details]
Segfault from pod log

Description of problem (please be detailed as possible and provide log
snippests):

When noobaa backing store fills to 100% , the automatic resharding process starts and tries to increase from old_num_shards: 1 to new_num_shards: 2

This causes the rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-xyz pod to segfault (see attached log) and restart.
This error message is shown in pod log output:

debug 2020-02-27 14:01:17.406 7fe100037700  0 RGWReshardLock::lock failed to acquire lock on noobaa-backing-store-49f85f8e-31fa-4823-ad3c-41938ac9a882:ff08d26e-7085-4cb6-85e8-d09de035a592.5628.1 ret=-16

The restarted rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-xyz pod will then wait a few minutes and then try the resharding again - and crash again.


Version of all relevant components (if applicable):
OCS 4.2.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
This is a test environment and we are evaluating OCS.
The impact is that object storage is down for a short while the pod is restarted every few minutes and nooba is unavailable during that time.

Is there any workaround available to the best of your knowledge?
No, not without deleting data. Manual resharding doesnt work either.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 3


Can this issue reproducible? Yes


Can this issue reproduce from the UI? Only by searching kibana logs


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Fill a bucket with 100 000+ objects (approx 1 month of data from openshift-metering)
2. Check the logs from rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-xyz pod and it will segfault.
3. Commands used for troublehooting

Enter the rook-ceph-operator pod

$ radosgw-admin bucket limit check --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --conf=/var/lib/rook/openshift-storage/openshift-storage.config
[
    {
        "user_id": "noobaa-ceph-objectstore-user",
        "buckets": [
            {
                "bucket": "noobaa-backing-store-49f85f8e-31fa-4823-ad3c-41938ac9a882",
                "tenant": "",
                "num_objects": 106177,
                "num_shards": 0,
                "objects_per_shard": 106177,
                "fill_status": "OVER 100.000000%"
            }
        ]
    },
    {
        "user_id": "ocs-storagecluster-cephobjectstoreuser",
        "buckets": []
    }
]

$ radosgw-admin reshard list --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --conf=/var/lib/rook/openshift-storage/openshift-storage.config
[
    {
        "time": "2020-03-02 06:49:15.521656Z",
        "tenant": "",
        "bucket_name": "noobaa-backing-store-49f85f8e-31fa-4823-ad3c-41938ac9a882",
        "bucket_id": "ff08d26e-7085-4cb6-85e8-d09de035a592.5628.1",
        "new_instance_id": "",
        "old_num_shards": 1,
        "new_num_shards": 2
    }
]

$ radosgw-admin reshard status --bucket noobaa-backing-store-49f85f8e-31fa-4823-ad3c-41938ac9a882 --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --conf=/var/lib/rook/openshift-storage/openshift-storage.config
[
    {
        "reshard_status": "in-progress",
        "new_bucket_instance_id": "ff08d26e-7085-4cb6-85e8-d09de035a592.2151282.1",
        "num_shards": 2
    }
]

$ radosgw-admin reshard process --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --conf=/var/lib/rook/openshift-storage/openshift-storage.config
ERROR: failed to process reshard logs, error=2020-03-02 06:50:12.162 7fa2ab71d8c0  0 RGWReshardLock::lock failed to acquire lock on reshard.0000000014 ret=-16
(16) Device or resource busy

# or sometimes this happens:

$ radosgw-admin reshard process --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --conf=/var/lib/rook/openshift-storage/openshift-storage.config
Segmentation fault (core dumped)



Actual results:
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-xyz pod segfaults

Expected results:
Resharding should work and increase shards to 2.

Additional info:

Comment 2 Göran Törnqvist 2020-03-03 11:48:05 UTC
Note: we are running OCP 4.3.1 on VMware UPI

Comment 3 Michael Adam 2020-03-05 15:52:23 UTC
picking up ceph patches is not possible for 4.3 any more ==> moving to 4.4

Comment 21 Yaniv Kaul 2020-05-17 08:14:53 UTC
So it's not in OCS 4.4 - let's move it to OCS 4.5 or 4.6?

Comment 23 Yaniv Kaul 2020-06-24 15:26:07 UTC
Correct. Should be ON_QA now?

Comment 24 Michael Adam 2020-06-25 10:12:35 UTC
(In reply to Yaniv Kaul from comment #23)
> Correct. Should be ON_QA now?

Right. Needs ACKs first. 

Fix being in RHCS 4.1, we could also verify in 4.4.1 or later 4.4.z.

Generally, for BZs on the ceph component of OCS, we should depend on a BZ on the RHCS product for tracking.
Otherwise, the status will remain unclear.

Comment 29 Tiffany Nguyen 2020-08-12 01:24:31 UTC
Verified with OCS 4.5 build:
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-08-10-150345   True        False         14h     Cluster version is 4.5.0-0.nightly-2020-08-10-150345

#oc get csv -n openshift-storage 
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.5.0-508.ci   OpenShift Container Storage   4.5.0-508.ci              Succeeded

# ceph version
ceph version 14.2.8-79.el7cp (2d4542a7b3632dd9a7b09b5700f711e8016a94fd) nautilus (stable)

# rook version
rook: v1.3.8
go: go1.13.8

Writing 150K+ and 250K+ objects to a single bucket and run "adosgw-admin reshard" command to look for segfault.
No segfault is seen during this testing.  Testing was done on both VMware and AWS clusters.

Comment 31 errata-xmlrpc 2020-09-15 10:16:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754