2167333 – Possible data damage, reduced availability and desgraded data resundancy on a longevity cluster after running multi clones RBD test case

Bug 2167333 - Possible data damage, reduced availability and desgraded data resundancy on a longevity cluster after running multi clones RBD test case

Summary: Possible data damage, reduced availability and desgraded data resundancy on a...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Neha Ojha
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-02-06 10:25 UTC by Yuli Persky
Modified:	2023-08-09 16:37 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-08-08 13:03:06 UTC
Embargoed:

Attachments	(Terms of Use)

Description Yuli Persky 2023-02-06 10:25:59 UTC

Comment 2 Yuli Persky 2023-02-06 10:26:26 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Possible data damage, reduced availability and desgraded data resundancy on a longevity cluster after running multi clones RBD test case. 

The test calculated size of the storage ( without taking into consideration used capacity) and should create 512 clones for pvc with data, up to 70% of total storage capacity. 

Ceph Status: 

(yulienv) [ypersky@ypersky ocs-ci]$ oc rsh rook-ceph-tools-6c57b89b-wnz7r
sh-4.4$ ceph status
  cluster:
    id:     825763bb-b9d4-43df-b96a-3d14c4be0fd2
    health: HEALTH_ERR
            2/172180 objects unfound (0.001%)
            2 osds down
            2 hosts (2 osds) down
            2 racks (2 osds) down
            Reduced data availability: 289 pgs inactive
            Possible data damage: 2 pgs recovery_unfound
            Degraded data redundancy: 344362/516540 objects degraded (66.667%), 147 pgs degraded, 289 pgs undersized
            2 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,b,d (age 35h)
    mgr: a(active, since 3d)
    osd: 3 osds: 1 up (since 2d), 3 in (since 3M)
 
  data:
    pools:   10 pools, 289 pgs
    objects: 172.18k objects, 670 GiB
    usage:   1.3 TiB used, 708 GiB / 2.0 TiB avail
    pgs:     100.000% pgs not active
             344362/516540 objects degraded (66.667%)
             2/172180 objects unfound (0.001%)
             145 undersized+degraded+peered
             142 undersized+peered
             2   recovery_unfound+undersized+degraded+peered
 
sh-4.4$



Version of all relevant components (if applicable):

 OCP versions
        ==============

                clientVersion:
                  buildDate: "2023-01-31T16:04:37Z"
                  compiler: gc
                  gitCommit: b05f7d40f9a2dac30771be620e9e9148d26ffd07
                  gitTreeState: clean
                  gitVersion: 4.12.0-202301311516.p0.gb05f7d4.assembly.stream-b05f7d4
                  goVersion: go1.19.4
                  major: ""
                  minor: ""
                  platform: linux/amd64
                kustomizeVersion: v4.5.7
                openshiftVersion: 4.11.7
                releaseClientVersion: 4.12.0-0.nightly-2023-02-02-180827
                serverVersion:
                  buildDate: "2022-09-13T15:03:52Z"
                  compiler: gc
                  gitCommit: 0a57f1f59bda75ea2cf13d9f3b4ac5d202134f2d
                  gitTreeState: clean
                  gitVersion: v1.24.0+3882f8f
                  goVersion: go1.18.4
                  major: "1"
                  minor: "24"
                  platform: linux/amd64
                
                
                Cluster version:

                NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
                version   4.11.7    True        False         111d    Error while reconciling 4.11.7: the cluster operator monitoring has not yet successfully rolled out
                
        OCS versions
        ==============
 NAME                              DISPLAY                       VERSION   REPLACES                          PHASE
                mcg-operator.v4.11.4              NooBaa Operator               4.11.4    mcg-operator.v4.11.3              Succeeded
                ocs-operator.v4.11.4              OpenShift Container Storage   4.11.4    ocs-operator.v4.11.3              Succeeded
                odf-csi-addons-operator.v4.11.4   CSI Addons                    4.11.4    odf-csi-addons-operator.v4.11.3   Succeeded
                odf-operator.v4.11.4              OpenShift Data Foundation     4.11.4    odf-operator.v4.11.3              Succeeded
                
                ODF (OCS) build :                     full_version: 4.11.4-4
                
        Rook versions
        ===============

                rook: v4.11.4-0.96e324244ec878d70194179a2892ec7193f6b591
                go: go1.17.12
                
        Ceph versions
        ===============

                ceph version 16.2.8-84.el8cp (c2980f2fd700e979d41b4bad2939bb90f0fe435c) pacific (stable)
                


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes - not possible to work with the cluster. 


Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

4


Can this issue reproducible?


Can this issue reproduce from the UI?
Not relevant 


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Run csi tests of the performance suite on the longevity cluster  ( tests/e2e/performance/csi_tests/ 
2. Run test_pvc_multi_clones_test (RBD) 
3.


Actual results:

(yulienv) [ypersky@ypersky ocs-ci]$ oc get pods
NAME                                                              READY   STATUS             RESTARTS           AGE
csi-addons-controller-manager-649c659fdb-bf7td                    2/2     Running            0                  60d
csi-cephfsplugin-2bcpj                                            2/2     Running            0                  60d
csi-cephfsplugin-bk4fc                                            2/2     Running            0                  60d
csi-cephfsplugin-cv7gz                                            2/2     Running            0                  60d
csi-cephfsplugin-nqwdv                                            2/2     Running            0                  60d
csi-cephfsplugin-provisioner-5655f4c8c-svrx6                      5/5     Running            0                  60d
csi-cephfsplugin-provisioner-5655f4c8c-vk8mc                      5/5     Running            0                  60d
csi-cephfsplugin-rsh98                                            2/2     Running            0                  60d
csi-cephfsplugin-sqltv                                            2/2     Running            0                  60d
csi-rbdplugin-7pft9                                               3/3     Running            0                  60d
csi-rbdplugin-dt8bz                                               3/3     Running            0                  60d
csi-rbdplugin-lmd9g                                               3/3     Running            0                  60d
csi-rbdplugin-pg6fc                                               3/3     Running            0                  60d
csi-rbdplugin-provisioner-85bd975ccb-cp4mb                        6/6     Running            0                  60d
csi-rbdplugin-provisioner-85bd975ccb-fc5tk                        6/6     Running            0                  60d
csi-rbdplugin-tgxpr                                               3/3     Running            0                  60d
csi-rbdplugin-wc2p5                                               3/3     Running            0                  60d
noobaa-core-0                                                     1/1     Running            0                  3d5h
noobaa-db-pg-0                                                    0/1     Init:0/2           0                  3d5h
noobaa-endpoint-675bbf58cc-rnz9l                                  1/1     Running            0                  3d5h
noobaa-operator-6b65db7dff-wn6g8                                  1/1     Running            0                  60d
ocs-metrics-exporter-5dffd4f586-p2dsl                             1/1     Running            0                  60d
ocs-operator-7c9544bdb7-5zkv5                                     1/1     Running            0                  60d
odf-console-5b49c87b6-hc84d                                       1/1     Running            0                  60d
odf-operator-controller-manager-9cc79c599-xp2c2                   2/2     Running            0                  60d
rook-ceph-crashcollector-compute-0-645bb8f676-lchdd               1/1     Running            0                  40h
rook-ceph-crashcollector-compute-1-d84f4b9bc-6lj88                1/1     Running            0                  3d8h
rook-ceph-crashcollector-compute-2-8658488d94-wq4vb               1/1     Running            0                  36h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-585575c594mkp   1/2     CrashLoopBackOff   487 (117s ago)     39h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-56877876kqq74   1/2     CrashLoopBackOff   1063 (3m41s ago)   3d8h
rook-ceph-mgr-a-55fdcbcb6c-rqp5m                                  2/2     Running            34 (3d3h ago)      3d5h
rook-ceph-mon-a-6546d5f4d6-hrhsz                                  2/2     Running            40 (41h ago)       3d8h
rook-ceph-mon-b-7cb7874d-6d5t2                                    2/2     Running            0                  40h
rook-ceph-mon-d-6487754c78-sm4sw                                  2/2     Running            0                  35h
rook-ceph-operator-749b7f5645-qktvf                               1/1     Running            0                  60d
rook-ceph-osd-0-9fc7b7645-tbwxx                                   0/2     Init:0/9           0                  36h
rook-ceph-osd-1-85b88788c-7fwbr                                   2/2     Running            0                  3d8h
rook-ceph-osd-2-7486ccb6f8-559np                                  0/2     Init:0/9           0                  40h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-b96bd5dhdzdk   1/2     Running            817 (3m3s ago)     3d5h
rook-ceph-tools-6c57b89b-wnz7r                                    1/1     Running            0                  60d
(yulienv) [ypersky@ypersky ocs-ci]$ 


Expected results:

Ceph Health should be ok. No data damage/data redundancy problems should occur. 

Additional info:

This is a longevity cluster. Probably related to the problem. 
When test_pvc_multi_clone_performance test is run ceparately - it does not cause the above issue on other environments. 

The test during which the Ceph Health became HEALTH_ERR is test_pvc_multi_clone_performance.py a/ocsci-jenkins/openshift-clusters/tdesala-long/tdesala-long_20221018T075322/logs/ocs-ci-logs-1675340061/by_outcome/failed/tests/e2e/performance/csi_tests/test_pvc_multi_clone_performance.py/TestPvcMultiClonePerformance/test_pvc_multiple_clone_performance-RBD

Test cases that failed prior to multi clones execution are : 

drwxrwxr-x 3 zack zack 1 Feb  3 00:04 test_pvc_clone_performance.py
drwxrwxr-x 3 zack zack 1 Feb  3 09:57 test_pvc_creation_deletion_performance.py
drwxrwxr-x 5 zack zack 3 Feb  4 12:35 .
drwxrwxr-x 3 zack zack 1 Feb  4 12:35 test_pvc_multi_clone_performance.py

All the csi tests that run on this cluster in this execution ( the time stamp order is the test run order) : 
 
drwxrwxr-x  3 zack zack  1 Feb  2 07:14 test_bulk_pod_attachtime_performance.py
drwxrwxr-x  3 zack zack  1 Feb  2 08:01 test_pod_attachtime.py
drwxrwxr-x  3 zack zack  1 Feb  2 08:15 test_pod_reattachtime.py
drwxrwxr-x  3 zack zack  1 Feb  2 13:49 test_pvc_bulk_clone_performance.py
drwxrwxr-x  3 zack zack  1 Feb  2 14:27 test_pvc_bulk_creation_deletion_performance.py
drwxrwxr-x  3 zack zack  1 Feb  2 15:44 test_pvc_clone_performance.py
drwxrwxr-x  3 zack zack  1 Feb  3 09:46 test_pvc_creation_deletion_performance.py
drwxrwxr-x  3 zack zack  1 Feb  4 12:25 test_pvc_multi_clone_performance.py
drwxrwxr-x  3 zack zack  1 Feb  4 14:49 test_pvc_multi_snapshot_performance.py
drwxrwxr-x 12 zack zack 10 Feb  4 14:50 .
drwxrwxr-x  3 zack zack  1 Feb  4 14:50 test_pvc_snapshot_performance.py
drwxrwxr-x  5 zack zack  3 Feb  4 14:53 ..


The logs for this run is available on magna002 , path : /a/ocsci-jenkins/openshift-clusters/tdesala-long/tdesala-long_20221018T075322/logs/ocs-ci-logs-1675340061/by_outcome/failed/tests/e2e/performance/csi_tests/test_pvc_multi_clone_performance.py/TestPvcMultiClonePerformance/test_pvc_multiple_clone_performance-RBD

Comment 3 Yuli Persky 2023-02-06 10:27:59 UTC

PLease note that just gather logs are available

Comment 5 Yuli Persky 2023-02-06 10:53:11 UTC

Fix to comment #3 : must gather logs are available here : rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/OCS/ocs-qe-bugs/bz-2167333/

Comment 6 Yuli Persky 2023-02-06 13:21:18 UTC

The cluster was alive around 104 days till the latests runs of Performance tests started , and caused the Ceph problems.

Comment 8 Yuli Persky 2023-02-08 22:23:39 UTC

Relevant Jenkins job : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/20245/testReport/

Comment 10 Mudit Agarwal 2023-03-16 16:24:59 UTC

Yuli, is this reproducible? looks like its a configuration issue, please check Vikhyat's comment.

Comment 12 Yuli Persky 2023-03-23 11:51:27 UTC

@Mudit Agarwal - Yuli, is this reproducible?  - Yes! I've run Performance test on various 4.11 and various 4.12 builds, in all the builds we get the same results ( the numbers of course are not the same but the grade of the regression ( degradation) is the same.

Comment 13 Yuli Persky 2023-03-23 11:55:47 UTC

@Everyone please disregard comment #12 - I was referring to another issue. Will provide my feedback on the issue described here in the next comment.

Comment 14 Yuli Persky 2023-03-24 20:43:10 UTC

@Vikhyat Umrao  - Unfortunately we do not have OSD logs either as the OSDs were not running when the logs were collected. 

As for the reproduction - no , we did not have a chance to reproduce the issue, since the Longevity cluster on which we saw the problem is a shared resource and after this crush it was decided to modify the Multi Clones test and calculate the clones sizes so that maximal number of clones will consume up to Free capacity of the cluster and not up to Total cluster capacity ( which filled the OSDs). 

However, this BZ was filed since we though that after the clones are created and reach the 85% threshold , the cluster should go into Red Only state, and not into crashes.

Comment 15 Yuli Persky 2023-03-24 20:46:38 UTC

@Mudit Agarwal
I realize that the attached logs do not contain info regarding the crashes. 
 - how would you like me to proceed? 
Should I try to schedule time on the Longevity cluster in order to try to reproduce? 
Is there anything special we can configure ( Debug level?) prior to the execution so that the next time the crash , if occurs, will be tracked?

Comment 16 Yuli Persky 2023-03-27 22:41:52 UTC

BTW, there is a open BZ for : Must-gather doesn't collect coredump logs crucial for OSD crash events
https://bugzilla.redhat.com/show_bug.cgi?id=2168849

I think that the reproduction trials should be postponed afther this BZ is fixed. Otherwise even if we simulate a crash - it will not be collected in the logs.

Comment 18 Yuli Persky 2023-03-28 12:10:16 UTC

@Brad Hubbard - I'll try to reproduce this issue and when I have this - I'll share the setup with you to analyze.

Comment 20 Yuli Persky 2023-03-29 21:58:25 UTC

@Brad Hubbard - After https://bugzilla.redhat.com/show_bug.cgi?id=2168849 is fixed I'll try to reproduce the problem and will post here the results and my analyze.
I assumed that in case a valid load is causing the cluster to get ceph HEALTH_ERR , Degraded data redundancy Reduced data availability and  Possible data damage - then there is a bug in ceph. 
Please let me know in case there are any specific logs you'd like me to check after/during the reproduction. 
Thank you!

Note You need to log in before you can comment on or make changes to this bug.