Bug 2182820

Summary: Vmware - rook-ceph-osd-1 pod experiences multiple CrashLoopBackOff,rook-ceph-mds-pas-testing-cephfs pod fails to gain ip and node
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Yuli Persky <ypersky>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED NOTABUG QA Contact: Neha Berry <nberry>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.12CC: jopinto, jpeimer, kramdoss, ocs-bugs, odf-bz-bot, rperiyas, srai
Target Milestone: ---Keywords: Automation, Performance
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-30 15:20:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yuli Persky 2023-03-29 18:51:11 UTC
Created attachment 1954509 [details]
full versions of all the components of the cluster

Description of problem (please be detailed as possible and provide log
snippests):

During a run of the whole Performance tests suite, 
Vmware - rook-ceph-osd-1 pod experiences multiple CrashLoopBackOff,rook-ceph-mds-pas-testing-cephfs pod fails to gain ip and node. 

sh-4.4$ ceph status
  cluster:
    id:     74cb797c-bf88-4008-a3fb-7df929d6692a
    health: HEALTH_WARN
            insufficient standby MDS daemons available
            1 osds down
            1 host (1 osds) down
            1 rack (1 osds) down
            Degraded data redundancy: 22238/66714 objects degraded (33.333%), 300 pgs degraded, 513 pgs undersized
 
  services:
    mon: 3 daemons, quorum a,b,c (age 3d)
    mgr: a(active, since 3d)
    mds: 2/2 daemons up, 1 hot standby
    osd: 3 osds: 2 up (since 3d), 3 in (since 7d)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 2/2 healthy
    pools:   14 pools, 513 pgs
    objects: 22.24k objects, 84 GiB
    usage:   171 GiB used, 3.8 TiB / 4 TiB avail
    pgs:     22238/66714 objects degraded (33.333%)
             300 active+undersized+degraded
             213 active+undersized
 
  io:
    client:   2.6 KiB/s rd, 174 KiB/s wr, 3 op/s rd, 3 op/s wr
 
sh-4.4$ 

Tests that run before the Ceph health became not OK : 

tests.e2e.performance.csi_tests.test_bulk_pod_attachtime_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pod_attachtime ( Passed) 
tests.e2e.performance.csi_tests.test_pod_reattachtime ( Passed) 	
tests.e2e.performance.csi_tests.test_pvc_bulk_clone_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pvc_bulk_creation_deletion_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pvc_clone_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pvc_creation_deletion_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pvc_multi_clone_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pvc_multi_snapshot_performance  ( Failed - bug in the test) 
tests.e2e.performance.csi_tests.test_pvc_snapshot_performance ( Passed) 

The next test - FIO test failed to run the FIO clint 

All the next tests failed to run since failed on Ceph health check ( ceph not ok). 


Version of all relevant components (if applicable):

Driver versions
================

        OCP versions
        ==============

                clientVersion:
                  buildDate: "2023-03-18T02:15:12Z"
                  compiler: gc
                  gitCommit: eed143055ede731029931ad204b19cd2f565ef1a
                  gitTreeState: clean
                  gitVersion: 4.13.0-202303180002.p0.geed1430.assembly.stream-eed1430
                  goVersion: go1.19.6
                  major: ""
                  minor: ""
                  platform: linux/amd64
                kustomizeVersion: v4.5.7
                openshiftVersion: 4.12.7
                releaseClientVersion: 4.13.0-0.nightly-2023-03-19-052243
                serverVersion:
                  buildDate: "2023-02-21T11:03:12Z"
                  compiler: gc
                  gitCommit: b6d1f054747e9886f61dd85316deac3415e2726f
                  gitTreeState: clean
                  gitVersion: v1.25.4+18eadca
                  goVersion: go1.19.4
                  major: "1"
                  minor: "25"
                  platform: linux/amd64
                
                
                Cluster version:

                NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
                version   4.12.7    True        False         7d5h    Cluster version is 4.12.7
                
        OCS versions
        ==============
              NAME                              DISPLAY                       VERSION   REPLACES                          PHASE
                mcg-operator.v4.12.1              NooBaa Operator               4.12.1    mcg-operator.v4.12.0              Succeeded
                ocs-operator.v4.12.1              OpenShift Container Storage   4.12.1    ocs-operator.v4.12.0              Succeeded
                odf-csi-addons-operator.v4.12.1   CSI Addons                    4.12.1    odf-csi-addons-operator.v4.12.0   Succeeded
                odf-operator.v4.12.1              OpenShift Data Foundation     4.12.1    odf-operator.v4.12.0              Succeeded
                
                ODF (OCS) build :                     full_version: 4.12.1-19
                
        Rook versions
        ===============

                rook: v4.12.1-0.f4e99907f9b9f05a190303465f61d12d5d24cace
                go: go1.18.9
                
        Ceph versions
        ===============

                ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)
                
        RHCOS versions
        ================


                NAME              STATUS   ROLES                  AGE    VERSION           INTERNAL-IP    EXTERNAL-IP    OS-IMAGE                                               
         KERNEL-VERSION                 CONTAINER-RUNTIME
                compute-0         Ready    worker                 7d5h   v1.25.4+18eadca   10.1.160.219   10.1.160.219   Red Hat Enterprise Linux CoreOS 412.86.202303011010-0 (
Ootpa)   4.18.0-372.46.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8
                compute-1         Ready    worker                 7d5h   v1.25.4+18eadca   10.1.160.211   10.1.160.211   Red Hat Enterprise Linux CoreOS 412.86.202303011010-0 (Ootpa)   4.18.0-372.46.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8
                compute-2         Ready    worker                 7d5h   v1.25.4+18eadca   10.1.160.209   10.1.160.209   Red Hat Enterprise Linux CoreOS 412.86.202303011010-0 (Ootpa)   4.18.0-372.46.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8
                control-plane-0   Ready    control-plane,master   7d5h   v1.25.4+18eadca   10.1.160.222   10.1.160.222   Red Hat Enterprise Linux CoreOS 412.86.202303011010-0 (Ootpa)   4.18.0-372.46.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8
                control-plane-1   Ready    control-plane,master   7d5h   v1.25.4+18eadca   10.1.160.217   10.1.160.217   Red Hat Enterprise Linux CoreOS 412.86.202303011010-0 (Ootpa)   4.18.0-372.46.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8
                control-plane-2   Ready    control-plane,master   7d5h   v1.25.4+18eadca   10.1.160.220   10.1.160.220   Red Hat Enterprise Linux CoreOS 412.86.202303011010-0 (Ootpa)   4.18.0-372.46.1.el8_6.x86_64   cri-o://1.25.2-10.rhaos4.12.git0a083f9.el8


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes, since the cluster is not usable. 


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3


Can this issue reproducible?

I am not sure. 

Can this issue reproduce from the UI?

Not relevant. 
If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy 4.12 VMware cluster 
2. Run Performance suite ( marker performance) 
3. Check Ceph health and pods status ( all pods should be up and running, ceph health should be OK). 

Actual results:
Ceph health is not ok, run of the whole Performance tests suite, 
Vmware - rook-ceph-osd-1 pod experiences multiple CrashLoopBackOff,rook-ceph-mds-pas-testing-cephfs pod fails to gain ip and node. 


Expected results:

All the pods should be up and running, ceph health should be ok. 


Additional info:
Relevant Jenkins job : 

https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/22470/

Tests that run before the Ceph health became not OK : 

tests.e2e.performance.csi_tests.test_bulk_pod_attachtime_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pod_attachtime ( Passed) 
tests.e2e.performance.csi_tests.test_pod_reattachtime ( Passed) 	
tests.e2e.performance.csi_tests.test_pvc_bulk_clone_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pvc_bulk_creation_deletion_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pvc_clone_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pvc_creation_deletion_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pvc_multi_clone_performance ( Passed) 
tests.e2e.performance.csi_tests.test_pvc_multi_snapshot_performance  ( Failed - bug in the test) 
tests.e2e.performance.csi_tests.test_pvc_snapshot_performance ( Passed) 

The next test - FIO test failed to run the FIO clint 

All the next tests failed to run since failed on Ceph health check ( ceph not ok).

link to must gather logs will be posted in the next comment, also link to  ocs-operator logs.

Comment 2 Yuli Persky 2023-03-29 19:10:02 UTC
Please note that must gather logs and ocs-operator logs are located at : 
rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/OCS/ocs-qe-bugs/bz-2182820/

Comment 3 Yuli Persky 2023-03-29 19:16:28 UTC
Please note pods status: 

(yulienv38) [ypersky@ypersky ocs-ci]$ oc get pods -o wide
NAME                                                              READY   STATUS                  RESTARTS         AGE    IP             NODE        NOMINATED NODE   READINESS GATES
csi-addons-controller-manager-c9c696595-q7tg4                     2/2     Running                 0                7d6h   10.131.0.14    compute-2   <none>           <none>
csi-cephfsplugin-4d4kr                                            2/2     Running                 0                7d6h   10.1.160.211   compute-1   <none>           <none>
csi-cephfsplugin-h2x5s                                            2/2     Running                 0                7d6h   10.1.160.219   compute-0   <none>           <none>
csi-cephfsplugin-provisioner-7874d85cb6-4csm2                     5/5     Running                 0                3d3h   10.131.1.248   compute-2   <none>           <none>
csi-cephfsplugin-provisioner-7874d85cb6-w5tht                     5/5     Running                 0                7d6h   10.129.2.26    compute-1   <none>           <none>
csi-cephfsplugin-rgd7h                                            2/2     Running                 0                7d6h   10.1.160.209   compute-2   <none>           <none>
csi-rbdplugin-8rbfk                                               3/3     Running                 0                7d6h   10.1.160.211   compute-1   <none>           <none>
csi-rbdplugin-g8s6g                                               3/3     Running                 0                7d6h   10.1.160.219   compute-0   <none>           <none>
csi-rbdplugin-klb9b                                               3/3     Running                 0                7d6h   10.1.160.209   compute-2   <none>           <none>
csi-rbdplugin-provisioner-58d9d7fd6c-dmfrp                        6/6     Running                 0                7d6h   10.131.0.16    compute-2   <none>           <none>
csi-rbdplugin-provisioner-58d9d7fd6c-vc7mc                        6/6     Running                 0                7d6h   10.129.2.25    compute-1   <none>           <none>
noobaa-core-0                                                     1/1     Running                 0                3d3h   10.131.1.250   compute-2   <none>           <none>
noobaa-db-pg-0                                                    1/1     Running                 0                7d6h   10.131.0.28    compute-2   <none>           <none>
noobaa-endpoint-6f767c5799-kkms6                                  1/1     Running                 0                3d3h   10.131.1.247   compute-2   <none>           <none>
noobaa-operator-5667fcb59c-rc924                                  1/1     Running                 1 (24h ago)      3d3h   10.129.3.60    compute-1   <none>           <none>
ocs-metrics-exporter-696b8bdc7c-kjpqx                             1/1     Running                 0                7d6h   10.131.0.12    compute-2   <none>           <none>
ocs-operator-768bd84df7-hckqt                                     1/1     Running                 0                7d6h   10.129.2.23    compute-1   <none>           <none>
odf-console-54978b6774-x9k2q                                      1/1     Running                 0                7d6h   10.129.2.22    compute-1   <none>           <none>
odf-operator-controller-manager-8d6549758-dbpzt                   2/2     Running                 0                7d6h   10.129.2.21    compute-1   <none>           <none>
rook-ceph-crashcollector-compute-0-b64cb7fc9-lbnpb                1/1     Running                 0                3d2h   10.128.3.117   compute-0   <none>           <none>
rook-ceph-crashcollector-compute-1-66c4d45546-nbgn4               1/1     Running                 0                7d6h   10.129.2.32    compute-1   <none>           <none>
rook-ceph-crashcollector-compute-2-68c465997-45zgm                1/1     Running                 0                7d6h   10.131.0.22    compute-2   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-79d8f4bb8sfbc   2/2     Running                 0                7d6h   10.131.0.25    compute-2   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-79577447vlcfw   2/2     Running                 0                3d3h   10.128.3.119   compute-0   <none>           <none>
rook-ceph-mds-pas-testing-cephfs-a-5cd65db484-xpbb4               2/2     Running                 0                7d2h   10.129.2.71    compute-1   <none>           <none>
rook-ceph-mds-pas-testing-cephfs-b-694cb8cc4f-9sj9v               0/2     Pending                 0                7d2h   <none>         <none>      <none>           <none>
rook-ceph-mgr-a-7474cd5645-d48l7                                  2/2     Running                 0                3d3h   10.131.1.243   compute-2   <none>           <none>
rook-ceph-mon-a-58fd5cf765-gm2kg                                  2/2     Running                 0                7d6h   10.131.0.18    compute-2   <none>           <none>
rook-ceph-mon-b-66677dfc69-dvxrp                                  2/2     Running                 0                7d6h   10.129.2.28    compute-1   <none>           <none>
rook-ceph-mon-c-747bf9df65-wkqwq                                  2/2     Running                 0                3d3h   10.128.3.118   compute-0   <none>           <none>
rook-ceph-operator-fbb9bcddb-ztq4w                                1/1     Running                 0                7d6h   10.129.2.24    compute-1   <none>           <none>
rook-ceph-osd-0-7f4987fc57-vkjkk                                  2/2     Running                 0                7d6h   10.131.0.21    compute-2   <none>           <none>
rook-ceph-osd-1-86895f9f75-tsw74                                  0/2     Init:CrashLoopBackOff   883 (112s ago)   3d3h   10.128.3.116   compute-0   <none>           <none>
rook-ceph-osd-2-58dbf9b77-kxnmm                                   2/2     Running                 0                7d6h   10.129.2.31    compute-1   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-1-data-0btgcv-4m55s           0/1     Completed               0                7d6h   10.129.2.30    compute-1   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-2-data-0v9wcv-ds8bp           0/1     Completed               0                7d6h   10.131.0.20    compute-2   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6f67fcb2gp2q   2/2     Running                 0                7d6h   10.131.0.26    compute-2   <none>           <none>
rook-ceph-tools-56c5c9c865-qwdnd                                  1/1     Running                 0                7d6h   10.131.0.27    compute-2   <none>           <none>
(yulienv38) [ypersky@ypersky ocs-ci]$

Comment 5 Subham Rai 2023-03-30 09:31:50 UTC
could you share must gather, seems like the cluster is not running now

Comment 6 Travis Nielsen 2023-04-04 20:48:44 UTC
In the must-gather at [1], the OSD pod is not starting because the disk is no longer found [2].

2023-03-29T15:23:57.218966410Z failed to read label for /var/lib/ceph/osd/ceph-1/block: (6) No such device or address
2023-03-29T15:23:57.219088647Z 2023-03-29T15:23:57.217+0000 7f4ce8d9b540 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-1/block: (6) No such device or address

I suspect the vmware disk just became not available, and now the cluster is simply in a degraded state.
The cluster is still keeping the data safe and would allow reads/writes. In a production cluster, the OSD would need to be replaced
to regain full data protection. 

1. Do you see any other side effects besides the OSD being down? 
2. Are you able to repro this? 

[1] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2182820/
[2] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2182820/logs-20230329-182520/ocs-must-gather-us/quay-io-ocs-dev-ocs-must-gather-sha256-3d57a983f6c2b53ef2dd4ee3a5a4df7bc86daf53abb703e9a7872d07c00ed3c7/namespaces/openshift-storage/pods/rook-ceph-osd-1-86895f9f75-tsw74/activate/activate/logs/current.log

Comment 7 Yuli Persky 2023-04-10 14:14:34 UTC
@ Subham Rai  - for must gather logs please refer to comment#2 : 
Please note that must gather logs and ocs-operator logs are located at : 
rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/OCS/ocs-qe-bugs/bz-2182820/

Comment 8 Travis Nielsen 2023-04-11 15:14:43 UTC
Moving out of 4.13 while under investigation, it appears an environmental issue

Comment 9 Yuli Persky 2023-04-11 22:43:14 UTC
@Travis Nielsen , 

Per your questions : 


1. Do you see any other side effects besides the OSD being down? 
 - I described all the behaviour I was able to analyze. 
2. Are you able to repro this? 
 - I tried to reproduce it on vmware not lso cluster and did not see this behaviour. 

Therefor what I am going to do is - I'll recreate vmware lso 4.12 cluster on the same DC and will run the Perf suite again. 
I will update on the result - whether the issue was reproduced ot not. 
I'm keeping the "needinfo" on myself , to update the results of the reproduction try.

Comment 10 Yuli Persky 2023-05-29 20:51:34 UTC
Unfortunately I was not able to reproduce this issue during following runs. 
Therefore removed the Needinfo since it looks like I cannot contribute more.
However, if I encounter this behaviour in the future - I'll update here.

Comment 11 Travis Nielsen 2023-05-30 15:20:31 UTC
Please reopen if you can repro