Bug 1929188

Summary: IBMZ: rook-ceph-mon and rook-ceph-mds-ocs-storagecluster-cephfilesystem pods restart several times during ocs-ci tier 4b test execution
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Sravika <sbalusu>
Component: rookAssignee: Blaine Gardner <brgardne>
Status: CLOSED DUPLICATE QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: aaaggarw, akandath, brgardne, ekuric, madam, muagarwa, ocs-bugs, ratamir, rcyriac, shan, svenkat, tnielsen, tunguyen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-25 05:29:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
must-gather none

Description Sravika 2021-02-16 11:57:01 UTC
Description of problem (please be detailed as possible and provide log
snippests):

During ocs-ci tests, the pods "rook-ceph-mon" and "rook-ceph-mds-ocs-storagecluster-cephfilesystem" restarted 12 and 6 times respectively during different tier4b tests mentioned below (based on the last restart time). Also the OSDs OOMed and restarted, but there is an open issue for the osds already (https://bugzilla.redhat.com/show_bug.cgi?id=1917815)

tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephBlockPool-mgr] 

tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephBlockPool-mon] 

tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephBlockPool-osd] 




Version of all relevant components (if applicable):
OCS -4.6.2
ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCP 4.17 and OCS 4.6.2 (4.6.2-233.ci) with 4 workers
2. The pod resources are as follows
     "rook-ceph-mon"
      Limits:
      cpu:     1
      memory:  2Gi
     "rook-ceph-mds-ocs-storagecluster-cephfilesystem"
      Limits:
      cpu:     3
      memory:  8Gi
3. Run the ocs-ci tier4b tests as follows:

run-ci -m 'tier4b' --ocsci-conf config.yaml --cluster-path /root/ocp4-workdir 

Actual results:
Pods restart during tier4b tests

# oc get po
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-k4fbs                                            3/3     Running     0          18h
csi-cephfsplugin-khg7g                                            3/3     Running     0          18h
csi-cephfsplugin-mfmqn                                            3/3     Running     0          5h57m
csi-cephfsplugin-provisioner-d8ccd695d-n9nl2                      6/6     Running     0          5h57m
csi-cephfsplugin-provisioner-d8ccd695d-sxvqv                      6/6     Running     0          5h57m
csi-cephfsplugin-qlmd9                                            3/3     Running     0          18h
csi-rbdplugin-89kf6                                               3/3     Running     0          18h
csi-rbdplugin-9kfs4                                               3/3     Running     0          5h57m
csi-rbdplugin-9vdxj                                               3/3     Running     0          18h
csi-rbdplugin-pm9gh                                               3/3     Running     0          5h57m
csi-rbdplugin-provisioner-76988fbc89-bbwbm                        6/6     Running     0          5h57m
csi-rbdplugin-provisioner-76988fbc89-z2vmf                        6/6     Running     0          5h57m
noobaa-core-0                                                     1/1     Running     0          18h
noobaa-db-0                                                       1/1     Running     0          18h
noobaa-endpoint-554fc74b95-4mvw2                                  1/1     Running     0          18h
noobaa-operator-55fc95dc4c-468gd                                  1/1     Running     0          18h
ocs-metrics-exporter-c5655b599-tk66m                              1/1     Running     0          18h
ocs-operator-c946699b4-d4jwh                                      1/1     Running     0          18h
rook-ceph-crashcollector-worker-0.m1312001ocs.lnxne.boe-7c9qkjg   1/1     Running     0          18h
rook-ceph-crashcollector-worker-1.m1312001ocs.lnxne.boe-69v8tbs   1/1     Running     0          18h
rook-ceph-crashcollector-worker-2.m1312001ocs.lnxne.boe-57vqmr2   1/1     Running     0          18h
rook-ceph-crashcollector-worker-3.m1312001ocs.lnxne.boe-f4p2x42   1/1     Running     0          18h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-b597c6-9j2h8    1/1     Running     6          18h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-65dbb657xbppv   1/1     Running     0          18h
rook-ceph-mgr-a-c84478cb7-hg9cx                                   1/1     Running     0          5h57m
rook-ceph-mon-a-cb564ff4-cnvwq                                    1/1     Running     12         18h
rook-ceph-mon-b-9b4b6965b-mgx4f                                   1/1     Running     0          18h
rook-ceph-mon-c-65b9ccc6bc-vb4gs                                  1/1     Running     0          18h
rook-ceph-operator-6c97bf77-k5kb6                                 1/1     Running     0          18h
rook-ceph-osd-0-6cbbcc64c4-685h5                                  1/1     Running     0          5h57m
rook-ceph-osd-1-679685dd65-nhtss                                  1/1     Running     1          18h
rook-ceph-osd-2-6fbbf49c44-6cgpv                                  1/1     Running     2          18h
rook-ceph-osd-prepare-ocs-deviceset-0-data-0-8sxjw-9mz6n          0/1     Completed   0          18h
rook-ceph-osd-prepare-ocs-deviceset-1-data-0-svz48-8bhnl          0/1     Completed   0          18h
rook-ceph-osd-prepare-ocs-deviceset-2-data-0-h7g2l-9qzzf          0/1     Completed   0          18h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7c755bc4qz7n   1/1     Running     0          18h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-57c7b86bmr77   1/1     Running     0          18h
rook-ceph-tools-6fdd868f75-fjssb                                  1/1     Running     0          17h
worker-0m1312001ocslnxneboe-debug                                 0/1     Completed   0          5h30m
worker-1m1312001ocslnxneboe-debug                                 0/1     Completed   0          5h30m
worker-2m1312001ocslnxneboe-debug                                 0/1     Completed   0          5h30m
worker-3m1312001ocslnxneboe-debug                                 0/1     Completed   0          5h30m

Expected results:
Pods should not restart during the tests

Additional info:

Uploading the mustgather logs and the tier4b test execution logs in the google drive below:

https://drive.google.com/file/d/1fEKnYUtX00nh-aR9JsAKlkQJQ_r3k3tP/view?usp=sharing

Comment 2 Blaine Gardner 2021-03-01 16:53:24 UTC
Is this still reproducible with the latest builds that have the tcmalloc fixes for IBM-Z?

Comment 3 Tiffany Nguyen 2021-03-10 06:21:27 UTC
Running tier4b test cases on AWS cluster. I'm seeing "rook-ceph-mgr" and "rook-ceph-mon" restarted 6 times. Also. "rook-ceph-mds" restarted 1 time.

$ oc get csv -n openshift-storage  
NAME                         DISPLAY                       VERSION        REPLACES   PHASE 
ocs-operator.v4.7.0-284.ci   OpenShift Container Storage   4.7.0-284.ci              Succeeded 

$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-03-06-183610   True        False         11h     Cluster version is 4.7.0-0.nightly-2021-03-06-183610


$ oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-92tmw                                            3/3     Running     0          9h
csi-cephfsplugin-ll28j                                            3/3     Running     0          9h
csi-cephfsplugin-pfx8x                                            3/3     Running     0          9h
csi-cephfsplugin-provisioner-849d54494-lr6rc                      6/6     Running     0          9h
csi-cephfsplugin-provisioner-849d54494-smr5d                      6/6     Running     0          9h
csi-rbdplugin-28h99                                               3/3     Running     0          9h
csi-rbdplugin-b6k6t                                               3/3     Running     0          9h
csi-rbdplugin-nlx78                                               3/3     Running     0          9h
csi-rbdplugin-provisioner-86df955ff9-22rhd                        6/6     Running     0          9h
csi-rbdplugin-provisioner-86df955ff9-p87cp                        6/6     Running     0          9h
noobaa-core-0                                                     1/1     Running     0          9h
noobaa-db-pg-0                                                    1/1     Running     0          9h
noobaa-endpoint-549b9d76f8-mtlvg                                  1/1     Running     0          9h
noobaa-operator-694ffbfd7c-qdvnl                                  1/1     Running     0          9h
ocs-metrics-exporter-75464574c8-nprk7                             1/1     Running     0          9h
ocs-operator-9dcfb85fc-bgm4x                                      1/1     Running     0          9h
rook-ceph-crashcollector-ip-10-0-151-234-6b49486b8b-fqg4c         1/1     Running     0          9h
rook-ceph-crashcollector-ip-10-0-176-124-764cb795fc-rppxw         1/1     Running     0          9h
rook-ceph-crashcollector-ip-10-0-201-62-648bb6f64f-t9b6c          1/1     Running     0          9h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-65864c49hmw7j   2/2     Running     1          9h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6fb9655dwptql   2/2     Running     0          9h
rook-ceph-mgr-a-7b48546684-hdm8f                                  2/2     Running     6          9h
rook-ceph-mon-a-78779b7dcf-r2s7h                                  2/2     Running     6          9h
rook-ceph-mon-b-f45cd8b47-2dkc4                                   2/2     Running     0          9h
rook-ceph-mon-c-69c4c69685-j5dl8                                  2/2     Running     0          9h
rook-ceph-operator-56c845f4bb-ldk54                               1/1     Running     0          9h
rook-ceph-osd-0-769677ddf9-ktgn7                                  2/2     Running     6          9h
rook-ceph-osd-1-56655d86d7-c2gxn                                  2/2     Running     0          9h
rook-ceph-osd-2-7dc986c454-lv994                                  2/2     Running     0          9h
rook-ceph-osd-prepare-ocs-deviceset-0-data-0twlw2-5qb6x           0/1     Completed   0          9h
rook-ceph-osd-prepare-ocs-deviceset-1-data-0j6fjh-fpr4w           0/1     Completed   0          9h
rook-ceph-osd-prepare-ocs-deviceset-2-data-09vd99-6jx54           0/1     Completed   0          9h
rook-ceph-tools-69f66f5b4f-wxv89                                  1/1     Running     0          9h

Comment 4 Sravika 2021-04-07 09:34:24 UTC
@brgardne  Yes, this issue is still reproducible on latest 4.7 with the tcmalloc fix

Comment 5 Blaine Gardner 2021-04-07 17:48:47 UTC
Please attach OCS must-gather for the most recently failing tests. I cannot debug without that.

Comment 7 Sravika 2021-04-08 14:05:42 UTC
@brgardne : sure, @akandath is running the tier4b tests on IBM Z and will upload the ocs must-gather logs. Thankyou.

Comment 8 Abdul Kandathil (IBM) 2021-04-08 15:45:54 UTC
Created attachment 1770300 [details]
must-gather

Attached the must-gather log after reproduring it using the tests mentioned in the description. Below is the current status of ocs pods.

---
(.venv) [root@m1301015 ~]# oc -n openshift-storage get pod
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-2vftq                                            3/3     Running     0          62m
csi-cephfsplugin-ljhtn                                            3/3     Running     0          62m
csi-cephfsplugin-provisioner-6f5dd9fc87-fs6qf                     6/6     Running     0          62m
csi-cephfsplugin-provisioner-6f5dd9fc87-nckqs                     6/6     Running     0          62m
csi-cephfsplugin-vngfm                                            3/3     Running     0          62m
csi-rbdplugin-fjllb                                               3/3     Running     0          62m
csi-rbdplugin-nr6fp                                               3/3     Running     0          62m
csi-rbdplugin-nz2rn                                               3/3     Running     0          62m
csi-rbdplugin-provisioner-5555796984-58kj4                        6/6     Running     0          62m
csi-rbdplugin-provisioner-5555796984-q795c                        6/6     Running     0          62m
noobaa-core-0                                                     1/1     Running     0          61m
noobaa-db-pg-0                                                    1/1     Running     0          61m
noobaa-endpoint-865475b975-6gqp4                                  1/1     Running     0          59m
noobaa-operator-7d758949bc-8l5xf                                  1/1     Running     0          70m
ocs-metrics-exporter-79f7985f66-7s54h                             1/1     Running     0          70m
ocs-operator-7776799dbc-4pmcj                                     1/1     Running     0          70m
rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-87ln5dk   1/1     Running     0          62m
rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-76w26xk   1/1     Running     0          62m
rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-c92bfjr   1/1     Running     0          61m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-74b678d9zs6gr   2/2     Running     0          60m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7cd89f6d765tt   2/2     Running     0          60m
rook-ceph-mgr-a-6f754b7646-5w9bg                                  2/2     Running     1          61m
rook-ceph-mon-a-5d74494c59-b5ldz                                  2/2     Running     1          62m
rook-ceph-mon-b-7569b969fc-2p6x7                                  2/2     Running     0          62m
rook-ceph-mon-c-5bd7b4d45f-hxqcr                                  2/2     Running     0          61m
rook-ceph-operator-7779c4f57b-t9297                               1/1     Running     0          70m
rook-ceph-osd-0-7f45df9c8-mjfz5                                   2/2     Running     1          61m
rook-ceph-osd-1-5554468bbf-jh5ts                                  2/2     Running     0          61m
rook-ceph-osd-2-7df6b7d5c7-8vx8w                                  2/2     Running     0          61m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0mdnxw-5qc6t           0/1     Completed   0          61m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0h7wqd-dlng7           0/1     Completed   0          61m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0hjv5j-g727j           0/1     Completed   0          61m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9f5f694lz9jl   2/2     Running     0          60m
rook-ceph-tools-599b8f4774-5lm9f                                  1/1     Running     0          66m
(.venv) [root@m1301015 ~]#

Comment 9 Sridhar Venkat (IBM) 2021-04-17 15:39:30 UTC
For system P, we are seeing this problem as well under this scenario:

1. When we perform independent FIO runs for CephFS (which is: Create cephfs pvc, attach it to a pod and inside the pod install fio and run it) simultaneously on all three worker/storage nodes.

Under these scenarios, we do not this problem: 
1. We are not seeing ceph-mon pod restarting when tier tests are run.
2. When the above-mentioned independent FIO run is not executed simultaneously. For example, run it on node 3 and while it is running, after 10 minutes run it simultaneously on node 1 and 2.
3. Not able to reproduce it with RBD/Block storage independent FIO runs so far.

Comment 10 Sridhar Venkat (IBM) 2021-04-18 01:22:27 UTC
We did another attempt to run CephFS independent fio runs and this time we could not reproduce the problem. 

After I reported the problem in the system P environment above, we waited for the ceph health to recover and did a block storage independent fio run, which completed successfully. Then we wanted to reproduce the problem with CephFS independent fio run and collect must-gather logs to attach it to this bug. But we could not reproduce the problem.

Comment 12 Mudit Agarwal 2021-04-19 13:33:39 UTC
Travis/Blaine, PTAL

Comment 13 Aaruni Aggarwal 2021-04-19 14:02:05 UTC
Ran CephFS independent fio run on IBM Power Systems and encountered the issue again. 

So Attaching the must-gather logs : 
https://drive.google.com/file/d/11BdjZrCtYJSV1ISr6XijE3dAWhnMlhJP/view?usp=sharing

Comment 14 Blaine Gardner 2021-04-19 16:11:08 UTC
I'll start looking through the must-gather with urgent priority. In the meantime, could I get SSH access to a test cluster showing this behavior so I can inspect things interactively?

Blaine

(Clearing needinfo from Travis but not myself)

Comment 15 Blaine Gardner 2021-04-19 17:19:16 UTC
The latest must-gather does not seem to show the issue being reproduced.

NAME                                                              READY   STATUS        RESTARTS   AGE     IP              NODE       NOMINATED NODE   READINESS GATES
aaruni-demo-pod-rbd1                                              1/1     Running       0          3h1m    10.128.2.66     worker-0   <none>           <none>
aaruni-demo-pod-rbd2                                              1/1     Running       0          3h1m    10.131.0.140    worker-1   <none>           <none>
csi-cephfsplugin-466kg                                            3/3     Running       0          4h35m   192.168.0.189   worker-2   <none>           <none>
csi-cephfsplugin-6dd6t                                            3/3     Running       0          4h35m   192.168.0.230   worker-1   <none>           <none>
csi-cephfsplugin-bttx5                                            3/3     Running       0          4h35m   192.168.0.23    worker-0   <none>           <none>
csi-cephfsplugin-provisioner-f975d886c-cqj95                      6/6     Running       0          2m      10.131.0.158    worker-1   <none>           <none>
csi-cephfsplugin-provisioner-f975d886c-g2vx8                      6/6     Running       0          4h35m   10.128.2.23     worker-0   <none>           <none>
csi-rbdplugin-9jbpj                                               3/3     Running       0          4h35m   192.168.0.23    worker-0   <none>           <none>
csi-rbdplugin-fjvqp                                               3/3     Running       0          4h35m   192.168.0.230   worker-1   <none>           <none>
csi-rbdplugin-provisioner-6bbf798bfb-7hk85                        6/6     Running       0          4h35m   10.131.0.115    worker-1   <none>           <none>
csi-rbdplugin-provisioner-6bbf798bfb-cx5nc                        6/6     Running       0          4h35m   10.128.2.22     worker-0   <none>           <none>
csi-rbdplugin-r4qp2                                               3/3     Running       0          4h35m   192.168.0.189   worker-2   <none>           <none>
must-gather-xhdv5-helper                                          1/1     Running       0          104s    10.131.0.160    worker-1   <none>           <none>
noobaa-core-0                                                     1/1     Running       0          12s     10.129.3.89     worker-2   <none>           <none>
noobaa-db-pg-0                                                    0/1     Terminating   0          4h33m   10.129.3.80     worker-2   <none>           <none>
noobaa-endpoint-8f79bfbb5-g68h7                                   1/1     Running       0          2m      10.128.2.117    worker-0   <none>           <none>
noobaa-operator-56d4ffcbd8-xnpqn                                  1/1     Running       0          4h36m   10.131.0.113    worker-1   <none>           <none>
ocs-metrics-exporter-6c4d8ff5f-gtzq2                              1/1     Running       0          4h36m   10.128.2.21     worker-0   <none>           <none>
ocs-operator-69fd4cc975-pbbvh                                     1/1     Running       0          119s    10.128.2.118    worker-0   <none>           <none>
rook-ceph-crashcollector-worker-0-84849b9589-4c84j                1/1     Running       0          4h34m   10.128.2.27     worker-0   <none>           <none>
rook-ceph-crashcollector-worker-1-6d6b4fd6b8-9vvll                1/1     Running       0          4h35m   10.131.0.123    worker-1   <none>           <none>
rook-ceph-crashcollector-worker-2-7495d898b7-lnf68                1/1     Running       0          2m      10.129.3.86     worker-2   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-898984cctln75   2/2     Running       0          4h33m   10.128.2.29     worker-0   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7f6cf679tt96h   2/2     Running       0          4h33m   10.131.0.122    worker-1   <none>           <none>
rook-ceph-mgr-a-69f99584bb-mmssf                                  2/2     Running       0          4h33m   10.131.0.119    worker-1   <none>           <none>
rook-ceph-mon-a-787db7b988-nxlwp                                  2/2     Running       0          4h35m   10.131.0.117    worker-1   <none>           <none>
rook-ceph-mon-b-76887ccfd8-22zcm                                  2/2     Running       0          2m      10.129.3.88     worker-2   <none>           <none>
rook-ceph-mon-c-5c7d549f77-927hc                                  2/2     Running       0          4h34m   10.128.2.25     worker-0   <none>           <none>
rook-ceph-operator-64849fdfd6-kfb9j                               1/1     Running       0          2m      10.131.0.157    worker-1   <none>           <none>
rook-ceph-osd-0-974db7b55-lsmdh                                   2/2     Running       0          4h33m   10.131.0.121    worker-1   <none>           <none>
rook-ceph-osd-1-6c9649577f-svqvs                                  2/2     Running       0          4h33m   10.128.2.28     worker-0   <none>           <none>
rook-ceph-osd-2-66c57cc56d-gdrqh                                  2/2     Running       0          2m      10.129.3.87     worker-2   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-09lvrnthp   0/1     Completed     0          4h33m   10.128.2.26     worker-0   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data-04svqgvw5   0/1     Completed     0          4h33m   10.131.0.120    worker-1   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9bb7fc9snd5g   2/2     Running       2          119s    10.128.2.120    worker-0   <none>           <none>
rook-ceph-tools-69c5449589-2kp85                                  1/1     Running       0          4h33m   192.168.0.23    worker-0   <none>           <none>
worker-0-debug                                                    1/1     Running       0          104s    192.168.0.23    worker-0   <none>           <none>
worker-1-debug                                                    1/1     Running       0          104s    192.168.0.230   worker-1   <none>           <none>


Is this the right must-gather @Aaruni?

Comment 16 Blaine Gardner 2021-04-19 17:54:08 UTC
*** Bug 1932478 has been marked as a duplicate of this bug. ***

Comment 17 Blaine Gardner 2021-04-19 18:02:21 UTC
After looking more closely at the most recent 2 must-gathers, I believe this is a duplicate of bug https://bugzilla.redhat.com/show_bug.cgi?id=1932478. This bug has much more detail, so I have closed the other as a duplicate of this bug.

The cause of the pod restarts from the latest 2 must-gathers seems to be liveness probe failures. These liveness probe failures occur when a Ceph daemon does not bootstrap itself on startup before the liveness probe starts checking on its health.
 - by default, most Ceph daemon liveness probes start checking 10 seconds after the container is started
 - the exception is OSDs which start after 45 seconds by default

The commonality most striking to me between the two bugs is that they are both on IBM -- ROKS in 1932478 and IBM-Z here. Ultimately, I do not believe this is a Rook issue. I believe there may be an issue in Ceph where daemons are slow to bootstrap on IBM-Z platforms.

Rook v1.5 (OCS v4.7) introduced the ability to override the `livenessProbe.initialDelaySeconds` which is a way to work around this issue in the short term. However, I do not believe the OCS GUI allows this to be configured. We may want to do some ocs-operator changes to work around this until the root cause can be determined.

Comment 18 Aaruni Aggarwal 2021-04-20 07:04:05 UTC
(In reply to brgardne from comment #15)
> The latest must-gather does not seem to show the issue being reproduced.
> 
> NAME                                                              READY  
> STATUS        RESTARTS   AGE     IP              NODE       NOMINATED NODE  
> READINESS GATES
> aaruni-demo-pod-rbd1                                              1/1    
> Running       0          3h1m    10.128.2.66     worker-0   <none>          
> <none>
> aaruni-demo-pod-rbd2                                              1/1    
> Running       0          3h1m    10.131.0.140    worker-1   <none>          
> <none>
> csi-cephfsplugin-466kg                                            3/3    
> Running       0          4h35m   192.168.0.189   worker-2   <none>          
> <none>
> csi-cephfsplugin-6dd6t                                            3/3    
> Running       0          4h35m   192.168.0.230   worker-1   <none>          
> <none>
> csi-cephfsplugin-bttx5                                            3/3    
> Running       0          4h35m   192.168.0.23    worker-0   <none>          
> <none>
> csi-cephfsplugin-provisioner-f975d886c-cqj95                      6/6    
> Running       0          2m      10.131.0.158    worker-1   <none>          
> <none>
> csi-cephfsplugin-provisioner-f975d886c-g2vx8                      6/6    
> Running       0          4h35m   10.128.2.23     worker-0   <none>          
> <none>
> csi-rbdplugin-9jbpj                                               3/3    
> Running       0          4h35m   192.168.0.23    worker-0   <none>          
> <none>
> csi-rbdplugin-fjvqp                                               3/3    
> Running       0          4h35m   192.168.0.230   worker-1   <none>          
> <none>
> csi-rbdplugin-provisioner-6bbf798bfb-7hk85                        6/6    
> Running       0          4h35m   10.131.0.115    worker-1   <none>          
> <none>
> csi-rbdplugin-provisioner-6bbf798bfb-cx5nc                        6/6    
> Running       0          4h35m   10.128.2.22     worker-0   <none>          
> <none>
> csi-rbdplugin-r4qp2                                               3/3    
> Running       0          4h35m   192.168.0.189   worker-2   <none>          
> <none>
> must-gather-xhdv5-helper                                          1/1    
> Running       0          104s    10.131.0.160    worker-1   <none>          
> <none>
> noobaa-core-0                                                     1/1    
> Running       0          12s     10.129.3.89     worker-2   <none>          
> <none>
> noobaa-db-pg-0                                                    0/1    
> Terminating   0          4h33m   10.129.3.80     worker-2   <none>          
> <none>
> noobaa-endpoint-8f79bfbb5-g68h7                                   1/1    
> Running       0          2m      10.128.2.117    worker-0   <none>          
> <none>
> noobaa-operator-56d4ffcbd8-xnpqn                                  1/1    
> Running       0          4h36m   10.131.0.113    worker-1   <none>          
> <none>
> ocs-metrics-exporter-6c4d8ff5f-gtzq2                              1/1    
> Running       0          4h36m   10.128.2.21     worker-0   <none>          
> <none>
> ocs-operator-69fd4cc975-pbbvh                                     1/1    
> Running       0          119s    10.128.2.118    worker-0   <none>          
> <none>
> rook-ceph-crashcollector-worker-0-84849b9589-4c84j                1/1    
> Running       0          4h34m   10.128.2.27     worker-0   <none>          
> <none>
> rook-ceph-crashcollector-worker-1-6d6b4fd6b8-9vvll                1/1    
> Running       0          4h35m   10.131.0.123    worker-1   <none>          
> <none>
> rook-ceph-crashcollector-worker-2-7495d898b7-lnf68                1/1    
> Running       0          2m      10.129.3.86     worker-2   <none>          
> <none>
> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-898984cctln75   2/2    
> Running       0          4h33m   10.128.2.29     worker-0   <none>          
> <none>
> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7f6cf679tt96h   2/2    
> Running       0          4h33m   10.131.0.122    worker-1   <none>          
> <none>
> rook-ceph-mgr-a-69f99584bb-mmssf                                  2/2    
> Running       0          4h33m   10.131.0.119    worker-1   <none>          
> <none>
> rook-ceph-mon-a-787db7b988-nxlwp                                  2/2    
> Running       0          4h35m   10.131.0.117    worker-1   <none>          
> <none>
> rook-ceph-mon-b-76887ccfd8-22zcm                                  2/2    
> Running       0          2m      10.129.3.88     worker-2   <none>          
> <none>
> rook-ceph-mon-c-5c7d549f77-927hc                                  2/2    
> Running       0          4h34m   10.128.2.25     worker-0   <none>          
> <none>
> rook-ceph-operator-64849fdfd6-kfb9j                               1/1    
> Running       0          2m      10.131.0.157    worker-1   <none>          
> <none>
> rook-ceph-osd-0-974db7b55-lsmdh                                   2/2    
> Running       0          4h33m   10.131.0.121    worker-1   <none>          
> <none>
> rook-ceph-osd-1-6c9649577f-svqvs                                  2/2    
> Running       0          4h33m   10.128.2.28     worker-0   <none>          
> <none>
> rook-ceph-osd-2-66c57cc56d-gdrqh                                  2/2    
> Running       0          2m      10.129.3.87     worker-2   <none>          
> <none>
> rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-09lvrnthp   0/1    
> Completed     0          4h33m   10.128.2.26     worker-0   <none>          
> <none>
> rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data-04svqgvw5   0/1    
> Completed     0          4h33m   10.131.0.120    worker-1   <none>          
> <none>
> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9bb7fc9snd5g   2/2    
> Running       2          119s    10.128.2.120    worker-0   <none>          
> <none>
> rook-ceph-tools-69c5449589-2kp85                                  1/1    
> Running       0          4h33m   192.168.0.23    worker-0   <none>          
> <none>
> worker-0-debug                                                    1/1    
> Running       0          104s    192.168.0.23    worker-0   <none>          
> <none>
> worker-1-debug                                                    1/1    
> Running       0          104s    192.168.0.230   worker-1   <none>          
> <none>
> 
> 
> Is this the right must-gather @Aaruni?

Yes @brgardne These logs are the ones that I collected when some of the pods restarted(age of some pods is around 2m) as one of the worker node went to NotReady state while doing independent FIO runs for FileSystem.

Comment 19 Abdul Kandathil (IBM) 2021-04-20 09:18:22 UTC
@brgardne, We cannot give access to our cluster, but we can have a call so that you can have a look. Will that work for you?

Comment 20 Blaine Gardner 2021-04-20 15:36:10 UTC
@aaaggarw I'm now more confused. Why is it a Rook bug that pods are restarting when a node goes into NotReady state?

Comment 21 Mudit Agarwal 2021-04-20 15:56:06 UTC
Is the worker node going down in all of the instances mentioned above?

We already have a BZ for that https://bugzilla.redhat.com/show_bug.cgi?id=1945016

Comment 22 Michael Adam 2021-04-20 16:49:16 UTC
(In reply to brgardne from comment #20)
> @aaaggarw I'm now more confused. Why is it a Rook bug that pods
> are restarting when a node goes into NotReady state?

Basically, you are saying, from rook's perspective, it seems to be working as designed.


So one would need to find out why the node goes into NotReady state.

Comment 23 Blaine Gardner 2021-04-20 17:38:02 UTC
Thank you Mudit. I think it seems likely the bug you linked (1945016) is the root cause behind this. I will also look through both BZ's logs to see if there are artifacts that correlate these two bugs from what I can see.

Comment 24 Aaruni Aggarwal 2021-04-21 06:24:26 UTC
(In reply to Mudit Agarwal from comment #21)
> Is the worker node going down in all of the instances mentioned above?
> 
> We already have a BZ for that
> https://bugzilla.redhat.com/show_bug.cgi?id=1945016

For Power Platform worker node is going down in only one scenario ie. when we are running Independent FIO runs for ceph filesystem.

Comment 25 Aaruni Aggarwal 2021-04-21 06:37:04 UTC
(In reply to Michael Adam from comment #22)
> (In reply to brgardne from comment #20)
> > @aaaggarw I'm now more confused. Why is it a Rook bug that pods
> > are restarting when a node goes into NotReady state?
> 
> Basically, you are saying, from rook's perspective, it seems to be working
> as designed.
> 
> 
> So one would need to find out why the node goes into NotReady state.

Michael, not sure what is happening . I created 3 pods(one for each worker node) and 3 pvc for cephfs . Then I ran fio commands simultaneously inside all the 3 pods using oc rsh. It was working fine for first 2 pods but 3rd one stuck and then I found that one of the worker node went to NotReady.

Comment 26 Mudit Agarwal 2021-04-21 07:39:21 UTC
Hi Aaruni,

Thanks for the info, need some more help

Is this reproducible in 4.6 also, can you please try?

Also, if this is reproducible can we access the cluster

Comment 27 Aaruni Aggarwal 2021-04-21 07:57:47 UTC
(In reply to Mudit Agarwal from comment #26)
> Hi Aaruni,
> 
> Thanks for the info, need some more help
> 
> Is this reproducible in 4.6 also, can you please try?
> 
> Also, if this is reproducible can we access the cluster

Hii Mudit

Will let you know once I create 4.6 cluster and test it.

Comment 28 Aaruni Aggarwal 2021-04-21 13:36:45 UTC
(In reply to brgardne from comment #20)
> @aaaggarw I'm now more confused. Why is it a Rook bug that pods
> are restarting when a node goes into NotReady state?

Apologies brgardne for confusing you. My issue is related to the BZ that Mudit posted above - https://bugzilla.redhat.com/show_bug.cgi?id=1945016

Comment 29 Aaruni Aggarwal 2021-04-21 13:41:24 UTC
(In reply to Mudit Agarwal from comment #26)
> Hi Aaruni,
> 
> Thanks for the info, need some more help
> 
> Is this reproducible in 4.6 also, can you please try?
> 
> Also, if this is reproducible can we access the cluster

Mudit , I forgot this earlier. We can't run the same test on OCS4.6 as we have tcmalloc issue in OCS4.6 . If we do this (heavy loaded pvc/pods), we may end up with crashed osd pods. And this tcmalloc issue got resolved in OCS4.7

Comment 32 Mudit Agarwal 2021-04-26 06:15:28 UTC
So, there are two things mentioned in this BZ:

1. worker node going down
2. rook pods getting restarted

If [2] is happening because of one then this can be a dup of BZ #1945016 else it has to be treated separately.

Also, I don't think that pods have restarted that many times as we have seen in the tcmalloc issue and Blaine can keep me honest here. 
If that is a case then this issue might not be that serious (or unexpected)

Comment 33 Aaruni Aggarwal 2021-04-26 06:29:21 UTC
For IBM Power, issue is same as https://bugzilla.redhat.com/show_bug.cgi?id=1945016 and we are also not getting this issue consistently on our platform. Not sure about IBM Z as they opened this Bugzilla.

Comment 34 Blaine Gardner 2021-04-26 17:37:13 UTC
@svenkat I believe all signs point to this being the same issue on both Z and P systems and the same symptoms of https://bugzilla.redhat.com/show_bug.cgi?id=1945016. IBM nodes in particular are reported to fall in to NotReady state under load. In 1945016, they have asked the OCP team to take a look.

Are these "tier 4b" tests run on non-IBM systems? If yes, then we know it is an IBM-only issue. If no, then can we run the tests on an x86 cluster to see if it reproduces there also to gather more data?

Comment 35 Mudit Agarwal 2021-05-09 12:28:17 UTC
Not being hit consistently, as mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1920498#c37 and https://bugzilla.redhat.com/show_bug.cgi?id=1920498#c38

Moving it to 4.8

Comment 36 Travis Nielsen 2021-05-11 15:13:31 UTC
Based on recent discussion, should this be closed and instead opened as an issue in https://github.com/red-hat-storage/ocs-ci?

Comment 37 Sébastien Han 2021-05-24 16:00:51 UTC
Mudit, please see https://bugzilla.redhat.com/show_bug.cgi?id=1929188#c36

Comment 38 Mudit Agarwal 2021-05-25 05:29:11 UTC
This is a duplicate of BZ #1945016, we discussed to open a ci issue for the capacity BZ and not this one.

*** This bug has been marked as a duplicate of bug 1945016 ***

Comment 39 Red Hat Bugzilla 2023-09-15 01:01:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days