Description of problem (please be detailed as possible and provide log snippests): During ocs-ci tests, the pods "rook-ceph-mon" and "rook-ceph-mds-ocs-storagecluster-cephfilesystem" restarted 12 and 6 times respectively during different tier4b tests mentioned below (based on the last restart time). Also the OSDs OOMed and restarted, but there is an open issue for the osds already (https://bugzilla.redhat.com/show_bug.cgi?id=1917815) tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephBlockPool-mgr] tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephBlockPool-mon] tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephBlockPool-osd] Version of all relevant components (if applicable): OCS -4.6.2 ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install OCP 4.17 and OCS 4.6.2 (4.6.2-233.ci) with 4 workers 2. The pod resources are as follows "rook-ceph-mon" Limits: cpu: 1 memory: 2Gi "rook-ceph-mds-ocs-storagecluster-cephfilesystem" Limits: cpu: 3 memory: 8Gi 3. Run the ocs-ci tier4b tests as follows: run-ci -m 'tier4b' --ocsci-conf config.yaml --cluster-path /root/ocp4-workdir Actual results: Pods restart during tier4b tests # oc get po NAME READY STATUS RESTARTS AGE csi-cephfsplugin-k4fbs 3/3 Running 0 18h csi-cephfsplugin-khg7g 3/3 Running 0 18h csi-cephfsplugin-mfmqn 3/3 Running 0 5h57m csi-cephfsplugin-provisioner-d8ccd695d-n9nl2 6/6 Running 0 5h57m csi-cephfsplugin-provisioner-d8ccd695d-sxvqv 6/6 Running 0 5h57m csi-cephfsplugin-qlmd9 3/3 Running 0 18h csi-rbdplugin-89kf6 3/3 Running 0 18h csi-rbdplugin-9kfs4 3/3 Running 0 5h57m csi-rbdplugin-9vdxj 3/3 Running 0 18h csi-rbdplugin-pm9gh 3/3 Running 0 5h57m csi-rbdplugin-provisioner-76988fbc89-bbwbm 6/6 Running 0 5h57m csi-rbdplugin-provisioner-76988fbc89-z2vmf 6/6 Running 0 5h57m noobaa-core-0 1/1 Running 0 18h noobaa-db-0 1/1 Running 0 18h noobaa-endpoint-554fc74b95-4mvw2 1/1 Running 0 18h noobaa-operator-55fc95dc4c-468gd 1/1 Running 0 18h ocs-metrics-exporter-c5655b599-tk66m 1/1 Running 0 18h ocs-operator-c946699b4-d4jwh 1/1 Running 0 18h rook-ceph-crashcollector-worker-0.m1312001ocs.lnxne.boe-7c9qkjg 1/1 Running 0 18h rook-ceph-crashcollector-worker-1.m1312001ocs.lnxne.boe-69v8tbs 1/1 Running 0 18h rook-ceph-crashcollector-worker-2.m1312001ocs.lnxne.boe-57vqmr2 1/1 Running 0 18h rook-ceph-crashcollector-worker-3.m1312001ocs.lnxne.boe-f4p2x42 1/1 Running 0 18h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-b597c6-9j2h8 1/1 Running 6 18h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-65dbb657xbppv 1/1 Running 0 18h rook-ceph-mgr-a-c84478cb7-hg9cx 1/1 Running 0 5h57m rook-ceph-mon-a-cb564ff4-cnvwq 1/1 Running 12 18h rook-ceph-mon-b-9b4b6965b-mgx4f 1/1 Running 0 18h rook-ceph-mon-c-65b9ccc6bc-vb4gs 1/1 Running 0 18h rook-ceph-operator-6c97bf77-k5kb6 1/1 Running 0 18h rook-ceph-osd-0-6cbbcc64c4-685h5 1/1 Running 0 5h57m rook-ceph-osd-1-679685dd65-nhtss 1/1 Running 1 18h rook-ceph-osd-2-6fbbf49c44-6cgpv 1/1 Running 2 18h rook-ceph-osd-prepare-ocs-deviceset-0-data-0-8sxjw-9mz6n 0/1 Completed 0 18h rook-ceph-osd-prepare-ocs-deviceset-1-data-0-svz48-8bhnl 0/1 Completed 0 18h rook-ceph-osd-prepare-ocs-deviceset-2-data-0-h7g2l-9qzzf 0/1 Completed 0 18h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7c755bc4qz7n 1/1 Running 0 18h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-57c7b86bmr77 1/1 Running 0 18h rook-ceph-tools-6fdd868f75-fjssb 1/1 Running 0 17h worker-0m1312001ocslnxneboe-debug 0/1 Completed 0 5h30m worker-1m1312001ocslnxneboe-debug 0/1 Completed 0 5h30m worker-2m1312001ocslnxneboe-debug 0/1 Completed 0 5h30m worker-3m1312001ocslnxneboe-debug 0/1 Completed 0 5h30m Expected results: Pods should not restart during the tests Additional info: Uploading the mustgather logs and the tier4b test execution logs in the google drive below: https://drive.google.com/file/d/1fEKnYUtX00nh-aR9JsAKlkQJQ_r3k3tP/view?usp=sharing
Is this still reproducible with the latest builds that have the tcmalloc fixes for IBM-Z?
Running tier4b test cases on AWS cluster. I'm seeing "rook-ceph-mgr" and "rook-ceph-mon" restarted 6 times. Also. "rook-ceph-mds" restarted 1 time. $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.0-284.ci OpenShift Container Storage 4.7.0-284.ci Succeeded $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-03-06-183610 True False 11h Cluster version is 4.7.0-0.nightly-2021-03-06-183610 $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-92tmw 3/3 Running 0 9h csi-cephfsplugin-ll28j 3/3 Running 0 9h csi-cephfsplugin-pfx8x 3/3 Running 0 9h csi-cephfsplugin-provisioner-849d54494-lr6rc 6/6 Running 0 9h csi-cephfsplugin-provisioner-849d54494-smr5d 6/6 Running 0 9h csi-rbdplugin-28h99 3/3 Running 0 9h csi-rbdplugin-b6k6t 3/3 Running 0 9h csi-rbdplugin-nlx78 3/3 Running 0 9h csi-rbdplugin-provisioner-86df955ff9-22rhd 6/6 Running 0 9h csi-rbdplugin-provisioner-86df955ff9-p87cp 6/6 Running 0 9h noobaa-core-0 1/1 Running 0 9h noobaa-db-pg-0 1/1 Running 0 9h noobaa-endpoint-549b9d76f8-mtlvg 1/1 Running 0 9h noobaa-operator-694ffbfd7c-qdvnl 1/1 Running 0 9h ocs-metrics-exporter-75464574c8-nprk7 1/1 Running 0 9h ocs-operator-9dcfb85fc-bgm4x 1/1 Running 0 9h rook-ceph-crashcollector-ip-10-0-151-234-6b49486b8b-fqg4c 1/1 Running 0 9h rook-ceph-crashcollector-ip-10-0-176-124-764cb795fc-rppxw 1/1 Running 0 9h rook-ceph-crashcollector-ip-10-0-201-62-648bb6f64f-t9b6c 1/1 Running 0 9h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-65864c49hmw7j 2/2 Running 1 9h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6fb9655dwptql 2/2 Running 0 9h rook-ceph-mgr-a-7b48546684-hdm8f 2/2 Running 6 9h rook-ceph-mon-a-78779b7dcf-r2s7h 2/2 Running 6 9h rook-ceph-mon-b-f45cd8b47-2dkc4 2/2 Running 0 9h rook-ceph-mon-c-69c4c69685-j5dl8 2/2 Running 0 9h rook-ceph-operator-56c845f4bb-ldk54 1/1 Running 0 9h rook-ceph-osd-0-769677ddf9-ktgn7 2/2 Running 6 9h rook-ceph-osd-1-56655d86d7-c2gxn 2/2 Running 0 9h rook-ceph-osd-2-7dc986c454-lv994 2/2 Running 0 9h rook-ceph-osd-prepare-ocs-deviceset-0-data-0twlw2-5qb6x 0/1 Completed 0 9h rook-ceph-osd-prepare-ocs-deviceset-1-data-0j6fjh-fpr4w 0/1 Completed 0 9h rook-ceph-osd-prepare-ocs-deviceset-2-data-09vd99-6jx54 0/1 Completed 0 9h rook-ceph-tools-69f66f5b4f-wxv89 1/1 Running 0 9h
@brgardne Yes, this issue is still reproducible on latest 4.7 with the tcmalloc fix
Please attach OCS must-gather for the most recently failing tests. I cannot debug without that.
@brgardne : sure, @akandath is running the tier4b tests on IBM Z and will upload the ocs must-gather logs. Thankyou.
Created attachment 1770300 [details] must-gather Attached the must-gather log after reproduring it using the tests mentioned in the description. Below is the current status of ocs pods. --- (.venv) [root@m1301015 ~]# oc -n openshift-storage get pod NAME READY STATUS RESTARTS AGE csi-cephfsplugin-2vftq 3/3 Running 0 62m csi-cephfsplugin-ljhtn 3/3 Running 0 62m csi-cephfsplugin-provisioner-6f5dd9fc87-fs6qf 6/6 Running 0 62m csi-cephfsplugin-provisioner-6f5dd9fc87-nckqs 6/6 Running 0 62m csi-cephfsplugin-vngfm 3/3 Running 0 62m csi-rbdplugin-fjllb 3/3 Running 0 62m csi-rbdplugin-nr6fp 3/3 Running 0 62m csi-rbdplugin-nz2rn 3/3 Running 0 62m csi-rbdplugin-provisioner-5555796984-58kj4 6/6 Running 0 62m csi-rbdplugin-provisioner-5555796984-q795c 6/6 Running 0 62m noobaa-core-0 1/1 Running 0 61m noobaa-db-pg-0 1/1 Running 0 61m noobaa-endpoint-865475b975-6gqp4 1/1 Running 0 59m noobaa-operator-7d758949bc-8l5xf 1/1 Running 0 70m ocs-metrics-exporter-79f7985f66-7s54h 1/1 Running 0 70m ocs-operator-7776799dbc-4pmcj 1/1 Running 0 70m rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-87ln5dk 1/1 Running 0 62m rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-76w26xk 1/1 Running 0 62m rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-c92bfjr 1/1 Running 0 61m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-74b678d9zs6gr 2/2 Running 0 60m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7cd89f6d765tt 2/2 Running 0 60m rook-ceph-mgr-a-6f754b7646-5w9bg 2/2 Running 1 61m rook-ceph-mon-a-5d74494c59-b5ldz 2/2 Running 1 62m rook-ceph-mon-b-7569b969fc-2p6x7 2/2 Running 0 62m rook-ceph-mon-c-5bd7b4d45f-hxqcr 2/2 Running 0 61m rook-ceph-operator-7779c4f57b-t9297 1/1 Running 0 70m rook-ceph-osd-0-7f45df9c8-mjfz5 2/2 Running 1 61m rook-ceph-osd-1-5554468bbf-jh5ts 2/2 Running 0 61m rook-ceph-osd-2-7df6b7d5c7-8vx8w 2/2 Running 0 61m rook-ceph-osd-prepare-ocs-deviceset-0-data-0mdnxw-5qc6t 0/1 Completed 0 61m rook-ceph-osd-prepare-ocs-deviceset-1-data-0h7wqd-dlng7 0/1 Completed 0 61m rook-ceph-osd-prepare-ocs-deviceset-2-data-0hjv5j-g727j 0/1 Completed 0 61m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9f5f694lz9jl 2/2 Running 0 60m rook-ceph-tools-599b8f4774-5lm9f 1/1 Running 0 66m (.venv) [root@m1301015 ~]#
For system P, we are seeing this problem as well under this scenario: 1. When we perform independent FIO runs for CephFS (which is: Create cephfs pvc, attach it to a pod and inside the pod install fio and run it) simultaneously on all three worker/storage nodes. Under these scenarios, we do not this problem: 1. We are not seeing ceph-mon pod restarting when tier tests are run. 2. When the above-mentioned independent FIO run is not executed simultaneously. For example, run it on node 3 and while it is running, after 10 minutes run it simultaneously on node 1 and 2. 3. Not able to reproduce it with RBD/Block storage independent FIO runs so far.
We did another attempt to run CephFS independent fio runs and this time we could not reproduce the problem. After I reported the problem in the system P environment above, we waited for the ceph health to recover and did a block storage independent fio run, which completed successfully. Then we wanted to reproduce the problem with CephFS independent fio run and collect must-gather logs to attach it to this bug. But we could not reproduce the problem.
Travis/Blaine, PTAL
Ran CephFS independent fio run on IBM Power Systems and encountered the issue again. So Attaching the must-gather logs : https://drive.google.com/file/d/11BdjZrCtYJSV1ISr6XijE3dAWhnMlhJP/view?usp=sharing
I'll start looking through the must-gather with urgent priority. In the meantime, could I get SSH access to a test cluster showing this behavior so I can inspect things interactively? Blaine (Clearing needinfo from Travis but not myself)
The latest must-gather does not seem to show the issue being reproduced. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES aaruni-demo-pod-rbd1 1/1 Running 0 3h1m 10.128.2.66 worker-0 <none> <none> aaruni-demo-pod-rbd2 1/1 Running 0 3h1m 10.131.0.140 worker-1 <none> <none> csi-cephfsplugin-466kg 3/3 Running 0 4h35m 192.168.0.189 worker-2 <none> <none> csi-cephfsplugin-6dd6t 3/3 Running 0 4h35m 192.168.0.230 worker-1 <none> <none> csi-cephfsplugin-bttx5 3/3 Running 0 4h35m 192.168.0.23 worker-0 <none> <none> csi-cephfsplugin-provisioner-f975d886c-cqj95 6/6 Running 0 2m 10.131.0.158 worker-1 <none> <none> csi-cephfsplugin-provisioner-f975d886c-g2vx8 6/6 Running 0 4h35m 10.128.2.23 worker-0 <none> <none> csi-rbdplugin-9jbpj 3/3 Running 0 4h35m 192.168.0.23 worker-0 <none> <none> csi-rbdplugin-fjvqp 3/3 Running 0 4h35m 192.168.0.230 worker-1 <none> <none> csi-rbdplugin-provisioner-6bbf798bfb-7hk85 6/6 Running 0 4h35m 10.131.0.115 worker-1 <none> <none> csi-rbdplugin-provisioner-6bbf798bfb-cx5nc 6/6 Running 0 4h35m 10.128.2.22 worker-0 <none> <none> csi-rbdplugin-r4qp2 3/3 Running 0 4h35m 192.168.0.189 worker-2 <none> <none> must-gather-xhdv5-helper 1/1 Running 0 104s 10.131.0.160 worker-1 <none> <none> noobaa-core-0 1/1 Running 0 12s 10.129.3.89 worker-2 <none> <none> noobaa-db-pg-0 0/1 Terminating 0 4h33m 10.129.3.80 worker-2 <none> <none> noobaa-endpoint-8f79bfbb5-g68h7 1/1 Running 0 2m 10.128.2.117 worker-0 <none> <none> noobaa-operator-56d4ffcbd8-xnpqn 1/1 Running 0 4h36m 10.131.0.113 worker-1 <none> <none> ocs-metrics-exporter-6c4d8ff5f-gtzq2 1/1 Running 0 4h36m 10.128.2.21 worker-0 <none> <none> ocs-operator-69fd4cc975-pbbvh 1/1 Running 0 119s 10.128.2.118 worker-0 <none> <none> rook-ceph-crashcollector-worker-0-84849b9589-4c84j 1/1 Running 0 4h34m 10.128.2.27 worker-0 <none> <none> rook-ceph-crashcollector-worker-1-6d6b4fd6b8-9vvll 1/1 Running 0 4h35m 10.131.0.123 worker-1 <none> <none> rook-ceph-crashcollector-worker-2-7495d898b7-lnf68 1/1 Running 0 2m 10.129.3.86 worker-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-898984cctln75 2/2 Running 0 4h33m 10.128.2.29 worker-0 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7f6cf679tt96h 2/2 Running 0 4h33m 10.131.0.122 worker-1 <none> <none> rook-ceph-mgr-a-69f99584bb-mmssf 2/2 Running 0 4h33m 10.131.0.119 worker-1 <none> <none> rook-ceph-mon-a-787db7b988-nxlwp 2/2 Running 0 4h35m 10.131.0.117 worker-1 <none> <none> rook-ceph-mon-b-76887ccfd8-22zcm 2/2 Running 0 2m 10.129.3.88 worker-2 <none> <none> rook-ceph-mon-c-5c7d549f77-927hc 2/2 Running 0 4h34m 10.128.2.25 worker-0 <none> <none> rook-ceph-operator-64849fdfd6-kfb9j 1/1 Running 0 2m 10.131.0.157 worker-1 <none> <none> rook-ceph-osd-0-974db7b55-lsmdh 2/2 Running 0 4h33m 10.131.0.121 worker-1 <none> <none> rook-ceph-osd-1-6c9649577f-svqvs 2/2 Running 0 4h33m 10.128.2.28 worker-0 <none> <none> rook-ceph-osd-2-66c57cc56d-gdrqh 2/2 Running 0 2m 10.129.3.87 worker-2 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-09lvrnthp 0/1 Completed 0 4h33m 10.128.2.26 worker-0 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data-04svqgvw5 0/1 Completed 0 4h33m 10.131.0.120 worker-1 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9bb7fc9snd5g 2/2 Running 2 119s 10.128.2.120 worker-0 <none> <none> rook-ceph-tools-69c5449589-2kp85 1/1 Running 0 4h33m 192.168.0.23 worker-0 <none> <none> worker-0-debug 1/1 Running 0 104s 192.168.0.23 worker-0 <none> <none> worker-1-debug 1/1 Running 0 104s 192.168.0.230 worker-1 <none> <none> Is this the right must-gather @Aaruni?
*** Bug 1932478 has been marked as a duplicate of this bug. ***
After looking more closely at the most recent 2 must-gathers, I believe this is a duplicate of bug https://bugzilla.redhat.com/show_bug.cgi?id=1932478. This bug has much more detail, so I have closed the other as a duplicate of this bug. The cause of the pod restarts from the latest 2 must-gathers seems to be liveness probe failures. These liveness probe failures occur when a Ceph daemon does not bootstrap itself on startup before the liveness probe starts checking on its health. - by default, most Ceph daemon liveness probes start checking 10 seconds after the container is started - the exception is OSDs which start after 45 seconds by default The commonality most striking to me between the two bugs is that they are both on IBM -- ROKS in 1932478 and IBM-Z here. Ultimately, I do not believe this is a Rook issue. I believe there may be an issue in Ceph where daemons are slow to bootstrap on IBM-Z platforms. Rook v1.5 (OCS v4.7) introduced the ability to override the `livenessProbe.initialDelaySeconds` which is a way to work around this issue in the short term. However, I do not believe the OCS GUI allows this to be configured. We may want to do some ocs-operator changes to work around this until the root cause can be determined.
(In reply to brgardne from comment #15) > The latest must-gather does not seem to show the issue being reproduced. > > NAME READY > STATUS RESTARTS AGE IP NODE NOMINATED NODE > READINESS GATES > aaruni-demo-pod-rbd1 1/1 > Running 0 3h1m 10.128.2.66 worker-0 <none> > <none> > aaruni-demo-pod-rbd2 1/1 > Running 0 3h1m 10.131.0.140 worker-1 <none> > <none> > csi-cephfsplugin-466kg 3/3 > Running 0 4h35m 192.168.0.189 worker-2 <none> > <none> > csi-cephfsplugin-6dd6t 3/3 > Running 0 4h35m 192.168.0.230 worker-1 <none> > <none> > csi-cephfsplugin-bttx5 3/3 > Running 0 4h35m 192.168.0.23 worker-0 <none> > <none> > csi-cephfsplugin-provisioner-f975d886c-cqj95 6/6 > Running 0 2m 10.131.0.158 worker-1 <none> > <none> > csi-cephfsplugin-provisioner-f975d886c-g2vx8 6/6 > Running 0 4h35m 10.128.2.23 worker-0 <none> > <none> > csi-rbdplugin-9jbpj 3/3 > Running 0 4h35m 192.168.0.23 worker-0 <none> > <none> > csi-rbdplugin-fjvqp 3/3 > Running 0 4h35m 192.168.0.230 worker-1 <none> > <none> > csi-rbdplugin-provisioner-6bbf798bfb-7hk85 6/6 > Running 0 4h35m 10.131.0.115 worker-1 <none> > <none> > csi-rbdplugin-provisioner-6bbf798bfb-cx5nc 6/6 > Running 0 4h35m 10.128.2.22 worker-0 <none> > <none> > csi-rbdplugin-r4qp2 3/3 > Running 0 4h35m 192.168.0.189 worker-2 <none> > <none> > must-gather-xhdv5-helper 1/1 > Running 0 104s 10.131.0.160 worker-1 <none> > <none> > noobaa-core-0 1/1 > Running 0 12s 10.129.3.89 worker-2 <none> > <none> > noobaa-db-pg-0 0/1 > Terminating 0 4h33m 10.129.3.80 worker-2 <none> > <none> > noobaa-endpoint-8f79bfbb5-g68h7 1/1 > Running 0 2m 10.128.2.117 worker-0 <none> > <none> > noobaa-operator-56d4ffcbd8-xnpqn 1/1 > Running 0 4h36m 10.131.0.113 worker-1 <none> > <none> > ocs-metrics-exporter-6c4d8ff5f-gtzq2 1/1 > Running 0 4h36m 10.128.2.21 worker-0 <none> > <none> > ocs-operator-69fd4cc975-pbbvh 1/1 > Running 0 119s 10.128.2.118 worker-0 <none> > <none> > rook-ceph-crashcollector-worker-0-84849b9589-4c84j 1/1 > Running 0 4h34m 10.128.2.27 worker-0 <none> > <none> > rook-ceph-crashcollector-worker-1-6d6b4fd6b8-9vvll 1/1 > Running 0 4h35m 10.131.0.123 worker-1 <none> > <none> > rook-ceph-crashcollector-worker-2-7495d898b7-lnf68 1/1 > Running 0 2m 10.129.3.86 worker-2 <none> > <none> > rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-898984cctln75 2/2 > Running 0 4h33m 10.128.2.29 worker-0 <none> > <none> > rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7f6cf679tt96h 2/2 > Running 0 4h33m 10.131.0.122 worker-1 <none> > <none> > rook-ceph-mgr-a-69f99584bb-mmssf 2/2 > Running 0 4h33m 10.131.0.119 worker-1 <none> > <none> > rook-ceph-mon-a-787db7b988-nxlwp 2/2 > Running 0 4h35m 10.131.0.117 worker-1 <none> > <none> > rook-ceph-mon-b-76887ccfd8-22zcm 2/2 > Running 0 2m 10.129.3.88 worker-2 <none> > <none> > rook-ceph-mon-c-5c7d549f77-927hc 2/2 > Running 0 4h34m 10.128.2.25 worker-0 <none> > <none> > rook-ceph-operator-64849fdfd6-kfb9j 1/1 > Running 0 2m 10.131.0.157 worker-1 <none> > <none> > rook-ceph-osd-0-974db7b55-lsmdh 2/2 > Running 0 4h33m 10.131.0.121 worker-1 <none> > <none> > rook-ceph-osd-1-6c9649577f-svqvs 2/2 > Running 0 4h33m 10.128.2.28 worker-0 <none> > <none> > rook-ceph-osd-2-66c57cc56d-gdrqh 2/2 > Running 0 2m 10.129.3.87 worker-2 <none> > <none> > rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-09lvrnthp 0/1 > Completed 0 4h33m 10.128.2.26 worker-0 <none> > <none> > rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data-04svqgvw5 0/1 > Completed 0 4h33m 10.131.0.120 worker-1 <none> > <none> > rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9bb7fc9snd5g 2/2 > Running 2 119s 10.128.2.120 worker-0 <none> > <none> > rook-ceph-tools-69c5449589-2kp85 1/1 > Running 0 4h33m 192.168.0.23 worker-0 <none> > <none> > worker-0-debug 1/1 > Running 0 104s 192.168.0.23 worker-0 <none> > <none> > worker-1-debug 1/1 > Running 0 104s 192.168.0.230 worker-1 <none> > <none> > > > Is this the right must-gather @Aaruni? Yes @brgardne These logs are the ones that I collected when some of the pods restarted(age of some pods is around 2m) as one of the worker node went to NotReady state while doing independent FIO runs for FileSystem.
@brgardne, We cannot give access to our cluster, but we can have a call so that you can have a look. Will that work for you?
@aaaggarw I'm now more confused. Why is it a Rook bug that pods are restarting when a node goes into NotReady state?
Is the worker node going down in all of the instances mentioned above? We already have a BZ for that https://bugzilla.redhat.com/show_bug.cgi?id=1945016
(In reply to brgardne from comment #20) > @aaaggarw I'm now more confused. Why is it a Rook bug that pods > are restarting when a node goes into NotReady state? Basically, you are saying, from rook's perspective, it seems to be working as designed. So one would need to find out why the node goes into NotReady state.
Thank you Mudit. I think it seems likely the bug you linked (1945016) is the root cause behind this. I will also look through both BZ's logs to see if there are artifacts that correlate these two bugs from what I can see.
(In reply to Mudit Agarwal from comment #21) > Is the worker node going down in all of the instances mentioned above? > > We already have a BZ for that > https://bugzilla.redhat.com/show_bug.cgi?id=1945016 For Power Platform worker node is going down in only one scenario ie. when we are running Independent FIO runs for ceph filesystem.
(In reply to Michael Adam from comment #22) > (In reply to brgardne from comment #20) > > @aaaggarw I'm now more confused. Why is it a Rook bug that pods > > are restarting when a node goes into NotReady state? > > Basically, you are saying, from rook's perspective, it seems to be working > as designed. > > > So one would need to find out why the node goes into NotReady state. Michael, not sure what is happening . I created 3 pods(one for each worker node) and 3 pvc for cephfs . Then I ran fio commands simultaneously inside all the 3 pods using oc rsh. It was working fine for first 2 pods but 3rd one stuck and then I found that one of the worker node went to NotReady.
Hi Aaruni, Thanks for the info, need some more help Is this reproducible in 4.6 also, can you please try? Also, if this is reproducible can we access the cluster
(In reply to Mudit Agarwal from comment #26) > Hi Aaruni, > > Thanks for the info, need some more help > > Is this reproducible in 4.6 also, can you please try? > > Also, if this is reproducible can we access the cluster Hii Mudit Will let you know once I create 4.6 cluster and test it.
(In reply to brgardne from comment #20) > @aaaggarw I'm now more confused. Why is it a Rook bug that pods > are restarting when a node goes into NotReady state? Apologies brgardne for confusing you. My issue is related to the BZ that Mudit posted above - https://bugzilla.redhat.com/show_bug.cgi?id=1945016
(In reply to Mudit Agarwal from comment #26) > Hi Aaruni, > > Thanks for the info, need some more help > > Is this reproducible in 4.6 also, can you please try? > > Also, if this is reproducible can we access the cluster Mudit , I forgot this earlier. We can't run the same test on OCS4.6 as we have tcmalloc issue in OCS4.6 . If we do this (heavy loaded pvc/pods), we may end up with crashed osd pods. And this tcmalloc issue got resolved in OCS4.7
So, there are two things mentioned in this BZ: 1. worker node going down 2. rook pods getting restarted If [2] is happening because of one then this can be a dup of BZ #1945016 else it has to be treated separately. Also, I don't think that pods have restarted that many times as we have seen in the tcmalloc issue and Blaine can keep me honest here. If that is a case then this issue might not be that serious (or unexpected)
For IBM Power, issue is same as https://bugzilla.redhat.com/show_bug.cgi?id=1945016 and we are also not getting this issue consistently on our platform. Not sure about IBM Z as they opened this Bugzilla.
@svenkat I believe all signs point to this being the same issue on both Z and P systems and the same symptoms of https://bugzilla.redhat.com/show_bug.cgi?id=1945016. IBM nodes in particular are reported to fall in to NotReady state under load. In 1945016, they have asked the OCP team to take a look. Are these "tier 4b" tests run on non-IBM systems? If yes, then we know it is an IBM-only issue. If no, then can we run the tests on an x86 cluster to see if it reproduces there also to gather more data?
Not being hit consistently, as mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1920498#c37 and https://bugzilla.redhat.com/show_bug.cgi?id=1920498#c38 Moving it to 4.8
Based on recent discussion, should this be closed and instead opened as an issue in https://github.com/red-hat-storage/ocs-ci?
Mudit, please see https://bugzilla.redhat.com/show_bug.cgi?id=1929188#c36
This is a duplicate of BZ #1945016, we discussed to open a ci issue for the capacity BZ and not this one. *** This bug has been marked as a duplicate of bug 1945016 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days