Bug 1929188
Summary: | IBMZ: rook-ceph-mon and rook-ceph-mds-ocs-storagecluster-cephfilesystem pods restart several times during ocs-ci tier 4b test execution | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Sravika <sbalusu> | ||||
Component: | rook | Assignee: | Blaine Gardner <brgardne> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Elad <ebenahar> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.6 | CC: | aaaggarw, akandath, brgardne, ekuric, madam, muagarwa, ocs-bugs, ratamir, rcyriac, shan, svenkat, tnielsen, tunguyen | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-05-25 05:29:11 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Sravika
2021-02-16 11:57:01 UTC
Is this still reproducible with the latest builds that have the tcmalloc fixes for IBM-Z? Running tier4b test cases on AWS cluster. I'm seeing "rook-ceph-mgr" and "rook-ceph-mon" restarted 6 times. Also. "rook-ceph-mds" restarted 1 time. $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.0-284.ci OpenShift Container Storage 4.7.0-284.ci Succeeded $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-03-06-183610 True False 11h Cluster version is 4.7.0-0.nightly-2021-03-06-183610 $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-92tmw 3/3 Running 0 9h csi-cephfsplugin-ll28j 3/3 Running 0 9h csi-cephfsplugin-pfx8x 3/3 Running 0 9h csi-cephfsplugin-provisioner-849d54494-lr6rc 6/6 Running 0 9h csi-cephfsplugin-provisioner-849d54494-smr5d 6/6 Running 0 9h csi-rbdplugin-28h99 3/3 Running 0 9h csi-rbdplugin-b6k6t 3/3 Running 0 9h csi-rbdplugin-nlx78 3/3 Running 0 9h csi-rbdplugin-provisioner-86df955ff9-22rhd 6/6 Running 0 9h csi-rbdplugin-provisioner-86df955ff9-p87cp 6/6 Running 0 9h noobaa-core-0 1/1 Running 0 9h noobaa-db-pg-0 1/1 Running 0 9h noobaa-endpoint-549b9d76f8-mtlvg 1/1 Running 0 9h noobaa-operator-694ffbfd7c-qdvnl 1/1 Running 0 9h ocs-metrics-exporter-75464574c8-nprk7 1/1 Running 0 9h ocs-operator-9dcfb85fc-bgm4x 1/1 Running 0 9h rook-ceph-crashcollector-ip-10-0-151-234-6b49486b8b-fqg4c 1/1 Running 0 9h rook-ceph-crashcollector-ip-10-0-176-124-764cb795fc-rppxw 1/1 Running 0 9h rook-ceph-crashcollector-ip-10-0-201-62-648bb6f64f-t9b6c 1/1 Running 0 9h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-65864c49hmw7j 2/2 Running 1 9h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6fb9655dwptql 2/2 Running 0 9h rook-ceph-mgr-a-7b48546684-hdm8f 2/2 Running 6 9h rook-ceph-mon-a-78779b7dcf-r2s7h 2/2 Running 6 9h rook-ceph-mon-b-f45cd8b47-2dkc4 2/2 Running 0 9h rook-ceph-mon-c-69c4c69685-j5dl8 2/2 Running 0 9h rook-ceph-operator-56c845f4bb-ldk54 1/1 Running 0 9h rook-ceph-osd-0-769677ddf9-ktgn7 2/2 Running 6 9h rook-ceph-osd-1-56655d86d7-c2gxn 2/2 Running 0 9h rook-ceph-osd-2-7dc986c454-lv994 2/2 Running 0 9h rook-ceph-osd-prepare-ocs-deviceset-0-data-0twlw2-5qb6x 0/1 Completed 0 9h rook-ceph-osd-prepare-ocs-deviceset-1-data-0j6fjh-fpr4w 0/1 Completed 0 9h rook-ceph-osd-prepare-ocs-deviceset-2-data-09vd99-6jx54 0/1 Completed 0 9h rook-ceph-tools-69f66f5b4f-wxv89 1/1 Running 0 9h @brgardne Yes, this issue is still reproducible on latest 4.7 with the tcmalloc fix Please attach OCS must-gather for the most recently failing tests. I cannot debug without that. @brgardne : sure, @akandath is running the tier4b tests on IBM Z and will upload the ocs must-gather logs. Thankyou. Created attachment 1770300 [details]
must-gather
Attached the must-gather log after reproduring it using the tests mentioned in the description. Below is the current status of ocs pods.
---
(.venv) [root@m1301015 ~]# oc -n openshift-storage get pod
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-2vftq 3/3 Running 0 62m
csi-cephfsplugin-ljhtn 3/3 Running 0 62m
csi-cephfsplugin-provisioner-6f5dd9fc87-fs6qf 6/6 Running 0 62m
csi-cephfsplugin-provisioner-6f5dd9fc87-nckqs 6/6 Running 0 62m
csi-cephfsplugin-vngfm 3/3 Running 0 62m
csi-rbdplugin-fjllb 3/3 Running 0 62m
csi-rbdplugin-nr6fp 3/3 Running 0 62m
csi-rbdplugin-nz2rn 3/3 Running 0 62m
csi-rbdplugin-provisioner-5555796984-58kj4 6/6 Running 0 62m
csi-rbdplugin-provisioner-5555796984-q795c 6/6 Running 0 62m
noobaa-core-0 1/1 Running 0 61m
noobaa-db-pg-0 1/1 Running 0 61m
noobaa-endpoint-865475b975-6gqp4 1/1 Running 0 59m
noobaa-operator-7d758949bc-8l5xf 1/1 Running 0 70m
ocs-metrics-exporter-79f7985f66-7s54h 1/1 Running 0 70m
ocs-operator-7776799dbc-4pmcj 1/1 Running 0 70m
rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-87ln5dk 1/1 Running 0 62m
rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-76w26xk 1/1 Running 0 62m
rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-c92bfjr 1/1 Running 0 61m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-74b678d9zs6gr 2/2 Running 0 60m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7cd89f6d765tt 2/2 Running 0 60m
rook-ceph-mgr-a-6f754b7646-5w9bg 2/2 Running 1 61m
rook-ceph-mon-a-5d74494c59-b5ldz 2/2 Running 1 62m
rook-ceph-mon-b-7569b969fc-2p6x7 2/2 Running 0 62m
rook-ceph-mon-c-5bd7b4d45f-hxqcr 2/2 Running 0 61m
rook-ceph-operator-7779c4f57b-t9297 1/1 Running 0 70m
rook-ceph-osd-0-7f45df9c8-mjfz5 2/2 Running 1 61m
rook-ceph-osd-1-5554468bbf-jh5ts 2/2 Running 0 61m
rook-ceph-osd-2-7df6b7d5c7-8vx8w 2/2 Running 0 61m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0mdnxw-5qc6t 0/1 Completed 0 61m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0h7wqd-dlng7 0/1 Completed 0 61m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0hjv5j-g727j 0/1 Completed 0 61m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9f5f694lz9jl 2/2 Running 0 60m
rook-ceph-tools-599b8f4774-5lm9f 1/1 Running 0 66m
(.venv) [root@m1301015 ~]#
For system P, we are seeing this problem as well under this scenario: 1. When we perform independent FIO runs for CephFS (which is: Create cephfs pvc, attach it to a pod and inside the pod install fio and run it) simultaneously on all three worker/storage nodes. Under these scenarios, we do not this problem: 1. We are not seeing ceph-mon pod restarting when tier tests are run. 2. When the above-mentioned independent FIO run is not executed simultaneously. For example, run it on node 3 and while it is running, after 10 minutes run it simultaneously on node 1 and 2. 3. Not able to reproduce it with RBD/Block storage independent FIO runs so far. We did another attempt to run CephFS independent fio runs and this time we could not reproduce the problem. After I reported the problem in the system P environment above, we waited for the ceph health to recover and did a block storage independent fio run, which completed successfully. Then we wanted to reproduce the problem with CephFS independent fio run and collect must-gather logs to attach it to this bug. But we could not reproduce the problem. Travis/Blaine, PTAL Ran CephFS independent fio run on IBM Power Systems and encountered the issue again. So Attaching the must-gather logs : https://drive.google.com/file/d/11BdjZrCtYJSV1ISr6XijE3dAWhnMlhJP/view?usp=sharing I'll start looking through the must-gather with urgent priority. In the meantime, could I get SSH access to a test cluster showing this behavior so I can inspect things interactively? Blaine (Clearing needinfo from Travis but not myself) The latest must-gather does not seem to show the issue being reproduced. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES aaruni-demo-pod-rbd1 1/1 Running 0 3h1m 10.128.2.66 worker-0 <none> <none> aaruni-demo-pod-rbd2 1/1 Running 0 3h1m 10.131.0.140 worker-1 <none> <none> csi-cephfsplugin-466kg 3/3 Running 0 4h35m 192.168.0.189 worker-2 <none> <none> csi-cephfsplugin-6dd6t 3/3 Running 0 4h35m 192.168.0.230 worker-1 <none> <none> csi-cephfsplugin-bttx5 3/3 Running 0 4h35m 192.168.0.23 worker-0 <none> <none> csi-cephfsplugin-provisioner-f975d886c-cqj95 6/6 Running 0 2m 10.131.0.158 worker-1 <none> <none> csi-cephfsplugin-provisioner-f975d886c-g2vx8 6/6 Running 0 4h35m 10.128.2.23 worker-0 <none> <none> csi-rbdplugin-9jbpj 3/3 Running 0 4h35m 192.168.0.23 worker-0 <none> <none> csi-rbdplugin-fjvqp 3/3 Running 0 4h35m 192.168.0.230 worker-1 <none> <none> csi-rbdplugin-provisioner-6bbf798bfb-7hk85 6/6 Running 0 4h35m 10.131.0.115 worker-1 <none> <none> csi-rbdplugin-provisioner-6bbf798bfb-cx5nc 6/6 Running 0 4h35m 10.128.2.22 worker-0 <none> <none> csi-rbdplugin-r4qp2 3/3 Running 0 4h35m 192.168.0.189 worker-2 <none> <none> must-gather-xhdv5-helper 1/1 Running 0 104s 10.131.0.160 worker-1 <none> <none> noobaa-core-0 1/1 Running 0 12s 10.129.3.89 worker-2 <none> <none> noobaa-db-pg-0 0/1 Terminating 0 4h33m 10.129.3.80 worker-2 <none> <none> noobaa-endpoint-8f79bfbb5-g68h7 1/1 Running 0 2m 10.128.2.117 worker-0 <none> <none> noobaa-operator-56d4ffcbd8-xnpqn 1/1 Running 0 4h36m 10.131.0.113 worker-1 <none> <none> ocs-metrics-exporter-6c4d8ff5f-gtzq2 1/1 Running 0 4h36m 10.128.2.21 worker-0 <none> <none> ocs-operator-69fd4cc975-pbbvh 1/1 Running 0 119s 10.128.2.118 worker-0 <none> <none> rook-ceph-crashcollector-worker-0-84849b9589-4c84j 1/1 Running 0 4h34m 10.128.2.27 worker-0 <none> <none> rook-ceph-crashcollector-worker-1-6d6b4fd6b8-9vvll 1/1 Running 0 4h35m 10.131.0.123 worker-1 <none> <none> rook-ceph-crashcollector-worker-2-7495d898b7-lnf68 1/1 Running 0 2m 10.129.3.86 worker-2 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-898984cctln75 2/2 Running 0 4h33m 10.128.2.29 worker-0 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7f6cf679tt96h 2/2 Running 0 4h33m 10.131.0.122 worker-1 <none> <none> rook-ceph-mgr-a-69f99584bb-mmssf 2/2 Running 0 4h33m 10.131.0.119 worker-1 <none> <none> rook-ceph-mon-a-787db7b988-nxlwp 2/2 Running 0 4h35m 10.131.0.117 worker-1 <none> <none> rook-ceph-mon-b-76887ccfd8-22zcm 2/2 Running 0 2m 10.129.3.88 worker-2 <none> <none> rook-ceph-mon-c-5c7d549f77-927hc 2/2 Running 0 4h34m 10.128.2.25 worker-0 <none> <none> rook-ceph-operator-64849fdfd6-kfb9j 1/1 Running 0 2m 10.131.0.157 worker-1 <none> <none> rook-ceph-osd-0-974db7b55-lsmdh 2/2 Running 0 4h33m 10.131.0.121 worker-1 <none> <none> rook-ceph-osd-1-6c9649577f-svqvs 2/2 Running 0 4h33m 10.128.2.28 worker-0 <none> <none> rook-ceph-osd-2-66c57cc56d-gdrqh 2/2 Running 0 2m 10.129.3.87 worker-2 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-09lvrnthp 0/1 Completed 0 4h33m 10.128.2.26 worker-0 <none> <none> rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data-04svqgvw5 0/1 Completed 0 4h33m 10.131.0.120 worker-1 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9bb7fc9snd5g 2/2 Running 2 119s 10.128.2.120 worker-0 <none> <none> rook-ceph-tools-69c5449589-2kp85 1/1 Running 0 4h33m 192.168.0.23 worker-0 <none> <none> worker-0-debug 1/1 Running 0 104s 192.168.0.23 worker-0 <none> <none> worker-1-debug 1/1 Running 0 104s 192.168.0.230 worker-1 <none> <none> Is this the right must-gather @Aaruni? *** Bug 1932478 has been marked as a duplicate of this bug. *** After looking more closely at the most recent 2 must-gathers, I believe this is a duplicate of bug https://bugzilla.redhat.com/show_bug.cgi?id=1932478. This bug has much more detail, so I have closed the other as a duplicate of this bug. The cause of the pod restarts from the latest 2 must-gathers seems to be liveness probe failures. These liveness probe failures occur when a Ceph daemon does not bootstrap itself on startup before the liveness probe starts checking on its health. - by default, most Ceph daemon liveness probes start checking 10 seconds after the container is started - the exception is OSDs which start after 45 seconds by default The commonality most striking to me between the two bugs is that they are both on IBM -- ROKS in 1932478 and IBM-Z here. Ultimately, I do not believe this is a Rook issue. I believe there may be an issue in Ceph where daemons are slow to bootstrap on IBM-Z platforms. Rook v1.5 (OCS v4.7) introduced the ability to override the `livenessProbe.initialDelaySeconds` which is a way to work around this issue in the short term. However, I do not believe the OCS GUI allows this to be configured. We may want to do some ocs-operator changes to work around this until the root cause can be determined. (In reply to brgardne from comment #15) > The latest must-gather does not seem to show the issue being reproduced. > > NAME READY > STATUS RESTARTS AGE IP NODE NOMINATED NODE > READINESS GATES > aaruni-demo-pod-rbd1 1/1 > Running 0 3h1m 10.128.2.66 worker-0 <none> > <none> > aaruni-demo-pod-rbd2 1/1 > Running 0 3h1m 10.131.0.140 worker-1 <none> > <none> > csi-cephfsplugin-466kg 3/3 > Running 0 4h35m 192.168.0.189 worker-2 <none> > <none> > csi-cephfsplugin-6dd6t 3/3 > Running 0 4h35m 192.168.0.230 worker-1 <none> > <none> > csi-cephfsplugin-bttx5 3/3 > Running 0 4h35m 192.168.0.23 worker-0 <none> > <none> > csi-cephfsplugin-provisioner-f975d886c-cqj95 6/6 > Running 0 2m 10.131.0.158 worker-1 <none> > <none> > csi-cephfsplugin-provisioner-f975d886c-g2vx8 6/6 > Running 0 4h35m 10.128.2.23 worker-0 <none> > <none> > csi-rbdplugin-9jbpj 3/3 > Running 0 4h35m 192.168.0.23 worker-0 <none> > <none> > csi-rbdplugin-fjvqp 3/3 > Running 0 4h35m 192.168.0.230 worker-1 <none> > <none> > csi-rbdplugin-provisioner-6bbf798bfb-7hk85 6/6 > Running 0 4h35m 10.131.0.115 worker-1 <none> > <none> > csi-rbdplugin-provisioner-6bbf798bfb-cx5nc 6/6 > Running 0 4h35m 10.128.2.22 worker-0 <none> > <none> > csi-rbdplugin-r4qp2 3/3 > Running 0 4h35m 192.168.0.189 worker-2 <none> > <none> > must-gather-xhdv5-helper 1/1 > Running 0 104s 10.131.0.160 worker-1 <none> > <none> > noobaa-core-0 1/1 > Running 0 12s 10.129.3.89 worker-2 <none> > <none> > noobaa-db-pg-0 0/1 > Terminating 0 4h33m 10.129.3.80 worker-2 <none> > <none> > noobaa-endpoint-8f79bfbb5-g68h7 1/1 > Running 0 2m 10.128.2.117 worker-0 <none> > <none> > noobaa-operator-56d4ffcbd8-xnpqn 1/1 > Running 0 4h36m 10.131.0.113 worker-1 <none> > <none> > ocs-metrics-exporter-6c4d8ff5f-gtzq2 1/1 > Running 0 4h36m 10.128.2.21 worker-0 <none> > <none> > ocs-operator-69fd4cc975-pbbvh 1/1 > Running 0 119s 10.128.2.118 worker-0 <none> > <none> > rook-ceph-crashcollector-worker-0-84849b9589-4c84j 1/1 > Running 0 4h34m 10.128.2.27 worker-0 <none> > <none> > rook-ceph-crashcollector-worker-1-6d6b4fd6b8-9vvll 1/1 > Running 0 4h35m 10.131.0.123 worker-1 <none> > <none> > rook-ceph-crashcollector-worker-2-7495d898b7-lnf68 1/1 > Running 0 2m 10.129.3.86 worker-2 <none> > <none> > rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-898984cctln75 2/2 > Running 0 4h33m 10.128.2.29 worker-0 <none> > <none> > rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7f6cf679tt96h 2/2 > Running 0 4h33m 10.131.0.122 worker-1 <none> > <none> > rook-ceph-mgr-a-69f99584bb-mmssf 2/2 > Running 0 4h33m 10.131.0.119 worker-1 <none> > <none> > rook-ceph-mon-a-787db7b988-nxlwp 2/2 > Running 0 4h35m 10.131.0.117 worker-1 <none> > <none> > rook-ceph-mon-b-76887ccfd8-22zcm 2/2 > Running 0 2m 10.129.3.88 worker-2 <none> > <none> > rook-ceph-mon-c-5c7d549f77-927hc 2/2 > Running 0 4h34m 10.128.2.25 worker-0 <none> > <none> > rook-ceph-operator-64849fdfd6-kfb9j 1/1 > Running 0 2m 10.131.0.157 worker-1 <none> > <none> > rook-ceph-osd-0-974db7b55-lsmdh 2/2 > Running 0 4h33m 10.131.0.121 worker-1 <none> > <none> > rook-ceph-osd-1-6c9649577f-svqvs 2/2 > Running 0 4h33m 10.128.2.28 worker-0 <none> > <none> > rook-ceph-osd-2-66c57cc56d-gdrqh 2/2 > Running 0 2m 10.129.3.87 worker-2 <none> > <none> > rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-09lvrnthp 0/1 > Completed 0 4h33m 10.128.2.26 worker-0 <none> > <none> > rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data-04svqgvw5 0/1 > Completed 0 4h33m 10.131.0.120 worker-1 <none> > <none> > rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-9bb7fc9snd5g 2/2 > Running 2 119s 10.128.2.120 worker-0 <none> > <none> > rook-ceph-tools-69c5449589-2kp85 1/1 > Running 0 4h33m 192.168.0.23 worker-0 <none> > <none> > worker-0-debug 1/1 > Running 0 104s 192.168.0.23 worker-0 <none> > <none> > worker-1-debug 1/1 > Running 0 104s 192.168.0.230 worker-1 <none> > <none> > > > Is this the right must-gather @Aaruni? Yes @brgardne These logs are the ones that I collected when some of the pods restarted(age of some pods is around 2m) as one of the worker node went to NotReady state while doing independent FIO runs for FileSystem. @brgardne, We cannot give access to our cluster, but we can have a call so that you can have a look. Will that work for you? @aaaggarw I'm now more confused. Why is it a Rook bug that pods are restarting when a node goes into NotReady state? Is the worker node going down in all of the instances mentioned above? We already have a BZ for that https://bugzilla.redhat.com/show_bug.cgi?id=1945016 (In reply to brgardne from comment #20) > @aaaggarw I'm now more confused. Why is it a Rook bug that pods > are restarting when a node goes into NotReady state? Basically, you are saying, from rook's perspective, it seems to be working as designed. So one would need to find out why the node goes into NotReady state. Thank you Mudit. I think it seems likely the bug you linked (1945016) is the root cause behind this. I will also look through both BZ's logs to see if there are artifacts that correlate these two bugs from what I can see. (In reply to Mudit Agarwal from comment #21) > Is the worker node going down in all of the instances mentioned above? > > We already have a BZ for that > https://bugzilla.redhat.com/show_bug.cgi?id=1945016 For Power Platform worker node is going down in only one scenario ie. when we are running Independent FIO runs for ceph filesystem. (In reply to Michael Adam from comment #22) > (In reply to brgardne from comment #20) > > @aaaggarw I'm now more confused. Why is it a Rook bug that pods > > are restarting when a node goes into NotReady state? > > Basically, you are saying, from rook's perspective, it seems to be working > as designed. > > > So one would need to find out why the node goes into NotReady state. Michael, not sure what is happening . I created 3 pods(one for each worker node) and 3 pvc for cephfs . Then I ran fio commands simultaneously inside all the 3 pods using oc rsh. It was working fine for first 2 pods but 3rd one stuck and then I found that one of the worker node went to NotReady. Hi Aaruni, Thanks for the info, need some more help Is this reproducible in 4.6 also, can you please try? Also, if this is reproducible can we access the cluster (In reply to Mudit Agarwal from comment #26) > Hi Aaruni, > > Thanks for the info, need some more help > > Is this reproducible in 4.6 also, can you please try? > > Also, if this is reproducible can we access the cluster Hii Mudit Will let you know once I create 4.6 cluster and test it. (In reply to brgardne from comment #20) > @aaaggarw I'm now more confused. Why is it a Rook bug that pods > are restarting when a node goes into NotReady state? Apologies brgardne for confusing you. My issue is related to the BZ that Mudit posted above - https://bugzilla.redhat.com/show_bug.cgi?id=1945016 (In reply to Mudit Agarwal from comment #26) > Hi Aaruni, > > Thanks for the info, need some more help > > Is this reproducible in 4.6 also, can you please try? > > Also, if this is reproducible can we access the cluster Mudit , I forgot this earlier. We can't run the same test on OCS4.6 as we have tcmalloc issue in OCS4.6 . If we do this (heavy loaded pvc/pods), we may end up with crashed osd pods. And this tcmalloc issue got resolved in OCS4.7 So, there are two things mentioned in this BZ: 1. worker node going down 2. rook pods getting restarted If [2] is happening because of one then this can be a dup of BZ #1945016 else it has to be treated separately. Also, I don't think that pods have restarted that many times as we have seen in the tcmalloc issue and Blaine can keep me honest here. If that is a case then this issue might not be that serious (or unexpected) For IBM Power, issue is same as https://bugzilla.redhat.com/show_bug.cgi?id=1945016 and we are also not getting this issue consistently on our platform. Not sure about IBM Z as they opened this Bugzilla. @svenkat I believe all signs point to this being the same issue on both Z and P systems and the same symptoms of https://bugzilla.redhat.com/show_bug.cgi?id=1945016. IBM nodes in particular are reported to fall in to NotReady state under load. In 1945016, they have asked the OCP team to take a look. Are these "tier 4b" tests run on non-IBM systems? If yes, then we know it is an IBM-only issue. If no, then can we run the tests on an x86 cluster to see if it reproduces there also to gather more data? Not being hit consistently, as mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1920498#c37 and https://bugzilla.redhat.com/show_bug.cgi?id=1920498#c38 Moving it to 4.8 Based on recent discussion, should this be closed and instead opened as an issue in https://github.com/red-hat-storage/ocs-ci? Mudit, please see https://bugzilla.redhat.com/show_bug.cgi?id=1929188#c36 This is a duplicate of BZ #1945016, we discussed to open a ci issue for the capacity BZ and not this one. *** This bug has been marked as a duplicate of bug 1945016 *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |