1945016 – [Tracker for RHEL BZ #1953430] [IBM Z | RHV] Killing ceph daemon leaving an unhealthy ocs/ocp cluster (worker node/s NotReady)

Bug 1945016 - [Tracker for RHEL BZ #1953430] [IBM Z | RHV] Killing ceph daemon leaving an unhealthy ocs/ocp cluster (worker node/s NotReady)

Summary: [Tracker for RHEL BZ #1953430] [IBM Z | RHV] Killing ceph daemon leaving an u...

Keywords:
Status:	VERIFIED
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.7
Hardware:	s390x
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.8.0
Assignee:	Scott Ostapovicz
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Duplicates (5):	1929188 1940860 1964958 1970483 1989046 (view as bug list)
Depends On:	1953430
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-31 08:38 UTC by Abdul Kandathil (IBM)
Modified:	2025-04-12 08:28 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1953430 (view as bug list)
Environment:
Last Closed:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
logs (12.01 MB, application/zip) 2021-03-31 08:38 UTC, Abdul Kandathil (IBM)	no flags	Details
tier4a logs (336.65 KB, application/zip) 2021-04-08 08:26 UTC, Sravika	no flags	Details
ocs/ocp/lso 4.6 log (1.23 MB, application/zip) 2021-04-22 11:47 UTC, Abdul Kandathil (IBM)	no flags	Details
console log of kernel module ceph crash - image 1 (87.20 KB, application/pdf) 2021-07-10 02:47 UTC, lbrownin	no flags	Details
console log of kernel module ceph crash - image 2 (71.49 KB, application/pdf) 2021-07-10 02:48 UTC, lbrownin	no flags	Details
View All

Description Abdul Kandathil (IBM) 2021-03-31 08:38:42 UTC

Created attachment 1767995 [details]
logs

Description of problem (please be detailed as possible and provide log
snippests):
Following OCS-CI test from tier4b is breaking cluster by bringing a worker node to not-ready state and ocs cluster to unhealthy state.
  Test : tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr]
  Test description :  Base function for ceph daemon kill disruptive tests. Deletion of 'resource_to_delete' daemon will be introduced while 'operation_to_disrupt' is progressing. 

Version of all relevant components (if applicable):
OCP: 4.7.3
OCS: 4.7.0-327.ci
LSO: 4.7.0-202103202139.p0
OCS-CI: checkout stable branch. commit id = 49356e581131fd1aaa71c71eff7090bca130a07d


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
YES, it brings a worker node to a not-ready state and OCS cluster to an unhealthy state.


Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
4

Can this issue reproducible?
YES

Can this issue reproduce from the UI?
NO


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy local storage with local disk
2. Deploy OCS cluster
3. Execute ocs-ci test : "run-ci  --ocsci-conf config.yaml --cluster-path /root/ocp4-workdir/ tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr]"


Actual results:
The test fails along with a broken cluster.

Expected results:
Test executes successfully leaving a healthy cluster.


Additional info:
 Must gather logs and ocs-ci logs (entire tier4b, this test fails at 13%) attached

Comment 2 Abdul Kandathil (IBM) 2021-03-31 08:49:33 UTC

State of cluster after the test execution:

```
[root@m1301015 ~]# oc -n openshift-storage get po
NAME                                                              READY   STATUS        RESTARTS   AGE
csi-cephfsplugin-9gttv                                            3/3     Running       0          11h
csi-cephfsplugin-kdr24                                            3/3     Running       0          11h
csi-cephfsplugin-mm89d                                            3/3     Running       0          11h
csi-cephfsplugin-provisioner-76b7c894b9-7zfct                     6/6     Running       0          11h
csi-cephfsplugin-provisioner-76b7c894b9-wvfld                     6/6     Running       0          11h
csi-rbdplugin-8xd6j                                               3/3     Running       0          11h
csi-rbdplugin-provisioner-5866f86d44-dt7lj                        6/6     Running       0          8h
csi-rbdplugin-provisioner-5866f86d44-kzwk4                        6/6     Running       0          11h
csi-rbdplugin-skj2r                                               3/3     Running       0          11h
csi-rbdplugin-xvmdk                                               3/3     Running       0          11h
noobaa-core-0                                                     1/1     Terminating   0          10h
noobaa-db-pg-0                                                    1/1     Running       0          10h
noobaa-endpoint-94dc487d6-rfc86                                   1/1     Running       0          10h
noobaa-operator-8b6c658f-j9bq9                                    1/1     Running       0          8h
noobaa-operator-8b6c658f-z5nrz                                    1/1     Terminating   0          11h
ocs-metrics-exporter-5f5679bdb8-tcqcm                             1/1     Running       0          11h
ocs-operator-8664f5945f-hvk6h                                     1/1     Running       0          11h
rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-75z52tv   1/1     Running       0          10h
rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-7dvfbwc   1/1     Running       0          10h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-554845d72cxwc   2/2     Running       0          8h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-58c55c94zr6jq   2/2     Running       0          10h
rook-ceph-mgr-a-78c79f5db4-5z8zf                                  2/2     Running       4          10h
rook-ceph-mon-a-94847fb95-q8bqx                                   2/2     Running       3          10h
rook-ceph-mon-b-59cc54575f-fxgvl                                  2/2     Running       0          10h
rook-ceph-mon-c-7d7d8c847-pd54p                                   0/2     Pending       0          41s
rook-ceph-mon-d-7dd8d46684-q9ncb                                  0/2     Pending       0          8h
rook-ceph-operator-74795b5c46-wt4s4                               1/1     Running       0          11h
rook-ceph-osd-0-5f567749bd-t6l8r                                  2/2     Running       3          10h
rook-ceph-osd-1-6c59f4ff4c-ngbvj                                  2/2     Running       0          10h
rook-ceph-osd-2-6866798b97-n95qk                                  0/2     Pending       0          8h
rook-ceph-osd-prepare-ocs-deviceset-0-data-0dlcbd-j8vh9           0/1     Completed     0          10h
rook-ceph-osd-prepare-ocs-deviceset-1-data-0r4gpr-2bzjv           0/1     Completed     0          10h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-764c689wzdxd   2/2     Running       0          10h
rook-ceph-tools-76bc89666b-s22lk                                  1/1     Running       0          10h
[root@m1301015 ~]#

[root@m1301015 ~]# oc get nodes
NAME                             STATUS     ROLES    AGE   VERSION
master-0.m1301015ocs.lnxne.boe   Ready      master   11h   v1.20.0+551f7b2
master-1.m1301015ocs.lnxne.boe   Ready      master   11h   v1.20.0+551f7b2
master-2.m1301015ocs.lnxne.boe   Ready      master   11h   v1.20.0+551f7b2
worker-0.m1301015ocs.lnxne.boe   NotReady   worker   11h   v1.20.0+551f7b2
worker-1.m1301015ocs.lnxne.boe   Ready      worker   11h   v1.20.0+551f7b2
worker-2.m1301015ocs.lnxne.boe   Ready      worker   11h   v1.20.0+551f7b2
[root@m1301015 ~]#
```

Comment 3 Mudit Agarwal 2021-04-01 13:20:44 UTC

I don't think OCS has anything to do with the whole node going bad.
Is this reproducible? If yes, please keep the cluster intact once you hit it again.

Jose, can you please take a look.

Comment 4 Abdul Kandathil (IBM) 2021-04-06 08:32:15 UTC

@muagarwa
I have the issue reproduced again. let me know if you want to have a look (can be done during a call).

Comment 5 Elad 2021-04-07 08:24:07 UTC

Hi,

This test kills the MGR Ceph daemon on the node.
However, in comment #2, there is one node reported as NotReady. I suspect that the problem did not start with the ceph MGR daemon kill test cases, but in a test prior to that, which brings down a node

Comment 6 Elad 2021-04-07 08:41:52 UTC

I went over the test execution logs and saw that Ceph health was OK when the test started (tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr]):

tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr] 
-------------------------------- live log setup --------------------------------
02:10:35 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK 
02:10:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc wait --for condition=ready pod -l app=rook-ceph-tools -n openshift-storage --timeout=120s
02:10:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get pod -l 'app=rook-ceph-tools' -o jsonpath='{.items[0].metadata.name}'
02:10:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-76bc89666b-s22lk -- ceph health
02:10:36 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK.
02:10:36 - MainThread - tests.conftest - INFO - Ceph health check passed at setup


The test logs started showing errors here:

02:15:47 - MainThread - tests.manage.pv_services.test_ceph_daemon_kill_during_resource_creation - INFO - FIO is success on pod pod-test-cephfs-24ab233ee77f4ba28a2f3e35
02:15:47 - MainThread - ocs_ci.ocs.resources.pod - INFO - Waiting for FIO results from pod pod-test-cephfs-553b225fd1e14f949b1a72d6
02:23:16 - MainThread - ocs_ci.ocs.resources.pod - ERROR - Found Exception: Command '['oc', '-n', 'namespace-test-fa46cbc3dd344139becf318f3', 'rsh', 'pod-test-cephfs-553b225fd1e14f949b1a72d6', 'fio', '--name=fio-rand-readwrite', '--filename=/var/lib/www/html/pod-test-cephfs-553b225fd1e14f949b1a72d6_io_file1', '--readwrite=randrw', '--bs=4K', '--direct=1', '--numjobs=1', '--time_based=1', '--runtime=30', '--size=2G', '--iodepth=4', '--invalidate=1', '--fsync_on_close=1', '--rwmixread=75', '--ioengine=libaio', '--rate=1m', '--rate_process=poisson', '--output-format=json']' timed out after 600 seconds
Traceback (most recent call last):
  File "/usr/lib64/python3.6/subprocess.py", line 425, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib64/python3.6/subprocess.py", line 863, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib64/python3.6/subprocess.py", line 1535, in _communicate
    self._check_timeout(endtime, orig_timeout)
  File "/usr/lib64/python3.6/subprocess.py", line 891, in _check_timeout
    raise TimeoutExpired(self.args, orig_timeout)

So there is a possibility that worker-0.m1301015ocs.lnxne.boe became NotReady during this specific test execution. 
This could have happened due to one of the following:
- Something in the test caused the node to get to NotReady state, either due to a problem with the test or a product bug
- The node became NotReady due to an environment issue

Comment 7 Elad 2021-04-07 08:44:47 UTC

Jilju, could you please take a look?

Comment 8 Jilju Joy 2021-04-07 09:55:43 UTC

(In reply to Elad from comment #7)
> Jilju, could you please take a look?

Hi Elad,
I checked the attached logs. The test case killed mgr daemon at some point and verified it got restarted. After that the test case ran another 5 minutes creating PVCs and pods before the failure. I don't think there is something in the test case which caused the error. We will need to see what caused the node to be down.
OCP must-gather can reveal more on this.

Comment 9 Elad 2021-04-07 10:07:16 UTC

Hi Abdul, could you please attach also OCP must gather?

Comment 10 Abdul Kandathil (IBM) 2021-04-07 12:23:34 UTC

Hi @ebenahar,

I have reproduced the error on a different cluster and uploaded all logs to google drive (due to size restriction in Bugzilla). Please find the google drive link below, 

https://drive.google.com/file/d/1Z7jn7ppfCfvfGZOB8-jYGTlzG6h7fevF/view?usp=sharing

Comment 11 Elad 2021-04-07 15:54:34 UTC

Hi Abdul,
For some reason, I see only the logs from the master nodes have been collected in the OCP mast gather.

The problematic node is a worker and I can't find the logs of the worker nodes

Comment 12 Abdul Kandathil (IBM) 2021-04-07 17:27:16 UTC

The command I used is "oc adm must-gather". I thought it will gather all logs including worker nodes.

Hi @muagarwa, May I know whether there is any additional flags to gather ocp logs including worker nodes?

Comment 13 Mudit Agarwal 2021-04-08 04:37:49 UTC

Someone from QE or pkundra should be able to help

Comment 15 Sravika 2021-04-08 08:25:25 UTC

The same behaviour is observed in tier 4a tests as well when running tests related to "pv_services". The tests executed as part of tier4a are as part of this python class

"tests/manage/pv_services/test_pvc_disruptive.py::TestPVCDisruption"

Description: Base function for PVC disruptive tests. Deletion of 'resource_to_delete' will be introduced while 'operation_to_disrupt' is progressing. 

OCP Version: 4.7.3
OCS Version: latest-stable-4.7 (4.7.0-327.ci)
RHCOS Version: 4.7.0-s390x
OCS-CI : commit 0d371476e5949ecc118ab3fad142889ef4ccb860

Comment 16 Sravika 2021-04-08 08:26:55 UTC

Created attachment 1770175 [details]
tier4a logs

Comment 17 Abdul Kandathil (IBM) 2021-04-08 09:10:53 UTC

Hi @ebenahar,
As the worker node itself is unhealthy, I am not sure whether must-gather can collect any logs from that node.
Is there any alternative way? like continuously collecting logs during test run? if yes can you please share instructions for same.

Comment 18 Elad 2021-04-08 09:21:01 UTC

Hi Abdul,

For constantly collecting logs during the execution, I think you can do it from another terminal while the tests are running

let i=0; while true; do mkdir node_logs_$i ; for x in $(oc get nodes|grep worker|awk '{print$1}'); do oc adm node-logs $x >>node_logs_$i/$x.logs; done; done

Comment 20 Abdul Kandathil (IBM) 2021-04-08 14:31:08 UTC

Hi @ebenahar

Collected must-gather again along with logs as mentioned in the previous comment.

please find logs in google drive : https://drive.google.com/file/d/10syhXmVh3YPjzKoqRc7crVdYw7NpeAB1/view?usp=sharing

This was the status of the nodes after the run.
--
(.venv) [root@m13lp43 ocs-ci]#  oc get nodes
NAME                         STATUS   ROLES    AGE    VERSION
test1-dkblv-master-0         Ready    master   179m   v1.20.0+bafe72f
test1-dkblv-master-1         Ready    master   179m   v1.20.0+bafe72f
test1-dkblv-master-2         Ready    master   179m   v1.20.0+bafe72f
test1-dkblv-worker-0-5n42b   Ready    worker   174m   v1.20.0+bafe72f
test1-dkblv-worker-0-bcxcg   NotReady    worker   174m   v1.20.0+bafe72f
test1-dkblv-worker-0-hn5gg   NotReady    worker   172m   v1.20.0+bafe72f
(.venv) [root@m13lp43 ocs-ci]#

Note: After the reboot of these nodes, it turned to ready state and collected node logs which is kept in a separate directory in the zip file.

Comment 21 Elad 2021-04-08 14:41:28 UTC

Thanks Abdul,

Jilju, can you please take a look?

Comment 22 Jilju Joy 2021-04-09 12:46:52 UTC

(In reply to Elad from comment #21)
> Thanks Abdul,
> 
> Jilju, can you please take a look?

Hi Elad, I couldn't gather much information which leads to the root cause. In comment #15 the issue occurred while running a different test case than the initial reported one. In all the three occurrences  (comment #c0, comment #c15, comment #c20) the node became NotReady during the execution of fio on multiple app pods at the same time.

Comment 23 Elad 2021-04-13 14:46:06 UTC

Thanks Jilju for examining this.
So basically, it seems that the issue here is not with a specific test scenario but rather with the FIO load that runs during some of the tests.

Jilju - are we encountering anything similar with non IBM platforms? I assume the node in the IBM execution have similar specs to the ones we use in other platforms and the difference is the architecture.

Comment 24 Abdul Kandathil (IBM) 2021-04-14 12:05:53 UTC

Experienced a similar issue while running scale tier. 
Please find the tier run logs in google drive : https://drive.google.com/file/d/10TLXCrkr4gUShQGG16fmPDbgqxfV5Qhw/view?usp=sharing

After running for a longer time I had to cancel the tier run and below is the current status of cluster.

---
[root@m1301015 ~]# oc -n openshift-storage get po
NAME                                                              READY   STATUS        RESTARTS   AGE
csi-cephfsplugin-4zxcn                                            3/3     Running       0          22h
csi-cephfsplugin-provisioner-6f5dd9fc87-5dk8l                     6/6     Terminating   0          22h
csi-cephfsplugin-provisioner-6f5dd9fc87-9ghww                     6/6     Terminating   0          22h
csi-cephfsplugin-provisioner-6f5dd9fc87-lb6ng                     0/6     Pending       0          19h
csi-cephfsplugin-provisioner-6f5dd9fc87-rktlj                     0/6     Pending       0          19h
csi-cephfsplugin-sb5fs                                            3/3     Running       0          22h
csi-cephfsplugin-zpz86                                            3/3     Running       0          22h
csi-rbdplugin-2wsnl                                               3/3     Running       0          22h
csi-rbdplugin-cq7lp                                               3/3     Running       0          22h
csi-rbdplugin-provisioner-5555796984-dckjd                        0/6     Pending       0          19h
csi-rbdplugin-provisioner-5555796984-gn6ps                        6/6     Terminating   0          22h
csi-rbdplugin-provisioner-5555796984-l4lfz                        0/6     Pending       0          19h
csi-rbdplugin-provisioner-5555796984-prhk5                        6/6     Terminating   0          22h
csi-rbdplugin-rfz6n                                               3/3     Running       0          22h
noobaa-core-0                                                     1/1     Terminating   0          22h
noobaa-db-pg-0                                                    1/1     Terminating   0          22h
noobaa-endpoint-845ff84644-nd2mf                                  0/1     Pending       0          19h
noobaa-endpoint-845ff84644-t5lrb                                  1/1     Terminating   0          22h
noobaa-operator-558c448c59-cff9f                                  1/1     Terminating   0          23h
noobaa-operator-558c448c59-w8c5x                                  0/1     Pending       0          19h
ocs-metrics-exporter-7b686f76c4-6ql4v                             1/1     Terminating   0          23h
ocs-metrics-exporter-7b686f76c4-shwff                             0/1     Pending       0          19h
ocs-operator-6d887c8fbc-9v7qj                                     1/1     Terminating   0          23h
ocs-operator-6d887c8fbc-jcw5k                                     0/1     Pending       0          19h
rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-877cjx5   0/1     Pending       0          19h
rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-87szg2j   1/1     Terminating   0          22h
rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-76lnhnq   0/1     Pending       0          19h
rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-76zn4pw   1/1     Terminating   0          22h
rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-c9tqv24   0/1     Pending       0          19h
rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-c9tztgr   1/1     Terminating   0          22h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78887c7dl85tj   0/2     Pending       0          19h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78887c7dwfb8w   2/2     Terminating   0          20h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-76dbb8fcfpthh   0/2     Pending       0          19h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-76dbb8fcj2fp9   2/2     Terminating   0          22h
rook-ceph-mgr-a-6cdd8f8fc4-84h45                                  0/2     Pending       0          7h17m
rook-ceph-mon-a-847c84b9b9-l7p5g                                  0/2     Pending       0          19h
rook-ceph-mon-a-847c84b9b9-qtvll                                  2/2     Terminating   0          20h
rook-ceph-mon-b-658c4d656-7j2rn                                   2/2     Terminating   0          22h
rook-ceph-mon-b-658c4d656-nv8r7                                   0/2     Pending       0          19h
rook-ceph-mon-c-57f5c9b84-gnx4n                                   0/2     Pending       0          19h
rook-ceph-mon-c-57f5c9b84-nchkr                                   2/2     Terminating   0          22h
rook-ceph-operator-5fcdd8fd6d-kp2cd                               0/1     Pending       0          19h
rook-ceph-operator-5fcdd8fd6d-w2777                               1/1     Terminating   0          23h
rook-ceph-osd-0-8fb66ddb8-2qhcc                                   2/2     Terminating   0          20h
rook-ceph-osd-0-8fb66ddb8-jpw7v                                   0/2     Pending       0          19h
rook-ceph-osd-1-6fb8445d6f-7dm9f                                  0/2     Pending       0          19h
rook-ceph-osd-1-6fb8445d6f-nl7pg                                  2/2     Terminating   0          22h
rook-ceph-osd-2-65c7f5949d-gqpvz                                  0/2     Pending       0          19h
rook-ceph-osd-2-65c7f5949d-s57rw                                  2/2     Terminating   0          22h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6695cdc6c2bc   0/2     Pending       0          19h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6695cdcfvzgn   2/2     Terminating   0          22h
rook-ceph-tools-599b8f4774-2s26r                                  1/1     Terminating   0          22h
rook-ceph-tools-599b8f4774-552zl                                  0/1     Pending       0          19h
worker-0m1301015ocslnxneboe-debug                                 1/1     Terminating   0          19h
[root@m1301015 ~]#
[root@m1301015 ~]# oc get nodes
NAME                             STATUS     ROLES    AGE   VERSION
master-0.m1301015ocs.lnxne.boe   Ready      master   23h   v1.20.0+551f7b2
master-1.m1301015ocs.lnxne.boe   Ready      master   23h   v1.20.0+551f7b2
master-2.m1301015ocs.lnxne.boe   Ready      master   23h   v1.20.0+551f7b2
worker-0.m1301015ocs.lnxne.boe   NotReady   worker   23h   v1.20.0+551f7b2
worker-1.m1301015ocs.lnxne.boe   NotReady   worker   23h   v1.20.0+551f7b2
worker-2.m1301015ocs.lnxne.boe   NotReady   worker   23h   v1.20.0+551f7b2
[root@m1301015 ~]#

Comment 26 Jilju Joy 2021-04-15 06:58:13 UTC

Hi Mudit/Jose

The failed test cases are old and was stable. There are four failure instances of 3 different test cases updated in this bug. So I think we cannot rule out the possibility of a regression. WDYT ?

Comment 30 Mudit Agarwal 2021-04-20 14:22:47 UTC

Moving back to 4.7 till we have a RCA

Comment 31 Blaine Gardner 2021-04-20 17:54:17 UTC

FWIW, I did some sarching upstream to see if there are similar reports. I found this:
 - https://github.com/Azure/AKS/issues/102
   - could this be a disk performance issue? (maybe there is a lot of logging traffic on the OS disk?)
   - or possibly the node's resources getting overloaded? (do all pods have resource requests/limits, or is the OS taking up too many resources?)

Comment 32 Travis Nielsen 2021-04-20 17:55:23 UTC

@Abdul/Jilju How do the node resources (memory/disk) in this CI environment compare to the node resources in other CI environments where the CI passes? If the nodes are going down unexpectedly, I suspect there are not sufficient resources.

Comment 33 Mudit Agarwal 2021-04-21 07:41:30 UTC

Also, can we check that whether this is reproducible on 4.6 or not?

Comment 34 Abdul Kandathil (IBM) 2021-04-21 08:49:14 UTC

@tnielsen @muagarwa,

We didn't had this issue with the same tests in ocs 4.6 on IBM Z.

Each worker node has 64GB of memory and 16 cores.

Comment 35 Mudit Agarwal 2021-04-21 16:30:43 UTC

This looks like a resource issue to me.

I have looked through the logs provided by Jilju.

This is one of the affected nodes: musoni2-mwkff-worker-0-vd548

NAME                           STATUS     ROLES    AGE   VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME                          LABELS
musoni2-mwkff-worker-0-vd548   NotReady   worker   46h   v1.20.0+5fbfd19   10.1.11.198   <none>        Red Hat Enterprise Linux CoreOS 47.83.202103041352-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=musoni2-mwkff-worker-0-vd548,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos


And here is the output from "oc describe nodes" for the same node

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                11102m (71%)   10 (64%)
  memory             30946Mi (99%)  26Gi (85%)   ==================>>
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
Events:              <none>

Name:               musoni2-mwkff-worker-0-vd548
Roles:              worker

Looks like the node's CPU or RAM usage is reaching ~100% and in such a case the services running on the node won't be able to run.

QE has two disruptive test cases one while resource_creation and another while resource_deletion, 
as far as I have observed and QE can keep me honest that these failures are happening only while
resource_creation test case.

I was going through the kubernetes official doc and noticed that Pods can consume all the available capacity on a node by default. 
This is an issue because nodes typically run quite a few system daemons that power the OS and Kubernetes itself. 
Unless resources are set aside for these system daemons, pods and system daemons compete for resources and lead to resource starvation issues on the node.
https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/ 

So, this might be the case here but I am not sure, why we are seeing this now but in any case this shouldn't be an OCS issue.

Comment 36 Abdul Kandathil (IBM) 2021-04-22 07:05:35 UTC

@muagarwa,
I can reproduce the same issue with the mentioned test in the description using ocs 4.6.2. Other components used: OCP - 4.7.2, LSO - 4.7.0.
I will try again on OCP 4.6 and LSO 4.6 as our previous tests in February were passing.

Comment 37 Jilju Joy 2021-04-22 07:09:14 UTC

(In reply to Mudit Agarwal from comment #35)
> This looks like a resource issue to me.
> 
> I have looked through the logs provided by Jilju.
> 
> This is one of the affected nodes: musoni2-mwkff-worker-0-vd548
> 
> NAME                           STATUS     ROLES    AGE   VERSION          
> INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                        
> KERNEL-VERSION                 CONTAINER-RUNTIME                         
> LABELS
> musoni2-mwkff-worker-0-vd548   NotReady   worker   46h   v1.20.0+5fbfd19  
> 10.1.11.198   <none>        Red Hat Enterprise Linux CoreOS
> 47.83.202103041352-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64  
> cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8  
> beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.
> openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/
> hostname=musoni2-mwkff-worker-0-vd548,kubernetes.io/os=linux,node-role.
> kubernetes.io/worker=,node.openshift.io/os_id=rhcos
> 
> 
> And here is the output from "oc describe nodes" for the same node
> 
> Allocated resources:
>   (Total limits may be over 100 percent, i.e., overcommitted.)
>   Resource           Requests       Limits
>   --------           --------       ------
>   cpu                11102m (71%)   10 (64%)
>   memory             30946Mi (99%)  26Gi (85%)   ==================>>
>   ephemeral-storage  0 (0%)         0 (0%)
>   hugepages-1Gi      0 (0%)         0 (0%)
>   hugepages-2Mi      0 (0%)         0 (0%)
> Events:              <none>
> 
> Name:               musoni2-mwkff-worker-0-vd548
> Roles:              worker
> 
> Looks like the node's CPU or RAM usage is reaching ~100% and in such a case
> the services running on the node won't be able to run.
> 
> QE has two disruptive test cases one while resource_creation and another
> while resource_deletion, 
> as far as I have observed and QE can keep me honest that these failures are
> happening only while
> resource_creation test case.
There is a recent run on RHV were node down issue is seen while running the test case
tests/manage/pv_services/test_resource_deletion_during_pvc_pod_deletion_and_io.py::TestResourceDeletionDuringMultipleDeleteOperations::test_disruptive_during_pod_pvc_deletion_and_io[CephFileSystem-mgr]
In this case the failure occurred while running fio on pods even before the mgr pod deletion.
Steps executed before the failure.
1. Created  30 CephFS PVCs of size 3GiB.
2. Created pods to consume these PVCs. RWO PVC is attached to one pod and RWX PVC on 2 pods. So total 45 pods were created.
3. Started fio on 24 pods. Fio file size is 1G and runtime is 30 seconds.

One worker node became NotReady during step 3.

must-gather http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rhv2/sgatfane-rhv2_20210322T180859/logs/failed_testcase_ocs_logs_1618553355/test_disruptive_during_pod_pvc_deletion_and_io%5bCephFileSystem-mgr%5d_ocs_logs/

Test run : ocs-ci results for sgatfane-OCS4-7-Downstream-OCP4-7-LSO-MON-HOSTPATH-OSD-HDD-RHV-IPI-1AZ-RHCOS-3M-3W-tier4c (BUILD ID: v4.7.0-353.ci RUN ID: 1618553355)

Build url : https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/2006/


worker node memory 32GB and CPU- 16.

In in failures we have seen, the issue occurred during I/O operation on pods.
> 
> I was going through the kubernetes official doc and noticed that Pods can
> consume all the available capacity on a node by default. 
> This is an issue because nodes typically run quite a few system daemons that
> power the OS and Kubernetes itself. 
> Unless resources are set aside for these system daemons, pods and system
> daemons compete for resources and lead to resource starvation issues on the
> node.
> https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-
> resources/ 
> 
> So, this might be the case here but I am not sure, why we are seeing this
> now but in any case this shouldn't be an OCS issue.

Comment 39 Abdul Kandathil (IBM) 2021-04-22 11:47:09 UTC

Created attachment 1774466 [details]
ocs/ocp/lso 4.6 log

Same test passes on cluster with OCP 4.6.9, LSO 4.6.0 (4.6.0-202104091041.p0), OCS 4.6.2-233.ci

Please find the attachment for logs.


Have a healthy cluster as well after execution.
--
[root@m1301015 ~]# oc -n openshift-storage get po
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-7rr26                                            3/3     Running     0          50m
csi-cephfsplugin-hhlfw                                            3/3     Running     0          50m
csi-cephfsplugin-k7jg6                                            3/3     Running     0          50m
csi-cephfsplugin-provisioner-d8ccd695d-5dm8t                      6/6     Running     0          50m
csi-cephfsplugin-provisioner-d8ccd695d-qrs4h                      6/6     Running     0          50m
csi-rbdplugin-bz48b                                               3/3     Running     0          50m
csi-rbdplugin-provisioner-76988fbc89-crx5p                        6/6     Running     0          50m
csi-rbdplugin-provisioner-76988fbc89-hdlhd                        6/6     Running     0          50m
csi-rbdplugin-tt7ll                                               3/3     Running     0          50m
csi-rbdplugin-w89nr                                               3/3     Running     0          50m
noobaa-core-0                                                     1/1     Running     0          48m
noobaa-db-0                                                       1/1     Running     0          48m
noobaa-endpoint-f99cfb6cd-f7nd5                                   1/1     Running     0          45m
noobaa-operator-55fc95dc4c-ghgck                                  1/1     Running     0          52m
ocs-metrics-exporter-c5655b599-wcfw8                              1/1     Running     0          52m
ocs-operator-c946699b4-7hj4g                                      1/1     Running     0          52m
rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-57sdkhn   1/1     Running     0          49m
rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-6cxrpxs   1/1     Running     0          49m
rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-86mmlph   1/1     Running     0          48m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-98c94597sh595   1/1     Running     0          47m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5f67ff865n2k4   1/1     Running     0          47m
rook-ceph-mgr-a-597c6b4d96-lnzm2                                  1/1     Running     1          48m
rook-ceph-mon-a-569577d86c-2fc5b                                  1/1     Running     0          49m
rook-ceph-mon-b-864d95cf5b-gv9l6                                  1/1     Running     0          49m
rook-ceph-mon-c-5db6794758-whtb6                                  1/1     Running     0          48m
rook-ceph-operator-6c97bf77-jzxw2                                 1/1     Running     0          52m
rook-ceph-osd-0-f9ffd4dc8-zbn7z                                   1/1     Running     0          48m
rook-ceph-osd-1-57669f7d74-gzxts                                  1/1     Running     0          48m
rook-ceph-osd-2-7b9fb9cc68-9wgzf                                  1/1     Running     0          48m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0-ngpmb-47g76          0/1     Completed   0          48m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0-g59wk-qcr89          0/1     Completed   0          48m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0-8xtrz-wx2cl          0/1     Completed   0          48m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6544c75p7rt9   1/1     Running     0          47m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-846c57fbtp2n   1/1     Running     0          47m
rook-ceph-tools-6fdd868f75-v25pb                                  1/1     Running     0          47m
[root@m1301015 ~]#

Comment 40 Abdul Kandathil (IBM) 2021-04-22 14:46:47 UTC

@ratamir 

The tests causing the issue (known as of now) are listed below :
 - tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr]  - tier4b
 - tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephFileSystem-mgr]  - tier4b
 - tests/manage/pv_services/test_resource_deletion_during_pvc_pod_deletion_and_io.py::TestResourceDeletionDuringMultipleDeleteOperations::test_disruptive_during_pod_pvc_deletion_and_io[CephFileSystem-mgr]   - tier4c
 - tests/e2e/scale/test_pv_scale_and_respin_ceph_pods.py::TestPVSTOcsCreatePVCsAndRespinCephPods::test_pv_scale_out_create_pvcs_and_respin_ceph_pods[mgr] - scale (see comment #24)

The issue is reproducible with following versions:
 - OCS 4.7 (with tcmalloc fix), OCP 4.7, LSO 4.7
 - OCS 4.6.2 (without tcmalloc fix), OCP 4.7, LSO 4.7

The issue is not reproducible with OCS 4.6.2, OCP 4.6.9, LSO 4.6 (see comment #39)

Please find the logs that were possible to collect in comment #20, which includes ocp/ocs must-gather and node logs collected as mentioned in comment #18.

Comment 41 Mudit Agarwal 2021-04-23 11:25:15 UTC

Thanks Abdul for trying out various combinations.

So looks like something has changed with OCP 4.7 also which is causing these tests to fail.
We need to make changes in the test scripts too.

I guess we can ask OCP team also to take a look.

Comment 42 Mudit Agarwal 2021-04-26 06:21:53 UTC

*** Bug 1940860 has been marked as a duplicate of this bug. ***

Comment 43 Mudit Agarwal 2021-04-27 14:14:53 UTC

Hi Abdul, Please check https://bugzilla.redhat.com/show_bug.cgi?id=1953430#c5

Can we try the test cases with OCP 4.7.8, keep other things intact.

Comment 44 Abdul Kandathil (IBM) 2021-04-28 10:38:08 UTC

Hi @

Comment 45 Abdul Kandathil (IBM) 2021-04-28 10:39:15 UTC

Hi mudit,
I have updated the https://bugzilla.redhat.com/show_bug.cgi?id=1953430. 
I am able to reproduce the same issue on ocp 4.7.8.

Comment 47 Abdul Kandathil (IBM) 2021-05-07 10:16:33 UTC

@ratamir,

As we discussed in the last meeting, I tried reproducing manually the steps in one of the test that causing this BZ. I followed ocs-ci logs of following test "tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr]".

Steps perfomed are:
    - Create test namespace
    - Create 12 PVCs - 6 with RWX and 6 with RWO
    - kill mgr daemon logging into respective scheduled node during the PVC creation.
    - Wait for all PVCs to reach Bound state.
    - Create pods claiming previously created PVCs - 1 each for RWO PVCs and 2 each for RWX PVCs
    - Install fio on all the pods and run following fio command on each pods - "fio --name=fio-rand-readwrite --filename=/var/lib/www/html/<pod-name>_io_file1 --readwrite=randrw --bs=4K --direct=1 --numjobs=1 --time_based=1 --runtime=30 --size=2G --iodepth=4 --invalidate=1 --fsync_on_close=1 --rwmixread=75 --ioengine=libaio --rate=1m --rate_process=poisson --output-format=json"

fio command on all the pods executed successfully and I don't see any node going to NotReady State during/after performing these steps. May i know whether i missed anything in between?

Below are the version of components used.
[root@s83lp83 ~]# oc version
Client Version: 4.7.7
Server Version: 4.7.8
Kubernetes Version: v1.20.0+7d0a2b2
[root@s83lp83 ~]#
[root@s83lp83 ~]# oc -n openshift-storage get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.7.0-377.ci   OpenShift Container Storage   4.7.0-377.ci              Succeeded
[root@s83lp83 ~]#
[root@s83lp83 ~]# oc -n local-storage get csv
NAME                                           DISPLAY         VERSION                 REPLACES   PHASE
local-storage-operator.4.7.0-202104250659.p0   Local Storage   4.7.0-202104250659.p0              Succeeded
[root@s83lp83 ~]#

Comment 48 Mudit Agarwal 2021-05-09 12:25:57 UTC

Not a blocker for 4.7, moving it out.

Comment 49 Humble Chirammal 2021-05-10 05:35:56 UTC

(In reply to Abdul Kandathil (IBM) from comment #47)
> @ratamir,
> 
> As we discussed in the last meeting, I tried reproducing manually the steps
> in one of the test that causing this BZ. I followed ocs-ci logs of following
> test
> "tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::
> TestDaemonKillDuringResourceCreation::
> test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-
> mgr]".
> 
> Steps perfomed are:
>     - Create test namespace
>     - Create 12 PVCs - 6 with RWX and 6 with RWO
>     - kill mgr daemon logging into respective scheduled node during the PVC
> creation.
>     - Wait for all PVCs to reach Bound state.
>     - Create pods claiming previously created PVCs - 1 each for RWO PVCs and
> 2 each for RWX PVCs
>     - Install fio on all the pods and run following fio command on each pods
> - "fio --name=fio-rand-readwrite
> --filename=/var/lib/www/html/<pod-name>_io_file1 --readwrite=randrw --bs=4K
> --direct=1 --numjobs=1 --time_based=1 --runtime=30 --size=2G --iodepth=4
> --invalidate=1 --fsync_on_close=1 --rwmixread=75 --ioengine=libaio --rate=1m
> --rate_process=poisson --output-format=json"
> 
> fio command on all the pods executed successfully and I don't see any node
> going to NotReady State during/after performing these steps. May i know
> whether i missed anything in between?
> 
> Below are the version of components used.
> [root@s83lp83 ~]# oc version
> Client Version: 4.7.7
> Server Version: 4.7.8
> Kubernetes Version: v1.20.0+7d0a2b2
> [root@s83lp83 ~]#
> [root@s83lp83 ~]# oc -n openshift-storage get csv
> NAME                         DISPLAY                       VERSION       
> REPLACES   PHASE
> ocs-operator.v4.7.0-377.ci   OpenShift Container Storage   4.7.0-377.ci     
> Succeeded
> [root@s83lp83 ~]#
> [root@s83lp83 ~]# oc -n local-storage get csv
> NAME                                           DISPLAY         VERSION      
> REPLACES   PHASE
> local-storage-operator.4.7.0-202104250659.p0   Local Storage  
> 4.7.0-202104250659.p0              Succeeded
> [root@s83lp83 ~]#

Abdul, as per this comment we were able to reproduce this issue with OCP 4.7.8 ( https://bugzilla.redhat.com/show_bug.cgi?id=1945016#c45)  but its not the case here with same OCP version.

Compared to the setup of c#45, any difference in versions of OCS, LSO, ocs-ci, RHCOS ? with this we could check is it fixed with the latest versions of any related components or in OCP itself.

Comment 50 Abdul Kandathil (IBM) 2021-05-10 08:28:42 UTC

Hi @hchiramm,

I don't find a difference in my setup compared to comment #45. I tried running the failing tests mentioned in comment #40 again today and I see below tests are passing  
 - tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr]
 - tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephFileSystem-mgr]

But with the below test, I can reproduce the same issue.
 - tests/manage/pv_services/test_resource_deletion_during_pvc_pod_deletion_and_io.py::TestResourceDeletionDuringMultipleDeleteOperations::test_disruptive_during_pod_pvc_deletion_and_io[CephFileSystem-mgr]

Current cluster status.

[root@s83lp83 ~]# oc get no
NAME                         STATUS     ROLES    AGE     VERSION
test1-4fw8h-master-0         Ready      master   4d17h   v1.20.0+7d0a2b2
test1-4fw8h-master-1         Ready      master   4d17h   v1.20.0+7d0a2b2
test1-4fw8h-master-2         Ready      master   4d17h   v1.20.0+7d0a2b2
test1-4fw8h-worker-0-5cd2h   NotReady   worker   4d17h   v1.20.0+7d0a2b2
test1-4fw8h-worker-0-tfd6c   Ready      worker   4d17h   v1.20.0+7d0a2b2
test1-4fw8h-worker-0-xx7p5   Ready      worker   4d17h   v1.20.0+7d0a2b2
[root@s83lp83 ~]#

Comment 51 Mudit Agarwal 2021-05-25 05:29:11 UTC

*** Bug 1929188 has been marked as a duplicate of this bug. ***

Comment 52 Mudit Agarwal 2021-06-04 17:24:05 UTC

The tracker BZs are not getting enough traction and there is nothing which we can do in OCS.
If this is still an issue please keep updating the tracker BZ.

Comment 53 Mudit Agarwal 2021-06-10 06:44:52 UTC

*** Bug 1969309 has been marked as a duplicate of this bug. ***

Comment 54 Mudit Agarwal 2021-06-14 10:16:38 UTC

*** Bug 1964958 has been marked as a duplicate of this bug. ***

Comment 55 Mudit Agarwal 2021-06-14 10:17:59 UTC

Brining it back to 4.8 and proposing this as a blocker because we are hitting it frequently.

Comment 56 Travis Nielsen 2021-06-14 17:31:11 UTC

Are we actually seeing this frequently in the product, or only in the ocs-ci tests? It could easily be an environment issue such as not enough memory in the ci.

Comment 57 Mudit Agarwal 2021-06-15 03:16:10 UTC

*** Bug 1970483 has been marked as a duplicate of this bug. ***

Comment 58 Mudit Agarwal 2021-06-23 16:55:59 UTC

Last I heard from IBM team is that this issue is not reproducible now.
Have we seen this recently in internal setups?

Comment 59 Martin Bukatovic 2021-06-23 17:15:46 UTC

I haven't run into this myself, so I have no further data to provide, but maybe Avi did, as he reported BZ 1964958 (a duplicate of this bug) during arbiter latency testing while running IO workload.

Comment 61 Mudit Agarwal 2021-06-24 06:32:05 UTC

Adding NI for Avi

Comment 63 lmcfadde 2021-06-28 12:22:35 UTC

Bug 1970483 - Nodes go down while running performance suite of tests from ocs-ci is marked as a duplicate of this bug.  1970483 is still reproducible and a concern.  So if you are closing this one then please unlink them.  Also we were told that actually https://bugzilla.redhat.com/show_bug.cgi?id=1953430 and https://bugzilla.redhat.com/show_bug.cgi?id=1970483 are the same.  I don't have access to 1953430 so can someone say if any progress is made or any solution is avail.

Comment 64 Mudit Agarwal 2021-06-28 12:32:05 UTC

This bug is not getting closed as IBM team is consistently hitting it.
Also, I have made BZ #1953430 public now. So, please check if it is accessible to you or not.

Comment 65 Raz Tamir 2021-07-07 07:56:56 UTC

Based on https://bugzilla.redhat.com/show_bug.cgi?id=1953430#c52 , moving to ON_QA for further testing.
In case it is observed please escalate

Comment 68 lmcfadde 2021-07-08 22:00:26 UTC

@muagarwa yes I can  BZ #1953430 now

Comment 70 lbrownin 2021-07-10 02:47:01 UTC

Created attachment 1800162 [details]
console log of kernel module ceph crash - image 1

The same crash occurred previously with BZ #1970483

Comment 71 lbrownin 2021-07-10 02:48:12 UTC

Created attachment 1800163 [details]
console log of kernel module ceph crash - image 2

Continuation of image 1.

Comment 72 lbrownin 2021-07-10 02:51:05 UTC

Console logs show a cpu lock problem in ceph module

Comment 73 lbrownin 2021-07-10 02:52:39 UTC

This is on powervs using rhcos 4.8 rc 3 running ocs 4.8 on ocp 4.8.
All worker nodes are NotReady showing the same ceph module issue.

Comment 74 Sravika 2021-07-29 08:18:01 UTC

We do not see this behaviour of workers going into "Not Ready" state after the execution of tier4a, tier4b, and tier4c tests which includes "tests/manage/pv_services/" with latest release of OCS on IBM Z.

Comment 75 Raz Tamir 2021-07-30 06:21:23 UTC

Moving to verified

Comment 79 Mudit Agarwal 2021-08-20 02:52:18 UTC

*** Bug 1989046 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.