Created attachment 1767995 [details] logs Description of problem (please be detailed as possible and provide log snippests): Following OCS-CI test from tier4b is breaking cluster by bringing a worker node to not-ready state and ocs cluster to unhealthy state. Test : tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr] Test description : Base function for ceph daemon kill disruptive tests. Deletion of 'resource_to_delete' daemon will be introduced while 'operation_to_disrupt' is progressing. Version of all relevant components (if applicable): OCP: 4.7.3 OCS: 4.7.0-327.ci LSO: 4.7.0-202103202139.p0 OCS-CI: checkout stable branch. commit id = 49356e581131fd1aaa71c71eff7090bca130a07d Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? YES, it brings a worker node to a not-ready state and OCS cluster to an unhealthy state. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Can this issue reproducible? YES Can this issue reproduce from the UI? NO If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy local storage with local disk 2. Deploy OCS cluster 3. Execute ocs-ci test : "run-ci --ocsci-conf config.yaml --cluster-path /root/ocp4-workdir/ tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr]" Actual results: The test fails along with a broken cluster. Expected results: Test executes successfully leaving a healthy cluster. Additional info: Must gather logs and ocs-ci logs (entire tier4b, this test fails at 13%) attached
State of cluster after the test execution: ``` [root@m1301015 ~]# oc -n openshift-storage get po NAME READY STATUS RESTARTS AGE csi-cephfsplugin-9gttv 3/3 Running 0 11h csi-cephfsplugin-kdr24 3/3 Running 0 11h csi-cephfsplugin-mm89d 3/3 Running 0 11h csi-cephfsplugin-provisioner-76b7c894b9-7zfct 6/6 Running 0 11h csi-cephfsplugin-provisioner-76b7c894b9-wvfld 6/6 Running 0 11h csi-rbdplugin-8xd6j 3/3 Running 0 11h csi-rbdplugin-provisioner-5866f86d44-dt7lj 6/6 Running 0 8h csi-rbdplugin-provisioner-5866f86d44-kzwk4 6/6 Running 0 11h csi-rbdplugin-skj2r 3/3 Running 0 11h csi-rbdplugin-xvmdk 3/3 Running 0 11h noobaa-core-0 1/1 Terminating 0 10h noobaa-db-pg-0 1/1 Running 0 10h noobaa-endpoint-94dc487d6-rfc86 1/1 Running 0 10h noobaa-operator-8b6c658f-j9bq9 1/1 Running 0 8h noobaa-operator-8b6c658f-z5nrz 1/1 Terminating 0 11h ocs-metrics-exporter-5f5679bdb8-tcqcm 1/1 Running 0 11h ocs-operator-8664f5945f-hvk6h 1/1 Running 0 11h rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-75z52tv 1/1 Running 0 10h rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-7dvfbwc 1/1 Running 0 10h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-554845d72cxwc 2/2 Running 0 8h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-58c55c94zr6jq 2/2 Running 0 10h rook-ceph-mgr-a-78c79f5db4-5z8zf 2/2 Running 4 10h rook-ceph-mon-a-94847fb95-q8bqx 2/2 Running 3 10h rook-ceph-mon-b-59cc54575f-fxgvl 2/2 Running 0 10h rook-ceph-mon-c-7d7d8c847-pd54p 0/2 Pending 0 41s rook-ceph-mon-d-7dd8d46684-q9ncb 0/2 Pending 0 8h rook-ceph-operator-74795b5c46-wt4s4 1/1 Running 0 11h rook-ceph-osd-0-5f567749bd-t6l8r 2/2 Running 3 10h rook-ceph-osd-1-6c59f4ff4c-ngbvj 2/2 Running 0 10h rook-ceph-osd-2-6866798b97-n95qk 0/2 Pending 0 8h rook-ceph-osd-prepare-ocs-deviceset-0-data-0dlcbd-j8vh9 0/1 Completed 0 10h rook-ceph-osd-prepare-ocs-deviceset-1-data-0r4gpr-2bzjv 0/1 Completed 0 10h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-764c689wzdxd 2/2 Running 0 10h rook-ceph-tools-76bc89666b-s22lk 1/1 Running 0 10h [root@m1301015 ~]# [root@m1301015 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0.m1301015ocs.lnxne.boe Ready master 11h v1.20.0+551f7b2 master-1.m1301015ocs.lnxne.boe Ready master 11h v1.20.0+551f7b2 master-2.m1301015ocs.lnxne.boe Ready master 11h v1.20.0+551f7b2 worker-0.m1301015ocs.lnxne.boe NotReady worker 11h v1.20.0+551f7b2 worker-1.m1301015ocs.lnxne.boe Ready worker 11h v1.20.0+551f7b2 worker-2.m1301015ocs.lnxne.boe Ready worker 11h v1.20.0+551f7b2 [root@m1301015 ~]# ```
I don't think OCS has anything to do with the whole node going bad. Is this reproducible? If yes, please keep the cluster intact once you hit it again. Jose, can you please take a look.
@muagarwa I have the issue reproduced again. let me know if you want to have a look (can be done during a call).
Hi, This test kills the MGR Ceph daemon on the node. However, in comment #2, there is one node reported as NotReady. I suspect that the problem did not start with the ceph MGR daemon kill test cases, but in a test prior to that, which brings down a node
I went over the test execution logs and saw that Ceph health was OK when the test started (tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr]): tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr] -------------------------------- live log setup -------------------------------- 02:10:35 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK 02:10:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc wait --for condition=ready pod -l app=rook-ceph-tools -n openshift-storage --timeout=120s 02:10:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get pod -l 'app=rook-ceph-tools' -o jsonpath='{.items[0].metadata.name}' 02:10:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-76bc89666b-s22lk -- ceph health 02:10:36 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK. 02:10:36 - MainThread - tests.conftest - INFO - Ceph health check passed at setup The test logs started showing errors here: 02:15:47 - MainThread - tests.manage.pv_services.test_ceph_daemon_kill_during_resource_creation - INFO - FIO is success on pod pod-test-cephfs-24ab233ee77f4ba28a2f3e35 02:15:47 - MainThread - ocs_ci.ocs.resources.pod - INFO - Waiting for FIO results from pod pod-test-cephfs-553b225fd1e14f949b1a72d6 02:23:16 - MainThread - ocs_ci.ocs.resources.pod - ERROR - Found Exception: Command '['oc', '-n', 'namespace-test-fa46cbc3dd344139becf318f3', 'rsh', 'pod-test-cephfs-553b225fd1e14f949b1a72d6', 'fio', '--name=fio-rand-readwrite', '--filename=/var/lib/www/html/pod-test-cephfs-553b225fd1e14f949b1a72d6_io_file1', '--readwrite=randrw', '--bs=4K', '--direct=1', '--numjobs=1', '--time_based=1', '--runtime=30', '--size=2G', '--iodepth=4', '--invalidate=1', '--fsync_on_close=1', '--rwmixread=75', '--ioengine=libaio', '--rate=1m', '--rate_process=poisson', '--output-format=json']' timed out after 600 seconds Traceback (most recent call last): File "/usr/lib64/python3.6/subprocess.py", line 425, in run stdout, stderr = process.communicate(input, timeout=timeout) File "/usr/lib64/python3.6/subprocess.py", line 863, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File "/usr/lib64/python3.6/subprocess.py", line 1535, in _communicate self._check_timeout(endtime, orig_timeout) File "/usr/lib64/python3.6/subprocess.py", line 891, in _check_timeout raise TimeoutExpired(self.args, orig_timeout) So there is a possibility that worker-0.m1301015ocs.lnxne.boe became NotReady during this specific test execution. This could have happened due to one of the following: - Something in the test caused the node to get to NotReady state, either due to a problem with the test or a product bug - The node became NotReady due to an environment issue
Jilju, could you please take a look?
(In reply to Elad from comment #7) > Jilju, could you please take a look? Hi Elad, I checked the attached logs. The test case killed mgr daemon at some point and verified it got restarted. After that the test case ran another 5 minutes creating PVCs and pods before the failure. I don't think there is something in the test case which caused the error. We will need to see what caused the node to be down. OCP must-gather can reveal more on this.
Hi Abdul, could you please attach also OCP must gather?
Hi @ebenahar, I have reproduced the error on a different cluster and uploaded all logs to google drive (due to size restriction in Bugzilla). Please find the google drive link below, https://drive.google.com/file/d/1Z7jn7ppfCfvfGZOB8-jYGTlzG6h7fevF/view?usp=sharing
Hi Abdul, For some reason, I see only the logs from the master nodes have been collected in the OCP mast gather. The problematic node is a worker and I can't find the logs of the worker nodes
The command I used is "oc adm must-gather". I thought it will gather all logs including worker nodes. Hi @muagarwa, May I know whether there is any additional flags to gather ocp logs including worker nodes?
Someone from QE or pkundra should be able to help
The same behaviour is observed in tier 4a tests as well when running tests related to "pv_services". The tests executed as part of tier4a are as part of this python class "tests/manage/pv_services/test_pvc_disruptive.py::TestPVCDisruption" Description: Base function for PVC disruptive tests. Deletion of 'resource_to_delete' will be introduced while 'operation_to_disrupt' is progressing. OCP Version: 4.7.3 OCS Version: latest-stable-4.7 (4.7.0-327.ci) RHCOS Version: 4.7.0-s390x OCS-CI : commit 0d371476e5949ecc118ab3fad142889ef4ccb860
Created attachment 1770175 [details] tier4a logs
Hi @ebenahar, As the worker node itself is unhealthy, I am not sure whether must-gather can collect any logs from that node. Is there any alternative way? like continuously collecting logs during test run? if yes can you please share instructions for same.
Hi Abdul, For constantly collecting logs during the execution, I think you can do it from another terminal while the tests are running let i=0; while true; do mkdir node_logs_$i ; for x in $(oc get nodes|grep worker|awk '{print$1}'); do oc adm node-logs $x >>node_logs_$i/$x.logs; done; done
Hi @ebenahar Collected must-gather again along with logs as mentioned in the previous comment. please find logs in google drive : https://drive.google.com/file/d/10syhXmVh3YPjzKoqRc7crVdYw7NpeAB1/view?usp=sharing This was the status of the nodes after the run. -- (.venv) [root@m13lp43 ocs-ci]# oc get nodes NAME STATUS ROLES AGE VERSION test1-dkblv-master-0 Ready master 179m v1.20.0+bafe72f test1-dkblv-master-1 Ready master 179m v1.20.0+bafe72f test1-dkblv-master-2 Ready master 179m v1.20.0+bafe72f test1-dkblv-worker-0-5n42b Ready worker 174m v1.20.0+bafe72f test1-dkblv-worker-0-bcxcg NotReady worker 174m v1.20.0+bafe72f test1-dkblv-worker-0-hn5gg NotReady worker 172m v1.20.0+bafe72f (.venv) [root@m13lp43 ocs-ci]# Note: After the reboot of these nodes, it turned to ready state and collected node logs which is kept in a separate directory in the zip file.
Thanks Abdul, Jilju, can you please take a look?
(In reply to Elad from comment #21) > Thanks Abdul, > > Jilju, can you please take a look? Hi Elad, I couldn't gather much information which leads to the root cause. In comment #15 the issue occurred while running a different test case than the initial reported one. In all the three occurrences (comment #c0, comment #c15, comment #c20) the node became NotReady during the execution of fio on multiple app pods at the same time.
Thanks Jilju for examining this. So basically, it seems that the issue here is not with a specific test scenario but rather with the FIO load that runs during some of the tests. Jilju - are we encountering anything similar with non IBM platforms? I assume the node in the IBM execution have similar specs to the ones we use in other platforms and the difference is the architecture.
Experienced a similar issue while running scale tier. Please find the tier run logs in google drive : https://drive.google.com/file/d/10TLXCrkr4gUShQGG16fmPDbgqxfV5Qhw/view?usp=sharing After running for a longer time I had to cancel the tier run and below is the current status of cluster. --- [root@m1301015 ~]# oc -n openshift-storage get po NAME READY STATUS RESTARTS AGE csi-cephfsplugin-4zxcn 3/3 Running 0 22h csi-cephfsplugin-provisioner-6f5dd9fc87-5dk8l 6/6 Terminating 0 22h csi-cephfsplugin-provisioner-6f5dd9fc87-9ghww 6/6 Terminating 0 22h csi-cephfsplugin-provisioner-6f5dd9fc87-lb6ng 0/6 Pending 0 19h csi-cephfsplugin-provisioner-6f5dd9fc87-rktlj 0/6 Pending 0 19h csi-cephfsplugin-sb5fs 3/3 Running 0 22h csi-cephfsplugin-zpz86 3/3 Running 0 22h csi-rbdplugin-2wsnl 3/3 Running 0 22h csi-rbdplugin-cq7lp 3/3 Running 0 22h csi-rbdplugin-provisioner-5555796984-dckjd 0/6 Pending 0 19h csi-rbdplugin-provisioner-5555796984-gn6ps 6/6 Terminating 0 22h csi-rbdplugin-provisioner-5555796984-l4lfz 0/6 Pending 0 19h csi-rbdplugin-provisioner-5555796984-prhk5 6/6 Terminating 0 22h csi-rbdplugin-rfz6n 3/3 Running 0 22h noobaa-core-0 1/1 Terminating 0 22h noobaa-db-pg-0 1/1 Terminating 0 22h noobaa-endpoint-845ff84644-nd2mf 0/1 Pending 0 19h noobaa-endpoint-845ff84644-t5lrb 1/1 Terminating 0 22h noobaa-operator-558c448c59-cff9f 1/1 Terminating 0 23h noobaa-operator-558c448c59-w8c5x 0/1 Pending 0 19h ocs-metrics-exporter-7b686f76c4-6ql4v 1/1 Terminating 0 23h ocs-metrics-exporter-7b686f76c4-shwff 0/1 Pending 0 19h ocs-operator-6d887c8fbc-9v7qj 1/1 Terminating 0 23h ocs-operator-6d887c8fbc-jcw5k 0/1 Pending 0 19h rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-877cjx5 0/1 Pending 0 19h rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-87szg2j 1/1 Terminating 0 22h rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-76lnhnq 0/1 Pending 0 19h rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-76zn4pw 1/1 Terminating 0 22h rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-c9tqv24 0/1 Pending 0 19h rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-c9tztgr 1/1 Terminating 0 22h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78887c7dl85tj 0/2 Pending 0 19h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-78887c7dwfb8w 2/2 Terminating 0 20h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-76dbb8fcfpthh 0/2 Pending 0 19h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-76dbb8fcj2fp9 2/2 Terminating 0 22h rook-ceph-mgr-a-6cdd8f8fc4-84h45 0/2 Pending 0 7h17m rook-ceph-mon-a-847c84b9b9-l7p5g 0/2 Pending 0 19h rook-ceph-mon-a-847c84b9b9-qtvll 2/2 Terminating 0 20h rook-ceph-mon-b-658c4d656-7j2rn 2/2 Terminating 0 22h rook-ceph-mon-b-658c4d656-nv8r7 0/2 Pending 0 19h rook-ceph-mon-c-57f5c9b84-gnx4n 0/2 Pending 0 19h rook-ceph-mon-c-57f5c9b84-nchkr 2/2 Terminating 0 22h rook-ceph-operator-5fcdd8fd6d-kp2cd 0/1 Pending 0 19h rook-ceph-operator-5fcdd8fd6d-w2777 1/1 Terminating 0 23h rook-ceph-osd-0-8fb66ddb8-2qhcc 2/2 Terminating 0 20h rook-ceph-osd-0-8fb66ddb8-jpw7v 0/2 Pending 0 19h rook-ceph-osd-1-6fb8445d6f-7dm9f 0/2 Pending 0 19h rook-ceph-osd-1-6fb8445d6f-nl7pg 2/2 Terminating 0 22h rook-ceph-osd-2-65c7f5949d-gqpvz 0/2 Pending 0 19h rook-ceph-osd-2-65c7f5949d-s57rw 2/2 Terminating 0 22h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6695cdc6c2bc 0/2 Pending 0 19h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6695cdcfvzgn 2/2 Terminating 0 22h rook-ceph-tools-599b8f4774-2s26r 1/1 Terminating 0 22h rook-ceph-tools-599b8f4774-552zl 0/1 Pending 0 19h worker-0m1301015ocslnxneboe-debug 1/1 Terminating 0 19h [root@m1301015 ~]# [root@m1301015 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0.m1301015ocs.lnxne.boe Ready master 23h v1.20.0+551f7b2 master-1.m1301015ocs.lnxne.boe Ready master 23h v1.20.0+551f7b2 master-2.m1301015ocs.lnxne.boe Ready master 23h v1.20.0+551f7b2 worker-0.m1301015ocs.lnxne.boe NotReady worker 23h v1.20.0+551f7b2 worker-1.m1301015ocs.lnxne.boe NotReady worker 23h v1.20.0+551f7b2 worker-2.m1301015ocs.lnxne.boe NotReady worker 23h v1.20.0+551f7b2 [root@m1301015 ~]#
Hi Mudit/Jose The failed test cases are old and was stable. There are four failure instances of 3 different test cases updated in this bug. So I think we cannot rule out the possibility of a regression. WDYT ?
Moving back to 4.7 till we have a RCA
FWIW, I did some sarching upstream to see if there are similar reports. I found this: - https://github.com/Azure/AKS/issues/102 - could this be a disk performance issue? (maybe there is a lot of logging traffic on the OS disk?) - or possibly the node's resources getting overloaded? (do all pods have resource requests/limits, or is the OS taking up too many resources?)
@Abdul/Jilju How do the node resources (memory/disk) in this CI environment compare to the node resources in other CI environments where the CI passes? If the nodes are going down unexpectedly, I suspect there are not sufficient resources.
Also, can we check that whether this is reproducible on 4.6 or not?
@tnielsen @muagarwa, We didn't had this issue with the same tests in ocs 4.6 on IBM Z. Each worker node has 64GB of memory and 16 cores.
This looks like a resource issue to me. I have looked through the logs provided by Jilju. This is one of the affected nodes: musoni2-mwkff-worker-0-vd548 NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME LABELS musoni2-mwkff-worker-0-vd548 NotReady worker 46h v1.20.0+5fbfd19 10.1.11.198 <none> Red Hat Enterprise Linux CoreOS 47.83.202103041352-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/hostname=musoni2-mwkff-worker-0-vd548,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos And here is the output from "oc describe nodes" for the same node Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 11102m (71%) 10 (64%) memory 30946Mi (99%) 26Gi (85%) ==================>> ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events: <none> Name: musoni2-mwkff-worker-0-vd548 Roles: worker Looks like the node's CPU or RAM usage is reaching ~100% and in such a case the services running on the node won't be able to run. QE has two disruptive test cases one while resource_creation and another while resource_deletion, as far as I have observed and QE can keep me honest that these failures are happening only while resource_creation test case. I was going through the kubernetes official doc and noticed that Pods can consume all the available capacity on a node by default. This is an issue because nodes typically run quite a few system daemons that power the OS and Kubernetes itself. Unless resources are set aside for these system daemons, pods and system daemons compete for resources and lead to resource starvation issues on the node. https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/ So, this might be the case here but I am not sure, why we are seeing this now but in any case this shouldn't be an OCS issue.
@muagarwa, I can reproduce the same issue with the mentioned test in the description using ocs 4.6.2. Other components used: OCP - 4.7.2, LSO - 4.7.0. I will try again on OCP 4.6 and LSO 4.6 as our previous tests in February were passing.
(In reply to Mudit Agarwal from comment #35) > This looks like a resource issue to me. > > I have looked through the logs provided by Jilju. > > This is one of the affected nodes: musoni2-mwkff-worker-0-vd548 > > NAME STATUS ROLES AGE VERSION > INTERNAL-IP EXTERNAL-IP OS-IMAGE > KERNEL-VERSION CONTAINER-RUNTIME > LABELS > musoni2-mwkff-worker-0-vd548 NotReady worker 46h v1.20.0+5fbfd19 > 10.1.11.198 <none> Red Hat Enterprise Linux CoreOS > 47.83.202103041352-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 > cri-o://1.20.1-2.rhaos4.7.git0c3216a.el8 > beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cluster.ocs. > openshift.io/openshift-storage=,kubernetes.io/arch=amd64,kubernetes.io/ > hostname=musoni2-mwkff-worker-0-vd548,kubernetes.io/os=linux,node-role. > kubernetes.io/worker=,node.openshift.io/os_id=rhcos > > > And here is the output from "oc describe nodes" for the same node > > Allocated resources: > (Total limits may be over 100 percent, i.e., overcommitted.) > Resource Requests Limits > -------- -------- ------ > cpu 11102m (71%) 10 (64%) > memory 30946Mi (99%) 26Gi (85%) ==================>> > ephemeral-storage 0 (0%) 0 (0%) > hugepages-1Gi 0 (0%) 0 (0%) > hugepages-2Mi 0 (0%) 0 (0%) > Events: <none> > > Name: musoni2-mwkff-worker-0-vd548 > Roles: worker > > Looks like the node's CPU or RAM usage is reaching ~100% and in such a case > the services running on the node won't be able to run. > > QE has two disruptive test cases one while resource_creation and another > while resource_deletion, > as far as I have observed and QE can keep me honest that these failures are > happening only while > resource_creation test case. There is a recent run on RHV were node down issue is seen while running the test case tests/manage/pv_services/test_resource_deletion_during_pvc_pod_deletion_and_io.py::TestResourceDeletionDuringMultipleDeleteOperations::test_disruptive_during_pod_pvc_deletion_and_io[CephFileSystem-mgr] In this case the failure occurred while running fio on pods even before the mgr pod deletion. Steps executed before the failure. 1. Created 30 CephFS PVCs of size 3GiB. 2. Created pods to consume these PVCs. RWO PVC is attached to one pod and RWX PVC on 2 pods. So total 45 pods were created. 3. Started fio on 24 pods. Fio file size is 1G and runtime is 30 seconds. One worker node became NotReady during step 3. must-gather http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rhv2/sgatfane-rhv2_20210322T180859/logs/failed_testcase_ocs_logs_1618553355/test_disruptive_during_pod_pvc_deletion_and_io%5bCephFileSystem-mgr%5d_ocs_logs/ Test run : ocs-ci results for sgatfane-OCS4-7-Downstream-OCP4-7-LSO-MON-HOSTPATH-OSD-HDD-RHV-IPI-1AZ-RHCOS-3M-3W-tier4c (BUILD ID: v4.7.0-353.ci RUN ID: 1618553355) Build url : https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/2006/ worker node memory 32GB and CPU- 16. In in failures we have seen, the issue occurred during I/O operation on pods. > > I was going through the kubernetes official doc and noticed that Pods can > consume all the available capacity on a node by default. > This is an issue because nodes typically run quite a few system daemons that > power the OS and Kubernetes itself. > Unless resources are set aside for these system daemons, pods and system > daemons compete for resources and lead to resource starvation issues on the > node. > https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute- > resources/ > > So, this might be the case here but I am not sure, why we are seeing this > now but in any case this shouldn't be an OCS issue.
Created attachment 1774466 [details] ocs/ocp/lso 4.6 log Same test passes on cluster with OCP 4.6.9, LSO 4.6.0 (4.6.0-202104091041.p0), OCS 4.6.2-233.ci Please find the attachment for logs. Have a healthy cluster as well after execution. -- [root@m1301015 ~]# oc -n openshift-storage get po NAME READY STATUS RESTARTS AGE csi-cephfsplugin-7rr26 3/3 Running 0 50m csi-cephfsplugin-hhlfw 3/3 Running 0 50m csi-cephfsplugin-k7jg6 3/3 Running 0 50m csi-cephfsplugin-provisioner-d8ccd695d-5dm8t 6/6 Running 0 50m csi-cephfsplugin-provisioner-d8ccd695d-qrs4h 6/6 Running 0 50m csi-rbdplugin-bz48b 3/3 Running 0 50m csi-rbdplugin-provisioner-76988fbc89-crx5p 6/6 Running 0 50m csi-rbdplugin-provisioner-76988fbc89-hdlhd 6/6 Running 0 50m csi-rbdplugin-tt7ll 3/3 Running 0 50m csi-rbdplugin-w89nr 3/3 Running 0 50m noobaa-core-0 1/1 Running 0 48m noobaa-db-0 1/1 Running 0 48m noobaa-endpoint-f99cfb6cd-f7nd5 1/1 Running 0 45m noobaa-operator-55fc95dc4c-ghgck 1/1 Running 0 52m ocs-metrics-exporter-c5655b599-wcfw8 1/1 Running 0 52m ocs-operator-c946699b4-7hj4g 1/1 Running 0 52m rook-ceph-crashcollector-worker-0.m1301015ocs.lnxne.boe-57sdkhn 1/1 Running 0 49m rook-ceph-crashcollector-worker-1.m1301015ocs.lnxne.boe-6cxrpxs 1/1 Running 0 49m rook-ceph-crashcollector-worker-2.m1301015ocs.lnxne.boe-86mmlph 1/1 Running 0 48m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-98c94597sh595 1/1 Running 0 47m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5f67ff865n2k4 1/1 Running 0 47m rook-ceph-mgr-a-597c6b4d96-lnzm2 1/1 Running 1 48m rook-ceph-mon-a-569577d86c-2fc5b 1/1 Running 0 49m rook-ceph-mon-b-864d95cf5b-gv9l6 1/1 Running 0 49m rook-ceph-mon-c-5db6794758-whtb6 1/1 Running 0 48m rook-ceph-operator-6c97bf77-jzxw2 1/1 Running 0 52m rook-ceph-osd-0-f9ffd4dc8-zbn7z 1/1 Running 0 48m rook-ceph-osd-1-57669f7d74-gzxts 1/1 Running 0 48m rook-ceph-osd-2-7b9fb9cc68-9wgzf 1/1 Running 0 48m rook-ceph-osd-prepare-ocs-deviceset-0-data-0-ngpmb-47g76 0/1 Completed 0 48m rook-ceph-osd-prepare-ocs-deviceset-1-data-0-g59wk-qcr89 0/1 Completed 0 48m rook-ceph-osd-prepare-ocs-deviceset-2-data-0-8xtrz-wx2cl 0/1 Completed 0 48m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6544c75p7rt9 1/1 Running 0 47m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-846c57fbtp2n 1/1 Running 0 47m rook-ceph-tools-6fdd868f75-v25pb 1/1 Running 0 47m [root@m1301015 ~]#
@ratamir The tests causing the issue (known as of now) are listed below : - tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr] - tier4b - tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephFileSystem-mgr] - tier4b - tests/manage/pv_services/test_resource_deletion_during_pvc_pod_deletion_and_io.py::TestResourceDeletionDuringMultipleDeleteOperations::test_disruptive_during_pod_pvc_deletion_and_io[CephFileSystem-mgr] - tier4c - tests/e2e/scale/test_pv_scale_and_respin_ceph_pods.py::TestPVSTOcsCreatePVCsAndRespinCephPods::test_pv_scale_out_create_pvcs_and_respin_ceph_pods[mgr] - scale (see comment #24) The issue is reproducible with following versions: - OCS 4.7 (with tcmalloc fix), OCP 4.7, LSO 4.7 - OCS 4.6.2 (without tcmalloc fix), OCP 4.7, LSO 4.7 The issue is not reproducible with OCS 4.6.2, OCP 4.6.9, LSO 4.6 (see comment #39) Please find the logs that were possible to collect in comment #20, which includes ocp/ocs must-gather and node logs collected as mentioned in comment #18.
Thanks Abdul for trying out various combinations. So looks like something has changed with OCP 4.7 also which is causing these tests to fail. We need to make changes in the test scripts too. I guess we can ask OCP team also to take a look.
*** Bug 1940860 has been marked as a duplicate of this bug. ***
Hi Abdul, Please check https://bugzilla.redhat.com/show_bug.cgi?id=1953430#c5 Can we try the test cases with OCP 4.7.8, keep other things intact.
Hi @
Hi mudit, I have updated the https://bugzilla.redhat.com/show_bug.cgi?id=1953430. I am able to reproduce the same issue on ocp 4.7.8.
@ratamir, As we discussed in the last meeting, I tried reproducing manually the steps in one of the test that causing this BZ. I followed ocs-ci logs of following test "tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr]". Steps perfomed are: - Create test namespace - Create 12 PVCs - 6 with RWX and 6 with RWO - kill mgr daemon logging into respective scheduled node during the PVC creation. - Wait for all PVCs to reach Bound state. - Create pods claiming previously created PVCs - 1 each for RWO PVCs and 2 each for RWX PVCs - Install fio on all the pods and run following fio command on each pods - "fio --name=fio-rand-readwrite --filename=/var/lib/www/html/<pod-name>_io_file1 --readwrite=randrw --bs=4K --direct=1 --numjobs=1 --time_based=1 --runtime=30 --size=2G --iodepth=4 --invalidate=1 --fsync_on_close=1 --rwmixread=75 --ioengine=libaio --rate=1m --rate_process=poisson --output-format=json" fio command on all the pods executed successfully and I don't see any node going to NotReady State during/after performing these steps. May i know whether i missed anything in between? Below are the version of components used. [root@s83lp83 ~]# oc version Client Version: 4.7.7 Server Version: 4.7.8 Kubernetes Version: v1.20.0+7d0a2b2 [root@s83lp83 ~]# [root@s83lp83 ~]# oc -n openshift-storage get csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.0-377.ci OpenShift Container Storage 4.7.0-377.ci Succeeded [root@s83lp83 ~]# [root@s83lp83 ~]# oc -n local-storage get csv NAME DISPLAY VERSION REPLACES PHASE local-storage-operator.4.7.0-202104250659.p0 Local Storage 4.7.0-202104250659.p0 Succeeded [root@s83lp83 ~]#
Not a blocker for 4.7, moving it out.
(In reply to Abdul Kandathil (IBM) from comment #47) > @ratamir, > > As we discussed in the last meeting, I tried reproducing manually the steps > in one of the test that causing this BZ. I followed ocs-ci logs of following > test > "tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py:: > TestDaemonKillDuringResourceCreation:: > test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc- > mgr]". > > Steps perfomed are: > - Create test namespace > - Create 12 PVCs - 6 with RWX and 6 with RWO > - kill mgr daemon logging into respective scheduled node during the PVC > creation. > - Wait for all PVCs to reach Bound state. > - Create pods claiming previously created PVCs - 1 each for RWO PVCs and > 2 each for RWX PVCs > - Install fio on all the pods and run following fio command on each pods > - "fio --name=fio-rand-readwrite > --filename=/var/lib/www/html/<pod-name>_io_file1 --readwrite=randrw --bs=4K > --direct=1 --numjobs=1 --time_based=1 --runtime=30 --size=2G --iodepth=4 > --invalidate=1 --fsync_on_close=1 --rwmixread=75 --ioengine=libaio --rate=1m > --rate_process=poisson --output-format=json" > > fio command on all the pods executed successfully and I don't see any node > going to NotReady State during/after performing these steps. May i know > whether i missed anything in between? > > Below are the version of components used. > [root@s83lp83 ~]# oc version > Client Version: 4.7.7 > Server Version: 4.7.8 > Kubernetes Version: v1.20.0+7d0a2b2 > [root@s83lp83 ~]# > [root@s83lp83 ~]# oc -n openshift-storage get csv > NAME DISPLAY VERSION > REPLACES PHASE > ocs-operator.v4.7.0-377.ci OpenShift Container Storage 4.7.0-377.ci > Succeeded > [root@s83lp83 ~]# > [root@s83lp83 ~]# oc -n local-storage get csv > NAME DISPLAY VERSION > REPLACES PHASE > local-storage-operator.4.7.0-202104250659.p0 Local Storage > 4.7.0-202104250659.p0 Succeeded > [root@s83lp83 ~]# Abdul, as per this comment we were able to reproduce this issue with OCP 4.7.8 ( https://bugzilla.redhat.com/show_bug.cgi?id=1945016#c45) but its not the case here with same OCP version. Compared to the setup of c#45, any difference in versions of OCS, LSO, ocs-ci, RHCOS ? with this we could check is it fixed with the latest versions of any related components or in OCP itself.
Hi @hchiramm, I don't find a difference in my setup compared to comment #45. I tried running the failing tests mentioned in comment #40 again today and I see below tests are passing - tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr] - tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephFileSystem-mgr] But with the below test, I can reproduce the same issue. - tests/manage/pv_services/test_resource_deletion_during_pvc_pod_deletion_and_io.py::TestResourceDeletionDuringMultipleDeleteOperations::test_disruptive_during_pod_pvc_deletion_and_io[CephFileSystem-mgr] Current cluster status. [root@s83lp83 ~]# oc get no NAME STATUS ROLES AGE VERSION test1-4fw8h-master-0 Ready master 4d17h v1.20.0+7d0a2b2 test1-4fw8h-master-1 Ready master 4d17h v1.20.0+7d0a2b2 test1-4fw8h-master-2 Ready master 4d17h v1.20.0+7d0a2b2 test1-4fw8h-worker-0-5cd2h NotReady worker 4d17h v1.20.0+7d0a2b2 test1-4fw8h-worker-0-tfd6c Ready worker 4d17h v1.20.0+7d0a2b2 test1-4fw8h-worker-0-xx7p5 Ready worker 4d17h v1.20.0+7d0a2b2 [root@s83lp83 ~]#
*** Bug 1929188 has been marked as a duplicate of this bug. ***
The tracker BZs are not getting enough traction and there is nothing which we can do in OCS. If this is still an issue please keep updating the tracker BZ.
*** Bug 1969309 has been marked as a duplicate of this bug. ***
*** Bug 1964958 has been marked as a duplicate of this bug. ***
Brining it back to 4.8 and proposing this as a blocker because we are hitting it frequently.
Are we actually seeing this frequently in the product, or only in the ocs-ci tests? It could easily be an environment issue such as not enough memory in the ci.
*** Bug 1970483 has been marked as a duplicate of this bug. ***
Last I heard from IBM team is that this issue is not reproducible now. Have we seen this recently in internal setups?
I haven't run into this myself, so I have no further data to provide, but maybe Avi did, as he reported BZ 1964958 (a duplicate of this bug) during arbiter latency testing while running IO workload.
Adding NI for Avi
Bug 1970483 - Nodes go down while running performance suite of tests from ocs-ci is marked as a duplicate of this bug. 1970483 is still reproducible and a concern. So if you are closing this one then please unlink them. Also we were told that actually https://bugzilla.redhat.com/show_bug.cgi?id=1953430 and https://bugzilla.redhat.com/show_bug.cgi?id=1970483 are the same. I don't have access to 1953430 so can someone say if any progress is made or any solution is avail.
This bug is not getting closed as IBM team is consistently hitting it. Also, I have made BZ #1953430 public now. So, please check if it is accessible to you or not.
Based on https://bugzilla.redhat.com/show_bug.cgi?id=1953430#c52 , moving to ON_QA for further testing. In case it is observed please escalate
@muagarwa yes I can BZ #1953430 now
Created attachment 1800162 [details] console log of kernel module ceph crash - image 1 The same crash occurred previously with BZ #1970483
Created attachment 1800163 [details] console log of kernel module ceph crash - image 2 Continuation of image 1.
Console logs show a cpu lock problem in ceph module
This is on powervs using rhcos 4.8 rc 3 running ocs 4.8 on ocp 4.8. All worker nodes are NotReady showing the same ceph module issue.
We do not see this behaviour of workers going into "Not Ready" state after the execution of tier4a, tier4b, and tier4c tests which includes "tests/manage/pv_services/" with latest release of OCS on IBM Z.
Moving to verified
*** Bug 1989046 has been marked as a duplicate of this bug. ***