Bug 1955042

Summary:	[IBM Z] OCS-CI tier4b tests fails due to timeout during pod deletion
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Abdul Kandathil (IBM) <akandath>
Component:	csi-driver	Assignee:	Humble Chirammal <hchiramm>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Elad <ebenahar>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	madam, mschaefe, muagarwa, ocs-bugs, odf-bz-bot, sostapov
Target Milestone:	---
Target Release:	---
Hardware:	s390x
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-21 13:26:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Abdul Kandathil (IBM) 2021-04-29 10:43:09 UTC

Description of problem (please be detailed as possible and provide log
snippests): Below tests from tier4b fails due to timeout during pod deletion.

    - tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mon]
    - ests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-osd]
    - tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mds]
    - tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mds]


Version of all relevant components (if applicable):
OCP 4.7.3, LSO 4.7.0-202104142050.p0, OCS 4.7.0-364.ci


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCP + OCS + LSO
2. Execute the above tests using ocs-ci
3.


Actual results:
Tests fail due to timeout during pod deletion.


Expected results:


Additional info:
 - The same tests were passing with OCS 4.6.2 on OCP 4.6.
 - Test logs and mustgather logs are available in google drive due to size restriction : https://drive.google.com/file/d/1akBK9DdMgDhxjmDXFPGipUP7r7pg5Gep/view?usp=sharing

Comment 2 Mudit Agarwal 2021-06-10 14:20:02 UTC

Didn't get a chance to look into this but mostly looks like a sync issue. 
Doesn't look like a 4.8 blocker at this point of time, moving it out. 
Will pull it back if required.

Comment 3 Michael Schaefer 2021-07-30 12:46:57 UTC

I have *not* seen these failures when running tier4b on OCS 4.8.0-175.ci, OCP 4.8.2 - candidate-4.8, ocs-ci stable-ocs-4.8-202107251413 anymore.

Comment 4 Humble Chirammal 2021-08-02 06:25:52 UTC

(In reply to Michael Schaefer from comment #3)
> I have *not* seen these failures when running tier4b on OCS 4.8.0-175.ci,
> OCP 4.8.2 - candidate-4.8, ocs-ci stable-ocs-4.8-202107251413 anymore.

Michael, Thanks for the verification. In that case, Can we close this bugzilla?

Comment 5 Michael Schaefer 2021-08-03 07:44:28 UTC

No, please keep this open as it has occurred again when running the tier4b suite from ocs-ci once more.

Environment:
OCS 4.8.0-175.ci
OCP 4.8.2 - candidate-4.8
ocs-ci stable-ocs-4.8-202107251413

All tests following
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr]
are failing skipped because of Ceph health warnings.

Find must_gather logs for that testcase here: https://drive.google.com/file/d/1jfBqg4UaDvFPoroR7viqvEpdDJB6qxtJ/view?usp=sharing

The sequence of the executed tests is:
  1 PASS @1033.733s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephBlockPool-create_pvc-mgr]
  2 PASS @977.894s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephBlockPool-create_pod-mgr]
  3 PASS @1000.305s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephBlockPool-run_io-mgr]
  4 PASS @1010.164s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephBlockPool-create_pvc-mon]
  5 PASS @974.211s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephBlockPool-create_pod-mon]
  6 PASS @997.931s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephBlockPool-run_io-mon]
  7 PASS @978.874s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephBlockPool-create_pvc-osd]
  8 PASS @991.756s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephBlockPool-create_pod-osd]
  9 PASS @978.594s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephBlockPool-run_io-osd]
 10 FAIL @767.205s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr]
 11 ERR @1509.976s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mgr]
 12 SKIP @1.439s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pod-mgr]
 13 SKIP @1.643s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-run_io-mgr]
 14 SKIP @1.447s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mon]
 15 SKIP @1.578s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pod-mon]
 16 SKIP @1.607s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-run_io-mon]
 17 SKIP @1.572s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-osd]
 18 SKIP @1.458s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pod-osd]
 19 SKIP @1.788s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-run_io-osd]
 20 SKIP @1.471s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pvc-mds]
 21 SKIP @1.439s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-create_pod-mds]
 22 SKIP @25.557s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_creation.py::TestDaemonKillDuringResourceCreation::test_ceph_daemon_kill_during_resource_creation[CephFileSystem-run_io-mds]
 23 SKIP @26.362s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephBlockPool-delete_pvcs-mgr]
 24 SKIP @1.667s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephBlockPool-delete_pods-mgr]
 25 SKIP @1.640s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephBlockPool-delete_pvcs-mon]
 26 SKIP @1.593s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephBlockPool-delete_pods-mon]
 27 SKIP @1.534s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephBlockPool-delete_pvcs-osd]
 28 SKIP @1.637s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephBlockPool-delete_pods-osd]
 29 SKIP @1.537s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mgr]
 30 SKIP @1.723s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr]
 31 SKIP @1.639s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mon]
 32 SKIP @1.365s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mon]
 33 SKIP @1.404s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-osd]
 34 SKIP @1.746s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-osd]
 35 SKIP @1.335s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mds]
 36 SKIP @25.536s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mds]
 37 SKIP @25.219s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephBlockPool-mgr]
 38 SKIP @1.581s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephBlockPool-mon]
 39 SKIP @1.537s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephBlockPool-osd]
 40 SKIP @1.428s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephFileSystem-mgr]
 41 SKIP @1.499s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephFileSystem-mon]
 42 SKIP @1.460s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephFileSystem-osd]
 43 SKIP @24.250s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py::TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephFileSystem-mds]
 44 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_rwo_pvc_fencing_unfencing.py::TestRwoPVCFencingUnfencing::test_rwo_pvc_fencing_node_prolonged_network_failure[dedicated-2-1-False]
 45 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_rwo_pvc_fencing_unfencing.py::TestRwoPVCFencingUnfencing::test_rwo_pvc_fencing_node_prolonged_network_failure[dedicated-4-3-True]
 46 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_rwo_pvc_fencing_unfencing.py::TestRwoPVCFencingUnfencing::test_rwo_pvc_fencing_node_prolonged_network_failure[colocated-4-1-False]
 47 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/pv_services/test_rwo_pvc_fencing_unfencing.py::TestRwoPVCFencingUnfencing::test_rwo_pvc_fencing_node_prolonged_network_failure[colocated-6-3-True]
 48 SKIP @1.892s ../ocs-ci-4.8-stable/tests/manage/pv_services/pvc_clone/test_node_restart_during_pvc_clone.py::TestNodeRestartDuringPvcClone::test_worker_node_restart_during_pvc_clone
 49 SKIP @50.336s ../ocs-ci-4.8-stable/tests/manage/pv_services/pvc_clone/test_resource_deletion_during_pvc_clone.py::TestResourceDeletionDuringPvcClone::test_resource_deletion_during_pvc_clone
 50 SKIP @1.665s ../ocs-ci-4.8-stable/tests/manage/pv_services/pvc_resize/test_node_restart_during_pvc_expansion.py::TestNodeRestartDuringPvcExpansion::test_worker_node_restart_during_pvc_expansion
 51 SKIP @26.249s ../ocs-ci-4.8-stable/tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[mgr]
 52 SKIP @1.558s ../ocs-ci-4.8-stable/tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[osd]
 53 SKIP @1.365s ../ocs-ci-4.8-stable/tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[rbdplugin]
 54 SKIP @1.485s ../ocs-ci-4.8-stable/tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[cephfsplugin]
 55 SKIP @1.660s ../ocs-ci-4.8-stable/tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[rbdplugin_provisioner]
 56 SKIP @25.314s ../ocs-ci-4.8-stable/tests/manage/pv_services/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[cephfsplugin_provisioner]
 57 SKIP @53.131s ../ocs-ci-4.8-stable/tests/manage/pv_services/pvc_snapshot/test_resource_deletion_during_snapshot_restore.py::TestResourceDeletionDuringSnapshotRestore::test_resource_deletion_during_snapshot_restore
 58 SKIP @0.002s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_automated_recovery_from_failed_nodes_proactive_IPI.py::TestAutomatedRecoveryFromFailedNodes::test_automated_recovery_from_failed_nodes_IPI_proactive[rbd]
 59 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_automated_recovery_from_failed_nodes_proactive_IPI.py::TestAutomatedRecoveryFromFailedNodes::test_automated_recovery_from_failed_nodes_IPI_proactive[cephfs]
 60 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_automated_recovery_from_failed_nodes_reactive_IPI.py::TestAutomatedRecoveryFromFailedNodes::test_automated_recovery_from_failed_nodes_IPI_reactive[rbd-shutdown]
 61 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_automated_recovery_from_failed_nodes_reactive_IPI.py::TestAutomatedRecoveryFromFailedNodes::test_automated_recovery_from_failed_nodes_IPI_reactive[rbd-terminate]
 62 SKIP @0.000s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_automated_recovery_from_failed_nodes_reactive_IPI.py::TestAutomatedRecoveryFromFailedNodes::test_automated_recovery_from_failed_nodes_IPI_reactive[cephfs-shutdown]
 63 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_automated_recovery_from_failed_nodes_reactive_IPI.py::TestAutomatedRecoveryFromFailedNodes::test_automated_recovery_from_failed_nodes_IPI_reactive[cephfs-terminate]
 64 SKIP @0.000s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_az_failure.py::TestAvailabilityZones::test_availability_zone_failure
 65 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_disk_failures.py::TestDiskFailures::test_detach_attach_worker_volume
 66 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_disk_failures.py::TestDiskFailures::test_detach_attach_2_data_volumes
 67 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_disk_failures.py::TestDiskFailures::test_recovery_from_volume_deletion
 68 SKIP @1.832s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_node_replacement_proactive.py::TestNodeReplacementTwice::test_nodereplacement_twice
 69 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_node_replacement_reactive_aws_ipi.py::TestNodeReplacement::test_node_replacement_reactive_aws_ipi[rbd-power off]
 70 SKIP @0.000s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_node_replacement_reactive_aws_ipi.py::TestNodeReplacement::test_node_replacement_reactive_aws_ipi[rbd-network failure]
 71 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_node_replacement_reactive_aws_ipi.py::TestNodeReplacement::test_node_replacement_reactive_aws_ipi[cephfs-power off]
 72 SKIP @0.000s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_node_replacement_reactive_aws_ipi.py::TestNodeReplacement::test_node_replacement_reactive_aws_ipi[cephfs-network failure]
 73 SKIP @6.839s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_nodes_maintenance.py::TestNodesMaintenance::test_node_maintenance_restart_activate[worker]
 74 SKIP @1.510s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_nodes_maintenance.py::TestNodesMaintenance::test_node_maintenance_restart_activate[master]
 75 SKIP @0.001s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_nodes_maintenance.py::TestNodesMaintenance::test_simultaneous_drain_of_two_ocs_nodes[rbd]
 76 SKIP @20.104s ../ocs-ci-4.8-stable/tests/manage/z_cluster/nodes/test_nodes_maintenance.py::TestNodesMaintenance::test_simultaneous_drain_of_two_ocs_nodes[cephfs]


Cluster status after test:
$ oc get node
NAME                                STATUS     ROLES    AGE     VERSION
bootstrap-0.m1301015ocs.lnxne.boe   Ready      worker   16h     v1.21.1+051ac4f
master-0.m1301015ocs.lnxne.boe      Ready      master   4d17h   v1.21.1+051ac4f
master-1.m1301015ocs.lnxne.boe      Ready      master   4d17h   v1.21.1+051ac4f
master-2.m1301015ocs.lnxne.boe      Ready      master   4d17h   v1.21.1+051ac4f
worker-0.m1301015ocs.lnxne.boe      NotReady   worker   4d17h   v1.21.1+051ac4f
worker-1.m1301015ocs.lnxne.boe      Ready      worker   4d17h   v1.21.1+051ac4f
worker-2.m1301015ocs.lnxne.boe      Ready      worker   4d17h   v1.21.1+051ac4f

Ceph status:
$ ./ceph-tool.sh status
ceph health
  cluster:
    id:     79c11403-4fe0-4275-b0f9-1f53ba99fd9a
    health: HEALTH_WARN
            Long heartbeat ping times on back interface seen, longest is 5896.639 msec
            Long heartbeat ping times on front interface seen, longest is 5827.613 msec
 
  services:
    mon: 3 daemons, quorum a,b,c (age 10s)
    mgr: a(active, since 12h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 6 osds: 6 up (since 8h), 6 in (since 4d)
    rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
 
  data:
    pools:   10 pools, 368 pgs
    objects: 12.03k objects, 44 GiB
    usage:   101 GiB used, 5.9 TiB / 6 TiB avail
    pgs:     368 active+clean
 
  io:
    client:   170 B/s rd, 681 B/s wr, 0 op/s rd, 0 op/s wr
 
Status of the "Not Ready" worker: 
$ oc describe node/worker-0.m1301015ocs.lnxne.boe
Name:               worker-0.m1301015ocs.lnxne.boe
Roles:              worker
Labels:             beta.kubernetes.io/arch=s390x
                    beta.kubernetes.io/os=linux
                    cluster.ocs.openshift.io/openshift-storage=
                    kubernetes.io/arch=s390x
                    kubernetes.io/hostname=worker-0.m1301015ocs.lnxne.boe
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
                    topology.rook.io/rack=rack2
Annotations:        csi.volume.kubernetes.io/nodeid:
                      {"openshift-storage.cephfs.csi.ceph.com":"worker-0.m1301015ocs.lnxne.boe","openshift-storage.rbd.csi.ceph.com":"worker-0.m1301015ocs.lnxne...
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-408f748963da0fca1911c061a0fd93f6
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-408f748963da0fca1911c061a0fd93f6
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 29 Jul 2021 15:45:38 +0200
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  worker-0.m1301015ocs.lnxne.boe
  AcquireTime:     <unset>
  RenewTime:       Mon, 02 Aug 2021 19:48:43 +0200
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Mon, 02 Aug 2021 19:46:19 +0200   Mon, 02 Aug 2021 19:49:23 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Mon, 02 Aug 2021 19:46:19 +0200   Mon, 02 Aug 2021 19:49:23 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Mon, 02 Aug 2021 19:46:19 +0200   Mon, 02 Aug 2021 19:49:23 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Mon, 02 Aug 2021 19:46:19 +0200   Mon, 02 Aug 2021 19:49:23 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:
  InternalIP:  10.13.1.20
  Hostname:    worker-0.m1301015ocs.lnxne.boe
Capacity:
  cpu:                16
  ephemeral-storage:  125424620Ki
  hugepages-1Mi:      0
  memory:             66026156Ki
  pods:               250
Allocatable:
  cpu:                15500m
  ephemeral-storage:  115591329601
  hugepages-1Mi:      0
  memory:             64875180Ki
  pods:               250
System Info:
  Machine ID:                               8d32a0fb44ed45889f8bdb847dd6adee
  System UUID:                              8d32a0fb44ed45889f8bdb847dd6adee
  Boot ID:                                  5eb8580b-18f4-4a7a-b638-3a7806c12ccf
  Kernel Version:                           4.18.0-305.10.2.el8_4.s390x
  OS Image:                                 Red Hat Enterprise Linux CoreOS 48.84.202107242219-0 (Ootpa)
  Operating System:                         linux
  Architecture:                             s390x
  Container Runtime Version:                cri-o://1.21.2-6.rhaos4.8.git54a5889.el8
  Kubelet Version:                          v1.21.1+051ac4f
  Kube-Proxy Version:                       v1.21.1+051ac4f
Non-terminated Pods:                        (28 in total)
  Namespace                                 Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                                 ----                                        ------------  ----------  ---------------  -------------  ---
  default                                   pod-test-rbd                                0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d22h
  namespace-test-9e8497ecc4f74fe0a51114a39  pod-test-cephfs-3b2052c93fbb47cfa072ab34    0 (0%)        0 (0%)      0 (0%)           0 (0%)         13h
  namespace-test-9e8497ecc4f74fe0a51114a39  pod-test-cephfs-715aa69f608d49878928082e    0 (0%)        0 (0%)      0 (0%)           0 (0%)         13h
  namespace-test-9e8497ecc4f74fe0a51114a39  pod-test-cephfs-bb7d4801cdb34e2f88ec629b    0 (0%)        0 (0%)      0 (0%)           0 (0%)         13h
  openshift-cluster-node-tuning-operator    tuned-qjpz7                                 10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         4d17h
  openshift-dns                             dns-default-qctj9                           60m (0%)      0 (0%)      110Mi (0%)       0 (0%)         4d17h
  openshift-dns                             node-resolver-p7vnh                         5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         4d17h
  openshift-image-registry                  node-ca-pggrx                               10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         4d17h
  openshift-ingress-canary                  ingress-canary-tzdx4                        10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         4d17h
  openshift-kube-storage-version-migrator   migrator-5c458875b5-zhrst                   10m (0%)      0 (0%)      200Mi (0%)       0 (0%)         4d6h
  openshift-local-storage                   diskmaker-manager-n77rp                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d17h
  openshift-machine-config-operator         machine-config-daemon-tlncf                 40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         4d17h
  openshift-marketplace                     certified-operators-qswdf                   10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         14h
  openshift-marketplace                     certified-operators-sltrb                   10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         13h
  openshift-marketplace                     community-operators-dltkw                   10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         3d16h
  openshift-marketplace                     community-operators-x524g                   10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         13h
  openshift-marketplace                     redhat-marketplace-tkp2s                    10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         13h
  openshift-marketplace                     redhat-marketplace-zwkqb                    10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         14h
  openshift-marketplace                     redhat-operators-tdlwx                      10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         18h
  openshift-monitoring                      node-exporter-fx7hv                         9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         4d17h
  openshift-multus                          multus-additional-cni-plugins-s5rxd         10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         4d17h
  openshift-multus                          multus-phwrr                                10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         4d17h
  openshift-multus                          network-metrics-daemon-khqrs                20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         4d17h
  openshift-network-diagnostics             network-check-target-fbgnn                  10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         4d17h
  openshift-sdn                             sdn-vm8qv                                   110m (0%)     0 (0%)      220Mi (0%)       0 (0%)         4d17h
  openshift-storage                         csi-cephfsplugin-ks2g7                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d17h
  openshift-storage                         csi-rbdplugin-v92v5                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d17h
  openshift-storage                         must-gather-p484h-helper                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         13h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                384m (2%)    0 (0%)
  memory             1338Mi (2%)  0 (0%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Mi      0 (0%)       0 (0%)
Events:              <none>

Comment 6 Mudit Agarwal 2021-09-21 13:26:18 UTC

The issue mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1955042#c5 is different.
That happened because the ceph health was in WARN state, and the reason for the same is clearly the resource/environment issue as can be seen in the logs pasted in #comment5
one of the worker node is also down because of which this failure is expected.

For the original issue, we recently did one fix (https://github.com/ceph/ceph-csi/pull/2136) in this area (available in 4.9) which should avoid this problem.
Closing the bug, please reopen if this is seen on the latest 4.9 builds.

If this is frequent on 4.8 builds then we might have to backport https://github.com/ceph/ceph-csi/pull/2136