Description of problem (please be detailed as possible and provide log snippests): OSD pod stuck on INIT state (more than 20 min) after drain/undrain worker node Version of all relevant components (if applicable): Provider: Vmware OCP Version:4.8.0-0.nightly-2021-07-30-021048 OCS Version:4.8.0-175.ci LSO:4.8.0-202106291913 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Install OCS operator with OSD encryption (no KMS) + LSO via UI http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j018vue1cslv33-t4an/j018vue1cslv33-t4an_20210730T100503/logs/screenshots_ui_1627640337/test_deployment/ 2.Drain worker node compute-0 $ oc adm drain compute-0 --force=true --ignore-daemonsets --delete-local-data 3.Wait 1400 seconds 4.Respin rook-ceph operator pod $ oc -n openshift-storage delete Pod rook-ceph-operator-7d7cf8b6b4-sbfsx 5.Uncordon the node $ oc adm uncordon compute-0 6.Wait for all the pods in openshift-storage to be running [Failed!!, osd-0 on Init state] The pod rook-ceph-osd-0-7d44749b88-9l98d on Init:0/8 state http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j018vue1cslv33-t4an/j018vue1cslv33-t4an_20210730T100503/logs/failed_testcase_ocs_logs_1627646050/test_rook_operator_restart_during_mon_failover_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-180ca4c2ca1f8bfd59251ef37dc6f0b0c6f6b651383dad7a34ef67c0374617f5/namespaces/openshift-storage/oc_output/pods_-owide http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j018vue1cslv33-t4an/j018vue1cslv33-t4an_20210730T100503/logs/failed_testcase_ocs_logs_1627646050/test_rook_operator_restart_during_mon_failover_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-180ca4c2ca1f8bfd59251ef37dc6f0b0c6f6b651383dad7a34ef67c0374617f5/ceph/must_gather_commands/ceph_osd_tree Must Gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j018vue1cslv33-t4an/j018vue1cslv33-t4an_20210730T100503/logs/failed_testcase_ocs_logs_1627646050/test_rook_operator_restart_during_mon_failover_ocs_logs/ocs_must_gather/ Actual results: osd-0 on Init state after Drain/Undrain worker node Expected results: osd-0 on Running state after Drain/Undrain worker node Additional info:
Not a 4.8 blocker
This issue does not reconstructed with same setup [ Provider: Vmware OCP Version:4.8 OCS Version:4.8.0-175.ci LSO Version:4.8.0-202106291913 ] I ran this procedure 5 times with/without IO in background manually
(In reply to Oded from comment #3) > This issue does not reconstructed with same setup > [ > Provider: Vmware > OCP Version:4.8 > OCS Version:4.8.0-175.ci > LSO Version:4.8.0-202106291913 > ] > I ran this procedure 5 times with/without IO in background manually Can we close this then? Thanks.