Bug 1988845 - Vmware lso cluster with encryption, osd pod stuck on INIT state after drain/undrain worker node [NEEDINFO]
Summary: Vmware lso cluster with encryption, osd pod stuck on INIT state after drain/u...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Sébastien Han
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-01 16:39 UTC by Oded
Modified: 2023-08-09 17:03 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-05 10:12:34 UTC
Embargoed:
shan: needinfo? (oviner)


Attachments (Terms of Use)

Description Oded 2021-08-01 16:39:14 UTC
Description of problem (please be detailed as possible and provide log
snippests):
OSD pod stuck on INIT state (more than 20 min) after drain/undrain worker node

Version of all relevant components (if applicable):
Provider: Vmware
OCP Version:4.8.0-0.nightly-2021-07-30-021048 
OCS Version:4.8.0-175.ci
LSO:4.8.0-202106291913

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Install OCS operator with OSD encryption (no KMS) + LSO via UI
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j018vue1cslv33-t4an/j018vue1cslv33-t4an_20210730T100503/logs/screenshots_ui_1627640337/test_deployment/

2.Drain worker node compute-0
$ oc adm drain compute-0 --force=true --ignore-daemonsets --delete-local-data

3.Wait 1400 seconds

4.Respin  rook-ceph operator pod
$ oc -n openshift-storage delete Pod rook-ceph-operator-7d7cf8b6b4-sbfsx

5.Uncordon the node
$ oc adm uncordon compute-0

6.Wait for all the pods in openshift-storage to be running [Failed!!, osd-0 on Init state]
The pod rook-ceph-osd-0-7d44749b88-9l98d on Init:0/8 state

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j018vue1cslv33-t4an/j018vue1cslv33-t4an_20210730T100503/logs/failed_testcase_ocs_logs_1627646050/test_rook_operator_restart_during_mon_failover_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-180ca4c2ca1f8bfd59251ef37dc6f0b0c6f6b651383dad7a34ef67c0374617f5/namespaces/openshift-storage/oc_output/pods_-owide

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j018vue1cslv33-t4an/j018vue1cslv33-t4an_20210730T100503/logs/failed_testcase_ocs_logs_1627646050/test_rook_operator_restart_during_mon_failover_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-180ca4c2ca1f8bfd59251ef37dc6f0b0c6f6b651383dad7a34ef67c0374617f5/ceph/must_gather_commands/ceph_osd_tree

Must Gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j018vue1cslv33-t4an/j018vue1cslv33-t4an_20210730T100503/logs/failed_testcase_ocs_logs_1627646050/test_rook_operator_restart_during_mon_failover_ocs_logs/ocs_must_gather/



Actual results:
osd-0 on Init state after Drain/Undrain worker node

Expected results:
osd-0 on Running state after Drain/Undrain worker node

Additional info:

Comment 2 Mudit Agarwal 2021-08-03 06:38:26 UTC
Not a 4.8 blocker

Comment 3 Oded 2021-08-03 07:16:06 UTC
This issue does not reconstructed with same setup
[
Provider: Vmware
OCP Version:4.8
OCS Version:4.8.0-175.ci
LSO Version:4.8.0-202106291913
]
I ran this procedure 5 times with/without IO in background manually

Comment 4 Sébastien Han 2021-08-03 07:38:43 UTC
(In reply to Oded from comment #3)
> This issue does not reconstructed with same setup
> [
> Provider: Vmware
> OCP Version:4.8
> OCS Version:4.8.0-175.ci
> LSO Version:4.8.0-202106291913
> ]
> I ran this procedure 5 times with/without IO in background manually

Can we close this then? Thanks.


Note You need to log in before you can comment on or make changes to this bug.