Created attachment 1845459 [details] rook-ceph-operator logs Created attachment 1845459 [details] rook-ceph-operator logs Description of problem (please be detailed as possible and provide log snippests): I ran device replacement procedure on BM cluster, the deployment didn't get created until the rook ceph operator pod restarted. I ran same procedure on vmware lso and the osd pod moved to running state after 1 min. Version of all relevant components (if applicable): OCP Version:4.9.0-0.nightly-2021-12-08-041936 ODF Version:4.9.0-251.ci LSO Version:local-storage-operator.4.9.0-202111151318 Provider:BM Ceph versions: sh-4.4$ ceph versions { "mon": { "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 1 }, "osd": { "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 2 }, "mds": { "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 2 }, "rgw": { "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 1 }, "overall": { "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 9 } } Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Get OSD Pods: $ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rook-ceph-osd-0-697fdb84bf-5blmj 2/2 Running 0 19h 10.129.2.32 argo007.ceph.redhat.com <none> <none> rook-ceph-osd-1-7f976b8cdf-pq429 2/2 Running 0 13h 10.128.2.188 argo006.ceph.redhat.com <none> <none> rook-ceph-osd-2-b86d77594-lsblx 2/2 Running 0 20h 10.131.0.29 argo005.ceph.redhat.com <none> <none> 2.Scale down OSD-1: $ osd_id_to_remove=2 $ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0 deployment.apps/rook-ceph-osd-2 scaled $ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove} No resources found in openshift-storage namespace. 3.Remove the old OSD from the cluster: $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -n openshift-storage -f - job.batch/ocs-osd-removal-job created $ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage NAME READY STATUS RESTARTS AGE ocs-osd-removal-job--1-fr5bc 0/1 Completed 0 18s $ oc delete -n openshift-storage job ocs-osd-removal-job job.batch "ocs-osd-removal-job" deleted 4.Check PV status: $ oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE local-pv-1ffca0cf 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-0-data-44t287 localblock 9m9s local-pv-91f335ce 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-0-data-3g9kwh localblock 19h local-pv-b9f65aa6 931Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-0-data-54mcw6 localblock 2m10s 5.Check PVC status: $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE db-noobaa-db-pg-0 Bound pvc-988608b7-dc12-4039-921e-4d9aeade2f20 50Gi RWO ocs-storagecluster-ceph-rbd 20h ocs-deviceset-localblock-0-data-3g9kwh Bound local-pv-91f335ce 931Gi RWO localblock 19h ocs-deviceset-localblock-0-data-44t287 Bound local-pv-1ffca0cf 931Gi RWO localblock 9m41s ocs-deviceset-localblock-0-data-54mcw6 Bound local-pv-b9f65aa6 931Gi RWO localblock 3m50s 6.Check OSD pods status: $ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rook-ceph-osd-0-697fdb84bf-5blmj 2/2 Running 0 19h 10.129.2.32 argo007.ceph.redhat.com <none> <none> rook-ceph-osd-1-dcc7bbfc6-gkwgj 2/2 Running 0 10m 10.128.3.114 argo006.ceph.redhat.com <none> <none> 7.Check deployment: $ oc get deployment | grep osd rook-ceph-osd-0 1/1 1 1 19h rook-ceph-osd-1 1/1 1 1 12m 8.Wait 20 min 9.Restart rook ceph operator pod: $ oc delete pods rook-ceph-operator-866bbcb854-rkf68 pod "rook-ceph-operator-866bbcb854-rkf68" deleted 10.Check deployment of osd-2: $ oc get deployment | grep osd rook-ceph-osd-0 1/1 1 1 19h rook-ceph-osd-1 1/1 1 1 27m rook-ceph-osd-2 1/1 1 1 24s 11.Check osd pod: $ oc get pods | grep osd rook-ceph-osd-0-697fdb84bf-5blmj 2/2 Running 0 19h rook-ceph-osd-1-dcc7bbfc6-gkwgj 2/2 Running 0 27m rook-ceph-osd-2-64dfc48f85-4rghn 2/2 Running 0 51s **Attached rook ceph operator logs Actual results: OSD-2 on Down state Expected results: OSD-2 on Running state Additional info:
Not a 4.9 blocker, moving to 4.10 for investigation. Oded, the rook operator log attached is only from after restarting the operator, right? Please provide the following: 1. On the BM LSO cluster, provide the rook operator log before restarting the operator. 2. For comparison it would also be helpful to have the operator log from the vmware cluster where the OSD was automatically created
Hi @nberry , I dont know, because I did not replace the disk. I only ran OSD removal job. **There is no option to remove the disk in our BM infra.
Oded can you collect the logs requested in comment 2? thanks
Hi @tnielsen , I cant reconstruct this issue on my BM cluster (I tried 7 times). I will try to deploy a new BM cluster.
Ok let's when you get the repro
Please reopen if you get the repro, thanks
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days