Bug 2030654

Summary: Bare Metal,Device Replacement, Deployment didn't get created until the rook ceph operator pod restarted.
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Oded <oviner>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Elad <ebenahar>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.9CC: madam, mmuench, muagarwa, nberry, ocs-bugs, odf-bz-bot, tnielsen
Target Milestone: ---Flags: oviner: needinfo? (nberry)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-17 16:10:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
rook-ceph-operator logs none

Description Oded 2021-12-09 12:13:23 UTC
Created attachment 1845459 [details]
rook-ceph-operator logs

Created attachment 1845459 [details]
rook-ceph-operator logs

Description of problem (please be detailed as possible and provide log
snippests):
I ran device replacement procedure on BM cluster, the deployment didn't get created until the rook ceph operator pod restarted.
I ran same procedure on vmware lso and the osd pod moved to running state after 1 min.

Version of all relevant components (if applicable):
OCP Version:4.9.0-0.nightly-2021-12-08-041936
ODF Version:4.9.0-251.ci
LSO Version:local-storage-operator.4.9.0-202111151318
Provider:BM
Ceph versions:
sh-4.4$ ceph versions
{
    "mon": {
        "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 2
    },
    "mds": {
        "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 2
    },
    "rgw": {
        "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)": 9
    }
}

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Get OSD Pods:
$ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP             NODE                      NOMINATED NODE   READINESS GATES
rook-ceph-osd-0-697fdb84bf-5blmj   2/2     Running   0          19h   10.129.2.32    argo007.ceph.redhat.com   <none>           <none>
rook-ceph-osd-1-7f976b8cdf-pq429   2/2     Running   0          13h   10.128.2.188   argo006.ceph.redhat.com   <none>           <none>
rook-ceph-osd-2-b86d77594-lsblx    2/2     Running   0          20h   10.131.0.29    argo005.ceph.redhat.com   <none>           <none>

2.Scale down OSD-1:
$ osd_id_to_remove=2
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
deployment.apps/rook-ceph-osd-2 scaled
$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
No resources found in openshift-storage namespace.

3.Remove the old OSD from the cluster:
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME                           READY   STATUS      RESTARTS   AGE
ocs-osd-removal-job--1-fr5bc   0/1     Completed   0          18s

$ oc delete -n openshift-storage job ocs-osd-removal-job
job.batch "ocs-osd-removal-job" deleted

4.Check PV status:
$ oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                            STORAGECLASS                  REASON   AGE
local-pv-1ffca0cf                          931Gi      RWO            Delete           Bound    openshift-storage/ocs-deviceset-localblock-0-data-44t287         localblock                             9m9s
local-pv-91f335ce                          931Gi      RWO            Delete           Bound    openshift-storage/ocs-deviceset-localblock-0-data-3g9kwh         localblock                             19h
local-pv-b9f65aa6                          931Gi      RWO            Delete           Bound    openshift-storage/ocs-deviceset-localblock-0-data-54mcw6         localblock                             2m10s

5.Check PVC status:
$ oc get pvc
NAME                                     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
db-noobaa-db-pg-0                        Bound    pvc-988608b7-dc12-4039-921e-4d9aeade2f20   50Gi       RWO            ocs-storagecluster-ceph-rbd   20h
ocs-deviceset-localblock-0-data-3g9kwh   Bound    local-pv-91f335ce                          931Gi      RWO            localblock                    19h
ocs-deviceset-localblock-0-data-44t287   Bound    local-pv-1ffca0cf                          931Gi      RWO            localblock                    9m41s
ocs-deviceset-localblock-0-data-54mcw6   Bound    local-pv-b9f65aa6                          931Gi      RWO            localblock                    3m50s

6.Check OSD pods status:
$ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP             NODE                      NOMINATED NODE   READINESS GATES
rook-ceph-osd-0-697fdb84bf-5blmj   2/2     Running   0          19h   10.129.2.32    argo007.ceph.redhat.com   <none>           <none>
rook-ceph-osd-1-dcc7bbfc6-gkwgj    2/2     Running   0          10m   10.128.3.114   argo006.ceph.redhat.com   <none>           <none>

7.Check deployment:
$ oc get deployment | grep osd
rook-ceph-osd-0                                      1/1     1            1           19h
rook-ceph-osd-1                                      1/1     1            1           12m

8.Wait 20 min

9.Restart rook ceph operator pod:
$ oc delete pods rook-ceph-operator-866bbcb854-rkf68
pod "rook-ceph-operator-866bbcb854-rkf68" deleted

10.Check deployment of osd-2:
$ oc get deployment | grep osd
rook-ceph-osd-0                                      1/1     1            1           19h
rook-ceph-osd-1                                      1/1     1            1           27m
rook-ceph-osd-2                                      1/1     1            1           24s

11.Check osd pod:
$ oc get pods | grep osd
rook-ceph-osd-0-697fdb84bf-5blmj                                  2/2     Running     0          19h
rook-ceph-osd-1-dcc7bbfc6-gkwgj                                   2/2     Running     0          27m
rook-ceph-osd-2-64dfc48f85-4rghn                                  2/2     Running     0          51s


**Attached rook ceph operator logs

Actual results:
OSD-2 on Down state

Expected results:
OSD-2 on Running state

Additional info:

Comment 2 Travis Nielsen 2021-12-09 23:34:57 UTC
Not a 4.9 blocker, moving to 4.10 for investigation.

Oded, the rook operator log attached is only from after restarting the operator, right? 

Please provide the following:
1. On the BM LSO cluster, provide the rook operator log before restarting the operator.
2. For comparison it would also be helpful to have the operator log from the vmware cluster where the OSD was automatically created

Comment 4 Oded 2021-12-10 11:54:12 UTC
Hi @nberry ,

I dont know, because I did not replace the disk. I only ran OSD removal job.

**There is no option to remove the disk in our BM infra.

Comment 5 Travis Nielsen 2021-12-10 17:28:20 UTC
Oded can you collect the logs requested in comment 2? thanks

Comment 6 Oded 2021-12-12 14:20:51 UTC
Hi @tnielsen ,


I cant reconstruct this issue on my BM cluster (I tried 7 times). I will try to deploy a new BM cluster.

Comment 7 Travis Nielsen 2021-12-13 16:43:29 UTC
Ok let's when you get the repro

Comment 9 Travis Nielsen 2022-01-17 16:10:11 UTC
Please reopen if you get the repro, thanks