Bug 2212510
| Summary: | AWS_UPI, Node replacemet, OSD pod is not running on the replacement node [stuck on pending state] | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Oded <oviner> |
| Component: | documentation | Assignee: | Erin Donnelly <edonnell> |
| Status: | VERIFIED --- | QA Contact: | Oded <oviner> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.13 | CC: | asriram, ebenahar, edonnell, etamir, hnallurv, odf-bz-bot, tnielsen |
| Target Milestone: | --- | Flags: | oviner:
needinfo?
(edonnell) oviner: needinfo? (etamir) oviner: needinfo? (hnallurv) |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The osd-0 deployment spec shows that it has affinity to rack0:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cluster.ocs.openshift.io/openshift-storage
operator: Exists
- key: cluster.ocs.openshift.io/openshift-storage
operator: Exists
- key: topology.rook.io/rack
operator: In
values:
- rack0
Does the new node have the rack0 label? I don't see the node descriptions in the must gather. Since the OSD remains pending, I suspect the new node has not been labeled for the expected rack.
Hi Travis, The new node labled with "topology.rook.io/rack: rack0" You can find it in the OCS MG "/cluster-scoped-resources/core/nodes" Do we need to label the new node "ip-10-0-92-187.us-east-2.compute.internal" with rack2 like "ip-10-0-52-221.us-east-2.compute.internal" and "ip-10-0-78-44.us-east-2.compute.internal"? If yes, we need to add a new step in Node replacement doc. https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html-single/replacing_nodes/index#replacing-an-operational-aws-node-upi_rhodf IIUC, we dont need to add this step on AWS_IPI because when we delete a node automatically new node created.. ip-10-0-92-187.us-east-2.compute.internal labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/instance-type: m5.4xlarge beta.kubernetes.io/os: linux cluster.ocs.openshift.io/openshift-storage: "" failure-domain.beta.kubernetes.io/region: us-east-2 failure-domain.beta.kubernetes.io/zone: us-east-2c kubernetes.io/arch: amd64 kubernetes.io/hostname: ip-10-0-92-187.us-east-2.compute.internal kubernetes.io/os: linux node-role.kubernetes.io/worker: "" node.kubernetes.io/instance-type: m5.4xlarge node.openshift.io/os_id: rhcos topology.ebs.csi.aws.com/zone: us-east-2c topology.kubernetes.io/region: us-east-2 topology.kubernetes.io/zone: us-east-2c topology.rook.io/rack: rack0 ip-10-0-52-221.us-east-2.compute.internal: labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/instance-type: m5.4xlarge beta.kubernetes.io/os: linux cluster.ocs.openshift.io/openshift-storage: "" failure-domain.beta.kubernetes.io/region: us-east-2 failure-domain.beta.kubernetes.io/zone: us-east-2b kubernetes.io/arch: amd64 kubernetes.io/hostname: ip-10-0-78-44.us-east-2.compute.internal kubernetes.io/os: linux node-role.kubernetes.io/worker: "" node.kubernetes.io/instance-type: m5.4xlarge node.openshift.io/os_id: rhcos topology.ebs.csi.aws.com/zone: us-east-2b topology.kubernetes.io/region: us-east-2 topology.kubernetes.io/zone: us-east-2b topology.rook.io/rack: rack2 ip-10-0-78-44.us-east-2.compute.internal: labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/instance-type: m5.4xlarge beta.kubernetes.io/os: linux cluster.ocs.openshift.io/openshift-storage: "" failure-domain.beta.kubernetes.io/region: us-east-2 failure-domain.beta.kubernetes.io/zone: us-east-2b kubernetes.io/arch: amd64 kubernetes.io/hostname: ip-10-0-78-44.us-east-2.compute.internal kubernetes.io/os: linux node-role.kubernetes.io/worker: "" node.kubernetes.io/instance-type: m5.4xlarge node.openshift.io/os_id: rhcos topology.ebs.csi.aws.com/zone: us-east-2b topology.kubernetes.io/region: us-east-2 topology.kubernetes.io/zone: us-east-2b topology.rook.io/rack: rack2 (In reply to Oded from comment #3) > Hi Travis, > The new node labled with "topology.rook.io/rack: rack0" > You can find it in the OCS MG "/cluster-scoped-resources/core/nodes" > Do we need to label the new node "ip-10-0-92-187.us-east-2.compute.internal" > with rack2 like "ip-10-0-52-221.us-east-2.compute.internal" and > "ip-10-0-78-44.us-east-2.compute.internal"? It's expected that the nodes are balanced between the three racks, so this looks expected that the two existing nodes are in rack1 and rack2, and the new node is in rack0. Clarification added below for the node on rack1. So there must be some other reason that the OSD is not having its affinity satisfied. > ip-10-0-52-221.us-east-2.compute.internal: > labels: > beta.kubernetes.io/arch: amd64 > beta.kubernetes.io/instance-type: m5.4xlarge > beta.kubernetes.io/os: linux > cluster.ocs.openshift.io/openshift-storage: "" > failure-domain.beta.kubernetes.io/region: us-east-2 > failure-domain.beta.kubernetes.io/zone: us-east-2b > kubernetes.io/arch: amd64 > kubernetes.io/hostname: ip-10-0-78-44.us-east-2.compute.internal > kubernetes.io/os: linux > node-role.kubernetes.io/worker: "" > node.kubernetes.io/instance-type: m5.4xlarge > node.openshift.io/os_id: rhcos > topology.ebs.csi.aws.com/zone: us-east-2b > topology.kubernetes.io/region: us-east-2 > topology.kubernetes.io/zone: us-east-2b > topology.rook.io/rack: rack2 In the must-gather I see these labels for this node: topology.ebs.csi.aws.com/zone: us-east-2a topology.kubernetes.io/region: us-east-2 topology.kubernetes.io/zone: us-east-2a topology.rook.io/rack: rack1 Looking again the error on the pod, the key issue seems to be: "1 node(s) had volume node affinity conflict" The volume is: - name: ocs-deviceset-gp2-csi-2-data-0q6mfq persistentVolumeClaim: claimName: ocs-deviceset-gp2-csi-2-data-0q6mfq Its PVC is bound to the PV: volumeName: pvc-153bbd34-5e3d-4386-ab0c-46fe1d630186 This PV has node affinity to us-east-2a: nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: topology.ebs.csi.aws.com/zone operator: In values: - us-east-2a But the new node belongs to us-east-2c: topology.ebs.csi.aws.com/zone: us-east-2c topology.kubernetes.io/region: us-east-2 topology.kubernetes.io/zone: us-east-2c topology.rook.io/rack: rack0 So when the nodes are replaced, they must be in the same AZ or else the ebs volume can't be bound. Another question is why racks are created. When running across AWS zones, the OCS operator should just be using the AZs instead of creating racks. So there is no Rook issue that needs to be fixed, shall we close this issue? Hi Travis,
In my new test I replace node on same zone [us-east-2a] and it is working as expected.
We need to add comment on AWS/Vsphere:
"""
The new node should be at the same zone[aws]/rack[vmware] as the replaced node.
"""
What do you think?
SetUp:
OCP Version: 4.13.0-0.nightly-2023-06-09-152551
ODF Version: 4.13.0-218
PLATFORM: AWS_UPI
$ oc get machinesets.machine.openshift.io -A
No resources found
Test Process:
1.Check worker nodes labels:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-49-29.us-east-2.compute.internal Ready control-plane,master 37m v1.26.5+7d22122
ip-10-0-50-29.us-east-2.compute.internal Ready worker 24m v1.26.5+7d22122
ip-10-0-63-145.us-east-2.compute.internal Ready worker 23m v1.26.5+7d22122
ip-10-0-66-162.us-east-2.compute.internal Ready control-plane,master 37m v1.26.5+7d22122
ip-10-0-70-103.us-east-2.compute.internal Ready worker 24m v1.26.5+7d22122
ip-10-0-89-148.us-east-2.compute.internal Ready control-plane,master 38m v1.26.5+7d22122
ip-10-0-95-97.us-east-2.compute.internal Ready worker 24m v1.26.5+7d22122
$ oc get nodes --show-labels | grep worker | awk '{ print $1 }'
ip-10-0-50-29.us-east-2.compute.internal -> us-east-2a
ip-10-0-63-145.us-east-2.compute.internal -> us-east-2a
Ip-10-0-70-103.us-east-2.compute.internal -> us-east-2b
Ip-10-0-95-97.us-east-2.compute.internal -> us-east-2c
$ oc get nodes ip-10-0-50-29.us-east-2.compute.internal --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ip-10-0-50-29.us-east-2.compute.internal Ready worker 28m v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-50-29.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
$ oc get nodes ip-10-0-63-145.us-east-2.compute.internal --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ip-10-0-63-145.us-east-2.compute.internal Ready worker 28m v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-63-145.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
$ oc get nodes ip-10-0-70-103.us-east-2.compute.internal --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ip-10-0-70-103.us-east-2.compute.internal Ready worker 30m v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-70-103.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b
$ oc get nodes ip-10-0-95-97.us-east-2.compute.internal --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ip-10-0-95-97.us-east-2.compute.internal Ready worker 38m v1.26.5+7d22122 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-95-97.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c
2.Install ODF operator
3.Create storagecluster
Label with OCS:
ip-10-0-50-29.us-east-2.compute.internal -> us-east-2a
Ip-10-0-70-103.us-east-2.compute.internal -> us-east-2b
Ip-10-0-95-97.us-east-2.compute.internal -> us-east-2c
4.Check ceph status
sh-5.1$ ceph health
HEALTH_OK
5. Delete ip-10-0-50-29.us-east-2.compute.internal node:
$ oc adm cordon ip-10-0-50-29.us-east-2.compute.internal
node/ip-10-0-50-29.us-east-2.compute.internal cordoned
oviner:auth$ oc adm drain ip-10-0-50-29.us-east-2.compute.internal --force --delete-emptydir-data=true --ignore-daemonsets
node/ip-10-0-50-29.us-east-2.compute.internal already cordoned
Warning: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-xx9cb, openshift-cluster-node-tuning-operator/tuned-jnwnd, openshift-dns/
node/ip-10-0-50-29.us-east-2.compute.internal drained
$ oc delete nodes ip-10-0-50-29.us-east-2.compute.internal
node "ip-10-0-50-29.us-east-2.compute.internal" deleted
6.Label new node with ocs
oc label node ip-10-0-63-145.us-east-2.compute.internal cluster.ocs.openshift.io/openshift-storage=""
7. Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
$ oc get pods -o wide | grep ip-10-0-63-145.us-east-2.compute.internal
csi-addons-controller-manager-b49dc6c8d-m5dj2 2/2 Running 0 4m33s 10.130.2.13 ip-10-0-63-145.us-east-2.compute.internal <none> <none>
csi-cephfsplugin-provisioner-76b98bccfb-4xxcd 5/5 Running 0 13m 10.130.2.10 ip-10-0-63-145.us-east-2.compute.internal <none> <none>
csi-cephfsplugin-ttzcn 2/2 Running 0 13m 10.0.63.145 ip-10-0-63-145.us-east-2.compute.internal <none> <none>
csi-rbdplugin-provisioner-5856654fdc-f8nnl 6/6 Running 0 4m32s 10.130.2.15 ip-10-0-63-145.us-east-2.compute.internal <none> <none>
csi-rbdplugin-rmcgp 3/3 Running 0 13m 10.0.63.145 ip-10-0-63-145.us-east-2.compute.internal <none> <none>
noobaa-operator-796bfb4c65-k4dq8 1/1 Running 0 18m 10.130.2.9 ip-10-0-63-145.us-east-2.compute.internal <none> <none>
odf-operator-controller-manager-56977d98b4-klhzs 2/2 Running 0 4m31s 10.130.2.19 ip-10-0-63-145.us-east-2.compute.internal <none> <none>
rook-ceph-crashcollector-cfd2a7580c149360f20574a5df21a88c-zsp7p 1/1 Running 0 54s 10.130.2.22 ip-10-0-63-145.us-east-2.compute.internal <none> <none>
rook-ceph-exporter-ip-10-0-63-145.us-east-2.compute.intern74qz5 1/1 Running 0 54s 10.130.2.23 ip-10-0-63-145.us-east-2.compute.internal <none> <none>
rook-ceph-mon-c-55897b99df-zj5sg 2/2 Running 0 4m31s 10.130.2.25 ip-10-0-63-145.us-east-2.compute.internal <none> <none>
rook-ceph-osd-2-56db486c46-xl5cr 2/2 Running 0 4m31s 10.130.2.24 ip-10-0-63-145.us-east-2.compute.internal <none> <none>
8.Check Ceph status:
sh-5.1$ ceph -s
cluster:
id: 2beb8f74-078e-4878-abb9-c363c4a9f014
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 118s)
mgr: a(active, since 10m)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 111s), 3 in (since 9m)
data:
volumes: 1/1 healthy
pools: 4 pools, 113 pgs
objects: 94 objects, 131 MiB
usage: 247 MiB used, 1.5 TiB / 1.5 TiB avail
pgs: 113 active+clean
io:
client: 853 B/s rd, 8.0 KiB/s wr, 1 op/s rd, 0 op/s wr
For more info: https://docs.google.com/document/d/1SZevJ14RJzmizif1Po9UOucagHbAZjLtUb9yIeJDJLE/edit [second Test!!]
> The new node should be at the same zone[aws]/rack[vmware] as the replaced node.
Agreed, we need a statement in the node replacement doc that indicates replaced nodes must be in the same zone/rack.
Hi Travis, On Vsphere_upi, rack label is created automatically when labeling a new node with OCS. https://bugzilla.redhat.com/show_bug.cgi?id=2102304#c19 On Vsphere_ipi, a new node is created automatically after deleting the node. On AWS_IPI, a new node [ec2] is created automatically after deleting the node On AWS_UPI, we need to add a new node [ec2] on the same zone. So I think we need to add a comment only on AWS_UPI to create the node on the same zone [like replaced node] (In reply to Oded from comment #7) > So I think we need to add a comment only on AWS_UPI to create the node on the same zone [like replaced node] Hi Anjana, Could the above note be added to the relevant section in 4.13 docs? Regards, Harish (In reply to Harish NV Rao from comment #8) > (In reply to Oded from comment #7) > > > So I think we need to add a comment only on AWS_UPI to create the node on the same zone [like replaced node] > > Hi Anjana, > > Could the above note be added to the relevant section in 4.13 docs? > > Regards, > Harish Yes, Harish. Can I move the bz based gitlab? https://gitlab.cee.redhat.com/red-hat-openshift-container-storage-documentation/openshift-data-foundation-documentation-4.13/-/commit/e4e54e3c6d3917fcfdd8ca9d01dbd38c127f1633#db85db15814d57597ed12d89dcc51a0009e0b008_13_16 I don't see the fix in the preview link. https://dxp-docp-prod.apps.ext-waf.spoke.prod.us-west-2.aws.paas.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.13/html-single/replacing_nodes/index?lb_target=preview#replacing-an-operational-aws-node-upi_rhodf (In reply to Oded from comment #17) > Can I move the bz based gitlab? > https://gitlab.cee.redhat.com/red-hat-openshift-container-storage- > documentation/openshift-data-foundation-documentation-4.13/-/commit/ > e4e54e3c6d3917fcfdd8ca9d01dbd38c127f1633#db85db15814d57597ed12d89dcc51a0009e0 > b008_13_16 > > I don't see the fix in the preview link. > https://dxp-docp-prod.apps.ext-waf.spoke.prod.us-west-2.aws.paas.redhat.com/ > documentation/en-us/red_hat_openshift_data_foundation/4.13/html-single/ > replacing_nodes/index?lb_target=preview#replacing-an-operational-aws-node- > upi_rhodf After logging into the customer portal, I can see the Note When replacing an AWS node on user-provisioned infrastructure, the new node needs to be created in the same AWS zone as the original node. @ @eran @etamir @hnallurv Can we backport it to ODF4.10/11/12? |
Description of problem (please be detailed as possible and provide log snippests): OSD stuck on pending state after node replacement Version of all relevant components (if applicable): OCP Version: 4.13.0-0.nightly-2023-06-03-192019 ODF Version: odf-operator.v4.13.0-207.stable PLATFORM: AWS_UPI [ Verify UPI installation -> machinesets does not exist $ oc get machinesets.machine.openshift.io -A No resources found ] Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy OCP cluster with 4 worker nodes 2.Install storagecluster with 3 nodes 3.Check ceph status [OK] 4.Check nodes: $ oc get nodes --show-labels | grep ocs | awk '{ print $1 }' ip-10-0-50-169.us-east-2.compute.internal ip-10-0-52-221.us-east-2.compute.internal ip-10-0-78-44.us-east-2.compute.internal $ oc get nodes --show-labels | grep worker | awk '{ print $1 }' ip-10-0-50-169.us-east-2.compute.internal ip-10-0-52-221.us-east-2.compute.internal ip-10-0-78-44.us-east-2.compute.internal ip-10-0-92-187.us-east-2.compute.internal Delete Node “ip-10-0-50-169.us-east-2.compute.internal” Replace with “ ip-10-0-92-187.us-east-2.compute.internal” 5. Delete Node ip-10-0-50-169.us-east-2.compute.internal $ oc adm cordon ip-10-0-50-169.us-east-2.compute.internal $ oc adm drain ip-10-0-50-169.us-east-2.compute.internal --force --delete-emptydir-data=true --ignore-daemonsets $ oc delete nodes ip-10-0-50-169.us-east-2.compute.internal 6.Apply the OpenShift Data Foundation label to the “ ip-10-0-92-187.us-east-2.compute.internal” node $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1 ip-10-0-52-221.us-east-2.compute.internal ip-10-0-78-44.us-east-2.compute.internal ip-10-0-92-187.us-east-2.compute.internal 7. Check pods on ip-10-0-92-187.us-east-2.compute.internal $ oc get pods -o wide| grep ip-10-0-92-187.us-east-2.compute.internal csi-cephfsplugin-n5m8s 2/2 Running 0 62m 10.0.92.187 ip-10-0-92-187.us-east-2.compute.internal <none> <none> csi-cephfsplugin-provisioner-5dfdc765b9-4p242 5/5 Running 0 62m 10.131.0.22 ip-10-0-92-187.us-east-2.compute.internal <none> <none> csi-rbdplugin-2nt6l 3/3 Running 0 62m 10.0.92.187 ip-10-0-92-187.us-east-2.compute.internal <none> <none> csi-rbdplugin-provisioner-8696d74786-rrgxp 6/6 Running 0 50m 10.131.0.27 ip-10-0-92-187.us-east-2.compute.internal <none> <none> odf-operator-controller-manager-7fdcf5f87d-m5szm 2/2 Running 0 50m 10.131.0.26 ip-10-0-92-187.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-73fd770e97485e5723141463fbe1d7c7-2rxfj 1/1 Running 0 25m 10.131.0.33 ip-10-0-92-187.us-east-2.compute.internal <none> <none> rook-ceph-exporter-ip-10-0-92-187.us-east-2.compute.intern9qhsr 1/1 Running 0 25m 10.131.0.34 ip-10-0-92-187.us-east-2.compute.internal <none> <none> rook-ceph-mon-d-55fdb456f-vwjwz 2/2 Running 0 27m 10.131.0.35 ip-10-0-92-187.us-east-2.compute.internal <none> <none> OSD pod is not running on ip-10-0-92-187.us-east-2.compute.internal : $ oc get pods rook-ceph-osd-0-fdffd864c-6llmm NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-fdffd864c-6llmm 0/2 Pending 0 36m Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 36m default-scheduler 0/7 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/7 nodes are available: 7 Preemption is not helpful for scheduling.. Warning FailedScheduling 32m default-scheduler 0/6 nodes are available: 1 node(s) had volume node affinity conflict, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.. Warning FailedScheduling 6m18s (x7 over 27m) default-scheduler 0/6 nodes are available: 1 node(s) had volume node affinity conflict, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.. Doc: https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html-single/replacing_nodes/index#replacing-an-operational-aws-node-upi_rhodf Actual results: OSD Pod on Pending state Expected results: OSD is running on the replacement node Additional info: OCS MG: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2212510.tar.gz https://docs.google.com/document/d/1SZevJ14RJzmizif1Po9UOucagHbAZjLtUb9yIeJDJLE/edit