2212510 – AWS_UPI, Node replacemet, OSD pod is not running on the replacement node [stuck on pending state]

Bug 2212510 - AWS_UPI, Node replacemet, OSD pod is not running on the replacement node [stuck on pending state]

Summary: AWS_UPI, Node replacemet, OSD pod is not running on the replacement node [stu...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	documentation
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Erin Donnelly
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-06-05 18:41 UTC by Oded
Modified:	2024-04-12 04:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-13 11:35:36 UTC
Embargoed:

Attachments	(Terms of Use)

Description Oded 2023-06-05 18:41:08 UTC

Description of problem (please be detailed as possible and provide log
snippests):
OSD stuck on pending state after node replacement

Version of all relevant components (if applicable):
OCP Version: 4.13.0-0.nightly-2023-06-03-192019
ODF Version: odf-operator.v4.13.0-207.stable
PLATFORM: AWS_UPI

[
Verify UPI installation -> machinesets does not exist
$  oc get machinesets.machine.openshift.io -A
No resources found
]


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy OCP cluster with 4 worker nodes
2.Install storagecluster with 3 nodes
3.Check ceph status [OK]
4.Check nodes:
$ oc get nodes --show-labels | grep ocs | awk '{ print $1 }' 
ip-10-0-50-169.us-east-2.compute.internal
ip-10-0-52-221.us-east-2.compute.internal
ip-10-0-78-44.us-east-2.compute.internal

$ oc get nodes --show-labels | grep worker | awk '{ print $1 }' 
ip-10-0-50-169.us-east-2.compute.internal
ip-10-0-52-221.us-east-2.compute.internal
ip-10-0-78-44.us-east-2.compute.internal
ip-10-0-92-187.us-east-2.compute.internal

Delete Node “ip-10-0-50-169.us-east-2.compute.internal”
Replace with “ ip-10-0-92-187.us-east-2.compute.internal”

5. Delete Node ip-10-0-50-169.us-east-2.compute.internal
$ oc adm cordon ip-10-0-50-169.us-east-2.compute.internal
$ oc adm drain ip-10-0-50-169.us-east-2.compute.internal --force --delete-emptydir-data=true --ignore-daemonsets
$ oc delete nodes ip-10-0-50-169.us-east-2.compute.internal

6.Apply the OpenShift Data Foundation label to the “ ip-10-0-92-187.us-east-2.compute.internal” node 
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
ip-10-0-52-221.us-east-2.compute.internal
ip-10-0-78-44.us-east-2.compute.internal
ip-10-0-92-187.us-east-2.compute.internal

7. Check pods on ip-10-0-92-187.us-east-2.compute.internal 
$ oc get pods -o wide| grep ip-10-0-92-187.us-east-2.compute.internal
csi-cephfsplugin-n5m8s                                            2/2     Running     0          62m   10.0.92.187   ip-10-0-92-187.us-east-2.compute.internal   <none>           <none>
csi-cephfsplugin-provisioner-5dfdc765b9-4p242                     5/5     Running     0          62m   10.131.0.22   ip-10-0-92-187.us-east-2.compute.internal   <none>           <none>
csi-rbdplugin-2nt6l                                               3/3     Running     0          62m   10.0.92.187   ip-10-0-92-187.us-east-2.compute.internal   <none>           <none>
csi-rbdplugin-provisioner-8696d74786-rrgxp                        6/6     Running     0          50m   10.131.0.27   ip-10-0-92-187.us-east-2.compute.internal   <none>           <none>
odf-operator-controller-manager-7fdcf5f87d-m5szm                  2/2     Running     0          50m   10.131.0.26   ip-10-0-92-187.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-73fd770e97485e5723141463fbe1d7c7-2rxfj   1/1     Running     0          25m   10.131.0.33   ip-10-0-92-187.us-east-2.compute.internal   <none>           <none>
rook-ceph-exporter-ip-10-0-92-187.us-east-2.compute.intern9qhsr   1/1     Running     0          25m   10.131.0.34   ip-10-0-92-187.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-d-55fdb456f-vwjwz                                   2/2     Running     0          27m   10.131.0.35   ip-10-0-92-187.us-east-2.compute.internal   <none>           <none>

OSD pod is not running on ip-10-0-92-187.us-east-2.compute.internal :
$ oc get pods rook-ceph-osd-0-fdffd864c-6llmm
NAME                              READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-fdffd864c-6llmm   0/2     Pending   0          36m

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  36m                  default-scheduler  0/7 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/7 nodes are available: 7 Preemption is not helpful for scheduling..
  Warning  FailedScheduling  32m                  default-scheduler  0/6 nodes are available: 1 node(s) had volume node affinity conflict, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..
  Warning  FailedScheduling  6m18s (x7 over 27m)  default-scheduler  0/6 nodes are available: 1 node(s) had volume node affinity conflict, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..

Doc: 
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html-single/replacing_nodes/index#replacing-an-operational-aws-node-upi_rhodf


Actual results:
OSD Pod on Pending state

Expected results:
OSD is running on the replacement node

Additional info:
OCS MG:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2212510.tar.gz
https://docs.google.com/document/d/1SZevJ14RJzmizif1Po9UOucagHbAZjLtUb9yIeJDJLE/edit

Comment 2 Travis Nielsen 2023-06-05 19:37:17 UTC

The osd-0 deployment spec shows that it has affinity to rack0:

      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cluster.ocs.openshift.io/openshift-storage
                  operator: Exists
                - key: cluster.ocs.openshift.io/openshift-storage
                  operator: Exists
                - key: topology.rook.io/rack
                  operator: In
                  values:
                  - rack0

Does the new node have the rack0 label? I don't see the node descriptions in the must gather. Since the OSD remains pending, I suspect the new node has not been labeled for the expected rack.

Comment 3 Oded 2023-06-06 13:18:18 UTC

Hi Travis, 
The new node labled with  "topology.rook.io/rack: rack0"
You can find it in the OCS MG "/cluster-scoped-resources/core/nodes"
Do we need to label the new node "ip-10-0-92-187.us-east-2.compute.internal" with rack2 like "ip-10-0-52-221.us-east-2.compute.internal" and "ip-10-0-78-44.us-east-2.compute.internal"?

If yes, we need to add a new step in Node replacement doc. https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html-single/replacing_nodes/index#replacing-an-operational-aws-node-upi_rhodf

IIUC, we dont need to add this step on AWS_IPI because when we delete a node automatically new node created..

ip-10-0-92-187.us-east-2.compute.internal
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: m5.4xlarge
    beta.kubernetes.io/os: linux
    cluster.ocs.openshift.io/openshift-storage: ""
    failure-domain.beta.kubernetes.io/region: us-east-2
    failure-domain.beta.kubernetes.io/zone: us-east-2c
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: ip-10-0-92-187.us-east-2.compute.internal
    kubernetes.io/os: linux
    node-role.kubernetes.io/worker: ""
    node.kubernetes.io/instance-type: m5.4xlarge
    node.openshift.io/os_id: rhcos
    topology.ebs.csi.aws.com/zone: us-east-2c
    topology.kubernetes.io/region: us-east-2
    topology.kubernetes.io/zone: us-east-2c
    topology.rook.io/rack: rack0


ip-10-0-52-221.us-east-2.compute.internal:
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: m5.4xlarge
    beta.kubernetes.io/os: linux
    cluster.ocs.openshift.io/openshift-storage: ""
    failure-domain.beta.kubernetes.io/region: us-east-2
    failure-domain.beta.kubernetes.io/zone: us-east-2b
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: ip-10-0-78-44.us-east-2.compute.internal
    kubernetes.io/os: linux
    node-role.kubernetes.io/worker: ""
    node.kubernetes.io/instance-type: m5.4xlarge
    node.openshift.io/os_id: rhcos
    topology.ebs.csi.aws.com/zone: us-east-2b
    topology.kubernetes.io/region: us-east-2
    topology.kubernetes.io/zone: us-east-2b
    topology.rook.io/rack: rack2


ip-10-0-78-44.us-east-2.compute.internal:
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: m5.4xlarge
    beta.kubernetes.io/os: linux
    cluster.ocs.openshift.io/openshift-storage: ""
    failure-domain.beta.kubernetes.io/region: us-east-2
    failure-domain.beta.kubernetes.io/zone: us-east-2b
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: ip-10-0-78-44.us-east-2.compute.internal
    kubernetes.io/os: linux
    node-role.kubernetes.io/worker: ""
    node.kubernetes.io/instance-type: m5.4xlarge
    node.openshift.io/os_id: rhcos
    topology.ebs.csi.aws.com/zone: us-east-2b
    topology.kubernetes.io/region: us-east-2
    topology.kubernetes.io/zone: us-east-2b
    topology.rook.io/rack: rack2

Comment 4 Travis Nielsen 2023-06-07 17:38:43 UTC

(In reply to Oded from comment #3)
> Hi Travis, 
> The new node labled with  "topology.rook.io/rack: rack0"
> You can find it in the OCS MG "/cluster-scoped-resources/core/nodes"
> Do we need to label the new node "ip-10-0-92-187.us-east-2.compute.internal"
> with rack2 like "ip-10-0-52-221.us-east-2.compute.internal" and
> "ip-10-0-78-44.us-east-2.compute.internal"?

It's expected that the nodes are balanced between the three racks, so this looks 
expected that the two existing nodes are in rack1 and rack2, and the new node is in rack0.
Clarification added below for the node on rack1.
So there must be some other reason that the OSD is not having its affinity satisfied.


> ip-10-0-52-221.us-east-2.compute.internal:
>   labels:
>     beta.kubernetes.io/arch: amd64
>     beta.kubernetes.io/instance-type: m5.4xlarge
>     beta.kubernetes.io/os: linux
>     cluster.ocs.openshift.io/openshift-storage: ""
>     failure-domain.beta.kubernetes.io/region: us-east-2
>     failure-domain.beta.kubernetes.io/zone: us-east-2b
>     kubernetes.io/arch: amd64
>     kubernetes.io/hostname: ip-10-0-78-44.us-east-2.compute.internal
>     kubernetes.io/os: linux
>     node-role.kubernetes.io/worker: ""
>     node.kubernetes.io/instance-type: m5.4xlarge
>     node.openshift.io/os_id: rhcos
>     topology.ebs.csi.aws.com/zone: us-east-2b
>     topology.kubernetes.io/region: us-east-2
>     topology.kubernetes.io/zone: us-east-2b
>     topology.rook.io/rack: rack2

In the must-gather I see these labels for this node:

    topology.ebs.csi.aws.com/zone: us-east-2a
    topology.kubernetes.io/region: us-east-2
    topology.kubernetes.io/zone: us-east-2a
    topology.rook.io/rack: rack1


Looking again the error on the pod, the key issue seems to be: 

"1 node(s) had volume node affinity conflict"

The volume is:

    - name: ocs-deviceset-gp2-csi-2-data-0q6mfq
      persistentVolumeClaim:
        claimName: ocs-deviceset-gp2-csi-2-data-0q6mfq

Its PVC is bound to the PV: 
volumeName: pvc-153bbd34-5e3d-4386-ab0c-46fe1d630186

This PV has node affinity to us-east-2a:

  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.ebs.csi.aws.com/zone
          operator: In
          values:
          - us-east-2a

But the new node belongs to us-east-2c:

    topology.ebs.csi.aws.com/zone: us-east-2c
    topology.kubernetes.io/region: us-east-2
    topology.kubernetes.io/zone: us-east-2c
    topology.rook.io/rack: rack0

So when the nodes are replaced, they must be in the same AZ or else the ebs volume can't be bound.

Another question is why racks are created. When running across AWS zones, the OCS operator should just be using the AZs instead of creating racks.

So there is no Rook issue that needs to be fixed, shall we close this issue?

Comment 5 Oded 2023-06-12 22:06:47 UTC

Hi Travis, 
In my new test I replace node on same zone [us-east-2a] and it is working as expected.
We need to add comment on AWS/Vsphere:
"""
The new node should be at the same zone[aws]/rack[vmware] as the replaced node.
"""

What do you think?


SetUp:
OCP Version: 4.13.0-0.nightly-2023-06-09-152551
ODF Version: 4.13.0-218
PLATFORM: AWS_UPI
$  oc get machinesets.machine.openshift.io -A
No resources found

Test Process: 
1.Check worker nodes labels:
$ oc get nodes
NAME                                        STATUS   ROLES                  AGE   VERSION
ip-10-0-49-29.us-east-2.compute.internal    Ready    control-plane,master   37m   v1.26.5+7d22122
ip-10-0-50-29.us-east-2.compute.internal    Ready    worker                 24m   v1.26.5+7d22122
ip-10-0-63-145.us-east-2.compute.internal   Ready    worker                 23m   v1.26.5+7d22122
ip-10-0-66-162.us-east-2.compute.internal   Ready    control-plane,master   37m   v1.26.5+7d22122
ip-10-0-70-103.us-east-2.compute.internal   Ready    worker                 24m   v1.26.5+7d22122
ip-10-0-89-148.us-east-2.compute.internal   Ready    control-plane,master   38m   v1.26.5+7d22122
ip-10-0-95-97.us-east-2.compute.internal    Ready    worker                 24m   v1.26.5+7d22122

$ oc get nodes --show-labels | grep worker | awk '{ print $1 }'
ip-10-0-50-29.us-east-2.compute.internal -> us-east-2a
ip-10-0-63-145.us-east-2.compute.internal -> us-east-2a
Ip-10-0-70-103.us-east-2.compute.internal -> us-east-2b
Ip-10-0-95-97.us-east-2.compute.internal -> us-east-2c

$ oc get nodes ip-10-0-50-29.us-east-2.compute.internal --show-labels
NAME                                       STATUS   ROLES    AGE   VERSION           LABELS
ip-10-0-50-29.us-east-2.compute.internal   Ready    worker   28m   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-50-29.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a

$ oc get nodes ip-10-0-63-145.us-east-2.compute.internal --show-labels
NAME                                        STATUS   ROLES    AGE   VERSION           LABELS
ip-10-0-63-145.us-east-2.compute.internal   Ready    worker   28m   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-63-145.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a

$ oc get nodes ip-10-0-70-103.us-east-2.compute.internal --show-labels
NAME                                        STATUS   ROLES    AGE   VERSION           LABELS
ip-10-0-70-103.us-east-2.compute.internal   Ready    worker   30m   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-70-103.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b

$ oc get nodes ip-10-0-95-97.us-east-2.compute.internal --show-labels 
NAME                                       STATUS   ROLES    AGE   VERSION           LABELS
ip-10-0-95-97.us-east-2.compute.internal   Ready    worker   38m   v1.26.5+7d22122   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-95-97.us-east-2.compute.internal,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c

2.Install ODF operator

3.Create storagecluster 
Label with OCS:
ip-10-0-50-29.us-east-2.compute.internal -> us-east-2a
Ip-10-0-70-103.us-east-2.compute.internal -> us-east-2b
Ip-10-0-95-97.us-east-2.compute.internal -> us-east-2c

4.Check ceph status
sh-5.1$ ceph health
HEALTH_OK

5. Delete ip-10-0-50-29.us-east-2.compute.internal  node:
$ oc adm cordon ip-10-0-50-29.us-east-2.compute.internal
node/ip-10-0-50-29.us-east-2.compute.internal cordoned
oviner:auth$ oc adm drain ip-10-0-50-29.us-east-2.compute.internal --force --delete-emptydir-data=true --ignore-daemonsets
node/ip-10-0-50-29.us-east-2.compute.internal already cordoned
Warning: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-xx9cb, openshift-cluster-node-tuning-operator/tuned-jnwnd, openshift-dns/
node/ip-10-0-50-29.us-east-2.compute.internal drained
$ oc delete nodes ip-10-0-50-29.us-east-2.compute.internal
node "ip-10-0-50-29.us-east-2.compute.internal" deleted

6.Label new node with ocs
oc label node ip-10-0-63-145.us-east-2.compute.internal  cluster.ocs.openshift.io/openshift-storage=""

7. Verify that the new Object Storage Device (OSD) pods are running on the replacement node:
$ oc get pods -o wide | grep ip-10-0-63-145.us-east-2.compute.internal
csi-addons-controller-manager-b49dc6c8d-m5dj2                     2/2     Running     0          4m33s   10.130.2.13   ip-10-0-63-145.us-east-2.compute.internal   <none>           <none>
csi-cephfsplugin-provisioner-76b98bccfb-4xxcd                     5/5     Running     0          13m     10.130.2.10   ip-10-0-63-145.us-east-2.compute.internal   <none>           <none>
csi-cephfsplugin-ttzcn                                            2/2     Running     0          13m     10.0.63.145   ip-10-0-63-145.us-east-2.compute.internal   <none>           <none>
csi-rbdplugin-provisioner-5856654fdc-f8nnl                        6/6     Running     0          4m32s   10.130.2.15   ip-10-0-63-145.us-east-2.compute.internal   <none>           <none>
csi-rbdplugin-rmcgp                                               3/3     Running     0          13m     10.0.63.145   ip-10-0-63-145.us-east-2.compute.internal   <none>           <none>
noobaa-operator-796bfb4c65-k4dq8                                  1/1     Running     0          18m     10.130.2.9    ip-10-0-63-145.us-east-2.compute.internal   <none>           <none>
odf-operator-controller-manager-56977d98b4-klhzs                  2/2     Running     0          4m31s   10.130.2.19   ip-10-0-63-145.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-cfd2a7580c149360f20574a5df21a88c-zsp7p   1/1     Running     0          54s     10.130.2.22   ip-10-0-63-145.us-east-2.compute.internal   <none>           <none>
rook-ceph-exporter-ip-10-0-63-145.us-east-2.compute.intern74qz5   1/1     Running     0          54s     10.130.2.23   ip-10-0-63-145.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-c-55897b99df-zj5sg                                  2/2     Running     0          4m31s   10.130.2.25   ip-10-0-63-145.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-2-56db486c46-xl5cr                                  2/2     Running     0          4m31s   10.130.2.24   ip-10-0-63-145.us-east-2.compute.internal   <none>           <none>

8.Check Ceph status:
sh-5.1$ ceph -s
  cluster:
    id:     2beb8f74-078e-4878-abb9-c363c4a9f014
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 118s)
    mgr: a(active, since 10m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 111s), 3 in (since 9m)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 113 pgs
    objects: 94 objects, 131 MiB
    usage:   247 MiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     113 active+clean
 
  io:
    client:   853 B/s rd, 8.0 KiB/s wr, 1 op/s rd, 0 op/s wr


For more info: https://docs.google.com/document/d/1SZevJ14RJzmizif1Po9UOucagHbAZjLtUb9yIeJDJLE/edit [second Test!!]

Comment 6 Travis Nielsen 2023-06-12 22:27:04 UTC

> The new node should be at the same zone[aws]/rack[vmware] as the replaced node.

Agreed, we need a statement in the node replacement doc that indicates replaced nodes must be in the same zone/rack.

Comment 7 Oded 2023-06-13 11:46:45 UTC

Hi Travis,

On Vsphere_upi,  rack label is created automatically when labeling a new node with OCS. https://bugzilla.redhat.com/show_bug.cgi?id=2102304#c19 
On Vsphere_ipi, a new node is created automatically after deleting the node. 
On AWS_IPI, a new node [ec2] is created automatically after deleting the node 
On AWS_UPI, we need to add a new node [ec2] on the same zone.

So I think we need to add a comment only on AWS_UPI to create the node on the same zone [like  replaced node]

Comment 8 Harish NV Rao 2023-06-13 11:58:38 UTC

(In reply to Oded from comment #7)

> So I think we need to add a comment only on AWS_UPI to create the node on the same zone [like  replaced node]

Hi Anjana,

Could the above note be added to the relevant section in 4.13 docs? 

Regards,
Harish

Comment 11 Anjana Suparna Sriram 2023-06-15 13:29:46 UTC

(In reply to Harish NV Rao from comment #8)
> (In reply to Oded from comment #7)
> 
> > So I think we need to add a comment only on AWS_UPI to create the node on the same zone [like  replaced node]
> 
> Hi Anjana,
> 
> Could the above note be added to the relevant section in 4.13 docs? 
> 
> Regards,
> Harish

Yes, Harish.

Comment 17 Oded 2023-07-16 09:48:16 UTC

Can I move the bz based gitlab? https://gitlab.cee.redhat.com/red-hat-openshift-container-storage-documentation/openshift-data-foundation-documentation-4.13/-/commit/e4e54e3c6d3917fcfdd8ca9d01dbd38c127f1633#db85db15814d57597ed12d89dcc51a0009e0b008_13_16

I don't see the fix in the preview link.
https://dxp-docp-prod.apps.ext-waf.spoke.prod.us-west-2.aws.paas.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.13/html-single/replacing_nodes/index?lb_target=preview#replacing-an-operational-aws-node-upi_rhodf

Comment 20 Oded 2023-08-10 13:34:26 UTC

(In reply to Oded from comment #17)
> Can I move the bz based gitlab?
> https://gitlab.cee.redhat.com/red-hat-openshift-container-storage-
> documentation/openshift-data-foundation-documentation-4.13/-/commit/
> e4e54e3c6d3917fcfdd8ca9d01dbd38c127f1633#db85db15814d57597ed12d89dcc51a0009e0
> b008_13_16
> 
> I don't see the fix in the preview link.
> https://dxp-docp-prod.apps.ext-waf.spoke.prod.us-west-2.aws.paas.redhat.com/
> documentation/en-us/red_hat_openshift_data_foundation/4.13/html-single/
> replacing_nodes/index?lb_target=preview#replacing-an-operational-aws-node-
> upi_rhodf

After logging into the customer portal, I can see the Note

When replacing an AWS node on user-provisioned infrastructure, the new node needs to be created in the same AWS zone as the original node.

@

Comment 21 Oded 2023-08-10 13:35:46 UTC

@eran

Comment 22 Oded 2023-08-10 13:44:23 UTC

@etamir 
@hnallurv Can we backport it to ODF4.10/11/12?

Comment 25 Red Hat Bugzilla 2024-04-12 04:25:18 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.