Bug 2196628

Summary:	[RDR] [Globalnet enabled] Rook ceph mon endpoints are not updated with new ips when submariner is re-installed
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Aman Agrawal <amagrawa>
Component:	rook	Assignee:	Santosh Pillai <sapillai>
Status:	CLOSED CURRENTRELEASE	QA Contact:	kmanohar
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.13	CC:	aclewett, asriram, assingh, asuryana, bkunal, kmanohar, kramdoss, muagarwa, nyechiel, odf-bz-bot, rtalur, sagrawal, sapillai, sgaddam, skitt, tnielsen, vthapar, vumrao
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-09-24 07:35:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 4 Santosh Pillai 2023-05-11 02:51:02 UTC

Submariner was reinstalled on a an existing setup. When submariner was removed, it deleted all the exported services.
 
When globalnet operator started again, it created all the exported services but this time with different IPs. 
For example: cluster c1 had exported IP for a service as 242.1.255.* but after globalnet was reinstalled, the exported services were recreated but this time with 242.0.255.*.  Rook is still using 242.1.255.* saved in the config map and has no clue about 242.0.255.*

In rook we don't really allow mon IP's to change. That's not a supported case. For global IPs scenario, we allow it by failing over one mon at a time. But this is a different situation, all the mon (global) IPs got changed at once when Submariner was reinstalled.

@tnielsen Do you think this can be a scenario that rook should support?

Comment 5 Santosh Pillai 2023-05-15 10:27:11 UTC

Used the following steps to add mons back to quorum. Here we edit only one mon and let rook operator to failover other mons.

Obtain the following information from the cluster
fsid 
mon e exported IP: This can be optioned from `oc get service | grep submariner`. Lets say the exported IP for mon-e is 242.0.255.251 in this case

- Scale down OCS operator and Rook deployments
	oc scale deployment ocs-operator --replicas=0 -n openshift-storage
	oc scale deployment rook-ceph-operator --replicas=0 -n openshift-storage

- Update mon deployment to use correct exported IP in `spec.containers[0].args.public_addr`
  	`--public-addr=242.0.255.251` 
  	
- Copy mon-e deployment
  	oc get deployment rook-ceph-mon-e -o yaml > rook-ceph-mon-e-deployment-c1.yaml
  	
- Edit rook-ceph-mon-endpoints to use correct exported IP for mon-e

- Patch the rook-ceph-mon-e Deployment to stop this mon working without deleting the mon pod:
	kubectl  patch deployment rook-ceph-mon-e --type='json' -p '[{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}]'
	kubectl  patch deployment rook-ceph-mon-e -p '{"spec": {"template": {"spec": {"containers": [{"name": "mon", "command": ["sleep", "infinity"], "args": []}]}}}}'

- Connect to mon-e pod:
   	oc exec -it <rook-ceph-mon-e> sh

- Inside mon-e pod:
   - Create a temporary monmap
   	monmaptool  --create --add e 242.0.255.251 --set-min-mon-release --enable--all-features --clobber /tmp/monmap --fsid <ceph fsid>
   - Remove this mon-e entry
   	monmaptool --rm e /tmp/monmap
   - Add v2 protocol (Add V1 protocol as well if cluster supports both)
   	monmaptool --addv e [v2:242.0.255.251:3300] /tmp/monmap
   - inject this monmap to mon-e
   	ceph-mon -i e --inject-monmap /tmp/monmap
   - Exit mon-e pod
   
- Scale back ocs and rook deployments:
 	oc scale deployment ocs-operator --replicas=1 -n openshift-storage
 	oc scale deployment rook-ceph-operator --replicas=1 -n openshift-storage
 	
- Wait for rook operator to failover other mons

Comment 6 Santosh Pillai 2023-05-15 12:19:34 UTC

c1 cluster is `healthy` now after using above workaround

```sh-5.1$ ceph status 
  cluster:
    id:     6bee5946-d3e4-4999-8110-24ed4325fbe2
    health: HEALTH_OK
 
  services:
    mon:        3 daemons, quorum e,g,h (age 21m)
    mgr:        a(active, since 24m)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 21m), 3 in (since 10d)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 3.05k objects, 3.3 GiB
    usage:   5.9 GiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   31 KiB/s rd, 1.5 MiB/s wr, 36 op/s rd, 322 op/s wr
```


c2 has daemons crashing but mon's are up now.

``` 
sh-5.1$ ceph status  
  cluster:
    id:     c2c61349-f7b5-47c5-8fd6-f687ea46b450
    health: HEALTH_WARN
            1599 daemons have recently crashed
 
  services:
    mon:        3 daemons, quorum e,h,i (age 32m)
    mgr:        a(active, since 34m)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 33m), 3 in (since 10d)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 2.89k objects, 4.2 GiB
    usage:   12 GiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   17 KiB/s rd, 64 KiB/s wr, 21 op/s rd, 6 op/s wr
```

Comment 7 Santosh Pillai 2023-05-15 12:38:06 UTC

(In reply to Santosh Pillai from comment #5)
> Used the following steps to add mons back to quorum. Here we edit only one
> mon and let rook operator to failover other mons.
> 
> Obtain the following information from the cluster
> fsid 
> mon e exported IP: This can be optioned from `oc get service | grep
> submariner`. Lets say the exported IP for mon-e is 242.0.255.251 in this case
> 
> - Scale down OCS operator and Rook deployments
> 	oc scale deployment ocs-operator --replicas=0 -n openshift-storage
> 	oc scale deployment rook-ceph-operator --replicas=0 -n openshift-storage
> 
> - Update mon deployment to use correct exported IP in
> `spec.containers[0].args.public_addr`
>   	`--public-addr=242.0.255.251` 
>   	
> - Copy mon-e deployment
>   	oc get deployment rook-ceph-mon-e -o yaml >
> rook-ceph-mon-e-deployment-c1.yaml
>   	
> - Edit rook-ceph-mon-endpoints to use correct exported IP for mon-e
> 
> - Patch the rook-ceph-mon-e Deployment to stop this mon working without
> deleting the mon pod:
> 	kubectl  patch deployment rook-ceph-mon-e --type='json' -p
> '[{"op":"remove", "path":"/spec/template/spec/containers/0/livenessProbe"}]'
> 	kubectl  patch deployment rook-ceph-mon-e -p '{"spec": {"template":
> {"spec": {"containers": [{"name": "mon", "command": ["sleep", "infinity"],
> "args": []}]}}}}'
> 
> - Connect to mon-e pod:
>    	oc exec -it <rook-ceph-mon-e> sh
> 
> - Inside mon-e pod:
>    - Create a temporary monmap
>    	monmaptool  --create --add e 242.0.255.251 --set-min-mon-release
> --enable--all-features --clobber /tmp/monmap --fsid <ceph fsid>
>    - Remove this mon-e entry
>    	monmaptool --rm e /tmp/monmap
>    - Add v2 protocol (Add V1 protocol as well if cluster supports both)
>    	monmaptool --addv e [v2:242.0.255.251:3300] /tmp/monmap
>    - inject this monmap to mon-e
>    	ceph-mon -i e --inject-monmap /tmp/monmap
>    - Exit mon-e pod
>    
> - Scale back ocs and rook deployments:
>  	oc scale deployment ocs-operator --replicas=1 -n openshift-storage
>  	oc scale deployment rook-ceph-operator --replicas=1 -n openshift-storage
>  	
> - Wait for rook operator to failover other mons


One last step is to restart rbd mirror pods on both the clusters. 

Mirroring health is ok on both the clusters now.

oc get cephblockpool ocs-storagecluster-cephblockpool -n openshift-storage -o jsonpath='{.status.mirroringStatus.summary}{"\n"}'
{"daemon_health":"OK","health":"OK","image_health":"OK","states":{"replaying":20}}

Comment 8 Travis Nielsen 2023-05-15 19:18:34 UTC

Santosh great to see the workaround for getting the cluster back up in this scenario of reinstalling Submariner.

This scenario is very disruptive. Ceph requires immutable IP addresses for the mons.

We cannot support this scenario automatically in Rook.

The only way we can hope to support this scenario is that if/when it happens in production, they will need to contact the support team to step through these complicated recovery steps.
Even better if we can get this recovery working with the krew plugin, which will just need an addition to the existing --restore-quorum command to support the changing IP. Then there is just the separate question of the best way for the support team to use the krew plugin (or an alternative) that is fully tested by QE.

Comment 9 Travis Nielsen 2023-05-16 15:11:34 UTC

Based on previous comments, moving out of 4.13 since we can't support anything except the customer working with support for disaster recovery steps.

Comment 29 Santosh Pillai 2023-08-08 05:41:51 UTC

Hi Vikhyat 

Did you get a chance to check the last comment by Travis regarding doc support?

Comment 30 Vikhyat Umrao 2023-08-08 21:59:24 UTC

(In reply to Santosh Pillai from comment #29)
> Hi Vikhyat 
> 
> Did you get a chance to check the last comment by Travis regarding doc
> support?

Hi Santosh,

Yes, updating IP should be easy  - this is documented here https://access.redhat.com/solutions/3093781 for standalone clusters. I think the basic steps should be the same for ODF. Adding @assingh who can help from ODF side.

Comment 31 Vikhyat Umrao 2023-08-08 22:05:59 UTC

(In reply to Vikhyat Umrao from comment #30)
> (In reply to Santosh Pillai from comment #29)
> > Hi Vikhyat 
> > 
> > Did you get a chance to check the last comment by Travis regarding doc
> > support?
> 
> Hi Santosh,
> 
> Yes, updating IP should be easy  - this is documented here
> https://access.redhat.com/solutions/3093781 for standalone clusters. I think
> the basic steps should be the same for ODF. Adding @assingh who
> can help from ODF side.

Ahh, I see in comment#5 you are already able to achieve it and the question is do we need to document it or not? I think yes we should document it @

Comment 32 Vikhyat Umrao 2023-08-08 22:06:23 UTC

Comment 33 Vikhyat Umrao 2023-08-08 22:07:10 UTC

@bkunal and Ashish - can you please check from the KCS point of view?

Comment 36 Santosh Pillai 2023-08-16 13:49:16 UTC

Thanks for the doc Bipin. I'll take a look at the doc tomorrow.

Comment 37 Aman Agrawal 2023-08-23 17:17:33 UTC

Hi Bipin,

This is tested only once (which didn't go well if I remember correctly) and I suspect we might hit other issues (during hub recovery or node failure scenarios) and thus suggest to have it re-tested (may need a tracker and an assignee).

Seeking thoughts.

Comment 42 Mudit Agarwal 2024-01-25 04:08:51 UTC

Santosh, please work on the KCS as suggested by Travis
https://access.redhat.com/node/add/kcs-solution

Comment 47 kmanohar 2024-04-29 11:39:45 UTC

Observed the same behaviour when tried to uninstall and reinstall submariner in cluster replacement scenario. Mds pods are also in CLBO state. All the ceph commands are stuck. Submariner connection status was also degraded.


rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-74675cc7jhnd6   1/2     CrashLoopBackOff   141 (2m58s ago)   14h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6bf88858r9nbc   1/2     CrashLoopBackOff   141 (53s ago)     14h

___________________________________________________________________________________________________________________


oc get cm rook-ceph-mon-endpoints -o yaml

apiVersion: v1
data:
  csi-cluster-config-json: '[{"clusterID":"openshift-storage","monitors":["242.1.255.248:3300","242.1.255.250:3300","242.1.255.249:3300"],"cephFS":{"netNamespaceFilePath":"","subvolumeGroup":"","kernelMountOptions":"","fuseMountOptions":""},"rbd":{"netNamespaceFilePath":"","radosNamespace":""},"nfs":{"netNamespaceFilePath":""},"readAffinity":{"enabled":false,"crushLocationLabels":null},"namespace":""}]'
  data: f=242.1.255.248:3300,d=242.1.255.250:3300,e=242.1.255.249:3300
  mapping: '{"node":{"d":null,"e":null,"f":null}}'
  maxMonId: "5"
  outOfQuorum: ""
kind: ConfigMap
metadata:
  creationTimestamp: "2024-04-28T19:39:59Z"
  finalizers:
  - ceph.rook.io/disaster-protection
  name: rook-ceph-mon-endpoints
  namespace: openshift-storage
  ownerReferences:
  - apiVersion: ceph.rook.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: CephCluster
    name: ocs-storagecluster-cephcluster
    uid: 1937512d-7903-4bfc-bfe7-90523c26662e
  resourceVersion: "102498"
  uid: e172ee65-8cbf-4ce5-b5ff-bb62e9e62db7

______________________________________________________________________________


oc get service | grep submariner

submariner-3qslc37nybfedm2rae7lou4wkknd25iu              ClusterIP      172.30.21.74     242.0.255.252   3300/TCP                                                   14h
submariner-alzs2tkhcukgmx55rmttuo6o22vrqubb              ClusterIP      172.30.222.93    242.0.255.253   3300/TCP                                                   14h
submariner-cpqogstf25uynxvpgw4u34ak42mvdmdb              ClusterIP      172.30.192.247   242.0.255.249   6800/TCP                                                   14h
submariner-vui6efiepwvd4jr4b7gjvmuicpviuwqg              ClusterIP      172.30.181.231   242.0.255.250   6800/TCP                                                   14h
submariner-w4mfbcdvdi2oqcadnvzhxdwg75glri52              ClusterIP      172.30.9.17      242.0.255.254   3300/TCP                                                   14h
submariner-zoh2rvm7zupu5kldmqt6mduidrg6wek2              ClusterIP      172.30.120.52    242.0.255.251   6800/TCP                                                   14h


_________________________________________________________________________________________________________________________


oc get service submariner-cpqogstf25uynxvpgw4u34ak42mvdmdb -o yaml

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2024-04-28T19:58:06Z"
  finalizers:
  - submariner.io/globalnet-internal-service
  labels:
    submariner.io/exportedServiceRef: rook-ceph-osd-2
  name: submariner-cpqogstf25uynxvpgw4u34ak42mvdmdb
  namespace: openshift-storage
  resourceVersion: "168660"
  uid: 7c930a7e-3d7f-4e11-8c15-bf5f163654a4
spec:
  clusterIP: 172.30.192.247
  clusterIPs:
  - 172.30.192.247
  externalIPs:
  - 242.0.255.249
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: osd-port-v2
    port: 6800
    protocol: TCP
    targetPort: 6800
  selector:
    app: rook-ceph-osd
    ceph-osd-id: "2"
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

______________________________________________________________________________________________________________

Noticed that the different IPs being assigned after re-installation of submariner


ODF - 4.16.0-77
ACM - 2.10.2
Submariner - 0.17.0
OCP - 4.16

subtcl gather logs - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/4.16/cluster_replacemnet/submariner/submariner-20240429084042/
subctl verify - https://privatebin.corp.redhat.com/?62e391efec155b9c#775F4X1YYPLeUNUemdFV6XBT4mja2yUnTB8TH8vfvBGL

Comment 48 Santosh Pillai 2024-05-02 04:39:46 UTC

*** Bug 2277936 has been marked as a duplicate of this bug. ***

Comment 49 Santosh Pillai 2024-05-02 09:02:59 UTC

Tried the manual workaround on a new cluster.  The mons got failed over to using correct global IPs.  Had to update the `--public-addr` argument in the osd deployments. 

Only  problem I see now is one of the OSD is not coming up due to 
``` 
Events:
  Type     Reason           Age                    From               Message
  ----     ------           ----                   ----               -------
  Normal   Scheduled        143m                   default-scheduler  Successfully assigned openshift-storage/rook-ceph-osd-1-7cf9d97d6-5ww9r to compute-2
  Warning  FailedMapVolume  2m57s (x70 over 143m)  kubelet            MapVolume.MapPodDevice failed for volume "pvc-5bf80e1b-f943-44cc-b447-6dfdd1080fd1" : rpc error: code = AlreadyExists desc = block volume already mounted in more than one place
```

which might a different issue altogether!

Ceph status: 
``` 
 oc rsh -n openshift-storage $(oc get pods -o wide -n openshift-storage|grep rook-ceph-operator|awk '{print$1}') ceph --conf=/var/lib/rook/openshift-storage/openshift-storage.config status
  cluster:
    id:     084efd57-e82f-4db6-ae39-f005f98c815b
    health: HEALTH_WARN
            1 osds down
            1 host (1 osds) down
            1 rack (1 osds) down
            Degraded data redundancy: 4165/12495 objects degraded (33.333%), 115 pgs degraded, 169 pgs undersized
 
  services:
    mon: 3 daemons, quorum d,g,h (age 3h)
    mgr: a(active, since 104m), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 2 up (since 2h), 3 in (since 3d)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 4.17k objects, 8.3 GiB
    usage:   16 GiB used, 4.0 TiB / 4 TiB avail
    pgs:     4165/12495 objects degraded (33.333%)
             115 active+undersized+degraded
             54  active+undersized
 
  io:
    client:   938 B/s rd, 2.2 KiB/s wr, 1 op/s rd, 0 op/s wr

```

still investigating.

Comment 50 Santosh Pillai 2024-05-02 13:34:14 UTC

(In reply to Santosh Pillai from comment #49)
> Tried the manual workaround on a new cluster.  The mons got failed over to
> using correct global IPs.  Had to update the `--public-addr` argument in the
> osd deployments. 
> 
> Only  problem I see now is one of the OSD is not coming up due to 
> ``` 
> Events:
>   Type     Reason           Age                    From               Message
>   ----     ------           ----                   ----               -------
>   Normal   Scheduled        143m                   default-scheduler 
> Successfully assigned openshift-storage/rook-ceph-osd-1-7cf9d97d6-5ww9r to
> compute-2
>   Warning  FailedMapVolume  2m57s (x70 over 143m)  kubelet           
> MapVolume.MapPodDevice failed for volume
> "pvc-5bf80e1b-f943-44cc-b447-6dfdd1080fd1" : rpc error: code = AlreadyExists
> desc = block volume already mounted in more than one place
> ```
> 
> which might a different issue altogether!
> 
> Ceph status: 
> ``` 
>  oc rsh -n openshift-storage $(oc get pods -o wide -n openshift-storage|grep
> rook-ceph-operator|awk '{print$1}') ceph
> --conf=/var/lib/rook/openshift-storage/openshift-storage.config status
>   cluster:
>     id:     084efd57-e82f-4db6-ae39-f005f98c815b
>     health: HEALTH_WARN
>             1 osds down
>             1 host (1 osds) down
>             1 rack (1 osds) down
>             Degraded data redundancy: 4165/12495 objects degraded (33.333%),
> 115 pgs degraded, 169 pgs undersized
>  
>   services:
>     mon: 3 daemons, quorum d,g,h (age 3h)
>     mgr: a(active, since 104m), standbys: b
>     mds: 1/1 daemons up, 1 hot standby
>     osd: 3 osds: 2 up (since 2h), 3 in (since 3d)
>     rgw: 1 daemon active (1 hosts, 1 zones)
>  
>   data:
>     volumes: 1/1 healthy
>     pools:   12 pools, 169 pgs
>     objects: 4.17k objects, 8.3 GiB
>     usage:   16 GiB used, 4.0 TiB / 4 TiB avail
>     pgs:     4165/12495 objects degraded (33.333%)
>              115 active+undersized+degraded
>              54  active+undersized
>  
>   io:
>     client:   938 B/s rd, 2.2 KiB/s wr, 1 op/s rd, 0 op/s wr
> 
> ```
> 
> still investigating.

This could be some issue with the env because I was not able to ssh into the node `compute-2` where the OSD is failing to get started.

Comment 51 Santosh Pillai 2024-05-02 14:55:06 UTC

cluster health is ok after restarting the `compute-2` node mentioned in above comment. 

```
ux-backend-server-65849c7f6c-vw9bc                                2/2     Running     0             3d19h
❯ oc rsh -n openshift-storage $(oc get pods -o wide -n openshift-storage|grep rook-ceph-operator|awk '{print$1}') ceph --conf=/var/lib/rook/openshift-storage/openshift-storage.config status
  cluster:
    id:     084efd57-e82f-4db6-ae39-f005f98c815b
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum d,g,h (age 53m)
    mgr: a(active, since 42m), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 53m), 3 in (since 3d)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 4.17k objects, 8.3 GiB
    usage:   24 GiB used, 6.0 TiB / 6 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   852 B/s rd, 2.0 KiB/s wr, 1 op/s rd, 0 op/s wr
```

Comment 52 Mudit Agarwal 2024-05-07 05:47:21 UTC

Santosh, what are the next steps for this BZ?

Comment 53 Santosh Pillai 2024-05-07 06:25:47 UTC

(In reply to Mudit Agarwal from comment #52)
> Santosh, what are the next steps for this BZ?

Testing is still in progress for DR cluster with globalnet

Comment 54 Santosh Pillai 2024-05-07 08:25:17 UTC

moving this back to on_QA based on comment #51

Comment 56 Santosh Pillai 2024-05-16 01:42:35 UTC

Awesome. Thanks Annette for following up with the Submariner team.

Good to know that cluster replacement can work without deleting submariner-globalnet on the surviving cluster. This will help bypass the issue and hence the workaround mentioned in this BZ for the cluster-replacement scenarios.

Comment 58 Sunil Kumar Acharya 2024-06-18 06:45:26 UTC

Please update the RDT flag/text appropriately.

Comment 59 Santosh Pillai 2024-06-28 04:31:48 UTC

Since we now a different approach to uninstall and reinstall the submariner, we won't need the workaround mentioned in the #comment 5.  So we can safely close this BZ. We do need a doc BZ to elaborate the uninstall and reinstall of the submariner.

Comment 72 Red Hat Bugzilla 2025-02-14 04:25:02 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days