Description of problem (please be detailed as possible and provide log snippests): On RDR longevity cluster, which has rbd and cephfs workloads running for more than 2 weeks, replication has stopped, and rbd command hangs in one of the managed clusters(C2). Version of all relevant components (if applicable): ODF- 4.1.4.0-128 OCP - 4.14.0-0.nightly-2023-09-12-024050 Submariner - 0.16 ACM - 2.9.0-109 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Keep the rbd,cephfs workload running for a week. 2. Perform failover from c1 to c2 of rbd and cephfs based workloads. 3. Keep the setup in the same state for a day. 4. Perform relocate back to c2 of RBD based workloads. 5. Keep the setup in the same state, after 2 or 3 days replication had stopped, and rbd commands hangs on C2. namespace - appset-busybox-1-c1 Cluster C1 is creating the snapshots as per schedules but C2 cluster is not syncing the snapshots for over 3 days For eg - this image on csi-vol-eaec7c83-0c7b-46f7-9827-d9a6191e6597 c1, last synced on 19:50:00 2023. current date Sep 28 06:42:14 UTC 2023 OUTPUTS ======= C1: sh-5.1$ ceph osd blocklist ls listed 0 entries sh-5.1$ rbd snap ls ocs-storagecluster-cephblockpool/csi-vol-eaec7c83-0c7b-46f7-9827-d9a6191e6597 -a SNAPID NAME SIZE PROTECTED TIMESTAMP NAMESPACE 84538 .mirror.non_primary.061e9b75-6c1c-4035-8076-266386c1f714.01f286ba-eb50-40c1-988e-bc5ed9e2ca26 20 GiB Fri Sep 22 14:25:28 2023 mirror (demoted peer_uuids:[] c87e9404-b96e-471c-96cd-ae4c32a6675a:84424 copied) 92978 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.8f66b3ea-da32-4147-97c4-67c83806b2dc 20 GiB Sat Sep 23 18:40:00 2023 mirror (primary peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca]) 93027 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.53d29199-c11d-4d28-8e80-fc64a72defc4 20 GiB Sat Sep 23 19:30:01 2023 mirror (primary peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca]) 93071 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.657e8c6c-6f40-489b-8620-a21811045d74 20 GiB Sat Sep 23 19:40:04 2023 mirror (primary peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca]) 93087 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.45cc27b1-d024-4816-8fab-f23b85f289f2 20 GiB Sat Sep 23 19:50:00 2023 mirror (primary peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca]) 118727 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.32694de1-7dda-435b-b5ae-d3037a20b35b 20 GiB Thu Sep 28 06:50:02 2023 mirror (primary peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca]) sh-5.1$ rbd mirror image status ocs-storagecluster-cephblockpool/csi-vol-eaec7c83-0c7b-46f7-9827-d9a6191e6597 csi-vol-eaec7c83-0c7b-46f7-9827-d9a6191e6597: global_id: 061e9b75-6c1c-4035-8076-266386c1f714 state: up+starting_replay description: starting replay service: a on compute-0 last_update: 2023-09-28 06:52:21 peer_sites: name: 97ad2da9-5f6b-42f0-9803-5b70cbea1451 state: up+replaying description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"last_snapshot_bytes":0,"last_snapshot_sync_seconds":0,"local_snapshot_timestamp":1695494400,"remote_snapshot_timestamp":1695883802,"replay_state":"idle"} last_update: 2023-09-28 06:52:38 snapshots: 92978 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.8f66b3ea-da32-4147-97c4-67c83806b2dc (peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca]) 93027 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.53d29199-c11d-4d28-8e80-fc64a72defc4 (peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca]) 93071 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.657e8c6c-6f40-489b-8620-a21811045d74 (peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca]) 93087 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.45cc27b1-d024-4816-8fab-f23b85f289f2 (peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca]) 118727 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.32694de1-7dda-435b-b5ae-d3037a20b35b (peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca]) Ceph status cluster: id: d72a52c8-f24b-4e17-ac0f-0fcde1dc7a5a health: HEALTH_OK services: mon: 3 daemons, quorum d,e,g (age 13h) mgr: a(active, since 4d) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 4d), 3 in (since 2w) rbd-mirror: 1 daemon active (1 hosts) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 792.63k objects, 213 GiB usage: 644 GiB used, 5.4 TiB / 6 TiB avail pgs: 169 active+clean io: client: 18 MiB/s rd, 129 KiB/s wr, 2.35k op/s rd, 20 op/s wr sh-5.1$ rbd mirror pool status ocs-storagecluster-cephblockpool health: WARNING daemon health: OK image health: WARNING images: 40 total 26 starting_replay 14 replaying --------------------------------------------- C2 ceph status cluster: id: 97ad2da9-5f6b-42f0-9803-5b70cbea1451 health: HEALTH_WARN 2 MDSs report slow metadata IOs 5 daemons have recently crashed services: mon: 3 daemons, quorum d,f,g (age 13h) mgr: a(active, since 4d) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 4d), 3 in (since 2w) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 400.82k objects, 119 GiB usage: 358 GiB used, 5.7 TiB / 6 TiB avail pgs: 169 active+clean sh-5.1$ ceph osd blocklist ls listed 0 entries RBD commands like rbd mirror snapshot schedule list --recursive, rbd mirror image status ocs-storagecluster-cephblockpool/ hangs in C2, whereas all rbd based commands works on C1 rbd mirror image status ocs-storagecluster-cephblockpool/csi-vol-eaec7c83-0c7b-46f7-9827-d9a6191e6597 ==> hangs rbd snap ls ocs-storagecluster-cephblockpool/csi-vol-eaec7c83-0c7b-46f7-9827-d9a6191e6597 -a ==> hangs rbd mirror pool status ocs-storagecluster-cephblockpool ==> hangs ________________________________________________________________________ Actual results: Replication stops after few days of having workloads running Expected results: Replication should not stop Additional info: Raising this BZ based on the observations made by shyam in this bz https://bugzilla.redhat.com/show_bug.cgi?id=2240525#c3 Must gather logs - c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/replication-issue/c1/ c2 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/replication-issue/c2/ hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/replication-issue/hub/
Live cluster is available for debugging c1 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/29300/ c2 - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/29301/ hub - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/29299/
This bz and the other CephFS based bz https://bugzilla.redhat.com/show_bug.cgi?id=2241131 were hit around same time
Rook has incorrectly assigned `242.1.255.254` exported IP as the `public_addr` for rook-ceph-osd-0 pod in `c2` cluster. The service created by globalnet uses for osd-0 has `242.1.255.248` as the external (exported) IP. ``` ❯ oc get service submariner-zoh2rvm7zupu5kldmqt6mduidrg6wek2 -o yaml apiVersion: v1 kind: Service metadata: creationTimestamp: "2023-09-23T00:41:35Z" finalizers: - submariner.io/globalnet-internal-service labels: submariner.io/exportedServiceRef: rook-ceph-osd-0 name: submariner-zoh2rvm7zupu5kldmqt6mduidrg6wek2 namespace: openshift-storage resourceVersion: "13827635" uid: 8b85c64e-c60c-4db6-aa1c-e9f7417890ef spec: clusterIP: 172.30.60.17 clusterIPs: - 172.30.60.17 externalIPs: - 242.1.255.248 internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: osd-port-v2 port: 6800 protocol: TCP targetPort: 6800 selector: app: rook-ceph-osd ceph-osd-id: "0" sessionAffinity: None type: ClusterIP status: loadBalancer: {} ``` Running the nslookup on `kmanohar-long-c2.rook-ceph-osd-0.openshift-storage.svc.clusterset.local` returned `242.1.255.254` as the IP for `rook-ceph-osd-0` clusterset. So rook assigned `242.1.255.254` as the public-addr for rook-ceph-osd-0 pod. Rook logs: ``` 2023-09-23 20:56:53.699249 I | op-osd: OSD will have its main bluestore block on "ocs-deviceset-thin-csi-odf-2-data-0vvp2c" 2023-09-23 20:56:53.846600 I | op-osd: osd.0 cluster IP is "172.30.241.158" 2023-09-23 20:56:53.889365 I | op-osd: osd.0 exported IP is "242.1.255.254" ### Need to check by nslookup is returning `242.1.255.254` instead of `242.1.255.248` 2023-09-23 20:56:53.889588 I | op-osd: assigning osd 0 topology affinity to "topology.rook.io/rack=rack0" 2023-09-23 20:56:54.368601 I | op-mgr: successful modules: mgr module(s) from the spec``` ``` Rook ceph osd pod: ``` oc get pods rook-ceph-osd-0-5747994569-4n7f9 -o yaml | grep public-addr - --public-addr=242.1.255.254 ```
Sridhar, any idea why nslookup is not returning the expected external IP here. ``` ❯ kubectl -n openshift-storage run tmp-shell --rm -i --tty --image quay.io/submariner/nettest -- /bin/bash If you don't see a command prompt, try pressing enter. bash-5.0# nslookup kmanohar-long-c2.rook-ceph-osd-0.openshift-storage.svc.clusterset.local Server: 172.30.0.10 Address: 172.30.0.10#53 Name: kmanohar-long-c2.rook-ceph-osd-0.openshift-storage.svc.clusterset.local Address: 242.1.255.254 bash-5.0# ``` nsLookup returns `242.1.255.254` for the exported service `rook-ceph-osd-0`, but the exported service has `242.1.255.248` as the external service ``` ❯ oc get service submariner-zoh2rvm7zupu5kldmqt6mduidrg6wek2 -o yaml apiVersion: v1 kind: Service metadata: creationTimestamp: "2023-09-23T00:41:35Z" finalizers: - submariner.io/globalnet-internal-service labels: submariner.io/exportedServiceRef: rook-ceph-osd-0 name: submariner-zoh2rvm7zupu5kldmqt6mduidrg6wek2 namespace: openshift-storage resourceVersion: "13827635" uid: 8b85c64e-c60c-4db6-aa1c-e9f7417890ef spec: clusterIP: 172.30.60.17 clusterIPs: - 172.30.60.17 externalIPs: - 242.1.255.248 internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: osd-port-v2 port: 6800 protocol: TCP targetPort: 6800 selector: app: rook-ceph-osd ceph-osd-id: "0" sessionAffinity: None type: ClusterIP status: loadBalancer: {} ``` Globalnet logs: ``` [90m2023-09-23T20:34:02.228Z[0m [32mINF[0m ..essip_controller.go:129 Globalnet Processing created ClusterGlobalEgressIP "cluster-egress.submariner.io", Spec.NumberOfIPs: 8, Status: v1.GlobalEgressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 12, 13, 21, 46, 0, time.Local), Reason:"Success", Message:"Allocated 8 global IP(s)"}}, AllocatedIPs:[]string{"242.1.0.1", "242.1.0.2", "242.1.0.3", "242.1.0.4", "242.1.0.5", "242.1.0.6", "242.1.0.7", "242.1.0.8"}} [90m2023-09-23T20:34:02.228Z[0m [33mDBG[0m ..essip_controller.go:194 Globalnet Update called for "cluster-egress.submariner.io", but numberOfIPs 8 are already allocated [90m2023-09-23T20:34:02.300Z[0m [32mINF[0m ../node_controller.go:114 Globalnet Processing created Node "compute-2" [90m2023-09-23T20:34:02.300Z[0m [32mINF[0m ../node_controller.go:144 Globalnet Allocated global IP 242.1.255.254 for node "compute-2" [90m2023-09-23T20:34:02.300Z[0m [32mINF[0m ../node_controller.go:147 Globalnet Adding ingress rules for node "compute-2" with global IP 242.1.255.254, CNI IP 10.129.2.1 [90m2023-09-23T20:34:02.300Z[0m [33mDBG[0m ..s/iptables/iface.go:184 IPTables Installing iptable ingress rules for Node: -p icmp -d 242.1.255.254 -j DNAT --to 10.129.2.1 [90m2023-09-23T20:34:02.528Z[0m [32mINF[0m ..essip_controller.go:138 Globalnet Processing created openshift-storage/rook-ceph-mon-d, TargetRef: "ClusterIPService", "rook-ceph-mon-d", Status: v1.GlobalIngressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 13, 4, 20, 5, 0, time.Local), Reason:"Success", Message:"Allocated global IP"}}, AllocatedIP:"242.1.255.251"} [90m2023-09-23T20:34:02.529Z[0m [32mINF[0m ..essip_controller.go:138 Globalnet Processing created openshift-storage/rook-ceph-mon-f, TargetRef: "ClusterIPService", "rook-ceph-mon-f", Status: v1.GlobalIngressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 13, 4, 25, 19, 0, time.Local), Reason:"Success", Message:"Allocated global IP"}}, AllocatedIP:"242.1.255.249"} [90m2023-09-23T20:34:02.529Z[0m [32mINF[0m ..essip_controller.go:138 Globalnet Processing created openshift-storage/rook-ceph-mon-g, TargetRef: "ClusterIPService", "rook-ceph-mon-g", Status: v1.GlobalIngressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 23, 13, 46, 30, 0, time.Local), Reason:"Success", Message:"Allocated global IP"}}, AllocatedIP:"242.1.255.254"} [90m2023-09-23T20:34:02.529Z[0m [32mINF[0m ..essip_controller.go:138 Globalnet Processing created openshift-storage/rook-ceph-osd-0, TargetRef: "ClusterIPService", "rook-ceph-osd-0", Status: v1.GlobalIngressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 23, 0, 41, 35, 0, time.Local), Reason:"Success", Message:"Allocated global IP"}}, AllocatedIP:"242.1.255.248"} [90m2023-09-23T20:34:02.529Z[0m [32mINF[0m ..essip_controller.go:138 Globalnet Processing created openshift-storage/rook-ceph-osd-1, TargetRef: "ClusterIPService", "rook-ceph-osd-1", Status: v1.GlobalIngressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 13, 4, 18, 3, 0, time.Local), Reason:"Success", Message:"Allocated global IP"}}, AllocatedIP:"242.1.255.253"} [90m2023-09-23T20:34:02.529Z[0m [32mINF[0m ..essip_controller.go:138 Globalnet Processing created openshift-storage/rook-ceph-osd-2, TargetRef: "ClusterIPService", "rook-ceph-osd-2", Status: v1.GlobalIngressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 13, 4, 18, 32, 0, time.Local), Reason:"Success", Message:"Allocated global IP"}}, AllocatedIP:"242.1.255.252"} [90m2023-09-23T20:34:02.933Z[0m [32mINF[0m ../gateway_monitor.go:351 Globalnet Successfully started the controllers ```
(In reply to Santosh Pillai from comment #10) > Sridhar, any idea why nslookup is not returning the expected external IP > here. It looks like the actual IP assigned to the exported service is 242.1.255.248, but Lighthouse seems to be resolving it to 242.1.255.254. We would need the logs of Lighthouse component to understand if there is any error. Can you please attach the output of "subctl gather..." from the cluster?
(In reply to Sridhar Gaddam from comment #11) > (In reply to Santosh Pillai from comment #10) > > Sridhar, any idea why nslookup is not returning the expected external IP > > here. > > It looks like the actual IP assigned to the exported service is > 242.1.255.248, but Lighthouse seems to be resolving it to 242.1.255.254. > We would need the logs of Lighthouse component to understand if there is any > error. Can you please attach the output of "subctl gather..." from the > cluster? attached submariner logs from c2 cluster in comment #12
*** Bug 2241131 has been marked as a duplicate of this bug. ***
Okay, I had a look at the logs and identified the root cause of the issue. The ODF clusters are using multiple Gateway nodes and the clusters have been running for more than 20+ days. During this period, there were couple of times where the Active Gateway migrated from one node to the other. From the logs, we noticed that during GW migration, the submariner-globalnet controller on the passive GW nodes was running for a few seconds before it was stopped. During this interval, the submariner-globalnet controller on the newly elected ACTIVE gateway node has already started running. So we have a brief interval where two instances of submariner-globalnet controllers were active and trying to process the globalIPs for various objects like Nodes, exported services etc. Because of this race condition, some of the globalIPs allocated to the exported services were updated to new ones. More specifically, the globalIP assigned to rook-ceph-osd-0 changed from 242.1.255.254 to 242.1.255.248. Now when an application is trying to resolve the globalIP associated to the service rook-ceph-osd-0, its processed by Submariner Lighthouse component. The Lighthouse component was assuming that globalIPs once assigned to an exported service will not be updated until the lifetime of the service. Because of this, Lighthouse was returning the old globalIP which is no longer valid and hence mirroring was not happening (as they were trying to connect to wrong global IP). I pushed the following fix which will partially fix this issue - https://github.com/submariner-io/submariner/pull/2739 But in parallel, we are also working on a proper fix to avoid this race condition all-together. PS: Apart from the issue I mentioned above, there is no issue with the connections between the Gateway nodes. You will notice that if you run "subctl verify ..." all the e2e tests would pass. The issue is seen mainly because you have some existing exported services in the cluster and Gateway migration triggered a race condition that updated the globalIPs of the exported service.
Clearing need info on me as an ACM bug https://issues.redhat.com/browse/ACM-7844 has been raised, and a PR too https://github.com/submariner-io/submariner/pull/2743 is under review.
ACM bug is already ON QA
Verification was done for ACM bug. Moving the bug to verified.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days