Bug 2241124 - [RDR][Tracker ACM-7844] On RDR longevity cluster, RBD volume replication is not progressing, and rbd command hangs
Summary: [RDR][Tracker ACM-7844] On RDR longevity cluster, RBD volume replication is n...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.14
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.14.1
Assignee: Karolin Seeger
QA Contact: kmanohar
URL:
Whiteboard:
: 2241131 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-09-28 07:05 UTC by kmanohar
Modified: 2024-11-20 04:25 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-07-22 12:02:36 UTC
Embargoed:
kmanohar: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ACM-7844 0 None None None 2023-10-10 13:59:40 UTC

Description kmanohar 2023-09-28 07:05:08 UTC
Description of problem (please be detailed as possible and provide log
snippests):
On RDR longevity cluster, which has rbd and cephfs workloads running for more than 2 weeks, replication has stopped, and rbd command hangs in one of the managed clusters(C2).


Version of all relevant components (if applicable):
ODF- 4.1.4.0-128
OCP - 4.14.0-0.nightly-2023-09-12-024050
Submariner - 0.16
ACM - 2.9.0-109

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Keep the rbd,cephfs workload running for a week.
2. Perform failover from c1 to c2 of rbd and cephfs based workloads.
3. Keep the setup in the same state for a day.
4. Perform relocate back to c2 of RBD based workloads. 
5. Keep the setup in the same state, after 2 or 3 days replication had stopped, and rbd commands hangs on C2.

namespace - appset-busybox-1-c1
Cluster C1 is creating the snapshots as per schedules but C2 cluster is not syncing the snapshots for over 3 days

For eg - 
this image on csi-vol-eaec7c83-0c7b-46f7-9827-d9a6191e6597 c1, last synced on 19:50:00 2023. current date Sep 28 06:42:14 UTC 2023

OUTPUTS
=======
C1:  

sh-5.1$ ceph osd blocklist ls
listed 0 entries

sh-5.1$ rbd snap ls ocs-storagecluster-cephblockpool/csi-vol-eaec7c83-0c7b-46f7-9827-d9a6191e6597 -a
SNAPID  NAME                                                                                           SIZE    PROTECTED  TIMESTAMP                 NAMESPACE                                                                       
 84538  .mirror.non_primary.061e9b75-6c1c-4035-8076-266386c1f714.01f286ba-eb50-40c1-988e-bc5ed9e2ca26  20 GiB             Fri Sep 22 14:25:28 2023  mirror (demoted peer_uuids:[] c87e9404-b96e-471c-96cd-ae4c32a6675a:84424 copied)
 92978  .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.8f66b3ea-da32-4147-97c4-67c83806b2dc      20 GiB             Sat Sep 23 18:40:00 2023  mirror (primary peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca])              
 93027  .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.53d29199-c11d-4d28-8e80-fc64a72defc4      20 GiB             Sat Sep 23 19:30:01 2023  mirror (primary peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca])              
 93071  .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.657e8c6c-6f40-489b-8620-a21811045d74      20 GiB             Sat Sep 23 19:40:04 2023  mirror (primary peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca])              
 93087  .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.45cc27b1-d024-4816-8fab-f23b85f289f2      20 GiB             Sat Sep 23 19:50:00 2023  mirror (primary peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca])              
118727  .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.32694de1-7dda-435b-b5ae-d3037a20b35b      20 GiB             Thu Sep 28 06:50:02 2023  mirror (primary peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca])      

sh-5.1$ rbd mirror image status ocs-storagecluster-cephblockpool/csi-vol-eaec7c83-0c7b-46f7-9827-d9a6191e6597
csi-vol-eaec7c83-0c7b-46f7-9827-d9a6191e6597:
  global_id:   061e9b75-6c1c-4035-8076-266386c1f714
  state:       up+starting_replay
  description: starting replay
  service:     a on compute-0
  last_update: 2023-09-28 06:52:21
  peer_sites:
    name: 97ad2da9-5f6b-42f0-9803-5b70cbea1451
    state: up+replaying
    description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"last_snapshot_bytes":0,"last_snapshot_sync_seconds":0,"local_snapshot_timestamp":1695494400,"remote_snapshot_timestamp":1695883802,"replay_state":"idle"}
    last_update: 2023-09-28 06:52:38
  snapshots:
    92978 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.8f66b3ea-da32-4147-97c4-67c83806b2dc (peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca])
    93027 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.53d29199-c11d-4d28-8e80-fc64a72defc4 (peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca])
    93071 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.657e8c6c-6f40-489b-8620-a21811045d74 (peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca])
    93087 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.45cc27b1-d024-4816-8fab-f23b85f289f2 (peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca])
    118727 .mirror.primary.061e9b75-6c1c-4035-8076-266386c1f714.32694de1-7dda-435b-b5ae-d3037a20b35b (peer_uuids:[b4e7f16e-d1eb-49f0-8dab-0d55b00845ca])
        

Ceph status

cluster:
    id:     d72a52c8-f24b-4e17-ac0f-0fcde1dc7a5a
    health: HEALTH_OK
 
  services:
    mon:        3 daemons, quorum d,e,g (age 13h)
    mgr:        a(active, since 4d)
    mds:        1/1 daemons up, 1 hot standby
    osd:        3 osds: 3 up (since 4d), 3 in (since 2w)
    rbd-mirror: 1 daemon active (1 hosts)
    rgw:        1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 792.63k objects, 213 GiB
    usage:   644 GiB used, 5.4 TiB / 6 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   18 MiB/s rd, 129 KiB/s wr, 2.35k op/s rd, 20 op/s wr

sh-5.1$  rbd mirror pool status ocs-storagecluster-cephblockpool
health: WARNING
daemon health: OK
image health: WARNING
images: 40 total
    26 starting_replay
    14 replaying


---------------------------------------------

C2

ceph status
  cluster:
    id:     97ad2da9-5f6b-42f0-9803-5b70cbea1451
    health: HEALTH_WARN
            2 MDSs report slow metadata IOs
            5 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum d,f,g (age 13h)
    mgr: a(active, since 4d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 4d), 3 in (since 2w)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 400.82k objects, 119 GiB
    usage:   358 GiB used, 5.7 TiB / 6 TiB avail
    pgs:     169 active+clean

sh-5.1$ ceph osd blocklist ls
listed 0 entries

RBD commands like rbd mirror snapshot schedule list --recursive, rbd mirror image status ocs-storagecluster-cephblockpool/ hangs in C2, whereas all rbd based commands works on C1

rbd mirror image status ocs-storagecluster-cephblockpool/csi-vol-eaec7c83-0c7b-46f7-9827-d9a6191e6597

==> hangs

rbd snap ls ocs-storagecluster-cephblockpool/csi-vol-eaec7c83-0c7b-46f7-9827-d9a6191e6597 -a

==> hangs

rbd mirror pool status ocs-storagecluster-cephblockpool

==> hangs
________________________________________________________________________


Actual results:
Replication stops after few days of having workloads running

Expected results:
Replication should not stop


Additional info:
Raising this BZ based on the observations made by shyam in this bz https://bugzilla.redhat.com/show_bug.cgi?id=2240525#c3


Must gather logs -
c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/replication-issue/c1/

c2 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/replication-issue/c2/

hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/keerthana/Longevity-4.14/replication-issue/hub/

Comment 3 kmanohar 2023-09-28 07:57:31 UTC
This bz and the other CephFS based bz https://bugzilla.redhat.com/show_bug.cgi?id=2241131 were hit around same time

Comment 9 Santosh Pillai 2023-10-03 05:51:36 UTC
Rook has incorrectly assigned `242.1.255.254` exported IP as the `public_addr` for rook-ceph-osd-0 pod in `c2` cluster. 

The service created by globalnet uses for osd-0 has `242.1.255.248` as the external (exported) IP.

```
❯ oc get service submariner-zoh2rvm7zupu5kldmqt6mduidrg6wek2 -o yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2023-09-23T00:41:35Z"
  finalizers:
  - submariner.io/globalnet-internal-service
  labels:
    submariner.io/exportedServiceRef: rook-ceph-osd-0
  name: submariner-zoh2rvm7zupu5kldmqt6mduidrg6wek2
  namespace: openshift-storage
  resourceVersion: "13827635"
  uid: 8b85c64e-c60c-4db6-aa1c-e9f7417890ef
spec:
  clusterIP: 172.30.60.17
  clusterIPs:
  - 172.30.60.17
  externalIPs:
  - 242.1.255.248
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: osd-port-v2
    port: 6800
    protocol: TCP
    targetPort: 6800
  selector:
    app: rook-ceph-osd
    ceph-osd-id: "0"
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
```

Running the nslookup on `kmanohar-long-c2.rook-ceph-osd-0.openshift-storage.svc.clusterset.local` returned `242.1.255.254` as the IP for `rook-ceph-osd-0` clusterset. So rook assigned `242.1.255.254` as the public-addr for rook-ceph-osd-0 pod. 

Rook logs:

```
2023-09-23 20:56:53.699249 I | op-osd: OSD will have its main bluestore block on "ocs-deviceset-thin-csi-odf-2-data-0vvp2c"
2023-09-23 20:56:53.846600 I | op-osd: osd.0 cluster IP is "172.30.241.158"
2023-09-23 20:56:53.889365 I | op-osd: osd.0 exported IP is "242.1.255.254"    ### Need to check by nslookup is returning `242.1.255.254` instead of `242.1.255.248`
2023-09-23 20:56:53.889588 I | op-osd: assigning osd 0 topology affinity to "topology.rook.io/rack=rack0"
2023-09-23 20:56:54.368601 I | op-mgr: successful modules: mgr module(s) from the spec```
```


Rook ceph osd pod:
```  oc get pods rook-ceph-osd-0-5747994569-4n7f9 -o yaml | grep public-addr
    - --public-addr=242.1.255.254

```

Comment 10 Santosh Pillai 2023-10-03 06:09:00 UTC
Sridhar, any idea why nslookup is not returning the expected external IP here. 

```
❯ kubectl -n openshift-storage run tmp-shell --rm -i --tty --image quay.io/submariner/nettest -- /bin/bash
If you don't see a command prompt, try pressing enter.
bash-5.0# nslookup kmanohar-long-c2.rook-ceph-osd-0.openshift-storage.svc.clusterset.local
Server:		172.30.0.10
Address:	172.30.0.10#53

Name:	kmanohar-long-c2.rook-ceph-osd-0.openshift-storage.svc.clusterset.local
Address: 242.1.255.254

bash-5.0# 

```

nsLookup returns `242.1.255.254` for the exported service `rook-ceph-osd-0`, but the exported service has `242.1.255.248` as the external service
```
❯ oc get service submariner-zoh2rvm7zupu5kldmqt6mduidrg6wek2 -o yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2023-09-23T00:41:35Z"
  finalizers:
  - submariner.io/globalnet-internal-service
  labels:
    submariner.io/exportedServiceRef: rook-ceph-osd-0
  name: submariner-zoh2rvm7zupu5kldmqt6mduidrg6wek2
  namespace: openshift-storage
  resourceVersion: "13827635"
  uid: 8b85c64e-c60c-4db6-aa1c-e9f7417890ef
spec:
  clusterIP: 172.30.60.17
  clusterIPs:
  - 172.30.60.17
  externalIPs:
  - 242.1.255.248
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: osd-port-v2
    port: 6800
    protocol: TCP
    targetPort: 6800
  selector:
    app: rook-ceph-osd
    ceph-osd-id: "0"
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
```

Globalnet logs:

```
[90m2023-09-23T20:34:02.228Z[0m [32mINF[0m ..essip_controller.go:129 Globalnet            Processing created ClusterGlobalEgressIP "cluster-egress.submariner.io", Spec.NumberOfIPs: 8, Status: v1.GlobalEgressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 12, 13, 21, 46, 0, time.Local), Reason:"Success", Message:"Allocated 8 global IP(s)"}}, AllocatedIPs:[]string{"242.1.0.1", "242.1.0.2", "242.1.0.3", "242.1.0.4", "242.1.0.5", "242.1.0.6", "242.1.0.7", "242.1.0.8"}}
[90m2023-09-23T20:34:02.228Z[0m [33mDBG[0m ..essip_controller.go:194 Globalnet            Update called for "cluster-egress.submariner.io", but numberOfIPs 8 are already allocated
[90m2023-09-23T20:34:02.300Z[0m [32mINF[0m ../node_controller.go:114 Globalnet            Processing created Node "compute-2"
[90m2023-09-23T20:34:02.300Z[0m [32mINF[0m ../node_controller.go:144 Globalnet            Allocated global IP 242.1.255.254 for node "compute-2"
[90m2023-09-23T20:34:02.300Z[0m [32mINF[0m ../node_controller.go:147 Globalnet            Adding ingress rules for node "compute-2" with global IP 242.1.255.254, CNI IP 10.129.2.1
[90m2023-09-23T20:34:02.300Z[0m [33mDBG[0m ..s/iptables/iface.go:184 IPTables             Installing iptable ingress rules for Node: -p icmp -d 242.1.255.254 -j DNAT --to 10.129.2.1
[90m2023-09-23T20:34:02.528Z[0m [32mINF[0m ..essip_controller.go:138 Globalnet            Processing created openshift-storage/rook-ceph-mon-d, TargetRef: "ClusterIPService", "rook-ceph-mon-d", Status: v1.GlobalIngressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 13, 4, 20, 5, 0, time.Local), Reason:"Success", Message:"Allocated global IP"}}, AllocatedIP:"242.1.255.251"}
[90m2023-09-23T20:34:02.529Z[0m [32mINF[0m ..essip_controller.go:138 Globalnet            Processing created openshift-storage/rook-ceph-mon-f, TargetRef: "ClusterIPService", "rook-ceph-mon-f", Status: v1.GlobalIngressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 13, 4, 25, 19, 0, time.Local), Reason:"Success", Message:"Allocated global IP"}}, AllocatedIP:"242.1.255.249"}
[90m2023-09-23T20:34:02.529Z[0m [32mINF[0m ..essip_controller.go:138 Globalnet            Processing created openshift-storage/rook-ceph-mon-g, TargetRef: "ClusterIPService", "rook-ceph-mon-g", Status: v1.GlobalIngressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 23, 13, 46, 30, 0, time.Local), Reason:"Success", Message:"Allocated global IP"}}, AllocatedIP:"242.1.255.254"}
[90m2023-09-23T20:34:02.529Z[0m [32mINF[0m ..essip_controller.go:138 Globalnet            Processing created openshift-storage/rook-ceph-osd-0, TargetRef: "ClusterIPService", "rook-ceph-osd-0", Status: v1.GlobalIngressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 23, 0, 41, 35, 0, time.Local), Reason:"Success", Message:"Allocated global IP"}}, AllocatedIP:"242.1.255.248"}
[90m2023-09-23T20:34:02.529Z[0m [32mINF[0m ..essip_controller.go:138 Globalnet            Processing created openshift-storage/rook-ceph-osd-1, TargetRef: "ClusterIPService", "rook-ceph-osd-1", Status: v1.GlobalIngressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 13, 4, 18, 3, 0, time.Local), Reason:"Success", Message:"Allocated global IP"}}, AllocatedIP:"242.1.255.253"}
[90m2023-09-23T20:34:02.529Z[0m [32mINF[0m ..essip_controller.go:138 Globalnet            Processing created openshift-storage/rook-ceph-osd-2, TargetRef: "ClusterIPService", "rook-ceph-osd-2", Status: v1.GlobalIngressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2023, time.September, 13, 4, 18, 32, 0, time.Local), Reason:"Success", Message:"Allocated global IP"}}, AllocatedIP:"242.1.255.252"}
[90m2023-09-23T20:34:02.933Z[0m [32mINF[0m ../gateway_monitor.go:351 Globalnet            Successfully started the controllers
```

Comment 11 Sridhar Gaddam 2023-10-03 08:30:47 UTC
(In reply to Santosh Pillai from comment #10)
> Sridhar, any idea why nslookup is not returning the expected external IP
> here. 

It looks like the actual IP assigned to the exported service is 242.1.255.248, but Lighthouse seems to be resolving it to 242.1.255.254.
We would need the logs of Lighthouse component to understand if there is any error. Can you please attach the output of "subctl gather..." from the cluster?

Comment 13 Santosh Pillai 2023-10-03 09:10:28 UTC
(In reply to Sridhar Gaddam from comment #11)
> (In reply to Santosh Pillai from comment #10)
> > Sridhar, any idea why nslookup is not returning the expected external IP
> > here. 
> 
> It looks like the actual IP assigned to the exported service is
> 242.1.255.248, but Lighthouse seems to be resolving it to 242.1.255.254.
> We would need the logs of Lighthouse component to understand if there is any
> error. Can you please attach the output of "subctl gather..." from the
> cluster?

attached submariner logs from c2 cluster in comment #12

Comment 14 Shyamsundar 2023-10-03 13:43:03 UTC
*** Bug 2241131 has been marked as a duplicate of this bug. ***

Comment 15 Sridhar Gaddam 2023-10-05 15:02:23 UTC
Okay, I had a look at the logs and identified the root cause of the issue.

The ODF clusters are using multiple Gateway nodes and the clusters have been running for more than 20+ days. During this period, there were couple of times where the Active Gateway migrated from one node to the other. From the logs, we noticed that during GW migration, the submariner-globalnet controller on the passive GW nodes was running for a few seconds before it was stopped. During this interval, the submariner-globalnet controller on the newly elected ACTIVE gateway node has already started running. So we have a brief interval where two instances of submariner-globalnet controllers were active and trying to process the globalIPs for various objects like Nodes, exported services etc. Because of this race condition, some of the globalIPs allocated to the exported services were updated to new ones. More specifically, the globalIP assigned to rook-ceph-osd-0 changed from 242.1.255.254 to 242.1.255.248. 

Now when an application is trying to resolve the globalIP associated to the service rook-ceph-osd-0, its processed by Submariner Lighthouse component.
The Lighthouse component was assuming that globalIPs once assigned to an exported service will not be updated until the lifetime of the service. 
Because of this, Lighthouse was returning the old globalIP which is no longer valid and hence mirroring was not happening (as they were trying to connect to wrong global IP). 

I pushed the following fix which will partially fix this issue - https://github.com/submariner-io/submariner/pull/2739
But in parallel, we are also working on a proper fix to avoid this race condition all-together.

PS:
Apart from the issue I mentioned above, there is no issue with the connections between the Gateway nodes. You will notice that if you run "subctl verify ..." all the e2e tests would pass.
The issue is seen mainly because you have some existing exported services in the cluster and Gateway migration triggered a race condition that updated the globalIPs of the exported service.

Comment 16 kmanohar 2023-10-10 13:12:24 UTC
Clearing need info on me as an ACM bug https://issues.redhat.com/browse/ACM-7844 has been raised, and a PR too https://github.com/submariner-io/submariner/pull/2743 is under review.

Comment 17 Mudit Agarwal 2023-10-31 13:39:07 UTC
ACM bug is already ON QA

Comment 25 krishnaram Karthick 2023-12-06 04:44:55 UTC
Verification was done for ACM bug. Moving the bug to verified.

Comment 26 Red Hat Bugzilla 2024-11-20 04:25:04 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.