Bug 2165105 - ODF DR issues on IBM Red Hat OpenShift Kubernetes Service
Summary: ODF DR issues on IBM Red Hat OpenShift Kubernetes Service
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.11
Hardware: Unspecified
OS: Linux
high
medium
Target Milestone: ---
: ---
Assignee: Yossi Boaron
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-01-27 17:25 UTC by Shaikh I Ali
Modified: 2023-08-09 17:00 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-04-12 12:13:56 UTC
Embargoed:
yboaron: needinfo-


Attachments (Terms of Use)

Description Shaikh I Ali 2023-01-27 17:25:52 UTC
Description of problem (please be detailed as possible and provide log
snippests):

ODF DR Application Failover was unsuccessful on IBM VPC Cloud

Version of all relevant components (if applicable): 
OCP 4.12 + ODF 4.11 + ACM 2.6


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? 
Yes


Is there any workaround available to the best of your knowledge? 
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 1


Can this issue reproducible? Yes


Can this issue reproduce from the UI? Yes


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. On IBM VPC Cloud, deploy THREE  4.12 OCP Clusters( 1 hub + 2 managed Cluster)
Follow below Steps to create an ODF DR setup
https://red-hat-storage.github.io/ocs-training/training/ocs4/odf411-multisite-ramen.html

2. Create 2 managed cluster with non-overlapping network on ibm vpc cloud

$ ibmcloud ks cluster create vpc-gen2 --flavor bx2.16x64 --name ocp412-p1 --subnet-id 0736-6cb4fbf2-50d3-4335-b0f0-dbe1ca6fa5b9 --vpc-id r134-02280287-d7fa-4a8e-a8cf-7102aa9de58b --zone us-south-3 --service-subnet 172.30.6.0/23 --pod-subnet 172.30.8.0/23 --workers 3 --version 4.12.0_openshift

$ ibmcloud ks cluster create vpc-gen2 --flavor bx2.16x64 --name ocp412-s1 --subnet-id 0736-6cb4fbf2-50d3-4335-b0f0-dbe1ca6fa5b9 --vpc-id r134-02280287-d7fa-4a8e-a8cf-7102aa9de58b --zone us-south-3 --service-subnet 172.30.2.0/23 --pod-subnet 172.30.4.0/23 --workers 3 --version 4.12.0_openshift

3. Import the above Clusters onto the HUB cluster, as per the Instructions.

4. Connect Managed Clusters with submariner addon

  4a. Follow IBM Cloud Submariner installation instructions to disable Calico network cross cluster CIRD 'natting'.
  https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.6/html-single/add-ons/index#preparing-ibm

https://submariner.io/operations/deployment/calico/

4b. Follow community instruction to install submariner-operator

------------- my installation steps------------------------
curl -Ls https://get.submariner.io | bash
export PATH=$PATH:~/.local/bin
echo export PATH=\$PATH:~/.local/bin >> ~/.profile

2:-
subctl deploy-broker --kubeconfig <any_cluster_kubeconfig>

$ subctl join --context ocp412-p1/cebid1v20fj7niadniu0/admin broker-info.subm --clusterid ocp412-p1 --natt=false --insecure-skip-tls-verify=true --globalnet=false --check-broker-certificate=false

subctl uninstall --context ocp412-s1/cebide520uj0anrcnnb0/admin
$ subctl join --context ocp412-s1/cebide520uj0anrcnnb0/admin broker-info.subm --clusterid ocp412-s1 --natt=false --insecure-skip-tls-verify=true --globalnet=false --check-broker-certificate=false

$ subctl verify --context ocp412-p1/cebid1v20fj7niadniu0/admin --tocontext ocp412-s1/cebide520uj0anrcnnb0/admin --only connectivity

3. Verify submariner components running fine on managed clusters.
 ----On managed cluster1--
oc get pods -n submariner-operator
NAME                                             READY   STATUS    RESTARTS        AGE
submariner-gateway-6m9f4                         1/1     Running   12 (5h9m ago)   2d22h
submariner-lighthouse-agent-b45c7dd9b-pgnc6      1/1     Running   2               2d22h
submariner-lighthouse-coredns-5c485dcf74-7kbbl   1/1     Running   0               21h
submariner-lighthouse-coredns-5c485dcf74-k5p72   1/1     Running   2               2d22h
submariner-metrics-proxy-nw89c                   1/1     Running   2               2d22h
submariner-operator-5bd64bfcbc-sz68s             1/1     Running   0               21h
submariner-routeagent-5qppb                      1/1     Running   2               2d22h
submariner-routeagent-hszpm                      1/1     Running   2               2d22h
submariner-routeagent-rbkkq                      1/1     Running   2               2d22h

------On Managed Cluster2 ---------
$ oc get pods -n submariner-operator
NAME                                             READY   STATUS    RESTARTS        AGE
submariner-gateway-vgkgt                         1/1     Running   13 (5h7m ago)   2d22h
submariner-lighthouse-agent-784b676ff8-8t9wr     1/1     Running   2               2d22h
submariner-lighthouse-coredns-79b6c6b5f7-2mpt5   1/1     Running   2               2d22h
submariner-lighthouse-coredns-79b6c6b5f7-2q8fl   1/1     Running   2               2d22h
submariner-metrics-proxy-b7lm9                   1/1     Running   2               2d22h
submariner-operator-5bd64bfcbc-g4h6z             1/1     Running   2               2d22h
submariner-routeagent-9wmqj                      1/1     Running   2               2d22h
submariner-routeagent-ml9h5                      1/1     Running   3 (21h ago)     2d22h
submariner-routeagent-qgrfz                      1/1     Running   2               2d22h

4. Do the Other DR Steps
  4a. OpenShift Data Foundation Installation --- (SUCCESS)
  4b. Install ODF Multicluster Orchestrator Operator on Hub cluster --- (SUCCESS)
  4c. Configure SSL access between S3 endpoints  --- (SUCCESS)
  4d. Enabling Multicluster Web Console  --- (SUCCESS)
  4e. Create Data Policy on Hub cluster  --- (SUCCESS)
  4f. Create Sample Application for DR testing  --- (SUCCESS)
  4f. Creating Sample Application using ACM console  --- (SUCCESS)
  4g. Validating Sample Application deployment  --- (SUCCESS)
  4h. Apply DRPolicy to Sample Application --- (SUCCESS)
  4i. Modify DRPlacementControl to failover >>>>>(FAILED)


Actual results:
-----On 2nd Cluster After Failover------
$ oc get pods -n busybox-sample
NAME                      READY   STATUS              RESTARTS   AGE
busybox-67bf494b9-slgzs   0/1     ContainerCreating   0          4d5h

$ oc describe pod busybox-67bf494b9-slgzs -n busybox-sample

  Warning  FailedMount  8m21s (x631 over 21h)  kubelet  MountVolume.MountDevice failed for volume "pvc-ee9a1af7-6d6b-4d07-a22f-a51266f418fd" : rpc error: code = Internal desc = image not found: RBD image not found

Expected results:
The Failover cluster should have the APPLICATION POD running Successfully

Additional info:

Comment 2 Shyamsundar 2023-01-27 18:20:17 UTC
@shaali Based on the error reported by MountVolume.MountDevice it seems like the image is missing on the peer cluster where the failover was attempted. This could be due to a few causes like replication was not established to begin with (among others)

Could you upload the must-gather for the 3 clusters as follows:
- ODF must-gather from the 2 managed clusters: https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.11/html/troubleshooting_openshift_data_foundation/downloading-log-files-and-diagnostic-information_rhodf

- ACM must-gather from the hub cluster: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.6/html/troubleshooting/troubleshooting#running-the-must-gather-command-to-troubleshoot

(please use 4.12 images to gather as appropriate)

Comment 3 Shaikh I Ali 2023-01-30 18:22:51 UTC
hi @shyamsundar,  i shared the mustgather log in below link as the log size appears more than the attachable size. let me know if thats accessible to you.
https://drive.google.com/file/d/1Y5Ry1gEFJPYD4JUtVJmiUImkbxELO9EZ/view?usp=sharing

attached zip contents
1. mananaged cluster1 must gather
2. mananaged cluster2 must gather
3. hubcluster must gather

(thanks)

Comment 4 Shaikh I Ali 2023-01-31 18:28:09 UTC
As per the analysis/discussion with Shyamsundar/Annette Clewett, it appears the 'submariner' component is not Healthy and connectivity issues across cluster. 


-----managed cluster 1----
$ subctl show connections --context ocp412-p1/cebid1v20fj7niadniu0/admin
 ✓ Showing Connections 
GATEWAY                          CLUSTER   REMOTE IP      NAT   CABLE DRIVER   SUBNETS                        STATUS   RTT avg.   
test-cebide520uj0anrcnnb0-ocp4   c2        10.240.129.5   no    libreswan      172.30.2.0/23, 172.30.4.0/23   error    0s

$ oc get pods -n submariner-operator
NAME                                             READY   STATUS    RESTARTS        AGE
submariner-gateway-6m9f4                         1/1     Running   20 (2d5h ago)   6d23h
submariner-lighthouse-agent-b45c7dd9b-pgnc6      1/1     Running   2               6d23h
submariner-lighthouse-coredns-5c485dcf74-7kbbl   1/1     Running   0               4d22h
submariner-lighthouse-coredns-5c485dcf74-k5p72   1/1     Running   2               6d23h
submariner-metrics-proxy-nw89c                   1/1     Running   2               6d23h
submariner-operator-5bd64bfcbc-sz68s             1/1     Running   0               4d22h
submariner-routeagent-5qppb                      1/1     Running   2               6d23h
submariner-routeagent-hszpm                      1/1     Running   2               6d23h
submariner-routeagent-rbkkq                      1/1     Running   2               6d23h


$ oc get ippool
NAME                  AGE
default-ipv4-ippool   50d
podwestcluster        5d23h
svcwestcluster        5d23h

-----managed Cluster 2-----
$ subctl show connections --context ocp412-s1/cebide520uj0anrcnnb0/admin
 ✓ Showing Connections 
GATEWAY                          CLUSTER   REMOTE IP       NAT   CABLE DRIVER   SUBNETS                        STATUS   RTT avg.   
test-cebid1v20fj7niadniu0-ocp4   c1        10.240.129.16   no    libreswan      172.30.6.0/23, 172.30.8.0/23   error    0s 


$ oc get pods -n submariner-operator
NAME                                             READY   STATUS    RESTARTS        AGE
submariner-gateway-vgkgt                         1/1     Running   20 (2d5h ago)   6d23h
submariner-lighthouse-agent-784b676ff8-8t9wr     1/1     Running   2               6d23h
submariner-lighthouse-coredns-79b6c6b5f7-2mpt5   1/1     Running   2               6d23h
submariner-lighthouse-coredns-79b6c6b5f7-2q8fl   1/1     Running   2               6d23h
submariner-metrics-proxy-b7lm9                   1/1     Running   2               6d23h
submariner-operator-5bd64bfcbc-g4h6z             1/1     Running   2               6d23h
submariner-routeagent-9wmqj                      1/1     Running   2               6d23h
submariner-routeagent-ml9h5                      1/1     Running   3 (4d22h ago)   6d23h
submariner-routeagent-qgrfz                      1/1     Running   2               6d23h

$ oc get ippool
NAME                  AGE
default-ipv4-ippool   50d
podeastcluster        6d23h
svceasctluster        6d23h

Below steps were executed as per the submariner configuration informations

https://red-hat-storage.github.io/ocs-training/training/ocs4/odf411-multisite-ramen.html#_connect_the_managed_clusters_using_submariner_add_ons

https://gist.github.com/prsurve/82fea750d7a3e7ef59adf183ebefca30

https://submariner.io/getting-started/quickstart/openshift/aws/#install-submariner-with-service-discovery

https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.5/html-single/add-ons/index#preparing-ibm

https://submariner.io/operations/deployment/calico/

Hence please route this issue to the respective Network/SubMariner Team to analyse further

Comment 9 Nir Yechiel 2023-02-02 08:10:53 UTC
Hi Shaikh,

We are tracking Submariner issues in JIRA now. I can migrate this BZ over to JIRA.

One question I have (and sorry I am not familiar enough with IBM Cloud): is this about IBM Cloud Kubernetes Service (IKS) or the Red Hat OpenShift Kubernetes Service (ROKS)? My understanding is that they are different. We can use some help to understand the details of the target environment (OpenShift/k8s version?, CNI - which I understand is Calico?, are there any docs available?), and have access to your setup.

Is there a way we can discuss more easily (not via email/BZ)? Any chance you have access to Red Hat's Slack or are you part of the kubernetes Slack instance?


Thanks,
Nir

Comment 10 Shaikh I Ali 2023-02-03 06:07:20 UTC
Hi Nir, Thanks for tracking the issue. 
The ENV is ROKs(Red Hat OpenShift Kubernetes Service) on IBM VPC CLOUD INFRA .

I am reachable on gchat/slack (partner-email: shaali , ibm: sikhlaqu.com)
I can share you ENV , it will be good, if you can assign this issue to someone , with whom i can discuss directly.

thanks
Shaikh Ali

Comment 11 Sridhar Gaddam 2023-02-03 19:01:38 UTC
The setup is running with upstream Submariner version 0.14.1 and following are the initial observations:
1. OpenShift is installed with Calico CNI and the necessary IPPools are configured in both the clusters.
2. The IPsec connections seem to be established but the health-check was failing, because of which the connection status is marked as failure.

Now, when I had a look at why the health-check failed, I noticed the following errors in the route-agent pod running on the Gateway nodes of both clusters.
```
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "127.0.0.1" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.1" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.2" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.3" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.4" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.5" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.6" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.7" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.9" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.10" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.11" address
..ver/cni/cni_iface.go:74 CNI                  Interface "eth0" has "10.240.129.5" address
..ver/cni/cni_iface.go:74 CNI                  Interface "vethlocal" has "127.0.0.10" address
..roxy/kp_iptables.go:123 KubeProxy            Error discovering the CNI interface [36merror=[0m[31m"unable to find CNI Interface on the host which has IP from \"172.30.4.0/23\""[0m
..iptables/iptables.go:32 IPTables             Install/ensure SUBMARINER-POSTROUTING chain exists
```

As you can see above, when the route-agent pod initially came up, there was no CNI interface on the node.
So, I restarted the pod - this time I could see that CNI interface is discovered and appropriate rules are programmed on the node.

```
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "127.0.0.1" address 
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.1" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.2" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.3" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.4" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.5" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.6" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.7" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.9" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.10" address
..ver/cni/cni_iface.go:74 CNI                  Interface "lo" has "172.20.0.11" address
..ver/cni/cni_iface.go:74 CNI                  Interface "eth0" has "10.240.129.5" address
..ver/cni/cni_iface.go:74 CNI                  Interface "vethlocal" has "127.0.0.10" address
..ver/cni/cni_iface.go:74 CNI                  Interface "vx-submariner" has "240.240.129.5" address
..ver/cni/cni_iface.go:74 CNI                  Interface "tunl0" has "172.30.4.128" address
..ver/cni/cni_iface.go:79 CNI                  Found CNI Interface "tunl0" that has IP "172.30.4.128" from ClusterCIDR "172.30.4.0/23"
..er/cni/cni_iface.go:132 CNI                  Successfully annotated node "10.240.129.5" with cniIfaceIP "172.30.4.128"
```

Its weird why the "tunl0" (aka CNI Iface) was missing when the route-agent pod came up for the first time.

Anyways, after restarting the route-agent pods on both the clusters, the CNI interface is detected and necessary rules were programmed.
However, the health-check was still failing. 

On debugging the datapath, I'm seeing that traffic is sent from the Gateway node of one cluster but it does not seem to reach the other cluster.
We need to understand why the packets are not reaching the remote cluster gateway node.

Note: IPSec tunnels seems to be proper and UDP ports 4500 and 4490 are opened.

Comment 12 Sridhar Gaddam 2023-02-03 20:09:17 UTC
On debugging the issue further, the issue is that ESP packets are getting dropped between the Gateway nodes.
Basically, the clusters are directly reachable without any NAT in between - so IPsec was using ESP for tunnelling the traffic.

Since I do not have access to the underlay to enable ESP traffic between the Gateway nodes, I forced udp-encapsulation in the IPsec configuration.
With this, I can now see that connections are successfully established and health-check is passing fine.

```
 ✓ Showing Connections 
GATEWAY                          CLUSTER   REMOTE IP       NAT   CABLE DRIVER   SUBNETS                        STATUS      RTT avg.     
test-cebid1v20fj7niadniu0-ocp4   c1        10.240.129.16   no    libreswan      172.30.6.0/23, 172.30.8.0/23   connected   646.937µs    

 ✓ Showing Endpoints 
CLUSTER   ENDPOINT IP     PUBLIC IP        CABLE DRIVER   TYPE     
c2        10.240.129.5    150.238.65.210   libreswan      local    
c1        10.240.129.16   150.238.65.210   libreswan      remote   

 ✓ Showing Gateways 
NODE                             HA STATUS   SUMMARY                               
test-cebide520uj0anrcnnb0-ocp4   active      All connections (1) are established  
```

Latest status:
Gateway to Gateway connection is now working fine between the clusters. 
However, there is some issue with non-Gateway to non-Gateway communication and this needs to be debugged.

Comment 13 Shaikh I Ali 2023-02-04 12:59:03 UTC
Hi Sridhar,  thanks for fixes. 


1. But i still see the submariner verification tests are failing.

$ KUBECONFIG=/Users/shaikh/.bluemix/plugins/container-service/clusters/ocp412-p1-cebid1v20fj7niadniu0-admin/kube-config.yaml:/Users/shaikh/.bluemix/plugins/container-service/clusters/ocp412-s1-cebide520uj0anrcnnb0-admin/kube-config.yaml subctl verify --context ocp412-p1/cebid1v20fj7niadniu0/admin --tocontext ocp412-s1/cebide520uj0anrcnnb0/admin --only connectivity

Performing the following verifications: connectivity
Running Suite: Submariner E2E suite
===================================
Random Seed: 1675510912
Will run 24 of 44 specs

...

...

Summarizing 9 Failures:

[Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote pod when the pod is not on a gateway and the remote pod is not on a gateway [It] should have sent the expected data from the pod to the other pod 
github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188

[Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote pod when the pod is not on a gateway and the remote pod is on a gateway [It] should have sent the expected data from the pod to the other pod 
github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188

[Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote pod when the pod is on a gateway and the remote pod is not on a gateway [It] should have sent the expected data from the pod to the other pod 
github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188

[Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote pod when the pod is on a gateway and the remote pod is on a gateway [It] should have sent the expected data from the pod to the other pod 
github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:187

[Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote service when the pod is not on a gateway and the remote service is not on a gateway [It] should have sent the expected data from the pod to the other pod 
github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188

[Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote service when the pod is not on a gateway and the remote service is on a gateway [It] should have sent the expected data from the pod to the other pod 
github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188

[Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote service when the pod is on a gateway and the remote service is not on a gateway [It] should have sent the expected data from the pod to the other pod 
github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188

[Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod with HostNetworking connects via TCP to a remote pod when the pod is not on a gateway and the remote pod is not on a gateway [It] should have sent the expected data from the pod to the other pod 
github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188

[Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod with HostNetworking connects via TCP to a remote pod when the pod is on a gateway and the remote pod is not on a gateway [It] should have sent the expected data from the pod to the other pod 
github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188

Ran 12 of 31 Specs in 1507.923 seconds
FAIL! -- 3 Passed | 9 Failed | 0 Pending | 19 Skipped
-----------------------------------------

2. Can you also Let us know, What are the commands/Steps executed to bring submariner it to healthy.

    2a. Restart 'route-agent pods on both the clusters' ? for that we have to just delete those route-agent pods.

    2b. 'Since I do not have access to the underlay to enable ESP traffic between the Gateway nodes, I forced udp-encapsulation in the IPsec configuration'. For this what access you need ? this is a non-prod cluster, its okie to do any changes to cluster. 

    2c. What commands were executed to 'force udp-encapsulations' ?

4. Is that okie to TRY Disaster Recovery(DR) on this ENV NOW or still need some more fixes ? As you said

```
Gateway to Gateway connection is now working fine between the clusters. 
However, there is some issue with non-Gateway to non-Gateway communication and this needs to be debugged.
```

5. my OLD failover POD is still in Error state. May be i need to Redo the APPLICATION DEPLOY and FAILOVER and see again.
```
Warning  FailedMount  2m23s (x4308 over 8d)   kubelet  MountVolume.MountDevice failed for volume "pvc-ee9a1af7-6d6b-4d07-a22f-a51266f418fd" : rpc error: code = Internal desc = image not found: RBD image not found
```
6. I have also another TEST ENV(3 Cluster setup), which is stuck in the same state. i want to execute the same steps there and see if submariner becomes healthy.
So Listdown all the Steps/Commands/Verification Steps to Fix this issue.

Thanks
Shaikh Ali

Comment 14 Yossi Boaron 2023-02-05 16:56:57 UTC
A. On managed cluster S1 node 10.240.129.6 (non GW), there are IP routes to remote cluster (p1 cluster)  CIDRs that were added by bird protocol [1] , any idea how these routes were added? routes to remote cluster CIDRs should be handled by Submariner route agent


[1]
172.30.6.0/23 proto bird 
        nexthop via 10.240.129.5 dev eth0 weight 1 
        nexthop via 240.240.129.5 dev vx-submariner weight 1 
172.30.8.0/23 proto bird 
        nexthop via 10.240.129.5 dev eth0 weight 1 
        nexthop via 240.240.129.5 dev vx-submariner weight 1 

@shaali

Comment 15 Nir Yechiel 2023-02-05 17:01:59 UTC
FWIW, bird is the routing component used by Calico: https://projectcalico.docs.tigera.io/reference/architecture/overview

Comment 16 Yossi Boaron 2023-02-06 09:48:55 UTC
I can see that after Setting 'disableBGPExport' to true in pod,svc IPPOOLS created for Submariner , these routes are deleted.

Comment 17 Shaikh I Ali 2023-02-06 11:42:06 UTC
@Yossi /Nir
Thanks for the debugging updates 
incase you have more IBM IKS network side queries/blocking, lets know, if we need to schedule a discussion with our IBM IKS Network Team.

Comment 18 Sridhar Gaddam 2023-02-06 16:30:21 UTC
@Yossi and I had a look at the setup and we managed to get it working.

Basically, the current Calico CNI was configured to use IPIP tunnels with BGP and this combination seems to be creating some issues for the inter-cluster traffic handled by Submariner.
So, we changed the Calico default IPPool to use VxLAN tunnels.
After this change we could see that both Gateway to Gateway as well as non-Gateway to non-Gateway Submariner e2e tests are now passing fine.

Summary of the changes done so far:

1. The firewall configuration on the Gateway nodes was not allowing ESP traffic, so we forced udp-encapsulation in the IPSec configuration after which the tunnels were successfully established.

2. Even though the Calico IPPools were created as per the Submariner requirement [https://submariner.io/operations/deployment/calico/] there was an issue with the configuration. 
  The following flags were not enabled, so we manually updated the IPPools with this change on both the clusters.
  natOutgoing: false
  disabled: true

3. We changed the IPPool configuration in default-ipv4-ippool to use `vxlanMode: Always` and `ipipMode: Never` on both the clusters and restarted the calico-node pods in calico-system namespace..

Comment 19 Shaikh I Ali 2023-02-06 19:54:18 UTC
Hi, I tried the DR after the above submariner fixes . I see some issues during the FailOver. 
I see the below Error on DRPlacementControl details. Some issue related with Certificates.


```
-----------  busybox3-placement-1-drpc --------
----error----
Failed to restore PVs (failed to restore PVs for VolRep (failed to
          restorePVs using profile list ([s3profile-ocp412-p1-ocs-storagecluster
          s3profile-ocp412-s1-ocs-storagecluster]): unable to ListKeys of type
          v1.PersistentVolume keyPrefix
          busybox-sample3/busybox3-placement-1-drpc/v1.PersistentVolume/, failed
          to list objects in bucket
          odrbucket-fdfbb7912e2d:busybox-sample3/busybox3-placement-1-drpc/v1.PersistentVolume/,
          RequestError: send request failed

          caused by: Get
          "https://s3-openshift-storage.ocp412-s1-f8480cb6e62d97529990eca5f3f95767-0000.us-south.stg.containers.appdomain.cloud/odrbucket-fdfbb7912e2d?list-type=2&prefix=busybox-sample3%2Fbusybox3-placement-1-drpc%2Fv1.PersistentVolume%2F":
          x509: certificate signed by unknown authority))
--------end------------------

-------DR placement control under NS busybox-sample3 (Hub cluster)-----------
apiVersion: ramendr.openshift.io/v1alpha1
kind: DRPlacementControl
metadata:
  resourceVersion: '127586588'
  name: busybox3-placement-1-drpc
  uid: 913f7f1e-7bf5-435a-8b30-cb312b5a3a36
  creationTimestamp: '2023-02-06T17:28:18Z'
  generation: 3
  ..
  ...
      manager: manager
      operation: Update
      subresource: status
      time: '2023-02-06T19:28:04Z'
  namespace: busybox-sample3
  finalizers:
    - drpc.ramendr.openshift.io/finalizer
  labels:
    app: busybox3
    cluster.open-cluster-management.io/backup: resource
spec:
  action: Failover
  drPolicyRef:
    name: ocp412-p1-s1-5m
  failoverCluster: ocp412-s1
  placementRef:
    kind: PlacementRule
    name: busybox3-placement-1
    namespace: busybox-sample3
  preferredCluster: ocp412-p1
  pvcSelector:
    matchLabels:
      app: busybox3
status:
  actionStartTime: '2023-02-06T17:50:23Z'
  conditions:
    - lastTransitionTime: '2023-02-06T17:50:23Z'
      message: Waiting for PV restore to complete...)
      observedGeneration: 3
      reason: FailingOver
      status: 'False'
      type: Available
    - lastTransitionTime: '2023-02-06T17:50:23Z'
      message: Started failover to cluster "ocp412-s1"
      observedGeneration: 3
      reason: NotStarted
      status: 'False'
      type: PeerReady
  lastUpdateTime: '2023-02-06T19:28:04Z'
  phase: FailingOver
  preferredDecision:
    clusterName: ocp412-p1
    clusterNamespace: ocp412-p1
  progression: WaitingForPVRestore
  resourceConditions:
    conditions:
      - lastTransitionTime: '2023-02-06T17:28:20Z'
        message: >-
          Failed to restore PVs (failed to restore PVs for VolRep (failed to
          restorePVs using profile list ([s3profile-ocp412-p1-ocs-storagecluster
          s3profile-ocp412-s1-ocs-storagecluster]): unable to ListKeys of type
          v1.PersistentVolume keyPrefix
          busybox-sample3/busybox3-placement-1-drpc/v1.PersistentVolume/, failed
          to list objects in bucket
          odrbucket-fdfbb7912e2d:busybox-sample3/busybox3-placement-1-drpc/v1.PersistentVolume/,
          RequestError: send request failed

          caused by: Get
          "https://s3-openshift-storage.ocp412-s1-f8480cb6e62d97529990eca5f3f95767-0000.us-south.stg.containers.appdomain.cloud/odrbucket-fdfbb7912e2d?list-type=2&prefix=busybox-sample3%2Fbusybox3-placement-1-drpc%2Fv1.PersistentVolume%2F":
          x509: certificate signed by unknown authority))
        observedGeneration: 1
        reason: Error
        status: 'False'
        type: ClusterDataReady
      - lastTransitionTime: '2023-02-06T17:28:20Z'
        message: Initializing VolumeReplicationGroup
        observedGeneration: 1
        reason: Initializing
        status: Unknown
        type: DataReady
      - lastTransitionTime: '2023-02-06T17:28:20Z'
        message: Initializing VolumeReplicationGroup
        observedGeneration: 1
        reason: Initializing
        status: Unknown
        type: DataProtected
      - lastTransitionTime: '2023-02-06T17:28:20Z'
        message: Initializing VolumeReplicationGroup
        observedGeneration: 1
        reason: Initializing
        status: Unknown
        type: ClusterDataProtected
    resourceMeta:
      generation: 1
      kind: VolumeReplicationGroup
      name: busybox3-placement-1-drpc
      namespace: busybox-sample3

```

Comment 20 Annette Clewett 2023-02-06 23:58:31 UTC
@shaali I fixed the cert issue by doing these steps:

1) Created ConfigMap again using this process https://red-hat-storage.github.io/ocs-training/training/ocs4/odf411-multisite-ramen.html#_configure_ssl_access_between_s3_endpoints

2) Because OCP 4.12 has no VolSync operator until ACM 2.7 is released I did this WA on both managed clusters to  populate the volsync CRDs with old version 0.5.0 (ramen must find CRDs)

$ oc apply -f https://raw.githubusercontent.com/backube/volsync/v0.5.0/config/crd/bases/volsync.backube_replicationdestinations.yaml
$ oc apply -f https://raw.githubusercontent.com/backube/volsync/v0.5.0/config/crd/bases/volsync.backube_replicationsources.yaml

3) Restarted ramen pods on managed clusters (openshift-dr-system) and hub cluster (openshift-operators).

Problem now is the ramen pod on ocp412-s1 (https://c100-e.containers.test.cloud.ibm.com:30912) has an ErrImagePull so Failover will not succeed.

$ oc describe pod ramen-dr-cluster-operator-75769b99bd-t56r5 -n openshift-dr-system
[...]
Events:
  Type     Reason          Age                   From               Message
  ----     ------          ----                  ----               -------
  Normal   Scheduled       13m                   default-scheduler  Successfully assigned openshift-dr-system/ramen-dr-cluster-operator-75769b99bd-t56r5 to 10.240.129.5
  Normal   AddedInterface  13m                   multus             Add eth0 [172.30.4.143/32] from k8s-pod-network
  Normal   Pulled          13m                   kubelet            Container image "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:e3dad360d0351237a16593ca0862652809c41a2127c2f98b9e0a559568efbd10" already present on machine
  Normal   Created         13m                   kubelet            Created container kube-rbac-proxy
  Normal   Started         13m                   kubelet            Started container kube-rbac-proxy
  Warning  Failed          12m                   kubelet            Failed to pull image "registry.redhat.io/odf4/odr-rhel8-operator@sha256:5844f3cea10d926d204bbb37f592a82ed3f513495dbf909ff7755b43f10925a8": rpc error: code = Unknown desc = Get "https://registry.redhat.io/auth/realms/rhcc/protocol/redhat-docker-v2/auth?account=%7Cuhc-pool-c4d98868-1c71-4b5b-aba8-701650c9911c&scope=repository%3Aodf4%2Fodr-rhel8-operator%3Apull&service=docker-registry": dial tcp 23.220.124.237:443: connect: connection timed out
  Warning  Failed          11m (x2 over 13m)     kubelet            Failed to pull image "registry.redhat.io/odf4/odr-rhel8-operator@sha256:5844f3cea10d926d204bbb37f592a82ed3f513495dbf909ff7755b43f10925a8": rpc error: code = Unknown desc = initializing source docker://registry.redhat.io/odf4/odr-rhel8-operator@sha256:5844f3cea10d926d204bbb37f592a82ed3f513495dbf909ff7755b43f10925a8: can't talk to a V1 container registry
  Warning  Failed          11m (x3 over 13m)     kubelet            Error: ErrImagePull
  Warning  Failed          11m (x6 over 13m)     kubelet            Error: ImagePullBackOff
  Normal   Pulling         11m (x4 over 13m)     kubelet            Pulling image "registry.redhat.io/odf4/odr-rhel8-operator@sha256:5844f3cea10d926d204bbb37f592a82ed3f513495dbf909ff7755b43f10925a8"
  Normal   BackOff         3m20s (x35 over 13m)  kubelet            Back-off pulling image "registry.redhat.io/odf4/odr-rhel8-operator@sha256:5844f3cea10d926d204bbb37f592a82ed3f513495dbf909ff7755b43f10925a8"

Comment 21 Shaikh I Ali 2023-02-07 17:32:48 UTC
Hi 
@annette.clewett

1. thanks for fixing that Cert issue. What was the issue ?, i have already executed the steps. created the cert configurations in all 3 clusters and updated the proxy in all 3 cluster. Did u additionally restart the kube-proxy pods ?  last time i didnt get this issue, when I created the DR after doing this ssl Steps. But somehow the SSL cert got updated. so..does the sequence matters ? 'ssl step <-->DR policy creation'


2.  I just moved that 'ramen' to another Node, which came up fine..
could be some Node issue, i cordoned the other node.

3. So the i see the "Failover" is successful now. APP is up on secondary Cluster.
Also tried the "Relocate' , it worked as expected and APP is back to primary cluster.
I will so some more tests  and let you know incase of issue.

4. A small issue, i see is the Data Written on the PVC has a diffrent OWNERSHIP on both clusters. Not sure if that will cause any issue in APPLICATIONS.

-----After Failover onto s1 cluster (Project: busybox-sample3 )---
$ ls -l /mnt/test
-rw-rw-rw-  1 10009600 10024400     0 Feb 6 17:26 fromprimary-06Feb2023
drwxrws---  2 root   10024400   16384 Feb 6 17:23 lost+found

-----After Relocate onto p1 cluster---
~ $ ls -l /mnt/test
total 1832
-rw-rw-rw-  1 10009600 10009600     0 Feb 6 17:26 fromprimary-06Feb2023
-rw-rw-rw-  1 10024400 10009600     0 Feb 7 12:43 fromsecondary-07Feb2023
drwxrws---  2 root   10009600   16384 Feb 6 17:23 lost+found

Comment 22 Shaikh I Ali 2023-02-07 18:54:57 UTC
@Sridhar I am setting up Another ENV to test the DR e2e, and I am also Documenting the Entire procedure 

Can you provide the Exact steps that were executed on the current cluster to fix submariner

--------------------Last updates on Submariner-------------
Summary of the changes done so far:

1. The firewall configuration on the Gateway nodes was not allowing ESP traffic, so we forced udp-encapsulation in the IPSec configuration after which the tunnels were successfully established.

***Can you provide exact command for this ??  and the post verification steps (if any) ?

2. Even though the Calico IPPools were created as per the Submariner requirement [https://submariner.io/operations/deployment/calico/] there was an issue with the configuration. 
  The following flags were not enabled, so we manually updated the IPPools with this change on both the clusters.
  natOutgoing: false
  disabled: true

**** I think these steps are enough as per the link ? https://submariner.io/operations/deployment/calico/
 

3. We changed the IPPool configuration in default-ipv4-ippool to use `vxlanMode: Always` and `ipipMode: Never` on both the clusters and restarted the calico-node pods in calico-system namespace..

**** update 'default-ipv4-ippool' using calicoctl ??  or kubectl edit ?

Comment 23 Sridhar Gaddam 2023-02-08 07:35:17 UTC
(In reply to Shaikh I Ali from comment #22)
> 1. The firewall configuration on the Gateway nodes was not allowing ESP
> traffic, so we forced udp-encapsulation in the IPSec configuration after
> which the tunnels were successfully established.
> 
> ***Can you provide exact command for this ??  and the post verification
> steps (if any) ?

This can be fixed in two ways.
1. By opening the firewall ports on the Gateway nodes of the cluster to allow ESP traffic, this will be optimal solution as we can avoid additional UDP overhead in the IPsec tunnels.
2. If the above option is not feasible (as it requires access to the underlay/infra), then you can do the following. While joining the cluster to the Broker using the "subctl join ..." command you can pass the flag "--force-udp-encaps"

Post verification: After executing step 1 or 2 on both the clusters, you will notice that IPsec tunnels will be successfully established and the same can be seen using the command "subctl show connections".


> 2. Even though the Calico IPPools were created as per the Submariner
> requirement [https://submariner.io/operations/deployment/calico/] there was
> an issue with the configuration. 
>   The following flags were not enabled, so we manually updated the IPPools
> with this change on both the clusters.
>   natOutgoing: false
>   disabled: true
> 
> **** I think these steps are enough as per the link ?
> https://submariner.io/operations/deployment/calico/

Yes, the document is up to date. To make it easier to detect the calico config, I've pushed the following PR in Submariner to verify if the config is meeting Submariner requirements as part of "subctl diagnose cni" command.
https://github.com/submariner-io/subctl/pull/550


> 3. We changed the IPPool configuration in default-ipv4-ippool to use
> `vxlanMode: Always` and `ipipMode: Never` on both the clusters and restarted
> the calico-node pods in calico-system namespace..
> 
> **** update 'default-ipv4-ippool' using calicoctl ??  or kubectl edit ?

You can use kubectl cli.


Note You need to log in before you can comment on or make changes to this bug.