Description of problem (please be detailed as possible and provide log snippests): ODF DR Application Failover was unsuccessful on IBM VPC Cloud Version of all relevant components (if applicable): OCP 4.12 + ODF 4.11 + ACM 2.6 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. On IBM VPC Cloud, deploy THREE 4.12 OCP Clusters( 1 hub + 2 managed Cluster) Follow below Steps to create an ODF DR setup https://red-hat-storage.github.io/ocs-training/training/ocs4/odf411-multisite-ramen.html 2. Create 2 managed cluster with non-overlapping network on ibm vpc cloud $ ibmcloud ks cluster create vpc-gen2 --flavor bx2.16x64 --name ocp412-p1 --subnet-id 0736-6cb4fbf2-50d3-4335-b0f0-dbe1ca6fa5b9 --vpc-id r134-02280287-d7fa-4a8e-a8cf-7102aa9de58b --zone us-south-3 --service-subnet 172.30.6.0/23 --pod-subnet 172.30.8.0/23 --workers 3 --version 4.12.0_openshift $ ibmcloud ks cluster create vpc-gen2 --flavor bx2.16x64 --name ocp412-s1 --subnet-id 0736-6cb4fbf2-50d3-4335-b0f0-dbe1ca6fa5b9 --vpc-id r134-02280287-d7fa-4a8e-a8cf-7102aa9de58b --zone us-south-3 --service-subnet 172.30.2.0/23 --pod-subnet 172.30.4.0/23 --workers 3 --version 4.12.0_openshift 3. Import the above Clusters onto the HUB cluster, as per the Instructions. 4. Connect Managed Clusters with submariner addon 4a. Follow IBM Cloud Submariner installation instructions to disable Calico network cross cluster CIRD 'natting'. https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.6/html-single/add-ons/index#preparing-ibm https://submariner.io/operations/deployment/calico/ 4b. Follow community instruction to install submariner-operator ------------- my installation steps------------------------ curl -Ls https://get.submariner.io | bash export PATH=$PATH:~/.local/bin echo export PATH=\$PATH:~/.local/bin >> ~/.profile 2:- subctl deploy-broker --kubeconfig <any_cluster_kubeconfig> $ subctl join --context ocp412-p1/cebid1v20fj7niadniu0/admin broker-info.subm --clusterid ocp412-p1 --natt=false --insecure-skip-tls-verify=true --globalnet=false --check-broker-certificate=false subctl uninstall --context ocp412-s1/cebide520uj0anrcnnb0/admin $ subctl join --context ocp412-s1/cebide520uj0anrcnnb0/admin broker-info.subm --clusterid ocp412-s1 --natt=false --insecure-skip-tls-verify=true --globalnet=false --check-broker-certificate=false $ subctl verify --context ocp412-p1/cebid1v20fj7niadniu0/admin --tocontext ocp412-s1/cebide520uj0anrcnnb0/admin --only connectivity 3. Verify submariner components running fine on managed clusters. ----On managed cluster1-- oc get pods -n submariner-operator NAME READY STATUS RESTARTS AGE submariner-gateway-6m9f4 1/1 Running 12 (5h9m ago) 2d22h submariner-lighthouse-agent-b45c7dd9b-pgnc6 1/1 Running 2 2d22h submariner-lighthouse-coredns-5c485dcf74-7kbbl 1/1 Running 0 21h submariner-lighthouse-coredns-5c485dcf74-k5p72 1/1 Running 2 2d22h submariner-metrics-proxy-nw89c 1/1 Running 2 2d22h submariner-operator-5bd64bfcbc-sz68s 1/1 Running 0 21h submariner-routeagent-5qppb 1/1 Running 2 2d22h submariner-routeagent-hszpm 1/1 Running 2 2d22h submariner-routeagent-rbkkq 1/1 Running 2 2d22h ------On Managed Cluster2 --------- $ oc get pods -n submariner-operator NAME READY STATUS RESTARTS AGE submariner-gateway-vgkgt 1/1 Running 13 (5h7m ago) 2d22h submariner-lighthouse-agent-784b676ff8-8t9wr 1/1 Running 2 2d22h submariner-lighthouse-coredns-79b6c6b5f7-2mpt5 1/1 Running 2 2d22h submariner-lighthouse-coredns-79b6c6b5f7-2q8fl 1/1 Running 2 2d22h submariner-metrics-proxy-b7lm9 1/1 Running 2 2d22h submariner-operator-5bd64bfcbc-g4h6z 1/1 Running 2 2d22h submariner-routeagent-9wmqj 1/1 Running 2 2d22h submariner-routeagent-ml9h5 1/1 Running 3 (21h ago) 2d22h submariner-routeagent-qgrfz 1/1 Running 2 2d22h 4. Do the Other DR Steps 4a. OpenShift Data Foundation Installation --- (SUCCESS) 4b. Install ODF Multicluster Orchestrator Operator on Hub cluster --- (SUCCESS) 4c. Configure SSL access between S3 endpoints --- (SUCCESS) 4d. Enabling Multicluster Web Console --- (SUCCESS) 4e. Create Data Policy on Hub cluster --- (SUCCESS) 4f. Create Sample Application for DR testing --- (SUCCESS) 4f. Creating Sample Application using ACM console --- (SUCCESS) 4g. Validating Sample Application deployment --- (SUCCESS) 4h. Apply DRPolicy to Sample Application --- (SUCCESS) 4i. Modify DRPlacementControl to failover >>>>>(FAILED) Actual results: -----On 2nd Cluster After Failover------ $ oc get pods -n busybox-sample NAME READY STATUS RESTARTS AGE busybox-67bf494b9-slgzs 0/1 ContainerCreating 0 4d5h $ oc describe pod busybox-67bf494b9-slgzs -n busybox-sample Warning FailedMount 8m21s (x631 over 21h) kubelet MountVolume.MountDevice failed for volume "pvc-ee9a1af7-6d6b-4d07-a22f-a51266f418fd" : rpc error: code = Internal desc = image not found: RBD image not found Expected results: The Failover cluster should have the APPLICATION POD running Successfully Additional info:
@shaali Based on the error reported by MountVolume.MountDevice it seems like the image is missing on the peer cluster where the failover was attempted. This could be due to a few causes like replication was not established to begin with (among others) Could you upload the must-gather for the 3 clusters as follows: - ODF must-gather from the 2 managed clusters: https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.11/html/troubleshooting_openshift_data_foundation/downloading-log-files-and-diagnostic-information_rhodf - ACM must-gather from the hub cluster: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.6/html/troubleshooting/troubleshooting#running-the-must-gather-command-to-troubleshoot (please use 4.12 images to gather as appropriate)
hi @shyamsundar, i shared the mustgather log in below link as the log size appears more than the attachable size. let me know if thats accessible to you. https://drive.google.com/file/d/1Y5Ry1gEFJPYD4JUtVJmiUImkbxELO9EZ/view?usp=sharing attached zip contents 1. mananaged cluster1 must gather 2. mananaged cluster2 must gather 3. hubcluster must gather (thanks)
As per the analysis/discussion with Shyamsundar/Annette Clewett, it appears the 'submariner' component is not Healthy and connectivity issues across cluster. -----managed cluster 1---- $ subctl show connections --context ocp412-p1/cebid1v20fj7niadniu0/admin ✓ Showing Connections GATEWAY CLUSTER REMOTE IP NAT CABLE DRIVER SUBNETS STATUS RTT avg. test-cebide520uj0anrcnnb0-ocp4 c2 10.240.129.5 no libreswan 172.30.2.0/23, 172.30.4.0/23 error 0s $ oc get pods -n submariner-operator NAME READY STATUS RESTARTS AGE submariner-gateway-6m9f4 1/1 Running 20 (2d5h ago) 6d23h submariner-lighthouse-agent-b45c7dd9b-pgnc6 1/1 Running 2 6d23h submariner-lighthouse-coredns-5c485dcf74-7kbbl 1/1 Running 0 4d22h submariner-lighthouse-coredns-5c485dcf74-k5p72 1/1 Running 2 6d23h submariner-metrics-proxy-nw89c 1/1 Running 2 6d23h submariner-operator-5bd64bfcbc-sz68s 1/1 Running 0 4d22h submariner-routeagent-5qppb 1/1 Running 2 6d23h submariner-routeagent-hszpm 1/1 Running 2 6d23h submariner-routeagent-rbkkq 1/1 Running 2 6d23h $ oc get ippool NAME AGE default-ipv4-ippool 50d podwestcluster 5d23h svcwestcluster 5d23h -----managed Cluster 2----- $ subctl show connections --context ocp412-s1/cebide520uj0anrcnnb0/admin ✓ Showing Connections GATEWAY CLUSTER REMOTE IP NAT CABLE DRIVER SUBNETS STATUS RTT avg. test-cebid1v20fj7niadniu0-ocp4 c1 10.240.129.16 no libreswan 172.30.6.0/23, 172.30.8.0/23 error 0s $ oc get pods -n submariner-operator NAME READY STATUS RESTARTS AGE submariner-gateway-vgkgt 1/1 Running 20 (2d5h ago) 6d23h submariner-lighthouse-agent-784b676ff8-8t9wr 1/1 Running 2 6d23h submariner-lighthouse-coredns-79b6c6b5f7-2mpt5 1/1 Running 2 6d23h submariner-lighthouse-coredns-79b6c6b5f7-2q8fl 1/1 Running 2 6d23h submariner-metrics-proxy-b7lm9 1/1 Running 2 6d23h submariner-operator-5bd64bfcbc-g4h6z 1/1 Running 2 6d23h submariner-routeagent-9wmqj 1/1 Running 2 6d23h submariner-routeagent-ml9h5 1/1 Running 3 (4d22h ago) 6d23h submariner-routeagent-qgrfz 1/1 Running 2 6d23h $ oc get ippool NAME AGE default-ipv4-ippool 50d podeastcluster 6d23h svceasctluster 6d23h Below steps were executed as per the submariner configuration informations https://red-hat-storage.github.io/ocs-training/training/ocs4/odf411-multisite-ramen.html#_connect_the_managed_clusters_using_submariner_add_ons https://gist.github.com/prsurve/82fea750d7a3e7ef59adf183ebefca30 https://submariner.io/getting-started/quickstart/openshift/aws/#install-submariner-with-service-discovery https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.5/html-single/add-ons/index#preparing-ibm https://submariner.io/operations/deployment/calico/ Hence please route this issue to the respective Network/SubMariner Team to analyse further
Hi Shaikh, We are tracking Submariner issues in JIRA now. I can migrate this BZ over to JIRA. One question I have (and sorry I am not familiar enough with IBM Cloud): is this about IBM Cloud Kubernetes Service (IKS) or the Red Hat OpenShift Kubernetes Service (ROKS)? My understanding is that they are different. We can use some help to understand the details of the target environment (OpenShift/k8s version?, CNI - which I understand is Calico?, are there any docs available?), and have access to your setup. Is there a way we can discuss more easily (not via email/BZ)? Any chance you have access to Red Hat's Slack or are you part of the kubernetes Slack instance? Thanks, Nir
Hi Nir, Thanks for tracking the issue. The ENV is ROKs(Red Hat OpenShift Kubernetes Service) on IBM VPC CLOUD INFRA . I am reachable on gchat/slack (partner-email: shaali , ibm: sikhlaqu.com) I can share you ENV , it will be good, if you can assign this issue to someone , with whom i can discuss directly. thanks Shaikh Ali
The setup is running with upstream Submariner version 0.14.1 and following are the initial observations: 1. OpenShift is installed with Calico CNI and the necessary IPPools are configured in both the clusters. 2. The IPsec connections seem to be established but the health-check was failing, because of which the connection status is marked as failure. Now, when I had a look at why the health-check failed, I noticed the following errors in the route-agent pod running on the Gateway nodes of both clusters. ``` ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "127.0.0.1" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.1" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.2" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.3" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.4" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.5" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.6" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.7" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.9" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.10" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.11" address ..ver/cni/cni_iface.go:74 CNI Interface "eth0" has "10.240.129.5" address ..ver/cni/cni_iface.go:74 CNI Interface "vethlocal" has "127.0.0.10" address ..roxy/kp_iptables.go:123 KubeProxy Error discovering the CNI interface [36merror=[0m[31m"unable to find CNI Interface on the host which has IP from \"172.30.4.0/23\""[0m ..iptables/iptables.go:32 IPTables Install/ensure SUBMARINER-POSTROUTING chain exists ``` As you can see above, when the route-agent pod initially came up, there was no CNI interface on the node. So, I restarted the pod - this time I could see that CNI interface is discovered and appropriate rules are programmed on the node. ``` ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "127.0.0.1" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.1" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.2" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.3" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.4" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.5" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.6" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.7" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.9" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.10" address ..ver/cni/cni_iface.go:74 CNI Interface "lo" has "172.20.0.11" address ..ver/cni/cni_iface.go:74 CNI Interface "eth0" has "10.240.129.5" address ..ver/cni/cni_iface.go:74 CNI Interface "vethlocal" has "127.0.0.10" address ..ver/cni/cni_iface.go:74 CNI Interface "vx-submariner" has "240.240.129.5" address ..ver/cni/cni_iface.go:74 CNI Interface "tunl0" has "172.30.4.128" address ..ver/cni/cni_iface.go:79 CNI Found CNI Interface "tunl0" that has IP "172.30.4.128" from ClusterCIDR "172.30.4.0/23" ..er/cni/cni_iface.go:132 CNI Successfully annotated node "10.240.129.5" with cniIfaceIP "172.30.4.128" ``` Its weird why the "tunl0" (aka CNI Iface) was missing when the route-agent pod came up for the first time. Anyways, after restarting the route-agent pods on both the clusters, the CNI interface is detected and necessary rules were programmed. However, the health-check was still failing. On debugging the datapath, I'm seeing that traffic is sent from the Gateway node of one cluster but it does not seem to reach the other cluster. We need to understand why the packets are not reaching the remote cluster gateway node. Note: IPSec tunnels seems to be proper and UDP ports 4500 and 4490 are opened.
On debugging the issue further, the issue is that ESP packets are getting dropped between the Gateway nodes. Basically, the clusters are directly reachable without any NAT in between - so IPsec was using ESP for tunnelling the traffic. Since I do not have access to the underlay to enable ESP traffic between the Gateway nodes, I forced udp-encapsulation in the IPsec configuration. With this, I can now see that connections are successfully established and health-check is passing fine. ``` ✓ Showing Connections GATEWAY CLUSTER REMOTE IP NAT CABLE DRIVER SUBNETS STATUS RTT avg. test-cebid1v20fj7niadniu0-ocp4 c1 10.240.129.16 no libreswan 172.30.6.0/23, 172.30.8.0/23 connected 646.937µs ✓ Showing Endpoints CLUSTER ENDPOINT IP PUBLIC IP CABLE DRIVER TYPE c2 10.240.129.5 150.238.65.210 libreswan local c1 10.240.129.16 150.238.65.210 libreswan remote ✓ Showing Gateways NODE HA STATUS SUMMARY test-cebide520uj0anrcnnb0-ocp4 active All connections (1) are established ``` Latest status: Gateway to Gateway connection is now working fine between the clusters. However, there is some issue with non-Gateway to non-Gateway communication and this needs to be debugged.
Hi Sridhar, thanks for fixes. 1. But i still see the submariner verification tests are failing. $ KUBECONFIG=/Users/shaikh/.bluemix/plugins/container-service/clusters/ocp412-p1-cebid1v20fj7niadniu0-admin/kube-config.yaml:/Users/shaikh/.bluemix/plugins/container-service/clusters/ocp412-s1-cebide520uj0anrcnnb0-admin/kube-config.yaml subctl verify --context ocp412-p1/cebid1v20fj7niadniu0/admin --tocontext ocp412-s1/cebide520uj0anrcnnb0/admin --only connectivity Performing the following verifications: connectivity Running Suite: Submariner E2E suite =================================== Random Seed: 1675510912 Will run 24 of 44 specs ... ... Summarizing 9 Failures: [Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote pod when the pod is not on a gateway and the remote pod is not on a gateway [It] should have sent the expected data from the pod to the other pod github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188 [Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote pod when the pod is not on a gateway and the remote pod is on a gateway [It] should have sent the expected data from the pod to the other pod github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188 [Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote pod when the pod is on a gateway and the remote pod is not on a gateway [It] should have sent the expected data from the pod to the other pod github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188 [Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote pod when the pod is on a gateway and the remote pod is on a gateway [It] should have sent the expected data from the pod to the other pod github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:187 [Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote service when the pod is not on a gateway and the remote service is not on a gateway [It] should have sent the expected data from the pod to the other pod github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188 [Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote service when the pod is not on a gateway and the remote service is on a gateway [It] should have sent the expected data from the pod to the other pod github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188 [Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod connects via TCP to a remote service when the pod is on a gateway and the remote service is not on a gateway [It] should have sent the expected data from the pod to the other pod github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188 [Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod with HostNetworking connects via TCP to a remote pod when the pod is not on a gateway and the remote pod is not on a gateway [It] should have sent the expected data from the pod to the other pod github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188 [Fail] [dataplane] Basic TCP connectivity tests across clusters without discovery when a pod with HostNetworking connects via TCP to a remote pod when the pod is on a gateway and the remote pod is not on a gateway [It] should have sent the expected data from the pod to the other pod github.com/submariner-io/shipyard.1/test/e2e/framework/network_pods.go:188 Ran 12 of 31 Specs in 1507.923 seconds FAIL! -- 3 Passed | 9 Failed | 0 Pending | 19 Skipped ----------------------------------------- 2. Can you also Let us know, What are the commands/Steps executed to bring submariner it to healthy. 2a. Restart 'route-agent pods on both the clusters' ? for that we have to just delete those route-agent pods. 2b. 'Since I do not have access to the underlay to enable ESP traffic between the Gateway nodes, I forced udp-encapsulation in the IPsec configuration'. For this what access you need ? this is a non-prod cluster, its okie to do any changes to cluster. 2c. What commands were executed to 'force udp-encapsulations' ? 4. Is that okie to TRY Disaster Recovery(DR) on this ENV NOW or still need some more fixes ? As you said ``` Gateway to Gateway connection is now working fine between the clusters. However, there is some issue with non-Gateway to non-Gateway communication and this needs to be debugged. ``` 5. my OLD failover POD is still in Error state. May be i need to Redo the APPLICATION DEPLOY and FAILOVER and see again. ``` Warning FailedMount 2m23s (x4308 over 8d) kubelet MountVolume.MountDevice failed for volume "pvc-ee9a1af7-6d6b-4d07-a22f-a51266f418fd" : rpc error: code = Internal desc = image not found: RBD image not found ``` 6. I have also another TEST ENV(3 Cluster setup), which is stuck in the same state. i want to execute the same steps there and see if submariner becomes healthy. So Listdown all the Steps/Commands/Verification Steps to Fix this issue. Thanks Shaikh Ali
A. On managed cluster S1 node 10.240.129.6 (non GW), there are IP routes to remote cluster (p1 cluster) CIDRs that were added by bird protocol [1] , any idea how these routes were added? routes to remote cluster CIDRs should be handled by Submariner route agent [1] 172.30.6.0/23 proto bird nexthop via 10.240.129.5 dev eth0 weight 1 nexthop via 240.240.129.5 dev vx-submariner weight 1 172.30.8.0/23 proto bird nexthop via 10.240.129.5 dev eth0 weight 1 nexthop via 240.240.129.5 dev vx-submariner weight 1 @shaali
FWIW, bird is the routing component used by Calico: https://projectcalico.docs.tigera.io/reference/architecture/overview
I can see that after Setting 'disableBGPExport' to true in pod,svc IPPOOLS created for Submariner , these routes are deleted.
@Yossi /Nir Thanks for the debugging updates incase you have more IBM IKS network side queries/blocking, lets know, if we need to schedule a discussion with our IBM IKS Network Team.
@Yossi and I had a look at the setup and we managed to get it working. Basically, the current Calico CNI was configured to use IPIP tunnels with BGP and this combination seems to be creating some issues for the inter-cluster traffic handled by Submariner. So, we changed the Calico default IPPool to use VxLAN tunnels. After this change we could see that both Gateway to Gateway as well as non-Gateway to non-Gateway Submariner e2e tests are now passing fine. Summary of the changes done so far: 1. The firewall configuration on the Gateway nodes was not allowing ESP traffic, so we forced udp-encapsulation in the IPSec configuration after which the tunnels were successfully established. 2. Even though the Calico IPPools were created as per the Submariner requirement [https://submariner.io/operations/deployment/calico/] there was an issue with the configuration. The following flags were not enabled, so we manually updated the IPPools with this change on both the clusters. natOutgoing: false disabled: true 3. We changed the IPPool configuration in default-ipv4-ippool to use `vxlanMode: Always` and `ipipMode: Never` on both the clusters and restarted the calico-node pods in calico-system namespace..
Hi, I tried the DR after the above submariner fixes . I see some issues during the FailOver. I see the below Error on DRPlacementControl details. Some issue related with Certificates. ``` ----------- busybox3-placement-1-drpc -------- ----error---- Failed to restore PVs (failed to restore PVs for VolRep (failed to restorePVs using profile list ([s3profile-ocp412-p1-ocs-storagecluster s3profile-ocp412-s1-ocs-storagecluster]): unable to ListKeys of type v1.PersistentVolume keyPrefix busybox-sample3/busybox3-placement-1-drpc/v1.PersistentVolume/, failed to list objects in bucket odrbucket-fdfbb7912e2d:busybox-sample3/busybox3-placement-1-drpc/v1.PersistentVolume/, RequestError: send request failed caused by: Get "https://s3-openshift-storage.ocp412-s1-f8480cb6e62d97529990eca5f3f95767-0000.us-south.stg.containers.appdomain.cloud/odrbucket-fdfbb7912e2d?list-type=2&prefix=busybox-sample3%2Fbusybox3-placement-1-drpc%2Fv1.PersistentVolume%2F": x509: certificate signed by unknown authority)) --------end------------------ -------DR placement control under NS busybox-sample3 (Hub cluster)----------- apiVersion: ramendr.openshift.io/v1alpha1 kind: DRPlacementControl metadata: resourceVersion: '127586588' name: busybox3-placement-1-drpc uid: 913f7f1e-7bf5-435a-8b30-cb312b5a3a36 creationTimestamp: '2023-02-06T17:28:18Z' generation: 3 .. ... manager: manager operation: Update subresource: status time: '2023-02-06T19:28:04Z' namespace: busybox-sample3 finalizers: - drpc.ramendr.openshift.io/finalizer labels: app: busybox3 cluster.open-cluster-management.io/backup: resource spec: action: Failover drPolicyRef: name: ocp412-p1-s1-5m failoverCluster: ocp412-s1 placementRef: kind: PlacementRule name: busybox3-placement-1 namespace: busybox-sample3 preferredCluster: ocp412-p1 pvcSelector: matchLabels: app: busybox3 status: actionStartTime: '2023-02-06T17:50:23Z' conditions: - lastTransitionTime: '2023-02-06T17:50:23Z' message: Waiting for PV restore to complete...) observedGeneration: 3 reason: FailingOver status: 'False' type: Available - lastTransitionTime: '2023-02-06T17:50:23Z' message: Started failover to cluster "ocp412-s1" observedGeneration: 3 reason: NotStarted status: 'False' type: PeerReady lastUpdateTime: '2023-02-06T19:28:04Z' phase: FailingOver preferredDecision: clusterName: ocp412-p1 clusterNamespace: ocp412-p1 progression: WaitingForPVRestore resourceConditions: conditions: - lastTransitionTime: '2023-02-06T17:28:20Z' message: >- Failed to restore PVs (failed to restore PVs for VolRep (failed to restorePVs using profile list ([s3profile-ocp412-p1-ocs-storagecluster s3profile-ocp412-s1-ocs-storagecluster]): unable to ListKeys of type v1.PersistentVolume keyPrefix busybox-sample3/busybox3-placement-1-drpc/v1.PersistentVolume/, failed to list objects in bucket odrbucket-fdfbb7912e2d:busybox-sample3/busybox3-placement-1-drpc/v1.PersistentVolume/, RequestError: send request failed caused by: Get "https://s3-openshift-storage.ocp412-s1-f8480cb6e62d97529990eca5f3f95767-0000.us-south.stg.containers.appdomain.cloud/odrbucket-fdfbb7912e2d?list-type=2&prefix=busybox-sample3%2Fbusybox3-placement-1-drpc%2Fv1.PersistentVolume%2F": x509: certificate signed by unknown authority)) observedGeneration: 1 reason: Error status: 'False' type: ClusterDataReady - lastTransitionTime: '2023-02-06T17:28:20Z' message: Initializing VolumeReplicationGroup observedGeneration: 1 reason: Initializing status: Unknown type: DataReady - lastTransitionTime: '2023-02-06T17:28:20Z' message: Initializing VolumeReplicationGroup observedGeneration: 1 reason: Initializing status: Unknown type: DataProtected - lastTransitionTime: '2023-02-06T17:28:20Z' message: Initializing VolumeReplicationGroup observedGeneration: 1 reason: Initializing status: Unknown type: ClusterDataProtected resourceMeta: generation: 1 kind: VolumeReplicationGroup name: busybox3-placement-1-drpc namespace: busybox-sample3 ```
@shaali I fixed the cert issue by doing these steps: 1) Created ConfigMap again using this process https://red-hat-storage.github.io/ocs-training/training/ocs4/odf411-multisite-ramen.html#_configure_ssl_access_between_s3_endpoints 2) Because OCP 4.12 has no VolSync operator until ACM 2.7 is released I did this WA on both managed clusters to populate the volsync CRDs with old version 0.5.0 (ramen must find CRDs) $ oc apply -f https://raw.githubusercontent.com/backube/volsync/v0.5.0/config/crd/bases/volsync.backube_replicationdestinations.yaml $ oc apply -f https://raw.githubusercontent.com/backube/volsync/v0.5.0/config/crd/bases/volsync.backube_replicationsources.yaml 3) Restarted ramen pods on managed clusters (openshift-dr-system) and hub cluster (openshift-operators). Problem now is the ramen pod on ocp412-s1 (https://c100-e.containers.test.cloud.ibm.com:30912) has an ErrImagePull so Failover will not succeed. $ oc describe pod ramen-dr-cluster-operator-75769b99bd-t56r5 -n openshift-dr-system [...] Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 13m default-scheduler Successfully assigned openshift-dr-system/ramen-dr-cluster-operator-75769b99bd-t56r5 to 10.240.129.5 Normal AddedInterface 13m multus Add eth0 [172.30.4.143/32] from k8s-pod-network Normal Pulled 13m kubelet Container image "registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:e3dad360d0351237a16593ca0862652809c41a2127c2f98b9e0a559568efbd10" already present on machine Normal Created 13m kubelet Created container kube-rbac-proxy Normal Started 13m kubelet Started container kube-rbac-proxy Warning Failed 12m kubelet Failed to pull image "registry.redhat.io/odf4/odr-rhel8-operator@sha256:5844f3cea10d926d204bbb37f592a82ed3f513495dbf909ff7755b43f10925a8": rpc error: code = Unknown desc = Get "https://registry.redhat.io/auth/realms/rhcc/protocol/redhat-docker-v2/auth?account=%7Cuhc-pool-c4d98868-1c71-4b5b-aba8-701650c9911c&scope=repository%3Aodf4%2Fodr-rhel8-operator%3Apull&service=docker-registry": dial tcp 23.220.124.237:443: connect: connection timed out Warning Failed 11m (x2 over 13m) kubelet Failed to pull image "registry.redhat.io/odf4/odr-rhel8-operator@sha256:5844f3cea10d926d204bbb37f592a82ed3f513495dbf909ff7755b43f10925a8": rpc error: code = Unknown desc = initializing source docker://registry.redhat.io/odf4/odr-rhel8-operator@sha256:5844f3cea10d926d204bbb37f592a82ed3f513495dbf909ff7755b43f10925a8: can't talk to a V1 container registry Warning Failed 11m (x3 over 13m) kubelet Error: ErrImagePull Warning Failed 11m (x6 over 13m) kubelet Error: ImagePullBackOff Normal Pulling 11m (x4 over 13m) kubelet Pulling image "registry.redhat.io/odf4/odr-rhel8-operator@sha256:5844f3cea10d926d204bbb37f592a82ed3f513495dbf909ff7755b43f10925a8" Normal BackOff 3m20s (x35 over 13m) kubelet Back-off pulling image "registry.redhat.io/odf4/odr-rhel8-operator@sha256:5844f3cea10d926d204bbb37f592a82ed3f513495dbf909ff7755b43f10925a8"
Hi @annette.clewett 1. thanks for fixing that Cert issue. What was the issue ?, i have already executed the steps. created the cert configurations in all 3 clusters and updated the proxy in all 3 cluster. Did u additionally restart the kube-proxy pods ? last time i didnt get this issue, when I created the DR after doing this ssl Steps. But somehow the SSL cert got updated. so..does the sequence matters ? 'ssl step <-->DR policy creation' 2. I just moved that 'ramen' to another Node, which came up fine.. could be some Node issue, i cordoned the other node. 3. So the i see the "Failover" is successful now. APP is up on secondary Cluster. Also tried the "Relocate' , it worked as expected and APP is back to primary cluster. I will so some more tests and let you know incase of issue. 4. A small issue, i see is the Data Written on the PVC has a diffrent OWNERSHIP on both clusters. Not sure if that will cause any issue in APPLICATIONS. -----After Failover onto s1 cluster (Project: busybox-sample3 )--- $ ls -l /mnt/test -rw-rw-rw- 1 10009600 10024400 0 Feb 6 17:26 fromprimary-06Feb2023 drwxrws--- 2 root 10024400 16384 Feb 6 17:23 lost+found -----After Relocate onto p1 cluster--- ~ $ ls -l /mnt/test total 1832 -rw-rw-rw- 1 10009600 10009600 0 Feb 6 17:26 fromprimary-06Feb2023 -rw-rw-rw- 1 10024400 10009600 0 Feb 7 12:43 fromsecondary-07Feb2023 drwxrws--- 2 root 10009600 16384 Feb 6 17:23 lost+found
@Sridhar I am setting up Another ENV to test the DR e2e, and I am also Documenting the Entire procedure Can you provide the Exact steps that were executed on the current cluster to fix submariner --------------------Last updates on Submariner------------- Summary of the changes done so far: 1. The firewall configuration on the Gateway nodes was not allowing ESP traffic, so we forced udp-encapsulation in the IPSec configuration after which the tunnels were successfully established. ***Can you provide exact command for this ?? and the post verification steps (if any) ? 2. Even though the Calico IPPools were created as per the Submariner requirement [https://submariner.io/operations/deployment/calico/] there was an issue with the configuration. The following flags were not enabled, so we manually updated the IPPools with this change on both the clusters. natOutgoing: false disabled: true **** I think these steps are enough as per the link ? https://submariner.io/operations/deployment/calico/ 3. We changed the IPPool configuration in default-ipv4-ippool to use `vxlanMode: Always` and `ipipMode: Never` on both the clusters and restarted the calico-node pods in calico-system namespace.. **** update 'default-ipv4-ippool' using calicoctl ?? or kubectl edit ?
(In reply to Shaikh I Ali from comment #22) > 1. The firewall configuration on the Gateway nodes was not allowing ESP > traffic, so we forced udp-encapsulation in the IPSec configuration after > which the tunnels were successfully established. > > ***Can you provide exact command for this ?? and the post verification > steps (if any) ? This can be fixed in two ways. 1. By opening the firewall ports on the Gateway nodes of the cluster to allow ESP traffic, this will be optimal solution as we can avoid additional UDP overhead in the IPsec tunnels. 2. If the above option is not feasible (as it requires access to the underlay/infra), then you can do the following. While joining the cluster to the Broker using the "subctl join ..." command you can pass the flag "--force-udp-encaps" Post verification: After executing step 1 or 2 on both the clusters, you will notice that IPsec tunnels will be successfully established and the same can be seen using the command "subctl show connections". > 2. Even though the Calico IPPools were created as per the Submariner > requirement [https://submariner.io/operations/deployment/calico/] there was > an issue with the configuration. > The following flags were not enabled, so we manually updated the IPPools > with this change on both the clusters. > natOutgoing: false > disabled: true > > **** I think these steps are enough as per the link ? > https://submariner.io/operations/deployment/calico/ Yes, the document is up to date. To make it easier to detect the calico config, I've pushed the following PR in Submariner to verify if the config is meeting Submariner requirements as part of "subctl diagnose cni" command. https://github.com/submariner-io/subctl/pull/550 > 3. We changed the IPPool configuration in default-ipv4-ippool to use > `vxlanMode: Always` and `ipipMode: Never` on both the clusters and restarted > the calico-node pods in calico-system namespace.. > > **** update 'default-ipv4-ippool' using calicoctl ?? or kubectl edit ? You can use kubectl cli.