2076669 – [Submariner] - subctl diagnose firewall inter-cluster return incorrect state for Azure platform

Bug 2076669 - [Submariner] - subctl diagnose firewall inter-cluster return incorrect state for Azure platform

Summary: [Submariner] - subctl diagnose firewall inter-cluster return incorrect state ...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Submariner
Sub Component:
Version:	rhacm-2.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	rhacm-2.5
Assignee:	Aswin Suryanarayanan
QA Contact:	Maxim Babushkin
Docs Contact:	Christopher Dawson
URL:
Whiteboard:
Duplicates (1):	2119719 (view as bug list)
Depends On:	1999325
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-19 15:37 UTC by Maxim Babushkin
Modified:	2023-09-18 04:35 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-02-09 13:20:50 UTC
Target Upstream Version:
Embargoed:
Flags:	bot-tracker-sync: rhacm-2.5+ maafried: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	stolostron backlog issues 21785	0	None	None	None	2022-04-19 22:21:53 UTC

Description Maxim Babushkin 2022-04-19 15:37:19 UTC

**What happened**:
When running "subctl diagnose firewall inter-cluster" test between clusters in azure platform and other platform, incorrect state returned.

ACM 2.5.0
Submariner 0.12.0 (Globalnet enabled)
OCP versions 4.9.25

 ✗ Checking if tunnels can be setup on the gateway node of cluster "mbabushk-sub3" 
 ✗ The tcpdump output from the sniffer pod does not include the message sent from client pod. Please check that your firewall configuration allows UDP/4505 traffic on the "mbabushk-sub3-7ptvl-worker-centralus1-vsrvt" node.

But the UDP/4505 port opened on both clusters.

**What you expected to happen**:
The test of the firewall between clusters should pass.

**How to reproduce it (as minimally and precisely as possible)**:
Deploy two clusters. One in Azure platform and one in other platform.
Run the firewall test.
It will return an error, while the configuration is correct.

**Anything else we need to know?**:
Debugging output by Sridhar (Slack thread):
==========================================================
Maxim Babushkin
Hi @multiclusternetwork-team
Have a question regarding azure platform.
I made a deployment of submariner on aws, gcp and azure platforms.
All the cloud prepare steps I made manually.
When I'm running the subctl diagnose firewall inter-cluster command and specifying the azure cluster, I get the following error:
 ✗ Checking if tunnels can be setup on the gateway node of cluster "mbabushk-sub3" 
 ✗ The tcpdump output from the sniffer pod does not include the message sent from client pod. Please check that your firewall configuration allows UDP/4505 traffic on the "mbabushk-sub3-7ptvl-worker-centralus1-vsrvt" node.
The port is open and all e2e tests are passing.
Only this check fails.
When I'm testing the firewall between aws and gcp, no issue.
It happens only when azure involved.
Is it happens because cloud prepare is not yet implemented and some check of the firewall relies on it?
Or is there any other reason?
Thanks.

Sridhar Gaddam
Hello Maxim, well if the tunnels are successfully created and since e2e is passing, I dont think there was an issue with cloud-prepare.

Sridhar Gaddam
Please share the kubeconfigs of the three clusters (in private), I can take a look and provide an update.

Maxim Babushkin
@sridharg
Thanks.
Will share in a sec.

Sridhar Gaddam
@mbabushk there is a problem while deploying the diagnose pods on the Azure cluster.

Sridhar Gaddam
For some reason, K8s running on Azure is automatically adding some volumes to the pod and this seems to be causing issues

Sridhar Gaddam
│Events:                                                                                                                                                                                                                                     │
│  Type     Reason          Age                From                                   Message                                                                                                                                                │
│  ----     ------          ----               ----                                   -------                                                                                                                                                │
│  Normal   Scheduled       <unknown>                                                 Successfully assigned default/validate-clientnvkbb to mbabushk-sub3-7ptvl-master-2                                                                     │
│  Normal   AddedInterface  24s                multus                                 Add eth0 [10.130.0.33/23] from openshift-sdn                                                                                                           │
│  Normal   Pulled          24s                kubelet, mbabushk-sub3-7ptvl-master-2  Container image "quay.io/submariner/nettest:devel" already present on machine                                                                          │
│  Normal   Created         24s                kubelet, mbabushk-sub3-7ptvl-master-2  Created container validate-client                                                                                                                      │
│  Normal   Started         24s                kubelet, mbabushk-sub3-7ptvl-master-2  Started container validate-client                                                                                                                      │
│  Warning  FailedMount     13s (x2 over 14s)  kubelet, mbabushk-sub3-7ptvl-master-2  MountVolume.SetUp failed for volume "kube-api-access-5dhqq" : [object "default"/"kube-root-ca.crt" not registered, object "default"/"openshift-service-│
│ca.crt" not registered] 

Sridhar Gaddam
This is the corresponding Pod.Spec

Sridhar Gaddam
  volumes:
  - name: kube-api-access-b7jk6
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt

Sridhar Gaddam
We do not add this to the pod, its getting auto-added by some component and because of this, the pod is not starting (hence diagnose is failing)

Sridhar Gaddam
These volumes seem to be added even on the other OCP clusters, but on Azure its failing with FailedMount error. We need to debug this further to understand why these volumes are added and if we can figure out a way to get past this issue.
==========================================================

Comment 1 Aswin Suryanarayanan 2022-05-02 12:49:16 UTC

The root cause seems to be related to this bug https://bugzilla.redhat.com/show_bug.cgi?id=1999325

Comment 2 Nelson Jean 2022-05-11 17:50:49 UTC

Hi @maafried , is there any fix required in Submariner now that bug https://bugzilla.redhat.com/show_bug.cgi?id=1999325 has been verified?  If not, and an existing Submariner build can be used for verification of this bug, please move this to ON_QA.

Comment 4 Maxim Babushkin 2022-05-25 14:46:23 UTC

The issue reported in this bug, related to the following bug - https://bugzilla.redhat.com/show_bug.cgi?id=1999325.
The https://bugzilla.redhat.com/show_bug.cgi?id=1999325 bug was fixed in ocp version 4.11 and we are waiting for it to be backported to the earlier ocp versions.

Until backported, I have not way to verify it.

Comment 5 bot-tracker-sync 2022-05-27 05:44:15 UTC

G2Bsync 1139250998 comment 
 nelsonjean Fri, 27 May 2022 03:46:40 UTC 
 G2Bsync 

Do you know when the backport is estimated to be available?

Comment 6 Maxim Babushkin 2022-05-29 07:03:10 UTC

It looks like the fix was backported to ocp versions 4.10 and 4.9.
https://bugzilla.redhat.com/show_bug.cgi?id=2067464
https://bugzilla.redhat.com/show_bug.cgi?id=2075704

I'll wait for the hub to get the ocp versions with the fix, will verify and then close the bz.

Eveline, do you know when we expect the new release of ocp 4.9 and 4.10 to be available in the acm hub?

Comment 8 Maxim Babushkin 2022-08-15 17:10:17 UTC

Although seems that the fix was backported to ocp versions 4.10 and 4.9, when I tested on the latest 4.10, I still facing the issue.

Comment 9 Daniel Farrell 2022-08-25 12:52:57 UTC

Looks like we need to re-verify this fix.

Comment 10 Nir Yechiel 2022-09-15 09:51:43 UTC

@Aswin, can you please take a look at this issue again? It looks like it was failed QE, even after the OCP fix went in.

Comment 11 Aswin Suryanarayanan 2022-09-27 00:43:58 UTC

Since the diagnose try with port 9898 from the source cluster to see if 4500 is reachable and in Azure by default egress traffic is not allowed. The packets are getting dropped.

Comment 12 Aswin Suryanarayanan 2022-09-27 23:57:55 UTC

(In reply to Aswin Suryanarayanan from comment #11)
> Since the diagnose try with port 9898 from the source cluster to see if 4500
> is reachable and in Azure by default egress traffic is not allowed. The
> packets are getting dropped.

This needs further investigation , the port 9898 seems to be not the reason for failure. Will update after further investigation.

Comment 13 Aswin Suryanarayanan 2022-11-29 02:26:10 UTC

*** Bug 2119719 has been marked as a duplicate of this bug. ***

Comment 15 Nir Yechiel 2023-02-09 05:39:40 UTC

@Aswin, can you remind me if this one is still relevant? If so, we should move it over to Jira. Thanks!

Comment 16 Maxim Babushkin 2023-02-09 08:18:55 UTC

@Nir, yes, this is still relevant.

Comment 17 Nir Yechiel 2023-02-09 13:20:50 UTC

Migrated to Jira: https://issues.redhat.com/browse/ACM-3316

Comment 18 Aswin Suryanarayanan 2023-02-21 14:55:07 UTC

removing need info as the discussion can continue in Jira.

Comment 19 Red Hat Bugzilla 2023-09-18 04:35:47 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.