Bug 1762580

Summary:	Cannot access to the service's externalIP with egressIP from some pods in spite of ocp update to 3.11.146-1
Product:	OpenShift Container Platform	Reporter:	Min Woo Park <mpark>
Component:	Networking	Assignee:	Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component:	openshift-sdn	QA Contact:	huirwang
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	anbhat, bbennett, cdc, dageoffr, danw, huirwang, jdesousa, jinjli, nstielau, openshift-bugs-escalate, palonsor, pamoedom, pweil, sponnaga, zzhao
Version:	3.11.0
Target Milestone:	---
Target Release:	4.6.0
Hardware:	x86_64
OS:	Linux
Whiteboard:	SDN-CUST-IMPACT
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: We weren't enabling conntrack for openshift SDN in multitenant mode. Consequence: Pods were unable to reach an externalIP services. Fix: Enable multitenant Result: Now they can reach externalIP services.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 15:54:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1901043

Comment 20 Juan Luis de Sousa-Valadas 2019-12-23 14:16:31 UTC

Hello Min,
Can't they just connect to the service clusterIP instead of the externalIP?

Comment 24 zhaozhanqi 2020-01-09 00:47:54 UTC

huiran , remembered we ever verified this kind of issue. could you also help try to if it's still can be reproduced in 3.11.146-1

Comment 37 Juan Luis de Sousa-Valadas 2020-01-15 01:51:57 UTC

Hi Min,
Actually, do this instead:

# iptables -N OPENSHIFT-PREROUTING -t nat
# iptables -I PREROUTING -t nat -j OPENSHIFT-PREROUTING
# iptables -A OPENSHIFT-PREROUTING -t nat -m mark --mark 0x1/0x1 -j RETURN
# iptables -A OPENSHIFT-PREROUTING -t nat -m mark '!' --mark 0x0 -j ACCEPT

If there are errors do instead:
# iptables -D PREROUTING -t nat -j OPENSHIFT-PREROUTING
# iptables -D OPENSHIFT-PREROUTING -t nat -m mark --mark 0x1/0x1 -j RETURN
# iptables -D OPENSHIFT-PREROUTING -t nat -m mark '!' --mark 0x0 -j ACCEPT
# iptables -X OPENSHIFT-PREROUTING -t nat

Only necessary on the nodes with an egress IP, but may be done on every node.

Comment 64 Ben Bennett 2020-05-20 13:57:47 UTC

Updating the release to track the development branch.  Juan is actively working on the issue, and we will work out the backport as soon as we have a tested fix.

Comment 71 Juan Luis de Sousa-Valadas 2020-06-15 11:49:24 UTC

Hi Huiran,
These are the steps to reproduce.

Deploy a cluster with the *multitenant* plugin.

For both RHCOS and RHEL have four nodes which look like:
NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS
node-0 node-0 136.144.52.250 10.130.2.0/23 [136.144.52.242]
node-1 node-1 136.144.52.230 10.131.2.0/23
node-2 node-2 136.144.52.243 10.129.2.0/23
node-3 node-3 136.144.52.241 10.128.2.0/23

For this scenario node-0 has the egressIP, node-1 has the externalIP, node-2 has the client and rhel-3 has the server

$ oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hello-service1 ClusterIP 172.30.77.123 136.144.52.230 27018/TCP 8h

The externalIP is *the host ip the hostsubnet*. You could use a different external provided that said IP belongs to the node node-1 and that the rest of nodes can reach it.

The client runs on node-2 in the test-client project which uses the egressIP and server runs on node-3 which on the test-server project which may or may not have an egressIP. Client and server must *not* be joined (as in "oc adm pod-network join-projects")

The test *must* be run on both all RHCOS and all RHEL nodes. If we do all RHEL and all RHCOS we cover all possible scenarios. The test cannot be performed on a single host because then we may find an edge case where it works, if it works accross 4 different hosts that's the worst case scenario, and should cover all the different scenario.

I have verified this manually but needs to be validated again for both RHCOS and RHEL again by QA as soon as the PR merges.

Thanks again Huiran for the multiple clusters.

Comment 76 Dan Winship 2020-06-16 13:18:10 UTC

*** Bug 1717487 has been marked as a duplicate of this bug. ***

Comment 91 Juan Luis de Sousa-Valadas 2020-09-22 15:00:32 UTC

Hi Huiran,
I belive this is related to RHCOS vs RHEL 7 behavior.
Could you deploy a cluster with 4 RHEL nodes: one with the egressIP, one with the externalIP, one with the client pod and one with the server pod?
I think this may work on RHEL 7 because in my local 3.11 fork this works, so I think it may be because of RHEL, I can't think of any other difference in SDN between 3.11 or 4.6 that may cause this.

Comment 93 Juan Luis de Sousa-Valadas 2020-09-23 09:48:53 UTC

Hi Huiran,
Because this works in RHEL 7, doesn't work on RHEL 8 and the customer with this use case is using OCP 3.11 which is RHEL 7 only, I think we should mark this bug as VERIFIED so that we can backport it all the way back to OCP 3.11, and file a new low priority bug specifically for the RHCOS case.

Do you think QA could agree with that?

Comment 96 Ben Bennett 2020-09-23 13:04:25 UTC

*** Bug 1881882 has been marked as a duplicate of this bug. ***

Comment 99 errata-xmlrpc 2020-10-27 15:54:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196