Bug 1762580

Summary: Cannot access to the service's externalIP with egressIP from some pods in spite of ocp update to 3.11.146-1
Product: OpenShift Container Platform Reporter: Min Woo Park <mpark>
Component: NetworkingAssignee: Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component: openshift-sdn QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: anbhat, bbennett, cdc, dageoffr, danw, huirwang, jdesousa, jinjli, nstielau, openshift-bugs-escalate, palonsor, pamoedom, pweil, sponnaga, zzhao
Version: 3.11.0   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: SDN-CUST-IMPACT
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: We weren't enabling conntrack for openshift SDN in multitenant mode. Consequence: Pods were unable to reach an externalIP services. Fix: Enable multitenant Result: Now they can reach externalIP services.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 15:54:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1901043    

Comment 20 Juan Luis de Sousa-Valadas 2019-12-23 14:16:31 UTC
Hello Min,
Can't they just connect to the service clusterIP instead of the externalIP?

Comment 24 zhaozhanqi 2020-01-09 00:47:54 UTC
huiran , remembered we ever verified this kind of issue. could you also help try to if it's still can be reproduced in 3.11.146-1

Comment 37 Juan Luis de Sousa-Valadas 2020-01-15 01:51:57 UTC
Hi Min,
Actually, do this instead:

# iptables -N OPENSHIFT-PREROUTING -t nat
# iptables -I PREROUTING -t nat -j OPENSHIFT-PREROUTING
# iptables -A OPENSHIFT-PREROUTING -t nat -m mark --mark 0x1/0x1 -j RETURN
# iptables -A OPENSHIFT-PREROUTING -t nat -m mark '!' --mark 0x0 -j ACCEPT

If there are errors do instead:
# iptables -D PREROUTING -t nat -j OPENSHIFT-PREROUTING
# iptables -D OPENSHIFT-PREROUTING -t nat -m mark --mark 0x1/0x1 -j RETURN
# iptables -D OPENSHIFT-PREROUTING -t nat -m mark '!' --mark 0x0 -j ACCEPT
# iptables -X OPENSHIFT-PREROUTING -t nat

Only necessary on the nodes with an egress IP, but may be done on every node.

Comment 64 Ben Bennett 2020-05-20 13:57:47 UTC
Updating the release to track the development branch.  Juan is actively working on the issue, and we will work out the backport as soon as we have a tested fix.

Comment 71 Juan Luis de Sousa-Valadas 2020-06-15 11:49:24 UTC
Hi Huiran,
These are the steps to reproduce.

Deploy a cluster with the *multitenant* plugin.

For both RHCOS and RHEL have  four nodes which look like:
NAME     HOST     HOST IP          SUBNET          EGRESS CIDRS   EGRESS IPS
node-0   node-0   136.144.52.250   10.130.2.0/23                  [136.144.52.242]
node-1   node-1   136.144.52.230   10.131.2.0/23                  
node-2   node-2   136.144.52.243   10.129.2.0/23                  
node-3   node-3   136.144.52.241   10.128.2.0/23                  

For this scenario node-0 has the egressIP, node-1 has the externalIP,  node-2 has the client and rhel-3 has the server

$ oc get svc
NAME              TYPE        CLUSTER-IP      EXTERNAL-IP      PORT(S)             AGE
hello-service1    ClusterIP   172.30.77.123   136.144.52.230   27018/TCP           8h

The externalIP is *the host ip the hostsubnet*. You could use a different external provided that said IP belongs to the node node-1 and that the rest of nodes can reach it.

The client runs on node-2 in the test-client project which uses the egressIP and server runs on node-3 which on the test-server project which may or may not have an egressIP. Client and server must *not* be joined (as in "oc adm pod-network join-projects")

The test *must* be run on both all RHCOS and all RHEL nodes. If we do all RHEL and all RHCOS we cover all possible scenarios. The test cannot be performed on a single host because then we may find an edge case where it works, if it works accross 4 different hosts that's the worst case scenario, and should cover all the different scenario.

I have verified this manually but needs to be validated again for both RHCOS and RHEL again by QA as soon as the PR merges.

Thanks again Huiran for the multiple clusters.

Comment 76 Dan Winship 2020-06-16 13:18:10 UTC
*** Bug 1717487 has been marked as a duplicate of this bug. ***

Comment 91 Juan Luis de Sousa-Valadas 2020-09-22 15:00:32 UTC
Hi Huiran,
I belive this is related to RHCOS vs RHEL 7 behavior.
Could you deploy a cluster with 4 RHEL nodes: one with the egressIP, one with the externalIP, one with the client pod and one with the server pod?
I think this may work on RHEL 7 because in my local 3.11 fork this works, so I think it may be because of RHEL, I can't think of any other difference in SDN between 3.11 or 4.6 that may cause this.

Comment 93 Juan Luis de Sousa-Valadas 2020-09-23 09:48:53 UTC
Hi Huiran,
Because this works in RHEL 7, doesn't work on RHEL 8 and the customer with this use case is using OCP 3.11 which is RHEL 7 only, I think we should mark this bug as VERIFIED so that we can backport it all the way back to OCP 3.11, and file a new low priority bug specifically for the RHCOS case.

Do you think QA could agree with that?

Comment 96 Ben Bennett 2020-09-23 13:04:25 UTC
*** Bug 1881882 has been marked as a duplicate of this bug. ***

Comment 99 errata-xmlrpc 2020-10-27 15:54:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196