Can't they just connect to the service clusterIP instead of the externalIP?
huiran , remembered we ever verified this kind of issue. could you also help try to if it's still can be reproduced in 3.11.146-1
Actually, do this instead:
# iptables -N OPENSHIFT-PREROUTING -t nat
# iptables -I PREROUTING -t nat -j OPENSHIFT-PREROUTING
# iptables -A OPENSHIFT-PREROUTING -t nat -m mark --mark 0x1/0x1 -j RETURN
# iptables -A OPENSHIFT-PREROUTING -t nat -m mark '!' --mark 0x0 -j ACCEPT
If there are errors do instead:
# iptables -D PREROUTING -t nat -j OPENSHIFT-PREROUTING
# iptables -D OPENSHIFT-PREROUTING -t nat -m mark --mark 0x1/0x1 -j RETURN
# iptables -D OPENSHIFT-PREROUTING -t nat -m mark '!' --mark 0x0 -j ACCEPT
# iptables -X OPENSHIFT-PREROUTING -t nat
Only necessary on the nodes with an egress IP, but may be done on every node.
Updating the release to track the development branch. Juan is actively working on the issue, and we will work out the backport as soon as we have a tested fix.
These are the steps to reproduce.
Deploy a cluster with the *multitenant* plugin.
For both RHCOS and RHEL have four nodes which look like:
NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS
node-0 node-0 22.214.171.124 10.130.2.0/23 [126.96.36.199]
node-1 node-1 188.8.131.52 10.131.2.0/23
node-2 node-2 184.108.40.206 10.129.2.0/23
node-3 node-3 220.127.116.11 10.128.2.0/23
For this scenario node-0 has the egressIP, node-1 has the externalIP, node-2 has the client and rhel-3 has the server
$ oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hello-service1 ClusterIP 172.30.77.123 18.104.22.168 27018/TCP 8h
The externalIP is *the host ip the hostsubnet*. You could use a different external provided that said IP belongs to the node node-1 and that the rest of nodes can reach it.
The client runs on node-2 in the test-client project which uses the egressIP and server runs on node-3 which on the test-server project which may or may not have an egressIP. Client and server must *not* be joined (as in "oc adm pod-network join-projects")
The test *must* be run on both all RHCOS and all RHEL nodes. If we do all RHEL and all RHCOS we cover all possible scenarios. The test cannot be performed on a single host because then we may find an edge case where it works, if it works accross 4 different hosts that's the worst case scenario, and should cover all the different scenario.
I have verified this manually but needs to be validated again for both RHCOS and RHEL again by QA as soon as the PR merges.
Thanks again Huiran for the multiple clusters.
*** Bug 1717487 has been marked as a duplicate of this bug. ***
I belive this is related to RHCOS vs RHEL 7 behavior.
Could you deploy a cluster with 4 RHEL nodes: one with the egressIP, one with the externalIP, one with the client pod and one with the server pod?
I think this may work on RHEL 7 because in my local 3.11 fork this works, so I think it may be because of RHEL, I can't think of any other difference in SDN between 3.11 or 4.6 that may cause this.
Because this works in RHEL 7, doesn't work on RHEL 8 and the customer with this use case is using OCP 3.11 which is RHEL 7 only, I think we should mark this bug as VERIFIED so that we can backport it all the way back to OCP 3.11, and file a new low priority bug specifically for the RHCOS case.
Do you think QA could agree with that?
*** Bug 1881882 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.