Description of problem: After remove one node from the cluster. The exist hostnetwork pod cannot access the kubernetes service Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. setup the cluster with IPI on AWS with OVN 2. Check the node info 3. Check the hostnetwork pod can access the kubernetes service 4. Remove one of worker 5. recheck the step 3 Actual results: step 2: oc get node NAME STATUS ROLES AGE VERSION ip-10-0-150-123.us-east-2.compute.internal Ready worker 91m v1.19.0+e465e66 ip-10-0-151-131.us-east-2.compute.internal Ready master 100m v1.19.0+e465e66 ip-10-0-161-214.us-east-2.compute.internal Ready master 101m v1.19.0+e465e66 ip-10-0-179-144.us-east-2.compute.internal Ready worker 91m v1.19.0+e465e66 ip-10-0-195-82.us-east-2.compute.internal Ready worker 91m v1.19.0+e465e66 ip-10-0-220-58.us-east-2.compute.internal Ready master 101m v1.19.0+e465e66 step 3. hostnetwork pod can accuess the kubernetes for i in $(oc get pod --no-headers -n openshift-machine-config-operator -l k8s-app=machine-config-daemon | awk '{print $1 }') ; do oc exec -n openshift-machine-config-operator $i -- curl -I https://172.30.0.1:443 -k 2>/dev/null ; done HTTP/2 403 audit-id: d44644d4-2cf7-48a2-b954-004588c3d3d9 cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 38ebd12e-5177-4918-a858-ccd20be797ba x-kubernetes-pf-prioritylevel-uid: 0824645e-3120-4d45-8461-12ee23461ec6 content-length: 234 date: Tue, 29 Sep 2020 11:13:01 GMT HTTP/2 403 audit-id: 6e554806-bf5f-4839-817d-5edceb7acf2a cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 38ebd12e-5177-4918-a858-ccd20be797ba x-kubernetes-pf-prioritylevel-uid: 0824645e-3120-4d45-8461-12ee23461ec6 content-length: 234 date: Tue, 29 Sep 2020 11:13:04 GMT HTTP/2 403 audit-id: c2af1ed5-293d-48b3-ba69-35fed8608056 cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 38ebd12e-5177-4918-a858-ccd20be797ba x-kubernetes-pf-prioritylevel-uid: 0824645e-3120-4d45-8461-12ee23461ec6 content-length: 234 date: Tue, 29 Sep 2020 11:13:08 GMT HTTP/2 403 audit-id: 22c15cf0-b682-49c2-81d0-414346a75992 cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 38ebd12e-5177-4918-a858-ccd20be797ba x-kubernetes-pf-prioritylevel-uid: 0824645e-3120-4d45-8461-12ee23461ec6 content-length: 234 date: Tue, 29 Sep 2020 11:13:12 GMT HTTP/2 403 audit-id: e4e54d67-05c0-4e85-a27c-121a6622bf11 cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 38ebd12e-5177-4918-a858-ccd20be797ba x-kubernetes-pf-prioritylevel-uid: 0824645e-3120-4d45-8461-12ee23461ec6 content-length: 234 date: Tue, 29 Sep 2020 11:13:16 GMT HTTP/2 403 audit-id: bd0b19b5-9579-4b5c-ac72-10b85451b15c cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 38ebd12e-5177-4918-a858-ccd20be797ba x-kubernetes-pf-prioritylevel-uid: 0824645e-3120-4d45-8461-12ee23461ec6 content-length: 234 date: Tue, 29 Sep 2020 11:13:18 GMT step 4: oc delete node ip-10-0-150-123.us-east-2.compute.internal step 5: for i in $(oc get pod --no-headers -n openshift-machine-config-operator -l k8s-app=machine-config-daemon | awk '{print $1 }') ; do oc exec -n openshift-machine-config-operator $i -- curl -I --connect-timeout 10 https://172.30.0.1:443 -k ; done Defaulting container name to machine-config-daemon. Use 'oc describe pod/machine-config-daemon-2ct9w -n openshift-machine-config-operator' to see all of the containers in this pod. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0HTTP/2 403 audit-id: 972d9c82-9fe7-4074-9962-a9e0c9bd4dba cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 38ebd12e-5177-4918-a858-ccd20be797ba x-kubernetes-pf-prioritylevel-uid: 0824645e-3120-4d45-8461-12ee23461ec6 content-length: 234 date: Tue, 29 Sep 2020 12:39:29 GMT 0 234 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 Defaulting container name to machine-config-daemon. Use 'oc describe pod/machine-config-daemon-dvsn4 -n openshift-machine-config-operator' to see all of the containers in this pod. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0 curl: (28) Connection timed out after 10000 milliseconds command terminated with exit code 28 Defaulting container name to machine-config-daemon. Use 'oc describe pod/machine-config-daemon-f7sls -n openshift-machine-config-operator' to see all of the containers in this pod. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0 curl: (28) Connection timed out after 10000 milliseconds command terminated with exit code 28 Defaulting container name to machine-config-daemon. Use 'oc describe pod/machine-config-daemon-g45zb -n openshift-machine-config-operator' to see all of the containers in this pod. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0 curl: (28) Connection timed out after 10001 milliseconds command terminated with exit code 28 Defaulting container name to machine-config-daemon. Use 'oc describe pod/machine-config-daemon-qtwrr -n openshift-machine-config-operator' to see all of the containers in this pod. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0 curl: (28) Connection timed out after 10000 milliseconds command terminated with exit code 28 Expected results: Additional info:
Created attachment 1717532 [details] ovn logs
Seems to be sporadic: aconstan@linux-3 ~ $ oc exec -ti -c ovnkube-node ovnkube-node-mkfx8 -n openshift-ovn-kubernetes -- bash [root@ip-10-0-161-214 ~]# curl -k https://172.30.0.1:443 { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": { }, "code": 403 }[root@ip-10-0-161-214 ~]# curl -k https://172.30.0.1:443 ^C [root@ip-10-0-161-214 ~]# curl -k https://172.30.0.1:443 { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": { }, "code": 403 }[root@ip-10-0-161-214 ~]# curl -k https://172.30.0.1:443 { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": { }, "code": 403 }[root@ip-10-0-161-214 ~]# curl -k https://172.30.0.1:443 ^C [root@ip-10-0-161-214 ~]# curl -k https://172.30.0.1:443 ^C [root@ip-10-0-161-214 ~]# curl -k https://172.30.0.1:443 ^C I will continue digging into it
OVS logs are filled with: 2020-09-29T14:10:15.776Z|00738|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac) 2020-09-29T14:10:15.792Z|00739|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac) 2020-09-29T14:10:23.029Z|00740|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac) 2020-09-29T14:10:23.038Z|00741|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac and 2020-09-29T14:11:12.479Z|00097|dpif(handler13)|WARN|system@ovs-system: execute ct(commit,zone=2,label=0/0x1),2 failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:0a:81:00:01,dl_dst=a6:20:f7:9d:af:28,nw_src=10.129.0.2,nw_dst=10.0.220.58,nw_tos=0,nw_ecn=0,nw_ttl=63,tp_src=47186,tp_dst=6443,tcp_flags=syn tcp_csum:c17e with metadata skb_priority(0),dp_hash(0x44012e95),skb_mark(0),ct_state(0x21),ct_zone(0x2),ct_tuple4(src=10.129.0.2,dst=10.0.220.58,proto=6,tp_src=47186,tp_dst=6443),in_port(2) mtu 0 2020-09-29T14:11:50.611Z|00098|dpif(handler13)|WARN|system@ovs-system: execute ct(commit,zone=2,label=0/0x1),2 failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:0a:81:00:01,dl_dst=a6:20:f7:9d:af:28,nw_src=10.129.0.2,nw_dst=10.0.220.58,nw_tos=0,nw_ecn=0,nw_ttl=63,tp_src=49270,tp_dst=6443,tcp_flags=syn tcp_csum:eab8 with metadata skb_priority(0),dp_hash(0xff707b1b),skb_mark(0),ct_state(0x21),ct_zone(0x2),ct_tuple4(src=10.129.0.2,dst=10.0.220.58,proto=6,tp_src=49270,tp_dst=6443),in_port(2) mtu 0 2020-09-29T14:12:26.049Z|00094|dpif(handler10)|WARN|system@ovs-system: execute ct(commit,zone=2,label=0/0x1),2 failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:0a:81:00:01,dl_dst=a6:20:f7:9d:af:28,nw_src=10.129.0.2,nw_dst=10.0.151.131,nw_tos=0,nw_ecn=0,nw_ttl=63,tp_src=51246,tp_dst=6443,tcp_flags=syn tcp_csum:fee6 with metadata skb_priority(0),dp_hash(0xf38debfc),skb_mark(0),ct_state(0x21),ct_zone(0x2),ct_tuple4(src=10.129.0.2,dst=10.0.151.131,proto=6,tp_src=51246,tp_dst=6443),in_port(2) mtu 0 2020-09-29T14:12:44.141Z|00099|dpif(handler13)|WARN|system@ovs-system: execute ct(commit,zone=2,label=0/0x1),2 failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:0a:81:00:01,dl_dst=a6:20:f7:9d:af:28,nw_src=10.129.0.2,nw_dst=10.0.151.131,nw_tos=0,nw_ecn=0,nw_ttl=63,tp_src=52228,tp_dst=6443,tcp_flags=syn tcp_csum:31e8 with metadata skb_priority(0),dp_hash(0x2771c8d9),skb_mark(0),ct_state(0x21),ct_zone(0x2),ct_tuple4(src=10.129.0.2,dst=10.0.151.131,proto=6,tp_src=52228,tp_dst=6443),in_port(2) mtu 0 2020-09-29T14:13:24.870Z|00095|dpif(handler10)|WARN|system@ovs-system: execute ct(commit,zone=2,label=0/0x1),2 failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:0a:81:00:01,dl_dst=a6:20:f7:9d:af:28,nw_src=10.129.0.2,nw_dst=10.0.220.58,nw_tos=0,nw_ecn=0,nw_ttl=63,tp_src=54470,tp_dst=6443,tcp_flags=syn tcp_csum:4d06
Also: When in one terminal I execute: $ curl -k https://172.30.0.1:443 I see the following appear in the other: 2020-09-29T15:04:13.392Z|00128|dpif(handler13)|WARN|system@ovs-system: execute ct(commit,zone=2,label=0/0x1),2 failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:0a:81:00:01,dl_dst=a6:20:f7:9d:af:28,nw_src=10.129.0.2,nw_dst=10.0.220.58,nw_tos=0,nw_ecn=0,nw_ttl=63,tp_src=53416,tp_dst=6443,tcp_flags=syn tcp_csum:d27f with metadata skb_priority(0),dp_hash(0xe0751575),skb_mark(0),ct_state(0x21),ct_zone(0x2),ct_tuple4(src=10.129.0.2,dst=10.0.220.58,proto=6,tp_src=53416,tp_dst=6443),in_port(2) mtu 0
I.e the curl command is executed in a host network pod $ oc exec -ti -c ovnkube-node ovnkube-node-mkfx8 -n openshift-ovn-kubernetes -- bash $ curl -k https://172.30.0.1:443 and in the other terminal we tail the ovs instance running on the same node $ oc logs -f ovs-node-g9n8f -n openshift-ovn-kubernetes
The suspicion is that this might be linked to: https://bugzilla.redhat.com/show_bug.cgi?id=1877128 I won't close this as a dupe until I get confirmation from OVN/OVS people. But it does not seem to be linked to the deleted node, more like a coincidence.
The kubernetes API service has three end-points on this cluster: subsets: - addresses: - ip: 10.0.151.131 - ip: 10.0.161.214 - ip: 10.0.220.58 ports: - name: https port: 6443 protocol: TCP #comment 6 seems to also apply to the 10.0.151.131 end-point as I've seen this as well: 2020-09-29T15:30:49.579Z|00145|dpif(handler13)|WARN|system@ovs-system: execute ct(commit,zone=2,label=0/0x1),2 failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:0a:81:00:01,dl_dst=a6:20:f7:9d:af:28,nw_src=10.129.0.2,nw_dst=10.0.151.131,nw_tos=0,nw_ecn=0,nw_ttl=63,tp_src=56836,tp_dst=6443,tcp_flags=syn tcp_csum:e47e with metadata skb_priority(0),dp_hash(0xa6dc50af),skb_mark(0),ct_state(0x21),ct_zone(0x2),ct_tuple4(src=10.129.0.2,dst=10.0.151.131,proto=6,tp_src=56836,tp_dst=6443),in_port(2) mtu 0 I thus suspect that for the transactions that do go through (seen in #comment 4) the end-point hit is 10.0.161.214 $ oc logs ovs-node-g9n8f -n openshift-ovn-kubernetes | grep "Invalid argument) on packet " | grep 10.0.151.131 | wc -l 142 aconstan@linux-3 ~ $ oc logs ovs-node-g9n8f -n openshift-ovn-kubernetes | grep "Invalid argument) on packet " | grep 10.0.220.58 | wc -l 125 aconstan@linux-3 ~ $ oc logs ovs-node-g9n8f -n openshift-ovn-kubernetes | grep "Invalid argument) on packet " | grep 10.0.161.214 | wc -l 3 aconstan@linux-3 ~ $ oc logs ovs-node-g9n8f -n openshift-ovn-kubernetes | grep "Invalid argument) on packet " | grep 10.0.161.214 2020-09-29T09:37:08.870Z|00001|dpif(handler11)|WARN|system@ovs-system: execute ct(commit,zone=48,label=0x2/0x2,nat(dst=10.0.161.214:6443)),recirc(0x155) failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:0a:81:00:1a,dl_dst=0a:58:0a:81:00:01,nw_src=10.129.0.26,nw_dst=172.30.0.1,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=35802,tp_dst=443,tcp_flags=syn tcp_csum:66e9 2020-09-29T09:38:44.186Z|00002|dpif(handler10)|WARN|system@ovs-system: execute ct(commit,zone=48,label=0x2/0x2,nat(dst=10.0.161.214:6443)),recirc(0x155) failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:0a:81:00:1a,dl_dst=0a:58:0a:81:00:01,nw_src=10.129.0.26,nw_dst=172.30.0.1,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=48616,tp_dst=443,tcp_flags=syn tcp_csum:d41f 2020-09-29T12:13:30.961Z|00026|dpif(handler10)|WARN|system@ovs-system: execute ct(commit,zone=35,label=0x2/0x2,nat(dst=10.0.161.214:6443)),recirc(0x67) failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:0a:81:00:46,dl_dst=0a:58:0a:81:00:01,nw_src=10.129.0.70,nw_dst=172.30.92.161,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=55534,tp_dst=443,tcp_flags=syn tcp_csum:208a $ date Tue Sep 29 15:35:52 UTC 2020 I.e the ct warnings did happen a couple of hours ago on that end-point as well, but stopped.
OK, I am updating here with the following from node ip-10-0-161-214.us-east-2.compute.internal: - OVN NB DB - OVN SB DB - Open vSwitch conf.db - dump of flows on br-int - dump of flows on br-ex - dump of flows on br-local - output of `ip a` Again, if you exec into a host network pod, in this case: $ oc exec -ti -c ovnkube-node ovnkube-node-mkfx8 -n openshift-ovn-kubernetes -- bash and perform a curl to the kube-api service: $ curl -k https://172.30.0.1:443 it will not successfully connect at the same time on the same node, but in the ovs pod you can see the following being logged as a result of that curl command: ==> /host/var/log/openvswitch/ovs-vswitchd.log <== 2020-09-30T08:41:03.453Z|01319|connmgr|INFO|br-int<->unix#0: 8 flow_mods in the 2 s starting 56 s ago (8 deletes) 2020-09-30T08:41:03.564Z|00616|dpif(handler10)|WARN|system@ovs-system: ex ecute ct(commit,zone=2,label=0/0x1),2 failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:0a:81:00:01,dl_dst=a6:20:f7:9d:af:28,nw_src=10.129.0.2,nw_dst=10.0.151.131,nw_tos=0,nw_ecn=0,nw_ttl=63,tp_src=41750,tp_dst=6443,tcp_flags=syn tcp_csum:a8a9 with metadata skb_priority(0),dp_hash(0x39ca402f),skb_mark(0),ct_state(0x21),ct_zone(0x2),ct_tuple4(src=10.129.0.2,dst=10.0.151.131,proto=6,tp_src=41750,tp_dst=6443),in_port(2) mtu 0 I think all networking information should be attached to this ticket now
Created attachment 1717772 [details] ip a
Created attachment 1717773 [details] nb db
Created attachment 1717774 [details] sb db
Created attachment 1717775 [details] conf db
Created attachment 1717776 [details] br-int flows
Created attachment 1717777 [details] br-ex flows
Created attachment 1717778 [details] br-local flows
Created attachment 1717783 [details] ofproto/trace for the sake of completion
So update again: None of the following "work-arounds" resolve the issue: - Deleting the VM from the AWS console - Forcing re-compute of the openflows with ovn-controller - Deleting the kubernetes API service (and thus having it re-created by the service controller) - restarting ovs-vswitchd.service on the node you're trying to connect from - Deleting ovnkube-node on the node you're trying to connect from fun times.....
Also adding to the list in #comment 19 - Rebooting the cluster node you're trying to connect from does not help
I was able to reproduce on KIND. The issue is that when we delete a node, we are accidentally deleting the lr policies for other nodes that allow for host backed services to be routed correctly (1003 priority). In this example I deleted ovn-worker2 2020-09-30T19:09:01.440Z|00158|nbctl|INFO|Running command run -- lr-policy-del ovn_cluster_router 1005 "ip4.src == 10.244.0.2 && ip4.dst != 10.244.0.0/16 /* inter-ovn-worker2 */" 2020-09-30T19:09:01.444Z|00159|nbctl|INFO|Running command run -- lr-policy-del ovn_cluster_router 1003 "ip4.src == 10.244.1.2 && ip4.dst != 10.244.0.0/16 /* inter-ovn-control-plane */" 2020-09-30T19:09:01.448Z|00160|nbctl|INFO|Running command run -- lr-policy-del ovn_cluster_router 1004 "inport == \"rtos-ovn-worker2\" && ip4.dst == 172.20.0.2 /* ovn-worker2 */" 2020-09-30T19:09:01.452Z|00161|nbctl|INFO|Running command run -- lr-policy-del ovn_cluster_router 1005 "ip4.src == 10.244.0.2 && ip4.dst == 172.20.0.2 /* ovn-worker2 */" 2020-09-30T19:09:01.456Z|00162|nbctl|INFO|Running command run -- lr-policy-del ovn_cluster_router 1003 "ip4.src == 10.244.2.2 && ip4.dst != 10.244.0.0/16 /* inter-ovn-worker */" 2020-09-30T19:09:01.461Z|00163|nbctl|INFO|Running command run -- lr-policy-del ovn_cluster_router 101 "ip4.src == 10.244.0.0/16 && ip4.dst == 172.20.0.2/32" The problem is in ovn-kubernetes. I'll post a fix shortly.
Ahh! Great find, indeed looking at those routing policies in the cluster in #comment 1: Routing Policies 1005 ip4.src == 10.128.0.2 && ip4.dst == 10.0.151.131 /* ip-10-0-151-131.us-east-2.compute.internal */ reroute 169.254.0.1 1005 ip4.src == 10.128.2.2 && ip4.dst == 10.0.195.82 /* ip-10-0-195-82.us-east-2.compute.internal */ reroute 169.254.0.1 1005 ip4.src == 10.129.0.2 && ip4.dst == 10.0.161.214 /* ip-10-0-161-214.us-east-2.compute.internal */ reroute 169.254.0.1 1005 ip4.src == 10.129.2.2 && ip4.dst == 10.0.179.144 /* ip-10-0-179-144.us-east-2.compute.internal */ reroute 169.254.0.1 1005 ip4.src == 10.130.0.2 && ip4.dst == 10.0.220.58 /* ip-10-0-220-58.us-east-2.compute.internal */ reroute 169.254.0.1 1004 inport == "rtos-ip-10-0-151-131.us-east-2.compute.internal" && ip4.dst == 10.0.151.131 /* ip-10-0-151-131.us-east-2.compute.internal */ reroute 10.128.0.2 1004 inport == "rtos-ip-10-0-161-214.us-east-2.compute.internal" && ip4.dst == 10.0.161.214 /* ip-10-0-161-214.us-east-2.compute.internal */ reroute 10.129.0.2 1004 inport == "rtos-ip-10-0-179-144.us-east-2.compute.internal" && ip4.dst == 10.0.179.144 /* ip-10-0-179-144.us-east-2.compute.internal */ reroute 10.129.2.2 1004 inport == "rtos-ip-10-0-195-82.us-east-2.compute.internal" && ip4.dst == 10.0.195.82 /* ip-10-0-195-82.us-east-2.compute.internal */ reroute 10.128.2.2 1004 inport == "rtos-ip-10-0-220-58.us-east-2.compute.internal" && ip4.dst == 10.0.220.58 /* ip-10-0-220-58.us-east-2.compute.internal */ reroute 10.130.0.2 1003 ip4.src == 10.131.0.2 && ip4.dst != 10.128.0.0/14 /* inter-ip-10-0-150-123.us-east-2.compute.internal */ reroute 169.254.0.1 101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.0.151.131/32 allow 101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.0.161.214/32 allow 101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.0.179.144/32 allow 101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.0.195.82/32 allow 101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.0.220.58/32 allow 101 ip4.src == 10.128.0.0/14 && ip4.dst == 10.128.0.0/14 allow I only looked at the 1005 rules and overlooked the 1003 ones....
By the way, couldn't the 1005 rules be skipped altogether? They are a subset of 1003 if my eyes and logic isn't deceiving me at this hour...
They do overlap, but 1005 is for shared gateway mode (which only needs to use DGP for sending back to its own host), while local gw mode needs to send all external traffic from the mp0 port via DGP. When we converge on shared gw mode later we will remove this.
Tested and verified in 4.6.0-0.nightly-2020-10-05-234751 hostnetwork pod still can access the kubernetes service after remove one node
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196