Description of problem: Create a pod and waiting the pod is running and then create a service. found the iptables did not updated to correct endpoints However if create the service firstly and then create the pod. the service works well and the iptables is correct. Version-Release number of selected component (if applicable): v3.11.291 How reproducible: always Steps to Reproduce: 1. oc new-project z6 2. create test pod and waiting it at running status oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/caddy-docker.json -n z6 3. Create test service oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/unsecure/service_unsecure.json -n z6 4. Check the sdn logs, From the logs. it wait about 5 mins then the iptables changed and the service is refer to endpoint Actual results: # oc logs sdn-88pd9 -n openshift-sdn | grep z6 I0922 07:30:40.939972 14407 networkpolicy.go:651] Watch ADDED event for Namespace "z6" I0922 07:30:40.975492 14407 proxy.go:159] Watch ADDED event for NetNamespace "z6" I0922 07:30:40.975537 14407 vnids.go:203] Watch ADDED event for NetNamespace "z6" I0922 07:30:40.975561 14407 vnids.go:148] Associate netid 12435952 to namespace "z6" with mcEnabled false I0922 07:30:40.975607 14407 egressip.go:286] Watch ADDED event for NetNamespace "z6" I0922 07:30:40.983582 14407 networkpolicy.go:651] Watch MODIFIED event for Namespace "z6" I0922 07:30:41.034631 14407 networkpolicy.go:651] Watch MODIFIED event for Namespace "z6" I0922 07:30:47.744434 14407 networkpolicy.go:602] Watch ADDED event for Pod "z6/caddy-docker" I0922 07:30:47.744466 14407 networkpolicy.go:609] PodIP is not set for pod "z6/caddy-docker"; ignoring I0922 07:30:47.751988 14407 networkpolicy.go:602] Watch MODIFIED event for Pod "z6/caddy-docker" I0922 07:30:47.752071 14407 networkpolicy.go:609] PodIP is not set for pod "z6/caddy-docker"; ignoring I0922 07:30:47.767756 14407 networkpolicy.go:602] Watch MODIFIED event for Pod "z6/caddy-docker" I0922 07:30:47.767797 14407 networkpolicy.go:609] PodIP is not set for pod "z6/caddy-docker"; ignoring I0922 07:30:51.099913 14407 networkpolicy.go:602] Watch MODIFIED event for Pod "z6/caddy-docker" I0922 07:30:57.108772 14407 proxy.go:353] sdn proxy: add svc z6/service-unsecure: &{{ } {service-unsecure z6 /api/v1/namespaces/z6/services/service-unsecure 8dac6879-fca5-11ea-a740-fa163ef5f512 273133 0 2020-09-22 07:30:56 +0000 UTC <nil> <nil> map[name:service-unsecure] map[] [] nil [] } {ClusterIP [{http TCP 27017 {0 8080 } 0}] map[name:caddy-docker] 172.30.140.77 [] None <nil> [] 0 false} {{[]}}} I0922 07:30:57.109082 14407 service.go:319] Adding new service port "z6/service-unsecure:http" at 172.30.140.77:27017/TCP I0922 07:30:57.120146 14407 roundrobin.go:276] LoadBalancerRR: Setting endpoints for z6/service-unsecure:http to [10.130.0.47:8080] I0922 07:30:57.120218 14407 roundrobin.go:100] LoadBalancerRR service "z6/service-unsecure:http" did not exist, created I0922 07:30:57.120256 14407 endpoints.go:234] Setting endpoints for "z6/service-unsecure:http" to [10.130.0.47:8080] -A KUBE-SERVICES -m comment --comment "z6/service-unsecure:http has no endpoints" -m tcp -p tcp -d 172.30.140.77/32 --dport 27017 -j REJECT -A KUBE-SERVICES -m comment --comment "z6/service-unsecure:http cluster IP" -m tcp -p tcp -d 172.30.140.77/32 --dport 27017 ! -s 10.128.0.0/14 -j KUBE-MARK-MASQ -A KUBE-SERVICES -m comment --comment "z6/service-unsecure:http cluster IP" -m tcp -p tcp -d 172.30.140.77/32 --dport 27017 -j KUBE-SVC-DVQDYDX75OQJM2LY -A KUBE-SVC-DVQDYDX75OQJM2LY -m comment --comment z6/service-unsecure:http -j KUBE-SEP-NV7TOEHGW6QAKSSA -A KUBE-SEP-NV7TOEHGW6QAKSSA -m comment --comment z6/service-unsecure:http -s 10.130.0.47/32 -j KUBE-MARK-MASQ -A KUBE-SEP-NV7TOEHGW6QAKSSA -m comment --comment z6/service-unsecure:http -m tcp -p tcp -j DNAT --to-destination 10.130.0.47:8080 -A KUBE-SERVICES -m comment --comment "z6/service-unsecure:http cluster IP" -m tcp -p tcp -d 172.30.140.77/32 --dport 27017 ! -s 10.128.0.0/14 -j KUBE-MARK-MASQ -A KUBE-SERVICES -m comment --comment "z6/service-unsecure:http cluster IP" -m tcp -p tcp -d 172.30.140.77/32 --dport 27017 -j KUBE-SVC-DVQDYDX75OQJM2LY -A KUBE-SVC-DVQDYDX75OQJM2LY -m comment --comment z6/service-unsecure:http -j KUBE-SEP-NV7TOEHGW6QAKSSA -A KUBE-SEP-NV7TOEHGW6QAKSSA -m comment --comment z6/service-unsecure:http -s 10.130.0.47/32 -j KUBE-MARK-MASQ -A KUBE-SEP-NV7TOEHGW6QAKSSA -m comment --comment z6/service-unsecure:http -m tcp -p tcp -j DNAT --to-destination 10.130.0.47:8080 I0922 07:35:50.878037 14407 endpoints.go:234] Setting endpoints for "z6/service-unsecure:http" to [10.130.0.47:8080] I0922 07:35:50.878047 14407 endpoints.go:234] Setting endpoints for "z6/service-unsecure:http" to [10.130.0.47:8080] I0922 07:35:50.885953 14407 networkpolicy.go:602] Watch MODIFIED event for Pod "z6/caddy-docker" -A KUBE-SERVICES -m comment --comment "z6/service-unsecure:http cluster IP" -m tcp -p tcp -d 172.30.140.77/32 --dport 27017 ! -s 10.128.0.0/14 -j KUBE-MARK-MASQ -A KUBE-SERVICES -m comment --comment "z6/service-unsecure:http cluster IP" -m tcp -p tcp -d 172.30.140.77/32 --dport 27017 -j KUBE-SVC-DVQDYDX75OQJM2LY -A KUBE-SVC-DVQDYDX75OQJM2LY -m comment --comment z6/service-unsecure:http -j KUBE-SEP-NV7TOEHGW6QAKSSA -A KUBE-SEP-NV7TOEHGW6QAKSSA -m comment --comment z6/service-unsecure:http -s 10.130.0.47/32 -j KUBE-MARK-MASQ -A KUBE-SEP-NV7TOEHGW6QAKSSA -m comment --comment z6/service-unsecure:http -m tcp -p tcp -j DNAT --to-destination 10.130.0.47:8080 ***************************************************** When I create service firstly and then create the pod (in project z7). it's working well. see logs again: I0922 07:43:17.265658 14407 networkpolicy.go:651] Watch ADDED event for Namespace "z7" I0922 07:43:17.292908 14407 proxy.go:159] Watch ADDED event for NetNamespace "z7" I0922 07:43:17.292969 14407 egressip.go:286] Watch ADDED event for NetNamespace "z7" I0922 07:43:17.292978 14407 vnids.go:203] Watch ADDED event for NetNamespace "z7" I0922 07:43:17.293044 14407 vnids.go:148] Associate netid 12557285 to namespace "z7" with mcEnabled false I0922 07:43:17.296962 14407 networkpolicy.go:651] Watch MODIFIED event for Namespace "z7" I0922 07:43:17.351758 14407 networkpolicy.go:651] Watch MODIFIED event for Namespace "z7" I0922 07:43:23.386293 14407 proxy.go:353] sdn proxy: add svc z7/service-unsecure: &{{ } {service-unsecure z7 /api/v1/namespaces/z7/services/service-unsecure 4a7e09b2-fca7-11ea-a740-fa163ef5f512 275290 0 2020-09-22 07:43:22 +0000 UTC <nil> <nil> map[name:service-unsecure] map[] [] nil [] } {ClusterIP [{http TCP 27017 {0 8080 } 0}] map[name:caddy-docker] 172.30.235.215 [] None <nil> [] 0 false} {{[]}}} I0922 07:43:23.386600 14407 service.go:319] Adding new service port "z7/service-unsecure:http" at 172.30.235.215:27017/TCP -A KUBE-SERVICES -m comment --comment "z7/service-unsecure:http has no endpoints" -m tcp -p tcp -d 172.30.235.215/32 --dport 27017 -j REJECT I0922 07:43:31.409551 14407 networkpolicy.go:602] Watch ADDED event for Pod "z7/caddy-docker" I0922 07:43:31.409583 14407 networkpolicy.go:609] PodIP is not set for pod "z7/caddy-docker"; ignoring I0922 07:43:31.411989 14407 networkpolicy.go:602] Watch MODIFIED event for Pod "z7/caddy-docker" I0922 07:43:31.412028 14407 networkpolicy.go:609] PodIP is not set for pod "z7/caddy-docker"; ignoring I0922 07:43:31.430706 14407 networkpolicy.go:602] Watch MODIFIED event for Pod "z7/caddy-docker" I0922 07:43:31.430744 14407 networkpolicy.go:609] PodIP is not set for pod "z7/caddy-docker"; ignoring I0922 07:43:34.390752 14407 networkpolicy.go:602] Watch MODIFIED event for Pod "z7/caddy-docker" I0922 07:43:34.398893 14407 roundrobin.go:310] LoadBalancerRR: Setting endpoints for z7/service-unsecure:http to [10.130.0.48:8080] I0922 07:43:34.398932 14407 roundrobin.go:100] LoadBalancerRR service "z7/service-unsecure:http" did not exist, created I0922 07:43:34.398975 14407 endpoints.go:234] Setting endpoints for "z7/service-unsecure:http" to [10.130.0.48:8080] -A KUBE-SERVICES -m comment --comment "z7/service-unsecure:http cluster IP" -m tcp -p tcp -d 172.30.235.215/32 --dport 27017 ! -s 10.128.0.0/14 -j KUBE-MARK-MASQ -A KUBE-SERVICES -m comment --comment "z7/service-unsecure:http cluster IP" -m tcp -p tcp -d 172.30.235.215/32 --dport 27017 -j KUBE-SVC-URQ4JHSYI2EPWQZQ -A KUBE-SVC-URQ4JHSYI2EPWQZQ -m comment --comment z7/service-unsecure:http -j KUBE-SEP-GK57SKWV6HVWD2OM -A KUBE-SEP-GK57SKWV6HVWD2OM -m comment --comment z7/service-unsecure:http -s 10.130.0.48/32 -j KUBE-MARK-MASQ -A KUBE-SEP-GK57SKWV6HVWD2OM -m comment --comment z7/service-unsecure:http -m tcp -p tcp -j DNAT --to-destination 10.130.0.48:8080 Expected results: Additional info:
ok so was able to reproduce this on a quicklab cluster: quicklab@master-1 ~]$ oc version oc v3.11.286 kubernetes v1.11.0+d4cacc0 openshift v3.11.286 kubernetes v1.11.0+d4cacc0 [quicklab@master-1 ~]$ oc new-project z6 [quicklab@master-1 ~]$ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/caddy-docker.json -n z6 pod/caddy-docker created [quicklab@master-1 ~]$ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/unsecure/service_unsecure.json -n z6 service/service-unsecure created [quicklab@master-1 ~]$ curl 172.30.94.83:27017 curl: (7) Failed connect to 172.30.94.83:27017; Connection refused [quicklab@master-1 ~]$ oc describe svc service-unsecure Name: service-unsecure Namespace: z6 Labels: name=service-unsecure Annotations: <none> Selector: name=caddy-docker Type: ClusterIP IP: 172.30.94.83 Port: http 27017/TCP TargetPort: 8080/TCP Endpoints: 10.128.2.4:8080 Session Affinity: None Events: <none> [quicklab@master-1 ~]$ curl 10.128.2.4:8080 Hello-OpenShift-1 http-8080 [quicklab@master-1 ~]$ curl 172.30.94.83:27017 curl: (7) Failed connect to 172.30.94.83:27017; Connection refused [root@infra-2 origin]# iptables-save | grep z6 -A KUBE-SERVICES -d 172.30.94.83/32 -p tcp -m comment --comment "z6/service-unsecure:http has no endpoints" -m tcp --dport 27017 -j REJECT --reject-with icmp-port-unreachable SDN logs: I1022 19:59:40.554923 20011 proxier.go:369] userspace proxy: processing 0 service events I1022 19:59:40.554958 20011 proxier.go:348] userspace syncProxyRules took 32.998776ms I1022 19:59:46.696021 20011 proxy.go:353] sdn proxy: add svc z6/service-unsecure: &{{ } {service-unsecure z6 /api/v1/namespaces/z6/services/service-unsecure 2288fc19-14a1-11eb-8303-fa163e3ff84c 259062 0 2020-10-22 19:59:46 +0000 UTC <nil> <nil> map[name:service-unsecure] map[] [] nil [] } {ClusterIP [{http TCP 27017 {0 8080 } 0}] map[name:caddy-docker] 172.30.94.83 [] None <nil> [] 0 false} {{[]}}} I1022 19:59:46.696894 20011 service.go:319] Adding new service port "z6/service-unsecure:http" at 172.30.94.83:27017/TCP I1022 19:59:46.784983 20011 roundrobin.go:276] LoadBalancerRR: Setting endpoints for z6/service-unsecure:http to [10.128.2.4:8080] After about ~8mins the correct rules were created. [root@infra-2 origin]# iptables-save | grep z6 -A KUBE-SEP-S3L2DEEHCSAW76OY -s 10.128.2.4/32 -m comment --comment "z6/service-unsecure:http" -j KUBE-MARK-MASQ -A KUBE-SEP-S3L2DEEHCSAW76OY -p tcp -m comment --comment "z6/service-unsecure:http" -m tcp -j DNAT --to-destination 10.128.2.4:8080 -A KUBE-SERVICES ! -s 10.128.0.0/14 -d 172.30.94.83/32 -p tcp -m comment --comment "z6/service-unsecure:http cluster IP" -m tcp --dport 27017 -j KUBE-MARK-MASQ -A KUBE-SERVICES -d 172.30.94.83/32 -p tcp -m comment --comment "z6/service-unsecure:http cluster IP" -m tcp --dport 27017 -j KUBE-SVC-DVQDYDX75OQJM2LY -A KUBE-SVC-DVQDYDX75OQJM2LY -m comment --comment "z6/service-unsecure:http" -j KUBE-SEP-S3L2DEEHCSAW76OY okay, so clearly the REJECT rule is added because kube-proxy is considering the case for the svc without an ep/backing pod. Somewhere the signal for when an endpoint exists is fuzzy. However this bug appears to be more of a race-condition. Because the second time I tried it, the correct rules were installed [quicklab@master-1 ~]$ oc delete project z6 project.project.openshift.io "z6" deleted [quicklab@master-1 ~]$ oc new-projext z6 [quicklab@master-1 ~]$ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/caddy-docker.json -n z6 pod/caddy-docker created [quicklab@master-1 ~]$ [quicklab@master-1 ~]$ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/unsecure/service_unsecure.json -n z6 service/service-unsecure created [quicklab@master-1 ~]$ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service-unsecure ClusterIP 172.30.146.113 <none> 27017/TCP 44s [quicklab@master-1 ~]$ curl 172.30.146.113:27017 Hello-OpenShift-1 http-8080 SDN logs: I1022 19:47:04.432998 20011 proxy.go:353] sdn proxy: add svc z5/service-unsecure: &{{ } {service-unsecure z5 /api/v1/namespaces/z5/services/service-unsecure 5c2d8221-149f-11eb-8303-fa163e3ff84c 256937 0 2020-10-22 19:47:04 +0000 UTC <nil> <nil> map[name:service-unsecure] map[] [] nil [] } {ClusterIP [{http TCP 27017 {0 8080 } 0}] map[name:caddy-docker] 172.30.177.231 [] None <nil> [] 0 false} {{[]}}} I1022 19:47:04.433508 20011 service.go:319] Adding new service port "z5/service-unsecure:http" at 172.30.177.231:27017/TCP I1022 19:47:04.529429 20011 proxier.go:369] userspace proxy: processing 0 service events I1022 19:47:04.529467 20011 proxier.go:348] userspace syncProxyRules took 21.21411ms I1022 19:47:04.537124 20011 roundrobin.go:276] LoadBalancerRR: Setting endpoints for z5/service-unsecure:http to [10.128.2.3:8080] I1022 19:47:04.537299 20011 roundrobin.go:240] Delete endpoint 10.128.2.3:8080 for service "z5/service-unsecure:http" [root@infra-2 origin]# iptables-save | grep z6 -A KUBE-SEP-33HRY4FFZEYPMEW2 -s 10.128.2.5/32 -m comment --comment "z6/service-unsecure:http" -j KUBE-MARK-MASQ -A KUBE-SEP-33HRY4FFZEYPMEW2 -p tcp -m comment --comment "z6/service-unsecure:http" -m tcp -j DNAT --to-destination 10.128.2.5:8080 -A KUBE-SERVICES ! -s 10.128.0.0/14 -d 172.30.146.113/32 -p tcp -m comment --comment "z6/service-unsecure:http cluster IP" -m tcp --dport 27017 -j KUBE-MARK-MASQ -A KUBE-SERVICES -d 172.30.146.113/32 -p tcp -m comment --comment "z6/service-unsecure:http cluster IP" -m tcp --dport 27017 -j KUBE-SVC-DVQDYDX75OQJM2LY -A KUBE-SVC-DVQDYDX75OQJM2LY -m comment --comment "z6/service-unsecure:http" -j KUBE-SEP-33HRY4FFZEYPMEW2 So when looking at the logs, it is visible that the endpoint is first set and then deleted for the correct case whereas in the wrong scenario where the REJECT rule is installed, the "Delete endpoint" log line isn't visible. Need to look into the code more to find out why this is happening.
(In reply to Surya Seetharaman from comment #4) > > After about ~8mins the correct rules were created. > > okay, so clearly the REJECT rule is added because kube-proxy is considering > the case for the svc without an ep/backing pod. Somewhere the signal for > when an endpoint exists is fuzzy. > > However this bug appears to be more of a race-condition. Because the second > time I tried it, the correct rules were installed. > Not true, I take back my words. This is not a race condition but actually a timing issue. SO for instance in some cases it takes 8mins for the correct rules to get updated while in some cases within seconds the REJECT rule gets converted to the correct rules. However the REJECT rule is always added first even if there is a pod for the service already. But once the endpoint is set, the correct rules should get saved, but this doesn't often happen. /o\ > > > So when looking at the logs, it is visible that the endpoint is first set > and then deleted for the correct case whereas in the wrong scenario where > the REJECT rule is installed, the "Delete endpoint" log line isn't visible. > Need to look into the code more to find out why this is happening. This is just the updateAffinityMap stuff that happens. Not related.
ok so this is interesting: In fact on some sdn pods correct rules appear immediately while on some its still the REJECT rule: [quicklab@master-1 ~]$ oc exec -ti -n openshift-sdn sdn-2cd8n /bin/bash [root@master-2 origin]# iptables-save | grep z5 -A KUBE-SERVICES -d 172.30.71.115/32 -p tcp -m comment --comment "z5/service-unsecure:http has no endpoints" -m tcp --dport 27017 -j REJECT --reject-with icmp-port-unreachable [root@master-2 origin]# curl 172.30.71.115:27017 curl: (7) Failed connect to 172.30.71.115:27017; Connection refused [quicklab@master-1 ~]$ oc exec -ti -n openshift-sdn sdn-7nwhr /bin/bash root@master-0 origin]# iptables-save | grep z5 -A KUBE-SEP-HWZGWL7RJUA5PWOL -s 10.128.2.7/32 -m comment --comment "z5/service-unsecure:http" -j KUBE-MARK-MASQ -A KUBE-SEP-HWZGWL7RJUA5PWOL -p tcp -m comment --comment "z5/service-unsecure:http" -m tcp -j DNAT --to-destination 10.128.2.7:8080 -A KUBE-SERVICES ! -s 10.128.0.0/14 -d 172.30.71.115/32 -p tcp -m comment --comment "z5/service-unsecure:http cluster IP" -m tcp --dport 27017 -j KUBE-MARK-MASQ -A KUBE-SERVICES -d 172.30.71.115/32 -p tcp -m comment --comment "z5/service-unsecure:http cluster IP" -m tcp --dport 27017 -j KUBE-SVC-EVP4CM6LNBXKGCJM -A KUBE-SVC-EVP4CM6LNBXKGCJM -m comment --comment "z5/service-unsecure:http" -j KUBE-SEP-HWZGWL7RJUA5PWOL
New updates: Status: on some sdn/ovs pods, the REJECT rule gets added because the endpointsMap is not populated correctly. There are three categories of behaviour and the sequence of events happening here are: 1) Type I: it takes 8mins approx for the rule to appear correctly, until then it is reject. I1023 15:46:33.656981 106483 proxy.go:353] sdn proxy: add svc z5/service-unsecure: &{{ } {service-unsecure z5 /api/v1/namespaces/z5/services/service-unsecure ed287840-1546-11eb-af2e-fa163e3ff84c 450857 0 2020-10-23 15:46:33 +0000 UTC <nil> <nil> map[name:service-unsecure] map[] [] nil [] } {ClusterIP [{http TCP 27017 {0 8080 } 0}] map[name:caddy-docker] 172.30.85.244 [] None <nil> [] 0 false} {{[]}}} I1023 15:46:33.657998 106483 service.go:319] Adding new service port "z5/service-unsecure:http" at 172.30.85.244:27017/TCP I1023 15:46:33.658252 106483 proxier.go:644] tssurya Syncing iptables rules I1023 15:46:33.792704 106483 roundrobin.go:252] tssurya is inside buildPortsToEndpointsMap I1023 15:46:33.792768 106483 roundrobin.go:270] tssurya is inside OnEndpointsAdd map[http:[{10.128.2.15 8080}]] I1023 15:46:33.792978 106483 roundrobin.go:279] tssurya state is &{[] 0 { map[] 10800}} and exists is true I1023 15:46:33.793105 106483 roundrobin.go:282] LoadBalancerRR: Setting endpoints for z5/service-unsecure:http to [10.128.2.15:8080] I1023 15:46:33.793145 106483 roundrobin.go:243] Delete endpoint 10.128.2.15:8080 for service "z5/service-unsecure:http" I1023 15:46:33.793178 106483 roundrobin.go:97] tssurya ttlSeconds 10800 I1023 15:46:33.793186 106483 roundrobin.go:106] tssurya returning from newServiceInternal &{[] 0 { map[] 10800}} I1023 15:46:33.828067 106483 proxier.go:763] tssurya print endpointsMap: [] I1023 15:46:33.828162 106483 proxier.go:810] tssurya service z5/service-unsecure:http has no endpoints :(! On the next run of sunc proxy, the rules get updated: I1023 15:58:17.244061 106483 proxier.go:763] tssurya print endpointsMap: [10.128.2.15:8080] I1023 15:58:17.244078 106483 proxier.go:790] tssurya service z5/service-unsecure:http has endpoints for approx ~8mins there is connection refused. 2) Type II: it takes a few milli-seconds for the rule to appear correctly correctly: I1023 15:46:33.655399 100474 proxy.go:353] sdn proxy: add svc z5/service-unsecure: &{{ } {service-unsecure z5 /api/v1/namespaces/z5/services/service-unsecure ed287840-1546-11eb-af2e-fa163e3ff84c 450857 0 2020-10-23 15:46:33 +0000 UTC <nil> <nil> map[name:service-unsecure] map[] [] nil [] } {ClusterIP [{http TCP 27017 {0 8080 } 0}] map[name:caddy-docker] 172.30.85.244 [] None <nil> [] 0 false} {{[]}}} I1023 15:46:33.656185 100474 service.go:319] Adding new service port "z5/service-unsecure:http" at 172.30.85.244:27017/TCP I1023 15:46:33.706806 100474 proxier.go:763] tssurya print endpointsMap: [] I1023 15:46:33.706900 100474 proxier.go:810] tssurya service z5/service-unsecure:http has no endpoints :(! I1023 15:46:33.797681 100474 roundrobin.go:252] tssurya is inside buildPortsToEndpointsMap I1023 15:46:33.798271 100474 roundrobin.go:270] tssurya is inside OnEndpointsAdd map[http:[{10.128.2.15 8080}]] I1023 15:46:33.798611 100474 roundrobin.go:279] tssurya state is &{[] 0 { map[] 10800}} and exists is true I1023 15:46:33.799141 100474 roundrobin.go:282] LoadBalancerRR: Setting endpoints for z5/service-unsecure:http to [10.128.2.15:8080] I1023 15:46:33.799348 100474 roundrobin.go:243] Delete endpoint 10.128.2.15:8080 for service "z5/service-unsecure:http" I1023 15:46:33.799506 100474 roundrobin.go:97] tssurya ttlSeconds 10800 I1023 15:46:33.799623 100474 roundrobin.go:106] tssurya returning from newServiceInternal &{[] 0 { map[] 10800}} I1023 15:46:33.861359 100474 proxier.go:369] userspace proxy: processing 0 service events I1023 15:46:33.861403 100474 proxier.go:348] userspace syncProxyRules took 138.139971ms I1023 15:46:33.861520 100474 proxier.go:644] tssurya Syncing iptables rules I1023 15:46:33.990342 100474 proxier.go:763] tssurya print endpointsMap: [10.128.2.15:8080] I1023 15:46:33.990472 100474 proxier.go:790] tssurya service z5/service-unsecure:http has endpoints so even if the reject rule is added first, it gets immediately replaced by the correct rules. 3) Type III: correct rules get added: I1023 15:46:33.894978 99723 roundrobin.go:252] tssurya is inside buildPortsToEndpointsMap I1023 15:46:33.895036 99723 roundrobin.go:270] tssurya is inside OnEndpointsAdd map[http:[{10.128.2.15 8080}]] I1023 15:46:33.895372 99723 roundrobin.go:279] tssurya state is &{[] 0 { map[] 10800}} and exists is true I1023 15:46:33.895488 99723 roundrobin.go:282] LoadBalancerRR: Setting endpoints for z5/service-unsecure:http to [10.128.2.15:8080] I1023 15:46:33.895532 99723 roundrobin.go:243] Delete endpoint 10.128.2.15:8080 for service "z5/service-unsecure:http" I1023 15:46:33.895568 99723 roundrobin.go:97] tssurya ttlSeconds 10800 I1023 15:46:33.895609 99723 roundrobin.go:106] tssurya returning from newServiceInternal &{[] 0 { map[] 10800}} I1023 15:46:33.895902 99723 proxier.go:644] tssurya Syncing iptables rules I1023 15:46:33.896169 99723 proxy.go:353] sdn proxy: add svc z5/service-unsecure: &{{ } {service-unsecure z5 /api/v1/namespaces/z5/services/service-unsecure ed287840-1546-11eb-af2e-fa163e3ff84c 450857 0 2020-10-23 15:46:33 +0000 UTC <nil> <nil> map[name:service-unsecure] map[] [] nil [] } {ClusterIP [{http TCP 27017 {0 8080 } 0}] map[name:caddy-docker] 172.30.85.244 [] None <nil> [] 0 false} {{[]}}} I1023 15:46:34.036885 99723 proxier.go:369] userspace proxy: processing 0 service events I1023 15:46:34.036910 99723 proxier.go:348] userspace syncProxyRules took 17.657405ms I1023 15:46:35.811948 99723 service.go:319] Adding new service port "z5/service-unsecure:http" at 172.30.85.244:27017/TCP I1023 15:46:35.812025 99723 proxier.go:644] tssurya Syncing iptables rules I1023 15:46:35.847119 99723 proxier.go:763] tssurya print endpointsMap: [10.128.2.15:8080] I1023 15:46:35.847143 99723 proxier.go:790] tssurya service z5/service-unsecure:http has endpoints
All right, so I think I found the root cause for this race condition: Got introduced in in September through this PR (https://github.com/openshift/origin/commit/0c9e44a08d12c115c72900af6c7513a7d178ea2b) and exists only in the newer versions of 3.11. That's why I couldn't reproduce this bug in older 3.11 versions and/or in 4.x. The variable proxier.changesPending (https://github.com/openshift/origin/blob/56691f4e623af6d9f7c659cf41d71a2a28fb0ed6/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L185) that was introduced in the patch is not locked properly and hence it is not getting set to true on this line correctly: https://github.com/openshift/origin/blob/56691f4e623af6d9f7c659cf41d71a2a28fb0ed6/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L505 when doing an EndpointsUpdate. Due to this on entering SyncProxyRules function we get returned without running the sync because both force and changesPending variables are set to "false aaaaand false" (https://github.com/openshift/origin/blob/56691f4e623af6d9f7c659cf41d71a2a28fb0ed6/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L618-L624). So until we get another OnEndpointsUpdate where the changesPending is set to true, the sync won't run and we are blocked. The REJECT rule won't be changed and upto several minutes ~ 10mins we are stuck. Fix should be easy, add the proxier.mu.Lock before the changing the variable.
Verified this bug on v3.11.346
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 3.11.374 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0079