Description of problem: kind: Service with Type: NodePort with Protocol: UDP established and correctly binds to multiple endpoints but only one pod receives traffic even though sessionAffinity: None. The first pod started receives all traffic. If the pod is removed or failed, the node starts sending ICMP port unreachable messages even though the service has available pods that are reachable. It only seems to impact UDP traffic. If the same method is used to establish a TCP based service, it operates correctly and handles pod failures and load balancing like expected. This seems limited to just UDP based services. Version-Release number of selected component (if applicable): 3.3.0, also tested in Kubernetes v1.5.2 How reproducible: Unconfirmed but customer can reproduce 100%
From customer: To reproduce, create a service with a type NodePort, protocol UDP and no session affinity. Then start two pods on different application nodes that listen on a UDP port. I happen to be using sflowtool from InMon but you could just as easily run a copy of netcat or even a simple Perl script that listens on a network port. At this point, you can then send a UDP stream into the cluster to any node and the NodePort magic kicks in doing the port translation and directing the traffic to a pod. For my test, I just happen to send traffic into cluster by sending it to one of the infra nodes but in reality, I could sent it to any app node or even the masters. At this point, you will notice that only one pod is receiving traffic even though there are two pods running and traffic should be randomly sent to both pods. Then, do something that makes the pod that is taking traffic stop (kill the pod or try to scale the DC down to less replicas where the 'working' pod is removed). What happens is that as soon as the pod that was taking all the traffic stops, even though there are other working pods in the service, the node that you are directing traffic to (one of the infra nodes in my test here) will start sending icmp unreachable messages back towards the device that is trying to send traffic into the cluster.
In order to send related traffic for UDP to the same backing pod, we use conntrack for UDP in iptables. What that means is that UDP packets from the same source IP:port to the service IP:port will go to the same backend until the UDP conntrack entry times out after 180 seconds (if there is no traffic). http://www.iptables.info/en/connection-state.html#UDPCONNECTIONS If you want the sessions to be "different" you need to vary the source port being used. I agree that this is not great... but what we can do for UDP services us limited by the constraints of the protocol.