Description of problem: We've had a number of nodes start showing this issue where existing containers on the node can do DNS resolution without issue, but newly created containers will sometimes have working DNS, and sometimes not. Specifically, for an already running container: [root@ops-node-compute-fcf59 ~]# docker ps | grep 3e92 3e92fb118928 172.30.157.196:5000/monitoring/oso-rhel7-zagg-web@sha256:b25423de57d2271c09a085f7c304070b2ac291af1d3bd60b5a4a39a2a0f3a6f2 "/bin/sh -c /usr/loca" 20 hours ago Up 20 hours k8s_oso-rhel7-zagg-web.d76ca846_oso-rhel7-zagg-web-18-9eequ_monitoring_2f22fe9a-a4db-11e5-b032-0ae36c123e51_0706e020 [root@ops-node-compute-fcf59 ~]# for x in {1..10} ; do docker exec -ti 3e92fb118928 getent hosts google.com ; echo "RETURNED: $?" ; done 2607:f8b0:4004:80d::200e google.com RETURNED: 0 2607:f8b0:4004:80d::200e google.com RETURNED: 0 2607:f8b0:4004:80d::200e google.com RETURNED: 0 2607:f8b0:4004:80d::200e google.com RETURNED: 0 2607:f8b0:4004:80d::200e google.com RETURNED: 0 2607:f8b0:4004:80d::200e google.com RETURNED: 0 2607:f8b0:4004:80d::200e google.com RETURNED: 0 2607:f8b0:4004:80d::200e google.com RETURNED: 0 2607:f8b0:4004:80d::200e google.com RETURNED: 0 2607:f8b0:4004:80d::200e google.com RETURNED: 0 If we try a similar thing with a newly launched container: [root@ops-node-compute-fcf59 ~]# for x in {1..10} ; do docker run -ti --rm docker-registry.ops.rhcloud.com/ops/oso-rhel7-zagg-client getent hosts google.com ; echo "RETURNED: $?" ; done 2607:f8b0:4004:806::1009 google.com RETURNED: 0 RETURNED: 2 2607:f8b0:4004:806::1009 google.com RETURNED: 0 2607:f8b0:4004:806::1009 google.com RETURNED: 0 RETURNED: 2 2607:f8b0:4004:806::1007 google.com RETURNED: 0 RETURNED: 2 RETURNED: 2 RETURNED: 2 2607:f8b0:4004:806::1004 google.com RETURNED: 0 A failure rate of 50%. Version-Release number of selected component (if applicable): [root@ops-node-compute-fcf59 ~]# rpm -qa | grep openshift atomic-openshift-node-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 atomic-openshift-clients-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 tuned-profiles-atomic-openshift-node-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 atomic-openshift-sdn-ovs-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64 [root@ops-node-compute-fcf59 ~]# rpm -qa | grep vswitch openvswitch-2.3.1-2.git20150113.el7.x86_64 How reproducible: Intermittent. Steps to Reproduce: 1. Unsure of how to get to failing state, but once a node is in the failing state. Try to start a new container that does nothing but attempt a DNS lookup. 2. docker run -ti --rm docker-registry.ops.rhcloud.com/ops/oso-rhel7-zagg-client getent hosts google.com 3. Run step #2 several times to see it pass/fail intermittently Actual results: Some container runs resolve DNS fine, some container runs fail. Expected results: 100% DNS resolution. Additional info: The following is not specific to the reported details above, but may or may not be related. We've also seen some nodes get into a case where new containers fail DNS resolution 100% of the time, but existing containers are functional. Also seen a case where new containers can do DNS resolution, but existing containers cannot.
This seems to not just be limited to DNS resolution. It seems to be a network connectivity problem in general on the SDN. # DNS failing (same as above) bash-4.2$ getent hosts google.com bash-4.2$ echo $? 2 bash-4.2$ # normal http traffic to github.com failing bash-4.2$ curl 192.30.252.129 curl: (7) Failed connect to 192.30.252.129:80; No route to host bash-4.2$
I want to point out that once a container has stopped working, it doesn't go back to working. The pattern we're seeing the most is a container is working fine for days, then suddenly it can't access anything and nothing can access it (it seems like it's no longer on the SDN). A simple container restart does NOT fix it. To fix this problem, we stop atomic-openshift-node and docker, then we restart openvswitch.
Can I get access to a broken system please?
Email/provide me your ssh key, and I'll set you up with access.
Could be related to https://github.com/openshift/openshift-sdn/issues/231 However, the system that this was reproduced on is no longer showing the symptoms. So definitive diagnosis is still proving challenging. Trying to get it to happen again.
The issue is that the ovs rules look like: cookie=0x17, table=0, priority=100,ip,nw_dst=10.1.15.15 actions=output:17 cookie=0x17, table=0, priority=100,arp,arp_tpa=10.1.15.15 actions=output:17 BUT there is nothing connected to port 17. Because they are starting a bare docker container (not using openshift) they are connected to port 9: cookie=0xac1f1ab9, priority=75,arp,arp_tpa=10.1.15.0/24 actions=output:9 cookie=0xac1f1ab9, priority=75,ip,nw_dst=10.1.15.0/24 actions=output:9 Since the arp traffic looks like (tun0 is the host-side bridge): # tcpdump -i tun0 arp 12:03:57.762311 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-11.ec2.internal, length 28 12:03:57.762318 ARP, Reply ip-10-1-15-1.ec2.internal is-at d2:05:73:b8:c8:58 (oui Unknown), length 28 12:03:58.082605 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28 12:03:58.082615 ARP, Reply ip-10-1-15-1.ec2.internal is-at d2:05:73:b8:c8:58 (oui Unknown), length 28 12:03:59.083963 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28 12:03:59.083983 ARP, Reply ip-10-1-15-1.ec2.internal is-at d2:05:73:b8:c8:58 (oui Unknown), length 28 It hits the higher priority rule and gets sent to port 17, but there is nothing connected to ovs port 17. So on the docker bridge we see: # tcpdump -i lbr0 arp 12:04:13.107067 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28 12:04:14.109949 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28 12:04:15.111941 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28 And you can see that the response was not passed through the bridge. However, that area of code has been refactored and fixed since 3.1, and QA are not seeing stray rules (they have submitted bugs about other stray rules, so they are clearly looking for these issues). I dug through the kubernetes code to make sure that we were calling TearDownPod() when the container within it terminates abnormally, and I couldn't see any code paths that were not calling it (or more typically KillPod() that in turn calls TearDownPod() ) any time a container we started is no longer running.
This has only been seen with 3.1.0.4, we have not seen it with anything later. We have been recommending that an upgrade to 3.1.1.6 or higher.
Tried with 3.1.1.6 and was unable to reproduce the issue: [root@hackathon-node-compute-78650 root]# rpm -q atomic-openshift atomic-openshift-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 [root@hackathon-node-compute-78650 root]# for x in {1..10} ; do docker run -ti --rm docker-registry.ops.rhcloud.com/ops/oso-rhel7-host-monitoring getent hosts google.com ; echo "RETURNED: $?" ; done 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 I'm okay with closing this.
This is fixed in the 3.1.1.6 release. (Joel, thanks for taking the time to test this in your environment)
*** Bug 1312945 has been marked as a duplicate of this bug. ***