Bug 1292971
Summary: | DNS resolution intermittently failing on new containers | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Joel Diaz <jdiaz> |
Component: | Networking | Assignee: | Ben Bennett <bbennett> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Meng Bo <bmeng> |
Severity: | low | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.1.0 | CC: | agrimm, aos-bugs, ccoleman, eparis, haowang, jdiaz, jeder, jgoulding, jkrieger, misalunk, pep, tstclair, twiest |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-03-02 16:58:01 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1303130, 1267746 |
Description
Joel Diaz
2015-12-18 22:05:09 UTC
This seems to not just be limited to DNS resolution. It seems to be a network connectivity problem in general on the SDN. # DNS failing (same as above) bash-4.2$ getent hosts google.com bash-4.2$ echo $? 2 bash-4.2$ # normal http traffic to github.com failing bash-4.2$ curl 192.30.252.129 curl: (7) Failed connect to 192.30.252.129:80; No route to host bash-4.2$ I want to point out that once a container has stopped working, it doesn't go back to working. The pattern we're seeing the most is a container is working fine for days, then suddenly it can't access anything and nothing can access it (it seems like it's no longer on the SDN). A simple container restart does NOT fix it. To fix this problem, we stop atomic-openshift-node and docker, then we restart openvswitch. Can I get access to a broken system please? Email/provide me your ssh key, and I'll set you up with access. Could be related to https://github.com/openshift/openshift-sdn/issues/231 However, the system that this was reproduced on is no longer showing the symptoms. So definitive diagnosis is still proving challenging. Trying to get it to happen again. The issue is that the ovs rules look like: cookie=0x17, table=0, priority=100,ip,nw_dst=10.1.15.15 actions=output:17 cookie=0x17, table=0, priority=100,arp,arp_tpa=10.1.15.15 actions=output:17 BUT there is nothing connected to port 17. Because they are starting a bare docker container (not using openshift) they are connected to port 9: cookie=0xac1f1ab9, priority=75,arp,arp_tpa=10.1.15.0/24 actions=output:9 cookie=0xac1f1ab9, priority=75,ip,nw_dst=10.1.15.0/24 actions=output:9 Since the arp traffic looks like (tun0 is the host-side bridge): # tcpdump -i tun0 arp 12:03:57.762311 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-11.ec2.internal, length 28 12:03:57.762318 ARP, Reply ip-10-1-15-1.ec2.internal is-at d2:05:73:b8:c8:58 (oui Unknown), length 28 12:03:58.082605 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28 12:03:58.082615 ARP, Reply ip-10-1-15-1.ec2.internal is-at d2:05:73:b8:c8:58 (oui Unknown), length 28 12:03:59.083963 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28 12:03:59.083983 ARP, Reply ip-10-1-15-1.ec2.internal is-at d2:05:73:b8:c8:58 (oui Unknown), length 28 It hits the higher priority rule and gets sent to port 17, but there is nothing connected to ovs port 17. So on the docker bridge we see: # tcpdump -i lbr0 arp 12:04:13.107067 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28 12:04:14.109949 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28 12:04:15.111941 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28 And you can see that the response was not passed through the bridge. However, that area of code has been refactored and fixed since 3.1, and QA are not seeing stray rules (they have submitted bugs about other stray rules, so they are clearly looking for these issues). I dug through the kubernetes code to make sure that we were calling TearDownPod() when the container within it terminates abnormally, and I couldn't see any code paths that were not calling it (or more typically KillPod() that in turn calls TearDownPod() ) any time a container we started is no longer running. This has only been seen with 3.1.0.4, we have not seen it with anything later. We have been recommending that an upgrade to 3.1.1.6 or higher. Tried with 3.1.1.6 and was unable to reproduce the issue: [root@hackathon-node-compute-78650 root]# rpm -q atomic-openshift atomic-openshift-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 [root@hackathon-node-compute-78650 root]# for x in {1..10} ; do docker run -ti --rm docker-registry.ops.rhcloud.com/ops/oso-rhel7-host-monitoring getent hosts google.com ; echo "RETURNED: $?" ; done 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 2607:f8b0:400d:c02::64 google.com RETURNED: 0 I'm okay with closing this. This is fixed in the 3.1.1.6 release. (Joel, thanks for taking the time to test this in your environment) *** Bug 1312945 has been marked as a duplicate of this bug. *** |