Bug 1292971

Summary:	DNS resolution intermittently failing on new containers
Product:	OpenShift Container Platform	Reporter:	Joel Diaz <jdiaz>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Meng Bo <bmeng>
Severity:	low	Docs Contact:
Priority:	urgent
Version:	3.1.0	CC:	agrimm, aos-bugs, ccoleman, eparis, haowang, jdiaz, jeder, jgoulding, jkrieger, misalunk, pep, tstclair, twiest
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-03-02 16:58:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1303130, 1267746

Description Joel Diaz 2015-12-18 22:05:09 UTC

Description of problem:
We've had a number of nodes start showing this issue where existing containers on the node can do DNS resolution without issue, but newly created containers will sometimes have working DNS, and sometimes not.

Specifically, for an already running container:
[root@ops-node-compute-fcf59 ~]# docker ps | grep 3e92
3e92fb118928        172.30.157.196:5000/monitoring/oso-rhel7-zagg-web@sha256:b25423de57d2271c09a085f7c304070b2ac291af1d3bd60b5a4a39a2a0f3a6f2   "/bin/sh -c /usr/loca"   20 hours ago        Up 20 hours                             k8s_oso-rhel7-zagg-web.d76ca846_oso-rhel7-zagg-web-18-9eequ_monitoring_2f22fe9a-a4db-11e5-b032-0ae36c123e51_0706e020
[root@ops-node-compute-fcf59 ~]# for x in {1..10} ; do docker exec -ti 3e92fb118928 getent hosts google.com ; echo "RETURNED: $?" ; done
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0

If we try a similar thing with a newly launched container:

[root@ops-node-compute-fcf59 ~]# for x in {1..10} ; do docker run -ti --rm docker-registry.ops.rhcloud.com/ops/oso-rhel7-zagg-client getent hosts google.com ; echo "RETURNED: $?" ; done
2607:f8b0:4004:806::1009 google.com
RETURNED: 0
RETURNED: 2
2607:f8b0:4004:806::1009 google.com
RETURNED: 0
2607:f8b0:4004:806::1009 google.com
RETURNED: 0
RETURNED: 2
2607:f8b0:4004:806::1007 google.com
RETURNED: 0
RETURNED: 2
RETURNED: 2
RETURNED: 2
2607:f8b0:4004:806::1004 google.com
RETURNED: 0

A failure rate of 50%.

Version-Release number of selected component (if applicable):
[root@ops-node-compute-fcf59 ~]# rpm -qa | grep openshift
atomic-openshift-node-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-clients-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
tuned-profiles-atomic-openshift-node-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-sdn-ovs-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
[root@ops-node-compute-fcf59 ~]# rpm -qa | grep vswitch
openvswitch-2.3.1-2.git20150113.el7.x86_64

How reproducible:
Intermittent.


Steps to Reproduce:
1. Unsure of how to get to failing state, but once a node is in the failing state. Try to start a new container that does nothing but attempt a DNS lookup.
2. docker run -ti --rm docker-registry.ops.rhcloud.com/ops/oso-rhel7-zagg-client getent hosts google.com
3. Run step #2 several times to see it pass/fail intermittently

Actual results:
Some container runs resolve DNS fine, some container runs fail.


Expected results:
100% DNS resolution.

Additional info:

The following is not specific to the reported details above, but may or may not be related. We've also seen some nodes get into a case where new containers fail DNS resolution 100% of the time, but existing containers are functional. Also seen a case where new containers can do DNS resolution, but existing containers cannot.

Comment 1 Thomas Wiest 2016-01-04 17:20:13 UTC

This seems to not just be limited to DNS resolution. It seems to be a network connectivity problem in general on the SDN.


# DNS failing (same as above)
bash-4.2$ getent hosts google.com
bash-4.2$ echo $?
2
bash-4.2$ 


# normal http traffic to github.com failing
bash-4.2$ curl 192.30.252.129
curl: (7) Failed connect to 192.30.252.129:80; No route to host
bash-4.2$

Comment 2 Thomas Wiest 2016-01-05 16:13:00 UTC

I want to point out that once a container has stopped working, it doesn't go back to working.

The pattern we're seeing the most is a container is working fine for days, then suddenly it can't access anything and nothing can access it (it seems like it's no longer on the SDN). A simple container restart does NOT fix it.

To fix this problem, we stop atomic-openshift-node and docker, then we restart openvswitch.

Comment 3 Ben Bennett 2016-01-05 17:08:20 UTC

Can I get access to a broken system please?

Comment 4 Joel Diaz 2016-01-06 14:36:27 UTC

Email/provide me your ssh key, and I'll set you up with access.

Comment 5 Ben Bennett 2016-01-12 16:19:51 UTC

Could be related to https://github.com/openshift/openshift-sdn/issues/231

However, the system that this was reproduced on is no longer showing the symptoms.  So definitive diagnosis is still proving challenging.

Trying to get it to happen again.

Comment 6 Ben Bennett 2016-01-12 20:49:20 UTC

The issue is that the ovs rules look like:
 cookie=0x17, table=0, priority=100,ip,nw_dst=10.1.15.15     actions=output:17
 cookie=0x17, table=0, priority=100,arp,arp_tpa=10.1.15.15   actions=output:17

BUT there is nothing connected to port 17.  Because they are starting a bare docker container (not using openshift) they are connected to port 9:
 cookie=0xac1f1ab9, priority=75,arp,arp_tpa=10.1.15.0/24  actions=output:9
 cookie=0xac1f1ab9, priority=75,ip,nw_dst=10.1.15.0/24    actions=output:9

Since the arp traffic looks like (tun0 is the host-side bridge):
# tcpdump -i tun0 arp
12:03:57.762311 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-11.ec2.internal, length 28
12:03:57.762318 ARP, Reply ip-10-1-15-1.ec2.internal is-at d2:05:73:b8:c8:58 (oui Unknown), length 28
12:03:58.082605 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28
12:03:58.082615 ARP, Reply ip-10-1-15-1.ec2.internal is-at d2:05:73:b8:c8:58 (oui Unknown), length 28
12:03:59.083963 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28
12:03:59.083983 ARP, Reply ip-10-1-15-1.ec2.internal is-at d2:05:73:b8:c8:58 (oui Unknown), length 28

It hits the higher priority rule and gets sent to port 17, but there is nothing connected to ovs port 17.  So on the docker bridge we see:
# tcpdump -i lbr0 arp
12:04:13.107067 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28
12:04:14.109949 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28
12:04:15.111941 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28

And you can see that the response was not passed through the bridge.

However, that area of code has been refactored and fixed since 3.1, and QA are not seeing stray rules (they have submitted bugs about other stray rules, so they are clearly looking for these issues).

I dug through the kubernetes code to make sure that we were calling TearDownPod() when the container within it terminates abnormally, and I couldn't see any code paths that were not calling it (or more typically KillPod() that in turn calls TearDownPod() ) any time a container we started is no longer running.

Comment 10 Ben Bennett 2016-03-02 13:39:08 UTC

This has only been seen with 3.1.0.4, we have not seen it with anything later.  We have been recommending that an upgrade to 3.1.1.6 or higher.

Comment 11 Joel Diaz 2016-03-02 14:55:41 UTC

Tried with 3.1.1.6 and was unable to reproduce the issue:

[root@hackathon-node-compute-78650 root]# rpm -q atomic-openshift
atomic-openshift-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
[root@hackathon-node-compute-78650 root]# for x in {1..10} ; do docker run -ti --rm docker-registry.ops.rhcloud.com/ops/oso-rhel7-host-monitoring getent hosts google.com ; echo "RETURNED: $?" ; done
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0

I'm okay with closing this.

Comment 12 Ben Bennett 2016-03-02 16:58:01 UTC

This is fixed in the 3.1.1.6 release.

(Joel, thanks for taking the time to test this in your environment)

Comment 13 Ben Bennett 2016-03-02 17:00:26 UTC

*** Bug 1312945 has been marked as a duplicate of this bug. ***