1292971 – DNS resolution intermittently failing on new containers

Bug 1292971 - DNS resolution intermittently failing on new containers

Summary: DNS resolution intermittently failing on new containers

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Linux
Priority:	urgent
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Ben Bennett
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1312945 (view as bug list)
Depends On:
Blocks:	OSOPS_V3 1267746
TreeView+	depends on / blocked

Reported:	2015-12-18 22:05 UTC by Joel Diaz
Modified:	2019-10-10 10:44 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-03-02 16:58:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Joel Diaz 2015-12-18 22:05:09 UTC

Description of problem:
We've had a number of nodes start showing this issue where existing containers on the node can do DNS resolution without issue, but newly created containers will sometimes have working DNS, and sometimes not.

Specifically, for an already running container:
[root@ops-node-compute-fcf59 ~]# docker ps | grep 3e92
3e92fb118928        172.30.157.196:5000/monitoring/oso-rhel7-zagg-web@sha256:b25423de57d2271c09a085f7c304070b2ac291af1d3bd60b5a4a39a2a0f3a6f2   "/bin/sh -c /usr/loca"   20 hours ago        Up 20 hours                             k8s_oso-rhel7-zagg-web.d76ca846_oso-rhel7-zagg-web-18-9eequ_monitoring_2f22fe9a-a4db-11e5-b032-0ae36c123e51_0706e020
[root@ops-node-compute-fcf59 ~]# for x in {1..10} ; do docker exec -ti 3e92fb118928 getent hosts google.com ; echo "RETURNED: $?" ; done
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0
2607:f8b0:4004:80d::200e google.com
RETURNED: 0

If we try a similar thing with a newly launched container:

[root@ops-node-compute-fcf59 ~]# for x in {1..10} ; do docker run -ti --rm docker-registry.ops.rhcloud.com/ops/oso-rhel7-zagg-client getent hosts google.com ; echo "RETURNED: $?" ; done
2607:f8b0:4004:806::1009 google.com
RETURNED: 0
RETURNED: 2
2607:f8b0:4004:806::1009 google.com
RETURNED: 0
2607:f8b0:4004:806::1009 google.com
RETURNED: 0
RETURNED: 2
2607:f8b0:4004:806::1007 google.com
RETURNED: 0
RETURNED: 2
RETURNED: 2
RETURNED: 2
2607:f8b0:4004:806::1004 google.com
RETURNED: 0

A failure rate of 50%.

Version-Release number of selected component (if applicable):
[root@ops-node-compute-fcf59 ~]# rpm -qa | grep openshift
atomic-openshift-node-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-clients-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
tuned-profiles-atomic-openshift-node-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
atomic-openshift-sdn-ovs-3.1.0.4-1.git.15.5e061c3.el7aos.x86_64
[root@ops-node-compute-fcf59 ~]# rpm -qa | grep vswitch
openvswitch-2.3.1-2.git20150113.el7.x86_64

How reproducible:
Intermittent.


Steps to Reproduce:
1. Unsure of how to get to failing state, but once a node is in the failing state. Try to start a new container that does nothing but attempt a DNS lookup.
2. docker run -ti --rm docker-registry.ops.rhcloud.com/ops/oso-rhel7-zagg-client getent hosts google.com
3. Run step #2 several times to see it pass/fail intermittently

Actual results:
Some container runs resolve DNS fine, some container runs fail.


Expected results:
100% DNS resolution.

Additional info:

The following is not specific to the reported details above, but may or may not be related. We've also seen some nodes get into a case where new containers fail DNS resolution 100% of the time, but existing containers are functional. Also seen a case where new containers can do DNS resolution, but existing containers cannot.

Comment 1 Thomas Wiest 2016-01-04 17:20:13 UTC

This seems to not just be limited to DNS resolution. It seems to be a network connectivity problem in general on the SDN.


# DNS failing (same as above)
bash-4.2$ getent hosts google.com
bash-4.2$ echo $?
2
bash-4.2$ 


# normal http traffic to github.com failing
bash-4.2$ curl 192.30.252.129
curl: (7) Failed connect to 192.30.252.129:80; No route to host
bash-4.2$

Comment 2 Thomas Wiest 2016-01-05 16:13:00 UTC

I want to point out that once a container has stopped working, it doesn't go back to working.

The pattern we're seeing the most is a container is working fine for days, then suddenly it can't access anything and nothing can access it (it seems like it's no longer on the SDN). A simple container restart does NOT fix it.

To fix this problem, we stop atomic-openshift-node and docker, then we restart openvswitch.

Comment 3 Ben Bennett 2016-01-05 17:08:20 UTC

Can I get access to a broken system please?

Comment 4 Joel Diaz 2016-01-06 14:36:27 UTC

Email/provide me your ssh key, and I'll set you up with access.

Comment 5 Ben Bennett 2016-01-12 16:19:51 UTC

Could be related to https://github.com/openshift/openshift-sdn/issues/231

However, the system that this was reproduced on is no longer showing the symptoms.  So definitive diagnosis is still proving challenging.

Trying to get it to happen again.

Comment 6 Ben Bennett 2016-01-12 20:49:20 UTC

The issue is that the ovs rules look like:
 cookie=0x17, table=0, priority=100,ip,nw_dst=10.1.15.15     actions=output:17
 cookie=0x17, table=0, priority=100,arp,arp_tpa=10.1.15.15   actions=output:17

BUT there is nothing connected to port 17.  Because they are starting a bare docker container (not using openshift) they are connected to port 9:
 cookie=0xac1f1ab9, priority=75,arp,arp_tpa=10.1.15.0/24  actions=output:9
 cookie=0xac1f1ab9, priority=75,ip,nw_dst=10.1.15.0/24    actions=output:9

Since the arp traffic looks like (tun0 is the host-side bridge):
# tcpdump -i tun0 arp
12:03:57.762311 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-11.ec2.internal, length 28
12:03:57.762318 ARP, Reply ip-10-1-15-1.ec2.internal is-at d2:05:73:b8:c8:58 (oui Unknown), length 28
12:03:58.082605 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28
12:03:58.082615 ARP, Reply ip-10-1-15-1.ec2.internal is-at d2:05:73:b8:c8:58 (oui Unknown), length 28
12:03:59.083963 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28
12:03:59.083983 ARP, Reply ip-10-1-15-1.ec2.internal is-at d2:05:73:b8:c8:58 (oui Unknown), length 28

It hits the higher priority rule and gets sent to port 17, but there is nothing connected to ovs port 17.  So on the docker bridge we see:
# tcpdump -i lbr0 arp
12:04:13.107067 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28
12:04:14.109949 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28
12:04:15.111941 ARP, Request who-has ip-10-1-15-1.ec2.internal tell ip-10-1-15-15.ec2.internal, length 28

And you can see that the response was not passed through the bridge.

However, that area of code has been refactored and fixed since 3.1, and QA are not seeing stray rules (they have submitted bugs about other stray rules, so they are clearly looking for these issues).

I dug through the kubernetes code to make sure that we were calling TearDownPod() when the container within it terminates abnormally, and I couldn't see any code paths that were not calling it (or more typically KillPod() that in turn calls TearDownPod() ) any time a container we started is no longer running.

Comment 10 Ben Bennett 2016-03-02 13:39:08 UTC

This has only been seen with 3.1.0.4, we have not seen it with anything later.  We have been recommending that an upgrade to 3.1.1.6 or higher.

Comment 11 Joel Diaz 2016-03-02 14:55:41 UTC

Tried with 3.1.1.6 and was unable to reproduce the issue:

[root@hackathon-node-compute-78650 root]# rpm -q atomic-openshift
atomic-openshift-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
[root@hackathon-node-compute-78650 root]# for x in {1..10} ; do docker run -ti --rm docker-registry.ops.rhcloud.com/ops/oso-rhel7-host-monitoring getent hosts google.com ; echo "RETURNED: $?" ; done
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0
2607:f8b0:400d:c02::64 google.com
RETURNED: 0

I'm okay with closing this.

Comment 12 Ben Bennett 2016-03-02 16:58:01 UTC

This is fixed in the 3.1.1.6 release.

(Joel, thanks for taking the time to test this in your environment)

Comment 13 Ben Bennett 2016-03-02 17:00:26 UTC

*** Bug 1312945 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.