Bug 1390835 - Fail to resolve DNS in container which started by 'docker run'
Summary: Fail to resolve DNS in container which started by 'docker run'
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Dan Winship
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-02 04:16 UTC by Dongbo Yan
Modified: 2017-03-08 18:43 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
If this bug requires documentation, please select an appropriate Doc Type value.
Clone Of:
Environment:
Last Closed: 2017-01-18 12:48:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
node-config (1.44 KB, text/plain)
2016-11-03 02:40 UTC, Yan Du
no flags Details
config info in container and node (28.06 KB, text/plain)
2016-11-04 02:54 UTC, Yan Du
no flags Details
some info inside container (19.07 KB, text/plain)
2016-11-10 10:12 UTC, Dongbo Yan
no flags Details
iptable rules (3.18 KB, text/plain)
2016-11-10 10:13 UTC, Dongbo Yan
no flags Details
3.4 node iptables (12.17 KB, text/plain)
2016-11-15 15:08 UTC, Russell Teague
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 11919 0 None None None 2016-11-15 15:35:40 UTC
Red Hat Product Errata RHBA-2017:0066 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 17:23:26 UTC

Description Dongbo Yan 2016-11-02 04:16:33 UTC
Description of problem:
Fail to resolve DNS in container which started by 'docker run'

Version-Release number of selected component (if applicable):
openshift v3.4.0.18+ada983f
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Docker version 1.12.1, build fd47464-redhat
docker-common-1.12.1-6.el7.x86_64

we update openvswitch to openvswitch-2.5.0-14.git20160727.el7fdp.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Setup OCP env and ssh into the node
2. Start a docker container manually
# docker pull bmeng/hello-openshift
# docker run -td bmeng/hello-openshift

3. Login to the docker container and check the network
# docker exec -it <container_id> bash
# nslookup www.redhat.com

Actual results:
bash-4.3# nslookup www.redhat.com
Server:    10.240.0.45
Address 1: 10.240.0.45
nslookup: can't resolve 'www.redhat.com'

Expected results:
/ $ nslookup www.redhat.com
Server:    10.240.0.46
Address 1: 10.240.0.46 qe-dyan1-node-registry-router-2.c.openshift-gce-devel.internal

Name:      www.redhat.com
Address 1: 23.194.78.16 a23-194-78-16.deploy.static.akamaitechnologies.com
Address 2: 2600:1407:9:39a::d44
Address 3: 2600:1407:9:389::d44

Additional info:
It works when creating the container by oc create https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/pod-for-ping.json

Comment 1 Ben Bennett 2016-11-02 17:45:03 UTC
What does your resolv.conf look like in the container?

What does it look like outside?

Are you running a dnsmasq on the nodes?  What does the config look like?

Comment 3 Yan Du 2016-11-03 02:40:00 UTC
Created attachment 1216833 [details]
node-config

Comment 4 Ben Bennett 2016-11-03 20:28:17 UTC
How did you install that cluster?

Comment 5 Dan Winship 2016-11-03 20:36:35 UTC
worksforme using a dind cluster.

Can you provide the output of "ip a", "ip r", and "traceroute 10.240.0.45" (or whatever the DNS ends up being this time) inside the container, and "ip a", "ip r", "brctl show docker0" and "iptables-save" outside the container?

Comment 7 Yan Du 2016-11-04 02:54:34 UTC
Created attachment 1217212 [details]
config info in container and node

Comment 10 Dan Winship 2016-11-04 15:49:39 UTC
OK, the problem seems to be that docker's iptables rules have been deleted.

systemctl status shows that docker was started at "2016-11-03 00:02:51 EDT" and iptables.service was started at "2016-11-03 00:03:31 EDT". So that's the problem; iptables.service was started after docker, and deleted its rules.

In 3.3, OpenShift would always restart docker after it was started up at boot time (to change its configuration), but that no longer happens, so we depend on docker and iptables.service being started in the right order.

Since iptables.service is being enabled by ansible, it seems like the right fix here is for ansible to also install an appropriate systemd unit file that will ensure that docker doesn't get started until after iptables.service does.

Comment 11 Dongbo Yan 2016-11-09 07:01:28 UTC
Thanks for the work around, it works well after docker and iptables.service being started in the right order

Comment 13 Dongbo Yan 2016-11-10 10:12:02 UTC
Created attachment 1219297 [details]
some info inside container

Comment 14 Dongbo Yan 2016-11-10 10:13:19 UTC
Created attachment 1219299 [details]
iptable rules

Comment 15 Dongbo Yan 2016-11-10 10:19:07 UTC
Test with 
openshift v3.4.0.24+52fd77b
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

docker version
Version:         1.12.3
 API version:     1.24
 Package version: docker-common-1.12.3-4.el7.x86_64
 Go version:      go1.6.2
 Git commit:      f320458-redhat
 Built:           Mon Nov  7 10:15:24 2016
 OS/Arch:         linux/amd64

# uname -r
3.10.0-327.36.1.el7.x86_64

the work around does not work, attach some info inside container ( attachment 1219297 [details] ), and iptable rules ( attachment 1219299 [details] )

Comment 17 Russell Teague 2016-11-10 15:39:02 UTC
Please provide details on how the workaround was applied.  For example, the systemd files applied, the order of commands executed, such as `systemctl daemon-reload`, and the output of `systemctl status iptables` and `systemctl status docker`.

Comment 18 Johnny Liu 2016-11-11 04:15:42 UTC
According to comemnt 10, the work around is "systemctl iptables restart" -> "systemctl docker restart" in order.

comment 11 is confirming the workaround works, while comment 15 is saying the workaround does not any more. 

But I could confirm this workaround still works well in my env, you could have a try in your env.

Comment 21 Dongbo Yan 2016-11-14 05:57:27 UTC
For comment 11 , I confirmed the work around with
openshift v3.4.0.23+24b1a58
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

docker version
 Version:         1.12.3
 API version:     1.24
 Package version: docker-common-1.12.3-2.el7.x86_64
 Go version:      go1.6.2
 Git commit:      81ac282-redhat
 Built:           Tue Nov  1 12:01:01 2016
 OS/Arch:         linux/amd64

Comment 22 Dongbo Yan 2016-11-14 07:52:35 UTC
The work around steps I perform:
1.After openshift cluster is ready, restart iptables service
 $ systemctl restart iptables
2.Then restart docker service
 $ systemctl restart docker

no other extra steps

Comment 23 Johnny Liu 2016-11-14 10:53:19 UTC
This bug is blocked by 1394491.

Comment 24 Johnny Liu 2016-11-14 13:34:19 UTC
After manually apply the PR#2770, on a openstack install, this PR works well, while on a AWS install, still failed. 

Dig more, found on a AWS install, have to open udp 53 port on node host to allow traffic from container to dnsmaq service, while on a openstack install, does not need do that.

Comment 25 Johnny Liu 2016-11-14 14:15:39 UTC
(In reply to Johnny Liu from comment #24)
> After manually apply the PR#2770, on a openstack install, this PR works
> well, while on a AWS install, still failed. 
> 
> Dig more, found on a AWS install, have to open udp 53 port on node host to
> allow traffic from container to dnsmaq service, while on a openstack
> install, does not need do that.

Actually on a openstack install, also have the same problem (need open 53 port on iptables).

The reason of my previous testing on openstack succeeding was installer also adding office public DNS server into /etc/resolv.conf.

On AWS install:
# cat /etc/resolv.conf 
# Generated by NetworkManager
search ec2.internal
nameserver 172.18.11.184     ---> node's ip
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh

On openstack install:
# cat /etc/resolv.conf 
# Generated by NetworkManager
search openstacklocal lab.sjc.redhat.com
nameserver 192.168.2.9       ---> node's ip
nameserver 10.11.5.19        ---> office public ip, after remove this line, the same behavior is seen as aws install
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh

Comment 26 Johnny Liu 2016-11-15 03:25:41 UTC
Run the following command on node to open 53 port to workaround for comment 25:
# iptables -A OS_FIREWALL_ALLOW -p udp -m state --state NEW -m udp --dport 53 -j ACCEPT

Comment 27 Dan Winship 2016-11-15 14:39:48 UTC
(In reply to Dongbo Yan from comment #22)
> The work around steps I perform:
> 1.After openshift cluster is ready, restart iptables service
>  $ systemctl restart iptables
> 2.Then restart docker service
>  $ systemctl restart docker
> 
> no other extra steps

If you manually restart iptables, you have to restart both docker and openshift afterward. But you shouldn't be manually restarting iptables anyway; we know how things behave when the services are *manually* restarted. This bug was about making sure that they get started in the right order at boot time. So the test should be, boot the machine, and if things work, then the services were started in the right order and the bug is fixed.

Comment 28 Russell Teague 2016-11-15 15:08:25 UTC
Created attachment 1220860 [details]
3.4 node iptables

default rules after also adding "iptables -A INPUT -i docker0 -j ACCEPT"

Comment 29 Dan Winship 2016-11-15 15:33:06 UTC
Ignore comment 27... the comment it was replying to was a red herring. The problem is that since docker traffic is no longer being routed through OVS, we need a rule to accept docker0 traffic.

Comment 30 Dongbo Yan 2016-11-24 10:08:43 UTC
openshift v3.4.0.29+ca980ba
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Verified

Comment 32 errata-xmlrpc 2017-01-18 12:48:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066


Note You need to log in before you can comment on or make changes to this bug.