Bug 1608571

Summary: OCP 3.10: unable to pull images on compute node due to dnsmasq failures after running scale tests
Product: OpenShift Container Platform Reporter: Walid A. <wabouham>
Component: NetworkingAssignee: Ivan Chavero <ichavero>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, dmace, hongli, mifiedle, shzhou, tibrahim, tmanor, wabouham, weliang, wmeng
Version: 3.10.0Keywords: NeedsTestCase
Target Milestone: ---   
Target Release: 3.11.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: aos-scalability-310
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-11 07:22:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dnsmasq configuration on compute node none

Description Walid A. 2018-07-25 20:16:35 UTC
On OCP 3.10 environments on AWS, version v3.10.23, dnsmasq gets hung and fails to resolve DNS queries needed when deploying pods on compute node and trying to pull images from several registries such as docker.io.  This seems to happen after running OpenShift extended comformance tests followed by successfully deploying 500 pause-pods on each of two compute nodes (Node Vertical Test).  Dnsmasq service shows running but cannot resolve any DNS queries on that one compute node.  This is reproducible on several environments and appears to happen after running conformance tests followed by the Node Vertical scale test.  The only way to resolve this issue is to restart the dnsmasq service or reboot the compute node.  KubeletArgument max-pods was set to 510 ( default was 250) in the openshift-node namespace compute node configmap, to allow for 500 pause-pods per compute node

# host docker.io
Host docker.io.us-west-2.compute.internal not found: 5(REFUSED)

# dig docker.io

; <<>> DiG 9.9.4-RedHat-9.9.4-70.el7 <<>> docker.io
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 37291
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

# docker pull docker.io/ocpqe/hello-pod
Using default tag: latest
Trying to pull repository docker.io/ocpqe/hello-pod ... 
Get https://registry-1.docker.io/v2/: dial tcp: lookup registry-1.docker.io on 172.31.34.104:53: server misbehaving

# docker pull gcr.io/google_containers/pause-amd64
Using default tag: latest
Trying to pull repository gcr.io/google_containers/pause-amd64 ... 
Get https://gcr.io/v1/_ping: dial tcp: lookup gcr.io on 172.31.34.104:53: server misbehaving

# systemctl status dnsmasq
● dnsmasq.service - DNS caching server.
   Loaded: loaded (/usr/lib/systemd/system/dnsmasq.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2018-07-23 19:48:57 UTC; 1 day 22h ago
 Main PID: 4343 (dnsmasq)
   Memory: 1.5M
   CGroup: /system.slice/dnsmasq.service
           └─4343 /usr/sbin/dnsmasq -k

Jul 25 18:06:31 ip-172-31-34-104.us-west-2.compute.internal dnsmasq[4343]: using nameserver 127.0.0.1#53 for domain in-addr.arpa
Jul 25 18:06:31 ip-172-31-34-104.us-west-2.compute.internal dnsmasq[4343]: using nameserver 127.0.0.1#53 for domain cluster.local
Jul 25 18:07:01 ip-172-31-34-104.us-west-2.compute.internal dnsmasq[4343]: setting upstream servers from DBus
Jul 25 18:07:01 ip-172-31-34-104.us-west-2.compute.internal dnsmasq[4343]: using nameserver 172.31.0.2#53



Version-Release number of selected component (if applicable):
openshift v3.10.23
kubernetes v1.10.0+b81c8f8

How reproducible:
Always

Steps to Reproduce:
1. Deploy an OCP 3.10 environment (1 master/etd, 1 infra, and 2 compute nodes) m4.xlarge instance type on AWS EC2, using the openshift-ansible/playbooks/deploy_cluster.yml playbook

2. On Master node, oc edit node-config-compute -n openshift node:
Add under kubeletArguments:
      max-pods:
      - '510'
3. on each compute node:  systemctl restart atomic-openshift-node

4. Run the SVT conformance suite of tests on the master node from Jenkins https://github.com/openshift/svt/blob/master/conformance/svt_conformance.sh

5. Run the SVT Node Vertical test with the nodeVertical.yaml cluster-loader golang config file which deploys 500 pause-pods (gcr.io/google_containers/pause-amd64) per compute node:
https://github.com/openshift/svt/blob/master/openshift_scalability/ci/scripts/pbench-controller-start-node-vertical.sh

6. Run host or dig commands on first compute node or try to docker pull any image, will result in DNS lookup failure
- docker pull docker.io/ocpqe/hello-pod
- docker pull gcr.io/google_containers/pause-amd64
- host docker.io; dig docker.io; dig gcr.io


Actual results:
(see dig an host command failures in description above)
# docker pull docker.io/ocpqe/hello-pod
Using default tag: latest
Trying to pull repository docker.io/ocpqe/hello-pod ... 
Get https://registry-1.docker.io/v2/: dial tcp: lookup registry-1.docker.io on 172.31.34.104:53: server misbehaving


Expected results:
# docker pull docker.io/ocpqe/hello-pod
Using default tag: latest
Trying to pull repository docker.io/ocpqe/hello-pod ... 
latest: Pulling from docker.io/ocpqe/hello-pod
Digest: sha256:04b6af86b03c1836211be2589db870dba09b7811c197c47c07fbbe33c7f80ef7
Status: Image is up to date for docker.io/ocpqe/hello-pod:latest

# host docker.io
docker.io has address 52.73.11.219
docker.io has address 52.204.111.1
docker.io has address 34.203.15.50
docker.io mail is handled by 30 aspmx5.googlemail.com.
docker.io mail is handled by 10 aspmx.l.google.com.
docker.io mail is handled by 20 alt1.aspmx.l.google.com.
docker.io mail is handled by 30 aspmx2.googlemail.com.
docker.io mail is handled by 30 aspmx3.googlemail.com.
docker.io mail is handled by 30 aspmx4.googlemail.com.

# dig docker.io

; <<>> DiG 9.9.4-RedHat-9.9.4-70.el7 <<>> docker.io
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44443
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;docker.io.			IN	A

;; ANSWER SECTION:
docker.io.		52	IN	A	34.203.15.50
docker.io.		52	IN	A	52.73.11.219
docker.io.		52	IN	A	52.204.111.1

;; Query time: 0 msec
;; SERVER: 172.31.53.59#53(172.31.53.59)
;; WHEN: Wed Jul 25 19:55:26 UTC 2018
;; MSG SIZE  rcvd: 86

# 

Additional info:

Links to journal logs from compute node with dnsmasq failures will be in next private comment

Comment 2 Walid A. 2018-07-26 02:11:31 UTC
Created attachment 1470566 [details]
dnsmasq configuration on compute node

Comment 3 Ivan Chavero 2018-08-01 22:55:35 UTC
I'm trying to replicate the issue, did you use the Ansible Installer for this cluster?

Comment 4 Walid A. 2018-08-02 00:41:40 UTC
@Ivan, yes, I used the openshift-ansible deloy_cluster.yml playbook to build this cluster.

Comment 13 Weibin Liang 2018-08-13 13:32:26 UTC
@walid, can you still reproduce this bug after you install the fixed PR and run your testing scripts?

Comment 14 Walid A. 2018-08-13 16:08:22 UTC
@Weibin, the PR fix in Comment 11 appears to resolve this issue.  I ran the same automated the scripts (SVT Conformance followed by Node Vertical test with 500 pods per node) on the AWS clusters installed with the openshift-ansible PR fix.  So far, I am not hitting the dnsmasq failures anymore.  Also after the testcases that used to leave 1052+ open files by dnmasq, I am now seeing only 40-50 files open while executing the next testcase, so dnsmaq appears to be closing the files accordingly:

# lsof 2>/dev/null | grep dnsmasq | wc -l
50

Comment 15 Weibin Liang 2018-08-13 17:30:05 UTC
@Walid, thanks for your confirmation.

Comment 17 Walid A. 2018-08-23 16:36:02 UTC
Verified on OCP v3.11.0-0.19.0:

cd /etc/systemd/system/dnsmasq.service.d
cat override.conf
[Service]
LimitNOFILE=65535

Node Vertical test with 500 pods per node was successfully executed.

Comment 20 errata-xmlrpc 2018-10-11 07:22:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652