1608571 – OCP 3.10: unable to pull images on compute node due to dnsmasq failures after running scale tests

Bug 1608571 - OCP 3.10: unable to pull images on compute node due to dnsmasq failures after running scale tests

Summary: OCP 3.10: unable to pull images on compute node due to dnsmasq failures afte...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.10.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.11.0
Assignee:	Ivan Chavero
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:	aos-scalability-310
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-25 20:16 UTC by Walid A.
Modified:	2022-08-04 22:20 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-11 07:22:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
dnsmasq configuration on compute node (26.63 KB, text/plain) 2018-07-26 02:11 UTC, Walid A.	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	https://github.com/openshift openshift-ansible pull 9514	0	None	None	None	2020-07-06 06:39:14 UTC
Red Hat Product Errata	RHBA-2018:2652	0	None	None	None	2018-10-11 07:22:40 UTC

Description Walid A. 2018-07-25 20:16:35 UTC

On OCP 3.10 environments on AWS, version v3.10.23, dnsmasq gets hung and fails to resolve DNS queries needed when deploying pods on compute node and trying to pull images from several registries such as docker.io.  This seems to happen after running OpenShift extended comformance tests followed by successfully deploying 500 pause-pods on each of two compute nodes (Node Vertical Test).  Dnsmasq service shows running but cannot resolve any DNS queries on that one compute node.  This is reproducible on several environments and appears to happen after running conformance tests followed by the Node Vertical scale test.  The only way to resolve this issue is to restart the dnsmasq service or reboot the compute node.  KubeletArgument max-pods was set to 510 ( default was 250) in the openshift-node namespace compute node configmap, to allow for 500 pause-pods per compute node

# host docker.io
Host docker.io.us-west-2.compute.internal not found: 5(REFUSED)

# dig docker.io

; <<>> DiG 9.9.4-RedHat-9.9.4-70.el7 <<>> docker.io
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 37291
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

# docker pull docker.io/ocpqe/hello-pod
Using default tag: latest
Trying to pull repository docker.io/ocpqe/hello-pod ... 
Get https://registry-1.docker.io/v2/: dial tcp: lookup registry-1.docker.io on 172.31.34.104:53: server misbehaving

# docker pull gcr.io/google_containers/pause-amd64
Using default tag: latest
Trying to pull repository gcr.io/google_containers/pause-amd64 ... 
Get https://gcr.io/v1/_ping: dial tcp: lookup gcr.io on 172.31.34.104:53: server misbehaving

# systemctl status dnsmasq
● dnsmasq.service - DNS caching server.
   Loaded: loaded (/usr/lib/systemd/system/dnsmasq.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2018-07-23 19:48:57 UTC; 1 day 22h ago
 Main PID: 4343 (dnsmasq)
   Memory: 1.5M
   CGroup: /system.slice/dnsmasq.service
           └─4343 /usr/sbin/dnsmasq -k

Jul 25 18:06:31 ip-172-31-34-104.us-west-2.compute.internal dnsmasq[4343]: using nameserver 127.0.0.1#53 for domain in-addr.arpa
Jul 25 18:06:31 ip-172-31-34-104.us-west-2.compute.internal dnsmasq[4343]: using nameserver 127.0.0.1#53 for domain cluster.local
Jul 25 18:07:01 ip-172-31-34-104.us-west-2.compute.internal dnsmasq[4343]: setting upstream servers from DBus
Jul 25 18:07:01 ip-172-31-34-104.us-west-2.compute.internal dnsmasq[4343]: using nameserver 172.31.0.2#53



Version-Release number of selected component (if applicable):
openshift v3.10.23
kubernetes v1.10.0+b81c8f8

How reproducible:
Always

Steps to Reproduce:
1. Deploy an OCP 3.10 environment (1 master/etd, 1 infra, and 2 compute nodes) m4.xlarge instance type on AWS EC2, using the openshift-ansible/playbooks/deploy_cluster.yml playbook

2. On Master node, oc edit node-config-compute -n openshift node:
Add under kubeletArguments:
      max-pods:
      - '510'
3. on each compute node:  systemctl restart atomic-openshift-node

4. Run the SVT conformance suite of tests on the master node from Jenkins https://github.com/openshift/svt/blob/master/conformance/svt_conformance.sh

5. Run the SVT Node Vertical test with the nodeVertical.yaml cluster-loader golang config file which deploys 500 pause-pods (gcr.io/google_containers/pause-amd64) per compute node:
https://github.com/openshift/svt/blob/master/openshift_scalability/ci/scripts/pbench-controller-start-node-vertical.sh

6. Run host or dig commands on first compute node or try to docker pull any image, will result in DNS lookup failure
- docker pull docker.io/ocpqe/hello-pod
- docker pull gcr.io/google_containers/pause-amd64
- host docker.io; dig docker.io; dig gcr.io


Actual results:
(see dig an host command failures in description above)
# docker pull docker.io/ocpqe/hello-pod
Using default tag: latest
Trying to pull repository docker.io/ocpqe/hello-pod ... 
Get https://registry-1.docker.io/v2/: dial tcp: lookup registry-1.docker.io on 172.31.34.104:53: server misbehaving


Expected results:
# docker pull docker.io/ocpqe/hello-pod
Using default tag: latest
Trying to pull repository docker.io/ocpqe/hello-pod ... 
latest: Pulling from docker.io/ocpqe/hello-pod
Digest: sha256:04b6af86b03c1836211be2589db870dba09b7811c197c47c07fbbe33c7f80ef7
Status: Image is up to date for docker.io/ocpqe/hello-pod:latest

# host docker.io
docker.io has address 52.73.11.219
docker.io has address 52.204.111.1
docker.io has address 34.203.15.50
docker.io mail is handled by 30 aspmx5.googlemail.com.
docker.io mail is handled by 10 aspmx.l.google.com.
docker.io mail is handled by 20 alt1.aspmx.l.google.com.
docker.io mail is handled by 30 aspmx2.googlemail.com.
docker.io mail is handled by 30 aspmx3.googlemail.com.
docker.io mail is handled by 30 aspmx4.googlemail.com.

# dig docker.io

; <<>> DiG 9.9.4-RedHat-9.9.4-70.el7 <<>> docker.io
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44443
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;docker.io.			IN	A

;; ANSWER SECTION:
docker.io.		52	IN	A	34.203.15.50
docker.io.		52	IN	A	52.73.11.219
docker.io.		52	IN	A	52.204.111.1

;; Query time: 0 msec
;; SERVER: 172.31.53.59#53(172.31.53.59)
;; WHEN: Wed Jul 25 19:55:26 UTC 2018
;; MSG SIZE  rcvd: 86

# 

Additional info:

Links to journal logs from compute node with dnsmasq failures will be in next private comment

Comment 2 Walid A. 2018-07-26 02:11:31 UTC

Created attachment 1470566 [details]
dnsmasq configuration on compute node

Comment 3 Ivan Chavero 2018-08-01 22:55:35 UTC

I'm trying to replicate the issue, did you use the Ansible Installer for this cluster?

Comment 4 Walid A. 2018-08-02 00:41:40 UTC

@Ivan, yes, I used the openshift-ansible deloy_cluster.yml playbook to build this cluster.

Comment 13 Weibin Liang 2018-08-13 13:32:26 UTC

@walid, can you still reproduce this bug after you install the fixed PR and run your testing scripts?

Comment 14 Walid A. 2018-08-13 16:08:22 UTC

@Weibin, the PR fix in Comment 11 appears to resolve this issue.  I ran the same automated the scripts (SVT Conformance followed by Node Vertical test with 500 pods per node) on the AWS clusters installed with the openshift-ansible PR fix.  So far, I am not hitting the dnsmasq failures anymore.  Also after the testcases that used to leave 1052+ open files by dnmasq, I am now seeing only 40-50 files open while executing the next testcase, so dnsmaq appears to be closing the files accordingly:

# lsof 2>/dev/null | grep dnsmasq | wc -l
50

Comment 15 Weibin Liang 2018-08-13 17:30:05 UTC

@Walid, thanks for your confirmation.

Comment 17 Walid A. 2018-08-23 16:36:02 UTC

Verified on OCP v3.11.0-0.19.0:

cd /etc/systemd/system/dnsmasq.service.d
cat override.conf
[Service]
LimitNOFILE=65535

Node Vertical test with 500 pods per node was successfully executed.

Comment 20 errata-xmlrpc 2018-10-11 07:22:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652

Note You need to log in before you can comment on or make changes to this bug.