Hide Forgot
On OCP 3.10 environments on AWS, version v3.10.23, dnsmasq gets hung and fails to resolve DNS queries needed when deploying pods on compute node and trying to pull images from several registries such as docker.io. This seems to happen after running OpenShift extended comformance tests followed by successfully deploying 500 pause-pods on each of two compute nodes (Node Vertical Test). Dnsmasq service shows running but cannot resolve any DNS queries on that one compute node. This is reproducible on several environments and appears to happen after running conformance tests followed by the Node Vertical scale test. The only way to resolve this issue is to restart the dnsmasq service or reboot the compute node. KubeletArgument max-pods was set to 510 ( default was 250) in the openshift-node namespace compute node configmap, to allow for 500 pause-pods per compute node # host docker.io Host docker.io.us-west-2.compute.internal not found: 5(REFUSED) # dig docker.io ; <<>> DiG 9.9.4-RedHat-9.9.4-70.el7 <<>> docker.io ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 37291 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 # docker pull docker.io/ocpqe/hello-pod Using default tag: latest Trying to pull repository docker.io/ocpqe/hello-pod ... Get https://registry-1.docker.io/v2/: dial tcp: lookup registry-1.docker.io on 172.31.34.104:53: server misbehaving # docker pull gcr.io/google_containers/pause-amd64 Using default tag: latest Trying to pull repository gcr.io/google_containers/pause-amd64 ... Get https://gcr.io/v1/_ping: dial tcp: lookup gcr.io on 172.31.34.104:53: server misbehaving # systemctl status dnsmasq ● dnsmasq.service - DNS caching server. Loaded: loaded (/usr/lib/systemd/system/dnsmasq.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2018-07-23 19:48:57 UTC; 1 day 22h ago Main PID: 4343 (dnsmasq) Memory: 1.5M CGroup: /system.slice/dnsmasq.service └─4343 /usr/sbin/dnsmasq -k Jul 25 18:06:31 ip-172-31-34-104.us-west-2.compute.internal dnsmasq[4343]: using nameserver 127.0.0.1#53 for domain in-addr.arpa Jul 25 18:06:31 ip-172-31-34-104.us-west-2.compute.internal dnsmasq[4343]: using nameserver 127.0.0.1#53 for domain cluster.local Jul 25 18:07:01 ip-172-31-34-104.us-west-2.compute.internal dnsmasq[4343]: setting upstream servers from DBus Jul 25 18:07:01 ip-172-31-34-104.us-west-2.compute.internal dnsmasq[4343]: using nameserver 172.31.0.2#53 Version-Release number of selected component (if applicable): openshift v3.10.23 kubernetes v1.10.0+b81c8f8 How reproducible: Always Steps to Reproduce: 1. Deploy an OCP 3.10 environment (1 master/etd, 1 infra, and 2 compute nodes) m4.xlarge instance type on AWS EC2, using the openshift-ansible/playbooks/deploy_cluster.yml playbook 2. On Master node, oc edit node-config-compute -n openshift node: Add under kubeletArguments: max-pods: - '510' 3. on each compute node: systemctl restart atomic-openshift-node 4. Run the SVT conformance suite of tests on the master node from Jenkins https://github.com/openshift/svt/blob/master/conformance/svt_conformance.sh 5. Run the SVT Node Vertical test with the nodeVertical.yaml cluster-loader golang config file which deploys 500 pause-pods (gcr.io/google_containers/pause-amd64) per compute node: https://github.com/openshift/svt/blob/master/openshift_scalability/ci/scripts/pbench-controller-start-node-vertical.sh 6. Run host or dig commands on first compute node or try to docker pull any image, will result in DNS lookup failure - docker pull docker.io/ocpqe/hello-pod - docker pull gcr.io/google_containers/pause-amd64 - host docker.io; dig docker.io; dig gcr.io Actual results: (see dig an host command failures in description above) # docker pull docker.io/ocpqe/hello-pod Using default tag: latest Trying to pull repository docker.io/ocpqe/hello-pod ... Get https://registry-1.docker.io/v2/: dial tcp: lookup registry-1.docker.io on 172.31.34.104:53: server misbehaving Expected results: # docker pull docker.io/ocpqe/hello-pod Using default tag: latest Trying to pull repository docker.io/ocpqe/hello-pod ... latest: Pulling from docker.io/ocpqe/hello-pod Digest: sha256:04b6af86b03c1836211be2589db870dba09b7811c197c47c07fbbe33c7f80ef7 Status: Image is up to date for docker.io/ocpqe/hello-pod:latest # host docker.io docker.io has address 52.73.11.219 docker.io has address 52.204.111.1 docker.io has address 34.203.15.50 docker.io mail is handled by 30 aspmx5.googlemail.com. docker.io mail is handled by 10 aspmx.l.google.com. docker.io mail is handled by 20 alt1.aspmx.l.google.com. docker.io mail is handled by 30 aspmx2.googlemail.com. docker.io mail is handled by 30 aspmx3.googlemail.com. docker.io mail is handled by 30 aspmx4.googlemail.com. # dig docker.io ; <<>> DiG 9.9.4-RedHat-9.9.4-70.el7 <<>> docker.io ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44443 ;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;docker.io. IN A ;; ANSWER SECTION: docker.io. 52 IN A 34.203.15.50 docker.io. 52 IN A 52.73.11.219 docker.io. 52 IN A 52.204.111.1 ;; Query time: 0 msec ;; SERVER: 172.31.53.59#53(172.31.53.59) ;; WHEN: Wed Jul 25 19:55:26 UTC 2018 ;; MSG SIZE rcvd: 86 # Additional info: Links to journal logs from compute node with dnsmasq failures will be in next private comment
Created attachment 1470566 [details] dnsmasq configuration on compute node
I'm trying to replicate the issue, did you use the Ansible Installer for this cluster?
@Ivan, yes, I used the openshift-ansible deloy_cluster.yml playbook to build this cluster.
@walid, can you still reproduce this bug after you install the fixed PR and run your testing scripts?
@Weibin, the PR fix in Comment 11 appears to resolve this issue. I ran the same automated the scripts (SVT Conformance followed by Node Vertical test with 500 pods per node) on the AWS clusters installed with the openshift-ansible PR fix. So far, I am not hitting the dnsmasq failures anymore. Also after the testcases that used to leave 1052+ open files by dnmasq, I am now seeing only 40-50 files open while executing the next testcase, so dnsmaq appears to be closing the files accordingly: # lsof 2>/dev/null | grep dnsmasq | wc -l 50
@Walid, thanks for your confirmation.
Verified on OCP v3.11.0-0.19.0: cd /etc/systemd/system/dnsmasq.service.d cat override.conf [Service] LimitNOFILE=65535 Node Vertical test with 500 pods per node was successfully executed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652