Bug 1560489

Summary: DNS issues needing restart of dnsmasq service - 'could not resolve host' error
Product: OpenShift Container Platform Reporter: Sudarshan Chaudhari <suchaudh>
Component: NetworkingAssignee: Phil Cameron <pcameron>
Networking sub component: openshift-sdn QA Contact: Meng Bo <bmeng>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: agk, aos-bugs, bbennett, dmace, faltahe, hongli, jbautist, jkaur, jlee, pcameron, stwalter, suchaudh, vwalek
Version: 3.7.0   
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-30 19:11:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sudarshan Chaudhari 2018-03-26 09:57:35 UTC
Description of problem:

dnsmasq service freezes randomly and needs to manually restart the dnsmasq service to start the resolution.

No logs are being captured for the dnsmasq service on the openshift nodes. 

Actual results:
URL resolution fails on the nodes and results in unable resolve from pods as well.

Expected results:
Even though NetworkManager or dnsmasq service restarts the config should be correct and it should keep resolving.

Comment 3 Ben Bennett 2018-03-26 13:50:41 UTC
The recent changes that made the dnsmasq handle all traffic for a node may have overloaded the current limits we have set in our config.  We should see if it makes sense to increase the limits generally, or if we need to make it configurable.

Comment 7 Phil Cameron 2018-04-19 12:17:53 UTC
https://github.com/openshift/openshift-ansible/pull/8042
WIP until this change is verified to work for customer

Comment 9 openshift-github-bot 2018-04-30 12:52:47 UTC
Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/1ca76dd879cb6187fa7137a0e36c36461cea3776
dnsmasq - increase dns-forward-max, cache-size

bug 1560489
https://bugzilla.redhat.com/show_bug.cgi?id=1560489

Signed-off-by: Phil Cameron <pcameron>

Comment 12 Hongan Li 2018-05-17 02:41:57 UTC
verified in atomic-openshift-3.10.0-0.47.0.git.0.2fffa04.el7, the options "dns-forward-max" and "cache-size" has been increased to 10000 as below:

[root@ip-172-18-5-139 ~]# cat /etc/dnsmasq.d/origin-dns.conf 
no-resolv
domain-needed
no-negcache
max-cache-ttl=1
enable-dbus
dns-forward-max=10000
cache-size=10000
bind-dynamic
except-interface=lo
# End of config

OS: Red Hat Enterprise Linux Server release 7.5 (Maipo)
kernel: Linux ip-172-18-5-139.ec2.internal 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 17 errata-xmlrpc 2018-07-30 19:11:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816

Comment 19 Alasdair Kergon 2019-11-12 13:29:57 UTC
Clearing all the ignored old needinfo requests, picking a random required subcomponent.