Bug 1731454

Summary: internal service DNS resolution stopped working after node reboot (dnsmasq spamming dbus errors)
Product: OpenShift Container Platform Reporter: Christian Koep <ckoep>
Component: NetworkingAssignee: Dan Mace <dmace>
Networking sub component: DNS QA Contact: Hongan Li <hongli>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, dmace
Version: 3.11.0   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-23 11:55:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Christian Koep 2019-07-19 12:49:07 UTC
######## Description of problem:

* After restarting an OpenShift Container Platform 3.11 worker node (which worked fine before) the resolution of OCP services (e. g. docker-registry.default.svc,  glusterfs-dynamic-mongodb.foo.svc.cluster.local stopped working. Internal network resolution (e. g. non OCP services) as well as internet services (e. g. google.com) can be resolved just fine.

* The dnsmasq service spams the following error messages (see private attachments as well):

~~~
# systemctl status dnsmasq
● dnsmasq.service - DNS caching server.
   Loaded: loaded (/usr/lib/systemd/system/dnsmasq.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/dnsmasq.service.d
           └─override.conf
   Active: active (running) since Fri 2019-07-19 10:56:11 CEST; 1h 55min ago
 Main PID: 9603 (dnsmasq)
   Memory: 8.3M
   CGroup: /system.slice/dnsmasq.service
           └─9603 /usr/sbin/dnsmasq -k

Jul 19 12:51:15 czcholspc004095.prg-dc.dhl.com dnsmasq[9603]: DBus error: Connection ":1.0" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policie...ation file
Jul 19 12:51:15 czcholspc004095.prg-dc.dhl.com dnsmasq[9603]: DBus error: Connection ":1.0" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policie...ation file
Jul 19 12:51:16 czcholspc004095.prg-dc.dhl.com dnsmasq[9603]: DBus error: Connection ":1.0" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policie...ation file
Jul 19 12:51:16 czcholspc004095.prg-dc.dhl.com dnsmasq[9603]: DBus error: Connection ":1.0" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policie...ation file
Jul 19 12:51:16 czcholspc004095.prg-dc.dhl.com dnsmasq[9603]: DBus error: Connection ":1.0" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policie...ation file
Jul 19 12:51:16 czcholspc004095.prg-dc.dhl.com dnsmasq[9603]: DBus error: Connection ":1.0" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policie...ation file
Jul 19 12:51:17 czcholspc004095.prg-dc.dhl.com dnsmasq[9603]: DBus error: Connection ":1.0" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policie...ation file
Jul 19 12:51:17 czcholspc004095.prg-dc.dhl.com dnsmasq[9603]: DBus error: Connection ":1.0" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policie...ation file
Jul 19 12:51:17 czcholspc004095.prg-dc.dhl.com dnsmasq[9603]: DBus error: Connection ":1.0" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policie...ation file
Jul 19 12:51:17 czcholspc004095.prg-dc.dhl.com dnsmasq[9603]: DBus error: Connection ":1.0" is not allowed to own the service "uk.org.thekelleys.dnsmasq" due to security policie...ation file
Hint: Some lines were ellipsized, use -l to show in full.
~~~

######## Version-Release number of selected component (if applicable):
* Red Hat OpenShift Container Platform 3.11.88
* Red Hat Enterprise Linux Server release 7.6 (booted kernel: 3.10.0-957.5.1.el7.x86_64)
* dnsmasq 2.76-7.el7
* dbus 1.10.24-12.el7

######## How reproducible:
* Currently only reproducible on a single node

######## Actual results:
~~~
    # nslookup glusterfs-dynamic-mongodb.foo.svc.cluster.local
    Server:         1.2.3.4 (obfuscated)
    Address:        1.2.3.4#53 (obfuscated)
    ** server can't find glusterfs-dynamic-mongodb.foo.svc.cluster.local: NXDOMAIN
~~~

######## Expected results:
* DNS resolution works as expected.

######## Additional info:
* See sosreports etc. attached privately.

Comment 8 Christian Koep 2019-07-23 11:55:17 UTC
Hi,

it turns out that (as described in the following knowledge base solution), rebooting the node once again did eventually resolve the issue completely.

    - [ dnsmasq and other services failed to start with dbus error in RHEL 7.4 or later ]
      https://access.redhat.com/solutions/3412631

I'm closing this Bugzilla for the time being.

Kind regards,
Christian

Comment 9 Red Hat Bugzilla 2023-09-15 01:28:28 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days