Which dnsmasq process are you referring to exactly? Also if you manage to reproduce the issue and can let me get my hands on it before you kill dnsmasq it could help troubleshoot
(In reply to Omer Tuchfeld from comment #2) > Which dnsmasq process are you referring to exactly? > Also if you manage to reproduce the issue and can let me get my hands on it > before you kill dnsmasq it could help troubleshoot It's this one: [core@cnfdf15 ~]$ sudo systemctl status dnsmasq ● dnsmasq.service - Run dnsmasq to provide local dns for Single Node OpenShift Loaded: loaded (/etc/systemd/system/dnsmasq.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2022-08-09 09:31:13 UTC; 11min ago Main PID: 2149 (dnsmasq) Tasks: 1 (limit: 605092) Memory: 1.2M CPU: 10min 36.081s CGroup: /system.slice/dnsmasq.service └─2149 /usr/sbin/dnsmasq -k It comes from mc/50-master-dnsmasq-configuration I sent the details on how to connect on slack
Maybe the same as this issue: Bug 2106361 - dnsmasq high CPU usage in 4.11 spoke deployment or after 4.10.21 to 4.11.0-rc.1 upgrade on an SNO node
After a closer look at the log, I think here could be our problem: -- Reboot -- Aug 09 14:52:23 cnfdf15 systemd[1]: Started Run dnsmasq to provide local dns for Single Node OpenShift. Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: started, version 2.79 cachesize 150 Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: compile time options: IPv6 GNU-getopt DBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth DNSSEC loop-detect inotify Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: reading /etc/resolv.conf Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: using nameserver 10.8.34.79#53 Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: ignoring nameserver 127.0.0.1 - local interface Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: using nameserver 10.8.34.211#53 Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: read /etc/hosts - 2 addresses Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: failed to read /etc/resolv.conf: Permission denied Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: no servers found in /etc/resolv.conf, will retry Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: query[A] 0.rhel.pool.ntp.org from 127.0.0.1 Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: forwarded 0.rhel.pool.ntp.org to 10.8.34.79 The line Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: using nameserver 10.8.34.79#53 means it may forward the requests to itself. Indeed it happens here: Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: forwarded 0.rhel.pool.ntp.org to 10.8.34.79 After restarting dnsmasq, it never considers own IP as a nameserver. Could it be that dnsmasq starts too early after boot and somehow doesn't recognize that 10.8.34.79 is the local interface? In the experiment I used this MC: apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 50-master-dnsmasq-logging-configuration spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,bm8tcmVzb2x2CmxvZy1xdWVyaWVz mode: 420 overwrite: true path: /etc/dnsmasq.d/log-single-node.conf This adds: no-resolv log-queries to the configuration
Yep, and no-resolve made the problem disappear
Correction - with no-resolv the spoke fails to register on the hub due to this error: $ oc -n open-cluster-management-agent logs deployment.apps/klusterlet-registration-agent ... E0810 05:36:41.156655 1 base_controller.go:272] ManagedClusterCreatingController reconciliation failed: Get "https://api.cnfdf13.telco5gran.eng.rdu2.redhat.com:6443/apis/cluster.open-cluster-management.io/v1/managedclusters/cnfdf15": dial tcp: lookup api.cnfdf13.telco5gran.eng.rdu2.redhat.com on 172.30.0.10:53: server misbehaving ... The workaround that fully worked for me was to define a new file that dnsmasq would import instead of resolv.conf. This file is the exact copy of resolv.conf, but without the SNO host IP: apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 50-master-dnsmasq-configuration-overrides spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,bG9nLXF1ZXJpZXMK mode: 420 overwrite: true path: /etc/dnsmasq.d/log-single-node.conf - contents: source: data:text/plain;charset=utf-8;base64,cmVzb2x2LWZpbGU9L2V0Yy9yZXNvbHYub3ZlcnJpZGU= mode: 420 overwrite: true path: /etc/dnsmasq.d/resolv-override.conf - contents: source: data:text/plain;charset=utf-8;base64,c2VhcmNoICBjbmZkZjE1LnRlbGNvNWdyYW4uZW5nLnJkdTIucmVkaGF0LmNvbSB0ZWxjbzVncmFuLmVuZy5yZHUyLnJlZGhhdC5jb20KbmFtZXNlcnZlciAxMjcuMC4wLjEKbmFtZXNlcnZlciAxMC44LjM0LjIxMQo= mode: 420 overwrite: true path: /etc/resolv.override E.g, [core@cnfdf15 ~]$ diff /etc/resolv.conf /etc/resolv.override 1d0 < # Generated by NetworkManager 3d1 < nameserver 10.8.34.79
Any update on this issue? do we still target for 2.6 release? Thanks!
*** This bug has been marked as a duplicate of bug 2106361 ***