Bug 2116549
| Summary: | 4.11 SNO failed to install, dnsmasq 100% CPU | ||
|---|---|---|---|
| Product: | Red Hat Advanced Cluster Management for Kubernetes | Reporter: | Vitaly Grinberg <vgrinber> |
| Component: | Infrastructure Operator | Assignee: | Omer Tuchfeld <otuchfel> |
| Status: | CLOSED DUPLICATE | QA Contact: | Chad Crum <ccrum> |
| Severity: | high | Docs Contact: | Derek <dcadzow> |
| Priority: | urgent | ||
| Version: | rhacm-2.6 | CC: | ccrum, itsoiref, mcornea, mfilanov, otuchfel, pemensik, trwest, yfirst, yliu1, yuhe |
| Target Milestone: | --- | ||
| Target Release: | rhacm-2.6 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-22 13:16:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Comment 2
Omer Tuchfeld
2022-08-09 08:51:39 UTC
(In reply to Omer Tuchfeld from comment #2) > Which dnsmasq process are you referring to exactly? > Also if you manage to reproduce the issue and can let me get my hands on it > before you kill dnsmasq it could help troubleshoot It's this one: [core@cnfdf15 ~]$ sudo systemctl status dnsmasq ● dnsmasq.service - Run dnsmasq to provide local dns for Single Node OpenShift Loaded: loaded (/etc/systemd/system/dnsmasq.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2022-08-09 09:31:13 UTC; 11min ago Main PID: 2149 (dnsmasq) Tasks: 1 (limit: 605092) Memory: 1.2M CPU: 10min 36.081s CGroup: /system.slice/dnsmasq.service └─2149 /usr/sbin/dnsmasq -k It comes from mc/50-master-dnsmasq-configuration I sent the details on how to connect on slack Maybe the same as this issue: Bug 2106361 - dnsmasq high CPU usage in 4.11 spoke deployment or after 4.10.21 to 4.11.0-rc.1 upgrade on an SNO node After a closer look at the log, I think here could be our problem:
-- Reboot --
Aug 09 14:52:23 cnfdf15 systemd[1]: Started Run dnsmasq to provide local dns for Single Node OpenShift.
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: started, version 2.79 cachesize 150
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: compile time options: IPv6 GNU-getopt DBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth DNSSEC loop-detect inotify
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: reading /etc/resolv.conf
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: using nameserver 10.8.34.79#53
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: ignoring nameserver 127.0.0.1 - local interface
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: using nameserver 10.8.34.211#53
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: read /etc/hosts - 2 addresses
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: failed to read /etc/resolv.conf: Permission denied
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: no servers found in /etc/resolv.conf, will retry
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: query[A] 0.rhel.pool.ntp.org from 127.0.0.1
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: forwarded 0.rhel.pool.ntp.org to 10.8.34.79
The line
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: using nameserver 10.8.34.79#53
means it may forward the requests to itself.
Indeed it happens here:
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: forwarded 0.rhel.pool.ntp.org to 10.8.34.79
After restarting dnsmasq, it never considers own IP as a nameserver.
Could it be that dnsmasq starts too early after boot and somehow doesn't recognize that 10.8.34.79 is the local interface?
In the experiment I used this MC:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: master
name: 50-master-dnsmasq-logging-configuration
spec:
config:
ignition:
version: 3.1.0
storage:
files:
- contents:
source: data:text/plain;charset=utf-8;base64,bm8tcmVzb2x2CmxvZy1xdWVyaWVz
mode: 420
overwrite: true
path: /etc/dnsmasq.d/log-single-node.conf
This adds:
no-resolv
log-queries
to the configuration
Yep, and no-resolve made the problem disappear Correction - with no-resolv the spoke fails to register on the hub due to this error: $ oc -n open-cluster-management-agent logs deployment.apps/klusterlet-registration-agent ... E0810 05:36:41.156655 1 base_controller.go:272] ManagedClusterCreatingController reconciliation failed: Get "https://api.cnfdf13.telco5gran.eng.rdu2.redhat.com:6443/apis/cluster.open-cluster-management.io/v1/managedclusters/cnfdf15": dial tcp: lookup api.cnfdf13.telco5gran.eng.rdu2.redhat.com on 172.30.0.10:53: server misbehaving ... The workaround that fully worked for me was to define a new file that dnsmasq would import instead of resolv.conf. This file is the exact copy of resolv.conf, but without the SNO host IP: apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 50-master-dnsmasq-configuration-overrides spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,bG9nLXF1ZXJpZXMK mode: 420 overwrite: true path: /etc/dnsmasq.d/log-single-node.conf - contents: source: data:text/plain;charset=utf-8;base64,cmVzb2x2LWZpbGU9L2V0Yy9yZXNvbHYub3ZlcnJpZGU= mode: 420 overwrite: true path: /etc/dnsmasq.d/resolv-override.conf - contents: source: data:text/plain;charset=utf-8;base64,c2VhcmNoICBjbmZkZjE1LnRlbGNvNWdyYW4uZW5nLnJkdTIucmVkaGF0LmNvbSB0ZWxjbzVncmFuLmVuZy5yZHUyLnJlZGhhdC5jb20KbmFtZXNlcnZlciAxMjcuMC4wLjEKbmFtZXNlcnZlciAxMC44LjM0LjIxMQo= mode: 420 overwrite: true path: /etc/resolv.override E.g, [core@cnfdf15 ~]$ diff /etc/resolv.conf /etc/resolv.override 1d0 < # Generated by NetworkManager 3d1 < nameserver 10.8.34.79 Any update on this issue? do we still target for 2.6 release? Thanks! *** This bug has been marked as a duplicate of bug 2106361 *** |