Bug 2116549

Summary: 4.11 SNO failed to install, dnsmasq 100% CPU
Product: Red Hat Advanced Cluster Management for Kubernetes Reporter: Vitaly Grinberg <vgrinber>
Component: Infrastructure OperatorAssignee: Omer Tuchfeld <otuchfel>
Status: CLOSED DUPLICATE QA Contact: Chad Crum <ccrum>
Severity: high Docs Contact: Derek <dcadzow>
Priority: urgent    
Version: rhacm-2.6CC: ccrum, itsoiref, mcornea, mfilanov, otuchfel, pemensik, trwest, yfirst, yliu1, yuhe
Target Milestone: ---   
Target Release: rhacm-2.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-22 13:16:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Omer Tuchfeld 2022-08-09 08:51:39 UTC
Which dnsmasq process are you referring to exactly? 
Also if you manage to reproduce the issue and can let me get my hands on it before you kill dnsmasq it could help troubleshoot

Comment 3 Vitaly Grinberg 2022-08-09 09:47:22 UTC
(In reply to Omer Tuchfeld from comment #2)
> Which dnsmasq process are you referring to exactly? 
> Also if you manage to reproduce the issue and can let me get my hands on it
> before you kill dnsmasq it could help troubleshoot

It's this one:
[core@cnfdf15 ~]$ sudo systemctl status dnsmasq
● dnsmasq.service - Run dnsmasq to provide local dns for Single Node OpenShift
   Loaded: loaded (/etc/systemd/system/dnsmasq.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2022-08-09 09:31:13 UTC; 11min ago
 Main PID: 2149 (dnsmasq)
    Tasks: 1 (limit: 605092)
   Memory: 1.2M
      CPU: 10min 36.081s
   CGroup: /system.slice/dnsmasq.service
           └─2149 /usr/sbin/dnsmasq -k

It comes from mc/50-master-dnsmasq-configuration

I sent the details on how to connect on slack

Comment 4 yliu1 2022-08-09 13:36:43 UTC
Maybe the same as this issue: Bug 2106361 - dnsmasq high CPU usage in 4.11 spoke deployment or after 4.10.21 to 4.11.0-rc.1 upgrade on an SNO node

Comment 6 Vitaly Grinberg 2022-08-10 04:49:52 UTC
After a closer look at the log, I think here could be our problem:
-- Reboot --
Aug 09 14:52:23 cnfdf15 systemd[1]: Started Run dnsmasq to provide local dns for Single Node OpenShift.
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: started, version 2.79 cachesize 150
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: compile time options: IPv6 GNU-getopt DBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth DNSSEC loop-detect inotify
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: reading /etc/resolv.conf
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: using nameserver 10.8.34.79#53
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: ignoring nameserver 127.0.0.1 - local interface
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: using nameserver 10.8.34.211#53
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: read /etc/hosts - 2 addresses
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: failed to read /etc/resolv.conf: Permission denied
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: no servers found in /etc/resolv.conf, will retry
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: query[A] 0.rhel.pool.ntp.org from 127.0.0.1
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: forwarded 0.rhel.pool.ntp.org to 10.8.34.79

The line
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: using nameserver 10.8.34.79#53
means it may forward the requests to itself.

Indeed it happens here:
Aug 09 14:52:23 cnfdf15 dnsmasq[2149]: forwarded 0.rhel.pool.ntp.org to 10.8.34.79
After restarting dnsmasq, it never considers own IP as a nameserver.
Could it be that dnsmasq starts too early after boot and somehow doesn't recognize that 10.8.34.79 is the local interface?

In the experiment I used this MC:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 50-master-dnsmasq-logging-configuration
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,bm8tcmVzb2x2CmxvZy1xdWVyaWVz
        mode: 420
        overwrite: true
        path: /etc/dnsmasq.d/log-single-node.conf

This adds:

no-resolv
log-queries

to the configuration

Comment 7 Vitaly Grinberg 2022-08-10 04:51:43 UTC
Yep, and no-resolve made the problem disappear

Comment 8 Vitaly Grinberg 2022-08-10 07:11:59 UTC
Correction - with no-resolv the spoke fails to register on the hub due to this error:

$ oc -n open-cluster-management-agent logs deployment.apps/klusterlet-registration-agent
...
E0810 05:36:41.156655       1 base_controller.go:272] ManagedClusterCreatingController reconciliation failed: Get "https://api.cnfdf13.telco5gran.eng.rdu2.redhat.com:6443/apis/cluster.open-cluster-management.io/v1/managedclusters/cnfdf15": dial tcp: lookup api.cnfdf13.telco5gran.eng.rdu2.redhat.com on 172.30.0.10:53: server misbehaving
...

The workaround that fully worked for me was to define a new file that dnsmasq would import instead of resolv.conf.
This file is the exact copy of resolv.conf, but without the SNO host IP:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 50-master-dnsmasq-configuration-overrides
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,bG9nLXF1ZXJpZXMK
        mode: 420
        overwrite: true
        path: /etc/dnsmasq.d/log-single-node.conf
      - contents:
          source: data:text/plain;charset=utf-8;base64,cmVzb2x2LWZpbGU9L2V0Yy9yZXNvbHYub3ZlcnJpZGU=
        mode: 420
        overwrite: true
        path: /etc/dnsmasq.d/resolv-override.conf

      - contents:
          source: data:text/plain;charset=utf-8;base64,c2VhcmNoICBjbmZkZjE1LnRlbGNvNWdyYW4uZW5nLnJkdTIucmVkaGF0LmNvbSB0ZWxjbzVncmFuLmVuZy5yZHUyLnJlZGhhdC5jb20KbmFtZXNlcnZlciAxMjcuMC4wLjEKbmFtZXNlcnZlciAxMC44LjM0LjIxMQo=
        mode: 420
        overwrite: true
        path: /etc/resolv.override


E.g,
[core@cnfdf15 ~]$ diff /etc/resolv.conf /etc/resolv.override 
1d0
< # Generated by NetworkManager
3d1
< nameserver 10.8.34.79

Comment 9 Yuanyuan He 2022-08-17 12:40:06 UTC
Any update on this issue? do we still target for 2.6 release? Thanks!

Comment 11 Igal Tsoiref 2022-08-22 13:16:57 UTC

*** This bug has been marked as a duplicate of bug 2106361 ***