Hide Forgot
Description of problem: random nodes are assigned as secondary IP address the ingressVIP IP address, which is linked to the ovs-if-br-ex interface, which triggers resolv-prepender, which then detects the VIP address and substitutes it in as a nameserver instead of the node-IP. This persists until the node is rebooted, at which point the issue dissipates. (Will return later at variable times, not entirely sure what's triggering it). Version-Release number of selected component (if applicable): OCP 4.6.44 How reproducible: Always Steps to Reproduce: 1. Observe in SOS reports/debug node pods that /host/etc/resolv.conf includes nameserver entry for ingress VIP. 2. Reboot node - ingressVIP is cleared 3. Check node several hours later, ingressVIP is back. Actual results: IngressVIP is assigned as nameserver/secondary interface IP in ovs-if-br-ex Expected results: ingressVIP is filtered from resolv.conf/br-ex config Additional info: This does not appear to only be impacting the infra nodes - this is impacting random nodes across the cluster, And have demonstrated that different machines in the same machineset will have different results. NetworkManager logs: [hostnames/IPs redacted] ~~~ [wrussell@supportshell logs]$ grep NetworkManager journalctl_--no-pager_--catalog_--boot Nov 01 13:07:47 NetworkManager[1752]: <info> [1635772067.9147] manager: (vethe77b8b09): new Veth device (/org/freedesktop/NetworkManager/Devices/418) Nov 01 13:07:47 NetworkManager[1752]: <info> [1635772067.9161] device (vethe77b8b09): carrier: link connected Nov 01 13:07:47 NetworkManager[1752]: <info> [1635772067.9311] device (vethe77b8b09): interface index 180 renamed iface from 'vethe77b8b09' to 'e966135f5a6246d' Nov 01 13:07:47 NetworkManager[1752]: <info> [1635772067.9536] manager: (e966135f5a6246d): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/419) Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7994] dhcp4 (br-ex): option dhcp_lease_time => '23564' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7994] dhcp4 (br-ex): option domain_name => 'redacted' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7995] dhcp4 (br-ex): option domain_name_servers => '4 redacted dns servers' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7995] dhcp4 (br-ex): option expiry => '1635795711' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7995] dhcp4 (br-ex): option ip_address => '10.xxx.xxx.xxx - localhost (correct IP)' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7995] dhcp4 (br-ex): option requested_broadcast_address => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7995] dhcp4 (br-ex): option requested_domain_name => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7995] dhcp4 (br-ex): option requested_domain_name_servers => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7995] dhcp4 (br-ex): option requested_domain_search => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7995] dhcp4 (br-ex): option requested_host_name => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7995] dhcp4 (br-ex): option requested_interface_mtu => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7995] dhcp4 (br-ex): option requested_ms_classless_static_routes => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7995] dhcp4 (br-ex): option requested_nis_domain => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7995] dhcp4 (br-ex): option requested_nis_servers => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7996] dhcp4 (br-ex): option requested_ntp_servers => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7996] dhcp4 (br-ex): option requested_rfc3442_classless_static_routes => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7996] dhcp4 (br-ex): option requested_root_path => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7996] dhcp4 (br-ex): option requested_routers => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7996] dhcp4 (br-ex): option requested_static_routes => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7996] dhcp4 (br-ex): option requested_subnet_mask => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7996] dhcp4 (br-ex): option requested_time_offset => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7996] dhcp4 (br-ex): option requested_wpad => '1' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7996] dhcp4 (br-ex): option routers => '10.xxx.xxx.yyy (correct router)' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7996] dhcp4 (br-ex): option subnet_mask => '255.255.255.224' Nov 01 13:09:07 NetworkManager[1752]: <info> [1635772147.7996] dhcp4 (br-ex): state changed extended -> extended Nov 01 13:09:07 dbus-daemon[1421]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.6' (uid=0 pid=1752 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0") Nov 01 13:09:07 hyperkube[2471]: I1101 13:09:07.804535 2471 factory.go:216] Factory "crio" was unable to handle container "/system.slice/NetworkManager-dispatcher.service" Nov 01 13:09:07 hyperkube[2471]: I1101 13:09:07.804582 2471 factory.go:216] Factory "systemd" was unable to handle container "/system.slice/NetworkManager-dispatcher.service" Nov 01 13:09:07 hyperkube[2471]: I1101 13:09:07.804595 2471 factory.go:212] Using factory "raw" for container "/system.slice/NetworkManager-dispatcher.service" Nov 01 13:09:07 hyperkube[2471]: I1101 13:09:07.805217 2471 manager.go:987] Added container: "/system.slice/NetworkManager-dispatcher.service" (aliases: [], namespace: "") Nov 01 13:09:07 hyperkube[2471]: I1101 13:09:07.805576 2471 handler.go:325] Added event &{/system.slice/NetworkManager-dispatcher.service 2021-11-01 13:09:07.803652759 +0000 UTC containerCreation {<nil>}} Nov 01 13:09:07 hyperkube[2471]: I1101 13:09:07.805652 2471 container.go:490] Start housekeeping for container "/system.slice/NetworkManager-dispatcher.service" -- Subject: Unit NetworkManager-dispatcher.service has begun start-up -- Unit NetworkManager-dispatcher.service is starting up. -- Subject: Unit NetworkManager-dispatcher.service has finished start-up -- Unit NetworkManager-dispatcher.service has completed starting up. Nov 01 13:09:10 root[2460385]: NM resolv-prepender: Prepending 'nameserver [INGRESS VIP!!]' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf) Nov 01 13:09:10 nm-dispatcher[2460046]: <13>Nov 1 13:09:10 root: NM resolv-prepender: Prepending 'nameserver [INGRESS VIP!!]' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf) Nov 01 13:09:20 systemd[1]: NetworkManager-dispatcher.service: Consumed 411ms CPU time -- The unit Netwnd consumed the indicated resources. Nov 01 13:09:20 hyperkube[2471]: I1101 13:09:20.940404 2471 manager.go:1044] Destroyed container: "/system.slice/NetworkManager-dispatcher.service" (aliases: [], namespace: "") Nov 01 13:09:20 hyperkube[2471]: I1101 13:09:20.940443 2471 handler.go:325] Added event &{/system.slice/NetworkManager-dispatcher.service 2021-11-01 13:09:20.940434058 +0000 UTC m=+477556.928899764 containerDeletion {<nil>}} ~~~ Please see linked primary case: 03070762 for additional logs/uploads: MG + sos report from affected node available. Most notable - ran local query on both an affected node and a non-imacted node based on snippet from the resolv-prepender script to observe outputs: ~~~ $ /usr/bin/podman run --rm \ --authfile /var/lib/kubelet/config.json \ --net=host \ quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f60b01e589e0f655bf2641abb3220fdd55b1c44e80282f268f81e3c08024b7e5 \ node-ip \ show \ "$(API VIP IP)" \ "$(Ingress VIP IP)" ~~~ Outputs on each node were different --> exit results [cleaned/trimmed] - full output in case: Impacted: ~~~ ...output omitted... time="2021-11-04T20:52:18Z" level=debug msg="Address 10.xxx.xxx.xxx/32 br-ex contains VIP 10.xxx.xxx.xxx" time="2021-11-04T20:52:18Z" level=info msg="Chosen Node IPs: [INGRESS VIP]" 10.xxx.xxx.xxx (Ingress VIP as last output provided) ~~~ Non-impacted ~~~ time="2021-11-04T20:56:43Z" level=debug msg="Address 10.xxx.xxx.xxx/27 br-ex is on interface br-ex with default route" #(correct node IP) time="2021-11-04T20:56:43Z" level=info msg="Chosen Node IPs: [10.xxx.xxx.xxx #correct node IP]" 10.xxx.xxx.xxx (#correct node-IP as last output provided) ~~~ Looking forward to your analysis/assistance.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056