Bug 2021544 - OCP 4.6.44 - Ingress VIP assigned as secondary IP in ovs-if-br-ex and added to resolv.conf as nameserver
Summary: OCP 4.6.44 - Ingress VIP assigned as secondary IP in ovs-if-br-ex and added t...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6.z
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
: 4.10.0
Assignee: Ben Nemec
QA Contact: Rei
URL:
Whiteboard:
Depends On:
Blocks: 2026012
TreeView+ depends on / blocked
 
Reported: 2021-11-09 15:14 UTC by Will Russell
Modified: 2022-06-21 10:57 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:26:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-runtimecfg pull 157 0 None Merged Bug 2021544: Ignore VIPs in node-ip show 2022-02-28 21:00:19 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:26:37 UTC

Description Will Russell 2021-11-09 15:14:05 UTC
Description of problem:

random nodes are assigned as secondary IP address the ingressVIP IP address, which is linked to the ovs-if-br-ex interface, which triggers resolv-prepender, which then detects the VIP address and substitutes it in as a nameserver instead of the node-IP. This persists until the node is rebooted, at which point the issue dissipates. (Will return later at variable times, not entirely sure what's triggering it).


Version-Release number of selected component (if applicable):

OCP 4.6.44


How reproducible:
Always

Steps to Reproduce:
1. Observe in SOS reports/debug node pods that /host/etc/resolv.conf includes nameserver entry for ingress VIP. 
2. Reboot node - ingressVIP is cleared
3. Check node several hours later, ingressVIP is back.

Actual results:
IngressVIP is assigned as nameserver/secondary interface IP in ovs-if-br-ex

Expected results:
ingressVIP is filtered from resolv.conf/br-ex config

Additional info:
This does not appear to only be impacting the infra nodes - this is impacting random nodes across the cluster, And have demonstrated that different machines in the same machineset will have different results. 

NetworkManager logs: [hostnames/IPs redacted]
~~~
[wrussell@supportshell logs]$ grep NetworkManager journalctl_--no-pager_--catalog_--boot
Nov 01 13:07:47  NetworkManager[1752]: <info>  [1635772067.9147] manager: (vethe77b8b09): new Veth device (/org/freedesktop/NetworkManager/Devices/418)
Nov 01 13:07:47  NetworkManager[1752]: <info>  [1635772067.9161] device (vethe77b8b09): carrier: link connected
Nov 01 13:07:47  NetworkManager[1752]: <info>  [1635772067.9311] device (vethe77b8b09): interface index 180 renamed iface from 'vethe77b8b09' to 'e966135f5a6246d'
Nov 01 13:07:47  NetworkManager[1752]: <info>  [1635772067.9536] manager: (e966135f5a6246d): new Open vSwitch Port device (/org/freedesktop/NetworkManager/Devices/419)
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7994] dhcp4 (br-ex): option dhcp_lease_time      => '23564'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7994] dhcp4 (br-ex): option domain_name          => 'redacted'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7995] dhcp4 (br-ex): option domain_name_servers  => '4 redacted dns servers'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7995] dhcp4 (br-ex): option expiry               => '1635795711'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7995] dhcp4 (br-ex): option ip_address           => '10.xxx.xxx.xxx - localhost (correct IP)'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7995] dhcp4 (br-ex): option requested_broadcast_address => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7995] dhcp4 (br-ex): option requested_domain_name => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7995] dhcp4 (br-ex): option requested_domain_name_servers => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7995] dhcp4 (br-ex): option requested_domain_search => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7995] dhcp4 (br-ex): option requested_host_name  => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7995] dhcp4 (br-ex): option requested_interface_mtu => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7995] dhcp4 (br-ex): option requested_ms_classless_static_routes => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7995] dhcp4 (br-ex): option requested_nis_domain => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7995] dhcp4 (br-ex): option requested_nis_servers => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7996] dhcp4 (br-ex): option requested_ntp_servers => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7996] dhcp4 (br-ex): option requested_rfc3442_classless_static_routes => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7996] dhcp4 (br-ex): option requested_root_path  => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7996] dhcp4 (br-ex): option requested_routers    => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7996] dhcp4 (br-ex): option requested_static_routes => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7996] dhcp4 (br-ex): option requested_subnet_mask => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7996] dhcp4 (br-ex): option requested_time_offset => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7996] dhcp4 (br-ex): option requested_wpad       => '1'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7996] dhcp4 (br-ex): option routers              => '10.xxx.xxx.yyy (correct router)'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7996] dhcp4 (br-ex): option subnet_mask          => '255.255.255.224'
Nov 01 13:09:07  NetworkManager[1752]: <info>  [1635772147.7996] dhcp4 (br-ex): state changed extended -> extended
Nov 01 13:09:07  dbus-daemon[1421]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.6' (uid=0 pid=1752 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Nov 01 13:09:07  hyperkube[2471]: I1101 13:09:07.804535    2471 factory.go:216] Factory "crio" was unable to handle container "/system.slice/NetworkManager-dispatcher.service"
Nov 01 13:09:07  hyperkube[2471]: I1101 13:09:07.804582    2471 factory.go:216] Factory "systemd" was unable to handle container "/system.slice/NetworkManager-dispatcher.service"
Nov 01 13:09:07  hyperkube[2471]: I1101 13:09:07.804595    2471 factory.go:212] Using factory "raw" for container "/system.slice/NetworkManager-dispatcher.service"
Nov 01 13:09:07  hyperkube[2471]: I1101 13:09:07.805217    2471 manager.go:987] Added container: "/system.slice/NetworkManager-dispatcher.service" (aliases: [], namespace: "")
Nov 01 13:09:07  hyperkube[2471]: I1101 13:09:07.805576    2471 handler.go:325] Added event &{/system.slice/NetworkManager-dispatcher.service 2021-11-01 13:09:07.803652759 +0000 UTC containerCreation {<nil>}}
Nov 01 13:09:07  hyperkube[2471]: I1101 13:09:07.805652    2471 container.go:490] Start housekeeping for container "/system.slice/NetworkManager-dispatcher.service"
-- Subject: Unit NetworkManager-dispatcher.service has begun start-up
-- Unit NetworkManager-dispatcher.service is starting up.
-- Subject: Unit NetworkManager-dispatcher.service has finished start-up
-- Unit NetworkManager-dispatcher.service has completed starting up.
Nov 01 13:09:10  root[2460385]: NM resolv-prepender: Prepending 'nameserver [INGRESS VIP!!]' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf)
Nov 01 13:09:10  nm-dispatcher[2460046]: <13>Nov  1 13:09:10 root: NM resolv-prepender: Prepending 'nameserver [INGRESS VIP!!]' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf)
Nov 01 13:09:20  systemd[1]: NetworkManager-dispatcher.service: Consumed 411ms CPU time
-- The unit Netwnd consumed the indicated resources.
Nov 01 13:09:20  hyperkube[2471]: I1101 13:09:20.940404    2471 manager.go:1044] Destroyed container: "/system.slice/NetworkManager-dispatcher.service" (aliases: [], namespace: "")
Nov 01 13:09:20  hyperkube[2471]: I1101 13:09:20.940443    2471 handler.go:325] Added event &{/system.slice/NetworkManager-dispatcher.service 2021-11-01 13:09:20.940434058 +0000 UTC m=+477556.928899764 containerDeletion {<nil>}}
~~~


Please see linked primary case: 03070762 for additional logs/uploads: MG + sos report from affected node available. 

Most notable - ran local query on both an affected node and a non-imacted node based on snippet from the resolv-prepender script to observe outputs:

~~~
$ /usr/bin/podman run --rm \
        --authfile /var/lib/kubelet/config.json \
        --net=host \
        quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f60b01e589e0f655bf2641abb3220fdd55b1c44e80282f268f81e3c08024b7e5 \
        node-ip \
        show \
        "$(API VIP IP)" \
        "$(Ingress VIP IP)"
~~~

Outputs on each node were different --> exit results [cleaned/trimmed] - full output in case:

Impacted:
~~~
...output omitted...
time="2021-11-04T20:52:18Z" level=debug msg="Address 10.xxx.xxx.xxx/32 br-ex contains VIP 10.xxx.xxx.xxx"
time="2021-11-04T20:52:18Z" level=info msg="Chosen Node IPs: [INGRESS VIP]"
10.xxx.xxx.xxx (Ingress VIP as last output provided)
~~~

Non-impacted
~~~
time="2021-11-04T20:56:43Z" level=debug msg="Address 10.xxx.xxx.xxx/27 br-ex is on interface br-ex with default route" #(correct node IP)
time="2021-11-04T20:56:43Z" level=info msg="Chosen Node IPs: [10.xxx.xxx.xxx #correct node IP]"
10.xxx.xxx.xxx (#correct node-IP as last output provided)
~~~

Looking forward to your analysis/assistance.

Comment 10 errata-xmlrpc 2022-03-10 16:26:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.