1995021 – resolv.conf and corefile sync slows down/stops after keepalived container restart

Bug 1995021 - resolv.conf and corefile sync slows down/stops after keepalived container restart

Summary: resolv.conf and corefile sync slows down/stops after keepalived container res...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	All
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Yossi Boaron
QA Contact:	Eldar Weiss
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2033966
TreeView+	depends on / blocked

Reported:	2021-08-18 09:53 UTC by Eldar Weiss
Modified:	2022-03-10 16:05 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: An old version of the kubernetes client library in the baremetal-runtimecfg project. Consequence: When a VIP failed over, sometimes client connections were not closed in a timely fashion. This could result in long delays for monitor containers that rely on talking to the API. Fix: Updated the client library. Result: Connections are not closed as expected on VIP failovers so the monitor does not hang for an excessively long time.
Clone Of:
Clones:	2033966 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:05:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift baremetal-runtimecfg pull 164	0	None	open	Bug 1995021: upgrade k8s.io/client-go	2021-12-08 15:15:38 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:05:37 UTC

Description Eldar Weiss 2021-08-18 09:53:08 UTC

Description of problem:
Adding a nameserver to a node's NM resolv.conf does not add the nameserver to the Corefile if the keepalived constainer was restarted very recently

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-14-065522

/
Steps to Reproduce:
On any master node:
1.Get the keepalived container's id by using:

sudo crictl ps --name keepalived

2. stop the container by using:

sudo crictl stop *CONTAINER ID*

3.
Modify /var/run/NetworkManager/resolv.conf by adding a nameserver (example: 'nameserver 8.8.8.8')

4.
Check and see if the nameserver you added was added to cat /etc/coredns/Corefile or not.

Actual results:

[core@master-0-1 ~]$ cat /var/run/NetworkManager/resolv.conf
# Generated by NetworkManager
search ocp-edge-cluster-0.qe.lab.redhat.com
nameserver fe80::5054:ff:fe08:ccbe%br-ex
nameserver fd2e:6f44:5dd8::1
nameserver 8.8.8.8

[core@master-0-1 ~]$ cat /etc/coredns/Corefile 
. {
    errors
    health :18080
    forward . fe80::5054:ff:fe08:ccbe%br-ex fd2e:6f44:5dd8::1 {
        policy sequential
    }

Expected results:
[core@master-0-1 ~]$ cat /etc/coredns/Corefile 
. {
    errors
    health :18080
    forward . fe80::5054:ff:fe08:ccbe%br-ex fd2e:6f44:5dd8::1 8.8.8.8 {
        policy sequential
    }

Additional info:
1) This can also happen when removing a nameserver and waiting for it to be removed from the 
corefile.
2) At times, the sync does happen in the above condition, but takes several minutes.

Comment 1 Kirsten Garrison 2021-08-18 17:14:24 UTC

Please provide a must gather if possible along with information about the deployment

Comment 5 Sinny Kumari 2021-11-22 16:46:59 UTC

Hi Ben,

Are you or your team still looking at this bug?

Comment 6 Ben Nemec 2021-11-23 22:54:06 UTC

Sorry, I think we missed this one because it wasn't in the baremetal subcomponent. I'm going to move it so we catch it in our triage meeting tomorrow.

Comment 8 Eldar Weiss 2021-12-16 15:24:31 UTC

Issue is resolved.


Expected results:
Corefile is synced with the resolv.conf by getting the resolv.conf addition in it's "forward" section, with the sync only taking a a few seconds.

Version-Release number of selected component (if applicable), verified on:
4.10.0-0.ci-2021-12-15-195801

Actual results:
I've added "8.8.8.6" to the nameserver:
[core@master-0-0 ~]$ date
Thu Dec 16 15:20:46 UTC 2021
[core@master-0-0 ~]$ sudo vi /var/run/NetworkManager/resolv.conf
[core@master-0-0 ~]$ cat vi /var/run/NetworkManager/resolv.conf
cat: vi: No such file or directory
# Generated by NetworkManager
search ocp-edge-cluster-0.qe.lab.redhat.com
nameserver fe80::5054:ff:fe62:929f%br-ex
nameserver fd2e:6f44:5dd8::1
nameserver 8.8.8.6

Then checked the corefile to make sure it added the nameserver
[core@master-0-0 ~]$ cat /etc/coredns/Corefile | grep forward
    forward . fe80::5054:ff:fe62:929f%br-ex fd2e:6f44:5dd8::1 8.8.8.6 {
[core@master-0-0 ~]$ date
Thu Dec 16 15:21:18 UTC 2021

Took less than a minute.


This should be backported ASAP.

Comment 12 errata-xmlrpc 2022-03-10 16:05:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.