1751978 – [IPI baremetal] restart keepalived when default route is changed

Bug 1751978 - [IPI baremetal] restart keepalived when default route is changed

Summary: [IPI baremetal] restart keepalived when default route is changed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Antoni Segura Puimedon
QA Contact:	Victor Voronkov
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1744560 (view as bug list)
Depends On:
Blocks:	1741265 1817988
TreeView+	depends on / blocked

Reported:	2019-09-13 10:51 UTC by Petr Horáček
Modified:	2020-07-13 17:11 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Keepalived 1.x binds to specific interfaces. It is possible for users to dynamically put interfaces in bonds or bridges. Consequence: When the interface keepalived uses is bridged, keepalived will be unable to resume operation, disrupting Virtual IP management Fix: Monitor interface changes and reload keepalived so it reads the new configuration. Result: Virtual IP management operates with minimal disruption.
Clone Of:
Clones:	1817988 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:11:28 UTC
Target Upstream Version:
Embargoed:
Flags:	dkholodo: needinfo- vvoronko: needinfo- vvoronko: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift baremetal-runtimecfg pull 20	'None'	closed	Add dynkeepalived to solve interface changes	2021-02-19 14:34:02 UTC
Github	openshift machine-config-operator pull 1124	'None'	closed	Bug 1751978: templates/baremetal: Fix keepalived dysfunction on vrrp iface change	2021-02-19 14:34:02 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:11:52 UTC

Description Petr Horáček 2019-09-13 10:51:19 UTC

Description of problem:
When the default route is changed (from NIC to a linux bridge), keepalived doesn't update its configuration (move its IPs on top of the bridge). Due to that, we lose connectivity to API server.


Version-Release number of selected component (if applicable):


How reproducible:
Always


Steps to Reproduce:
1. Move IP from NIC to bridge

Actual results:
Bridge doesn't get KeepAlive IPs. Connectivity to the API server is lost.

Expected results:
API server is still reachable. KeepAlive IPs are moved to the bridge.

Additional notes:
Antoni Segura Puimedon suggested adding liveness probe to the KeepAlive pod, restarting and reconfiguring if linked interface doesn't have an IP anymore.

Comment 9 Steven Hardy 2019-12-13 11:18:54 UTC

*** Bug 1744560 has been marked as a duplicate of this bug. ***

Comment 15 Victor Voronkov 2020-01-30 05:17:02 UTC

Bug fix verified by setting linux bridge over the external interface:

nmcli con add type bridge ifname br10
nmcli con add type bridge-slave ifname ens4 master br10
nmcli con up bridge-slave-ens4

As expected, keepalived monitor observed the network change and rendered valid config file,
meanwhile DNS_VIP and INGRESS_VIP migrated to other host in the cluster

Comment 16 Petr Horáček 2020-03-27 11:13:37 UTC

I observed this issue again on OpenShift 4.4.

I reconfigured the host, so the interface that was originally carrying VIPs is attached to an OVS bridge.

While the keepalived-monitor observed the change and rendered new config, keepalived container failed with 'permanent error CONFIG'.

I have a cluster available in case you want to debug it there.

keepalived-monitor logs:
time="2020-03-27T10:54:03Z" level=info msg="Config change detected" new config="{{ostest test.metalkube.org 192.168.111.5 14 192.168.111.2 10 192.168.111.4 93 24 0 } {0 0 0 [] } 192.168.111.23 worker-0  brcnv [1
92.168.111.1]}"                                                                                                                                                                                                   
time="2020-03-27T10:54:03Z" level=info msg="Runtimecfg rendering template" path=/etc/keepalived/keepalived.conf

keepalived.conf:
vrrp_script chk_ingress {
    script "/usr/bin/curl -o /dev/null -kLs http://0:1936/healthz"
    interval 1
    weight 50
}

vrrp_instance ostest_INGRESS {
    state BACKUP
    interface brcnv
    virtual_router_id 93
    priority 40
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass cluster_uuid_ingress_vip
    }
    virtual_ipaddress {
        192.168.111.4/24
    }
    track_script {
        chk_ingress
    }
}

keepalived logs:
The client sent: reload
Opening file '/etc/keepalived/keepalived.conf'.
Stopped
Keepalived_vrrp exited with permanent error CONFIG. Terminating
Stopping
Stopped Keepalived v1.3.5 (03/19,2017), git commit v1.3.5-6-g6fa32f2

Comment 18 Petr Horáček 2020-03-27 11:43:46 UTC

Sorry for the noise. I will clone this instead.

Comment 21 errata-xmlrpc 2020-07-13 17:11:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.