Bug 1842706

Summary:	keepalived vrrp address lost after nmcli modication
Product:	Red Hat Enterprise Linux 7	Reporter:	Justin <jherron>
Component:	keepalived	Assignee:	Ryan O'Hara <rohara>
Status:	CLOSED WONTFIX	QA Contact:	Brandon Perkins <bperkins>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	7.7	CC:	cluster-maint, cutaylor
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-11 21:42:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Justin 2020-06-01 21:51:18 UTC

Description of problem:
Case 02644340, customer is reporting when any modification is done to the Interface profile managed by NetworkManager that keepalived is listening on the vrrp address assigned to an interface is lost and fail over does not occur as well as the vrrp address is unreachable. I was able to completely reproduce this problem with two vms on a single subnet. 

        +----->--------------------+
        |      |XXXXXXXXXXXX|      |
        |  +-----------------<-+   |
        |  |                   |   |
        |  v                   |   v
+-------+--+---+            +--+---+-------+
|node01        |            |node02        |
|              +<---+VIP+-->+              |
|              | 10.170.1.50|              |
|              |            |              |
|              |            |              |
+--------------+            +--------------+


VIP address: 
-----------------------------------------
Version-Release number of selected component (if applicable):

keepalived-1.3.5-16.el7.x86_64 

How reproducible:
----------------------------------------------------
Steps to Reproduce:
1. yum install keepalived -y 
2. Setup an instance of keepalived with MASTER|BACKUP
3. Edit the interface via nmcli then reset or reapply the interface.

Actual results:
/*Address is assigned to ens224*/
----------------------------------------------------
~]# ip addr list | grep 131
    inet 131.232.67.212/24 brd 131.232.67.255 scope global noprefixroute ens192
    inet 131.232.67.214/24 scope global secondary ens192
----------------------------------------------------
/*Modify interface that is listed in the keepalived.conf instance ie directive `interface ens192`*/
----------------------------------------------------
~]# nmcli con mod ens192 ipv4.dns "131.232.3.99"
~]# ip addr list | grep 131 ; date
    inet 131.232.67.212/24 brd 131.232.67.255 scope global noprefixroute ens192
    inet 131.232.67.214/24 scope global secondary ens192
----------------------------------------------------
Now device reapply to update the settings; Notice below the VIP address is no longer assigned. 
however keepalived doesn't appear to be aware as the link is still up. 
----------------------------------------------------
~]# nmcli device reapply ens192 
Connection successfully reapplied to device 'ens192'.

~]# ip addr list | grep 131 
    inet 131.232.67.212/24 brd 131.232.67.255 scope global noprefixroute ens192

----------------------------------------------------
Now restart keepalived and the vip gets assigned. However fail-over did not occur or ip address reasign thus VIP is unreachable. 
----------------------------------------------------

~]# systemctl restart keepalived
~]# ip addr list | grep 131 ; date
    inet 131.232.67.212/24 brd 131.232.67.255 scope global noprefixroute ens192

~]# ip addr list | grep 131 ; date
    inet 131.232.67.212/24 brd 131.232.67.255 scope global noprefixroute ens192
    inet 131.232.67.214/24 scope global secondary ens192

Expected results:
The expected result is to either initiate failover or for NetworkManager to not *remove* the vrrp address from the interface. 

Additional info:
From what I have gathered keepalived only monitors the state of the device it has assigned for an instance in keepalived.conf. However, NetworkManager only knows about the ip addresses that are defined in the interface profiles. Once the command nmcli con <int> up; nmcli dev reapply <int>; NetworkMaanger deletes the VIP address and keepalived is completely unaware. 

ip addr
    inet 10.170.1.10/24 brd 10.170.1.255 scope global noprefixroute eth0
    inet 10.170.1.50/32 scope global eth0
    inet 10.170.1.40/24 scope global secondary eth0

systemctl status keepalived
● keepalived.service - LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-06-01 17:22:03 EDT; 10min ago
  Process: 1358 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)

Jun 01 17:27:17 node01.example.com Keepalived_vrrp[1361]: Sending gratuitous ARP on eth0 for 10.170.1.40
Jun 01 17:27:17 node01.example.com Keepalived_vrrp[1361]: Sending gratuitous ARP on eth0 for 10.170.1.40

ip addr
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 0c:1a:4e:44:ba:00 brd ff:ff:ff:ff:ff:ff
    inet 10.170.1.10/24 brd 10.170.1.255 scope global noprefixroute eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::9e80:3cbf:b409:d551/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

/*Workaround*/

NetworkManger doesn't mark the state of the the interface down when it looses an ip address. So now the question is how do we remedy this an actually cause the fail over to occur? I used NetworkManager environment variables for NetworkManager to actually put the link down then bring it back up. 

/etc/NetworkManager/dispatcher.d/pre-down.d/
```
#!/bin/bash

IFACE=$DEVICE_IP_IFACE
ACTION=$NM_DISPATCHER_ACTION
ADVRT=1

case $ACTION in
        down)
           ip link set $ACTION $IFACE
        ;;
	up)
           ip link set down $IFACE;sleep $ADVRT;ip link set $ACTION $IFACE;
           logger info "Interface $IFACE has been reset!"
        ;;
esac

```

I set the ADVRT variable above to one, as I obseverd it needs to be equal to the integer value of the advert_int defined in the keepalived.conf file.So the keepalived daemon has time, equal to the value defined by the advert_int to detect that the state of LINK went down and is not able to send its VRRP Hello packets out.The above is kinda of a crude way of doing this but it does work from my testing and implementation. It does not work with the nmcli device reapply as this is a different type of syscall that doesn't call the scripts in the dispatcher.d from what I have found.


/*RESEARCH*/
After digging a bit I found the commit and the changelog in the upstream maintainers where this was implemented but this was implemented in version 2.0.0. Might be something to backport into the Red software collections keepalived 1.5, as from 1.3 its alot of commits and is proably not worth the effort. 

Keepalived ChangeLog
Release 2.0.0 - https://www.keepalived.org/changelog.html

* Monitor VIP/eVIP deletion and transition to backup if a VIP/eVIP
    is removed unloes it is configured with the no-track option.

https://github.com/acassen/keepalived/commits/v2.0.0

> Add no-track option for VIPs/eVIPs 
https://github.com/acassen/keepalived/issues/836

> Add tracking of VIPs/eVIPs on interfaces other the vrrp instance i/f
https://github.com/acassen/keepalived/commit/979727e5db1f0307149b2932267ed214ecd0850d

Comment 9 Chris Williams 2020-11-11 21:42:37 UTC

Red Hat Enterprise Linux 7 shipped it's final minor release on September 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7.
From intial triage it does not appear the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 and will now be closed. 

From the RHEL life cycle page:
https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase
"During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available."

If this BZ was closed in error and meets the above criteria please re-open it flag for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelerated Fixes:
https://source.redhat.com/groups/public/pnt-cxno/pnt_customer_experience_and_operations_wiki/support_delivery_accelerated_fix_release_handbook  

Feature Requests can re-opened and moved to RHEL 8 if the desired functionality is not already present in the product. 

Please reach out to the applicable Product Experience Engineer[0] if you have any questions or concerns.  

[0] https://bugzilla.redhat.com/page.cgi?id=agile_component_mapping.html&product=Red+Hat+Enterprise+Linux+7