Bug 1202584

Summary: Keepalived instances flapping to MASTER then back to STANDBY on failover with nopreempt
Product: [Fedora] Fedora Reporter: Assaf Muller <amuller>
Component: keepalivedAssignee: Matthias Saou <matthias>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 20CC: bperkins, matthias, pasik, rohara
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: keepalived-1.2.15-3.fc20 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-29 04:24:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Assaf Muller 2015-03-16 23:43:38 UTC
Description of problem:
Using OpenStack Neutron with highly available routers, configure three L3 agents and create an HA router. Go into the router namespace of the master and set the HA device to down. Observe syslog of the other two nodes, on one you will see (Over and over again):

Transition to MASTER STATE
Entering MASTER STATE
Received lower prio advert in nopreempt mode
Entering BACKUP STATE

In the other node:
Transition to MASTER STATE
Entering MASTER STATE
Received higher prio advert
Entering BACKUP STATE

Version-Release number of selected component (if applicable):
Does not reproduce on 1.2.9-1.fc20, does reproduce on 1.2.15-2.fc20.

How reproducible:
100%

Steps to Reproduce:
Detailed in problem description.

Actual results:
Nodes flapping from standby to master and back.

Expected results:
One should go to master, the other should remain in standby.

Additional information:
On each node: keepalived.conf:
Node 1 - http://www.fpaste.org/198754/
Node 2 - http://www.fpaste.org/198756/
Node 3 - http://www.fpaste.org/198757/

'ip a' output in namespace of the router:
Node 1 - http://www.fpaste.org/198758/
Node 2 - http://www.fpaste.org/198759/
Node 3 - http://www.fpaste.org/198761/

syslog summary is directly in the bug report above.

To work around the issue I tried specifying unique priorities, specifying the source advertisement address (A unique address per router instance, the one you can see in the 'ip a' output above), setting the initial state to EQUAL. I tried these in pretty much all permutations, nothing seems to make an effect. Working with pre-emption turned on eliminated the issue entirely, but no-preemption should work, and is preferred for the Neutron use case (We don't want elections when the faulty node comes back on for thousands of routers, there's just no need for another disruption in the data plane).

Comment 1 Assaf Muller 2015-03-17 01:28:07 UTC
More information, these two patches were introduced in 1.2.14:
e18370cb165d21db954c08ddbce1b39d97858012
13693a2d1b834c749394ef0bdee6afe9eb1fad2d

And changed the behavior.

Comment 2 Ryan O'Hara 2015-03-18 13:55:44 UTC
Fixed upstream with this commit:

https://github.com/acassen/keepalived/commit/2bab517b2b50c1e784e79a082d971f4855e9e0ab

Should land in Fedora packages today.

Comment 3 Fedora Update System 2015-03-18 14:56:43 UTC
keepalived-1.2.15-3.fc22 has been submitted as an update for Fedora 22.
https://admin.fedoraproject.org/updates/keepalived-1.2.15-3.fc22

Comment 4 Fedora Update System 2015-03-18 14:56:49 UTC
keepalived-1.2.15-3.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/keepalived-1.2.15-3.fc21

Comment 5 Fedora Update System 2015-03-18 14:56:54 UTC
keepalived-1.2.15-3.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/keepalived-1.2.15-3.fc20

Comment 6 Fedora Update System 2015-03-19 18:41:47 UTC
Package keepalived-1.2.15-3.fc22:
* should fix your issue,
* was pushed to the Fedora 22 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing keepalived-1.2.15-3.fc22'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-4177/keepalived-1.2.15-3.fc22
then log in and leave karma (feedback).

Comment 7 Assaf Muller 2015-03-20 17:58:09 UTC
New RPM verified to fix the bug.

Comment 8 Fedora Update System 2015-03-29 04:24:31 UTC
keepalived-1.2.15-3.fc22 has been pushed to the Fedora 22 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 9 Fedora Update System 2015-03-29 04:34:37 UTC
keepalived-1.2.15-3.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 10 Fedora Update System 2015-03-29 04:34:44 UTC
keepalived-1.2.15-3.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.