Bug 1286729

Summary: Can't restart keepalived after it's killed
Product: Red Hat Enterprise Linux 7 Reporter: Ihar Hrachyshka <ihrachys>
Component: keepalivedAssignee: Ryan O'Hara <rohara>
Status: CLOSED ERRATA QA Contact: Brandon Perkins <bperkins>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.2CC: cluster-maint, fdinitto, ihrachys, ushkalim
Target Milestone: rc   
Target Release: 7.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-01 19:36:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1449769    
Bug Blocks:    

Description Ihar Hrachyshka 2015-11-30 15:40:03 UTC
Description of problem: keepalived fails to respawn after crash when running OSP8 (Liberty) neutron functional tests.

Version-Release number of selected component (if applicable): keepalived 1.2.13-6.el7

How reproducible: always.

First, OSP8 based steps to reproduce:
1. set up OSP8 system.
2. run test_keepalived_respawns functional test for OSP8 neutron.
3. experience the following failure.

==============================
Failed 1 tests - output below:
==============================

neutron.tests.functional.agent.linux.test_keepalived.KeepalivedManagerTestCase.test_keepalived_respawns
-------------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "neutron/tests/functional/agent/linux/test_keepalived.py", line 73, in test_keepalived_respawns
        exception=RuntimeError(_("Keepalived didn't respawn")))
      File "neutron/agent/linux/utils.py", line 339, in wait_until_true
        eventlet.sleep(sleep)
      File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 34, in sleep
        hub.switch()
      File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
        return self.greenlet.switch()
    RuntimeError: Keepalived didn't respawn
    

Captured pythonlogging:
~~~~~~~~~~~~~~~~~~~~~~~
    2015-11-30 11:04:40,067  WARNING [oslo_config.cfg] Option "verbose" from group "DEFAULT" is deprecated for removal.  Its value may be silently ignored in the future.
    2015-11-30 11:04:41,103    ERROR [neutron.agent.linux.external_process] default-service for router with uuid router1 not found. The process should not have died
    2015-11-30 11:04:41,103    ERROR [neutron.agent.linux.external_process] Respawning keepalived for uuid router1
    2015-11-30 11:04:42,103    ERROR [neutron.agent.linux.external_process] default-service for router with uuid router1 not found. The process should not have died
    2015-11-30 11:04:42,103    ERROR [neutron.agent.linux.external_process] Respawning keepalived for uuid router1
    2015-11-30 11:04:43,104    ERROR [neutron.agent.linux.external_process] default-service for router with uuid router1 not found. The process should not have died
    2015-11-30 11:04:43,104    ERROR [neutron.agent.linux.external_process] Respawning keepalived for uuid router1
    2015-11-30 11:04:44,104    ERROR [neutron.agent.linux.external_process] default-service for router with uuid router1 not found. The process should not have died
    2015-11-30 11:04:44,104    ERROR [neutron.agent.linux.external_process] Respawning keepalived for uuid router1
    2015-11-30 11:04:45,104    ERROR [neutron.agent.linux.external_process] default-service for router with uuid router1 not found. The process should not have died
    2015-11-30 11:04:45,104    ERROR [neutron.agent.linux.external_process] Respawning keepalived for uuid router1

The test basically starts a neutron HA router, then kills keepalived with -9, and validates that it comes back up.

As you can see in the output, Neutron L3 agent tries to respawn the killed keepalived process, but it never comes up. This is because of the following scenario:

- when keepalived starts, it spawns vrrp thread to monitor vrrp forked process; it also creates a vrrp pid file;
- then when the process is killed, and it's restarted, the following happens: the new keepalived process runs with -P, so daemon_mode | 1 == True [1]; so when we validate whether we are already running [2], we check vrrp pid file [3]. Since we never clean up the file before starting the process, and the process dies without a chance to clean up the file as part of its signal handler, respawn never works.

[1] https://github.com/acassen/keepalived/blob/03da0d2d0393808bbb2feac7abc07aaf8d647855/keepalived/core/main.c#L236
[2] https://github.com/acassen/keepalived/blob/03da0d2d0393808bbb2feac7abc07aaf8d647855/keepalived/core/main.c#L291
[3] https://github.com/acassen/keepalived/blob/03da0d2d0393808bbb2feac7abc07aaf8d647855/keepalived/core/pidfile.c#L92

Additional info:

I noticed that we experience the issue in OSP8 only, while the same test passes in upstream gate. I compared keepalived versions and realized that in upstream, we run against keepalived 1.2.7, while in OSP8 we use 1.2.13. After git bisecting the keepalived tree, I found out the following patch to be the first one that triggers the failure for the test:

https://github.com/acassen/keepalived/commit/6d88c3ea7fab764ef8b106fb150857379e32304a

The problem with the change seems to be that it now never runs vrrp pid file clean up:

https://github.com/acassen/keepalived/commit/6d88c3ea7fab764ef8b106fb150857379e32304a#diff-0d41c8909ffc59514233517d4e78cbfeR143

That's, I believe, steps to reproduce without OSP8:

$ (in the first console) keepalived -P -l -n -p keepalived.pid -r vrrp.pid -c checkers.pid
$ (in another console) kill -9 the main keepalived process
$ (back to the first console) keepalived -P -l -n -p keepalived.pid -r vrrp.pid -c checkers.pid

Main process not started because there is vrrp pid present. I believe the main process is supposed to monitor and respawn the forked one, so I would expect it should still run after fork if dead for some reason.

Comment 1 Ryan O'Hara 2015-12-01 02:35:00 UTC
(In reply to Ihar Hrachyshka from comment #0)
> That's, I believe, steps to reproduce without OSP8:
> 
> $ (in the first console) keepalived -P -l -n -p keepalived.pid -r vrrp.pid
> -c checkers.pid
> $ (in another console) kill -9 the main keepalived process
> $ (back to the first console) keepalived -P -l -n -p keepalived.pid -r
> vrrp.pid -c checkers.pid
> 
> Main process not started because there is vrrp pid present. I believe the
> main process is supposed to monitor and respawn the forked one, so I would
> expect it should still run after fork if dead for some reason.

You're killing the main keepalived process and expecting it to respawn and/or cleanup the PID file? Keepalived will only respawn if you kill the child process.

Comment 2 Ryan O'Hara 2015-12-01 02:40:43 UTC
Also, is there a reason that you're using the -c option with the -P option?

Comment 3 Ihar Hrachyshka 2015-12-02 13:56:44 UTC
Ryan, the expectation is that either one of the following occurs when we start another keepalived process after the previous one died:

- it starts managing the forked vrrp process;
- it kills the vrrp process and starts the new one, then continues running.

Currently, it's just exiting and leaving vrrp thread running, and not monitored. It also results in neutron l3 agent to continuously trying to respawn the main keepalived process because it runs checks that main process is running, and it's not. Since neutron does not know anything about the child process, it never stops respawn attempts (and considering the main process should monitor the child, I guess it's right to expect it's spawned and not just exit).

As for -c + -P options:

- I use -P to reflect what neutron l3 agent currently does: https://github.com/openstack/neutron/blob/81a4aac8d429a5d8b2874c4fce0ffd6498f5b1c6/neutron/agent/linux/keepalived.py#L396

- I added -c to make it run without root, locally. Maybe it's not really needed to reproduce the issue, not sure.

Comment 4 Ryan O'Hara 2015-12-02 15:01:03 UTC
After talking with Ihar on irc I've learned that we're not talking about keepalived's respawn feature. This confusion was due to use of term "respawn", but what he is referring to is the neutron process that will monitor and "respawn" keepalived, not the parent keepalived process that can also respawn a failed thread.

Also, it seems that the VRRP (child) process is orphaned when the parent catches SIGKILL, so that would explain why the VRRP process' pid file is still there. What I find most interesting is that older versions of keepalived supposedly do not have this problem. Comment #1 states that version 1.2.7 works as expected. More tests are needed with actually pid numbers, file, pstree output, etc.

Comment 5 Ryan O'Hara 2015-12-02 15:26:27 UTC
Please provide keepalived logs along the with steps used to generate said logs.

Comment 6 Ryan O'Hara 2015-12-02 19:03:21 UTC
(In reply to Ihar Hrachyshka from comment #0)
> I noticed that we experience the issue in OSP8 only, while the same test
> passes in upstream gate. I compared keepalived versions and realized that in
> upstream, we run against keepalived 1.2.7, while in OSP8 we use 1.2.13.
> After git bisecting the keepalived tree, I found out the following patch to
> be the first one that triggers the failure for the test:

I see no difference in behavior for version 1.2.7.

Comment 7 Ihar Hrachyshka 2015-12-17 13:58:23 UTC
Note: we have found out a way to handle it on neutron side, so ugency could be reduced for the bug. PS: I know I still owe better instructions to reproduce.

Comment 8 Ryan O'Hara 2015-12-17 16:42:27 UTC
(In reply to Ihar Hrachyshka from comment #7)
> Note: we have found out a way to handle it on neutron side, so ugency could
> be reduced for the bug. PS: I know I still owe better instructions to
> reproduce.

At the moment I am more curious about why it worked with with version 1.2.7 in the upstream gate but failed with our version. In my attempts to reproduce this, both 1.2.7 and latest version behave the same. I'd like to see keepalived logs from the upsteam gate test if possible and compare the keepalived logs from the failed test you're seeing.

Comment 9 Ryan O'Hara 2016-12-05 17:07:29 UTC
I've submitted a pull request here:

https://github.com/acassen/keepalived/pull/475

Comment 12 Ryan O'Hara 2017-03-23 02:25:49 UTC
Include with rebase to 1.3.5.

Comment 15 errata-xmlrpc 2017-08-01 19:36:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2169