Bug 1286729
Summary: | Can't restart keepalived after it's killed | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ihar Hrachyshka <ihrachys> |
Component: | keepalived | Assignee: | Ryan O'Hara <rohara> |
Status: | CLOSED ERRATA | QA Contact: | Brandon Perkins <bperkins> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 7.2 | CC: | cluster-maint, fdinitto, ihrachys, ushkalim |
Target Milestone: | rc | ||
Target Release: | 7.4 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-01 19:36:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1449769 | ||
Bug Blocks: |
Description
Ihar Hrachyshka
2015-11-30 15:40:03 UTC
(In reply to Ihar Hrachyshka from comment #0) > That's, I believe, steps to reproduce without OSP8: > > $ (in the first console) keepalived -P -l -n -p keepalived.pid -r vrrp.pid > -c checkers.pid > $ (in another console) kill -9 the main keepalived process > $ (back to the first console) keepalived -P -l -n -p keepalived.pid -r > vrrp.pid -c checkers.pid > > Main process not started because there is vrrp pid present. I believe the > main process is supposed to monitor and respawn the forked one, so I would > expect it should still run after fork if dead for some reason. You're killing the main keepalived process and expecting it to respawn and/or cleanup the PID file? Keepalived will only respawn if you kill the child process. Also, is there a reason that you're using the -c option with the -P option? Ryan, the expectation is that either one of the following occurs when we start another keepalived process after the previous one died: - it starts managing the forked vrrp process; - it kills the vrrp process and starts the new one, then continues running. Currently, it's just exiting and leaving vrrp thread running, and not monitored. It also results in neutron l3 agent to continuously trying to respawn the main keepalived process because it runs checks that main process is running, and it's not. Since neutron does not know anything about the child process, it never stops respawn attempts (and considering the main process should monitor the child, I guess it's right to expect it's spawned and not just exit). As for -c + -P options: - I use -P to reflect what neutron l3 agent currently does: https://github.com/openstack/neutron/blob/81a4aac8d429a5d8b2874c4fce0ffd6498f5b1c6/neutron/agent/linux/keepalived.py#L396 - I added -c to make it run without root, locally. Maybe it's not really needed to reproduce the issue, not sure. After talking with Ihar on irc I've learned that we're not talking about keepalived's respawn feature. This confusion was due to use of term "respawn", but what he is referring to is the neutron process that will monitor and "respawn" keepalived, not the parent keepalived process that can also respawn a failed thread. Also, it seems that the VRRP (child) process is orphaned when the parent catches SIGKILL, so that would explain why the VRRP process' pid file is still there. What I find most interesting is that older versions of keepalived supposedly do not have this problem. Comment #1 states that version 1.2.7 works as expected. More tests are needed with actually pid numbers, file, pstree output, etc. Please provide keepalived logs along the with steps used to generate said logs. (In reply to Ihar Hrachyshka from comment #0) > I noticed that we experience the issue in OSP8 only, while the same test > passes in upstream gate. I compared keepalived versions and realized that in > upstream, we run against keepalived 1.2.7, while in OSP8 we use 1.2.13. > After git bisecting the keepalived tree, I found out the following patch to > be the first one that triggers the failure for the test: I see no difference in behavior for version 1.2.7. Note: we have found out a way to handle it on neutron side, so ugency could be reduced for the bug. PS: I know I still owe better instructions to reproduce. (In reply to Ihar Hrachyshka from comment #7) > Note: we have found out a way to handle it on neutron side, so ugency could > be reduced for the bug. PS: I know I still owe better instructions to > reproduce. At the moment I am more curious about why it worked with with version 1.2.7 in the upstream gate but failed with our version. In my attempts to reproduce this, both 1.2.7 and latest version behave the same. I'd like to see keepalived logs from the upsteam gate test if possible and compare the keepalived logs from the failed test you're seeing. I've submitted a pull request here: https://github.com/acassen/keepalived/pull/475 Include with rebase to 1.3.5. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2169 |