Bug 1773531
Summary: | Haproxy process failure is not triggering amphora recreation | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Stafeyev <astafeye> |
Component: | openstack-octavia | Assignee: | Carlos Goncalves <cgoncalves> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Bruna Bonguardo <bbonguar> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 13.0 (Queens) | CC: | bperkins, cgoncalves, gthiemon, ihrachys, lpeer, majopela, michjohn, njohnston, scohen |
Target Milestone: | z12 | Keywords: | Triaged, ZStream |
Target Release: | 13.0 (Queens) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-14 14:31:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1775291 | ||
Bug Blocks: |
Description
Alexander Stafeyev
2019-11-18 11:23:46 UTC
IMPORTANT! If we are not stopping haproxy-LISTENERID.service and killing (kill -9 "haproxy-listernerID process"), The process is NOT recovered. Reproduced every time. The VIP moves to backup but the haproxy-listenerID process is not retriggered on the MASTER. ( May be this was on purpose due to the fact the after a failover a recreation of amphora expected) When you use "systemctl stop" you are disabling the automatic process restarting capabilities. This means it is expected that the HAProxy process will not automatically respawn when it is manually stopped. When using "kill -9" the HAProxy process(es) will recover, but this will not trigger a controller failover. It is handled automatically inside the amphora instance. A "kill -9" to one of the HAProxy child processes will terminate the child process, but the parent will respawn children as necessary. In this case HAProxy does not stop processing requests and will recover. As the HAProxy is still alive and servicing requests, this will not trigger a controller failover. A "kill -9" to the HAProxy parent process (the systemd wrapper) will cause HAProxy to exit. In this scenario systemd will respawn the parent haproxy process and processing will resume. This respawn is fast enough that it will not trigger a controller failover of the amphora. With the above in mind, I think we should fix the first scenario, when HAProxy is manually stopped via "systemctl stop", we should have the controller take corrective action by failing over the amphora. Since this requires someone to log into the amphora and manually trigger, I am going to drop the severity of this bug to medium. (In reply to Michael Johnson from comment #3) > When you use "systemctl stop" you are disabling the automatic process > restarting capabilities. This means it is expected that the HAProxy process > will not automatically respawn when it is manually stopped. > > When using "kill -9" the HAProxy process(es) will recover, but this will not > trigger a controller failover. It is handled automatically inside the > amphora instance. > A "kill -9" to one of the HAProxy child processes will terminate the child > process, but the parent will respawn children as necessary. In this case > HAProxy does not stop processing requests and will recover. As the HAProxy > is still alive and servicing requests, this will not trigger a controller > failover. > A "kill -9" to the HAProxy parent process (the systemd wrapper) will cause > HAProxy to exit. In this scenario systemd will respawn the parent haproxy > process and processing will resume. This respawn is fast enough that it will > not trigger a controller failover of the amphora. > > With the above in mind, I think we should fix the first scenario, when > HAProxy is manually stopped via "systemctl stop", we should have the > controller take corrective action by failing over the amphora. Since this > requires someone to log into the amphora and manually trigger, I am going to > drop the severity of this bug to medium. Hi Michael, Thank you for your response. From what I experienced, after kill -9 the haproxy did not recover. I think if the process dies, a failover should occur, and maybe systemctl restart to the service should occur behind the scenes. I think this issue with haproxy not recovering from kill -9 is a separate issue to the systemctl stop and HA/act/stdby functionality. This should be split out into a separate bug. Capturing notes here however. This was tested on recent images and works correctly. HAproxy is restarted in under a second when it's parent process is killed with -9. However, I pulled down the image Alex is using based on RHEL 7.7 and found it is not restarting as expected. It has systemd version systemd-219-67.el7_7.2.x86_64 in it, which has a known defect that has since been fixed: https://github.com/systemd/systemd/commit/a3c1168ac293f16d9343d248795bb4c246aaff4a The amphora haproxy service definition has a "Requires=amphora-netns" along with the "Restart=always". The amphora-netns.service in turn as "StopWhenUnneeded=true" set. This triggers the above bug in the older/unpatched versions of systemd. It creates a race condition in systemd where when the haproxy service fails due to the kill -9, systemd starts stopping amphora-netns as it is no longer needed. This in turn makes the haproxy restart think a dependency service, amphora-netns, has failed and it fails out the haproxy restart. Ideally we would want a version of systemd that does not have this race condition defect. We could consider evaluating if the "StopWhenUnneeded=true" setting is really necessary for the amphora-netns service and remove it. This would work around the systemd bug. (In reply to Michael Johnson from comment #5) > Ideally we would want a version of systemd that does not have this race > condition defect. We could consider evaluating if the > "StopWhenUnneeded=true" setting is really necessary for the amphora-netns > service and remove it. This would work around the systemd bug. Great troubleshooting. Thank you. Systemd RPMs for RHEL 7.7, 8.0 and 8.1 do not include the systemd patch you pointed. The patch is also not in systemd distgit (Git repository which contains .spec file used for building a RPM package) and neither could I find an open systemd RHBZ related to this bug. Could you please file one and set it as blocking this BZ? Let us know if you need help. Test only BZ. Depends on https://bugzilla.redhat.com/show_bug.cgi?id=1775291. |