Bug 1773531

Summary:	Haproxy process failure is not triggering amphora recreation
Product:	Red Hat OpenStack	Reporter:	Alexander Stafeyev <astafeye>
Component:	openstack-octavia	Assignee:	Carlos Goncalves <cgoncalves>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Bruna Bonguardo <bbonguar>
Severity:	medium	Docs Contact:
Priority:	high
Version:	13.0 (Queens)	CC:	bperkins, cgoncalves, gthiemon, ihrachys, lpeer, majopela, michjohn, njohnston, scohen
Target Milestone:	z12	Keywords:	Triaged, ZStream
Target Release:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-14 14:31:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1775291
Bug Blocks:

Description Alexander Stafeyev 2019-11-18 11:23:46 UTC

Description of problem:
Killing the haproxy process in the amphora is not triggering amphora recreation.


Version-Release number of selected component (if applicable):
13 
2019-11-04.1

How reproducible:
100%

Steps to Reproduce:
1. Deploy openstack with Octavia and active-standby ( I think in SINGLE topology it can be reproduced as well).
2. Create an LB, listener ( pool with members is optional)
3. SSH into the Amphora ( master)
4. systemctl status haproxy*. You will see "haproxy-listenerID.service"
5. Execute systemctl stop haproxy-listenerID.service 
6. Another way is :
[root@amphora-bf564257-110b-4e51-a79a-4291872481a4 ~]# ps -ef | grep ha
root      6657     1  0 06:17 ?        00:00:00 /usr/sbin/haproxy-systemd-wrapper -f /var/lib/octavia/46dce846-42f2-4f97-9f38-b0199d682c80/haproxy.cfg -f /var/lib/octavia/haproxy-default-user-group.conf -p /var/lib/octavia/46dce846-42f2-4f97-9f38-b0199d682c80/46dce846-42f2-4f97-9f38-b0199d682c80.pid -L V3EwQ8DC7bFLdG_CJGN1bGNCOy4
nobody    6660  6657  0 06:17 ?        00:00:00 /usr/sbin/haproxy -f /var/lib/octavia/46dce846-42f2-4f97-9f38-b0199d682c80/haproxy.cfg -f /var/lib/octavia/haproxy-default-user-group.conf -p /var/lib/octavia/46dce846-42f2-4f97-9f38-b0199d682c80/46dce846-42f2-4f97-9f38-b0199d682c80.pid -L V3EwQ8DC7bFLdG_CJGN1bGNCOy4 -Ds
nobody    6661  6660  0 06:17 ?        00:00:00 /usr/sbin/haproxy -f /var/lib/octavia/46dce846-42f2-4f97-9f38-b0199d682c80/haproxy.cfg -f /var/lib/octavia/haproxy-default-user-group.conf -p /var/lib/octavia/46dce846-42f2-4f97-9f38-b0199d682c80

Kill those processes. 


Actual results:
The VIP mooves to the BACKUP node and traffic is ok. 
The Master amphora is NOT recreated 

Expected results:
The VIP mooves to the BACKUP node and traffic is ok. 
The Master amphora IS RECREATED. 

Additional info:
Due to the fact that if both Amphorae will issue haproxy failure, and the amphora recreation is not triggered, we will loose LB functionality and the customers will expericnce "No Service".

Comment 2 Alexander Stafeyev 2019-11-19 11:40:25 UTC

IMPORTANT! 

If we are not stopping haproxy-LISTENERID.service and killing (kill -9 "haproxy-listernerID process"), The process is NOT recovered. 

Reproduced every time. 
The VIP moves to backup but the haproxy-listenerID process is not retriggered on the MASTER. ( May be this was on purpose due to the fact the after a failover a recreation of amphora expected)

Comment 3 Michael Johnson 2019-11-20 16:16:26 UTC

When you use "systemctl stop" you are disabling the automatic process restarting capabilities. This means it is expected that the HAProxy process will not automatically respawn when it is manually stopped.

When using "kill -9" the HAProxy process(es) will recover, but this will not trigger a controller failover. It is handled automatically inside the amphora instance.
A "kill -9" to one of the HAProxy child processes will terminate the child process, but the parent will respawn children as necessary. In this case HAProxy does not stop processing requests and will recover. As the HAProxy is still alive and servicing requests, this will not trigger a controller failover.
A "kill -9" to the HAProxy parent process (the systemd wrapper) will cause HAProxy to exit. In this scenario systemd will respawn the parent haproxy process and processing will resume. This respawn is fast enough that it will not trigger a controller failover of the amphora.

With the above in mind, I think we should fix the first scenario, when HAProxy is manually stopped via "systemctl stop", we should have the controller take corrective action by failing over the amphora. Since this requires someone to log into the amphora and manually trigger, I am going to drop the severity of this bug to medium.

Comment 4 Alexander Stafeyev 2019-11-20 19:13:06 UTC

(In reply to Michael Johnson from comment #3)
> When you use "systemctl stop" you are disabling the automatic process
> restarting capabilities. This means it is expected that the HAProxy process
> will not automatically respawn when it is manually stopped.
> 
> When using "kill -9" the HAProxy process(es) will recover, but this will not
> trigger a controller failover. It is handled automatically inside the
> amphora instance.
> A "kill -9" to one of the HAProxy child processes will terminate the child
> process, but the parent will respawn children as necessary. In this case
> HAProxy does not stop processing requests and will recover. As the HAProxy
> is still alive and servicing requests, this will not trigger a controller
> failover.
> A "kill -9" to the HAProxy parent process (the systemd wrapper) will cause
> HAProxy to exit. In this scenario systemd will respawn the parent haproxy
> process and processing will resume. This respawn is fast enough that it will
> not trigger a controller failover of the amphora.
> 
> With the above in mind, I think we should fix the first scenario, when
> HAProxy is manually stopped via "systemctl stop", we should have the
> controller take corrective action by failing over the amphora. Since this
> requires someone to log into the amphora and manually trigger, I am going to
> drop the severity of this bug to medium.

Hi Michael, 
Thank you for your response. 
From what I experienced, after kill -9 the haproxy did not recover. 
I think if the process dies, a failover should occur, and maybe systemctl restart to the service should occur behind the scenes.

Comment 5 Michael Johnson 2019-11-20 22:51:58 UTC

I think this issue with haproxy not recovering from kill -9 is a separate issue to the systemctl stop and HA/act/stdby functionality. This should be split out into a separate bug.

Capturing notes here however.

This was tested on recent images and works correctly. HAproxy is restarted in under a second when it's parent process is killed with -9.

However, I pulled down the image Alex is using based on RHEL 7.7 and found it is not restarting as expected.

It has systemd version systemd-219-67.el7_7.2.x86_64 in it, which has a known defect that has since been fixed: https://github.com/systemd/systemd/commit/a3c1168ac293f16d9343d248795bb4c246aaff4a

The amphora haproxy service definition has a "Requires=amphora-netns" along with the "Restart=always". The amphora-netns.service in turn as "StopWhenUnneeded=true" set. This triggers the above bug in the older/unpatched versions of systemd.
It creates a race condition in systemd where when the haproxy service fails due to the kill -9, systemd starts stopping amphora-netns as it is no longer needed. This in turn makes the haproxy restart think a dependency service, amphora-netns, has failed and it fails out the haproxy restart.

Ideally we would want a version of systemd that does not have this race condition defect. We could consider evaluating if the "StopWhenUnneeded=true" setting is really necessary for the amphora-netns service and remove it. This would work around the systemd bug.

Comment 6 Carlos Goncalves 2019-11-21 08:44:38 UTC

(In reply to Michael Johnson from comment #5)
> Ideally we would want a version of systemd that does not have this race
> condition defect. We could consider evaluating if the
> "StopWhenUnneeded=true" setting is really necessary for the amphora-netns
> service and remove it. This would work around the systemd bug.

Great troubleshooting. Thank you.

Systemd RPMs for RHEL 7.7, 8.0 and 8.1 do not include the systemd patch you pointed. The patch is also not in systemd distgit (Git repository which contains .spec file used for building a RPM package) and neither could I find an open systemd RHBZ related to this bug. Could you please file one and set it as blocking this BZ? Let us know if you need help.

Comment 8 Carlos Goncalves 2019-11-21 17:18:10 UTC

Test only BZ. Depends on https://bugzilla.redhat.com/show_bug.cgi?id=1775291.