1573881 – Active-standby, Killing haproxy services in amphoras- Fail over is not seen

Bug 1573881 - Active-standby, Killing haproxy services in amphoras- Fail over is not seen

Summary: Active-standby, Killing haproxy services in amphoras- Fail over is not seen

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-octavia
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	z4
Target Release:	14.0 (Rocky)
Assignee:	Carlos Goncalves
QA Contact:	Bruna Bonguardo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1698576
TreeView+	depends on / blocked

Reported:	2018-05-02 12:37 UTC by Alexander Stafeyev
Modified:	2023-03-24 14:04 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-01 13:27:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Alexander Stafeyev 2018-05-02 12:37:35 UTC

Description of problem:

We are running active standby octavia. 
During curl requests to the LB:
From with in the master amphora we killed all haproxy services: 
systemctl stop system.slice
ps -ef | grep hapr

kill -9 PID1 PID2 PID3

Traffic failed for a few seconds and then continued to be successful. 
Failover occured, BUT "openstack loadbalancer amphora list" did not show that there was a failover: 


(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list 
+--------------------------------------+--------------------------------------+-----------+--------+----------------+-------------+
| id                                   | loadbalancer_id                      | status    | role   | lb_network_ip  | ha_ip       |
+--------------------------------------+--------------------------------------+-----------+--------+----------------+-------------+
| 29db41ff-7fcc-4f75-90f0-93dc7df62bcc | 9c43f7ed-a929-4327-8766-9bad1d94f958 | ALLOCATED | MASTER | 192.168.199.58 | 192.168.1.8 |
| f86fe48e-eec5-47d3-ad95-28095714de65 | 9c43f7ed-a929-4327-8766-9bad1d94f958 | ALLOCATED | BACKUP | 192.168.199.54 | 192.168.1.8 |
+--------------------------------------+--------------------------------------+-----------+--------+----------------+-------------+

Additional to that: 
There was no info in none of octavia logs for the haproxy failures. 


Version-Release number of selected component (if applicable):
13

How reproducible:
100%

Steps to Reproduce:
1. From the master amphora - systemctl stop system.slice
2. From the master amphora - ps -ef | grep haprox , and kill -9 for all the PIDs
3. execute "openstack loadbalancer amphora list"
4. See logs for info. 

Actual results:
The failover (which indeed occured) is not seen in the amphora list command ( openstack loadbalancer failover LB command shows the changes in states and the new master) 

No info in logs

Expected results:

We should see new master
We should see all info regarding the failure in logs 

Additional info:

Comment 1 Alexander Stafeyev 2018-05-02 12:39:28 UTC

Restarting the system.slice process not recovering the other haproxy services.

Comment 3 Alexander Stafeyev 2018-05-08 14:31:30 UTC

After debugging with NirM, we saw the following: 

Killing haproxy service for listener DOES initiate failover for traffic  BUT amphora list table is not changed. 


Hence the heartbits continued to be sent to the octavia controller, this may be the reason for the lack of table change. 

The next step was to kill the amphora agent( prevent heartbeats sending), the table change after some time, the amphora was deleted by the house keeping BUT a new one failed to be created , and the LB moved to ERROR state. 

Additional incorrect behavior seen - when we killed haproxy service we saw " listener stae DOWN" in the logs. 

Should the listener change states when there is a backup amphora ready to take control? 

FYI

Comment 4 Alexander Stafeyev 2018-05-09 12:43:04 UTC

Additional info - 

New amphora allocation is not executed at all.

Comment 9 Carlos Goncalves 2019-10-01 13:27:56 UTC

I cannot reproduce this. It might have been fixed in a release newer to the originally reported release.

1. Create load balaner and listener

$ openstack loadbalancer create --vip-subnet-id private-subnet --name lb-1
$ openstack loadbalancer listener create --protocol HTTP --protocol-port 80 --name listener-1 lb-1

2. Check listener is ACTIVE/ONLINE

$ openstack loadbalancer listener show listener-1
+-----------------------------+--------------------------------------+
| Field                       | Value                                |
+-----------------------------+--------------------------------------+
| admin_state_up              | True                                 |
| connection_limit            | -1                                   |
| created_at                  | 2019-10-01T12:45:48                  |
| default_pool_id             | None                                 |
| default_tls_container_ref   | None                                 |
| description                 |                                      |
| id                          | 28de0348-aaa7-4032-9604-0df9def243e2 |
| insert_headers              | None                                 |
| l7policies                  |                                      |
| loadbalancers               | 8a4f3795-6656-4c45-b80e-aee33fd8cf0f |
| name                        | listener-1                           |
| operating_status            | ONLINE                               |
| project_id                  | 6029ff484d3b42afaf7d3fcf9d4c1392     |
| protocol                    | HTTP                                 |
| protocol_port               | 80                                   |
| provisioning_status         | ACTIVE                               |
| sni_container_refs          | []                                   |
| timeout_client_data         | 50000                                |
| timeout_member_connect      | 5000                                 |
| timeout_member_data         | 50000                                |
| timeout_tcp_inspect         | 0                                    |
| updated_at                  | 2019-10-01T13:06:12                  |
| client_ca_tls_container_ref | None                                 |
| client_authentication       | NONE                                 |
| client_crl_container_ref    | None                                 |
| allowed_cidrs               | None                                 |
+-----------------------------+--------------------------------------+

3. Stop haproxy systemd service in the amphora
[centos@amphora-4869a150-15a2-4da7-a8c0-ed3c9c6ae9ee ~]$ sudo systemctl stop haproxy-8a4f3795-6656-4c45-b80e-aee33fd8cf0f.service

4. Octavia Health Manager detected the number of listeners was incoherent with its database, triggered amphora failover and set listener operating_status to OFFLINE.

$ openstack loadbalancer listener show listener-1
+-----------------------------+--------------------------------------+
| Field                       | Value                                |
+-----------------------------+--------------------------------------+
| admin_state_up              | True                                 |
| connection_limit            | -1                                   |
| created_at                  | 2019-10-01T12:45:48                  |
| default_pool_id             | None                                 |
| default_tls_container_ref   | None                                 |
| description                 |                                      |
| id                          | 28de0348-aaa7-4032-9604-0df9def243e2 |
| insert_headers              | None                                 |
| l7policies                  |                                      |
| loadbalancers               | 8a4f3795-6656-4c45-b80e-aee33fd8cf0f |
| name                        | listener-1                           |
| operating_status            | OFFLINE                              |
| project_id                  | 6029ff484d3b42afaf7d3fcf9d4c1392     |
| protocol                    | HTTP                                 |
| protocol_port               | 80                                   |
| provisioning_status         | ACTIVE                               |
| sni_container_refs          | []                                   |
| timeout_client_data         | 50000                                |
| timeout_member_connect      | 5000                                 |
| timeout_member_data         | 50000                                |
| timeout_tcp_inspect         | 0                                    |
| updated_at                  | 2019-10-01T13:06:28                  |
| client_ca_tls_container_ref | None                                 |
| client_authentication       | NONE                                 |
| client_crl_container_ref    | None                                 |
| allowed_cidrs               | None                                 |
+-----------------------------+--------------------------------------+

WARNING octavia.controller.healthmanager.health_drivers.update_db [-] Amphora 4869a150-15a2-4da7-a8c0-ed3c9c6ae9ee health message reports 0 listeners when 1 expected
INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: 4869a150-15a2-4da7-a8c0-ed3c9c6ae9ee
INFO octavia.controller.worker.v1.controller_worker [-] Perform failover for an amphora: {'compute_id': u'659dd232-b94c-46b2-927a-fe162317f34d', 'role': 'master_or_backu
p', 'id': u'4869a150-15a2-4da7-a8c0-ed3c9c6ae9ee', 'lb_network_ip': u'192.168.0.79', 'load_balancer_id': u'8a4f3795-6656-4c45-b80e-aee33fd8cf0f'}
[...]
INFO octavia.controller.worker.v1.controller_worker [-] Successfully completed the failover for an amphora: {'compute_id': u'659dd232-b94c-46b2-927a-fe162317f34d', 'role': 'master_or_backup', 'id': u'4869a150-15a2-4da7-a8c0-ed3c9c6ae9ee', 'lb_network_ip': u'192.168.0.79', 'load_balancer_id': u'8a4f3795-6656-4c45-b80e-aee33fd8cf0f'}
INFO octavia.controller.worker.v1.controller_worker [-] Mark ACTIVE in DB for load balancer id: 8a4f3795-6656-4c45-b80e-aee33fd8cf0f
INFO octavia.controller.healthmanager.health_manager [-] Attempted 1 failovers of amphora
INFO octavia.controller.healthmanager.health_manager [-] Failed at 0 failovers of amphora
INFO octavia.controller.healthmanager.health_manager [-] Cancelled 0 failovers of amphora
INFO octavia.controller.healthmanager.health_manager [-] Successfully completed 1 failovers of amphora

5. Listener came back ONLINE and a new amphora was created.

$ openstack loadbalancer listener show listener-1
+-----------------------------+--------------------------------------+
| Field                       | Value                                |
+-----------------------------+--------------------------------------+
| admin_state_up              | True                                 |
| connection_limit            | -1                                   |
| created_at                  | 2019-10-01T12:45:48                  |
| default_pool_id             | None                                 |
| default_tls_container_ref   | None                                 |
| description                 |                                      |
| id                          | 28de0348-aaa7-4032-9604-0df9def243e2 |
| insert_headers              | None                                 |
| l7policies                  |                                      |
| loadbalancers               | 8a4f3795-6656-4c45-b80e-aee33fd8cf0f |
| name                        | listener-1                           |
| operating_status            | ONLINE                               |
| project_id                  | 6029ff484d3b42afaf7d3fcf9d4c1392     |
| protocol                    | HTTP                                 |
| protocol_port               | 80                                   |
| provisioning_status         | ACTIVE                               |
| sni_container_refs          | []                                   |
| timeout_client_data         | 50000                                |
| timeout_member_connect      | 5000                                 |
| timeout_member_data         | 50000                                |
| timeout_tcp_inspect         | 0                                    |
| updated_at                  | 2019-10-01T13:06:32                  |
| client_ca_tls_container_ref | None                                 |
| client_authentication       | NONE                                 |
| client_crl_container_ref    | None                                 |
| allowed_cidrs               | None                                 |
+-----------------------------+--------------------------------------+

$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+--------+---------------+----------+
| id                                   | loadbalancer_id                      | status    | role   | lb_network_ip | ha_ip    |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+----------+
| 702f2042-d670-498d-8958-268ab1f71bab | 8a4f3795-6656-4c45-b80e-aee33fd8cf0f | ALLOCATED | MASTER | 192.168.0.10  | 10.0.0.3 |
| c21d5a35-d2ba-498a-bbc1-193d7fee41fc | 8a4f3795-6656-4c45-b80e-aee33fd8cf0f | ALLOCATED | BACKUP | 192.168.0.91  | 10.0.0.3 |
+--------------------------------------+--------------------------------------+-----------+--------+---------------+----------+

Note You need to log in before you can comment on or make changes to this bug.