1964361 – find health manger tuning to avoid amphora failover at scale

Bug 1964361 - find health manger tuning to avoid amphora failover at scale

Summary: find health manger tuning to avoid amphora failover at scale

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-octavia
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nate Johnston
QA Contact:	Bruna Bonguardo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-25 10:18 UTC by anil venkata
Modified:	2022-08-17 15:23 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-23 15:00:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-4131	0	None	None	None	2022-08-17 15:23:00 UTC

Description anil venkata 2021-05-25 10:18:54 UTC

Description of problem:
When we have 8000 load balancers (3 controller and 50 compute node environment), with default health_update_threads=12, stats_update_threads=12 and heartbeat_interval=10 we see amphorae failing over frequently as they are taking more than 10 seconds to update the DB.

2021-05-25 07:02:45.325 76 DEBUG octavia.amphorae.drivers.health.heartbeat_udp [-] Received packet from ('172.24.15.202', 12648) dorecv /usr/lib/python3.6/site-packages/octavia/amphorae/drivers/health/heartbeat_udp.py:1892021-05-25 07:02:55.369 83 WARNING octavia.controller.healthmanager.health_drivers.update_db [-] Amphora 66a63aab-869a-411c-b4ee-d925eca8c26a health message was processed too slowly: 10.043575048446655s! The system may be overloaded or otherwise malfunctioning. This heartbeat has been ignored and no update was made to the amphora health entry. THIS IS NOT GOOD.

After increasing threads i.e health_update_threads=24, stats_update_threads=24 but with default heartbeat_interval = 10, we still see the same issue.

800 DB updates (not sure how octavia is updating DB here) per second (i.e 800 load balancers heart beat updated in one second) may take time for some of the updates as DB at this scale will definitely have higher number of other component resources and transactions happening on them. So increasing thread count (health_update_threads=24) or neutron controller node count doesn't solve this. 800 LB updates is approximate figure, we can have more or less updates based on the time LB started updating heart beats.

Possible solutions can be:-
1. Health manager worker club the heartbeats received at the same time from different VMs into a batch and do a batch update on DB instead of processing each heartbeat message and updating DB separately for the incoming heartbeat packets
2. Increase the timeout interval to a bigger value to avoid stressing DB and the data plane as well.

We can try out different heartbeat intervals in our scale environment to find the correct tuning.

Comment 1 Michael Johnson 2021-05-25 19:14:58 UTC

I can see that the increase in the health_update_threads has greatly reduced the failover rates, so that tuning was successful.

However, we are still seeing a fair number of warnings about the processing taking longer than ten seconds.

The vast majority of the heartbeats are successfully being processed at the expected rate:

2021-05-25 17:58:31.381 87 DEBUG octavia.controller.healthmanager.health_drivers.update_db [-] Health Update finished in: 0.053336770040914416 seconds update_health /usr/lib/python3.6/site-packages/octavia/controller/healthmanager/health_drivers/update_db.py:72

The first thing I will note here is that the actual database update calls come after this processing interval, so that is not related to this processing delay for some heartbeat packets. This interval is mostly python pool scheduling and a database select call that is highly tuned.

Currently the select call is performing as expected, taking 0.00420631 in the database itself. This is going to be a very fast query as all of the needed information should be in the mysql cache and answered from RAM. This is even though the mysql instance deployed here is using the default buffer pool sizing of 128MB (innodb_buffer_pool_chunk_size 134217728, innodb_buffer_pool_instances 1).

I will also note that we do not have the "thundering herd" problem as there is built in jitter and randomization of the controller order each amphora will use to send a heartbeat packet.

Of course changing the update interval would reduce the load, but that also increases the potential downtime for non-HA load balancers.

I don't think we are at a load that would indicate we need to change the update interval yet. I think we have other tuning knobs to adjust that should allow you to continue scaling before needing to extend the update interval or increase the number of Octavia controllers.

Comment 4 anil venkata 2021-05-26 13:06:13 UTC

Thanks Michael.

File http://perf1.perf.lab.eng.bos.redhat.com/pub/asyedham/BZ-1964361/lb_not_del_id.txt captured active load balancers (load_balancers table) from DB.
File http://perf1.perf.lab.eng.bos.redhat.com/pub/asyedham/BZ-1964361/amphora.txt captured amphorae table content from DB.
From both the files, we want to identify how many times a LB has failed over i.e how many amphora entries each LB has. This file http://perf1.perf.lab.eng.bos.redhat.com/pub/asyedham/BZ-1964361/lb_amphora_count.txt has that info. We can see that each load balancer has filed over multiple times.

Comment 5 Michael Johnson 2021-05-26 20:01:21 UTC

In a hour snapshot today, I fourteen load balancers have failed over in an hour. This is an improvement over having the lower thread counts.

Obviously I want to update the settings to eliminate the slow processing warnings and reduce/eliminate the failover counts. Some failovers may be related to underlying cloud issues and not be related to the slow processing issue.

Comment 6 Michael Johnson 2021-05-26 23:45:38 UTC

I spent some time on controller 2 today.

I was correct that the health workers is the issue you are hitting. We just missed the magic number by four....

Adjusting [health_manager] health_update_threads to 28 reduced the "processing too slowly" warnings for controller 2. This also seemed to stop the few failovers you were still seeing.

I also tried adjusting the database pool settings, but as I expected it had no effect as the database is not the limit here.

Given the hardware you have for the controllers and your goal to maximize the number of load balancers on the 45 hosts, I would set this to 50 for all controllers.

I have gone ahead and made that change on all three controllers directly in the configuration files (I know you are on a time crunch to wrap this up). If you redeploy, please use the same procedure you used above to set it to 50.

To summarize the current state of the lab:
45 compute hosts with varying hardware (64GB of RAM and up)
3 controllers (controller 2 is 83% idle CPU, 83GB RAM used, zero swap used, 28GB disk used with 49 service containers on it (including the Octavia service containers))
7,000 healthy load balancers

So adjusting your health workers up to 50 (just short of your core count) should let you continue creating load balancers assuming you have compute capacity for them (i.e. nova is not tuned here, so it is leaving idle capacity on the compute hosts).

After 50 workers is not enough, on this hardware, I would recommend adjusting the heartbeat interval from ten seconds up to a higher number. This will allow you to continue to scale the number of load balancers this deployment can support, trading off recovery time from nova issues.
The alternative would be to add additional controller hosts or potentially additional health manager containers on these controllers.

Since the containers are running with read only filesystems, I wasn't able to add the debugging I had hoped to narrow this down more. I just didn't have time to work around that issue.

Comment 7 Michael Johnson 2021-05-26 23:55:52 UTC

The deployment here is configured for standalone load balancers, so the number of active amphora is equal to the number of load balancers.
Should a load balancer need to failover, the existing amphora instance will be replaced with a new amphora.

Currently no load balancers are failing over in your environment, at least over the last hour.

The best way to track failovers is to track them in the log file. The health managers will also report a summary on shutdown.

Comment 10 anil venkata 2021-06-08 04:09:33 UTC

2021-05-27 12:24:25.121 88 WARNING octavia.controller.healthmanager.health_drivers.update_db [-] Amphora b8f76bb4-07a9-429c-8a99-b469fe73c3aa health message was processed too slowly: 10.00056791305542s! The system may be overloaded or otherwise malfunctioning. This heartbeat has been ignored and no update was made to the amphora health entry. THIS IS NOT GOOD.

We still see many of above messages in the health-manager.log (from http://perf1.perf.lab.eng.bos.redhat.com/pub/asyedham/BZ-1964361/20210527-8k-lb2-scale-run/ which Asma shared in comment 8)

Comment 11 Michael Johnson 2021-06-08 14:05:01 UTC

Those warning level messages are not an indicator of a failover. They are a warning that some heartbeat messages are getting dropped due to a slow processing inside the database or the controller thread configuration.

Comment 12 anil venkata 2021-06-10 10:10:39 UTC

Thanks Michael.
We didn't see failovers with health_update_threads=50 in our latest logs. Each controller node here has 56 CPUs.
So your suggestion is to increase it in multiples of 2 (max count  should be the number of cpu i.e 56 in this case) from your existing count (i.e 12, 24, 56) till you don't see the failover messages i.e 

2021-05-10 08:08:44.727 78 INFO octavia.controller.worker.v1.flows.amphora_flows [-] Performing failover for amphora: {'id': 'c9b2cd78-4d58-4370-90a9-d05e9a4fb7a4', 'load_balancer_id': '1d232822-910c-485e-b542-339b498333e9', 'lb_network_ip': '172.24.10.133', 'compute_id': '83aa48e5-84f4-48f3-b741-341796ca2a17', 'role': 'standalone'}

Comment 13 Michael Johnson 2021-06-11 15:53:30 UTC

I wrote up a summary in the test summary doc, but the TLDR is if you are running a deployment with a large number of load balancers, you should increase the health workers to a number close to the number of available cores on the controllers host. In the case of your 56 core host, 50 is a reasonable number to select, leaving some available for other host services. Tripleo, by default, will max out with a setting of 12. This may need to be adjusted up if you are seeing "The system may be overloaded or otherwise malfunctioning." WARNING messages in the logs.

Comment 14 Michael Johnson 2021-06-23 15:00:24 UTC

Closing this issue as we have provided the required configuration tuning guidance to resolve this issue at the scale of the test run.

Comment 15 anil venkata 2021-08-19 08:39:23 UTC

Tunic recommendation for health_update_threads to avoid amphora failover (below is Michael comment on our internal doc) -

As you scale the number of load balancers running in the deployment, you may need to increase the number of health update workers (health_update_threads) each controller has available.

In this test run, with 7,000 load balancers and three controllers, we needed to increase the number of health workers to 28. Tripleo will default to a maximum of 12 health workers if the controller has twelve or more CPU cores available.

The proper number of health workers for a deployment depends on the performance of the controller hosts, the database performance, and the number of amphora running in the cloud.
The health workers will start logging a warning message, “health message was processed too slowly” when it is time to increase the number of health workers, address performance issues within the database, or increase the number of controller nodes.

A rough minimum formula would look like:
health workers = ((amphora (heartbeat timeout heartbeat interval)) number controllers) processing seconds 

For this test run we saw:
28 = ((7000(6010))3)0.072

The processing time per heartbeat can be seen by enabling debug logging in the health manager and observing the “Health Update finished in:” messages.

For deployments with a large number of load balancers, this can be simplified by setting the health workers to a value close to the number of CPU cores available on the controller, as 
available controller memory allows. In this test environment, we set the health workers to 50 because the controller hosts had 56 cores available.

Note You need to log in before you can comment on or make changes to this bug.