Bug 1780380 - Single instance of nova-scheduler under-performing and failing completely when spawning 90+ instance
Summary: Single instance of nova-scheduler under-performing and failing completely whe...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: melanie witt
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-05 20:14 UTC by David Vallee Delisle
Modified: 2023-09-07 21:13 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-28 18:21:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-28300 0 None None None 2023-09-07 21:13:52 UTC
Red Hat Knowledge Base (Solution) 3912531 0 None None None 2020-03-04 02:04:32 UTC

Description David Vallee Delisle 2019-12-05 20:14:03 UTC
Description of problem:
Customer is running singleton nova-scheduler because of this known race-condition [1]. One of their VNFm is trying to create ~95 instances, all at once and it fails every time. This environment has more than 210 compute nodes, which doesn't seem to help. The VNFm is unable to throttle or batch the requests, they have to be sent all at once.

We tried a few tweaks including:
- To slow down the spawning and building process, we tried these:
--- rabbit_qos_prefetch_count = 5
--- rpc_conn_pool_size = 15
- To reduce the load on singleton and single thread nova-scheduler, we tried this:
--- scheduler_driver_task_period = 600
--- scheduler_instance_sync_interval = 300

It was also recommended to increase the rpc_response_timeout but customer is refusing for multiple and valid reasons which will be added in a private comment after this bz is created.

We thought of turning off scheduler_tracks_instance_changes but since this VNFm is highly using anti-affinity, we didn't deemed reasonable to do this.

We reproduced the issue while gathering metrics every 20 seconds on all 3 controllers, including a mysql show global status and show full processlist. We don't see any long running queries. 

Here's some data we were able to pull out
- [2] We see that the time elapsed to complete a query is snowballing
- When a scheduling completes in more than 60 seconds, nova-conductor re-schedules the instance, even though the first scheduling isn't completed, so we're scheduling uselessly and it's not helping .
- At some point during the deployment, we start seeing nova-scheduler loosing connections to both mysql and rabbit [3]. The client IP is from the controller that runs nova-scheduler, and we see this on all 3 controller's rabbit server (even itself).


I did a writeup of the timeline of the events seen in nova-scheduler in one of the first try we did, before tweaking any settings:
- 18:26:32.059 conductor sends request to sched
- 18:26:32.286 sched receives it (line 168975)
- 18:26:54.681 sched is starting to update various hosts (line 179803, more than 10 000 lines later)
- 18:26:55.300 sched is pausing host update for this request, another request is treated (line 180174)
- 18:27:03.569 host update is resumed for one host only (line 184208)
- 18:27:09.222 host update is resumed for one host only (line 186551)
- 18:27:19.646 host update is resumed for the last host and filtering starts here (line 191570)
- 18:27:20.005 filtering is completed and host is selected (191764)
- 18:27:32.266 sched receives a new message from conductor to schedule the same instance with the same request id (line 197461)
- 18:27:58.398 sched replies to initial request after 86 seconds.

I wrote a little script [4] to calculate the timings of all the requests. We see [5] that some scheduling took more than 20 minutes to schedule, but there's also some AMQP errors associated with these schedulings.

Here's some stats:
Average time spent between message received and starting schedule: 01:10
Average time spent on full host update, filtering and selecting a host: 00:24
Average time spent between host selected and reply to nova-conductor: 01:30
Average full scheduling time: 03:04

We know the first schedule completes in 3 seconds. Assuming that we have 95 instances to create, we should complete in less than 5 minutes.

Nova Version: 14.1.0-26

How reproducible:
All the time

Steps to Reproduce:
1. Spawn 100 VMs with VNFm, each one with a volume attached and some networks

Comment 1 David Vallee Delisle 2019-12-05 20:14:45 UTC
[1] https://access.redhat.com/solutions/3912531
[2]
~~~
$ grep -oP "2019-12-04 21:.*time elapsed: \K[0-9\.]+" nova-scheduler.log
3.44689894095
3.16558479331
6.40297500417
9.82629629597
9.82664851658
10.0239099953
10.0282932576
16.4430216867
10.0669002105
20.0252350476
30.1180116236
20.1111399643
33.0991760902
43.1894487571
49.5891294423
63.7777555604
54.5302599091
54.533567043
54.6139408257
60.738026157
64.1181275602
26.565380061
78.0030478965
27.0078757908
102.923218425
103.131700858
95.7414350398
60.9927330278
60.9992507454
118.50942478
126.101711994
91.3457637243
132.461802224
119.398469888
105.35335133
88.3574065138
162.792949449
165.562504878
152.929885861
156.497918397
169.727027906
202.863166472
178.996908274
165.037972461
154.78269426
209.464632859
151.949924923
165.485986901
189.836764175
178.933592726
164.986896656
158.44151637
154.260849101
[...]
113.828529857
165.378699306
120.477167031
110.340138154
101.146832367
24.0570967384
27.0404482279
1372.0221854
1384.6043166
1385.06421934
3.44446522743
1376.46540126
1346.34468364
1389.50615874
1372.76815067
1389.50195823
6.62023449875
3.35733870603
1373.75957376
1332.45880223
1332.45942
1393.59287239
1362.02189677
2.92857376486
3.20036164671
3.03208646923
~~~

[3]
~~~
2019-12-04 21:21:57 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26032.213> (192.168.0.7:42382 -> 192.168.0.9:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:22:07 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7490.229> (192.168.0.7:42096 -> 192.168.0.7:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:22:35 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.32600.212> (192.168.0.7:46708 -> 192.168.0.9:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:22:35 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26083.213> (192.168.0.7:46718 -> 192.168.0.9:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:22:35 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4217.222> (192.168.0.7:34806 -> 192.168.0.8:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:22:35 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7538.229> (192.168.0.7:45334 -> 192.168.0.7:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:22:35 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7548.229> (192.168.0.7:45348 -> 192.168.0.7:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:22:44 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7577.229> (192.168.0.7:46482 -> 192.168.0.7:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:22:45 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4242.222> (192.168.0.7:35956 -> 192.168.0.8:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:22:45 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7585.229> (192.168.0.7:46502 -> 192.168.0.7:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:22:51 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26116.213> (192.168.0.7:48672 -> 192.168.0.9:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:22:51 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4246.222> (192.168.0.7:36752 -> 192.168.0.8:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:22:51 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4244.222> (192.168.0.7:36755 -> 192.168.0.8:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:22:51 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4245.222> (192.168.0.7:36760 -> 192.168.0.8:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:22:51 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7589.229> (192.168.0.7:47294 -> 192.168.0.7:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:23:02 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26137.213> (192.168.0.7:50044 -> 192.168.0.9:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:23:02 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26139.213> (192.168.0.7:50052 -> 192.168.0.9:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:23:02 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4277.222> (192.168.0.7:38130 -> 192.168.0.8:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:23:08 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4286.222> (192.168.0.7:38908 -> 192.168.0.8:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:23:09 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26142.213> (192.168.0.7:50963 -> 192.168.0.9:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:23:09 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4274.222> (192.168.0.7:39045 -> 192.168.0.8:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:23:09 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4288.222> (192.168.0.7:39048 -> 192.168.0.8:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:23:09 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4287.222> (192.168.0.7:39050 -> 192.168.0.8:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:23:09 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.3163.229> (192.168.0.7:49599 -> 192.168.0.7:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:23:09 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.3150.229> (192.168.0.7:49600 -> 192.168.0.7:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:23:09 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.3126.229> (192.168.0.7:49602 -> 192.168.0.7:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:23:33 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7662.229> (192.168.0.7:52712 -> 192.168.0.7:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:23:49 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26210.213> (192.168.0.7:55842 -> 192.168.0.9:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:23:49 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26201.213> (192.168.0.7:55848 -> 192.168.0.9:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:23:49 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26213.213> (192.168.0.7:55852 -> 192.168.0.9:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:23:49 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26208.213> (192.168.0.7:55857 -> 192.168.0.9:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:23:49 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26228.213> (192.168.0.7:55861 -> 192.168.0.9:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:23:49 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4350.222> (192.168.0.7:43928 -> 192.168.0.8:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:23:49 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4357.222> (192.168.0.7:43930 -> 192.168.0.8:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:23:49 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4354.222> (192.168.0.7:43934 -> 192.168.0.8:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:23:49 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7664.229> (192.168.0.7:54486 -> 192.168.0.7:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:23:56 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26227.213> (192.168.0.7:56634 -> 192.168.0.9:5672):||{handshake_timeout,frame_header}||||
2019-12-04 21:24:06 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26251.213> (192.168.0.7:57722 -> 192.168.0.9:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:24:06 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7728.229> (192.168.0.7:56352 -> 192.168.0.7:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:24:16 alprgagqnce-h-pe4dloc-003 ERROR closing AMQP connection <0.26284.213> (192.168.0.7:58956 -> 192.168.0.9:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:24:16 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4389.222> (192.168.0.7:47028 -> 192.168.0.8:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:24:16 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4405.222> (192.168.0.7:47036 -> 192.168.0.8:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:24:16 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7748.229> (192.168.0.7:57576 -> 192.168.0.7:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:24:16 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7733.229> (192.168.0.7:57578 -> 192.168.0.7:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:24:16 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7746.229> (192.168.0.7:57582 -> 192.168.0.7:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:24:16 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7745.229> (192.168.0.7:57586 -> 192.168.0.7:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:24:43 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4442.222> (192.168.0.7:50296 -> 192.168.0.8:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:24:43 alprgagqnce-h-pe4dloc-002 ERROR closing AMQP connection <0.4448.222> (192.168.0.7:50298 -> 192.168.0.8:5672):||{handshake_timeout,handshake}||||
2019-12-04 21:24:43 alprgagqnce-h-pe4dloc-001 ERROR closing AMQP connection <0.7780.229> (192.168.0.7:60806 -> 192.168.0.7:5672):||{handshake_timeout,frame_header}||||
~~~

[4] https://gitlab.cee.redhat.com/dvalleed/randoms/blob/master/bin/parseNovaSched.py
[5]
~~~
parseNovaSched.py *.log
request id                               Time      Rec2Start Start2Sel Sel2Rep   Full      # Sel     Errors   
req-67a059ab-5ca3-4605-9696-ceb67e700ac8 21:21:17  0:01:04   0:00:03   0:22:02   0:23:09   1         26       
req-8efcb648-04a3-488e-9ae9-47d722030d4f 21:21:17  0:00:54   0:02:26   0:19:49   0:23:09   2         25       
req-ba4b2b22-d541-4c01-9124-7a5bf9b5dd3f 21:21:30  0:01:12   0:00:03   0:21:52   0:23:07   1         25       
req-e01645bc-20f0-4e38-91a8-bb0e8073aecc 21:21:17  0:00:27   0:03:36   0:19:02   0:23:05   2         25       
req-a6e3c897-f8f3-45f6-a2d5-a30b20fd1aea 21:21:24  0:00:27   0:03:15   0:19:16   0:22:58   2         25       
req-f7554a98-7381-4f26-85ee-c01df2561ff4 21:21:30  0:00:34   0:00:03   0:22:19   0:22:56   1         25       
req-602d63ba-9617-48aa-9dec-6207ac561a8d 21:21:44  0:00:54   0:00:03   0:21:56   0:22:53   1         25       
req-37622e48-3ee8-4b96-aa2b-1362207af7be 21:21:30  0:00:27   0:00:03   0:22:22   0:22:52   1         25       
req-812e378c-d8ad-4f9f-bf12-9cbe29a88584 21:21:40  0:00:48   0:00:03   0:21:55   0:22:46   1         25       
req-160708ce-ee89-404c-a753-99dce95c9da0 21:22:24  0:02:06   0:00:03   0:20:09   0:22:18   1         25       
req-4cc352aa-b5a6-4944-85fc-7af339f6b488 21:22:41  0:01:39   0:02:27   0:02:03   0:06:09   2         0        
req-eec699a2-55fd-4456-93c2-5871f172af4a 21:33:53  0:00:42   0:03:05   0:02:17   0:06:04   3         0        
req-61da2352-e86e-41f1-b997-f92678bd43a0 21:36:33  0:01:11   0:02:40   0:01:58   0:05:49   3         0        
req-0ccedcce-0a9c-468c-b111-5a753ed876fa 21:34:03  0:00:32   0:02:38   0:02:21   0:05:31   2         1        
request id                               Time      Rec2Start Start2Sel Sel2Rep   Full      # Sel     Errors   
req-d552085a-7fe5-432a-a44b-80ff5d2e0a81 21:22:41  0:02:33   0:01:24   0:01:34   0:05:31   2         0        
req-2304a951-7ac3-411b-96aa-f28d9d0bce7f 21:35:43  0:00:50   0:02:24   0:02:16   0:05:30   3         0        
req-adc46ede-f25b-40bc-9930-b8ea98920bf5 21:30:57  0:01:01   0:01:35   0:02:47   0:05:23   2         0        
req-f3354823-efcc-4513-b9fc-dd3b5155f1dc 21:22:25  0:02:15   0:00:03   0:03:04   0:05:22   1         0        
req-6a0c0bda-003f-4b84-b152-b6d098206648 21:29:24  0:01:13   0:02:07   0:02:01   0:05:21   3         0        
req-df4b1699-0dca-4696-b5c4-9915e6d49441 21:36:24  0:01:04   0:02:25   0:01:51   0:05:20   3         0        
req-37102531-b023-4fde-87d3-2e8f5fe3b771 21:21:43  0:01:44   0:03:04   0:00:26   0:05:14   2         0        
req-e515ed5d-d403-4a31-b6eb-6cbf2f30536a 21:22:59  0:02:07   0:00:03   0:03:03   0:05:13   1         0        
req-b193dc8d-79b6-400c-af3b-86be3178acfa 21:22:35  0:02:25   0:00:03   0:02:44   0:05:12   1         0        
req-a7ed83c1-bf04-4399-a309-7ea893e57e31 21:35:53  0:00:51   0:01:36   0:02:43   0:05:10   2         0        
req-14848b59-0d38-443e-9e79-6eb554e2aaaf 21:27:00  0:00:58   0:01:33   0:02:27   0:04:58   2         0        
req-2be7ac0b-cc21-4c1e-bbc7-eedba4a73b60 21:37:17  0:01:03   0:02:17   0:01:38   0:04:58   2         0        
req-514e4999-4d30-4175-ba45-e405efa7010c 21:22:21  0:01:42   0:00:03   0:03:11   0:04:56   1         0        
req-b10701a0-68f1-4b2f-9c70-8ef2f39f1906 21:22:52  0:01:45   0:00:03   0:03:07   0:04:55   1         0        
req-59e342ce-529d-4409-b73b-a4952d79a67f 21:26:57  0:01:08   0:01:35   0:02:05   0:04:48   2         0        
request id                               Time      Rec2Start Start2Sel Sel2Rep   Full      # Sel     Errors   
req-c910fbfc-b930-42d0-8553-caac83480cfa 21:26:57  0:01:01   0:01:19   0:02:21   0:04:41   2         0        
req-7ace459d-cd65-41d8-9dd0-7bd914c31744 21:22:38  0:01:19   0:02:28   0:00:52   0:04:39   2         0        
req-e189c4f3-d8f2-4642-86ad-d25f7d63c4c2 21:26:08  0:00:53   0:01:14   0:02:22   0:04:29   2         0        
req-32320b87-e594-4175-ac99-9319faf3493d 21:22:59  0:01:58   0:00:03   0:02:27   0:04:28   1         0        
req-127a086e-18a4-4020-953d-dc485eb6bcc7 21:22:35  0:01:11   0:00:03   0:03:12   0:04:26   1         0        
req-7e98d5da-bb96-4a65-9fd5-f7942d8ac2d2 21:30:31  0:00:40   0:01:09   0:02:35   0:04:24   2         0        
req-fbbd3c67-57fa-4edb-9e19-14d6d3575e63 21:36:53  0:01:55   0:00:03   0:02:26   0:04:24   1         0        
req-1df84f66-da9e-4083-81a5-e1e4c16fb9c5 21:21:17  0:00:37   0:01:59   0:01:47   0:04:23   2         0        
req-52b6e117-7dc1-4bc0-b3a8-e143b5d5d59a 21:31:58  0:01:49   0:00:03   0:02:30   0:04:22   1         0        
req-7a496203-57c3-4074-90ae-5c0d8ebe447f 21:36:23  0:01:31   0:00:03   0:02:47   0:04:21   1         0        
req-1dc75d8d-4ea7-40a9-b175-9c3d8394c48c 21:37:17  0:00:46   0:01:14   0:02:20   0:04:20   2         0        
req-6ba12e34-1a0e-4328-83bc-3da0a5bbd119 21:37:17  0:01:03   0:01:20   0:01:57   0:04:20   2         0        
req-a4fd9ea4-5c88-44a5-9f8c-5c8b79d7c4f4 21:36:20  0:00:57   0:03:04   0:00:19   0:04:20   3         0        
req-896a6d12-cd8b-4dca-835f-fa3390d876c9 21:22:41  0:01:26   0:02:34   0:00:19   0:04:19   2         0        
req-f1783b75-1996-4b8f-abdc-410eab1e962c 21:38:03  0:01:07   0:01:21   0:01:51   0:04:19   2         0        
request id                               Time      Rec2Start Start2Sel Sel2Rep   Full      # Sel     Errors   
req-22862bee-229d-460b-9f57-e4032ed814ee 21:21:30  0:01:18   0:02:39   0:00:20   0:04:17   2         0        
req-44ce3df9-33cd-4c59-a070-ae77dba360f3 21:24:33  0:02:08   0:00:03   0:02:06   0:04:17   1         0        
req-e0ce1a23-ee04-440e-a93d-75142c0fb1a0 21:21:44  0:01:49   0:02:03   0:00:25   0:04:17   2         0        
req-edcc669a-6091-4737-a176-0edc6e81dbc3 21:22:24  0:01:53   0:00:03   0:02:21   0:04:17   1         0        
req-248a222f-29c2-49c8-b651-d4212d39c1df 21:40:01  0:01:02   0:01:54   0:01:19   0:04:15   3         0        
req-c0bd5314-f9d9-4573-9100-60da4974f05b 21:40:01  0:01:27   0:01:40   0:01:08   0:04:15   2         0        
req-38256f32-d712-4460-b5f1-5c72b3c91508 21:22:21  0:00:45   0:03:16   0:00:13   0:04:14   2         0        
req-388853ed-ebed-4fca-a896-f5dc4970275c 21:37:17  0:01:03   0:01:17   0:01:54   0:04:14   2         0        
req-91f7b3fa-70a9-486d-bd57-79a212fc4525 21:24:33  0:00:37   0:01:08   0:02:28   0:04:13   2         0        
req-c568cd89-e447-473d-9737-4d58f6192288 21:23:46  0:01:34   0:00:03   0:02:35   0:04:12   1         0        
req-f960cba2-ec8a-497a-9ff2-5eee2f70ae9c 21:28:22  0:01:53   0:00:03   0:02:16   0:04:12   1         0        
req-7af9e4c0-eff9-4232-98ff-5c57762c3c04 21:29:00  0:00:54   0:01:17   0:01:59   0:04:10   2         0        
req-7fb3c38a-5951-4bde-94f5-fb1f99decd4a 21:36:53  0:01:20   0:00:03   0:02:47   0:04:10   1         0        
req-c5069cd8-fd81-43db-b647-1170dddb682d 21:33:53  0:00:39   0:01:08   0:02:23   0:04:10   2         0        
req-a38f2af2-67e4-409b-b624-b68e03081e9a 21:35:19  0:01:05   0:00:03   0:02:56   0:04:04   1         0        
request id                               Time      Rec2Start Start2Sel Sel2Rep   Full      # Sel     Errors   
req-35667b74-3025-45f5-a533-644c2edf84fa 21:21:30  0:01:15   0:02:45   0:00:03   0:04:03   2         0        
req-8711dde4-801c-47fb-9fd3-ec6347ce4bdf 21:21:44  0:01:52   0:00:03   0:02:08   0:04:03   1         0        
req-97d9d08c-dae0-4ead-ba7d-e889bb92680b 21:30:57  0:01:38   0:00:02   0:02:22   0:04:02   1         0        
req-a6b70973-9370-423a-a87e-aeffb00c42d4 21:35:23  0:01:07   0:02:11   0:00:42   0:04:00   2         0        
req-b181ed42-75bf-4084-900d-5a8255512fa8 21:36:44  0:01:26   0:01:54   0:00:40   0:04:00   2         0        
req-3c55bedc-32ea-43cf-839e-1c92ca32ce11 21:29:00  0:01:08   0:00:03   0:02:46   0:03:57   1         0        
req-24b8a710-a077-41da-bbbd-cb67029cb264 21:40:21  0:01:13   0:01:41   0:01:01   0:03:55   2         0        
req-e41d92d2-e3ca-4669-8c9a-01c3d239ace1 21:36:06  0:01:45   0:00:02   0:02:08   0:03:55   1         0        
req-572fcd54-2a34-4149-8545-c63aaa5eae40 21:27:10  0:01:47   0:00:03   0:02:04   0:03:54   1         0        
req-67c56707-4e9c-44de-abf1-b5133c1076b0 21:30:31  0:00:40   0:01:06   0:02:08   0:03:54   2         0        
req-c7717a75-0a85-47bc-9848-8973167d151b 21:31:59  0:01:21   0:00:03   0:02:30   0:03:54   1         0        
req-44ae7eb0-46be-4188-a6bb-93175bf0a01a 21:24:46  0:01:08   0:00:03   0:02:42   0:03:53   1         0        
req-3c2511ce-f8b1-4b7e-b43f-74e9484f0a2e 21:37:21  0:01:00   0:00:03   0:02:49   0:03:52   1         0        
req-0055d11d-9f12-47f1-995a-40ef0c5e39fc 21:32:34  0:01:03   0:00:03   0:02:44   0:03:50   1         0        
req-10a5c474-8381-43e6-9fc3-b33494a37c6f 21:26:41  0:01:07   0:00:03   0:02:40   0:03:50   1         0        
request id                               Time      Rec2Start Start2Sel Sel2Rep   Full      # Sel     Errors   
req-a5f9bf20-6993-4e30-83c4-f438d0f13685 21:21:43  0:01:29   0:00:04   0:02:17   0:03:50   1         0        
req-eb855964-d691-4eb5-ad35-c7fe50abd782 21:32:14  0:01:13   0:00:03   0:02:33   0:03:49   1         0        
req-47357bc2-829e-43db-a3eb-d82de51ef37e 21:36:16  0:01:25   0:00:03   0:02:17   0:03:45   1         0        
req-6fccc456-57a4-4dd0-a1e0-d262bb065bac 21:31:59  0:01:18   0:00:03   0:02:23   0:03:44   1         0        
req-6e668163-cbb6-48f5-bac9-040ec95bbb84 21:29:14  0:01:04   0:00:03   0:02:36   0:03:43   1         0        
req-d136a131-19ca-440b-8eb9-e1e43939c552 21:22:25  0:01:14   0:00:03   0:02:26   0:03:43   1         0        
req-7de12f9d-7d1c-4b32-9ee8-1ac2075276f5 21:26:08  0:01:13   0:00:02   0:02:27   0:03:42   1         0        
req-d065b076-fc1d-45e1-a0ec-4b7441c8f22b 21:34:55  0:01:01   0:00:03   0:02:38   0:03:42   1         0        
req-7064d7f9-847a-4800-812b-229d7f0e4918 21:27:17  0:01:50   0:00:03   0:01:47   0:03:40   1         0        
req-f9eec2bf-7cdf-4322-87c8-0065e8ae243f 21:33:24  0:01:15   0:00:03   0:02:21   0:03:39   1         0        
req-3092e07a-501c-466c-ab35-accafe0bce16 21:33:46  0:00:59   0:00:03   0:02:36   0:03:38   1         0        
req-32c4564a-c70d-4a98-8071-5a275152e931 21:31:45  0:01:22   0:00:03   0:02:13   0:03:38   1         0        
req-11212a27-e73c-4b95-a11e-a809f5a6da40 21:27:34  0:01:36   0:00:04   0:01:57   0:03:37   1         0        
req-0b33eee1-acb4-4c85-9bfd-1d0d35f9ac6b 21:26:57  0:01:01   0:00:03   0:02:30   0:03:34   1         0        
req-58bcbb41-4207-4deb-916a-6f8e49435c7a 21:29:17  0:01:17   0:00:03   0:02:13   0:03:33   1         0        
request id                               Time      Rec2Start Start2Sel Sel2Rep   Full      # Sel     Errors   
req-f0990e11-0fee-4eee-b2fb-7a4d118995cd 21:25:51  0:01:50   0:00:03   0:01:40   0:03:33   1         0        
req-bf32b190-d745-4ac2-9033-71451ac4f2fd 21:40:44  0:01:03   0:01:53   0:00:36   0:03:32   2         0        
req-f9fffe25-65fa-4ebb-ac08-1e6adeb9b453 21:26:18  0:01:13   0:00:03   0:02:16   0:03:32   1         0        
req-8d68c0c2-1704-4fa6-b518-bb97eaec8f70 21:25:47  0:01:10   0:00:03   0:02:17   0:03:30   1         0        
req-bc0c6993-a05c-44ab-83e1-bd8a493d0e13 21:29:40  0:00:57   0:01:04   0:01:29   0:03:30   2         0        
req-7b8ce591-826c-4b43-b527-a7f5315c3f04 21:25:10  0:00:55   0:00:03   0:02:31   0:03:29   1         0        
req-6ba7031d-2999-4d00-a73a-1efd09f262e5 21:31:28  0:01:19   0:00:03   0:02:05   0:03:27   1         0        
req-996bae41-14fc-4809-a2c4-56197d54dde7 21:32:57  0:01:07   0:00:03   0:02:17   0:03:27   1         0        
req-b611f7ec-60e4-4d76-bc1f-6af25349247c 21:29:40  0:01:07   0:00:03   0:02:17   0:03:27   1         0        
req-149a525a-1df6-4e8a-85dc-6735dded4516 21:25:51  0:01:13   0:00:03   0:02:10   0:03:26   1         0        
req-6bbf3509-7d81-4a1d-aa41-e1e873742f05 21:40:44  0:01:00   0:02:11   0:00:15   0:03:26   2         0        
req-7876f94f-55e4-44c5-8c57-777445cfef5f 21:21:44  0:01:39   0:00:03   0:01:44   0:03:26   1         0        
req-1ac9699d-ba88-4c05-a692-c19d6e39ba1e 21:35:19  0:01:01   0:00:03   0:02:21   0:03:25   1         0        
req-f38ab135-42f9-4348-b90b-04f1bdac1ac1 21:38:20  0:01:20   0:00:03   0:02:01   0:03:24   1         0        
req-e8557457-9280-4236-9634-fdb43b1fe2e0 21:36:30  0:00:55   0:02:24   0:00:04   0:03:23   3         0        
request id                               Time      Rec2Start Start2Sel Sel2Rep   Full      # Sel     Errors   
req-f19b2e97-eab5-46c5-af92-eb18fef515b5 21:25:23  0:01:09   0:00:03   0:02:11   0:03:23   1         0        
req-011c4f41-7e96-4e62-ac4c-b156705ae6fb 21:33:11  0:01:17   0:00:04   0:02:01   0:03:22   1         0        
req-22a4e66f-8c11-4a88-8ec7-bcbba123b453 21:31:38  0:01:22   0:00:03   0:01:56   0:03:21   1         0        
req-2ac2cb10-5c5a-4eaa-8a65-6852d01808d8 21:27:17  0:01:19   0:00:03   0:01:58   0:03:20   1         0        
req-ad6b7e1c-c29a-4576-a7e0-faf9c5187b90 21:36:23  0:00:58   0:00:03   0:02:19   0:03:20   1         0        
req-3b6989d6-0242-41bb-a9ff-23040f09f47c 21:38:44  0:01:24   0:00:03   0:01:52   0:03:19   1         0        
req-4aae38b2-5888-453e-ac95-148dbde8a044 21:26:35  0:00:59   0:00:03   0:02:17   0:03:19   1         0        
req-4ab2d5c7-0cb1-45b2-9e0b-6fe059f7567c 21:32:34  0:01:00   0:00:03   0:02:16   0:03:19   1         0        
req-6fe7045e-15be-447c-b883-e64fb6b8c5a4 21:33:07  0:01:11   0:00:03   0:02:03   0:03:17   1         0        
req-1968ba70-1ba2-4f33-99a7-cb6e25e75ccf 21:39:53  0:01:11   0:00:03   0:02:02   0:03:16   1         0        
req-4b27906a-8274-4c09-8a19-746656105b09 21:39:10  0:01:14   0:00:03   0:01:59   0:03:16   1         0        
req-5e640725-2911-426f-a43b-975bec8e8c41 21:28:29  0:00:56   0:00:02   0:02:18   0:03:16   1         0        
req-c06ba87e-98eb-475c-9caf-8b91d9d204e1 21:39:34  0:01:07   0:00:19   0:01:50   0:03:16   1         1        
req-d52abd90-9bbe-41b8-aaaf-c6f86704994d 21:39:10  0:01:21   0:00:03   0:01:52   0:03:16   1         0        
req-fa8e4402-0f8d-479d-ac5b-e1ecbbe41f47 21:39:53  0:01:14   0:00:03   0:01:59   0:03:16   1         0        
request id                               Time      Rec2Start Start2Sel Sel2Rep   Full      # Sel     Errors   
req-fb8ef572-bad7-4562-80f2-fcb35ff09986 21:26:08  0:01:16   0:00:03   0:01:57   0:03:16   1         0        
req-0a4ce880-2f65-43b5-a497-347600fb14df 21:40:01  0:01:23   0:00:03   0:01:48   0:03:14   1         0        
req-3f942a0c-b2eb-4ebe-9013-b50f8ee708ef 21:23:46  0:01:44   0:01:27   0:00:03   0:03:14   2         0        
req-610e4b6b-8861-444a-aa5d-2dea2048f545 21:29:00  0:00:58   0:00:03   0:02:13   0:03:14   1         0        
req-d501d132-9f68-4e34-bac9-0c5975e3b08d 21:31:11  0:01:13   0:00:04   0:01:57   0:03:14   1         0        
req-036b58f3-0fe6-43f6-952b-ca7c4239cc09 21:36:44  0:01:23   0:01:49   0:00:01   0:03:13   2         0        
req-2de38750-cdcc-4425-9f8c-166d756bd9fc 21:41:03  0:01:00   0:02:04   0:00:09   0:03:13   3         0        
req-e2cbb17a-6b10-4e4d-8ee3-6089ca3df7b4 21:41:03  0:00:57   0:01:06   0:01:10   0:03:13   2         0        
req-2732c426-6d6b-410a-a85f-f899d07c268b 21:29:57  0:01:18   0:00:03   0:01:49   0:03:10   1         0        
req-5b43e9d7-a5b7-45d7-959c-511a2e15bab6 21:33:24  0:01:18   0:00:03   0:01:48   0:03:09   1         0        
req-3772105f-4705-4ca9-b664-8ae1c2961298 21:40:41  0:00:32   0:00:59   0:01:37   0:03:08   2         0        
req-47ffa802-040b-491c-9a50-46d517875e1c 21:36:53  0:01:01   0:00:06   0:02:01   0:03:08   1         1        
req-e1eeee7d-19ca-4c0a-919f-d0ba17ac0196 21:27:58  0:01:02   0:00:03   0:02:01   0:03:06   1         0        
req-4a2c3c61-df70-43a5-bae1-9487f4a2b03b 21:28:29  0:01:05   0:00:03   0:01:51   0:02:59   1         0        
req-6b367d22-c548-44e8-994d-6e33a3578b33 21:21:07  0:00:20   0:02:19   0:00:20   0:02:59   2         0        
~~~

Comment 3 David Vallee Delisle 2019-12-05 20:31:10 UTC
When we look at the requests sorted by time they were received [1], we see that:
- the first one was scheduled in 3 seconds
- the second one was scheduled twice because of the anti-affinity filter and took 26s to complete
- the third one tool 7s to start, 10 seconds to fully complete, but 3s to filter and select host
- the fourth one took 20 seconds to start, almost 3 minutes to complete because of Anti-affinity
- We also have a bunch of requests that weren't able to reply to conductor because their sockets to rabbit were teared down with handshake timeouts.
- req-9ebf17a6-7aed-4f96-9015-6de729547116 is pretty interesting because it took 43s to start (from the time it got the conductor message), and took 1m35s to send the reply after host was selected. Apparently, during this request (21:23:23.369), we hit a "pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query'), as well as AMQP disconnections. 

[1]
~~~
request id                               Time      Rec2Start Start2Sel Sel2Rep   Full      # Sel     Errors   
req-a56d26f1-bda4-4af5-9f07-2d84123fb1d2 21:21:01  0:00:00   0:00:03   0:00:00   0:00:03   1         0        
req-c5eea2bb-0a26-4004-9a3e-f96a45aa3c29 21:21:04  0:00:00   0:00:23   0:00:03   0:00:26   2         0        
req-3610fbfb-2737-4fef-8e44-950e042228a7 21:21:07  0:00:07   0:00:03   0:00:00   0:00:10   1         0        
req-6b367d22-c548-44e8-994d-6e33a3578b33 21:21:07  0:00:20   0:02:19   0:00:20   0:02:59   2         0        
req-cf8a3945-774c-4b36-bb8a-cc64c5e6dea5 21:21:07  0:00:04   0:00:03   0:00:03   0:00:10   1         0        
req-f82fd0ff-a9a7-4077-a2da-65524cb9a6a6 21:21:07  0:00:01   0:00:03   0:00:03   0:00:07   1         0        
req-0381092b-5e90-45ec-ab52-b19a49f63dab 21:21:17  0:00:24   0:00:02   0:00:01   0:00:27   1         0        
req-19f51606-a084-41a3-a90e-7ca1c209829d 21:21:17  0:00:04   0:00:03   0:00:00   0:00:07   1         0        
req-1df84f66-da9e-4083-81a5-e1e4c16fb9c5 21:21:17  0:00:37   0:01:59   0:01:47   0:04:23   2         0        
req-249bea05-b81a-41bd-abda-6e9fb5328151 21:21:17  0:00:01   0:00:03   0:00:03   0:00:07   1         0        
req-67a059ab-5ca3-4605-9696-ceb67e700ac8 21:21:17  0:01:04   0:00:03   0:22:02   0:23:09   1         26       
req-8efcb648-04a3-488e-9ae9-47d722030d4f 21:21:17  0:00:54   0:02:26   0:19:49   0:23:09   2         25       
req-9ebf17a6-7aed-4f96-9015-6de729547116 21:21:17  0:00:43   0:00:04   0:01:35   0:02:22   1         1        
req-e01645bc-20f0-4e38-91a8-bb0e8073aecc 21:21:17  0:00:27   0:03:36   0:19:02   0:23:05   2         25 
~~~

Comment 4 melanie witt 2019-12-06 00:10:21 UTC
Thanks for providing all of the detailed information about the situation.

Based on the info I read here, the short answer as to why the scheduler is taking so long to schedule the instances is because of the high number of collisions occurring during scheduling of the parallel requests, causing re-schedules, and as you observed the "snowball effect". I think it's likely that the rabbitmq connection is lost because of the lengthy time elapsed as requests take longer and longer to serve, and rabbitmq thinks these connections are "dead" and cleans them up. This is just a guess, I am not a rabbitmq expert.

To approach this problem, I think we need to focus on reducing the collisions that are happening. I also agree that increasing rpc_response_timeout is best to be avoided, for the reasons explained in comment 2.

What are the configured values for the following nova-scheduler nova.conf option [1]?

  [DEFAULT]/scheduler_host_subset_size

This setting controls the number of hosts matching the request that will be randomly considered for instance placement. It defaults to 1, which results in an instance placement pattern of "packing" or "stacking" on the compute hosts, which the goal to utilize each compute host as completely as possible. This pattern is bad however in an environment where a large number of parallel requests are being made that have anti-affinity policies with one another. The independent parallel requests don't know anything about each other while in-flight, and thus don't know how to avoid each other to fulfill the anti-affinity policy. Instead, there's a high chance they will all pick the same "first" host and try to build there, then collide once they are there. Then, the "losing" request has to be re-scheduled because the anti-affinity policy was violated. And each subsequent request takes longer and continues to collide in a snowballing cascade.

The parallel requests for anti-affinity is a use case where a pattern of "spreading" instances on the compute hosts should work better and avoid collisions. This can be achieved by increasing the value of the [DEFAULT]/scheduler_host_subset_size setting. I might suggest starting with a value of 50 given the batch is around 100, but if all 100 requests need to go to different hosts I would try higher, at 100.

Another nova-scheduler nova.conf option that could help is:

  [DEFAULT]/scheduler_max_attempts

This defaults to 3 and can be increased if a large number of collisions are being encountered. This is to avoid instances falling into the ERROR state with MaximumRetriesExceeded. The scheduler_host_subset_size is more important but if there are residual issues with requests not being able to find a compute host within the allowed number of retries, it would be a good idea to increase the maximum number of retries allowed.

I would try adjusting scheduler_host_subset_size and possibly scheduler_max_attempts first and see if it helps alleviate the scheduling issues.

[1] https://docs.openstack.org/newton/config-reference/compute/schedulers.html#configuration-options

Comment 11 melanie witt 2020-02-28 18:21:22 UTC
Closing this per our understanding that there is not a defect in nova related to this bug report. Please re-open if you find there are still problems related to this that need investigation.


Note You need to log in before you can comment on or make changes to this bug.