Bug 1667257 - [rhos-prio] When creating ~600 security group rules using heat, with more than 1 heat-engine thread, neutron server is getting hammered and it eventually timeouts
Summary: [rhos-prio] When creating ~600 security group rules using heat, with more tha...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: async
: 10.0 (Newton)
Assignee: Zane Bitter
QA Contact: Ronnie Rasouli
URL:
Whiteboard:
Depends On: 1665239 1776950
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-17 21:04 UTC by David Vallee Delisle
Modified: 2023-10-06 18:08 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1665239
Environment:
Last Closed: 2019-05-30 16:01:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Script to fix up template (323 bytes, text/plain)
2019-01-18 14:24 UTC, Thomas Hervé
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-29344 0 None None None 2023-10-06 18:08:14 UTC

Description David Vallee Delisle 2019-01-17 21:04:43 UTC
+++ This bug was initially created as a clone of Bug #1665239 +++

Customer is trying to create a lot of security group rules and at some point, neutron fails with "StaleDataError: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched."

We also see a buildup in the time it takes neutron server to reply to POST /v2.0/security-group-rules.json, the first ones take less than a second to complete, up to more than 200s and this is when we start seeing 500s instead of 201s.

In my lab, with 3 workers running (one on each controllers), I saw up to 12k requests on /v2.0/extensions.json per minute. On my customer's lab running with 144 workers (across 3 controllers), they see up to 166k requests on that endpoint, every minute [2]. We suspect that this is either a symptom, or cause, but that it's related to the issue.

When disable heat-engine on 2 out of 3 controllers, setting the num_engine_worker to 1 on the remaining controller, incrementing the stack_action_timeout and token expiration timeout, we are able to complete the creation of a 600 security group rule stack. We see that the POSTs to the neutron api are taking an average of 0-2s to complete, and never more than 10s. 

Side issue, when creating a security group rule stack that failed, and when we try to update it later, heat is failing when trying to create "duplicate rules". We would expect that heat would take an inventory of what's currently created and don't try to create them again, or would simply not fail when a dupe is found.

[1]
~~~
2019-01-15 19:47:19.580 988415 INFO neutron.wsgi [req-a08cf5a3-f78c-4db7-a695-12fa6c633178 <SNIP> "POST /v2.0/security-groups.json HTTP/1.1" 201 1585 0.177781
2019-01-15 19:47:19.815 988415 INFO neutron.wsgi [req-ffcf0c33-92f7-433f-956f-83447192e780 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 692 0.209263
2019-01-15 19:47:20.317 988415 INFO neutron.wsgi [req-9ba6afe2-da48-4383-83d7-bfb234c319e5 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 691 0.204312
2019-01-15 19:51:32.691 988415 INFO neutron.wsgi [req-7803947e-50ad-44c9-b11b-117bf7c6f85f <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 16.791251
2019-01-15 19:51:36.343 988415 INFO neutron.wsgi [req-b9d4092a-26c6-4480-a00d-d3f0ec4ac9c8 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 14.068940
2019-01-15 19:51:36.698 988415 INFO neutron.wsgi [req-7afc5146-c460-40dc-a087-d1f276271427 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 1.517953
2019-01-15 19:51:49.328 988415 INFO neutron.wsgi [req-d4de311b-71d4-43c6-bd6c-0361aba957ae <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 706 9.610927
2019-01-15 19:51:50.892 988415 INFO neutron.wsgi [req-25eb98b2-f7aa-45f0-bf9c-bcb754dac184 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 702 14.876030
2019-01-15 19:51:58.667 988415 INFO neutron.wsgi [req-d538e36f-db02-4f2c-b92c-2d522804aba5 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 14.241387
2019-01-15 19:52:01.714 988415 INFO neutron.wsgi [req-0db87c71-4d02-42d1-9921-e31c5ff1a05f <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 702 14.407799
2019-01-15 19:52:10.841 988415 INFO neutron.wsgi [req-da9f14f9-9c45-42a4-8ac1-3330440648f2 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 26.159587
2019-01-15 19:52:16.303 988415 INFO neutron.wsgi [req-96e571bf-6b34-425b-8f7e-5e036eb1f820 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 703 15.903058
2019-01-15 19:52:16.654 988415 INFO neutron.wsgi [req-e6a94f99-52cc-4209-9407-4c79f6ded4f4 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 706 13.617583
2019-01-15 19:52:29.603 988415 INFO neutron.wsgi [req-767ff22e-1501-429f-8456-f79136741563 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 701 16.726857
2019-01-15 19:52:36.672 988415 INFO neutron.wsgi [req-277c428a-c2b4-4306-9f8a-93ec6c330f2d <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 706 26.011657
2019-01-15 19:52:39.039 988415 INFO neutron.wsgi [req-d046d305-a858-4197-9aff-d1aa92cc5013 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 705 18.620860
2019-01-15 19:52:39.129 988415 INFO neutron.wsgi [req-1387dd6d-4b7d-4ea1-a2d3-a8025691339d <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 703 21.609913
2019-01-15 19:52:45.177 988415 INFO neutron.wsgi [req-e423bac0-d3a3-4bd5-80a7-26ecc5acf992 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 20.025870
2019-01-15 19:52:46.028 988415 INFO neutron.wsgi [req-1627776b-69cc-450b-af0f-6a1cd46b3156 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 703 22.658502
2019-01-15 19:52:51.165 988415 INFO neutron.wsgi [req-2302153e-7e74-453a-92a5-977d5fe8241c <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 705 22.737595
2019-01-15 19:52:51.180 988415 INFO neutron.wsgi [req-39c624e5-7561-4991-9b52-b1158c35e36d <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 18.844666
2019-01-15 19:52:54.547 988415 INFO neutron.wsgi [req-cfac5a7e-2f77-41af-9b4e-2a9eaf5834ca <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 703 24.258442
2019-01-15 19:52:54.662 988415 INFO neutron.wsgi [req-5f3307a8-c245-4072-8d37-8515602819c9 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 706 15.109443
2019-01-15 19:52:59.831 988415 INFO neutron.wsgi [req-b0a3583f-87bc-4fd8-8b50-2b5f85733b82 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 705 15.903552
2019-01-15 19:53:06.112 988415 INFO neutron.wsgi [req-434d319b-d064-4867-b649-7ce8476472d7 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 703 35.665496
2019-01-15 19:53:06.530 988415 INFO neutron.wsgi [req-b5698a80-8bd1-40d0-aa65-9aa7b036d60d <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 17.874464
2019-01-15 19:53:06.886 988415 INFO neutron.wsgi [req-d1b19d79-3162-4c22-879b-e9f6a0d0be48 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 14.455133
2019-01-15 19:53:15.236 988415 INFO neutron.wsgi [req-b58bd74a-6b79-4509-b248-c0bfd4495851 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 16.102678
2019-01-15 19:53:18.277 988415 INFO neutron.wsgi [req-e4cc92f1-1a9d-48e2-885c-635a6b229150 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 701 45.369257
2019-01-15 19:53:20.133 988415 INFO neutron.wsgi [req-0050eec3-436e-402f-9329-7c6e17646f05 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 706 14.844351
2019-01-15 19:53:20.989 988415 INFO neutron.wsgi [req-27b1329d-be50-4d50-9388-1a3e0625ba40 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 705 44.393658
2019-01-15 19:53:27.791 988415 INFO neutron.wsgi [req-8e27c892-1cff-4612-b049-5bd7a7f89d28 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 705 21.656269
2019-01-15 19:53:32.453 988415 INFO neutron.wsgi [req-650d4517-e9d1-41a2-b6ad-5608f9c700c0 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 20.101678
2019-01-15 19:53:34.202 988415 INFO neutron.wsgi [req-6c6ba3fb-ce0c-4b97-999c-5c830821d6ec <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 15.787255
2019-01-15 19:53:42.130 988415 INFO neutron.wsgi [req-b671cc75-6ea5-445b-bbd6-5ada6fea7be3 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 707 54.972306
2019-01-15 19:53:45.845 988415 INFO neutron.wsgi [req-e08068d7-db37-4400-b306-74f0f1172125 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 703 23.388320
2019-01-15 19:53:58.979 988415 INFO neutron.wsgi [req-086f5e3d-1b75-4fd5-a06c-fc14471fcb2c <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 30.182090
2019-01-15 19:54:01.465 988415 INFO neutron.wsgi [req-d38479ce-841d-48be-afe2-d5a2ffa3c606 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 705 54.739885
2019-01-15 19:54:10.876 988415 INFO neutron.wsgi [req-30837d47-30ec-48d7-af7a-a395406c6445 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 705 39.601803
2019-01-15 19:54:25.617 988415 INFO neutron.wsgi [req-ecf947f2-398b-490f-a5d0-cc57215cb78e <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 702 44.332826
2019-01-15 19:54:40.077 988415 INFO neutron.wsgi [req-3921fd53-45a1-441d-ade4-dd608dc151d3 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 703 47.553737
2019-01-15 19:54:52.544 988415 INFO neutron.wsgi [req-7c719f4f-72dd-40ec-bc77-ddc839183408 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 500 344 154.134486
2019-01-15 19:54:53.516 988415 INFO neutron.wsgi [req-e9707c8b-d868-4c88-89ed-6930b62958cc <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 706 28.462749
2019-01-15 19:55:14.696 988415 INFO neutron.wsgi [req-a5c3b87b-bec0-411c-a32b-d91f63daa7ae <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 118.773054
2019-01-15 19:55:30.893 988415 INFO neutron.wsgi [req-d3a9210a-510a-4d5b-b795-182d4800ab32 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 703 102.303439
2019-01-15 19:55:47.641 988415 INFO neutron.wsgi [req-b5bdf2f7-2521-40af-8618-f4b11c892d6a <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 705 106.655262
2019-01-15 19:55:56.359 988415 INFO neutron.wsgi [req-2530ec41-7ebb-465e-9f29-5d33728c1095 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 500 344 188.876560
2019-01-15 19:56:02.326 988415 INFO neutron.wsgi [req-7c456e85-9c7d-4098-aab3-f41196a6b136 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 704 107.206438
2019-01-15 19:56:16.989 988415 INFO neutron.wsgi [req-80a1b4e1-8f5e-4a43-8a8f-d8f323169fc1 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 702 143.618252
2019-01-15 19:56:18.348 988415 INFO neutron.wsgi [req-bb2b5c17-7e00-4a2f-8319-59054a91cb65 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 500 344 199.180581
2019-01-15 19:56:28.786 988415 INFO neutron.wsgi [req-bb911917-cde4-4dbc-b84c-dc7f0d15d990 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 703 183.594454
2019-01-15 19:56:34.992 988415 INFO neutron.wsgi [req-700ea889-3c41-427e-894e-c5ead319dda3 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 500 344 207.171425
2019-01-15 19:56:41.027 988415 INFO neutron.wsgi [req-ddb79694-a729-4c5e-872f-6d41fc69bfa4 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 703 193.279332
2019-01-15 19:56:47.757 988415 INFO neutron.wsgi [req-330def83-1180-4a03-af6e-61c6666e314b <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 500 344 215.920990
2019-01-15 19:56:49.348 988415 INFO neutron.wsgi [req-472aa58f-d6af-44c5-bc9f-c0e870b252c8 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 500 344 213.843158
2019-01-15 19:56:50.166 988415 INFO neutron.wsgi [req-330597d3-1e03-4718-acc7-7816edab26a5 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 201 703 137.999861
2019-01-15 19:56:50.194 988415 INFO neutron.wsgi [req-76d9497e-64be-46f8-9817-e372be743f9b <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 500 344 211.362064
2019-01-15 19:56:54.332 988415 INFO neutron.wsgi [req-a65f78ed-0043-458e-b914-1cf809416190 <SNIP> "POST /v2.0/security-group-rules.json HTTP/1.1" 500 344 215.126256
~~~

[2]
~~~
  37167 2019-01-07 20:22
 166220 2019-01-07 20:23
 107487 2019-01-07 20:24
    518 2019-01-07 20:25
   7577 2019-01-07 20:26
   3884 2019-01-07 20:27
    372 2019-01-07 20:28
    471 2019-01-07 20:32
   1067 2019-01-07 20:33
 147207 2019-01-07 20:34
 158324 2019-01-07 20:35
   1747 2019-01-07 20:36
    817 2019-01-07 20:37
     68 2019-01-07 20:38
   1283 2019-01-07 20:39
~~~

Comment 2 David Vallee Delisle 2019-01-17 21:10:33 UTC
Setting the number of worker to 1 is not an appropriate workaround for the customer. It took 1h41m to create ~600 rules with 1 worker, customer has 4000 rules to create. It would take more than 7h. They need to create this stack on multiple deployments. Blocking other tenants from working with heat for 7h, every time they need to create or update that stack is not acceptable.

Comment 4 Zane Bitter 2019-01-17 21:53:34 UTC
There is a patch in Ocata that would serve to slow down the creation a bit by adding a 1s delay before starting each resource that is a root of the dependency graph (i.e. doesn't depend on anything): https://review.openstack.org/445355
However, with 4000 rules to create and even the initial ones taking 1-2s, this would probably be insufficient anyway.

The best way to solve this, especially in OSP 10, would be in the template itself. Hopefully with 4000 rules the template is being autogenerated and not constructed by hand? Start by picking a batch size, n. For the first n resources, you don't need to do anything. For the next n, add a depends_on in each to make it require (a different) one of the first n. For the third n, make each depend on one from the second n, and so on. This will ensure that only n resources are being created in parallel. e.g. for n=10:

  a  b  c  d  e  f  g  h  i  j

  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^
  |  |  |  |  |  |  |  |  |  |  (depends_on)

  k  l  m  n  o  p  q  r  s  t

  ^  ^  ^  ^  ^  ^
  |  |  |  |  |  |              (depends_on)

  u  v  w  x  y  z

You can then tune the batch size to whatever works without overwhelming neutron.

Each worker can process more than one resource (it switches between greenthreads while waiting for I/O), so with 144 workers it's likely that it's trying to do substantially all of the resources at roughly the same time. I would start by generating a template with a batch sixe of say 20 and seeing how it goes from there.

Comment 10 Thomas Hervé 2019-01-18 14:24:50 UTC
Created attachment 1521555 [details]
Script to fix up template

Comment 22 Brian Haley 2019-02-11 14:56:50 UTC
Oops, I should have been updating the neutron bug, 1665239, will do that now.

Comment 23 Nate Johnston 2019-03-19 14:13:37 UTC
The neutron bug is waiting for QA now.  https://bugzilla.redhat.com/show_bug.cgi?id=1665239

Comment 26 Zane Bitter 2019-05-30 16:01:00 UTC
Fixed in neutron.


Note You need to log in before you can comment on or make changes to this bug.