Bug 1395689 - [ocp-on-osp] Stack deletion or scale down fails due to "concurrent db transaction"
Summary: [ocp-on-osp] Stack deletion or scale down fails due to "concurrent db transac...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ga
: 12.0 (Pike)
Assignee: Zane Bitter
QA Contact: Ronnie Rasouli
URL:
Whiteboard: aos-scalability-34
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-16 13:00 UTC by Jan Provaznik
Modified: 2021-01-31 12:37 UTC (History)
15 users (show)

Fixed In Version: openstack-heat-9.0.0-0.20170719132024.923d018.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-13 20:52:28 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1642166 0 None None None 2017-03-02 15:31:53 UTC
OpenStack gerrit 321783 0 None MERGED Corrected max secs for concurrent trans retries 2021-02-09 21:18:09 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Jan Provaznik 2016-11-16 13:00:15 UTC
Description of problem:
After deploying a bigger openshift deployment (e.g. 50 compute nodes), 
deletion or scale down of the deployment fails and in stack events is:
2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource ConcurrentTransaction: Concurrent transaction for deployments of server 39bb0dea-d89a-4aac-ac46-3dca0ca7c2e3

This is a downstream tracker of an upstream heat bug:
https://bugs.launchpad.net/heat/+bug/1642166

Version-Release number of selected component (if applicable):
python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch
openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-cloudwatch-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch
openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch
openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch


Steps to Reproduce:
1. deploy openshift-on-openstack with node_count=50
2. delete the stack
3.

Actual results:

stack deletion fails

Expected results:
stack is deleted

Comment 2 Zane Bitter 2016-11-16 17:18:35 UTC
This is basically due to a thundering herd problem - deletes are generally pretty fast and so everything hits the database at once.

A patch I proposed a while back to help limit this was https://review.openstack.org/#/c/301323/ However it's unlikely to go ahead, since it only applies to legacy non-convergence stacks (no longer the default configuration).

According to the upstream bug you already tested another patch I proposed, https://review.openstack.org/#/c/321783/ which adds a bit of jitter to DB retries, but it didn't help. Unfortunately it also fails in the upstream gate when convergence is enabled, so it can't be merged until that has been investigated and fixed anyway.

If the issue is contention between multiple software deployments on a single server, then one option is to serialise them by adding depends_on relationships between them. Note that this also applies to creation, so it may slow down creation time considerably.

Comment 3 Zane Bitter 2016-11-16 19:54:32 UTC
OK, after reviewing the upstream IRC logs I see that those last two workarounds probably won't help. It sounds like there are 300 software deployments all being triggered on a single server, all at once. Because they probably have nothing in the nested stack depending on them they'll be first in line to be deleted, so the triggering of the deployments will be more or less synchronised. And because there are 168 heat-engine worker processes running across the 3 controllers, they'll all be able to cause massive database contention in parallel.

I'd imagine the amount of retry jitter and/or the number of retries, would need to be *much* higher to accommodate.

Comment 4 Jan Provaznik 2016-11-16 20:57:21 UTC
A potential workaround we will try:

Adding a resource into node template which waits random time before triggering node_cleanup.

  delete_delay_time:
    type: OS::Heat::RandomString
    properties:
      character_classes:
        - min: 2
          class: digits
      length: 2
        
  delete_delay:
    type: OS::Heat::TestResource
    properties:
      action_wait_secs:
        delete: {get_resource: delete_delay_time}
    depends_on: deployment_bastion_node_cleanup

Comment 5 Jan Provaznik 2016-11-17 11:10:00 UTC
With above resources there is a problem in converting string returned by RandomString into float:
| jprovazn                  | 8e9d3892-3f4c-483c-b67e-b8ec8941dda6 | Resource CREATE failed: resources.uuyidznnssv7: resources.openshift_nodes.Property error: resources.delete_delay.properties.action_wait_secs: could not convert string to float: delete_delay_time | CREATE_FAILED      | 2016-11-17T10:45:19 |

Comment 6 Rabi Mishra 2016-11-17 12:54:33 UTC
I think you can use the below instead.
...
  delete_delay:
    type: OS::Heat::TestResource
    properties:
      action_wait_secs:
         delete: {get_attr: [delete_delay_time, value]}
    depends_on: deployment_bastion_node_cleanup

Comment 7 Jan Provaznik 2016-11-17 13:49:07 UTC
I tried on scaledown from 40 to 5 nodes, it seems that usage of the above delay signifficantly helps but is not 100%. With delay resource I hit two Conccurent DB transaction errors (in compare to tens w/o the delay). I hit 2 Greenlet exits, not sure if it's related or what exatly it means:


2016-11-17 13:33:30.206 93477 INFO heat.engine.resources.openstack.heat.test_resource [req-aea02c62-d66d-4ba6-8e74-78229217af9d - - - - -] Resource delete_delay waited 0:00:26.581435/35 seco
nds
2016-11-17 13:33:30.469 93494 INFO heat.engine.stack [req-3be88e22-39eb-40d7-87bb-9610cfbb664d - - - - -] Stopped due to GreenletExit() in delete
2016-11-17 13:33:30.687 93489 INFO heat.engine.resources.openstack.heat.test_resource [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e0048
7d3 - - -] Resource delete_delay waited 0:00:26.990793/27 seconds
2016-11-17 13:33:30.754 93494 INFO heat.engine.stack [req-3be88e22-39eb-40d7-87bb-9610cfbb664d - - - - -] Stack DELETE FAILED (jprovazn-openshift_nodes-u55ic4exosf4-xjkkmqeidfhk-dqcn4xhy6brg
): GreenletExit()
2016-11-17 13:33:30.775 93507 INFO heat.engine.stack [req-c2406ea8-934d-48ba-a612-b82c01dd641c - - - - -] Stopped due to GreenletExit() in delete
2016-11-17 13:33:30.838 93507 INFO heat.engine.stack [req-c2406ea8-934d-48ba-a612-b82c01dd641c - - - - -] Stack DELETE FAILED (jprovazn-openshift_nodes-u55ic4exosf4-6n3rr5e7bb4r-twcxgn6d56hl
): GreenletExit()
2016-11-17 13:33:30.957 93479 INFO heat.engine.stack [req-e6b16fdc-e3e0-4f33-bb44-320225c149ae - - - - -] Stopped due to GreenletExit() in delete
2016-11-17 13:33:31.007 93479 INFO heat.engine.stack [req-e6b16fdc-e3e0-4f33-bb44-320225c149ae - - - - -] Stack DELETE FAILED (jprovazn-openshift_nodes-u55ic4exosf4-kksr7lcmnlx5-jmya6lhmi6ao
): GreenletExit()
2016-11-17 13:33:31.038 93489 INFO heat.engine.stack [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Stopped due to Greenle
tExit() in delete
2016-11-17 13:33:31.076 93489 INFO heat.engine.stack [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Stack DELETE FAILED (j
provazn-openshift_nodes-u55ic4exosf4-hsduldryyhfi-p5hb4xfwquh6): GreenletExit()
2016-11-17 13:33:31.315 93512 INFO heat.engine.stack [req-332524ca-c0b7-4faa-8da5-7611cb24cb33 - - - - -] Stopped due to GreenletExit() in delete
2016-11-17 13:33:31.353 93477 INFO heat.engine.stack [req-aea02c62-d66d-4ba6-8e74-78229217af9d - - - - -] Stopped due to GreenletExit() in delete
2016-11-17 13:33:31.393 93512 INFO heat.engine.stack [req-332524ca-c0b7-4faa-8da5-7611cb24cb33 - - - - -] Stack DELETE FAILED (jprovazn-openshift_nodes-u55ic4exosf4-76cx5vggssis-a4mledfuqdjr
): GreenletExit()
2016-11-17 13:33:31.416 93499 INFO heat.engine.stack [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Stopped due to Greenle
tExit() in delete
2016-11-17 13:33:31.421 93477 INFO heat.engine.stack [req-aea02c62-d66d-4ba6-8e74-78229217af9d - - - - -] Stack DELETE FAILED (jprovazn-openshift_nodes-u55ic4exosf4-poqpw6phc3o4-t47soymjew2z
): GreenletExit()
2016-11-17 13:33:31.455 93499 INFO heat.engine.stack [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Stack DELETE FAILED (j
provazn-openshift_nodes-u55ic4exosf4-gryzxcui5zxt-ldoxkt4sbtt3): GreenletExit()
2016-11-17 13:33:31.514 93492 INFO heat.engine.stack [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Stopped due to Greenle
tExit() in delete
2016-11-17 13:33:31.559 93492 INFO heat.engine.stack [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Stack DELETE FAILED (j
provazn-openshift_nodes-u55ic4exosf4-yilebysoqvq5-vv57hagbgdqy): GreenletExit()
2016-11-17 13:33:31.613 93498 INFO heat.engine.stack [req-5b7f4c29-9dac-4e29-9256-9b503b7e6a9f - - - - -] Stopped due to GreenletExit() in delete
2016-11-17 13:33:31.648 93498 INFO heat.engine.stack [req-5b7f4c29-9dac-4e29-9256-9b503b7e6a9f - - - - -] Stack DELETE FAILED (jprovazn-openshift_nodes-u55ic4exosf4-dvbowcn5oqp5-jas7foyf3lqo
): GreenletExit()
2016-11-17 13:33:39.751 93494 INFO heat.engine.service [req-e033b2e3-506b-464e-b334-1e78564071de f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Deleting stack jprov
azn-openshift_nodes-u55ic4exosf4-7r4ptpkp4426-jh5nsnlahfhs-internal_port-s4ed3zgn2evp

Comment 8 Zane Bitter 2016-11-17 15:00:35 UTC
GreenletExit happens when a thread is killed. The only reason I know of for that is if you issue a delete while another action (in this case it looks like a previous delete) was still in progress. There's a grace period after a 

Did you test with https://review.openstack.org/#/c/321783/ or without? I'd expect that to make a bigger difference now that we have spread stuff out with the random delay in the template.

Comment 9 Jan Provaznik 2016-11-17 17:19:32 UTC
We test without any additional patches ATM, so https://review.openstack.org/#/c/321783/ is not applied. We can try it.

Comment 10 Zane Bitter 2016-11-17 18:18:15 UTC
OK, would be interesting.

The good news is that you've got it down to a level where you'd probably need only 1 retry to get get everything deleted. That's getting to a level where you could seriously consider just adding a release note for now.

It only happens when you delete a whole large cluster in one go, or scale down by a very large step, so I'd guess that this won't be common in practice, and the workaround should now be simple.

Comment 11 Zane Bitter 2017-06-08 18:29:10 UTC
Crag fixed https://review.openstack.org/#/c/321783/ to retry properly and it merged in Pike, so we expect this problem to be *much* less pronounced now, although we don't know at exactly what scale the retry delays might start proving too short.

Comment 18 errata-xmlrpc 2017-12-13 20:52:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.