Description of problem: After deploying a bigger openshift deployment (e.g. 50 compute nodes), deletion or scale down of the deployment fails and in stack events is: 2016-11-16 08:14:30.428 52982 ERROR heat.engine.resource ConcurrentTransaction: Concurrent transaction for deployments of server 39bb0dea-d89a-4aac-ac46-3dca0ca7c2e3 This is a downstream tracker of an upstream heat bug: https://bugs.launchpad.net/heat/+bug/1642166 Version-Release number of selected component (if applicable): python-heatclient-1.4.0-0.20160831084943.fb7802e.el7ost.noarch openstack-heat-engine-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-api-cloudwatch-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-api-cfn-7.0.0-0.20160907124808.21e49dc.el7ost.noarch puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7ost.noarch openstack-heat-common-7.0.0-0.20160907124808.21e49dc.el7ost.noarch openstack-heat-api-7.0.0-0.20160907124808.21e49dc.el7ost.noarch Steps to Reproduce: 1. deploy openshift-on-openstack with node_count=50 2. delete the stack 3. Actual results: stack deletion fails Expected results: stack is deleted
This is basically due to a thundering herd problem - deletes are generally pretty fast and so everything hits the database at once. A patch I proposed a while back to help limit this was https://review.openstack.org/#/c/301323/ However it's unlikely to go ahead, since it only applies to legacy non-convergence stacks (no longer the default configuration). According to the upstream bug you already tested another patch I proposed, https://review.openstack.org/#/c/321783/ which adds a bit of jitter to DB retries, but it didn't help. Unfortunately it also fails in the upstream gate when convergence is enabled, so it can't be merged until that has been investigated and fixed anyway. If the issue is contention between multiple software deployments on a single server, then one option is to serialise them by adding depends_on relationships between them. Note that this also applies to creation, so it may slow down creation time considerably.
OK, after reviewing the upstream IRC logs I see that those last two workarounds probably won't help. It sounds like there are 300 software deployments all being triggered on a single server, all at once. Because they probably have nothing in the nested stack depending on them they'll be first in line to be deleted, so the triggering of the deployments will be more or less synchronised. And because there are 168 heat-engine worker processes running across the 3 controllers, they'll all be able to cause massive database contention in parallel. I'd imagine the amount of retry jitter and/or the number of retries, would need to be *much* higher to accommodate.
A potential workaround we will try: Adding a resource into node template which waits random time before triggering node_cleanup. delete_delay_time: type: OS::Heat::RandomString properties: character_classes: - min: 2 class: digits length: 2 delete_delay: type: OS::Heat::TestResource properties: action_wait_secs: delete: {get_resource: delete_delay_time} depends_on: deployment_bastion_node_cleanup
With above resources there is a problem in converting string returned by RandomString into float: | jprovazn | 8e9d3892-3f4c-483c-b67e-b8ec8941dda6 | Resource CREATE failed: resources.uuyidznnssv7: resources.openshift_nodes.Property error: resources.delete_delay.properties.action_wait_secs: could not convert string to float: delete_delay_time | CREATE_FAILED | 2016-11-17T10:45:19 |
I think you can use the below instead. ... delete_delay: type: OS::Heat::TestResource properties: action_wait_secs: delete: {get_attr: [delete_delay_time, value]} depends_on: deployment_bastion_node_cleanup
I tried on scaledown from 40 to 5 nodes, it seems that usage of the above delay signifficantly helps but is not 100%. With delay resource I hit two Conccurent DB transaction errors (in compare to tens w/o the delay). I hit 2 Greenlet exits, not sure if it's related or what exatly it means: 2016-11-17 13:33:30.206 93477 INFO heat.engine.resources.openstack.heat.test_resource [req-aea02c62-d66d-4ba6-8e74-78229217af9d - - - - -] Resource delete_delay waited 0:00:26.581435/35 seco nds 2016-11-17 13:33:30.469 93494 INFO heat.engine.stack [req-3be88e22-39eb-40d7-87bb-9610cfbb664d - - - - -] Stopped due to GreenletExit() in delete 2016-11-17 13:33:30.687 93489 INFO heat.engine.resources.openstack.heat.test_resource [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e0048 7d3 - - -] Resource delete_delay waited 0:00:26.990793/27 seconds 2016-11-17 13:33:30.754 93494 INFO heat.engine.stack [req-3be88e22-39eb-40d7-87bb-9610cfbb664d - - - - -] Stack DELETE FAILED (jprovazn-openshift_nodes-u55ic4exosf4-xjkkmqeidfhk-dqcn4xhy6brg ): GreenletExit() 2016-11-17 13:33:30.775 93507 INFO heat.engine.stack [req-c2406ea8-934d-48ba-a612-b82c01dd641c - - - - -] Stopped due to GreenletExit() in delete 2016-11-17 13:33:30.838 93507 INFO heat.engine.stack [req-c2406ea8-934d-48ba-a612-b82c01dd641c - - - - -] Stack DELETE FAILED (jprovazn-openshift_nodes-u55ic4exosf4-6n3rr5e7bb4r-twcxgn6d56hl ): GreenletExit() 2016-11-17 13:33:30.957 93479 INFO heat.engine.stack [req-e6b16fdc-e3e0-4f33-bb44-320225c149ae - - - - -] Stopped due to GreenletExit() in delete 2016-11-17 13:33:31.007 93479 INFO heat.engine.stack [req-e6b16fdc-e3e0-4f33-bb44-320225c149ae - - - - -] Stack DELETE FAILED (jprovazn-openshift_nodes-u55ic4exosf4-kksr7lcmnlx5-jmya6lhmi6ao ): GreenletExit() 2016-11-17 13:33:31.038 93489 INFO heat.engine.stack [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Stopped due to Greenle tExit() in delete 2016-11-17 13:33:31.076 93489 INFO heat.engine.stack [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Stack DELETE FAILED (j provazn-openshift_nodes-u55ic4exosf4-hsduldryyhfi-p5hb4xfwquh6): GreenletExit() 2016-11-17 13:33:31.315 93512 INFO heat.engine.stack [req-332524ca-c0b7-4faa-8da5-7611cb24cb33 - - - - -] Stopped due to GreenletExit() in delete 2016-11-17 13:33:31.353 93477 INFO heat.engine.stack [req-aea02c62-d66d-4ba6-8e74-78229217af9d - - - - -] Stopped due to GreenletExit() in delete 2016-11-17 13:33:31.393 93512 INFO heat.engine.stack [req-332524ca-c0b7-4faa-8da5-7611cb24cb33 - - - - -] Stack DELETE FAILED (jprovazn-openshift_nodes-u55ic4exosf4-76cx5vggssis-a4mledfuqdjr ): GreenletExit() 2016-11-17 13:33:31.416 93499 INFO heat.engine.stack [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Stopped due to Greenle tExit() in delete 2016-11-17 13:33:31.421 93477 INFO heat.engine.stack [req-aea02c62-d66d-4ba6-8e74-78229217af9d - - - - -] Stack DELETE FAILED (jprovazn-openshift_nodes-u55ic4exosf4-poqpw6phc3o4-t47soymjew2z ): GreenletExit() 2016-11-17 13:33:31.455 93499 INFO heat.engine.stack [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Stack DELETE FAILED (j provazn-openshift_nodes-u55ic4exosf4-gryzxcui5zxt-ldoxkt4sbtt3): GreenletExit() 2016-11-17 13:33:31.514 93492 INFO heat.engine.stack [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Stopped due to Greenle tExit() in delete 2016-11-17 13:33:31.559 93492 INFO heat.engine.stack [req-414fdf3b-ebd2-4892-bee8-3e323e6d9a2d f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Stack DELETE FAILED (j provazn-openshift_nodes-u55ic4exosf4-yilebysoqvq5-vv57hagbgdqy): GreenletExit() 2016-11-17 13:33:31.613 93498 INFO heat.engine.stack [req-5b7f4c29-9dac-4e29-9256-9b503b7e6a9f - - - - -] Stopped due to GreenletExit() in delete 2016-11-17 13:33:31.648 93498 INFO heat.engine.stack [req-5b7f4c29-9dac-4e29-9256-9b503b7e6a9f - - - - -] Stack DELETE FAILED (jprovazn-openshift_nodes-u55ic4exosf4-dvbowcn5oqp5-jas7foyf3lqo ): GreenletExit() 2016-11-17 13:33:39.751 93494 INFO heat.engine.service [req-e033b2e3-506b-464e-b334-1e78564071de f83e3800d8e045979b7f043ed39d6d3a 8593e1a57e40462bb3c78722e00487d3 - - -] Deleting stack jprov azn-openshift_nodes-u55ic4exosf4-7r4ptpkp4426-jh5nsnlahfhs-internal_port-s4ed3zgn2evp
GreenletExit happens when a thread is killed. The only reason I know of for that is if you issue a delete while another action (in this case it looks like a previous delete) was still in progress. There's a grace period after a Did you test with https://review.openstack.org/#/c/321783/ or without? I'd expect that to make a bigger difference now that we have spread stuff out with the random delay in the template.
We test without any additional patches ATM, so https://review.openstack.org/#/c/321783/ is not applied. We can try it.
OK, would be interesting. The good news is that you've got it down to a level where you'd probably need only 1 retry to get get everything deleted. That's getting to a level where you could seriously consider just adding a release note for now. It only happens when you delete a whole large cluster in one go, or scale down by a very large step, so I'd guess that this won't be common in practice, and the workaround should now be simple.
Crag fixed https://review.openstack.org/#/c/321783/ to retry properly and it merged in Pike, so we expect this problem to be *much* less pronounced now, although we don't know at exactly what scale the retry delays might start proving too short.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462