Description of problem: This error can be seen in a OSP12 deployment: 2018-05-14 04:38:02Z [overcloud-Controller-irffjxb5zlgh-2-u56vlld5py3g]: UPDATE_COMPLETE Stack UPDATE completed successfully 2018-05-14 04:38:03Z [overcloud-Controller-irffjxb5zlgh.2]: UPDATE_COMPLETE state changed 2018-05-14 04:38:03Z [overcloud-Controller-irffjxb5zlgh]: UPDATE_FAILED resources[0]: (pymysql.err.IntegrityError) (1451, u'Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`event`, CONSTRAINT `ev_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_properties_data` (`id`))') 2018-05-14 04:38:04Z [Controller]: UPDATE_FAILED resources.Controller: resources[0]: (pymysql.err.IntegrityError) (1451, u'Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`event`, CONSTRAINT `ev_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_prope 2018-05-14 04:42:05Z [Compute]: UPDATE_FAILED UPDATE aborted 2018-05-14 04:42:05Z [overcloud]: UPDATE_FAILED resources.Controller: resources[0]: (pymysql.err.IntegrityError) (1451, u'Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`event`, CONSTRAINT `ev_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_prope Stack overcloud UPDATE_FAILED overcloud.Controller.0: resource_type: OS::TripleO::Controller physical_resource_id: 7173f444-fe34-417d-bfeb-c853ee85bc06 status: UPDATE_FAILED status_reason: | resources[0]: (pymysql.err.IntegrityError) (1451, u'Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`event`, CONSTRAINT `ev_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_properties_data` (`id`))') [SQL: u'DELETE FROM resource_properties_data WHERE resource_properties_data.id IN (%(id_1)s, %(id_2)s, %(id_3)s, %(id_4)s, %(id_5)s, %(id_6)s, %(id_7)s, %(id_8)s, %(id_9)s, %(id_10)s)'] [parameters: {u'id_10': 6781, u'id_2': 6785, u'id_3': 7014, u'id_1': 7008, u'id_6': 6962, u'id_7': 6964, u'id_4': 6731, u'id_5': 6736, u'id_8': 6871, u'id_9': 6868}] overcloud.Compute: resource_type: OS::Heat::ResourceGroup physical_resource_id: cb7ed374-46dd-4121-a9a5-e982100e3123 status: UPDATE_FAILED status_reason: | UPDATE aborted It seems like the following upstream bug [1] solved in penstack/heat 9.0.0.0b2 but seen in openstack-heat-engine-9.0.1-3 [1] https://bugs.launchpad.net/heat/+bug/1681772 Version-Release number of selected component (if applicable): How reproducible: Unsure Steps to Reproduce: 1. Scale up the overcloud 2. 3. Actual results: Database error Expected results: no database error Additional info:
I had a look at the logs and the database. There is definitely a bug in Heat: it happens for a bunch of rows in the resource_properties_data table. Config settings seem to be default, and the undercloud in a good state. There are 3 stacks that reached the 1000 events threshold requiring the purge. Increasing that threshold will push the issue down the road. We can then query the database to purge events manually. Regarding the bug itself, the only thing suspicious is the use of synchronize_session=False when deleting resource_properties_data. It's always unclear to me what that does.
https://bugzilla.redhat.com/show_bug.cgi?id=1464533 was the OSP11 version of that one, FWIW.
(In reply to Thomas Hervé from comment #3) > Regarding the bug itself, the only thing suspicious is the use of > synchronize_session=False when deleting resource_properties_data. It's > always unclear to me what that does. I think Rabi's comments on the patch about that are correct - it prevents SQLAlchemy from updating its object cache after running the command. So I don't think it's actually relevant to the issue, or that the patch actually fixed anything. The only explanation I can think of is that another resource/event is adding a reference to one of the IDs at a point between when we check for which ones are still referenced and when we try to delete the ones that aren't. I don't see how that could happen in practice though. (We could create a new Event with the same properties ID, but only if the Resource already has a reference to that properties ID.) I do find it a little odd that we're not using a transaction. I don't see how that would help, but at least we could retry it after it failed.
You're right, the problem is probably the lack of transactions. We chatted about it with Rabi, and the conclusion we came to is that cleaning them up while insert was always weird. We have a purge cron, it feels like a better fit for such thing.
The purge cron runs, what, once a day? (And is specific to the deployment tool.) TripleO was filling up the database in less time than that because of the number of events involved in updating a large scaling group. IIRC the way this is tuned is something like when you have >1000 events it deletes the oldest 200. That's not a terrible way of doing it. (Maybe we could spawn a separate thread to do it though?) At the end of the day, if there is in fact a race condition (as seems to be the case, though I can't spot the mechanism) then there's still going to be a race no matter what the trigger is.
This same problem can result in undeletable stacks. Trying to run "openstack stack delete -y overcloud" right now, I'm getting: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 831, in _action_recorder yield File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1918, in delete *action_args) File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 351, in wrapper step = next(subtask) File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 890, in action_handler_task done = check(handler_data) File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 553, in check_delete_complete return self._check_status_complete(self.DELETE) File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 420, in _check_status_complete action=action) ResourceFailure: DBReferenceError: resources.ControllerDeployedServerServiceChain.resources.ServiceChain.resources[103]: (pymysql.err.IntegrityError) (1451, u'Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`event`, CONSTRAINT `ev_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_properties_data` (`id`))') [SQL: u'DELETE FROM resource_properties_data WHERE resource_properties_data.id IN (%(id_1)s, %(id_2)s, %(id_3)s, %(id_4)s, %(id_5)s, %(id_6)s, %(id_7)s, %(id_8)s, %(id_9)s, %(id_10)s, %(id_11)s, %(id_12)s, %(id_13)s, %(id_14)s, %(id_15)s, %(id_16)s, %(id_17)s, %(id_18)s, %(id_19)s, %(id_20)s, %(id_21)s, %(id_22)s, %(id_23)s, %(id_24)s, %(id_25)s)'] [parameters: {u'id_10': 1084, u'id_11': 3016, u'id_12': 3018, u'id_13': 3019, u'id_14': 849, u'id_15': 2644, u'id_16': 3088, u'id_17': 3067, u'id_18': 3044, u'id_19': 3045, u'id_21': 3051, u'id_20': 3047, u'id_23': 3064, u'id_22': 3055, u'id_25': 3070, u'id_24': 2299, u'id_2': 3080, u'id_3': 3084, u'id_1': 3079, u'id_6': 1031, u'id_7': 2991, u'id_4': 1040, u'id_5': 2329, u'id_8': 2993, u'id_9': 2998}] Running that `heat-manage migrate_properties_data` command doesn't seem to be helping.
(In reply to Lars Kellogg-Stedman from comment #14) > This same problem can result in undeletable stacks. That looks like the same error. Basically once the event table overflows 1000 events for the stack, on 1% of new events we'll try to delete the oldest 200, which is failing for unknown reasons. Both updates and deletes create events, so this can happen equally during an update or a delete.
I found the root cause: https://storyboard.openstack.org/#!/story/2002643#comment-90440 This affects all non-convergence stacks in OSP11 and up (although in practice that only means TripleO).
We're in the process of backporting a patch to ignore (but log) the error upstream. The root cause is that entries in the resource properties data can be shared between events/resources in the current stack and those in the backup stack (for non-covergence stacks, including TripleO prior to OSP13). This will be fully fixed on master. If anybody encounters this in the meantime, a good workaround to try would be to purge deleted stacks using 'heat-manage purge_deleted', since the unwanted references may be in now-deleted backup stacks. There's no guarantee, however, since the references may in fact be from a current backup stack.
2018-11-18 16:09:19Z [AllNodesDeploySteps]: UPDATE_COMPLETE state changed 2018-11-18 16:09:35Z [overcloud]: UPDATE_COMPLETE Stack UPDATE completed successfully Stack overcloud UPDATE_COMPLETE Overcloud Endpoint: https://10.0.0.101:13000/v2.0 Overcloud Deployed (undercloud) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | 61df175f-d107-43d9-b5d7-6a0c6d6b1a5a | compute-2 | ACTIVE | ctlplane=192.168.24.18 | overcloud-full | compute | | 2b283998-fd3a-47bb-bd6d-7a0ffb5d17b9 | controller-0 | ACTIVE | ctlplane=192.168.24.7 | overcloud-full | controller | | b011c464-249d-4df9-90f2-2bd48cae17e0 | controller-2 | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | controller | | 43db5a1a-dbb5-45d4-b621-48bda166bff0 | compute-0 | ACTIVE | ctlplane=192.168.24.17 | overcloud-full | compute | | a5de29bb-9751-425f-b92f-8dc2f09300f5 | controller-1 | ACTIVE | ctlplane=192.168.24.15 | overcloud-full | controller | | 35f5073d-4072-40da-9afd-b852bdf8e64f | compute-1 | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | compute | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ (undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version 2018-11-14.1(undercloud) [stack@undercloud-0 ~]$ openstack-heat-common-9.0.5-1.el7ost.noarch ======== Summary: successfully scaled from 2 compute to 3 compute nodes
Copied previously-approved doc_text from bug 1596866.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3787