Bug 1577874

Summary:	Heat database error: 'Cannot delete or update a parent row: a foreign key constraint fails ...
Product:	Red Hat OpenStack	Reporter:	Eduard Barrera <ebarrera>
Component:	openstack-heat	Assignee:	Zane Bitter <zbitter>
Status:	CLOSED ERRATA	QA Contact:	Victor Voronkov <vvoronko>
Severity:	high	Docs Contact:
Priority:	high
Version:	12.0 (Pike)	CC:	amcleod, astupnik, lars, mburns, sbaker, shardy, srevivo, therve, vvoronko, zbitter
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-heat-9.0.5-1.el7ost	Doc Type:	Bug Fix
Doc Text:	Previously, when a stack had more than 1,000 past events, Heat purged a portion of existing events from the database. However, if the stack had previous updates with convergence disabled, some events might reference resource properties data from the backup stack, and purge events might fail with a foreign key constraint error: `Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`event`, CONSTRAINT `ev_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_properties_data` (`id`))` or `Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`resource`, CONSTRAINT `rsrc_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_properties_data` (`id`))` This prevents the new event from being stored and the stack update fails. With this update, Heat ignores foreign key constraint errors when attempting to purge events. Events are not purged until any backup stacks that hold common references have been purged. New events are stored, and the operation of the stack continues.	Story Points:	---
Clone Of:
Clones:	1596866 1845859 (view as bug list)		Environment:
Last Closed:	2018-12-05 18:53:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1596866

Description Eduard Barrera 2018-05-14 10:31:57 UTC

Description of problem:

This error can be seen in a OSP12 deployment:

2018-05-14 04:38:02Z [overcloud-Controller-irffjxb5zlgh-2-u56vlld5py3g]: UPDATE_COMPLETE  Stack UPDATE completed successfully
2018-05-14 04:38:03Z [overcloud-Controller-irffjxb5zlgh.2]: UPDATE_COMPLETE  state changed
2018-05-14 04:38:03Z [overcloud-Controller-irffjxb5zlgh]: UPDATE_FAILED  resources[0]: (pymysql.err.IntegrityError) (1451, u'Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`event`, CONSTRAINT `ev_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_properties_data` (`id`))') 
2018-05-14 04:38:04Z [Controller]: UPDATE_FAILED  resources.Controller: resources[0]: (pymysql.err.IntegrityError) (1451, u'Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`event`, CONSTRAINT `ev_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_prope
2018-05-14 04:42:05Z [Compute]: UPDATE_FAILED  UPDATE aborted
2018-05-14 04:42:05Z [overcloud]: UPDATE_FAILED  resources.Controller: resources[0]: (pymysql.err.IntegrityError) (1451, u'Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`event`, CONSTRAINT `ev_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_prope

 Stack overcloud UPDATE_FAILED 

overcloud.Controller.0:
  resource_type: OS::TripleO::Controller
  physical_resource_id: 7173f444-fe34-417d-bfeb-c853ee85bc06
  status: UPDATE_FAILED
  status_reason: |
    resources[0]: (pymysql.err.IntegrityError) (1451, u'Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`event`, CONSTRAINT `ev_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_properties_data` (`id`))') [SQL: u'DELETE FROM resource_properties_data WHERE resource_properties_data.id IN (%(id_1)s, %(id_2)s, %(id_3)s, %(id_4)s, %(id_5)s, %(id_6)s, %(id_7)s, %(id_8)s, %(id_9)s, %(id_10)s)'] [parameters: {u'id_10': 6781, u'id_2': 6785, u'id_3': 7014, u'id_1': 7008, u'id_6': 6962, u'id_7': 6964, u'id_4': 6731, u'id_5': 6736, u'id_8': 6871, u'id_9': 6868}]
overcloud.Compute:
  resource_type: OS::Heat::ResourceGroup
  physical_resource_id: cb7ed374-46dd-4121-a9a5-e982100e3123
  status: UPDATE_FAILED
  status_reason: |
    UPDATE aborted


It seems like the following upstream bug [1] solved in penstack/heat 9.0.0.0b2    but seen in  openstack-heat-engine-9.0.1-3

[1] https://bugs.launchpad.net/heat/+bug/1681772

Version-Release number of selected component (if applicable):


How reproducible:
Unsure


Steps to Reproduce:
1. Scale up the overcloud
2.
3.

Actual results:
Database error

Expected results:
no database error

Additional info:

Comment 3 Thomas Hervé 2018-05-14 14:38:49 UTC

I had a look at the logs and the database. There is definitely a bug in Heat: it happens for a bunch of rows in the resource_properties_data table. Config settings seem to be default, and the undercloud in a good state.

There are 3 stacks that reached the 1000 events threshold requiring the purge. Increasing that threshold will push the issue down the road. We can then query the database to purge events manually.

Regarding the bug itself, the only thing suspicious is the use of synchronize_session=False when deleting resource_properties_data. It's always unclear to me what that does.

Comment 4 Thomas Hervé 2018-05-14 14:39:43 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1464533 was the OSP11 version of that one, FWIW.

Comment 6 Zane Bitter 2018-05-15 21:03:43 UTC

(In reply to Thomas Hervé from comment #3)
> Regarding the bug itself, the only thing suspicious is the use of
> synchronize_session=False when deleting resource_properties_data. It's
> always unclear to me what that does.

I think Rabi's comments on the patch about that are correct - it prevents SQLAlchemy from updating its object cache after running the command. So I don't think it's actually relevant to the issue, or that the patch actually fixed anything.

The only explanation I can think of is that another resource/event is adding a reference to one of the IDs at a point between when we check for which ones are still referenced and when we try to delete the ones that aren't. I don't see how that could happen in practice though. (We could create a new Event with the same properties ID, but only if the Resource already has a reference to that properties ID.)

I do find it a little odd that we're not using a transaction. I don't see how that would help, but at least we could retry it after it failed.

Comment 7 Thomas Hervé 2018-05-16 07:09:06 UTC

You're right, the problem is probably the lack of transactions.

We chatted about it with Rabi, and the conclusion we came to is that cleaning them up while insert was always weird. We have a purge cron, it feels like a better fit for such thing.

Comment 8 Zane Bitter 2018-05-16 13:44:09 UTC

The purge cron runs, what, once a day? (And is specific to the deployment tool.) TripleO was filling up the database in less time than that because of the number of events involved in updating a large scaling group.

IIRC the way this is tuned is something like when you have >1000 events it deletes the oldest 200. That's not a terrible way of doing it. (Maybe we could spawn a separate thread to do it though?) At the end of the day, if there is in fact a race condition (as seems to be the case, though I can't spot the mechanism) then there's still going to be a race no matter what the trigger is.

Comment 14 Lars Kellogg-Stedman 2018-06-21 21:36:15 UTC

This same problem can result in undeletable stacks.  Trying to run "openstack stack delete -y overcloud" right now, I'm getting:

    Traceback (most recent call last):
      File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 831, in _action_recorder
        yield
      File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1918, in delete
        *action_args)
      File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 351, in wrapper
        step = next(subtask)
      File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 890, in action_handler_task
        done = check(handler_data)
      File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 553, in check_delete_complete
        return self._check_status_complete(self.DELETE)
      File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 420, in _check_status_complete
        action=action)
    ResourceFailure: DBReferenceError: resources.ControllerDeployedServerServiceChain.resources.ServiceChain.resources[103]: (pymysql.err.IntegrityError) (1451, u'Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`event`, CONSTRAINT `ev_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_properties_data` (`id`))') [SQL: u'DELETE FROM resource_properties_data WHERE resource_properties_data.id IN (%(id_1)s, %(id_2)s, %(id_3)s, %(id_4)s, %(id_5)s, %(id_6)s, %(id_7)s, %(id_8)s, %(id_9)s, %(id_10)s, %(id_11)s, %(id_12)s, %(id_13)s, %(id_14)s, %(id_15)s, %(id_16)s, %(id_17)s, %(id_18)s, %(id_19)s, %(id_20)s, %(id_21)s, %(id_22)s, %(id_23)s, %(id_24)s, %(id_25)s)'] [parameters: {u'id_10': 1084, u'id_11': 3016, u'id_12': 3018, u'id_13': 3019, u'id_14': 849, u'id_15': 2644, u'id_16': 3088, u'id_17': 3067, u'id_18': 3044, u'id_19': 3045, u'id_21': 3051, u'id_20': 3047, u'id_23': 3064, u'id_22': 3055, u'id_25': 3070, u'id_24': 2299, u'id_2': 3080, u'id_3': 3084, u'id_1': 3079, u'id_6': 1031, u'id_7': 2991, u'id_4': 1040, u'id_5': 2329, u'id_8': 2993, u'id_9': 2998}]

Running that `heat-manage migrate_properties_data` command doesn't seem to be helping.

Comment 15 Zane Bitter 2018-06-21 21:44:48 UTC

(In reply to Lars Kellogg-Stedman from comment #14)
> This same problem can result in undeletable stacks.

That looks like the same error. Basically once the event table overflows 1000 events for the stack, on 1% of new events we'll try to delete the oldest 200, which is failing for unknown reasons. Both updates and deletes create events, so this can happen equally during an update or a delete.

Comment 16 Zane Bitter 2018-06-23 03:04:24 UTC

I found the root cause:

https://storyboard.openstack.org/#!/story/2002643#comment-90440

This affects all non-convergence stacks in OSP11 and up (although in practice that only means TripleO).

Comment 17 Zane Bitter 2018-06-29 20:19:20 UTC

We're in the process of backporting a patch to ignore (but log) the error upstream.

The root cause is that entries in the resource properties data can be shared between events/resources in the current stack and those in the backup stack (for non-covergence stacks, including TripleO prior to OSP13). This will be fully fixed on master.

If anybody encounters this in the meantime, a good workaround to try would be to purge deleted stacks using 'heat-manage purge_deleted', since the unwanted references may be in now-deleted backup stacks. There's no guarantee, however, since the references may in fact be from a current backup stack.

Comment 26 Victor Voronkov 2018-11-18 16:14:28 UTC

2018-11-18 16:09:19Z [AllNodesDeploySteps]: UPDATE_COMPLETE  state changed
2018-11-18 16:09:35Z [overcloud]: UPDATE_COMPLETE  Stack UPDATE completed successfully

 Stack overcloud UPDATE_COMPLETE 

Overcloud Endpoint: https://10.0.0.101:13000/v2.0
Overcloud Deployed
(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| 61df175f-d107-43d9-b5d7-6a0c6d6b1a5a | compute-2    | ACTIVE | ctlplane=192.168.24.18 | overcloud-full | compute    |
| 2b283998-fd3a-47bb-bd6d-7a0ffb5d17b9 | controller-0 | ACTIVE | ctlplane=192.168.24.7  | overcloud-full | controller |
| b011c464-249d-4df9-90f2-2bd48cae17e0 | controller-2 | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | controller |
| 43db5a1a-dbb5-45d4-b621-48bda166bff0 | compute-0    | ACTIVE | ctlplane=192.168.24.17 | overcloud-full | compute    |
| a5de29bb-9751-425f-b92f-8dc2f09300f5 | controller-1 | ACTIVE | ctlplane=192.168.24.15 | overcloud-full | controller |
| 35f5073d-4072-40da-9afd-b852bdf8e64f | compute-1    | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | compute    |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+

(undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version 
2018-11-14.1(undercloud) [stack@undercloud-0 ~]$ 

openstack-heat-common-9.0.5-1.el7ost.noarch

========
Summary: successfully scaled from 2 compute to 3 compute nodes

Comment 28 Zane Bitter 2018-11-28 18:00:35 UTC

Copied previously-approved doc_text from bug 1596866.

Comment 30 errata-xmlrpc 2018-12-05 18:53:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3787