Bug 1577874
Summary: | Heat database error: 'Cannot delete or update a parent row: a foreign key constraint fails ... | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Eduard Barrera <ebarrera> | |
Component: | openstack-heat | Assignee: | Zane Bitter <zbitter> | |
Status: | CLOSED ERRATA | QA Contact: | Victor Voronkov <vvoronko> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 12.0 (Pike) | CC: | amcleod, astupnik, lars, mburns, sbaker, shardy, srevivo, therve, vvoronko, zbitter | |
Target Milestone: | --- | Keywords: | Triaged, ZStream | |
Target Release: | 12.0 (Pike) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-heat-9.0.5-1.el7ost | Doc Type: | Bug Fix | |
Doc Text: |
Previously, when a stack had more than 1,000 past events, Heat purged a portion of existing events from the database. However, if the stack had previous updates with convergence disabled, some events might reference resource properties data from the backup stack, and purge events might fail with a foreign key constraint error:
`Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`event`, CONSTRAINT `ev_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_properties_data` (`id`))`
or
`Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`resource`, CONSTRAINT `rsrc_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_properties_data` (`id`))`
This prevents the new event from being stored and the stack update fails.
With this update, Heat ignores foreign key constraint errors when attempting to purge events. Events are not purged until any backup stacks that hold common references have been purged. New events are stored, and the operation of the stack continues.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1596866 1845859 (view as bug list) | Environment: | ||
Last Closed: | 2018-12-05 18:53:07 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1596866 |
Description
Eduard Barrera
2018-05-14 10:31:57 UTC
I had a look at the logs and the database. There is definitely a bug in Heat: it happens for a bunch of rows in the resource_properties_data table. Config settings seem to be default, and the undercloud in a good state. There are 3 stacks that reached the 1000 events threshold requiring the purge. Increasing that threshold will push the issue down the road. We can then query the database to purge events manually. Regarding the bug itself, the only thing suspicious is the use of synchronize_session=False when deleting resource_properties_data. It's always unclear to me what that does. https://bugzilla.redhat.com/show_bug.cgi?id=1464533 was the OSP11 version of that one, FWIW. (In reply to Thomas Hervé from comment #3) > Regarding the bug itself, the only thing suspicious is the use of > synchronize_session=False when deleting resource_properties_data. It's > always unclear to me what that does. I think Rabi's comments on the patch about that are correct - it prevents SQLAlchemy from updating its object cache after running the command. So I don't think it's actually relevant to the issue, or that the patch actually fixed anything. The only explanation I can think of is that another resource/event is adding a reference to one of the IDs at a point between when we check for which ones are still referenced and when we try to delete the ones that aren't. I don't see how that could happen in practice though. (We could create a new Event with the same properties ID, but only if the Resource already has a reference to that properties ID.) I do find it a little odd that we're not using a transaction. I don't see how that would help, but at least we could retry it after it failed. You're right, the problem is probably the lack of transactions. We chatted about it with Rabi, and the conclusion we came to is that cleaning them up while insert was always weird. We have a purge cron, it feels like a better fit for such thing. The purge cron runs, what, once a day? (And is specific to the deployment tool.) TripleO was filling up the database in less time than that because of the number of events involved in updating a large scaling group. IIRC the way this is tuned is something like when you have >1000 events it deletes the oldest 200. That's not a terrible way of doing it. (Maybe we could spawn a separate thread to do it though?) At the end of the day, if there is in fact a race condition (as seems to be the case, though I can't spot the mechanism) then there's still going to be a race no matter what the trigger is. This same problem can result in undeletable stacks. Trying to run "openstack stack delete -y overcloud" right now, I'm getting: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 831, in _action_recorder yield File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1918, in delete *action_args) File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 351, in wrapper step = next(subtask) File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 890, in action_handler_task done = check(handler_data) File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 553, in check_delete_complete return self._check_status_complete(self.DELETE) File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 420, in _check_status_complete action=action) ResourceFailure: DBReferenceError: resources.ControllerDeployedServerServiceChain.resources.ServiceChain.resources[103]: (pymysql.err.IntegrityError) (1451, u'Cannot delete or update a parent row: a foreign key constraint fails (`heat`.`event`, CONSTRAINT `ev_rsrc_prop_data_ref` FOREIGN KEY (`rsrc_prop_data_id`) REFERENCES `resource_properties_data` (`id`))') [SQL: u'DELETE FROM resource_properties_data WHERE resource_properties_data.id IN (%(id_1)s, %(id_2)s, %(id_3)s, %(id_4)s, %(id_5)s, %(id_6)s, %(id_7)s, %(id_8)s, %(id_9)s, %(id_10)s, %(id_11)s, %(id_12)s, %(id_13)s, %(id_14)s, %(id_15)s, %(id_16)s, %(id_17)s, %(id_18)s, %(id_19)s, %(id_20)s, %(id_21)s, %(id_22)s, %(id_23)s, %(id_24)s, %(id_25)s)'] [parameters: {u'id_10': 1084, u'id_11': 3016, u'id_12': 3018, u'id_13': 3019, u'id_14': 849, u'id_15': 2644, u'id_16': 3088, u'id_17': 3067, u'id_18': 3044, u'id_19': 3045, u'id_21': 3051, u'id_20': 3047, u'id_23': 3064, u'id_22': 3055, u'id_25': 3070, u'id_24': 2299, u'id_2': 3080, u'id_3': 3084, u'id_1': 3079, u'id_6': 1031, u'id_7': 2991, u'id_4': 1040, u'id_5': 2329, u'id_8': 2993, u'id_9': 2998}] Running that `heat-manage migrate_properties_data` command doesn't seem to be helping. (In reply to Lars Kellogg-Stedman from comment #14) > This same problem can result in undeletable stacks. That looks like the same error. Basically once the event table overflows 1000 events for the stack, on 1% of new events we'll try to delete the oldest 200, which is failing for unknown reasons. Both updates and deletes create events, so this can happen equally during an update or a delete. I found the root cause: https://storyboard.openstack.org/#!/story/2002643#comment-90440 This affects all non-convergence stacks in OSP11 and up (although in practice that only means TripleO). We're in the process of backporting a patch to ignore (but log) the error upstream. The root cause is that entries in the resource properties data can be shared between events/resources in the current stack and those in the backup stack (for non-covergence stacks, including TripleO prior to OSP13). This will be fully fixed on master. If anybody encounters this in the meantime, a good workaround to try would be to purge deleted stacks using 'heat-manage purge_deleted', since the unwanted references may be in now-deleted backup stacks. There's no guarantee, however, since the references may in fact be from a current backup stack. 2018-11-18 16:09:19Z [AllNodesDeploySteps]: UPDATE_COMPLETE state changed 2018-11-18 16:09:35Z [overcloud]: UPDATE_COMPLETE Stack UPDATE completed successfully Stack overcloud UPDATE_COMPLETE Overcloud Endpoint: https://10.0.0.101:13000/v2.0 Overcloud Deployed (undercloud) [stack@undercloud-0 ~]$ openstack server list +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | 61df175f-d107-43d9-b5d7-6a0c6d6b1a5a | compute-2 | ACTIVE | ctlplane=192.168.24.18 | overcloud-full | compute | | 2b283998-fd3a-47bb-bd6d-7a0ffb5d17b9 | controller-0 | ACTIVE | ctlplane=192.168.24.7 | overcloud-full | controller | | b011c464-249d-4df9-90f2-2bd48cae17e0 | controller-2 | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | controller | | 43db5a1a-dbb5-45d4-b621-48bda166bff0 | compute-0 | ACTIVE | ctlplane=192.168.24.17 | overcloud-full | compute | | a5de29bb-9751-425f-b92f-8dc2f09300f5 | controller-1 | ACTIVE | ctlplane=192.168.24.15 | overcloud-full | controller | | 35f5073d-4072-40da-9afd-b852bdf8e64f | compute-1 | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | compute | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ (undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version 2018-11-14.1(undercloud) [stack@undercloud-0 ~]$ openstack-heat-common-9.0.5-1.el7ost.noarch ======== Summary: successfully scaled from 2 compute to 3 compute nodes Copied previously-approved doc_text from bug 1596866. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3787 |