1846281 – deadlock errors updating and reindexing miq_workers table on 5.11.5

Bug 1846281 - deadlock errors updating and reindexing miq_workers table on 5.11.5

Summary: deadlock errors updating and reindexing miq_workers table on 5.11.5

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.11.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.11.7
Assignee:	Nick Carboni
QA Contact:	Tasos Papaioannou
Docs Contact:	Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-11 09:37 UTC by Felix Dewaleyne
Modified:	2023-10-06 20:34 UTC (History)
CC List:	10 users (show)
Fixed In Version:	5.11.7.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-06 14:32:55 UTC
Category:	---
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:
Flags:	simaishi: cfme-5.11.z+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ManageIQ manageiq issues 20281	0	None	closed	Worker reindex can conflict with heartbeat causing a database deadlock	2020-08-03 17:07:28 UTC
Red Hat Product Errata	RHSA-2020:3358	0	None	None	None	2020-08-06 14:33:17 UTC

Description Felix Dewaleyne 2020-06-11 09:37:41 UTC

Description of problem:
deadlock errors updating and reindexing miq_workers table on 5.11.5

Version-Release number of selected component (if applicable):
5.11.5

How reproducible:
customer environemnt

Steps to Reproduce:
1.normal usage with several appliances set to manage the database (but worker only active on one appliance as expected)
2.
3.

Actual results:
deadlock messages related to hearbeat and the reindexing of the miq_workers table

Expected results:
no deadlock situation happens with this table

Additional info:

Comment 10 Nick Carboni 2020-06-15 19:41:13 UTC

Also, the reindex and vacuum schedule and tables are configurable in advanced settings under the keys :database => :maintenance.

It looks like most (possibly all) of the deadlocks are being caused by reindexes of the miq_workers table.
You can either make the schedule run less frequently or remove that table for now to see if other issues are resolved by fixing the deadlock issue.

Based on the database logs, it looks like other queries that are getting locked are ones to update the miq_workers table. Often it's something rather insignificant, like setting the worker's cpu_time, but it could, theoretically cause an issue if the heartbeat time is not updated quickly enough. It also seems like the other update is frequently the process that is allowed to proceed, so I would expect the impact to be rather minimal (not to say it shouldn't be addressed, though).

Comment 12 Felix Dewaleyne 2020-06-16 11:31:01 UTC

I wonder if it makes sense to remove this table from the list of the reindexing tables .the largerger the environments the more likely at least one deadlock error will appear.

Comment 13 Nick Carboni 2020-06-16 15:04:03 UTC

I don't think that removing the table from the reindex list is a real solution, maybe we should be reindexing less frequently as the index only really changes if workers are added or deleted.
I only suggested removing the table from the list as a temporary solution so that you could determine if the other issues you were encountering were caused by the deadlock or not.

That said I don't think access to just these two relations should be causing a deadlock. It's possible that we need to use transactions differently to ensure that we either get both locks or neither.
That way a process might be blocked for some time, but not actually deadlocked. I'll take this bug and look into that as a better long-term solution.

Comment 14 Nick Carboni 2020-06-16 17:55:10 UTC

So based on https://www.postgresql.org/docs/10/sql-reindex.html ...

> To build the index without interfering with production you should drop the index and reissue the CREATE INDEX CONCURRENTLY command.

But it also states that the index should only need to be rebuilt in pretty rare situations. I think there are a few options:
1. Stop reindexing entirely
  - This is something that is an easy change and can also be done by users by just editing settings
2. Change the reindex method to do something more clever like dropping and recreating the index
  - This is described in https://www.postgresql.org/docs/10/routine-reindex.html but it doesn't sound trivial
3. Instead of running this in a schedule, maybe we reindex at server start as workers won't be updating their rows at this time
  - I'm not sure how much this will help if the server is up for a long time

For now I would say stop running the reindex on that table and address the other issues you're having.

Comment 15 Nick Carboni 2020-06-16 19:58:55 UTC

Opened an issue upstream to sort out what the best way to resolve this would be.

Also I think option 3 from comment 14 is not viable as that would cause a reindex from every server on the region and we don't have a good way to coordinate servers starting.

Comment 16 CFME Bot 2020-06-23 19:05:22 UTC

https://github.com/ManageIQ/manageiq/pull/20305

Comment 17 CFME Bot 2020-06-24 17:05:20 UTC

New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/1a005f497f4a7ef8103fa2ecf6ff43426c793974
commit 1a005f497f4a7ef8103fa2ecf6ff43426c793974
Author:     Nick Carboni <ncarboni>
AuthorDate: Tue Jun 23 18:54:47 2020 +0000
Commit:     Nick Carboni <ncarboni>
CommitDate: Tue Jun 23 18:54:47 2020 +0000

    Remove the miq_workers table from the reindex list

    This was causing nearly constant deadlocks in an environment with
    lots of updates on the workers table. Based on
    https://www.postgresql.org/docs/10/sql-reindex.html it sounds like
    reindexes will definitely interfere with updates and that the situations
    where we actually need a reindex are rather rare.

    https://bugzilla.redhat.com/show_bug.cgi?id=1846281
    Fixes #20281

 config/settings.yml | 1 -
 1 file changed, 1 deletion(-)

Comment 18 CFME Bot 2020-07-01 20:00:25 UTC

New commit detected on ManageIQ/manageiq/ivanchuk:

https://github.com/ManageIQ/manageiq/commit/615a8a05e2e8d7744ff7eb0524087d7d58e626c4
commit 615a8a05e2e8d7744ff7eb0524087d7d58e626c4
Author:     Joe Rafaniello <jrafanie.github.com>
AuthorDate: Wed Jun 24 17:02:56 2020 +0000
Commit:     Satoe Imaishi <simaishi>
CommitDate: Wed Jul  1 19:55:11 2020 +0000

    Merge pull request #20305 from carbonin/remove_workers_from_reindex

    Remove the miq_workers table from the reindex list

    (cherry picked from commit 153932c931f11d110da88a57a13685d875bc1d83)

    https://bugzilla.redhat.com/show_bug.cgi?id=1846281

 config/settings.yml | 1 -
 1 file changed, 1 deletion(-)

Comment 19 Tasos Papaioannou 2020-07-16 14:29:31 UTC

Verified on 5.11.7.0.

Comment 22 errata-xmlrpc 2020-08-06 14:32:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: CloudForms 5.0.7 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3358

Note You need to log in before you can comment on or make changes to this bug.