Description of problem: deadlock errors updating and reindexing miq_workers table on 5.11.5 Version-Release number of selected component (if applicable): 5.11.5 How reproducible: customer environemnt Steps to Reproduce: 1.normal usage with several appliances set to manage the database (but worker only active on one appliance as expected) 2. 3. Actual results: deadlock messages related to hearbeat and the reindexing of the miq_workers table Expected results: no deadlock situation happens with this table Additional info:
Also, the reindex and vacuum schedule and tables are configurable in advanced settings under the keys :database => :maintenance. It looks like most (possibly all) of the deadlocks are being caused by reindexes of the miq_workers table. You can either make the schedule run less frequently or remove that table for now to see if other issues are resolved by fixing the deadlock issue. Based on the database logs, it looks like other queries that are getting locked are ones to update the miq_workers table. Often it's something rather insignificant, like setting the worker's cpu_time, but it could, theoretically cause an issue if the heartbeat time is not updated quickly enough. It also seems like the other update is frequently the process that is allowed to proceed, so I would expect the impact to be rather minimal (not to say it shouldn't be addressed, though).
I wonder if it makes sense to remove this table from the list of the reindexing tables .the largerger the environments the more likely at least one deadlock error will appear.
I don't think that removing the table from the reindex list is a real solution, maybe we should be reindexing less frequently as the index only really changes if workers are added or deleted. I only suggested removing the table from the list as a temporary solution so that you could determine if the other issues you were encountering were caused by the deadlock or not. That said I don't think access to just these two relations should be causing a deadlock. It's possible that we need to use transactions differently to ensure that we either get both locks or neither. That way a process might be blocked for some time, but not actually deadlocked. I'll take this bug and look into that as a better long-term solution.
So based on https://www.postgresql.org/docs/10/sql-reindex.html ... > To build the index without interfering with production you should drop the index and reissue the CREATE INDEX CONCURRENTLY command. But it also states that the index should only need to be rebuilt in pretty rare situations. I think there are a few options: 1. Stop reindexing entirely - This is something that is an easy change and can also be done by users by just editing settings 2. Change the reindex method to do something more clever like dropping and recreating the index - This is described in https://www.postgresql.org/docs/10/routine-reindex.html but it doesn't sound trivial 3. Instead of running this in a schedule, maybe we reindex at server start as workers won't be updating their rows at this time - I'm not sure how much this will help if the server is up for a long time For now I would say stop running the reindex on that table and address the other issues you're having.
Opened an issue upstream to sort out what the best way to resolve this would be. Also I think option 3 from comment 14 is not viable as that would cause a reindex from every server on the region and we don't have a good way to coordinate servers starting.
https://github.com/ManageIQ/manageiq/pull/20305
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/1a005f497f4a7ef8103fa2ecf6ff43426c793974 commit 1a005f497f4a7ef8103fa2ecf6ff43426c793974 Author: Nick Carboni <ncarboni> AuthorDate: Tue Jun 23 18:54:47 2020 +0000 Commit: Nick Carboni <ncarboni> CommitDate: Tue Jun 23 18:54:47 2020 +0000 Remove the miq_workers table from the reindex list This was causing nearly constant deadlocks in an environment with lots of updates on the workers table. Based on https://www.postgresql.org/docs/10/sql-reindex.html it sounds like reindexes will definitely interfere with updates and that the situations where we actually need a reindex are rather rare. https://bugzilla.redhat.com/show_bug.cgi?id=1846281 Fixes #20281 config/settings.yml | 1 - 1 file changed, 1 deletion(-)
New commit detected on ManageIQ/manageiq/ivanchuk: https://github.com/ManageIQ/manageiq/commit/615a8a05e2e8d7744ff7eb0524087d7d58e626c4 commit 615a8a05e2e8d7744ff7eb0524087d7d58e626c4 Author: Joe Rafaniello <jrafanie.github.com> AuthorDate: Wed Jun 24 17:02:56 2020 +0000 Commit: Satoe Imaishi <simaishi> CommitDate: Wed Jul 1 19:55:11 2020 +0000 Merge pull request #20305 from carbonin/remove_workers_from_reindex Remove the miq_workers table from the reindex list (cherry picked from commit 153932c931f11d110da88a57a13685d875bc1d83) https://bugzilla.redhat.com/show_bug.cgi?id=1846281 config/settings.yml | 1 - 1 file changed, 1 deletion(-)
Verified on 5.11.7.0.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Critical: CloudForms 5.0.7 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3358