Description of problem: This issue has occurred on the last two Saturday mornings (07-14 02:00 IDT and 07-07 02:00 IST). The whole region locks up because the database crashes during a vacuum (believed to be cause from the vim_performance_states table during a weekly maintenance schedule). From the most recent occurrence, there are 3 instances of a task to vacuum the vim_performance_states table that time out after 10 minutes: [----] I, [2018-07-14T02:00:08.384951 #2020:96710c] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [3000312697088], MiqWorker id: [3000007633750], Zone: [], Role: [database_operations], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [VimPerformanceState.vacuum], Timeout: [600], Priority: [90], State: [dequeue], Deliver On: [], Data: [], Args: [], Dequeued in: [5.136815917] seconds [----] E, [2018-07-14T02:10:08.877728 #2020:96710c] ERROR -- : MIQ(MiqQueue#deliver) Message id: [3000312697088], timed out after 600.452985311 seconds. Timeout threshold [600] The bloat line from the evm.log shows the table has 1.9 GiB of wasted bytes an 43 million rows in size: table_name | index_name | otta | pages | pagesize | percent_bloat | rows | wasted_bytes | wasted_pages | wasted_size ----------------------------------+----------------------------------------------------------------+---------+---------+----------+---------------+----------+--------------+--------------+------------- vim_performance_states | | 3398824 | 3648675 | 0 | 1.1 | 43129164 | 2046779392 | 249851 | 2046779392 From the postgresql logs, these messages start at 07-13 16:30 EDT (07-14 02:00 IST): 2018-07-13 16:30:14 EDT:152.144.198.41(56322):5b4308e5.9cf:root@vmdb_production:[2511]:LOG: process 2511 still waiting for AccessExclusiveLock on relation 21559 of database 21433 after 5045.047 ms Leading up to this last one before the system locks up: 2018-07-13 16:53:07 EDT:152.144.198.43(59020):5b491129.1afe:root@vmdb_production:[6910]:LOG: process 6910 still waiting for AccessShareLock on relation 23958 of database 21433 after 5022.036 ms Also in the postgresql.log, there are 130 occurrences indicating increase the max_wal_size: 2018-07-13 16:30:17 EDT::5b430766.522:@:[1314]:LOG: checkpoints are occurring too frequently (9 seconds apart) 2018-07-13 16:30:17 EDT::5b430766.522:@:[1314]:HINT: Consider increasing the configuration parameter "max_wal_size". From the top_output.log we see the wal sender process consume high memory and cpu until the system is restarted: top - 02:31:57 up 4 days, 14:04, 0 users, load average: 1.05, 1.58, 1.93 PID PPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7016 1281 postgres 20 0 38.8g 28.8g 1800 D 98.3 92.2 7:57.48 postgres: wal sender process root 152.144.25.111(58598) idle Version-Release number of selected component (if applicable): CFME 5.9.2.4 How reproducible: custom environment Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Additional info: Logs for relevant postgresql.log and top_output.log from database appliance here: http://file.rdu.redhat.com/myoder/02141304/ Similar bug opened for CFME 5.7.0: https://bugzilla.redhat.com/show_bug.cgi?id=1425905 RFE to break up vim_performance_states table into monthly tables: https://bugzilla.redhat.com/show_bug.cgi?id=1427610 Possibly affecting this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1434918
https://github.com/ManageIQ/manageiq/pull/17713
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/8da858fdef849a079a97233f45667c314406bcb4 commit 8da858fdef849a079a97233f45667c314406bcb4 Author: Joe Rafaniello <jrafanie> AuthorDate: Mon Jul 16 14:04:34 2018 -0400 Commit: Joe Rafaniello <jrafanie> CommitDate: Mon Jul 16 14:04:34 2018 -0400 Maintenance must run VACUUM to avoid long held locks https://bugzilla.redhat.com/show_bug.cgi?id=1601264 It's hard to say which tables may or may not grow too large that we cannot easily run VACUUM FULL on them as part of maintenance. Therefore we should do the safest thing possible and run VACUUM for routine maintenance and leave VACUUM FULL when table bloat becomes a problem. Some tables, such as vim_performance_states can grow so large that a VACUUM FULL could lock up the table for many minutes and could fail to complete during the normal 10 minute queue timeout. lib/extensions/ar_base.rb | 2 +- spec/lib/extensions/ar_base_spec.rb | 14 + 2 files changed, 15 insertions(+), 1 deletion(-)
Hi Yoder, the postgresl.log doesn't contain reference to vim_performance_states. Can they provide the rotated logs? Yeah, exporting those tables might help us understand where the problem lies.
Verified on 5.10.0.16.