1601264 – Vim Performance States Table Causing Region to Lock up During a Vacuum

Bug 1601264 - Vim Performance States Table Causing Region to Lock up During a Vacuum

Summary: Vim Performance States Table Causing Region to Lock up During a Vacuum

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.10.0
Assignee:	Joe Rafaniello
QA Contact:	Tasos Papaioannou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1603026
TreeView+	depends on / blocked

Reported:	2018-07-15 19:59 UTC by myoder
Modified:	2021-09-09 15:02 UTC (History)
CC List:	8 users (show)
Fixed In Version:	5.10.0.5
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1603026 (view as bug list)
Environment:
Last Closed:	2019-02-11 14:01:10 UTC
Category:	Bug
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description myoder 2018-07-15 19:59:26 UTC

Description of problem:

This issue has occurred on the last two Saturday mornings (07-14 02:00 IDT and 07-07 02:00 IST).  The whole region locks up because the database crashes during a vacuum (believed to be cause from the vim_performance_states table during a weekly maintenance schedule).


From the most recent occurrence, there are 3 instances of a task to vacuum the vim_performance_states table that time out after 10 minutes:

[----] I, [2018-07-14T02:00:08.384951 #2020:96710c]  INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [3000312697088], MiqWorker id: [3000007633750], Zone: [], Role: [database_operations], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [VimPerformanceState.vacuum], Timeout: [600], Priority: [90], State: [dequeue], Deliver On: [], Data: [], Args: [], Dequeued in: [5.136815917] seconds
[----] E, [2018-07-14T02:10:08.877728 #2020:96710c] ERROR -- : MIQ(MiqQueue#deliver) Message id: [3000312697088], timed out after 600.452985311 seconds.  Timeout threshold [600]


The bloat line from the evm.log shows the table has 1.9 GiB of wasted bytes an 43 million rows in size:

 table_name                       | index_name                                                     |    otta |   pages | pagesize | percent_bloat |     rows | wasted_bytes | wasted_pages | wasted_size
----------------------------------+----------------------------------------------------------------+---------+---------+----------+---------------+----------+--------------+--------------+-------------
 vim_performance_states           |                                                                | 3398824 | 3648675 |        0 |           1.1 | 43129164 |   2046779392 |       249851 |  2046779392



From the postgresql logs, these messages start at 07-13 16:30 EDT (07-14 02:00 IST):

  2018-07-13 16:30:14 EDT:152.144.198.41(56322):5b4308e5.9cf:root@vmdb_production:[2511]:LOG:  process 2511 still waiting for AccessExclusiveLock on relation 21559 of database 21433 after 5045.047 ms

Leading up to this last one before the system locks up:

  2018-07-13 16:53:07 EDT:152.144.198.43(59020):5b491129.1afe:root@vmdb_production:[6910]:LOG:  process 6910 still waiting for AccessShareLock on relation 23958 of database 21433 after 5022.036 ms



Also in the postgresql.log, there are 130 occurrences indicating increase the max_wal_size:

2018-07-13 16:30:17 EDT::5b430766.522:@:[1314]:LOG:  checkpoints are occurring too frequently (9 seconds apart)
  2018-07-13 16:30:17 EDT::5b430766.522:@:[1314]:HINT:  Consider increasing the configuration parameter "max_wal_size".



From the top_output.log we see the wal sender process consume high memory and cpu until the system is restarted:

top - 02:31:57 up 4 days, 14:04,  0 users,  load average: 1.05, 1.58, 1.93
  PID  PPID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 7016  1281 postgres  20   0   38.8g  28.8g   1800 D  98.3 92.2   7:57.48 postgres: wal sender process root 152.144.25.111(58598) idle



Version-Release number of selected component (if applicable):
CFME 5.9.2.4

How reproducible:
custom environment

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:


Additional info:

Logs for relevant postgresql.log and top_output.log from database appliance here:

  http://file.rdu.redhat.com/myoder/02141304/


Similar bug opened for CFME 5.7.0:
  https://bugzilla.redhat.com/show_bug.cgi?id=1425905

RFE to break up vim_performance_states table into monthly tables:
  https://bugzilla.redhat.com/show_bug.cgi?id=1427610

Possibly affecting this bug:
  https://bugzilla.redhat.com/show_bug.cgi?id=1434918

Comment 2 CFME Bot 2018-07-16 18:30:57 UTC

https://github.com/ManageIQ/manageiq/pull/17713

Comment 3 CFME Bot 2018-07-16 19:35:54 UTC

New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/8da858fdef849a079a97233f45667c314406bcb4
commit 8da858fdef849a079a97233f45667c314406bcb4
Author:     Joe Rafaniello <jrafanie>
AuthorDate: Mon Jul 16 14:04:34 2018 -0400
Commit:     Joe Rafaniello <jrafanie>
CommitDate: Mon Jul 16 14:04:34 2018 -0400

    Maintenance must run VACUUM to avoid long held locks

    https://bugzilla.redhat.com/show_bug.cgi?id=1601264

    It's hard to say which tables may or may not grow too large that we
    cannot easily run VACUUM FULL on them as part of maintenance. Therefore
    we should do the safest thing possible and run VACUUM for routine
    maintenance and leave VACUUM FULL when table bloat becomes a problem.

    Some tables, such as vim_performance_states can grow so large that a
    VACUUM FULL could lock up the table for many minutes and could fail
    to complete during the normal 10 minute queue timeout.

 lib/extensions/ar_base.rb | 2 +-
 spec/lib/extensions/ar_base_spec.rb | 14 +
 2 files changed, 15 insertions(+), 1 deletion(-)

Comment 17 Joe Rafaniello 2018-07-25 14:33:51 UTC

Hi Yoder, the postgresl.log doesn't contain reference to vim_performance_states.  Can they provide the rotated logs?

Yeah, exporting those tables might help us understand where the problem lies.

Comment 26 Tasos Papaioannou 2018-09-27 18:18:56 UTC

Verified on 5.10.0.16.

Note You need to log in before you can comment on or make changes to this bug.