Bug 1226429 - [Scale] - Realtime Metrics Purge Spiral - Metric::Purging.purge times out after a sufficiently large enough environment with C&U on
Summary: [Scale] - Realtime Metrics Purge Spiral - Metric::Purging.purge times out aft...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Performance
Version: 5.3.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: GA
: 5.6.0
Assignee: Keenan Brock
QA Contact: Alex Krzos
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-05-29 18:16 UTC by Alex Krzos
Modified: 2019-10-10 09:50 UTC (History)
12 users (show)

Fixed In Version: 5.6.0.0
Doc Type: Bug Fix
Doc Text:
Previously, the queued work to purge old real-time metrics consumed more time than the default timeout for the work. When this occurred, the purge was rolled back and no old data was purged. This consumed a large amount of CPU from the database appliance while this occurred. This work was wasted as the work was rolled back when the message timed out. Once the appliance experienced this, it was unable to recover out of this spiral without manual intervention. This patch fixes metric purging as it uses scope to simplify conditions; the issue is now resolved.
Clone Of:
Environment:
Last Closed: 2016-06-29 14:56:21 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Graph of CPU usage on the Database appliance annotated to show when purge message timeouts (88.33 KB, image/png)
2015-05-29 18:21 UTC, Alex Krzos
no flags Details
Grepped evm.log showing Perf Processor worker attempting to purge records but timing out (14.37 KB, text/plain)
2015-05-29 18:21 UTC, Alex Krzos
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1348 0 normal SHIPPED_LIVE CFME 5.6.0 bug fixes and enhancement update 2016-06-29 18:50:04 UTC

Description Alex Krzos 2015-05-29 18:16:34 UTC
Description of problem:
The queued work to purge old realtime metrics can consume more time than the default timeout for the work.  When this occurs, the purge is rolled back and no old data has been purged.  This consumes a large amount of CPU from the database appliance while this occurs.  This work is wasted as the work is rolled back when the message times out.  Once the appliance experiences this it is unable to recover out of this spiral without manual intervention.

Environment tested:
9 appliances total
1 database appliance
4 worker appliances in zone VC0
4 worker appliances in zone VC1

Attached provider environments:
2 x VMware simulators with 10,000 VMs each, (5,000 VMs online)
Total of 20,000 VMs, 10,000 VMs online for C&U collection
200 Hosts per simulator (400 hosts total)

Although current environment was not completing C&U collections without queueing some collections beyond 10 minutes, collecting on 10,000 VM generated sufficient realtime data that a Metric::Purging.purge message would timeout:

[----] E, [2015-05-28T18:13:38.037839 #2215:c12884] ERROR -- : MIQ(MiqQueue.deliver)    Message id: [5081222], timed out after 607.065224043 seconds.  Timeout threshold [600]


Version-Release number of selected component (if applicable):
5.3.4.2 was tested but conceivably any version with a message timeout of 600seconds for realtime metrics purge is affected.

How reproducible:
Environment size must be large enough to produce enough realtime data into database without purge being able to keep up.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

In order to get the environment to no longer waste cpu cycles on purge messages, there was several attempts to tune how purge messages complete.  First the purge window size was increased to 10000 on the appliance which purging was isolated to:

performance:
  history:
    :purge_window_size: 10000

This still failed.  The timeout was then bumped to 3600s per purge message without success. Lastly, the timeout value was bumped to 10800 seconds with success in the purging operation.  Since all 9 appliances can put the purge message on the queue, each appliance's /var/www/miq/vmdb/app/models/metric/purging.rb file must be modified with the new timeout.

The purge operation was able to then complete:
[----] I, [2015-05-29T02:51:36.755800 #58415:105c084]  INFO -- : MIQ(Metric::Purging.purge) Purging all realtime metrics older than [2015-05-28 21:51:14 UTC]...Complete - Deleted 1787279 records and 0 associated tag values - Timings: {:query_oldest=>0.6977813243865967, :query_batch=>3531.920001268387, :purge_metrics=>85.55350804328918, :total_time=>3621.1672916412354}


This RFE BZ 1207826 also addresses this as a request to separate individual deletes to queued messages rather than a single transaction which can timeout.

Comment 1 Alex Krzos 2015-05-29 18:21:12 UTC
Created attachment 1032254 [details]
Graph of CPU usage on the Database appliance annotated to show when purge message timeouts

Comment 2 Alex Krzos 2015-05-29 18:21:56 UTC
Created attachment 1032255 [details]
Grepped evm.log showing Perf Processor worker attempting to purge records but timing out

Comment 4 Alex Krzos 2015-06-15 18:31:14 UTC
To clarify, the database appliance and purge message have stabilized after the following:

1. Turned off C&U on all zones.
2. Raised message timeout from 600s to 10800s.
3. Installed database maintenance scripts for hourly reindexing miq_queue, miq_workers, miq_servers, and metrics tables and weekly full vacuum of vms, hosts, storages, miq_schedules, event_logs, policy_events, snapshots, jobs, networks, miq_queue, and vim_performance_states


Once database cpu usage had stabilized and purge messages completed, C&U was turned back on and purge message are completing in a far more reasonable timing (Anywhere from 240s to 30s).

Comment 5 CFME Bot 2015-12-02 19:46:09 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/e5b0cc710c7859a6afadf22c989348f3c85af59c

commit e5b0cc710c7859a6afadf22c989348f3c85af59c
Author:     Keenan Brock <kbrock>
AuthorDate: Tue Nov 17 17:29:31 2015 -0500
Commit:     Keenan Brock <kbrock>
CommitDate: Tue Dec 1 19:00:42 2015 -0500

    extract common klass/condition for interval
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1226429

 app/models/metric/purging.rb | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

Comment 6 CFME Bot 2015-12-02 19:46:15 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/908ceeb466f5ae71054e149a58532fba3e0e1d57

commit 908ceeb466f5ae71054e149a58532fba3e0e1d57
Author:     Keenan Brock <kbrock>
AuthorDate: Thu Nov 19 17:03:45 2015 -0500
Commit:     Keenan Brock <kbrock>
CommitDate: Tue Dec 1 19:00:47 2015 -0500

    metric purging - Use scope to simplify conditions
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1226429

 app/models/metric/purging.rb | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

Comment 7 CFME Bot 2015-12-02 19:46:21 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/65758412d3aaefeb3a0e3d5de88de2ba5988a747

commit 65758412d3aaefeb3a0e3d5de88de2ba5988a747
Author:     Keenan Brock <kbrock>
AuthorDate: Tue Nov 17 19:56:26 2015 -0500
Commit:     Keenan Brock <kbrock>
CommitDate: Tue Dec 1 19:00:51 2015 -0500

    use minimum to fetch timestamp
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1226429

 app/models/metric/purging.rb | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

Comment 8 CFME Bot 2015-12-02 19:46:28 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/0620bf0ae5353275591f829e4e9c9d114135e469

commit 0620bf0ae5353275591f829e4e9c9d114135e469
Author:     Keenan Brock <kbrock>
AuthorDate: Tue Nov 17 19:58:14 2015 -0500
Commit:     Keenan Brock <kbrock>
CommitDate: Tue Dec 1 19:00:55 2015 -0500

    pluck batch ids
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1226429

 app/models/metric/purging.rb | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comment 9 CFME Bot 2015-12-02 19:46:34 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/7e239b0a47031ed93a72c253b45f17795bcee7e0

commit 7e239b0a47031ed93a72c253b45f17795bcee7e0
Author:     Keenan Brock <kbrock>
AuthorDate: Tue Nov 17 21:11:28 2015 -0500
Commit:     Keenan Brock <kbrock>
CommitDate: Tue Dec 1 19:00:59 2015 -0500

    decrease limits on ids to bring back
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1226429

 app/models/metric/purging.rb | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

Comment 10 CFME Bot 2015-12-02 19:46:40 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/f8cbb951258d844952fdb21a8c40d7db20c2e706

commit f8cbb951258d844952fdb21a8c40d7db20c2e706
Author:     Keenan Brock <kbrock>
AuthorDate: Tue Nov 17 22:04:36 2015 -0500
Commit:     Keenan Brock <kbrock>
CommitDate: Tue Dec 1 19:01:06 2015 -0500

    no need for oldest
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1226429

 app/models/metric/purging.rb | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

Comment 11 Keenan Brock 2016-01-27 21:29:58 UTC
This should be resolved by:

https://github.com/ManageIQ/manageiq/pull/5925

Please push back if you do not see this fixed.

Comment 16 Thomas Hennessy 2016-06-17 13:56:00 UTC
Testing suggestion:
Scenario to be tested: environment with 10k VMs capturing C&U generates about 180 realtime records per hour for each VM, so if we have four hours of data to delete for 10k VMs, lets create 180*10000*4 (= 720,000) realtime instances spread evenly across four sequential metrics_## tables all with a date more than four hours ago.

We can save the generated table so that it can be used for reproduction (or better the program that generates this) and then enable C&U in the environment so that the purge processing is engaged.... let it run for a few hours to make sure that purging has happened and monitor the logs (or the database) for results.

this should be fairly easy to ramp up to arbitrarily large number of instances and pretty straight forward to set up.

just an idea.

Comment 18 errata-xmlrpc 2016-06-29 14:56:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1348


Note You need to log in before you can comment on or make changes to this bug.