Description of problem: The queued work to purge old realtime metrics can consume more time than the default timeout for the work. When this occurs, the purge is rolled back and no old data has been purged. This consumes a large amount of CPU from the database appliance while this occurs. This work is wasted as the work is rolled back when the message times out. Once the appliance experiences this it is unable to recover out of this spiral without manual intervention. Environment tested: 9 appliances total 1 database appliance 4 worker appliances in zone VC0 4 worker appliances in zone VC1 Attached provider environments: 2 x VMware simulators with 10,000 VMs each, (5,000 VMs online) Total of 20,000 VMs, 10,000 VMs online for C&U collection 200 Hosts per simulator (400 hosts total) Although current environment was not completing C&U collections without queueing some collections beyond 10 minutes, collecting on 10,000 VM generated sufficient realtime data that a Metric::Purging.purge message would timeout: [----] E, [2015-05-28T18:13:38.037839 #2215:c12884] ERROR -- : MIQ(MiqQueue.deliver) Message id: [5081222], timed out after 607.065224043 seconds. Timeout threshold [600] Version-Release number of selected component (if applicable): 5.3.4.2 was tested but conceivably any version with a message timeout of 600seconds for realtime metrics purge is affected. How reproducible: Environment size must be large enough to produce enough realtime data into database without purge being able to keep up. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: In order to get the environment to no longer waste cpu cycles on purge messages, there was several attempts to tune how purge messages complete. First the purge window size was increased to 10000 on the appliance which purging was isolated to: performance: history: :purge_window_size: 10000 This still failed. The timeout was then bumped to 3600s per purge message without success. Lastly, the timeout value was bumped to 10800 seconds with success in the purging operation. Since all 9 appliances can put the purge message on the queue, each appliance's /var/www/miq/vmdb/app/models/metric/purging.rb file must be modified with the new timeout. The purge operation was able to then complete: [----] I, [2015-05-29T02:51:36.755800 #58415:105c084] INFO -- : MIQ(Metric::Purging.purge) Purging all realtime metrics older than [2015-05-28 21:51:14 UTC]...Complete - Deleted 1787279 records and 0 associated tag values - Timings: {:query_oldest=>0.6977813243865967, :query_batch=>3531.920001268387, :purge_metrics=>85.55350804328918, :total_time=>3621.1672916412354} This RFE BZ 1207826 also addresses this as a request to separate individual deletes to queued messages rather than a single transaction which can timeout.
Created attachment 1032254 [details] Graph of CPU usage on the Database appliance annotated to show when purge message timeouts
Created attachment 1032255 [details] Grepped evm.log showing Perf Processor worker attempting to purge records but timing out
To clarify, the database appliance and purge message have stabilized after the following: 1. Turned off C&U on all zones. 2. Raised message timeout from 600s to 10800s. 3. Installed database maintenance scripts for hourly reindexing miq_queue, miq_workers, miq_servers, and metrics tables and weekly full vacuum of vms, hosts, storages, miq_schedules, event_logs, policy_events, snapshots, jobs, networks, miq_queue, and vim_performance_states Once database cpu usage had stabilized and purge messages completed, C&U was turned back on and purge message are completing in a far more reasonable timing (Anywhere from 240s to 30s).
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/e5b0cc710c7859a6afadf22c989348f3c85af59c commit e5b0cc710c7859a6afadf22c989348f3c85af59c Author: Keenan Brock <kbrock> AuthorDate: Tue Nov 17 17:29:31 2015 -0500 Commit: Keenan Brock <kbrock> CommitDate: Tue Dec 1 19:00:42 2015 -0500 extract common klass/condition for interval https://bugzilla.redhat.com/show_bug.cgi?id=1226429 app/models/metric/purging.rb | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/908ceeb466f5ae71054e149a58532fba3e0e1d57 commit 908ceeb466f5ae71054e149a58532fba3e0e1d57 Author: Keenan Brock <kbrock> AuthorDate: Thu Nov 19 17:03:45 2015 -0500 Commit: Keenan Brock <kbrock> CommitDate: Tue Dec 1 19:00:47 2015 -0500 metric purging - Use scope to simplify conditions https://bugzilla.redhat.com/show_bug.cgi?id=1226429 app/models/metric/purging.rb | 23 ++++++++++------------- 1 file changed, 10 insertions(+), 13 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/65758412d3aaefeb3a0e3d5de88de2ba5988a747 commit 65758412d3aaefeb3a0e3d5de88de2ba5988a747 Author: Keenan Brock <kbrock> AuthorDate: Tue Nov 17 19:56:26 2015 -0500 Commit: Keenan Brock <kbrock> CommitDate: Tue Dec 1 19:00:51 2015 -0500 use minimum to fetch timestamp https://bugzilla.redhat.com/show_bug.cgi?id=1226429 app/models/metric/purging.rb | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/0620bf0ae5353275591f829e4e9c9d114135e469 commit 0620bf0ae5353275591f829e4e9c9d114135e469 Author: Keenan Brock <kbrock> AuthorDate: Tue Nov 17 19:58:14 2015 -0500 Commit: Keenan Brock <kbrock> CommitDate: Tue Dec 1 19:00:55 2015 -0500 pluck batch ids https://bugzilla.redhat.com/show_bug.cgi?id=1226429 app/models/metric/purging.rb | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/7e239b0a47031ed93a72c253b45f17795bcee7e0 commit 7e239b0a47031ed93a72c253b45f17795bcee7e0 Author: Keenan Brock <kbrock> AuthorDate: Tue Nov 17 21:11:28 2015 -0500 Commit: Keenan Brock <kbrock> CommitDate: Tue Dec 1 19:00:59 2015 -0500 decrease limits on ids to bring back https://bugzilla.redhat.com/show_bug.cgi?id=1226429 app/models/metric/purging.rb | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/f8cbb951258d844952fdb21a8c40d7db20c2e706 commit f8cbb951258d844952fdb21a8c40d7db20c2e706 Author: Keenan Brock <kbrock> AuthorDate: Tue Nov 17 22:04:36 2015 -0500 Commit: Keenan Brock <kbrock> CommitDate: Tue Dec 1 19:01:06 2015 -0500 no need for oldest https://bugzilla.redhat.com/show_bug.cgi?id=1226429 app/models/metric/purging.rb | 12 ++---------- 1 file changed, 2 insertions(+), 10 deletions(-)
This should be resolved by: https://github.com/ManageIQ/manageiq/pull/5925 Please push back if you do not see this fixed.
Testing suggestion: Scenario to be tested: environment with 10k VMs capturing C&U generates about 180 realtime records per hour for each VM, so if we have four hours of data to delete for 10k VMs, lets create 180*10000*4 (= 720,000) realtime instances spread evenly across four sequential metrics_## tables all with a date more than four hours ago. We can save the generated table so that it can be used for reproduction (or better the program that generates this) and then enable C&U in the environment so that the purge processing is engaged.... let it run for a few hours to make sure that purging has happened and monitor the logs (or the database) for results. this should be fairly easy to ramp up to arbitrarily large number of instances and pretty straight forward to set up. just an idea.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1348