Description of problem: After importing a large database, the rake db:migrate task took 6 hours and 45minutes and the ruby process conducting the migration consumed 11GiB of memory. Conducting this on a default appliance configuration results in running out of memory and a slower process due to swapping. Version-Release number of selected component (if applicable): 5.4 migration to 5.5 How reproducible: With this large scale database it is possible to reproduce Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: The database also bloats in size after the migration.
A comparison of database bloat with db:migrate Before: # SELECT pg_size_pretty(pg_database_size('vmdb_production')); pg_size_pretty ---------------- 25 GB After db:migrate: # SELECT pg_size_pretty(pg_database_size('vmdb_production')); pg_size_pretty ---------------- 37 GB The bloat appears mostly to be introduced from events: Before: ems_events - 8854 MB After db:migrate: event_streams - 17GB
Lucy, can you try to change that migration to use find_each and add an includes for the binary_blob_parts?... something like below? The down would need similar treatment. diff --git a/db/migrate/20150625220141_fix_serialized_reports_for_rails_four.rb b/db/migrate/20150625220141_fix_serialized_reports_for_rails_four.rb index e48dda3..4a608ca 100644 --- a/db/migrate/20150625220141_fix_serialized_reports_for_rails_four.rb +++ b/db/migrate/20150625220141_fix_serialized_reports_for_rails_four.rb @@ -126,14 +126,14 @@ class FixSerializedReportsForRailsFour < ActiveRecord::Migration def up say_with_time("Converting MiqReportResult#report to a serialized hash") do - MiqReportResult.where('report IS NOT NULL').each do |rr| + MiqReportResult.where('report IS NOT NULL').find_each do |rr| val = rr.serialize_report_to_hash(rr.read_attribute(:report), self) rr.update_attribute(:report, val) if val end end say_with_time("Converting BinaryBlob report results to a serialized hash") do - BinaryBlob.includes(:resource).where(:resource_type => 'MiqReportResult').each do |bb| + BinaryBlob.includes([:resource, :binary_blob_parts]).where(:resource_type => 'MiqReportResult').find_each do |bb| if bb.resource val = bb.serialize_report_to_hash(bb.binary, self) bb.binary = val if val
I looked at the supplied DB and saw that there are almost 28k miq_report_results that are over 1 ry old. But, only 3k that are less than 1 year old. Purging the old results may be an option.
Joe, I ran the migration locally (with the customer DB) with no changes so I can get a baseline in my env and then with the changes you suggested. Here's the result: Without any changes to the migration: (git:master) ~/work/rh/manageiq$ rake db:migrate:up VERSION=20150625220141 ** Using session_store: ActionDispatch::Session::MemCacheStore == 20150625220141 FixSerializedReportsForRailsFour: migrating ================= -- Converting MiqReportResult#report to a serialized hash -> 195.3643s -- Converting BinaryBlob report results to a serialized hash -> 13494.3909s == 20150625220141 FixSerializedReportsForRailsFour: migrated (13689.7554s) ==== After changing “find” to “find_each" (git:master) ~/work/rh/manageiq$ rake db:migrate:up VERSION=20150625220141 ** Using session_store: ActionDispatch::Session::MemCacheStore == 20150625220141 FixSerializedReportsForRailsFour: migrating ================= -- Converting MiqReportResult#report to a serialized hash -> 190.9291s -- Converting BinaryBlob report results to a serialized hash -> 11907.3533s == 20150625220141 FixSerializedReportsForRailsFour: migrated (12098.2825s) ==== 228 min vs. 201 = 27 min improvement or 12% Seems like there's still a lot of room for improvement. The processing of the binary blobs is the killer. We shill may need to resort to purging old data to get to where we need to be.
Gregg...I am already working on this (didn't know Lucy's ticket got split)
The MiqReportResult conversion is decently fast...181s on my machine with batching but it might not even need it. However, converting the binary blobs is crazy expensive. Batching by 1000 still cranked my memory to 1.5 GB ... by 100 and I was at 700mb. Note that I also had to enable disable_ddl_transaction! However that's not the slow part... YAML dump and load takes like 99% of the time. There are only like 60000 records in the sample table and the big ones were taking upwards of 5 seconds each. So I'm thinking we might be able to do some parallel/threaded processing Some stats, btw -> BinaryBlob total rows: 66146 -> BinaryBlob sizes: max: 19598184, min: 9, average: 161890.02913252503
https://github.com/ManageIQ/manageiq/pull/9112
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/184d2a978bcc0e579a9af1e2d30e509b002b8870 commit 184d2a978bcc0e579a9af1e2d30e509b002b8870 Author: Jason Frey <jfrey> AuthorDate: Thu Jun 2 13:56:46 2016 -0400 Commit: Jason Frey <jfrey> CommitDate: Thu Jun 2 13:56:52 2016 -0400 Limit the amount of records pulled into memory at once https://bugzilla.redhat.com/show_bug.cgi?id=1337159 .../20150625220141_fix_serialized_reports_for_rails_four.rb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
Not done yet. However, Alex, if you can test with the change I made we can see how much progress I made (in particular with the memory)
From the PR: This is a simple tactical change to the migration to alleviate the memory pressure in large databases. In our candidate database that has ~65000 BinaryBlob records, the find ends up consuming all of the memory on the appliance, moving it into swap, and the migration takes something like 7 hours. This should alleviate the memory pressure, allow GC to do its job, and not hit swap, which should allow the migration to complete in less time. This is not a final fix, as it will still take hours, just hopefully not as many (hard to test locally because I'd have to run for 7 hours 😉). After this change, the majority of the time is spent in YAML.dump/load land, but I am having a very hard time getting that to parallelize in a migration (due to transactions and postgres connections not playing nice in parallel).
https://github.com/ManageIQ/manageiq/pull/9119
New commit detected on cfme/5.5.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=a1cd891125108bfb96e10b7a73fb80d09ed70647 commit a1cd891125108bfb96e10b7a73fb80d09ed70647 Merge: 01b5ef6 7af03ba Author: Gregg Tanzillo <gtanzill> AuthorDate: Fri Jun 3 14:10:24 2016 -0400 Commit: Gregg Tanzillo <gtanzill> CommitDate: Fri Jun 3 14:10:24 2016 -0400 Merge branch '55z_binary_blob_migration_tactical' into '5.5.z' Limit the amount of records pulled into memory at once Clean cherry-pick of https://github.com/ManageIQ/manageiq/pull/9112 https://bugzilla.redhat.com/show_bug.cgi?id=1337159 @gtanzill Please review. See merge request !958 .../20150625220141_fix_serialized_reports_for_rails_four.rb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
New commit detected on cfme/5.5.z: https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=7af03ba2303de581633966c55c55c835d034317b commit 7af03ba2303de581633966c55c55c835d034317b Author: Jason Frey <jfrey> AuthorDate: Thu Jun 2 13:56:46 2016 -0400 Commit: Jason Frey <jfrey> CommitDate: Fri Jun 3 13:16:37 2016 -0400 Limit the amount of records pulled into memory at once https://bugzilla.redhat.com/show_bug.cgi?id=1337159 .../20150625220141_fix_serialized_reports_for_rails_four.rb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
https://github.com/ManageIQ/manageiq/pull/9133
https://github.com/ManageIQ/manageiq/pull/7620
Testing 5.5.5.2 I see some reduction in timing but the most noticeable savings is in memory utilization: Peak appliance used memory during rake db:migrate 5.5.4.1 - 11.3GiB 5.5.5.2 - 2.7GiB db:migrate Timing: 5.5.4.1 - 6hr 45m 5.5.5.2 - 5hr 43m Database still bloats on event_streams: event_streams - 17GB vmdb_production=# SELECT pg_size_pretty(pg_database_size('vmdb_production')); pg_size_pretty ---------------- 37 GB (1 row) vmdb_production=# VACUUM FULL event_streams; VACUUM vmdb_production=# SELECT pg_size_pretty(pg_database_size('vmdb_production')); pg_size_pretty ---------------- 26 GB (1 row) FixSerializedReportsForRailsFour decreases in timing: 5.5.4.1 - 21999.6597s 5.5.5.2 - 19057.7571s As RenameEmsEventTableToEventStream increases in timing: 5.5.4.1 - 876.8407s 5.5.5.2 - 1194.6601s Though net affect on db:migrate is lower by an hour in this testbed.
Created attachment 1181609 [details] 5552 db importing and migration output
Looks like I updated the wrong bug, this bug is tracking 5.7, thus Moving this back to on_dev and will update the correct bug for 5.5.
https://github.com/ManageIQ/manageiq/pull/13947
This BZ has been closed since it was opened on a version of CloudForms that has reached the end of its lifecycle. If you find that this bug is still relevant, please open a new BZ against the relevant supported version of CloudForms.