1337159 – db:migrate on large database takes long time and large amount of memory

Bug 1337159 - db:migrate on large database takes long time and large amount of memory

Summary: db:migrate on large database takes long time and large amount of memory

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Performance
Sub Component:
Version:	5.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.7.0
Assignee:	Jason Frey
QA Contact:	Pradeep Kumar Surisetty
Docs Contact:
URL:
Whiteboard:	perf
Depends On:
Blocks:	1343221 1350962
TreeView+	depends on / blocked

Reported:	2016-05-18 12:07 UTC by Alex Krzos
Modified:	2018-01-05 23:50 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1343221 1350962 (view as bug list)
Environment:
Last Closed:	2017-06-07 18:46:19 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
5552 db importing and migration output (86.20 KB, application/zip) 2016-07-19 11:53 UTC, Alex Krzos	no flags	Details
View All

Description Alex Krzos 2016-05-18 12:07:56 UTC

Description of problem:
After importing a large database, the rake db:migrate task took 6 hours and 45minutes and the ruby process conducting the migration consumed 11GiB of memory.  Conducting this on a default appliance configuration results in running out of memory and a slower process due to swapping.


Version-Release number of selected component (if applicable):
5.4 migration to 5.5

How reproducible:
With this large scale database it is possible to reproduce

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
The database also bloats in size after the migration.

Comment 2 Alex Krzos 2016-05-18 13:40:06 UTC

A comparison of database bloat with db:migrate

Before:
# SELECT pg_size_pretty(pg_database_size('vmdb_production'));
 pg_size_pretty
----------------
 25 GB

After db:migrate:
# SELECT pg_size_pretty(pg_database_size('vmdb_production'));
 pg_size_pretty
----------------
 37 GB

The bloat appears mostly to be introduced from events:
Before:
ems_events - 8854 MB

After db:migrate:
event_streams - 17GB

Comment 13 Joe Rafaniello 2016-05-26 17:49:00 UTC

Lucy, can you try to change that migration to use find_each and add an includes for the binary_blob_parts?... something like below?  The down would need similar treatment.


diff --git a/db/migrate/20150625220141_fix_serialized_reports_for_rails_four.rb b/db/migrate/20150625220141_fix_serialized_reports_for_rails_four.rb
index e48dda3..4a608ca 100644
--- a/db/migrate/20150625220141_fix_serialized_reports_for_rails_four.rb
+++ b/db/migrate/20150625220141_fix_serialized_reports_for_rails_four.rb
@@ -126,14 +126,14 @@ class FixSerializedReportsForRailsFour < ActiveRecord::Migration

   def up
     say_with_time("Converting MiqReportResult#report to a serialized hash") do
-      MiqReportResult.where('report IS NOT NULL').each do |rr|
+      MiqReportResult.where('report IS NOT NULL').find_each do |rr|
         val = rr.serialize_report_to_hash(rr.read_attribute(:report), self)
         rr.update_attribute(:report, val) if val
       end
     end

     say_with_time("Converting BinaryBlob report results to a serialized hash") do
-      BinaryBlob.includes(:resource).where(:resource_type => 'MiqReportResult').each do |bb|
+      BinaryBlob.includes([:resource, :binary_blob_parts]).where(:resource_type => 'MiqReportResult').find_each do |bb|
         if bb.resource
           val = bb.serialize_report_to_hash(bb.binary, self)
           bb.binary = val if val

Comment 14 Gregg Tanzillo 2016-05-26 21:03:03 UTC

I looked at the supplied DB and saw that there are almost 28k miq_report_results that are over 1 ry old. But, only 3k that are less than 1 year old. Purging the old results may be an option.

Comment 15 Gregg Tanzillo 2016-05-27 14:26:21 UTC

Joe, I ran the migration locally (with the customer DB) with no changes so I can get a baseline in my env and then with the changes you suggested. Here's the result:

Without any changes to the migration:

(git:master) ~/work/rh/manageiq$ rake db:migrate:up VERSION=20150625220141
** Using session_store: ActionDispatch::Session::MemCacheStore
== 20150625220141 FixSerializedReportsForRailsFour: migrating =================
-- Converting MiqReportResult#report to a serialized hash
   -> 195.3643s
-- Converting BinaryBlob report results to a serialized hash
   -> 13494.3909s
== 20150625220141 FixSerializedReportsForRailsFour: migrated (13689.7554s) ====

After changing “find” to “find_each"

(git:master) ~/work/rh/manageiq$ rake db:migrate:up VERSION=20150625220141
** Using session_store: ActionDispatch::Session::MemCacheStore
== 20150625220141 FixSerializedReportsForRailsFour: migrating =================
-- Converting MiqReportResult#report to a serialized hash
   -> 190.9291s
-- Converting BinaryBlob report results to a serialized hash
   -> 11907.3533s
== 20150625220141 FixSerializedReportsForRailsFour: migrated (12098.2825s) ====

228 min vs. 201 = 27 min improvement or 12%

Seems like there's still a lot of room for improvement. The processing of the binary blobs is the killer. We shill may need to resort to purging old data to get to where we need to be.

Comment 16 Jason Frey 2016-05-27 14:37:32 UTC

Gregg...I am already working on this (didn't know Lucy's ticket got split)

Comment 17 Jason Frey 2016-05-27 14:43:33 UTC

The MiqReportResult conversion is decently fast...181s on my machine with batching but it might not even need it.

However, converting the binary blobs is crazy expensive.  Batching by 1000 still cranked my memory to 1.5 GB ... by 100 and I was at 700mb.  Note that I also had to enable disable_ddl_transaction!

However that's not the slow part... YAML dump and load takes like 99% of the time.  There are only like 60000 records in the sample table and the big ones were taking upwards of 5 seconds each.

So I'm thinking we might be able to do some parallel/threaded processing

Some stats, btw
-> BinaryBlob total rows: 66146
-> BinaryBlob sizes: max: 19598184, min: 9, average: 161890.02913252503

Comment 19 CFME Bot 2016-06-02 18:11:10 UTC

https://github.com/ManageIQ/manageiq/pull/9112

Comment 20 CFME Bot 2016-06-02 20:30:55 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/184d2a978bcc0e579a9af1e2d30e509b002b8870

commit 184d2a978bcc0e579a9af1e2d30e509b002b8870
Author:     Jason Frey <jfrey>
AuthorDate: Thu Jun 2 13:56:46 2016 -0400
Commit:     Jason Frey <jfrey>
CommitDate: Thu Jun 2 13:56:52 2016 -0400

    Limit the amount of records pulled into memory at once
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1337159

 .../20150625220141_fix_serialized_reports_for_rails_four.rb       | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comment 21 Jason Frey 2016-06-02 21:32:20 UTC

Not done yet.  However, Alex, if you can test with the change I made we can see how much progress I made (in particular with the memory)

Comment 22 Jason Frey 2016-06-02 21:34:34 UTC

From the PR:

This is a simple tactical change to the migration to alleviate the memory pressure in large databases. In our candidate database that has ~65000 BinaryBlob records, the find ends up consuming all of the memory on the appliance, moving it into swap, and the migration takes something like 7 hours. This should alleviate the memory pressure, allow GC to do its job, and not hit swap, which should allow the migration to complete in less time.

This is not a final fix, as it will still take hours, just hopefully not as many (hard to test locally because I'd have to run for 7 hours 😉). After this change, the majority of the time is spent in YAML.dump/load land, but I am having a very hard time getting that to parallelize in a migration (due to transactions and postgres connections not playing nice in parallel).

Comment 23 CFME Bot 2016-06-03 05:05:57 UTC

https://github.com/ManageIQ/manageiq/pull/9119

Comment 24 CFME Bot 2016-06-03 18:13:29 UTC

New commit detected on cfme/5.5.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=a1cd891125108bfb96e10b7a73fb80d09ed70647

commit a1cd891125108bfb96e10b7a73fb80d09ed70647
Merge: 01b5ef6 7af03ba
Author:     Gregg Tanzillo <gtanzill>
AuthorDate: Fri Jun 3 14:10:24 2016 -0400
Commit:     Gregg Tanzillo <gtanzill>
CommitDate: Fri Jun 3 14:10:24 2016 -0400

    Merge branch '55z_binary_blob_migration_tactical' into '5.5.z'
    
    Limit the amount of records pulled into memory at once
    
    Clean cherry-pick of https://github.com/ManageIQ/manageiq/pull/9112
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1337159
    
    @gtanzill Please review.
    
    
    See merge request !958

 .../20150625220141_fix_serialized_reports_for_rails_four.rb       | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comment 25 CFME Bot 2016-06-03 18:13:34 UTC

New commit detected on cfme/5.5.z:
https://code.engineering.redhat.com/gerrit/gitweb?p=cfme.git;a=commitdiff;h=7af03ba2303de581633966c55c55c835d034317b

commit 7af03ba2303de581633966c55c55c835d034317b
Author:     Jason Frey <jfrey>
AuthorDate: Thu Jun 2 13:56:46 2016 -0400
Commit:     Jason Frey <jfrey>
CommitDate: Fri Jun 3 13:16:37 2016 -0400

    Limit the amount of records pulled into memory at once
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1337159

 .../20150625220141_fix_serialized_reports_for_rails_four.rb       | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comment 27 CFME Bot 2016-06-15 12:48:51 UTC

https://github.com/ManageIQ/manageiq/pull/9133

Comment 28 CFME Bot 2016-06-20 12:07:25 UTC

https://github.com/ManageIQ/manageiq/pull/7620

Comment 30 Alex Krzos 2016-07-19 11:48:31 UTC

Testing 5.5.5.2 I see some reduction in timing but the most noticeable savings is in memory utilization:

Peak appliance used memory during rake db:migrate
5.5.4.1 - 11.3GiB
5.5.5.2 - 2.7GiB

db:migrate Timing:
5.5.4.1 - 6hr 45m
5.5.5.2 - 5hr 43m

Database still bloats on event_streams:

event_streams - 17GB


vmdb_production=# SELECT pg_size_pretty(pg_database_size('vmdb_production'));
 pg_size_pretty
----------------
 37 GB
(1 row)

vmdb_production=# VACUUM FULL event_streams;
VACUUM
vmdb_production=# SELECT pg_size_pretty(pg_database_size('vmdb_production'));
 pg_size_pretty
----------------
 26 GB
(1 row)


FixSerializedReportsForRailsFour decreases in timing:
5.5.4.1 - 21999.6597s
5.5.5.2 - 19057.7571s

As 

RenameEmsEventTableToEventStream increases in timing:
5.5.4.1 - 876.8407s
5.5.5.2 - 1194.6601s

Though net affect on db:migrate is lower by an hour in this testbed.

Comment 31 Alex Krzos 2016-07-19 11:53:49 UTC

Created attachment 1181609 [details]
5552 db importing and migration output

Comment 33 Alex Krzos 2016-07-19 12:07:32 UTC

Looks like I updated the wrong bug, this bug is tracking 5.7, thus Moving this back to on_dev and will update the correct bug for 5.5.

Comment 36 CFME Bot 2017-02-22 11:11:27 UTC

https://github.com/ManageIQ/manageiq/pull/13947

Comment 37 Chris Pelland 2017-06-07 18:46:19 UTC

This BZ has been closed since it was opened on a version of CloudForms that has reached the end of its lifecycle.  If you find that this bug is still relevant, please open a new BZ against the relevant supported version of CloudForms.

Note You need to log in before you can comment on or make changes to this bug.