Bug 1331053 - Replication fails to start after 5.5 -> 5.6 migration
Summary: Replication fails to start after 5.5 -> 5.6 migration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.6.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.6.0
Assignee: Gregg Tanzillo
QA Contact: luke couzens
URL:
Whiteboard: replication:migration:upgrade
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-27 14:49 UTC by luke couzens
Modified: 2016-06-29 15:55 UTC (History)
7 users (show)

Fixed In Version: 5.6.0.8
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-29 15:55:29 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1348 0 normal SHIPPED_LIVE CFME 5.6.0 bug fixes and enhancement update 2016-06-29 18:50:04 UTC

Description luke couzens 2016-04-27 14:49:58 UTC
Description of problem:replication keeps restarting after migrating a 5.5.3.4 appliance to 5.6.0.4-beta2.3


Version-Release number of selected component (if applicable):5.6.0.4-beta2.3


How reproducible:100%


Steps to Reproduce:
1.provision 2x 5.5.3.4 appliances
2.configure 1st db with region 99 (r99)
3.configure 2nd db with region 1 (r1)
4.login to webui of r1 appliance
5.setup replication worker (configure-configuration-workers)
6.point it at r99 appliance
7.enable db synchronization (configure-configuration-server)
8.test replication by adding provider and checking it shows up in r99 also
9.disable db synchronization (configure-configuration-server)
10.backup r99 and r1 db's
11.provision 2x 5.6.0.4-beta2.3 appliances
12.configure 1st db with region 99 (r99) also fetching v2_key from 5.5 appliance
13.configure 2nd db with region 1 (r1) also fetching v2_key from 5.5 appliance
14.copy 5.5 r99/r1 backups to the respected 5.6 appliances
15.copy 5.5 r99/r1 /var/www/miq/vmdb/GUID to /var/www/miq/vmdb/ of respected 5.6 appliance.
15.stop evm 
16.restore backups (pg_restore --dbname=vmdb_production <location/backup> --verbose)
17.rake db:migrate (run from /var/www/miq/vmdb/)
18.start evm
19.rake evm:automate:reset
20.restart evm
21.restart both appliances and check you can login to ui
22.on 5.6 r99 appliance run psql -d vmdb_production -c "delete from host_storages"
23.on 5.6 r1 run the following:

vmdb
bin/rake evm:dbsync:uninstall host_storages
psql -d vmdb_production -c "drop trigger if exists rr1_hosts_storages on host_storages"
psql -d vmdb_production -c "drop function if exists rr1_hosts_storages()"
psql -d vmdb_production -c "delete from rr<region>_pending_changes where change_table = 'hosts_storages'"
psql -d vmdb_production -c "delete from rr<region>_sync_state where table_name = 'hosts_storages'"
bin/rake evm:dbsync:prepare_replication_without_sync

24.login to webui of r1 appliance
25.point replication worker at new 5.6 r99 ip (settings-configuration-workers)
26.turn db synchronization back on (settings-configuration-server)
27.test adding provider and if it appears in r99
28.also check replication active/inactive + backlog (settings-diagnostics-region-replication)


Actual results: replication worker keeps restarting 


Expected results: replication worker stays up and replicates data to r99


Additional info:

some lines from evm.log : http://pastebin.test.redhat.com/369335

Migration doc : https://access.redhat.com/articles/2076193

if using unconfigured appliances they may require you to run the loosen_pgssl_connections.py : https://github.com/lcouzens/cfme_tests/blob/master/scripts/loosen_pgssl_connections.py (basically enables access to the database for replication)

Comment 6 luke couzens 2016-05-18 15:05:39 UTC
Hey gregg, So unfortunately I don't have the appliance's anymore however I am just going to provision some new ones for us to take a look at. I will send you the ip's once its ready. (unless it works this time) ;)

Just ran through beta1 to beta2 and running bin/rails r tools/purge_duplicate_rubyrep_triggers.rb that seemed to do the trick for that replication so I will let you know.

Cheers.

Comment 8 luke couzens 2016-05-19 16:40:59 UTC
The same or similar issue is also present in an in-place upgrade from 5.5 to 5.6.0.7-beta2.6

Comment 11 CFME Bot 2016-05-24 16:30:38 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/b2838bd65a278470c8b7bc8315aac6b375df5d62

commit b2838bd65a278470c8b7bc8315aac6b375df5d62
Author:     Gregg Tanzillo <gtanzill>
AuthorDate: Fri May 20 13:02:01 2016 -0400
Commit:     Gregg Tanzillo <gtanzill>
CommitDate: Fri May 20 14:03:02 2016 -0400

    Explicitly add internal Rails tables to replication excluded tables list
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1331053

 lib/miq_pglogical.rb                        | 2 +-
 lib/miq_rubyrep.rb                          | 1 +
 spec/replication/util/miq_pglogical_spec.rb | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

Comment 16 Gregg Tanzillo 2016-06-03 14:13:48 UTC
Luke, I took a look at your region 0 appliance and saw that the replication worker is stuck in a loop of stopping due to exceeding the memory threshold and then starting again. Here's one of the log messages:

[----] W, [2016-06-03T09:27:10.677516 #11463:1323998]  WARN -- : MIQ(MiqServer#validate_worker) Worker [MiqReplicationWorker] with ID: [794], PID: [29987], GUID: [d7009c50-298e-11e6-8075-fa163e3af26a] process memory usage [274493000] exceeded limit [209715200], requesting worker to exit

This has been fixed by Nick in this PR - https://github.com/ManageIQ/manageiq/pull/9087

I went into the advanced settings on you appliance and added the new threshold, reset replication and now the worker is replicating successfully.

Comment 17 luke couzens 2016-06-03 22:46:41 UTC
Thanks for your help Gregg.

Verified in 5.6.0.8

Comment 19 errata-xmlrpc 2016-06-29 15:55:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1348


Note You need to log in before you can comment on or make changes to this bug.