Description of problem:replication keeps restarting after migrating a 5.5.3.4 appliance to 5.6.0.4-beta2.3 Version-Release number of selected component (if applicable):5.6.0.4-beta2.3 How reproducible:100% Steps to Reproduce: 1.provision 2x 5.5.3.4 appliances 2.configure 1st db with region 99 (r99) 3.configure 2nd db with region 1 (r1) 4.login to webui of r1 appliance 5.setup replication worker (configure-configuration-workers) 6.point it at r99 appliance 7.enable db synchronization (configure-configuration-server) 8.test replication by adding provider and checking it shows up in r99 also 9.disable db synchronization (configure-configuration-server) 10.backup r99 and r1 db's 11.provision 2x 5.6.0.4-beta2.3 appliances 12.configure 1st db with region 99 (r99) also fetching v2_key from 5.5 appliance 13.configure 2nd db with region 1 (r1) also fetching v2_key from 5.5 appliance 14.copy 5.5 r99/r1 backups to the respected 5.6 appliances 15.copy 5.5 r99/r1 /var/www/miq/vmdb/GUID to /var/www/miq/vmdb/ of respected 5.6 appliance. 15.stop evm 16.restore backups (pg_restore --dbname=vmdb_production <location/backup> --verbose) 17.rake db:migrate (run from /var/www/miq/vmdb/) 18.start evm 19.rake evm:automate:reset 20.restart evm 21.restart both appliances and check you can login to ui 22.on 5.6 r99 appliance run psql -d vmdb_production -c "delete from host_storages" 23.on 5.6 r1 run the following: vmdb bin/rake evm:dbsync:uninstall host_storages psql -d vmdb_production -c "drop trigger if exists rr1_hosts_storages on host_storages" psql -d vmdb_production -c "drop function if exists rr1_hosts_storages()" psql -d vmdb_production -c "delete from rr<region>_pending_changes where change_table = 'hosts_storages'" psql -d vmdb_production -c "delete from rr<region>_sync_state where table_name = 'hosts_storages'" bin/rake evm:dbsync:prepare_replication_without_sync 24.login to webui of r1 appliance 25.point replication worker at new 5.6 r99 ip (settings-configuration-workers) 26.turn db synchronization back on (settings-configuration-server) 27.test adding provider and if it appears in r99 28.also check replication active/inactive + backlog (settings-diagnostics-region-replication) Actual results: replication worker keeps restarting Expected results: replication worker stays up and replicates data to r99 Additional info: some lines from evm.log : http://pastebin.test.redhat.com/369335 Migration doc : https://access.redhat.com/articles/2076193 if using unconfigured appliances they may require you to run the loosen_pgssl_connections.py : https://github.com/lcouzens/cfme_tests/blob/master/scripts/loosen_pgssl_connections.py (basically enables access to the database for replication)
Hey gregg, So unfortunately I don't have the appliance's anymore however I am just going to provision some new ones for us to take a look at. I will send you the ip's once its ready. (unless it works this time) ;) Just ran through beta1 to beta2 and running bin/rails r tools/purge_duplicate_rubyrep_triggers.rb that seemed to do the trick for that replication so I will let you know. Cheers.
The same or similar issue is also present in an in-place upgrade from 5.5 to 5.6.0.7-beta2.6
https://github.com/ManageIQ/manageiq/pull/8859
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/b2838bd65a278470c8b7bc8315aac6b375df5d62 commit b2838bd65a278470c8b7bc8315aac6b375df5d62 Author: Gregg Tanzillo <gtanzill> AuthorDate: Fri May 20 13:02:01 2016 -0400 Commit: Gregg Tanzillo <gtanzill> CommitDate: Fri May 20 14:03:02 2016 -0400 Explicitly add internal Rails tables to replication excluded tables list https://bugzilla.redhat.com/show_bug.cgi?id=1331053 lib/miq_pglogical.rb | 2 +- lib/miq_rubyrep.rb | 1 + spec/replication/util/miq_pglogical_spec.rb | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-)
Luke, I took a look at your region 0 appliance and saw that the replication worker is stuck in a loop of stopping due to exceeding the memory threshold and then starting again. Here's one of the log messages: [----] W, [2016-06-03T09:27:10.677516 #11463:1323998] WARN -- : MIQ(MiqServer#validate_worker) Worker [MiqReplicationWorker] with ID: [794], PID: [29987], GUID: [d7009c50-298e-11e6-8075-fa163e3af26a] process memory usage [274493000] exceeded limit [209715200], requesting worker to exit This has been fixed by Nick in this PR - https://github.com/ManageIQ/manageiq/pull/9087 I went into the advanced settings on you appliance and added the new threshold, reset replication and now the worker is replicating successfully.
Thanks for your help Gregg. Verified in 5.6.0.8
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1348