Bug 1381604

Summary: Unique constraint failure when configuring pglogical replication
Product: Red Hat CloudForms Management Engine Reporter: Nick Carboni <ncarboni>
Component: ReplicationAssignee: Nick Carboni <ncarboni>
Status: CLOSED WORKSFORME QA Contact: Alex Newman <anewman>
Severity: high Docs Contact:
Priority: high    
Version: 5.6.0CC: cpelland, jhardy, obarenbo
Target Milestone: GA   
Target Release: 5.7.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-16 20:03:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: CFME Core Target Upstream Version:

Description Nick Carboni 2016-10-04 14:23:03 UTC
Description of problem:

After an upgrade from 5.5 to 5.6, while moving from rubyrep to pglogical the following error was encountered (in the postgresql.log):
root@vmdb_production:[14630]:ERROR:  duplicate key value violates unique constraint "index_storages_vms_on_vm_id_and_storage_id"
root@vmdb_production:[14630]:DETAIL:  Key (vm_or_template_id, storage_id)=(10000000003768, 10000000000394) already exists.
root@vmdb_production:[14630]:CONTEXT:  COPY storages_vms_and_templates, line 2604
root@vmdb_production:[14630]:STATEMENT:  COPY "public"."storages_vms_and_templates" FROM stdin
root@vmdb_production:[14630]:ERROR:  current transaction is aborted, commands ignored until end of transaction block
root@vmdb_production:[14630]:STATEMENT:  COPY "public"."container_groups_container_services" FROM stdin
@:[14271]:ERROR:  table copy failed

This indicates that a row was present in the global database from the region that should not have been there.

Upon closer investigation, the following was found in the storages_vms_and_templates table:

vmdb_production=# select * from storages_vms_and_templates where vm_or_template_id = 10000000003768 and storage_id = 10000000000394;
   storage_id   | vm_or_template_id |       id
----------------+-------------------+----------------
 10000000000394 |    10000000003768 | 99000000000015
(1 row)

This shows that this join table was assigned an id out of its proper region.
Because of this, this row was not seen by `bin/rake evm:dbsync:destroy_local_region 10` and disrupted the sync.

Comment 2 Nick Carboni 2017-01-16 20:03:18 UTC
I could not reproduce this. When I ran through this upgrade, I saw the rows get removed from the global database then get re-added after the upgrade.

This case should be handled by this part of the migration to add primary keys to the join tables (https://github.com/ManageIQ/manageiq/blob/master/db/migrate/20160406195810_add_id_primary_key_to_join_tables.rb#L50-L55)

When the rows are removed, they are then either re-added by rubyrep when the worker is turned back on after the upgrade or by pglogical during the initial sync.

Going to close this as WORKSFORME.