Bug 1381604

Summary:	Unique constraint failure when configuring pglogical replication
Product:	Red Hat CloudForms Management Engine	Reporter:	Nick Carboni <ncarboni>
Component:	Replication	Assignee:	Nick Carboni <ncarboni>
Status:	CLOSED WORKSFORME	QA Contact:	Alex Newman <anewman>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.6.0	CC:	cpelland, jhardy, obarenbo
Target Milestone:	GA
Target Release:	5.7.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-01-16 20:03:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	CFME Core	Target Upstream Version:
Embargoed:

Description Nick Carboni 2016-10-04 14:23:03 UTC

Description of problem:

After an upgrade from 5.5 to 5.6, while moving from rubyrep to pglogical the following error was encountered (in the postgresql.log):
root@vmdb_production:[14630]:ERROR:  duplicate key value violates unique constraint "index_storages_vms_on_vm_id_and_storage_id"
root@vmdb_production:[14630]:DETAIL:  Key (vm_or_template_id, storage_id)=(10000000003768, 10000000000394) already exists.
root@vmdb_production:[14630]:CONTEXT:  COPY storages_vms_and_templates, line 2604
root@vmdb_production:[14630]:STATEMENT:  COPY "public"."storages_vms_and_templates" FROM stdin
root@vmdb_production:[14630]:ERROR:  current transaction is aborted, commands ignored until end of transaction block
root@vmdb_production:[14630]:STATEMENT:  COPY "public"."container_groups_container_services" FROM stdin
@:[14271]:ERROR:  table copy failed

This indicates that a row was present in the global database from the region that should not have been there.

Upon closer investigation, the following was found in the storages_vms_and_templates table:

vmdb_production=# select * from storages_vms_and_templates where vm_or_template_id = 10000000003768 and storage_id = 10000000000394;
   storage_id   | vm_or_template_id |       id
----------------+-------------------+----------------
 10000000000394 |    10000000003768 | 99000000000015
(1 row)

This shows that this join table was assigned an id out of its proper region.
Because of this, this row was not seen by `bin/rake evm:dbsync:destroy_local_region 10` and disrupted the sync.

Comment 2 Nick Carboni 2017-01-16 20:03:18 UTC

I could not reproduce this. When I ran through this upgrade, I saw the rows get removed from the global database then get re-added after the upgrade.

This case should be handled by this part of the migration to add primary keys to the join tables (https://github.com/ManageIQ/manageiq/blob/master/db/migrate/20160406195810_add_id_primary_key_to_join_tables.rb#L50-L55)

When the rows are removed, they are then either re-added by rubyrep when the worker is turned back on after the upgrade or by pglogical during the initial sync.

Going to close this as WORKSFORME.