1381604 – Unique constraint failure when configuring pglogical replication

Bug 1381604 - Unique constraint failure when configuring pglogical replication

Summary: Unique constraint failure when configuring pglogical replication

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Replication
Sub Component:
Version:	5.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.7.2
Assignee:	Nick Carboni
QA Contact:	Alex Newman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-04 14:23 UTC by Nick Carboni
Modified:	2017-12-05 15:01 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-01-16 20:03:18 UTC
Category:	---
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Nick Carboni 2016-10-04 14:23:03 UTC

Description of problem:

After an upgrade from 5.5 to 5.6, while moving from rubyrep to pglogical the following error was encountered (in the postgresql.log):
root@vmdb_production:[14630]:ERROR:  duplicate key value violates unique constraint "index_storages_vms_on_vm_id_and_storage_id"
root@vmdb_production:[14630]:DETAIL:  Key (vm_or_template_id, storage_id)=(10000000003768, 10000000000394) already exists.
root@vmdb_production:[14630]:CONTEXT:  COPY storages_vms_and_templates, line 2604
root@vmdb_production:[14630]:STATEMENT:  COPY "public"."storages_vms_and_templates" FROM stdin
root@vmdb_production:[14630]:ERROR:  current transaction is aborted, commands ignored until end of transaction block
root@vmdb_production:[14630]:STATEMENT:  COPY "public"."container_groups_container_services" FROM stdin
@:[14271]:ERROR:  table copy failed

This indicates that a row was present in the global database from the region that should not have been there.

Upon closer investigation, the following was found in the storages_vms_and_templates table:

vmdb_production=# select * from storages_vms_and_templates where vm_or_template_id = 10000000003768 and storage_id = 10000000000394;
   storage_id   | vm_or_template_id |       id
----------------+-------------------+----------------
 10000000000394 |    10000000003768 | 99000000000015
(1 row)

This shows that this join table was assigned an id out of its proper region.
Because of this, this row was not seen by `bin/rake evm:dbsync:destroy_local_region 10` and disrupted the sync.

Comment 2 Nick Carboni 2017-01-16 20:03:18 UTC

I could not reproduce this. When I ran through this upgrade, I saw the rows get removed from the global database then get re-added after the upgrade.

This case should be handled by this part of the migration to add primary keys to the join tables (https://github.com/ManageIQ/manageiq/blob/master/db/migrate/20160406195810_add_id_primary_key_to_join_tables.rb#L50-L55)

When the rows are removed, they are then either re-added by rubyrep when the worker is turned back on after the upgrade or by pglogical during the initial sync.

Going to close this as WORKSFORME.

Note You need to log in before you can comment on or make changes to this bug.