Bug 1391997

Summary: heartbeats are failing intermittently attributing problems to pglogical duplicate keys conditions and causing workers to restart
Product: Red Hat CloudForms Management Engine Reporter: Thomas Hennessy <thenness>
Component: ApplianceAssignee: Nick Carboni <ncarboni>
Status: CLOSED ERRATA QA Contact: Alex Newman <anewman>
Severity: high Docs Contact:
Priority: high    
Version: 5.6.0CC: abellott, cpelland, gekis, jdeubel, jhardy, jocarter, myoder, ncarboni, obarenbo, saali, simaishi, thenness
Target Milestone: GAKeywords: ZStream
Target Release: 5.6.4   
Hardware: x86_64   
OS: Linux   
Whiteboard: replication
Fixed In Version: 5.6.4.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-09 17:03:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1380475    
Bug Blocks:    
Attachments:
Description Flags
grep of 18428 containing representative error
none
grep of pid 10404 with representative error
none
grep of pid 10922 with representative error
none
hotfix
none
hotfix contains modified files which correct replication_set_table_pkey violation v3 none

Description Thomas Hennessy 2016-11-04 14:52:28 UTC
Description of problem: errors in multiple different workers when trying to perform heartbeat.  Representative error follows:
=====
/var/www/miq/vmdb/lib/extensions/ar_adapter/ar_pglogical/pglogical_raw.rb:301:in `async_exec': ERROR:  duplicate key value violates unique constraint "replication_set_table_pkey" (PG::UniqueViolation)
DETAIL:  Key (set_id, set_reloid)=(-299507980, 16492) already exists. 
=========


Version-Release number of selected component (if applicable): 5.6.1.2


How reproducible: not yet known


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Gregg Tanzillo 2016-11-04 14:59:35 UTC
This may be a duplicate of BZ https://bugzilla.redhat.com/show_bug.cgi?id=1380475

Comment 3 Thomas Hennessy 2016-11-04 15:51:30 UTC
Created attachment 1217429 [details]
grep of 18428 containing representative error

Comment 4 Thomas Hennessy 2016-11-04 15:55:20 UTC
Created attachment 1217442 [details]
grep of pid 10404 with representative error

Comment 5 Thomas Hennessy 2016-11-04 15:57:20 UTC
Created attachment 1217443 [details]
grep of pid 10922 with representative error

Comment 7 Nick Carboni 2016-11-08 20:30:00 UTC
Created attachment 1218701 [details]
hotfix

This hotfix contains modified files which correct this behavior.

To apply the hotfix, on each server:
  -Put the .tgz file in the /var/www/miq/vmdb directory
  -Untar the file using `tar -xzvf <hotfix_file_name>`
  -Restart the evmserverd service using `systemctl restart evmserverd`

Comment 8 Nick Carboni 2016-11-08 20:36:55 UTC
PR to bring the required changes back to the upstream darga release: https://github.com/ManageIQ/manageiq/pull/12513

Comment 9 Thomas Hennessy 2016-11-08 20:44:20 UTC
hotfix has been provided to the customer for SF 01733267

Comment 10 Nick Carboni 2016-11-10 14:22:08 UTC
Can this be closed as it is actually a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1380475 ?

Comment 11 Thomas Hennessy 2016-11-10 16:11:41 UTC
Nick,
it seems inappropriate to close this case which contains th hot fix which other customers will need in favor of the referenced BZ which contains no reference to the heartbeat issue and the fallout from that which is apparent in the evm.log and for which other customers who are moving to pglogical are exposed.

At the moment, this hotfix is being tested at the large US customer who has been most visibly impacted by this problem and form whom this BZ has been opened.

There is at least one other customer who has been impacted by the error as reported in this BZ and for which the other BZ has no apparent reltionship.

I recommend that this BZ not be changed to closed.

Comment 12 Nick Carboni 2016-11-11 14:51:11 UTC
Okay, in that case the code change that this BZ actually yielded was this PR https://github.com/ManageIQ/manageiq/pull/12513 which backports the fixes from the previous BZs which describe this problem to 5.6 (darga)

Satoe, is there any way we can get this BZ changed to "look like" a 5.6.z clone of the duplicate mentioned as that is what this BZ is functioning as?

I don't want people to view this as an issue for the 5.7 release, especially as the blocker flag was just added.

Comment 13 Satoe Imaishi 2016-11-11 15:06:35 UTC
Changed to 5.6.4 BZ.

Comment 14 Nick Carboni 2016-11-11 15:22:25 UTC
Thanks Satoe!

Comment 15 Thomas Hennessy 2016-11-17 13:16:35 UTC
from customer testing hotfix:
=====
Most recent comment: On 2016-11-17 04:01:54, Trieu, Daniel commented:
"Hello RedHat Team,

Updating the ticket:

The hotfix to address reported heartbeat failure issues associated with pglogical has been pushed to 3 (out of 8) regions. I confirmed within 30 minutes that the heartbeat issue was resolved.

Out of caution, the plan is to push the hotfix to another 2 regions today and another 3 regions the day after.

To be clear, there is another hotfix on this ticket for marshal errors, which is in test/dev/uat right now and has not been pushed to any production region.


Daniel"
=====

Comment 16 Gellert Kis 2017-01-05 15:09:37 UTC
Created attachment 1237702 [details]
hotfix contains modified files which correct replication_set_table_pkey violation v3

We got this updated version of hotfix.

Comment 19 errata-xmlrpc 2017-03-09 17:03:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0474.html