Bug 1391997 - heartbeats are failing intermittently attributing problems to pglogical duplicate keys conditions and causing workers to restart
Summary: heartbeats are failing intermittently attributing problems to pglogical dupli...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.6.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: GA
: 5.6.4
Assignee: Nick Carboni
QA Contact: Alex Newman
URL:
Whiteboard: replication
Depends On: 1380475
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-04 14:52 UTC by Thomas Hennessy
Modified: 2020-08-13 08:40 UTC (History)
12 users (show)

Fixed In Version: 5.6.4.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-09 17:03:56 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:


Attachments (Terms of Use)
grep of 18428 containing representative error (12.64 MB, application/x-gzip)
2016-11-04 15:51 UTC, Thomas Hennessy
no flags Details
grep of pid 10404 with representative error (62.51 KB, application/x-gzip)
2016-11-04 15:55 UTC, Thomas Hennessy
no flags Details
grep of pid 10922 with representative error (70.47 KB, application/x-gzip)
2016-11-04 15:57 UTC, Thomas Hennessy
no flags Details
hotfix (5.43 KB, application/x-gzip)
2016-11-08 20:30 UTC, Nick Carboni
no flags Details
hotfix contains modified files which correct replication_set_table_pkey violation v3 (7.59 KB, application/zip)
2017-01-05 15:09 UTC, Gellert Kis
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0474 0 normal SHIPPED_LIVE CFME 5.6.4 bug fixes and enhancement update 2017-03-09 22:03:14 UTC

Description Thomas Hennessy 2016-11-04 14:52:28 UTC
Description of problem: errors in multiple different workers when trying to perform heartbeat.  Representative error follows:
=====
/var/www/miq/vmdb/lib/extensions/ar_adapter/ar_pglogical/pglogical_raw.rb:301:in `async_exec': ERROR:  duplicate key value violates unique constraint "replication_set_table_pkey" (PG::UniqueViolation)
DETAIL:  Key (set_id, set_reloid)=(-299507980, 16492) already exists. 
=========


Version-Release number of selected component (if applicable): 5.6.1.2


How reproducible: not yet known


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Gregg Tanzillo 2016-11-04 14:59:35 UTC
This may be a duplicate of BZ https://bugzilla.redhat.com/show_bug.cgi?id=1380475

Comment 3 Thomas Hennessy 2016-11-04 15:51:30 UTC
Created attachment 1217429 [details]
grep of 18428 containing representative error

Comment 4 Thomas Hennessy 2016-11-04 15:55:20 UTC
Created attachment 1217442 [details]
grep of pid 10404 with representative error

Comment 5 Thomas Hennessy 2016-11-04 15:57:20 UTC
Created attachment 1217443 [details]
grep of pid 10922 with representative error

Comment 7 Nick Carboni 2016-11-08 20:30:00 UTC
Created attachment 1218701 [details]
hotfix

This hotfix contains modified files which correct this behavior.

To apply the hotfix, on each server:
  -Put the .tgz file in the /var/www/miq/vmdb directory
  -Untar the file using `tar -xzvf <hotfix_file_name>`
  -Restart the evmserverd service using `systemctl restart evmserverd`

Comment 8 Nick Carboni 2016-11-08 20:36:55 UTC
PR to bring the required changes back to the upstream darga release: https://github.com/ManageIQ/manageiq/pull/12513

Comment 9 Thomas Hennessy 2016-11-08 20:44:20 UTC
hotfix has been provided to the customer for SF 01733267

Comment 10 Nick Carboni 2016-11-10 14:22:08 UTC
Can this be closed as it is actually a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1380475 ?

Comment 11 Thomas Hennessy 2016-11-10 16:11:41 UTC
Nick,
it seems inappropriate to close this case which contains th hot fix which other customers will need in favor of the referenced BZ which contains no reference to the heartbeat issue and the fallout from that which is apparent in the evm.log and for which other customers who are moving to pglogical are exposed.

At the moment, this hotfix is being tested at the large US customer who has been most visibly impacted by this problem and form whom this BZ has been opened.

There is at least one other customer who has been impacted by the error as reported in this BZ and for which the other BZ has no apparent reltionship.

I recommend that this BZ not be changed to closed.

Comment 12 Nick Carboni 2016-11-11 14:51:11 UTC
Okay, in that case the code change that this BZ actually yielded was this PR https://github.com/ManageIQ/manageiq/pull/12513 which backports the fixes from the previous BZs which describe this problem to 5.6 (darga)

Satoe, is there any way we can get this BZ changed to "look like" a 5.6.z clone of the duplicate mentioned as that is what this BZ is functioning as?

I don't want people to view this as an issue for the 5.7 release, especially as the blocker flag was just added.

Comment 13 Satoe Imaishi 2016-11-11 15:06:35 UTC
Changed to 5.6.4 BZ.

Comment 14 Nick Carboni 2016-11-11 15:22:25 UTC
Thanks Satoe!

Comment 15 Thomas Hennessy 2016-11-17 13:16:35 UTC
from customer testing hotfix:
=====
Most recent comment: On 2016-11-17 04:01:54, Trieu, Daniel commented:
"Hello RedHat Team,

Updating the ticket:

The hotfix to address reported heartbeat failure issues associated with pglogical has been pushed to 3 (out of 8) regions. I confirmed within 30 minutes that the heartbeat issue was resolved.

Out of caution, the plan is to push the hotfix to another 2 regions today and another 3 regions the day after.

To be clear, there is another hotfix on this ticket for marshal errors, which is in test/dev/uat right now and has not been pushed to any production region.


Daniel"
=====

Comment 16 Gellert Kis 2017-01-05 15:09:37 UTC
Created attachment 1237702 [details]
hotfix contains modified files which correct replication_set_table_pkey violation v3

We got this updated version of hotfix.

Comment 19 errata-xmlrpc 2017-03-09 17:03:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0474.html


Note You need to log in before you can comment on or make changes to this bug.