Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1391997 - heartbeats are failing intermittently attributing problems to pglogical duplicate keys conditions and causing workers to restart
heartbeats are failing intermittently attributing problems to pglogical dupli...
Status: CLOSED ERRATA
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance (Show other bugs)
5.6.0
x86_64 Linux
high Severity high
: GA
: 5.6.4
Assigned To: Nick Carboni
Alex Newman
replication
: ZStream
Depends On: 1380475
Blocks:
  Show dependency treegraph
 
Reported: 2016-11-04 10:52 EDT by Thomas Hennessy
Modified: 2017-03-09 12:03 EST (History)
12 users (show)

See Also:
Fixed In Version: 5.6.4.0
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-03-09 12:03:56 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
grep of 18428 containing representative error (12.64 MB, application/x-gzip)
2016-11-04 11:51 EDT, Thomas Hennessy
no flags Details
grep of pid 10404 with representative error (62.51 KB, application/x-gzip)
2016-11-04 11:55 EDT, Thomas Hennessy
no flags Details
grep of pid 10922 with representative error (70.47 KB, application/x-gzip)
2016-11-04 11:57 EDT, Thomas Hennessy
no flags Details
hotfix (5.43 KB, application/x-gzip)
2016-11-08 15:30 EST, Nick Carboni
no flags Details
hotfix contains modified files which correct replication_set_table_pkey violation v3 (7.59 KB, application/zip)
2017-01-05 10:09 EST, Gellert Kis
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0474 normal SHIPPED_LIVE CFME 5.6.4 bug fixes and enhancement update 2017-03-09 17:03:14 EST

  None (edit)
Description Thomas Hennessy 2016-11-04 10:52:28 EDT
Description of problem: errors in multiple different workers when trying to perform heartbeat.  Representative error follows:
=====
/var/www/miq/vmdb/lib/extensions/ar_adapter/ar_pglogical/pglogical_raw.rb:301:in `async_exec': ERROR:  duplicate key value violates unique constraint "replication_set_table_pkey" (PG::UniqueViolation)
DETAIL:  Key (set_id, set_reloid)=(-299507980, 16492) already exists. 
=========


Version-Release number of selected component (if applicable): 5.6.1.2


How reproducible: not yet known


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Comment 2 Gregg Tanzillo 2016-11-04 10:59:35 EDT
This may be a duplicate of BZ https://bugzilla.redhat.com/show_bug.cgi?id=1380475
Comment 3 Thomas Hennessy 2016-11-04 11:51 EDT
Created attachment 1217429 [details]
grep of 18428 containing representative error
Comment 4 Thomas Hennessy 2016-11-04 11:55 EDT
Created attachment 1217442 [details]
grep of pid 10404 with representative error
Comment 5 Thomas Hennessy 2016-11-04 11:57 EDT
Created attachment 1217443 [details]
grep of pid 10922 with representative error
Comment 7 Nick Carboni 2016-11-08 15:30 EST
Created attachment 1218701 [details]
hotfix

This hotfix contains modified files which correct this behavior.

To apply the hotfix, on each server:
  -Put the .tgz file in the /var/www/miq/vmdb directory
  -Untar the file using `tar -xzvf <hotfix_file_name>`
  -Restart the evmserverd service using `systemctl restart evmserverd`
Comment 8 Nick Carboni 2016-11-08 15:36:55 EST
PR to bring the required changes back to the upstream darga release: https://github.com/ManageIQ/manageiq/pull/12513
Comment 9 Thomas Hennessy 2016-11-08 15:44:20 EST
hotfix has been provided to the customer for SF 01733267
Comment 10 Nick Carboni 2016-11-10 09:22:08 EST
Can this be closed as it is actually a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1380475 ?
Comment 11 Thomas Hennessy 2016-11-10 11:11:41 EST
Nick,
it seems inappropriate to close this case which contains th hot fix which other customers will need in favor of the referenced BZ which contains no reference to the heartbeat issue and the fallout from that which is apparent in the evm.log and for which other customers who are moving to pglogical are exposed.

At the moment, this hotfix is being tested at the large US customer who has been most visibly impacted by this problem and form whom this BZ has been opened.

There is at least one other customer who has been impacted by the error as reported in this BZ and for which the other BZ has no apparent reltionship.

I recommend that this BZ not be changed to closed.
Comment 12 Nick Carboni 2016-11-11 09:51:11 EST
Okay, in that case the code change that this BZ actually yielded was this PR https://github.com/ManageIQ/manageiq/pull/12513 which backports the fixes from the previous BZs which describe this problem to 5.6 (darga)

Satoe, is there any way we can get this BZ changed to "look like" a 5.6.z clone of the duplicate mentioned as that is what this BZ is functioning as?

I don't want people to view this as an issue for the 5.7 release, especially as the blocker flag was just added.
Comment 13 Satoe Imaishi 2016-11-11 10:06:35 EST
Changed to 5.6.4 BZ.
Comment 14 Nick Carboni 2016-11-11 10:22:25 EST
Thanks Satoe!
Comment 15 Thomas Hennessy 2016-11-17 08:16:35 EST
from customer testing hotfix:
=====
Most recent comment: On 2016-11-17 04:01:54, Trieu, Daniel commented:
"Hello RedHat Team,

Updating the ticket:

The hotfix to address reported heartbeat failure issues associated with pglogical has been pushed to 3 (out of 8) regions. I confirmed within 30 minutes that the heartbeat issue was resolved.

Out of caution, the plan is to push the hotfix to another 2 regions today and another 3 regions the day after.

To be clear, there is another hotfix on this ticket for marshal errors, which is in test/dev/uat right now and has not been pushed to any production region.


Daniel"
=====
Comment 16 Gellert Kis 2017-01-05 10:09 EST
Created attachment 1237702 [details]
hotfix contains modified files which correct replication_set_table_pkey violation v3

We got this updated version of hotfix.
Comment 19 errata-xmlrpc 2017-03-09 12:03:56 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0474.html

Note You need to log in before you can comment on or make changes to this bug.