Red Hat Bugzilla – Bug 1391997
heartbeats are failing intermittently attributing problems to pglogical duplicate keys conditions and causing workers to restart
Last modified: 2017-03-09 12:03:56 EST
Description of problem: errors in multiple different workers when trying to perform heartbeat. Representative error follows: ===== /var/www/miq/vmdb/lib/extensions/ar_adapter/ar_pglogical/pglogical_raw.rb:301:in `async_exec': ERROR: duplicate key value violates unique constraint "replication_set_table_pkey" (PG::UniqueViolation) DETAIL: Key (set_id, set_reloid)=(-299507980, 16492) already exists. ========= Version-Release number of selected component (if applicable): 5.6.1.2 How reproducible: not yet known Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This may be a duplicate of BZ https://bugzilla.redhat.com/show_bug.cgi?id=1380475
Created attachment 1217429 [details] grep of 18428 containing representative error
Created attachment 1217442 [details] grep of pid 10404 with representative error
Created attachment 1217443 [details] grep of pid 10922 with representative error
Created attachment 1218701 [details] hotfix This hotfix contains modified files which correct this behavior. To apply the hotfix, on each server: -Put the .tgz file in the /var/www/miq/vmdb directory -Untar the file using `tar -xzvf <hotfix_file_name>` -Restart the evmserverd service using `systemctl restart evmserverd`
PR to bring the required changes back to the upstream darga release: https://github.com/ManageIQ/manageiq/pull/12513
hotfix has been provided to the customer for SF 01733267
Can this be closed as it is actually a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1380475 ?
Nick, it seems inappropriate to close this case which contains th hot fix which other customers will need in favor of the referenced BZ which contains no reference to the heartbeat issue and the fallout from that which is apparent in the evm.log and for which other customers who are moving to pglogical are exposed. At the moment, this hotfix is being tested at the large US customer who has been most visibly impacted by this problem and form whom this BZ has been opened. There is at least one other customer who has been impacted by the error as reported in this BZ and for which the other BZ has no apparent reltionship. I recommend that this BZ not be changed to closed.
Okay, in that case the code change that this BZ actually yielded was this PR https://github.com/ManageIQ/manageiq/pull/12513 which backports the fixes from the previous BZs which describe this problem to 5.6 (darga) Satoe, is there any way we can get this BZ changed to "look like" a 5.6.z clone of the duplicate mentioned as that is what this BZ is functioning as? I don't want people to view this as an issue for the 5.7 release, especially as the blocker flag was just added.
Changed to 5.6.4 BZ.
Thanks Satoe!
from customer testing hotfix: ===== Most recent comment: On 2016-11-17 04:01:54, Trieu, Daniel commented: "Hello RedHat Team, Updating the ticket: The hotfix to address reported heartbeat failure issues associated with pglogical has been pushed to 3 (out of 8) regions. I confirmed within 30 minutes that the heartbeat issue was resolved. Out of caution, the plan is to push the hotfix to another 2 regions today and another 3 regions the day after. To be clear, there is another hotfix on this ticket for marshal errors, which is in test/dev/uat right now and has not been pushed to any production region. Daniel" =====
Created attachment 1237702 [details] hotfix contains modified files which correct replication_set_table_pkey violation v3 We got this updated version of hotfix.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0474.html