Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Previously, the DB2 resource agent failed to promote a Slave resource to Master. As a consequence, automatic failover of the Master resource was not possible. With this update, additional keywords for disconnected peer have been added to the agent. As a result, the DB2 agent is able to detect the resource state correctly, and the described problem no longer occurs.
Created attachment 1357294[details]
proposed fix
=== Description of problem:
DB2 resource agent fails with exit code 1 during promotion when the PRIMARY (Master)
was killed or when we cannot communicate with it. This is basically preventing automatic
failover of Master resource when the node running Master was fenced. Slave will
indefinatelly stay as Slave and manual intervention is required.
=== Version-Release number of selected component (if applicable):
resource-agents-3.9.5-105.el7_4.2
=== How reproducible:
Always
=== Steps to Reproduce:
1. Configure DB2 HADR in pacemaker cluster - https://access.redhat.com/articles/3172731
2. Let the DB2_HADR Master/Slave resource properly start
3. Fence node running Master DB2_HADR resource
=== Actual results:
Slave will fail the promote and restart the DB2 resource.
Nov 21 18:16:54 [22674] fastvm-rhel-7-4-161 lrmd: info: log_execute: executing - rsc:DB2_HADR action:promote call_id:93
Nov 21 18:16:54 db2(DB2_HADR)[11786]: INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
Nov 21 18:16:54 db2(DB2_HADR)[11786]: INFO: running db2 takeover hadr on db sample
Nov 21 18:16:59 [22672] fastvm-rhel-7-4-161 cib: info: cib_process_ping: Reporting our current digest to fastvm-rhel-7-4-161: 9f793423678d5df46c9e3c76dec92687 for 0.10.124 (0x55adf6758eb0 0)
Nov 21 18:17:07 db2(DB2_HADR)[11786]: INFO: output: SQL1770N Takeover HADR cannot complete. Reason code = "7".
Nov 21 18:17:08 db2(DB2_HADR)[11786]: INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/DISCONNECTED_PEER/DISCONNECTED and will be promoted
Nov 21 18:17:08 [22674] fastvm-rhel-7-4-161 lrmd: info: log_finished: finished - rsc:DB2_HADR action:promote call_id:93 pid:11786 exit-code:1 exec-time:13565ms queue-time:0ms
After restart of DB2 resource we won't be able to promote it until manual intervention and we will be receiving following warnings
db2(DB2_HADR)[xxx]: WARNING: DB2 database db2inst1(0)/sample in status STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted
=== Expected results:
The Slave will be promoted to Master if the issue is detected within HADR_PEER_WINDOW and promotion timeout. Below is example from edited resource agent that is expected.
Note: additional debugging messages were added so we see what is happening.
Nov 21 18:25:58 [22674] fastvm-rhel-7-4-161 lrmd: info: log_execute: executing - rsc:DB2_HADR action:promote call_id:177
Nov 21 18:25:58 db2(DB2_HADR)[28023]: INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
Nov 21 18:25:58 db2(DB2_HADR)[28023]: INFO: running db2 takeover hadr on db sample
Nov 21 18:26:03 [22672] fastvm-rhel-7-4-161 cib: info: cib_process_ping: Reporting our current digest to fastvm-rhel-7-4-161: 92352f43210d7ef1a10d25678d628cbd for 0.10.256 (0x55adf6863980 0)
Nov 21 18:26:10 db2(DB2_HADR)[28023]: INFO: output: SQL1770N Takeover HADR cannot complete. Reason code = "7".
Nov 21 18:26:11 db2(DB2_HADR)[28023]: INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/DISCONNECTED_PEER/DISCONNECTED and will be promoted
Nov 21 18:26:11 db2(DB2_HADR)[28023]: INFO: running db2 takeover hadr on db sample by force peer window only
Nov 21 18:26:14 db2(DB2_HADR)[28023]: INFO: SUCCESS db2 takeover hadr on db sample by force peer window only
Nov 21 18:26:16 db2(DB2_HADR)[28023]: INFO: DB20000I The ARCHIVE LOG command completed successfully.
Nov 21 18:26:16 [22674] fastvm-rhel-7-4-161 lrmd: info: log_finished: finished - rsc:DB2_HADR action:promote call_id:177 pid:28023 exit-code:0 exec-time:18310ms queue-time:0ms
=== Additional info:
Attached to this BZ is proposed fix and changes that were done during testing with additional debugging.
This was tested so far only on DB 10.5 and 2-node pacemaker cluster latest RHEL 7.4.
The graceful promotion works (=when primary is up and running and we issue 'pcs resource move DB2_HADR --master'). Manual takeover with DB2 only utilities also works in the described scenario.
== Additional information:
DB2 11.1 on same RHEL 7.4 experience the same issue.
Note that original package at least shows now the first error message that it failed with.
Still the code relies on the second for loop to get the takeover with peer window and that works
with patched package.
=== original package - promote fails
Nov 22 13:56:41 [1310] fastvm-rhel-7-4-164 lrmd: info: log_execute: executing - rsc:DB2_HADR action:promote call_id:26
Nov 22 13:56:45 db2(DB2_HADR)[4776]: INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
Nov 22 13:56:46 [1308] fastvm-rhel-7-4-164 cib: info: cib_process_ping: Reporting our current digest to fastvm-rhel-7-4-164: b94a9db433150f5a8109e2fe6bc33eba for 0.10.90 (0x5574388c19f0 0)
Nov 22 13:56:49 db2(DB2_HADR)[4776]: ERROR: DB2 database db2inst1(0)/sample promote failed: SQL1770N Takeover HADR cannot complete. Reason code = "1".
Nov 22 13:56:49 [1310] fastvm-rhel-7-4-164 lrmd: info: log_finished: finished - rsc:DB2_HADR action:promote call_id:26 pid:4776 exit-code:1 exec-time:7622ms queue-time:0ms
Nov 22 13:56:49 [1314] fastvm-rhel-7-4-164 crmd: notice: process_lrm_event: Result of promote operation for DB2_HADR on fastvm-rhel-7-4-164: 1 (unknown error) | call=26 key=DB2_HADR_promote_0 confirmed=true cib-update=47
=== updated package with patch proposed to upstream - promote works
Nov 22 14:03:11 [1310] fastvm-rhel-7-4-164 lrmd: info: log_execute: executing - rsc:DB2_HADR action:promote call_id:68
Nov 22 14:03:11 db2(DB2_HADR)[14174]: INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
Nov 22 14:03:15 [1308] fastvm-rhel-7-4-164 cib: info: cib_process_ping: Reporting our current digest to fastvm-rhel-7-4-164: 193e41f6eded9a5c61cdcc23d4c8f925 for 0.10.151 (0x55743851a0d0 0)
Nov 22 14:03:28 db2(DB2_HADR)[14174]: INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/DISCONNECTED_PEER/DISCONNECTED and will be promoted
Nov 22 14:03:40 db2(DB2_HADR)[14174]: INFO: DB20000I The ARCHIVE LOG command completed successfully.
Nov 22 14:03:40 [1310] fastvm-rhel-7-4-164 lrmd: info: log_finished: finished - rsc:DB2_HADR action:promote call_id:68 pid:14174 exit-code:0 exec-time:29627ms queue-time:0ms
Comment 10Monika Muzikovska
2018-01-12 13:04:58 UTC
Marking Verified as SanityOnly in version resource-agents-3.9.5-119.el7.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2018:0757