Bug 1516180

Summary: db2 resource agent fails to promote Slave when Master has crashed/failed abruptly
Product: Red Hat Enterprise Linux 7 Reporter: Ondrej Faměra <ofamera>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: high    
Version: 7.4CC: agk, aherr, cfeist, cluster-maint, fdinitto, jruemker, mmuzikov, mnovacek, qguo, sbradley
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: resource-agents-3.9.5-115.el7 Doc Type: If docs needed, set a value
Doc Text:
Previously, the DB2 resource agent failed to promote a Slave resource to Master. As a consequence, automatic failover of the Master resource was not possible. With this update, additional keywords for disconnected peer have been added to the agent. As a result, the DB2 agent is able to detect the resource state correctly, and the described problem no longer occurs.
Story Points: ---
Clone Of:
: 1521019 (view as bug list) Environment:
Last Closed: 2018-04-10 12:09:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1521019    
Attachments:
Description Flags
proposed fix
none
extra debugging seen in outputs from description none

Description Ondrej Faměra 2017-11-22 08:36:21 UTC
Created attachment 1357294 [details]
proposed fix

=== Description of problem:
DB2 resource agent fails with exit code 1 during promotion when the PRIMARY (Master)
was killed or when we cannot communicate with it. This is basically preventing automatic
failover of Master resource when the node running Master was fenced. Slave will
indefinatelly stay as Slave and manual intervention is required.

=== Version-Release number of selected component (if applicable):
resource-agents-3.9.5-105.el7_4.2

=== How reproducible:
Always

=== Steps to Reproduce:
1. Configure DB2 HADR in pacemaker cluster - https://access.redhat.com/articles/3172731
2. Let the DB2_HADR Master/Slave resource properly start
3. Fence node running Master DB2_HADR resource

=== Actual results:
Slave will fail the promote and restart the DB2 resource.

  Nov 21 18:16:54 [22674] fastvm-rhel-7-4-161       lrmd:     info: log_execute:  executing - rsc:DB2_HADR action:promote call_id:93
  Nov 21 18:16:54  db2(DB2_HADR)[11786]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
  Nov 21 18:16:54  db2(DB2_HADR)[11786]:    INFO: running    db2 takeover hadr on db sample
  Nov 21 18:16:59 [22672] fastvm-rhel-7-4-161        cib:     info: cib_process_ping:     Reporting our current digest to fastvm-rhel-7-4-161: 9f793423678d5df46c9e3c76dec92687 for 0.10.124 (0x55adf6758eb0 0)
  Nov 21 18:17:07  db2(DB2_HADR)[11786]:    INFO: output:  SQL1770N  Takeover HADR cannot complete. Reason code = "7".
  Nov 21 18:17:08  db2(DB2_HADR)[11786]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/DISCONNECTED_PEER/DISCONNECTED and will be promoted
  Nov 21 18:17:08 [22674] fastvm-rhel-7-4-161       lrmd:     info: log_finished: finished - rsc:DB2_HADR action:promote call_id:93 pid:11786 exit-code:1 exec-time:13565ms queue-time:0ms

After restart of DB2 resource we won't be able to promote it until manual intervention and we will be receiving following warnings

  db2(DB2_HADR)[xxx]:    WARNING: DB2 database db2inst1(0)/sample in status STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted

=== Expected results:
The Slave will be promoted to Master if the issue is detected within HADR_PEER_WINDOW and promotion timeout. Below is example from edited resource agent that is expected.
Note: additional debugging messages were added so we see what is happening.

  Nov 21 18:25:58 [22674] fastvm-rhel-7-4-161       lrmd:     info: log_execute:  executing - rsc:DB2_HADR action:promote call_id:177
  Nov 21 18:25:58  db2(DB2_HADR)[28023]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
  Nov 21 18:25:58  db2(DB2_HADR)[28023]:    INFO: running    db2 takeover hadr on db sample
  Nov 21 18:26:03 [22672] fastvm-rhel-7-4-161        cib:     info: cib_process_ping:     Reporting our current digest to fastvm-rhel-7-4-161: 92352f43210d7ef1a10d25678d628cbd for 0.10.256 (0x55adf6863980 0)
  Nov 21 18:26:10  db2(DB2_HADR)[28023]:    INFO: output:  SQL1770N  Takeover HADR cannot complete. Reason code = "7".
  Nov 21 18:26:11  db2(DB2_HADR)[28023]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/DISCONNECTED_PEER/DISCONNECTED and will be promoted
  Nov 21 18:26:11  db2(DB2_HADR)[28023]:    INFO: running    db2 takeover hadr on db sample by force peer window only
  Nov 21 18:26:14  db2(DB2_HADR)[28023]:    INFO: SUCCESS    db2 takeover hadr on db sample by force peer window only
  Nov 21 18:26:16  db2(DB2_HADR)[28023]:    INFO: DB20000I  The ARCHIVE LOG command completed successfully.
  Nov 21 18:26:16 [22674] fastvm-rhel-7-4-161       lrmd:     info: log_finished: finished - rsc:DB2_HADR action:promote call_id:177 pid:28023 exit-code:0 exec-time:18310ms queue-time:0ms


=== Additional info:
Attached to this BZ is proposed fix and changes that were done during testing with additional debugging.
This was tested so far only on DB 10.5 and 2-node pacemaker cluster latest RHEL 7.4.
The graceful promotion works (=when primary is up and running and we issue 'pcs resource move DB2_HADR --master'). Manual takeover with DB2 only utilities also works in the described scenario.

Comment 2 Ondrej Faměra 2017-11-22 08:42:24 UTC
Created attachment 1357296 [details]
extra debugging seen in outputs from description

Comment 4 Oyvind Albrigtsen 2017-11-22 11:30:33 UTC
https://github.com/ClusterLabs/resource-agents/pull/1060

Comment 6 Ondrej Faměra 2017-11-22 13:20:43 UTC
== Additional information:
DB2 11.1 on same RHEL 7.4 experience the same issue.
Note that original package at least shows now the first error message that it failed with.
Still the code relies on the second for loop to get the takeover with peer window and that works
with patched package.

=== original package - promote fails
  Nov 22 13:56:41 [1310] fastvm-rhel-7-4-164       lrmd:     info: log_execute:   executing - rsc:DB2_HADR action:promote call_id:26
  Nov 22 13:56:45  db2(DB2_HADR)[4776]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
  Nov 22 13:56:46 [1308] fastvm-rhel-7-4-164        cib:     info: cib_process_ping:      Reporting our current digest to fastvm-rhel-7-4-164: b94a9db433150f5a8109e2fe6bc33eba for 0.10.90 (0x5574388c19f0 0)
  Nov 22 13:56:49  db2(DB2_HADR)[4776]:    ERROR: DB2 database db2inst1(0)/sample promote failed: SQL1770N  Takeover HADR cannot complete. Reason code = "1".
  Nov 22 13:56:49 [1310] fastvm-rhel-7-4-164       lrmd:     info: log_finished:  finished - rsc:DB2_HADR action:promote call_id:26 pid:4776 exit-code:1 exec-time:7622ms queue-time:0ms
  Nov 22 13:56:49 [1314] fastvm-rhel-7-4-164       crmd:   notice: process_lrm_event:     Result of promote operation for DB2_HADR on fastvm-rhel-7-4-164: 1 (unknown error) | call=26 key=DB2_HADR_promote_0 confirmed=true cib-update=47

=== updated package with patch proposed to upstream - promote works
  Nov 22 14:03:11 [1310] fastvm-rhel-7-4-164       lrmd:     info: log_execute:   executing - rsc:DB2_HADR action:promote call_id:68
  Nov 22 14:03:11  db2(DB2_HADR)[14174]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
  Nov 22 14:03:15 [1308] fastvm-rhel-7-4-164        cib:     info: cib_process_ping:      Reporting our current digest to fastvm-rhel-7-4-164: 193e41f6eded9a5c61cdcc23d4c8f925 for 0.10.151 (0x55743851a0d0 0)
  Nov 22 14:03:28  db2(DB2_HADR)[14174]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/DISCONNECTED_PEER/DISCONNECTED and will be promoted
  Nov 22 14:03:40  db2(DB2_HADR)[14174]:    INFO: DB20000I  The ARCHIVE LOG command completed successfully.
  Nov 22 14:03:40 [1310] fastvm-rhel-7-4-164       lrmd:     info: log_finished:  finished - rsc:DB2_HADR action:promote call_id:68 pid:14174 exit-code:0 exec-time:29627ms queue-time:0ms

Comment 10 Monika Muzikovska 2018-01-12 13:04:58 UTC
Marking Verified as SanityOnly in version resource-agents-3.9.5-119.el7.

Comment 13 errata-xmlrpc 2018-04-10 12:09:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0757