1516180 – db2 resource agent fails to promote Slave when Master has crashed/failed abruptly

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1516180 - db2 resource agent fails to promote Slave when Master has crashed/failed abruptly

Summary: db2 resource agent fails to promote Slave when Master has crashed/failed abru...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Oyvind Albrigtsen
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1521019
TreeView+	depends on / blocked

Reported:	2017-11-22 08:36 UTC by Ondrej Faměra
Modified:	2021-03-11 16:21 UTC (History)
CC List:	10 users (show)
Fixed In Version:	resource-agents-3.9.5-115.el7
Doc Type:	If docs needed, set a value
Doc Text:	Previously, the DB2 resource agent failed to promote a Slave resource to Master. As a consequence, automatic failover of the Master resource was not possible. With this update, additional keywords for disconnected peer have been added to the agent. As a result, the DB2 agent is able to detect the resource state correctly, and the described problem no longer occurs.
Clone Of:
Clones:	1521019 (view as bug list)
Environment:
Last Closed:	2018-04-10 12:09:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
proposed fix (423 bytes, patch) 2017-11-22 08:36 UTC, Ondrej Faměra	no flags	Details \| Diff
extra debugging seen in outputs from description (697 bytes, patch) 2017-11-22 08:42 UTC, Ondrej Faměra	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3250061	0	None	None	None	2017-11-27 14:46:31 UTC
Red Hat Product Errata	RHBA-2018:0757	0	None	None	None	2018-04-10 12:10:40 UTC

Description Ondrej Faměra 2017-11-22 08:36:21 UTC

Created attachment 1357294 [details]
proposed fix

=== Description of problem:
DB2 resource agent fails with exit code 1 during promotion when the PRIMARY (Master)
was killed or when we cannot communicate with it. This is basically preventing automatic
failover of Master resource when the node running Master was fenced. Slave will
indefinatelly stay as Slave and manual intervention is required.

=== Version-Release number of selected component (if applicable):
resource-agents-3.9.5-105.el7_4.2

=== How reproducible:
Always

=== Steps to Reproduce:
1. Configure DB2 HADR in pacemaker cluster - https://access.redhat.com/articles/3172731
2. Let the DB2_HADR Master/Slave resource properly start
3. Fence node running Master DB2_HADR resource

=== Actual results:
Slave will fail the promote and restart the DB2 resource.

  Nov 21 18:16:54 [22674] fastvm-rhel-7-4-161       lrmd:     info: log_execute:  executing - rsc:DB2_HADR action:promote call_id:93
  Nov 21 18:16:54  db2(DB2_HADR)[11786]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
  Nov 21 18:16:54  db2(DB2_HADR)[11786]:    INFO: running    db2 takeover hadr on db sample
  Nov 21 18:16:59 [22672] fastvm-rhel-7-4-161        cib:     info: cib_process_ping:     Reporting our current digest to fastvm-rhel-7-4-161: 9f793423678d5df46c9e3c76dec92687 for 0.10.124 (0x55adf6758eb0 0)
  Nov 21 18:17:07  db2(DB2_HADR)[11786]:    INFO: output:  SQL1770N  Takeover HADR cannot complete. Reason code = "7".
  Nov 21 18:17:08  db2(DB2_HADR)[11786]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/DISCONNECTED_PEER/DISCONNECTED and will be promoted
  Nov 21 18:17:08 [22674] fastvm-rhel-7-4-161       lrmd:     info: log_finished: finished - rsc:DB2_HADR action:promote call_id:93 pid:11786 exit-code:1 exec-time:13565ms queue-time:0ms

After restart of DB2 resource we won't be able to promote it until manual intervention and we will be receiving following warnings

  db2(DB2_HADR)[xxx]:    WARNING: DB2 database db2inst1(0)/sample in status STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted

=== Expected results:
The Slave will be promoted to Master if the issue is detected within HADR_PEER_WINDOW and promotion timeout. Below is example from edited resource agent that is expected.
Note: additional debugging messages were added so we see what is happening.

  Nov 21 18:25:58 [22674] fastvm-rhel-7-4-161       lrmd:     info: log_execute:  executing - rsc:DB2_HADR action:promote call_id:177
  Nov 21 18:25:58  db2(DB2_HADR)[28023]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
  Nov 21 18:25:58  db2(DB2_HADR)[28023]:    INFO: running    db2 takeover hadr on db sample
  Nov 21 18:26:03 [22672] fastvm-rhel-7-4-161        cib:     info: cib_process_ping:     Reporting our current digest to fastvm-rhel-7-4-161: 92352f43210d7ef1a10d25678d628cbd for 0.10.256 (0x55adf6863980 0)
  Nov 21 18:26:10  db2(DB2_HADR)[28023]:    INFO: output:  SQL1770N  Takeover HADR cannot complete. Reason code = "7".
  Nov 21 18:26:11  db2(DB2_HADR)[28023]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/DISCONNECTED_PEER/DISCONNECTED and will be promoted
  Nov 21 18:26:11  db2(DB2_HADR)[28023]:    INFO: running    db2 takeover hadr on db sample by force peer window only
  Nov 21 18:26:14  db2(DB2_HADR)[28023]:    INFO: SUCCESS    db2 takeover hadr on db sample by force peer window only
  Nov 21 18:26:16  db2(DB2_HADR)[28023]:    INFO: DB20000I  The ARCHIVE LOG command completed successfully.
  Nov 21 18:26:16 [22674] fastvm-rhel-7-4-161       lrmd:     info: log_finished: finished - rsc:DB2_HADR action:promote call_id:177 pid:28023 exit-code:0 exec-time:18310ms queue-time:0ms


=== Additional info:
Attached to this BZ is proposed fix and changes that were done during testing with additional debugging.
This was tested so far only on DB 10.5 and 2-node pacemaker cluster latest RHEL 7.4.
The graceful promotion works (=when primary is up and running and we issue 'pcs resource move DB2_HADR --master'). Manual takeover with DB2 only utilities also works in the described scenario.

Comment 2 Ondrej Faměra 2017-11-22 08:42:24 UTC

Created attachment 1357296 [details]
extra debugging seen in outputs from description

Comment 4 Oyvind Albrigtsen 2017-11-22 11:30:33 UTC

https://github.com/ClusterLabs/resource-agents/pull/1060

Comment 6 Ondrej Faměra 2017-11-22 13:20:43 UTC

== Additional information:
DB2 11.1 on same RHEL 7.4 experience the same issue.
Note that original package at least shows now the first error message that it failed with.
Still the code relies on the second for loop to get the takeover with peer window and that works
with patched package.

=== original package - promote fails
  Nov 22 13:56:41 [1310] fastvm-rhel-7-4-164       lrmd:     info: log_execute:   executing - rsc:DB2_HADR action:promote call_id:26
  Nov 22 13:56:45  db2(DB2_HADR)[4776]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
  Nov 22 13:56:46 [1308] fastvm-rhel-7-4-164        cib:     info: cib_process_ping:      Reporting our current digest to fastvm-rhel-7-4-164: b94a9db433150f5a8109e2fe6bc33eba for 0.10.90 (0x5574388c19f0 0)
  Nov 22 13:56:49  db2(DB2_HADR)[4776]:    ERROR: DB2 database db2inst1(0)/sample promote failed: SQL1770N  Takeover HADR cannot complete. Reason code = "1".
  Nov 22 13:56:49 [1310] fastvm-rhel-7-4-164       lrmd:     info: log_finished:  finished - rsc:DB2_HADR action:promote call_id:26 pid:4776 exit-code:1 exec-time:7622ms queue-time:0ms
  Nov 22 13:56:49 [1314] fastvm-rhel-7-4-164       crmd:   notice: process_lrm_event:     Result of promote operation for DB2_HADR on fastvm-rhel-7-4-164: 1 (unknown error) | call=26 key=DB2_HADR_promote_0 confirmed=true cib-update=47

=== updated package with patch proposed to upstream - promote works
  Nov 22 14:03:11 [1310] fastvm-rhel-7-4-164       lrmd:     info: log_execute:   executing - rsc:DB2_HADR action:promote call_id:68
  Nov 22 14:03:11  db2(DB2_HADR)[14174]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
  Nov 22 14:03:15 [1308] fastvm-rhel-7-4-164        cib:     info: cib_process_ping:      Reporting our current digest to fastvm-rhel-7-4-164: 193e41f6eded9a5c61cdcc23d4c8f925 for 0.10.151 (0x55743851a0d0 0)
  Nov 22 14:03:28  db2(DB2_HADR)[14174]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/DISCONNECTED_PEER/DISCONNECTED and will be promoted
  Nov 22 14:03:40  db2(DB2_HADR)[14174]:    INFO: DB20000I  The ARCHIVE LOG command completed successfully.
  Nov 22 14:03:40 [1310] fastvm-rhel-7-4-164       lrmd:     info: log_finished:  finished - rsc:DB2_HADR action:promote call_id:68 pid:14174 exit-code:0 exec-time:29627ms queue-time:0ms

Comment 10 Monika Muzikovska 2018-01-12 13:04:58 UTC

Marking Verified as SanityOnly in version resource-agents-3.9.5-119.el7.

Comment 13 errata-xmlrpc 2018-04-10 12:09:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0757

Note You need to log in before you can comment on or make changes to this bug.