2168675 – Can't move MS SQL Server cluster resources

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2168675 - Can't move MS SQL Server cluster resources

Summary: Can't move MS SQL Server cluster resources

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Chris Lumens
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2169829
TreeView+	depends on / blocked

Reported:	2023-02-09 17:58 UTC by Daniel Yeisley
Modified:	2023-05-16 09:52 UTC (History)
CC List:	8 users (show)
Fixed In Version:	pacemaker-2.1.5-8.el8
Doc Type:	Bug Fix
Doc Text:	Cause: An internal attribute querying API used by attrd_updater does not check if it was given NULL for its node name parameter. Consequence: attrd_updater ignores the command line option to display the value of the attribute on all cluster nodes. Fix: The API call should check for a NULL node name before doing anything else. Result: attrd_updater again displays the value of the attribute on all cluster nodes.
Clone Of:
Clones:	2169829 (view as bug list)
Environment:
Last Closed:	2023-05-16 08:35:22 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CLUSTERQE-6422	None	None	None	2023-02-21 10:01:58 UTC
Red Hat Issue Tracker	RHELPLAN-148211	None	None	None	2023-02-09 18:02:06 UTC
Red Hat Product Errata	RHBA-2023:2818	None	None	None	2023-05-16 08:35:31 UTC

Description Daniel Yeisley 2023-02-09 17:58:14 UTC

Description of problem:
I've been running into a problem with multi-node SQL Server on RHEL 8.8. I can create the 2-node cluster and execute my test harness against it. Then I attempt to move the resources to the second node and it fails with "Not enough replicas are online." 

I installed two systems with RHEL-8.7.0 and updated the pacemaker packages to the 8.8.0 versions to reproduce the problem. 


Version-Release number of selected component (if applicable):
[root@isvqe-01 ~]# rpm -qa | grep pacemaker
pacemaker-2.1.5-5.el8.x86_64
pacemaker-libs-2.1.5-5.el8.x86_64
pacemaker-schemas-2.1.5-5.el8.noarch
pacemaker-cluster-libs-2.1.5-5.el8.x86_64
pacemaker-cli-2.1.5-5.el8.x86_64


How reproducible:
Always.

Steps to Reproduce:
1. Install a MS SQL Server 2-node cluster using the ansible role
2. Create the database and load the tables. 
3. Attempt to move the resources to node 2. 

The ansible role creates the following resource config.
[root@isvqe-01 Certification]# pcs resource config
Resource: mydummy (class=ocf provider=pacemaker type=Dummy)
  Operations:
    migrate_from: mydummy-migrate_from-interval-0s
      interval=0s
      timeout=20s
    migrate_to: mydummy-migrate_to-interval-0s
      interval=0s
      timeout=20s
    monitor: mydummy-monitor-interval-10s
      interval=10s
      timeout=20s
    reload: mydummy-reload-interval-0s
      interval=0s
      timeout=20s
    reload-agent: mydummy-reload-agent-interval-0s
      interval=0s
      timeout=20s
    start: mydummy-start-interval-0s
      interval=0s
      timeout=20s
    stop: mydummy-stop-interval-0s
      interval=0s
      timeout=20s
Resource: virtualip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: virtualip-instance_attributes
    ip=192.168.100.201
  Operations:
    monitor: virtualip-monitor-interval-30s
      interval=30s
    start: virtualip-start-interval-0s
      interval=0s
      timeout=20s
    stop: virtualip-stop-interval-0s
      interval=0s
      timeout=20s
Clone: ag_cluster-clone
  Meta Attributes: ag_cluster-clone-meta_attributes
    notify=True
    promotable=true
  Resource: ag_cluster (class=ocf provider=mssql type=ag)
    Attributes: ag_cluster-instance_attributes
      ag_name=ag1
    Meta Attributes: ag_cluster-meta_attributes
      failure-timeout=80s
    Operations:
      demote: ag_cluster-demote-interval-0s
        interval=0s
        timeout=10
      monitor: ag_cluster-monitor-interval-10
        interval=10
        timeout=60
      monitor: ag_cluster-monitor-interval-11
        interval=11
        timeout=60
        role=Master
      monitor: ag_cluster-monitor-interval-12
        interval=12
        timeout=60
        role=Slave
      notify: ag_cluster-notify-interval-0s
        interval=0s
        timeout=60
      promote: ag_cluster-promote-interval-0s
        interval=0s
        timeout=60
      start: ag_cluster-start-interval-0s
        interval=0s
        timeout=60
      stop: ag_cluster-stop-interval-0s
        interval=0s
        timeout=10

After initial creation.
[root@isvqe-01 ~]# pcs status
Cluster name: isvqe-cluster
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-02-09 11:32:08 -05:00)
Cluster Summary:
  * Stack: corosync
  * Current DC: isvqe-02 (version 2.1.5-5.el8-a3f44794f94) - partition with quorum
  * Last updated: Thu Feb  9 11:32:09 2023
  * Last change:  Thu Feb  9 11:17:56 2023 by root via cibadmin on isvqe-01
  * 2 nodes configured
  * 4 resource instances configured

Node List:
  * Online: [ isvqe-01 isvqe-02 ]

Full List of Resources:
  * mydummy	(ocf::pacemaker:Dummy):	 Started isvqe-01
  * virtualip	(ocf::heartbeat:IPaddr2):	 Started isvqe-01
  * Clone Set: ag_cluster-clone [ag_cluster] (promotable):
    * Masters: [ isvqe-01 ]
    * Slaves: [ isvqe-02 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


Actual results:
After the attempted move.
[root@isvqe-01 Certification]# pcs status
Cluster name: isvqe-cluster

WARNINGS:
Following resources have been moved and their move constraints are still in place: 'virtualip'
Run 'pcs constraint location' or 'pcs resource clear <resource id>' to view or remove the constraints, respectively

Status of pacemakerd: 'Pacemaker is running' (last updated 2023-02-09 12:46:52 -05:00)
Cluster Summary:
  * Stack: corosync
  * Current DC: isvqe-02 (version 2.1.5-5.el8-a3f44794f94) - partition with quorum
  * Last updated: Thu Feb  9 12:46:53 2023
  * Last change:  Thu Feb  9 12:34:38 2023 by hacluster via crmd on isvqe-02
  * 2 nodes configured
  * 4 resource instances configured

Node List:
  * Online: [ isvqe-01 isvqe-02 ]

Full List of Resources:
  * mydummy	(ocf::pacemaker:Dummy):	 Started isvqe-02
  * virtualip	(ocf::heartbeat:IPaddr2):	 Stopped
  * Clone Set: ag_cluster-clone [ag_cluster] (promotable):
    * Slaves: [ isvqe-01 isvqe-02 ]

Failed Resource Actions:
  * ag_cluster_promote_0 on isvqe-02 'error' (1): call=956, status='complete', exitreason='2023/02/09 12:46:41 Not enough replicas are online to safely promote this replica: need 2 but have 1', last-rc-change='Thu Feb  9 12:46:36 2023', queued=0ms, exec=5202ms
  * ag_cluster_promote_0 on isvqe-01 'error' (1): call=240, status='complete', exitreason='2023/02/09 12:46:14 Not enough replicas are online to safely promote this replica: need 2 but have 1', last-rc-change='Thu Feb  9 12:46:09 2023', queued=0ms, exec=5206ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


Expected results:


Additional info:
The SQL Server packages are located here: https://packages.microsoft.com/rhel/8/mssql-server-2019/Packages/m/

Comment 2 Reid Wahl 2023-02-09 19:32:46 UTC

Does it work as expected if you use a lower Pacemaker version (which version if so?) and leave the resource agent the same?

Comment 3 Reid Wahl 2023-02-09 19:36:13 UTC

Also, which versions of the mssql packages are installed?

The error is coming from the resource agent, not from Pacemaker. It's definitely possible that some change in Pacemaker affected the resource agent, and then it would remain to be determined which one needs to be patched or how the cluster configuration needs to change. It'll be easier to find the problem with some known good vs. current version info, including for mssql.

Comment 4 Reid Wahl 2023-02-09 19:41:10 UTC

Also can you please elaborate on these steps?

> 1. Install a MS SQL Server 2-node cluster using the ansible role
> 2. Create the database and load the tables.

I don't know where to find or how to use the Ansible role for MS SQL server, or of how to create an MS SQL database and load tables. (Can find out but takes time)

Comment 5 Daniel Yeisley 2023-02-09 19:43:53 UTC

(In reply to Reid Wahl from comment #3)
> Also, which versions of the mssql packages are installed?
> 
> The error is coming from the resource agent, not from Pacemaker. It's
> definitely possible that some change in Pacemaker affected the resource
> agent, and then it would remain to be determined which one needs to be
> patched or how the cluster configuration needs to change. It'll be easier to
> find the problem with some known good vs. current version info, including
> for mssql.

I have the following packages installed.

[root@isvqe-01 Certification]# rpm -qa | grep ^mssql 
mssql-server-15.0.4261.1-2.x86_64
mssql-tools-17.10.1.1-1.x86_64
mssql-server-fts-15.0.4261.1-2.x86_64
mssql-server-ha-15.0.4261.1-2.x86_64
[root@isvqe-01 Certification]# rpm -qa | grep ^resource
resource-agents-4.9.0-29.el8.x86_64

I can reinstall with RHEL 8.8 and use pacemaker from 8.7 to see what happens.

Comment 6 Daniel Yeisley 2023-02-09 22:27:27 UTC

I installed two systems with RHEL 8.8 and the pacemaker packages from 8.7 and had no issues.

[root@isvqe-01 Certification]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.8 Beta (Ootpa)

[root@isvqe-01 ~]# rpm -qa | grep pacemaker
pacemaker-2.1.4-5.el8.x86_64
pacemaker-cluster-libs-2.1.4-5.el8.x86_64
pacemaker-libs-2.1.4-5.el8.x86_64
pacemaker-cli-2.1.4-5.el8.x86_64
pacemaker-schemas-2.1.4-5.el8.noarch

[root@isvqe-01 ~]# rpm -qa | grep ^mssql
mssql-server-15.0.4261.1-2.x86_64
mssql-tools-17.10.1.1-1.x86_64
mssql-server-ha-15.0.4261.1-2.x86_64
mssql-server-fts-15.0.4261.1-2.x86_64

[root@isvqe-01 ~]# rpm -qa | grep ^resource
resource-agents-4.9.0-40.el8.x86_64

[root@isvqe-01 Certification]# pcs status
Cluster name: isvqe-cluster

WARNINGS:
Following resources have been moved and their move constraints are still in place: 'virtualip'
Run 'pcs constraint location' or 'pcs resource clear <resource id>' to view or remove the constraints, respectively

Cluster Summary:
  * Stack: corosync
  * Current DC: isvqe-02 (version 2.1.4-5.el8-dc6eb4362e) - partition with quorum
  * Last updated: Thu Feb  9 17:26:23 2023
  * Last change:  Thu Feb  9 16:55:17 2023 by root via crm_resource on isvqe-01
  * 2 nodes configured
  * 4 resource instances configured

Node List:
  * Online: [ isvqe-01 isvqe-02 ]

Full List of Resources:
  * mydummy	(ocf::pacemaker:Dummy):	 Started isvqe-01
  * virtualip	(ocf::heartbeat:IPaddr2):	 Started isvqe-02
  * Clone Set: ag_cluster-clone [ag_cluster] (promotable):
    * Masters: [ isvqe-02 ]
    * Slaves: [ isvqe-01 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 8 Reid Wahl 2023-02-09 22:32:35 UTC

(In reply to Daniel Yeisley from comment #6)
> I installed two systems with RHEL 8.8 and the pacemaker packages from 8.7
> and had no issues.

Great to know. Do you know if there's source code anywhere for the mssql stuff? It'd be nice to see exactly what it's doing and where it's breaking. Their GitHub was last updated 5 years ago. The scripts are in the RPM package but ag-helper (the source of the error message) is compiled.

Comment 11 Ken Gaillot 2023-02-13 16:57:35 UTC

(In reply to Daniel Yeisley from comment #6)
> I installed two systems with RHEL 8.8 and the pacemaker packages from 8.7
> and had no issues.

So, RHEL 8.7 with pacemaker 8.8 packages = bad, and RHEL 8.8 with pacemaker 8.7 packages = good?

What happens when the 8.7 packages are used on 8.7, and the 8.8 packages on 8.8?

I don't know of any changes in the 8.8 packages that could affect this agent, but it could be an unknown regression. Unless maybe the agent parses the text (instead of XML) output of a command-line tool like crm_mon?

Comment 13 Daniel Yeisley 2023-02-13 19:15:48 UTC

(In reply to Ken Gaillot from comment #11)
> (In reply to Daniel Yeisley from comment #6)
> > I installed two systems with RHEL 8.8 and the pacemaker packages from 8.7
> > and had no issues.
> 
> So, RHEL 8.7 with pacemaker 8.8 packages = bad, and RHEL 8.8 with pacemaker
> 8.7 packages = good?
> 
> What happens when the 8.7 packages are used on 8.7, and the 8.8 packages on
> 8.8?
> 
> I don't know of any changes in the 8.8 packages that could affect this
> agent, but it could be an unknown regression. Unless maybe the agent parses
> the text (instead of XML) output of a command-line tool like crm_mon?

I initially ran into the problem during my testing on 8.8 (with the 8.8 pacemaker packages). Then I re-ran my tests on RHEL 8.7 to verify that it wasn't something that changed in the SQL Server packages. I had no issues on 8.7. Then I started with 8.7 and upgraded it with packages from 8.8 until I could recreate the issue.

Comment 14 Reid Wahl 2023-02-13 21:06:53 UTC

For now, in the absence of any recent source code for the ag-helper program, I'm having to rely on the version from 2017 and hope it's still basically the same: https://github.com/microsoft/mssql-server-ha/blob/sql2017/go/src/ag-helper/main.go

So, here's the issue. In the RHEL 8.8 version, attrd_updater is only showing the attribute value from the local node. Previously (RHEL 8.7), it showed the values from all nodes. The ag-helper program is expecting multi-line output, containing the sequence number from each node. This block is where the work is done: https://github.com/microsoft/mssql-server-ha/blob/sql2017/go/src/ag-helper/main.go#L538-L566


Here's the difference in output from attrd_updater in the ocf:mssql:ag resource agent. This is where we query the ag_cluster-sequence-number attribute from Pacemaker and pass it as an argument to ag-helper.

pacemaker-2.1.5-5 (RHEL 8.8):
++ 15:27:18: mssql_promote:375: attrd_updater -n ag_cluster-sequence-number -QA
+ 15:27:19: mssql_promote:375: local 'sequence_numbers=name="ag_cluster-sequence-number" host="isvqe-01" value="4294967310"'
++ 15:27:19: mssql_promote:399: /usr/lib/ocf/lib/mssql/ag-helper --hostname '' --port 1433 --credentials-file /var/opt/mssql/secrets/passwd --ag-name ag1 --application-name monitor-ag_cluster-promote --connection-timeout 30 --health-threshold 3 --action promote --sequence-numbers 'name="ag_cluster-sequence-number" host="isvqe-01" value="4294967310"' --new-master isvqe-01 --required-synchronized-secondaries-to-commit -1 --disable-primary-on-quorum-timeout-after 60 --primary-write-lease-duration 62


pacemaker-2.1.4-5 (RHEL 8.7):
++ 15:45:55: mssql_promote:375: attrd_updater -n ag_cluster-sequence-number -QA
+ 15:45:55: mssql_promote:375: local 'sequence_numbers=name="ag_cluster-sequence-number" host="isvqe-01" value="4294967310"
name="ag_cluster-sequence-number" host="isvqe-02" value="4294967310"'
++ 15:45:55: mssql_promote:399: /usr/lib/ocf/lib/mssql/ag-helper --hostname '' --port 1433 --credentials-file /var/opt/mssql/secrets/passwd --ag-name ag1 --application-name monitor-ag_cluster-promote --connection-timeout 30 --health-threshold 3 --action promote --sequence-numbers 'name="ag_cluster-sequence-number" host="isvqe-01" value="4294967310"
name="ag_cluster-sequence-number" host="isvqe-02" value="4294967310"' --new-master isvqe-01 --required-synchronized-secondaries-to-commit -1 --disable-primary-on-quorum-timeout-after 60 --primary-write-lease-duration 62


I tested with a dummy attribute using the RHEL 8.8 version, and the -A option does work correctly: it returns the attribute's value for all nodes, as expected.

[root@isvqe-01 ~]# attrd_updater -n test_attr -QA
name="test_attr" host="isvqe-02" value="test_val"
name="test_attr" host="isvqe-01" value="test_val"


In contrast, the failed resource promote log farther above shows that `attrd_updater -n ag_cluster-sequence-number -QA` only returned the attribute's value for one node.

So there does **not** seem to be a regression in the output **format**, but we are nonetheless getting incomplete output **content** when the resource agent is running the command. I wonder if it's a timing issue related to when the query is happening versus when the attribute is set and synced. There have been some changes in attribute handling recently.

@clumens, what do you think? You're more familiar with the attrd code.

Comment 15 Chris Lumens 2023-02-14 14:38:32 UTC

We talked about this a bit in chat yesterday and it looks like a bug in attrd_updater, caused by pcmk__attrd_api_query -> pcmk__node_attr_target.  This was introduced by 9528756621.

To make a long story very short, pcmk__attrd_api_query should not call pcmk__node_attr_target when node == NULL.  Prior to that commit, this was the case.

When pcmk__node_attr_target is called with a node name of NULL, it treats that the same as if it was passed "auto" or "localhost" and looks at the running system to figure out the node name.  Among other places, it looks at the OCF_RESKEY_CRM_meta_on_node environment variable.  When that variable is set, its contents will be used as the node name.  That gets passed back to pcmk__attrd_api_query, which uses that to override the NULL it was given.  That's put into the XML IPC call, and that's the node name the server uses when looking up the attribute's value.  Basically, the environment variable's presence overrides the request for all nodes.  It looks like that environment variable is getting set in this case.

The fix looks pretty straightforward - skip that function call when node is NULL, and call it otherwise.  We're not sure whether the latter is really the right thing to do, but it's what was happening in 2.1.4 so it seems that we should restore that behavior just in case someone was relying on it.

Comment 26 errata-xmlrpc 2023-05-16 08:35:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2818

Note You need to log in before you can comment on or make changes to this bug.