Bug 2168675
| Summary: | Can't move MS SQL Server cluster resources | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Daniel Yeisley <dyeisley> | |
| Component: | pacemaker | Assignee: | Chris Lumens <clumens> | |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
| Severity: | unspecified | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 8.8 | CC: | cfeist, clumens, cluster-maint, jrehova, kgaillot, msmazova, nwahl, rmeggins | |
| Target Milestone: | rc | Keywords: | Regression, Triaged | |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
|
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | pacemaker-2.1.5-8.el8 | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: An internal attribute querying API used by attrd_updater does not check if it was given NULL for its node name parameter.
Consequence: attrd_updater ignores the command line option to display the value of the attribute on all cluster nodes.
Fix: The API call should check for a NULL node name before doing anything else.
Result: attrd_updater again displays the value of the attribute on all cluster nodes.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2169829 (view as bug list) | Environment: | ||
| Last Closed: | 2023-05-16 08:35:22 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2169829 | |||
Does it work as expected if you use a lower Pacemaker version (which version if so?) and leave the resource agent the same? Also, which versions of the mssql packages are installed? The error is coming from the resource agent, not from Pacemaker. It's definitely possible that some change in Pacemaker affected the resource agent, and then it would remain to be determined which one needs to be patched or how the cluster configuration needs to change. It'll be easier to find the problem with some known good vs. current version info, including for mssql. Also can you please elaborate on these steps?
> 1. Install a MS SQL Server 2-node cluster using the ansible role
> 2. Create the database and load the tables.
I don't know where to find or how to use the Ansible role for MS SQL server, or of how to create an MS SQL database and load tables. (Can find out but takes time)
(In reply to Reid Wahl from comment #3) > Also, which versions of the mssql packages are installed? > > The error is coming from the resource agent, not from Pacemaker. It's > definitely possible that some change in Pacemaker affected the resource > agent, and then it would remain to be determined which one needs to be > patched or how the cluster configuration needs to change. It'll be easier to > find the problem with some known good vs. current version info, including > for mssql. I have the following packages installed. [root@isvqe-01 Certification]# rpm -qa | grep ^mssql mssql-server-15.0.4261.1-2.x86_64 mssql-tools-17.10.1.1-1.x86_64 mssql-server-fts-15.0.4261.1-2.x86_64 mssql-server-ha-15.0.4261.1-2.x86_64 [root@isvqe-01 Certification]# rpm -qa | grep ^resource resource-agents-4.9.0-29.el8.x86_64 I can reinstall with RHEL 8.8 and use pacemaker from 8.7 to see what happens. I installed two systems with RHEL 8.8 and the pacemaker packages from 8.7 and had no issues.
[root@isvqe-01 Certification]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.8 Beta (Ootpa)
[root@isvqe-01 ~]# rpm -qa | grep pacemaker
pacemaker-2.1.4-5.el8.x86_64
pacemaker-cluster-libs-2.1.4-5.el8.x86_64
pacemaker-libs-2.1.4-5.el8.x86_64
pacemaker-cli-2.1.4-5.el8.x86_64
pacemaker-schemas-2.1.4-5.el8.noarch
[root@isvqe-01 ~]# rpm -qa | grep ^mssql
mssql-server-15.0.4261.1-2.x86_64
mssql-tools-17.10.1.1-1.x86_64
mssql-server-ha-15.0.4261.1-2.x86_64
mssql-server-fts-15.0.4261.1-2.x86_64
[root@isvqe-01 ~]# rpm -qa | grep ^resource
resource-agents-4.9.0-40.el8.x86_64
[root@isvqe-01 Certification]# pcs status
Cluster name: isvqe-cluster
WARNINGS:
Following resources have been moved and their move constraints are still in place: 'virtualip'
Run 'pcs constraint location' or 'pcs resource clear <resource id>' to view or remove the constraints, respectively
Cluster Summary:
* Stack: corosync
* Current DC: isvqe-02 (version 2.1.4-5.el8-dc6eb4362e) - partition with quorum
* Last updated: Thu Feb 9 17:26:23 2023
* Last change: Thu Feb 9 16:55:17 2023 by root via crm_resource on isvqe-01
* 2 nodes configured
* 4 resource instances configured
Node List:
* Online: [ isvqe-01 isvqe-02 ]
Full List of Resources:
* mydummy (ocf::pacemaker:Dummy): Started isvqe-01
* virtualip (ocf::heartbeat:IPaddr2): Started isvqe-02
* Clone Set: ag_cluster-clone [ag_cluster] (promotable):
* Masters: [ isvqe-02 ]
* Slaves: [ isvqe-01 ]
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
(In reply to Daniel Yeisley from comment #6) > I installed two systems with RHEL 8.8 and the pacemaker packages from 8.7 > and had no issues. Great to know. Do you know if there's source code anywhere for the mssql stuff? It'd be nice to see exactly what it's doing and where it's breaking. Their GitHub was last updated 5 years ago. The scripts are in the RPM package but ag-helper (the source of the error message) is compiled. (In reply to Daniel Yeisley from comment #6) > I installed two systems with RHEL 8.8 and the pacemaker packages from 8.7 > and had no issues. So, RHEL 8.7 with pacemaker 8.8 packages = bad, and RHEL 8.8 with pacemaker 8.7 packages = good? What happens when the 8.7 packages are used on 8.7, and the 8.8 packages on 8.8? I don't know of any changes in the 8.8 packages that could affect this agent, but it could be an unknown regression. Unless maybe the agent parses the text (instead of XML) output of a command-line tool like crm_mon? (In reply to Ken Gaillot from comment #11) > (In reply to Daniel Yeisley from comment #6) > > I installed two systems with RHEL 8.8 and the pacemaker packages from 8.7 > > and had no issues. > > So, RHEL 8.7 with pacemaker 8.8 packages = bad, and RHEL 8.8 with pacemaker > 8.7 packages = good? > > What happens when the 8.7 packages are used on 8.7, and the 8.8 packages on > 8.8? > > I don't know of any changes in the 8.8 packages that could affect this > agent, but it could be an unknown regression. Unless maybe the agent parses > the text (instead of XML) output of a command-line tool like crm_mon? I initially ran into the problem during my testing on 8.8 (with the 8.8 pacemaker packages). Then I re-ran my tests on RHEL 8.7 to verify that it wasn't something that changed in the SQL Server packages. I had no issues on 8.7. Then I started with 8.7 and upgraded it with packages from 8.8 until I could recreate the issue. For now, in the absence of any recent source code for the ag-helper program, I'm having to rely on the version from 2017 and hope it's still basically the same: https://github.com/microsoft/mssql-server-ha/blob/sql2017/go/src/ag-helper/main.go So, here's the issue. In the RHEL 8.8 version, attrd_updater is only showing the attribute value from the local node. Previously (RHEL 8.7), it showed the values from all nodes. The ag-helper program is expecting multi-line output, containing the sequence number from each node. This block is where the work is done: https://github.com/microsoft/mssql-server-ha/blob/sql2017/go/src/ag-helper/main.go#L538-L566 Here's the difference in output from attrd_updater in the ocf:mssql:ag resource agent. This is where we query the ag_cluster-sequence-number attribute from Pacemaker and pass it as an argument to ag-helper. pacemaker-2.1.5-5 (RHEL 8.8): ++ 15:27:18: mssql_promote:375: attrd_updater -n ag_cluster-sequence-number -QA + 15:27:19: mssql_promote:375: local 'sequence_numbers=name="ag_cluster-sequence-number" host="isvqe-01" value="4294967310"' ++ 15:27:19: mssql_promote:399: /usr/lib/ocf/lib/mssql/ag-helper --hostname '' --port 1433 --credentials-file /var/opt/mssql/secrets/passwd --ag-name ag1 --application-name monitor-ag_cluster-promote --connection-timeout 30 --health-threshold 3 --action promote --sequence-numbers 'name="ag_cluster-sequence-number" host="isvqe-01" value="4294967310"' --new-master isvqe-01 --required-synchronized-secondaries-to-commit -1 --disable-primary-on-quorum-timeout-after 60 --primary-write-lease-duration 62 pacemaker-2.1.4-5 (RHEL 8.7): ++ 15:45:55: mssql_promote:375: attrd_updater -n ag_cluster-sequence-number -QA + 15:45:55: mssql_promote:375: local 'sequence_numbers=name="ag_cluster-sequence-number" host="isvqe-01" value="4294967310" name="ag_cluster-sequence-number" host="isvqe-02" value="4294967310"' ++ 15:45:55: mssql_promote:399: /usr/lib/ocf/lib/mssql/ag-helper --hostname '' --port 1433 --credentials-file /var/opt/mssql/secrets/passwd --ag-name ag1 --application-name monitor-ag_cluster-promote --connection-timeout 30 --health-threshold 3 --action promote --sequence-numbers 'name="ag_cluster-sequence-number" host="isvqe-01" value="4294967310" name="ag_cluster-sequence-number" host="isvqe-02" value="4294967310"' --new-master isvqe-01 --required-synchronized-secondaries-to-commit -1 --disable-primary-on-quorum-timeout-after 60 --primary-write-lease-duration 62 I tested with a dummy attribute using the RHEL 8.8 version, and the -A option does work correctly: it returns the attribute's value for all nodes, as expected. [root@isvqe-01 ~]# attrd_updater -n test_attr -QA name="test_attr" host="isvqe-02" value="test_val" name="test_attr" host="isvqe-01" value="test_val" In contrast, the failed resource promote log farther above shows that `attrd_updater -n ag_cluster-sequence-number -QA` only returned the attribute's value for one node. So there does **not** seem to be a regression in the output **format**, but we are nonetheless getting incomplete output **content** when the resource agent is running the command. I wonder if it's a timing issue related to when the query is happening versus when the attribute is set and synced. There have been some changes in attribute handling recently. @clumens, what do you think? You're more familiar with the attrd code. We talked about this a bit in chat yesterday and it looks like a bug in attrd_updater, caused by pcmk__attrd_api_query -> pcmk__node_attr_target. This was introduced by 9528756621. To make a long story very short, pcmk__attrd_api_query should not call pcmk__node_attr_target when node == NULL. Prior to that commit, this was the case. When pcmk__node_attr_target is called with a node name of NULL, it treats that the same as if it was passed "auto" or "localhost" and looks at the running system to figure out the node name. Among other places, it looks at the OCF_RESKEY_CRM_meta_on_node environment variable. When that variable is set, its contents will be used as the node name. That gets passed back to pcmk__attrd_api_query, which uses that to override the NULL it was given. That's put into the XML IPC call, and that's the node name the server uses when looking up the attribute's value. Basically, the environment variable's presence overrides the request for all nodes. It looks like that environment variable is getting set in this case. The fix looks pretty straightforward - skip that function call when node is NULL, and call it otherwise. We're not sure whether the latter is really the right thing to do, but it's what was happening in 2.1.4 so it seems that we should restore that behavior just in case someone was relying on it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2818 |
Description of problem: I've been running into a problem with multi-node SQL Server on RHEL 8.8. I can create the 2-node cluster and execute my test harness against it. Then I attempt to move the resources to the second node and it fails with "Not enough replicas are online." I installed two systems with RHEL-8.7.0 and updated the pacemaker packages to the 8.8.0 versions to reproduce the problem. Version-Release number of selected component (if applicable): [root@isvqe-01 ~]# rpm -qa | grep pacemaker pacemaker-2.1.5-5.el8.x86_64 pacemaker-libs-2.1.5-5.el8.x86_64 pacemaker-schemas-2.1.5-5.el8.noarch pacemaker-cluster-libs-2.1.5-5.el8.x86_64 pacemaker-cli-2.1.5-5.el8.x86_64 How reproducible: Always. Steps to Reproduce: 1. Install a MS SQL Server 2-node cluster using the ansible role 2. Create the database and load the tables. 3. Attempt to move the resources to node 2. The ansible role creates the following resource config. [root@isvqe-01 Certification]# pcs resource config Resource: mydummy (class=ocf provider=pacemaker type=Dummy) Operations: migrate_from: mydummy-migrate_from-interval-0s interval=0s timeout=20s migrate_to: mydummy-migrate_to-interval-0s interval=0s timeout=20s monitor: mydummy-monitor-interval-10s interval=10s timeout=20s reload: mydummy-reload-interval-0s interval=0s timeout=20s reload-agent: mydummy-reload-agent-interval-0s interval=0s timeout=20s start: mydummy-start-interval-0s interval=0s timeout=20s stop: mydummy-stop-interval-0s interval=0s timeout=20s Resource: virtualip (class=ocf provider=heartbeat type=IPaddr2) Attributes: virtualip-instance_attributes ip=192.168.100.201 Operations: monitor: virtualip-monitor-interval-30s interval=30s start: virtualip-start-interval-0s interval=0s timeout=20s stop: virtualip-stop-interval-0s interval=0s timeout=20s Clone: ag_cluster-clone Meta Attributes: ag_cluster-clone-meta_attributes notify=True promotable=true Resource: ag_cluster (class=ocf provider=mssql type=ag) Attributes: ag_cluster-instance_attributes ag_name=ag1 Meta Attributes: ag_cluster-meta_attributes failure-timeout=80s Operations: demote: ag_cluster-demote-interval-0s interval=0s timeout=10 monitor: ag_cluster-monitor-interval-10 interval=10 timeout=60 monitor: ag_cluster-monitor-interval-11 interval=11 timeout=60 role=Master monitor: ag_cluster-monitor-interval-12 interval=12 timeout=60 role=Slave notify: ag_cluster-notify-interval-0s interval=0s timeout=60 promote: ag_cluster-promote-interval-0s interval=0s timeout=60 start: ag_cluster-start-interval-0s interval=0s timeout=60 stop: ag_cluster-stop-interval-0s interval=0s timeout=10 After initial creation. [root@isvqe-01 ~]# pcs status Cluster name: isvqe-cluster Status of pacemakerd: 'Pacemaker is running' (last updated 2023-02-09 11:32:08 -05:00) Cluster Summary: * Stack: corosync * Current DC: isvqe-02 (version 2.1.5-5.el8-a3f44794f94) - partition with quorum * Last updated: Thu Feb 9 11:32:09 2023 * Last change: Thu Feb 9 11:17:56 2023 by root via cibadmin on isvqe-01 * 2 nodes configured * 4 resource instances configured Node List: * Online: [ isvqe-01 isvqe-02 ] Full List of Resources: * mydummy (ocf::pacemaker:Dummy): Started isvqe-01 * virtualip (ocf::heartbeat:IPaddr2): Started isvqe-01 * Clone Set: ag_cluster-clone [ag_cluster] (promotable): * Masters: [ isvqe-01 ] * Slaves: [ isvqe-02 ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Actual results: After the attempted move. [root@isvqe-01 Certification]# pcs status Cluster name: isvqe-cluster WARNINGS: Following resources have been moved and their move constraints are still in place: 'virtualip' Run 'pcs constraint location' or 'pcs resource clear <resource id>' to view or remove the constraints, respectively Status of pacemakerd: 'Pacemaker is running' (last updated 2023-02-09 12:46:52 -05:00) Cluster Summary: * Stack: corosync * Current DC: isvqe-02 (version 2.1.5-5.el8-a3f44794f94) - partition with quorum * Last updated: Thu Feb 9 12:46:53 2023 * Last change: Thu Feb 9 12:34:38 2023 by hacluster via crmd on isvqe-02 * 2 nodes configured * 4 resource instances configured Node List: * Online: [ isvqe-01 isvqe-02 ] Full List of Resources: * mydummy (ocf::pacemaker:Dummy): Started isvqe-02 * virtualip (ocf::heartbeat:IPaddr2): Stopped * Clone Set: ag_cluster-clone [ag_cluster] (promotable): * Slaves: [ isvqe-01 isvqe-02 ] Failed Resource Actions: * ag_cluster_promote_0 on isvqe-02 'error' (1): call=956, status='complete', exitreason='2023/02/09 12:46:41 Not enough replicas are online to safely promote this replica: need 2 but have 1', last-rc-change='Thu Feb 9 12:46:36 2023', queued=0ms, exec=5202ms * ag_cluster_promote_0 on isvqe-01 'error' (1): call=240, status='complete', exitreason='2023/02/09 12:46:14 Not enough replicas are online to safely promote this replica: need 2 but have 1', last-rc-change='Thu Feb 9 12:46:09 2023', queued=0ms, exec=5206ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Expected results: Additional info: The SQL Server packages are located here: https://packages.microsoft.com/rhel/8/mssql-server-2019/Packages/m/