Bug 1925681

Summary:	connection problems with multi-node MS SQL Server
Product:	Red Hat Enterprise Linux 8	Reporter:	Daniel Yeisley <dyeisley>
Component:	pacemaker	Assignee:	Chris Lumens <clumens>
Status:	CLOSED ERRATA	QA Contact:	cluster-qe <cluster-qe>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	8.4	CC:	cfeist, clumens, cluster-maint, kgaillot, msmazova, nwahl
Target Milestone:	rc	Keywords:	Regression, Triaged
Target Release:	8.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	pacemaker-2.0.5-7.el8	Doc Type:	No Doc Update
Doc Text:	This issue was in an unreleased build.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-18 15:26:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Daniel Yeisley 2021-02-05 20:42:51 UTC

Description of problem:
I'm running into connection issues with a 4-node SQL Server AG setup. The database is replicated to the nodes across the network. The servers do not share storage. 

Version-Release number of selected component (if applicable):
pacemaker-2.0.5-5.el8

How reproducible:
Always with 4 nodes.

Steps to Reproduce:
1. Install SQL Server 2019.
2. Create cluster resources.
3. Create the database and load tables.
4. Move resources between nodes and try to connect to the database.

:: [ 13:10:27 ] :: [   PASS   ] :: Command 'pcs cluster start --all' (Expected 0, got 0)
:: [ 13:10:27 ] :: [  BEGIN   ] :: Running 'pcs cluster enable --all'
veritas6: Cluster Enabled
veritas7: Cluster Enabled
isvqe-01: Cluster Enabled
isvqe-02: Cluster Enabled
:: [ 13:10:34 ] :: [   PASS   ] :: Command 'pcs cluster enable --all' (Expected 0, got 0)
:: [ 13:10:34 ] :: [  BEGIN   ] :: Running 'pcs property set stonith-enabled=false'
:: [ 13:10:35 ] :: [   PASS   ] :: Command 'pcs property set stonith-enabled=false' (Expected 0, got 0)
:: [ 13:10:35 ] :: [  BEGIN   ] :: Running 'pcs property set cluster-recheck-interval=2min'
:: [ 13:10:36 ] :: [   PASS   ] :: Command 'pcs property set cluster-recheck-interval=2min' (Expected 0, got 0)
:: [ 13:10:36 ] :: [  BEGIN   ] :: Running 'pcs property set start-failure-is-fatal=true'
:: [ 13:10:37 ] :: [   PASS   ] :: Command 'pcs property set start-failure-is-fatal=true' (Expected 0, got 0)
:: [ 13:10:37 ] :: [  BEGIN   ] :: Running 'pcs resource create virtualip ocf:heartbeat:IPaddr2 ip=192.168.100.101 cidr_netmask=24 op monitor interval=30s'
:: [ 13:10:38 ] :: [   PASS   ] :: Command 'pcs resource create virtualip ocf:heartbeat:IPaddr2 ip=192.168.100.101 cidr_netmask=24 op monitor interval=30s' (Expected 0, got 0)
:: [ 13:10:38 ] :: [  BEGIN   ] :: Running 'pcs resource move virtualip veritas6'
:: [ 13:10:39 ] :: [   PASS   ] :: Command 'pcs resource move virtualip veritas6' (Expected 0, got 0)
:: [ 13:10:39 ] :: [  BEGIN   ] :: Running 'pcs resource create ag_cluster ocf:mssql:ag ag_name=ag1 meta failure-timeout=80s promotable notify=true'
:: [ 13:10:41 ] :: [   PASS   ] :: Command 'pcs resource create ag_cluster ocf:mssql:ag ag_name=ag1 meta failure-timeout=80s promotable notify=true' (Expected 0, got 0)
:: [ 13:10:41 ] :: [  BEGIN   ] :: Running 'pcs constraint colocation add virtualip with master ag_cluster-clone INFINITY with-rsc-role=Master'
:: [ 13:10:41 ] :: [   PASS   ] :: Command 'pcs constraint colocation add virtualip with master ag_cluster-clone INFINITY with-rsc-role=Master' (Expected 0, got 0)
:: [ 13:10:41 ] :: [  BEGIN   ] :: Running 'pcs constraint order promote ag_cluster-clone then start virtualip'
Adding ag_cluster-clone virtualip (kind: Mandatory) (Options: first-action=promote then-action=start)
:: [ 13:10:42 ] :: [   PASS   ] :: Command 'pcs constraint order promote ag_cluster-clone then start virtualip' (Expected 0, got 0)


Actual results:


Expected results:
This is with pacemaker-2.0.5-4.el8.x86_64 installed and the resources on node 4.

sh-4.4# /opt/mssql-tools/bin/sqlcmd -S $CLUSTER_NAME -U SA -P $MSSQL_SA_PASSWORD -d ds2
1> select @@version
2> go
                                                                                                                                                                                                                                                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Microsoft SQL Server 2019 (RTM-CU8-GDR) (KB4583459) - 15.0.4083.2 (X64) 
	Nov  2 2020 18:35:09 
	Copyright (C) 2019 Microsoft Corporation
	Developer Edition (64-bit) on Linux (Red Hat Enterprise Linux 8.4 Beta (Ootpa)) <X64>                                                                            

(1 rows affected)

I updated pacemaker packages on all 4 nodes and tried again.
sh-4.4# /opt/mssql-tools/bin/sqlcmd -S $CLUSTER_NAME -U SA -P $MSSQL_SA_PASSWORD -d ds2 
Sqlcmd: Error: Microsoft ODBC Driver 17 for SQL Server : Cannot open database "ds2" requested by the login. The login failed..
Sqlcmd: Error: Microsoft ODBC Driver 17 for SQL Server : Login failed for user 'sa'..

sh-4.4# rpm -qa | grep pacemaker
pacemaker-cluster-libs-2.0.5-5.el8.x86_64
pacemaker-libs-2.0.5-5.el8.x86_64
pacemaker-2.0.5-5.el8.x86_64
pacemaker-cli-2.0.5-5.el8.x86_64
pacemaker-schemas-2.0.5-5.el8.noarch



Additional info:

Comment 1 Reid Wahl 2021-02-05 21:38:24 UTC

Hi, Daniel. First let's get some basic questions out of the way.

Did you update any other packages (e.g., resource-agents)?

Did you stop Pacemaker before updating?

Did you run anything besides the sqlcmd after updating (e.g., did you rebuild the cluster and run the whole test from scratch)?

What were the resource states and locations after the update?


Aside from that, can you attach a crm_report from each node? It would be good to take a peek at the logs.

Comment 2 Daniel Yeisley 2021-02-08 13:43:20 UTC

(In reply to Reid Wahl from comment #1)
> Hi, Daniel. First let's get some basic questions out of the way.
> 
> Did you update any other packages (e.g., resource-agents)?
> 
No I only updated the pacemaker packages. I started seeing problems on nightly composes in January. I put the 2.0.5-5 packages on an older compose from December and recreated the problem. Then I got a newer compose and downgraded the packages to 2.0.5-4 and had no issues. 


> Did you stop Pacemaker before updating?
> 
No I did not. I just installed the rpms. 


> Did you run anything besides the sqlcmd after updating (e.g., did you
> rebuild the cluster and run the whole test from scratch)?
> 
Nope. I just installed the rpms and attempted to run sqlcmd.


> What were the resource states and locations after the update?
> 
My test had finished so all of the resources were on node 4. I don't recall the states. 

> 
> Aside from that, can you attach a crm_report from each node? It would be
> good to take a peek at the logs.
I'll set up the test again to get this. 

I should add that I don't see this problem when using just 2 nodes.

Comment 5 Reid Wahl 2021-02-09 01:34:14 UTC

Thanks for the crm_reports :)

> Neither 2.0.5-5 nor 2.0.5-6 has anything I know of that would affect SQL resources

Yeah, colocation is still working fine, so the new "influence/critical" feature should be unrelated. I would be surprised if the addition of `crm_resource --digests` or the attrd fix for remote_proxy_cb could be related here either. Those are listed as the only three changes for 2.0.5-5.

-----

> It's definitely a good idea to stop pacemaker before updating.

Agreed. I'm not sure if Ken does any testing of updating pacemaker while it's still running, but for us on the support team it's a nack. We always advise users to stop pacemaker before an update to reduce the risk of unexpected behavior and try to keep the cluster in a clean and coherent state.
  - Recommended Practices for Applying Software Updates to a RHEL High Availability or Resilient Storage Cluster (https://access.redhat.com/articles/2059253)

-----

The move began at 13:36:34.

Feb  8 13:36:34 veritas7 pacemaker-schedulerd[64529]: notice: Calculated transition 57, saving inputs in /var/lib/pacemaker/pengine/pe-input-13.bz2
Feb  8 13:36:34 veritas7 pacemaker-schedulerd[64529]: notice:  * Move       virtualip        (     veritas7 -> isvqe-01 )
Feb  8 13:36:41 veritas7 pacemaker-controld[64530]: notice: Initiating demote operation ag_cluster_demote_0 locally on veritas7
Feb  8 13:36:41 veritas7 pacemaker-controld[64530]: notice: Result of demote operation for ag_cluster on veritas7: ok


Promote completed at 13:36:57. I'm not knowledgeable in MSSQL but will share selected messages below.

Feb  8 13:36:52 isvqe-01 pacemaker-controld[65827]: notice: Requesting local execution of promote operation for ag_cluster on isvqe-01
Feb  8 13:36:52 isvqe-01 ag(ag_cluster)[131161]: INFO: promote: 2021/02/08 13:36:52 ag-helper invoked with hostname [localhost]; port [1433]; ag-name [ag1]; credentials-file [/var/opt/mssql/secrets/passwd]; application-name [monitor-ag_cluster-promote]; connection-timeout [30]; health-threshold [3]; action [promote]
Feb  8 13:36:52 isvqe-01 ag(ag_cluster)[131161]: INFO: promote: 2021/02/08 13:36:52 ag-helper invoked with skip-precheck [false]; sequence-numbers [...]; new-master [isvqe-01]; required-synchronized-secondaries-to-commit [-1]; disable-primary-on-quorum-timeout-after [60]; primary-write-lease-duration [70]
...
Feb  8 13:36:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:57.19 spid52s     The availability group database "ds2" is changing roles from "SECONDARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
Feb  8 13:36:57 isvqe-01 ag(ag_cluster)[131161]: INFO: promote: 2021/02/08 13:36:57 Replica is PRIMARY (1)
Feb  8 13:36:57 isvqe-01 ag(ag_cluster)[131161]: INFO: promote: 2021/02/08 13:36:57 Replica is now PRIMARY
Feb  8 13:36:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:57.22 spid52s     The availability group database "ds2" is changing roles from "RESOLVING" to "PRIMARY" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
...
Feb  8 13:36:57 isvqe-01 pacemaker-controld[65827]: notice: Result of promote operation for ag_cluster on isvqe-01: ok


Feb  8 13:36:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:57.66 spid14s     Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA]->[77483240-4F63-43C1-A5C6-5006C73160D7], database [ds2], remote endpoint [tcp://isvqe-02:5022], source operation [641A3BEB-3CDB-46C9-8545-4971B551B8A5]: Transitioning from [PENDING] to [CHECK_IF_SEEDING_NEEDED].'
Feb  8 13:36:58 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:57.99 spid14s     Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA]->[791B937E-741A-4B48-AB12-88CBCBE5C8C9], database [ds2], remote endpoint [tcp://veritas6:5022], source operation [F6A7412B-57A5-4DA5-B763-6FB6DF70D007]: Transitioning from [PENDING] to [CHECK_IF_SEEDING_NEEDED].'
...
Feb  8 13:36:59 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:59.38 spid107s    0 transactions rolled back in database 'ds2' (5:0). This is an informational message only. No user action is required.
Feb  8 13:36:59 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:59.38 spid107s    Recovery completed for database ds2 (database ID 5) in 3 second(s) (analysis 0 ms, redo 0 ms, undo 0 ms [system undo 0 ms, regular undo 0 ms].) This is an informational message only. No user action is required.
Feb  8 13:36:59 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:59.39 spid107s    Parallel redo is shutdown for database 'ds2' with worker pool size [16].
Feb  8 13:36:59 isvqe-01 sqlservr[64677]: #015AGID [BE9BE8C9-6E17-1132-BFBA-8B7D2C28AFDB] : [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA] local replica received build replica response from [77483240-4F63-43C1-A5C6-5006C73160D7], Response code [0x80071c27]. Message handle status [1].2021-02-08 13:36:59.43 spid129s    Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA]->[77483240-4F63-43C1-A5C6-5006C73160D7], database [ds2], remote endpoint [tcp://isvqe-02:5022], source operation [641A3BEB-3CDB-46C9-8545-4971B551B8A5]: Transitioning from [CHECK_IF_SEEDING_NEEDED] to [CATCHUP].'
Feb  8 13:36:59 isvqe-01 sqlservr[64677]: #015AGID [BE9BE8C9-6E17-1132-BFBA-8B7D2C28AFDB] : [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA] local replica received build replica response from [791B937E-741A-4B48-AB12-88CBCBE5C8C9], Response code [0x80071c27]. Message handle status [1].2021-02-08 13:36:59.57 spid132s    Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA]->[791B937E-741A-4B48-AB12-88CBCBE5C8C9], database [ds2], remote endpoint [tcp://veritas6:5022], source operation [F6A7412B-57A5-4DA5-B763-6FB6DF70D007]: Transitioning from [CHECK_IF_SEEDING_NEEDED] to [CATCHUP].'


Here is the error that MSSQL reports for the failed login attempts from veritas6 (with IP ending in 103).

Feb  8 13:38:25 isvqe-01 sqlservr[64677]: #0152021-02-08 13:38:25.94 Logon       Error: 18456, Severity: 14, State: 38.
Feb  8 13:38:25 isvqe-01 sqlservr[64677]: #0152021-02-08 13:38:25.94 Logon       Login failed for user 'sa'. Reason: Failed to open the explicitly specified database 'DS2'. [CLIENT: 192.168.100.103]
Feb  8 13:38:35 isvqe-01 sqlservr[64677]: #0152021-02-08 13:38:35.98 Logon       Error: 18456, Severity: 14, State: 38.
Feb  8 13:38:35 isvqe-01 sqlservr[64677]: #0152021-02-08 13:38:35.98 Logon       Login failed for user 'sa'. Reason: Failed to open the explicitly specified database 'DS2'. [CLIENT: 192.168.100.103]


The sqlcmd output said pretty much the same thing. Do we have any further info on what at the MSSQL level prevented us from opening the database?


And a bit later:

Feb  8 13:39:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:39:57.34 spid62s     Always On: DebugTraceVarArgs AR 'Seeding is canceled with cancelReason = 21'
Feb  8 13:39:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:39:57.34 spid62s     Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA]->[AB8EE3AA-7D63-4E88-A1C0-C08F0F564C63], database [ds2], remote endpoint [tcp://veritas7:5022], source operation [2C0D9A0F-8D92-4B4F-99BE-44C8EE0BF755]: Transitioning from [CHECK_IF_SEEDING_NEEDED] to [FAILED].'
Feb  8 13:39:57 isvqe-01 sqlservr[64677]: #015CHadrTransportReplica::SendBuildMessageAndWaitForResponse failed due to transport disconnect
Feb  8 13:39:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:39:57.35 spid56s     Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA]->[AB8EE3AA-7D63-4E88-A1C0-C08F0F564C63], database [ds2], remote endpoint [tcp://veritas7:5022], source operation [2C0D9A0F-8D92-4B4F-99BE-44C8EE0BF755]: Seeding task failed with result 0x00000002.'
Feb  8 13:39:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:39:57.36 spid56s     Always On: DebugTraceVarArgs AR 'Seeding is canceled with cancelReason = 20'

-----

The below might be a red herring (because it seems to have been this way for a while) but it caught my eye. We can keep looking for other clues.

(Note: I can't find current source code for ag-helper... I'm thinking Microsoft keeps it locked down.)

I noticed in both node 1 (veritas6) and node 3 (isvqe-01), the ag-helper process was launched with the `--current-master` option but no optarg.

root      135219  0.0  0.0 461148  8944 ?        Sl   13:43   0:00 /usr/lib/ocf/lib/mssql/ag-helper --port 1433 --credentials-file /var/opt/mssql/secrets/passwd --ag-name ag1 --application-name monitor-ag_cluster-monitor --connection-timeout 30 --health-threshold 3 --action monitor --required-synchronized-secondaries-to-commit -1 --current-master  --disable-primary-on-quorum-timeout-after 60 --primary-write-lease-duration 81 --monitor-interval-timeout 60

root      196640  0.3  0.0 387416 12480 ?        Sl   13:42   0:00 /usr/lib/ocf/lib/mssql/ag-helper --port 1433 --credentials-file /var/opt/mssql/secrets/passwd --ag-name ag1 --application-name monitor-ag_cluster-monitor --connection-timeout 30 --health-threshold 3 --action monitor --required-synchronized-secondaries-to-commit -1 --current-master  --disable-primary-on-quorum-timeout-after 60 --primary-write-lease-duration 80 --monitor-interval-timeout 60


The ag resource agent gets current_master as follows:

	local current_master="$(crm_resource -r "$OCF_RESOURCE_INSTANCE" --locate | grep -Po 'resource [^ ]+ is running on: \K(.+)(?= Master$)')"


But that regex doesn't match anything.

# pcs resource status | grep -A2 stateful
  * Clone Set: stateful-clone [stateful] (promotable):
    * Masters: [ node2 ]
    * Slaves: [ node1 ]
# /usr/sbin/crm_resource -r stateful --locate
resource stateful is running on: node1
resource stateful is running on: node2
# /usr/sbin/crm_resource -r stateful --locate | grep -Po 'resource [^ ]+ is running on: \K(.+)(?= Master$)'
# 


I'm testing on the master branch of pacemaker, but the logic is the same in 2.0.5. I don't see why we exclude clones when we're checking whether a resource is a promotable clone. However, as noted above, it looks like it's been this way for a while, so I don't expect this to be the cause of the reported issue.

            if (!pe_rsc_is_clone(the_rsc) && the_rsc->fns->state(the_rsc, TRUE) == RSC_ROLE_MASTER) {
                state = "Master";
            }
            fprintf(stdout, "resource %s is running on: %s %s\n", rsc, node->details->uname, state);

https://github.com/ClusterLabs/pacemaker/blame/Pacemaker-2.0.5/tools/crm_resource_runtime.c#L26-L33
https://github.com/ClusterLabs/pacemaker/blame/master/tools/crm_resource_print.c#L306-L313


Oddly, there was a period during which EVERY node was detected as the current master, as well as periods when NO nodes were detected as current master.

For example:

No node is current-master: 

    Feb  8 13:12:51 isvqe-02 ag(ag_cluster)[111620]: INFO: monitor: 2021/02/08 13:12:51 ag-helper invoked with required-synchronized-secondaries-to-commit [-1]; current-master []; disable-primary-on-quorum-timeout-after [60]; primary-write-lease-duration [82]; monitor-interval-timeout [60]

All nodes are current-master:

    Feb  8 13:13:01 veritas6 ag(ag_cluster)[172179]: INFO: monitor: 2021/02/08 13:13:01 ag-helper invoked with required-synchronized-secondaries-to-commit [-1]; current-master [veritas7
    Feb  8 13:13:01 veritas6 ag(ag_cluster)[172179]: INFO: monitor: isvqe-01
    Feb  8 13:13:01 veritas6 ag(ag_cluster)[172179]: INFO: monitor: isvqe-02
    Feb  8 13:13:01 veritas6 ag(ag_cluster)[172179]: INFO: monitor: veritas6]; disable-primary-on-quorum-timeout-after [60]; primary-write-lease-duration [82]; monitor-interval-timeout [60]


It seems fair to say that there's problem with either `crm_resource --locate`, the PCRE in ocf:mssql:ag, or both. But I don't see how this could have been introduced with pacemaker-2.0.5-5.

Comment 6 Reid Wahl 2021-02-09 05:44:54 UTC

(In reply to Reid Wahl from comment #5)
> I'm testing on the master branch of pacemaker, but the logic is the same in
> 2.0.5. I don't see why we exclude clones when we're checking whether a
> resource is a promotable clone. However, as noted above, it looks like it's
> been this way for a while, so I don't expect this to be the cause of the
> reported issue.

It occurred to me later that this is almost certainly intentional -- we want to work with the primitive and not the clone wrapper itself. With this in mind, my statement that "it seems to have been this way for a while" is no longer relevant.

I think there's a reasonable chance that the `crm_resource --locate` output is the issue.

pacemaker-2.0.5-4:
# rpm -q pacemaker pacemaker-cli pacemaker-cluster-libs pacemaker-libs
pacemaker-2.0.5-4.el8.x86_64
pacemaker-cli-2.0.5-4.el8.x86_64
pacemaker-cluster-libs-2.0.5-4.el8.x86_64
pacemaker-libs-2.0.5-4.el8.x86_64
# pcs status | grep -A2 stateful
  * Clone Set: stateful-clone [stateful] (promotable):
    * Masters: [ node2 ]
    * Slaves: [ node1 ]
# /usr/sbin/crm_resource -r stateful --locate 
resource stateful is running on: node2 Master
resource stateful is running on: node1


pacemaker-2.0.5-5:
# rpm -q pacemaker pacemaker-cli pacemaker-cluster-libs pacemaker-libs
pacemaker-2.0.5-5.el8.x86_64
pacemaker-cli-2.0.5-5.el8.x86_64
pacemaker-cluster-libs-2.0.5-5.el8.x86_64
pacemaker-libs-2.0.5-5.el8.x86_64
# pcs status | grep -A2 stateful
  * Clone Set: stateful-clone [stateful] (promotable):
    * Masters: [ node2 ]
    * Slaves: [ node1 ]
# /usr/sbin/crm_resource -r stateful --locate 
resource stateful is running on: node1
resource stateful is running on: node2


Since the ocf:mssql:ag resource agent gets the node name that has " Master" after it, and then uses that node name as the --current-master option for ag-helper commands, it seems worthwhile to look more closely at this and see if it's related.

Comment 7 Reid Wahl 2021-02-09 06:59:32 UTC

Hmm. The issue in 2.0.5-5 is that resource_search_list_default() checks the same clone **instance** twice. It's not that the role is wrong for a particular instance -- rather, it uses instance 0 for both/all checks, as if instance 0 were running on all nodes.

This also might help explain why the buggy behavior only happened on one/some nodes.

2.0.5-4:
(native_resource_state)         trace: stateful:0 state: Started
(native_resource_state)         trace: stateful:1 state: Master


2.0.5-5:
(native_resource_state)         trace: stateful:0 state: Started
(native_resource_state)         trace: stateful:0 state: Started

Comment 8 Reid Wahl 2021-02-09 10:16:09 UTC

This behavior was introduced in upstream commit b5ce7803. The `crm_resource --locate` command prints the expected output in the previous commit (9ab0635a).

Comment 9 Reid Wahl 2021-02-09 10:37:22 UTC

I think I see what went wrong.

The previous version of cli_resource_search() retrieved rsc's parent and then iterated over all of the parent's children, passing each one to do_find_resource(). do_find_resource() then printed the "resource %s is running on" message for each child. In the case of a cloned resource (including promotable clones), this meant iterating over and printing the location of each clone instance.
~~~
    int found = 0;
    ...
    /* The anonymous clone children's common ID is supplied */
    } else if (pe_rsc_is_clone(parent)
               && !pcmk_is_set(rsc->flags, pe_rsc_unique)
               && rsc->clone_name
               && pcmk__str_eq(requested_name, rsc->clone_name, pcmk__str_casei)
               && !pcmk__str_eq(requested_name, rsc->id, pcmk__str_casei)) {

        for (GListPtr iter = parent->children; iter; iter = iter->next) {
            found += do_find_resource(out, requested_name, iter->data, data_set);
        }
    ...
    return found;
~~~


The current version of cli_resource_search() simply returns the list of nodes on which rsc is running. If rsc's parent is a clone, then cli_resource_search gets all children as before -- but this time it simply adds the running_on node to the `found` list. It does not directly call a printing function.
~~~
    GListPtr found = NULL;
    ...
    /* The anonymous clone children's common ID is supplied */
    } else if (pe_rsc_is_clone(parent)
               && !pcmk_is_set(rsc->flags, pe_rsc_unique)
               && rsc->clone_name
               && pcmk__str_eq(requested_name, rsc->clone_name, pcmk__str_casei)
               && !pcmk__str_eq(requested_name, rsc->id, pcmk__str_casei)) {

        for (GListPtr iter = parent->children; iter; iter = iter->next) {
            GListPtr extra = ((pe_resource_t *) iter->data)->running_on;
            if (extra != NULL) {
                found = g_list_concat(found, extra);
            }
        }
    ...
    return found;
~~~

Then the caller passes the list of nodes as the `resources` argument to the "resource-search" function (even though it's actually a list of nodes and not a list of resources).
~~~
        case cmd_locate: {
            GListPtr resources = cli_resource_search(out, rsc, options.rsc_id, data_set);
            rc = out->message(out, "resource-search", resources, rsc, options.rsc_id);
            break;
        }
~~~


Meanwhile, the `rsc` argument to "resource-search" is whatever pe_find_resource_with_flags returned earlier in the main function. pe_find_resource_with_flags apparently returns the first clone instance (according to a gdb session run against 2.0.5-5 earlier).
~~~
1740	    // If command requires that resource exist if specified, find it
1741	    if (options.find_flags && options.rsc_id) {
1742	        rsc = pe_find_resource_with_flags(data_set->resources, options.rsc_id,
1743	                                          options.find_flags);

Breakpoint 3, main (argc=<optimized out>, argv=<optimized out>) at crm_resource.c:1744
1744	        if (rsc == NULL) {
(gdb) p rsc->id
$4 = 0x5555558afd10 "stateful:0"
(gdb) p options.rsc_id
$5 = (gchar *) 0x555555786560 "stateful"
~~~


The effect here is that the "stateful:0" pe_resource_t object gets passed to the "resource-search" (resource_search_default()) function. For every node passed via the `resources` argument (as found by cli_resource_search), resource_search_default queries the state of the stateful:0 instance. This means it queries the state of the same clone instance every time. It follows that the resource role will be the same every time -- so there may be no masters in the output, or multiple masters in the output.
~~~
static int
resource_search_default(pcmk__output_t *out, va_list args)
{
    ...
    GListPtr nodes = va_arg(args, GListPtr);
    pe_resource_t *rsc = va_arg(args, pe_resource_t *);
    ...
    for (GListPtr lpc = nodes; lpc != NULL; lpc = lpc->next) {
        pe_node_t *node = (pe_node_t *) lpc->data;
    ...
            const char *state = "";

            if (!pe_rsc_is_clone(rsc) && rsc->fns->state(rsc, TRUE) == RSC_ROLE_MASTER) {
                state = " Master";
            }
            out->list_item(out, "node", "resource %s is running on: %s%s",
                           requested_name, node->details->uname, state);
~~~


Tying this back to ocf:mssql:ag, this resource agent appears to depend on correctly parsing the node currently running the master instance from `crm_resource --locate`. Whether this is an ideal way to query that info, I don't know off the top of my head, but that's how it's done nonetheless.

Comment 11 Ken Gaillot 2021-02-10 23:14:40 UTC

QA: Reproducer:
1. Configure a cluster with a promotable clone (e.g. ocf:pacemaker:Stateful).
2. Run "crm_resource --locate -r $RSC_ID" on each node. With 2.0.5-5, the Master instance(s) will not be indicated. Before 2.0.5-5 and after the fix, Master instances should be indicated, like:

    resource stateful is running on: node2 Master

FYI, the commit that introduced this regression is included in the 8.4 build not for its own sake, but because it was part of a Pacemaker feature set that is lower than the feature set needed for another bz in 8.4. Backporting a particular feature set from the upstream master branch requires backporting all earlier feature sets not already present, in order to support rolling upgrades.

Chris, comments 6-9 describe the source of the problem.

Comment 15 Chris Lumens 2021-02-11 20:24:00 UTC

Candidate fix is up at https://github.com/ClusterLabs/pacemaker/pull/2293.  Please feel free to check it out and comment.

Comment 16 Ken Gaillot 2021-02-12 17:40:18 UTC

Fix merged upstream as of commit d1bb0758

Comment 21 Daniel Yeisley 2021-02-12 21:55:19 UTC

I installed pacemaker-2.0.5-7.el8 on my four systems and my SQL Server test passed. No issues.

Comment 23 errata-xmlrpc 2021-05-18 15:26:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1782