Bug 1925681
Summary: | connection problems with multi-node MS SQL Server | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Daniel Yeisley <dyeisley> |
Component: | pacemaker | Assignee: | Chris Lumens <clumens> |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 8.4 | CC: | cfeist, clumens, cluster-maint, kgaillot, msmazova, nwahl |
Target Milestone: | rc | Keywords: | Regression, Triaged |
Target Release: | 8.0 | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | pacemaker-2.0.5-7.el8 | Doc Type: | No Doc Update |
Doc Text: |
This issue was in an unreleased build.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-18 15:26:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Daniel Yeisley
2021-02-05 20:42:51 UTC
Hi, Daniel. First let's get some basic questions out of the way. Did you update any other packages (e.g., resource-agents)? Did you stop Pacemaker before updating? Did you run anything besides the sqlcmd after updating (e.g., did you rebuild the cluster and run the whole test from scratch)? What were the resource states and locations after the update? Aside from that, can you attach a crm_report from each node? It would be good to take a peek at the logs. (In reply to Reid Wahl from comment #1) > Hi, Daniel. First let's get some basic questions out of the way. > > Did you update any other packages (e.g., resource-agents)? > No I only updated the pacemaker packages. I started seeing problems on nightly composes in January. I put the 2.0.5-5 packages on an older compose from December and recreated the problem. Then I got a newer compose and downgraded the packages to 2.0.5-4 and had no issues. > Did you stop Pacemaker before updating? > No I did not. I just installed the rpms. > Did you run anything besides the sqlcmd after updating (e.g., did you > rebuild the cluster and run the whole test from scratch)? > Nope. I just installed the rpms and attempted to run sqlcmd. > What were the resource states and locations after the update? > My test had finished so all of the resources were on node 4. I don't recall the states. > > Aside from that, can you attach a crm_report from each node? It would be > good to take a peek at the logs. I'll set up the test again to get this. I should add that I don't see this problem when using just 2 nodes. Thanks for the crm_reports :) > Neither 2.0.5-5 nor 2.0.5-6 has anything I know of that would affect SQL resources Yeah, colocation is still working fine, so the new "influence/critical" feature should be unrelated. I would be surprised if the addition of `crm_resource --digests` or the attrd fix for remote_proxy_cb could be related here either. Those are listed as the only three changes for 2.0.5-5. ----- > It's definitely a good idea to stop pacemaker before updating. Agreed. I'm not sure if Ken does any testing of updating pacemaker while it's still running, but for us on the support team it's a nack. We always advise users to stop pacemaker before an update to reduce the risk of unexpected behavior and try to keep the cluster in a clean and coherent state. - Recommended Practices for Applying Software Updates to a RHEL High Availability or Resilient Storage Cluster (https://access.redhat.com/articles/2059253) ----- The move began at 13:36:34. Feb 8 13:36:34 veritas7 pacemaker-schedulerd[64529]: notice: Calculated transition 57, saving inputs in /var/lib/pacemaker/pengine/pe-input-13.bz2 Feb 8 13:36:34 veritas7 pacemaker-schedulerd[64529]: notice: * Move virtualip ( veritas7 -> isvqe-01 ) Feb 8 13:36:41 veritas7 pacemaker-controld[64530]: notice: Initiating demote operation ag_cluster_demote_0 locally on veritas7 Feb 8 13:36:41 veritas7 pacemaker-controld[64530]: notice: Result of demote operation for ag_cluster on veritas7: ok Promote completed at 13:36:57. I'm not knowledgeable in MSSQL but will share selected messages below. Feb 8 13:36:52 isvqe-01 pacemaker-controld[65827]: notice: Requesting local execution of promote operation for ag_cluster on isvqe-01 Feb 8 13:36:52 isvqe-01 ag(ag_cluster)[131161]: INFO: promote: 2021/02/08 13:36:52 ag-helper invoked with hostname [localhost]; port [1433]; ag-name [ag1]; credentials-file [/var/opt/mssql/secrets/passwd]; application-name [monitor-ag_cluster-promote]; connection-timeout [30]; health-threshold [3]; action [promote] Feb 8 13:36:52 isvqe-01 ag(ag_cluster)[131161]: INFO: promote: 2021/02/08 13:36:52 ag-helper invoked with skip-precheck [false]; sequence-numbers [...]; new-master [isvqe-01]; required-synchronized-secondaries-to-commit [-1]; disable-primary-on-quorum-timeout-after [60]; primary-write-lease-duration [70] ... Feb 8 13:36:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:57.19 spid52s The availability group database "ds2" is changing roles from "SECONDARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required. Feb 8 13:36:57 isvqe-01 ag(ag_cluster)[131161]: INFO: promote: 2021/02/08 13:36:57 Replica is PRIMARY (1) Feb 8 13:36:57 isvqe-01 ag(ag_cluster)[131161]: INFO: promote: 2021/02/08 13:36:57 Replica is now PRIMARY Feb 8 13:36:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:57.22 spid52s The availability group database "ds2" is changing roles from "RESOLVING" to "PRIMARY" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required. ... Feb 8 13:36:57 isvqe-01 pacemaker-controld[65827]: notice: Result of promote operation for ag_cluster on isvqe-01: ok Feb 8 13:36:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:57.66 spid14s Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA]->[77483240-4F63-43C1-A5C6-5006C73160D7], database [ds2], remote endpoint [tcp://isvqe-02:5022], source operation [641A3BEB-3CDB-46C9-8545-4971B551B8A5]: Transitioning from [PENDING] to [CHECK_IF_SEEDING_NEEDED].' Feb 8 13:36:58 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:57.99 spid14s Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA]->[791B937E-741A-4B48-AB12-88CBCBE5C8C9], database [ds2], remote endpoint [tcp://veritas6:5022], source operation [F6A7412B-57A5-4DA5-B763-6FB6DF70D007]: Transitioning from [PENDING] to [CHECK_IF_SEEDING_NEEDED].' ... Feb 8 13:36:59 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:59.38 spid107s 0 transactions rolled back in database 'ds2' (5:0). This is an informational message only. No user action is required. Feb 8 13:36:59 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:59.38 spid107s Recovery completed for database ds2 (database ID 5) in 3 second(s) (analysis 0 ms, redo 0 ms, undo 0 ms [system undo 0 ms, regular undo 0 ms].) This is an informational message only. No user action is required. Feb 8 13:36:59 isvqe-01 sqlservr[64677]: #0152021-02-08 13:36:59.39 spid107s Parallel redo is shutdown for database 'ds2' with worker pool size [16]. Feb 8 13:36:59 isvqe-01 sqlservr[64677]: #015AGID [BE9BE8C9-6E17-1132-BFBA-8B7D2C28AFDB] : [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA] local replica received build replica response from [77483240-4F63-43C1-A5C6-5006C73160D7], Response code [0x80071c27]. Message handle status [1].2021-02-08 13:36:59.43 spid129s Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA]->[77483240-4F63-43C1-A5C6-5006C73160D7], database [ds2], remote endpoint [tcp://isvqe-02:5022], source operation [641A3BEB-3CDB-46C9-8545-4971B551B8A5]: Transitioning from [CHECK_IF_SEEDING_NEEDED] to [CATCHUP].' Feb 8 13:36:59 isvqe-01 sqlservr[64677]: #015AGID [BE9BE8C9-6E17-1132-BFBA-8B7D2C28AFDB] : [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA] local replica received build replica response from [791B937E-741A-4B48-AB12-88CBCBE5C8C9], Response code [0x80071c27]. Message handle status [1].2021-02-08 13:36:59.57 spid132s Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA]->[791B937E-741A-4B48-AB12-88CBCBE5C8C9], database [ds2], remote endpoint [tcp://veritas6:5022], source operation [F6A7412B-57A5-4DA5-B763-6FB6DF70D007]: Transitioning from [CHECK_IF_SEEDING_NEEDED] to [CATCHUP].' Here is the error that MSSQL reports for the failed login attempts from veritas6 (with IP ending in 103). Feb 8 13:38:25 isvqe-01 sqlservr[64677]: #0152021-02-08 13:38:25.94 Logon Error: 18456, Severity: 14, State: 38. Feb 8 13:38:25 isvqe-01 sqlservr[64677]: #0152021-02-08 13:38:25.94 Logon Login failed for user 'sa'. Reason: Failed to open the explicitly specified database 'DS2'. [CLIENT: 192.168.100.103] Feb 8 13:38:35 isvqe-01 sqlservr[64677]: #0152021-02-08 13:38:35.98 Logon Error: 18456, Severity: 14, State: 38. Feb 8 13:38:35 isvqe-01 sqlservr[64677]: #0152021-02-08 13:38:35.98 Logon Login failed for user 'sa'. Reason: Failed to open the explicitly specified database 'DS2'. [CLIENT: 192.168.100.103] The sqlcmd output said pretty much the same thing. Do we have any further info on what at the MSSQL level prevented us from opening the database? And a bit later: Feb 8 13:39:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:39:57.34 spid62s Always On: DebugTraceVarArgs AR 'Seeding is canceled with cancelReason = 21' Feb 8 13:39:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:39:57.34 spid62s Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA]->[AB8EE3AA-7D63-4E88-A1C0-C08F0F564C63], database [ds2], remote endpoint [tcp://veritas7:5022], source operation [2C0D9A0F-8D92-4B4F-99BE-44C8EE0BF755]: Transitioning from [CHECK_IF_SEEDING_NEEDED] to [FAILED].' Feb 8 13:39:57 isvqe-01 sqlservr[64677]: #015CHadrTransportReplica::SendBuildMessageAndWaitForResponse failed due to transport disconnect Feb 8 13:39:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:39:57.35 spid56s Always On: DebugTraceVarArgs AR '[HADR] [Primary] operation on replicas [C13A92B6-7CFC-4FBB-841E-2AFF0C4791BA]->[AB8EE3AA-7D63-4E88-A1C0-C08F0F564C63], database [ds2], remote endpoint [tcp://veritas7:5022], source operation [2C0D9A0F-8D92-4B4F-99BE-44C8EE0BF755]: Seeding task failed with result 0x00000002.' Feb 8 13:39:57 isvqe-01 sqlservr[64677]: #0152021-02-08 13:39:57.36 spid56s Always On: DebugTraceVarArgs AR 'Seeding is canceled with cancelReason = 20' ----- The below might be a red herring (because it seems to have been this way for a while) but it caught my eye. We can keep looking for other clues. (Note: I can't find current source code for ag-helper... I'm thinking Microsoft keeps it locked down.) I noticed in both node 1 (veritas6) and node 3 (isvqe-01), the ag-helper process was launched with the `--current-master` option but no optarg. root 135219 0.0 0.0 461148 8944 ? Sl 13:43 0:00 /usr/lib/ocf/lib/mssql/ag-helper --port 1433 --credentials-file /var/opt/mssql/secrets/passwd --ag-name ag1 --application-name monitor-ag_cluster-monitor --connection-timeout 30 --health-threshold 3 --action monitor --required-synchronized-secondaries-to-commit -1 --current-master --disable-primary-on-quorum-timeout-after 60 --primary-write-lease-duration 81 --monitor-interval-timeout 60 root 196640 0.3 0.0 387416 12480 ? Sl 13:42 0:00 /usr/lib/ocf/lib/mssql/ag-helper --port 1433 --credentials-file /var/opt/mssql/secrets/passwd --ag-name ag1 --application-name monitor-ag_cluster-monitor --connection-timeout 30 --health-threshold 3 --action monitor --required-synchronized-secondaries-to-commit -1 --current-master --disable-primary-on-quorum-timeout-after 60 --primary-write-lease-duration 80 --monitor-interval-timeout 60 The ag resource agent gets current_master as follows: local current_master="$(crm_resource -r "$OCF_RESOURCE_INSTANCE" --locate | grep -Po 'resource [^ ]+ is running on: \K(.+)(?= Master$)')" But that regex doesn't match anything. # pcs resource status | grep -A2 stateful * Clone Set: stateful-clone [stateful] (promotable): * Masters: [ node2 ] * Slaves: [ node1 ] # /usr/sbin/crm_resource -r stateful --locate resource stateful is running on: node1 resource stateful is running on: node2 # /usr/sbin/crm_resource -r stateful --locate | grep -Po 'resource [^ ]+ is running on: \K(.+)(?= Master$)' # I'm testing on the master branch of pacemaker, but the logic is the same in 2.0.5. I don't see why we exclude clones when we're checking whether a resource is a promotable clone. However, as noted above, it looks like it's been this way for a while, so I don't expect this to be the cause of the reported issue. if (!pe_rsc_is_clone(the_rsc) && the_rsc->fns->state(the_rsc, TRUE) == RSC_ROLE_MASTER) { state = "Master"; } fprintf(stdout, "resource %s is running on: %s %s\n", rsc, node->details->uname, state); https://github.com/ClusterLabs/pacemaker/blame/Pacemaker-2.0.5/tools/crm_resource_runtime.c#L26-L33 https://github.com/ClusterLabs/pacemaker/blame/master/tools/crm_resource_print.c#L306-L313 Oddly, there was a period during which EVERY node was detected as the current master, as well as periods when NO nodes were detected as current master. For example: No node is current-master: Feb 8 13:12:51 isvqe-02 ag(ag_cluster)[111620]: INFO: monitor: 2021/02/08 13:12:51 ag-helper invoked with required-synchronized-secondaries-to-commit [-1]; current-master []; disable-primary-on-quorum-timeout-after [60]; primary-write-lease-duration [82]; monitor-interval-timeout [60] All nodes are current-master: Feb 8 13:13:01 veritas6 ag(ag_cluster)[172179]: INFO: monitor: 2021/02/08 13:13:01 ag-helper invoked with required-synchronized-secondaries-to-commit [-1]; current-master [veritas7 Feb 8 13:13:01 veritas6 ag(ag_cluster)[172179]: INFO: monitor: isvqe-01 Feb 8 13:13:01 veritas6 ag(ag_cluster)[172179]: INFO: monitor: isvqe-02 Feb 8 13:13:01 veritas6 ag(ag_cluster)[172179]: INFO: monitor: veritas6]; disable-primary-on-quorum-timeout-after [60]; primary-write-lease-duration [82]; monitor-interval-timeout [60] It seems fair to say that there's problem with either `crm_resource --locate`, the PCRE in ocf:mssql:ag, or both. But I don't see how this could have been introduced with pacemaker-2.0.5-5. (In reply to Reid Wahl from comment #5) > I'm testing on the master branch of pacemaker, but the logic is the same in > 2.0.5. I don't see why we exclude clones when we're checking whether a > resource is a promotable clone. However, as noted above, it looks like it's > been this way for a while, so I don't expect this to be the cause of the > reported issue. It occurred to me later that this is almost certainly intentional -- we want to work with the primitive and not the clone wrapper itself. With this in mind, my statement that "it seems to have been this way for a while" is no longer relevant. I think there's a reasonable chance that the `crm_resource --locate` output is the issue. pacemaker-2.0.5-4: # rpm -q pacemaker pacemaker-cli pacemaker-cluster-libs pacemaker-libs pacemaker-2.0.5-4.el8.x86_64 pacemaker-cli-2.0.5-4.el8.x86_64 pacemaker-cluster-libs-2.0.5-4.el8.x86_64 pacemaker-libs-2.0.5-4.el8.x86_64 # pcs status | grep -A2 stateful * Clone Set: stateful-clone [stateful] (promotable): * Masters: [ node2 ] * Slaves: [ node1 ] # /usr/sbin/crm_resource -r stateful --locate resource stateful is running on: node2 Master resource stateful is running on: node1 pacemaker-2.0.5-5: # rpm -q pacemaker pacemaker-cli pacemaker-cluster-libs pacemaker-libs pacemaker-2.0.5-5.el8.x86_64 pacemaker-cli-2.0.5-5.el8.x86_64 pacemaker-cluster-libs-2.0.5-5.el8.x86_64 pacemaker-libs-2.0.5-5.el8.x86_64 # pcs status | grep -A2 stateful * Clone Set: stateful-clone [stateful] (promotable): * Masters: [ node2 ] * Slaves: [ node1 ] # /usr/sbin/crm_resource -r stateful --locate resource stateful is running on: node1 resource stateful is running on: node2 Since the ocf:mssql:ag resource agent gets the node name that has " Master" after it, and then uses that node name as the --current-master option for ag-helper commands, it seems worthwhile to look more closely at this and see if it's related. Hmm. The issue in 2.0.5-5 is that resource_search_list_default() checks the same clone **instance** twice. It's not that the role is wrong for a particular instance -- rather, it uses instance 0 for both/all checks, as if instance 0 were running on all nodes. This also might help explain why the buggy behavior only happened on one/some nodes. 2.0.5-4: (native_resource_state) trace: stateful:0 state: Started (native_resource_state) trace: stateful:1 state: Master 2.0.5-5: (native_resource_state) trace: stateful:0 state: Started (native_resource_state) trace: stateful:0 state: Started This behavior was introduced in upstream commit b5ce7803. The `crm_resource --locate` command prints the expected output in the previous commit (9ab0635a). I think I see what went wrong. The previous version of cli_resource_search() retrieved rsc's parent and then iterated over all of the parent's children, passing each one to do_find_resource(). do_find_resource() then printed the "resource %s is running on" message for each child. In the case of a cloned resource (including promotable clones), this meant iterating over and printing the location of each clone instance. ~~~ int found = 0; ... /* The anonymous clone children's common ID is supplied */ } else if (pe_rsc_is_clone(parent) && !pcmk_is_set(rsc->flags, pe_rsc_unique) && rsc->clone_name && pcmk__str_eq(requested_name, rsc->clone_name, pcmk__str_casei) && !pcmk__str_eq(requested_name, rsc->id, pcmk__str_casei)) { for (GListPtr iter = parent->children; iter; iter = iter->next) { found += do_find_resource(out, requested_name, iter->data, data_set); } ... return found; ~~~ The current version of cli_resource_search() simply returns the list of nodes on which rsc is running. If rsc's parent is a clone, then cli_resource_search gets all children as before -- but this time it simply adds the running_on node to the `found` list. It does not directly call a printing function. ~~~ GListPtr found = NULL; ... /* The anonymous clone children's common ID is supplied */ } else if (pe_rsc_is_clone(parent) && !pcmk_is_set(rsc->flags, pe_rsc_unique) && rsc->clone_name && pcmk__str_eq(requested_name, rsc->clone_name, pcmk__str_casei) && !pcmk__str_eq(requested_name, rsc->id, pcmk__str_casei)) { for (GListPtr iter = parent->children; iter; iter = iter->next) { GListPtr extra = ((pe_resource_t *) iter->data)->running_on; if (extra != NULL) { found = g_list_concat(found, extra); } } ... return found; ~~~ Then the caller passes the list of nodes as the `resources` argument to the "resource-search" function (even though it's actually a list of nodes and not a list of resources). ~~~ case cmd_locate: { GListPtr resources = cli_resource_search(out, rsc, options.rsc_id, data_set); rc = out->message(out, "resource-search", resources, rsc, options.rsc_id); break; } ~~~ Meanwhile, the `rsc` argument to "resource-search" is whatever pe_find_resource_with_flags returned earlier in the main function. pe_find_resource_with_flags apparently returns the first clone instance (according to a gdb session run against 2.0.5-5 earlier). ~~~ 1740 // If command requires that resource exist if specified, find it 1741 if (options.find_flags && options.rsc_id) { 1742 rsc = pe_find_resource_with_flags(data_set->resources, options.rsc_id, 1743 options.find_flags); Breakpoint 3, main (argc=<optimized out>, argv=<optimized out>) at crm_resource.c:1744 1744 if (rsc == NULL) { (gdb) p rsc->id $4 = 0x5555558afd10 "stateful:0" (gdb) p options.rsc_id $5 = (gchar *) 0x555555786560 "stateful" ~~~ The effect here is that the "stateful:0" pe_resource_t object gets passed to the "resource-search" (resource_search_default()) function. For every node passed via the `resources` argument (as found by cli_resource_search), resource_search_default queries the state of the stateful:0 instance. This means it queries the state of the same clone instance every time. It follows that the resource role will be the same every time -- so there may be no masters in the output, or multiple masters in the output. ~~~ static int resource_search_default(pcmk__output_t *out, va_list args) { ... GListPtr nodes = va_arg(args, GListPtr); pe_resource_t *rsc = va_arg(args, pe_resource_t *); ... for (GListPtr lpc = nodes; lpc != NULL; lpc = lpc->next) { pe_node_t *node = (pe_node_t *) lpc->data; ... const char *state = ""; if (!pe_rsc_is_clone(rsc) && rsc->fns->state(rsc, TRUE) == RSC_ROLE_MASTER) { state = " Master"; } out->list_item(out, "node", "resource %s is running on: %s%s", requested_name, node->details->uname, state); ~~~ Tying this back to ocf:mssql:ag, this resource agent appears to depend on correctly parsing the node currently running the master instance from `crm_resource --locate`. Whether this is an ideal way to query that info, I don't know off the top of my head, but that's how it's done nonetheless. QA: Reproducer: 1. Configure a cluster with a promotable clone (e.g. ocf:pacemaker:Stateful). 2. Run "crm_resource --locate -r $RSC_ID" on each node. With 2.0.5-5, the Master instance(s) will not be indicated. Before 2.0.5-5 and after the fix, Master instances should be indicated, like: resource stateful is running on: node2 Master FYI, the commit that introduced this regression is included in the 8.4 build not for its own sake, but because it was part of a Pacemaker feature set that is lower than the feature set needed for another bz in 8.4. Backporting a particular feature set from the upstream master branch requires backporting all earlier feature sets not already present, in order to support rolling upgrades. Chris, comments 6-9 describe the source of the problem. Candidate fix is up at https://github.com/ClusterLabs/pacemaker/pull/2293. Please feel free to check it out and comment. Fix merged upstream as of commit d1bb0758 I installed pacemaker-2.0.5-7.el8 on my four systems and my SQL Server test passed. No issues. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:1782 |