1907726 – Fail counts are not removed when pcs resource cleanup is run from a remote node

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1907726 - Fail counts are not removed when pcs resource cleanup is run from a remote node

Summary: Fail counts are not removed when pcs resource cleanup is run from a remote node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	8.4
Assignee:	Reid Wahl
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-15 04:15 UTC by Takashi Kajinami
Modified:	2024-06-13 23:43 UTC (History)
CC List:	11 users (show)
Fixed In Version:	pacemaker-2.0.5-5.el8
Doc Type:	Bug Fix
Doc Text:	Cause: Node attribute requests run on Pacemaker Remote nodes always had the node's name added to the request by the cluster. Consequence: The "pcs resource cleanup" command, which uses a node attribute clear request and should clean up all nodes' failures if no node is specified, would instead clean up only the Pacemaker Remote node's failures when run there with no node specified. Fix: The cluster adds the Pacemaker Remote node's name only to node attribute update requests, not node attribute clearing requests. Result: "pcs resource cleanup" without a node specified now works as intended when run from a Pacemaker Remote node.
Clone Of:
Environment:
Last Closed:	2021-05-18 15:26:40 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ClusterLabs pacemaker pull 2260	None	closed	Fix: liblrmd: Limit node name addition to proxied attrd update commands	2021-02-15 10:34:05 UTC
Red Hat Knowledge Base (Solution)	5656091	None	None	None	2020-12-17 07:14:27 UTC
Red Hat Product Errata	RHEA-2021:1782	None	None	None	2021-05-18 15:46:50 UTC

Description Takashi Kajinami 2020-12-15 04:15:42 UTC

Description of problem:

When we run "pcs resource cleanup" in a remote node where pacemaker_remoted is running, the command succeeds without any errors but any fail counts about pacemaker resources are not cleaned up actually.
We always should run the clean up command in the node where pacemaker is running to clean up the fail counts.


Version-Release number of selected component (if applicable):
RHEL8.2 and the following packages are installed

~~~
corosync-3.0.3-2.el8.x86_64
corosynclib-3.0.3-2.el8.x86_64
pacemaker-2.0.3-5.el8_2.1.x86_64
pacemaker-cli-2.0.3-5.el8_2.1.x86_64
pacemaker-cluster-libs-2.0.3-5.el8_2.1.x86_64
pacemaker-libs-2.0.3-5.el8_2.1.x86_64
pacemaker-remote-2.0.3-5.el8_2.1.x86_64
pacemaker-schemas-2.0.3-5.el8_2.1.noarch
pcs-0.10.4-6.el8_2.1.x86_64
~~~


How reproducible:
Always

Steps to Reproduce:
1. Set up a cluster and create a remote resource with pacemaker_remote
2. Cause some failures in the cluster
3. Run "pcs resource cleanup" on ad remote node

Actual results:
Fail counts are not deleted

Expected results:
Fail counts are deleted

Additional info:
This issue was initially observed in a RHOSP16.1 deployment with instance ha feature enabled.

Comment 8 Reid Wahl 2020-12-16 09:59:45 UTC

Well, good news first: This is easily reproducible. `crm_resource --cleanup` from a remote node clears a failure from "Failed Resource Actions" (failures) but **not** from "Migration Summary" (failcounts).

Environment:
------------
node1: fastvm-rhel-8-0-23
node2: fastvm-rhel-8-0-24
node3-rem: fastvm-rhel-8-0-52


[root@fastvm-rhel-8-0-23 pacemaker]# crm_resource --fail --node node2 --resource dummy
Waiting for 1 reply from the controller. OK

[root@fastvm-rhel-8-0-23 pacemaker]# crm_mon -1 --exclude=all --include=failcounts,failures
Migration Summary:
  * Node: node2:
    * dummy: migration-threshold=1000000 fail-count=3 last-failure='Tue Dec 15 23:40:00 2020'

Failed Resource Actions:
  * dummy_asyncmon_0 on node2 'error' (1): call=35, status='complete', exitreason='Simulated failure', last-rc-change='2020-12-15 23:40:00 -08:00', queued=0ms, exec=0ms

[root@fastvm-rhel-8-0-52 pacemaker]# crm_resource --cleanup
Cleaned up all resources on all nodes
Waiting for 1 reply from the controller. OK

[root@fastvm-rhel-8-0-23 pacemaker]# crm_mon -1 --exclude=all --include=failcounts,failures
Migration Summary:
  * Node: node2:
    * dummy: migration-threshold=1000000 fail-count=3 last-failure='Tue Dec 15 23:40:00 2020'


As Takashi said, when this happens, `crm_mon ... --include=failcounts` on the remote node may very briefly show an empty "Migration Summary:" section, but the failcounts quickly return (presumably re-synced from the non-remote nodes).

-----

Looking at the logs on the node where the remote resource is running, it appears that when the `cleanup` command is run from a remote node, it only matches the fail-count and last-failure attributes set for the remote node. If the `cleanup` command is run from a full node, it matches the fail-count and last-failure attributes for **all** nodes.

Cleanup from remote node:

    Dec 16 01:54:44 fastvm-rhel-8-0-24 pacemaker-attrd     [11792] (attrd_client_update:249)       debug: Setting ^(fail-count|last-failure)- to (null)
    Dec 16 01:54:44 fastvm-rhel-8-0-24 pacemaker-attrd     [11792] (attrd_client_update:259)       trace: Matched fail-count-dummy#asyncmon_0 with ^(fail-count|last-failure)-
    Dec 16 01:54:44 fastvm-rhel-8-0-24 pacemaker-attrd     [11792] (attrd_client_update:259)       trace: Matched last-failure-dummy#asyncmon_0 with ^(fail-count|last-failure)-
    Dec 16 01:54:44 fastvm-rhel-8-0-24 pacemaker-attrd     [11792] (attrd_peer_update:903)         trace: Unchanged fail-count-dummy#asyncmon_0[node3-rem] from node2 is (null)
    Dec 16 01:54:44 fastvm-rhel-8-0-24 pacemaker-attrd     [11792] (attrd_peer_update:903)         trace: Unchanged last-failure-dummy#asyncmon_0[node3-rem] from node2 is (null)


Cleanup from a different full node (for consistency, this was not run from the local node):

    Dec 16 01:55:28 fastvm-rhel-8-0-24 pacemaker-attrd     [11792] (attrd_peer_update:849)         debug: Setting fail-count-dummy#asyncmon_0 for all hosts to (null)
    Dec 16 01:55:28 fastvm-rhel-8-0-24 pacemaker-attrd     [11792] (attrd_peer_update:883)         notice: Setting fail-count-dummy#asyncmon_0[node2]: 2 -> (unset) | from node1
    Dec 16 01:55:28 fastvm-rhel-8-0-24 pacemaker-attrd     [11792] (attrd_peer_update:903)         trace: Unchanged fail-count-dummy#asyncmon_0[node3-rem] from node1 is (null)
    Dec 16 01:55:28 fastvm-rhel-8-0-24 pacemaker-attrd     [11792] (attrd_peer_update:849)         debug: Setting last-failure-dummy#asyncmon_0 for all hosts to (null)
    Dec 16 01:55:28 fastvm-rhel-8-0-24 pacemaker-attrd     [11792] (attrd_peer_update:883)         notice: Setting last-failure-dummy#asyncmon_0[node2]: 1608112464 -> (unset) | from node1
    Dec 16 01:55:28 fastvm-rhel-8-0-24 pacemaker-attrd     [11792] (attrd_peer_update:903)         trace: Unchanged last-failure-dummy#asyncmon_0[node3-rem] from node1 is (null)


Note that in the latter case, attrd_client_update() is not called, and attributes for all nodes are matched and unset (rather than just the node that ran the cleanup command).

I'll have to take a closer look at the code tomorrow(-ish) unless Ken beats me to it. I think this gets us pointed in the right general direction.

Comment 9 Reid Wahl 2020-12-16 10:07:54 UTC

(In reply to Reid Wahl from comment #8)
> If the `cleanup` command is run from a full node, it matches the fail-count
> and last-failure attributes for **all** nodes.

This is not totally accurate phrasing. The cleanup command was run from node1. node2's logs show the attributes for all the nodes **except** node1.


> Note that in the latter case, attrd_client_update() is not called ...

Actually it is. But it's called from the node where the cleanup command is run (node1 in our example).

Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (attrd_client_update:249)       debug: Setting ^(fail-count|last-failure)- to (null)
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (attrd_client_update:259)       trace: Matched fail-count-dummy#asyncmon_0 with ^(fail-count|last-failure)-
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (attrd_client_update:259)       trace: Matched last-failure-dummy#asyncmon_0 with ^(fail-count|last-failure)-
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (attrd_peer_update:849)         debug: Setting fail-count-dummy#asyncmon_0 for all hosts to (null)
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (attrd_peer_update:883)         notice: Setting fail-count-dummy#asyncmon_0[node2]: 1 -> (unset) | from node1
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (write_attribute:1267)  debug: Updating fail-count-dummy#asyncmon_0[node2]=(null) (peer known as node2, UUID 2, ID 2/2)
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (write_attribute:1267)  debug: Updating fail-count-dummy#asyncmon_0[node3-rem]=(null) (peer known as node3-rem, UUID node3-rem, ID 0/0)
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (write_attribute:1299)  info: Sent CIB request 20 with 2 changes for fail-count-dummy#asyncmon_0 (id n/a, set n/a)
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (attrd_peer_update:849)         debug: Setting last-failure-dummy#asyncmon_0 for all hosts to (null)
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (attrd_peer_update:883)         notice: Setting last-failure-dummy#asyncmon_0[node2]: 1608112596 -> (unset) | from node1
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (write_attribute:1267)  debug: Updating last-failure-dummy#asyncmon_0[node2]=(null) (peer known as node2, UUID 2, ID 2/2)
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (write_attribute:1267)  debug: Updating last-failure-dummy#asyncmon_0[node3-rem]=(null) (peer known as node3-rem, UUID node3-rem, ID 0/0)
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (write_attribute:1299)  info: Sent CIB request 21 with 2 changes for last-failure-dummy#asyncmon_0 (id n/a, set n/a)
...
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (attrd_cib_callback:1029)       info: CIB update 20 result for fail-count-dummy#asyncmon_0: OK | rc=0
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (attrd_cib_callback:1033)       info: * fail-count-dummy#asyncmon_0[node2]=(null)
Dec 16 01:56:40 fastvm-rhel-8-0-23 pacemaker-attrd     [11586] (attrd_cib_callback:1033)       info: * fail-count-dummy#asyncmon_0[node3-rem]=(null)



In the remote node case, attrd_client_update() is called from the node where the remote resource is running.

I have a guess about what's going on here (involving the host argument) but haven't looked closely yet.

Comment 10 Ken Gaillot 2020-12-16 14:54:51 UTC

(In reply to Reid Wahl from comment #9)
> I have a guess about what's going on here (involving the host argument) but
> haven't looked closely yet.

This triggered a memory: the issue most likely has to do with remote_proxy_cb() adding PCMK__XA_ATTR_NODE_NAME to proxied attrd requests.

Comment 11 Reid Wahl 2020-12-17 01:19:16 UTC

Yep, the immediate cause is that PCMK__XA_ATTR_NODE_NAME is set for the attrd_client_update() call from the remote node. remote_proxy_cb() in proxy_common.c is doing it.

When I commented out this crm_xml_add() line (https://github.com/ClusterLabs/pacemaker/blob/master/lib/lrmd/proxy_common.c#L263), the `crm_resource --cleanup` from the remote node cleaned up all failures as expected.

            if (pcmk__str_eq(type, T_ATTRD, pcmk__str_casei)
                && crm_element_value(request,
                                     PCMK__XA_ATTR_NODE_NAME) == NULL) {
                //crm_xml_add(request, PCMK__XA_ATTR_NODE_NAME, proxy->node_name);
            }

With a failure for the dummy resource on node2, the cleanup command run from the remote node, and the crm_xml_add() line commented out as shown above, the fail-count attribute **does** get unset.

Dec 16 17:08:17 fastvm-rhel-8-0-24 pacemaker-controld  [30912] (remote_proxy_cb:284)     trace: Relayed request request 4 from node3-rem to attrd for proxy-attrd-206625-3ae4066f
Dec 16 17:08:17 fastvm-rhel-8-0-24 pacemaker-attrd     [30910] (attrd_ipc_dispatch:249)       trace: nwahl   <create_attrd_op t="attrd" src="crm_resource" task="clear-failure" attr_is_remote="0" acl_role="pacemaker-remote" acl_target="node3-rem" lrmd_ipc_user="node3-rem" attr_user="node3-rem"/>
Dec 16 17:08:17 fastvm-rhel-8-0-24 pacemaker-attrd     [30910] (attrd_client_update:245)       trace: nwahl: host: <null>
Dec 16 17:08:17 fastvm-rhel-8-0-24 pacemaker-attrd     [30910] (attrd_client_update:251)       debug: Setting ^(fail-count|last-failure)- to (null)
Dec 16 17:08:17 fastvm-rhel-8-0-24 pacemaker-attrd     [30910] (attrd_client_update:261)       trace: Matched fail-count-dummy#asyncmon_0 with ^(fail-count|last-failure)-
Dec 16 17:08:17 fastvm-rhel-8-0-24 pacemaker-attrd     [30910] (attrd_client_update:261)       trace: Matched last-failure-dummy#asyncmon_0 with ^(fail-count|last-failure)-
Dec 16 17:08:17 fastvm-rhel-8-0-24 pacemaker-controld  [30912] (remote_proxy_dispatch:134)       trace: Passing response back to 3ae4066f on node3-rem: <ack function="attrd_ipc_dispatch" line="261" status="112"/> - request id: 4
Dec 16 17:08:17 fastvm-rhel-8-0-24 pacemaker-attrd     [30910] (attrd_peer_update:851)         debug: Setting fail-count-dummy#asyncmon_0 for all hosts to (null)
Dec 16 17:08:17 fastvm-rhel-8-0-24 pacemaker-attrd     [30910] (attrd_peer_update:885)         notice: Setting fail-count-dummy#asyncmon_0[node2]: 1 -> (unset) | from node2
Dec 16 17:08:17 fastvm-rhel-8-0-24 pacemaker-attrd     [30910] (attrd_peer_update:851)         debug: Setting last-failure-dummy#asyncmon_0 for all hosts to (null)
Dec 16 17:08:17 fastvm-rhel-8-0-24 pacemaker-attrd     [30910] (attrd_peer_update:885)         notice: Setting last-failure-dummy#asyncmon_0[node2]: 1608167274 -> (unset) | from node2


This was a crude change for testing purposes, to isolate the issue. I presume there's a reason that this block exists, and the fix would be a bit more complex?

Comment 12 Reid Wahl 2020-12-17 01:20:07 UTC

(In reply to Ken Gaillot from comment #10)
> This triggered a memory: the issue most likely has to do with
> remote_proxy_cb() adding PCMK__XA_ATTR_NODE_NAME to proxied attrd requests.

Yep. I had not refreshed the BZ or noticed an email come in before I hopped back on today and posted an update.

Comment 13 Reid Wahl 2020-12-17 01:41:21 UTC

(In reply to Reid Wahl from comment #11)
> This was a crude change for testing purposes, to isolate the issue. I
> presume there's a reason that this block exists, and the fix would be a bit
> more complex?

I guess we could just check the task, huh?

            if (pcmk__str_eq(type, T_ATTRD, pcmk__str_casei)
                && !pcmk__str_eq(crm_element_value(request, PCMK__XA_TASK),
                                 PCMK__ATTRD_CMD_CLEAR_FAILURE, pcmk__str_casei)
                && crm_element_value(request,
                                     PCMK__XA_ATTR_NODE_NAME) == NULL) {
                crm_xml_add(request, PCMK__XA_ATTR_NODE_NAME, proxy->node_name);
            }

Comment 14 Ken Gaillot 2020-12-21 15:57:19 UTC

(In reply to Reid Wahl from comment #13)
> (In reply to Reid Wahl from comment #11)
> > This was a crude change for testing purposes, to isolate the issue. I
> > presume there's a reason that this block exists, and the fix would be a bit
> > more complex?
> 
> I guess we could just check the task, huh?
> 
>             if (pcmk__str_eq(type, T_ATTRD, pcmk__str_casei)
>                 && !pcmk__str_eq(crm_element_value(request, PCMK__XA_TASK),
>                                  PCMK__ATTRD_CMD_CLEAR_FAILURE,
> pcmk__str_casei)
>                 && crm_element_value(request,
>                                      PCMK__XA_ATTR_NODE_NAME) == NULL) {
>                 crm_xml_add(request, PCMK__XA_ATTR_NODE_NAME,
> proxy->node_name);
>             }

You're right, we should be able to add the name only for certain requests (not including PCMK__ATTRD_CMD_CLEAR_FAILURE).

The node name code was added as of 446a100 (PR#767 / 1.1.14), which got attrd_updater without an explicit -N working when run from Pacemaker Remote nodes. Normally, if the node name isn't set in a client request, pacemaker-attrd will consider the request to be for the local node (attrd_client_update()). This meant that attribute requests on remote nodes were actually setting attributes for the cluster node they were attached to. Remote nodes don't know their own node name as known to the cluster, so the fix was to have the proxy (i.e. the controller node attached to the remote connection) set the host for all attrd requests.
 
PCMK__ATTRD_CMD_CLEAR_FAILURE (ATTRD_OP_CLEAR_FAILURE at the time) was added in 9889d7f + 6191ea52 (1.1.17) as part of the switch to operation-specific fail counts. So this may be a regression since then.

We'll have to look at each of the PCMK__ATTRD_CMD_* requests to see whether there are any others that might be affected. Some are only sent between pacemaker-attrd instances, others probably ignore the node name, so CLEAR_FAILURE may be the only one. In any case, I think we can at least skip adding the node name in that case, since not passing a node name for that request is supposed to mean all nodes and not the local node.

I'm currently targeting 8.5 for the fix due to developer and QA capacity, but if we're lucky we may be able to squeeze it into 8.4.

Comment 15 Reid Wahl 2020-12-22 06:53:49 UTC

Ah, thanks for the explanation :) That makes more sense.

I think we should have been more selective when first adding 446a100. These are our tasks.

# grep -Rh 'define.*PCMK__ATTRD_CMD'
#define PCMK__ATTRD_CMD_PEER_REMOVE     "peer-remove"
#define PCMK__ATTRD_CMD_UPDATE          "update"
#define PCMK__ATTRD_CMD_UPDATE_BOTH     "update-both"
#define PCMK__ATTRD_CMD_UPDATE_DELAY    "update-delay"
#define PCMK__ATTRD_CMD_QUERY           "query"
#define PCMK__ATTRD_CMD_REFRESH         "refresh"
#define PCMK__ATTRD_CMD_FLUSH           "flush"
#define PCMK__ATTRD_CMD_SYNC            "sync"
#define PCMK__ATTRD_CMD_SYNC_RESPONSE   "sync-response"
#define PCMK__ATTRD_CMD_CLEAR_FAILURE   "clear-failure"


PCMK__ATTRD_CMD_PEER_REMOVE: It looks like this can be called with **either** PCMK__XA_ATTR_NODE_NAME **or** PCMK__XA_ATTR_NODE_ID. The node ID gets used if the node name is NULL. So even if this isn't causing an issue now, it seems like bad practice to add a node name automatically if it's NULL.
PCMK__ATTRD_CMD_UPDATE: Without a node name, means **local** node
PCMK__ATTRD_CMD_UPDATE_BOTH: Without a node name, means **local** node
PCMK__ATTRD_CMD_UPDATE_DELAY: Without a node name, means **local** node
PCMK__ATTRD_CMD_QUERY: Without a node name, means **all** nodes
PCMK__ATTRD_CMD_REFRESH: Ignores node name, so it doesn't matter whether we add it or not
PCMK__ATTRD_CMD_FLUSH: Appears to have been removed entirely except for its definition and a mention in a comment block in attrd_commands.c
PCMK__ATTRD_CMD_SYNC: Appears to ignore node name... in fact, appears to ignore its `xml` argument altogether.
PCMK__ATTRD_CMD_SYNC_RESPONSE: Not sure... it does use PCMK__XA_ATTR_NODE_NAME but seems to work a little differently. My initial inclination is that it doesn't need the node name added for remote nodes, but I haven't looked too closely.
PCMK__ATTRD_CMD_CLEAR_FAILURE: Without a node name, means **all** nodes


So I think we need to add PCMK__XA_ATTR_NODE_NAME for PCMK__ATTRD_CMD_{UPDATE,UPDATE_BOTH,UPDATE_DELAY}.

Comment 16 Reid Wahl 2020-12-22 08:42:24 UTC

(In reply to Reid Wahl from comment #15)
> PCMK__ATTRD_CMD_SYNC_RESPONSE: Not sure... it does use
> PCMK__XA_ATTR_NODE_NAME but seems to work a little differently. My initial
> inclination is that it doesn't need the node name added for remote nodes,
> but I haven't looked too closely.
> ...
> So I think we need to add PCMK__XA_ATTR_NODE_NAME for
> PCMK__ATTRD_CMD_{UPDATE,UPDATE_BOTH,UPDATE_DELAY}.

I did find this extra piece for SYNC_RESPONSE (below), so maybe it does need the node name added. Still not sure, but it probably doesn't hurt to keep the node name addition for that task.

~~~
void
attrd_peer_update(crm_node_t *peer, xmlNode *xml, const char *host, bool filter)
{
...
    // NULL because PCMK__ATTRD_CMD_SYNC_RESPONSE has no PCMK__XA_TASK
    update_both = pcmk__str_eq(op, PCMK__ATTRD_CMD_UPDATE_BOTH,
                               pcmk__str_null_matches | pcmk__str_casei);
~~~

--------------------

Other minor notes:

It may be worth tweaking this line of the --help output. "instead of the local one" isn't entirely accurate. With that said, I haven't put any thought yet into what it should say.

 -N, --node=value	Set the attribute for the named node (instead of the local one)

-----

The `-Q` and `-A` options appear to do nothing.

[root@fastvm-rhel-8-0-23 pacemaker]# attrd_updater -n real_attr
name="real_attr" host="node2" value="value2"
name="real_attr" host="node1" value="value"
[root@fastvm-rhel-8-0-23 pacemaker]# attrd_updater -n real_attr -Q
name="real_attr" host="node2" value="value2"
name="real_attr" host="node1" value="value"
[root@fastvm-rhel-8-0-23 pacemaker]# attrd_updater -n real_attr -Q -A
name="real_attr" host="node2" value="value2"
name="real_attr" host="node1" value="value"

Comment 17 Ken Gaillot 2020-12-22 16:21:54 UTC

(In reply to Reid Wahl from comment #15)
> PCMK__ATTRD_CMD_PEER_REMOVE: It looks like this can be called with
> **either** PCMK__XA_ATTR_NODE_NAME **or** PCMK__XA_ATTR_NODE_ID. The node ID
> gets used if the node name is NULL. So even if this isn't causing an issue
> now, it seems like bad practice to add a node name automatically if it's
> NULL.

Correct. That one is sent by crm_node rather than attrd_updater, and requires the node name to be specified (no default), so we can safely stop adding node name in the proxy for it.

> PCMK__ATTRD_CMD_SYNC_RESPONSE: Not sure... it does use
> PCMK__XA_ATTR_NODE_NAME but seems to work a little differently. My initial
> inclination is that it doesn't need the node name added for remote nodes,
> but I haven't looked too closely.

We don't need to worry about this one because it can only be sent between pacemaker-attrd instances (meaning it will never be proxied).

> So I think we need to add PCMK__XA_ATTR_NODE_NAME for
> PCMK__ATTRD_CMD_{UPDATE,UPDATE_BOTH,UPDATE_DELAY}.

That looks correct. (As an irrelevant aside, I made a mistake by making UPDATE_BOTH/UPDATE_DELAY separate tasks -- they should have been XML attributes inside UPDATE requests. That would have made this even simpler ...)

> The `-Q` and `-A` options appear to do nothing.

'Q' is the default command, so it can be specified or not. Interestingly, all '-A' does is ignore any '-N' option given, so it's the same as just not specifying -N.

Comment 18 Ken Gaillot 2021-01-05 15:41:37 UTC

Fixed upstream as of commit c3e2edb7

Comment 23 Markéta Smazová 2021-02-05 18:50:32 UTC

before fix
------------

>   [root@virt-064 ~]# rpm -q pacemaker
>   pacemaker-2.0.4-6.el8.x86_64


Configure a cluster with a pacemaker remote node and some resources:

>   [root@virt-064 ~]# pcs status --full
>   Cluster name: STSRHTS24792
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-064 (3) (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum
>     * Last updated: Fri Feb  5 18:50:28 2021
>     * Last change:  Fri Feb  5 18:42:05 2021 by root via cibadmin on virt-064
>     * 3 nodes configured
>     * 6 resource instances configured

>   Node List:
>     * Online: [ virt-063 (2) virt-064 (3) ]
>     * RemoteOnline: [ virt-062 ]

>   Full List of Resources:
>     * fence-virt-062	(stonith:fence_xvm):	 Started virt-064
>     * fence-virt-063	(stonith:fence_xvm):	 Started virt-064
>     * fence-virt-064	(stonith:fence_xvm):	 Started virt-063
>     * virt-062	(ocf::pacemaker:remote):	 Started virt-063
>     * dummy1	(ocf::pacemaker:Dummy):	 Started virt-063
>     * dummy2	(ocf::pacemaker:Dummy):	 Started virt-064

>   Migration Summary:

>   Tickets:

>   PCSD Status:
>     virt-063: Online
>     virt-064: Online

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled

Fail resources:

>   [root@virt-064 ~]# crm_resource --fail --node virt-063 --resource dummy1
>   Waiting for 1 reply from the controller. OK

>   [root@virt-064 ~]# crm_resource --fail --node virt-064 --resource dummy2
>   Waiting for 1 reply from the controller. OK


>   [root@virt-064 ~]# pcs status --full
>   Cluster name: STSRHTS24792
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-064 (3) (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum
>     * Last updated: Fri Feb  5 18:50:50 2021
>     * Last change:  Fri Feb  5 18:42:05 2021 by root via cibadmin on virt-064
>     * 3 nodes configured
>     * 6 resource instances configured

>   Node List:
>     * Online: [ virt-063 (2) virt-064 (3) ]
>     * RemoteOnline: [ virt-062 ]

>   Full List of Resources:
>     * fence-virt-062	(stonith:fence_xvm):	 Started virt-064
>     * fence-virt-063	(stonith:fence_xvm):	 Started virt-064
>     * fence-virt-064	(stonith:fence_xvm):	 Started virt-063
>     * virt-062	(ocf::pacemaker:remote):	 Started virt-063
>     * dummy1	(ocf::pacemaker:Dummy):	 Started virt-063
>     * dummy2	(ocf::pacemaker:Dummy):	 Started virt-064

>   Migration Summary:
>     * Node: virt-064 (3):
>       * dummy2: migration-threshold=1000000 fail-count=2 last-failure='Fri Feb  5 18:50:39 2021'
>     * Node: virt-063 (2):
>       * dummy1: migration-threshold=1000000 fail-count=2 last-failure='Fri Feb  5 18:50:29 2021'

>   Failed Resource Actions:
>     * dummy2_asyncmon_0 on virt-064 'error' (1): call=72, status='complete', exitreason='Simulated failure', last-rc-change='2021-02-05 18:50:39 +01:00', queued=0ms, exec=0ms
>     * dummy1_asyncmon_0 on virt-063 'error' (1): call=375, status='complete', exitreason='Simulated failure', last-rc-change='2021-02-05 18:50:29 +01:00', queued=0ms, exec=0ms

>   Tickets:

>   PCSD Status:
>     virt-063: Online
>     virt-064: Online

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled


Run `pcs resource cleanup` from pacemaker remote node "virt-062":

>   [root@virt-062 ~]# pcs resource cleanup
>   Cleaned up all resources on all nodes
>   Waiting for 2 replies from the controller.. OK

Check that failures were cleaned up on all nodes:

>   [root@virt-062 ~]# pcs status --full
>   Cluster name: STSRHTS24792
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-064 (3) (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum
>     * Last updated: Fri Feb  5 18:51:04 2021
>     * Last change:  Fri Feb  5 18:50:57 2021 by hacluster via crmd on virt-064
>     * 3 nodes configured
>     * 6 resource instances configured

>   Node List:
>     * Online: [ virt-063 (2) virt-064 (3) ]
>     * RemoteOnline: [ virt-062 ]

>   Full List of Resources:
>     * fence-virt-062	(stonith:fence_xvm):	 Started virt-064
>     * fence-virt-063	(stonith:fence_xvm):	 Started virt-064
>     * fence-virt-064	(stonith:fence_xvm):	 Started virt-063
>     * virt-062	(ocf::pacemaker:remote):	 Started virt-063
>     * dummy1	(ocf::pacemaker:Dummy):	 Started virt-063
>     * dummy2	(ocf::pacemaker:Dummy):	 Started virt-064

>   Migration Summary:
>     * Node: virt-064 (3):
>       * dummy2: migration-threshold=1000000 fail-count=2 last-failure='Fri Feb  5 18:50:39 2021'
>     * Node: virt-063 (2):
>       * dummy1: migration-threshold=1000000 fail-count=2 last-failure='Fri Feb  5 18:50:29 2021'

>   Tickets:

>   Daemon Status:
>     corosync: inactive/disabled
>     pacemaker: inactive/disabled
>     pacemaker_remote: active/enabled
>     pcsd: active/enabled

Failures were cleaned up from "Failed Resource Actions", but not from "Migration Summary" (fail counts).



after fix
-----------

>   [root@virt-038 ~]# rpm -q pacemaker
>   pacemaker-2.0.5-6.el8.x86_64

Configure a cluster with a pacemaker remote node and some resources:

>   [root@virt-038 ~]# pcs status --full
>   Cluster name: STSRHTS27463
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-038 (2) (version 2.0.5-6.el8-ba59be7122) - partition with quorum
>     * Last updated: Fri Feb  5 18:11:26 2021
>     * Last change:  Fri Feb  5 18:11:01 2021 by hacluster via crmd on virt-038
>     * 3 nodes configured
>     * 6 resource instances configured

>   Node List:
>     * Online: [ virt-038 (2) virt-065 (3) ]
>     * RemoteOnline: [ virt-034 ]

>   Full List of Resources:
>     * fence-virt-034	(stonith:fence_xvm):	 Started virt-065
>     * fence-virt-038	(stonith:fence_xvm):	 Started virt-065
>     * fence-virt-065	(stonith:fence_xvm):	 Started virt-038
>     * virt-034	(ocf::pacemaker:remote):	 Started virt-038
>     * dummy1	(ocf::pacemaker:Dummy):	 Started virt-038
>     * dummy2	(ocf::pacemaker:Dummy):	 Started virt-065

>   Migration Summary:

>   Tickets:

>   PCSD Status:
>     virt-038: Online
>     virt-065: Online

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled

Fail resources:

>   [root@virt-038 ~]# crm_resource --fail --node virt-038 --resource dummy1
>   Waiting for 1 reply from the controller
>   ... got reply (done)

>   [root@virt-038 ~]# crm_resource --fail --node virt-065 --resource dummy2
>   Waiting for 1 reply from the controller
>   ... got reply (done)


>   [root@virt-038 ~]# pcs status --full
>   Cluster name: STSRHTS27463
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-038 (2) (version 2.0.5-6.el8-ba59be7122) - partition with quorum
>     * Last updated: Fri Feb  5 18:11:48 2021
>     * Last change:  Fri Feb  5 18:11:01 2021 by hacluster via crmd on virt-038
>     * 3 nodes configured
>     * 6 resource instances configured

>   Node List:
>     * Online: [ virt-038 (2) virt-065 (3) ]
>     * RemoteOnline: [ virt-034 ]

>   Full List of Resources:
>     * fence-virt-034	(stonith:fence_xvm):	 Started virt-065
>     * fence-virt-038	(stonith:fence_xvm):	 Started virt-065
>     * fence-virt-065	(stonith:fence_xvm):	 Started virt-038
>     * virt-034	(ocf::pacemaker:remote):	 Started virt-038
>     * dummy1	(ocf::pacemaker:Dummy):	 Started virt-038
>     * dummy2	(ocf::pacemaker:Dummy):	 Started virt-065

>   Migration Summary:
>     * Node: virt-038 (2):
>       * dummy1: migration-threshold=1000000 fail-count=1 last-failure='Fri Feb  5 18:11:27 2021'
>     * Node: virt-065 (3):
>       * dummy2: migration-threshold=1000000 fail-count=2 last-failure='Fri Feb  5 18:11:37 2021'

>   Failed Resource Actions:
>     * dummy1_asyncmon_0 on virt-038 'error' (1): call=385, status='complete', exitreason='Simulated failure', last-rc-change='2021-02-05 18:11:27 +01:00', queued=0ms, exec=0ms
>     * dummy2_asyncmon_0 on virt-065 'error' (1): call=72, status='complete', exitreason='Simulated failure', last-rc-change='2021-02-05 18:11:37 +01:00', queued=0ms, exec=0ms

>   Tickets:

>   PCSD Status:
>     virt-038: Online
>     virt-065: Online

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled


Run `pcs resource cleanup` from pacemaker remote node "virt-034":

>   [root@virt-034 ~]# pcs resource cleanup
>   Cleaned up all resources on all nodes
>   Waiting for 2 replies from the controller
>   ... got reply
>   ... got reply (done)

Check that failures were cleaned up on all nodes:

>   [root@virt-034 ~]# pcs status --full
>   Cluster name: STSRHTS27463
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-038 (2) (version 2.0.5-6.el8-ba59be7122) - partition with quorum
>     * Last updated: Fri Feb  5 18:12:09 2021
>     * Last change:  Fri Feb  5 18:12:02 2021 by hacluster via crmd on virt-038
>     * 3 nodes configured
>     * 6 resource instances configured

>   Node List:
>     * Online: [ virt-038 (2) virt-065 (3) ]
>     * RemoteOnline: [ virt-034 ]

>   Full List of Resources:
>     * fence-virt-034	(stonith:fence_xvm):	 Started virt-065
>     * fence-virt-038	(stonith:fence_xvm):	 Started virt-065
>     * fence-virt-065	(stonith:fence_xvm):	 Started virt-038
>     * virt-034	(ocf::pacemaker:remote):	 Started virt-038
>     * dummy1	(ocf::pacemaker:Dummy):	 Started virt-038
>     * dummy2	(ocf::pacemaker:Dummy):	 Started virt-065

>   Migration Summary:

>   Tickets:

>   Daemon Status:
>     corosync: inactive/disabled
>     pacemaker: inactive/disabled
>     pacemaker_remote: active/enabled
>     pcsd: active/enabled


Failures were cleaned up from "Failed Resource Actions" and from "Migration Summary" (fail counts).


marking verified in pacemaker-2.0.5-6.el8

Comment 25 errata-xmlrpc 2021-05-18 15:26:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1782

Comment 26 errata-xmlrpc 2021-05-18 15:46:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1782

Note You need to log in before you can comment on or make changes to this bug.