1619265 – As a single entrypoint towards pacemaker, pcs should smartly coalesce all "--wait" synchronous hooking

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1619265 - As a single entrypoint towards pacemaker, pcs should smartly coalesce all "--wait" synchronous hooking

Summary: As a single entrypoint towards pacemaker, pcs should smartly coalesce all "--...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	7.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-20 13:18 UTC by Jan Pokorný [poki]
Modified:	2021-02-15 07:41 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-15 07:41:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1620081	0	medium	CLOSED	Allow for explicit "sequence points" in pcs based scripting, e.g. by the means of "pcs cluster settle cib"	2024-02-16 16:33:27 UTC

Internal Links: 1620081

Description Jan Pokorný [poki] 2018-08-20 13:18:11 UTC

If one for whatever reason issues command like this:

  for i in {1..60}; pcs resource enable clusterfs$i --wait=100 & done

it will generate 60 concurrent connections to pacemaker, which appears
to be harmful regarding the utilization spike.  While it may be dealt
directly on the pacemaker side, it makes sense for pcs to leverage
the fact it's a singleton interface present on the system, so it can
isolate pacemaker from redundant synchronous "settlement" waiting.
coalescing any parallel ones.

One of the ideas to achieve that would be to mark any such initiated
waiting in the volatile per-node local state storage (for instance
in-memory tracking with central daemon based architecture, plain text
/var/run/pcs/* files otherwise), and when a new synchronous, waiting
request is started, it is either hooked directly into pacemaker or
onto the hook that already exists, depending on whether the is any
active settlement waiting already (preferrably accompanied with
a brief sleep and rechecking if it is still pending to prevent
race-condition inflicted false negative, false negatives are acceptable
(i.e. it's OK when in doubtful situations two hooks are mounted to
pacemaker in parallel, but bad when any synchronous waiting gets lost).

Comment 2 Tomas Jelinek 2018-08-21 09:56:55 UTC

I agree the described behavior exists and may cause problems. Have commands such as the one described been confirmed to be used by anyone? Moreover for the command used in the example there is a workaround 'pcs resource enable clusterfs{1..60} --wait=100' which apart from reducing number of pacemaker connections also greatly reduces load on pcs and pacemaker (cib would be read and written once instead of 60 times). Assigning low priority for now.

Comment 3 Jan Pokorný [poki] 2018-08-21 16:09:24 UTC

re [comment 2]:

> Have commands such as the one described been confirmed to be used by
> anyone?

Doubting otherwise unrestrained (documentation) usecases is not helpful.

Yes, this was an invented speed booster for rather massive set of
operations otherwise indeed preferred as an atomic change, allegedly
(much) faster than:

> Moreover for the command used in the example there is
> a workaround 'pcs resource enable clusterfs{1..60} --wait=100' which
> apart from reducing number of pacemaker connections also greatly
> reduces load on pcs and pacemaker (cib would be read and written once
> instead of 60 times).

If there are multiple ways to achieve the same with various trade-offs
(e.g. speed vs. load like here), people will always be creative, and
a robust program should compensate the cons whenever possible, which
is asked here.

Comment 5 Jan Pokorný [poki] 2018-08-22 08:25:30 UTC

Also note that it may be in the web UI's interest to asynchronously
-- based on synchronous pcs CLI equivalent wait under the hood -- inform
the user some operation has been finished, successfully or not, and
it's not hard to see that multiple operations can be triggered
concurrently, just as with the provided example, yielding the same
negative symptoms.

Comment 6 Jan Pokorný [poki] 2018-08-22 12:34:15 UTC

See also a possible syntax meant to decrease the amount of parallel
"waiting for pacemaker to settle" and hence also proactively prevent
people inventing their own wait-involved tricks: [bug 1620081]

Comment 7 Robert Peterson 2019-10-22 12:30:29 UTC

I've been using this format:
pcs resource enable bobsrhel8{0..59} --wait=100
and it works beautifully (well, except for a myriad of unrelated
problems, especially with the Filesystem resource agent).

I haven't tried the original format for a long time.
I think this bz can probably be closed, as the original command is
similar to fork-bombing.

Comment 10 RHEL Program Management 2021-02-15 07:41:45 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.