1628701 – [RFE] Allow resource/operation defaults to be defined for particular resource/operation types

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1628701 - [RFE] Allow resource/operation defaults to be defined for particular resource/operation types

Summary: [RFE] Allow resource/operation defaults to be defined for particular resource...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	pre-dev-freeze
Target Release:	8.3
Assignee:	Chris Lumens
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1579213 (view as bug list)
Depends On:
Blocks:	1579221 1817547
TreeView+	depends on / blocked

Reported:	2018-09-13 18:23 UTC by Shane Bradley
Modified:	2023-09-07 19:23 UTC (History)
CC List:	13 users (show)
Fixed In Version:	pacemaker-2.0.4-3.el8
Doc Type:	No Doc Update
Doc Text:	Any corresponding pcs functionality should be documented instead.
Clone Of:
Environment:
Last Closed:	2020-11-04 04:00:53 UTC
Type:	Feature Request
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3560591	0	None	None	None	2020-03-26 15:05:42 UTC
Red Hat Knowledge Base (Solution)	3613101	0	None	None	None	2018-09-13 18:30:43 UTC

Description Shane Bradley 2018-09-13 18:23:54 UTC

Description of problem:
A way to configure attributes (like monitor/stop/start timeouts) for a resource within a bundle. 

Currently there is no way to change the monitor/stop/start timeouts on a <docker> contained in a bundle:
      <bundle id="galera-bundle">
        <docker image="192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest" masters="3"
                network="host" 
                options="--user=root --log-driver=journald -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS" 
                replicas="3" run-command="/bin/bash /usr/local/bin/kolla_start"/>
        <network control-port="3123"/>
        <storage>

A customer was having issues where the <docker> resource is taking longer than the default time to stop and they need a way to increase that timeout. There was no command options to do this. 

We need a way for pcs to allow the manipulation of instance attributes, meta attributes, and op attributes for the implicit docker resources as part of a bundle.

Version-Release number of selected component (if applicable):
pcs-0.9.162-5.el7_5.1.x86_64

How reproducible:
Everytime

Steps to Reproduce:
Setup one of the bundles described in our doc.

Actual results:
The <docker> continues to see timeout issues on stopping the <docker> resource.

Expected results:
The ability to modify the <docker> resource's timeout attribute for operations. 

Additional info:

- The default resource operation timeout for an OCF resource is 20000 ms when none is explicitly configured, so this is the timeout for the docker containers' stop and monitor operations, as we observed in the logs. The operation timeouts configured for the primitive do not apply to the <docker> resource within the bundle
- The only way to override the default 20000 ms op timeout for <docker> elements is to set a resource op default timeout, which applies to all resource operation values that are not explicitly configured. That is not an ideal way to modify timeouts on the <docker> resource within the bundle and only applies to resources created after you change the timeout value and not existing ones.

Comment 1 Tomas Jelinek 2018-09-14 07:55:30 UTC

As you pointed out, the docker resource is created by pacemaker behind the scenes. There is no docker resource in the CIB and therefore, as far as I know, it is impossible for pcs to set its instance, meta and utilization attributes as well as operations.

It is possible to set bundles meta attributes with 'pcs resource bundle create ... meta ...' and 'pcs resource bundle update ... meta ...'. These are inherited by implicit resources created by pacemaker for the bundle.

As I understand bundles, not having the docker (or other container), pacemaker remote, ip and other resources in the CIB is the whole point of bundles: to make the configuration easier for the users by not requiring them to set, configure and view a bunch of resources and instead creating them in pacemaker behind the scenes.

If they really want to configure these all in details, then they can create and manage all the resources by themselves instead of using bundles.

Moving to pacemaker for further discussion.

Comment 2 Ken Gaillot 2018-09-14 23:14:28 UTC

I wouldn't want to give direct access to the implicit resources' XML, but I could see having new options for the values that could reasonably be changed.

One option would be to use any timeouts specified on the bundled resource for the implicit resources. I'm leaning against this, because the time a bundled resource takes to do something is not related to the time it takes to launch a container or connect to Pacemaker Remote, and because bundles can be specified without a resource (to simply manage a container as a "black box").

The other option would be to provide new syntax to set the container operation timeouts. I could see one of these two approaches:

<docker ... container-start-timeout="50s" container-monitor-timeout="30s">

or

<docker ...>
   <operations>
      <op id="docker-monitor" interval="60s" name="monitor" timeout="30s"/>
      <op id="docker-start" interval="0" name="start" timeout="50s"/>
   </operations>
</docker>

Comment 3 Andrew Beekhof 2019-12-05 00:26:24 UTC

I can't see this being a per-bundle kind of change.
Can we have a top-level cluster property that sets the timeout for all bundle operations?

Comment 5 Ken Gaillot 2019-12-05 18:25:57 UTC

Are we concerned only with a single timeout to be used for all implicit container resource ops? If so, then a new container-timeout option makes sense, whether as a cluster-wide property, or a per-bundle meta-attribute that could be set once in rsc_defaults. Pro: easy to understand and configure.

Or do we want control over all operation properties (most importantly timeout and on-fail), potentially per implicit resource type (container, IPaddr2, remote connection) and action type (start, stop, monitor)? Pro's: at least in my experience, container stops take much longer than start/monitor; the user could configure a shorter or longer monitor interval, or set on-fail to fence or standby, if they want; it's possible (if unlikely) for a user to configure both docker and podman bundles, which have different timings. Possible syntax:

* <operations> blocks under the <docker>, <podman>, and <network> elements (leaving out remote connections at first, though we could add a separate <remote> section just to hold remote ops if needed later). Pro: consistent with how explicit resources are configured. Con: has to be done for every bundle.

* new global <implicit_ops> section that takes <docker>, <podman>, <network>, and maybe <remote>. Pro's: configured once for all bundles; could potentially apply to guest nodes' implicit remote connections as well. Con: less intuitive. Example:

    <implicit_ops>
      <podman>
        <meta_attributes id="implicit-podman-ops">
          <op id="implicit-podman-stop" name="stop" interval="0s" timeout="90s"/>
        </meta_attributes>
      </podman>
    </implicit_ops>

Comment 6 Andrew Beekhof 2019-12-06 05:07:30 UTC

This isn't something we need to control on a bundle-by-bundle basis.
All podman resources are horribly slow because of podman, its nothing to do with the container being managed or the fact its a part of a bundle.

Lets not overcomplicate this.

Comment 7 Ken Gaillot 2019-12-06 16:09:57 UTC

(In reply to Andrew Beekhof from comment #6)
> This isn't something we need to control on a bundle-by-bundle basis.
> All podman resources are horribly slow because of podman, its nothing to do
> with the container being managed or the fact its a part of a bundle.
> 
> Lets not overcomplicate this.

Right, but at least start vs stop vs monitor can have very different timings. I also want to avoid the situation where we add timeout today, then tomorrow someone wants on-fail and it has to be hacked into the design. What do you think of the <implicit_ops> option in Comment 5?

Comment 8 Andrew Beekhof 2019-12-10 03:56:42 UTC

(In reply to Ken Gaillot from comment #7)
> Right, but at least start vs stop vs monitor can have very different
> timings. 

They're all going to have to be set so high there is little benefit in setting them separately.

> I also want to avoid the situation where we add timeout today, then
> tomorrow someone wants on-fail and it has to be hacked into the design. 

Good point

> What do you think of the <implicit_ops> option in Comment 5?

The very fact that they're implicit means that the admin may not even know whether to set them for podman vs. docker vs. whatever.
Perhaps s/podman/bundle/

Or allow op_defaults to be scoped to a specific agent.

Comment 9 Ken Gaillot 2019-12-11 19:18:25 UTC

(In reply to Andrew Beekhof from comment #8)
> (In reply to Ken Gaillot from comment #7)
> > Right, but at least start vs stop vs monitor can have very different
> > timings. 
> 
> They're all going to have to be set so high there is little benefit in
> setting them separately.
>
> > I also want to avoid the situation where we add timeout today, then
> > tomorrow someone wants on-fail and it has to be hacked into the design. 
> 
> Good point
> 
> > What do you think of the <implicit_ops> option in Comment 5?
> 
> The very fact that they're implicit means that the admin may not even know
> whether to set them for podman vs. docker vs. whatever.
> Perhaps s/podman/bundle/
> 
> Or allow op_defaults to be scoped to a specific agent.

Genius :)

We've also had requests to scope op_defaults to a specific action. I'm thinking we could add a new rule expression type relevant only within rsc_defaults and op_defaults, something like:

    <op_defaults>
      <meta_attributes id="op_defaults-meta">
        <rule id="op_defaults-rule" score="INFINITY">
         <rsc_expression id="op_defaults-rule-Dummy" class="ocf" provider="pacemaker" type="Dummy"/>
         <op_expression id="op_defaults-rule-start" name="start" interval="0" />
        </rule>
        <nvpair id="op_defaults-Dummy-start-timeout" name="timeout" value="90s"/>
      </meta_attributes>
    </op_defaults>

That would require adding resource/operation info to the rule APIs, but I believe we do have the necessary info everywhere we'd need it.

Comment 10 Ken Gaillot 2020-03-26 15:05:42 UTC

*** Bug 1579213 has been marked as a duplicate of this bug. ***

Comment 11 Ken Gaillot 2020-03-26 15:16:28 UTC

The corresponding pcs functionality is tracked as Bug 1817547

Comment 12 Ken Gaillot 2020-05-20 17:26:27 UTC

This has been fixed in the upstream master branch via https://github.com/ClusterLabs/pacemaker/pull/2045

Comment 24 Patrik Hagara 2020-09-22 10:47:15 UTC

moving to verified based on comment#22 and https://bugzilla.redhat.com/show_bug.cgi?id=1817547#c32

Comment 27 errata-xmlrpc 2020-11-04 04:00:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4804

Note You need to log in before you can comment on or make changes to this bug.