Bug 1628701
| Summary: | [RFE] Allow resource/operation defaults to be defined for particular resource/operation types | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Shane Bradley <sbradley> |
| Component: | pacemaker | Assignee: | Chris Lumens <clumens> |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 8.0 | CC: | cfeist, clumens, cluster-maint, dciabrin, idevat, kgaillot, michele, nwahl, omular, phagara, pkomarov, sbradley, tojeline |
| Target Milestone: | pre-dev-freeze | Keywords: | FutureFeature |
| Target Release: | 8.3 | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | pacemaker-2.0.4-3.el8 | Doc Type: | No Doc Update |
| Doc Text: |
Any corresponding pcs functionality should be documented instead.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-11-04 04:00:53 UTC | Type: | Feature Request |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1579221, 1817547 | ||
As you pointed out, the docker resource is created by pacemaker behind the scenes. There is no docker resource in the CIB and therefore, as far as I know, it is impossible for pcs to set its instance, meta and utilization attributes as well as operations. It is possible to set bundles meta attributes with 'pcs resource bundle create ... meta ...' and 'pcs resource bundle update ... meta ...'. These are inherited by implicit resources created by pacemaker for the bundle. As I understand bundles, not having the docker (or other container), pacemaker remote, ip and other resources in the CIB is the whole point of bundles: to make the configuration easier for the users by not requiring them to set, configure and view a bunch of resources and instead creating them in pacemaker behind the scenes. If they really want to configure these all in details, then they can create and manage all the resources by themselves instead of using bundles. Moving to pacemaker for further discussion. I wouldn't want to give direct access to the implicit resources' XML, but I could see having new options for the values that could reasonably be changed.
One option would be to use any timeouts specified on the bundled resource for the implicit resources. I'm leaning against this, because the time a bundled resource takes to do something is not related to the time it takes to launch a container or connect to Pacemaker Remote, and because bundles can be specified without a resource (to simply manage a container as a "black box").
The other option would be to provide new syntax to set the container operation timeouts. I could see one of these two approaches:
<docker ... container-start-timeout="50s" container-monitor-timeout="30s">
or
<docker ...>
<operations>
<op id="docker-monitor" interval="60s" name="monitor" timeout="30s"/>
<op id="docker-start" interval="0" name="start" timeout="50s"/>
</operations>
</docker>
I can't see this being a per-bundle kind of change. Can we have a top-level cluster property that sets the timeout for all bundle operations? Are we concerned only with a single timeout to be used for all implicit container resource ops? If so, then a new container-timeout option makes sense, whether as a cluster-wide property, or a per-bundle meta-attribute that could be set once in rsc_defaults. Pro: easy to understand and configure.
Or do we want control over all operation properties (most importantly timeout and on-fail), potentially per implicit resource type (container, IPaddr2, remote connection) and action type (start, stop, monitor)? Pro's: at least in my experience, container stops take much longer than start/monitor; the user could configure a shorter or longer monitor interval, or set on-fail to fence or standby, if they want; it's possible (if unlikely) for a user to configure both docker and podman bundles, which have different timings. Possible syntax:
* <operations> blocks under the <docker>, <podman>, and <network> elements (leaving out remote connections at first, though we could add a separate <remote> section just to hold remote ops if needed later). Pro: consistent with how explicit resources are configured. Con: has to be done for every bundle.
* new global <implicit_ops> section that takes <docker>, <podman>, <network>, and maybe <remote>. Pro's: configured once for all bundles; could potentially apply to guest nodes' implicit remote connections as well. Con: less intuitive. Example:
<implicit_ops>
<podman>
<meta_attributes id="implicit-podman-ops">
<op id="implicit-podman-stop" name="stop" interval="0s" timeout="90s"/>
</meta_attributes>
</podman>
</implicit_ops>
This isn't something we need to control on a bundle-by-bundle basis. All podman resources are horribly slow because of podman, its nothing to do with the container being managed or the fact its a part of a bundle. Lets not overcomplicate this. (In reply to Andrew Beekhof from comment #6) > This isn't something we need to control on a bundle-by-bundle basis. > All podman resources are horribly slow because of podman, its nothing to do > with the container being managed or the fact its a part of a bundle. > > Lets not overcomplicate this. Right, but at least start vs stop vs monitor can have very different timings. I also want to avoid the situation where we add timeout today, then tomorrow someone wants on-fail and it has to be hacked into the design. What do you think of the <implicit_ops> option in Comment 5? (In reply to Ken Gaillot from comment #7) > Right, but at least start vs stop vs monitor can have very different > timings. They're all going to have to be set so high there is little benefit in setting them separately. > I also want to avoid the situation where we add timeout today, then > tomorrow someone wants on-fail and it has to be hacked into the design. Good point > What do you think of the <implicit_ops> option in Comment 5? The very fact that they're implicit means that the admin may not even know whether to set them for podman vs. docker vs. whatever. Perhaps s/podman/bundle/ Or allow op_defaults to be scoped to a specific agent. (In reply to Andrew Beekhof from comment #8) > (In reply to Ken Gaillot from comment #7) > > Right, but at least start vs stop vs monitor can have very different > > timings. > > They're all going to have to be set so high there is little benefit in > setting them separately. > > > I also want to avoid the situation where we add timeout today, then > > tomorrow someone wants on-fail and it has to be hacked into the design. > > Good point > > > What do you think of the <implicit_ops> option in Comment 5? > > The very fact that they're implicit means that the admin may not even know > whether to set them for podman vs. docker vs. whatever. > Perhaps s/podman/bundle/ > > Or allow op_defaults to be scoped to a specific agent. Genius :) We've also had requests to scope op_defaults to a specific action. I'm thinking we could add a new rule expression type relevant only within rsc_defaults and op_defaults, something like: <op_defaults> <meta_attributes id="op_defaults-meta"> <rule id="op_defaults-rule" score="INFINITY"> <rsc_expression id="op_defaults-rule-Dummy" class="ocf" provider="pacemaker" type="Dummy"/> <op_expression id="op_defaults-rule-start" name="start" interval="0" /> </rule> <nvpair id="op_defaults-Dummy-start-timeout" name="timeout" value="90s"/> </meta_attributes> </op_defaults> That would require adding resource/operation info to the rule APIs, but I believe we do have the necessary info everywhere we'd need it. *** Bug 1579213 has been marked as a duplicate of this bug. *** The corresponding pcs functionality is tracked as Bug 1817547 This has been fixed in the upstream master branch via https://github.com/ClusterLabs/pacemaker/pull/2045 moving to verified based on comment#22 and https://bugzilla.redhat.com/show_bug.cgi?id=1817547#c32 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4804 |
Description of problem: A way to configure attributes (like monitor/stop/start timeouts) for a resource within a bundle. Currently there is no way to change the monitor/stop/start timeouts on a <docker> contained in a bundle: <bundle id="galera-bundle"> <docker image="192.168.24.1:8787/rhosp13/openstack-mariadb:pcmklatest" masters="3" network="host" options="--user=root --log-driver=journald -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS" replicas="3" run-command="/bin/bash /usr/local/bin/kolla_start"/> <network control-port="3123"/> <storage> A customer was having issues where the <docker> resource is taking longer than the default time to stop and they need a way to increase that timeout. There was no command options to do this. We need a way for pcs to allow the manipulation of instance attributes, meta attributes, and op attributes for the implicit docker resources as part of a bundle. Version-Release number of selected component (if applicable): pcs-0.9.162-5.el7_5.1.x86_64 How reproducible: Everytime Steps to Reproduce: Setup one of the bundles described in our doc. Actual results: The <docker> continues to see timeout issues on stopping the <docker> resource. Expected results: The ability to modify the <docker> resource's timeout attribute for operations. Additional info: - The default resource operation timeout for an OCF resource is 20000 ms when none is explicitly configured, so this is the timeout for the docker containers' stop and monitor operations, as we observed in the logs. The operation timeouts configured for the primitive do not apply to the <docker> resource within the bundle - The only way to override the default 20000 ms op timeout for <docker> elements is to set a resource op default timeout, which applies to all resource operation values that are not explicitly configured. That is not an ideal way to modify timeouts on the <docker> resource within the bundle and only applies to resources created after you change the timeout value and not existing ones.