2112270 – [RFE] 'pcs (resource|stonith) (create|update)' commands fail to call the respective agents 'validate-all' action

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2112270 - [RFE] 'pcs (resource|stonith) (create|update)' commands fail to call the respective agents 'validate-all' action

Summary: [RFE] 'pcs (resource|stonith) (create|update)' commands fail to call the resp...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	9.2
Assignee:	Ondrej Mular
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Duplicates (1):	2112271 (view as bug list)
Depends On:	1553712 1636036 1644628 1816852 1955792 2102292 2112271 2123727 2127113 2157872
Blocks:	1377970 2159454
TreeView+	depends on / blocked

Reported:	2022-07-29 08:24 UTC by Tomas Jelinek
Modified:	2023-05-14 13:24 UTC (History)
CC List:	15 users (show)
Fixed In Version:	pcs-0.11.3-5.el9
Doc Type:	Enhancement
Doc Text:	.`pcs` can now run the `validate-all` action of resource and stonith agents When creating or updating a resource or a STONITH device, you can now specify the `--agent-validation` option. With this option, `pcs` uses an agent's `validate-all` action, when it is available, in addition to the validation done by `pcs` based on the agent's metadata.
Clone Of:	1816852
Environment:
Last Closed:	2023-05-09 07:18:23 UTC
Type:	---
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CLUSTERQE-6219	None	None	None	2022-12-01 17:23:54 UTC
Red Hat Issue Tracker	RHELPLAN-129620	None	None	None	2022-07-29 08:34:03 UTC
Red Hat Knowledge Base (Solution)	6110181	None	None	None	2022-08-09 15:08:17 UTC
Red Hat Product Errata	RHBA-2023:2151	None	None	None	2023-05-09 07:18:47 UTC

Description Tomas Jelinek 2022-07-29 08:24:55 UTC

+++ This bug was initially created as a clone of Bug #1816852 +++

Description of problem:

The above pcs commands should call the used resource agents 'validate-all' 
action so that the OCF intended environment checks can be performed and, more
importantly, consistency checks on the resource are possible ahead of starting it,
e.g. if it is an active-active one thus allowing cloning or if an active-passive
resource to be created is already active and thus has to be deactivated to
avoid data-corruption.

--- Additional comment from Ken Gaillot on 2020-03-24 22:59:27 CET ---

A question that should be resolved before having pcs call validate-all is what that action should do. (The OCF standard is not completely clear.)

validate-all should definitely check for legal syntax of proposed parameter values, as well as self-consistency (e.g. if parameter X is specified then parameter Y can only take such-and-such values).

However whether validate-all should check whether the parameter values are suitable for the local host is another question. A resource can legitimately be created that can't run on the local host (for example, because the resource is constrained to run only on certain nodes, or because the local host is a GUI host that's not part of the cluster). Also, validating that the resource could run on the local host doesn't guarantee it can run on any other node.

--- Additional comment from Heinz Mauelshagen on 2020-03-24 23:28:35 CET ---

A misconfigured, not startable and, worst case, data corrupting resource should be rejected to create, update or clone.

This is what we want to achieve using 'validate-all' in the MD resource agent Nigel's developing by checking
that a non-clustered RAID array isn't allowed creating a resource for when it's already active on a node as
a result of the "mdadm -C ..." command or tried cloning.  Mind the so created MD array can pre exist already
and now becomes a pacemaker cluster resource potentially causing data-corruption by having it active,
open and being updated on one node when pacemaker resource management decides to bring it up on another.

--- Additional comment from Tomas Jelinek on 2020-04-06 14:19:39 CEST ---

I agree with Ken on this one. Validate-all should definitely check for syntax and consistency.

I believe, that doing "live" checks, i.e. interacting with an underlying cluster / OS, should be avoided for several reasons. Adding to reasons presented by Ken: CIB can be edited in-file (pcs -f) even outside the cluster. In that case, interacting with underlying OS or (in this case) non-existing cluster is just wrong and it will present false responses and possibly cause more troubles.

If those live checks are really needed, they should be implemented outside of validate-all and validate-all-xml actions. Also, it looks like it may be wanted to run those checks in other cases than just modifying CIB (i.e. when a resource is being started by pacemaker), which is another argument for implementing them somewhere else.

--- Additional comment from Heinz Mauelshagen on 2020-04-08 14:06:32 CEST ---

(In reply to Tomas Jelinek from comment #4)
> I agree with Ken on this one. Validate-all should definitely check for
> syntax and consistency.
> 
> I believe, that doing "live" checks, i.e. interacting with an underlying
> cluster / OS, should be avoided for several reasons. Adding to reasons
> presented by Ken: CIB can be edited in-file (pcs -f) even outside the
> cluster. In that case, interacting with underlying OS or (in this case)
> non-existing cluster is just wrong and it will present false responses and
> possibly cause more troubles.
> 
> If those live checks are really needed, they should be implemented outside
> of validate-all and validate-all-xml actions. Also, it looks like it may be
> wanted to run those checks in other cases than just modifying CIB (i.e. when
> a resource is being started by pacemaker), which is another argument for
> implementing them somewhere else.

How does this argue against my comment #2 aiming to protect from data corruption as early
as possible in the CIB update process apart from the fact, that "pcs -f ..." followed by a
"pcs cluster cib-push ..." would run validate-all at CIB push time (any resource relying on
live checks on e.g. devices only available on the live cluster will have to carry them out
when deployed there which I presime to be covered by this bs request to call validate-all
from pacemaker)?

FWIW: I'm not arguing against action validate-all-xml being valuable to catch problems early
(though I did not grep any existing (upstream) resource agent implementing it; neither any pacemaker
executable proving implementaton of it; this bz also applies to the latter as of this).
As a result of those deficiencies, neither "pcs -f ..." nor "pcs cluster cib-push ..."
causes the validate-all-xml action to be called as of my testing.


IOW: if user misconfigures a resource on create/update either by direct means (e.g. via pcs)
or delayed ones using "pcs -f ...;pcs cluster cib-push ..." or by means of resource cloning in
case the resource is not suitable for active-active deployment, this should be caught ASAP
rather than at start time when it can be too late to avoid data corruption; I presume
validate-all and validate-all-xml to be the means to ensure this


In general terms, the general rational for this bz is, that validate-all and validate-all-xml
(btw: hard to find sepc for the latter in ocf docs?) are actions defined as a subset of the
resource agents API which help to avoid worst case scenarios such as data corruption.
As they are part of the resource agent API, they should be called by pacemaker directly and
not just as internal workaround calls by agents which then fail at e.g. start time causing
respective error messages in turn requiring cleanup which can be avoided altogether by
allowing for the API being fully implemented.

Also, supporting those missing API action calls does not disallow any argued on semantics needed at
start time (i.e. do at validate time what the respective resource agent requires to validate
and check at start time what's mandatory for the resource agent in that context) although it'll
need a patch series updating the current workaround processing of validate-all in a lot of existing
(upstream) resource agents conditionally discontinuing calling based on version.

--- Additional comment from Tomas Jelinek on 2020-04-08 17:47:04 CEST ---

(In reply to Heinz Mauelshagen from comment #5)
I agree with you that protecting from data corruption caused by badly created / updated / cloned resources is a good thing and should be implemented. I just pointed out another corner-case which needs to be dealt with. Running validate-all-xml on 'pcs resource create' (and update, clone) has been one of our goals for some time. However, support from pacemaker and agents is needed for that, so it has not been implemented yet.

It is important to figure out what validations can and should be done in what cases and which cluster component should be running them. As Ken and I pointed out, not all validations can be run in all cases. That is what the discussion is about.
You originally started with validation in 'pcs resource create | clone | update'. Then in comment 5 you also added validating in 'pcs cluster cib-push', which is a bit of a different feature. Yes, it aims for the same goal, but the implementation and circumstances are completely different. In resource create | update, we have one resource to deal with and we know its settings because they were just provided by the pcs user. In cib-push, we have either a whole CIB or an original CIB and a diff, so it must work differently. Later on in comment 5, you are actually saying it is pacemaker, not pcs, who should be running the validation:
> do at validate time what the respective resource agent requires to validate and check at start time what's mandatory for the resource agent in that context
Pcs can handle the "validate time" part but hardly the "start time" part as it is not involved in any way in starting resources, that's pacemaker's job.

You cannot find any traces of validate-all-xml anywhere, because it has not been implemented yet.

--- Additional comment from Heinz Mauelshagen on 2020-04-16 15:51:34 CEST ---

(In reply to Tomas Jelinek from comment #6)
> (In reply to Heinz Mauelshagen from comment #5)
> I agree with you that protecting from data corruption caused by badly
> created / updated / cloned resources is a good thing and should be
> implemented. I just pointed out another corner-case which needs to be dealt
> with. Running validate-all-xml on 'pcs resource create' (and update, clone)
> has been one of our goals for some time. However, support from pacemaker and
> agents is needed for that, so it has not been implemented yet.

Ok, is there any product plan for it as yet?

> 
> It is important to figure out what validations can and should be done in
> what cases and which cluster component should be running them. As Ken and I
> pointed out, not all validations can be run in all cases. That is what the
> discussion is about.

Agreed, we need conditionals covering in which context the respective validat-all* action is processed.

> You originally started with validation in 'pcs resource create | clone |
> update'.

Yes, I did that because of the UI used, not relative to internal technical design and implementation.
I.e., user doesn't care which components in the architecture processes the validation, that is up
to the technical implementer.  The latter has to find the adequate component to be enhanced or added
with the goal to idealy find a central one which manages the validation checks rather than running them
in multiple places suffering from related potential consistency issues in return.

>  Then in comment 5 you also added validating in 'pcs cluster
> cib-push', which is a bit of a different feature.

Indeed, this came up relative to applying validation checks in either cib create/update/... scenarios,
no matter what the UIs causing those are.

> Yes, it aims for the same
> goal, but the implementation and circumstances are completely different. In
> resource create | update, we have one resource to deal with and we know its
> settings because they were just provided by the pcs user. In cib-push, we
> have either a whole CIB or an original CIB and a diff, so it must work
> differently.

Yes, that's a 1 change (simplifying here), i.e. create/update/restart/foobar a single resource
versus an N change scenaio.  The latter being a list if the former with more interrelations
because of potential bulk creations/updates on cib-push.

> Later on in comment 5, you are actually saying it is pacemaker,
> not pcs, who should be running the validation:
> do at validate time what the respective resource agent requires to validate and check at start time what's mandatory for the resource agent in that context
> Pcs can handle the "validate time" part but hardly the "start time" part as
> it is not involved in any way in starting resources, that's pacemaker's job.

Sure, I don't argue not having start checks. We need them to ensure the same line of thought,
e.g. prevent data corruption at start if only a start time check is suitable to avoid it.

Pacemaker core seems to be the architecturally common place, whatever the proper component therein,
to carry out checks on either 'immediate' creates/updates/... via pcs/crm_resource/... or 'delayed'
ones via e.g. 'pcs -f ...;pcs cib_push ...' (potentially running on different machines, the former
command even outside the cluster the future cib push is aiming to occur on).

At the architectural level: what would prevent such central component (existing one to be enhanced or new one)
in pacemaker to carry out validation checks?

> 
> You cannot find any traces of validate-all-xml anywhere, because it has not
> been implemented yet.

Thanks, I gathered as much testing it and mentioned above saying "though I did not grep...neither any pacemaker executable...".

--- Additional comment from Ken Gaillot on 2021-04-30 21:45:56 CEST ---

The OCF Resource Agent API 1.1 standard [1] was recently released, and addresses these issues under "Global OCF Attributes" and "Check Levels".

The standard specifies OCF_OUTPUT_FORMAT and OCF_CHECK_LEVEL environment variables that an agent can check in its validate-all action. The agent may optionally support outputting XML when OCF_OUTPUT_FORMAT is "xml". If OCF_CHECK_LEVEL is 0 or unset, the agent should do only an internal consistency check, and if OCF_CHECK_LEVEL is 10, it may additionally validate the suitability of the local host.

I believe this work remains:

* Pacemaker's crm_resource must pass OCF_OUTPUT_FORMAT="xml" to agents when called with --output-as="xml" --validate (Bug 1644628), and support an option for an OCF_CHECK_LEVEL to pass to agents (Bug 1955792).

* OCF 1.1 leaves the specifics of validate-all XML output undefined, so we need to come up with a good schema design that we can use and can be added to the next standard version.

* Any resource agents desired to have this behavior must be updated to support OCF 1.1, including OCF_OUTPUT_FORMAT, OCF_CHECK_LEVEL, and the new schema with the validate-all action. Bug 1936696 (pacemaker, including the ocf:pacemaker agents), Bug 1937025 (resource-agents), Bug 1937026 (resource-agents-sap), Bug 1937027 (resource-agents-sap-hana), and Bug 1937028 (resource-agents-sap-hana-scaleout) cover most agents shipped with RHEL as far as supporting OCF 1.1 goes, though not specifically the new validate-all behaviors since they are optional. We could comment on those bzs that the new validate-all behavior is desired, or open separate bzs if particular agents are of interest.

* pcs has to call crm_resource --validate with --output-as="xml" and the desired check level, and parse the new output.

Pacemaker itself doesn't call validate-all before starting resources, because it's at a lower level where maximum flexibility is prioritized, and because the agent itself can perform validation as part of its start action (which many do). The start should be sufficient -- doing a separate validation would just add overhead compared to letting the start fail (the end result is the same, a failed resource). If the agent returns success for start despite the parameters being invalid, that's an agent issue from Pacemaker's point of view.

[1] https://github.com/ClusterLabs/OCF-spec/blob/master/ra/1.1/resource-agent-api.md

--- Additional comment from Heinz Mauelshagen on 2022-05-19 14:06:52 CEST ---

Still needed as e.g. start action time checks are too late to impose create/update time constraints!

Comment 1 Ondrej Mular 2022-09-27 12:23:25 UTC

*** Bug 2112271 has been marked as a duplicate of this bug. ***

Comment 2 Ondrej Mular 2022-10-06 07:16:26 UTC

Upstream commit: https://github.com/ClusterLabs/pcs/commit/4dbc165db6d12a13360618a74d827cd9a86e7354

Test:
[root@rhel91-devel1 pcs]# pcs resource create test_ip ocf:heartbeat:IPaddr2 ip=192.168.1.5
Error: Validation result from agent (use --force to override):
  Oct 06 08:45:58 ERROR: Unable to find nic or netmask.
  ocf-exit-reason:[findif] failed
Error: Errors have occurred, therefore pcs is unable to continue
[root@rhel91-devel1 pcs]# echo $?
1

Comment 3 Miroslav Lisik 2022-10-26 13:07:52 UTC

DevTestResults:

[root@r92-1 ~]# rpm -q pcs
pcs-0.11.3-5.el9.x86_64

[root@r92-1 ~]# pcs resource create test_ip ocf:heartbeat:IPaddr2 ip=192.168.1.5
Error: Validation result from agent (use --force to override):
  Oct 25 09:53:17 ERROR: Unable to find nic or netmask.
  ocf-exit-reason:[findif] failed
Error: Errors have occurred, therefore pcs is unable to continue
[root@r92-1 ~]# echo $?
1

Comment 8 Michal Mazourek 2022-12-05 17:10:02 UTC

The same test as in bz1816852 comment 30 was used on RHEL9 system with pcs-0.11.4-1.el9 to verify this bz.
Marking as VERIFIED.

Comment 16 errata-xmlrpc 2023-05-09 07:18:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2151

Note You need to log in before you can comment on or make changes to this bug.