733468 – RFE: better error messaging for resource invoke() failures

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 733468 - RFE: better error messaging for resource invoke() failures

Summary: RFE: better error messaging for resource invoke() failures

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	matahari
Sub Component:
Version:	6.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Russell Bryant
QA Contact:	Dave Johnson
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	743047
TreeView+	depends on / blocked

Reported:	2011-08-25 18:49 UTC by Dave Johnson
Modified:	2016-04-26 14:47 UTC (History)
CC List:	2 users (show)
Fixed In Version:	matahari-0.4.4-1.el6
Doc Type:	Enhancement
Doc Text:	No description required.
Clone Of:
Environment:
Last Closed:	2011-12-06 11:40:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:1569	0	normal	SHIPPED_LIVE	matahari bug fix and enhancement update	2011-12-06 00:39:06 UTC

Description Dave Johnson 2011-08-25 18:49:34 UTC

Description of problem:
==================================
Whether calling invoke() from within python or using the qmf-tool, it is hard to troubleshoot calls that fail.  It wasn't until I started getting in the habit of running the agent in 'debug' mode with the -vvv flag that I figured out that I was missing either flags or parameters for the agent being called.

# For example, here i made a call and got back a rc2 but no reason.
[root@ibm-x3650-05 sbin]# ./matahari-qmf-service-cli -b 127.0.0.1 -A Dummy -J heartbeat -S ocf -T Resources -a start -i 5000 -t 10000
matahariproject.org:service:eebb184c-3af8-4bb7-99a3-32c91d3e9ca3
Call options: {action:start, agent:Dummy, interval:5000000, parameters:{}, provider:heartbeat, standard:ocf, timeout:10000000}
Call returned: {rc:2}
2011-08-25 14:30:38 warning Connection [127.0.0.1:48004-127.0.0.1:49000] closed
# But if I look in the console,
child_death_dispatch: Managed process 6500 exited: 0x1ff3b30
child_death_dispatch: Managed process 6500 (_start_5000000) exited with rc=2
child_death_dispatch: Removing timer 3
operation_finished: _start_5000000:6500 - exited with rc=2
operation_finished: _start_5000000:6500 [ Dummy[6500]: ERROR: Need to tell us our resource instance name. ]
# There is the reason, that needs to make it back into the output of invoke()


Version-Release number of selected component (if applicable):
====================================================================
v0.4.2-9

How reproducible:
==================================
100%

Steps to Reproduce:
==================================
1. install/start matahari broker and service agent
2. install resource-agents pkg
3. invoke a DUMMY resource without a instance name
4. invoke a IPaddr resource and only pass it the nic parameter

Comment 2 Dave Johnson 2011-08-30 16:30:39 UTC

Not sure how serious this is but I noticed that when these type of errors do occur, the action continues to re-occur.  I would think we would want to catch that somehow and prevent further action?

mainloop_signal_dispatch: Invoking handler for signal 17: Child exited
child_death_dispatch: Managed process 11277 exited: 0x1078cb0
child_death_dispatch: Managed process 11277 (dave_start_10000) exited with rc=6
child_death_dispatch: Removing timer 15
operation_finished: dave_start_10000:11277 - exited with rc=6
operation_finished: dave_start_10000:11277 [ IPaddr[11277]: ERROR: Required parameter OCF_RESKEY_ip is missing ]
mh_async_callback: Completed: dave_start_10000 = 6
child_death_dispatch: Removed process entry for 11277

vvv
recurring_action_timer: Scheduling another invokation of dave_start_10000
^^^

services_os_action_execute: Async waiting for 11302 - /usr/lib/ocf/resource.d/heartbeat/IPaddr

Comment 3 Russell Bryant 2011-09-02 19:21:07 UTC

Thanks for the report.  After our discussion, I wanted to summarize the things that we discussed that should be done to help make this situation better.

1) First, the particular failure that you have describe shouldn't happen.  It shouldn't be possible to let a service/resource API request get into a bad state like that.  You already opened a separate bug for that which has been addressed.  For reference:

https://bugzilla.redhat.com/show_bug.cgi?id=734522

2) Second, even if you do provide valid arguments to a resources invoke action, but for some reason, the resource agent reports a failure, we should provide a bit more help with debugging that.  We need to provide some documentation as to what the OCF return codes mean.  We also need to document how to use logging in Matahari so that you don't have to run it in debug mode in a console to get useful output.

Point #1 has been addressed already.  I will work on addressing point #2 and will leave an update here when that is complete.

Comment 4 Russell Bryant 2011-09-02 20:32:19 UTC

OCF return codes added to docs:

https://github.com/matahari/matahari/commit/b5be3b3c85f1d9ad656d7cb47df1e15f1d533d15

Comment 5 Russell Bryant 2011-09-06 19:22:41 UTC

Logging information added to the manual here:

https://github.com/matahari/matahari-manual/commit/b310bed7cf001dd702d4588153d341e08bd26ccb

Comment 7 Dave Johnson 2011-09-12 15:51:55 UTC

So I see if no resource name is given, the resource exits with invalid parameters.  I also see that the documentation is updated with return codes and logging instructions.  

With that said, I still have an issue with starting resources where an error is reported in the resource agent itself yet continues to be scheduled at the specified interval.  If an return code of 6 is thrown, shouldn't that prevent further actions from being rescheduled? 

For instance, if I run

    qmf: call 1 invoke dave ocf heartbeat IPaddr start 20000 "{'nic': 'eth6'}" 
      10000 0 dave
    Output Parameters:
       Name      Value
       =================
       userdata  dave
       rc        6
 
logging shows this:

    operation_finished: dave_start_20000:26149 [ IPaddr[26149]: ERROR: Required 
      parameter OCF_RESKEY_ip is missing ]

return code 6 means:
    <row>
      <entry><literal>6</literal></entry>
      <entry><literal>OCF_ERR_CONFIGURED</literal></entry>
      <entry>The resource's configuration is invalid. Eg. A required parameters 
             are missing.</entry>
    </row>

Comment 8 Russell Bryant 2011-09-14 19:58:34 UTC

So the agent throws an exception if the input is invalid such that the action can't even be attempted once.  However, if the error is generated from the resource itself, then the behavior continues (it gets executed periodically as requested).

My first instinct after reading this update was to go and make it so if we get an LSB or OCF error back, we don't keep trying something that was set to be periodic, but I think I have changed my mind.  I think I would rather leave the interpretation of the result and whether or not it should be aborted up to the application using the API.

For example, if you think about the majority use case for periodic actions, it's for monitoring of status.  Even if it's returning failure for whatever reason, you would almost always want that action to keep executing at the configured interval so that you find out when it changes.

So, right now I'm leaning toward calling what you have described the expected and desired behavior of the agent.

Comment 9 Russell Bryant 2011-11-16 21:58:30 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
No description required.

Comment 10 errata-xmlrpc 2011-12-06 11:40:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1569.html

Note You need to log in before you can comment on or make changes to this bug.