Bug 733468

Summary: RFE: better error messaging for resource invoke() failures
Product: Red Hat Enterprise Linux 6 Reporter: Dave Johnson <dajohnso>
Component: matahariAssignee: Russell Bryant <rbryant>
Status: CLOSED ERRATA QA Contact: Dave Johnson <dajohnso>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.2CC: matahari-maint, rbryant
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: matahari-0.4.4-1.el6 Doc Type: Enhancement
Doc Text:
No description required.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 11:40:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 743047    

Description Dave Johnson 2011-08-25 18:49:34 UTC
Description of problem:
==================================
Whether calling invoke() from within python or using the qmf-tool, it is hard to troubleshoot calls that fail.  It wasn't until I started getting in the habit of running the agent in 'debug' mode with the -vvv flag that I figured out that I was missing either flags or parameters for the agent being called.

# For example, here i made a call and got back a rc2 but no reason.
[root@ibm-x3650-05 sbin]# ./matahari-qmf-service-cli -b 127.0.0.1 -A Dummy -J heartbeat -S ocf -T Resources -a start -i 5000 -t 10000
matahariproject.org:service:eebb184c-3af8-4bb7-99a3-32c91d3e9ca3
Call options: {action:start, agent:Dummy, interval:5000000, parameters:{}, provider:heartbeat, standard:ocf, timeout:10000000}
Call returned: {rc:2}
2011-08-25 14:30:38 warning Connection [127.0.0.1:48004-127.0.0.1:49000] closed
# But if I look in the console,
child_death_dispatch: Managed process 6500 exited: 0x1ff3b30
child_death_dispatch: Managed process 6500 (_start_5000000) exited with rc=2
child_death_dispatch: Removing timer 3
operation_finished: _start_5000000:6500 - exited with rc=2
operation_finished: _start_5000000:6500 [ Dummy[6500]: ERROR: Need to tell us our resource instance name. ]
# There is the reason, that needs to make it back into the output of invoke()


Version-Release number of selected component (if applicable):
====================================================================
v0.4.2-9

How reproducible:
==================================
100%

Steps to Reproduce:
==================================
1. install/start matahari broker and service agent
2. install resource-agents pkg
3. invoke a DUMMY resource without a instance name
4. invoke a IPaddr resource and only pass it the nic parameter

Comment 2 Dave Johnson 2011-08-30 16:30:39 UTC
Not sure how serious this is but I noticed that when these type of errors do occur, the action continues to re-occur.  I would think we would want to catch that somehow and prevent further action?

mainloop_signal_dispatch: Invoking handler for signal 17: Child exited
child_death_dispatch: Managed process 11277 exited: 0x1078cb0
child_death_dispatch: Managed process 11277 (dave_start_10000) exited with rc=6
child_death_dispatch: Removing timer 15
operation_finished: dave_start_10000:11277 - exited with rc=6
operation_finished: dave_start_10000:11277 [ IPaddr[11277]: ERROR: Required parameter OCF_RESKEY_ip is missing ]
mh_async_callback: Completed: dave_start_10000 = 6
child_death_dispatch: Removed process entry for 11277

vvv
recurring_action_timer: Scheduling another invokation of dave_start_10000
^^^

services_os_action_execute: Async waiting for 11302 - /usr/lib/ocf/resource.d/heartbeat/IPaddr

Comment 3 Russell Bryant 2011-09-02 19:21:07 UTC
Thanks for the report.  After our discussion, I wanted to summarize the things that we discussed that should be done to help make this situation better.

1) First, the particular failure that you have describe shouldn't happen.  It shouldn't be possible to let a service/resource API request get into a bad state like that.  You already opened a separate bug for that which has been addressed.  For reference:

https://bugzilla.redhat.com/show_bug.cgi?id=734522

2) Second, even if you do provide valid arguments to a resources invoke action, but for some reason, the resource agent reports a failure, we should provide a bit more help with debugging that.  We need to provide some documentation as to what the OCF return codes mean.  We also need to document how to use logging in Matahari so that you don't have to run it in debug mode in a console to get useful output.

Point #1 has been addressed already.  I will work on addressing point #2 and will leave an update here when that is complete.

Comment 4 Russell Bryant 2011-09-02 20:32:19 UTC
OCF return codes added to docs:

https://github.com/matahari/matahari/commit/b5be3b3c85f1d9ad656d7cb47df1e15f1d533d15

Comment 5 Russell Bryant 2011-09-06 19:22:41 UTC
Logging information added to the manual here:

https://github.com/matahari/matahari-manual/commit/b310bed7cf001dd702d4588153d341e08bd26ccb

Comment 7 Dave Johnson 2011-09-12 15:51:55 UTC
So I see if no resource name is given, the resource exits with invalid parameters.  I also see that the documentation is updated with return codes and logging instructions.  

With that said, I still have an issue with starting resources where an error is reported in the resource agent itself yet continues to be scheduled at the specified interval.  If an return code of 6 is thrown, shouldn't that prevent further actions from being rescheduled? 

For instance, if I run

    qmf: call 1 invoke dave ocf heartbeat IPaddr start 20000 "{'nic': 'eth6'}" 
      10000 0 dave
    Output Parameters:
       Name      Value
       =================
       userdata  dave
       rc        6
 
logging shows this:

    operation_finished: dave_start_20000:26149 [ IPaddr[26149]: ERROR: Required 
      parameter OCF_RESKEY_ip is missing ]

return code 6 means:
    <row>
      <entry><literal>6</literal></entry>
      <entry><literal>OCF_ERR_CONFIGURED</literal></entry>
      <entry>The resource's configuration is invalid. Eg. A required parameters 
             are missing.</entry>
    </row>

Comment 8 Russell Bryant 2011-09-14 19:58:34 UTC
So the agent throws an exception if the input is invalid such that the action can't even be attempted once.  However, if the error is generated from the resource itself, then the behavior continues (it gets executed periodically as requested).

My first instinct after reading this update was to go and make it so if we get an LSB or OCF error back, we don't keep trying something that was set to be periodic, but I think I have changed my mind.  I think I would rather leave the interpretation of the result and whether or not it should be aborted up to the application using the API.

For example, if you think about the majority use case for periodic actions, it's for monitoring of status.  Even if it's returning failure for whatever reason, you would almost always want that action to keep executing at the configured interval so that you find out when it changes.

So, right now I'm leaning toward calling what you have described the expected and desired behavior of the agent.

Comment 9 Russell Bryant 2011-11-16 21:58:30 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
No description required.

Comment 10 errata-xmlrpc 2011-12-06 11:40:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1569.html