Bug 733468
Summary: | RFE: better error messaging for resource invoke() failures | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Dave Johnson <dajohnso> |
Component: | matahari | Assignee: | Russell Bryant <rbryant> |
Status: | CLOSED ERRATA | QA Contact: | Dave Johnson <dajohnso> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.2 | CC: | matahari-maint, rbryant |
Target Milestone: | rc | Keywords: | FutureFeature |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | matahari-0.4.4-1.el6 | Doc Type: | Enhancement |
Doc Text: |
No description required.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-12-06 11:40:30 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 743047 |
Description
Dave Johnson
2011-08-25 18:49:34 UTC
Not sure how serious this is but I noticed that when these type of errors do occur, the action continues to re-occur. I would think we would want to catch that somehow and prevent further action? mainloop_signal_dispatch: Invoking handler for signal 17: Child exited child_death_dispatch: Managed process 11277 exited: 0x1078cb0 child_death_dispatch: Managed process 11277 (dave_start_10000) exited with rc=6 child_death_dispatch: Removing timer 15 operation_finished: dave_start_10000:11277 - exited with rc=6 operation_finished: dave_start_10000:11277 [ IPaddr[11277]: ERROR: Required parameter OCF_RESKEY_ip is missing ] mh_async_callback: Completed: dave_start_10000 = 6 child_death_dispatch: Removed process entry for 11277 vvv recurring_action_timer: Scheduling another invokation of dave_start_10000 ^^^ services_os_action_execute: Async waiting for 11302 - /usr/lib/ocf/resource.d/heartbeat/IPaddr Thanks for the report. After our discussion, I wanted to summarize the things that we discussed that should be done to help make this situation better. 1) First, the particular failure that you have describe shouldn't happen. It shouldn't be possible to let a service/resource API request get into a bad state like that. You already opened a separate bug for that which has been addressed. For reference: https://bugzilla.redhat.com/show_bug.cgi?id=734522 2) Second, even if you do provide valid arguments to a resources invoke action, but for some reason, the resource agent reports a failure, we should provide a bit more help with debugging that. We need to provide some documentation as to what the OCF return codes mean. We also need to document how to use logging in Matahari so that you don't have to run it in debug mode in a console to get useful output. Point #1 has been addressed already. I will work on addressing point #2 and will leave an update here when that is complete. OCF return codes added to docs: https://github.com/matahari/matahari/commit/b5be3b3c85f1d9ad656d7cb47df1e15f1d533d15 Logging information added to the manual here: https://github.com/matahari/matahari-manual/commit/b310bed7cf001dd702d4588153d341e08bd26ccb So I see if no resource name is given, the resource exits with invalid parameters. I also see that the documentation is updated with return codes and logging instructions. With that said, I still have an issue with starting resources where an error is reported in the resource agent itself yet continues to be scheduled at the specified interval. If an return code of 6 is thrown, shouldn't that prevent further actions from being rescheduled? For instance, if I run qmf: call 1 invoke dave ocf heartbeat IPaddr start 20000 "{'nic': 'eth6'}" 10000 0 dave Output Parameters: Name Value ================= userdata dave rc 6 logging shows this: operation_finished: dave_start_20000:26149 [ IPaddr[26149]: ERROR: Required parameter OCF_RESKEY_ip is missing ] return code 6 means: <row> <entry><literal>6</literal></entry> <entry><literal>OCF_ERR_CONFIGURED</literal></entry> <entry>The resource's configuration is invalid. Eg. A required parameters are missing.</entry> </row> So the agent throws an exception if the input is invalid such that the action can't even be attempted once. However, if the error is generated from the resource itself, then the behavior continues (it gets executed periodically as requested). My first instinct after reading this update was to go and make it so if we get an LSB or OCF error back, we don't keep trying something that was set to be periodic, but I think I have changed my mind. I think I would rather leave the interpretation of the result and whether or not it should be aborted up to the application using the API. For example, if you think about the majority use case for periodic actions, it's for monitoring of status. Even if it's returning failure for whatever reason, you would almost always want that action to keep executing at the configured interval so that you find out when it changes. So, right now I'm leaning toward calling what you have described the expected and desired behavior of the agent. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: No description required. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1569.html |