Bug 1832289

Summary: fence_aws: Logging enhancements (RHEL7)
Product: Red Hat Enterprise Linux 7 Reporter: Oyvind Albrigtsen <oalbrigt>
Component: fence-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED ERRATA QA Contact: Brandon Perkins <bperkins>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.8CC: bperkins, cfeist, cluster-maint, fguilher, nwahl, oalbrigt
Target Milestone: rc   
Target Release: 7.9   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: fence-agents-4.2.1-41.el7 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: 1781357 Environment:
Last Closed: 2020-09-29 19:15:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1781357    
Bug Blocks:    

Description Oyvind Albrigtsen 2020-05-06 13:30:23 UTC
+++ This bug was initially created as a clone of Bug #1781357 +++

Description of problem:

From customer:
~~~
I would like to be able to track down on which phase/stage of a fencing request it failed. For example, if it was during set_power_status(), get_power_status() or any other place in the code.

For example, SUSE's agent lets you easily figure out when the the StopInstance was issued, and what was the status and time taken for the instance to move from "Stopping" to "Stopped", or from "Pending" to "Running" by printing the status and progress of the checks.

Honestly, anything we can think of that would help our customers and us (as support engineers) to troubleshoot and have more clarify on what the fence agent is doing is of great help, so I'm open to suggestions.
~~~

I think this is doable, and perhaps it's worth doing for all the fence agents rather than only having a special case for fence_aws. There may be some agent-specific data that can be logged, but the basic suggestion to print the the output of get_power_status and possibly when each call to [gs]et_power_status begins and/or ends could apply to many agents.

The existing verbose option prints a huge amount of output that's not feasible to direct to syslog for any extended period. A "middle ground" logging option would be nice. Not sure if we would want to make that the default (for easier failure troubleshooting) or keep the current "quiet" logging pattern as the default.

This is the SUSE agent that the customer referred to:
  - https://github.com/ClusterLabs/cluster-glue/blob/master/lib/plugins/stonith/external/ec2

-----

Version-Release number of selected component (if applicable):

fence-agents-aws-4.2.1-30.el8_1.1.noarch

-----

Steps to Test:

1. Set newly added debug option and observe logs during stonith actions.

-----

Additional info:

Customer is willing to assist in testing and provide any relevant info/suggestions from the AWS side.

--- Additional comment from Oyvind Albrigtsen on 2020-01-31 16:12:10 CET ---

https://github.com/ClusterLabs/fence-agents/pull/318

Comment 7 Oyvind Albrigtsen 2020-05-25 12:16:24 UTC
Additional patch to catch ConnectionError (only reported when using role and blocking HTTPS): https://github.com/ClusterLabs/fence-agents/pull/338

Comment 11 errata-xmlrpc 2020-09-29 19:15:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (fence-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3850