1600640 – Improve fence agent meta-data failure logs

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1600640 - Improve fence agent meta-data failure logs

Summary: Improve fence agent meta-data failure logs

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-12 16:28 UTC by Andrew Price
Modified:	2021-03-05 00:22 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-08 07:31:48 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
crm report (295.50 KB, application/x-bzip) 2018-07-12 16:28 UTC, Andrew Price	no flags	Details
The corosync.log containing the failure (294.73 KB, application/x-gzip) 2018-07-12 16:44 UTC, Andrew Price	no flags	Details
View All

Description Andrew Price 2018-07-12 16:28:54 UTC

Created attachment 1458468 [details]
crm report

Description of problem:

QE hit a revolver test failure on s390x where a node was fenced off and wasn't turned on again as expected.

Version-Release number of selected component (if applicable):

pacemaker-1.1.19-1.el7.s390x
fence-agents-zvm-4.2.1-1.el7.s390x

How reproducible:
Only hit once so far

Steps to Reproduce:
1. Get QE to run revolver on an s390x cluster

Actual results:

Waiting up to 600 seconds for cluster to recover.
Killing network interfaces on qe-c01-m03...
Jul 11 17:38:40 qe-c01-m01: Saw victim qe-c01-m03 leave corosync
Jul 11 17:38:40 qe-c01-m02: Saw victim qe-c01-m03 leave corosync
Jul 11 17:38:41 qe-c01-m01.s390.bos.redhat.com pengine[1551]:  warning: Scheduling Node qe-c01-m03 for STONITH
Jul 11 17:38:43 qe-c01-m01: Recovered s390-cluster1:s390-cluster11.1 jid 2
Jul 11 17:38:43 qe-c01-m01: Recovered s390-cluster1:s390-cluster12.1 jid 2
Jul 11 17:38:43 qe-c01-m01: Recovered s390-cluster1:s390-cluster10.0 jid 2
Didn't receive heartbeat from qe-c01-m03 for 2 seconds
Victims left to recover: qe-c01-m03
[...]
Victims left to recover: qe-c01-m03 
qe-c01-m03 has not rebooted in 600 seconds, down
qe-c01-m03 returned 1 

Expected results:

Test passes

Additional info:

crm_report attached. Significant log entry:

Jul 11 17:38:40 [1548] qe-c01-m01.s390.bos.redhat.com stonith-ng: warning: stonith_device_execute:     Agent 'fence_zvmip' does not advertise support for 'reboot', performing 'off' action instead

I don't see this in the logs anywhere else.

[root@qe-c01-m01 qe-c01-m01]# crm_resource --show-metadata stonith:fence_zvmip | grep 'action.*reboot'
    <action name="reboot"/>

Comment 2 Andrew Price 2018-07-12 16:44:06 UTC

Created attachment 1458473 [details]
The corosync.log containing the failure

Comment 3 Ken Gaillot 2018-07-12 17:07:39 UTC

As discussed elsewhere, the logs contain this earlier:

Jul 11 12:12:48 [1514] qe-c01-m01.s390.bos.redhat.com stonith-ng: warning: stonith__rhcs_metadata: Could not execute metadata action for fence_zvmip: Software caused connection abort | rc=-103

Jul 11 12:12:48 [1514] qe-c01-m01.s390.bos.redhat.com stonith-ng: error: get_agent_metadata: Could not retrieve metadata for fencing agent fence_zvmip

which indicates that this is not a bug but a local failure of the agent for a meta-data call.

The rather unhelpful "connection abort" message indicates that the agent call either timed out (which might not be obvious in the logs, just compare the timestamps of when the meta-data call was initiated vs when the failure was reported), was interrupted by a signal (in which case there should also be an "... exited due to signal" log message), or failed in an unexpected way ("call ... returned unexpected status").

To solve the immediate problem, you'll have to figure out why the agent call failed and fix that.

If the underlying issue is transient, it might be sufficient to set the pcmk_meta-data_retries parameter to 2 or 3. This parameter is not advertised, so it may require a --force, but I believe it is supported. If the first meta-data call fails, it would immediately try again.

If the underlying issue is a timeout, it might be sufficient to set pcmk_meta-data_timeout appropriately. Again this is not an advertised parameter but should work.

I'll leave this BZ open as a feature request to improve the error messages, and advertise the pcmk_meta-data_* parameters if supported. Specifically if meta-data fails, we should give a more descriptive failure message, and we should warn about the specific assumptions that will be made without the benefit of meta-data.

Also, pacemaker currently tries to get the meta-data when parsing the device configuration from the CIB; we could consider whether it might be useful to retry a failed meta-data command at some later point, but that might not be practical.

Comment 4 Andrew Price 2018-07-12 17:23:47 UTC

Thanks, Ken. There isn't a timestamp for the metadata call initiation so I can't tell how long it took unfortunately. I don't see the other forms of log messages though so it seems most likely a timeout. I'll take a look at the system configs for any reasons it might be slow and talk to Chris M about trying those workarounds. What are the defaults for those two undocumented settings?

Comment 5 Ken Gaillot 2018-07-12 18:54:20 UTC

The default meta-data timeout for fence agents is 5 seconds. (As an aside, we use 30 seconds as the default for resource agents, so it might be a good idea to make this the same.)

Looking at the log, it does look like a timeout:

Jul 11 12:12:42 [1514] qe-c01-m01.s390.bos.redhat.com stonith-ng:     info: init_cib_cache_cb:  Updating device list from the cib: init
...
Jul 11 12:12:48 [1514] qe-c01-m01.s390.bos.redhat.com stonith-ng:  warning: stonith__rhcs_metadata:     Could not execute metadata action for fence_zvmip: Software caused connection abort | rc=-103


I mis-remembered the default retries as only one attempt, but actually the default retries for all actions is 2. The timeout is for the entire action, so the cluster doesn't retry after a timeout. In other words, there's no way currently to put a shorter timeout on the individual attempts and a larger timeout on the entire operation, which might help if the first attempt were hung and would never complete.

Comment 6 Andrew Price 2018-07-13 09:51:12 UTC

6 seconds does seem like a long time just to output some static(?) data so either the node was pretty heavily loaded at that point or the agent is doing something sub-optimal. I might open a new bz against the fence-agents to get that figured out. Thanks again.

Comment 7 Jan Pokorný [poki] 2018-07-13 10:47:48 UTC

Especially if this was indeed caused by an elapsed timeout and
considering fence_zvmip boils down to Python implementation
(note there's also a C implementation along!), I'd note several
points:

0. Python is a poor choice of the tool for the task, to start with
   (migrating whatever eligible to a subset of MicroPython could
   be a reasonable tradeoff)

1. Python-based fence-agents appear to be launched in a poor
   way, meaning it is fairly easy to inflict any sort of issues,
   like unindented delay because of running some kind magic
   for any default Python interpreter runs (e.g. if root user
   manages custom site-packages directory, with *.pth files
   that get interpreted prior to any default run there,
   for instance): filed a bug to that effect for pcs a while
                  back, though only mentioned that internally
                  in the context of cluster Python-based
                  components on the whole, so rectifying that
                  now with [bug 1600893]

Next, perhaps it might make sense, as some kind of an optional
OCF extension, to declare a mechanism of explicit meta-data
caching by compliant resource managers:

  a. dedicate a directory on presumably a volatile partition,
     likewise presumably OCF-namespaced, for this meta-data cache

  b. whenever meta-data (of actively used agents?) are required:
     - if cached version of meta-data doesn't exist in cache,
       or is older than the agent itself, rerun caching process,
       when the same content is obtained, just refresh the
       timestamp, otherwise store anew, then ...
     - otherwise or in addition, return cached content, do not
       invoke agent again needlessly

There are plentiful advantages with this scheme, beside not draining
resources with new processes needlessly, it'd be a good addition
for crm_report to grab these cached files, especially if only
actively used resources (at some point, that is) were a subject
to this externalized caching.

Comment 8 Jan Pokorný [poki] 2018-07-13 11:00:30 UTC

re [comment 6]:

> 6 seconds does seem like a long time just to output some static(?)
> data

Yes, those data are expected to be invariant, not dependent on the
temperature of the CPU, that's why the caching per [comment 7]
would be feasible at all.

Unrelated to fence_zvmip, sadly, there is very low discipline about
the implementation of the agents, meaning people do all kind of crazy
things, including talking back to the resource manager even for
the purpose of getting meta-data (!):

https://bugs.clusterlabs.org/show_bug.cgi?id=5357#c16

Comment 9 Andrew Price 2018-07-13 11:12:46 UTC

Well in gfs2 workloads in particular we need the cluster stack to be robust in cases where a lot of memory is being used for dirty pages. In that case anything requiring memory allocations can be slow, as pages might need to be flushed back to disk to free up memory, and so memory allocations can briefly become dependent on disk speed. Perhaps that's what we're seeing here when the python fence agent is executed.

Comment 10 Andrew Price 2018-07-17 11:02:33 UTC

(In reply to Andrew Price from comment #6)
> 6 seconds does seem like a long time just to output some static(?) data so
> either the node was pretty heavily loaded at that point or the agent is
> doing something sub-optimal. I might open a new bz against the fence-agents
> to get that figured out. Thanks again.

https://bugzilla.redhat.com/show_bug.cgi?id=1601820

Comment 12 Chris Mackowski 2018-07-18 21:46:22 UTC

I have been recording the 'time' output of -o metadata at an interval of 60 seconds. Below are a few of the slowest times. These were recorded during GFS2 testing.   

qe-c01-m01:
Jul 18 13:45:00	0.03user 0.02system 0:00.59elapsed 9%CPU (0avgtext+0avgdata 11232maxresident)k
Jul 18 14:45:07 0.03user 0.02system 0:00.71elapsed 7%CPU (0avgtext+0avgdata 11232maxresident)k
Jul 18 14:51:40 0.03user 0.02system 0:00.78elapsed 7%CPU (0avgtext+0avgdata 11236maxresident)k

qe-c01-m02:
Jul 18 13:38:22 0.04user 0.02system 0:02.19elapsed 3%CPU (0avgtext+0avgdata 11232maxresident)k
Jul 18 14:38:32 0.03user 0.02system 0:00.67elapsed 9%CPU (0avgtext+0avgdata 11236maxresident)k

qe-c01-m03:
Jul 18 13:51:48 0.03user 0.02system 0:01.08elapsed 5%CPU (0avgtext+0avgdata 11236maxresident)k
Jul 18 13:58:23 0.03user 0.02system 0:00.96elapsed 6%CPU (0avgtext+0avgdata 11236maxresident)k
Jul 18 15:31:56 0.03user 0.02system 0:00.54elapsed 10%CPU (0avgtext+0avgdata 11236maxresident)k
Jul 18 15:38:15 0.03user 0.02system 0:00.89elapsed 7%CPU (0avgtext+0avgdata 11236maxresident)k
Jul 18 15:44:32 0.04user 0.02system 0:00.85elapsed 7%CPU (0avgtext+0avgdata 11236maxresident)k

I also observed sluggish performance from 'rpm -qa'. Unfortunately I did not grab a datestamp, approx July 18 16:15 CDT. There was no GFS2 testing taking place, systems essentially idle.        

[root@qe-c01-m02 ~]# time rpm -qa | grep pcp
snip.....
real	0m41.565s
user	0m0.697s
sys	0m0.298s

[root@qe-c01-m03 ~]# time rpm -qa | grep pcp
snip....
real	0m40.453s
user	0m0.645s
sys	0m0.264s

-Chris Mackowski

Comment 13 Andrew Price 2018-07-25 10:27:27 UTC

I've been adding better monitoring to the cluster and noticed a low level of entropy available for rng, which really should be expected as this is virt. Theoretically that could cause processes to block, and not only when using /dev/random. To give it a boost I've enabled rngd to consume from /dev/prandom (which is a z specific randomness source provided by the prng kernel module).

Chris, could you set up the revolver test to run for a long time with the new tweaks?

Comment 15 Andrew Price 2018-08-02 15:45:54 UTC

Just to be clear on the workaround, since it's undocumented, is this the right way to set those parameters?

pcs stonith update --force zfence1 pcmk_meta-data_timeout=20 pcmk_monitor_timeout=60s

Comment 16 Ken Gaillot 2018-08-02 16:13:48 UTC

(In reply to Andrew Price from comment #15)
> Just to be clear on the workaround, since it's undocumented, is this the
> right way to set those parameters?
> 
> pcs stonith update --force zfence1 pcmk_meta-data_timeout=20
> pcmk_monitor_timeout=60s

Yes, that should work.

Comment 17 Ken Gaillot 2020-06-03 21:30:21 UTC

Moving to RHEL 8 only since this will not be done in the RHEL 7.9 time frame

Comment 21 RHEL Program Management 2021-01-08 07:31:48 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 22 Ken Gaillot 2021-03-05 00:22:57 UTC

FYI, failed fence agent meta-data commands will be periodically retried as of RHEL 8.5 (see Bug 1797579). We may reopen this bz for the log message improvements mentioned in Comment 3 if developer time becomes available.

Note You need to log in before you can comment on or make changes to this bug.