Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1047446

Summary: NFS timeout in RH Cluster through rgmanager monitor via netfs
Product: Red Hat Enterprise Linux 6 Reporter: Maiti <kamal.maiti>
Component: rgmanagerAssignee: Ryan McCabe <rmccabe>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.2CC: cluster-maint, fdinitto, jdeffenb, jruemker, kamal.maiti, michele
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 20:50:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreport 1 none

Description Maiti 2013-12-31 07:23:21 UTC
Description of problem:

There is no corresponding NFS timeout in /var/log/messages. No indication that there was a network hiccup. Each time rgmanager restarts a service, it impacts our production environment. This is happening on several nodes. We would like to have rgmanager either repeat the status check several times before restarting the service group or just log the results or why it is killing while time out expired. Please revisit codes and fix the problem.

Version-Release number of selected component (if applicable):

rgmanager-3.0.12.1-17.el6.x86_64
rhel 6.2
kernel : 2.6.32-358.11.1.el6.x86_64

How reproducible:

Issue is intermittently happening.

Actual results:

rgmanager is restarting HA services.

rgmanager Log :

From /var/log/cluster/rgmanager.log: 
Nov 21 23:30:52 rgmanager status on netfs:cploc1_cplocm2 timed out after 10 seconds 
Nov 21 23:30:53 rgmanager Task status PID 31764 did not exit after SIGKILL 
Nov 21 23:30:53 rgmanager Stopping service service:svc_cploc1 
Nov 21 23:31:17 rgmanager Service service:svc_cploc1 is recovering 
Nov 21 23:31:17 rgmanager Recovering failed service service:svc_cploc1 
Nov 21 23:31:29 rgmanager Service service:svc_cploc1 started 

Sometime timeout is 30 seconds

Service svc_cploc1 is using one script and one NFS mount point in cluster.conf.

Expected results:

rgmanager should not restart service if all resources are fine.

Additional info:

Visited codes and here is my analysis :

Based on  “timed out” logs printed in system log, it looks ”waitpid” is not returning same PID as of child [see a new process is forked]. It should return same PID. And sleeptime is timeout that is passed from wrapper script (netfs.sh).  So, basically “if logic” is returning true value in this case and it then prints “timed out “ messages and kills child process. How will be justified with NFS, that needs to be taken to Red Hat SEG and package maintainer or issue should be fixed.

File [rgmanager source file] : ./rgmanager/src/daemons/restree.c, function “res_exec”
Location of source file => rgmanager-src/rgmanager-3.0.12

==============codes=============
[…]
childpid = fork();
[…]
pid = waitpid(childpid, &ret, WNOHANG);
[…]

---snip---
if (sleeptime > 0) {

                /* There's a better way to do this, but this is easy and
                   doesn't introduce signal woes */
                while (sleeptime) {
                        pid = waitpid(childpid, &ret, WNOHANG);

                        if (pid == childpid)
                                break;
                        sleep(1);
                        --sleeptime;
                }

if (pid != childpid && sleeptime == 0) {

                        logt_print(LOG_ERR,
                               "%s on %s:%s timed out after %d seconds\n",
                               op_str, res->r_rule->rr_type,
                               res->r_attrs->ra_value,
                               (int)node->rn_actions[act_index].ra_timeout);

                        /* This can't be guaranteed to kill even the child
                           process if the child is in disk-wait :( */
                        kill(childpid, SIGKILL);
                        sleep(1);
---snip---
==============codes===========

Comment 1 Fabio Massimo Di Nitto 2013-12-31 15:44:24 UTC
Thanks for your report, please escalate your issue via GSS so that a customer case can be associated to this bugzilla.

Also make sure you have upgraded to latest resource-agents from rhel6.5.z as it contains several fixes on the netfs resource agent.

Nov 21 23:30:52 rgmanager status on netfs:cploc1_cplocm2 timed out after 10 seconds 

a network hicup is considered a failure and rgmanager will treat that as a fault.

If you have network hicups you need to consider to fix those rather than working around them.

As you mention it appears there is a problem with PID handling that should be investigated.

Comment 2 Maiti 2014-01-01 05:34:30 UTC
Similar incident appeared today and it's in another cluster. We didn't have any network hiccup.

Messages :

[root@prod-X ~]# cat /var/log/cluster/rgmanager.log
Jan 01 03:45:57 rgmanager status on netfs:cpms1c2_mbox1 timed out after 10 seconds
Jan 01 03:45:58 rgmanager Task status PID 13960 did not exit after SIGKILL
Jan 01 03:45:58 rgmanager Stopping service service:svc_cpms1c2
Jan 01 03:46:45 rgmanager Service service:svc_cpms1c2 is recovering
Jan 01 03:46:45 rgmanager Recovering failed service service:svc_cpms1c2
Jan 01 03:46:52 rgmanager Service service:svc_cpms1c2 started
[root@prod-X~]#

I'll push it via GSS and try to check which bugs are fixed in latest version of resource-agents.

Comment 3 Fabio Massimo Di Nitto 2014-01-01 10:22:56 UTC
(In reply to Maiti from comment #2)
> Similar incident appeared today and it's in another cluster. We didn't have
> any network hiccup.
> 
> Messages :
> 
> [root@prod-X ~]# cat /var/log/cluster/rgmanager.log
> Jan 01 03:45:57 rgmanager status on netfs:cpms1c2_mbox1 timed out after 10
> seconds
> Jan 01 03:45:58 rgmanager Task status PID 13960 did not exit after SIGKILL
> Jan 01 03:45:58 rgmanager Stopping service service:svc_cpms1c2
> Jan 01 03:46:45 rgmanager Service service:svc_cpms1c2 is recovering
> Jan 01 03:46:45 rgmanager Recovering failed service service:svc_cpms1c2
> Jan 01 03:46:52 rgmanager Service service:svc_cpms1c2 started
> [root@prod-X~]#
> 
> I'll push it via GSS and try to check which bugs are fixed in latest version
> of resource-agents.

It would be also useful to gather sosreports from all nodes in both clusters. Those timeout might require tuning due to the cluster setup based on the nfs server.

Comment 4 Maiti 2014-01-02 10:06:33 UTC
Fabio, I'm following up with GSS case #00986321. There are few sosreports which are already attached. You can look into them.

Comment 5 jdeffenb 2014-01-06 20:40:51 UTC
Created attachment 846297 [details]
sosreport 1

Comment 7 Maiti 2014-01-15 06:16:25 UTC
Hello,

Is there any update of this bug? Are you working/investigating ?

Comment 8 Fabio Massimo Di Nitto 2014-01-15 07:56:14 UTC
(In reply to Maiti from comment #7)
> Hello,
> 
> Is there any update of this bug? Are you working/investigating ?

Yes we are. Please request updates via the ticket that has been filed with GSS. Otherwise information will be spread across different system and they will be inconsistent.

Comment 13 John Ruemker 2014-01-21 20:50:40 UTC
Hello,
As I noted in the support case you have opened with us, there are some steps we can take to try to:

a) Diagnose/Tune NFS so that it is not prone to these issues, 
b) Configure the netfs resource to not be so sensitive to intermittent stalls, and
c) Configure your resources such that they will not all have to restart immediately after a short-term failure

For now, I am going to close this bug report out and we will continue the prcoess of diagnosing and addressing these issues in your case.  Please let us know there if you have any questions or concerns.

Thanks,
John Ruemker, RHCA
Senior Software Maintenance Engineer
Global Support Services
Red Hat, Inc.