Bug 1047446
| Summary: | NFS timeout in RH Cluster through rgmanager monitor via netfs | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Maiti <kamal.maiti> | ||||
| Component: | rgmanager | Assignee: | Ryan McCabe <rmccabe> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Cluster QE <mspqa-list> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 6.2 | CC: | cluster-maint, fdinitto, jdeffenb, jruemker, kamal.maiti, michele | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2014-01-21 20:50:40 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Thanks for your report, please escalate your issue via GSS so that a customer case can be associated to this bugzilla. Also make sure you have upgraded to latest resource-agents from rhel6.5.z as it contains several fixes on the netfs resource agent. Nov 21 23:30:52 rgmanager status on netfs:cploc1_cplocm2 timed out after 10 seconds a network hicup is considered a failure and rgmanager will treat that as a fault. If you have network hicups you need to consider to fix those rather than working around them. As you mention it appears there is a problem with PID handling that should be investigated. Similar incident appeared today and it's in another cluster. We didn't have any network hiccup. Messages : [root@prod-X ~]# cat /var/log/cluster/rgmanager.log Jan 01 03:45:57 rgmanager status on netfs:cpms1c2_mbox1 timed out after 10 seconds Jan 01 03:45:58 rgmanager Task status PID 13960 did not exit after SIGKILL Jan 01 03:45:58 rgmanager Stopping service service:svc_cpms1c2 Jan 01 03:46:45 rgmanager Service service:svc_cpms1c2 is recovering Jan 01 03:46:45 rgmanager Recovering failed service service:svc_cpms1c2 Jan 01 03:46:52 rgmanager Service service:svc_cpms1c2 started [root@prod-X~]# I'll push it via GSS and try to check which bugs are fixed in latest version of resource-agents. (In reply to Maiti from comment #2) > Similar incident appeared today and it's in another cluster. We didn't have > any network hiccup. > > Messages : > > [root@prod-X ~]# cat /var/log/cluster/rgmanager.log > Jan 01 03:45:57 rgmanager status on netfs:cpms1c2_mbox1 timed out after 10 > seconds > Jan 01 03:45:58 rgmanager Task status PID 13960 did not exit after SIGKILL > Jan 01 03:45:58 rgmanager Stopping service service:svc_cpms1c2 > Jan 01 03:46:45 rgmanager Service service:svc_cpms1c2 is recovering > Jan 01 03:46:45 rgmanager Recovering failed service service:svc_cpms1c2 > Jan 01 03:46:52 rgmanager Service service:svc_cpms1c2 started > [root@prod-X~]# > > I'll push it via GSS and try to check which bugs are fixed in latest version > of resource-agents. It would be also useful to gather sosreports from all nodes in both clusters. Those timeout might require tuning due to the cluster setup based on the nfs server. Fabio, I'm following up with GSS case #00986321. There are few sosreports which are already attached. You can look into them. Created attachment 846297 [details]
sosreport 1
Hello, Is there any update of this bug? Are you working/investigating ? (In reply to Maiti from comment #7) > Hello, > > Is there any update of this bug? Are you working/investigating ? Yes we are. Please request updates via the ticket that has been filed with GSS. Otherwise information will be spread across different system and they will be inconsistent. Hello, As I noted in the support case you have opened with us, there are some steps we can take to try to: a) Diagnose/Tune NFS so that it is not prone to these issues, b) Configure the netfs resource to not be so sensitive to intermittent stalls, and c) Configure your resources such that they will not all have to restart immediately after a short-term failure For now, I am going to close this bug report out and we will continue the prcoess of diagnosing and addressing these issues in your case. Please let us know there if you have any questions or concerns. Thanks, John Ruemker, RHCA Senior Software Maintenance Engineer Global Support Services Red Hat, Inc. |
Description of problem: There is no corresponding NFS timeout in /var/log/messages. No indication that there was a network hiccup. Each time rgmanager restarts a service, it impacts our production environment. This is happening on several nodes. We would like to have rgmanager either repeat the status check several times before restarting the service group or just log the results or why it is killing while time out expired. Please revisit codes and fix the problem. Version-Release number of selected component (if applicable): rgmanager-3.0.12.1-17.el6.x86_64 rhel 6.2 kernel : 2.6.32-358.11.1.el6.x86_64 How reproducible: Issue is intermittently happening. Actual results: rgmanager is restarting HA services. rgmanager Log : From /var/log/cluster/rgmanager.log: Nov 21 23:30:52 rgmanager status on netfs:cploc1_cplocm2 timed out after 10 seconds Nov 21 23:30:53 rgmanager Task status PID 31764 did not exit after SIGKILL Nov 21 23:30:53 rgmanager Stopping service service:svc_cploc1 Nov 21 23:31:17 rgmanager Service service:svc_cploc1 is recovering Nov 21 23:31:17 rgmanager Recovering failed service service:svc_cploc1 Nov 21 23:31:29 rgmanager Service service:svc_cploc1 started Sometime timeout is 30 seconds Service svc_cploc1 is using one script and one NFS mount point in cluster.conf. Expected results: rgmanager should not restart service if all resources are fine. Additional info: Visited codes and here is my analysis : Based on “timed out” logs printed in system log, it looks ”waitpid” is not returning same PID as of child [see a new process is forked]. It should return same PID. And sleeptime is timeout that is passed from wrapper script (netfs.sh). So, basically “if logic” is returning true value in this case and it then prints “timed out “ messages and kills child process. How will be justified with NFS, that needs to be taken to Red Hat SEG and package maintainer or issue should be fixed. File [rgmanager source file] : ./rgmanager/src/daemons/restree.c, function “res_exec” Location of source file => rgmanager-src/rgmanager-3.0.12 ==============codes============= […] childpid = fork(); […] pid = waitpid(childpid, &ret, WNOHANG); […] ---snip--- if (sleeptime > 0) { /* There's a better way to do this, but this is easy and doesn't introduce signal woes */ while (sleeptime) { pid = waitpid(childpid, &ret, WNOHANG); if (pid == childpid) break; sleep(1); --sleeptime; } if (pid != childpid && sleeptime == 0) { logt_print(LOG_ERR, "%s on %s:%s timed out after %d seconds\n", op_str, res->r_rule->rr_type, res->r_attrs->ra_value, (int)node->rn_actions[act_index].ra_timeout); /* This can't be guaranteed to kill even the child process if the child is in disk-wait :( */ kill(childpid, SIGKILL); sleep(1); ---snip--- ==============codes===========