Bug 219633
| Summary: | fenced fails to execute fence secondary fence methods if its ccs connection timesout | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] Red Hat Cluster Suite | Reporter: | Josef Bacik <jbacik> | ||||||
| Component: | fence | Assignee: | Chris Feist <cfeist> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 4 | CC: | cluster-maint, hlawatschek, jbrassow, marcos.david, rkenna, tao | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | RHBA-2007-0138 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2007-05-10 21:25:45 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 354421 | ||||||||
| Attachments: |
|
||||||||
|
Description
Josef Bacik
2006-12-14 15:39:33 UTC
you know what, looking at this now, there is a _large_ delay between the fencing action and the fence action returning a failure Dec 11 17:50:28 node_b fenced[3240]: fencing node "node_a" Dec 11 17:52:47 node_b fenced[3240]: agent "fence_ipmilan" reports: Rebooting machine @ IPMI:172.18.56.17...ipmilan: Failed to connect after 30 seconds Failed I bet the connection to ccsd is timing out and thats why ccsd freaks out when we go to do the next ccs_get(). I will try and reproduce this on my test setup. Well, it tries to fence the device twice before it fails. Between tries there is a 30 second (or maybe a minute) delay until ipmilan exists. Maybe this is the reason for the timeout. Should it try to fence only once? yeah any delay that long is going to cause this problem. I just reproduced on my test cluster. I will try and get a solution together for you to test out as soon as i fix it. ok the reason this is happening is because each connection has a default expire time of 30 seconds. One solution would be to bump that timeout, but i think thats a bad idea. I think the best solution would be to check the error from ccs_get() in fenced and if its -EBADR, to reconnect and then try again. I will write this up and test it out and see how it goes. Created attachment 143668 [details]
patch that resolves the problem
Ok here is the patch that I used to solve the problem. Bascially this moves
the ccs_connect/disconnect into dispatch_fence_agent(), so the caller doesn't
have to handle any of that. If we get an -EBADR when doing a ccs_get() we try
to reinstantiate the connection and call the ccs_get() again. I have tested
this on my cluster and it works perfectly.
fixing summary and component. Created attachment 143669 [details] test src.rpm with the fix. please test this patch to make sure it works. Just download this src.rpm and run rpmbuild --rebuild fence-1.32.25-1.bz219633.test.src.rpm and then install the rpm that it creates and test it out to make sure it works alright. I'm probably wrong, but won't this block if it can't connect to ccs without a
ccs_force_connect? It will sleep forever...
In this case the if (cd <0 )
will never be executed because of the sleep while (cd < 0)?
+ if (force)
+ cd = ccs_force_connect(NULL, 0);
+ else {
+ while ((cd = ccs_connect()) < 0)
+ sleep(1);
+ }
+
+ if (cd < 0) {
+ syslog(LOG_ERR, "cannot connect to ccs %d\n", cd);
+ return -1;
+ }
rpmbuild --rebuild fence-1.32.25-1.bz219633.test.src.rpm doesn't work: error: Architecture is not included: i386 yeah if you cannot connect to ccs then you will sleep forever. That is left over from the original ccs_connect() that was done in fence_victims(), I just copied it over, why fix whats not broke? During normal operations force will be 0, so you will do that while loop, and if you cannot connect to ccs during normal operations, you have much larger problems than not being able to fence :). The force option is provided simply for the fence_node facility, which will force a connection to ccs even if the cluster is without quorum. If you are on i686 you need to use rpmbuild --rebuild fence-1.32.25-1.bz219633.test.src.rpm --target i686 in order to make it build. Ok, I've tested this and it seems to work, it no longer has the message: "process_get: Invalid connection descriptor received." repeating over and over, instead it just tries the next fence level. Thanks for the help. Patch looks good and I've committed it to CVS. It will be included in the next fence build (>=1.32.27). An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0138.html Hello, As per your previous comments, I'll go ahead and close this issue. Best Regards, Jon Internal Status set to 'Resolved' Status set to: Closed by Tech Resolution set to: 'NotABug' This event sent from IssueTracker by jfautley issue 107933 |