Description of problem: Customer has multiple fence levels setup, one being something that can be turned off, like ipmi or ilo, and the second being something else. When the ipmi interface fails, fenced goes to do a ccs_get() on the next fence level and fails with these messages Dec 11 17:52:47 node_b ccsd[9390]: process_get: Invalid connection descriptor received. Dec 11 17:52:47 node_b ccsd[9390]: Error while processing get: Invalid request descriptor I have yet to be able to reproduce this myself. It has also been reported in the wild https://www.redhat.com/archives/linux-cluster/2006-December/msg00096.html I have not been able to reproduce this. Version-Release number of selected component (if applicable): ccs-1.0.7-0-x86_64 How reproducible: every time Steps to Reproduce: 1.setup multiple fence levels 2.disable the first fence level so it will fail 3.echo c > /proc/sysrq-trigger or some other method of forcing the box to be fenced Actual results: The second fence level is not executed Expected results: The second fence level should be executed Additional info: I'm building a debug ccs package now to try and narrow down what is happening. I will have that shortly.
you know what, looking at this now, there is a _large_ delay between the fencing action and the fence action returning a failure Dec 11 17:50:28 node_b fenced[3240]: fencing node "node_a" Dec 11 17:52:47 node_b fenced[3240]: agent "fence_ipmilan" reports: Rebooting machine @ IPMI:172.18.56.17...ipmilan: Failed to connect after 30 seconds Failed I bet the connection to ccsd is timing out and thats why ccsd freaks out when we go to do the next ccs_get(). I will try and reproduce this on my test setup.
Well, it tries to fence the device twice before it fails. Between tries there is a 30 second (or maybe a minute) delay until ipmilan exists. Maybe this is the reason for the timeout. Should it try to fence only once?
yeah any delay that long is going to cause this problem. I just reproduced on my test cluster. I will try and get a solution together for you to test out as soon as i fix it.
ok the reason this is happening is because each connection has a default expire time of 30 seconds. One solution would be to bump that timeout, but i think thats a bad idea. I think the best solution would be to check the error from ccs_get() in fenced and if its -EBADR, to reconnect and then try again. I will write this up and test it out and see how it goes.
Created attachment 143668 [details] patch that resolves the problem Ok here is the patch that I used to solve the problem. Bascially this moves the ccs_connect/disconnect into dispatch_fence_agent(), so the caller doesn't have to handle any of that. If we get an -EBADR when doing a ccs_get() we try to reinstantiate the connection and call the ccs_get() again. I have tested this on my cluster and it works perfectly.
fixing summary and component.
Created attachment 143669 [details] test src.rpm with the fix. please test this patch to make sure it works. Just download this src.rpm and run rpmbuild --rebuild fence-1.32.25-1.bz219633.test.src.rpm and then install the rpm that it creates and test it out to make sure it works alright.
I'm probably wrong, but won't this block if it can't connect to ccs without a ccs_force_connect? It will sleep forever... In this case the if (cd <0 ) will never be executed because of the sleep while (cd < 0)? + if (force) + cd = ccs_force_connect(NULL, 0); + else { + while ((cd = ccs_connect()) < 0) + sleep(1); + } + + if (cd < 0) { + syslog(LOG_ERR, "cannot connect to ccs %d\n", cd); + return -1; + }
rpmbuild --rebuild fence-1.32.25-1.bz219633.test.src.rpm doesn't work: error: Architecture is not included: i386
yeah if you cannot connect to ccs then you will sleep forever. That is left over from the original ccs_connect() that was done in fence_victims(), I just copied it over, why fix whats not broke? During normal operations force will be 0, so you will do that while loop, and if you cannot connect to ccs during normal operations, you have much larger problems than not being able to fence :). The force option is provided simply for the fence_node facility, which will force a connection to ccs even if the cluster is without quorum. If you are on i686 you need to use rpmbuild --rebuild fence-1.32.25-1.bz219633.test.src.rpm --target i686 in order to make it build.
Ok, I've tested this and it seems to work, it no longer has the message: "process_get: Invalid connection descriptor received." repeating over and over, instead it just tries the next fence level. Thanks for the help.
Patch looks good and I've committed it to CVS. It will be included in the next fence build (>=1.32.27).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0138.html
Hello, As per your previous comments, I'll go ahead and close this issue. Best Regards, Jon Internal Status set to 'Resolved' Status set to: Closed by Tech Resolution set to: 'NotABug' This event sent from IssueTracker by jfautley issue 107933