Bug 219633 - fenced fails to execute fence secondary fence methods if its ccs connection timesout
Summary: fenced fails to execute fence secondary fence methods if its ccs connection t...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: fence
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Chris Feist
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 354421
TreeView+ depends on / blocked
 
Reported: 2006-12-14 15:39 UTC by Josef Bacik
Modified: 2018-10-19 23:38 UTC (History)
6 users (show)

Fixed In Version: RHBA-2007-0138
Clone Of:
Environment:
Last Closed: 2007-05-10 21:25:45 UTC
Embargoed:


Attachments (Terms of Use)
patch that resolves the problem (4.19 KB, patch)
2006-12-14 18:23 UTC, Josef Bacik
no flags Details | Diff
test src.rpm with the fix. (315.92 KB, application/octet-stream)
2006-12-14 18:29 UTC, Josef Bacik
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0138 0 normal SHIPPED_LIVE fence bug fix update 2007-05-10 21:23:14 UTC

Description Josef Bacik 2006-12-14 15:39:33 UTC
Description of problem:
Customer has multiple fence levels setup, one being something that can be 
turned off, like ipmi or ilo, and the second being something else.  When the 
ipmi interface fails, fenced goes to do a ccs_get() on the next fence level 
and fails with these messages

Dec 11 17:52:47 node_b ccsd[9390]: process_get: Invalid connection descriptor 
received. 
Dec 11 17:52:47 node_b ccsd[9390]: Error while processing get: Invalid request 
descriptor

I have yet to be able to reproduce this myself.  It has also been reported in 
the wild

https://www.redhat.com/archives/linux-cluster/2006-December/msg00096.html

I have not been able to reproduce this.

Version-Release number of selected component (if applicable):
ccs-1.0.7-0-x86_64

How reproducible:
every time

Steps to Reproduce:
1.setup multiple fence levels
2.disable the first fence level so it will fail
3.echo c > /proc/sysrq-trigger or some other method of forcing the box to be 
fenced
  
Actual results:
The second fence level is not executed

Expected results:
The second fence level should be executed

Additional info:
I'm building a debug ccs package now to try and narrow down what is happening.  
I will have that shortly.

Comment 1 Josef Bacik 2006-12-14 15:57:24 UTC
you know what, looking at this now, there is a _large_ delay between the 
fencing action and the fence action returning a failure

Dec 11 17:50:28 node_b fenced[3240]: fencing node "node_a"
Dec 11 17:52:47 node_b fenced[3240]: agent "fence_ipmilan" reports: Rebooting 
machine @ IPMI:172.18.56.17...ipmilan: Failed to connect after 30 seconds 
Failed 

I bet the connection to ccsd is timing out and thats why ccsd freaks out when 
we go to do the next ccs_get().  I will try and reproduce this on my test 
setup.

Comment 2 Marcos David 2006-12-14 16:10:57 UTC
Well, it tries to fence the device twice before it fails. Between tries there is
a 30 second (or maybe a minute) delay until ipmilan exists. Maybe this is the
reason for the timeout. Should it try to fence only once?

Comment 3 Josef Bacik 2006-12-14 16:14:00 UTC
yeah any delay that long is going to cause this problem.  I just reproduced on 
my test cluster.  I will try and get a solution together for you to test out 
as soon as i fix it.

Comment 4 Josef Bacik 2006-12-14 16:49:56 UTC
ok the reason this is happening is because each connection has a default 
expire time of 30 seconds.  One solution would be to bump that timeout, but i 
think thats a bad idea.  I think the best solution would be to check the error 
from ccs_get() in fenced and if its -EBADR, to reconnect and then try again.  
I will write this up and test it out and see how it goes.

Comment 5 Josef Bacik 2006-12-14 18:23:20 UTC
Created attachment 143668 [details]
patch that resolves the problem

Ok here is the patch that I used to solve the problem.	Bascially this moves
the ccs_connect/disconnect into dispatch_fence_agent(), so the caller doesn't
have to handle any of that.  If we get an -EBADR when doing a ccs_get() we try
to reinstantiate the connection and call the ccs_get() again.  I have tested
this on my cluster and it works perfectly.

Comment 6 Josef Bacik 2006-12-14 18:25:27 UTC
fixing summary and component.

Comment 7 Josef Bacik 2006-12-14 18:29:23 UTC
Created attachment 143669 [details]
test src.rpm with the fix.

please test this patch to make sure it works.  Just download this src.rpm and
run

rpmbuild --rebuild fence-1.32.25-1.bz219633.test.src.rpm

and then install the rpm that it creates and test it out to make sure it works
alright.

Comment 8 Marcos David 2006-12-14 18:38:05 UTC
I'm probably wrong, but won't this block if it can't connect to ccs without a
ccs_force_connect? It will sleep forever...
In this case the if (cd <0 ) 
will never be executed because of the sleep while (cd < 0)?


+	if (force)
+		cd = ccs_force_connect(NULL, 0);
+	else {
+		while ((cd = ccs_connect()) < 0)
+			sleep(1);
+	}
+
+	if (cd < 0) {
+		syslog(LOG_ERR, "cannot connect to ccs %d\n", cd);
+		return -1;
+	}

Comment 9 Marcos David 2006-12-14 18:48:17 UTC
rpmbuild --rebuild fence-1.32.25-1.bz219633.test.src.rpm
doesn't work:
error: Architecture is not included: i386


Comment 10 Josef Bacik 2006-12-14 20:13:20 UTC
yeah if you cannot connect to ccs then you will sleep forever.  That is left 
over from the original ccs_connect() that was done in fence_victims(), I just 
copied it over, why fix whats not broke?  During normal operations force will 
be 0, so you will do that while loop, and if you cannot connect to ccs during 
normal operations, you have much larger problems than not being able to 
fence :).  The force option is provided simply for the fence_node facility, 
which will force a connection to ccs even if the cluster is without quorum.  
If you are on i686 you need to use

rpmbuild --rebuild fence-1.32.25-1.bz219633.test.src.rpm --target i686

in order to make it build.

Comment 11 Marcos David 2006-12-15 10:45:44 UTC
Ok, I've tested this and it seems to work, it no longer has the message:
"process_get: Invalid connection descriptor received."
repeating over and over, instead it just tries the next fence level.

Thanks for the help.

Comment 12 Chris Feist 2006-12-20 18:15:46 UTC
Patch looks good and I've committed it to CVS.  It will be included in the next
fence build (>=1.32.27).

Comment 19 Red Hat Bugzilla 2007-05-10 21:25:45 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0138.html


Comment 20 Issue Tracker 2007-06-15 14:05:03 UTC
Hello,

As per your previous comments, I'll go ahead and close this issue.

Best Regards,

Jon

Internal Status set to 'Resolved'
Status set to: Closed by Tech
Resolution set to: 'NotABug'

This event sent from IssueTracker by jfautley 
 issue 107933


Note You need to log in before you can comment on or make changes to this bug.