179016 – Got too many timeouts in handshaking

Bug 179016 - Got too many timeouts in handshaking

Summary: Got too many timeouts in handshaking

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	netdump
Sub Component:
Version:	4.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Neil Horman
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-01-26 16:16 UTC by daryl herzmann
Modified:	2018-10-19 20:34 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-04-04 13:36:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
answers to questions (5.58 KB, text/plain) 2006-01-26 17:24 UTC, daryl herzmann	no flags	Details
Responses to the netdump question list (5.84 KB, text/plain) 2006-05-31 23:36 UTC, John Caruso	no flags	Details
My own answers to the same questions. (10.84 KB, text/plain) 2006-08-01 16:10 UTC, John Zmaczynski	no flags	Details
Customer's answers. (10.86 KB, text/plain) 2006-08-07 08:55 UTC, Jose Plans	no flags	Details
View All

Description daryl herzmann 2006-01-26 16:16:28 UTC

Description of problem:
I have a Sun Fire x2100 (x86_64 rhel4u2) that keeps crashing while running IPVS
(see bz #176939).  Anyway, I have netdump setup, but it fails to work.  The
netdump server (x86 rhel4u2) will report this error:

Jan 26 09:05:02 xxxx netdump[2504]: Got too many timeouts in handshaking,
ignoring client xxx.xxx.xxx.xxx
Jan 26 09:05:05 xxx netdump[2504]: Got too many timeouts waiting for SHOW_STATUS
for client xxx.xxx.xxx.xxx, rebooting it

The client will never reboot.  Note that the client is running the u3 beta
kernel (2.6.9-27.EL).  Under the rhel4u2 kernel, netdump wouldn't even get this
far.  The machine would just hard lock while netdump making a connection to the
server.

Version-Release number of selected component (if applicable):
netdump-0.7.7-3 is running on the netdump server.

How reproducible:
Always

Steps to Reproduce:
1. Wait for the Sun Fire to hard lock
2. Watch netdump server dump those errors without end (1+ days, if permitted to
go that far)

  
Actual results:
client machine never reboots after crashing.

Expected results:
netdump should finish and client machine rebooted!
(Actually, machine shouldn't lock in the first place! :) )

Comment 1 Dave Anderson 2006-01-26 16:50:50 UTC

Please answer all questions from:

  http://people.redhat.com/jmoyer/netdump

Comment 2 daryl herzmann 2006-01-26 17:24:37 UTC

Created attachment 123724 [details]
answers to questions

Comment 3 daryl herzmann 2006-02-17 15:45:58 UTC

Just to follow up, this machine crashed on 13 Febuary and netdump worked.  

It crashed an hour ago just now, and I am getting the same handshake errors.

Comment 4 daryl herzmann 2006-04-19 04:02:32 UTC

Machine crashed this evening and netdump failed to work.  Strangely, I did not
get errors on the server either.

Comment 5 daryl herzmann 2006-05-08 16:32:54 UTC

Same thing happened on 7 May.  Any ideas?  thanks!

Comment 6 John Caruso 2006-05-31 23:35:26 UTC

A "me too": we saw the same behavior today on a Sun v40z server running kernel
2.6.9-22.0.1 (x86_64).  The netdump server logged the "Got too many timeouts in
handshaking, ignoring client / Got too many timeouts waiting for SHOW_STATUS for
client, rebooting it" messages to our syslog server, and the Sun v40z netdump
client just hung indefinitely until we powercycled it.

This server has crashed once (without netdump enabled) and hung twice (with
netdump enabled) in the past few weeks, and has never produced netdump output at
those times.  I'm assuming that the hangs just represent the system trying to
send netdump output and failing, so possibly all three events had the same
underlying cause and the system crashed in the first case because it didn't have
a netdump server configured yet.

The server in question *did* successfully log one kernel bug to its netdump
server a few days ago, as I reported (and you can see) in bug 193275.  Although
that kernel bug is supposed to be fatal, the server continued running for the
next 5 days.

We've installed the RHEL4 update 3 kernel (2.6.9-34) on this system and may be
trying a RHEL4 update 4 beta kernel (2.6.9-37) as well, so we'll see if either
of those fixes the netdump problem.

Comment 7 John Caruso 2006-05-31 23:36:38 UTC

Created attachment 130320 [details]
Responses to the netdump question list

Comment 8 John Zmaczynski 2006-08-01 16:10:22 UTC

Created attachment 133413 [details]
My own answers to the same questions.

Comment 9 John Zmaczynski 2006-08-01 16:14:33 UTC

I encountered the same issues pretty much.  Sun v20z running x86_64 RHEL4 
doing the netdump to another machine running i386 RHEL3.  The details are in 
the above attachment.

Comment 10 Jose Plans 2006-08-07 08:55:46 UTC

Created attachment 133720 [details]
Customer's answers.

Comment 12 RHEL Program Management 2007-05-09 10:54:54 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 13 Neil Horman 2007-08-29 18:04:29 UTC

No, there are two separate problems going on here.  Daryl has I think has either
a configuarion issue, or a problem with the forcedeth driver, given his answers
to the questions. I would suggest to him that they try with the latest 4.6
kernel, and start by verifying that it can capture a sysrq-c driven crash.  My
guess is that forcedeth likely has a printk in it that causes a recursive
message send operation that winds up deadlocking the system.

For the other 4 crashes, they all have mvfs installed and they all seem to be
the same crash.  I can't tell for certain whats causing that, but I would bet
that both the crash and the lockup will go away if you remove the mvfs modules.
 If possible I would ask that you hook serial consoles to the systems in
questino and when a hung netdump occurs, try to get a sysrq-t out of the system.
 That will help us confirm if mvfs is the culprit here.

Comment 14 daryl herzmann 2007-08-29 18:17:01 UTC

Hi,

Sorry, I have downgraded to RHEL3 to keep IPVS from crashing every other week. 
I don't have a reproducer for you now.  I did last year :)

daryl

Comment 15 Neil Horman 2007-08-29 18:45:50 UTC

ok, then setting to needinfo on the next reporter in the list

Comment 16 John Caruso 2007-08-29 20:09:21 UTC

I was able to force a crash (via echo c > /proc/sysrq-trigger) on a Sun v40z
running kernel 2.6.9-55.0.2.ELsmp with netdump-0.7.16-10.x86_64.  Not sure if
that helps.

Comment 17 Neil Horman 2007-08-29 20:32:41 UTC

yes, thank you, that suggests to me that your netdump hang is related to the
crash you are getting.  What version of mvfs are you on (you may need to
update).  Also, can you hook up a serial console and extract a sysrq-t from the
crash when/if it hangs during  netdump?

Comment 18 John Caruso 2007-08-29 21:11:18 UTC

I've never heard of mvfs.  Looks like it has something to do with ClearCase?  In
any case, we don't use it (and never have).

I only saw netdump hang once with the "Got too many timeouts" error, as I
reported 15 months ago.  If this happens again I'll try to get the info you've
requested, but I wouldn't hold your breath.

Comment 19 Neil Horman 2007-08-30 11:49:02 UTC

my bad, had you mixed up with one of the other attachments.  From you then it
would be good to capture the following:
1) Serial console contents during a crash that hangs
2) If possible a tcpdump during the during the hung crash
3) During the hang, the results of a sysrq-t and sysrq-m issues through the
serial console.

thanks!

Comment 20 RHEL Program Management 2007-09-07 19:46:05 UTC

This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.

Comment 22 Neil Horman 2008-04-04 13:36:31 UTC

closing due to inactivity

Note You need to log in before you can comment on or make changes to this bug.