Red Hat Bugzilla – Bug 179016
Got too many timeouts in handshaking
Last modified: 2010-10-21 23:59:22 EDT
Description of problem:
I have a Sun Fire x2100 (x86_64 rhel4u2) that keeps crashing while running IPVS
(see bz #176939). Anyway, I have netdump setup, but it fails to work. The
netdump server (x86 rhel4u2) will report this error:
Jan 26 09:05:02 xxxx netdump: Got too many timeouts in handshaking,
ignoring client xxx.xxx.xxx.xxx
Jan 26 09:05:05 xxx netdump: Got too many timeouts waiting for SHOW_STATUS
for client xxx.xxx.xxx.xxx, rebooting it
The client will never reboot. Note that the client is running the u3 beta
kernel (2.6.9-27.EL). Under the rhel4u2 kernel, netdump wouldn't even get this
far. The machine would just hard lock while netdump making a connection to the
Version-Release number of selected component (if applicable):
netdump-0.7.7-3 is running on the netdump server.
Steps to Reproduce:
1. Wait for the Sun Fire to hard lock
2. Watch netdump server dump those errors without end (1+ days, if permitted to
go that far)
client machine never reboots after crashing.
netdump should finish and client machine rebooted!
(Actually, machine shouldn't lock in the first place! :) )
Please answer all questions from:
Created attachment 123724 [details]
answers to questions
Just to follow up, this machine crashed on 13 Febuary and netdump worked.
It crashed an hour ago just now, and I am getting the same handshake errors.
Machine crashed this evening and netdump failed to work. Strangely, I did not
get errors on the server either.
Same thing happened on 7 May. Any ideas? thanks!
A "me too": we saw the same behavior today on a Sun v40z server running kernel
2.6.9-22.0.1 (x86_64). The netdump server logged the "Got too many timeouts in
handshaking, ignoring client / Got too many timeouts waiting for SHOW_STATUS for
client, rebooting it" messages to our syslog server, and the Sun v40z netdump
client just hung indefinitely until we powercycled it.
This server has crashed once (without netdump enabled) and hung twice (with
netdump enabled) in the past few weeks, and has never produced netdump output at
those times. I'm assuming that the hangs just represent the system trying to
send netdump output and failing, so possibly all three events had the same
underlying cause and the system crashed in the first case because it didn't have
a netdump server configured yet.
The server in question *did* successfully log one kernel bug to its netdump
server a few days ago, as I reported (and you can see) in bug 193275. Although
that kernel bug is supposed to be fatal, the server continued running for the
next 5 days.
We've installed the RHEL4 update 3 kernel (2.6.9-34) on this system and may be
trying a RHEL4 update 4 beta kernel (2.6.9-37) as well, so we'll see if either
of those fixes the netdump problem.
Created attachment 130320 [details]
Responses to the netdump question list
Created attachment 133413 [details]
My own answers to the same questions.
I encountered the same issues pretty much. Sun v20z running x86_64 RHEL4
doing the netdump to another machine running i386 RHEL3. The details are in
the above attachment.
Created attachment 133720 [details]
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
No, there are two separate problems going on here. Daryl has I think has either
a configuarion issue, or a problem with the forcedeth driver, given his answers
to the questions. I would suggest to him that they try with the latest 4.6
kernel, and start by verifying that it can capture a sysrq-c driven crash. My
guess is that forcedeth likely has a printk in it that causes a recursive
message send operation that winds up deadlocking the system.
For the other 4 crashes, they all have mvfs installed and they all seem to be
the same crash. I can't tell for certain whats causing that, but I would bet
that both the crash and the lockup will go away if you remove the mvfs modules.
If possible I would ask that you hook serial consoles to the systems in
questino and when a hung netdump occurs, try to get a sysrq-t out of the system.
That will help us confirm if mvfs is the culprit here.
Sorry, I have downgraded to RHEL3 to keep IPVS from crashing every other week.
I don't have a reproducer for you now. I did last year :)
ok, then setting to needinfo on the next reporter in the list
I was able to force a crash (via echo c > /proc/sysrq-trigger) on a Sun v40z
running kernel 2.6.9-55.0.2.ELsmp with netdump-0.7.16-10.x86_64. Not sure if
yes, thank you, that suggests to me that your netdump hang is related to the
crash you are getting. What version of mvfs are you on (you may need to
update). Also, can you hook up a serial console and extract a sysrq-t from the
crash when/if it hangs during netdump?
I've never heard of mvfs. Looks like it has something to do with ClearCase? In
any case, we don't use it (and never have).
I only saw netdump hang once with the "Got too many timeouts" error, as I
reported 15 months ago. If this happens again I'll try to get the info you've
requested, but I wouldn't hold your breath.
my bad, had you mixed up with one of the other attachments. From you then it
would be good to capture the following:
1) Serial console contents during a crash that hangs
2) If possible a tcpdump during the during the hung crash
3) During the hang, the results of a sysrq-t and sysrq-m issues through the
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time. This request will be
reviewed for a future Red Hat Enterprise Linux release.
closing due to inactivity