Description of problem: I have a Sun Fire x2100 (x86_64 rhel4u2) that keeps crashing while running IPVS (see bz #176939). Anyway, I have netdump setup, but it fails to work. The netdump server (x86 rhel4u2) will report this error: Jan 26 09:05:02 xxxx netdump[2504]: Got too many timeouts in handshaking, ignoring client xxx.xxx.xxx.xxx Jan 26 09:05:05 xxx netdump[2504]: Got too many timeouts waiting for SHOW_STATUS for client xxx.xxx.xxx.xxx, rebooting it The client will never reboot. Note that the client is running the u3 beta kernel (2.6.9-27.EL). Under the rhel4u2 kernel, netdump wouldn't even get this far. The machine would just hard lock while netdump making a connection to the server. Version-Release number of selected component (if applicable): netdump-0.7.7-3 is running on the netdump server. How reproducible: Always Steps to Reproduce: 1. Wait for the Sun Fire to hard lock 2. Watch netdump server dump those errors without end (1+ days, if permitted to go that far) Actual results: client machine never reboots after crashing. Expected results: netdump should finish and client machine rebooted! (Actually, machine shouldn't lock in the first place! :) )
Please answer all questions from: http://people.redhat.com/jmoyer/netdump
Created attachment 123724 [details] answers to questions
Just to follow up, this machine crashed on 13 Febuary and netdump worked. It crashed an hour ago just now, and I am getting the same handshake errors.
Machine crashed this evening and netdump failed to work. Strangely, I did not get errors on the server either.
Same thing happened on 7 May. Any ideas? thanks!
A "me too": we saw the same behavior today on a Sun v40z server running kernel 2.6.9-22.0.1 (x86_64). The netdump server logged the "Got too many timeouts in handshaking, ignoring client / Got too many timeouts waiting for SHOW_STATUS for client, rebooting it" messages to our syslog server, and the Sun v40z netdump client just hung indefinitely until we powercycled it. This server has crashed once (without netdump enabled) and hung twice (with netdump enabled) in the past few weeks, and has never produced netdump output at those times. I'm assuming that the hangs just represent the system trying to send netdump output and failing, so possibly all three events had the same underlying cause and the system crashed in the first case because it didn't have a netdump server configured yet. The server in question *did* successfully log one kernel bug to its netdump server a few days ago, as I reported (and you can see) in bug 193275. Although that kernel bug is supposed to be fatal, the server continued running for the next 5 days. We've installed the RHEL4 update 3 kernel (2.6.9-34) on this system and may be trying a RHEL4 update 4 beta kernel (2.6.9-37) as well, so we'll see if either of those fixes the netdump problem.
Created attachment 130320 [details] Responses to the netdump question list
Created attachment 133413 [details] My own answers to the same questions.
I encountered the same issues pretty much. Sun v20z running x86_64 RHEL4 doing the netdump to another machine running i386 RHEL3. The details are in the above attachment.
Created attachment 133720 [details] Customer's answers.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
No, there are two separate problems going on here. Daryl has I think has either a configuarion issue, or a problem with the forcedeth driver, given his answers to the questions. I would suggest to him that they try with the latest 4.6 kernel, and start by verifying that it can capture a sysrq-c driven crash. My guess is that forcedeth likely has a printk in it that causes a recursive message send operation that winds up deadlocking the system. For the other 4 crashes, they all have mvfs installed and they all seem to be the same crash. I can't tell for certain whats causing that, but I would bet that both the crash and the lockup will go away if you remove the mvfs modules. If possible I would ask that you hook serial consoles to the systems in questino and when a hung netdump occurs, try to get a sysrq-t out of the system. That will help us confirm if mvfs is the culprit here.
Hi, Sorry, I have downgraded to RHEL3 to keep IPVS from crashing every other week. I don't have a reproducer for you now. I did last year :) daryl
ok, then setting to needinfo on the next reporter in the list
I was able to force a crash (via echo c > /proc/sysrq-trigger) on a Sun v40z running kernel 2.6.9-55.0.2.ELsmp with netdump-0.7.16-10.x86_64. Not sure if that helps.
yes, thank you, that suggests to me that your netdump hang is related to the crash you are getting. What version of mvfs are you on (you may need to update). Also, can you hook up a serial console and extract a sysrq-t from the crash when/if it hangs during netdump?
I've never heard of mvfs. Looks like it has something to do with ClearCase? In any case, we don't use it (and never have). I only saw netdump hang once with the "Got too many timeouts" error, as I reported 15 months ago. If this happens again I'll try to get the info you've requested, but I wouldn't hold your breath.
my bad, had you mixed up with one of the other attachments. From you then it would be good to capture the following: 1) Serial console contents during a crash that hangs 2) If possible a tcpdump during the during the hung crash 3) During the hang, the results of a sysrq-t and sysrq-m issues through the serial console. thanks!
This request was previously evaluated by Red Hat Product Management for inclusion in the current Red Hat Enterprise Linux release, but Red Hat was unable to resolve it in time. This request will be reviewed for a future Red Hat Enterprise Linux release.
closing due to inactivity