From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 Description of problem: netdump will stop sending logs to the server in about 50% of the cases. The attachment 'netdump' shows such a case. When it fails, the failure will always be at the same spot in the netdump. Customer did reproduce this under 2.4.9-e.16 as well. If it fails, the machine will also not reboot automagically. Configuration of netdump was done step-by-step according to our whitepaper (http://www.redhat.com/support/wpapers/redhat/netdump/) Final testing was done on a fully updated system with crash.c from the netdump docs directory and a cross-over cable. Version-Release number of selected component (if applicable): netdump-0.6.6-1 How reproducible: Sometimes Steps to Reproduce: 1. boot 2. wait 120 secs 3. load crash.o Actual Results: they do not manage to get more than 2 successful consecutive netdump runs. Expected Results: successful netdump in every case Additional info: this is in Issue Tracker as Issue 21337
Created attachment 91552 [details] an aborted netdump same behaviour under 2.4.9-e.16smp
Created attachment 91553 [details] sysreport of the netdump server
Created attachment 91554 [details] sysreport of the netdump client
Do we have error messages of the failed netdumps? Can we get a network trace of one that fails, just to see the last bit of communicaation? What does the netdump server look like from a vm standpoint at the time of failure?
In addition to Norm's questions... Create a /var/crash/scripts/netdump-reboot script (as well as the other three scripts if you like), taking the example script from the /usr/share/doc/netdump-server*/example_scripts directory on the server side. It will send a mail message to whomever you fill in when the server sends a reboot request. This will at least confirm that the server daemon requested that the client reboot itself. Also, on the server side, are there any messages in /var/log/messages that read: "Got too many timeouts waiting for SHOW_STATUS for client 0x########, rebooting it"
hi, yes there are timeout messages: ---snip--- May 12 11:25:52 lnxc-323 netdump[1324]: Got too many timeouts waiting for memory page for client 0xc0a800cc, ignoring it May 12 11:25:56 lnxc-323 netdump[1324]: Got too many timeouts waiting for SHOW_STATUS for client 0xc0a800cc, rebooting it May 12 11:25:56 lnxc-323 netdump[1324]: Got unexpected packet type 3 from ip 0xc0a800cc ---snip--- and the server sends the reboot mails. interesting: during and after a netdump turn, the computer feels very slowly. but i don`t know why because there is no network traffic. tobias
> interesting: during and after a netdump turn, the computer feels very slowly. > but i don`t know why because there is no network traffic. That is interesting -- it's possible that the slow-down is due to the "kswapd" issue, as soon as the pagecache consumes all available memory. I see that the netdump client is running e.12 or e.16, but what version of the kernel is running on the netdump server machine? The server should be running at least the QU1 errata kernel 2.4.9-e.12. With that kernel (or any later versions), the third argument to /proc/sys/vm/pagecache should be lowered from 90% to at most 75%, and lower than that if necessary: # cat /proc/sys/vm/pagecache 2 50 90 # echo 2 50 75 > /proc/sys/vm/pagecache # cat /proc/sys/vm/pagecache 2 50 75 # The percentage value there will limit the amount of memory used by the pagecache, and when it is reached, will start stealing memory back from the pagecache. When writing a huge sequential file such as a crash dump, it doesn't sense to cache all of its pages in the pagecache.
hi, kernel version is: ---snip--- uname -a Linux lnxc-323 2.4.9-e.16smp #1 SMP Mon Mar 17 16:55:45 EST 2003 i686 unknown echo 2 50 75 > /proc/sys/vm/pagecache ---snip--- now, the server feels faster. but the netdump client doesn`t reboot. here are the last log messages: ---snip--- CPU#1 is frozen. < netdump activated - performing handshake with the client. > Process: 1260, { insmod} Kernel 2.4.9-e.16smp EIP: 0010:[<f8ab7076>] CPU: 0EIP is at init_module [crash] 0x16 EFLAGS: 00010282 Tainted: PF EAX: 00000013 EBX: f8ab7000 ECX: 00000000 EDX: f541e000 ESI: 00000000 EDI: 00000000 EBP: f4e7df28 DS: 0018 ES: 0018 CR0: 8005003b CR2: 00000000 CR3: 34e69000 CR4: 000006d0 Call Trace: [<c011d735>] sys_init_module [kernel] 0x555 [<f8ab7060>] init_module [crash] 0x0 [<c01072e3>] system_call [kernel] 0x33 free sibling task PC stack pid father child younger older init S 00000000 2836 1 0 1243 4 (NOTLB) Call Trace: [<c0125214>] schedule_timeout [kernel] 0x84 [<c0125180>] process_timeout [kernel] 0x0 [<c015636e>] do_select [kernel] 0x20e [<c0156719>] sys_select [kernel] 0x339 [<c01072e3>] system_call [kernel] 0x33 keventd S 00000000 5992 2 1 3 (L-TLB) Call Trace: [<c01296ec>] context_thread [kernel] 0x13c [<c0105000>] stext [kernel] 0x0 [<c0105000>] stext [kernel] 0x0 [<c0105836>] arch_kernel_thread [kernel] 0x26 [<c01295b0>] context_thread [kernel] 0x0 keventd S C323A000 6268 3 1 11 2 (L-TLB) Call Trace: [<c01296ec>] context_thread [kernel] 0x13c [<c0107296>] ret_from_fork [kernel] 0x6 [<c0105000>] stext [kernel] 0x0 ---snip--- mfg tobias
Unfortunately I was hoping that the server was actually timing out on itself due to the pagecache issue; we've had customer reports where dropping the percentage down would also solve the netdump timeout issue. The trace showing the exception in then init_module() function is exactly what is expected when using that crash.o module. The crash.o init functino writes to address 0, which causes the fault, which dumps the trace leading up to that point, and then starts the netdump operation. So nothing is unusual about the trace backs. One thing you might also try is to set the IDLETIMEOUT parameter in /etc/sysconfig/netdump on the client. By default it is not set. If set to a value (in seconds), the client will reboot itself if it receives no requests from the netdump server for that many seconds. Try setting it to 10, for example. This will at least indicate that the netdump client is still running waiting for commands, or if it is hung somewhere.
hi, the same problem with IDLETIMEOUT parameter :-( tobias
Given that the client doesn't reboot itself w/IDLETIMEOUT set, and that the server times out and sends a reboot request which is ignored, the client netdump code is wedged someplace. Unfortunately, that being the case, I'm at a loss to come up with any more suggestions as to how to determine the problem remotely without being able to reproduce it in-house. It will require hacking the netconsole module code (and possibly kernel code) to determine where it's blocking.
hi, sorry but we are working on other issues at the moment. we start the next tests on friday. tobias
from Issue Tracker #22148 [start] hi, here are some vmstat, readprofile and top outputs. the kernel version is 2.4.9-e.24. tobias File uploaded: debug.tar [end] attachment coming as 90435-debug.tar
Created attachment 92113 [details] some vmstat, readprofile and top outputs under 2.4.9-e.24 https://enterprise.redhat.com/portal/?module=download&fid=4028 in Issue Tracker
netdump is known to have issues with watchdogs such as nmi_watchdog (which needs to be enabled for profiling). This is seems to at least be contributing to the inability to catch a complete netdump
2.4.9-e.29 does have the patch that allows netdump and nmi_watchdog to be used together. It was applied sometime *after* 2.4.9-e.25, but I don't know when. In any case, if the netdump client hangs, the nmi_watchdog should fire, dump a stacktrace on the console, and shutdown the system.