From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030922 Description of problem: Context system configured to use both netdump and netconsole. command "service netdump start" returns OK. syslog on netdump server shows appropriate message like "[ ... network console startup ... ]" when crash.o loaded, the netconsole service successfully logs the Oops message to the netdump server. the netdump-server service is started on the netdump server. the system is being investigated for occasional system hangs. the system is a node of an eight-node 9i RAC cluster. the kernel is NOT tainted. OCFS is not being used. Problem The memory dump does not occur. The system console does not give the expected indication of a netdump starting. This behaviour was consistent with the following configurations NIC Kernel e1000 e.27 e1000 e.35 tg3 e.27 e100 e.27 The system in question is a Dell 6650 with quad Xeon and 8GB RAM. The system has multiple NICs as follows: Broadcom Gigabit onboard x2, e100 x1, e1000 x2. Total of three or four interfaces active at the time of the crash. Version-Release number of selected component (if applicable): netdump-0.6.11-2, kernel-2.4.9-e.27 How reproducible: Always Steps to Reproduce: 1. insmod crash.o 2. 3. Actual Results: no dump Expected Results: there should be a dump Additional info:
Please post any messages posted by the netdump-server in the server's /var/log/messages file after the oops message was written there. Error messages from the netdump-server are the key element in debugging this issue. When you say "system configured to use both netdump and netconsole", do you mean "netdump and syslog"? Can you post your /etc/sysconfig/netdump file? Also, when you say "The system console does not give the expected indication of a netdump starting.", exactly what is indicated?
/etc/sysconfig/netdump --------------------------------------------------------------- # You can also set both. # # LOCALPORT=6666 # DEV=## # NET DUMP SERVER ( lxrptrapp001) # NETDUMPADDR=141.168.133.132 # # # NETDUMPPORT=6666 # NETDUMPMACADDR= # IDLETIMEOUT= # # If you want the console log (not crash dumps) sent via the # syslog service, set SYSLOGADDR to the IP of the syslog server. # The other two values normally remain unchanged. # # # NET DUMP SERVER ( lxrptrapp001 ) # SYSLOGADDR=141.168.133.132 # # # SYSLOGPORT=514 # SYSLOGMACADDR= ============================================================== In response to DA's questions; * I will check again on the log messages on the netdump-server host, but from memory it showed only the Oops message. * yes by "netconsole", I mean sending the oops message to syslog. * the console of the client shows the oops message and nothing else. from other tests of netdump I expect a netdump message to say the dump is starting, and then another when the dump is finished before the re-boot.
Created attachment 97549 [details] oops message shown in netdump server syslog
I had suggested on tech-list that you set up the scripts on the server to send email notification upon receipt of a crash request. If you haven't done that, *please* do that. If you don't get the email message generated from the /var/crash/scripts/netdump-crash script, and there are no error messages from the netdump-server daemon in the /var/log/messages file on the server, then the netdump procedure wasn't even initiated. When the system crashed, on the client console you should have seen something like this after the oops message (which occurred on cpu3): CPU0 frozen CPU1 frozen CPU2 frozen < netdump activated - performing handshake with the client. > < handshake completed - listening for dump requests. > ... From your description, it sounds like you're not seeing any of the above messages on the client console? Also you mention above that this all came about after the system was experiencing occasional system hangs. Was the system hung when you did the insmod? The reason I ask is that there is a known deficiency with netdump such that if any of the non-panicking cpus are blocked somewhere with interrupts disabled, then you will see zero or more (but not all) "frozen" messages, and nothing after that. This is because the netdump module uses smp_call_function() to send IPI's to the other CPUs asking them to freeze themselves, and print their "frozen" messages -- and smp_call_function() will not return if it does not receive an indication that each of the CPUs have received the IPI. It will hang waiting for a response forever, no further messages will be seen on the console, and therefore won't initiate the handshake protocol with the server (which BTW, will cause the netdump-crash script to send the email message -- that's why that's also a critical piece of debug info).
I have a similar case with what appears to be identical hardware and configuration: Hardware:Dell 6650 kernel: 2.4.9-e.38 NIC: tg3 netdump server is back to back with the netdump client via a crossover cable. a console is also attatched Here is the information gathered from the netdump server CPU#6 is frozen. CPU#1 is frozen. CPU#0 is frozen. CPU#2 is frozen. CPU#3 is frozen. < netdump activated - performing handshake with the client. > NETDUMP START! < handshake completed - listening for dump requests. > Uhhuh. NMI received for unknown reason 21. Dazed and confused, but trying to continue Do you have a strange power saving mode enabled? <0>CPU 7: Machine Check Exception: 0000000000000004 CPU 6: Machine Check Exception: 0000000000000004 CPU 0: Machine Check Exception: 0000000000000004 CPU 2: Machine Check Exception: 0000000000000004 Kernel panic: Unable to continue ------------[ cut here ]------------ kernel BUG at panic.c:58! CPU 1: Machine Check Exception: 0000000000000004 CPU 3: Machine Check Exception: 0000000000000004 Bank 0: 8c00020020140146<0>Kernel panic: Unable to continue Bank 0: 8c00020020140146<0>Kernel panic: Unable to continue ------------[ cut here ]------------ kernel BUG at panic.c:58! [0036c0000050c000]<1>------------[ cut here ]------------ kernel BUG at panic.c:58! at 0c00020020140146 [0036c0000050c000]<0>Kernel panic: Unable to continue at 0c00020020140146 ------------[ cut here ]------------ kernel BUG at panic.c:58! Kernel panic: Unable to continue ------------[ cut here ]------------ kernel BUG at panic.c:58! Kernel panic: Unable to continue ------------[ cut here ]------------ kernel BUG at panic.c:58! The panic netdump has now paniced again....... messages file.... Mar 25 00:49:20 beantown netdump[1517]: Got too many timeouts waiting for memory page for client 0x02020202, ignoring it Mar 25 00:49:23 beantown netdump[1517]: Got too many timeouts waiting for SHOW_STATUS for client 0x02020202, rebooting it [root@beantown 2.2.2.2-2004-03-25-00:41]# pwd /var/crash/2.2.2.2-2004-03-25-00:41 [root@beantown 2.2.2.2-2004-03-25-00:41]# ls -al total 4146172 drwx------ 2 netdump netdump 4096 Mar 25 00:41 . drwxr-xr-x 18 netdump netdump 4096 Mar 25 00:41 .. -rw------- 1 netdump netdump 1099 Mar 25 00:41 log -rw------- 1 netdump netdump 4241518592 Mar 25 00:48 vmcore-incomplete
Roger, Do you have the nmi watchdog enabled? If so, please disable it and run your test again. Older versions of the netdump module did not interoperate with the nmi_watchdog. If you don't have the nmi watchdog enabled, then this points to other hardware problems. Most common are memory errors, so you may consider testing your memory.
Actually I believe this problem was most likely fixed in AS2.1 U5 with this patch to send_netdump_mem(): > This patch is a backport from RHEL3, which simply verifies all > requested pages with page_is_ram() before kmap'ing them. > Without it, a non-existent page or other sensitive non-RAM > memory location could be mapped, accessed, possibly causing > a machine check. (IT #36437) --- linux/drivers/net/netconsole.c.orig Tue Jun 1 13:45:22 2004 +++ linux/drivers/net/netconsole.c Tue Jun 1 13:47:09 2004 @@ -742,9 +742,11 @@ send_netdump_skb(dev, str, strlen(str), &reply); return; } - page = mem_map + nr; -// if (PageReserved(page)) -// page = ZERO_PAGE(0); + + if (page_is_ram(nr)) + page = mem_map + nr; + else + page = ZERO_PAGE(0); kaddr = (char *)kmap_atomic(page, KM_NETDUMP);
This bug is filed against RHEL2.1, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.