115211 – netdump fails, netconsole succeeds

Bug 115211 - netdump fails, netconsole succeeds

Summary: netdump fails, netconsole succeeds

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	netdump
Sub Component:
Version:	2.1
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Thomas Graf
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-02-09 04:32 UTC by Richard Keech
Modified:	2014-06-18 08:28 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-19 19:23:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
oops message shown in netdump server syslog (2.12 KB, text/plain) 2004-02-10 04:20 UTC, Richard Keech	no flags	Details
View All

Description Richard Keech 2004-02-09 04:32:07 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030922

Description of problem:
Context

system configured to use both netdump and netconsole.

command "service netdump start" returns OK.

syslog on netdump server shows appropriate message like 
  "[ ... network console startup ... ]"

when crash.o loaded, the netconsole service successfully logs
the Oops message to the netdump server. 

the netdump-server service is started on the netdump server.

the system is being investigated for occasional system hangs.
the system is a node of an eight-node 9i RAC cluster.
the kernel is NOT tainted.  OCFS is not being used.


Problem

The memory dump does not occur.  The system console does not
give the expected indication of a netdump starting.


This behaviour was consistent with the following configurations

NIC   Kernel

e1000 e.27
e1000 e.35
tg3   e.27
e100  e.27



The system in question is a Dell 6650 with quad Xeon and 8GB
RAM.  The system has multiple NICs as follows:  Broadcom Gigabit
onboard x2, e100 x1, e1000 x2.  Total of three or four interfaces
active at the time of the crash.








Version-Release number of selected component (if applicable):
netdump-0.6.11-2, kernel-2.4.9-e.27

How reproducible:
Always

Steps to Reproduce:
1. insmod crash.o
2.
3.
    

Actual Results:  no dump

Expected Results:  there should be a dump

Additional info:

Comment 1 Dave Anderson 2004-02-09 13:36:25 UTC

Please post any messages posted by the netdump-server in the
server's /var/log/messages file after the oops message was
written there.  Error messages from the netdump-server are
the key element in debugging this issue.

When you say "system configured to use both netdump and netconsole",
do you mean "netdump and syslog"?  Can you post your
/etc/sysconfig/netdump file?

Also, when you say "The system console does not give the expected
indication of a netdump starting.", exactly what is indicated?

Comment 2 Richard Keech 2004-02-10 02:41:22 UTC

/etc/sysconfig/netdump 
---------------------------------------------------------------
# You can also set both.
#
# LOCALPORT=6666
# DEV=##
#       NET DUMP SERVER ( lxrptrapp001)
#
NETDUMPADDR=141.168.133.132
#
#
# NETDUMPPORT=6666
# NETDUMPMACADDR=
# IDLETIMEOUT=
#
# If you want the console log (not crash dumps) sent via the
# syslog service, set SYSLOGADDR to the IP of the syslog server.
# The other two values normally remain unchanged.
#
#
#       NET DUMP SERVER ( lxrptrapp001 )
#
SYSLOGADDR=141.168.133.132
#
#
# SYSLOGPORT=514
# SYSLOGMACADDR=
==============================================================

In response to DA's questions;

* I will check again on the log messages on the netdump-server
host, but from memory it showed only the Oops message.

* yes by "netconsole", I mean sending the oops message to syslog.

* the console of the client shows the oops message and nothing
else.  from other tests of netdump I expect a netdump message
to say the dump is starting, and then another when the dump is
finished before the re-boot.

Comment 3 Richard Keech 2004-02-10 04:20:03 UTC

Created attachment 97549 [details]
oops message shown in netdump server syslog

Comment 4 Dave Anderson 2004-02-10 14:23:13 UTC

I had suggested on tech-list that you set up the scripts on the server
to send email notification upon receipt of a crash request.  If you
haven't done that, *please* do that.  If you don't get the email
message generated from the /var/crash/scripts/netdump-crash script,
and there are no error messages from the netdump-server daemon in the
/var/log/messages file on the server, then the netdump procedure
wasn't even initiated.

When the system crashed, on the client console you should have seen
something like this after the oops message (which occurred on cpu3):

CPU0 frozen
CPU1 frozen
CPU2 frozen
< netdump activated - performing handshake with the client. >
< handshake completed - listening for dump requests. >
...
 
From your description, it sounds like you're not seeing any of the
above messages on the client console?  Also you mention above that
this all came about after the system was experiencing occasional
system hangs.  Was the system hung when you did the insmod?  The
reason I ask is that there is a known deficiency with netdump such
that if any of the non-panicking cpus are blocked somewhere with
interrupts disabled, then you will see zero or more (but not all)
"frozen" messages, and nothing after that.  This is because the
netdump module uses smp_call_function() to send IPI's to the other
CPUs asking them to freeze themselves, and print their "frozen"
messages -- and smp_call_function() will not return if it does not
receive an indication that each of the CPUs have received the IPI.
It will hang waiting for a response forever, no further messages will
be seen on the console, and therefore won't initiate the handshake
protocol with the server (which BTW, will cause the netdump-crash
script to send the email message -- that's why that's also a critical
piece of debug info).

Comment 5 Roger Nunn 2004-03-25 12:59:41 UTC

I have a similar case with what appears to be identical hardware and
configuration:
Hardware:Dell 6650
kernel: 2.4.9-e.38 NIC: tg3
netdump server is back to back with the netdump client via a crossover
cable.
a console is also attatched
Here is the information gathered from the netdump server

CPU#6 is frozen.
CPU#1 is frozen.
CPU#0 is frozen.
CPU#2 is frozen.
CPU#3 is frozen.
< netdump activated - performing handshake with the client. >
NETDUMP START!
< handshake completed - listening for dump requests. >
Uhhuh. NMI received for unknown reason 21.
Dazed and confused, but trying to continue
Do you have a strange power saving mode enabled?
<0>CPU 7: Machine Check Exception: 0000000000000004
CPU 6: Machine Check Exception: 0000000000000004
CPU 0: Machine Check Exception: 0000000000000004
CPU 2: Machine Check Exception: 0000000000000004
Kernel panic: Unable to continue
------------[ cut here ]------------
kernel BUG at panic.c:58!
CPU 1: Machine Check Exception: 0000000000000004
CPU 3: Machine Check Exception: 0000000000000004
Bank 0: 8c00020020140146<0>Kernel panic: Unable to continue
Bank 0: 8c00020020140146<0>Kernel panic: Unable to continue
------------[ cut here ]------------
kernel BUG at panic.c:58!
[0036c0000050c000]<1>------------[ cut here ]------------
kernel BUG at panic.c:58!
at 0c00020020140146
[0036c0000050c000]<0>Kernel panic: Unable to continue
at 0c00020020140146
------------[ cut here ]------------
kernel BUG at panic.c:58!
Kernel panic: Unable to continue
------------[ cut here ]------------
kernel BUG at panic.c:58!
Kernel panic: Unable to continue
------------[ cut here ]------------
kernel BUG at panic.c:58!



 The panic netdump has now paniced again.......
messages file....
Mar 25 00:49:20 beantown netdump[1517]: Got too many timeouts waiting
for memory page for client 0x02020202, ignoring it
Mar 25 00:49:23 beantown netdump[1517]: Got too many timeouts waiting
for SHOW_STATUS for client 0x02020202, rebooting it


[root@beantown 2.2.2.2-2004-03-25-00:41]# pwd
/var/crash/2.2.2.2-2004-03-25-00:41
[root@beantown 2.2.2.2-2004-03-25-00:41]# ls -al
total 4146172
drwx------    2 netdump  netdump      4096 Mar 25 00:41 .
drwxr-xr-x   18 netdump  netdump      4096 Mar 25 00:41 ..
-rw-------    1 netdump  netdump      1099 Mar 25 00:41 log
-rw-------    1 netdump  netdump  4241518592 Mar 25 00:48
vmcore-incomplete

Comment 6 Jeff Moyer 2004-09-08 20:39:53 UTC

Roger,

Do you have the nmi watchdog enabled?  If so, please disable it and
run your test again.  Older versions of the netdump module did not
interoperate with the nmi_watchdog.

If you don't have the nmi watchdog enabled, then this points to other
hardware problems.  Most common are memory errors, so you may consider
testing your memory.

Comment 7 Dave Anderson 2004-09-08 21:07:40 UTC


Actually I believe this problem was most likely fixed in AS2.1 U5
with this patch to send_netdump_mem():

> This patch is a backport from RHEL3, which simply verifies all
> requested pages with page_is_ram() before kmap'ing them.
> Without it, a non-existent page or other sensitive non-RAM
> memory location could be mapped, accessed, possibly causing
> a machine check.  (IT #36437)


--- linux/drivers/net/netconsole.c.orig Tue Jun  1 13:45:22 2004
+++ linux/drivers/net/netconsole.c      Tue Jun  1 13:47:09 2004
@@ -742,9 +742,11 @@
                send_netdump_skb(dev, str, strlen(str), &reply);
                return;
        }
-       page = mem_map + nr;
-//     if (PageReserved(page))
-//             page = ZERO_PAGE(0);
+
+       if (page_is_ram(nr))
+               page = mem_map + nr;
+       else
+               page = ZERO_PAGE(0);
 
        kaddr = (char *)kmap_atomic(page, KM_NETDUMP);

Comment 8 RHEL Program Management 2007-10-19 19:23:11 UTC

This bug is filed against RHEL2.1, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products.  Since
this bug does not meet that criteria, it is now being closed.

For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/

If you feel this bug is indeed mission critical, please contact your
support representative.  You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.