90435 – netdump does not dump completely

Bug 90435 - netdump does not dump completely

Summary: netdump does not dump completely

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	netdump
Sub Component:
Version:	2.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Neil Horman
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-05-08 08:31 UTC by Patrick C. F. Ernzer
Modified:	2007-11-30 22:06 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-08-08 19:50:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
an aborted netdump (1.11 KB, text/plain) 2003-05-08 08:33 UTC, Patrick C. F. Ernzer	no flags	Details
sysreport of the netdump server (125.69 KB, application/octet-stream) 2003-05-08 08:34 UTC, Patrick C. F. Ernzer	no flags	Details
sysreport of the netdump client (150.90 KB, application/octet-stream) 2003-05-08 08:34 UTC, Patrick C. F. Ernzer	no flags	Details
some vmstat, readprofile and top outputs under 2.4.9-e.24 (220.00 KB, application/octet-stream) 2003-06-03 14:58 UTC, Patrick C. F. Ernzer	no flags	Details
View All

Description Patrick C. F. Ernzer 2003-05-08 08:31:30 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225

Description of problem:
netdump will stop sending logs to the server in about 50% of the cases. The
attachment 'netdump' shows such a case. When it fails, the failure will always
be at the same spot in the netdump. Customer did reproduce this under 2.4.9-e.16
as well.

If it fails, the machine will also not reboot automagically.

Configuration of netdump was done step-by-step according to our whitepaper
(http://www.redhat.com/support/wpapers/redhat/netdump/)

Final testing was done on a fully updated system with crash.c from the netdump
docs directory and a cross-over cable.

Version-Release number of selected component (if applicable):
netdump-0.6.6-1

How reproducible:
Sometimes

Steps to Reproduce:
1. boot
2. wait 120 secs
3. load crash.o
    

Actual Results:  they do not manage to get more than 2 successful consecutive
netdump runs.


Expected Results:  successful netdump in every case

Additional info:

this is in Issue Tracker as Issue 21337

Comment 1 Patrick C. F. Ernzer 2003-05-08 08:33:10 UTC

Created attachment 91552 [details]
an aborted netdump

same behaviour under 2.4.9-e.16smp

Comment 2 Patrick C. F. Ernzer 2003-05-08 08:34:23 UTC

Created attachment 91553 [details]
sysreport of the netdump server

Comment 3 Patrick C. F. Ernzer 2003-05-08 08:34:51 UTC

Created attachment 91554 [details]
sysreport of the netdump client

Comment 4 Norm Murray 2003-05-08 14:45:34 UTC

Do we have error messages of the failed netdumps? Can we get a network trace of
one that fails, just to see the last bit of communicaation? What does the
netdump server look like from a vm standpoint at the time of failure?

Comment 5 Dave Anderson 2003-05-09 20:57:24 UTC

In addition to Norm's questions...

Create a /var/crash/scripts/netdump-reboot script (as well
as the other three scripts if you like), taking the example
script from the /usr/share/doc/netdump-server*/example_scripts
directory on the server side.  It will send a mail message
to whomever you fill in when the server sends a reboot request.
This will at least confirm that the server daemon requested
that the client reboot itself.

Also, on the server side, are there any messages in /var/log/messages
that read:

"Got too many timeouts waiting for SHOW_STATUS for client 0x########, rebooting
it"

Comment 6 Tobias Meier 2003-05-12 09:52:00 UTC

hi,
yes there are timeout messages:

---snip---
May 12 11:25:52 lnxc-323 netdump[1324]: Got too many timeouts waiting for memory
page for client 0xc0a800cc, ignoring it
May 12 11:25:56 lnxc-323 netdump[1324]: Got too many timeouts waiting for
SHOW_STATUS for client 0xc0a800cc, rebooting it
May 12 11:25:56 lnxc-323 netdump[1324]: Got unexpected packet type 3 from ip
0xc0a800cc
---snip---

and the server sends the reboot mails.

interesting: during and after a netdump turn, the computer feels very slowly.
but i don`t know why because there is no network traffic.

tobias

Comment 7 Dave Anderson 2003-05-12 13:41:57 UTC

> interesting: during and after a netdump turn, the computer feels very slowly.
> but i don`t know why because there is no network traffic.

That is interesting -- it's possible that the slow-down is due to the "kswapd"
issue, as soon as the pagecache consumes all available memory.  I see that
the netdump client is running e.12 or e.16, but what version of the kernel is
running on the netdump server machine?  

The server should be running at least the QU1 errata kernel 2.4.9-e.12.
With that kernel (or any later versions), the third argument to
/proc/sys/vm/pagecache should be lowered from 90% to at most 75%,
and lower than that if necessary:

  # cat /proc/sys/vm/pagecache
  2       50      90
  # echo 2 50 75 > /proc/sys/vm/pagecache
  # cat /proc/sys/vm/pagecache
  2       50      75
  #

The percentage value there will limit the amount of memory used by the
pagecache, and when it is reached, will start stealing memory back from the
pagecache.  When writing a huge sequential file such as a crash dump, it
doesn't sense to cache all of its pages in the pagecache.

Comment 8 Tobias Meier 2003-05-13 09:41:46 UTC

hi,
kernel version is:

---snip---
uname -a
Linux lnxc-323 2.4.9-e.16smp #1 SMP Mon Mar 17 16:55:45 EST 2003 i686 unknown
echo 2 50 75 > /proc/sys/vm/pagecache
---snip---

now, the server feels faster. but the netdump client doesn`t reboot.

here are the last log messages:
---snip---
CPU#1 is frozen.
< netdump activated - performing handshake with the client. >

Process: 1260, {              insmod}
Kernel 2.4.9-e.16smp
EIP: 0010:[<f8ab7076>] CPU: 0EIP is at init_module [crash] 0x16
 EFLAGS: 00010282    Tainted: PF
EAX: 00000013 EBX: f8ab7000 ECX: 00000000 EDX: f541e000
ESI: 00000000 EDI: 00000000 EBP: f4e7df28 DS: 0018 ES: 0018
CR0: 8005003b CR2: 00000000 CR3: 34e69000 CR4: 000006d0
Call Trace: [<c011d735>] sys_init_module [kernel] 0x555
[<f8ab7060>] init_module [crash] 0x0
[<c01072e3>] system_call [kernel] 0x33


                         free                        sibling
  task             PC    stack   pid father child younger older
init          S 00000000  2836     1      0  1243       4       (NOTLB)
Call Trace: [<c0125214>] schedule_timeout [kernel] 0x84
[<c0125180>] process_timeout [kernel] 0x0
[<c015636e>] do_select [kernel] 0x20e
[<c0156719>] sys_select [kernel] 0x339
[<c01072e3>] system_call [kernel] 0x33

keventd       S 00000000  5992     2      1             3       (L-TLB)
Call Trace: [<c01296ec>] context_thread [kernel] 0x13c
[<c0105000>] stext [kernel] 0x0
[<c0105000>] stext [kernel] 0x0
[<c0105836>] arch_kernel_thread [kernel] 0x26
[<c01295b0>] context_thread [kernel] 0x0

keventd       S C323A000  6268     3      1            11     2 (L-TLB)
Call Trace: [<c01296ec>] context_thread [kernel] 0x13c
[<c0107296>] ret_from_fork [kernel] 0x6
[<c0105000>] stext [kernel] 0x0
---snip---

mfg tobias

Comment 9 Dave Anderson 2003-05-13 13:24:36 UTC

Unfortunately I was hoping that the server was actually timing
out on itself due to the pagecache issue; we've had customer
reports where dropping the percentage down would also solve the
netdump timeout issue.

The trace showing the exception in then init_module() function
is exactly what is expected when using that crash.o module.
The crash.o init functino writes to address 0, which causes the
fault, which dumps the trace leading up to that point, and then
starts the netdump operation.  So nothing is unusual about the trace backs.

One thing you might also try is to set the IDLETIMEOUT parameter
in /etc/sysconfig/netdump on the client.  By default it is not set.
If set to a value (in seconds), the client will reboot itself if it
receives no requests from the netdump server for that many seconds.
Try setting it to 10, for example.  This will at least indicate that
the netdump client is still running waiting for commands, or if it
is hung somewhere.

Comment 10 Tobias Meier 2003-05-14 09:43:24 UTC

hi,
the same problem with IDLETIMEOUT parameter :-(

tobias

Comment 11 Dave Anderson 2003-05-14 13:00:44 UTC

Given that the client doesn't reboot itself w/IDLETIMEOUT set,
and that the server times out and sends a reboot request which
is ignored, the client netdump code is wedged someplace.

Unfortunately, that being the case, I'm at a loss to come up with
any more suggestions as to how to determine the problem remotely
without being able to reproduce it in-house.  It will require hacking
the netconsole module code (and possibly kernel code) to determine
where it's blocking.

Comment 12 Tobias Meier 2003-05-20 15:02:12 UTC

hi,
sorry but we are working on other issues at the moment. 
we start the next tests on friday.

tobias

Comment 13 Patrick C. F. Ernzer 2003-06-03 14:57:15 UTC

from Issue Tracker #22148

[start]
hi,
here are some vmstat, readprofile and top outputs.
the kernel version is 2.4.9-e.24.

tobias

File uploaded: debug.tar
[end]

attachment coming as 90435-debug.tar

Comment 14 Patrick C. F. Ernzer 2003-06-03 14:58:36 UTC

Created attachment 92113 [details]
some vmstat, readprofile and top outputs under 2.4.9-e.24

https://enterprise.redhat.com/portal/?module=download&fid=4028 in Issue Tracker

Comment 15 Norm Murray 2003-06-05 17:23:42 UTC

netdump is known to have issues with watchdogs such as nmi_watchdog (which needs
to be enabled for profiling). This is seems to at least be contributing to the
inability to catch a complete netdump

Comment 16 Dave Anderson 2003-12-02 20:46:26 UTC

2.4.9-e.29 does have the patch that allows netdump
and nmi_watchdog to be used together.  It was applied
sometime *after* 2.4.9-e.25, but I don't know when.

In any case, if the netdump client hangs, the nmi_watchdog
should fire, dump a stacktrace on the console, and shutdown
the system.

Note You need to log in before you can comment on or make changes to this bug.