Bug 127815 (IT#36426) - netconsole freezes during printk() when output link not up
Summary: netconsole freezes during printk() when output link not up
Keywords:
Status: CLOSED ERRATA
Alias: IT#36426
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
Assignee: Jeff Moyer
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 137214
TreeView+ depends on / blocked
 
Reported: 2004-07-14 09:06 UTC by Alexandre Oliva
Modified: 2007-11-30 22:07 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-12-20 20:55:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
retry only n times on failed transmit due to queue_stopped (910 bytes, patch)
2004-07-28 15:41 UTC, Jeff Moyer
no flags Details | Diff
fixed version of the patch (950 bytes, patch)
2004-07-28 18:16 UTC, Jeff Moyer
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2004:550 0 normal SHIPPED_LIVE Updated kernel packages available for Red Hat Enterprise Linux 3 Update 4 2004-12-20 05:00:00 UTC

Description Alexandre Oliva 2004-07-14 09:06:12 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040510

Description of problem:
If you set up netconsole on a box with an e100 network card, unplug
the network cable for 20 seconds or so, then plug it back in, the
machine hangs right after printing the `e100: eth3 NIC Link is Up 100
Mbps Full duplex' message.

Version-Release number of selected component (if applicable):
kernel-2.4.21-15.0.3.EL netdump-0.6.11-3

How reproducible:
Always

Steps to Reproduce:
1.Configure a machine with an e100 network card as a netconsole client
2.Unplug the network cable
3.Put it back in after 20 seconds
    

Actual Results:  Complete system freeze

Expected Results:  No freeze

Additional info:

Comment 2 Alexandre Oliva 2004-07-14 09:12:41 UTC
I have a strong suspicion that the problem has to do with the printk()
call in e100_watchdog() before e100_config_fc() and e100_config() are
called.  At that point, netif_running(dev) is already true, but the
interface is not configured so the interface polling from
write_netconsole_msg() may be unable to make any progress, so the
interface will never be configured.  Since the e100_watchdog() won't
be set up before the current execution is done, and interrupts are
disabled, we get a hard freeze.  I suspect moving the printk past the
e100_config() call might fix the problem.

It would still be possible to run into an error should any other
printk() be issued between the point when netif_running is set and
when the watchdog would kick in and reconfigures the interface.  We'd
get into the same deadlock, at least on uniprocessor machines.

Comment 4 Jeff Moyer 2004-07-20 20:24:19 UTC
I would like to see precisely where we are hung.  It would be helpful
to run this test on a system with the nmi_watchdog enabled.

Comment 6 Alexandre Oliva 2004-07-21 05:21:05 UTC
Jeffrey, we're stuck in the poll loop in write_console_msg(),
attempting to transmit the printk()ed string over the interface that
is not (re)configured yet, because it would only be configured when
printk returned.

Comment 7 John W. Linville 2004-07-21 15:15:41 UTC
Problem stems from netconsole handling output from printk()
synchronously.  While that may normally work, it does allow for this
sort of deadlock to occur.

netconsole either needs mechanism to handle printk() output
asynchrounously or it needs checks (e.g. netif_queue_stopped()) to
allow for dropping printk() when output device is unable to transmit.

Comment 11 Jeff Moyer 2004-07-26 23:04:19 UTC
If we change netconsole.c to discard the printk when
netif_queue_stopped returns true, then it is kind of a sledgehammer
approach.  So far, this thread has only concentrated on this one
reason that the queue is stopped.  Other reasons include running out
of TX descriptors, which can be fixed in the polling loop where we are
currently looping forever.  IOW, in other cases, calling the poll
function for the driver will free up these resources, and we will be
able to send out the printk.

I spoke with Matt Mackall at OLS, and he mentioned that deadlocks such
as this were considered, and the decision was made to handle the
issues on a case by case basis.  (well, he was actually talking about
a deadlock involving taking a spin lock in the interrupt routine, and
doing a printk with it held).

What this really comes down to is how reliable is the network console
expected to be?  There are certainly cases where we will have to drop
messages, and since the messages are packaged up using UDP, I guess we
aren't exactly advertising any guarantees.

If we agree on that, then I'll put together a patch which addresses
this in netconsole.  It may be enough to simply try to poll a few
times, and if we don't make any progress, return failure.

Comment 12 John W. Linville 2004-07-27 13:19:06 UTC
FWIW, I would agree that netconsole's reliability is at most "best
effort".  Dropping text output while the link goes up and down, while
undesirable, seems reasonable -- much more so than hanging the box... :-)

Comment 13 Alexandre Oliva 2004-07-28 08:18:31 UTC
Agreed.  After the argument that packets are sent with UDP, dropping
them in cases we can't transmit doesn't feel inappropriate at all.

Comment 14 Jeff Moyer 2004-07-28 15:41:36 UTC
Created attachment 102255 [details]
retry only n times on failed transmit due to queue_stopped

Here is a patch which implements my last suggestion.  It has been tested with
numerous cable pulls using the e100 driver.  This also makes things happy in
the case where you generate a lot of console output while the cable is
unplugged (which was also capable of hanging the machine).

Comment 15 Jeff Moyer 2004-07-28 18:16:51 UTC
Created attachment 102259 [details]
fixed version of the patch

I generated this patch before I saved my emacs buffer.	Good thing I tested it
well!

Comment 16 Ernie Petrides 2004-09-04 00:49:14 UTC
A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.3.EL).


Comment 22 John Flanagan 2004-12-20 20:55:40 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html



Note You need to log in before you can comment on or make changes to this bug.