Red Hat Bugzilla – Bug 127815
netconsole freezes during printk() when output link not up
Last modified: 2007-11-30 17:07:02 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040510
Description of problem:
If you set up netconsole on a box with an e100 network card, unplug
the network cable for 20 seconds or so, then plug it back in, the
machine hangs right after printing the `e100: eth3 NIC Link is Up 100
Mbps Full duplex' message.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Configure a machine with an e100 network card as a netconsole client
2.Unplug the network cable
3.Put it back in after 20 seconds
Actual Results: Complete system freeze
Expected Results: No freeze
I have a strong suspicion that the problem has to do with the printk()
call in e100_watchdog() before e100_config_fc() and e100_config() are
called. At that point, netif_running(dev) is already true, but the
interface is not configured so the interface polling from
write_netconsole_msg() may be unable to make any progress, so the
interface will never be configured. Since the e100_watchdog() won't
be set up before the current execution is done, and interrupts are
disabled, we get a hard freeze. I suspect moving the printk past the
e100_config() call might fix the problem.
It would still be possible to run into an error should any other
printk() be issued between the point when netif_running is set and
when the watchdog would kick in and reconfigures the interface. We'd
get into the same deadlock, at least on uniprocessor machines.
I would like to see precisely where we are hung. It would be helpful
to run this test on a system with the nmi_watchdog enabled.
Jeffrey, we're stuck in the poll loop in write_console_msg(),
attempting to transmit the printk()ed string over the interface that
is not (re)configured yet, because it would only be configured when
Problem stems from netconsole handling output from printk()
synchronously. While that may normally work, it does allow for this
sort of deadlock to occur.
netconsole either needs mechanism to handle printk() output
asynchrounously or it needs checks (e.g. netif_queue_stopped()) to
allow for dropping printk() when output device is unable to transmit.
If we change netconsole.c to discard the printk when
netif_queue_stopped returns true, then it is kind of a sledgehammer
approach. So far, this thread has only concentrated on this one
reason that the queue is stopped. Other reasons include running out
of TX descriptors, which can be fixed in the polling loop where we are
currently looping forever. IOW, in other cases, calling the poll
function for the driver will free up these resources, and we will be
able to send out the printk.
I spoke with Matt Mackall at OLS, and he mentioned that deadlocks such
as this were considered, and the decision was made to handle the
issues on a case by case basis. (well, he was actually talking about
a deadlock involving taking a spin lock in the interrupt routine, and
doing a printk with it held).
What this really comes down to is how reliable is the network console
expected to be? There are certainly cases where we will have to drop
messages, and since the messages are packaged up using UDP, I guess we
aren't exactly advertising any guarantees.
If we agree on that, then I'll put together a patch which addresses
this in netconsole. It may be enough to simply try to poll a few
times, and if we don't make any progress, return failure.
FWIW, I would agree that netconsole's reliability is at most "best
effort". Dropping text output while the link goes up and down, while
undesirable, seems reasonable -- much more so than hanging the box... :-)
Agreed. After the argument that packets are sent with UDP, dropping
them in cases we can't transmit doesn't feel inappropriate at all.
Created attachment 102255 [details]
retry only n times on failed transmit due to queue_stopped
Here is a patch which implements my last suggestion. It has been tested with
numerous cable pulls using the e100 driver. This also makes things happy in
the case where you generate a lot of console output while the cable is
unplugged (which was also capable of hanging the machine).
Created attachment 102259 [details]
fixed version of the patch
I generated this patch before I saved my emacs buffer. Good thing I tested it
A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.3.EL).
An errata has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.