From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040510 Description of problem: If you set up netconsole on a box with an e100 network card, unplug the network cable for 20 seconds or so, then plug it back in, the machine hangs right after printing the `e100: eth3 NIC Link is Up 100 Mbps Full duplex' message. Version-Release number of selected component (if applicable): kernel-2.4.21-15.0.3.EL netdump-0.6.11-3 How reproducible: Always Steps to Reproduce: 1.Configure a machine with an e100 network card as a netconsole client 2.Unplug the network cable 3.Put it back in after 20 seconds Actual Results: Complete system freeze Expected Results: No freeze Additional info:
I have a strong suspicion that the problem has to do with the printk() call in e100_watchdog() before e100_config_fc() and e100_config() are called. At that point, netif_running(dev) is already true, but the interface is not configured so the interface polling from write_netconsole_msg() may be unable to make any progress, so the interface will never be configured. Since the e100_watchdog() won't be set up before the current execution is done, and interrupts are disabled, we get a hard freeze. I suspect moving the printk past the e100_config() call might fix the problem. It would still be possible to run into an error should any other printk() be issued between the point when netif_running is set and when the watchdog would kick in and reconfigures the interface. We'd get into the same deadlock, at least on uniprocessor machines.
I would like to see precisely where we are hung. It would be helpful to run this test on a system with the nmi_watchdog enabled.
Jeffrey, we're stuck in the poll loop in write_console_msg(), attempting to transmit the printk()ed string over the interface that is not (re)configured yet, because it would only be configured when printk returned.
Problem stems from netconsole handling output from printk() synchronously. While that may normally work, it does allow for this sort of deadlock to occur. netconsole either needs mechanism to handle printk() output asynchrounously or it needs checks (e.g. netif_queue_stopped()) to allow for dropping printk() when output device is unable to transmit.
If we change netconsole.c to discard the printk when netif_queue_stopped returns true, then it is kind of a sledgehammer approach. So far, this thread has only concentrated on this one reason that the queue is stopped. Other reasons include running out of TX descriptors, which can be fixed in the polling loop where we are currently looping forever. IOW, in other cases, calling the poll function for the driver will free up these resources, and we will be able to send out the printk. I spoke with Matt Mackall at OLS, and he mentioned that deadlocks such as this were considered, and the decision was made to handle the issues on a case by case basis. (well, he was actually talking about a deadlock involving taking a spin lock in the interrupt routine, and doing a printk with it held). What this really comes down to is how reliable is the network console expected to be? There are certainly cases where we will have to drop messages, and since the messages are packaged up using UDP, I guess we aren't exactly advertising any guarantees. If we agree on that, then I'll put together a patch which addresses this in netconsole. It may be enough to simply try to poll a few times, and if we don't make any progress, return failure.
FWIW, I would agree that netconsole's reliability is at most "best effort". Dropping text output while the link goes up and down, while undesirable, seems reasonable -- much more so than hanging the box... :-)
Agreed. After the argument that packets are sent with UDP, dropping them in cases we can't transmit doesn't feel inappropriate at all.
Created attachment 102255 [details] retry only n times on failed transmit due to queue_stopped Here is a patch which implements my last suggestion. It has been tested with numerous cable pulls using the e100 driver. This also makes things happy in the case where you generate a lot of console output while the cable is unplugged (which was also capable of hanging the machine).
Created attachment 102259 [details] fixed version of the patch I generated this patch before I saved my emacs buffer. Good thing I tested it well!
A fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.3.EL).
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html