Description of problem: BUG: soft lockup detected on CPU#1! occurs with partial display of registers and traceback displayed on console. In one instance I did catch the entirety in the log, not lost by filesystem recovery upon boot. Then the server becomes totally unresponsive to network and console, forcing a hard power off and reboot. Version-Release number of selected component (if applicable): kernel-2.6.16-1.2111_FC5 How reproducible: This machine is a new iptables based firewall between a 5Mbps capable Internet service and a private network. The lockup happened twice within relatively short time by running speed tests of the Internet connection through this firewall. Tests were using speakeasy.net/speedtest/ conducted by a client on the private network side of the firewall. Steps to Reproduce: 1.Define iptables firewall rules and tcpv4 forwarding. 2.Bring up firewall with Internet and private network interfaces up. 3.Run the speed test. Actual results: System locks up. Expected results: Stable continued execution. Additional info: This is an Opteron 280 based HP DL385 machine running Fedora updates released kernel. No customization of kernel, nor third party modules. Log excerpt attached.
Created attachment 129320 [details] A few iptables kernel log line followed by the BUG entry in /var/log/messages
Created attachment 129321 [details] dmidecode output from this machine
I removed ip_conntrack_netbios_ns from the iptables-config loaded modules and restarted iptables this morning. The server has not yet crashed since then in spite of repeated Internet speed tests. That might be coincidence, but since that is a relatively new module and the traceback indicated iptables connection tracking involvement, I thought it worth testing. I will post again tomorrow if the system remains stable, or sooner if it locks up.
Uptime 38 hours so far. If it is still up tomorrow morning I will re-include ip_conntrack_netbios_ns in loaded modules, restart iptables, and see if it soon breaks.
On 2006-05-19 I restarted iptables with ip_conntrack_netbios_ns included in the module. It ran without lockup until I rebooted on the 2.6.16-1.2122_FC5 kernel 2006-05-26 17:31 (without ip_conntrack_netbios_ns). So what I thought would reproduce the lockup did not. But with the 2122 kernel with production network traffic it promptly locked up 3 times in less than 2 hours. Sometimes with CPU#0, sometimes CPU#1, and once during boot, but with little to no output on the console and nothing in the log. Last time cycling power and booting I switched back to the 2111 kernel to see if it is more stable.
2111 is going into soft lockup in less than an hour under consistent network load. 3 or 4 times in 4 hours including the time it sat in lockup while I grabbed a late dinner.
Back on 2122 kernel. And I believe I have found the condition that exposes this lockup bug on my firewall. The vga=792 boot option. Here's the collection of observations that lead me to investigating that: 1) The first few lines of bug message output that often does show on the console several times referenced something about a console semaphore 2) The console typeout is strangely slow on this hardware with my usual vga=792 boot option (I like to see more on the screen). As a firewall this logs (with rate limiting) various iptables rejects. So the slow console output could contribute to a race condition problem in relation to SMP and iptables logging. 3) I found some online info about someone using an NV driver getting a soft lockup detected on CPU#n that was fixed with newer NV code 4) It is my understanding that the vga=792 boot option causes use of a framebuffer driver Booting without vga=792 the firewall has not locked up. It had about 1.5 hours of the same network load before I lost my test window, and with no traffic it has remained up ever since. I then went back through the previous week's logs and verified that the one time I booted 2111 when it stayed up for 9 days I had on that occasion edited the boot line and booted without vga=792. The Bootdata log entry verified that. Examining all the other Bootdata entries for all the times the system did crash (usually in less than an hour) they all did have vga=792. I will do some more testing of the DL385 under production load later in the week for higher confidence, but this is the only factor I have found that is consistent with when the lockup bug presents, and when it does not. I'm not a kernel developer, but it seems likely that this merely influences the likelihood of recreating the bug triggering conditions, and that the root cause has nothing to do with the video driver. Believing the bug to be uniquely an SMP kernel issue I have temporarily replaced the DL385 exhibiting the bug with a DL360 running the identical software versions and configuration (without vga=792) and it does not have the lockup problem. This is a uniprocessor PIII Coppermine 1GHz cpu, which of course is running the i686 2.6.16-1.2122_FC5 kernel. The uniprocessor system has been running with production network load for 33 hours now.
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you.
(this is a mass-close to kernel bugs in NEEDINFO state) As indicated previously there has been no update on the progress of this bug therefore I am closing it as INSUFFICIENT_DATA. Please re-open if the issue still occurs for you and I will try to assist in its resolution. Thank you for taking the time to report the initial bug. If you believe that this bug was closed in error, please feel free to reopen this bug.