Bug 152929
Summary: | Badness in local_bh_enable at kernel/softirq.c:140 | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Craig Bernstein <cbernstein> | ||||||||||||||
Component: | kernel | Assignee: | David Miller <davem> | ||||||||||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Brian Brock <bbrock> | ||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||
Priority: | medium | ||||||||||||||||
Version: | 4 | CC: | davej, williama_lovaton, wtogami | ||||||||||||||
Target Milestone: | --- | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | i686 | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2006-05-05 13:09:03 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
Description
Craig Bernstein
2005-03-31 01:04:22 UTC
Hi there guys, I have a problem with a very similar back trace. I am using Fedora Core 3 with latest kernel (2.6.10-1.770_FC3smp) on an IBM xSeries 445, 8 HT proccessors and 18 GB of RAM. I use this server on a production environment with an Oracle database 9.2.0.6. After 14 days of excellent performance it crashed today and we had to reboot. Then I checked the syslog and from april 2 to april 5 there were very similar back traces, once in each day but with very small variations (I'll post the 4 bt's later). Today, on april 5 it crashed but it is not clear to me if it happened for this particular problem. It was happening from april 2 after all. Created attachment 112740 [details]
Info about my hardware
I think this problem has something to do with a Fibre Channel interface to a
storage area network.
Created attachment 112741 [details]
Backtrace from april 2
Created attachment 112742 [details]
Backtrace from april 3
Created attachment 112743 [details]
Backtrace from april 4
Created attachment 112744 [details]
Backtrace from april 5
This is the day when the system crashed but I don't if it is related to the
backtrace.
Note that all of the backtraces looks very similar but I diffed them and they have small variations with each other. As aditional info, I have some problems with the storage system during installation since it confused the local SCSI devices with the "external" SCSI devices in the SAN. The installer swaped them and from sda to sdd corresponds to the SAN and sde is the local SCSI. It didn't boot, in fact it doesn't reach grub. So, I had to reinstall the system with the external devices unplugged and only after installation (on local sda) we plugged the SAN back and it worked with a little detail: sda is now sde and the SAN is back to sda-sdd but everything works ok. This is my df -h: [root@dnccor50 log]# df -h Filesystem Size Used Avail Use% Mounted on /dev/sde2 50G 8.1G 39G 18% / /dev/sde1 213M 18M 185M 9% /boot none 9.0G 0 9.0G 0% /dev/shm /dev/sdb1 79G 46G 30G 61% /ubas /dev/sdc1 345G 101M 327G 1% /cooeps Specifically we are using sdb and sdc. According to it there is no sda nor sdd, hence, during boot and constantly after boot I get this messages in the log: dnccor50 kernel: Device sda not ready. dnccor50 kernel: end_request: I/O error, dev sda, sector 734003072 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750384 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750385 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750386 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750387 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750388 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750389 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750390 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750391 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750392 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750393 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750394 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750395 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750396 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750397 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750398 dnccor50 kernel: Buffer I/O error on device sda, logical block 91750399 Besides this, the server has 2 1GB ethernet NICs (tg3 module and it does appear in the attached logs) connected to 100Mbps switches but the second one always connect half duplex and I have to force it to full duplex with: mii-tool --force=100baseTx-FD eth1 After that, this is the output: [root@dnccor50 log]# mii-tool eth0: negotiated 100baseTx-FD, link ok eth1: 100 Mbit, full duplex, link ok I hope this influx of logs and info helps to solve this problem. Today I got another similar backtrace in the log this morning but it hasn't crashed yet. The interesting thing I see is that they are happening almost at the very same hour: around 5 in the morning (+/- 5 minutes). Should be something happening on my network at that time? Another thing I forgot to mention is that my web server is using almost an identical installation. The server is the exact same model than the database server with the same hardware that have this problem and is the very same Fedora Core 3 with the same updates but it doesn't have any problem at all. It doesn't show anything abnormal in the logs and have had uptimes of 75+ days. So this prblem might be triggered either by the Oracle database software or the SAN/Fiber Channel interface configuration where the Oracle datafiles resides. Hi again, digging a bit deeper I think this bug has something to do with the fact that I have more than 4gb of RAM (18GB actually) and some problem with the tg3 NIC driver because the web server I mentioned before doesn't have this problem and is working right now with 2GB of RAM and 4 HT proccessors because the secondary motherboard is deactivated (which have the rest of the proccessors and the RAM) and that server doesn't show up any problem, it works flawlessly. Remember both servers are identical. Just take a look to the backtraces, all of them have this function call: [<f89402ea>] tigon3_4gb_hwbug_workaround+0x1a/0xdb [tg3] which is almost the same as the reported by the original creator of this bug: [<f8e112a6>] tigon3_4gb_hwbug_workaround+0x1a/0xdf [tg3] Doing even more research, I just found this patch from David S. Miller: [] http://www.mail-archive.com/bk-commits-24@vger.kernel.org/msg00327.html It is dated march 23 so I guess this is something for the next FC kernel update. The idea for this patch came from this thread on march 22: [] http://thread.gmane.org/gmane.linux.kernel/289615 Any clue if this has something to do with this bug report? With HIGHMEM enabled, kmap_skb_frag() BUG()'s when in an interrupt handler. Furthermore, since kmap_skb_frag() disables and kunmap_skb_frag() enables BH's, this means we can't invoke this stuff with interrupts disabled either. Unfortunately, this means that skb_copy(), even with GFP_ATOMIC, cannot be invoked with interrupts disabled. It can only be invoked with BHs disabled. There is no easy way to fix this in the tg3 driver, we really have to hold the TX spinlock with interrupts disabled while we perform the workaround to fixup a DMA address we know will confuse the chip. I'll think some more about this. Thanx Dave, The doubt remains, is this bug able to make the system crash? I mean, there were 4 reports in the syslog before it crashed. Even more the backtraces were reported in the morning and the crash was in the afternoon. Why, after 14 days of heavy work, it would start to log this kind of backtraces in the last 4 days? Another question, is this the same situation for any other network card/driver? what if I plug the NIC to a 1GB switch? And finally, what is BH? Odd, the last time the server syslogged this problem was april 6, one day later after the reboot. Since then, it have been working fine without any problem, the syslog doesn't report anything special. Any idea about what is triggering this problem? what kind of workload, etc? Fedora Core 2 has now reached end of life, and no further updates will be provided by Red Hat. The Fedora legacy project will be producing further kernel updates for security problems only. If this bug has not been fixed in the latest Fedora Core 2 update kernel, please try to reproduce it under Fedora Core 3, and reopen if necessary, changing the product version accordingly. Thank you. davej, Please reopen this bug report and change the version to 'fc3', I'm not the original reporter and my problem is showing up with an 'almost' current Fedora Core 3 system as I mentioned in comment #1. Even with that the bug seems to be happening in the very same line in softirq.c (line 140) than FC2 systems. David Miller has identified the problem and is working on it. I'll send him some new info. As an extra info about my system. I'm using this server for an Oracle 9.2.0.6 database with several transactions through a web system using Linux/Apache/PHP. The kernel configuration changes I added to /etc/sysctl.conf are the following: kernel.shmmax=2147483648 kernel.sem=820 104960 820 128 fs.file-max=131072 net.ipv4.ip_local_port_range=1024 65000 Is there a problem with this configuration? Most of these was taken from the Oracle configuration guide and some common sense... I hope. :-) Besides, after the last reboot, the backtrace showed up again one day later (April 6), but no more after that and ifconfig is giving me this: [root@dnccor50 nalwalovaton]# /sbin/ifconfig eth0 Link encap:Ethernet HWaddr 00:09:6B:E6:42:1D inet addr:192.1.2.7 Bcast:192.1.2.255 Mask:255.255.255.0 inet6 addr: fe80::209:6bff:fee6:421d/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:3938839 errors:0 dropped:0 overruns:0 frame:0 TX packets:1288902 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:989626748 (943.7 MiB) TX bytes:438869885 (418.5 MiB) Interrupt:217 eth1 Link encap:Ethernet HWaddr 00:09:6B:16:42:1D inet addr:192.1.26.16 Bcast:192.1.26.255 Mask:255.255.255.0 inet6 addr: fe80::209:6bff:fe16:421d/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:414640151 errors:0 dropped:0 overruns:0 frame:0 TX packets:415983358 errors:99 dropped:0 overruns:0 carrier:0 collisions:92371 txqueuelen:1000 RX bytes:4148049439 (3.8 GiB) TX bytes:728909046 (695.1 MiB) Interrupt:11 Note the 'errors:99' in eth1 since April 6 and stable after that. Both NICs are exactly the same and eth0 is no showing any errors. I must add that eth1 is used to connect the web server and the database server and eth0 is left for occasional users using sqlplus or toad and for admin purposes (ssh, ftp, etc), so the traffic is a lot less. Web server: eth0 -> for webapp users traffic. eth1 -> database traffic. Database server: eth0 -> Random users and system administration eth1 -> web server traffic Note (again), that both server are exactly the same brand and model but the web server is using 2GB of RAM as of now hence the superb stability (Both are FC3 systems). thanks for the update Created attachment 113556 [details]
Backtrace from april 22
There is another backtrace in the syslog, but this time it is double and the
system hasn't crashed yet. Right now the machine has almost 17 days of uptime.
Note that it happens almost at the same time of the day that the previous
logs.
Is there any clue if this problem can make the system crash? What can I do to
avoid this problem?
Hi, I can confirm that this backtrace is not the one who makes the whole system crash. It seems to be something "normal" when it is transmitting a huge file through the network but still it doesn't happen so often. Do you remember that those backtraces were happening at 5 AM? Well, it was because at that time there was a cronjob copying an Oracle database export to another machine where the "official" backup is stored. This is done daily. Now, we changed the cronjob to 3AM and now the few backtraces so far are happening at 3 AM. Coincidence? I don't think so. So far it seems harmless. What do you think? PS. The server crashed today after 75 days of work. Seems related to the fibre channel interface. The only way to reliably avoid the problem is not to use HIGHMEM, and I realize that likely isn't an option. I've been working on a bug fix for this, for 2 weeks, and I'll try to finish it up soon. An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which may contain a fix for your problem. Please update to this new kernel, and report whether or not it fixes your problem. If you have updated to Fedora Core 4 since this bug was opened, and the problem still occurs with the latest updates for that release, please change the version field of this bug to 'fc4'. Thank you. dave[jm], Since this is a production machine we haven't updated anything after deployment because of the fear to "fix what is not borken". But since we have been experimenting one last problem with the Oracle proccesses blocking each other, we will plan to make a huge 'yum update' some time later and see what happens. Right now we are using 2.6.10-1.770_FC3smp. Once we do that I'll post the results here, so please be patient. Thanks for the effort. dave[jm], I will be reporting back soon about this bug. Since this is a very critical server for our company, it is very very hard to get permission to apply all the updates available in FC3. Last saturday we managed to update our web server and this saturday we'll be updating the database server (the one mentioned in this bug). After the updates are applied I'll be monitoring the system to see what happens. This is a mass-update to all currently open Fedora Core 3 kernel bugs. Fedora Core 3 support has transitioned to the Fedora Legacy project. Due to the limited resources of this project, typically only updates for new security issues are released. As this bug isn't security related, it has been migrated to a Fedora Core 4 bug. Please upgrade to this newer release, and test if this bug is still present there. This bug has been placed in NEEDINFO_REPORTER state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. Thank you. This is a mass-update to all currently open kernel bugs. A new kernel update has been released (Version: 2.6.15-1.1830_FC4) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO_REPORTER state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. Thank you. Closing per previous comment. |