Bug 152929

Summary: Badness in local_bh_enable at kernel/softirq.c:140
Product: [Fedora] Fedora Reporter: Craig Bernstein <cbernstein>
Component: kernelAssignee: David Miller <davem>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: davej, williama_lovaton, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-05-05 13:09:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Info about my hardware
none
Backtrace from april 2
none
Backtrace from april 3
none
Backtrace from april 4
none
Backtrace from april 5
none
Backtrace from april 22 none

Description Craig Bernstein 2005-03-31 01:04:22 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050328 Firefox/1.0 (Ubuntu package 1.0.2)

Description of problem:
Lots of kernel error messages in dmesg on an NFS server.

Version-Release number of selected component (if applicable):
kernel-2.6.10-1.770_FC2

How reproducible:
Always

Steps to Reproduce:
1. Start system, wait for errors.

Actual Results:  Badness in local_bh_enable at kernel/softirq.c:140
 [<c012236b>] local_bh_enable+0x2a/0x65
 [<c026ac5f>] skb_copy_bits+0x167/0x23e
 [<c026a348>] skb_copy+0x91/0xad
 [<f8e112a6>] tigon3_4gb_hwbug_workaround+0x1a/0xdf [tg3]
 [<f8e117c4>] tg3_start_xmit+0x3ec/0x46b [tg3]
 [<c027b41a>] qdisc_restart+0xdb/0x1a5
 [<c026ea9b>] dev_queue_xmit+0xe5/0x1ff
 [<c0287067>] ip_finish_output+0x140/0x1b0
 [<c02879d2>] ip_fragment+0x298/0x655
 [<c0276919>] nf_iterate+0x40/0x89
 [<c0288ff2>] dst_output+0x0/0x1c
 [<c0286f27>] ip_finish_output+0x0/0x1b0
 [<c0288ffd>] dst_output+0xb/0x1c
 [<c0276c6e>] nf_hook_slow+0x81/0xb2
 [<c0288c10>] ip_push_pending_frames+0x2f5/0x3b6
 [<c0288ff2>] dst_output+0x0/0x1c
 [<c029ffa2>] udp_push_pending_frames+0x1d0/0x1ec
 [<c02a06ce>] udp_sendpage+0xf5/0x10f
 [<c02a63ec>] inet_sendpage+0x52/0x81
 [<f8e9d906>] svc_sendto+0x150/0x20e [sunrpc]
 [<c02bcf7a>] __down+0xca/0xd9
 [<f8e9dda4>] svc_udp_sendto+0xe/0x1f [sunrpc]
 [<f8e9ec8c>] svc_send+0xaf/0xda [sunrpc]
 [<f8ef5d31>] nfs3svc_release_fhandle+0x0/0xe [nfsd]
 [<f8e9d201>] svc_process+0x394/0x573 [sunrpc]
 [<c01046b2>] common_interrupt+0x1a/0x20
 [<f8ee9361>] nfsd+0x191/0x2dd [nfsd]
 [<f8ee91d0>] nfsd+0x0/0x2dd [nfsd]
 [<c01021f5>] kernel_thread_helper+0x5/0xb


Additional info:

Comment 1 William Lovaton 2005-04-05 23:51:42 UTC
Hi there guys,

I have a problem with a very similar back trace.  I am using Fedora Core 3 with
latest kernel (2.6.10-1.770_FC3smp) on an IBM xSeries 445, 8 HT proccessors and
18 GB of RAM.

I use this server on a production environment with an Oracle database 9.2.0.6. 
After 14 days of excellent performance it crashed today and we had to reboot. 
Then I checked the syslog and from april 2 to april 5 there were very similar
back traces, once in each day but with very small variations (I'll post the 4
bt's later).

Today, on april 5 it crashed but it is not clear to me if it happened for this
particular problem.  It was happening from april 2 after all.


Comment 2 William Lovaton 2005-04-05 23:59:26 UTC
Created attachment 112740 [details]
Info about my hardware

I think this problem has something to do with a Fibre Channel interface to a
storage area network.

Comment 3 William Lovaton 2005-04-06 00:02:34 UTC
Created attachment 112741 [details]
Backtrace from april 2

Comment 4 William Lovaton 2005-04-06 00:03:33 UTC
Created attachment 112742 [details]
Backtrace from april 3

Comment 5 William Lovaton 2005-04-06 00:04:21 UTC
Created attachment 112743 [details]
Backtrace from april 4

Comment 6 William Lovaton 2005-04-06 00:06:07 UTC
Created attachment 112744 [details]
Backtrace from april 5

This is the day when the system crashed but I don't if it is related to the
backtrace.

Comment 7 William Lovaton 2005-04-06 00:28:57 UTC
Note that all of the backtraces looks very similar but I diffed them and they
have small variations with each other.

As aditional info, I have some problems with the storage system during
installation since it confused the local SCSI devices with the "external" SCSI
devices in the SAN.  The installer swaped them and from sda to sdd corresponds
to the SAN and sde is the local SCSI.  It didn't boot, in fact it doesn't reach
grub.

So, I had to reinstall the system with the external devices unplugged and only
after installation (on local sda) we plugged the SAN back and it worked with a
little detail:  sda is now sde and the SAN is back to sda-sdd but everything
works ok.  This is my df -h:
[root@dnccor50 log]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sde2              50G  8.1G   39G  18% /
/dev/sde1             213M   18M  185M   9% /boot
none                  9.0G     0  9.0G   0% /dev/shm
/dev/sdb1              79G   46G   30G  61% /ubas
/dev/sdc1             345G  101M  327G   1% /cooeps

Specifically we are using sdb and sdc.  According to it there is no sda nor sdd,
hence, during boot and constantly after boot I get this messages in the log:

dnccor50 kernel: Device sda not ready.
dnccor50 kernel: end_request: I/O error, dev sda, sector 734003072
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750384
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750385
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750386
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750387
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750388
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750389
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750390
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750391
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750392
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750393
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750394
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750395
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750396
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750397
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750398
dnccor50 kernel: Buffer I/O error on device sda, logical block 91750399

Besides this, the server has 2 1GB ethernet NICs (tg3 module and it does appear
in the attached logs) connected to 100Mbps switches but the second one always
connect half duplex and I have to force it to full duplex with:
mii-tool --force=100baseTx-FD eth1

After that, this is the output:
[root@dnccor50 log]# mii-tool
eth0: negotiated 100baseTx-FD, link ok
eth1: 100 Mbit, full duplex, link ok

I hope this influx of logs and info helps to solve this problem.

Comment 8 William Lovaton 2005-04-06 13:06:00 UTC
Today I got another similar backtrace in the log this morning but it hasn't
crashed yet.  The interesting thing I see is that they are happening almost at
the very same hour: around 5 in the morning (+/- 5 minutes).  Should be
something happening on my network at that time?

Another thing I forgot to mention is that my web server is using almost an
identical installation.  The server is the exact same model than the database
server with the same hardware that have this problem and is the very same Fedora
Core 3 with the same updates but it doesn't have any problem at all.  It doesn't
show anything abnormal in the logs and have had uptimes of 75+ days.

So this prblem might be triggered either by the Oracle database software or the
SAN/Fiber Channel interface configuration where the Oracle datafiles resides.


Comment 9 William Lovaton 2005-04-06 16:17:42 UTC
Hi again,  digging a bit deeper I think this bug has something to do with the
fact that I have more than 4gb of RAM (18GB actually) and some problem with the
tg3 NIC driver because the web server I mentioned before doesn't have this
problem and is working right now with 2GB of RAM and 4 HT proccessors because
the secondary motherboard is deactivated (which have the rest of the proccessors
and the RAM) and that server doesn't show up any problem, it works flawlessly. 
Remember both servers are identical.

Just take a look to the backtraces, all of them have this function call:
[<f89402ea>] tigon3_4gb_hwbug_workaround+0x1a/0xdb [tg3]

which is almost the same as the reported by the original creator of this bug:
[<f8e112a6>] tigon3_4gb_hwbug_workaround+0x1a/0xdf [tg3]


Comment 10 William Lovaton 2005-04-06 17:19:22 UTC
Doing even more research, I just found this patch from David S. Miller:
[] http://www.mail-archive.com/bk-commits-24@vger.kernel.org/msg00327.html

It is dated march 23 so I guess this is something for the next FC kernel update.

The idea for this patch came from this thread on march 22:
[] http://thread.gmane.org/gmane.linux.kernel/289615

Any clue if this has something to do with this bug report?

Comment 11 David Miller 2005-04-06 18:02:00 UTC
With HIGHMEM enabled, kmap_skb_frag() BUG()'s when in an interrupt
handler.  Furthermore, since kmap_skb_frag() disables and kunmap_skb_frag()
enables BH's, this means we can't invoke this stuff with interrupts
disabled either.

Unfortunately, this means that skb_copy(), even with GFP_ATOMIC,
cannot be invoked with interrupts disabled.  It can only be
invoked with BHs disabled.

There is no easy way to fix this in the tg3 driver, we really have
to hold the TX spinlock with interrupts disabled while we perform
the workaround to fixup a DMA address we know will confuse the chip.

I'll think some more about this.


Comment 12 William Lovaton 2005-04-06 18:47:39 UTC
Thanx Dave,

The doubt remains, is this bug able to make the system crash?  I mean, there
were 4 reports in the syslog before it crashed.  Even more the backtraces were
reported in the morning and the crash was in the afternoon.

Why, after 14 days of heavy work, it would start to log this kind of backtraces
in the last 4 days?

Another question, is this the same situation for any other network card/driver?
what if I plug the NIC to a 1GB switch?

And finally, what is BH?

Comment 13 William Lovaton 2005-04-12 13:03:31 UTC
Odd, the last time the server syslogged this problem was april 6, one day later
after the reboot.  Since then, it have been working fine without any problem,
the syslog doesn't report anything special.

Any idea about what is triggering this problem? what kind of workload, etc?

Comment 14 Dave Jones 2005-04-16 06:09:48 UTC
Fedora Core 2 has now reached end of life, and no further updates will be
provided by Red Hat.  The Fedora legacy project will be producing further kernel
updates for security problems only.

If this bug has not been fixed in the latest Fedora Core 2 update kernel, please
try to reproduce it under Fedora Core 3, and reopen if necessary, changing the
product version accordingly.

Thank you.


Comment 15 William Lovaton 2005-04-18 13:38:34 UTC
davej,

Please reopen this bug report and change the version to 'fc3', I'm not the
original reporter and my problem is showing up with an 'almost' current Fedora
Core 3 system as I mentioned in comment #1.  Even with that the bug seems to be
happening in the very same line in softirq.c (line 140) than FC2 systems.

David Miller has identified the problem and is working on it.  I'll send him
some new info.

Comment 16 William Lovaton 2005-04-18 14:04:57 UTC
As an extra info about my system.  I'm using this server for an Oracle 9.2.0.6
database with several transactions through a web system using Linux/Apache/PHP.
 The kernel configuration changes I added to /etc/sysctl.conf are the following:

kernel.shmmax=2147483648
kernel.sem=820 104960 820 128
fs.file-max=131072
net.ipv4.ip_local_port_range=1024 65000

Is there a problem with this configuration?  Most of these was taken from the
Oracle configuration guide and some common sense... I hope. :-)

Besides, after the last reboot, the backtrace showed up again one day later
(April 6), but no more after that and ifconfig is giving me this:

[root@dnccor50 nalwalovaton]# /sbin/ifconfig
eth0      Link encap:Ethernet  HWaddr 00:09:6B:E6:42:1D
          inet addr:192.1.2.7  Bcast:192.1.2.255  Mask:255.255.255.0
          inet6 addr: fe80::209:6bff:fee6:421d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3938839 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1288902 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:989626748 (943.7 MiB)  TX bytes:438869885 (418.5 MiB)
          Interrupt:217

eth1      Link encap:Ethernet  HWaddr 00:09:6B:16:42:1D
          inet addr:192.1.26.16  Bcast:192.1.26.255  Mask:255.255.255.0
          inet6 addr: fe80::209:6bff:fe16:421d/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:414640151 errors:0 dropped:0 overruns:0 frame:0
          TX packets:415983358 errors:99 dropped:0 overruns:0 carrier:0
          collisions:92371 txqueuelen:1000
          RX bytes:4148049439 (3.8 GiB)  TX bytes:728909046 (695.1 MiB)
          Interrupt:11


Note the 'errors:99' in eth1 since April 6 and stable after that.  Both NICs are
exactly the same and eth0 is no showing any errors.  I must add that eth1 is
used to connect the web server and the database server and eth0 is left for
occasional users using sqlplus or toad and for admin purposes (ssh, ftp, etc),
so the traffic is a lot less.

Web server:
eth0 -> for webapp users traffic.
eth1 -> database traffic.

Database server:
eth0 -> Random users and system administration
eth1 -> web server traffic

Note (again), that both server are exactly the same brand and model but the web
server is using 2GB of RAM as of now hence the superb stability (Both are FC3
systems).

Comment 17 Dave Jones 2005-04-18 18:11:43 UTC
thanks for the update

Comment 18 William Lovaton 2005-04-22 13:25:38 UTC
Created attachment 113556 [details]
Backtrace from april 22

There is another backtrace in the syslog, but this time it is double and the
system hasn't crashed yet.  Right now the machine has almost 17 days of uptime.
 Note that it happens almost at the same time of the day that the previous
logs.

Is there any clue if this problem can make the system crash? What can I do to
avoid this problem?

Comment 19 William Lovaton 2005-06-21 01:14:42 UTC
Hi,

I can confirm that this backtrace is not the one who makes the whole system
crash.  It seems to be something "normal" when it is transmitting a huge file
through the network but still it doesn't happen so often.

Do you remember that those backtraces were happening at 5 AM?  Well, it was
because at that time there was a cronjob copying an Oracle database export to
another machine where the "official" backup is stored.  This is done daily.

Now, we changed the cronjob to 3AM and now the few backtraces so far are
happening at 3 AM.  Coincidence? I don't think so.

So far it seems harmless.  What do you think?

PS. The server crashed today after 75 days of work.  Seems related to the fibre
channel interface.

Comment 20 David Miller 2005-06-21 01:26:20 UTC
The only way to reliably avoid the problem is not to use HIGHMEM,
and I realize that likely isn't an option.

I've been working on a bug fix for this, for 2 weeks, and I'll try
to finish it up soon.


Comment 21 Dave Jones 2005-07-15 20:41:37 UTC
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem.   Please update to this new kernel, and
report whether or not it fixes your problem.

If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.

Thank you.

Comment 22 William Lovaton 2005-07-29 13:46:05 UTC
dave[jm],

Since this is a production machine we haven't updated anything after deployment
because of the fear to "fix what is not borken". But since we have been
experimenting one last problem with the Oracle proccesses blocking each other,
we will plan to make a huge 'yum update' some time later and see what happens. 
Right now we are using 2.6.10-1.770_FC3smp.

Once we do that I'll post the results here, so please be patient.

Thanks for the effort.


Comment 23 William Lovaton 2005-09-29 12:19:24 UTC
dave[jm],

I will be reporting back soon about this bug.  Since this is a very critical
server for our company, it is very very hard to get permission to apply all the
updates available in FC3.  Last saturday we managed to update our web server and
this saturday we'll be updating the database server (the one mentioned in this bug).

After the updates are applied I'll be monitoring the system to see what happens.

Comment 24 Dave Jones 2006-01-16 22:32:15 UTC
This is a mass-update to all currently open Fedora Core 3 kernel bugs.

Fedora Core 3 support has transitioned to the Fedora Legacy project.
Due to the limited resources of this project, typically only
updates for new security issues are released.

As this bug isn't security related, it has been migrated to a
Fedora Core 4 bug.  Please upgrade to this newer release, and
test if this bug is still present there.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

Thank you.


Comment 25 Dave Jones 2006-02-03 07:26:27 UTC
This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.


Comment 26 John Thacker 2006-05-05 13:09:03 UTC
Closing per previous comment.