Bug 149867 - (IT_73459) Deadlock with NIC teaming/bonding.
Deadlock with NIC teaming/bonding.
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i386 Linux
medium Severity high
: ---
: ---
Assigned To: David Miller
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-02-28 09:50 EST by Robert Hentosh
Modified: 2007-11-30 17:07 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-07-12 13:31:39 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Crash occuring with tg3 (78.77 KB, text/plain)
2005-02-28 09:54 EST, Robert Hentosh
no flags Details
SysRQ of a system with most state information (65.21 KB, text/plain)
2005-02-28 09:59 EST, Robert Hentosh
no flags Details
Objdump of the system with most info. (8.43 MB, application/x-gzip)
2005-02-28 10:09 EST, Robert Hentosh
no flags Details
/proc/ksyms sorted of system with most info (32.91 KB, application/x-gzip)
2005-02-28 10:25 EST, Robert Hentosh
no flags Details
Out from the HW ITP for system with most info (6.98 KB, application/x-gzip)
2005-02-28 10:42 EST, Robert Hentosh
no flags Details
Hack to R/W locks to track netproto (35.22 KB, text/plain)
2005-02-28 14:23 EST, Robert Hentosh
no flags Details
Fix for brlock deadlocks. (5.42 KB, patch)
2005-03-02 13:06 EST, David Miller
no flags Details | Diff

  None (edit)
Description Robert Hentosh 2005-02-28 09:50:52 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

Description of problem:
This is a long standing issue it seems that appears to have been originally reported in Bugzilla #89885.  In  that issue we also had a stack abuse of the proprietary ESM driver. The stack of the driver was decreased and we were unable to reproduce the problem.

However a new faster machines we have been able to reproduce the orginal issue and have after much effort found the 3-way deadlock condition in the network layer.

The issue has been only seen when some kind of NIC teaming is being used.  We have reproduced it with Intel's iANS, Broadcomm's BASP and the native bonding.  We also have used the bcm5700 driver, intel's e1000 driver, and the tg3 driver.  The rest of the software stack has been having Dell's OMSA stack installed and samba. A samba share is either exported from an HD or from RAM.  The RAM disk produces the hang a little faster.  We have not been able to reproduce the issue without OMSA installed, but after seeing the stack trace and  ITP dumps we don't not suspect anything that would implicate its requirement to be there. Also, a search of the internet turned up one other individual that appears to the the same lockup on non-Dell equipment.

We have also been able to reproduce the issue on several kernels including ones from AS 2.1.

The problem is specifically, a deadlock situation between the BR_NETPROTO_LOCK and dev->xmit_queue locks.  I will follow up with trace files of the lockups.


Version-Release number of selected component (if applicable):
kernel-2.4.21-27.0.2.ELsmp

How reproducible:
Sometimes

Steps to Reproduce:
1. Install RHEL3 U4.
2. Configure network teaming of somekind.
3. Stress SAMBA share of a local RAM disk.
4. Wait 8 hours or longer (sometimes 10 days)


Actual Results:  System lockup.  Sysrq does work.

Expected Results:  No lockup.

Additional info:
Comment 1 Robert Hentosh 2005-02-28 09:54:28 EST
Created attachment 111472 [details]
Crash occuring with tg3

This is a sysrq output during the hang using a tg3 driver.
Comment 2 Robert Hentosh 2005-02-28 09:59:39 EST
Created attachment 111473 [details]
SysRQ of a system with most state information

This is a sysrq of the system that we have the most information on. Infact at
this time it is still hooked to an ITP do we can dump the registers of any
memory location or register you might still need.  We have collected alot of
that information and I will also post it if required.  It is based of a
modified 2.4.21-27.0.2.ELsmp kernel that added "tracing" for the
BR_NETPROTO_LOCK. This lock was getting held but prior to the code we didn't
see who was holding the locks.
Comment 3 Robert Hentosh 2005-02-28 10:09:53 EST
Created attachment 111475 [details]
Objdump of the system with most info.

This is a gzip compressed objdump of the modified kernel. The problem again
occurs on unmodified kernels. But we changed this one to trace the
BR_NETPROTO_LOCK to root cause the issue.  It is the one we have the most
information on and shows the problem the clearest.
Comment 4 Robert Hentosh 2005-02-28 10:25:35 EST
Created attachment 111476 [details]
/proc/ksyms sorted of system with most info
Comment 5 Robert Hentosh 2005-02-28 10:42:40 EST
Created attachment 111480 [details]
Out from the HW ITP for system with most info

This is a dump of some of the registers and stacks of the code that is running
on the system when it is hung.	Using the ITP we were able to determine who had
the locks.  The ITP doesn't enumerate the CPU's the same as the sysrq output
... so please note that with the ITP:

  P0 = CPU2 in linux
  P1 = CPU3 in linux
  P2 = CPU0 in linux
  P3 = CPU1 in linux

if you are trying to match the ITP dumps with the sysrq output.
Comment 6 Robert Hentosh 2005-02-28 10:50:05 EST
Looking at the files of the system "with the most info", this is what we determined:

   CPU2 has a read lock on BR_NETPROTO
   CPU2 wants the dev->xmit_queue lock

   CPU1 owns the dev->xmit_queue lock (we get that from the lock structure dump)
   CPU1 wants a read lock on BR_NETPROTO

   CPU0 is attempting to aquire a **write** lock on BR_NETPROTO and has locked 0
and 1 but spinning on 2.

   CPU3 is wanting a socket lock at appears to be help by CPU1 via tcp_recvmsg().


The main problem I see is that CPU1 owns the dev lock but is trying to get the
NETPROTO lock and CPU2 did the exact opposite.. it has the NETPROTO lock but now
 wants the dev lock. These locks should be obtained in the same order. This
wouldn't be a problem since they are read locks... but then we get that rare
instance of another CPU (in this case CPU0) that has "half way" done its write
lock where it goes through from 0 to 3 and in this case only obtained the first
2 (CPU0 and CPU1) and is now waiting to get the 3rd and 4th to complete the
write lock.
Comment 7 Robert Hentosh 2005-02-28 14:23:26 EST
Created attachment 111493 [details]
Hack to R/W locks to track netproto

This is just a patch to the base 2.4.21-27.EL kernel that was done to track the
NETPROTO locks to see what was occuring...  Probably not useful to you now but
just in case you were curious for what changes were made.
Comment 9 David Miller 2005-03-02 13:04:52 EST
It is a known deficiency in the atomic version of the brlock
implementation.  This is what is used on x86 and it's cousins.

The non-atomic variant of brlocks do not have this problem.

Essentially, brlocks must give exactly equivalent semantics
are rwlocks.  This means, in particular, that when writers
try to enter, they must back off if readers are present so
that readers can make forward progress.  The atomic variant
of brlocks do not do this.

This fix is to eliminate the atomic variant of brlocks and always
use the non-atomic variant.  But, we can never include this
fix since it's an incredible kABI breaker.

I'll attach the patch, it's in 2.4.30-preX already.
There are other ways to easily reproduce this bug, mostly
involving adding or removing netfilter rules while input
packet processing is running.  It requires 3 or more cpus.
Comment 10 David Miller 2005-03-02 13:06:17 EST
Created attachment 111585 [details]
Fix for brlock deadlocks.

Fix for brlock deadlocks on x86/x86_64/ia64
Comment 12 Robert Hentosh 2005-03-04 11:14:37 EST
Visual inspection of the patch certainly confirms that it will fix this 
particular hang.

What update will this patch be put in place, RHEL3 U5?  The issue seems to 
occur in some of our stress test environments. Since it is a hang without a 
panic, it might be difficult for diagnosing by phone techsupport.

Our test cases have been running without failure now since the patch was 
applied on the 2nd.  One of test cases would fail in about 8 hours 
consistantly.
Comment 13 David Miller 2005-03-04 13:01:04 EST
If you read my comment, it states that since this patch is a kABI
breaker, it is unlikely we can place it into any RHEL kernel, ever.
And I quote from comment #9:

-----
This fix is to eliminate the atomic variant of brlocks and always
use the non-atomic variant.  But, we can never include this
fix since it's an incredible kABI breaker.
-----
Comment 15 Robert Hentosh 2005-03-04 15:50:47 EST
Sorry, that is odd. I never noticed comment #9 before my post. I only saw my 
comment #7 and your comment #10. I assume comment #8 and #11 are internal 
since I cannot view those either right now.

This (comment #9) is unsettling news.  Is there anyway to re-order the locks 
so that both threads will lock the device xmit queue before or after the 
NETPROTO lock? So we don't have one thread obtaining the locks in defferent 
orders.
Comment 16 David Miller 2005-03-04 15:59:12 EST
It isn't a private comment, I bet they just get renumbered when
the private ones are not displayed, sorry.  I was referring
to the comment I made right before I attached the patch.

There is no way to reorder the locks.  Netfilter needs to grab the
BR_NETPROTO_LOCK recursively as a reader and there is no way around
that and as long as it is the case the deadlock can be triggered.
Comment 17 David Miller 2005-03-08 13:06:24 EST
To explain the kABI situation, since we are changing entirely the layout
and behavior of the BR_NETPROTO_LOCK, any kernel module taking these locks
will stop working and need to be recompiled.

The lock is defined in include/linux/brlock.h and lock acquisition occurs
in inline functions callable from any kernel module, so kernel modules know
the layout of these locks.

The suggested patch would break such modules on x86, ia64, and x86_64 platforms.

As one can see, the lock routines and the locks themselves are exported
to modules in kernel/ksyms.c

Any implementation of nontrivial networking (such as a protocl stack)
would need to grab these locks.  It is therefore very likely that some
3rd party kernel module out there makes reference to them and would
break.
Comment 18 Charles Rose 2005-04-12 08:26:54 EDT
4/11/05: Per Sue Denham, RHEL3 U5 Release Notes will have an entry for this issue.
Comment 19 Susan Denham 2005-04-27 23:34:45 EDT
I'm sorry to report that the release notes had closed down by the time I got
this one to the doc team.    We will, however, immediately create a
KnowledgeBase entry that customers and our support folks can view.  I'll have
this info added to the U2 notes.  

Again, my apologies.
Comment 20 Jason Hibbets 2005-06-15 16:56:53 EDT
Update KBASE entry when bug fixed:

FAQ Question:
Why would my system dead lock under network pressure with NIC bonding?

Topic(s):

	* Red Hat Enterprise Linux : AS/ES/WS v. 3 -
http://kbase.redhat.com/faq/FAQ_79_5697.shtm

	* Red Hat Enterprise Linux : Hardware -
http://kbase.redhat.com/faq/FAQ_46_5697.shtm
Comment 22 John Feeney 2007-07-12 13:31:39 EDT
With RHEL3 now in maintenance mode, where only critical customer issues
can be fixed, this bugzilla has been closed as wont fix due to its severity
level and a lack of recent activity. 

Note You need to log in before you can comment on or make changes to this bug.