Bug 447942 - openib broken in 2.6.24.7-55.el5rt
Summary: openib broken in 2.6.24.7-55.el5rt
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel
Version: beta
Hardware: All
OS: Linux
high
high
Target Milestone: 1.0.3
: ---
Assignee: Jon Masters
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-05-22 15:28 UTC by Gurhan Ozen
Modified: 2008-10-07 19:20 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-10-07 19:20:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch to correct locking order of ib driver (1.34 KB, patch)
2008-10-06 19:14 UTC, Steven Rostedt
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2008:0857 0 normal SHIPPED_LIVE Important: kernel security and bug fix update 2008-10-07 19:18:59 UTC

Description Gurhan Ozen 2008-05-22 15:28:10 UTC
Description of problem:
I don't know what has happened since the last time i tested openib with RT
kernel, but the openib modules aren't happy in 2.6.24.7-55.el5rt kernel:

modprobe: page allocation failure. order:0, mode:0x20
Pid: 5759, comm: modprobe Not tainted 2.6.24.7-55.el5rt #1
 [<c046f51e>] __alloc_pages+0x2bb/0x2cb
 [<f8c6ba25>] ipoib_cm_alloc_rx_skb+0xd5/0x224 [ib_ipoib]
 [<f8c6d249>] ipoib_cm_dev_init+0x594/0x5e0 [ib_ipoib]
 [<f8c6ade1>] ipoib_transport_dev_init+0xd9/0x3b4 [ib_ipoib]
 [<f8c67feb>] ipoib_ib_dev_init+0x2f/0x72 [ib_ipoib]
 [<f8c64844>] ipoib_dev_init+0xac/0xd0 [ib_ipoib]
 [<f8c64a8d>] ipoib_add_one+0x225/0x3d6 [ib_ipoib]
 [<f890724b>] ib_register_client+0x4b/0x70 [ib_core]
 [<f89e20d5>] ipoib_init_module+0xd5/0xfc [ib_ipoib]
 [<c044ebfd>] sys_init_module+0x1494/0x15cb
 [<c045f662>] ? audit_syscall_entry+0x113/0x13d
 [<c0406e64>] ? do_syscall_trace+0x14c/0x198
 [<c0404102>] syscall_call+0x7/0xb
 =======================


WARNING: at kernel/rtmutex.c:1723 rt_read_fastunlock()
Pid: 7566, comm: ibv_rc_pingpong Not tainted 2.6.24.7-55.el5rt #1
 [<c044b181>] rt_mutex_up_read+0x19e/0x1fa
 [<c044b93a>] rt_up_read+0x8/0xa
 [<f8c05037>] put_uobj_read+0xe/0x18 [ib_uverbs]
 [<f8c05073>] put_pd_read+0xb/0xd [ib_uverbs]
 [<f8c07599>] ib_uverbs_create_qp+0x34e/0x477 [ib_uverbs]
 [<f8c04960>] ? ib_uverbs_qp_event_handler+0x0/0x3d [ib_uverbs]
 [<f8c0724b>] ? ib_uverbs_create_qp+0x0/0x477 [ib_uverbs]
 [<f8c046f5>] ib_uverbs_write+0x99/0xac [ib_uverbs]
 [<f8c0465c>] ? ib_uverbs_write+0x0/0xac [ib_uverbs]
 [<c048af26>] vfs_write+0xa8/0x15c
 [<c048b58b>] sys_write+0x3d/0x61
 [<c0404102>] syscall_call+0x7/0xb
 =======================


opensm[5395]: segfault at 38 rip 433c12 rsp 43805f50 error 4


Version-Release number of selected component (if applicable):
kernel: 2.6.24.7-55.el5rt

#  rpm -qa | egrep "libib|openib|opensm"
openib-1.3-3.el5
libibumad-1.1.7-1.el5
libibumad-devel-1.1.7-1.el5
libibcm-static-1.0.2-1.el5
opensm-3.1.8-1.el5
libibcommon-debuginfo-1.0.8-1.el5
opensm-libs-3.1.8-1.el5
libibverbs-devel-1.1.1-9.el5
libibmad-devel-1.1.6-1.el5
libibmad-static-1.1.6-1.el5
libibverbs-debuginfo-1.1.1-9.el5
libibcm-debuginfo-1.0.2-1.el5
libibverbs-1.1.1-9.el5
libibcm-1.0.2-1.el5
opensm-static-3.1.8-1.el5
libibumad-static-1.1.7-1.el5
libibverbs-static-1.1.1-9.el5
libibverbs-utils-1.1.1-9.el5
opensm-debuginfo-3.1.8-1.el5
libibumad-debuginfo-1.1.7-1.el5
libibcommon-1.0.8-1.el5
libibcommon-devel-1.0.8-1.el5
libibcm-devel-1.0.2-1.el5
libibcommon-static-1.0.8-1.el5
libibmad-debuginfo-1.1.6-1.el5
libibmad-1.1.6-1.el5
opensm-devel-3.1.8-1.el5


How reproducible:
Very

Steps to Reproduce:
1. Install RHEL5.2
2. Install openib packages 
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Clark Williams 2008-06-02 19:03:51 UTC
possibly a module-load problem? Assigning to jcm for now

Comment 2 Gurhan Ozen 2008-06-09 03:39:40 UTC
I was told to try a newer (-62+) kernel, but still no luck..:

WARNING: at kernel/rtmutex.c:1852 rt_read_fastunlock()
Pid: 16811, comm: ibv_rc_pingpong Not tainted 2.6.24.7-65.el5rt #1

Call Trace:
 [<ffffffff811357b2>] ? free_layer+0x37/0x3f
 [<ffffffff8105f31f>] rt_mutex_up_read+0x1a4/0x232
 [<ffffffff8105fcbc>] rt_up_read+0x9/0xb
 [<ffffffff881922b1>] :ib_uverbs:put_uobj_read+0x15/0x21
 [<ffffffff881922f7>] :ib_uverbs:put_pd_read+0xd/0xf
 [<ffffffff88194f8f>] :ib_uverbs:ib_uverbs_create_qp+0x39c/0x4cf
 [<ffffffff88191ae0>] ? :ib_uverbs:ib_uverbs_qp_event_handler+0x0/0x2d
 [<ffffffff88191843>] :ib_uverbs:ib_uverbs_write+0x96/0xb0
 [<ffffffff810b00d5>] vfs_write+0xc7/0x170
 [<ffffffff810b06e5>] sys_write+0x4a/0x76
 [<ffffffff8100c35e>] traceret+0x0/0x5

Comment 3 Jon Masters 2008-06-09 17:15:56 UTC
Got  a system I can login to to poke?

Comment 4 Gurhan Ozen 2008-06-09 20:22:36 UTC
Not something with remote console. You can use
dell-pe1950-0{2,3}.rhts.bos.redhat.com . Right now they have 5.1 though, i can
install 5.2 tonight . They are in the lab so you can have physical access to them. 

Comment 5 Doug Ledford 2008-06-09 21:09:12 UTC
Back when the kernel-rt was 2.6.21 or whatever, I had an openib patch followed
by an openib fixup patch that was the rt specific changes.  When the kernel was
forward ported to 2.6.24 did those ib fixups get preserved?

Comment 6 Clark Williams 2008-06-09 21:14:16 UTC
I'm currently carrying two patches for IB:

ofed-1.3.patch
ofed-1.3-rt-compat.patch


Is the rt-compat patch your fixups patch, or am I missing a patch?


Comment 7 Doug Ledford 2008-06-09 22:06:58 UTC
rt-compat was the fixup.

/me needs more time in the day...

Let me look over these patches and see if I can see the problem.

Comment 8 Doug Ledford 2008-06-09 23:29:32 UTC
I've confirmed the issue on my test network here.  I've also confirmed that the
rhel4 and rhel5 kernels can talk to each other fine, while neither can talk to
the rhel5-rt kernel via IB.  I'll see what I can find out.

Comment 9 Doug Ledford 2008-06-09 23:30:28 UTC
Oh, I should mention that this is with 2.6.24.7-65.el5rt, which is the latest
kernel I saw in brewroot.

Comment 10 Doug Ledford 2008-06-09 23:38:18 UTC
OK, now it is working (still got the warning though).  I didn't do anything
different other than rebooting the machine and this is a fresh load of the IB
modules where as previously I had unloaded and reloaded the modules without a
reboot.  I'll keep digging.  Just wanted to let people know it wasn't as DOA as
I first thought.

Comment 11 Doug Ledford 2008-06-11 17:01:17 UTC
Can one of the rt patch experts tell me what the warning in comment #2 is
warning about?  It just says warning and a stack trace.

Comment 12 Clark Williams 2008-06-11 17:10:11 UTC
I believe that's part of the rwlock-multi patch series that restores rwlock
semantics to rt_mutexes. 

Heavy mojo. Adding peterz, since rostedt is out on PTO this week.



Comment 13 Doug Ledford 2008-06-24 16:10:17 UTC
Well, if I can get a clear picture of what the warning is actually warning
about, or even better how to fix up code that generates the warning, I'll see if
I can make it go away.  But as it stands, I have no clue what changes need to be
made to make the warning be gone.

Comment 14 Steven Rostedt 2008-06-25 13:24:18 UTC
What kernel is comment #2 from? the -62 kernel has the line number 1852 not on a
WARN_ON, and -55 has it in rt_write_fastunlock.

There's been a lot of fixes in the rwlock code that could have caused these
issues. I believe the latest build will include the fixes (if it includes -rt14).


Comment 15 Clark Williams 2008-06-25 13:31:58 UTC
Luis has a -68 kernel running through RHTS right now, with the rebase to -rt14,
all the pertinent security CVEs and a bunch of patches cherry-picked from RHEL5.
Let's see how that does in smoke-testing and then get gozen/dledford to try it out.

Clark


Comment 16 Gurhan Ozen 2008-07-14 19:44:36 UTC
I ran this through -72 kernel and things are working fine! 

Comment 18 David Sommerseth 2008-10-02 12:49:26 UTC
Tried to run ibv_rc_pingpong on root.bos.redhat.com / dell-pe1950-02.rhts.bos.redhat.com  ... and it seems to fail.  Got this in /var/log/messages:

Oct  2 04:44:35 dhcp71-141 kernel: WARNING: at kernel/rtmutex.c:1896 rt_read_fastunlock()
Oct  2 04:44:35 dhcp71-141 kernel: Pid: 4589, comm: ibv_rc_pingpong Not tainted 2.6.24.7-81.el5rt #1
Oct  2 04:44:35 dhcp71-141 kernel: 
Oct  2 04:44:35 dhcp71-141 kernel: Call Trace:
Oct  2 04:44:35 dhcp71-141 kernel:  [<ffffffff81135ca6>] ? free_layer+0x37/0x3f
Oct  2 04:44:35 dhcp71-141 kernel:  [<ffffffff8105f521>] rt_mutex_up_read+0x1d2/0x260
Oct  2 04:44:35 dhcp71-141 kernel:  [<ffffffff8105fe55>] rt_up_read+0x9/0xb
Oct  2 04:44:35 dhcp71-141 kernel:  [<ffffffff883732b9>] :ib_uverbs:put_uobj_read+0x15/0x21
Oct  2 04:44:35 dhcp71-141 kernel:  [<ffffffff883732ff>] :ib_uverbs:put_pd_read+0xd/0xf
Oct  2 04:44:35 dhcp71-141 kernel:  [<ffffffff88375fd2>] :ib_uverbs:ib_uverbs_create_qp+0x39c/0x4cf
Oct  2 04:44:35 dhcp71-141 kernel:  [<ffffffff88372ae0>] ? :ib_uverbs:ib_uverbs_qp_event_handler+0x0/0x2d
Oct  2 04:44:35 dhcp71-141 kernel:  [<ffffffff88372843>] :ib_uverbs:ib_uverbs_write+0x96/0xb0
Oct  2 04:44:35 dhcp71-141 kernel:  [<ffffffff810b0479>] vfs_write+0xc7/0x170
Oct  2 04:44:35 dhcp71-141 kernel:  [<ffffffff810b0a89>] sys_write+0x4a/0x76
Oct  2 04:44:35 dhcp71-141 kernel:  [<ffffffff8100c37e>] traceret+0x0/0x5
Oct  2 04:44:35 dhcp71-141 kernel: 
=======================

Comment 19 Steven Rostedt 2008-10-06 19:14:06 UTC
Created attachment 319592 [details]
patch to correct locking order of ib driver

The ib driver releases the locks not in the reverse order that it takes them. The RW locks in RT is very sensitive to this.

Hopefully the attached patch will fix the issue.

Comment 20 Clark Williams 2008-10-06 19:40:55 UTC
The patch from comment #19 moved to BZ 465862 as the rt_mutex issue is a different bug from the page allocation issue this bz addresses.

Clark

Comment 23 errata-xmlrpc 2008-10-07 19:20:26 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0857.html


Note You need to log in before you can comment on or make changes to this bug.