Description of problem: I don't know what has happened since the last time i tested openib with RT kernel, but the openib modules aren't happy in 2.6.24.7-55.el5rt kernel: modprobe: page allocation failure. order:0, mode:0x20 Pid: 5759, comm: modprobe Not tainted 2.6.24.7-55.el5rt #1 [<c046f51e>] __alloc_pages+0x2bb/0x2cb [<f8c6ba25>] ipoib_cm_alloc_rx_skb+0xd5/0x224 [ib_ipoib] [<f8c6d249>] ipoib_cm_dev_init+0x594/0x5e0 [ib_ipoib] [<f8c6ade1>] ipoib_transport_dev_init+0xd9/0x3b4 [ib_ipoib] [<f8c67feb>] ipoib_ib_dev_init+0x2f/0x72 [ib_ipoib] [<f8c64844>] ipoib_dev_init+0xac/0xd0 [ib_ipoib] [<f8c64a8d>] ipoib_add_one+0x225/0x3d6 [ib_ipoib] [<f890724b>] ib_register_client+0x4b/0x70 [ib_core] [<f89e20d5>] ipoib_init_module+0xd5/0xfc [ib_ipoib] [<c044ebfd>] sys_init_module+0x1494/0x15cb [<c045f662>] ? audit_syscall_entry+0x113/0x13d [<c0406e64>] ? do_syscall_trace+0x14c/0x198 [<c0404102>] syscall_call+0x7/0xb ======================= WARNING: at kernel/rtmutex.c:1723 rt_read_fastunlock() Pid: 7566, comm: ibv_rc_pingpong Not tainted 2.6.24.7-55.el5rt #1 [<c044b181>] rt_mutex_up_read+0x19e/0x1fa [<c044b93a>] rt_up_read+0x8/0xa [<f8c05037>] put_uobj_read+0xe/0x18 [ib_uverbs] [<f8c05073>] put_pd_read+0xb/0xd [ib_uverbs] [<f8c07599>] ib_uverbs_create_qp+0x34e/0x477 [ib_uverbs] [<f8c04960>] ? ib_uverbs_qp_event_handler+0x0/0x3d [ib_uverbs] [<f8c0724b>] ? ib_uverbs_create_qp+0x0/0x477 [ib_uverbs] [<f8c046f5>] ib_uverbs_write+0x99/0xac [ib_uverbs] [<f8c0465c>] ? ib_uverbs_write+0x0/0xac [ib_uverbs] [<c048af26>] vfs_write+0xa8/0x15c [<c048b58b>] sys_write+0x3d/0x61 [<c0404102>] syscall_call+0x7/0xb ======================= opensm[5395]: segfault at 38 rip 433c12 rsp 43805f50 error 4 Version-Release number of selected component (if applicable): kernel: 2.6.24.7-55.el5rt # rpm -qa | egrep "libib|openib|opensm" openib-1.3-3.el5 libibumad-1.1.7-1.el5 libibumad-devel-1.1.7-1.el5 libibcm-static-1.0.2-1.el5 opensm-3.1.8-1.el5 libibcommon-debuginfo-1.0.8-1.el5 opensm-libs-3.1.8-1.el5 libibverbs-devel-1.1.1-9.el5 libibmad-devel-1.1.6-1.el5 libibmad-static-1.1.6-1.el5 libibverbs-debuginfo-1.1.1-9.el5 libibcm-debuginfo-1.0.2-1.el5 libibverbs-1.1.1-9.el5 libibcm-1.0.2-1.el5 opensm-static-3.1.8-1.el5 libibumad-static-1.1.7-1.el5 libibverbs-static-1.1.1-9.el5 libibverbs-utils-1.1.1-9.el5 opensm-debuginfo-3.1.8-1.el5 libibumad-debuginfo-1.1.7-1.el5 libibcommon-1.0.8-1.el5 libibcommon-devel-1.0.8-1.el5 libibcm-devel-1.0.2-1.el5 libibcommon-static-1.0.8-1.el5 libibmad-debuginfo-1.1.6-1.el5 libibmad-1.1.6-1.el5 opensm-devel-3.1.8-1.el5 How reproducible: Very Steps to Reproduce: 1. Install RHEL5.2 2. Install openib packages 3. Actual results: Expected results: Additional info:
possibly a module-load problem? Assigning to jcm for now
I was told to try a newer (-62+) kernel, but still no luck..: WARNING: at kernel/rtmutex.c:1852 rt_read_fastunlock() Pid: 16811, comm: ibv_rc_pingpong Not tainted 2.6.24.7-65.el5rt #1 Call Trace: [<ffffffff811357b2>] ? free_layer+0x37/0x3f [<ffffffff8105f31f>] rt_mutex_up_read+0x1a4/0x232 [<ffffffff8105fcbc>] rt_up_read+0x9/0xb [<ffffffff881922b1>] :ib_uverbs:put_uobj_read+0x15/0x21 [<ffffffff881922f7>] :ib_uverbs:put_pd_read+0xd/0xf [<ffffffff88194f8f>] :ib_uverbs:ib_uverbs_create_qp+0x39c/0x4cf [<ffffffff88191ae0>] ? :ib_uverbs:ib_uverbs_qp_event_handler+0x0/0x2d [<ffffffff88191843>] :ib_uverbs:ib_uverbs_write+0x96/0xb0 [<ffffffff810b00d5>] vfs_write+0xc7/0x170 [<ffffffff810b06e5>] sys_write+0x4a/0x76 [<ffffffff8100c35e>] traceret+0x0/0x5
Got a system I can login to to poke?
Not something with remote console. You can use dell-pe1950-0{2,3}.rhts.bos.redhat.com . Right now they have 5.1 though, i can install 5.2 tonight . They are in the lab so you can have physical access to them.
Back when the kernel-rt was 2.6.21 or whatever, I had an openib patch followed by an openib fixup patch that was the rt specific changes. When the kernel was forward ported to 2.6.24 did those ib fixups get preserved?
I'm currently carrying two patches for IB: ofed-1.3.patch ofed-1.3-rt-compat.patch Is the rt-compat patch your fixups patch, or am I missing a patch?
rt-compat was the fixup. /me needs more time in the day... Let me look over these patches and see if I can see the problem.
I've confirmed the issue on my test network here. I've also confirmed that the rhel4 and rhel5 kernels can talk to each other fine, while neither can talk to the rhel5-rt kernel via IB. I'll see what I can find out.
Oh, I should mention that this is with 2.6.24.7-65.el5rt, which is the latest kernel I saw in brewroot.
OK, now it is working (still got the warning though). I didn't do anything different other than rebooting the machine and this is a fresh load of the IB modules where as previously I had unloaded and reloaded the modules without a reboot. I'll keep digging. Just wanted to let people know it wasn't as DOA as I first thought.
Can one of the rt patch experts tell me what the warning in comment #2 is warning about? It just says warning and a stack trace.
I believe that's part of the rwlock-multi patch series that restores rwlock semantics to rt_mutexes. Heavy mojo. Adding peterz, since rostedt is out on PTO this week.
Well, if I can get a clear picture of what the warning is actually warning about, or even better how to fix up code that generates the warning, I'll see if I can make it go away. But as it stands, I have no clue what changes need to be made to make the warning be gone.
What kernel is comment #2 from? the -62 kernel has the line number 1852 not on a WARN_ON, and -55 has it in rt_write_fastunlock. There's been a lot of fixes in the rwlock code that could have caused these issues. I believe the latest build will include the fixes (if it includes -rt14).
Luis has a -68 kernel running through RHTS right now, with the rebase to -rt14, all the pertinent security CVEs and a bunch of patches cherry-picked from RHEL5. Let's see how that does in smoke-testing and then get gozen/dledford to try it out. Clark
I ran this through -72 kernel and things are working fine!
Tried to run ibv_rc_pingpong on root.bos.redhat.com / dell-pe1950-02.rhts.bos.redhat.com ... and it seems to fail. Got this in /var/log/messages: Oct 2 04:44:35 dhcp71-141 kernel: WARNING: at kernel/rtmutex.c:1896 rt_read_fastunlock() Oct 2 04:44:35 dhcp71-141 kernel: Pid: 4589, comm: ibv_rc_pingpong Not tainted 2.6.24.7-81.el5rt #1 Oct 2 04:44:35 dhcp71-141 kernel: Oct 2 04:44:35 dhcp71-141 kernel: Call Trace: Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff81135ca6>] ? free_layer+0x37/0x3f Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff8105f521>] rt_mutex_up_read+0x1d2/0x260 Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff8105fe55>] rt_up_read+0x9/0xb Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff883732b9>] :ib_uverbs:put_uobj_read+0x15/0x21 Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff883732ff>] :ib_uverbs:put_pd_read+0xd/0xf Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff88375fd2>] :ib_uverbs:ib_uverbs_create_qp+0x39c/0x4cf Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff88372ae0>] ? :ib_uverbs:ib_uverbs_qp_event_handler+0x0/0x2d Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff88372843>] :ib_uverbs:ib_uverbs_write+0x96/0xb0 Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff810b0479>] vfs_write+0xc7/0x170 Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff810b0a89>] sys_write+0x4a/0x76 Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff8100c37e>] traceret+0x0/0x5 Oct 2 04:44:35 dhcp71-141 kernel: =======================
Created attachment 319592 [details] patch to correct locking order of ib driver The ib driver releases the locks not in the reverse order that it takes them. The RW locks in RT is very sensitive to this. Hopefully the attached patch will fix the issue.
The patch from comment #19 moved to BZ 465862 as the rt_mutex issue is a different bug from the page allocation issue this bz addresses. Clark
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0857.html