Bug 447942
Summary: | openib broken in 2.6.24.7-55.el5rt | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Gurhan Ozen <gozen> | ||||
Component: | realtime-kernel | Assignee: | Jon Masters <jcm> | ||||
Status: | CLOSED ERRATA | QA Contact: | |||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | beta | CC: | bhu, davids, dledford, jburke, lgoncalv, pzijlstr, srostedt, williams | ||||
Target Milestone: | 1.0.3 | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-10-07 19:20:26 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Gurhan Ozen
2008-05-22 15:28:10 UTC
possibly a module-load problem? Assigning to jcm for now I was told to try a newer (-62+) kernel, but still no luck..: WARNING: at kernel/rtmutex.c:1852 rt_read_fastunlock() Pid: 16811, comm: ibv_rc_pingpong Not tainted 2.6.24.7-65.el5rt #1 Call Trace: [<ffffffff811357b2>] ? free_layer+0x37/0x3f [<ffffffff8105f31f>] rt_mutex_up_read+0x1a4/0x232 [<ffffffff8105fcbc>] rt_up_read+0x9/0xb [<ffffffff881922b1>] :ib_uverbs:put_uobj_read+0x15/0x21 [<ffffffff881922f7>] :ib_uverbs:put_pd_read+0xd/0xf [<ffffffff88194f8f>] :ib_uverbs:ib_uverbs_create_qp+0x39c/0x4cf [<ffffffff88191ae0>] ? :ib_uverbs:ib_uverbs_qp_event_handler+0x0/0x2d [<ffffffff88191843>] :ib_uverbs:ib_uverbs_write+0x96/0xb0 [<ffffffff810b00d5>] vfs_write+0xc7/0x170 [<ffffffff810b06e5>] sys_write+0x4a/0x76 [<ffffffff8100c35e>] traceret+0x0/0x5 Got a system I can login to to poke? Not something with remote console. You can use dell-pe1950-0{2,3}.rhts.bos.redhat.com . Right now they have 5.1 though, i can install 5.2 tonight . They are in the lab so you can have physical access to them. Back when the kernel-rt was 2.6.21 or whatever, I had an openib patch followed by an openib fixup patch that was the rt specific changes. When the kernel was forward ported to 2.6.24 did those ib fixups get preserved? I'm currently carrying two patches for IB: ofed-1.3.patch ofed-1.3-rt-compat.patch Is the rt-compat patch your fixups patch, or am I missing a patch? rt-compat was the fixup. /me needs more time in the day... Let me look over these patches and see if I can see the problem. I've confirmed the issue on my test network here. I've also confirmed that the rhel4 and rhel5 kernels can talk to each other fine, while neither can talk to the rhel5-rt kernel via IB. I'll see what I can find out. Oh, I should mention that this is with 2.6.24.7-65.el5rt, which is the latest kernel I saw in brewroot. OK, now it is working (still got the warning though). I didn't do anything different other than rebooting the machine and this is a fresh load of the IB modules where as previously I had unloaded and reloaded the modules without a reboot. I'll keep digging. Just wanted to let people know it wasn't as DOA as I first thought. Can one of the rt patch experts tell me what the warning in comment #2 is warning about? It just says warning and a stack trace. I believe that's part of the rwlock-multi patch series that restores rwlock semantics to rt_mutexes. Heavy mojo. Adding peterz, since rostedt is out on PTO this week. Well, if I can get a clear picture of what the warning is actually warning about, or even better how to fix up code that generates the warning, I'll see if I can make it go away. But as it stands, I have no clue what changes need to be made to make the warning be gone. What kernel is comment #2 from? the -62 kernel has the line number 1852 not on a WARN_ON, and -55 has it in rt_write_fastunlock. There's been a lot of fixes in the rwlock code that could have caused these issues. I believe the latest build will include the fixes (if it includes -rt14). Luis has a -68 kernel running through RHTS right now, with the rebase to -rt14, all the pertinent security CVEs and a bunch of patches cherry-picked from RHEL5. Let's see how that does in smoke-testing and then get gozen/dledford to try it out. Clark I ran this through -72 kernel and things are working fine! Tried to run ibv_rc_pingpong on root.bos.redhat.com / dell-pe1950-02.rhts.bos.redhat.com ... and it seems to fail. Got this in /var/log/messages: Oct 2 04:44:35 dhcp71-141 kernel: WARNING: at kernel/rtmutex.c:1896 rt_read_fastunlock() Oct 2 04:44:35 dhcp71-141 kernel: Pid: 4589, comm: ibv_rc_pingpong Not tainted 2.6.24.7-81.el5rt #1 Oct 2 04:44:35 dhcp71-141 kernel: Oct 2 04:44:35 dhcp71-141 kernel: Call Trace: Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff81135ca6>] ? free_layer+0x37/0x3f Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff8105f521>] rt_mutex_up_read+0x1d2/0x260 Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff8105fe55>] rt_up_read+0x9/0xb Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff883732b9>] :ib_uverbs:put_uobj_read+0x15/0x21 Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff883732ff>] :ib_uverbs:put_pd_read+0xd/0xf Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff88375fd2>] :ib_uverbs:ib_uverbs_create_qp+0x39c/0x4cf Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff88372ae0>] ? :ib_uverbs:ib_uverbs_qp_event_handler+0x0/0x2d Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff88372843>] :ib_uverbs:ib_uverbs_write+0x96/0xb0 Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff810b0479>] vfs_write+0xc7/0x170 Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff810b0a89>] sys_write+0x4a/0x76 Oct 2 04:44:35 dhcp71-141 kernel: [<ffffffff8100c37e>] traceret+0x0/0x5 Oct 2 04:44:35 dhcp71-141 kernel: ======================= Created attachment 319592 [details]
patch to correct locking order of ib driver
The ib driver releases the locks not in the reverse order that it takes them. The RW locks in RT is very sensitive to this.
Hopefully the attached patch will fix the issue.
The patch from comment #19 moved to BZ 465862 as the rt_mutex issue is a different bug from the page allocation issue this bz addresses. Clark An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0857.html |