Bug 434822

Summary: OpenIB broken in 2.6.24.1-24.el5rt
Product: Red Hat Enterprise MRG Reporter: Gurhan Ozen <gozen>
Component: realtime-kernelAssignee: Clark Williams <williams>
Status: CLOSED CANTFIX QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: 1.0CC: bhu, dledford, jburke
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-07-02 14:34:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gurhan Ozen 2008-02-25 17:38:56 UTC
Description of problem:
OpenIB stack of the kernel seems to be broken in kernel 2.6.24.1-24.el5rt . In a
lot of cases, a local lid can't even be detected:

# ib_send_lat 
------------------------------------------------------------------
                    Send Latency Test
Inline data is used up to 400 bytes message
Connection type : RC
Local lid 0x0 detected. Is an SM running?

Even though:
# ibstat
CA 'mthca0'
        CA type: MT25208 (MT23108 compat mode)
        Number of ports: 2
        Firmware version: 4.6.2
        Hardware version: a0
        Node GUID: 0x0002c90200200fcc
        System image GUID: 0x0002c90200200fcf
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x00510a6a
                Port GUID: 0x0002c90200200fcd
        Port 2:
                State: Down
                Physical state: Polling
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00510a68
                Port GUID: 0x0002c90200200fce

No programs can be run over the fabric. 
No obvious error/debug messages are in dmesg, or /var/log/{messages,osm}.log

Version-Release number of selected component (if applicable):
# uname -a
Linux dell-pe1950-02.rhts.boston.redhat.com 2.6.24.1-24.el5rt #1 SMP PREEMPT RT
Mon Feb 11 17:19:56 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
# rpm -qa | egrep "openib|libib|rdma|sdp" | sort |uniq
libibcm-1.0.1-1.el5
libibcm-debuginfo-1.0.1-1.el5
libibcm-devel-1.0.1-1.el5
libibcm-static-1.0.1-1.el5
libibcommon-1.0.7-1.el5
libibcommon-debuginfo-1.0.7-1.el5
libibcommon-devel-1.0.7-1.el5
libibcommon-static-1.0.7-1.el5
libibmad-1.1.5-1.el5
libibmad-debuginfo-1.1.5-1.el5
libibmad-devel-1.1.5-1.el5
libibmad-static-1.1.5-1.el5
libibumad-1.1.6-1.el5
libibumad-debuginfo-1.1.6-1.el5
libibumad-devel-1.1.6-1.el5
libibumad-static-1.1.6-1.el5
libibverbs-1.1.1-8.el5
libibverbs-debuginfo-1.1.1-8.el5
libibverbs-devel-1.1.1-8.el5
libibverbs-static-1.1.1-8.el5
libibverbs-utils-1.1.1-8.el5
librdmacm-1.0.5-1.el5
librdmacm-debuginfo-1.0.5-1.el5
librdmacm-devel-1.0.5-1.el5
librdmacm-static-1.0.5-1.el5
librdmacm-utils-1.0.5-1.el5
libsdp-1.1.99-8.el5
libsdp-debuginfo-1.1.99-8.el5
openib-1.3-1.el5
sdpnetstat-1.50-6.el5_1.1



How reproducible:
Everytime

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Clark Williams 2008-03-11 14:24:49 UTC
I just put the latest ofed-1.3 into our 2.6.24.3-rt3 based kernel. Please try
that kernel from brew.

Clark


Comment 2 Clark Williams 2008-03-11 14:28:05 UTC
Ugh, I could have added the version you need, eh?

kernel-rt-2.6.24.3-29.el5rt.



Comment 3 Gurhan Ozen 2008-03-25 07:37:27 UTC
As a follow up to my email to rhel-rt-internal list, i am changing the state to
fails-qa because in kernel-rt-2.6.24.3-29.el5rt, for some reason, the order of
IB devices returned is reversed somehow. 

Comment 4 Gurhan Ozen 2008-03-27 02:53:53 UTC
I was making comparison between kernel-rt-2.6.24.3-29.el5rt and RHEL5.2 kernel,
when i ran the tests on kernel-rt-2.6.24.3-29.el5rtvanilla, the order of ib
devices were reversed as well. I have run all ib/openmpi tests and they all pass
with kernel-rt-2.6.24.3-29.el5rt installed on RHEL5.2-Server-20030313.1 tree. 

Comment 5 Clark Williams 2008-07-01 21:14:33 UTC
Are we still broken in -65 (the GA kernel)?

Comment 6 Jeff Burke 2008-07-02 12:32:38 UTC
Clark,
  Looking a the change log from the kernel:

Changelog:
* Fri Jun 06 2008 Clark Williams <williams> - 2.6.24-65
- replaced peterz's slab fix with v2 patch
- replaced rostedt's ftrace hotplug fix wth v2 patch

What would have changed to have fix the issues Gurhan was seeing?

Comment 7 Clark Williams 2008-07-02 13:28:37 UTC
Ah, I didn't read close enough to see that it's a device ordering issue. So
yeah, we're still borken.

Comment 8 Doug Ledford 2008-07-02 14:33:46 UTC
Actually, we aren't broken and this bug should be closed.  Upstream has
obviously changed the sort order on Gurhan's hardware (so maybe pci=breadth or
one of the other sort modifying options is possibly in order) and if it had
happened in the middle of a single product lifecycle, that would be a bug we
have to fix.  However, this went out GA with the sort reversed, and now that
sorting order has to be maintained in order to preserve existing systems when
updates to MRG go out.  In short, it's too late to anything about this issue,
and we patently *can't* allow anything to be done about it.