Description of problem: When shutting down opensm service on RT kernel , I get the following backtrace: ------------[ cut here ]------------ kernel BUG at kernel/rt.c:344! invalid opcode: 0000 [1] PREEMPT SMP CPU 1 Modules linked in: autofs4 hidp l2cap bluetooth nfs lockd nfs_acl sunrpc iscsi_tcp ib_iser libiscsi scsi_transport_iscsi ib_ucm rdma_ucm ib_srp ib_sdp rdma_cm iw_cm ib_addr ib_local_sa ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad loop dm_multipath video sbs i2c_ec i2c_core dock button battery asus_acpi backlight ac parport_pc lp parport sg pcspkr ib_ipath ata_generic ib_mthca ib_mad ib_core shpchp bnx2 serio_raw ide_cd cdrom dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 4198, comm: opensm Not tainted 2.6.20-19.el5rt #1 RIP: 0010:[<ffffffff810b8e0a>] [<ffffffff810b8e0a>] rt_downgrade_write+0x4/0x8 RSP: 0000:ffff81005d999c18 EFLAGS: 00010282 RAX: ffff81007cece828 RBX: ffff810076c907f8 RCX: ffff810076c90828 RDX: ffff81007cece828 RSI: 0000000000000000 RDI: ffff81007cece780 RBP: ffff81005d999c18 R08: 0000000000000000 R09: 0000000000000001 R10: ffff81005d96f6c0 R11: 0000000000000000 R12: ffff810076c90800 R13: ffff810076c907f8 R14: 0000000000000000 R15: ffff81007cece6c0 FS: 0000000000000000(0000) GS:ffff81000d510540(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003c40946ae8 CR3: 0000000001001000 CR4: 00000000000006e0 Process opensm (pid: 4198, threadinfo ffff81005d998000, task ffff81005d996700) Stack: ffff81005d999c58 ffffffff883614e3 ffff8100786b5d30 0000000000000008 ffff8100786b57a0 ffff81005d96f6c0 ffff8100786b57a0 ffff8100019b1180 ffff81005d999c98 ffffffff81012db7 ffff81007805a378 ffff81005d96f6c0 Call Trace: [<ffffffff883614e3>] :ib_umad:ib_umad_close+0xb7/0x10f [<ffffffff81012db7>] __fput+0xdd/0x1af [<ffffffff8102f10c>] fput+0x17/0x19 [<ffffffff81025d81>] filp_close+0x6c/0x77 [<ffffffff8103b01e>] put_files_struct+0x6d/0xc1 [<ffffffff81015dbb>] do_exit+0x27f/0x8c5 [<ffffffff8104d3b7>] cpuset_exit+0x0/0x6e [<ffffffff8102d772>] get_signal_to_deliver+0x432/0x483 [<ffffffff8105fa88>] do_notify_resume+0xc2/0x7d3 [<ffffffff81062667>] ptregscall_common+0x67/0xb0 [<ffffffff810622d6>] sysret_signal+0x21/0x31 [<0000003c406c48c6>] --------------------------- | preempt count: 00000001 ] | 1-level deep critical section nesting: ---------------------------------------- .. [<ffffffff81069e97>] .... __spin_trylock+0x16/0x71 .....[<ffffffff8106b0ea>] .. ( <= oops_begin+0x28/0x77) Code: 0f 0b eb fe 55 48 89 e5 53 48 8d 5f 08 48 83 ec 08 85 f6 89 RIP [<ffffffff810b8e0a>] rt_downgrade_write+0x4/0x8 RSP <ffff81005d999c18> <1>Fixing recursive fault but reboot is needed! Version-Release number of selected component (if applicable): # uname -a Linux dell-pe1950-02.rhts.boston.redhat.com 2.6.20-19.el5rt #1 SMP PREEMPT Mon Apr 16 12:14:21 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Everytime. Steps to Reproduce: 1. You'll need a system with IB hardware for this. Do service opensmd start ; service opensmd stop. 2. 3. Actual results: Expected results: Additional info:
This behavior can be observed with ibping program as well. Just run ibping.
This was resolved with the OFED 1.2 final code and updated rt port patch used to build the kernel-rt-2.6.21-32.ofed.3.el5rt kernel (this was a scratch build, but the updated patches were submitted to Clark Williams to be included in his rt kernel).
applied to -35; testing
Verified with -35: [root@dell-pe1950-02 ~]# service opensmd start Starting IB Subnet Manager [ OK ] [root@dell-pe1950-02 ~]# service opensmd stop ; service opensmd start Stopping IB Subnet Manager....... [ OK ] Starting IB Subnet Manager [ OK ] [root@dell-pe1950-02 ~]# uname -a Linux dell-pe1950-02.rhts.boston.redhat.com 2.6.21-35.el5rt #1 SMP PREEMPT RT Thu Jul 26 11:59:02 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux