251934 – Firmware error with MT25204 Infiniband HCAs

Bug 251934 - Firmware error with MT25204 Infiniband HCAs

Summary: Firmware error with MT25204 Infiniband HCAs

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.1
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Doug Ledford
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	276111 (view as bug list)
Depends On:
Blocks:	222082 RHEL5u2_relnotes 451164 RHEL5u3_relnotes
TreeView+	depends on / blocked

Reported:	2007-08-13 15:28 UTC by Gurhan Ozen
Modified:	2013-11-04 01:33 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	(all architectures) Hardware testing for the Mellanox MT25204 has revealed that an internal error occurs under certain high-load conditions. When the ib_mthca driver reports a catastrophic error on this hardware, it is usually related to an insufficient completion queue depth relative to the number of outstanding work requests generated by the user application. Although the driver will reset the hardware and recover from such an event, all existing connections at the time of the error will be lost. This generally results in a segmentation fault in the user application. Further, if opensm is running at the time the error occurs, then you need to manually restart it in order to resume proper operation.
Clone Of:
Environment:
Last Closed:	2008-11-06 19:07:47 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
test script (36.12 KB, application/octet-stream) 2007-08-13 15:28 UTC, Gurhan Ozen	no flags	Details
View All

Description Gurhan Ozen 2007-08-13 15:28:00 UTC

Description of problem:
  I get the following error on MT25204 cards when running the openib testsuite:


Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: Catastrophic error
detected: internal error
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[00]: 000d0000
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[01]: 00000000
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[02]: 00000000
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[03]: 00000000
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[04]: 00000000
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[05]: 00127ee4
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[06]: ffffffff
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[07]: 00000000
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[08]: 00000000
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[09]: 00000000
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[0a]: 00000000
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[0b]: 00000000
Aug  4 20:58:08 dell-pe1950-02 avahi-daemon[3262]: Interface ib0.IPv6 no longer
relevant for mDNS.
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[0c]: 00000000
Aug  4 20:58:08 dell-pe1950-02 avahi-daemon[3262]: Leaving mDNS multicast group
on interface ib0.IPv6 with address fe80::202:c902:20:f2ad.
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[0d]: 00000000
Aug  4 20:58:08 dell-pe1950-02 avahi-daemon[3262]: Interface ib0.IPv4 no longer
relevant for mDNS.
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[0e]: 00000000
Aug  4 20:58:08 dell-pe1950-02 avahi-daemon[3262]: Leaving mDNS multicast group
on interface ib0.IPv4 with address 192.168.1.10.
Aug  4 20:58:08 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0:   buf[0f]: 00000000
Aug  4 20:58:08 dell-pe1950-02 avahi-daemon[3262]: Withdrawing address record
for fe80::202:c902:20:f2ad on ib0.
Aug  4 20:58:08 dell-pe1950-02 avahi-daemon[3262]: Withdrawing address record
for 192.168.1.10 on ib0.
Aug  4 20:58:09 dell-pe1950-02 kernel: ACPI: PCI interrupt for device
0000:0c:00.0 disabled
Aug  4 20:58:09 dell-pe1950-02 kernel: ib_mthca: Initializing 0000:0c:00.0
Aug  4 20:58:09 dell-pe1950-02 kernel: PCI: Enabling device 0000:0c:00.0 (0140
-> 0142)
Aug  4 20:58:09 dell-pe1950-02 kernel: ACPI: PCI Interrupt 0000:0c:00.0[A] ->
GSI 16 (level, low) -> IRQ 169
Aug  4 20:58:11 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: HCA FW version
1.0.700 is old (1.1.0 is current).
Aug  4 20:58:11 dell-pe1950-02 kernel: ib_mthca 0000:0c:00.0: If you have
problems, try updating your HCA FW.


This only happens on MT25204 cards. IT recovers itself nicely, unless, the box
happens to run the opensmd deamon. Then it can't recover itself. It happens
everytime when i run my testscript when it tries to run ibv_rc_pingpong, yet
manually running ibv_rc_pingpong doesn't reproduce the case. Maybe things needs
to be in specific sequence? I'll be attaching the script to the BZ, change the
line "set pass password" to have the password of the remote machine you'll be
running script against, and invoke it with:

  ./openib.exp <remote_host_name>

Make sure that both machines have IB adapters and are in the same IB subnet. You
have to have MT25204 card on the machine you are running the script. If you
don't have that card, I can let you use the machines in westford lab that have.

Version-Release number of selected component (if applicable):
[root@dell-pe1950-02 ~]# rpm -qa | egrep "openib|mthca"
libmthca-1.0.4-1.el5
libmthca-devel-1.0.4-1.el5
openib-perftest-1.2-1.el5
libmthca-devel-1.0.4-1.el5
openib-tvflash-0.9.2-1.el5
openib-mstflint-1.2-1.el5
openib-diags-1.2.7-1.el5
openib-1.2-1.el5
openib-srptools-0.0.6-1.el5

 modinfo ib_mthca
filename:      
/lib/modules/2.6.18-38.el5/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko
version:        0.08
license:        Dual BSD/GPL
description:    Mellanox InfiniBand HCA low-level driver
author:         Roland Dreier
srcversion:     D945E6E42EECF8618BF3854
alias:          pci:v00001867d00005E8Csv*sd*bc*sc*i*
alias:          pci:v000015B3d00005E8Csv*sd*bc*sc*i*
alias:          pci:v00001867d00006274sv*sd*bc*sc*i*
alias:          pci:v000015B3d00006274sv*sd*bc*sc*i*
alias:          pci:v00001867d00006282sv*sd*bc*sc*i*
alias:          pci:v000015B3d00006282sv*sd*bc*sc*i*
alias:          pci:v00001867d00006278sv*sd*bc*sc*i*
alias:          pci:v000015B3d00006278sv*sd*bc*sc*i*
alias:          pci:v00001867d00005A44sv*sd*bc*sc*i*
alias:          pci:v000015B3d00005A44sv*sd*bc*sc*i*
depends:        ib_mad,ib_core
vermagic:       2.6.18-38.el5 SMP mod_unload gcc-4.1
parm:           catas_reset_disable:disable reset on catastrophic event if
nonzero (int)
parm:           qos_support:Enable QoS support if > 0 (int)
parm:           fw_cmd_doorbell:post FW commands through doorbell page if
nonzero (and supported by FW) (int)
parm:           debug_level:Enable debug tracing if > 0 (int)
parm:           msi_x:attempt to use MSI-X if nonzero (int)
parm:           msi:attempt to use MSI if nonzero (int)
parm:           tune_pci:increase PCI burst from the default set by BIOS if
nonzero (int)
parm:           num_qp:maximum number of QPs per HCA (int)
parm:           rdb_per_qp:number of RDB buffers per QP (int)
parm:           num_cq:maximum number of CQs per HCA (int)
parm:           num_mcg:maximum number of multicast groups per HCA (int)
parm:           num_mpt:maximum number of memory protection table entries per
HCA (int)
parm:           num_mtt:maximum number of memory translation table segments per
HCA (int)
parm:           num_udav:maximum number of UD address vectors per HCA (int)
parm:           fmr_reserved_mtts:number of memory translation table segments
reserved for FMR (int)
module_sig:    
883f35046b38fd66346bc3bfc3c5df311211d009f5de2c99c70471b88f091278de56ebb575e1c904309f6f94576d4d6efff6c12589cf7422fc58cdce2d


How reproducible:
Everytime.

Steps to Reproduce:
1. Grab openib.exp from this bug report.
2. Run it with ./openib.exp <remote_machine>
3. Watch /var/log/messages
  
Actual results:


Expected results:


Additional info:

Comment 1 Gurhan Ozen 2007-08-13 15:28:00 UTC

Created attachment 161181 [details]
test script

Comment 2 Gurhan Ozen 2007-09-03 05:02:27 UTC

By the way this is only happening with MT25204 cards. I also have MT25208 and
MT23108 both of which don't give this issue.

Comment 3 Gurhan Ozen 2007-09-17 16:20:05 UTC

So I downloaded the firmware fw-25204-1_2_000-MHES14-XT.bin from Mellanox
website, and burned it with  "mstflint -d mthca0 -i
fw-25204-1_2_000-MHES14-XT.bin bb" command and the same issue can still be seen.

Comment 4 Doug Ledford 2007-09-18 15:33:29 UTC

This appears to possibly be a known issue in the firmware on this card itself. 
However, it appears to also possibly be a known issue in the firmware as it
relates to handling a programming bug in the user's software.  Specifically,
when the user space software posts more requests to the work queue than there
are open spots in the completion queue, and enough requests complete to overflow
the completion queue, then this particular firmware gives a catastrophic error
while the firmware on later cards handles the situation gracefully.

Gurhan, can you identify what test you are running that is causing this problem?
 If so, does changing the depth of tx/rx queues in the test solve the problem?

Comment 5 Gurhan Ozen 2007-09-18 18:14:51 UTC

(In reply to comment #4)
> This appears to possibly be a known issue in the firmware on this card itself. 
> However, it appears to also possibly be a known issue in the firmware as it
> relates to handling a programming bug in the user's software.  Specifically,
> when the user space software posts more requests to the work queue than there
> are open spots in the completion queue, and enough requests complete to overflow
> the completion queue, then this particular firmware gives a catastrophic error
> while the firmware on later cards handles the situation gracefully.
> 
> Gurhan, can you identify what test you are running that is causing this problem?
>  If so, does changing the depth of tx/rx queues in the test solve the problem?

   Ahh, identifying them will be tough if not impossible. I actually tried it
before opening this bug report, but wasn't successful. I am just running
openib.exp which executes different IB utilities one by one, and while that
script running firmware error happens at some random point. It is not consistent
on which program this happens:( Will try it again anyway.

Comment 7 Doug Ledford 2007-09-20 20:16:43 UTC

We've opted not to treat this as a blocker bug because the card does not become
unusable when this happens.  Instead, the driver resets the card, reinitializes
the card, then continues on working.  The current transfer will be interrupted,
but subsequent transfers will work fine.  In addition, this is likely a symptom
of a programming error on the part of the user space software in that this
condition is most likely triggered whenever the number of completion queue
entries available is smaller than the total number of outstanding work requests,
resulting in the firmware completing more work requests than it has completion
queue slots.  When it overflows the completion queue, it develops this error. 
Later versions of the mthca hardware handle this more gracefully, but it is
still a user programming bug in that there are always supposed to be sufficient
completion queue slots to process all outstanding work requests.

I'm not sure if this is worth a release note, or if we should just file this
away on kbase in case someone hits this error.

Comment 8 Gurhan Ozen 2007-09-20 20:28:37 UTC

When it happens, if opensmd is running on the host, then opensmd stalls and it
has to be restarted too. We might want to add that to kbase article (or release
note if opt to) as well.

Comment 10 Doug Ledford 2007-09-20 23:14:27 UTC

The opensm issue is one that's unlikely to get hit in the field.  Usually, most
HPC clusters avoid running opensm on a machine that's heavily loaded with other
traffic, and only when loaded with traffic does this problem show up.  So, if
for no other reason than people tend to separate management nodes and compute
nodes in terms of roles, this shouldn't be a problem.

Comment 11 Don Domingo 2007-09-21 00:23:21 UTC

not sure if i got this correctly, please advise if RHEL5.1 release note update
quoted below is correct:

<quote>
Running an openib application on a system equipped with an MT25204 will result
in an error if the opensmd daemon is also running. When this occurs, simply
restart opensmd.
</quote>

please advise. thanks!

Comment 12 Doug Ledford 2007-09-21 02:32:28 UTC

Actually, I think that's more confusing that anything ;-)  The real problem is
that openmpi applications will see random segfaults and programs terminating
unexpectedly if they are run on a system with the MT25204 hardware and the
hardware generates a catastrophic error event.  If the user should experience
their application crashing with a segfault, then they should check the output of
dmesg to see if their hardware generated a catastrophic error.  If it did, this
is in indication that the user level application tried to queue too many
commands to fit into the work completion queue and minor modification to the
program to increase the size of the completion queue should solve the problem. 
In the event that the application segfaults, and the card generates this error,
then if opensm was running on that machine, it will need restarted after the
error event in order to continue managing the InfiniBand fabric.

Comment 13 Doug Ledford 2007-09-21 17:42:28 UTC

Something like this might work:

Mellanox MT25204 hardware has been found to experience an internal error under
certain high load conditions.  If the ib_mthca driver reports a catastrophic
error on this hardware, it is usually related to an insufficient completion
queue depth relative to the number of outstanding work requests generated by the
user application.  Although the driver will reset the card and recover from the
error, all existing connections are lost at the time of the error, generally
resulting in a segfault in the user application.  Additionally, if opensm is
running at the time the error occurs, then it will have to be restarted in order
to resume proper operation.

Comment 14 Don Domingo 2007-09-23 22:46:09 UTC

thanks Doug, revising as follows:
<quote>
Hardware testing for the Mellanox MT25204 has revealed that an internal error
occurs under certain high-load conditions. When the ib_mthca driver reports a
catastrophic error on this hardware, it is usually related to an insufficient
completion queue depth relative to the number of outstanding work requests
generated by the user application.

Although the driver will reset the hardware and recover from such an event, all
existing connections are lost at the time of the error. This generally results
in a segmentation fault in the user application. Further, if opensm is running
at the time the error occurs, then it will have to be manually restarted in
order to resume proper operation.
</quote>

Comment 16 Gurhan Ozen 2007-10-08 14:35:31 UTC

*** Bug 276111 has been marked as a duplicate of this bug. ***

Comment 17 RHEL Program Management 2007-12-03 20:45:23 UTC

This request was evaluated by Red Hat Product Management for
inclusion, but this component is not scheduled to be updated in
the current Red Hat Enterprise Linux release.  This request will
be reviewed for a future Red Hat Enterprise Linux release.

Comment 18 Don Domingo 2008-02-20 03:47:18 UTC

adding same release note to "Known Issues" of RHEL5.2. please advise if resolved
so we can document as such. thanks!

Comment 19 Don Domingo 2008-04-02 02:15:00 UTC

Hi,
the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at
which point no further additions or revisions will be entertained.

a mockup of the RHEL5.2 release notes can be viewed at the following link:
http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html

please use the aforementioned link to verify if your bugzilla is already in the
release notes (if it needs to be). each item in the release notes contains a
link to its original bug; as such, you can search through the release notes by
bug number.

Cheers,
Don

Comment 20 Ryan Lerch 2008-08-11 01:09:08 UTC

Tracking this bug for the Red Hat Enterprise Linux 5.3 Release Notes. 

This Release Note is currently located in the Known Issues section.

Comment 21 Ryan Lerch 2008-08-11 01:09:08 UTC

Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Comment 22 Doug Ledford 2008-09-04 09:56:46 UTC

The current update for rhel5.3 may help with this, but since I can't reproduce I'll need Gurhan to let me know.

Note You need to log in before you can comment on or make changes to this bug.