Bug 732956 - vpd r/w failed with Supermicro X8DTH-iF
Summary: vpd r/w failed with Supermicro X8DTH-iF
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 16
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-08-24 09:50 UTC by Albert Strasheim
Modified: 2012-02-29 14:51 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-02-28 20:55:08 UTC


Attachments (Terms of Use)

Description Albert Strasheim 2011-08-24 09:50:00 UTC
Description of problem:

We have multiple Supermicro X8DTH-iF machines booting Fedora 15/16 kernels with a full system in an initramfs.

Sometimes LSI SAS controllers (mpt2sas module) or Mellanox InfiniBand controllers (mlx4_core, mlx4_en, mlx4_ib modules) fail to initialise with

mpt2sas 0000:04:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update.

or

mlx4_core 0000:03:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update.

This happens with both R2.0a and R2.0c of the BIOS for this motherboard.

Other times instead of vpd r/w errors, the boot process will hang round about the time that mpt2sas prints:

mpt2sas0: sending diag reset !!

Due to udev's modprobing with modaliases, the igb, i7core_edac and dca modules are also being loaded at this point.

I don't know if this could be a kernel bug or whether this is really a bug in the BIOS firmware or a bug in both the LSI and Mellanox firmwares.

Version-Release number of selected component (if applicable):

kernel-2.6.40.3-0.fc15.x86_64

Also with 2.6.38 kernels from F15 and 3.0 kernels from F16.

How reproducible:

50% of boots

Steps to Reproduce:
1. Boot recent kernel using a complete system in an initramfs
  
Additional info:

I think few people have seen this problem because of two reasons:

1. With many setups, mpt2sas is loaded before the root file system is mounted, after which the other modules are loaded. This provides enough of a gap between module initialisations to avoid these problems.

2. With older kernels before 2.6.37 (I tested CentOS 6's 2.6.32), the big kernel lock prevented multiple modules from initialising at the same time.

The machine has these expansion cards plugged into PCI-express slots connected to two different NUMA nodes. I still have to test whether this exacerbates the problem.

Workaround:

Add the following to /etc/modprobe.d/slowdown:

install mpt2sas /bin/sleep 5 && /sbin/modprobe --ignore-install mpt2sas
install igb /bin/sleep 15 && /sbin/modprobe --ignore-install igb
install mlx4_en /bin/sleep 20 && /sbin/modprobe --ignore-install mlx4_en
install mlx4_core /bin/sleep 30 && /sbin/modprobe --ignore-install mlx4_core
install mlx4_ib /bin/sleep 40 && /sbin/modprobe --ignore-install mlx4_ib

This slows down the process enough to avoid the vpd errors and hangs.

Comment 1 Albert Strasheim 2011-08-24 12:36:59 UTC
The problem goes away if all expansion cards are connected to the same I/O hub/NUMA node.

Comment 2 Chuck Ebbert 2011-08-26 20:39:03 UTC
(In reply to comment #1)
> mpt2sas 0000:04:00.0: vpd r/w failed.  This is likely a firmware bug on this
> device.  Contact the card vendor for a firmware update.

> The problem goes away if all expansion cards are connected to the same I/O
> hub/NUMA node.

The message comes from drivers/pci/access.c:pci_vpd_pci22_wait() .

Each vpd has its own lock, which is held by the read/write functions while waiting for completion. I don't see how concurrent reads of vpd data in different nodes could cause a problem, so maybe it really is a firmware bug.

Comment 3 Albert Strasheim 2011-08-27 21:07:11 UTC
Thanks for the feedback. We are taking up the issue with Supermicro. I'll report back once we know more. I have seen one or two hangs during boot even with all expansion cards connected to one I/O hub. It just happens much less.

Comment 4 Josh Boyer 2011-10-24 19:21:21 UTC
Have you heard back from Supermicro?

Comment 5 Igor Neves 2011-12-27 11:20:42 UTC
Hi,

I also get this error:

arcmsr 0000:0b:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update.

We use SuperMicro X8DTI-F Motherboard, but I think this problems are related, but I see this problem in Centos 6.x.

We are having problems with AER with this motherboard too, something it's wrong. We contacted supermicro too, but still waiting on feedback.

Comment 6 Albert Strasheim 2012-02-29 04:37:18 UTC
As far as I understand, there is a bug in the BIOS related to AER. SM has a BIOS that disables AER and are working on a BIOS update that offers a menu option to disable AER. I haven't had time to test this yet, but they said it would fix the problem.

Comment 7 Igor Neves 2012-02-29 14:51:52 UTC
I have that AER disabled BIOS and it does not solve everything.

I have seen this problem happening with two cards:
- Quad NIC Intel PCI-E Card
- Areca RAID controller 1882

I have a big thread of mails complaining about this with Supermicro and they skip the problem to "Intel 5520/5500 chipset compatibility".

So they say have spoken with Intel about this and Intel asks to disable msi on Linux with "pci=nomsi". This is no solution as I need SR-IOV, this are virtualization servers.

I have no time to try this out, if someone have time, would be nice.

Supermicro should be more interested in solving this problem. All hardware manufactures should work near the Linux communities, Linux it's wide spread.

Manufactures like Intel, that help a lot and invest time and money with opensource projects, gain points in my point of view.


Note You need to log in before you can comment on or make changes to this bug.