Description of problem:
We have multiple Supermicro X8DTH-iF machines booting Fedora 15/16 kernels with a full system in an initramfs.
Sometimes LSI SAS controllers (mpt2sas module) or Mellanox InfiniBand controllers (mlx4_core, mlx4_en, mlx4_ib modules) fail to initialise with
mpt2sas 0000:04:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.
mlx4_core 0000:03:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.
This happens with both R2.0a and R2.0c of the BIOS for this motherboard.
Other times instead of vpd r/w errors, the boot process will hang round about the time that mpt2sas prints:
mpt2sas0: sending diag reset !!
Due to udev's modprobing with modaliases, the igb, i7core_edac and dca modules are also being loaded at this point.
I don't know if this could be a kernel bug or whether this is really a bug in the BIOS firmware or a bug in both the LSI and Mellanox firmwares.
Version-Release number of selected component (if applicable):
Also with 2.6.38 kernels from F15 and 3.0 kernels from F16.
50% of boots
Steps to Reproduce:
1. Boot recent kernel using a complete system in an initramfs
I think few people have seen this problem because of two reasons:
1. With many setups, mpt2sas is loaded before the root file system is mounted, after which the other modules are loaded. This provides enough of a gap between module initialisations to avoid these problems.
2. With older kernels before 2.6.37 (I tested CentOS 6's 2.6.32), the big kernel lock prevented multiple modules from initialising at the same time.
The machine has these expansion cards plugged into PCI-express slots connected to two different NUMA nodes. I still have to test whether this exacerbates the problem.
Add the following to /etc/modprobe.d/slowdown:
install mpt2sas /bin/sleep 5 && /sbin/modprobe --ignore-install mpt2sas
install igb /bin/sleep 15 && /sbin/modprobe --ignore-install igb
install mlx4_en /bin/sleep 20 && /sbin/modprobe --ignore-install mlx4_en
install mlx4_core /bin/sleep 30 && /sbin/modprobe --ignore-install mlx4_core
install mlx4_ib /bin/sleep 40 && /sbin/modprobe --ignore-install mlx4_ib
This slows down the process enough to avoid the vpd errors and hangs.
The problem goes away if all expansion cards are connected to the same I/O hub/NUMA node.
(In reply to comment #1)
> mpt2sas 0000:04:00.0: vpd r/w failed. This is likely a firmware bug on this
> device. Contact the card vendor for a firmware update.
> The problem goes away if all expansion cards are connected to the same I/O
> hub/NUMA node.
The message comes from drivers/pci/access.c:pci_vpd_pci22_wait() .
Each vpd has its own lock, which is held by the read/write functions while waiting for completion. I don't see how concurrent reads of vpd data in different nodes could cause a problem, so maybe it really is a firmware bug.
Thanks for the feedback. We are taking up the issue with Supermicro. I'll report back once we know more. I have seen one or two hangs during boot even with all expansion cards connected to one I/O hub. It just happens much less.
Have you heard back from Supermicro?
I also get this error:
arcmsr 0000:0b:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.
We use SuperMicro X8DTI-F Motherboard, but I think this problems are related, but I see this problem in Centos 6.x.
We are having problems with AER with this motherboard too, something it's wrong. We contacted supermicro too, but still waiting on feedback.
As far as I understand, there is a bug in the BIOS related to AER. SM has a BIOS that disables AER and are working on a BIOS update that offers a menu option to disable AER. I haven't had time to test this yet, but they said it would fix the problem.
I have that AER disabled BIOS and it does not solve everything.
I have seen this problem happening with two cards:
- Quad NIC Intel PCI-E Card
- Areca RAID controller 1882
I have a big thread of mails complaining about this with Supermicro and they skip the problem to "Intel 5520/5500 chipset compatibility".
So they say have spoken with Intel about this and Intel asks to disable msi on Linux with "pci=nomsi". This is no solution as I need SR-IOV, this are virtualization servers.
I have no time to try this out, if someone have time, would be nice.
Supermicro should be more interested in solving this problem. All hardware manufactures should work near the Linux communities, Linux it's wide spread.
Manufactures like Intel, that help a lot and invest time and money with opensource projects, gain points in my point of view.