| Summary: | NetXen NX3031 Multifunction 1-Gigabit Adapter doesn't work in rhel 6.1 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Adam Okuliar <aokuliar> | ||||
| Component: | kernel | Assignee: | bob picco <bpicco> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Adam Okuliar <aokuliar> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 6.1 | CC: | amit.salecha, arozansk, borgan, bpicco, cdupuis, GR-Linux-NIC-Dev, he.wu, jiayin.shao, prarit, sucheta.chakraborty | ||||
| Target Milestone: | rc | Flags: | GR-Linux-NIC-Dev:
needinfo+
|
||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-03-23 12:49:30 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
Adam Okuliar
2011-02-15 09:49:08 UTC
Created attachment 478829 [details]
compressed content of /var/log directory
This is rather strange. Looks like the driver is having trouble communicating with the hardware. Is this a stand-up card or is this Lan on Motherboard (LOM)? Would it be possible to maybe seat the card in another slot in the server? It is a standalone QLE3044-RJ-CK card in PCI-E slot. I'll try to put it into another slot and let you know result. But please note, that in rhel6.0 works this card fine. What is flashed firmware version on card? Please paste o/p of "ethtool -i <eth>". Also, can I get o/p of "lspci -v". Thanks, Sucheta. Hi, we had some problems with this adapter in past. Please check: https://bugzilla.redhat.com/show_bug.cgi?id=640228 These are the same cards in same machines. You can also find a lspci -vv and firmware revision there. Thanks, Adam Thanks Adam for the info. I tried same driver (4.0.75) and firmware (4.0.534) version on a dl380g7 m/c. I tried with following two kernels: - 2.6.32-81.el6.bz562940.x86_64 - older RHEL6.1 kernel. I don't see any problem with this kernel. 2.6.32-118.el6.x86_64 - current RHEL6.1 kernel. Here I see the problem described in the bug report. So, can you tell me whether AER related code path has changed b/w these two kernels? If yes, what are the changes? I have done some investigation and it looks like it happens only on systems with Xeon CPU. AMD systems are unaffected. So probably this issue is related to bios or chipset somehow. I can give you access to affected machine, so you will be able to investigate this. Thanks, Adam. Thanks Adam for additional info. However I can reproduce the bug here. But if I go back to previous kernel version on the same machine - 2.6.32-81 - I don't see any problem. So, it has to do something with the newer kernel. Can you tell me what are the differences between these two kernels? Hi. Complete kernel change-log is available in brew underneath rpm links. I'm not familiar with AER-code, but guessing that our problem might be here: [pci] Fix KABI breakage (Prarit Bhargava) [661301] - [pci] PCIe/AER: Disable native AER service if BIOS has precedence (Prarit Bhargava) [661301] - [pci] aerdrv: fix uninitialized variable warning (Prarit Bhargava) [661301] - [pci] hotplug: Fix build with CONFIG_ACPI unset (Prarit Bhargava) [661301] - [pci] PCIe: Ask BIOS for control of all native services at once (Prarit Bhargava) [661301] - [pci] PCIe: Introduce commad line switch for disabling port services (Prarit Bhargava) [661301] - [pci] ACPI/PCI: Negotiate _OSC control bits before requesting them (Prarit Bhargava) [661301] - [pci] ACPI/PCI: Make acpi_pci_query_osc() return control bits (Prarit Bhargava) [661301] or here: [pci] PCIe AER: use pci_is_pcie() (Prarit Bhargava) [661301] - [pci] introduce pci_is_pcie() (Prarit Bhargava) [661301] - [pci] PCIe AER: use pci_pcie_cap() (Prarit Bhargava) [661301] - [pci] fix memory leak in aer_inject (Prarit Bhargava) [661301] - [pci] use better error return values in aer_inject (Prarit Bhargava) [661301] - [pci] add support for PCI domains to aer_inject (Prarit Bhargava) [661301] please see full change-log at: https://brewweb.devel.redhat.com/buildinfo?buildID=157683 I'm cc-ing Prarit Bhargava maybe he will provide further detail. Thanks, Adam Thanks Adam. But I need to understand exactly what these changes are and why they were made - in order to root cause this issue. Prarit please respond. Thanks, Sucheta. (In reply to comment #11) > Thanks Adam. > > But I need to understand exactly what these changes are and why they were made > - in order to root cause this issue. > > Prarit please respond. > > Thanks, > Sucheta. These are updates to the existing AER code in RHEL6. Can you boot with pci=noaer to see if you still have a problem? P. pci=noaer did not help. I can give you access to this system. Can you please investigate it? Adam (In reply to comment #13) > pci=noaer did not help. I can give you access to this system. Can you please > investigate it? > Sure ... but if pci=noaer is a boot option then AER is completely disabled -- which means the problem is likely not in the AER code. But I can help to bisect. Can you please loan the system to me? Thanks, P. ...loaned. To test functionality please assign ip addresses to cards with macs 00:0e:1e:02:05:a6 00:0e:1e:02:05:82 and test connectivity. Sometimes small ping passes without problems, but larger data transfer causes problem. You can use netcat or netperf benchmark: https://brewweb.devel.redhat.com/buildinfo?buildID=135954 netperf -L local.ip -H remote.ip Affected is only hp-dl380g7. On 385 same card works fine. If any problems occurs please let me know. Big thanks. Adam There seem to be other problems with the netxen driver on this system. If I turn off PCIE AER (pci=noaer) and boot, an idle system eventually starts streaming: NMI: IOCK error (debug interrupt?) CPU 0 Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 dm_mirror dm_region_hash dm_log power_meter microcode serio_raw iTCO_wdt iTCO_vendor_support hpilo bnx2 i7core_edac edac_core ixgbe mdio igb dca sg netxen_nic ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan] <snip -- sorry couldn't capture the whole thing> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process events/0 (pid: 75, threadinfo ffff8800bbe66000, task ffff8800bbe65500) Stack: 0000000000000046 00000b65778379d7 00000b657787f040 00000b65778379d7 <0> 00000b657787f040 0000000000000080 00000b65778379d7 ffff880028203f58 <0> ffff880028203f78 ffff88002820db40 0000000000000000 ffffc900068b2140 Call Trace: <IRQ> [<ffffffff814e117b>] smp_apic_timer_interrupt+0x6b/0x9b [<ffffffff8100bc93>] apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffffa025ef2e>] ? netxen_nic_hw_read_wx_2M+0x8e/0x150 [netxen_nic] [<ffffffffa0262330>] ? netxen_fw_poll_work+0x0/0x2e0 [netxen_nic] [<ffffffffa02623e5>] netxen_fw_poll_work+0xb5/0x2e0 [netxen_nic] [<ffffffff8108dfce>] ? prepare_to_wait+0x4e/0x80 [<ffffffffa0262330>] ? netxen_fw_poll_work+0x0/0x2e0 [netxen_nic] [<ffffffff810883b0>] worker_thread+0x170/0x2a0 [<ffffffff8108dce0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff81088240>] ? worker_thread+0x0/0x2a0 [<ffffffff8108d976>] kthread+0x96/0xa0 [<ffffffff8100c1ca>] child_rip+0xa/0x20 [<ffffffff8108d8e0>] ? kthread+0x0/0xa0 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 Code: 45 b0 0f 8d 15 ff ff ff 48 89 45 b0 48 89 45 a0 e9 08 ff ff ff 49 89 9d b0 00 00 00 eb 8d 48 8b 45 b0 48 89 45 a0 e9 f2 fe ff ff <41> c7 85 94 00 00 00 00 00 00 00 48 83 c4 48 5b 41 5c 41 5d 41 ie) The netxen driver is broken. Adam -- I cannot find 00:0e:1e:02:05:82 on the system ... are you sure that's correct? P. 00:0e:1e:02:05:82 is in hp-dl385g7-01.lab.eng.brq.redhat.com.
[root@hp-dl385g7-01 ~]# ip l | grep 00:0e:1e:02:05:82
link/ether 00:0e:1e:02:05:82 brd ff:ff:ff:ff:ff:ff
Do you have access to this system?
Adam
Ameen do you have any other idea? I would be glad to solve this before 6.1GA. Thanks, Adam Adam We are looking into PCI traces collected for this issue. We are seeing some weird things going on the PCI bus. I have asked the ASIC team to take further look at the traces and will provide an update once I hear back from them. As we have indicated in Comment #7, when we tested with 2.6.32-81.el6.bz562940.x86_64 kernel everything was fine. 2.6.32-118.el6.x86_64 kernel seems to bing out these issues. Something must have changed (need not be in the AER code) which is triggering these issues. Do you have a team in RedHat that can help us to understand the change/patch in the kernel which is triggering this issues? Thanks, -Ameen Adding Bob. P. (In reply to comment #16) > There seem to be other problems with the netxen driver on this system. If I > turn off PCIE AER (pci=noaer) and boot, an idle system eventually starts > streaming: > NMI: IOCK error (debug interrupt?) > CPU 0 > Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss autofs4 sunrpc > cpufreq_ondemand acpi_cpufreq freq_table ipv6 dm_mirror dm_region_hash dm_log > power_meter microcode serio_raw iTCO_wdt iTCO_vendor_support hpilo bnx2 > i7core_edac edac_core ixgbe mdio igb dca sg netxen_nic ext4 mbcache jbd2 sr_mod > cdrom sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa radeon ttm > drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: > scsi_wait_scan] > <snip -- sorry couldn't capture the whole thing> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process events/0 (pid: 75, threadinfo ffff8800bbe66000, task ffff8800bbe65500) > Stack: > 0000000000000046 00000b65778379d7 00000b657787f040 00000b65778379d7 > <0> 00000b657787f040 0000000000000080 00000b65778379d7 ffff880028203f58 > <0> ffff880028203f78 ffff88002820db40 0000000000000000 ffffc900068b2140 > Call Trace: > <IRQ> > [<ffffffff814e117b>] smp_apic_timer_interrupt+0x6b/0x9b > [<ffffffff8100bc93>] apic_timer_interrupt+0x13/0x20 > <EOI> > [<ffffffffa025ef2e>] ? netxen_nic_hw_read_wx_2M+0x8e/0x150 [netxen_nic] > [<ffffffffa0262330>] ? netxen_fw_poll_work+0x0/0x2e0 [netxen_nic] > [<ffffffffa02623e5>] netxen_fw_poll_work+0xb5/0x2e0 [netxen_nic] > [<ffffffff8108dfce>] ? prepare_to_wait+0x4e/0x80 > [<ffffffffa0262330>] ? netxen_fw_poll_work+0x0/0x2e0 [netxen_nic] > [<ffffffff810883b0>] worker_thread+0x170/0x2a0 > [<ffffffff8108dce0>] ? autoremove_wake_function+0x0/0x40 > [<ffffffff81088240>] ? worker_thread+0x0/0x2a0 > [<ffffffff8108d976>] kthread+0x96/0xa0 > [<ffffffff8100c1ca>] child_rip+0xa/0x20 > [<ffffffff8108d8e0>] ? kthread+0x0/0xa0 > [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 > Code: 45 b0 0f 8d 15 ff ff ff 48 89 45 b0 48 89 45 a0 e9 08 ff ff ff 49 89 9d > b0 00 00 00 eb 8d 48 8b 45 b0 48 89 45 a0 e9 f2 fe ff ff <41> c7 85 94 00 00 00 > 00 00 00 00 48 83 c4 48 5b 41 5c 41 5d 41 > ie) The netxen driver is broken. Prarit, We are seeing weird things going on in the PCI bus with 2.6.32-118.el6.x86_64 kernel. When we tested with 2.6.32-81.el6.bz562940.x86_64 kernel, everything was working fine. What kernel version did you use in this experiment?. All we have to do to re-produe this issue is to boot the machine with pci=noaer and leave the system idle right? Please let us know if there were any other variables involved (which we are not aware of). We will try to re-produce this failure and collect a PCI trace. This way we can verify if the foot prints here are same as the other issues or something different. Adam, Can I have this machine for a day or possibly two? The netxen driver changed very little since 2.6.32-81.el6.bz562940.x86_64. 2.6.32-81.el6.bz562940.x86_64 is a brew/cvs build of bz562940. The patches entered RHEL6.1 at kernel-2.6.32-100.el6. Chad's bz667194 changes arrived at kernel-2.6.32-101. I have kernel-2.6.32-122.el6 on hp-nehalem-02 without any issues. So I'd like to try kernel-2.6.32-100.el6 and kernel-2.6.32-101. From bz640228 it also appears this netxen card has a bad history? thanx, bob From PCI traces we don't see any downstream TLPs from the host bridge when the issue occurs. It would be good to understand what kernel version onwards we see this issue and what changes went into that kernel version (In reply to comment #22) > (In reply to comment #16) > > There seem to be other problems with the netxen driver on this system. If I > > turn off PCIE AER (pci=noaer) and boot, an idle system eventually starts > > streaming: > > NMI: IOCK error (debug interrupt?) > > CPU 0 > > Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss autofs4 sunrpc > > cpufreq_ondemand acpi_cpufreq freq_table ipv6 dm_mirror dm_region_hash dm_log > > power_meter microcode serio_raw iTCO_wdt iTCO_vendor_support hpilo bnx2 > > i7core_edac edac_core ixgbe mdio igb dca sg netxen_nic ext4 mbcache jbd2 sr_mod > > cdrom sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa radeon ttm > > drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: > > scsi_wait_scan] > > <snip -- sorry couldn't capture the whole thing> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process events/0 (pid: 75, threadinfo ffff8800bbe66000, task ffff8800bbe65500) > > Stack: > > 0000000000000046 00000b65778379d7 00000b657787f040 00000b65778379d7 > > <0> 00000b657787f040 0000000000000080 00000b65778379d7 ffff880028203f58 > > <0> ffff880028203f78 ffff88002820db40 0000000000000000 ffffc900068b2140 > > Call Trace: > > <IRQ> > > [<ffffffff814e117b>] smp_apic_timer_interrupt+0x6b/0x9b > > [<ffffffff8100bc93>] apic_timer_interrupt+0x13/0x20 > > <EOI> > > [<ffffffffa025ef2e>] ? netxen_nic_hw_read_wx_2M+0x8e/0x150 [netxen_nic] > > [<ffffffffa0262330>] ? netxen_fw_poll_work+0x0/0x2e0 [netxen_nic] > > [<ffffffffa02623e5>] netxen_fw_poll_work+0xb5/0x2e0 [netxen_nic] > > [<ffffffff8108dfce>] ? prepare_to_wait+0x4e/0x80 > > [<ffffffffa0262330>] ? netxen_fw_poll_work+0x0/0x2e0 [netxen_nic] > > [<ffffffff810883b0>] worker_thread+0x170/0x2a0 > > [<ffffffff8108dce0>] ? autoremove_wake_function+0x0/0x40 > > [<ffffffff81088240>] ? worker_thread+0x0/0x2a0 > > [<ffffffff8108d976>] kthread+0x96/0xa0 > > [<ffffffff8100c1ca>] child_rip+0xa/0x20 > > [<ffffffff8108d8e0>] ? kthread+0x0/0xa0 > > [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 > > Code: 45 b0 0f 8d 15 ff ff ff 48 89 45 b0 48 89 45 a0 e9 08 ff ff ff 49 89 9d > > b0 00 00 00 eb 8d 48 8b 45 b0 48 89 45 a0 e9 f2 fe ff ff <41> c7 85 94 00 00 00 > > 00 00 00 00 48 83 c4 48 5b 41 5c 41 5d 41 > > ie) The netxen driver is broken. > Prarit, > We are seeing weird things going on in the PCI bus with 2.6.32-118.el6.x86_64 > kernel. When we tested with 2.6.32-81.el6.bz562940.x86_64 kernel, everything > was working fine. What kernel version did you use in this experiment?. All we > have to do to re-produe this issue is to boot the machine with pci=noaer and > leave the system idle right? Please let us know if there were any other > variables involved (which we are not aware of). > We will try to re-produce this failure and collect a PCI trace. This way we can > verify if the foot prints here are same as the other issues or something > different Foot prints in the PCI trace for this issue is the same as what we have seen for other issues. (In reply to comment #23) Hi Bob, I can give you access to those machines. Can you please give me some estimation about when you will have time to investigate this? I need it for scheduling my work. Big Thanks Adam (In reply to comment #26) > (In reply to comment #23) > > Hi Bob, > > I can give you access to those machines. Can you please give me some estimation > about when you will have time to investigate this? I need it for scheduling my > work. > > Big Thanks > Adam Hi Adam, I assume you mean hp-dl380g7-01.lab.eng.brq.redhat.com? How about next week one or two days? your welcome, bob OK, so what about Monday and Tuesday? Are you OK with it? Adam HP has reported the same issue on https://bugzilla.redhat.com/show_bug.cgi?id=688489 (In reply to comment #28) > OK, so what about Monday and Tuesday? Are you OK with it? > Adam Monday and Tuesday will work. I saw the HP issue you mentioned in comment 29. thanx, bob Hi Bob, I loaned two machines to you in beaker. Their hostnames are: hp-dl380g7-01.lab.eng.brq.redhat.com hp-dl385g7-01.lab.eng.brq.redhat.com both of them have qlogic card inside.Interfaces ara marked as netxen[0-4]. This configuration disappear after reboot. Please run /root/prepare_sys.py to restore configuration after reboot. On 380 there are problems with communication with card, on 385 car is working fine. You can test throughput between 380 and 385 via q-logic interfaces by netperf: netperf -H 172.16.25.20. Please let me know if you will need any assistance. Thanks, Adam Hi Adam, Thanks for your machines. I've returned them. I'm fairly confident because of the success of git tag 101 kernel that this bug is a duplicate of bug 681870. I can't confirm this without access to our git tree. Also I don't want to tie your machines up longer. Adam, there is a patch attached to bug 681870. Prarit would like you to test it and report back to him, please. Oh, the git tagged 101 kernel is the last netxen commit thus far in r6.1. Ameen, Thanks for your help too. bob *** This bug has been marked as a duplicate of bug 681870 *** Bob, Thanks for the update. I don't have permission to view bug 681870. Can you please add me to this bug? (In reply to comment #33) > Bob, Thanks for the update. I don't have permission to view bug 681870. Can you > please add me to this bug? Done. |