Description of problem: While investigating an IPoIB problem on one of the QA boxes (gqaib-02.sbu.lab.eng.bos.redhat.com), noticed a kernel stack trace for the cxgb3 driver. That driver is for a Chelsio 10GbE card that isn't cabled to the switch, so nothing critical. But, it might be of interest to whoever works with this part of the kernel. [ 27.547324] iw_cxgb3: Chelsio T3 RDMA Driver - version 1.1 [ 27.557570] iw_cxgb3: Initialized device 0000:0a:00.0 [ 27.557580] cxgb3: p1p1, iscsi set MaxRxData to 16224 (0x3f603000) [ 27.557585] ------------[ cut here ]------------ [ 27.557595] WARNING: at mm/page_alloc.c:2387 __alloc_pages_nodemask+0x8f8/0xae0() [ 27.557597] Hardware name: PowerEdge 1950 [ 27.557599] Modules linked in: bnep bluetooth rfkill be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi ib_iser iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr mlx4_ib ib_sa ib_mad mlx4_en iw_cxgb3 ib_core bnx2 mperf coretemp iTCO_wdt iTCO_vendor_support dcdbas microcode lpc_ich serio_raw mfd_core i5000_edac mlx4_core edac_core i5k_amb cxgb3 mdio shpchp vhost_net tun macvtap macvlan kvm_intel kvm uinput radeon i2c_algo_bit drm_kms_helper ttm drm mptsas i2c_core mptscsih mptbase scsi_transport_sas [ 27.557654] Pid: 655, comm: NetworkManager Not tainted 3.9.6-200.fc18.x86_64 #1 [ 27.557656] Call Trace: [ 27.557665] [<ffffffff8105ef85>] warn_slowpath_common+0x75/0xa0 [ 27.557669] [<ffffffff8105efca>] warn_slowpath_null+0x1a/0x20 [ 27.557672] [<ffffffff8113d7f8>] __alloc_pages_nodemask+0x8f8/0xae0 [ 27.557677] [<ffffffff8101b923>] ? native_sched_clock+0x13/0x80 [ 27.557681] [<ffffffff810880d2>] ? up+0x32/0x50 [ 27.557686] [<ffffffff8117c0a8>] alloc_pages_current+0xb8/0x190 [ 27.557691] [<ffffffff8113834a>] __get_free_pages+0x2a/0x80 [ 27.557696] [<ffffffff81187ba9>] kmalloc_order_trace+0x39/0xb0 [ 27.557699] [<ffffffff81187de0>] __kmalloc+0x1c0/0x250 [ 27.557708] [<ffffffffa04227c1>] cxgbi_ddp_init+0x71/0x260 [libcxgbi] [ 27.557715] [<ffffffffa04390e7>] cxgb3i_dev_open+0xf7/0x384 [cxgb3i] [ 27.557719] [<ffffffff81657bd5>] ? printk+0x61/0x63 [ 27.557736] [<ffffffffa026349e>] cxgb3_add_clients+0x3e/0x60 [cxgb3] [ 27.557743] [<ffffffffa024f22c>] cxgb_open+0x32c/0x370 [cxgb3] [ 27.557749] [<ffffffff81555bfe>] __dev_open+0xce/0x150 [ 27.557752] [<ffffffff81555ee1>] __dev_change_flags+0xa1/0x180 [ 27.557756] [<ffffffff81556078>] dev_change_flags+0x28/0x70 [ 27.557760] [<ffffffff81561ab1>] do_setlink+0x351/0x980 [ 27.557766] [<ffffffff81321751>] ? nla_parse+0x31/0xe0 [ 27.557769] [<ffffffff815647ae>] rtnl_newlink+0x36e/0x580 [ 27.557774] [<ffffffff8118a373>] ? __kmalloc_node_track_caller+0x63/0x2a0 [ 27.557777] [<ffffffff81564253>] rtnetlink_rcv_msg+0x113/0x300 [ 27.557781] [<ffffffff8154742c>] ? __alloc_skb+0x7c/0x290 [ 27.557784] [<ffffffff81564140>] ? __rtnl_unlock+0x20/0x20 [ 27.557789] [<ffffffff8157f571>] netlink_rcv_skb+0xb1/0xc0 [ 27.557792] [<ffffffff81560975>] rtnetlink_rcv+0x25/0x40 [ 27.557795] [<ffffffff8157ee91>] netlink_unicast+0x1a1/0x220 [ 27.557798] [<ffffffff8157f211>] netlink_sendmsg+0x301/0x3c0 [ 27.557804] [<ffffffff8153a450>] sock_sendmsg+0xb0/0xe0 [ 27.557807] [<ffffffff8153bf31>] ? sock_recvmsg+0xc1/0xf0 [ 27.557811] [<ffffffff8153be5c>] __sys_sendmsg+0x3ac/0x3c0 [ 27.557815] [<ffffffff8153de29>] sys_sendmsg+0x49/0x90 [ 27.557820] [<ffffffff8166a5d9>] system_call_fastpath+0x16/0x1b [ 27.557822] ---[ end trace 7813defcd04c69c4 ]--- Version-Release number of selected component (if applicable): # uname -a Linux gqaib-02.sbu.lab.eng.bos.redhat.com 3.9.6-200.fc18.x86_64 #1 SMP Thu Jun 13 18:56:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux # modinfo cxgb3 filename: /lib/modules/3.9.6-200.fc18.x86_64/kernel/drivers/net/ethernet/chelsio/cxgb3/cxgb3.ko firmware: cxgb3/ael2020_twx_edc.bin firmware: cxgb3/ael2005_twx_edc.bin firmware: cxgb3/ael2005_opt_edc.bin firmware: cxgb3/t3c_psram-1.1.0.bin firmware: cxgb3/t3b_psram-1.1.0.bin firmware: cxgb3/t3fw-7.12.0.bin version: 1.1.5-ko license: Dual BSD/GPL author: Chelsio Communications description: Chelsio T3 Network Driver srcversion: 72CE3DC4D9C62460CBB4FF6 alias: pci:v00001425d00000037sv*sd*bc*sc*i* alias: pci:v00001425d00000036sv*sd*bc*sc*i* alias: pci:v00001425d00000035sv*sd*bc*sc*i* alias: pci:v00001425d00000032sv*sd*bc*sc*i* alias: pci:v00001425d00000031sv*sd*bc*sc*i* alias: pci:v00001425d00000030sv*sd*bc*sc*i* alias: pci:v00001425d00000026sv*sd*bc*sc*i* alias: pci:v00001425d00000025sv*sd*bc*sc*i* alias: pci:v00001425d00000024sv*sd*bc*sc*i* alias: pci:v00001425d00000023sv*sd*bc*sc*i* alias: pci:v00001425d00000022sv*sd*bc*sc*i* alias: pci:v00001425d00000021sv*sd*bc*sc*i* alias: pci:v00001425d00000020sv*sd*bc*sc*i* depends: mdio intree: Y vermagic: 3.9.6-200.fc18.x86_64 SMP mod_unload parm: dflt_msg_enable:Chelsio T3 default message enable bitmap (int) parm: msi:whether to use MSI or MSI-X (int) parm: ofld_disable:whether to enable offload at init time or not (int) # lspci -Qvvs 0a:00.0 0a:00.0 Ethernet controller: Chelsio Communications Inc T320 10GbE Dual Port Adapter Subsystem: Chelsio Communications Inc Device 0001 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 Region 0: Memory at fcf7f000 (64-bit, non-prefetchable) [size=4K] Region 2: Memory at fc000000 (64-bit, non-prefetchable) [size=8M] Region 4: Memory at fcf7e000 (64-bit, non-prefetchable) [size=4K] Expansion ROM at fce00000 [disabled] [size=512K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] MSI: Enable- Count=1/32 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [58] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal+ Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x8, ASPM L0s L1, Latency L0 unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABC, TimeoutDis-, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [94] Vital Product Data Unknown small resource type 00, will not decode more. Capabilities: [9c] MSI-X: Enable+ Count=32 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00000800 Capabilities: [100 v1] Device Serial Number 00-00-00-01-00-00-00-01 Capabilities: [300 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr- AERCap: First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn- Kernel driver in use: cxgb3 How reproducible: Every time I've rebooted (3 out of 3 so far). Steps to Reproduce: 1. Reboot the box 2. dmesg Stack trace is near the end here Additional info: Box is on the Red Hat internal VPN, so can be accessed remotely to investigate if needed.
Looks like cxgb3 is hitting this WARN_ON_ONCE: /* * In the slowpath, we sanity check order to avoid ever trying to * reclaim >= MAX_ORDER areas which will never succeed. Callers may * be using allocators in order of preference for an area that is * too large. */ if (order >= MAX_ORDER) { WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN)); return NULL; }
As a data point, that box has since been re-imaged with RHEL 6.4. The stack trace doesn't show up there. (guess it's in newer code) :D
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 18 kernel bugs. Fedora 18 has now been rebased to 3.11.4-101.fc18. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 19, and are still experiencing this issue, please change the version to Fedora 19. If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. It has been over a month since we asked you to test the 3.11 kernel updates and let us know if your issue has been resolved or is still a problem. When this happened, the bug was set to needinfo. Because the needinfo is still set, we assume either this is no longer a problem, or you cannot provide additional information to help us resolve the issue. As a result we are closing with insufficient data. If this is still a problem, we apologize, feel free to reopen the bug and provide more information so that we can work towards a resolution If you experience different issues, please open a new bug report for those.