Bug 670907

Summary: [RHEL6.1][Kernel] BUG: unable to handle kernel NULL pointer dereference, IP: [<ffffffff814115f0>] get_rps_cpu+0x290/0x340
Product: Red Hat Enterprise Linux 6 Reporter: Jeff Burke <jburke>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Liang Zheng <lzheng>
Severity: high Docs Contact:
Priority: medium    
Version: 6.1CC: arozansk, fhrbata, hjia, kzhang, lzheng, pbunyan
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-112.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-23 20:38:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 676037    
Attachments:
Description Flags
fix for rps oops none

Description Jeff Burke 2011-01-19 16:19:50 UTC
Description of problem:
 While running tests on a scratch kernel with a "Receive Packet Steering patch" the system Oops

Version-Release number of selected component (if applicable):
2.6.32-100.el6Scratch (test kernel)

How reproducible:
Always

Actual results:
igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX 
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready 
br0: port 1(eth0) entering learning state 
br0: port 1(eth0) entering forwarding state 
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 
IP: [<ffffffff814115f0>] get_rps_cpu+0x290/0x340 
PGD 46bc3b067 PUD 46c7b8067 PMD 0  
Oops: 0000 [#1] SMP  
last sysfs file: /sys/devices/system/cpu/cpu23/cache/index2/shared_cpu_map 
CPU 1  
Modules linked in: sit tunnel4 ip6table_filter ip6_tables ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT bridge stp llc autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 xt_physdev iptable_filter ip_tables dm_mirror dm_region_hash dm_log kvm_intel kvm i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core sg igb dca ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif ahci megaraid_sas dm_mod [last unloaded: microcode] 
 
Modules linked in: sit tunnel4 ip6table_filter ip6_tables ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT bridge stp llc autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 xt_physdev iptable_filter ip_tables dm_mirror dm_region_hash dm_log kvm_intel kvm i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma i7core_edac edac_core sg igb dca ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif ahci megaraid_sas dm_mod [last unloaded: microcode] 
Pid: 0, comm: swapper Not tainted 2.6.32-100.el6scratch.x86_64 #1 QSSC-S4R 
RIP: 0010:[<ffffffff814115f0>]  [<ffffffff814115f0>] get_rps_cpu+0x290/0x340 
RSP: 0018:ffff88048e403b80  EFLAGS: 00010246 
RAX: 0000000000000001 RBX: 0000000000000001 RCX: ffff88086bc80000 
RDX: ffffffff81f02368 RSI: 0000000000000206 RDI: 0000000000000000 
RBP: ffff88048e403bc0 R08: ffffffff81411a60 R09: ffff88048e403ba8 
R10: 0000000000000000 R11: ffff88086a8c55f8 R12: ffff88086a8c55c0 
R13: 0000000000000006 R14: ffff88086b7c2022 R15: ffff88086a8c55f8 
FS:  0000000000000000(0000) GS:ffff88048e400000(0000) knlGS:0000000000000000 
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b 
CR2: 0000000000000010 CR3: 000000046bd86000 CR4: 00000000000026e0 
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
Process swapper (pid: 0, threadinfo ffff880c6dde8000, task ffff88086de9e040) 
Stack: 
 ffffffff81411a60 ffff88086bc80000 000000078e416f00 ffff88086a8c55c0 
<0> ffff88086a8c55c0 ffff88086bc806c0 ffff88086b7c2022 ffff88086a8c55f8 
<0> ffff88048e403be0 ffffffff81411a7d ffff88086b7c2022 ffff88086a8c55c0 
Call Trace: 
 <IRQ>  
 [<ffffffff81411a60>] ? netif_receive_skb+0x0/0x50 
 [<ffffffff81411a7d>] netif_receive_skb+0x1d/0x50 
 [<ffffffffa02fe3c8>] br_handle_frame_finish+0x198/0x260 [bridge] 
 [<ffffffffa02fe63a>] br_handle_frame+0x1aa/0x250 [bridge] 
 [<ffffffff8140caa3>] __netif_receive_skb+0x1c3/0x670 
 [<ffffffff81411aa0>] netif_receive_skb+0x40/0x50 
 [<ffffffff81411ba8>] napi_skb_finish+0x58/0x70 
 [<ffffffff814120a9>] napi_gro_receive+0x39/0x50 
 [<ffffffffa01128c6>] igb_poll+0x8d6/0xf30 [igb] 
 [<ffffffff8104f0ba>] ? enqueue_entity+0x13a/0x340 
 [<ffffffff810561ec>] ? try_to_wake_up+0xcc/0x400 
 [<ffffffff81412273>] net_rx_action+0x103/0x2a0 
 [<ffffffff8108d373>] ? hrtimer_get_next_event+0xc3/0x100 
 [<ffffffff8106bcb7>] __do_softirq+0xb7/0x1e0 
 [<ffffffff810d0dc0>] ? handle_IRQ_event+0x60/0x170 
 [<ffffffff8100c2cc>] call_softirq+0x1c/0x30 
 [<ffffffff8100df05>] do_softirq+0x65/0xa0 
 [<ffffffff8106bab5>] irq_exit+0x85/0x90 
 [<ffffffff814ce455>] do_IRQ+0x75/0xf0 
 [<ffffffff8100bad3>] ret_from_intr+0x0/0x11 
 <EOI>  
 [<ffffffff812ad4ba>] ? intel_idle+0xda/0x160 
 [<ffffffff812ad49d>] ? intel_idle+0xbd/0x160 
 [<ffffffff813dc2a7>] cpuidle_idle_call+0xa7/0x140 
 [<ffffffff81009e96>] cpu_idle+0xb6/0x110 
 [<ffffffff814c01fc>] start_secondary+0x1fc/0x23f 
Code: 85 c0 48 8b 7d c8 0f 84 11 fe ff ff 83 bf f0 03 00 00 01 48 c7 c1 24 f5 7e 81 8b 9f a0 06 00 00 48 0f 44 cf 48 8b bf 30 04 00 00 <48> 8b 57 10 48 89 4d c0 48 89 55 c8 e8 9f 5c f1 ff 48 8b 4d c0  
RIP  [<ffffffff814115f0>] get_rps_cpu+0x290/0x340 
 RSP <ffff88048e403b80> 
CR2: 0000000000000010 

Expected results:
System should not Oops

Additional info:
This needs br0 interface to reproduce

Comment 2 Neil Horman 2011-01-21 21:23:22 UTC
So this is a wierd one, for some reason, when I add a bridge into the mix, we oops in get_rps_cpu.  The bridge sends us through a path in get_rps_cpu that we other wise wouldn't which causes a printk to be triggered, which is where we seem to oops
but the poniters that that printk interrogates are all valid, so I'm not at all certain whats going on here yet.

Comment 3 Neil Horman 2011-01-21 21:46:04 UTC
ok, it appears the netdev_warn macro is trying to dereference something in the netdev structure of the bridge interface that isn't set properly.  If i remove the call everything works fine.  I'll figure out what it is and get a patch together shortly.

Comment 4 Neil Horman 2011-01-22 02:08:29 UTC
note to self.  its net_device->dev.parent thats null, and netdev_warn presumes that its always non-null.  I'll need to patch that.

Comment 5 Neil Horman 2011-01-24 15:57:06 UTC
looks like upstream commit 08c801f8d45387a1b46066aad1789a9bb9c4b645 inadvertently fixed this when RPS was origionally introduced.  'll have a patch shortly.

Comment 6 Neil Horman 2011-01-24 18:07:09 UTC
Created attachment 475004 [details]
fix for rps oops

Comment 7 RHEL Program Management 2011-01-24 22:50:22 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 8 Aristeu Rozanski 2011-02-03 16:38:34 UTC
Patch(es) available on kernel-2.6.32-112.el6

Comment 13 Aristeu Rozanski 2011-03-02 15:23:25 UTC
this patch went in 104, released on 112. a followup patch was submitted and got
into 115.
(625487)

Comment 15 errata-xmlrpc 2011-05-23 20:38:10 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html