RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 677786 - Panic in get_rps_cpu+0x1ad/0x320 on kvm guest when attempting to run LTP containers test.
Summary: Panic in get_rps_cpu+0x1ad/0x320 on kvm guest when attempting to run LTP cont...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Neil Horman
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
: 680047 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-02-15 20:39 UTC by Mike Gahagan
Modified: 2011-05-23 20:39 UTC (History)
2 users (show)

Fixed In Version: kernel-2.6.32-121.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-05-23 20:39:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
reproducer (8.33 KB, application/x-compressed-tar)
2011-02-17 11:20 UTC, Zhang Kexin
no flags Details
backport of commit 4315d834c1496ddca977e9e22002b77c85bfec2c (612 bytes, patch)
2011-02-24 15:28 UTC, Neil Horman
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0542 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 6.1 kernel security, bug fix and enhancement update 2011-05-19 11:58:07 UTC

Description Mike Gahagan 2011-02-15 20:39:56 UTC
-114.0.1 x86_64 kvm

test1239.test.redhat.com login: lo: Disabled Privacy Extensions
lo: Disabled Privacy Extensions
ADDRCONF(NETDEV_UP): veth0: link is not ready
ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
BUG: unable to handle kernel paging request at 000000004e3bf418
IP: [<ffffffff8141f10d>] get_rps_cpu+0x1ad/0x320
PGD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/net/veth1/flags
CPU 0 
Modules linked in: veth autofs4 sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log microcode virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc virtio_net i2c_piix4 i2c_core sg ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio ata_generic pata_acpi ata_piix dm_mod [last unloaded: speedstep_lib]

Modules linked in: veth autofs4 sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log microcode virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc virtio_net i2c_piix4 i2c_core sg ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio ata_generic pata_acpi ata_piix dm_mod [last unloaded: speedstep_lib]
Pid: 0, comm: swapper Not tainted 2.6.32-114.0.1.el6.x86_64 #1 KVM
RIP: 0010:[<ffffffff8141f10d>]  [<ffffffff8141f10d>] get_rps_cpu+0x1ad/0x320
RSP: 0018:ffff8800020039f0  EFLAGS: 00010206
RAX: 000000004e3bf418 RBX: ffff88004b985a40 RCX: 00000000dac2accd
RDX: 0000000000002ba1 RSI: 00000000244d77db RDI: ffff88004b496000
RBP: ffff880002003a20 R08: 0000000000000304 R09: 0000000000000054
R10: 0000000000000008 R11: 0000000000000000 R12: ffff880037fc4a80
R13: 00000000b600a8c0 R14: 00000000a0d4ede0 R15: 0000000000000005
FS:  0000000000000000(0000) GS:ffff880002000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000004e3bf418 CR3: 000000004b9b6000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a2d020)
Stack:
 ffff88004b496000 ffff880037fc4a80 ffff88004b496000 0000000000000000
<0> 0000000000000062 ffff88004b9ed000 ffff880002003a60 ffffffff8141f30c
<0> ffff880002003a60 ffffffff81435290 ffff880002003a60 ffff880037fc4a80
Call Trace:
 <IRQ> 
 [<ffffffff8141f30c>] netif_rx+0x8c/0x150
 [<ffffffff81435290>] ? eth_type_trans+0x40/0x140
 [<ffffffff8141f632>] dev_forward_skb+0x122/0x180
 [<ffffffffa03b06d6>] veth_xmit+0x86/0xe0 [veth]
 [<ffffffff8141ae98>] dev_hard_start_xmit+0x2c8/0x3f0
 [<ffffffff8141f31a>] ? netif_rx+0x9a/0x150
 [<ffffffff8143636a>] sch_direct_xmit+0x15a/0x1c0
 [<ffffffff8141e888>] dev_queue_xmit+0x388/0x4d0
 [<ffffffff8143544a>] ? eth_header+0x3a/0xe0
 [<ffffffff81424275>] neigh_resolve_output+0x105/0x370
 [<ffffffff814265b5>] neigh_update+0x305/0x540
 [<ffffffff8147a303>] arp_process+0x323/0x730
 [<ffffffff8141e654>] ? dev_queue_xmit+0x154/0x4d0
 [<ffffffff8147a831>] arp_rcv+0x111/0x140
 [<ffffffff81426b34>] ? neigh_timer_handler+0xf4/0x340
 [<ffffffff8141a4bb>] __netif_receive_skb+0x39b/0x6b0
 [<ffffffff81410982>] ? kfree_skb+0x42/0x90
 [<ffffffff8141a85c>] process_backlog+0x8c/0xf0
 [<ffffffff8141ff63>] net_rx_action+0x103/0x2f0
 [<ffffffff8106f2a7>] __do_softirq+0xb7/0x1e0
 [<ffffffff81092320>] ? hrtimer_interrupt+0x140/0x250
 [<ffffffff8100c2cc>] call_softirq+0x1c/0x30
 [<ffffffff8100df05>] do_softirq+0x65/0xa0
 [<ffffffff8106f095>] irq_exit+0x85/0x90
 [<ffffffff814df420>] smp_apic_timer_interrupt+0x70/0x9b
 [<ffffffff8100bc93>] apic_timer_interrupt+0x13/0x20
 <EOI> 
 [<ffffffff8103626b>] ? native_safe_halt+0xb/0x10
 [<ffffffff81013e7d>] default_idle+0x4d/0xb0
 [<ffffffff81009e96>] cpu_idle+0xb6/0x110
 [<ffffffff814bf5ca>] rest_init+0x7a/0x80
 [<ffffffff81bbcf23>] start_kernel+0x418/0x424
 [<ffffffff81bbc33a>] x86_64_start_reservations+0x125/0x129
 [<ffffffff81bbc438>] x86_64_start_kernel+0xfa/0x109
Code: 10 85 c0 0f 45 d0 66 41 89 94 24 ba 00 00 00 0f 1f 80 00 00 00 00 48 8b 03 48 85 c0 0f 84 e1 fe ff ff 41 0f b7 94 24 ba 00 00 00 <0f> af 10 c1 ea 10 89 d2 0f b7 44 50 18 48 8b 15 e7 46 1f 00 0f 
RIP  [<ffffffff8141f10d>] get_rps_cpu+0x1ad/0x320
 RSP <ffff8800020039f0>
CR2: 000000004e3bf418
---[ end trace c6ddc13e73ba2349 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper Tainted: G      D    ----------------  2.6.32-114.0.1.el6.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff814d6a5b>] ? panic+0x78/0x141
 [<ffffffff814daab2>] ? oops_end+0xf2/0x100
 [<ffffffff81040a8b>] ? no_context+0xfb/0x260
 [<ffffffff81040d15>] ? __bad_area_nosemaphore+0x125/0x1e0
 [<ffffffff81012509>] ? sched_clock+0x9/0x10
 [<ffffffff810370d8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff81040de3>] ? bad_area_nosemaphore+0x13/0x20
 [<ffffffff8104149d>] ? __do_page_fault+0x31d/0x480
 [<ffffffff8103620c>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff810370d8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8105fb80>] ? tg_shares_up+0x0/0x2a0
 [<ffffffff8103620c>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff81012509>] ? sched_clock+0x9/0x10
 [<ffffffff810946e5>] ? sched_clock_local+0x25/0x90
 [<ffffffff814dca6e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff814d9e15>] ? page_fault+0x25/0x30
 [<ffffffff8141f10d>] ? get_rps_cpu+0x1ad/0x320
 [<ffffffff8141f30c>] ? netif_rx+0x8c/0x150
 [<ffffffff81435290>] ? eth_type_trans+0x40/0x140
 [<ffffffff8141f632>] ? dev_forward_skb+0x122/0x180
 [<ffffffffa03b06d6>] ? veth_xmit+0x86/0xe0 [veth]
 [<ffffffff8141ae98>] ? dev_hard_start_xmit+0x2c8/0x3f0
 [<ffffffff8141f31a>] ? netif_rx+0x9a/0x150
 [<ffffffff8143636a>] ? sch_direct_xmit+0x15a/0x1c0
 [<ffffffff8141e888>] ? dev_queue_xmit+0x388/0x4d0
 [<ffffffff8143544a>] ? eth_header+0x3a/0xe0
 [<ffffffff81424275>] ? neigh_resolve_output+0x105/0x370
 [<ffffffff814265b5>] ? neigh_update+0x305/0x540
 [<ffffffff8147a303>] ? arp_process+0x323/0x730
 [<ffffffff8141e654>] ? dev_queue_xmit+0x154/0x4d0
 [<ffffffff8147a831>] ? arp_rcv+0x111/0x140
 [<ffffffff81426b34>] ? neigh_timer_handler+0xf4/0x340
 [<ffffffff8141a4bb>] ? __netif_receive_skb+0x39b/0x6b0
 [<ffffffff81410982>] ? kfree_skb+0x42/0x90
 [<ffffffff8141a85c>] ? process_backlog+0x8c/0xf0
 [<ffffffff8141ff63>] ? net_rx_action+0x103/0x2f0
 [<ffffffff8106f2a7>] ? __do_softirq+0xb7/0x1e0
 [<ffffffff81092320>] ? hrtimer_interrupt+0x140/0x250
 [<ffffffff8100c2cc>] ? call_softirq+0x1c/0x30
 [<ffffffff8100df05>] ? do_softirq+0x65/0xa0
 [<ffffffff8106f095>] ? irq_exit+0x85/0x90
 [<ffffffff814df420>] ? smp_apic_timer_interrupt+0x70/0x9b
 [<ffffffff8100bc93>] ? apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff8103626b>] ? native_safe_halt+0xb/0x10
 [<ffffffff81013e7d>] ? default_idle+0x4d/0xb0
 [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110
 [<ffffffff814bf5ca>] ? rest_init+0x7a/0x80
 [<ffffffff81bbcf23>] ? start_kernel+0x418/0x424
 [<ffffffff81bbc33a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81bbc438>] ? x86_64_start_kernel+0xfa/0x109

Version-Release number of selected component (if applicable):
RHEL 6.1 0210.1 tree
kernels:
114.0.1
115
2.6.32-115.el6.bz676099.x86_64.debug

How reproducible:
appears to happen during the network setup portion of the LTP containers test (still trying to narrow down exactly where)
Build/install LTP 20100831 and run the containers test (runtest/containers in the LTP build tree)
100% reproduceable within about 15 sec. on a KVM guest, have not yet tried on bare metal or other architectures.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
more call traces with different kernels at:
http://pastebin.test.redhat.com/41345

Comment 1 Zhang Kexin 2011-02-16 09:18:59 UTC
I had a try with ltp container test today too, on bare metal kernel, met 3 kind of panic or oops. 

First kind, I ran 2 times of ltp container test:
cd /opt/ltp
./runltp -f containers -l container.log -p
./runltp -f containers -l container.log -p

BUG: scheduling while atomic: swapper/0/0x10000200
Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw k10temp edac_core edac_mce_amd sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan]
CPU 1:
Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw k10temp edac_core edac_mce_amd sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-114.0.1.el6.x86_64 #1 Dinar
RIP: 0010:[<ffffffff8103626b>]  [<ffffffff8103626b>] native_safe_halt+0xb/0x10
RSP: 0018:ffff88007db5fed8  EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff88007db5fed8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff81d0e208
RBP: ffffffff8100bc8e R08: 0000000000000000 R09: 0000000000000000
R10: 00002556d2a4343f R11: 000000010001bba2 R12: ffff880028250f48
R13: ffff88007db5fe68 R14: ffffffff81079473 R15: 000000017db5fe98
FS:  00007f0e0d007700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000003d664aaee0 CR3: 0000000001a25000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff81013e7d>] ? default_idle+0x4d/0xb0
 [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110
 [<ffffffff814d072e>] ? start_secondary+0x1fc/0x23f
general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/net/veth1/flags
CPU 1 
Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcoInitializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.32-114.0.1.el6.x86_64 (mockbuild.redhat.com) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Thu Feb 10 16:04:24 EST 2011
Command line: ro root=/dev/mapper/vg_amddinar03-lv_root rd_LVM_LV=vg_amddinar03/lv_root rd_LVM_LV=vg_amddinar03/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us console=ttyS0,115200n81 irqpoll maxcpus=1 reset_devices cgroup_disable=memory  memmap=exactmap memmap=640K@0K memmap=131436K@33408K elfcorehdr=164844K memmap=200K$824K memmap=44K#3275264K memmap=8K#3275308K memmap=1484K$3275316K memmap=262144K$3670016K memmap=64K$4173824K memmap=4K$4175872K memmap=1024K$4193280K
KERNEL supported cpus:
  Intel GenuineIntel
  AMD AuthenticAMD
  Centaur CentaurHauls

second kind:
run 3 times of ltp container test:
cd /opt/ltp
./runltp -f containers -l container.log -p
./runltp -f containers -l container.log -p
./runltp -f containers -l container.log -p

ADDRCONF(NETDEV_UP): veth0: link is not ready
ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
lo: Disabled Privacy Extensions
lo: Disabled Privacy Extensions
ADDRCONF(NETDEV_UP): veth0: link is not ready
ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
lo: Disabled Privacy Extensions
ADDRCONF(NETDEV_UP): veth0: link is not ready
lo: Disabled Privacy Extensions
ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/net/veth1/address
CPU 8 
Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw k10temp edac_core edac_mce_amd sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan]

Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw k10temp edac_core edac_mce_amd sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-114.0.1.el6.x86_64 #1 Dinar
RIP: 0010:[<ffffffff8141f10d>]  [<ffffffff8141f10d>] get_rps_cpu+0x1ad/0x320
RSP: 0018:ffff8800822837e0  EFLAGS: 00010282
RAX: b21dc8cfab664864 RBX: ffff880037b81240 RCX: 0000000077e458c4
RDX: 0000000000001230 RSI: 000000008916295d RDI: ffff88007c633000
RBP: ffff880082283810 R08: ffff88013437f49c R09: 0000000000000054
R10: 0000000000000008 R11: 0000000000000000 R12: ffff880134e32080
R13: 00000000b600a8c0 R14: 00000000265b630a R15: 0000000000000005
FS:  00007f17b7d59700(0000) GS:ffff880082280000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fffbc476f34 CR3: 0000000134dcf000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
ProceInitializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.32-114.0.1.el6.x86_64 (mockbuild.redhat.com) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Thu Feb 10 16:04:24 EST 2011
Command line: ro root=/dev/mapper/vg_amddinar03-lv_root rd_LVM_LV=vg_amddinar03/lv_root rd_LVM_LV=vg_amddinar03/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us console=ttyS0,115200n81 irqpoll maxcpus=1 reset_devices cgroup_disable=memory  memmap=exactmap memmap=640K@0K memmap=131436K@33408K elfcorehdr=164844K memmap=200K$824K memmap=44K#3275264K memmap=8K#3275308K memmap=1484K$3275316K memmap=262144K$3670016K memmap=64K$4173824K memmap=4K$4175872K memmap=1024K$4193280K
KERNEL supported cpus:
  Intel GenuineIntel
  AMD AuthenticAMD
  Centaur CentaurHauls

Third time:
run ltp container test for 3 times:

ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
BUG: scheduling while atomic: swapper/0/0x10000200
Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw sg k10temp edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan]
CPU 19:
Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw sg k10temp edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-114.0.1.el6.x86_64 #1 Dinar
RIP: 0010:[<ffffffff8103626b>]  [<ffffffff8103626b>] native_safe_halt+0xb/0x10
RSP: 0018:ffff88007d099ed8  EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff88007d099ed8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff81d0e208
RBP: ffffffff8100bc8e R08: 0000000000000000 R09: 0000000000000000
R10: 000000e9077283d4 R11: 00000001000a8cd2 R12: ffff88013a050f48
R13: ffff88007d099e68 R14: ffffffff81079473 R15: 000000017d099e98
FS:  00007f8532daf700(0000) GS:ffff88013a040000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000003d664aaee0 CR3: 0000000001a25000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff81013e7d>] ? default_idle+0x4d/0xb0
 [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110
 [<ffffffff814d072e>] ? start_secondary+0x1fc/0x23f
general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:04.0/0000:01:00.1/irq
CPU 19 
Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw sg k10temp edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan]

Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw sg k10temp edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-114.0.1.el6.x86_64 #1 Dinar
RIP: 0010:[<ffffffff8141f10d>]  [<ffffffff8141f10d>] get_rps_cpu+0x1ad/0x320
RSP: 0018:ffff88013a0439b0  EFLAGS: 00010282
RAX: a168b58dbbfeab2c RBX: ffff880235510240 RCX: 00000000c016ae26
RDX: 0000000000005672 RSI: 00000000cbd7dc0d RDI: ffff880233556000
RBP: ffff88013a0439e0 R08: ffff8801346ad69c R09: 0000000000000040
R10: 000000000000dd86 R11: 0000000000000000 R12: ffff8801b535a1c0
R13: 000000002548e0ff R14: 00000000e33c3c8b R15: 000000000000000a
FS:  00007f8532daf700(0000) GS:ffff88013a040000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000003d664aaee0 CR3: 0000000001a25000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff88007d098000, task ffff8802355beb40)
Stack:
 ffff880233556000 ffff8801b535a1c0 ffff880233556000 0000000000000000
<0> 000000000000004e ffff880234e47000 ffff88013a043a20 ffffffff8141f30c
<0> ffff88013a043a20 ffffffff81435290 ffff88013a043a20 ffff8801b535a1c0
Call Trace:
 <IRQ> 
 [<ffffffff8141f30c>] netif_rx+0x8c/0x150
 [<ffffffff81435290>] ? eth_type_trans+0x40/0x140
 [<ffffffff8141f632>] dev_forward_skb+0x122/0x180
 [<ffffffffa01e16d6>] veth_xmit+0x86/0xe0 [veth]
 [<ffffffff8141ae98>] dev_hard_start_xmit+0x2c8/0x3f0
 [<ffffffff814d9aa9>] ? _write_unlock_bh+0x19/0x20
 [<ffffffff8143636a>] sch_direct_xmit+0x15a/0x1c0
 [<ffffffff8141e888>] dev_queue_xmit+0x388/0x4d0
 [<ffffffff8143544a>] ? eth_header+0x3a/0xe0
 [<ffffffff81424275>] neigh_resolve_output+0x105/0x370
 [<ffffffffa031557c>] ip6_output_finish+0x9c/0x120 [ipv6]
 [<ffffffffa03179eb>] ip6_output2+0x2bb/0x2d0 [ipv6]
 [<ffffffffa0319165>] ip6_output+0x85/0x140 [ipv6]
 [<ffffffffa032b2e2>] ndisc_send_skb+0x312/0x330 [ipv6]
 [<ffffffffa032b361>] __ndisc_send+0x61/0x80 [ipv6]
 [<ffffffffa031cbd0>] ? addrconf_dad_timer+0x0/0x1a0 [ipv6]
 [<ffffffffa032bc12>] ndisc_send_ns+0x72/0xc0 [ipv6]
 [<ffffffffa031cbd0>] ? addrconf_dad_timer+0x0/0x1a0 [ipv6]
 [<ffffffff8107a1c8>] ? add_timer+0x18/0x30
 [<ffffffffa031cce3>] addrconf_dad_timer+0x113/0x1a0 [ipv6]
 [<ffffffff810798d7>] run_timer_softirq+0x197/0x340
 [<ffffffff8109d5d0>] ? tick_sched_timer+0x0/0xc0
 [<ffffffff8102982d>] ? lapic_next_event+0x1d/0x30
 [<ffffffff8106f2a7>] __do_softirq+0xb7/0x1e0
 [<ffffffff81092320>] ? hrtimer_interrupt+0x140/0x250
 [<ffffffff8100c2cc>] call_softirq+0x1c/0x30
 [<ffffffff8100df05>] do_softirq+0x65/0xa0
 [<ffffffff8106f095>] irq_exit+0x85/0x90
 [<ffffffff814df420>] smp_apic_timer_interrupt+0x70/0x9b
 [<ffffffff8100bc93>] apic_timer_interrupt+0x13/0x20
 <EOI> 
 [<ffffffff8103626b>] ? native_safe_halt+0xb/0x10
 [<ffffffff81013e7d>] default_idle+0x4d/0xb0
 [<ffffffff81009e96>] cpu_idle+0xb6/0x110
 [<ffffffff814d072e>] start_secondary+0x1fc/0x23f
Code: 10 85 c0 0f 45 d0 66 41 89 94 24 ba 00 00 00 0f 1f 80 00 00 00 00 48 8b 03 48 85 c0 0f 84 e1 fe ff ff 41 0f b7 94 24 ba 00 00 00 <0f> af 10 c1 ea 10 89 d2 0f b7 44 50 18 48 8b 15 e7 46 1f 00 0f 
RIP  [<ffffffff8141f10d>] get_rps_cpu+0x1ad/0x320
 RSP <ffff88013a0439b0>

Comment 3 Neil Horman 2011-02-16 11:50:37 UTC
zhang, where can I get a copy of the runltp test so that I can reproduce this myself?

Comment 7 Zhang Kexin 2011-02-17 11:20:03 UTC
Created attachment 479307 [details]
reproducer

Comment 9 Neil Horman 2011-02-17 16:09:24 UTC
trying to reproduce on a guest, and its working fine.  Mike has offered to see if he can get it to fail.  Thanks!

Comment 10 Neil Horman 2011-02-21 19:53:28 UTC
Note to self: Looks like we're falling down because the map pointer in get_rps_cpu is getting corrupted.  Need to figure out why.

Comment 11 Neil Horman 2011-02-22 15:26:49 UTC
Interesting data here:

I instrumented the kernel to print the value of the net devices rps_map in get_rps_cpu, and to indicate when the rps_map value was getting changed by the sysfs code (which is the only place that the rps_map should be updated from).  I saw a periodically changing rps_map value, but never saw the printks indicating that sysfs was getting called to update its value.  Given that I was getting messages like this from get_rps_cpu:

MAP = 2fd78a901c35ed7e
MAP = ffff88003b419000
MAP = ffff88003b419000
MAP = ffff88003b419000
MAP = ffff88003b419000
MAP = ffff88003b419000
MAP = ffff88003b419000
MAP = ffff88003b419000
MAP = ffff88003b419000
MAP = ffff88003b419000


None of which were valid pointer values (causing the reported crash when we dereferenced a member of the rps_map structure), I think we can assume that rps_map is somehow getting corrupted.  

So I modified the netdev_rx_queue structure.  I expanded the rps_map pointer to be an array of pointers 1 page in size, and marked that page as read only.  I validated that, with this change, writing to /sys/class/net/ethX/queues/rx-N/rps_cpus results in this expected crash:
BUG: unable to handle kernel paging request at ffff88003bae0000
IP: [<ffffffff8142f656>] store_rps_map+0x146/0x190
PGD 1a26063 PUD 1a2a063 PMD 3a885063 PTE 3bae0161
Oops: 0003 [#1] SMP 
last sysfs file: /sys/devices/virtual/net/virbr0/queues/rx-0/rps_cpus
CPU 1 
Modules linked in: ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat bridge stp llc autofs4 sunrpc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log uinput sg virtio_balloon 8139too 8139cp mii snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 i2c_core ext4 mbcache jbd2 sd_mod crc_t10dif virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: speedstep_lib]

Modules linked in: ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat bridge stp llc autofs4 sunrpc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log uinput sg virtio_balloon 8139too 8139cp mii snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 i2c_core ext4 mbcache jbd2 sd_mod crc_t10dif virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: speedstep_lib]
Pid: 1839, comm: bash Not tainted 2.6.32 #8 KVM
RIP: 0010:[<ffffffff8142f656>]  [<ffffffff8142f656>] store_rps_map+0x146/0x190
RSP: 0018:ffff88003d52be48  EFLAGS: 00010292
RAX: ffffffff81f416d4 RBX: 0000000000000002 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246
RBP: ffff88003d52be88 R08: 0000000000000000 R09: ffffffff81639ce0
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88003cbd4c80
R13: ffff88003bae0000 R14: 0000000000000000 R15: ffff88003bae1008
FS:  00007fb027c75700(0000) GS:ffff880002100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88003bae0000 CR3: 000000003b91a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bash (pid: 1839, threadinfo ffff88003d52a000, task ffff88003747f540)
Stack:
 ffff88003d52be78 ffffffff8115394a ffff880037a747e0 ffff88003b06e000
<0> ffff880037a747e0 ffff88003d52bf48 ffff88003d56e820 ffffffff81ab0780
<0> ffff88003d52be98 ffffffff8142dc93 ffff88003d52bee8 ffffffff811e3315
Call Trace:
 [<ffffffff8115394a>] ? alloc_pages_current+0x9a/0x100
 [<ffffffff8142dc93>] rx_queue_attr_store+0x23/0x30
 [<ffffffff811e3315>] sysfs_write_file+0xe5/0x170
 [<ffffffff81170b78>] vfs_write+0xb8/0x1a0
 [<ffffffff810d1572>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff811715b1>] sys_write+0x51/0x90
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b


Good, so, rebooting and running the LTC containers test then resulted in this crash:


BUG: unable to handle kernel paging request at ffff880039d0a000
IP: [<ffffffff8126b467>] clear_page_c+0x7/0x10
PGD 1a26063 PUD 1a2a063 PMD 3a233063 PTE 39d0a161
Oops: 0003 [#1] SMP 
last sysfs file: /sys/devices/virtual/net/veth0/address
CPU 1 
Modules linked in: veth ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat bridge stp llc autofs4 sunrpc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log uinput sg virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc 8139too 8139cp mii i2c_piix4 i2c_core ext4 mbcache jbd2 sd_mod crc_t10dif virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: speedstep_lib]

Modules linked in: veth ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat bridge stp llc autofs4 sunrpc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log uinput sg virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc 8139too 8139cp mii i2c_piix4 i2c_core ext4 mbcache jbd2 sd_mod crc_t10dif virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: speedstep_lib]
Pid: 2185, comm: net.hotplug Not tainted 2.6.32 #8 KVM
RIP: 0010:[<ffffffff8126b467>]  [<ffffffff8126b467>] clear_page_c+0x7/0x10
RSP: 0018:ffff88003b5b59f8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000200
RDX: ffff880000000000 RSI: 000000000000001b RDI: ffff880039d0a000
RBP: ffff88003b5b5b20 R08: ffffea0000ca5a58 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b5b4000
R13: 0000000000000000 R14: 0000000000ca5a30 R15: ffffea0000ca5a30
FS:  00007f0228dc9700(0000) GS:ffff880002100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff880039d0a000 CR3: 000000003bffc000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process net.hotplug (pid: 2185, threadinfo ffff88003b5b4000, task ffff880037529580)
Stack:
 ffffffff8111e2a1 0000000000000000 ffff880000019710 0000003700000001
<0> ffff88000002ab00 0000000000000000 00000040ffffffff 0000000000000000
<0> ffff88000002ab08 000000020002ab08 0000000000000000 ffff880000019710
Call Trace:
 [<ffffffff8111e2a1>] ? get_page_from_freelist+0x3d1/0x820
 [<ffffffff8111f4f1>] __alloc_pages_nodemask+0x111/0x850
 [<ffffffff8115394a>] alloc_pages_current+0x9a/0x100
 [<ffffffff8111dc7e>] __get_free_pages+0xe/0x50
 [<ffffffff8111dcd6>] get_zeroed_page+0x16/0x20
 [<ffffffff811321b0>] __pmd_alloc+0x30/0xe0
 [<ffffffff81134260>] copy_page_range+0x410/0x480
 [<ffffffff81064946>] dup_mm+0x316/0x520
 [<ffffffff8106591a>] copy_process+0xd5a/0x1300
 [<ffffffff81065f54>] do_fork+0x94/0x480
 [<ffffffff8118d4c2>] ? alloc_fd+0x92/0x160
 [<ffffffff8116da67>] ? fd_install+0x47/0x90
 [<ffffffff810d1572>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff81009588>] sys_clone+0x28/0x30
 [<ffffffff8100b493>] stub_clone+0x13/0x20
 [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b


so it seems to me that the net.hotplug process seems to somehow be causing our corruption.  I'm not sure for the life of me how thats happening, but I've got some more experimenting to do.

Comment 12 Neil Horman 2011-02-22 21:28:05 UTC
Tried running the debug kernel and noted this:
Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
Last user: [<ffffffff8145a63a>](rx_queue_release+0x6a/0x80

So we're definately getting corruption in that netdev_rx_queue structure, though I still can't quite see why

Comment 13 Neil Horman 2011-02-23 01:06:23 UTC
hmm, so following up on the debug kernel noting the redzone violation in rx_queue_release, I modified the code to allow the rx queues to get leaked instead of freeing them.  when I do that (just if 0-ing out the kfree) This test passes.  I think we may have an rcu race here, but I'm not 100 percent sure.  I do note that the _rx array gets freed in free_netdev upstream, so there may be an upstream change that inadvertently fixed this.

Comment 14 Neil Horman 2011-02-23 21:30:04 UTC
pretty sure I found the problem.  Theres an unbalanced use of the refcounter in the rx queue array allowing it to be freed while still in use.  I'll have a fix for this shortly.

Comment 15 Neil Horman 2011-02-24 15:28:48 UTC
Created attachment 480787 [details]
backport of commit 4315d834c1496ddca977e9e22002b77c85bfec2c

I've confirmed this fixes the problem

Comment 16 Brock Organ 2011-03-01 14:43:08 UTC
Reporter,

Could I please ask you to provide a priority assessment (set the priority field to one of urgent/high/medium/low) for the impact of this issue?  This will help us prioritize this issue with our other outstanding bugs for the current release cycle ...

Regards,

Brock

Comment 18 RHEL Program Management 2011-03-01 23:19:54 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 19 Thomas Graf 2011-03-08 20:38:22 UTC
*** Bug 680047 has been marked as a duplicate of this bug. ***

Comment 20 Aristeu Rozanski 2011-03-10 17:58:02 UTC
Patch(es) available on kernel-2.6.32-121.el6

Comment 24 errata-xmlrpc 2011-05-23 20:39:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html


Note You need to log in before you can comment on or make changes to this bug.