Bug 677786
Summary: | Panic in get_rps_cpu+0x1ad/0x320 on kvm guest when attempting to run LTP containers test. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Mike Gahagan <mgahagan> | ||||||
Component: | kernel | Assignee: | Neil Horman <nhorman> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 6.1 | CC: | kzhang, yimwang | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | kernel-2.6.32-121.el6 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2011-05-23 20:39:34 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Mike Gahagan
2011-02-15 20:39:56 UTC
I had a try with ltp container test today too, on bare metal kernel, met 3 kind of panic or oops. First kind, I ran 2 times of ltp container test: cd /opt/ltp ./runltp -f containers -l container.log -p ./runltp -f containers -l container.log -p BUG: scheduling while atomic: swapper/0/0x10000200 Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw k10temp edac_core edac_mce_amd sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan] CPU 1: Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw k10temp edac_core edac_mce_amd sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-114.0.1.el6.x86_64 #1 Dinar RIP: 0010:[<ffffffff8103626b>] [<ffffffff8103626b>] native_safe_halt+0xb/0x10 RSP: 0018:ffff88007db5fed8 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff88007db5fed8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff81d0e208 RBP: ffffffff8100bc8e R08: 0000000000000000 R09: 0000000000000000 R10: 00002556d2a4343f R11: 000000010001bba2 R12: ffff880028250f48 R13: ffff88007db5fe68 R14: ffffffff81079473 R15: 000000017db5fe98 FS: 00007f0e0d007700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000003d664aaee0 CR3: 0000000001a25000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff81013e7d>] ? default_idle+0x4d/0xb0 [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110 [<ffffffff814d072e>] ? start_secondary+0x1fc/0x23f general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/virtual/net/veth1/flags CPU 1 Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcoInitializing cgroup subsys cpuset Initializing cgroup subsys cpu Linux version 2.6.32-114.0.1.el6.x86_64 (mockbuild.redhat.com) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Thu Feb 10 16:04:24 EST 2011 Command line: ro root=/dev/mapper/vg_amddinar03-lv_root rd_LVM_LV=vg_amddinar03/lv_root rd_LVM_LV=vg_amddinar03/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us console=ttyS0,115200n81 irqpoll maxcpus=1 reset_devices cgroup_disable=memory memmap=exactmap memmap=640K@0K memmap=131436K@33408K elfcorehdr=164844K memmap=200K$824K memmap=44K#3275264K memmap=8K#3275308K memmap=1484K$3275316K memmap=262144K$3670016K memmap=64K$4173824K memmap=4K$4175872K memmap=1024K$4193280K KERNEL supported cpus: Intel GenuineIntel AMD AuthenticAMD Centaur CentaurHauls second kind: run 3 times of ltp container test: cd /opt/ltp ./runltp -f containers -l container.log -p ./runltp -f containers -l container.log -p ./runltp -f containers -l container.log -p ADDRCONF(NETDEV_UP): veth0: link is not ready ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready lo: Disabled Privacy Extensions lo: Disabled Privacy Extensions ADDRCONF(NETDEV_UP): veth0: link is not ready ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready lo: Disabled Privacy Extensions ADDRCONF(NETDEV_UP): veth0: link is not ready lo: Disabled Privacy Extensions ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/virtual/net/veth1/address CPU 8 Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw k10temp edac_core edac_mce_amd sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan] Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw k10temp edac_core edac_mce_amd sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-114.0.1.el6.x86_64 #1 Dinar RIP: 0010:[<ffffffff8141f10d>] [<ffffffff8141f10d>] get_rps_cpu+0x1ad/0x320 RSP: 0018:ffff8800822837e0 EFLAGS: 00010282 RAX: b21dc8cfab664864 RBX: ffff880037b81240 RCX: 0000000077e458c4 RDX: 0000000000001230 RSI: 000000008916295d RDI: ffff88007c633000 RBP: ffff880082283810 R08: ffff88013437f49c R09: 0000000000000054 R10: 0000000000000008 R11: 0000000000000000 R12: ffff880134e32080 R13: 00000000b600a8c0 R14: 00000000265b630a R15: 0000000000000005 FS: 00007f17b7d59700(0000) GS:ffff880082280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007fffbc476f34 CR3: 0000000134dcf000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 ProceInitializing cgroup subsys cpuset Initializing cgroup subsys cpu Linux version 2.6.32-114.0.1.el6.x86_64 (mockbuild.redhat.com) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Thu Feb 10 16:04:24 EST 2011 Command line: ro root=/dev/mapper/vg_amddinar03-lv_root rd_LVM_LV=vg_amddinar03/lv_root rd_LVM_LV=vg_amddinar03/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us console=ttyS0,115200n81 irqpoll maxcpus=1 reset_devices cgroup_disable=memory memmap=exactmap memmap=640K@0K memmap=131436K@33408K elfcorehdr=164844K memmap=200K$824K memmap=44K#3275264K memmap=8K#3275308K memmap=1484K$3275316K memmap=262144K$3670016K memmap=64K$4173824K memmap=4K$4175872K memmap=1024K$4193280K KERNEL supported cpus: Intel GenuineIntel AMD AuthenticAMD Centaur CentaurHauls Third time: run ltp container test for 3 times: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready BUG: scheduling while atomic: swapper/0/0x10000200 Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw sg k10temp edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan] CPU 19: Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw sg k10temp edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-114.0.1.el6.x86_64 #1 Dinar RIP: 0010:[<ffffffff8103626b>] [<ffffffff8103626b>] native_safe_halt+0xb/0x10 RSP: 0018:ffff88007d099ed8 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff88007d099ed8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff81d0e208 RBP: ffffffff8100bc8e R08: 0000000000000000 R09: 0000000000000000 R10: 000000e9077283d4 R11: 00000001000a8cd2 R12: ffff88013a050f48 R13: ffff88007d099e68 R14: ffffffff81079473 R15: 000000017d099e98 FS: 00007f8532daf700(0000) GS:ffff88013a040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000003d664aaee0 CR3: 0000000001a25000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: [<ffffffff81013e7d>] ? default_idle+0x4d/0xb0 [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110 [<ffffffff814d072e>] ? start_secondary+0x1fc/0x23f general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:04.0/0000:01:00.1/irq CPU 19 Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw sg k10temp edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan] Modules linked in: veth sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log bnx2 microcode serio_raw sg k10temp edac_core edac_mce_amd i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi pata_atiixp ahci radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-114.0.1.el6.x86_64 #1 Dinar RIP: 0010:[<ffffffff8141f10d>] [<ffffffff8141f10d>] get_rps_cpu+0x1ad/0x320 RSP: 0018:ffff88013a0439b0 EFLAGS: 00010282 RAX: a168b58dbbfeab2c RBX: ffff880235510240 RCX: 00000000c016ae26 RDX: 0000000000005672 RSI: 00000000cbd7dc0d RDI: ffff880233556000 RBP: ffff88013a0439e0 R08: ffff8801346ad69c R09: 0000000000000040 R10: 000000000000dd86 R11: 0000000000000000 R12: ffff8801b535a1c0 R13: 000000002548e0ff R14: 00000000e33c3c8b R15: 000000000000000a FS: 00007f8532daf700(0000) GS:ffff88013a040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000003d664aaee0 CR3: 0000000001a25000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffff88007d098000, task ffff8802355beb40) Stack: ffff880233556000 ffff8801b535a1c0 ffff880233556000 0000000000000000 <0> 000000000000004e ffff880234e47000 ffff88013a043a20 ffffffff8141f30c <0> ffff88013a043a20 ffffffff81435290 ffff88013a043a20 ffff8801b535a1c0 Call Trace: <IRQ> [<ffffffff8141f30c>] netif_rx+0x8c/0x150 [<ffffffff81435290>] ? eth_type_trans+0x40/0x140 [<ffffffff8141f632>] dev_forward_skb+0x122/0x180 [<ffffffffa01e16d6>] veth_xmit+0x86/0xe0 [veth] [<ffffffff8141ae98>] dev_hard_start_xmit+0x2c8/0x3f0 [<ffffffff814d9aa9>] ? _write_unlock_bh+0x19/0x20 [<ffffffff8143636a>] sch_direct_xmit+0x15a/0x1c0 [<ffffffff8141e888>] dev_queue_xmit+0x388/0x4d0 [<ffffffff8143544a>] ? eth_header+0x3a/0xe0 [<ffffffff81424275>] neigh_resolve_output+0x105/0x370 [<ffffffffa031557c>] ip6_output_finish+0x9c/0x120 [ipv6] [<ffffffffa03179eb>] ip6_output2+0x2bb/0x2d0 [ipv6] [<ffffffffa0319165>] ip6_output+0x85/0x140 [ipv6] [<ffffffffa032b2e2>] ndisc_send_skb+0x312/0x330 [ipv6] [<ffffffffa032b361>] __ndisc_send+0x61/0x80 [ipv6] [<ffffffffa031cbd0>] ? addrconf_dad_timer+0x0/0x1a0 [ipv6] [<ffffffffa032bc12>] ndisc_send_ns+0x72/0xc0 [ipv6] [<ffffffffa031cbd0>] ? addrconf_dad_timer+0x0/0x1a0 [ipv6] [<ffffffff8107a1c8>] ? add_timer+0x18/0x30 [<ffffffffa031cce3>] addrconf_dad_timer+0x113/0x1a0 [ipv6] [<ffffffff810798d7>] run_timer_softirq+0x197/0x340 [<ffffffff8109d5d0>] ? tick_sched_timer+0x0/0xc0 [<ffffffff8102982d>] ? lapic_next_event+0x1d/0x30 [<ffffffff8106f2a7>] __do_softirq+0xb7/0x1e0 [<ffffffff81092320>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff8100c2cc>] call_softirq+0x1c/0x30 [<ffffffff8100df05>] do_softirq+0x65/0xa0 [<ffffffff8106f095>] irq_exit+0x85/0x90 [<ffffffff814df420>] smp_apic_timer_interrupt+0x70/0x9b [<ffffffff8100bc93>] apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff8103626b>] ? native_safe_halt+0xb/0x10 [<ffffffff81013e7d>] default_idle+0x4d/0xb0 [<ffffffff81009e96>] cpu_idle+0xb6/0x110 [<ffffffff814d072e>] start_secondary+0x1fc/0x23f Code: 10 85 c0 0f 45 d0 66 41 89 94 24 ba 00 00 00 0f 1f 80 00 00 00 00 48 8b 03 48 85 c0 0f 84 e1 fe ff ff 41 0f b7 94 24 ba 00 00 00 <0f> af 10 c1 ea 10 89 d2 0f b7 44 50 18 48 8b 15 e7 46 1f 00 0f RIP [<ffffffff8141f10d>] get_rps_cpu+0x1ad/0x320 RSP <ffff88013a0439b0> zhang, where can I get a copy of the runltp test so that I can reproduce this myself? Created attachment 479307 [details]
reproducer
trying to reproduce on a guest, and its working fine. Mike has offered to see if he can get it to fail. Thanks! Note to self: Looks like we're falling down because the map pointer in get_rps_cpu is getting corrupted. Need to figure out why. Interesting data here: I instrumented the kernel to print the value of the net devices rps_map in get_rps_cpu, and to indicate when the rps_map value was getting changed by the sysfs code (which is the only place that the rps_map should be updated from). I saw a periodically changing rps_map value, but never saw the printks indicating that sysfs was getting called to update its value. Given that I was getting messages like this from get_rps_cpu: MAP = 2fd78a901c35ed7e MAP = ffff88003b419000 MAP = ffff88003b419000 MAP = ffff88003b419000 MAP = ffff88003b419000 MAP = ffff88003b419000 MAP = ffff88003b419000 MAP = ffff88003b419000 MAP = ffff88003b419000 MAP = ffff88003b419000 None of which were valid pointer values (causing the reported crash when we dereferenced a member of the rps_map structure), I think we can assume that rps_map is somehow getting corrupted. So I modified the netdev_rx_queue structure. I expanded the rps_map pointer to be an array of pointers 1 page in size, and marked that page as read only. I validated that, with this change, writing to /sys/class/net/ethX/queues/rx-N/rps_cpus results in this expected crash: BUG: unable to handle kernel paging request at ffff88003bae0000 IP: [<ffffffff8142f656>] store_rps_map+0x146/0x190 PGD 1a26063 PUD 1a2a063 PMD 3a885063 PTE 3bae0161 Oops: 0003 [#1] SMP last sysfs file: /sys/devices/virtual/net/virbr0/queues/rx-0/rps_cpus CPU 1 Modules linked in: ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat bridge stp llc autofs4 sunrpc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log uinput sg virtio_balloon 8139too 8139cp mii snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 i2c_core ext4 mbcache jbd2 sd_mod crc_t10dif virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: speedstep_lib] Modules linked in: ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat bridge stp llc autofs4 sunrpc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log uinput sg virtio_balloon 8139too 8139cp mii snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc i2c_piix4 i2c_core ext4 mbcache jbd2 sd_mod crc_t10dif virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: speedstep_lib] Pid: 1839, comm: bash Not tainted 2.6.32 #8 KVM RIP: 0010:[<ffffffff8142f656>] [<ffffffff8142f656>] store_rps_map+0x146/0x190 RSP: 0018:ffff88003d52be48 EFLAGS: 00010292 RAX: ffffffff81f416d4 RBX: 0000000000000002 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246 RBP: ffff88003d52be88 R08: 0000000000000000 R09: ffffffff81639ce0 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88003cbd4c80 R13: ffff88003bae0000 R14: 0000000000000000 R15: ffff88003bae1008 FS: 00007fb027c75700(0000) GS:ffff880002100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88003bae0000 CR3: 000000003b91a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process bash (pid: 1839, threadinfo ffff88003d52a000, task ffff88003747f540) Stack: ffff88003d52be78 ffffffff8115394a ffff880037a747e0 ffff88003b06e000 <0> ffff880037a747e0 ffff88003d52bf48 ffff88003d56e820 ffffffff81ab0780 <0> ffff88003d52be98 ffffffff8142dc93 ffff88003d52bee8 ffffffff811e3315 Call Trace: [<ffffffff8115394a>] ? alloc_pages_current+0x9a/0x100 [<ffffffff8142dc93>] rx_queue_attr_store+0x23/0x30 [<ffffffff811e3315>] sysfs_write_file+0xe5/0x170 [<ffffffff81170b78>] vfs_write+0xb8/0x1a0 [<ffffffff810d1572>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff811715b1>] sys_write+0x51/0x90 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b Good, so, rebooting and running the LTC containers test then resulted in this crash: BUG: unable to handle kernel paging request at ffff880039d0a000 IP: [<ffffffff8126b467>] clear_page_c+0x7/0x10 PGD 1a26063 PUD 1a2a063 PMD 3a233063 PTE 39d0a161 Oops: 0003 [#1] SMP last sysfs file: /sys/devices/virtual/net/veth0/address CPU 1 Modules linked in: veth ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat bridge stp llc autofs4 sunrpc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log uinput sg virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc 8139too 8139cp mii i2c_piix4 i2c_core ext4 mbcache jbd2 sd_mod crc_t10dif virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: speedstep_lib] Modules linked in: veth ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat bridge stp llc autofs4 sunrpc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log uinput sg virtio_balloon snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc 8139too 8139cp mii i2c_piix4 i2c_core ext4 mbcache jbd2 sd_mod crc_t10dif virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: speedstep_lib] Pid: 2185, comm: net.hotplug Not tainted 2.6.32 #8 KVM RIP: 0010:[<ffffffff8126b467>] [<ffffffff8126b467>] clear_page_c+0x7/0x10 RSP: 0018:ffff88003b5b59f8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000200 RDX: ffff880000000000 RSI: 000000000000001b RDI: ffff880039d0a000 RBP: ffff88003b5b5b20 R08: ffffea0000ca5a58 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b5b4000 R13: 0000000000000000 R14: 0000000000ca5a30 R15: ffffea0000ca5a30 FS: 00007f0228dc9700(0000) GS:ffff880002100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff880039d0a000 CR3: 000000003bffc000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process net.hotplug (pid: 2185, threadinfo ffff88003b5b4000, task ffff880037529580) Stack: ffffffff8111e2a1 0000000000000000 ffff880000019710 0000003700000001 <0> ffff88000002ab00 0000000000000000 00000040ffffffff 0000000000000000 <0> ffff88000002ab08 000000020002ab08 0000000000000000 ffff880000019710 Call Trace: [<ffffffff8111e2a1>] ? get_page_from_freelist+0x3d1/0x820 [<ffffffff8111f4f1>] __alloc_pages_nodemask+0x111/0x850 [<ffffffff8115394a>] alloc_pages_current+0x9a/0x100 [<ffffffff8111dc7e>] __get_free_pages+0xe/0x50 [<ffffffff8111dcd6>] get_zeroed_page+0x16/0x20 [<ffffffff811321b0>] __pmd_alloc+0x30/0xe0 [<ffffffff81134260>] copy_page_range+0x410/0x480 [<ffffffff81064946>] dup_mm+0x316/0x520 [<ffffffff8106591a>] copy_process+0xd5a/0x1300 [<ffffffff81065f54>] do_fork+0x94/0x480 [<ffffffff8118d4c2>] ? alloc_fd+0x92/0x160 [<ffffffff8116da67>] ? fd_install+0x47/0x90 [<ffffffff810d1572>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81009588>] sys_clone+0x28/0x30 [<ffffffff8100b493>] stub_clone+0x13/0x20 [<ffffffff8100b172>] ? system_call_fastpath+0x16/0x1b so it seems to me that the net.hotplug process seems to somehow be causing our corruption. I'm not sure for the life of me how thats happening, but I've got some more experimenting to do. Tried running the debug kernel and noted this: Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. Last user: [<ffffffff8145a63a>](rx_queue_release+0x6a/0x80 So we're definately getting corruption in that netdev_rx_queue structure, though I still can't quite see why hmm, so following up on the debug kernel noting the redzone violation in rx_queue_release, I modified the code to allow the rx queues to get leaked instead of freeing them. when I do that (just if 0-ing out the kfree) This test passes. I think we may have an rcu race here, but I'm not 100 percent sure. I do note that the _rx array gets freed in free_netdev upstream, so there may be an upstream change that inadvertently fixed this. pretty sure I found the problem. Theres an unbalanced use of the refcounter in the rx queue array allowing it to be freed while still in use. I'll have a fix for this shortly. Created attachment 480787 [details]
backport of commit 4315d834c1496ddca977e9e22002b77c85bfec2c
I've confirmed this fixes the problem
Reporter, Could I please ask you to provide a priority assessment (set the priority field to one of urgent/high/medium/low) for the impact of this issue? This will help us prioritize this issue with our other outstanding bugs for the current release cycle ... Regards, Brock This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. *** Bug 680047 has been marked as a duplicate of this bug. *** Patch(es) available on kernel-2.6.32-121.el6 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html |