Bug 669795
Summary: | RHEL Divide-by-zero in find_busiest_group() | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | George Beshers <gbeshers> | ||||||
Component: | kernel | Assignee: | George Beshers <gbeshers> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 6.2 | CC: | bugzilla, ctatman, gbeshers, jeast, jwest, loriann, lwoodman, martinez, mmilgram, syeghiay, tee, xwzh2008 | ||||||
Target Milestone: | rc | Keywords: | Reopened | ||||||
Target Release: | 6.4 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2012-07-19 02:13:24 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 782183, 830305, 840683 | ||||||||
Attachments: |
|
Requesting this be an exception for RHEL6.1. The patch eliminates for divide-by-zero conditions that have caused the kernel panic. Created attachment 481395 [details]
Patch tested with 2.6.32-117
Posted patch.
I will reopen if we see this again --- testing on a 4-rack machine at the end of the week. George *** This bug has been marked as a duplicate of bug 644903 *** Not seeing this on snap4. Reopening. We are seeing this on a system that is about to ship to a customer with Rhel6.1-ga installed. I am investigating what might be special about the machine in manufacturing relative to the test machines. ====================================================== Linux version 2.6.32-131.0.15.el6.x86_64 (mockbuild.bos.redhat.com) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Tue May 10 15:42:40 EDT 2011 ACPI: Preparing to enter system sleep state S5 Disabling non-boot CPUs ... Broke affinity for irq 112 Broke affinity for irq 4 Broke affinity for irq 105 Broke affinity for irq 18 Broke affinity for irq 106 Broke affinity for irq 21 Broke affinity for irq 107 Broke affinity for irq 24 Broke affinity for irq 108 Broke affinity for irq 109 Broke affinity for irq 110 Broke affinity for irq 97 Broke affinity for irq 98 Broke affinity for irq 99 Broke affinity for irq 100 Broke affinity for irq 101 Broke affinity for irq 102 Broke affinity for irq 103 Broke affinity for irq 104 divide error: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:1a.7/usb1/1-4/1-4:1.0/host1/target1:0:0/1:0:0:0/block/sr0/dev CPU 644 Modules linked in: autofs4 sunrpc ip6t_REJECT ipv6 vfat fat dm_mirror dm_region_hash dm_log numatools(U) xvma(U) uv_mmtimer(U) hwperf(U) uinput microcode ghes hed ixgbe mdio i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma sg igb dca ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif usb_storage mptsas mptscsih mptbase scsi_transport_sas dm_mod [last unloaded: ip_tables] Modules linked in: autofs4 sunrpc ip6t_REJECT ipv6 vfat fat dm_mirror dm_region_hash dm_log numatools(U) xvma(U) uv_mmtimer(U) hwperf(U) uinput microcode ghes hed ixgbe mdio i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma sg igb dca ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif usb_storage mptsas mptscsih mptbase scsi_transport_sas dm_mod [last unloaded: ip_tables] Pid: 1936, comm: migration/644 Tainted: G W ---------------- 2.6.32-131.0.15.el6.x86_64 #1 Stoutland Platform RIP: 0010:[<ffffffff81053b65>] [<ffffffff81053b65>] find_busiest_group+0x5c5/0xb20 RSP: 0018:ffff88087bf0fba0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff88087bf0fdbc RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000300 RDI: 0000000000000000 RBP: ffff88087bf0fd30 R08: ffff8a800e450be0 R09: 0000000000000300 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000015f80 R14: ffffffffffffffff R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8a800e500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f2b233c7098 CR3: 0000000001a25000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process migration/644 (pid: 1936, threadinfo ffff88087bf0e000, task ffff88087bf080c0) Stack: ffff88087bf0fcd0 ffff88087bf0fc40 0000000000000000 0000000000000000 <0> 0000000000000000 ffff88087bf0fda8 0000000000000000 0000028400000002 <0> ffffffff00000000 ffff8a800e5108c0 0000000000000000 0000000000000008 Call Trace: [<ffffffff814db693>] thread_return+0x3aa/0x777 [<ffffffff8100bc8e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff810c3690>] ? stop_machine_cpu_stop+0x0/0xe0 [<ffffffff810c3605>] cpu_stopper_thread+0x125/0x1b0 [<ffffffff814db337>] ? thread_return+0x4e/0x777 [<ffffffff8105dc72>] ? default_wake_function+0x12/0x20 [<ffffffff810c34e0>] ? cpu_stopper_thread+0x0/0x1b0 [<ffffffff8108ddf6>] kthread+0x96/0xa0 [<ffffffff8100c1ca>] child_rip+0xa/0x20 [<ffffffff8108dd60>] ? kthread+0x0/0xa0 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 In the case I saw, the customer was running linux-2.6.32-71.7.1.el6.x86_64. As is indicated, that might be fixed by 644903. Yes, that caught another problem, but the one I reopened for is in 6.1. George We're also seeing this bug in with KVM on RHEL 6.1 and a RHEL 6.1 guest on shutdown. 555 Halting system... 556 md: stopping all md devices. 557 ACPI: Preparing to enter system sleep state S5 558 Disabling non-boot CPUs ... 559 BUG: soft lockup - CPU#0 stuck for 67s! [migration/0:5] 560 Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_ generic ata_piix dm_mod [last unloaded: speedstep_lib] 561 CPU 0: 562 Modules linked in: nfs lockd fscache(T) nfs_acl auth_rpcgss autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_ generic ata_piix dm_mod [last unloaded: speedstep_lib] 563 Pid: 5, comm: migration/0 Tainted: G ---------------- T 2.6.32-131.0.15.el6.x86_64 #1 KVM 564 RIP: 0010:[<ffffffff810c36ff>] [<ffffffff810c36ff>] stop_machine_cpu_stop+0x6f/0xe0 565 RSP: 0018:ffff880198dabdd0 EFLAGS: 00000293 566 RAX: 0000000000000001 RBX: ffff880198dabdf0 RCX: ffff8800282111e8 567 RDX: 0000000000000000 RSI: ffff88019752c040 RDI: ffff880197161d28 568 RBP: ffffffff8100bc8e R08: ffff880198daa000 R09: 0000000000000001 569 R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000000 570 R13: ffffffff814db337 R14: ffff880198dabdf0 R15: ffff8801936e8f00 571 FS: 0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 572 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b 573 CR2: 00007fff295baeb0 CR3: 000000019362b000 CR4: 00000000000006f0 574 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 575 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 576 Call Trace: 577 [<ffffffff810c3690>] ? stop_machine_cpu_stop+0x0/0xe0 578 [<ffffffff810c35ba>] ? cpu_stopper_thread+0xda/0x1b0 579 [<ffffffff814db337>] ? thread_return+0x4e/0x777 580 [<ffffffff8105dc72>] ? default_wake_function+0x12/0x20 581 [<ffffffff810c34e0>] ? cpu_stopper_thread+0x0/0x1b0 582 [<ffffffff8108ddf6>] ? kthread+0x96/0xa0 583 [<ffffffff8100c1ca>] ? child_rip+0xa/0x20 584 [<ffffffff8108dd60>] ? kthread+0x0/0xa0 585 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 I have not seen this on rhel6.3 in all of our testing. Is anyone else still seeing this? Jason? George George, I found the actual cause of this and fixed in RHEL6.3. Sorry I didnt find and close this BZ as a DUP of BZ785959, I think you should do that now... >>>email sent to rhkernel-list by me on 02/23/2012 05:51 PM [RHEL6.3 Patch] Fix Kernel divide by zero panic in find_busiest_group() -------------------------------------------------------------------------------- RHEL6 is missing the attached upstream patch that at first glance appears to be just a cleanup. However this patch changes init_sched_groups_power() to call update_group_power() which in turn calls update_cpu_power(). The update_cpu_power() function verifies that the sched_group->cpu_power is not initialized with a zero where init_sched_groups_power() did not do this. Since find_busiest_group() uses sched_group->cpu_power in the demoninator of a fraction it must not be zero! ----------------------------------------------------------------------------------------------------- static void update_cpu_power(struct sched_domain *sd, int cpu) { ... if (sched_feat(ARCH_POWER)) power *= arch_scale_freq_power(sd, cpu); else power *= default_scale_freq_power(sd, cpu); power >>= SCHED_LOAD_SHIFT; power *= scale_rt_power(cpu); power >>= SCHED_LOAD_SHIFT; >>>if (!power) >>> power = 1; sdg->cpu_power = power; } ------------------------------------------------------------------------------------------------------- The attached patch fixes this panic and BZ785959. Since I can not reproduce this problem I'm building a kernel in brew for the customers to test and verify the fix. I'll respond with the results once they verify it. rhel6-cpu_power_init.patch commit d274cb30f4a08045492d3f0c47cdf1a25668b1f5 Author: Peter Zijlstra <a.p.zijlstra> Date: Thu Apr 7 14:09:43 2011 +0200 sched: Simplify ->cpu_power initialization The code in update_group_power() does what init_sched_groups_power() does and more, so remove the special init_ code and call the generic code instead. Also move the sd->span_weight initialization because update_group_power() needs it. Signed-off-by: Peter Zijlstra <a.p.zijlstra> Cc: Mike Galbraith <efault> Cc: Nick Piggin <npiggin> Cc: Linus Torvalds <torvalds> Cc: Andrew Morton <akpm> Link: http://lkml.kernel.org/r/20110407122941.875856012@chello.nl Signed-off-by: Ingo Molnar <mingo> diff --git a/kernel/sched.c b/kernel/sched.c index 3ce2ab6..071cf49 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -8618,9 +8618,6 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) struct rq *rq = cpu_rq(cpu); struct sched_domain *tmp; - for (tmp = sd; tmp; tmp = tmp->parent) - tmp->span_weight = cpumask_weight(sched_domain_span(tmp)); - /* Remove the sched domains which do not contribute to scheduling. */ for (tmp = sd; tmp; ) { struct sched_domain *parent = tmp->parent; @@ -9098,46 +9095,12 @@ static void free_sched_groups(const struct cpumask *cpu_map, */ static void init_sched_groups_power(int cpu, struct sched_domain *sd) { - struct sched_domain *child; - struct sched_group *group; - long power; - int weight; - WARN_ON(!sd || !sd->groups); if (cpu != group_first_cpu(sd->groups)) return; - child = sd->child; - - sd->groups->cpu_power = 0; - - if (!child) { - power = SCHED_LOAD_SCALE; - weight = cpumask_weight(sched_domain_span(sd)); - /* - * SMT siblings share the power of a single core. - * Usually multiple threads get a better yield out of - * that one core than a single thread would have, - * reflect that in sd->smt_gain. - */ - if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) { - power *= sd->smt_gain; - power /= weight; - power >>= SCHED_LOAD_SHIFT; - } - sd->groups->cpu_power += power; - return; - } - - /* - * Add cpu_power of each child group to this groups cpu_power. - */ - group = child->groups; - do { - sd->groups->cpu_power += group->cpu_power; - group = group->next; - } while (group != child->groups); + update_group_power(sd, cpu); } /* @@ -9444,7 +9407,7 @@ static int __build_sched_domains(const struct cpumask *cpu_map, { enum s_alloc alloc_state = sa_none; struct s_data d; - struct sched_domain *sd; + struct sched_domain *sd, *tmp; int i; #ifdef CONFIG_NUMA d.sd_allnodes = 0; @@ -9467,6 +9430,9 @@ static int __build_sched_domains(const struct cpumask *cpu_map, sd = __build_book_sched_domain(&d, cpu_map, attr, sd, i); sd = __build_mc_sched_domain(&d, cpu_map, attr, sd, i); sd = __build_smt_sched_domain(&d, cpu_map, attr, sd, i); + + for (tmp = sd; tmp; tmp = tmp->parent) + tmp->span_weight = cpumask_weight(sched_domain_span(tmp)); } for_each_cpu(i, cpu_map) { John or Larry, I can't access 784959 so I can't close this as a duplicate. I agree it should be closed. *** This bug has been marked as a duplicate of bug 784959 *** |
Created attachment 473574 [details] Patch -- offsets when applied to 2.6.32-99 Description of problem: Case 00402323 When we did an init 0 of a UV1000 system to shut it down, we hit a divide-by-zero. [root@UV00000140-P000 ~]# init 0 [root@UV00000140-P000 ~]# initctl: Event failed Shutting down Avahi daemon: [ OK ] Stopping atd: [ OK ] Stopping availmon: [ OK ] ... ACPI: Preparing to enter system sleep state S5 Disabling non-boot CPUs ... Broke affinity for irq 4 Broke affinity for irq 24 Broke affinity for irq 70 Broke affinity for irq 71 Broke affinity for irq 72 divide error: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0/host0/target0:1:0/0:1:0:0/block/sda/dev CPU 194 Modules linked in: sit tunnel4 autofs4 sunrpc ipt_REJECT ip6t_REJECT ipv6 vfat fat dm_mirror dm_region_hash dm_log numatools(U) xpmem(U) xp gru(U) xvma(U) hwperf(U) uv_mmtimer(U) uinput sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma igb dca ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mod [last unloaded: nf_conntrack] Modules linked in: sit tunnel4 autofs4 sunrpc ipt_REJECT ip6t_REJECT ipv6 vfat fat dm_mirror dm_region_hash dm_log numatools(U) xpmem(U) xp gru(U) xvma(U) hwperf(U) uv_mmtimer(U) uinput sg i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma igb dca ext4 mbcache jbd2 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas dm_mod [last unloaded: nf_conntrack] Pid: 32704, comm: kstop/194 Not tainted 2.6.32-71.el6.x86_64 #1 Stoutland Platform RIP: 0010:[<ffffffff81062e9c>] [<ffffffff81062e9c>] find_busiest_group+0x56c/0xb40 RSP: 0018:ffff881076c2dbf0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff881076c2ddfc RCX: 0000000000000000 RDX: 0000000000000000 RSI: 00000000000001c0 RDI: 0000000000000000 RBP: ffff881076c2dd70 R08: ffff89801c431b68 R09: 00000000000001c0 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000016980 R14: ffffffffffffffff R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff89801c440000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f4f1b2f2d80 CR3: 0000000001001000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kstop/194 (pid: 32704, threadinfo ffff881076c2c000, task ffff8810761aaaf0) Stack: ffff881076c2dd10 ffff881076c2dc80 ffff89801c451908 ffff881076c2dde8 <0> 00ff881076c2dda0 00ffffff00000002 000000c200000000 00000000ffffffff <0> ffff89801c451800 0000000000000008 0000000000016980 0000000000016980 Call Trace: [<ffffffff814c864a>] thread_return+0x412/0x778 [<ffffffff81013c8e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff810c5fe0>] ? stop_cpu+0x0/0xf0 [<ffffffff8108c69c>] worker_thread+0x1fc/0x2a0 [<ffffffff81091ca0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8108c4a0>] ? worker_thread+0x0/0x2a0 [<ffffffff81091936>] kthread+0x96/0xa0 [<ffffffff810141ca>] child_rip+0xa/0x20 [<ffffffff810918a0>] ? kthread+0x0/0xa0 [<ffffffff810141c0>] ? child_rip+0x0/0x20 Code: ff c7 85 bc fe ff ff 01 00 00 00 e9 8d fc ff ff 0f 1f 80 00 00 00 00 48 8b 95 e0 fe ff ff 48 8b 45 a8 8b 4a 08 48 c1 e0 0a 31 d2 <48> f7 f1 48 8b 4d b0 48 89 45 a0 31 c0 48 85 c9 74 0c 48 8b 45 RIP [<ffffffff81062e9c>] find_busiest_group+0x56c/0xb40 RSP <ffff881076c2dbf0> ---[ end trace 7ba40d9e8ca26a63 ]--- Lots of threads took the same failure. We looked into the problem and found that cpus are disabled as part of shutdown. This exposes a race condition in the load balancing code. if the OS tries to load balance a domain that has no cpus, it divides by zero. We have a patch to fix this bug in RHEL6. It guards against the divide by 0 and the page faults that we've seen. The current community sched.c code is very different than RHEL6's. Our fix changes a function and a structure which no longer exist in the community. There are no divides by cpu_power in 2.6.37. Attachments Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: