Red Hat Bugzilla – Bug 667340
kexec: Make sure to stop all CPUs before exiting the kernel
Last modified: 2011-08-30 03:26:00 EDT
Before starting the new kernel kexec calls machine_shutdown to stop all the cpus, which internally calls native_smp_send_stop. kexec expects that all the cpus are now halted after that call returns. However, native_smp_send_stop assumes that all the processors have processed the REBOOT ipi in 1 second after the IPI was sent. In the kexec case we can have the BSP starting the new kernel and AP's still processing the REBOOT IPI simultaneously. In virtualized environment with the host heavily overcommitted it is possible to see VCPUs failing to process the IPI in the allotted 1 sec. As a result the AP's end up accessing uninitialized state (the BSP has already gone ahead with setting up the new state) and causing GPF's. kexec expects machine_shutdown to return only after all cpus are stopped. Patch 76fac077 waits for the cpus to stop in all cases except for panic/kdump where we expect things to be broken and we are doing our best to make things work anyway. Patch 31e323cc is needed too, in order to avoid breaking Xen. The two patches are respectively in upstream 2.6.32.26 and 2.6.32.27.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
You can try doing kexec in a RHEL6 guest (either Xen HVM on a RHEL5 host or KVM) while the host is under heavy load. Separately, you can try rebooting a RHEL6 Xen PV guest on a RHEL5 host to ensure that there is no regression.
I have tried several times. I used usex to generate some load on the host, then load the new kernel with kexec -l then jump to the new kernel with reboot on the rhel6 kvm guest. I cannot cause the panic, only saw one time the guest didn't bypass the bois, it performed a normal boot. Is this a reproduction?
I think so, yes. You can try the brew build at https://brewweb.devel.redhat.com/taskinfo?taskID=3022266 if you want to "pre-verify" now.
I have managed to let the testing kernel also doesn't bypass BIOS on kvm guest with a load average about 42~43 on the host, it can be achieved on my workstation using 'usex -e 34'.
Patch(es) available on kernel-2.6.32-117.el6
Perhaps you can try the reproducer in bug 690419? It's for a different bug, but it could work here too.
Start the CPU add-remove loop and a little after kexec. After some time, without the patch you may see a failure to boot the new kernel. With the patch, you will have to stop the loop for kexec to proceed.
I saw two times this kind of warnings with -71.el6 kvm x86_64 guest, on host tyan-gt24-01.rhts.eng.bos.redhat.com. The first: ... Unmounting file systems: [ OK ] init: Re-executing /sbin/init Please stand by while rebooting the system... md: stopping all md devices. Starting new kernel ------------[ cut here ]------------ WARNING: at arch/x86/kernel/smp.c:117 native_smp_send_reschedule+0x5c/0x60() (Not tainted) Hardware name: KVM Modules linked in: sit tunnel4 sunrpc ipv6 dm_mirror dm_region_hash dm_log virtio_balloon virtio_net i2c_piix4 i2c_core sg ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: freq_table] Pid: 2190, comm: kexec Not tainted 2.6.32-71.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0 [<ffffffff8106b8aa>] warn_slowpath_null+0x1a/0x20 [<ffffffff8102ea7c>] native_smp_send_reschedule+0x5c/0x60 [<ffffffff810507f8>] resched_task+0x68/0x80 [<ffffffff81057a6d>] resched_cpu+0x8d/0xa0 [<ffffffff8106636b>] scheduler_tick+0x26b/0x280 [<ffffffff810a0c90>] ? tick_sched_timer+0x0/0xc0 [<ffffffff8107d7e2>] update_process_times+0x52/0x70 [<ffffffff810a0cf6>] tick_sched_timer+0x66/0xc0 [<ffffffff8109564e>] __run_hrtimer+0x8e/0x1a0 [<ffffffff8103be39>] ? kvm_clock_get_cycles+0x9/0x10 [<ffffffff810959f6>] hrtimer_interrupt+0xe6/0x250 [<ffffffff814cf9fc>] smp_apic_timer_interrupt+0x6c/0x9c [<ffffffff81013c93>] apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff8101af06>] ? native_read_tsc+0x6/0x20 [<ffffffff8126644c>] ? __bitmap_weight+0x8c/0xb0 [<ffffffff8126333a>] delay_tsc+0x4a/0x80 [<ffffffff812632e6>] __const_udelay+0x46/0x50 [<ffffffff8102ebab>] native_smp_send_stop+0x6b/0xb0 [<ffffffff8102e38f>] native_machine_shutdown+0x5f/0x80 [<ffffffff8103bd75>] kvm_shutdown+0x15/0x20 [<ffffffff8102df5f>] machine_shutdown+0xf/0x20 [<ffffffff810b8928>] kernel_kexec+0x158/0x160 [<ffffffff8108a5a4>] sys_reboot+0x144/0x220 [<ffffffff8105c394>] ? try_to_wake_up+0x284/0x380 [<ffffffff8106789a>] ? __cond_resched+0x2a/0x40 [<ffffffff810d40a2>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b ---[ end trace 0f4c73c2233b3107 ]--- Second: ... Starting new kernel ------------[ cut here ]------------ WARNING: at arch/x86/kernel/smp.c:117 native_smp_send_reschedule+0x5c/0x60() (Not tainted) Hardware name: KVM Modules linked in: sit tunnel4 sunrpc ipv6 dm_mirror dm_region_hash dm_log virtio_balloon virtio_net i2c_piix4 i2c_core sg ext4 mbcache jbd2 virtio_blk sr_mod cdrom virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: freq_table] Pid: 22142, comm: kexec Not tainted 2.6.32-71.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff8106b857>] warn_slowpath_common+0x87/0xc0 [<ffffffff8106b8aa>] warn_slowpath_null+0x1a/0x20 [<ffffffff8102ea7c>] native_smp_send_reschedule+0x5c/0x60 [<ffffffff810507f8>] resched_task+0x68/0x80 [<ffffffff81057a6d>] resched_cpu+0x8d/0xa0 [<ffffffff8106636b>] scheduler_tick+0x26b/0x280 [<ffffffff810a0c90>] ? tick_sched_timer+0x0/0xc0 [<ffffffff8107d7e2>] update_process_times+0x52/0x70 [<ffffffff810a0cf6>] tick_sched_timer+0x66/0xc0 [<ffffffff8109564e>] __run_hrtimer+0x8e/0x1a0 [<ffffffff8103be39>] ? kvm_clock_get_cycles+0x9/0x10 [<ffffffff810959f6>] hrtimer_interrupt+0xe6/0x250 [<ffffffff814cf9fc>] smp_apic_timer_interrupt+0x6c/0x9c [<ffffffff81013c93>] apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff81034721>] ? native_apic_mem_write+0x11/0x20 [<ffffffff8102f9ad>] disconnect_bsp_APIC+0x3d/0xc0 [<ffffffff810329b2>] disable_IO_APIC+0xa2/0x110 [<ffffffff8102f906>] ? disable_local_APIC+0x46/0x50 [<ffffffff8102e399>] native_machine_shutdown+0x69/0x80 [<ffffffff8103bd75>] kvm_shutdown+0x15/0x20 [<ffffffff8102df5f>] machine_shutdown+0xf/0x20 [<ffffffff810b8928>] kernel_kexec+0x158/0x160 [<ffffffff8108a5a4>] sys_reboot+0x144/0x220 [<ffffffff810507f8>] ? resched_task+0x68/0x80 [<ffffffff8105c394>] ? try_to_wake_up+0x284/0x380 [<ffffffff8105c4e5>] ? wake_up_process+0x15/0x20 [<ffffffff8119522d>] ? bdi_queue_work+0x7d/0x110 [<ffffffff810d40a2>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b ---[ end trace c3b8a463f323d0c9 ]--- Initializing cgroup subsys cpuset Initializing cgroup subsys cpu Linux version 2.6.32-71.el6.x86_64 (mockbuild@x86-007.build.bos.redhat.com) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Wed Sep 1 01:33:01 EDT 2010 Command line: ro root=/dev/mapper/vg_dhcp71107-lv_root rd_LVM_LV=vg_dhcp71107/lv_root rd_LVM_LV=vg_dhcp71107/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us console=ttyS0,115200 KERNEL supported cpus: Intel GenuineIntel AMD AuthenticAMD Centaur CentaurHauls BIOS-provided physical RAM map: BIOS-e820: 0000000000000100 - 000000000009cc00 (usable) BIOS-e820: 000000000009cc00 - 00000000000a0000 (reserved) BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000001fffb000 (usable) BIOS-e820: 000000001fffb000 - 0000000020000000 (reserved) BIOS-e820: 00000000fffbc000 - 0000000100000000 (reserved) DMI 2.4 present. last_pfn = 0x1fffb max_arch_pfn = 0x400000000 x86 PAT enabled: cpu 0, old 0x70106, new 0x7010600070106 init_memory_mapping: 0000000000000000-000000001fffb000 RAMDISK: 1f31f000 - 1ffefd30 ACPI: RSDP 00000000000f80a0 00014 (v00 BOCHS ) ACPI: RSDT 000000001fffdda0 00030 (v01 BOCHS BXPCRSDT 00000001 BXPC 00000001) ACPI: FACP 000000001ffffdc0 00074 (v01 BOCHS BXPCFACP 00000001 BXPC 00000001) ACPI: DSDT 000000001fffdf30 01E4B (v01 BXPC BXDSDT 00000001 INTL 20090123) ACPI: FACS 000000001ffffd80 00040 ACPI: SSDT 000000001fffded0 0005E (v01 BOCHS BXPCSSDT 00000001 BXPC 00000001) ACPI: APIC 000000001fffddd0 0008A (v01 BOCHS BXPCAPIC 00000001 BXPC 00000001) No NUMA configuration found Faking a node at 0000000000000000-000000001fffb000 Bootmem setup node 0 0000000000000000-000000001fffb000 NODE_DATA [0000000000009000 - 000000000003cfff] bootmap [000000000003d000 - 0000000000040fff] pages 4 (7 early reservations) ==> bootmem [0000000000 - 001fffb000] #0 [0000000000 - 0000001000] BIOS data page ==> [0000000000 - 0000001000] #1 [0000006000 - 0000008000] TRAMPOLINE ==> [0000006000 - 0000008000] #2 [0001000000 - 0001c9eff8] TEXT DATA BSS ==> [0001000000 - 0001c9eff8] #3 [001f31f000 - 001ffefd30] RAMDISK ==> [001f31f000 - 001ffefd30] #4 [000009cc00 - 0000100000] BIOS reserved ==> [000009cc00 - 0000100000] #5 [0001c9f000 - 0001c9f079] BRK ==> [0001c9f000 - 0001c9f079] #6 [0000008000 - 0000009000] PGTABLE ==> [0000008000 - 0000009000] found SMP MP-table at [ffff8800000f80f0] f80f0 kvm-clock: Using msrs 12 and 11 kvm-clock: cpu 0, msr 0:18bf901, boot clock Zone PFN ranges: DMA 0x00000001 -> 0x00001000 DMA32 0x00001000 -> 0x00100000 Normal 0x00100000 -> 0x00100000 Movable zone start PFN for each node early_node_map[2] active PFN ranges 0: 0x00000001 -> 0x0000009c 0: 0x00000100 -> 0x0001fffb ACPI: PM-Timer IO Port: 0xb008 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled) ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled) ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 4, version 17, address 0xfec00000, GSI 0-23 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level) ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level) Using ACPI (MADT) for SMP configuration information SMP: Allowing 4 CPUs, 0 hotplug CPUs PM: Registered nosave memory: 000000000009c000 - 000000000009d000 PM: Registered nosave memory: 000000000009d000 - 00000000000a0000 PM: Registered nosave memory: 00000000000a0000 - 00000000000f0000 PM: Registered nosave memory: 00000000000f0000 - 0000000000100000 Allocating PCI resources starting at 20000000 (gap: 20000000:dffbc000) Booting paravirtualized kernel on KVM NR_CPUS:4096 nr_cpumask_bits:4 nr_cpu_ids:4 nr_node_ids:1 PERCPU: Embedded 31 pages/cpu @ffff880001e00000 s95064 r8192 d23720 u524288 pcpu-alloc: s95064 r8192 d23720 u524288 alloc=1*2097152 pcpu-alloc: [0] 0 1 2 3 kvm-clock: cpu 0, msr 0:1e16901, primary cpu clock Built 1 zonelists in Node order, mobility grouping on. Total pages: 129071 Policy zone: DMA32 Kernel command line: ro root=/dev/mapper/vg_dhcp71107-lv_root rd_LVM_LV=vg_dhcp71107/lv_root rd_LVM_LV=vg_dhcp71107/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us console=ttyS0,115200 PID hash table entries: 2048 (order: 2, 16384 bytes) Checking aperture... No AGP bridge found AMD-Vi disabled by default: pass amd_iommu=on to enable Memory: 488640k/524268k available (4935k kernel code, 404k absent, 35224k reserved, 3927k data, 1220k init) Hierarchical RCU implementation. NR_IRQS:33024 nr_irqs:440 Console: colour VGA+ 80x25 console [ttyS0] enabled allocated 5242880 bytes of page_cgroup please try 'cgroup_disable=memory' option if you don't want memory cgroups Detected 2394.012 MHz processor. Calibrating delay loop (skipped) preset value.. 4788.02 BogoMIPS (lpj=2394012) pid_max: default: 32768 minimum: 301 Security Framework initialized SELinux: Initializing. Dentry cache hash table entries: 65536 (order: 7, 524288 bytes) Inode-cache hash table entries: 32768 (order: 6, 262144 bytes) Mount-cache hash table entries: 256 Initializing cgroup subsys ns Initializing cgroup subsys cpuacct Initializing cgroup subsys memory Initializing cgroup subsys devices Initializing cgroup subsys freezer Initializing cgroup subsys net_cls Initializing cgroup subsys blkio mce: CPU supports 10 MCE banks Performance Events: AMD PMU driver. ... version: 0 ... bit width: 48 ... generic registers: 4 ... value mask: 0000ffffffffffff ... max period: 00007fffffffffff ... fixed-purpose events: 0 ... event mask: 000000000000000f alternatives: switching to unfair spinlock ACPI: Core revision 20090903 ftrace: converting mcount calls to 0f 1f 44 00 00 ftrace: allocating 20276 entries in 80 pages Setting APIC routing to flat ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 CPU0: AMD QEMU Virtual CPU version (cpu64-rhel6) stepping 03 Booting Node 0, Processors #1 kvm-clock: cpu 1, msr 0:1e96901, secondary cpu clock #2 kvm-clock: cpu 2, msr 0:1f16901, secondary cpu clock #3 Ok. kvm-clock: cpu 3, msr 0:1f96901, secondary cpu clock Brought up 4 CPUs Total of 4 processors activated (19152.09 BogoMIPS). Testing NMI watchdog ... WARNING: CPU#0: NMI appears to be stuck (0->0)! Please report this to bugzilla.kernel.org, and attach the output of the 'dmesg' command. WARNING: CPU#1: NMI appears to be stuck (0->0)! Please report this to bugzilla.kernel.org, and attach the output of the 'dmesg' command. WARNING: CPU#2: NMI appears to be stuck (0->0)! Please report this to bugzilla.kernel.org, and attach the output of the 'dmesg' command. WARNING: CPU#3: NMI appears to be stuck (0->0)! Please report this to bugzilla.kernel.org, and attach the output of the 'dmesg' command. devtmpfs: initialized regulator: core version 0.5 NET: Registered protocol family 16 ACPI: bus type pci registered PCI: Using configuration type 1 for base access bio: create slab <bio-0> at 0 ACPI: Interpreter enabled ACPI: (supports S0 S3 S4 S5) ACPI: Using IOAPIC for interrupt routing ACPI: No dock devices found. ACPI: PCI Root Bridge [PCI0] (0000:00) pci 0000:00:01.3: quirk: region b000-b03f claimed by PIIX4 ACPI pci 0000:00:01.3: quirk: region b100-b10f claimed by PIIX4 SMB Unable to assume PCIe control: Disabling ASPM ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11) ACPI: PCI Interrupt Link [LNKB] (IRQs 5 10 11) *0, disabled. ACPI: PCI Interrupt Link [LNKC] (IRQs 5 *10 11) ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11) vgaarb: device added: PCI:0000:00:02.0,decodes=io+mem,owns=io+mem,locks=none vgaarb: loaded SCSI subsystem initialized usbcore: registered new interface driver usbfs usbcore: registered new interface driver hub usbcore: registered new device driver usb PCI: Using ACPI for IRQ routing NetLabel: Initializing NetLabel: domain hash size = 128 NetLabel: protocols = UNLABELED CIPSOv4 NetLabel: unlabeled traffic allowed by default Switching to clocksource kvm-clock pnp: PnP ACPI init ACPI: bus type pnp registered pnp: PnP ACPI: found 6 devices ACPI: ACPI bus type pnp unregistered NET: Registered protocol family 2 IP route cache hash table entries: 4096 (order: 3, 32768 bytes) TCP established hash table entries: 16384 (order: 6, 262144 bytes) TCP bind hash table entries: 16384 (order: 6, 262144 bytes) TCP: Hash tables configured (established 16384 bind 16384) TCP reno registered NET: Registered protocol family 1 pci 0000:00:00.0: Limiting direct PCI/PCI transfers pci 0000:00:01.0: Activating ISA DMA hang workarounds Trying to unpack rootfs image as initramfs... rootfs image is not initramfs (broken padding); looks like an initrd Freeing initrd memory: 13123k freed audit: initializing netlink socket (disabled) type=2000 audit(1301369751.257:1): initialized HugeTLB registered 2 MB page size, pre-allocated 0 pages VFS: Disk quotas dquot_6.5.2 Dquot-cache hash table entries: 512 (order 0, 4096 bytes) msgmni has been set to 980 alg: No test for stdrng (krng) ksign: Installing public key data Loading keyring - Added public key 9D480283FFC05098 - User ID: Red Hat, Inc. (Kernel Module GPG key) - Added public key D4A26C9CCD09BEDA - User ID: Red Hat Enterprise Linux Driver Update Program <secalert@redhat.com> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 252) io scheduler noop registered io scheduler anticipatory registered io scheduler deadline registered io scheduler cfq registered (default) pci_hotplug: PCI Hot Plug PCI Core version: 0.5 pciehp: PCI Express Hot Plug Controller Driver version: 0.4 acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 acpiphp: Slot [1] registered acpiphp: Slot [2] registered acpiphp: Slot [3] registered acpiphp: Slot [4] registered acpiphp: Slot [5] registered acpiphp: Slot [6] registered acpiphp: Slot [7] registered acpiphp: Slot [8] registered acpiphp: Slot [9] registered acpiphp: Slot [10] registered acpiphp: Slot [11] registered acpiphp: Slot [12] registered acpiphp: Slot [13] registered acpiphp: Slot [14] registered acpiphp: Slot [15] registered acpiphp: Slot [16] registered acpiphp: Slot [17] registered acpiphp: Slot [18] registered acpiphp: Slot [19] registered acpiphp: Slot [20] registered acpiphp: Slot [21] registered acpiphp: Slot [22] registered acpiphp: Slot [23] registered acpiphp: Slot [24] registered acpiphp: Slot [25] registered acpiphp: Slot [26] registered acpiphp: Slot [27] registered acpiphp: Slot [28] registered acpiphp: Slot [29] registered acpiphp: Slot [30] registered acpiphp: Slot [31] registered pci-stub: invalid id string "" input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0 ACPI: Power Button [PWRF] processor LNXCPU:00: registered as cooling_device0 processor LNXCPU:01: registered as cooling_device1 processor LNXCPU:02: registered as cooling_device2 processor LNXCPU:03: registered as cooling_device3 hpet_acpi_add: no address or irqs in _CRS Non-volatile memory driver v1.3 Linux agpgart interface v0.103 crash memory driver: version 1.0 Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
On the other hand, I cannot reproduce this warning with -127.el6 x86_64 kvm guest. Is this enough to verify this bug?
I think so, yes.
Due to comment 28-30, setting it verified. Thanks!
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html