Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 968147

Summary:

enable online multiple hot-added CPUs cause RHEL7.0 guest hang(soft lockup)

Product:

Red Hat Enterprise Linux 7

Reporter:

FuXiangChun <xfu>

Component:

kernel

Assignee:

Igor Mammedov <imammedo>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

high

Docs Contact:

Priority:

medium

Version:

7.0

CC:

dhoward, hhuang, imammedo, jkurik, juzhang, knoel, mazhang, michen, rbalakri, sdenham, virt-maint, xfu, xuzhang

Target Milestone:

Keywords:

Regression, ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-3.10.0-171.el7

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1144295 (view as bug list)

Environment:

Last Closed:

2015-03-05 11:30:45 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

833649, 968811, 1076185, 1144295

Attachments:

Description	Flags
guest serial console log	none
[RHEL 7.0 PATCH] abort secondary CPU bring-up gracefully if do_boot_cpu timed out on cpu_callin_mask	none
[RHEL7.1 PATCH] x86: initialize secondary CPU only if master CPU will wait for it	none
[RHEL7.1 PATCH 1/4] ACPI / processor: do not mark present at boot but not onlined CPU as onlined	none
[RHEL7.1 PATCH 2/4] x86: Fix list/memory corruption on CPU hotplug	none
[RHEL7.1 PATCH 3/4] x86/smpboot: Log error on secondary CPU wakeup failure at ERR level	none
[RHEL7.1 PATCH 4/4] x86: initialize secondary CPU only if master CPU will wait for it	none

Description FuXiangChun 2013-05-29 05:31:41 UTC

Description of problem:
First hotplug 254 CPUs via qmp command. then online them in guest via this command.
echo 1 > /sys/devices/system/cpu/cpuX/online

Guest hang when enabled online CPUs 102th in guest.

Version-Release number of selected component (if applicable):
host and guest kernel:
# uname -r
3.10.0-0.rc2.57.el7.x86_64

# rpm -qa|grep qemu-kvm
qemu-kvm-1.5.0-2.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Boot RHEL7.0 guest
/usr/libexec/qemu-kvm -cpu Opteron_G2 -M pc-i440fx-1.5 -enable-kvm -m 4G -smp 1,sockets=127,cores=2,threads=1,maxcpus=254 -no-kvm-pit-reinjection -usb -device usb-tablet,id=input0 -name sluo-test -uuid `uuidgen` -rtc base=localtime,clock=host,driftfix=slew -drive file=/home/RHEL-7.0-20130403.0_x86_64.qcow3bk1,if=none,id=drive-system-disk,format=qcow2,cache=none,aio=native,werror=stop,rerror=stop -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-system-disk -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=virtio-net-pci0,mac=08:2e:5f:0a:0d:b3,bus=pci.0,addr=0x5,bootindex=2 -device virtio-balloon-pci,id=ballooning,bus=pci.0,addr=0x6 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -k en-us -vnc :2 -spice port=5931,disable-ticketing -boot menu=on -vga qxl -global qxl-vga.vram_size=67108864 -serial unix:/tmp/ttyS0,server,nowait -qmp tcp:0:4444,server,nowait -monitor stdio

2.hotplug 254 CPUs via qmp script

3.enable online them in guest via a small shell script
# cat online-cpu
i=1
while [ $i -lt 254 ]
do 
echo 1 >/sys/devices/system/cpu/cpu$i/online 
echo $i
sleep 3
i=$(($i+1))
done

Actual results:
guest hang 

Expected results:
enable all hot-added CPUs successfully

Additional info:
I added full guest console log to attachment

Comment 1 FuXiangChun 2013-05-29 05:36:01 UTC

Created attachment 754159 [details]
guest serial console log

Comment 2 FuXiangChun 2013-05-29 05:52:17 UTC

Second time hit this issue only enable online the 17th cpus in guest.
...........
[ 1152.716910] kvm: no hardware support
[ 1160.649299] smpboot: CPU17: Not responding
[ 1188.006098] BUG: soft lockup - CPU#12 stuck for 22s! [migration/12:1573]
[ 1188.062113] BUG: soft lockup - CPU#3 stuck for 22s! [migration/3:1149]
[ 1188.066159] BUG: soft lockup - CPU#0 stuck for 22s! [migration/0:7]
[ 1188.346190] BUG: soft lockup - CPU#7 stuck for 22s! [migration/7:1337]
[ 1188.598239] BUG: soft lockup - CPU#16 stuck for 22s! [migration/16:1762]
[ 1188.816315] BUG: soft lockup - CPU#11 stuck for 23s! [migration/11:1524]
[ 1189.043371] BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:1102]
[ 1189.242437] BUG: soft lockup - CPU#6 stuck for 22s! [migration/6:1289]
[ 1189.474529] BUG: soft lockup - CPU#15 stuck for 22s! [migration/15:1716]
[ 1189.985080] BUG: soft lockup - CPU#1 stuck for 23s! [migration/1:1049]
[ 1190.176677] BUG: soft lockup - CPU#5 stuck for 22s! [migration/5:1243]
[ 1190.344089] BUG: soft lockup - CPU#14 stuck for 22s! [migration/14:1668]
[ 1190.578778] BUG: soft lockup - CPU#9 stuck for 22s! [migration/9:1431]
[ 1191.101185] BUG: soft lockup - CPU#4 stuck for 23s! [migration/4:1195]
[ 1191.224958] BUG: soft lockup - CPU#13 stuck for 22s! [migration/13:1621]
[ 1191.459125] BUG: soft lockup - CPU#8 stuck for 22s! [migration/8:1383]
[ 1216.000631] BUG: soft lockup - CPU#12 stuck for 22s! [migration/12:1573]
[ 1216.075646] BUG: soft lockup - CPU#0 stuck for 22s! [migration/0:7]
[ 1216.075824] BUG: soft lockup - CPU#3 stuck for 22s! [migration/3:1149]
[ 1216.327181] BUG: soft lockup - CPU#7 stuck for 22s! [migration/7:1337]
[ 1216.598010] BUG: soft lockup - CPU#16 stuck for 23s! [migration/16:1762]
[ 1216.803652] BUG: soft lockup - CPU#11 stuck for 23s! [migration/11:1524]
[ 1217.026701] BUG: soft lockup - CPU#2 stuck for 22s! [migration/2:1102]
[ 1217.242437] BUG: soft lockup - CPU#6 stuck for 22s! [migration/6:1289]
[ 1217.464061] BUG: soft lockup - CPU#15 stuck for 22s! [migration/15:1716]
[ 1217.987793] BUG: soft lockup - CPU#1 stuck for 23s! [migration/1:1049]
[ 1218.178257] BUG: soft lockup - CPU#5 stuck for 22s! [migration/5:1243]
[ 1218.348694] BUG: soft lockup - CPU#14 stuck for 22s! [migration/14:1668]
[ 1218.589359] BUG: soft lockup - CPU#9 stuck for 22s! [migration/9:1431]
[ 1219.110465] BUG: soft lockup - CPU#4 stuck for 22s! [migration/4:1195]
[ 1219.232498] BUG: soft lockup - CPU#13 stuck for 22s! [migration/13:1621]
[ 1219.471560] BUG: soft lockup - CPU#8 stuck for 22s! [migration/8:1383]
[ 1223.746006] INFO: rcu_sched detected stalls on CPUs/tasks: { 0 1 2} (detected by 5, t=60012 jiffies, g=10564, c=10563, q=23)
............

Comment 4 Igor Mammedov 2014-02-27 13:29:24 UTC

RHEL6 downstream pathc that needs to be forward ported ccda2747245997d79a3e110cfc72de388fb9361c
[x86] abort secondary CPU bring-up gracefully if do_boot_cpu timed out on cpu_callin_mask

Comment 5 Igor Mammedov 2014-02-27 13:31:19 UTC

Corresponding RHEL6 BZ 816899

Comment 8 Igor Mammedov 2014-02-28 14:23:23 UTC

FuXiangChun,

Could you retest with following kernel: https://brewweb.devel.redhat.com/taskinfo?taskID=7122768

Comment 9 FuXiangChun 2014-03-05 03:33:10 UTC

Sorry, Reply this comment so late. 

According to steps in description. Re-tested this issue with kernel https://brewweb.devel.redhat.com/taskinfo?taskID=7122768.

Result:
Guest works well, all cpus are online inside guest. not found any error message from dmesg log inside guest.

Comment 10 Igor Mammedov 2014-03-05 12:34:09 UTC

Created attachment 870935 [details]
[RHEL 7.0 PATCH] abort secondary CPU bring-up gracefully if do_boot_cpu timed out on cpu_callin_mask

Comment 11 Igor Mammedov 2014-03-06 09:52:58 UTC

posted upstream version as well: http://www.spinics.net/lists/kernel/msg1700480.tml

Comment 12 Karen Noel 2014-03-07 17:07:36 UTC

These links works better for me:
http://www.spinics.net/lists/kernel/msg1701331.html
http://marc.info/?l=linux-kernel&m=139411515911715&w=2

[PATCH v2] abort secondary CPU bring-up gracefully if do_boot_cpu timed out on cpu_callin_mask

Comment 18 Igor Mammedov 2014-03-13 14:32:51 UTC

posted upstream small series with an alternative approach:
 https://lkml.org/lkml/2014/3/13/270

Comment 23 Igor Mammedov 2014-09-11 12:33:50 UTC

Created attachment 936529 [details]
[RHEL7.1 PATCH] x86: initialize secondary CPU only if master CPU will wait for it

Comment 24 Igor Mammedov 2014-09-12 12:13:25 UTC

Forgot to add several additional fixes that should fix related CPU hotplug issues.

Comment 25 Igor Mammedov 2014-09-17 12:55:15 UTC

Created attachment 938465 [details]
[RHEL7.1 PATCH 1/4] ACPI / processor: do not mark present at boot but not onlined CPU as onlined

Comment 26 Igor Mammedov 2014-09-17 12:55:44 UTC

Created attachment 938466 [details]
[RHEL7.1 PATCH 2/4] x86: Fix list/memory corruption on CPU hotplug

Comment 27 Igor Mammedov 2014-09-17 12:56:13 UTC

Created attachment 938468 [details]
[RHEL7.1 PATCH 3/4] x86/smpboot: Log error on secondary CPU wakeup failure at ERR level

Comment 28 Igor Mammedov 2014-09-17 12:56:55 UTC

Created attachment 938469 [details]
[RHEL7.1 PATCH 4/4] x86: initialize secondary CPU only if master CPU will wait for it

Comment 31 Jarod Wilson 2014-09-29 12:21:54 UTC

Patch(es) available on kernel-3.10.0-171.el7

Comment 34 mazhang 2014-11-10 03:05:03 UTC

Reproduced this bug with kernel-3.10.0-78.el7.x86_64.

Host:
qemu-kvm-common-rhev-2.1.2-6.el7.x86_64
qemu-kvm-rhev-debuginfo-2.1.2-6.el7.x86_64
qemu-img-rhev-2.1.2-6.el7.x86_64
qemu-kvm-tools-rhev-2.1.2-6.el7.x86_64
qemu-kvm-rhev-2.1.2-6.el7.x86_64
kernel-3.10.0-187.el7.x86_64

Guest:
kernel-3.10.0-78.el7.x86_64

Steps:
1. Boot vm:
/usr/libexec/qemu-kvm \
-M pc \
-cpu Opteron_G2 \
-m 4G \
-smp 3,sockets=120,cores=2,threads=1,maxcpus=240 \
-enable-kvm \
-name rhel7 \
-uuid 990ea161-6b67-47b2-b803-19fb01d30d12 \
-smbios type=1,manufacturer='Red Hat',product='RHEV Hypervisor',version=el6,serial=koTUXQrb,uuid=feebc8fd-f8b0-4e75-abc3-e63fcdb67170 \
-k en-us \
-rtc base=localtime,clock=host,driftfix=slew \
-nodefaults \
-monitor stdio \
-qmp tcp:0:5555,server,nowait \
-boot menu=on \
-bios /usr/share/seabios/bios.bin \
-monitor unix:/tmp/monitor2,server,nowait \
-vga qxl \
-spice port=5900,disable-ticketing \
-usb \
-device usb-tablet,id=tablet0 \
-drive file=/home/rhel7-64-1.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,werror=stop,rerror=stop,aio=threads \
-device virtio-blk-pci,scsi=off,bus=pci.0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 \
-serial unix:/tmp/ttyS0,server,nowait \
-monitor unix:/tmp/monitor-unix,server,nowait \

2. Hotplug 237 cpus.

3. Online cpus in guest, then reboot guest.

Result:
After guest reboot got cpu stuck.
[  390.869009] BUG: soft lockup - CPU#119 stuck for 23s! [migration/119:1082]
[  390.823009]  ext4
[  389.852341]  [<ffffffff815c3785>] ? __schedule+0x2c5/0x790
[  390.660461]  i2c_piix4
[  389.824865]  mbcache
[  416.952465]  [<ffffffff81087e80>] ? lg_global_unlock+0xc0/0xc0
[  416.952465]  [<ffffffff8107fb90>] kthread+0xc0/0xd0
[  416.952465]  [<ffffffff8107fad0>] ? kthread_create_on_node+0x110/0x110
[  416.952465]  [<ffffffff815ce46c>] ret_from_fork+0x7c/0xb0
[  416.952465]  [<ffffffff8107fad0>] ? kthread_create_on_node+0x110/0x110
[  416.952465] Code: 84 ed 75 5d f0 ff 4b 24 0f 94 c1 84 c9 44 89 e2 74 12 8b 43 20 8b 4b 10 8d 50 01 89 4b 24 89 53 20 44 89 e2 83 fa 04 74 20 f3 90 <44> 8b 63 20 41 39 d4 74 f0 41 83 fc 02 75 bf fa 66 66 90 66 66
[  416.975377] BUG: soft lockup - CPU#209 stuck for 23s! [migration/209:1532]
[  416.975377] Modules linked in: serio_raw(+) powernow_k8(+) i2c_piix4 pcspkr mperf nfsd auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 virtio_blk ata_generic qxl pata_acpi drm_kms_helper ttm virtio_pci ata_piix virtio_ring drm virtio libata floppy i2c_core dm_mirror dm_region_hash dm_log dm_mod
[  416.975377] CPU: 209 PID: 1532 Comm: migration/209 Not tainted 3.10.0-78.el7.x86_64 #1
[  416.975377] Hardware name: Red Hat RHEV Hypervisor, BIOS seabios-1.7.5-5.el7 04/01/2014
[  416.975377] task: ffff88013520ec20 ti: ffff880135258000 task.ti: ffff880135258000
[  416.975377] RIP: 0010:[<ffffffff810d5d92>]  [<ffffffff810d5d92>] multi_cpu_stop+0x82/0xf0
[  416.975377] RSP: 0018:ffff880135259d90  EFLAGS: 00000293
[  416.975377] RAX: 0000000000000000 RBX: 0000000000014640 RCX: ffffffffffffff00
[  444.729013]  [<ffffffff810d5d10>] ? cpu_stop_should_run+0x50/0x50
[  444.729013]  [<ffffffff810d5f98>] cpu_stopper_thread+0x88/0x160
[  444.729013]  [<ffffffff815c3785>] ? __schedule+0x2c5/0x790
[  444.729013]  [<ffffffff81087f7f>] smpboot_thread_fn+0xff/0x1a0
[  444.729013]  [<ffffffff81087e80>] ? lg_global_unlock+0xc0/0xc0
[  444.729013]  [<ffffffff8107fb90>] kthread+0xc0/0xd0
[  444.729013]  [<ffffffff8107fad0>] ? kthread_create_on_node+0x110/0x110
[  444.729013]  [<ffffffff815ce46c>] ret_from_fork+0x7c/0xb0
[  446.572882]  [<ffffffff815c3785>] ? __schedule+0x2c5/0x790

Comment 35 mazhang 2014-11-10 03:14:33 UTC

Update guest kernel to kernel-3.10.0-197.el7.x86_64.

Got call trace during online cpus in guest:

[  371.883273] smpboot: Booting Node 0 Processor 18 APIC 0x12
[  371.877566] kvm-clock: cpu 18, msr 1:3ff86481, secondary cpu clock
[  371.998764] KVM setup async PF for cpu 18
[  371.999742] kvm-stealtime: cpu 18, msr 139a4e000
[  372.007181] Will online and init hotplugged CPU: 18
[  372.012729] kobject '(null)' (ffff88013987bed8): tried to add an uninitialized object, something is seriously wrong.
[  372.022148] CPU: 9 PID: 11446 Comm: sh Not tainted 3.10.0-197.el7.x86_64 #1
[  372.028366] Hardware name: Red Hat RHEV Hypervisor, BIOS seabios-1.7.5-5.el7 04/01/2014
[  372.036695]  ffff88013987bed8 000000003cffcc16 ffff8800b2fbbc70 ffffffff816037a7
[  372.044697]  ffff8800b2fbbce8 ffffffff812d67a6 ffff880000000020 ffff8800b2fbbcf8
[  372.052384]  ffff8800b2fbbca0 000000003cffcc16 0000000000000000 0000000000000013
[  372.060310] Call Trace:
[  372.062711]  [<ffffffff816037a7>] dump_stack+0x19/0x1b
[  372.067427]  [<ffffffff812d67a6>] kobject_add+0xc6/0xd0
[  372.072353]  [<ffffffff812b8659>] blk_mq_register_hctx+0x99/0xb0
[  372.078068]  [<ffffffff812b8fa1>] blk_mq_sysfs_register+0x41/0x60
[  372.083576]  [<ffffffff812b69f2>] blk_mq_queue_reinit_notify+0x72/0xb0
[  372.089658]  [<ffffffff8160f16c>] notifier_call_chain+0x4c/0x70
[  372.094853]  [<ffffffff8109d39e>] __raw_notifier_call_chain+0xe/0x10
[  372.100618]  [<ffffffff81072083>] cpu_notify+0x23/0x50
[  372.105423]  [<ffffffff8107226d>] _cpu_up+0x17d/0x1a0
[  372.110181]  [<ffffffff81072341>] cpu_up+0xb1/0x120
[  372.114711]  [<ffffffff815f6f1c>] cpu_subsys_online+0x3c/0x90
[  372.120157]  [<ffffffff813cf985>] device_online+0x65/0x90
[  372.125236]  [<ffffffff813cfa45>] store_online+0x95/0xa0
[  372.130227]  [<ffffffff813cc978>] dev_attr_store+0x18/0x30
[  372.135179]  [<ffffffff8123be06>] sysfs_write_file+0xc6/0x140
[  372.140314]  [<ffffffff811c5ebd>] vfs_write+0xbd/0x1e0
[  372.145303]  [<ffffffff8160f063>] ? trace_do_page_fault+0x43/0x100
[  372.150965]  [<ffffffff811c6908>] SyS_write+0x58/0xb0
[  372.155600]  [<ffffffff816137a9>] system_call_fastpath+0x16/0x1b


Reboot guest, also got call trace in dmesg.

BUG: soft lockup - CPU#183 stuck for 22s! [systemd-udevd:2720]
[   72.690012] Modules linked in: powernow_k8(+) xfs libcrc32c ppdev parport_pc serio_raw parport virtio_net i2c_piix4 pcspkr nfsd auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 ata_generic virtio_blk qxl pata_acpi drm_kms_helper ttm ata_piix virtio_pci drm virtio_ring i2c_core virtio libata floppy dm_mirror dm_region_hash dm_log dm_mod
[   72.690012] CPU: 183 PID: 2720 Comm: systemd-udevd Not tainted 3.10.0-197.el7.x86_64 #1
[   72.690012] Hardware name: Red Hat RHEV Hypervisor, BIOS seabios-1.7.5-5.el7 04/01/2014
[   72.690012] task: ffff8800a1562220 ti: ffff8800a15fc000 task.ti: ffff8800a15fc000
[   72.690012] RIP: 0010:[<ffffffff810d6c42>]  [<ffffffff810d6c42>] smp_call_function_many+0x202/0x260
[   72.690012] RSP: 0018:ffff8800a15ffc38  EFLAGS: 00000202
[   72.690012] RAX: 0000000000000047 RBX: 00000000000000fc RCX: ffff88013a0f9f68
[   72.690012] RDX: 0000000000000047 RSI: 00000000000000f0 RDI: 0000000000000040
[   72.690012] RBP: ffff8800a15ffc70 R08: ffff880134883c08 R09: ffff88013aef7420
[   72.690012] R10: ffffea0004d2da00 R11: ffffffff812d3599 R12: 0000000000000292
[   72.690012] R13: 0000000000000018 R14: ffff8800a15ffba8 R15: 0000000000000292
[   72.690012] FS:  00007f77c26aa880(0000) GS:ffff88013aee0000(0000) knlGS:0000000000000000
[   72.690012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   72.690012] CR2: 00007f827e5d11e0 CR3: 00000000a15e7000 CR4: 00000000000006e0
[   72.690012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   72.690012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   72.690012] Stack:
[   72.690012]  000000018118f2e7 ffffffff81a21100 ffff8800a15ffd58 ffffffff8105a300
[   72.690012]  0000000000000000 0000000000000004 0000000000000001 ffff8800a15ffc98
[   72.690012]  ffffffff810d6cfd ffff8800a15ffd58 8000000000000000 ffffffffa0bc9000
[   72.690012] Call Trace:
[   72.690012]  [<ffffffff8105a300>] ? try_to_free_pmd_page+0x50/0x50
[   72.690012]  [<ffffffff810d6cfd>] on_each_cpu+0x2d/0x60
[   72.690012]  [<ffffffff8105c053>] change_page_attr_set_clr+0x2a3/0x500
[   72.690012]  [<ffffffffa0bc6000>] ? 0xffffffffa0bc5fff
[   72.690012]  [<ffffffff8105c983>] set_memory_nx+0x43/0x50
[   72.690012]  [<ffffffff815fde19>] set_section_ro_nx+0x6b/0x71
[   72.690012]  [<ffffffff810dcf58>] load_module+0x12e8/0x1b20
[   72.690012]  [<ffffffff812f6430>] ? ddebug_proc_write+0xf0/0xf0
^M^[[K[^[[31m*^[[1;3[   72.690012]  [<ffffffff810981a0>] ? wake_up_bit+0x30/0x30
1m*^[[0m^[[31m*   [   72.690012]  [<ffffffff810dd946>] SyS_finit_module+0xa6/0xd0
^[[0m] A start jo[   72.690012]  [<ffffffff816137a9>] system_call_fastpath+0x16/0x1b
[   72.690012] Code: 48 63 35 b6 ba 94 00 89 c2 39 f0 0f 8d 86 fe ff ff 48 98 49 8b 0f 48 03 0c c5 80 47 a1 81 f6 41 20 01 74 cd 0f 1f 44 00 00 f3 90 <f6> 41 20 01 75 f8 48 63 35 85 ba 94 00 eb b7 0f b6 4d cc 4c 89

Comment 36 mazhang 2014-11-10 05:41:19 UTC

Update systemd to systemd-208-15.el7.x86_64 re-test this bug, also got problem.

After hotplug 237 cpus guest call trace:

[  710.804938] BUG: soft lockup - CPU#190 stuck for 22s! [systemd-udevd:5614]^M
[  710.804938] Modules linked in: powernow_k8(+) ipt_MASQUERADE xt_CHECKSUM ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg xfs libcrc32c ppdev virtio_net parport_pc serio_raw parport i2c_piix4 pcspkr nfsd auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 virtio_blk ata_generic pata_acpi qxl drm_kms_helper ttm ata_piix drm virtio_pci i2c_core virtio_ring virtio libata floppy dm_mirror dm_region_hash dm_log dm_mod^M
[  710.804938] CPU: 190 PID: 5614 Comm: systemd-udevd Not tainted 3.10.0-197.el7.x86_64 #1^M
[  710.804938] Hardware name: Red Hat RHEV Hypervisor, BIOS seabios-1.7.5-5.el7 04/01/2014^M
[  710.804938] task: ffff8800a3a04fa0 ti: ffff8800a0c98000 task.ti: ffff8800a0c98000^M
[  710.804938] RIP: 0010:[<ffffffff810d65c8>]  [<ffffffff810d65c8>] generic_exec_single+0xf8/0x1a0^M
[  710.804938] RSP: 0018:ffff8800a0c9bca0  EFLAGS: 00000202^M
[  710.804938] RAX: 00000000000000f0 RBX: ffff8800a0c9bc70 RCX: 0000000000000010^M
[  710.804938] RDX: 0000ffffffffffff RSI: 00000000000000f0 RDI: 0000000000000286^M
[  710.804938] RBP: ffff8800a0c9bcf0 R08: ffffffff81636440 R09: 000000000000f1aa^M
[  710.804938] R10: 0000000000000000 R11: ffff8800a0c9b95e R12: ffff8800a0c9bc54^M
[  710.804938] R13: ffffffff8104a266 R14: 00000000a7d38952 R15: ffff8800a0c9bc08^M
[  710.804938] FS:  00007f0d203c3880(0000) GS:ffff88013afc0000(0000) knlGS:0000000000000000^M
[  710.804938] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
[  710.804938] CR2: 00007f7d41709000 CR3: 00000000a0c88000 CR4: 00000000000006e0^M
[  710.804938] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
[  710.804938] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
[  710.804938] Stack:^M
[  710.804938]  0000000000000000 0000000000000000 ffffffffa059b130 ffff8800a0c9bd34^M
[  710.804938]  0000000000000003 00000000a7d38952 00000000a7d38952 00000000000000ea^M
[  710.804938]  ffffffffa059b130 ffffffff81a21100 ffff8800a0c9bd20 ffffffff810d66cf^M
[  710.804938] Call Trace:^M
[  710.804938]  [<ffffffffa059b130>] ? powernowk8_target_fn+0x960/0x960 [powernow_k8]^M
[  710.804938]  [<ffffffffa059b130>] ? powernowk8_target_fn+0x960/0x960 [powernow_k8]^M
[  710.804938]  [<ffffffff810d66cf>] smp_call_function_single+0x5f/0xa0^M
[  710.804938]  [<ffffffff81606fe2>] ? mutex_lock+0x12/0x2f^M
[  710.804938]  [<ffffffffa059c0c7>] powernowk8_init+0xb7/0x1a4 [powernow_k8]^M
[  710.804938]  [<ffffffffa059c010>] ? powernowk8_cpu_init+0xd90/0xd90 [powernow_k8]^M
[  710.804938]  [<ffffffff810020b8>] do_one_initcall+0xb8/0x230^M
[  710.804938]  [<ffffffff810dcf8e>] load_module+0x131e/0x1b20^M
[  710.804938]  [<ffffffff812f6430>] ? ddebug_proc_write+0xf0/0xf0^M
[  710.804938]  [<ffffffff810981a0>] ? wake_up_bit+0x30/0x30^M
[  710.804938]  [<ffffffff810dd946>] SyS_finit_module+0xa6/0xd0^M
[  710.804938]  [<ffffffff816137a9>] system_call_fastpath+0x16/0x1b^M

Hi Igor,

Could you please have a look at comment#35 and comment#36, is it a new problem ?

Thanks,
Mazhang.

Comment 37 Igor Mammedov 2014-11-10 09:55:07 UTC

This bug usually silently hangs during massive cpu hotplug in rapid succession
not on reboot. To reproduce boot with all cpus and run CPUs online/offline cycle
inside guest under CPU heavy load on host.

Traces in 35 and 36 look like a new not related to this bug, most likely something new.

Comment 38 Igor Mammedov 2014-11-10 10:16:33 UTC

(In reply to Igor Mammedov from comment #37)

another sign that bug was reproduced is:
smpboot: CPU17: Not responding
message in not fixed kernel and system hang.

In fixed kernel you might get:
do_boot_cpu failed
but system shouldn't hang.

Comment 39 mazhang 2014-11-11 03:00:24 UTC

(In reply to Igor Mammedov from comment #37)
> This bug usually silently hangs during massive cpu hotplug in rapid
> succession
> not on reboot. To reproduce boot with all cpus and run CPUs online/offline
> cycle
> inside guest under CPU heavy load on host.
> 
I try reproduce this bug as bz816899#c0. 

1 Boot with 240 cpus and run onlin/offline cycle inside guest.
cat off.sh
#!/bin/bash

SYS_CPU_DIR=/sys/devices/system/cpu/
CPU_CNT=240

while true; do
  for ((i=1; i < $CPU_CNT; i++)); do
    echo 0 > ${SYS_CPU_DIR}cpu${i}/online
  done
  for ((i=1; i < $CPU_CNT; i++)); do
    echo 1 > ${SYS_CPU_DIR}cpu${i}/online
  done
done

2 Heavy load on host:
[root@hp-dl385g7-08 ~]# cat cpuhog.sh 
#!/bin/bash

while  true
do
  dmesg > /dev/null
done

[root@hp-dl385g7-08 ~]# for ((i=0; i<31; i++)) do ./cpuhog.sh & done

3 After about 10 mins guest call trace, keep print following messages, but didn't found like "smpboot: CPU17: Not responding", is it reproduced?

[  889.752817] Code: 08 e8 1a c5 a9 ff 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 53 48 89 f3 e9 16 00 00 00 66 83 07 02 48 89 df 57 9d <66> 66 90 66 90 5b 5d c3 0f 1f 44 00 00 8b 37 f0 66 83 07 02 f6 
[  889.931666] BUG: soft lockup - CPU#36 stuck for 22s! [migration/36:667]
[  889.931668] Modules linked in: powernow_k8(+) ipt_MASQUERADE xt_CHECKSUM ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables sg xfs libcrc32c virtio_net serio_raw i2c_piix4 pcspkr mperf nfsd auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 ata_generic virtio_blk pata_acpi qxl drm_kms_helper ttm virtio_pci ata_piix drm virtio_ring libata i2c_core virtio floppy dm_mirror dm_region_hash dm_log dm_mod
[  889.931819] CPU: 36 PID: 667 Comm: migration/36 Not tainted 3.10.0-78.el7.x86_64 #1
[  889.931821] Hardware name: Red Hat RHEV Hypervisor, BIOS seabios-1.7.5-5.el7 04/01/2014
[  889.931826] task: ffff880136872070 ti: ffff8801368a8000 task.ti: ffff8801368a8000
[  889.931832] RIP: 0010:[<ffffffff815c589b>]  [<ffffffff815c589b>] _raw_spin_unlock_irqrestore+0x1b/0x40
[  889.931841] RSP: 0000:ffff880139c83e70  EFLAGS: 00000246
[  889.931847] RAX: 0000000000000000 RBX: ffff880139c83e20 RCX: 0000000000000040
[  889.931851] RDX: 0000000000000040 RSI: ffff88013a000000 RDI: 0000000000000246
[  889.931855] RBP: ffff880139c83e78 R08: ffffffff81bb86a8 R09: 0000000000000004
[  889.931858] R10: 0000000000000000 R11: 0000000000000024 R12: ffff880139c83de8
[  889.931862] R13: ffffffff815cf19d R14: ffff880139c83e78 R15: 0000000000000246
[  889.931870] FS:  00007f93cb8d8700(0000) GS:ffff880139c80000(0000) knlGS:0000000000000000
[  889.931876] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  889.931880] CR2: 00007f93ca62f000 CR3: 00000000018aa000 CR4: 00000000000006e0
[  889.931889] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  889.931893] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  889.931897] Stack:
[  889.931899]  ffffffff81900480 ffff880139c83ed8 ffffffff810f624b ffff880139c83ed8
[  889.932148]  ffffffff8109d070 ffff880139c8ec08 ffff880136872070 ffff8801368a9fd8
[  889.932414]  0000000000000009 0000000000000009 ffff8801155a5d00 ffffffff818a30c8
[  889.932650] Call Trace:
[  889.932650]  <IRQ>  [<ffffffff810f624b>] rcu_process_callbacks+0x35b/0x550
[  889.932650]  [<ffffffff8109d070>] ? run_rebalance_domains+0x50/0x1c0
[  889.932650]  [<ffffffff81062867>] __do_softirq+0xf7/0x290
[  889.932650]  [<ffffffff815cfe5c>] call_softirq+0x1c/0x30
[  889.932650]  [<ffffffff81014ba5>] do_softirq+0x55/0x90
[  889.932650]  [<ffffffff81062c05>] irq_exit+0x115/0x120
[  889.932650]  [<ffffffff815d0835>] smp_apic_timer_interrupt+0x45/0x60
[  889.932650]  [<ffffffff815cf19d>] apic_timer_interrupt+0x6d/0x80
[  889.932650]  <EOI>  [<ffffffff810d5d92>] ? multi_cpu_stop+0x82/0xf0
[  889.932650]  [<ffffffff810d5d10>] ? cpu_stop_should_run+0x50/0x50
[  889.932650]  [<ffffffff810d5f98>] cpu_stopper_thread+0x88/0x160
[  889.932650]  [<ffffffff815c3785>] ? __schedule+0x2c5/0x790
[  889.932650]  [<ffffffff81087f7f>] smpboot_thread_fn+0xff/0x1a0
[  889.932650]  [<ffffffff81087e80>] ? lg_global_unlock+0xc0/0xc0
[  889.932650]  [<ffffffff8107fb90>] kthread+0xc0/0xd0
[  889.932650]  [<ffffffff8107fad0>] ? kthread_create_on_node+0x110/0x110
[  889.932650]  [<ffffffff815ce46c>] ret_from_fork+0x7c/0xb0
[  889.932650]  [<ffffffff8107fad0>] ? kthread_create_on_node+0x110/0x110

kernel-3.10.0-78.el7.x86_64

> Traces in 35 and 36 look like a new not related to this bug, most likely
> something new.

I'll test with the latest kernel, will file a new bug if still has this problem.

Comment 40 Igor Mammedov 2014-11-20 07:58:03 UTC

It might cause such trace however without core dump it hard to say for sure.

In this case it's possible to take VM dump with libvirt and then using 'crash' utility get stack traces of each CPU.

Comment 42 mazhang 2014-11-26 02:46:11 UTC

Test this bug on the latest kernel, the problem "smpboot: CPU17: Not responding", did not happened. but hit "WARNING: at block/blk-mq.c:701 __blk_mq_run_hw_queue+0x31d/0x330()" with following kernel, both amd and intel platform hit this problem.
kernel-3.10.0-200.el7.x86_64
kernel-3.10.0-205.el7.x86_64
kernel-3.10.0-210.el7.x86_64

Here is the call trace:
[  836.883747] ------------[ cut here ]------------
[  836.883753] WARNING: at block/blk-mq.c:701 __blk_mq_run_hw_queue+0x31d/0x330()
[  836.883786] Modules linked in: intel_powerclamp(-) fuse btrfs zlib_deflate raid6_pq xor vfat msdos fat ext4 mbcache jbd2 binfmt_misc cfg80211 rfkill ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev parport_pc aesni_intel lrw gf128mul glue_helper ablk_helper cryptd parport i2c_piix4 serio_raw pcspkr uinput xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper virtio_net ttm virtio_blk ata_piix drm virtio_pci floppy virtio_ring libata i2c_core virtio dm_mirror dm_region_hash dm_log dm_mod
[  836.883791] CPU: 5 PID: 515 Comm: kworker/5:0H Not tainted 3.10.0-210.el7.x86_64 #1
[  836.883792] Hardware name: Red Hat RHEV Hypervisor, BIOS 0.5.1 01/01/2011
[  836.883794] Workqueue: kblockd blk_mq_run_work_fn
[  836.883830]  0000000000000000 000000001b0d5a21 ffff880136a8bd60 ffffffff816038f7
[  836.883866]  ffff880136a8bd98 ffffffff8106e27b ffff88013670bc00 ffff880136a8bdc8
[  836.883902]  ffff8801398b4280 ffff8801398bb000 0000000000000160 ffff880136a8bda8
[  836.883903] Call Trace:
[  836.883922]  [<ffffffff816038f7>] dump_stack+0x19/0x1b
[  836.883934]  [<ffffffff8106e27b>] warn_slowpath_common+0x6b/0xb0
[  836.883945]  [<ffffffff8106e3ca>] warn_slowpath_null+0x1a/0x20
[  836.883956]  [<ffffffff812b5a7d>] __blk_mq_run_hw_queue+0x31d/0x330
[  836.883966]  [<ffffffff812b5dd5>] blk_mq_run_work_fn+0x15/0x20
[  836.883978]  [<ffffffff8108f06b>] process_one_work+0x17b/0x470
[  836.883988]  [<ffffffff8108fe4b>] worker_thread+0x11b/0x400
[  836.883999]  [<ffffffff8108fd30>] ? rescuer_thread+0x400/0x400
[  836.884018]  [<ffffffff8109722f>] kthread+0xcf/0xe0
[  836.884028]  [<ffffffff81097160>] ? kthread_create_on_node+0x140/0x140
[  836.884040]  [<ffffffff8161387c>] ret_from_fork+0x7c/0xb0
[  836.884050]  [<ffffffff81097160>] ? kthread_create_on_node+0x140/0x140
[  836.884051] ---[ end trace a57be53284c1125a ]---

Comment 43 mazhang 2014-11-26 03:23:05 UTC

I've been test this bug more then 10 times, all my test did not hit the original problem, so this bug has been fixed.
For comment#42 issue, it should be problem of block multiqueue, so I just file a file 1168074 trace it.
Any problem please let me know.

Thanks,
Mazhang.

Comment 44 Igor Mammedov 2014-11-26 09:11:56 UTC

(In reply to mazhang from comment #43)
> I've been test this bug more then 10 times, all my test did not hit the
> original problem, so this bug has been fixed.
> For comment#42 issue, it should be problem of block multiqueue, so I just
> file a file 1168074 trace it.
That's what I'd do too.

> Any problem please let me know.
> 
> Thanks,
> Mazhang.

Comment 45 juzhang 2014-11-26 09:30:43 UTC

According to comment43 and comment44, set this issue as verified.

Comment 47 errata-xmlrpc 2015-03-05 11:30:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0290.html