Bug 1485208
| Summary: | Guest hangs after save, restore guest with 280 vcpus | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | chhu |
| Component: | qemu-kvm | Assignee: | Vitaly Kuznetsov <vkuznets> |
| Status: | CLOSED DUPLICATE | QA Contact: | FuXiangChun <xfu> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.0 | CC: | chayang, dyuan, hhuang, jinzhao, juzhang, knoel, lhuang, rbalakri, ribarry, virt-maint, vkuznets, xfu, xuzhang, yafu, yalzhang, yanqzhan, zhguo, zpeng |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-10-02 12:38:41 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
This seems like another underlying issue is causing this, so I'm moving this to QEMU for further triage. Reproduced this bug on packages:
libvirt-3.9.0-14.el7_5.2.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64
kernel: 3.10.0-861.el7.x86_64
guest kernel: 3.10.0-862.el7.x86_64
How reproducible:
100%
Steps to Reproduce:
1. Start guest with 280 vcpus with xml below:
<vcpu placement='static'>280</vcpu>
......
<os>
<type arch='x86_64' machine='pc-q35-rhel7.5.0'>hvm</type>
</os>
......
<cpu mode='host-model'>
<model fallback='allow'/>
</cpu>
......
<iommu model='intel'>
<driver intremap='on' eim='on'/>
</iommu>
</devices>
......
2. Login to the guest, check there are 280 vcpus
# cat /proc/cpuinfo |grep processor|wc -l
280
3. Save and restore the guest
# virsh save r7 r7.save
Domain r7 saved to r7.save
# virsh restore r7.save
Domain restored from r7.save
4. Try to login to the guest, the virsh console hang.
# virsh console r7
Connected to domain r7
Escape character is ^]
[ 70.972775] NMI watchdog: BUG: soft lockup - CPU#35 stuck for 23s! [kworker/u560:4:1843]
[ 98.972780] NMI watchdog: BUG: soft lockup - CPU#35 stuck for 23s! [kworker/u560:4:1843]
[ 106.333777] INFO: rcu_sched self-detected stall on CPU[ 106.336867] INFO: rcu_sched detected stalls on CPUs/tasks:
5. Try to login guest through virt-manager, can't input the login user.
Large amounts of VCPUs require q35 machine type, which is a tech preview in RHEL7; moving to RHEL8. Save-restore and migration are the same thing and symptoms look very much like https://bugzilla.redhat.com/show_bug.cgi?id=1529231 I'm going to close this as duplicate for now. *** This bug has been marked as a duplicate of bug 1529231 *** |
Description of problem: Guest hangs after save, restore guest with 280 vcpus Version-Release number of selected component (if applicable): libvirt-3.2.0-14.el7_4.3.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.5.x86_64 kernel: 3.10.0-693.2.1.el7.x86_64 rhel7.4 Guest kernel: 3.10.0-693.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. Start guest with 280 vcpus with xml below: <vcpu placement='static'>280</vcpu> ...... <os> <type arch='x86_64' machine='pc-q35-rhel7.4.0'>hvm</type> </os> ...... <cpu mode='host-model'> <model fallback='allow'/> </cpu> ...... <iommu model='intel'> <driver intremap='on' eim='on'/> </iommu> </devices> ...... 2. Login to the guest, check there are 280 vcpus # cat /proc/cpuinfo |grep processor|wc -l 280 3. Save and restore the guest # virsh save r7-4t r7-4t.save Domain r7-4t saved to r7-4t.save # virsh restore r7-4t.save Domain restored from r7-4t.save 4. Try to login to the guest, the virsh console hang. # virsh console r7-4t Connected to domain r7-4t Escape character is ^] [ 367.342608] INFO: task xfsaild/dm-0:2076 blocked for more than 120 seconds. [ 367.343583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 1) Sometimes, we can get these error messages in console when save, restore guest with 261 cpus: [ 69.647147] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kworker/u538:5:1944] [ 97.647211] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kworker/u538:5:1944] [ 104.278141] INFO: rcu_sched self-detected stall on CPU[ 104.280038] INFO: rcu_sched detected stalls on CPUs/tasks: 2) Try to login guest through virt-manager, can't input the login user. 5. Change the vcpu to 248, then redo step1-4. In Step4, login to guest successfully, and check there are 248 cpus. Actual result: In step4, save restore guest with 280 cpus, failed to login to guest, or the operations in guest hang(260<vcpu<263). Expected result: In step4, save restore guest with cpu<=384, can login to guest and do operations successfully.