Bug 1485208 - Guest hangs after save, restore guest with 280 vcpus
Summary: Guest hangs after save, restore guest with 280 vcpus
Keywords:
Status: CLOSED DUPLICATE of bug 1529231
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: qemu-kvm
Version: 8.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Vitaly Kuznetsov
QA Contact: FuXiangChun
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-25 05:59 UTC by chhu
Modified: 2019-10-02 12:38 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-02 12:38:41 UTC
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)

Description chhu 2017-08-25 05:59:40 UTC
Description of problem:
Guest hangs after save, restore guest with 280 vcpus

Version-Release number of selected component (if applicable):
libvirt-3.2.0-14.el7_4.3.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.5.x86_64
kernel: 3.10.0-693.2.1.el7.x86_64
rhel7.4 Guest kernel: 3.10.0-693.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start guest with 280 vcpus with xml below:
  <vcpu placement='static'>280</vcpu>
......
  <os>
    <type arch='x86_64' machine='pc-q35-rhel7.4.0'>hvm</type>
  </os>
......
  <cpu mode='host-model'>
   <model fallback='allow'/>
  </cpu>
......
    <iommu model='intel'>
      <driver intremap='on' eim='on'/>
    </iommu>
  </devices>
......

2. Login to the guest, check there are 280 vcpus
# cat /proc/cpuinfo |grep processor|wc -l
280

3. Save and restore the guest
# virsh save r7-4t r7-4t.save
Domain r7-4t saved to r7-4t.save
# virsh restore r7-4t.save
Domain restored from r7-4t.save

4. Try to login to the guest, the virsh console hang.
# virsh console r7-4t
Connected to domain r7-4t
Escape character is ^]
[  367.342608] INFO: task xfsaild/dm-0:2076 blocked for more than 120 seconds.
[  367.343583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

1) Sometimes, we can get these error messages in console when save, restore guest with 261 cpus:
[   69.647147] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kworker/u538:5:1944]
[   97.647211] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kworker/u538:5:1944]
[  104.278141] INFO: rcu_sched self-detected stall on CPU[  104.280038] INFO: rcu_sched detected stalls on CPUs/tasks:

2) Try to login guest through virt-manager, can't input the login user.

5. Change the vcpu to 248, then redo step1-4.
In Step4, login to guest successfully, and check there are 248 cpus.

Actual result:
In step4, save restore guest with 280 cpus, failed to login to guest, or the operations in guest hang(260<vcpu<263).

Expected result:
In step4, save restore guest with cpu<=384, can login to guest and do operations successfully.

Comment 2 Martin Kletzander 2017-10-10 12:17:11 UTC
This seems like another underlying issue is causing this, so I'm moving this to QEMU for further triage.

Comment 3 chhu 2018-04-13 08:31:49 UTC
Reproduced this bug on packages:

libvirt-3.9.0-14.el7_5.2.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64
kernel: 3.10.0-861.el7.x86_64
guest kernel: 3.10.0-862.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start guest with 280 vcpus with xml below:
  <vcpu placement='static'>280</vcpu>
......
  <os>
    <type arch='x86_64' machine='pc-q35-rhel7.5.0'>hvm</type>
  </os>
......
  <cpu mode='host-model'>
   <model fallback='allow'/>
  </cpu>
......
    <iommu model='intel'>
      <driver intremap='on' eim='on'/>
    </iommu>
  </devices>
......

2. Login to the guest, check there are 280 vcpus
# cat /proc/cpuinfo |grep processor|wc -l
280

3. Save and restore the guest
# virsh save r7 r7.save
Domain r7 saved to r7.save

# virsh restore r7.save
Domain restored from r7.save

4. Try to login to the guest, the virsh console hang.

# virsh console r7
Connected to domain r7
Escape character is ^]
[   70.972775] NMI watchdog: BUG: soft lockup - CPU#35 stuck for 23s! [kworker/u560:4:1843]
[   98.972780] NMI watchdog: BUG: soft lockup - CPU#35 stuck for 23s! [kworker/u560:4:1843]
[  106.333777] INFO: rcu_sched self-detected stall on CPU[  106.336867] INFO: rcu_sched detected stalls on CPUs/tasks:

5. Try to login guest through virt-manager, can't input the login user.

Comment 4 Radim Krčmář 2018-12-21 18:01:56 UTC
Large amounts of VCPUs require q35 machine type, which is a tech preview in RHEL7; moving to RHEL8.

Comment 7 Vitaly Kuznetsov 2019-10-02 12:38:41 UTC
Save-restore and migration are the same thing and symptoms look very much like
https://bugzilla.redhat.com/show_bug.cgi?id=1529231

I'm going to close this as duplicate for now.

*** This bug has been marked as a duplicate of bug 1529231 ***


Note You need to log in before you can comment on or make changes to this bug.