Bug 1485208

Summary:	Guest hangs after save, restore guest with 280 vcpus
Product:	Red Hat Enterprise Linux 8	Reporter:	chhu
Component:	qemu-kvm	Assignee:	Vitaly Kuznetsov <vkuznets>
Status:	CLOSED DUPLICATE	QA Contact:	FuXiangChun <xfu>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	8.0	CC:	chayang, dyuan, hhuang, jinzhao, juzhang, knoel, lhuang, rbalakri, ribarry, virt-maint, vkuznets, xfu, xuzhang, yafu, yalzhang, yanqzhan, zhguo, zpeng
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-02 12:38:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description chhu 2017-08-25 05:59:40 UTC

Description of problem:
Guest hangs after save, restore guest with 280 vcpus

Version-Release number of selected component (if applicable):
libvirt-3.2.0-14.el7_4.3.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.5.x86_64
kernel: 3.10.0-693.2.1.el7.x86_64
rhel7.4 Guest kernel: 3.10.0-693.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start guest with 280 vcpus with xml below:
  <vcpu placement='static'>280</vcpu>
......
  <os>
    <type arch='x86_64' machine='pc-q35-rhel7.4.0'>hvm</type>
  </os>
......
  <cpu mode='host-model'>
   <model fallback='allow'/>
  </cpu>
......
    <iommu model='intel'>
      <driver intremap='on' eim='on'/>
    </iommu>
  </devices>
......

2. Login to the guest, check there are 280 vcpus
# cat /proc/cpuinfo |grep processor|wc -l
280

3. Save and restore the guest
# virsh save r7-4t r7-4t.save
Domain r7-4t saved to r7-4t.save
# virsh restore r7-4t.save
Domain restored from r7-4t.save

4. Try to login to the guest, the virsh console hang.
# virsh console r7-4t
Connected to domain r7-4t
Escape character is ^]
[  367.342608] INFO: task xfsaild/dm-0:2076 blocked for more than 120 seconds.
[  367.343583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

1) Sometimes, we can get these error messages in console when save, restore guest with 261 cpus:
[   69.647147] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kworker/u538:5:1944]
[   97.647211] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kworker/u538:5:1944]
[  104.278141] INFO: rcu_sched self-detected stall on CPU[  104.280038] INFO: rcu_sched detected stalls on CPUs/tasks:

2) Try to login guest through virt-manager, can't input the login user.

5. Change the vcpu to 248, then redo step1-4.
In Step4, login to guest successfully, and check there are 248 cpus.

Actual result:
In step4, save restore guest with 280 cpus, failed to login to guest, or the operations in guest hang(260<vcpu<263).

Expected result:
In step4, save restore guest with cpu<=384, can login to guest and do operations successfully.

Comment 2 Martin Kletzander 2017-10-10 12:17:11 UTC

This seems like another underlying issue is causing this, so I'm moving this to QEMU for further triage.

Comment 3 chhu 2018-04-13 08:31:49 UTC

Reproduced this bug on packages:

libvirt-3.9.0-14.el7_5.2.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64
kernel: 3.10.0-861.el7.x86_64
guest kernel: 3.10.0-862.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start guest with 280 vcpus with xml below:
  <vcpu placement='static'>280</vcpu>
......
  <os>
    <type arch='x86_64' machine='pc-q35-rhel7.5.0'>hvm</type>
  </os>
......
  <cpu mode='host-model'>
   <model fallback='allow'/>
  </cpu>
......
    <iommu model='intel'>
      <driver intremap='on' eim='on'/>
    </iommu>
  </devices>
......

2. Login to the guest, check there are 280 vcpus
# cat /proc/cpuinfo |grep processor|wc -l
280

3. Save and restore the guest
# virsh save r7 r7.save
Domain r7 saved to r7.save

# virsh restore r7.save
Domain restored from r7.save

4. Try to login to the guest, the virsh console hang.

# virsh console r7
Connected to domain r7
Escape character is ^]
[   70.972775] NMI watchdog: BUG: soft lockup - CPU#35 stuck for 23s! [kworker/u560:4:1843]
[   98.972780] NMI watchdog: BUG: soft lockup - CPU#35 stuck for 23s! [kworker/u560:4:1843]
[  106.333777] INFO: rcu_sched self-detected stall on CPU[  106.336867] INFO: rcu_sched detected stalls on CPUs/tasks:

5. Try to login guest through virt-manager, can't input the login user.

Comment 4 Radim Krčmář 2018-12-21 18:01:56 UTC

Large amounts of VCPUs require q35 machine type, which is a tech preview in RHEL7; moving to RHEL8.

Comment 7 Vitaly Kuznetsov 2019-10-02 12:38:41 UTC

Save-restore and migration are the same thing and symptoms look very much like
https://bugzilla.redhat.com/show_bug.cgi?id=1529231

I'm going to close this as duplicate for now.

*** This bug has been marked as a duplicate of bug 1529231 ***