Bug 2084442

Summary:

watchdog: BUG: soft lockup - CPU# stuck for ... - no watchdog action

Product:

Red Hat Enterprise Linux 9

Reporter:

lejeczek <peljasz>

Component:

qemu-kvm

Assignee:

Michael S. Tsirkin <mst>

qemu-kvm sub component:

PCI

QA Contact:

Yiqian Wei <yiwei>

Status:

CLOSED NOTABUG

Docs Contact:

Severity:

high

Priority:

high

CC:

ailan, berrange, bstinson, chayang, coli, jinzhao, juzhang, jwboyer, mkletzan, mrezanin, mst, nilal, virt-maint, xiaohli, yiwei, ymankad

Version:

CentOS Stream

Keywords:

Triaged

Target Milestone:

Flags:

pm-rhel: mirror+

Target Release:

9.3

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2023-09-05 13:29:50 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

2180898

Bug Blocks:

Attachments:

Description	Flags
guest dmesg information	none

Description lejeczek 2022-05-12 07:21:06 UTC

Description of problem:

Hi guys.
I have VMs which often after live migration get "watchdog: BUG: soft lockup - CPU#." and when this happens VMs become non-responsive and inaccessible.
Then only way to revive such a VM is from the host to reset/destroy it - all that in and of itself is a problem, but not the problem I'm reporting here.

Also, I was not sure under which component/category file it so please feel free to move/change it.

I think is the problem here is faulty/malfunctioning "watchdog"
I use:
...
    <watchdog model='i6300esb' action='reset'/>
...
and the "watchdog" seems to work when I test in a up&running healhy VMs, yet when "soft lockups" happen, then no watchdog action takes place.

Version-Release number of selected component (if applicable):

all the version since c9s up to & including:
qemu-kvm-7.0.0-1.el9.x86_64
libvirt-daemon-8.2.0-1.el9.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yiqian Wei 2022-05-16 08:14:16 UTC

I did not reproduce this bug on RHEL.9.1.0 Host

src and dst host version:
qemu-kvm-7.0.0-1.el9.x86_64
kernel-5.14.0-92.el9.x86_64
edk2-ovmf-20220221gitb24306f15d-1.el9.noarch
guest : rhel9.1.0 

Test steps:
1. In src end, boot a rhel9.1.0 guest with "-device i6300esb" qemu cli[1]
2. In dst end, boot a rhel9.1.0 guest with qemu cli[1] and append '-incoming defer'
3. Migrate vm from src to dst
dst qmp:
{"execute": "migrate-incoming", "arguments": {"uri": "tcp:[::]:4000"}}
src qmp:
{"execute": "migrate", "arguments": {"uri": "tcp:$dst_host_ip:4000"}}


Additional info:
1. Not reproduce this bug with q35 + seabios
2. qemu cli[1]:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -blockdev node-name=file_ovmf_code,driver=file,filename=/usr/share/edk2/ovmf/OVMF_CODE.secboot.fd,auto-read-only=on,discard=unmap \
    -blockdev node-name=drive_ovmf_code,driver=raw,read-only=on,file=file_ovmf_code \
    -blockdev node-name=file_ovmf_vars,driver=file,filename=/mnt/yiwei/OVMF_VARS.fd,auto-read-only=on,discard=unmap \
    -blockdev node-name=drive_ovmf_vars,driver=raw,read-only=off,file=file_ovmf_vars \
    -machine q35,memory-backend=mem-machine_mem,pflash0=drive_ovmf_code,pflash1=drive_ovmf_vars \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -device i6300esb,id=wdt0,bus=pcie-pci-bridge-0 \
    -watchdog-action reset \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 12G \
    -object memory-backend-ram,size=12G,id=mem-machine_mem  \
    -smp 10,maxcpus=10,cores=5,threads=1,sockets=2  \
    -cpu IvyBridge,enforce \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/mnt/yiwei/rhel9.1-ovmf.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:df:ca:53:c2:69,id=idz43iXV,netdev=idPOEPyA,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=idPOEPyA,vhost=on   \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -monitor stdio  \
    -qmp tcp:0:4444,server=on,wait=off \

Comment 2 lejeczek 2022-05-16 18:08:10 UTC

I realize I might have made it bit vague, I'll try again: ...live migration ...not the problem I'm reporting here.
I think is the problem here is faulty/malfunctioning "watchdog" - so if somehow, in whatever manner you can cause "watchdog: BUG: soft lockup..." so VM becomes non-responsive (I've also notice that host reports high CPU load for such "broken" VM at that time) you should see no action from the watchdog - which I believe should happen.

Comment 3 Yiqian Wei 2022-05-18 03:09:40 UTC

Hi lejeczek,

I didn't reproduce this bug when run stress in guest.

How did you reproduce this bug,could you provide the steps and command line to reproduce this bug?

Thanks,
yiqian

Comment 4 lejeczek 2022-05-19 08:21:46 UTC

To shed bit more light on my systems: VMs qcow2s are stored on GlusterFS vols (say 3 node/peer cluster should do) - those gluster vols are autofs-mounted(since rhel, inexplicably to me, removed libgfapi from qemu) so qemu/libvirt access those qcow2s via fuse.

To stress such system-setup out I'd imagine live-migration of a single VM will not do, instead a "mass" live-migration is when the issue happens, say:
- node1 has already a few VMs up & running and you migrate a few more VMs in fast succession, to node1 from a nodeX
- if you were to add HA/pacemaker to the equation and let such ha-cluster manage your VMs, then that will also allow GlusterFS to be involved (though can be done without, manually) - reboot one such ha-cluster's node so then:
a) live-migration will take place
b) also gluster vol will be healing

It might not happen every time but when it(something) does happen, then you should get quite a few! VMs(migrated) being "soft lockup-ed" and then.... the watchdog "issue".

Hardware resources will most likely matter very much so, if you have big CPUs and lots of resources then that probably will be not good, not helpful. Smaller the systems the better, easier to stress out. I test all this with mid-shelf Ryzens.

thanks, L

Comment 6 Martin Kletzander 2023-02-06 09:44:13 UTC

With my limited understanding I looked at this issue and it seems the message you are getting is not from a watchdog device, but from the linux kernel lockup detector.  If a the CPU does not get enough execution time, for example when overcommiting the host, then soft lockup is detected and based on the configuration and settings it can trigger a kernel panic.  I think the only two options here are to either make sure that the host is not overcommitted cpu-wise or disabling the lockup detection via cmdline or sysctl.

Comment 7 Martin Kletzander 2023-02-06 09:47:48 UTC

Oh sorry, I misread the description. You are expecting the watchdog device to reset the VM once such lockup happens. How long have you tried waiting after the soft lockup message? Is it possible that QEMU does not get enough cpu time so that it can emulate the watchdog? If yes, then I suspect QEMU might actually be at fault here.

Comment 8 John Ferlan 2023-02-13 23:48:52 UTC

Since there has been no resolution planned for this issue and we are at the point in the release where we need to limit risk and change, I'm removing from the current release in order to have the work properly planned for some future release.

Comment 10 Daniel Berrangé 2023-03-16 08:15:15 UTC

(In reply to Martin Kletzander from comment #7)
> Oh sorry, I misread the description. You are expecting the watchdog device
> to reset the VM once such lockup happens. How long have you tried waiting
> after the soft lockup message? Is it possible that QEMU does not get enough
> cpu time so that it can emulate the watchdog? If yes, then I suspect QEMU
> might actually be at fault here.

This is a 10 CPU guest. Any one of those CPUs could potentially pet the watchdog and keep it from firing, so I don't think it is indicative of a watchdog bug unless we can demonstrate that all 10 CPUs are fully non-responsive.

A hardware watchdog alone is not sufficient to detect all potential problems which lead to non-responsive application services. You need to, in addition, have external application level liveliness probes and be willing to fence the VM if they fail to respond.

Comment 11 Nitesh Narayan Lal 2023-03-28 18:49:18 UTC

In Machine & PCI team meeting today, Michael commented that the fix for this bug is already upstream, and it should come as part of the QEMU 9.3 rebase. However, it is a high-risk change since it could expose qemu or guest bugs leading to unexpected resets.

Marking this as TestOnly and adding a depends on the rebase BZ.

PS: Michael is having some Bugzilla access issues, so once that is sorted, he should be able to answer any questions.

Comment 16 Yanan Fu 2023-04-24 10:33:45 UTC

QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 19 Yiqian Wei 2023-05-17 06:02:59 UTC

Hi lejeczek,

You could provide the following information:

1.What is the number of host cpus for this bug?

2.How many VMs are booted and how many cpus are used by the vm?

Comment 20 Yiqian Wei 2023-05-17 08:37:07 UTC

Reproduce steps:
host version:
kernel-5.14.0-312.el9.x86_64
qemu-kvm-7.2.0-14.el9_2.x86_64
edk2-ovmf-20230301gitf80f052277c8-3.el9.noarch
guest: rhel9.3.0

reproduce steps: 
1. Boot 8 guests with "-m 1G and -smp 4,sockets=1,cores=4,threads=1" on host
# sh ovmf.sh ovmf 8

2.check dmesg in guest

test results: 
please see Attachment: guest_dmesg.txt


host information:
1) memory
# free -h
               total        used        free      shared  buff/cache   available
Mem:           7.5Gi       3.1Gi       4.4Gi       6.0Mi       214Mi       4.3Gi
Swap:          7.8Gi        44Mi       7.8Gi

2)cpu
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel
  Model name:            Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz
    BIOS Model name:     Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz       
    CPU family:          6
    Model:               60
    Thread(s) per core:  1
    Core(s) per socket:  4
    Socket(s):           1


Hi Michael,

Please help to check that the above reproduction steps are right? thanks a lot.

Comment 21 Yiqian Wei 2023-05-17 08:39:42 UTC

Created attachment 1965004 [details]
guest dmesg information

Comment 23 lejeczek 2023-05-17 09:29:52 UTC

Hi guys - the original reporter of the BZ her.

Since I reported the bug the whole lot in my test-lab has changed - not to mention the obvious: software stack updated.
Hardware-wise, perhaps not a number of cores nor RAM capacity are different but their "families" I replaced with younger ones, also network's key parts.
Now... a year later, with all those changes and following version of:

libvirt-daemon-driver-qemu-9.0.0-7.el9.x86_64
ipxe-roms-qemu-20200823-9.git4bd064de.el9.noarch
qemu-img-8.0.0-1.el9.x86_64
qemu-kvm-common-8.0.0-1.el9.x86_64
qemu-kvm-core-8.0.0-1.el9.x86_64
glusterfs-server-11.0-1.el9s.x86_64

Hardware which currently participate in test-lab are: 3 x Ryzen 3800 + 32GB ram + 10Gbps lan
Also with bit of, back then, tweaking & testing of HA/pacemaker - in order to lower/control simultaneous resource(VirtualDomain) migration - when a node is rebooted/shut down/stood by - now...
I do not see original: soft lockup - CPU#

and to try to clarify my original message - I might have done it better - yes, I reckoned that issue was: bare-metal<=>Qemu which resulted in watchdog - in VM - did act as expected, but did not act so only ! when VM was tainted by "soft lockup", otherwise if the same VM was "healthy" then watchdog did its job.

so, I'm afraid I'll not be able to provide you guys with any more concrete - concrete as debug/trace - info, unless I get to see this very issue again.

many thanks, L.

Comment 24 Yiqian Wei 2023-05-17 11:02:16 UTC

Hi lejeczek,

Could you help to check Comment 20 ? hit "watchdog: BUG: soft lockup - CPU#2 stuck for 70s! [kexec:1895]" information in guest. 

Thanks,
Yiqian

Comment 25 lejeczek 2023-05-17 13:20:23 UTC

Hi.
Those dmesgs look familiar, similar.
One certain thing I can make is that I too - like you do, I see - was(still am) over-committing resources - guests together were set to ask more of the hosts than hosts themselves had physical capacity. (which I'd reckon, is a common place)

Comment 26 Yiqian Wei 2023-05-25 08:15:28 UTC

Can reproduce this bug with fix "qemu-kvm-8.0.0-3.el9.x86_64" version

host version:
kernel-5.14.0-312.el9.x86_64
qemu-kvm-8.0.0-3.el9.x86_64
edk2-ovmf-20230301gitf80f052277c8-3.el9.noarch
guest: rhel9.3.0

The same test steps and results as Comment 20

Comment 38 Michael S. Tsirkin 2023-09-05 13:29:50 UTC

it seems that watchdog at least works as expected.

Comment 39 Red Hat Bugzilla 2024-02-24 04:25:02 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days