Bug 2053584
Summary: | watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [cat:2843] | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Li Xiaohui <xiaohli> | |
Component: | qemu-kvm | Assignee: | Igor Mammedov <imammedo> | |
qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | ailan, chayang, dgilbert, fjin, imammedo, jinzhao, leobras, mdean, meili, nanliu, peterx, pvlasin, quintela, virt-maint, ymankad | |
Version: | 9.0 | Keywords: | Regression, Triaged | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | qemu-kvm-6.2.0-11.el9_0.2 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2065398 (view as bug list) | Environment: | ||
Last Closed: | 2022-05-17 12:25:28 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | qemu-7.0 | |
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2065398 |
Description
Li Xiaohui
2022-02-11 14:40:55 UTC
I wonder if this is the same problem as 2053526 - they're both failures with hotplug and migration, just different devices. Again, can we have 'info pci' from both the source (after hotplug) and the destination. (In reply to Dr. David Alan Gilbert from comment #1) > I wonder if this is the same problem as 2053526 - they're both failures with > hotplug and migration, just different devices. I'm not sure. Note I tried this bug again, guest should be crashed because fail to reboot it after hit error when transfer data. > > Again, can we have 'info pci' from both the source (after hotplug) and the > destination. Attach them into attachment. The diff pci info on source (after hotplugging) and destination (after migration): < BAR1: 32 bit memory at 0xfd600000 [0xfd600fff]. < BAR4: 64 bit prefetchable memory at 0xfb200000 [0xfb203fff]. --- > BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe]. > BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe]. Yes I think it is probably the same bug; the BARs on the destination are unprogrammed; but all the bus addressing matches correctly, so I don't think there's an obvious commandline error. (In reply to Dr. David Alan Gilbert from comment #5) > Yes I think it is probably the same bug; the BARs on the destination are > unprogrammed; but all the bus addressing matches correctly, > so I don't think there's an obvious commandline error. It looks like regression was introduced by a mix of fixes to PCIe and ACPI hotplug support. Bisection points to d5daff7d3126 pcie: implement slot power control for pcie root ports Simplified steps to reproduce: 1. start VM with following QEMU CLI and let a guest OS boot completely (use q35 6.2 based machine type with ACPI PCI hotplug enabled by default) -M q35 -device pcie-root-port,port=0x20,chassis=21,id=extra_root0,bus=pcie.0,addr=0x3 -monitor stdio ... 2. hotplug a pci device at monitor prompt (other means shall also work) (qemu) device_add virtio-serial-pci,id=virtio-serial0,max_ports=31,bus=extra_root0 3. check that hotplugged device is initialized (qemu) info pci ... Bus 1, device 0, function 0: Class 1920: PCI device 1af4:1043 PCI subsystem 1af4:1100 IRQ 0, pin A BAR1: 32 bit memory at 0xfe800000 [0xfe800fff]. BAR4: 64 bit prefetchable memory at 0xfe000000 [0xfe003fff]. id "virtio-serial0" ... 4. migrate VM to file (live miration should also work) (qemu) migrate "exec:gzip -c > STATEFILE.gz" (qemu) quit 5. restore VM on target QEMU with following CLI: -M q35 -device pcie-root-port,port=0x20,chassis=21,id=extra_root0,bus=pcie.0,addr=0x3 -monitor stdio \ -device virtio-serial-pci,id=virtio-serial0,max_ports=31,bus=extra_root0 -incoming "exec: gzip -c -d STATEFILE.gz" \ ... 6. Check that hotplugged device BARs are the same as in step 3 (qemu) info pci ... Bus 1, device 0, function 0: Class 1920: PCI device 1af4:1043 PCI subsystem 1af4:1100 IRQ 0, pin A BAR1: 32 bit memory at 0xfe800000 [0xfe800fff]. BAR4: 64 bit prefetchable memory at 0xfe000000 [0xfe003fff]. id "virtio-serial0" ... *** Bug 2053526 has been marked as a duplicate of this bug. *** According to Comment 6, I would add 'Regression' keyword. thanks Hi Igor, Shall we add exception+ for this bug and fix it on rhel9.0.0 since it blocked hotplug + migration scenarios. Thanks fro remainder, I've just requested exception for it Li Xiaohui, can you verify it, please? scratch build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=43641393 http://download.eng.bos.redhat.com/brewroot/work/tasks/1393/43641393/qemu-kvm-core-6.2.0-11.el9_0.imammedo202203080850.x86_64.rpm (In reply to Igor Mammedov from comment #13) > Li Xiaohui, > > can you verify it, please? Having downloaded the scratch build, I will verify it later. Thanks. > > scratch build: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=43641393 > http://download.eng.bos.redhat.com/brewroot/work/tasks/1393/43641393/qemu- > kvm-core-6.2.0-11.el9_0.imammedo202203080850.x86_64.rpm Hi Igor, I have tried the scratch build on hosts (kernel-5.14.0-70.el9.x86_64 & qemu-img-6.2.0-11.el9_0.imammedo202203080850.x86_64), test pass, the build should fix this bug. Test following cases, all pass: --> Running case(1/7): RHEL7-96931-[migration] Migration after hot-plug virtio-serial (3 min 48 sec)--- PASS. --> Running case(2/7): RHEL7-10039-[migration] Do migration after hot plug vdisk (2 min 48 sec)--- PASS. --> Running case(3/7): RHEL7-10040-[migration] Do migration after hot remove vdisk (4 min 36 sec)--- PASS. --> Running case(4/7): RHEL7-10078-[migration] Migrate guest after hot plug/unplug memory balloon device (6 min 20 sec)--- PASS. --> Running case(5/7): RHEL7-10079-[migration] Migrate guest after cpu hotplug/hotunplug in guest (RHEL only) (3 min 0 sec)--- PASS. --> Running case(6/7): RHEL7-10047-[migration] Ping-pong live migration with large vcpu and memory values of guest (6 min 4 sec)--- PASS. --> Running case(7/7): RHEL-178709-[migration] Basic migration test (4 min 24 sec)--- PASS. BTW, I also have repeated above RHEL7-96931 & RHEL7-10039 for 5 times with checking pci info on source (after hotplugging) and destination host (after migration), they all work well, no difference about pci info. QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. Verify this bug on kernel-5.14.0-70.3.1.el9_0.x86_64 & qemu-kvm-6.2.0-11.el9_0.2.x86_64. Same test scenarios as Comment 15, cases pass. Mark this bug as verified per the test results. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: qemu-kvm), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2307 |