Description of problem: Failed to hotunplug the cpu when it's not last one Version-Release number of selected component (if applicable): host: kernel-4.18.0-266.el8.ppc64le qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f.ppc64le guest: kernel-4.18.0-259.el8.ppc64le How reproducible: 100% Steps to Reproduce: 1.Boot up guest with command /usr/libexec/qemu-kvm \ -smp 1,maxcpus=2,cores=2,threads=1,sockets=1 \ -m 4096 \ -nodefaults \ -device virtio-scsi-pci,bus=pci.0 \ -device scsi-hd,id=scsi-hd0,drive=scsi-hd0-dr0,bootindex=0 \ -drive file=rhel840-ppc64le-virtio-scsi.qcow2,if=none,id=scsi-hd0-dr0,format=qcow2,cache=none \ -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c4:e7:84 \ -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on \ -chardev stdio,mux=on,id=serial_id_serial0,server,nowait,signal=off \ -device spapr-vty,id=serial111,chardev=serial_id_serial0 \ -mon chardev=serial_id_serial0,mode=readline \ 2.Hotplug one vcpu (qemu)device_add host-spapr-cpu-core,core-id=1,id=core1 3.Disable vcpu0 in guest chcpu -d 0 4.Hotunplug vcpu1 (qemu)device_del core1 5.enable vcpu0 in guest chcpu -e 0 6.Hotunplug vcpu1 again (qemu)device_del core1 Actual results: Failed to hotunplug vcpu1 Expected results: Can hotunplug vcpu1 because it's not last vcpu now. Additional info:
Xujun, can you confirm the “Hardware” filed?
Xujun, can we get the actual error message that qemu produces when you try to unplug the core?
(In reply to Qunfang Zhang from comment #1) > Xujun, can you confirm the “Hardware” filed? ppc only.x86 has no this problem.
(In reply to David Gibson from comment #2) > Xujun, can we get the actual error message that qemu produces when you try > to unplug the core? step4: (qemu) device_del core1 (qemu) [ 169.162520] pseries-hotplug-cpu: Failed to offline CPU <NULL>, rc: -16 step6: No any message.
Daniel, can you take a look at this one please.
I am able to reproduce the bug using Power 8 and Power 9 servers. I'll investigate.
Xujun, is this a regression?
(In reply to David Gibson from comment #7) > Xujun, is this a regression? I'm not sure,I need to try.will give feedback after testing.
(In reply to David Gibson from comment #7) > Xujun, is this a regression? Not a regression,hit this problem with slow 8.4 train and 8.3 fast train.
(In reply to Xujun Ma from comment #9) > (In reply to David Gibson from comment #7) > > Xujun, is this a regression? > > Not a regression,hit this problem with slow 8.4 train and 8.3 fast train. In fact this is also happening upstream. This was never handled. The reason why this is happening is because we're attempting to hotunplug the last vcpu online of the guest, and during the process we're doing things such as detaching the Core DRC and so on. The guest refuses to do it because it's the last online vcpu and the 'unplug success' callback in the QEMU side is never called. My solution is to check whether the CPU core is the last online in the guest before attempting the hotunplug. The patches were posted upstream for review: https://lists.gnu.org/archive/html/qemu-devel/2021-01/msg03349.html
Hi Daniel Could you help estimate and set a right ITM for this bug?
(In reply to Xujun Ma from comment #11) > Hi Daniel > > Could you help estimate and set a right ITM for this bug? What is ITM? If it's a Bugzilla flag I don't appear to have access to it. Regarding the bug, the fix is not as trivial as I first suggested in comment #10, unfortunately. Turns out that the solution proposed in comment #10 is also flawed, because we don't have any guarantees that the guest will not offline a CPU in the middle of the unplug process, making our assumptions pre-unplug obsolete and prone to the same error. Mailing list discussions led me to try another approach where I opened up the hotplug IRQ queue to be fired up at all 'device_del' attempts of removing the CPU core, regardless of having a previous unplug request pending. This has been disputed because this opens the possibility of a IRQ event flood in the guest kernel (although I wasn't able to make the guest misbehave by flooding it), but none of the alternatives to fix this problem in the QEMU level are clear winners. The discussions are still happening. Let's wait a bit to see where we're going with this.
Sorry Daniel, ITM's an internal scheduling thing, I'll look after it. Honestly, we're getting a bit late to get a bugfix into AV-8.4, and given that this is not a regression, I'm not sure there's a compelling reason to push for it. So I'm going to punt this one back to 8.5.
There has been a lot of action in this bug, and not a lot of updates in this Bugzilla from my end. From what I've mentioned in comment #12, we went all the way into implementing a 'CPU hotunplug timeout' mechanism. This logic almost got into 6.0.0. Further discussions in the mailing list, when evaluating a new QAPI event to report the timeout, led us to believe that the timeout mechanism isn't a good idea after all. Telling Libvirt that "a timeout happened, and perhaps something wrong happened in the guest" doesn't do much. Libvirt would need to inspect the guest anyway to see if the hotunplug succeeded or not. This code got reverted and we went for the logic I mentioned in comment #12, where we'll allow multiple CPU hotunplug requests for the same CPU. All that said, up to that point we were operating under the assumption that the kernel does not provide a callback mechanism for hotunplug errors. This is about to change in kernel v5.13. I've proposed a way to use one of the existing RTAS calls to signal device removal errors in the kernel, starting with CPUs. I'm using a hypercall that is used in the device configuration (RTAS set-indicator) to signal the platform/hypervisor that the kernel found an error when doing the device removal. This use of the hcall is a no-op in QEMU, and I checked with the partition firmware folks in IBM and it's also a no-op for phyp (PowerVM), so it's a viable way of doing it without breaking existing hypervisors. The kernel patches were queued to powerpc-next [1]. I've also patched QEMU to handle this new kernel behavior and David accepted it to his ppc-6.1 tree. This means that we'll finally have some form of reliable hotunplug error callback mechanism in pSeries. Going back to this bug, we can either go for the approach that will be available in QEMU 6.0.0 (allowing multiple hotunplug requests) or this new mechanism I've implemented that requires code from kernel v5.13. The former will get the bug fixed faster via rebase and will not required kernel side changes, so perhaps this approach is preferred here.
(In reply to Daniel Henrique Barboza from comment #14) > The kernel patches were queued to powerpc-next [1]. I've also patched QEMU to > handle this new kernel behavior and David accepted it to his ppc-6.1 tree. > This means that we'll finally have some form of reliable hotunplug error > callback mechanism in pSeries. I forgot the link: [1] https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=239586&state=*
Xujun, While Daniel and I have some more work to polish this in general, I think the fixes already committed to qemu-6.0 will already fix the original problem reported for this BZ. Specifically, the first device_add will still do nothing (without an explicit error) , but the second one should now retry and succeed. Can you please retest with the rebased qemu-6.0 based package?
(In reply to David Gibson from comment #16) > Xujun, > > While Daniel and I have some more work to polish this in general, I think > the fixes already committed to qemu-6.0 will already fix the original > problem reported for this BZ. Specifically, the first device_add will still > do nothing (without an explicit error) , but the second one should now retry > and succeed. > > Can you please retest with the rebased qemu-6.0 based package? Has no this problem on qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.ppc64le,bug has been fixed.
Thanks Xujun for confirmation. David, can we close this bug as CURRENTRELEASE? Thanks.
Yes, closing as CURRENTRELEASE. Thanks for verifying this.