Created attachment 1768253 [details] Tests scripts and logs from run 3, reproduced issue 3 times of 10 runs Description of problem: When testing virDomainBlockCommit[1] under high load, completing the block jobs with the VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT flag randomly fail with this error: libvirt.libvirtError: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete' In vdsm logs we see: 1. Vdsm calls virDomainBlockCommit: 2021-03-21 01:07:10,679+0200 INFO (mailbox-hsm/1) [virt.livemerge] Starting merge with job_id='7963caa8-20d0-4975-8afc-f17f463dc9da', original chain=f0004ba9-dcb8-4614-aba8-2e11114ba9eb < 0dad6300-a10a-4a90-ba3c-aefcc537932e (top), disk='sda', base='sda[13]', top=None, bandwidth=0, flags=12 (livemerge:396) 2. Vdsm receive event that the block job is ready: 2021-03-21 01:07:22,580+0200 INFO (libvirt/events) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Block job 7963caa8-20d0-4975-8afc-f17f463dc9da type ACTIVE_COMMIT for drive sda is ready (vm:5650) 3. Vdsm found that mirror job is ready: 2021-03-21 01:07:37,267+0200 INFO (periodic/1) [virt.livemerge] Job 7963caa8-20d0-4975-8afc-f17f463dc9da is ready for pivot (livemerge:626) (Vdsm look for the mirror ready="yes" attribute, when job cur==end) 4. Vdsm try to complete the job 2021-03-21 01:07:53,511+0200 INFO (merge/7963caa8) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Requesting pivot to complete active layer commit (job 7963caa8-20d0-4975-8afc-f17f463dc9da) (livemerge:771) Vdsm calls: flags = libvirt.VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT self.vm._dom.blockJobAbort(self.drive.name, flags=flags) 5. Vdsm receive a second event that the block jobs is ready?! 2021-03-21 01:07:56,120+0200 INFO (libvirt/events) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Block job 7963caa8-20d0-4975-8afc-f17f463dc9da type ACTIVE_COMMIT for drive sda is ready (vm:5650) 5. Call fail: 2021-03-21 01:07:56,122+0200 ERROR (merge/7963caa8) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Pivot failed (job: 7963caa8-20d0-4975-8afc-f17f463dc9da): Job 7963caa8-20d0-4975-8afc-f17f463dc9da failed with libvirt error: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete', aborting due to an unrecoverable error (livemerge:840) In libvirt log we see: (Note: libvirt uses UTC instead local time, this is very helpful when vdsm log and all other logs on the host use localtime) 1. Block jobs created: 2021-03-20 23:07:10.695+0000: 1740864: debug : qemuMonitorBlockCommit:3167 : device=libvirt-15-format, jobname=commit-sda-libvirt-15-format, persistjob=1, top=<null>, topNode=libvirt-15-format, base=<null>, baseNode=libvirt-13-format, backingName=<null>, bandwidth=0 2021-03-20 23:07:10.905+0000: 1798926: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f8e6c043730 event={"timestamp": {"seconds": 1616281630, "microseconds": 904561}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "commit-sda-libvirt-15-format"}} 2. Job changes state to "running": 2021-03-20 23:07:10.906+0000: 1798926: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f8e6c043730 event={"timestamp": {"seconds": 1616281630, "microseconds": 904654}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "commit-sda-libvirt-15-format"}} 3. Job changes state to "ready": 2021-03-20 23:07:22.579+0000: 1798926: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f8e6c043730 event={"timestamp": {"seconds": 1616281642, "microseconds": 579401}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "commit-sda-libvirt-15-format"}} 4. Job changes state to "stand"by? 2021-03-20 23:07:53.588+0000: 1798926: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f8e6c043730 event={"timestamp": {"seconds": 1616281673, "microseconds": 587046}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "commit-sda-libvirt-15-format"}} (It seems that this happens exactly 30 seconds after job is ready) 5. Libvirt try to complete the job: 2021-03-20 23:07:56.109+0000: 1740861: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f8e6c043730 msg={"execute":"job-complete","arguments":{"id":"commit-sda-libvirt-15-format"},"id":"libvirt-1300"} fd=-1 6. QEMU fails the request: 2021-03-20 23:07:56.115+0000: 1798926: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"id": "libvirt-1300", "error": {"class": "GenericError", "desc": "Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete'"}}] This error happens randomly only when using very high load - using stress test for delete snapshot flow in oVirt. During the test we delete snapshot with multipe disks (e.g. 16 disks), with with up to 2G of data. The test run multiple vms testing creating and deleting snaphost. I did mutliple runs with different setups: 1. Real server with 40 cores and fast FC storage, running 8 vms, 16 disks per vm with 2g of data to commit (up to 128 concurrent commits): 1 failed, 540 passed, 19 errored in 62428 seconds 2. Virtualized enviroment (vms on my laptop) with 3 vms, 9 disks per vm, 3 disks with 2G data to commit (up to 27 concurrent commits): 5 failed, 23 passed, 2 errored in 5805 second 3. Virtualized enviroment (vms on my laptop) with 1 vms 16 disks, 16 disks with 2G data to commit (16 concurrent commits): 3 failed, 7 passed, 0 errored in 7117 seconds I think what happens is this: 1. When system is overloaded, the time from receiving the event that the job is ready until vdsm try to complete the job may be more than 30 seconds (31 secodnds in this case). 2. QEMU move the job from "ready" state to "standby" state after 30 secodns. 3. When vdsm try to abort the job, libvirt try to complete the job without handling the "standby" state. This may be a libvirt bug, not handling the standby state, or qemu bug, introduing new state which is not compatible with existing users. Flling this for libvirt since the standby state seem to be pretty old based on qemu source. Version-Release number of selected component (if applicable): libvirt-daemon-6.6.0-13.module+el8.3.1+9548+0a8fede5.x86_64 qemu-kvm-5.1.0-20.module+el8.3.1+9918+230f5c26.x86_64 How reproducible: Random: - easy to reproduce with weak system with slow storage - hard to reproduce on real server with fast storage Steps to Reproduce: 1. Perform concurent block commit operations with lot of data to commit, generating enough load so vdsm take more than 30 seconds to respond to block job ready event. 2. Repeat many times Actual results: Random failures to complete the job. Expected results: Jobs complete successfully. Additional info: Pivot failure is very problematic in RHV, likely to end in expensive support case.
Created attachment 1768255 [details] Tests scripts and logs from run 2, reproduced issue 5 times of 30 runs
Created attachment 1768258 [details] Tests scripts and logs from run 1, reproduced issue 1 time of 560 commits
Moving to qemu for now. Until now after the "ready" state was emitted we were free to complete the job. Since the 'ready' state is announced to client applications via an event that creates an expectation that the completion will be successful as it used to be. We don't have any provision of taking back the 'ready' state if standby state is emitted by qemu.
"standby" is the state that a job enters when it is paused while it's in the "ready" state. You can't complete a paused job without resuming it first. In the log, it looks like immediately before the failing "job-complete", we have a "transaction" for block-dirty-bitmap-add. This calls bdrv_drain_all(), which in turn causes all jobs to pause for a moment, but they are immediately resumed again. The problem could be that resuming the job might not be instantaneous, but just schedule the the job coroutine (especially if the job is running in an iothread, like in this case apparently), but bdrv_drain_all_end() can return before the jobs have actually unpaused. So we run into a race condition where "job-complete" can be processed a little too early. If you check in the logs what happens immediately after the error, all the jobs return to "running"/"ready". Max, a while ago you introduced the drained_end_counter, which I think is supposed to address exactly this kind of scenario. Do you think the right solution would be incrementing it in child_job_drained_end() and decrementing it again in job_pause_point()? If so, do we need to keep a list of drained_end_counters to be decremented later in the Job or can we be sure that there will only ever be a single one? On another note, bdrv_drain_all() before executing the transaction actions is probably too weak (it should be a bdrv_drain_all_begin/end section around the actions). On yet another note, the overhead of a transaction is also completely unnecessary in this case because it consists only of a single action, so libvirt calling "block-dirty-bitmap-add" instead would make more sense (and sidestep this issue).
(In reply to Kevin Wolf from comment #5) > "standby" is the state that a job enters when it is paused while it's in the > "ready" state. You can't complete a paused job without resuming it first. The problem is that it's not an explicit user induced 'pause' of the job and at the same time it's also automatically resumed. That potentially leads to a game of whack-a-mole where the job can blip back and forth and create race conditions between checking the state and executing the command (TOCTOU). Additionally the reported error is of 'GenericError' and thus we can't really detect that the reported error happened because of the 'standby' state, so that we'd re-try later. [...] > On yet another note, the overhead of a transaction is also completely > unnecessary in this case because it consists only of a single action, so > libvirt calling "block-dirty-bitmap-add" instead would make more sense (and > sidestep this issue). Well, we can detect whether only one action is to be executed and use the non-transaction command, but similarly there can be also more than one member here if there are actually bitmaps to merge after the commit. In this instance I've opted to use transaction in any case so that the code doesn't have to do complicated checking. Said that we can switch the deletion of the temporary bitmap to non-transaction command in all cases, so unless there are bitmaps to merge it would indeed sidestep the issue.
Sorry if my explanation came across as if libvirt should be responsible for resuming the job. It is not, and even if it tried, it couldn't resume a job that is internally paused by QEMU. The race condition needs to be fixed in QEMU so that the state changes back in time before processing the next QMP command. If the same code is used for cases where transactions are actually necessary, then it makes sense to just use the same code even in cases where it isn't strictly necessary. There is some overhead in transactions because they drain the queue for all block devices (they have to because we want a consistent state for all transaction actions), but probably not enough to justify adding a special case.
> Max, a while ago you introduced the drained_end_counter, which I think > is supposed to address exactly this kind of scenario. Do you think the > right solution would be incrementing it in child_job_drained_end() and > decrementing it again in job_pause_point()? I think you’re right. drained_end_counter’s documentation states that it is to be incremented for every background operation that’s launched during drained_end, which blockjob.c currently does not do. (It seems clear that job_enter_cond()’s aio_co_enter() is such a background operation.) > If so, do we need to keep a list of drained_end_counters to be > decremented later in the Job or can we be sure that there will only > ever be a single one? I think there can only be a single one, for two reasons. First, drained_end_counter’s documentation further states that “The caller must poll until it reaches 0.” From this it follows to me that only one party can end a drain at at time, and so only one counter should be active at a time. Second, job_enter_cond() is only invoked once pause_count reaches 0. So nested drains will not result in job_enter_cond(), and so no need to increment the drained_end_counter. The only situation where I could imagine two drained_end_counters in play would be if you end a drain, then begin a new drain, and end it. But even if you don’t poll on ending the drain (which I think one must do), then beginning the drain must result in polling, so the job must have been entered before a subsequent drain can be ended. Max
This is fixed as of upstream commit 53ddb9c892f048bd031568178da52e4964d7d30a ('job: Allow complete for jobs on standby'). Moving to POST and ITR 8.5 for now where we would get it automatically by rebase.
Test on qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d, the bug has been fixed. Test Env: kernel version:4.18.0-305.1.el8.x86_64 qemu-kvm version:qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d Test steps: 1.Start guest with qemu cmds: /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine q35 \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \ -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \ -nodefaults \ -device VGA,bus=pcie.0,addr=0x2 \ -m 30720 \ -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2 \ -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \ -chardev socket,server=on,path=/tmp/monitor-qmpmonitor1-20210512-234257-mOeaMK07,id=qmp_id_qmpmonitor1,wait=off \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,server=on,path=/tmp/monitor-catch_monitor-20210512-234257-mOeaMK07,id=qmp_id_catch_monitor,wait=off \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pvpanic,ioport=0x505,id=idXTMc9z \ -chardev socket,server=on,path=/tmp/serial-serial0-20210512-234257-mOeaMK07,id=chardev_serial0,wait=off \ -device isa-serial,id=serial0,chardev=chardev_serial0 \ -chardev socket,id=seabioslog_id_20210512-234257-mOeaMK07,path=/tmp/seabios-20210512-234257-mOeaMK07,server=on,wait=off \ -device isa-debugcon,chardev=seabioslog_id_20210512-234257-mOeaMK07,iobase=0x402 \ -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \ -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0 \ -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel850-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \ -device virtio-net-pci,mac=9a:bb:ed:35:8d:44,id=idSFYXRM,netdev=id47qtZ5,bus=pcie-root-port-3,addr=0x0 \ -netdev tap,id=id47qtZ5,vhost=on \ -vnc :0 \ -rtc base=utc,clock=host,driftfix=slew \ -boot menu=off,order=cdn,once=c,strict=off \ -enable-kvm \ -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=5 \ -monitor stdio \ -qmp tcp:0:3000,server,nowait \ 2.Create snapshot node and do snapshot. #create snapshot node {'execute':'blockdev-create','arguments':{'options': {'driver':'file','filename':'/root/sn1','size':21474836480},'job-id':'job1'}} {'execute':'blockdev-add','arguments':{'driver':'file','node-name':'drive_sn1','filename':'/root/sn1'}} {'execute':'blockdev-create','arguments':{'options': {'driver': 'qcow2','file':'drive_sn1','size':21474836480,'backing-file':'/home/kvm_autotest_root/images/rhel76-64-virtio.qcow2','backing-fmt':'qcow2'},'job-id':'job2'}} {'execute':'blockdev-add','arguments':{'driver':'qcow2','node-name':'sn1','file':'drive_sn1','backing':null}} {'execute':'job-dismiss','arguments':{'id':'job1'}} {'execute':'job-dismiss','arguments':{'id':'job2'}} #do snapshot {'execute':'blockdev-snapshot','arguments':{'node':'drive_image1','overlay':'sn1'}} 3.DD file on sn1, then do commit from sn1 to base (guest)# dd if=/dev/urandom of=/home/test bs=1M count=100 {'execute': 'block-commit', 'arguments': { 'device':'sn1','job-id':'j1'}} {"timestamp": {"seconds": 1620888756, "microseconds": 929077}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}} {"timestamp": {"seconds": 1620888756, "microseconds": 929130}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}} {"return": {}} {"timestamp": {"seconds": 1620888756, "microseconds": 934822}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}} {"timestamp": {"seconds": 1620888756, "microseconds": 934863}, "event": "BLOCK_JOB_READY", "data": {"device": "j1", "len": 4194304, "offset": 4194304, "speed": 0, "type": "commit"}} 4.Pause block job. {'execute': 'job-pause', 'arguments': {'id':'j1'}} {"timestamp": {"seconds": 1620888790, "microseconds": 979143}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "j1"}} {"return": {}} 5.Check block job status. {"execute":"query-block-jobs"} {"return": [{"auto-finalize": true, "io-status": "ok", "device": "j1", "auto-dismiss": true, "busy": false, "len": 4325376, "offset": 4325376, "status": "standby", "paused": true, "speed": 0, "ready": true, "type": "commit"}]} 6.Complete commit job {"execute":"job-complete","arguments":{"id":"j1"}} {"return": {}} 7.Resume block job. {"execute":"job-resume","arguments":{"id":"j1"}} {"timestamp": {"seconds": 1620888859, "microseconds": 819548}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}} {"return": {}} {"timestamp": {"seconds": 1620888859, "microseconds": 820190}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "j1"}} {"timestamp": {"seconds": 1620888859, "microseconds": 820227}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "j1"}} {"timestamp": {"seconds": 1620888859, "microseconds": 820282}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "j1", "len": 4587520, "offset": 4587520, "speed": 0, "type": "commit"}} {"timestamp": {"seconds": 1620888859, "microseconds": 820314}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}} {"timestamp": {"seconds": 1620888859, "microseconds": 820339}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}} After step7, commit job executed successfully.
As comment17, set bug's status to "Verified".
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4684