Bug 1945635 - Pivot failed: Job ... in state 'standby' cannot accept command verb 'complete'
Summary: Pivot failed: Job ... in state 'standby' cannot accept command verb 'complete'
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.3
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 8.4
Assignee: Kevin Wolf
QA Contact: aihua liang
URL:
Whiteboard:
Depends On:
Blocks: 1976256
TreeView+ depends on / blocked
 
Reported: 2021-04-01 13:36 UTC by Nir Soffer
Modified: 2024-10-01 17:50 UTC (History)
12 users (show)

Fixed In Version: qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1976256 (view as bug list)
Environment:
Last Closed: 2021-11-16 07:52:31 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Tests scripts and logs from run 3, reproduced issue 3 times of 10 runs (6.58 MB, application/gzip)
2021-04-01 13:36 UTC, Nir Soffer
no flags Details
Tests scripts and logs from run 2, reproduced issue 5 times of 30 runs (3.45 MB, application/gzip)
2021-04-01 13:41 UTC, Nir Soffer
no flags Details
Tests scripts and logs from run 1, reproduced issue 1 time of 560 commits (15.34 MB, application/gzip)
2021-04-01 13:42 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 6129321 0 None None None 2021-06-20 01:36:00 UTC
Red Hat Product Errata RHBA-2021:4684 0 None None None 2021-11-16 07:53:15 UTC

Description Nir Soffer 2021-04-01 13:36:31 UTC
Created attachment 1768253 [details]
Tests scripts and logs from run 3, reproduced issue 3 times of 10 runs

Description of problem:

When testing virDomainBlockCommit[1] under high load, completing the
block jobs with the VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT flag randomly fail
with this error:

    libvirt.libvirtError: internal error: unable to execute QEMU command
    'job-complete': Job 'commit-sda-libvirt-15-format' in state
    'standby' cannot accept command verb 'complete'

In vdsm logs we see:

1. Vdsm calls virDomainBlockCommit:

2021-03-21 01:07:10,679+0200 INFO  (mailbox-hsm/1) [virt.livemerge] Starting merge with job_id='7963caa8-20d0-4975-8afc-f17f463dc9da', original chain=f0004ba9-dcb8-4614-aba8-2e11114ba9eb < 0dad6300-a10a-4a90-ba3c-aefcc537932e (top), disk='sda', base='sda[13]', top=None, bandwidth=0, flags=12 (livemerge:396)

2. Vdsm receive event that the block job is ready:

2021-03-21 01:07:22,580+0200 INFO  (libvirt/events) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Block job 7963caa8-20d0-4975-8afc-f17f463dc9da type ACTIVE_COMMIT for drive sda is ready (vm:5650)

3. Vdsm found that mirror job is ready:

2021-03-21 01:07:37,267+0200 INFO  (periodic/1) [virt.livemerge] Job 7963caa8-20d0-4975-8afc-f17f463dc9da is ready for pivot (livemerge:626)

(Vdsm look for the mirror ready="yes" attribute, when job cur==end)

4. Vdsm try to complete the job

2021-03-21 01:07:53,511+0200 INFO  (merge/7963caa8) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Requesting pivot to complete active layer commit (job 7963caa8-20d0-4975-8afc-f17f463dc9da) (livemerge:771)

Vdsm calls:

    flags = libvirt.VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT
    self.vm._dom.blockJobAbort(self.drive.name, flags=flags)


5. Vdsm receive a second event that the block jobs is ready?!

2021-03-21 01:07:56,120+0200 INFO  (libvirt/events) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Block job 7963caa8-20d0-4975-8afc-f17f463dc9da type ACTIVE_COMMIT for drive sda is ready (vm:5650)

5. Call fail:

2021-03-21 01:07:56,122+0200 ERROR (merge/7963caa8) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Pivot failed (job: 7963caa8-20d0-4975-8afc-f17f463dc9da): Job 7963caa8-20d0-4975-8afc-f17f463dc9da failed with libvirt error: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete', aborting due to an unrecoverable error (livemerge:840)


In libvirt log we see:

(Note: libvirt uses UTC instead local time, this is very helpful when
 vdsm log and all other logs on the host use localtime)

1. Block jobs created:

2021-03-20 23:07:10.695+0000: 1740864: debug : qemuMonitorBlockCommit:3167 : device=libvirt-15-format, jobname=commit-sda-libvirt-15-format, persistjob=1, top=<null>, topNode=libvirt-15-format, base=<null>, baseNode=libvirt-13-format, backingName=<null>, bandwidth=0

2021-03-20 23:07:10.905+0000: 1798926: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f8e6c043730 event={"timestamp": {"seconds": 1616281630, "microseconds": 904561}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "commit-sda-libvirt-15-format"}}

2. Job changes state to "running":

2021-03-20 23:07:10.906+0000: 1798926: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f8e6c043730 event={"timestamp": {"seconds": 1616281630, "microseconds": 904654}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "commit-sda-libvirt-15-format"}}

3. Job changes state to "ready":

2021-03-20 23:07:22.579+0000: 1798926: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f8e6c043730 event={"timestamp": {"seconds": 1616281642, "microseconds": 579401}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "commit-sda-libvirt-15-format"}}

4. Job changes state to "stand"by?

2021-03-20 23:07:53.588+0000: 1798926: info : qemuMonitorJSONIOProcessLine:235 : QEMU_MONITOR_RECV_EVENT: mon=0x7f8e6c043730 event={"timestamp": {"seconds": 1616281673, "microseconds": 587046}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "commit-sda-libvirt-15-format"}}

(It seems that this happens exactly 30 seconds after job is ready)

5. Libvirt try to complete the job:

2021-03-20 23:07:56.109+0000: 1740861: info : qemuMonitorSend:946 : QEMU_MONITOR_SEND_MSG: mon=0x7f8e6c043730 msg={"execute":"job-complete","arguments":{"id":"commit-sda-libvirt-15-format"},"id":"libvirt-1300"} fd=-1

6. QEMU fails the request:

2021-03-20 23:07:56.115+0000: 1798926: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"id": "libvirt-1300", "error": {"class": "GenericError", "desc": "Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete'"}}]


This error happens randomly only when using very high load - using
stress test for delete snapshot flow in oVirt. During the test we
delete snapshot with multipe disks (e.g. 16 disks), with with up to
2G of data. The test run multiple vms testing creating and deleting
snaphost.


I did mutliple runs with different setups:

1. Real server with 40 cores and fast FC storage, running 8 vms, 16 disks per
   vm with 2g of data to commit (up to 128 concurrent commits):
   1 failed, 540 passed, 19 errored in 62428 seconds

2. Virtualized enviroment (vms on my laptop) with 3 vms, 9 disks per vm,
   3 disks with 2G data to commit (up to 27 concurrent commits):
   5 failed, 23 passed, 2 errored in 5805 second

3. Virtualized enviroment (vms on my laptop) with 1 vms 16 disks, 16
   disks with 2G data to commit (16 concurrent commits):
   3 failed, 7 passed, 0 errored in 7117 seconds


I think what happens is this:

1. When system is overloaded, the time from receiving the event that the
   job is ready until vdsm try to complete the job may be more than 30
   seconds (31 secodnds in this case).

2. QEMU move the job from "ready" state to "standby" state after 30
   secodns.

3. When vdsm try to abort the job, libvirt try to complete the job
   without handling the "standby" state.


This may be a libvirt bug, not handling the standby state, or qemu bug,
introduing new state which is not compatible with existing users. Flling
this for libvirt since the standby state seem to be pretty old based on
qemu source.


Version-Release number of selected component (if applicable):
libvirt-daemon-6.6.0-13.module+el8.3.1+9548+0a8fede5.x86_64
qemu-kvm-5.1.0-20.module+el8.3.1+9918+230f5c26.x86_64

How reproducible:
Random:
- easy to reproduce with weak system with slow storage
- hard to reproduce on real server with fast storage

Steps to Reproduce:
1. Perform concurent block commit operations with lot of data to commit,
   generating enough load so vdsm take more than 30 seconds to respond
   to block job ready event.
2. Repeat many times

Actual results:
Random failures to complete the job.

Expected results:
Jobs complete successfully.

Additional info:

Pivot failure is very problematic in RHV, likely to end in expensive support
case.

Comment 1 Nir Soffer 2021-04-01 13:41:00 UTC
Created attachment 1768255 [details]
Tests scripts and logs from run 2, reproduced issue 5 times of 30 runs

Comment 2 Nir Soffer 2021-04-01 13:42:51 UTC
Created attachment 1768258 [details]
Tests scripts and logs from run 1, reproduced issue 1 time of 560 commits

Comment 4 Peter Krempa 2021-04-01 14:01:30 UTC
Moving to qemu for now. Until now after the "ready" state was emitted we were free to complete the job. Since the 'ready' state is announced to client applications via an event that creates an expectation that the completion will be successful as it used to be. We don't have any provision of taking back the 'ready' state if standby state is emitted by qemu.

Comment 5 Kevin Wolf 2021-04-06 11:02:03 UTC
"standby" is the state that a job enters when it is paused while it's in the "ready" state. You can't complete a paused job without resuming it first.

In the log, it looks like immediately before the failing "job-complete", we have a "transaction" for block-dirty-bitmap-add. This calls bdrv_drain_all(), which in turn causes all jobs to pause for a moment, but they are immediately resumed again. The problem could be that resuming the job might not be instantaneous, but just schedule the the job coroutine (especially if the job is running in an iothread, like in this case apparently), but bdrv_drain_all_end() can return before the jobs have actually unpaused. So we run into a race condition where "job-complete" can be processed a little too early. If you check in the logs what happens immediately after the error, all the jobs return to "running"/"ready".

Max, a while ago you introduced the drained_end_counter, which I think is supposed to address exactly this kind of scenario. Do you think the right solution would be incrementing it in child_job_drained_end() and decrementing it again in job_pause_point()? If so, do we need to keep a list of drained_end_counters to be decremented later in the Job or can we be sure that there will only ever be a single one?

On another note, bdrv_drain_all() before executing the transaction actions is probably too weak (it should be a bdrv_drain_all_begin/end section around the actions).

On yet another note, the overhead of a transaction is also completely unnecessary in this case because it consists only of a single action, so libvirt calling "block-dirty-bitmap-add" instead would make more sense (and sidestep this issue).

Comment 6 Peter Krempa 2021-04-06 12:06:16 UTC
(In reply to Kevin Wolf from comment #5)
> "standby" is the state that a job enters when it is paused while it's in the
> "ready" state. You can't complete a paused job without resuming it first.

The problem is that it's not an explicit user induced 'pause' of the job and at the same time it's also automatically resumed. That potentially leads to a game of whack-a-mole where the job can blip back and forth and create race conditions between checking the state and executing the command (TOCTOU). Additionally the reported error is of 'GenericError' and thus we can't really detect that the reported error happened because of the 'standby' state, so that we'd re-try later.

[...]

> On yet another note, the overhead of a transaction is also completely
> unnecessary in this case because it consists only of a single action, so
> libvirt calling "block-dirty-bitmap-add" instead would make more sense (and
> sidestep this issue).

Well, we can detect whether only one action is to be executed and use the non-transaction command, but similarly there can be also more than one member here if there are actually bitmaps to merge after the commit. In this instance I've opted to use transaction in any case so that the code doesn't have to do complicated checking.

Said that we can switch the deletion of the temporary bitmap to non-transaction command in all cases, so unless there are bitmaps to merge it would indeed sidestep the issue.

Comment 7 Kevin Wolf 2021-04-06 12:40:17 UTC
Sorry if my explanation came across as if libvirt should be responsible for resuming the job. It is not, and even if it tried, it couldn't resume a job that is internally paused by QEMU. The race condition needs to be fixed in QEMU so that the state changes back in time before processing the next QMP command.

If the same code is used for cases where transactions are actually necessary, then it makes sense to just use the same code even in cases where it isn't strictly necessary. There is some overhead in transactions because they drain the queue for all block devices (they have to because we want a consistent state for all transaction actions), but probably not enough to justify adding a special case.

Comment 8 Hanna Czenczek 2021-04-06 13:19:41 UTC
> Max, a while ago you introduced the drained_end_counter, which I think
> is supposed to address exactly this kind of scenario. Do you think the
> right solution would be incrementing it in child_job_drained_end() and
> decrementing it again in job_pause_point()?
I think you’re right.  drained_end_counter’s documentation states that it is to be incremented for every background operation that’s launched during drained_end, which blockjob.c currently does not do.  (It seems clear that job_enter_cond()’s aio_co_enter() is such a background operation.)

> If so, do we need to keep a list of drained_end_counters to be
> decremented later in the Job or can we be sure that there will only
> ever be a single one?
I think there can only be a single one, for two reasons.

First, drained_end_counter’s documentation further states that “The caller must poll until it reaches 0.”  From this it follows to me that only one party can end a drain at at time, and so only one counter should be active at a time.

Second, job_enter_cond() is only invoked once pause_count reaches 0.  So nested drains will not result in job_enter_cond(), and so no need to increment the drained_end_counter.  The only situation where I could imagine two drained_end_counters in play would be if you end a drain, then begin a new drain, and end it.  But even if you don’t poll on ending the drain (which I think one must do), then beginning the drain must result in polling, so the job must have been entered before a subsequent drain can be ended.

Max

Comment 10 Kevin Wolf 2021-04-13 16:37:07 UTC
This is fixed as of upstream commit 53ddb9c892f048bd031568178da52e4964d7d30a ('job: Allow complete for jobs on standby').

Moving to POST and ITR 8.5 for now where we would get it automatically by rebase.

Comment 17 aihua liang 2021-05-13 07:56:32 UTC
Test on qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d, the bug has been fixed.

Test Env:
 kernel version:4.18.0-305.1.el8.x86_64
 qemu-kvm version:qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d

Test steps:
 1.Start guest with qemu cmds:
    /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine q35 \
    -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
    -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
    -nodefaults \
    -device VGA,bus=pcie.0,addr=0x2 \
    -m 30720  \
    -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2  \
    -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
    -chardev socket,server=on,path=/tmp/monitor-qmpmonitor1-20210512-234257-mOeaMK07,id=qmp_id_qmpmonitor1,wait=off  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,server=on,path=/tmp/monitor-catch_monitor-20210512-234257-mOeaMK07,id=qmp_id_catch_monitor,wait=off  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idXTMc9z \
    -chardev socket,server=on,path=/tmp/serial-serial0-20210512-234257-mOeaMK07,id=chardev_serial0,wait=off \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20210512-234257-mOeaMK07,path=/tmp/seabios-20210512-234257-mOeaMK07,server=on,wait=off \
    -device isa-debugcon,chardev=seabioslog_id_20210512-234257-mOeaMK07,iobase=0x402 \
    -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
    -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel850-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
    -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
    -device virtio-net-pci,mac=9a:bb:ed:35:8d:44,id=idSFYXRM,netdev=id47qtZ5,bus=pcie-root-port-3,addr=0x0  \
    -netdev tap,id=id47qtZ5,vhost=on  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=5 \
    -monitor stdio \
    -qmp tcp:0:3000,server,nowait \

 2.Create snapshot node and do snapshot.
    #create snapshot node
    {'execute':'blockdev-create','arguments':{'options': {'driver':'file','filename':'/root/sn1','size':21474836480},'job-id':'job1'}}
    {'execute':'blockdev-add','arguments':{'driver':'file','node-name':'drive_sn1','filename':'/root/sn1'}}
    {'execute':'blockdev-create','arguments':{'options': {'driver': 'qcow2','file':'drive_sn1','size':21474836480,'backing-file':'/home/kvm_autotest_root/images/rhel76-64-virtio.qcow2','backing-fmt':'qcow2'},'job-id':'job2'}}
    {'execute':'blockdev-add','arguments':{'driver':'qcow2','node-name':'sn1','file':'drive_sn1','backing':null}}
    {'execute':'job-dismiss','arguments':{'id':'job1'}}
    {'execute':'job-dismiss','arguments':{'id':'job2'}}

   #do snapshot
    {'execute':'blockdev-snapshot','arguments':{'node':'drive_image1','overlay':'sn1'}}

 3.DD file on sn1, then do commit from sn1 to base
   (guest)# dd if=/dev/urandom of=/home/test bs=1M count=100
   {'execute': 'block-commit', 'arguments': { 'device':'sn1','job-id':'j1'}}
{"timestamp": {"seconds": 1620888756, "microseconds": 929077}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}}
{"timestamp": {"seconds": 1620888756, "microseconds": 929130}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}}
{"return": {}}
{"timestamp": {"seconds": 1620888756, "microseconds": 934822}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1620888756, "microseconds": 934863}, "event": "BLOCK_JOB_READY", "data": {"device": "j1", "len": 4194304, "offset": 4194304, "speed": 0, "type": "commit"}}

 4.Pause block job.
   {'execute': 'job-pause', 'arguments': {'id':'j1'}}
{"timestamp": {"seconds": 1620888790, "microseconds": 979143}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "j1"}}
{"return": {}}

 5.Check block job status.
   {"execute":"query-block-jobs"}
{"return": [{"auto-finalize": true, "io-status": "ok", "device": "j1", "auto-dismiss": true, "busy": false, "len": 4325376, "offset": 4325376, "status": "standby", "paused": true, "speed": 0, "ready": true, "type": "commit"}]}

 6.Complete commit job
   {"execute":"job-complete","arguments":{"id":"j1"}}
{"return": {}}

 7.Resume block job.
   {"execute":"job-resume","arguments":{"id":"j1"}}
{"timestamp": {"seconds": 1620888859, "microseconds": 819548}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"return": {}}
{"timestamp": {"seconds": 1620888859, "microseconds": 820190}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "j1"}}
{"timestamp": {"seconds": 1620888859, "microseconds": 820227}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "j1"}}
{"timestamp": {"seconds": 1620888859, "microseconds": 820282}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "j1", "len": 4587520, "offset": 4587520, "speed": 0, "type": "commit"}}
{"timestamp": {"seconds": 1620888859, "microseconds": 820314}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}}
{"timestamp": {"seconds": 1620888859, "microseconds": 820339}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}}

After step7, commit job executed successfully.

Comment 18 aihua liang 2021-05-13 07:58:10 UTC
As comment17, set bug's status to "Verified".

Comment 26 errata-xmlrpc 2021-11-16 07:52:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684


Note You need to log in before you can comment on or make changes to this bug.