1847867 – RFE: pass through block job error message to clients using a new block job completed event

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1847867 - RFE: pass through block job error message to clients using a new block job completed event

Summary: RFE: pass through block job error message to clients using a new block job co...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	9.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Virtualization Maintenance
QA Contact:	yisun
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-17 09:06 UTC by yisun
Modified:	2021-12-17 07:26 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-12-17 07:26:59 UTC
Type:	Feature Request
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Description yisun 2020-06-17 09:06:16 UTC

Description:
QMP error message not provided to user when error happens in verbose blockcopy job

Versions:
libvirt-6.0.0-24.module+el8.2.1+6997+c666f621.x86_64
qemu-kvm-4.2.0-25.module+el8.2.1+6985+9fd9d514.x86_64

How reproducible:
100%

Steps:
1. Having a transient vm with a 10G size vda
(.libvirt-ci-venv-ci-runtest-6LDjrw) [root@libvirt-rhel-8 ~]# virsh domblklist avocado-vt-vm1
 Target   Source
------------------------------------------------------------------------
 vda      /var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2

(.libvirt-ci-venv-ci-runtest-6LDjrw) [root@libvirt-rhel-8 ~]# qemu-img info -U /var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2
image: /var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2
file format: qcow2
virtual size: 10 GiB (10737418240 bytes)


(.libvirt-ci-venv-ci-runtest-6LDjrw) [root@libvirt-rhel-8 ~]# virsh undefine avocado-vt-vm1
Domain avocado-vt-vm1 has been undefined

2. Prepare a 11G image as blockcopy target
(.libvirt-ci-venv-ci-runtest-6LDjrw) [root@libvirt-rhel-8 ~]# qemu-img create -f qcow2 /tmp/new_image 11.000000G
Formatting '/tmp/new_image', fmt=qcow2 size=11811160064 cluster_size=65536 lazy_refcounts=off refcount_bits=16

3. Start the blockcopy with --reuse-external flag
(.libvirt-ci-venv-ci-runtest-6LDjrw) [root@libvirt-rhel-8 ~]# echo "" > /var/log/libvirtd-debug.log

(.libvirt-ci-venv-ci-runtest-6LDjrw) [root@libvirt-rhel-8 ~]# /usr/bin/virsh blockcopy avocado-vt-vm1 vda /tmp/new_image --verbose --wait --reuse-external
error:
Copy failed
<===== no clear error message provided.

4. Check the libvirtd debug log
2020-06-17 08:23:47.857+0000: 58735: debug : qemuDomainObjBeginJobInternal:9806 : Starting job: job=query agentJob=none asyncJob=none (vm=0x7f4c94196110 name=avocado-vt-vm1, current job=modify agentJob=none async=none)
2020-06-17 08:23:47.857+0000: 58735: debug : qemuDomainObjBeginJobInternal:9836 : Waiting for job (vm=0x7f4c94196110 name=avocado-vt-vm1)
2020-06-17 08:23:47.857+0000: 58733: debug : virEventPollRunOnce:651 : Poll got 1 event(s)
2020-06-17 08:23:47.857+0000: 58733: debug : virEventPollDispatchTimeouts:427 : Dispatch 4
2020-06-17 08:23:47.857+0000: 58733: debug : virEventPollDispatchHandles:473 : Dispatch 12
2020-06-17 08:23:47.857+0000: 58733: debug : virEventPollDispatchHandles:487 : i=0 w=1
2020-06-17 08:23:47.857+0000: 58733: debug : virEventPollDispatchHandles:487 : i=1 w=2
2020-06-17 08:23:47.857+0000: 58733: debug : virEventPollDispatchHandles:487 : i=2 w=3
2020-06-17 08:23:47.857+0000: 58733: debug : virEventPollDispatchHandles:487 : i=3 w=4
2020-06-17 08:23:47.857+0000: 58733: debug : virEventPollDispatchHandles:487 : i=4 w=5
2020-06-17 08:23:47.857+0000: 58733: debug : virEventPollDispatchHandles:487 : i=5 w=6
2020-06-17 08:23:47.857+0000: 58733: debug : virEventPollDispatchHandles:487 : i=6 w=7
2020-06-17 08:23:47.857+0000: 58733: debug : virEventPollDispatchHandles:487 : i=8 w=9
2020-06-17 08:23:47.857+0000: 58733: debug : virEventPollDispatchHandles:487 : i=9 w=10
2020-06-17 08:23:47.857+0000: 58733: debug : virEventPollDispatchHandles:487 : i=10 w=14
2020-06-17 08:23:47.857+0000: 58733: info : virEventPollDispatchHandles:501 : EVENT_POLL_DISPATCH_HANDLE: watch=14 events=1
2020-06-17 08:23:47.857+0000: 58733: info : virObjectRef:386 : OBJECT_REF: obj=0x7f4cd002f990
2020-06-17 08:23:47.857+0000: 58733: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"return": [{"current-progress": 0, "status": "concluded", "total-progress": 0, "type": "mirror", "id": "copy-vda-libvirt-1-format", "error": "Source and target image have different sizes"}], "id": "libvirt-372"}]
2020-06-17 08:23:47.857+0000: 58733: debug : virJSONValueFromString:1844 : string={"return": [{"current-progress": 0, "status": "concluded", "total-progress": 0, "type": "mirror", "id": "copy-vda-libvirt-1-format", "error": "Source and target image have different sizes"}], "id": "libvirt-372"}

Expected result:
If copy failed, we should provide clear error message to user

Actual result:
Error message is empty, need to go to debug log for detailed info

Additional info:
This is a automation regression failure but not a libvirt product regression failure:
PASS round:
https://libvirt-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/libvirt-RHEL-8.2-runtest-x86_64-function-block_copy_no_shallow/43/testReport/rhel/backingchain/blockcopy_positive_test_reuse_external/
FAIL round:
https://libvirt-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/libvirt-RHEL-8.2-runtest-x86_64-function-block_copy_no_shallow/46/testReport/rhel/backingchain/blockcopy_positive_test_reuse_external/
Reason:
Recently qemu starts to check if the blockcopy(--reuse-external) job's target file and source file have same size, if not, qmp error will be responded. And the test script have different size of target and source images. We'll update our test code at the same time

Comment 1 Peter Krempa 2020-06-17 10:29:54 UTC

There are two problems:

1) the 'error:' empty line is printed by virsh which hints there might be an error message missing. This is a bug in the virsh code which uses wrong function to print the "Copy failed" message, which should not be prefixed by a 'error:'. I'll post a patch dealing with this.

2) there is no way to get the actual error message using the existing infrastructure. 'virsh' can't report the eroror because the existing libvirt events for block job completion don't have any field to transport the error, and blockjobs which have completed in any way can't also be queried using the existing APIs. This means we need either a new libvirt event or a completely new way to handle block jobs which will allow reporting the error or querying a job after it has finished. This is far more work. I'm thus re-purposing the summary to track this second part.

Both of the problems described above are present for a very long time.

Comment 3 Nir Soffer 2021-03-08 17:17:49 UTC

Returning detailed error to clients would be awesome. This will make it easy
to debug issues in clients.

For example, in RHV, we sometimes get an even like:

    2021-03-07 16:01:30,859+0200 ERROR (libvirt/events) [virt.vm] (vmId='5652e2ca-9006-4657-8895-0551deda5b9a') 
    Block job 36ca3e55-8902-4508-8937-fc991cdb316b type ACTIVE_COMMIT for drive sdb has failed (vm:5597)

To understand why the job failed, you need to go to libvirt log, but this
log does not contain any info about this error -  because the error is
logged in info level, in this message:

    2021-03-07 14:01:30.842+0000: 53531: info : qemuMonitorJSONIOProcessLine:235 :
    QEMU_MONITOR_RECV_EVENT: mon=0x7f3b9805f030 event={"timestamp": {"seconds": 
    1615125690, "microseconds": 842235}, "event": "BLOCK_JOB_COMPLETED", "data":
    {"device": "commit-sdb-libvirt-7-format", "len": 2147483648, "offset": 1072693248,
    "speed": 0, "type": "commit", "error": "No space left on device"}}

Info level messages are usually disabled because libvirt log too verbose.
For example 80% of the log are messages about object life time like:

    2021-03-07 14:01:30.842+0000: 53531: info : virObjectRef:402 : OBJECT_REF: obj=0x7f3b9805f030

We can filter out such messages, but this looks like a debugging message
that should be logged in debug level.

I think the simplest way to fix this issue quickly is to check qemu reply,
and log block job failures in error or warning level, so they are always
logged.

Even if we have a way to pass the error info the the client using events,
logging the error on libvirt side is good idea, since the client may loose
the event.

Comment 4 John Ferlan 2021-09-08 13:30:27 UTC

Bulk update: Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release.

Comment 6 RHEL Program Management 2021-12-17 07:26:59 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.