RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1448268 - "virsh list" hangs when executing domain async operation
Summary: "virsh list" hangs when executing domain async operation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.4
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Michal Privoznik
QA Contact: jiyan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-05 03:53 UTC by Fangge Jin
Modified: 2018-04-10 10:44 UTC (History)
7 users (show)

Fixed In Version: libvirt-3.8.0-1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-10 10:43:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
libvirtd log, gstack output and guest xml (118.97 KB, application/x-gzip)
2017-05-05 03:53 UTC, Fangge Jin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:0704 0 None None None 2018-04-10 10:44:08 UTC

Description Fangge Jin 2017-05-05 03:53:10 UTC
Created attachment 1276483 [details]
libvirtd log, gstack output and guest xml

Description of problem:
"virsh list" hangs when executing domain async operation, e.g. "virsh dump", "virsh save"

Version-Release number of selected component:
libvirt-3.2.0-4.virtcov.el7.x86_64
qemu-kvm-rhev-2.9.0-2.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Prepare a guest with complex xml ( so that "virsh dump/save" will not finish too quickly)

2. # virsh start rhel7-3

3. Do "virsh list" in a loop
# while true;do virsh list;done

4. Open another terminal: # virsh dump rhel7-3 /tmp/dump

5. Check the output of step 3, during guest is being dumped, guest is in paused status. And at one time, virsh doesn't return until dump is finished:
# virsh list
 Id    Name                           State
----------------------------------------------------
 30    rhel7-3                        paused


Actual results:
As step 4

Expected results:
"virsh list" should always return immediately when executing domain async operation.

Comment 2 Michal Privoznik 2017-09-14 12:44:22 UTC
The problem is caused here:

static int
qemuDomainSaveMemory(virQEMUDriverPtr driver,
                     virDomainObjPtr vm,
                     const char *path,
                     virQEMUSaveDataPtr data,
                     const char *compressedpath,
                     unsigned int flags,
                     qemuDomainAsyncJob asyncJob)
{
    ...
    if (virFileWrapperFdClose(wrapperFd) < 0)
        goto cleanup;
    ...
}


When the virFileWrapperFdClose() is called, @vm is locked therefore any other thread that wants to lock @vm will have to wait. Now, ideally virFileWrapperFdClose() would be a short call. However, it waits for iohelper (a small binary that reads qemu migration stream and saves it onto a disk) to finish. And one of the last calls that iohelper does when dying is fdatasync(). So effectively, we are waiting for all data to hit the disk with a locked domain object.

Comment 3 Michal Privoznik 2017-09-18 09:08:49 UTC
Patch proposed on the list:

https://www.redhat.com/archives/libvir-list/2017-September/msg00490.html

Comment 4 Michal Privoznik 2017-09-21 15:23:10 UTC
I've pushed the patch upstream:

commit 92524d3e6ee7b4cc7d70218c2586b7713fe6b00b
Author:     Michal Privoznik <mprivozn>
AuthorDate: Thu Sep 14 16:28:34 2017 +0200
Commit:     Michal Privoznik <mprivozn>
CommitDate: Thu Sep 21 17:21:39 2017 +0200

    qemu: Introduce a wrapper over virFileWrapperFdClose
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1448268
    
    When migrating to a file (e.g. when doing 'virsh save file'),
    couple of things are happening in the thread that is executing
    the API:
    
    1) the domain obj is locked
    2) iohelper is spawned as a separate process to handle all I/O
    3) the thread waits for iohelper to finish
    4) the domain obj is unlocked
    
    Now, the problem is that while the thread waits in step 3 for
    iohelper to finish this may take ages because iohelper calls
    fdatasync(). And unfortunately, we are waiting the whole time
    with the domain locked. So if another thread wants to jump in and
    say copy the domain name ('virsh list' for instance), they are
    stuck.
    
    The solution is to unlock the domain whenever waiting for I/O and
    lock it back again when it finished.
    
    Signed-off-by: Michal Privoznik <mprivozn>
    Reviewed-by: John Ferlan <jferlan>

v3.7.0-154-g92524d3e6

Comment 6 jiyan 2017-10-11 09:09:35 UTC
Test env components:
kernel-3.10.0-731.el7.x86_64
libvirt-3.8.0-1.el7.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.9.x86_64

Test scenario:
Scenario-1: As for 'save' operation
1. Prepare a guest named 'pc' with complex xml such as includes complex xml elements:
# virsh list --all |grep pc
 -     pc                             shut off

2. Start the guest
# virsh start pc
Domain pc started

2. Prepare a shell script as following, and run the script in one terminal:
# cat testscript 
while true
do 
virsh list --all |grep pc
done

# sh testscript 

3. Open another terminal and save the state of guest to a file
# virsh save pc pc.save

4. Check the output in Step-2, the output of while loop is as following:
 15    pc                             running
 15    pc                             running
...
 15    pc                             paused
 15    pc                             paused
 15    pc                             paused
...
 -     pc                             shut off
 -     pc                             shut off
It can be seen during the 'save' operation, 'virsh list' can also return 'paused' state continuously.

5. Check the output of Step-3, the value returned by the command is normal 
# virsh save pc pc.save
Domain pc saved to pc.save


Scenario-2: As for 'dump' operation
1. Prepare a guest named 'pc' with complex xml such as includes complex xml elements:
# virsh list --all |grep pc
 -     pc                             shut off

2. Start the guest
# virsh start pc
Domain pc started

2. Prepare a shell script as following, and run the script in one terminal:
# cat testscript 
while true
do 
virsh list --all |grep pc
done

# sh testscript 

3. Open another terminal and save the state of guest to a file
# virsh save pc pc.save

4. Check the output in Step-2, the output of while loop is as following:
 16    pc                             running
 16    pc                             running
...
 16    pc                             paused
 16    pc                             paused
 16    pc                             paused
...
 16    pc                             running
 16    pc                             running
It can be seen during the 'dump' operation, 'virsh list' can also return 'paused' state continuously.

5. Check the output of Step-3, the value returned by the command is normal
# virsh dump pc pc.dump
Domain pc dumped to pc.dump

All the results are as expected, move this bug to be verified.

Comment 10 errata-xmlrpc 2018-04-10 10:43:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0704


Note You need to log in before you can comment on or make changes to this bug.