Bug 1448268

Summary: "virsh list" hangs when executing domain async operation
Product: Red Hat Enterprise Linux 7 Reporter: Fangge Jin <fjin>
Component: libvirtAssignee: Michal Privoznik <mprivozn>
Status: CLOSED ERRATA QA Contact: jiyan <jiyan>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.4CC: dyuan, lmen, rbalakri, xuzhang, yafu, yanqzhan, zpeng
Target Milestone: rcKeywords: Upstream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-3.8.0-1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 10:43:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
libvirtd log, gstack output and guest xml none

Description Fangge Jin 2017-05-05 03:53:10 UTC
Created attachment 1276483 [details]
libvirtd log, gstack output and guest xml

Description of problem:
"virsh list" hangs when executing domain async operation, e.g. "virsh dump", "virsh save"

Version-Release number of selected component:
libvirt-3.2.0-4.virtcov.el7.x86_64
qemu-kvm-rhev-2.9.0-2.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Prepare a guest with complex xml ( so that "virsh dump/save" will not finish too quickly)

2. # virsh start rhel7-3

3. Do "virsh list" in a loop
# while true;do virsh list;done

4. Open another terminal: # virsh dump rhel7-3 /tmp/dump

5. Check the output of step 3, during guest is being dumped, guest is in paused status. And at one time, virsh doesn't return until dump is finished:
# virsh list
 Id    Name                           State
----------------------------------------------------
 30    rhel7-3                        paused


Actual results:
As step 4

Expected results:
"virsh list" should always return immediately when executing domain async operation.

Comment 2 Michal Privoznik 2017-09-14 12:44:22 UTC
The problem is caused here:

static int
qemuDomainSaveMemory(virQEMUDriverPtr driver,
                     virDomainObjPtr vm,
                     const char *path,
                     virQEMUSaveDataPtr data,
                     const char *compressedpath,
                     unsigned int flags,
                     qemuDomainAsyncJob asyncJob)
{
    ...
    if (virFileWrapperFdClose(wrapperFd) < 0)
        goto cleanup;
    ...
}


When the virFileWrapperFdClose() is called, @vm is locked therefore any other thread that wants to lock @vm will have to wait. Now, ideally virFileWrapperFdClose() would be a short call. However, it waits for iohelper (a small binary that reads qemu migration stream and saves it onto a disk) to finish. And one of the last calls that iohelper does when dying is fdatasync(). So effectively, we are waiting for all data to hit the disk with a locked domain object.

Comment 3 Michal Privoznik 2017-09-18 09:08:49 UTC
Patch proposed on the list:

https://www.redhat.com/archives/libvir-list/2017-September/msg00490.html

Comment 4 Michal Privoznik 2017-09-21 15:23:10 UTC
I've pushed the patch upstream:

commit 92524d3e6ee7b4cc7d70218c2586b7713fe6b00b
Author:     Michal Privoznik <mprivozn>
AuthorDate: Thu Sep 14 16:28:34 2017 +0200
Commit:     Michal Privoznik <mprivozn>
CommitDate: Thu Sep 21 17:21:39 2017 +0200

    qemu: Introduce a wrapper over virFileWrapperFdClose
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1448268
    
    When migrating to a file (e.g. when doing 'virsh save file'),
    couple of things are happening in the thread that is executing
    the API:
    
    1) the domain obj is locked
    2) iohelper is spawned as a separate process to handle all I/O
    3) the thread waits for iohelper to finish
    4) the domain obj is unlocked
    
    Now, the problem is that while the thread waits in step 3 for
    iohelper to finish this may take ages because iohelper calls
    fdatasync(). And unfortunately, we are waiting the whole time
    with the domain locked. So if another thread wants to jump in and
    say copy the domain name ('virsh list' for instance), they are
    stuck.
    
    The solution is to unlock the domain whenever waiting for I/O and
    lock it back again when it finished.
    
    Signed-off-by: Michal Privoznik <mprivozn>
    Reviewed-by: John Ferlan <jferlan>

v3.7.0-154-g92524d3e6

Comment 6 jiyan 2017-10-11 09:09:35 UTC
Test env components:
kernel-3.10.0-731.el7.x86_64
libvirt-3.8.0-1.el7.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.9.x86_64

Test scenario:
Scenario-1: As for 'save' operation
1. Prepare a guest named 'pc' with complex xml such as includes complex xml elements:
# virsh list --all |grep pc
 -     pc                             shut off

2. Start the guest
# virsh start pc
Domain pc started

2. Prepare a shell script as following, and run the script in one terminal:
# cat testscript 
while true
do 
virsh list --all |grep pc
done

# sh testscript 

3. Open another terminal and save the state of guest to a file
# virsh save pc pc.save

4. Check the output in Step-2, the output of while loop is as following:
 15    pc                             running
 15    pc                             running
...
 15    pc                             paused
 15    pc                             paused
 15    pc                             paused
...
 -     pc                             shut off
 -     pc                             shut off
It can be seen during the 'save' operation, 'virsh list' can also return 'paused' state continuously.

5. Check the output of Step-3, the value returned by the command is normal 
# virsh save pc pc.save
Domain pc saved to pc.save


Scenario-2: As for 'dump' operation
1. Prepare a guest named 'pc' with complex xml such as includes complex xml elements:
# virsh list --all |grep pc
 -     pc                             shut off

2. Start the guest
# virsh start pc
Domain pc started

2. Prepare a shell script as following, and run the script in one terminal:
# cat testscript 
while true
do 
virsh list --all |grep pc
done

# sh testscript 

3. Open another terminal and save the state of guest to a file
# virsh save pc pc.save

4. Check the output in Step-2, the output of while loop is as following:
 16    pc                             running
 16    pc                             running
...
 16    pc                             paused
 16    pc                             paused
 16    pc                             paused
...
 16    pc                             running
 16    pc                             running
It can be seen during the 'dump' operation, 'virsh list' can also return 'paused' state continuously.

5. Check the output of Step-3, the value returned by the command is normal
# virsh dump pc pc.dump
Domain pc dumped to pc.dump

All the results are as expected, move this bug to be verified.

Comment 10 errata-xmlrpc 2018-04-10 10:43:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0704