Bug 1448268
| Summary: | "virsh list" hangs when executing domain async operation | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Fangge Jin <fjin> | ||||
| Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> | ||||
| Status: | CLOSED ERRATA | QA Contact: | jiyan <jiyan> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 7.4 | CC: | dyuan, lmen, rbalakri, xuzhang, yafu, yanqzhan, zpeng | ||||
| Target Milestone: | rc | Keywords: | Upstream | ||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | libvirt-3.8.0-1.el7 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-04-10 10:43:32 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
The problem is caused here:
static int
qemuDomainSaveMemory(virQEMUDriverPtr driver,
virDomainObjPtr vm,
const char *path,
virQEMUSaveDataPtr data,
const char *compressedpath,
unsigned int flags,
qemuDomainAsyncJob asyncJob)
{
...
if (virFileWrapperFdClose(wrapperFd) < 0)
goto cleanup;
...
}
When the virFileWrapperFdClose() is called, @vm is locked therefore any other thread that wants to lock @vm will have to wait. Now, ideally virFileWrapperFdClose() would be a short call. However, it waits for iohelper (a small binary that reads qemu migration stream and saves it onto a disk) to finish. And one of the last calls that iohelper does when dying is fdatasync(). So effectively, we are waiting for all data to hit the disk with a locked domain object.
Patch proposed on the list: https://www.redhat.com/archives/libvir-list/2017-September/msg00490.html I've pushed the patch upstream:
commit 92524d3e6ee7b4cc7d70218c2586b7713fe6b00b
Author: Michal Privoznik <mprivozn>
AuthorDate: Thu Sep 14 16:28:34 2017 +0200
Commit: Michal Privoznik <mprivozn>
CommitDate: Thu Sep 21 17:21:39 2017 +0200
qemu: Introduce a wrapper over virFileWrapperFdClose
https://bugzilla.redhat.com/show_bug.cgi?id=1448268
When migrating to a file (e.g. when doing 'virsh save file'),
couple of things are happening in the thread that is executing
the API:
1) the domain obj is locked
2) iohelper is spawned as a separate process to handle all I/O
3) the thread waits for iohelper to finish
4) the domain obj is unlocked
Now, the problem is that while the thread waits in step 3 for
iohelper to finish this may take ages because iohelper calls
fdatasync(). And unfortunately, we are waiting the whole time
with the domain locked. So if another thread wants to jump in and
say copy the domain name ('virsh list' for instance), they are
stuck.
The solution is to unlock the domain whenever waiting for I/O and
lock it back again when it finished.
Signed-off-by: Michal Privoznik <mprivozn>
Reviewed-by: John Ferlan <jferlan>
v3.7.0-154-g92524d3e6
Test env components: kernel-3.10.0-731.el7.x86_64 libvirt-3.8.0-1.el7.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.9.x86_64 Test scenario: Scenario-1: As for 'save' operation 1. Prepare a guest named 'pc' with complex xml such as includes complex xml elements: # virsh list --all |grep pc - pc shut off 2. Start the guest # virsh start pc Domain pc started 2. Prepare a shell script as following, and run the script in one terminal: # cat testscript while true do virsh list --all |grep pc done # sh testscript 3. Open another terminal and save the state of guest to a file # virsh save pc pc.save 4. Check the output in Step-2, the output of while loop is as following: 15 pc running 15 pc running ... 15 pc paused 15 pc paused 15 pc paused ... - pc shut off - pc shut off It can be seen during the 'save' operation, 'virsh list' can also return 'paused' state continuously. 5. Check the output of Step-3, the value returned by the command is normal # virsh save pc pc.save Domain pc saved to pc.save Scenario-2: As for 'dump' operation 1. Prepare a guest named 'pc' with complex xml such as includes complex xml elements: # virsh list --all |grep pc - pc shut off 2. Start the guest # virsh start pc Domain pc started 2. Prepare a shell script as following, and run the script in one terminal: # cat testscript while true do virsh list --all |grep pc done # sh testscript 3. Open another terminal and save the state of guest to a file # virsh save pc pc.save 4. Check the output in Step-2, the output of while loop is as following: 16 pc running 16 pc running ... 16 pc paused 16 pc paused 16 pc paused ... 16 pc running 16 pc running It can be seen during the 'dump' operation, 'virsh list' can also return 'paused' state continuously. 5. Check the output of Step-3, the value returned by the command is normal # virsh dump pc pc.dump Domain pc dumped to pc.dump All the results are as expected, move this bug to be verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0704 |
Created attachment 1276483 [details] libvirtd log, gstack output and guest xml Description of problem: "virsh list" hangs when executing domain async operation, e.g. "virsh dump", "virsh save" Version-Release number of selected component: libvirt-3.2.0-4.virtcov.el7.x86_64 qemu-kvm-rhev-2.9.0-2.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. Prepare a guest with complex xml ( so that "virsh dump/save" will not finish too quickly) 2. # virsh start rhel7-3 3. Do "virsh list" in a loop # while true;do virsh list;done 4. Open another terminal: # virsh dump rhel7-3 /tmp/dump 5. Check the output of step 3, during guest is being dumped, guest is in paused status. And at one time, virsh doesn't return until dump is finished: # virsh list Id Name State ---------------------------------------------------- 30 rhel7-3 paused Actual results: As step 4 Expected results: "virsh list" should always return immediately when executing domain async operation.