Bug 977706
Summary: | virsh pool-refresh will remove the pool if remove volume in the processing | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Wei Zhou <ustcweizhou> | ||||
Component: | libvirt | Assignee: | Ján Tomko <jtomko> | ||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 6.4 | CC: | berrange, cwei, dyuan, lsoft-mso-pj, mzhan, nux, rbalakri, shyu, srinivas.avasarala, ustcweizhou, wido | ||||
Target Milestone: | rc | Keywords: | Upstream | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | libvirt-0.10.2-32.el6 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2014-10-14 04:16:10 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1011600 | ||||||
Attachments: |
|
Description
Wei Zhou
2013-06-25 07:52:02 UTC
Considering this scenario: We use NFS server as the source of storage pool. When we refresh pool and delete volume at the same time on multiple nodes (CloudStack do have this), the error will occur with great probability. > (4) in storage_driver.c, storagePoolRefresh run stopPool and
> virStoragePoolObjRemove if refreshPool fails.
>
> if (backend->refreshPool(obj->conn, pool) < 0) {
> if (backend->stopPool)
> backend->stopPool(obj->conn, pool);
>
> pool->active = 0;
>
> if (pool->configFile == NULL) {
> virStoragePoolObjRemove(&driver->pools, pool);
> pool = NULL;
> }
> goto cleanup;
> }
Hmm, refreshPool should only return -1, if something truely serious went wrong. If a volume disappeared while in the middle of refreshing, this should not have caused it to return -1. It is supposed to simply skip volumes which disappear. Given your description I guess there is some part of the code which is not correctly skipping disappearing volumes.
Exactly. In storage_backend_fs.c, virStorageBackendFileSystemRefresh will goto cleanup if virStorageBackendProbeTarget return -1. This should be changed. if ((ret = virStorageBackendProbeTarget(&vol->target, &backingStore, &backingStoreFormat, &vol->allocation, &vol->capacity, &vol->target.encryption)) < 0) { if (ret == -2) { /* Silently ignore non-regular files, * eg '.' '..', 'lost+found', dangling symbolic link */ virStorageVolDefFree(vol); vol = NULL; continue; } else if (ret == -3) { /* The backing file is currently unavailable, its format is not * explicitly specified, the probe to auto detect the format * failed: continue with faked RAW format, since AUTO will * break virStorageVolTargetDefFormat() generating the line * <format type='...'/>. */ backingStoreFormat = VIR_STORAGE_FILE_RAW; } else goto cleanup; } (In reply to Wei Zhou from comment #4) > In storage_backend_fs.c, virStorageBackendFileSystemRefresh will goto > cleanup if virStorageBackendProbeTarget return -1. This should be changed. Would you mind submitting such a patch to the upstream libvirt mailing list? (I'm assuming you don't have a RH support contract since you're filing BZs directly rather than going through the support organization. If you do have a contract, please contact support so that your request can be properly prioritized.) Created attachment 766220 [details] Bug 977706: virStorageBackendVolOpenCheckMode return -2 instead of -1 if volume file is missing virStorageBackendVolOpenCheckMode (in storage_backend.c) return -2 instead of -1 if volume file is missing, so that virStorageBackendProbeTarget (in storage_backend_fs.c) return -2 as well. virStorageBackendFileSystemRefresh (in storage_backend_fs.c) then skip the missing files. I've posted the patch from comment 6 to the upstream list: https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html (In reply to Jan Tomko from comment #7) > I've posted the patch from comment 6 to the upstream list: > https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html I see that you posted the patch (and a V2) on the mailinglist, but it hasn't been accepted yet. Is there any ETA for this patch to make it into upstream? Hello, This issue is causing the NFS storage pool to disappear under load, re-adding it requires stopping all VMs. The Cloudstack project seems to be bit by this problem and the direction thay propose is to just bypass libvirt. The _better_ solution for everyone would be to hace this fixed. The contributed patches have not made it upstream and it's impacting deployments. Can anyone give it a kick? More in this thread http://www.mail-archive.com/dev@cloudstack.apache.org/msg25436.html I have sent a v4 of the patch upstream: https://www.redhat.com/archives/libvir-list/2014-March/msg01286.html Thanks Jan, any sign of having it accepted? It is now pushed upstream: commit ee640f444bbdc976bdaed305f0d64d241d275376 Author: Ján Tomko <jtomko> CommitDate: 2014-03-20 18:13:58 +0100 Ignore missing files on pool refresh If we cannot stat/open a file on pool refresh, returning -1 aborts the refresh and the pool is undefined. Only treat missing files as fatal unless VolOpenCheckMode is called with the VIR_STORAGE_VOL_OPEN_ERROR flag. If this flag is missing (when it's called from virStorageBackendProbeTarget in virStorageBackendFileSystemRefresh), only emit a warning and return -2 to let the caller skip over the file. https://bugzilla.redhat.com/show_bug.cgi?id=977706 git describe: v1.2.2-281-gee640f4 Jan, Fantastic! What do we need to do to have this backported in EL6? Downstream patch posted: http://post-office.corp.redhat.com/archives/rhvirt-patches/2014-April/msg00204.html Simplified reproducer (with just one host): 1. Have a pool with a few volumes (at least 5) 2. Run virsh pool-refresh in a loop 3. Keep creating and deleting a volume without libvirt: while true; do qemu-img create -f qcow2 img 5M; rm -f img; done Without the fix, the pool-refresh fails after a few seconds with: error: Requested operation is not valid: storage pool is not active Reproduced with libvirt-0.10.2-31.el6.x86_64, Verified with packages: libvirt-0.10.2-33.el6.x86_64 qemu-kvm-0.12.1.2-2.423.el6.x86_64 Test steps: 1. Have a default pool with more then 5 volumes 2. On terminal: A, run command: #while true; do virsh pool-refresh default; sleep 1; done 3. On terminal: B, run command: # while true; do qemu-img create -f qcow2 test.img 5M; rm -f test.img; done 4. last for 30 minutes, no error on both terminals. The default pool is still active. # virsh pool-list --all| grep default default active yes Test results: current command work well. *** Bug 1115740 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1374.html |