Bug 977706
| Summary: | virsh pool-refresh will remove the pool if remove volume in the processing | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Wei Zhou <ustcweizhou> | ||||
| Component: | libvirt | Assignee: | Ján Tomko <jtomko> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 6.4 | CC: | berrange, cwei, dyuan, lsoft-mso-pj, mzhan, nux, rbalakri, shyu, srinivas.avasarala, ustcweizhou, wido | ||||
| Target Milestone: | rc | Keywords: | Upstream | ||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | libvirt-0.10.2-32.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2014-10-14 04:16:10 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1011600 | ||||||
| Attachments: |
|
||||||
Considering this scenario: We use NFS server as the source of storage pool. When we refresh pool and delete volume at the same time on multiple nodes (CloudStack do have this), the error will occur with great probability. > (4) in storage_driver.c, storagePoolRefresh run stopPool and
> virStoragePoolObjRemove if refreshPool fails.
>
> if (backend->refreshPool(obj->conn, pool) < 0) {
> if (backend->stopPool)
> backend->stopPool(obj->conn, pool);
>
> pool->active = 0;
>
> if (pool->configFile == NULL) {
> virStoragePoolObjRemove(&driver->pools, pool);
> pool = NULL;
> }
> goto cleanup;
> }
Hmm, refreshPool should only return -1, if something truely serious went wrong. If a volume disappeared while in the middle of refreshing, this should not have caused it to return -1. It is supposed to simply skip volumes which disappear. Given your description I guess there is some part of the code which is not correctly skipping disappearing volumes.
Exactly.
In storage_backend_fs.c, virStorageBackendFileSystemRefresh will goto cleanup if virStorageBackendProbeTarget return -1. This should be changed.
if ((ret = virStorageBackendProbeTarget(&vol->target,
&backingStore,
&backingStoreFormat,
&vol->allocation,
&vol->capacity,
&vol->target.encryption)) < 0) {
if (ret == -2) {
/* Silently ignore non-regular files,
* eg '.' '..', 'lost+found', dangling symbolic link */
virStorageVolDefFree(vol);
vol = NULL;
continue;
} else if (ret == -3) {
/* The backing file is currently unavailable, its format is not
* explicitly specified, the probe to auto detect the format
* failed: continue with faked RAW format, since AUTO will
* break virStorageVolTargetDefFormat() generating the line
* <format type='...'/>. */
backingStoreFormat = VIR_STORAGE_FILE_RAW;
} else
goto cleanup;
}
(In reply to Wei Zhou from comment #4) > In storage_backend_fs.c, virStorageBackendFileSystemRefresh will goto > cleanup if virStorageBackendProbeTarget return -1. This should be changed. Would you mind submitting such a patch to the upstream libvirt mailing list? (I'm assuming you don't have a RH support contract since you're filing BZs directly rather than going through the support organization. If you do have a contract, please contact support so that your request can be properly prioritized.) Created attachment 766220 [details] Bug 977706: virStorageBackendVolOpenCheckMode return -2 instead of -1 if volume file is missing virStorageBackendVolOpenCheckMode (in storage_backend.c) return -2 instead of -1 if volume file is missing, so that virStorageBackendProbeTarget (in storage_backend_fs.c) return -2 as well. virStorageBackendFileSystemRefresh (in storage_backend_fs.c) then skip the missing files. I've posted the patch from comment 6 to the upstream list: https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html (In reply to Jan Tomko from comment #7) > I've posted the patch from comment 6 to the upstream list: > https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html I see that you posted the patch (and a V2) on the mailinglist, but it hasn't been accepted yet. Is there any ETA for this patch to make it into upstream? Hello, This issue is causing the NFS storage pool to disappear under load, re-adding it requires stopping all VMs. The Cloudstack project seems to be bit by this problem and the direction thay propose is to just bypass libvirt. The _better_ solution for everyone would be to hace this fixed. The contributed patches have not made it upstream and it's impacting deployments. Can anyone give it a kick? More in this thread http://www.mail-archive.com/dev@cloudstack.apache.org/msg25436.html I have sent a v4 of the patch upstream: https://www.redhat.com/archives/libvir-list/2014-March/msg01286.html Thanks Jan, any sign of having it accepted? It is now pushed upstream:
commit ee640f444bbdc976bdaed305f0d64d241d275376
Author: Ján Tomko <jtomko>
CommitDate: 2014-03-20 18:13:58 +0100
Ignore missing files on pool refresh
If we cannot stat/open a file on pool refresh, returning -1 aborts
the refresh and the pool is undefined.
Only treat missing files as fatal unless VolOpenCheckMode is called
with the VIR_STORAGE_VOL_OPEN_ERROR flag. If this flag is missing
(when it's called from virStorageBackendProbeTarget in
virStorageBackendFileSystemRefresh), only emit a warning and return
-2 to let the caller skip over the file.
https://bugzilla.redhat.com/show_bug.cgi?id=977706
git describe: v1.2.2-281-gee640f4
Jan, Fantastic! What do we need to do to have this backported in EL6? Downstream patch posted: http://post-office.corp.redhat.com/archives/rhvirt-patches/2014-April/msg00204.html Simplified reproducer (with just one host): 1. Have a pool with a few volumes (at least 5) 2. Run virsh pool-refresh in a loop 3. Keep creating and deleting a volume without libvirt: while true; do qemu-img create -f qcow2 img 5M; rm -f img; done Without the fix, the pool-refresh fails after a few seconds with: error: Requested operation is not valid: storage pool is not active Reproduced with libvirt-0.10.2-31.el6.x86_64, Verified with packages: libvirt-0.10.2-33.el6.x86_64 qemu-kvm-0.12.1.2-2.423.el6.x86_64 Test steps: 1. Have a default pool with more then 5 volumes 2. On terminal: A, run command: #while true; do virsh pool-refresh default; sleep 1; done 3. On terminal: B, run command: # while true; do qemu-img create -f qcow2 test.img 5M; rm -f test.img; done 4. last for 30 minutes, no error on both terminals. The default pool is still active. # virsh pool-list --all| grep default default active yes Test results: current command work well. *** Bug 1115740 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1374.html |
Description of problem: virsh pool-refresh will remove the pool if remove volume in the processing Version-Release number of selected component (if applicable): [root@cs-kvm003 ~]# rpm -qa libvirt qemu-kvm kernel kernel-2.6.32-358.2.1.el6.x86_64 libvirt-0.10.2-18.el6.x86_64 qemu-kvm-0.12.1.2-3.209.el6.4.x86_64 How reproducible: 100% Steps to Reproduce: I use two nodes for testing. One is for storage pool refresh, another one is removing volume. Before testing root@cs-kvm002:~# virsh pool-list Name State Autostart ----------------------------------------- 56bf9deb-5f00-363d-a2fe-c194eddb3576 active no dcadf71a-6a73-4788-863e-5f8728efdb5b active no 1. create a storage pool with volumes (more is better) 2. create a volume template xml root@cs-kvm003:~# cat >vol-disk-template.xml <volume> <name>disk1.img</name> <capacity unit='M'>10</capacity> <allocation unit='M'>0</allocation> <target> <path>/mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576/disk1.img</path> <format type='raw'/> </target> </volume> 3. create a script for testing root@cs-kvm003:~# cat >test_storagepool.sh #/bin/bash echo "-------------------------------------------------------------------" date virsh vol-delete /mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576/disk1.img virsh vol-create 56bf9deb-5f00-363d-a2fe-c194eddb3576 vol-disk-template.xml virsh pool-refresh 56bf9deb-5f00-363d-a2fe-c194eddb3576 & rm -rf /mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576/disk1.img 4. add to crontab root@cs-kvm003:~# crontab -l */1 * * * * /root/test_storagepool.sh >>/tmp/test_storagepool.log 2>&1 Actual results: After several minutes (depends on volumes in the storage pool, more is better), the erros will appear. --------test_storagepool.log------ error: failed to get vol '/mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576/disk1.img', specifying --pool might help error: Storage volume not found: no storage vol with matching path error: failed to get pool '56bf9deb-5f00-363d-a2fe-c194eddb3576' error: Storage pool not found: no pool with matching name '56bf9deb-5f00-363d-a2fe-c194eddb3576' --------/var/log/libvirt/libvirtd.log--------- 2013-06-24 14:25:00.157+0000: 1712: error : virStorageBackendVolOpenCheckMode:1018 : cannot stat file '/mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576/disk1.img': No such file or directory 2013-06-24 14:25:00.162+0000: 1712: error : virCommandWait:2314 : internal error Child process (/bin/umount /mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576) status unexpected: exit status 16 -------The storage pool is missing------------ root@cs-kvm003:~# virsh pool-list Name State Autostart ----------------------------------------- ab3f8e2a-92c8-4dd8-ab0d-83e6f056918e active no -------The mountpoint for the pool still exists------------ root@cs-kvm004:~# df Filesystem 1K-blocks Used Available Use% Mounted on ...... 172.16.2.30:/storage/primary-vr-test 7811748864 498458624 7313290240 7% /mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576 ----------------------------------------------------------- Expected results: Vol /mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576/disk1.img deleted Vol disk1.img created from vol-disk-template.xml Pool 56bf9deb-5f00-363d-a2fe-c194eddb3576 refreshed Additional info: If remove volume in the virsh pool-refresh processing, (1) in storage_backend.c, virStorageBackendVolOpenCheckMode return -1; (2) in storage_backend_fs.c, virStorageBackendProbeTarget return -1; (3) in storage_backend_fs.c, virStorageBackendFileSystemRefresh return -1; (4) in storage_driver.c, storagePoolRefresh run stopPool and virStoragePoolObjRemove if refreshPool fails. if (backend->refreshPool(obj->conn, pool) < 0) { if (backend->stopPool) backend->stopPool(obj->conn, pool); pool->active = 0; if (pool->configFile == NULL) { virStoragePoolObjRemove(&driver->pools, pool); pool = NULL; } goto cleanup; } There are some related issues in Apache Cloudstack community. https://issues.apache.org/jira/browse/CLOUDSTACK-2780 https://issues.apache.org/jira/browse/CLOUDSTACK-2729 https://issues.apache.org/jira/browse/CLOUDSTACK-2893