Bug 977706

Summary:

virsh pool-refresh will remove the pool if remove volume in the processing

Product:

Red Hat Enterprise Linux 6

Reporter:

Wei Zhou <ustcweizhou>

Component:

libvirt

Assignee:

Ján Tomko <jtomko>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

6.4

CC:

berrange, cwei, dyuan, lsoft-mso-pj, mzhan, nux, rbalakri, shyu, srinivas.avasarala, ustcweizhou, wido

Target Milestone:

Keywords:

Upstream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

libvirt-0.10.2-32.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-10-14 04:16:10 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1011600

Attachments:

Description	Flags
Bug 977706: virStorageBackendVolOpenCheckMode return -2 instead of -1 if volume file is missing	ustcweizhou: review+

Description Wei Zhou 2013-06-25 07:52:02 UTC

Description of problem:
virsh pool-refresh will remove the pool if remove volume in the processing

Version-Release number of selected component (if applicable):
[root@cs-kvm003 ~]# rpm -qa libvirt qemu-kvm kernel
kernel-2.6.32-358.2.1.el6.x86_64
libvirt-0.10.2-18.el6.x86_64
qemu-kvm-0.12.1.2-3.209.el6.4.x86_64


How reproducible:
100%

Steps to Reproduce:
I use two nodes for testing. One is for storage pool refresh, another one is removing volume.

Before testing
root@cs-kvm002:~# virsh pool-list
Name                 State      Autostart
-----------------------------------------
56bf9deb-5f00-363d-a2fe-c194eddb3576 active     no
dcadf71a-6a73-4788-863e-5f8728efdb5b active     no


1. create a storage pool with volumes (more is better)

2. create a volume template xml
root@cs-kvm003:~# cat >vol-disk-template.xml
<volume>
  <name>disk1.img</name>
  <capacity unit='M'>10</capacity>
  <allocation unit='M'>0</allocation>
  <target>
    <path>/mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576/disk1.img</path>
    <format type='raw'/>
  </target>
</volume>

3. create a script for testing
root@cs-kvm003:~# cat >test_storagepool.sh
#/bin/bash
echo "-------------------------------------------------------------------"
date
virsh vol-delete /mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576/disk1.img
virsh vol-create 56bf9deb-5f00-363d-a2fe-c194eddb3576 vol-disk-template.xml
virsh pool-refresh 56bf9deb-5f00-363d-a2fe-c194eddb3576 &
rm -rf /mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576/disk1.img

4. add to crontab
root@cs-kvm003:~# crontab -l
*/1 * * * * /root/test_storagepool.sh >>/tmp/test_storagepool.log 2>&1


Actual results:

After several minutes (depends on volumes in the storage pool, more is better), the erros will appear.

--------test_storagepool.log------
error: failed to get vol '/mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576/disk1.img', specifying --pool might help
error: Storage volume not found: no storage vol with matching path

error: failed to get pool '56bf9deb-5f00-363d-a2fe-c194eddb3576'
error: Storage pool not found: no pool with matching name '56bf9deb-5f00-363d-a2fe-c194eddb3576'
--------/var/log/libvirt/libvirtd.log---------
2013-06-24 14:25:00.157+0000: 1712: error : virStorageBackendVolOpenCheckMode:1018 : cannot stat file '/mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576/disk1.img': No such file or directory
2013-06-24 14:25:00.162+0000: 1712: error : virCommandWait:2314 : internal error Child process (/bin/umount /mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576) status unexpected: exit status 16

-------The storage pool is missing------------
root@cs-kvm003:~# virsh pool-list
Name                 State      Autostart
-----------------------------------------
ab3f8e2a-92c8-4dd8-ab0d-83e6f056918e active     no

-------The mountpoint for the pool still exists------------
root@cs-kvm004:~# df
Filesystem                            1K-blocks      Used  Available Use% Mounted on
......
172.16.2.30:/storage/primary-vr-test 7811748864 498458624 7313290240   7% /mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576


-----------------------------------------------------------

Expected results:
Vol /mnt/56bf9deb-5f00-363d-a2fe-c194eddb3576/disk1.img deleted
Vol disk1.img created from vol-disk-template.xml
Pool 56bf9deb-5f00-363d-a2fe-c194eddb3576 refreshed


Additional info:
If remove volume in the virsh pool-refresh processing, 
(1) in storage_backend.c, virStorageBackendVolOpenCheckMode return -1;
(2) in storage_backend_fs.c, virStorageBackendProbeTarget return -1;
(3) in storage_backend_fs.c, virStorageBackendFileSystemRefresh return -1;
(4) in storage_driver.c, storagePoolRefresh run stopPool and virStoragePoolObjRemove if refreshPool fails.

    if (backend->refreshPool(obj->conn, pool) < 0) {
        if (backend->stopPool)
            backend->stopPool(obj->conn, pool);

        pool->active = 0;

        if (pool->configFile == NULL) {
            virStoragePoolObjRemove(&driver->pools, pool);
            pool = NULL;
        }
        goto cleanup;
    }


There are some related issues in Apache Cloudstack community.
https://issues.apache.org/jira/browse/CLOUDSTACK-2780
https://issues.apache.org/jira/browse/CLOUDSTACK-2729
https://issues.apache.org/jira/browse/CLOUDSTACK-2893

Comment 2 Wei Zhou 2013-06-27 12:42:12 UTC

Considering this scenario:

We use NFS server as the source of storage pool. When we refresh pool and delete volume at the same time on multiple nodes (CloudStack do have this), the error will occur with great probability.

Comment 3 Daniel Berrangé 2013-06-27 13:52:16 UTC

> (4) in storage_driver.c, storagePoolRefresh run stopPool and
> virStoragePoolObjRemove if refreshPool fails.
> 
>     if (backend->refreshPool(obj->conn, pool) < 0) {
>         if (backend->stopPool)
>             backend->stopPool(obj->conn, pool);
> 
>         pool->active = 0;
> 
>         if (pool->configFile == NULL) {
>             virStoragePoolObjRemove(&driver->pools, pool);
>             pool = NULL;
>         }
>         goto cleanup;
>     }

Hmm, refreshPool should only return -1, if something truely serious went wrong. If a volume disappeared while in the middle of refreshing, this should not have caused it to return -1. It is supposed to simply skip volumes which disappear. Given your description I guess there is some part of the code which is not correctly skipping disappearing volumes.

Comment 4 Wei Zhou 2013-06-27 14:10:39 UTC

Exactly.

In storage_backend_fs.c, virStorageBackendFileSystemRefresh will goto cleanup if virStorageBackendProbeTarget return -1. This should be changed.

        if ((ret = virStorageBackendProbeTarget(&vol->target,
                                                &backingStore,
                                                &backingStoreFormat,
                                                &vol->allocation,
                                                &vol->capacity,
                                                &vol->target.encryption)) < 0) {
            if (ret == -2) {
                /* Silently ignore non-regular files,
                 * eg '.' '..', 'lost+found', dangling symbolic link */
                virStorageVolDefFree(vol);
                vol = NULL;
                continue;
            } else if (ret == -3) {
                /* The backing file is currently unavailable, its format is not
                 * explicitly specified, the probe to auto detect the format
                 * failed: continue with faked RAW format, since AUTO will
                 * break virStorageVolTargetDefFormat() generating the line
                 * <format type='...'/>. */
                backingStoreFormat = VIR_STORAGE_FILE_RAW;
            } else
                goto cleanup;
        }

Comment 5 Dave Allan 2013-06-27 14:47:07 UTC

(In reply to Wei Zhou from comment #4)
> In storage_backend_fs.c, virStorageBackendFileSystemRefresh will goto
> cleanup if virStorageBackendProbeTarget return -1. This should be changed.

Would you mind submitting such a patch to the upstream libvirt mailing list?  (I'm assuming you don't have a RH support contract since you're filing BZs directly rather than going through the support organization.  If you do have a contract, please contact support so that your request can be properly prioritized.)

Comment 6 Wei Zhou 2013-06-27 15:44:19 UTC

Created attachment 766220 [details]
Bug 977706: virStorageBackendVolOpenCheckMode return -2 instead of -1 if volume file is missing

virStorageBackendVolOpenCheckMode (in storage_backend.c) return -2 instead of -1 if volume file is missing, so that virStorageBackendProbeTarget (in storage_backend_fs.c) return -2 as well. virStorageBackendFileSystemRefresh (in storage_backend_fs.c) then skip the missing files.

Comment 7 Ján Tomko 2013-07-10 14:58:39 UTC

I've posted the patch from comment 6 to the upstream list:
https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html

Comment 9 Wido den Hollander 2013-07-23 11:35:42 UTC

(In reply to Jan Tomko from comment #7)
> I've posted the patch from comment 6 to the upstream list:
> https://www.redhat.com/archives/libvir-list/2013-July/msg00635.html

I see that you posted the patch (and a V2) on the mailinglist, but it hasn't been accepted yet.

Is there any ETA for this patch to make it into upstream?

Comment 10 Nux 2014-03-20 12:37:20 UTC

Hello,

This issue is causing the NFS storage pool to disappear under load, re-adding it requires stopping all VMs. 
The Cloudstack project seems to be bit by this problem and the direction thay propose is to just bypass libvirt. The _better_ solution for everyone would be to hace this fixed.

The contributed patches have not made it upstream and it's impacting deployments. Can anyone give it a kick?

More in this thread
http://www.mail-archive.com/dev@cloudstack.apache.org/msg25436.html

Comment 11 Ján Tomko 2014-03-20 16:06:36 UTC

I have sent a v4 of the patch upstream:
https://www.redhat.com/archives/libvir-list/2014-March/msg01286.html

Comment 12 Nux 2014-03-20 16:19:05 UTC

Thanks Jan, any sign of having it accepted?

Comment 13 Ján Tomko 2014-03-20 17:24:54 UTC

It is now pushed upstream:
commit ee640f444bbdc976bdaed305f0d64d241d275376
Author:     Ján Tomko <jtomko>
CommitDate: 2014-03-20 18:13:58 +0100

    Ignore missing files on pool refresh
    
    If we cannot stat/open a file on pool refresh, returning -1 aborts
    the refresh and the pool is undefined.
    
    Only treat missing files as fatal unless VolOpenCheckMode is called
    with the VIR_STORAGE_VOL_OPEN_ERROR flag. If this flag is missing
    (when it's called from virStorageBackendProbeTarget in
    virStorageBackendFileSystemRefresh), only emit a warning and return
    -2 to let the caller skip over the file.
                                                                                                                                 
    https://bugzilla.redhat.com/show_bug.cgi?id=977706                                
                                                                                                                                 
git describe: v1.2.2-281-gee640f4

Comment 14 Nux 2014-03-20 18:26:09 UTC

Jan,

Fantastic!
What do we need to do to have this backported in EL6?

Comment 16 Ján Tomko 2014-04-09 09:16:12 UTC

Downstream patch posted:
http://post-office.corp.redhat.com/archives/rhvirt-patches/2014-April/msg00204.html

Comment 17 Ján Tomko 2014-04-09 09:24:14 UTC

Simplified reproducer (with just one host):
1. Have a pool with a few volumes (at least 5)
2. Run virsh pool-refresh in a loop
3. Keep creating and deleting a volume without libvirt:
while true; do qemu-img create -f qcow2 img 5M; rm -f img; done

Without the fix, the pool-refresh fails after a few seconds with:
error: Requested operation is not valid: storage pool is not active

Comment 19 chhu 2014-04-25 05:56:15 UTC

Reproduced with libvirt-0.10.2-31.el6.x86_64,
Verified with packages:
libvirt-0.10.2-33.el6.x86_64
qemu-kvm-0.12.1.2-2.423.el6.x86_64

Test steps:
1. Have a default pool with more then 5 volumes

2. On terminal: A, run command:  
   #while true; do virsh pool-refresh default; sleep 1; done

3. On terminal: B, run command:
   # while true; do qemu-img create -f qcow2 test.img 5M; rm -f test.img; done

4. last for 30 minutes, no error on both terminals. The default pool is still active.
   # virsh pool-list --all| grep default
   default              active     yes

Test results:
current command work well.

Comment 20 Ján Tomko 2014-07-17 13:16:48 UTC

*** Bug 1115740 has been marked as a duplicate of this bug. ***

Comment 22 errata-xmlrpc 2014-10-14 04:16:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1374.html