Bug 1056037

Summary:

GlusterFS Snapshot delete of attached volume fails if it runs > 10 minutes

Product:

Red Hat OpenStack

Reporter:

Yogev Rabl <yrabl>

Component:

openstack-cinder

Assignee:

Eric Harney <eharney>

Status:

CLOSED UPSTREAM

QA Contact:

Dafna Ron <dron>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

4.0

CC:

bkopilov, eharney, scohen, yeylon

Target Milestone:

---

Keywords:

TestBlocker, ZStream

Target Release:

6.0 (Juno)

Hardware:

All

OS:

All

Whiteboard:

storage

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: Cinder has a fixed timeout for GlusterFS driver snapshot create and delete operations Consequence: If a snapshot create/delete operation takes longer than 10 minutes to complete, Cinder will fail it even if it is still working correctly. Fix: Have Nova send Cinder updates during the process so it knows that the job is still active. Result: Snapshot operations can take as long as required without timing out as long as activity is still reported.

Story Points:

---

Clone Of:

Clones:

1066167 1078975 (view as bug list)

Environment:

Last Closed:

2014-10-09 13:27:29 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1066167

Bug Blocks:

1033652, 1040711, 1045196

Attachments:

Description	Flags
the cinder & compute logs	none

Description Yogev Rabl 2014-01-21 12:53:27 UTC

Created attachment 853217 [details]
the cinder & compute logs

Description of problem:
While reproducing Bug 1033652, the cinder wasn't able to delete the snapshots of the volume attached to the instance. 

The system was installed with GlusterFS back end configured in the Packstack answer file. 
Both the Cinder & the Nova Compute servers had fuse installed:
fuse-libs-2.8.3-4.el6.x86_64
glusterfs-fuse-3.4.0.57rhs-1.el6_5.x86_64
fuse-2.8.3-4.el6.x86_64

And the SElinux was configured: 
# getsebool virt_use_fusefs
virt_use_fusefs --> on

According to the steps:
1. Created a volume from an image: 
# cinder create --image-id 52572739-a5e7-4232-a184-e267934cdd15 30
+---------------------+--------------------------------------+
|       Property      |                Value                 |
+---------------------+--------------------------------------+
|     attachments     |                  []                  |
|  availability_zone  |                 nova                 |
|       bootable      |                false                 |
|      created_at     |      2014-01-21T12:29:52.175178      |
| display_description |                 None                 |
|     display_name    |                 None                 |
|          id         | 83fc7617-7a95-4c6b-b631-28bbf991c120 |
|       image_id      | 52572739-a5e7-4232-a184-e267934cdd15 |
|       metadata      |                  {}                  |
|         size        |                  30                  |
|     snapshot_id     |                 None                 |
|     source_volid    |                 None                 |
|        status       |               creating               |
|     volume_type     |                 None                 |
+---------------------+--------------------------------------+

2. launched an instance from the volume named 'verify_bug'
3. create a snapshot from the instance named 'verify_bug_snap'
# cinder snapshot-list
+--------------------------------------+--------------------------------------+----------------+------------------------------+------+
|                  ID                  |              Volume ID               |     Status     |         Display Name         | Size |
+--------------------------------------+--------------------------------------+----------------+------------------------------+------+
 84c59525-63a9-4ebb-9125-e26e97bc1f51 | 83fc7617-7a95-4c6b-b631-28bbf991c120 |   available    | snapshot for verify_bug_snap |  30  |
+--------------------------------------+--------------------------------------+----------------+------------------------------+------+
From the nova compute server:
# ll /var/lib/nova/mnt/600bd85f165b39eac20b9779f0281317
-rw-rw-rw-. 1 qemu qemu 32212254720 Jan 21 14:35 volume-83fc7617-7a95-4c6b-b631-28bbf991c120
-rw-r--r--. 1 qemu qemu     7602176 Jan 21  2014 volume-83fc7617-7a95-4c6b-b631-28bbf991c120.84c59525-63a9-4ebb-9125-e26e97bc1f51
-rw-r--r--. 1  165  165         223 Jan 21 14:36 volume-83fc7617-7a95-4c6b-b631-28bbf991c120.info

The content of the info file is: 
# cat /var/lib/nova/mnt/600bd85f165b39eac20b9779f0281317/volume-83fc7617-7a95-4c6b-b631-28bbf991c120.info
{
 "84c59525-63a9-4ebb-9125-e26e97bc1f51": "volume-83fc7617-7a95-4c6b-b631-28bbf991c120.84c59525-63a9-4ebb-9125-e26e97bc1f51",
 "active": "volume-83fc7617-7a95-4c6b-b631-28bbf991c120.84c59525-63a9-4ebb-9125-e26e97bc1f51"
}

4. Delete the snapshot: 
# cinder snapshot-delete 84c59525-63a9-4ebb-9125-e26e97bc1f51
# cinder snapshot-list
+--------------------------------------+--------------------------------------+----------------+------------------------------+------+
|                  ID                  |              Volume ID               |     Status     |         Display Name         | Size |
+--------------------------------------+--------------------------------------+----------------+------------------------------+------+
| 84c59525-63a9-4ebb-9125-e26e97bc1f51 | 83fc7617-7a95-4c6b-b631-28bbf991c120 |    deleting    | snapshot for verify_bug_snap |  30  |
+--------------------------------------+--------------------------------------+----------------+------------------------------+------+


Version-Release number of selected component (if applicable):
python-novaclient-2.15.0-2.el6ost.noarch
python-nova-2013.2.1-2.el6ost.noarch
openstack-nova-compute-2013.2.1-2.el6ost.noarch
openstack-nova-common-2013.2.1-2.el6ost.noarch
libvirt-client-0.10.2-29.el6_5.2.x86_64
libvirt-0.10.2-29.el6_5.2.x86_64
libvirt-python-0.10.2-29.el6_5.2.x86_64
python-cinderclient-1.0.7-2.el6ost.noarch
openstack-cinder-2013.2.1-5.el6ost.noarch
python-cinder-2013.2.1-5.el6ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Create a volume from an image
2. Boot an instance from the volume
3. Create a snapshot from the instance.
4. Delete the snapshot

Actual results:
The snapshot deletion is stuck, and if interrupted it moves to error, thus the user can't delete the volume, as well.

Expected results:
The user can delete the snapshot.

Additional info:

the cinder & compute logs are attached.

Comment 2 Yogev Rabl 2014-01-21 12:55:03 UTC

This bug blocks the following bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1033652
https://bugzilla.redhat.com/show_bug.cgi?id=1040711

Comment 3 Eric Harney 2014-01-21 18:57:22 UTC

The basic problem here is that Cinder has a fixed time out when waiting for snapshot_delete operations on the Nova side to complete.  If they take too long (even when things are functioning correctly) Cinder will prematurely fail the operation.

To fix this, we need to have Nova send back updates of job percent complete while the block job is in-progress.  Cinder can then reset its timeout window based on these updates.  (This should be doable without changing how the APIs work between Cinder and Nova today.)


For testing in the meantime:
The longest operations are when deleting the only snapshot that exists, because in that case the whole base disk image has to be copied into the snapshot file.  Deletions of snapshots when other snapshots exist should be much quicker, which will let you avoid this bug while testing other pieces of this feature.

Comment 5 Dafna Ron 2014-05-27 11:24:13 UTC

*** Bug 1101504 has been marked as a duplicate of this bug. ***

Comment 8 Sean Cohen 2014-10-09 13:27:29 UTC

This likely indicates using a version of libvirt which had known bugs in it in this area. Closing pending further info on reproduction.