Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1594454

Summary:

Glance does not protect qcow2 base images stored on Ceph

Product:

Red Hat OpenStack

Reporter:

Cody Swanson <cswanson>

Component:

openstack-nova

Assignee:

OSP DFG:Compute <osp-dfg-compute>

Status:

CLOSED NOTABUG

QA Contact:

OSP DFG:Compute <osp-dfg-compute>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

10.0 (Newton)

CC:

abishop, berrange, dasmith, eglynn, fpercoco, jhakimra, kchamart, mwitt, sbauza, sferdjao, sgordon, srevivo, tshefi, vromanso

Target Milestone:

---

Flags:

tshefi: automate_bug-

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-07-13 14:16:48 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
transcript of my reproducer lab	none
Nova compute debug logs	none

Description Cody Swanson 2018-06-23 05:25:37 UTC

Created attachment 1453895 [details]
transcript of my reproducer lab

Description of problem: 
When you spawn images from qcow2 base images on openstack it's possible to delete the image from glance while instances instantiated from that image are still running on the overcloud. This causes nova snapshot to revert to using the qemu-img utility instead of the ceph rbd driver to handle the snapshot. With raw images this is prevented because the image is locked by it's linked clones. I'm trying to understand if this is an expected behaviour. Should we be denying image removal for qcow2 base images while instances built on that base image still exist?

Version-Release number of selected component (if applicable):
I've replicated this behaviour on openstack 10z7 and 10z8. I have not had the chance to test this behaviour in RHOP11+.

How reproducible: 
Every time. 


Steps to Reproduce:

1. Upload qcow2 server image to glance with ceph backend
2. Spawn instances from uploaded image
3. Snapshot instance, observe nova compute logs in debug.
4. Delete image you just uploaded to glance while spawned instances still run.
5. Snapshot instance again, observe nova compute logs in debug.

Actual results:

We are able to remove the qcow2 formatted base images from glance while instances are deployed. This causes snapshot operations to take around 10x longer since libvirt uses qemu-img to copy the entire disk to the local compute node and then it re-uploads it to glance again once complete. This can also cause snapshots to fail if the compute node does not have enough local disk space to accommodate a large image. I've seen this take a snapshot operation from 6 minutes to 60 minutes at a client's site with 100GB images.

Of course if we allow the base image to be removed from glance while there are still instances running it this will cause re-instantiation failures as well. 


Expected results:

Shouldn't we be protecting images as long as there are instances deployed that use that image as a source?


Additional info:

I've replicated this in my RHOP10z8 lab environment and attached the details in a file.

Comment 1 Cody Swanson 2018-06-23 05:31:43 UTC

Created attachment 1453896 [details]
Nova compute debug logs

The nova compute debug logs showing the issues having no glance base image creates when snapshotting instances.

Comment 2 Cyril Roelandt 2018-07-04 16:15:15 UTC

I do not know whether Glance can be aware that a given image is used by Nova. Maybe Nova could protect an image when it spawns an instance from it, and unprotect it when all instances using a given image have been deleted?

Anyway, this looks like an upstream RFE, so I'll try to grab some core devs, ask them whether this behaviour is intentional, and see what can be done.

Thanks for your report!

Comment 3 Cody Swanson 2018-07-04 16:35:56 UTC

Cyril,

Thanks, this caused a snapshot performance issue for the customer because they had mistakenly removed the qcow2 base image the instances were spawned from while the instances were running. Had they been using raw format images glance would have complained that the image was in-use and denied the removal. It took a while to figure out that this is why nova was pulling the whole disk down to the compute node before re-uploading it to glance instead of just performing the snapshot via the rbd driver on ceph as it normally does. 

I submitted this for glance as that seemed like the most logical component but maybe this is a nova or kvm/rbd driver issue? Let me know if you need any further information.

Comment 4 Cyril Roelandt 2018-07-07 02:23:18 UTC

So I talked to Abhishek, and it seems that Glance has indeed no way of knowing whether images are used by Nova. If this feature were to be implemented, it would have to be done on the Nova side. I'm therefore changing the component to openstack-nova, feel free to change it back if you think Abhishek and I are mistaken.

Comment 5 Alan Bishop 2018-07-10 13:45:36 UTC

Updating the whiteboard based on Cyril's comment #4.

Comment 6 melanie witt 2018-07-13 00:50:05 UTC

I don't think that use of qcow2 images are supported for Ceph with OpenStack [1]:

  "Important: Ceph doesn’t support QCOW2 for hosting a virtual machine disk. Thus if you want to boot virtual machines in Ceph (ephemeral backend or boot from volume), the Glance image format must be RAW."

[1] http://docs.ceph.com/docs/mimic/rbd/rbd-openstack

Comment 7 melanie witt 2018-07-13 14:16:48 UTC

Because use of the QCOW2 image format is not supported by Ceph as a backend for OpenStack, I'm closing this as NOTABUG. Here is additional product documentation explaining image format support for Ceph:

https://access.redhat.com/solutions/2434691

Comment 8 Tzach Shefi 2018-09-24 05:52:30 UTC

Closed as NOTABUG, nothing for QE to test/automate.