Bug 1594454
| Summary: | Glance does not protect qcow2 base images stored on Ceph | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Cody Swanson <cswanson> | ||||||
| Component: | openstack-nova | Assignee: | OSP DFG:Compute <osp-dfg-compute> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | OSP DFG:Compute <osp-dfg-compute> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 10.0 (Newton) | CC: | abishop, berrange, dasmith, eglynn, fpercoco, jhakimra, kchamart, mwitt, sbauza, sferdjao, sgordon, srevivo, tshefi, vromanso | ||||||
| Target Milestone: | --- | Flags: | tshefi:
automate_bug-
|
||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2018-07-13 14:16:48 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Created attachment 1453896 [details]
Nova compute debug logs
The nova compute debug logs showing the issues having no glance base image creates when snapshotting instances.
I do not know whether Glance can be aware that a given image is used by Nova. Maybe Nova could protect an image when it spawns an instance from it, and unprotect it when all instances using a given image have been deleted? Anyway, this looks like an upstream RFE, so I'll try to grab some core devs, ask them whether this behaviour is intentional, and see what can be done. Thanks for your report! Cyril, Thanks, this caused a snapshot performance issue for the customer because they had mistakenly removed the qcow2 base image the instances were spawned from while the instances were running. Had they been using raw format images glance would have complained that the image was in-use and denied the removal. It took a while to figure out that this is why nova was pulling the whole disk down to the compute node before re-uploading it to glance instead of just performing the snapshot via the rbd driver on ceph as it normally does. I submitted this for glance as that seemed like the most logical component but maybe this is a nova or kvm/rbd driver issue? Let me know if you need any further information. So I talked to Abhishek, and it seems that Glance has indeed no way of knowing whether images are used by Nova. If this feature were to be implemented, it would have to be done on the Nova side. I'm therefore changing the component to openstack-nova, feel free to change it back if you think Abhishek and I are mistaken. Updating the whiteboard based on Cyril's comment #4. I don't think that use of qcow2 images are supported for Ceph with OpenStack [1]: "Important: Ceph doesn’t support QCOW2 for hosting a virtual machine disk. Thus if you want to boot virtual machines in Ceph (ephemeral backend or boot from volume), the Glance image format must be RAW." [1] http://docs.ceph.com/docs/mimic/rbd/rbd-openstack Because use of the QCOW2 image format is not supported by Ceph as a backend for OpenStack, I'm closing this as NOTABUG. Here is additional product documentation explaining image format support for Ceph: https://access.redhat.com/solutions/2434691 Closed as NOTABUG, nothing for QE to test/automate. |
Created attachment 1453895 [details] transcript of my reproducer lab Description of problem: When you spawn images from qcow2 base images on openstack it's possible to delete the image from glance while instances instantiated from that image are still running on the overcloud. This causes nova snapshot to revert to using the qemu-img utility instead of the ceph rbd driver to handle the snapshot. With raw images this is prevented because the image is locked by it's linked clones. I'm trying to understand if this is an expected behaviour. Should we be denying image removal for qcow2 base images while instances built on that base image still exist? Version-Release number of selected component (if applicable): I've replicated this behaviour on openstack 10z7 and 10z8. I have not had the chance to test this behaviour in RHOP11+. How reproducible: Every time. Steps to Reproduce: 1. Upload qcow2 server image to glance with ceph backend 2. Spawn instances from uploaded image 3. Snapshot instance, observe nova compute logs in debug. 4. Delete image you just uploaded to glance while spawned instances still run. 5. Snapshot instance again, observe nova compute logs in debug. Actual results: We are able to remove the qcow2 formatted base images from glance while instances are deployed. This causes snapshot operations to take around 10x longer since libvirt uses qemu-img to copy the entire disk to the local compute node and then it re-uploads it to glance again once complete. This can also cause snapshots to fail if the compute node does not have enough local disk space to accommodate a large image. I've seen this take a snapshot operation from 6 minutes to 60 minutes at a client's site with 100GB images. Of course if we allow the base image to be removed from glance while there are still instances running it this will cause re-instantiation failures as well. Expected results: Shouldn't we be protecting images as long as there are instances deployed that use that image as a source? Additional info: I've replicated this in my RHOP10z8 lab environment and attached the details in a file.