Description of problem: Instance's original image used to create it was deleted.Normally this would not be a problem, the instance can still be booted because nova's ImageCache mechanism keeps a cache'd version of the image stored on the compute node in /var/lib/nova/instances/_base/*. However if that directory is lost for some storage system failure for example then the cached image is gone and the original image is deleted from glance so the image does not boot. ERROR seen when trying to boot the instnace. /nova/nova-compute.log 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/image/api.py", line 182, in download 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher dst_path=dest_path) 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/image/glance.py", line 351, in download 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher _reraise_translated_image_exception(image_id) 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/image/glance.py", line 349, in download 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher image_chunks = self._client.call(context, 1, 'data', image_id) 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/nova/image/glance.py", line 218, in call 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher return getattr(client.images, method)(*args, **kwargs) 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/glanceclient/v1/images.py", line 143, in data 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher % urlparse.quote(str(image_id))) 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/glanceclient/common/http.py", line 262, in get 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher return self._request('GET', url, **kwargs) 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/glanceclient/common/http.py", line 230, in _request 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher raise exc.from_response(resp, resp.text) 2016-02-18 11:21:56.770 3275 TRACE oslo_messaging.rpc.dispatcher ImageNotFound: Image d7066af1-2e0c-4028-8a47-4e69f8a7a9b6 could not be found Version-Release number of selected component (if applicable): openstack-nova-api-2015.1.2-7.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1.boot an instance using a glance image 2.delete the glance image after the instance has booted. 3. Now simulate a storage failure by deleting the cached image /var/lib/nova/instances/_base/<cache-image-uuid> 4. reboot the instance and note it fails to boot with ImageNotFound errors in nova-compute.log. Actual results: fails to boot Expected results: boots Additional info: The following kcs describes how to recover if you have a backup of the original glance image used to create the instance. https://access.redhat.com/solutions/2172601
//From engineering Reviewing the code this looks like a race anyway between the storage coming back online, each compute registering as a user of this shared storage and finally an imagecache update being called by each compute. https://github.com/openstack/nova/blob/master/nova/virt/storage_users.py#L76 At present, if a given compute has not registered as a user of the instance store for 24 hours, it and any instances previously running on it are not considered by the next cache update. As a result any images cached for instances on these hosts will be removed as they no longer appear to be in use by any undeleted instances. Checking to see if storage was lost for 24 hours.
Update from customer: The storage was down for 24hrs.
(In reply to Jeremy from comment #2) > Update from customer: > > The storage was down for 24hrs. Thanks that shows the working theory documented in c#1 is possibly valid but I'd still like to reproduce or confirm with logs from the customer.
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
Closing with INSUFFICIENT_DATA, I've been unable to reproduce this and the imagebackend is now being heavily refactored upstream. Happy to reopen if we see this again and have logs.
Closed without a fix therefore QE won't automate
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days