Bug 1024003

Summary: nova: boot of instance is stuck in scheduling forever if we cannot copy image to swift backend during boot of instance
Product: Red Hat OpenStack Reporter: Dafna Ron <dron>
Component: openstack-novaAssignee: Nikola Dipanov <ndipanov>
Status: CLOSED WORKSFORME QA Contact: Ami Jeain <ajeain>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.0CC: dallan, dron, eglynn, hateya, ndipanov, yeylon
Target Milestone: ---   
Target Release: 4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-20 18:25:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2013-10-28 14:46:51 UTC
Created attachment 816824 [details]
logs

Description of problem:

I am working with swift backend. 
My swift data server do not have enough disk space but when I create an image with --location the image is created locally and will only be copied to swift when we boot an instance. 

and when I boot an instance, it just gets stuck in scheduling forever. 

Version-Release number of selected component (if applicable):

[root@opens-vdsb ~(keystone_admin)]# rpm -qa |grep swift 
openstack-swift-plugin-swift3-1.0.0-0.20120711git.1.el6ost.noarch
openstack-swift-1.9.1-2.el6ost.noarch
openstack-swift-proxy-1.9.1-2.el6ost.noarch
python-swiftclient-1.6.0-1.el6ost.noarch
[root@opens-vdsb ~(keystone_admin)]# rpm -qa |grep glance 
python-glance-2013.2-0.11.b3.el6ost.noarch
openstack-glance-2013.2-0.11.b3.el6ost.noarch
python-glanceclient-0.10.0-1.el6ost.noarch
[root@opens-vdsb ~(keystone_admin)]# rpm -qa |grep nova
openstack-nova-common-2013.2-0.24.rc1.el6ost.noarch
openstack-nova-network-2013.2-0.24.rc1.el6ost.noarch
python-novaclient-2.15.0-1.el6ost.noarch
python-nova-2013.2-0.24.rc1.el6ost.noarch
openstack-nova-api-2013.2-0.24.rc1.el6ost.noarch
openstack-nova-console-2013.2-0.24.rc1.el6ost.noarch
openstack-nova-compute-2013.2-0.24.rc1.el6ost.noarch
openstack-nova-conductor-2013.2-0.24.rc1.el6ost.noarch
openstack-nova-novncproxy-2013.2-0.24.rc1.el6ost.noarch
openstack-nova-scheduler-2013.2-0.24.rc1.el6ost.noarch
openstack-nova-cert-2013.2-0.24.rc1.el6ost.noarch

How reproducible:

100%

Steps to Reproduce:
1. install openstack with swift and make sure your data server do not have a lot of space
2. create an image using --location
3. boot an instance 

Actual results:

instance is stuck in scheduling forever 

Expected results:

1. if we have a problem copying the image to swift we should fail the boot of instance 
2. instance should get a timeout if stuck in scheduling. 

Additional info: logs

Comment 1 Nikola Dipanov 2013-10-29 17:50:47 UTC
This seems like a glance issue to me, however bug report is incomplete to the point that I cannot tell what is going on here.

No exact command line is provided, and it is very difficult to deduct it from the attached httpd logs, so I can only guess what the reporter actually attempted, and only partial logs are provided: Nova scheduler logs are not helpful on their own, as the boot process includes several services (api, scheduler, compute, conductor) and in order to get the whole picture - we need complete logs.

Here are a few possible scenarios based on the above:

* 'glance image-create' call with --location returned without an error even though it should have errored out.

* Image creation does not report correct issues back to Horizon

* Nova does not handle certain glance errors properly and does not error out the instance (no way to know since we don't have compute or API logs).

* Horizon fails to report some nova API errors back to the user properly.

To name only a few. It would be very helpful if the reporter could provide full logs and also a more detailed description of how the Horizon reacted both on attempting to create the image and boot the instance.

Comment 5 Nikola Dipanov 2013-10-31 15:19:04 UTC
I've tried to reproduce it with the help of @Dafna and we were not able to do it.

There are some suspicions that this might be related to the lack of space on the hypervisor node. If that is the case - I'd say that this is not really a bug since we do not guarantee any kind of graceful failures when nova runs out of basic resources.

I will leave a needinfo on @Dafna until this can be confirmed.

Comment 6 Dafna Ron 2013-11-01 09:25:14 UTC
we have a problem that if I create the image with --location and not with --copy-from the image is created locally and we do not try to upload to the store at all (which I think is a bug but not in nova). 
the original issue was trying to boot an instance when the image cannot be uploaded to swift because of space issues on swift data servers - not local host.

Comment 7 Nikola Dipanov 2013-11-13 16:54:27 UTC
After a conversation with Dafna it seems that the crux of the issue is that there are cases when booting an instance can fail and leave it in the SCHEDULING state. This is indeed an issue we need to address (if it is in fact real).

Based on the conversation it seems we need to change the title of this bug as it does not seem to be related to Swift at all.

Since it is not clear to me how this issue can be reproduced - I will leave a needinfo flag on this, and once we get a reproducer - adjust the title to reflect what the bug is about.

Comment 8 Dafna Ron 2013-11-20 18:25:31 UTC
I tried to reproduce in latest build but the instances no longer get stuck in scheduling. 
closing this bug - if we encounter this issue again we will reopen.