Bug 1024003 - nova: boot of instance is stuck in scheduling forever if we cannot copy image to swift backend during boot of instance
nova: boot of instance is stuck in scheduling forever if we cannot copy image...
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova (Show other bugs)
x86_64 Linux
unspecified Severity high
: ---
: 4.0
Assigned To: Nikola Dipanov
Ami Jeain
Depends On:
  Show dependency treegraph
Reported: 2013-10-28 10:46 EDT by Dafna Ron
Modified: 2016-04-26 15:24 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-11-20 13:25:31 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
logs (35.62 KB, application/x-gzip)
2013-10-28 10:46 EDT, Dafna Ron
no flags Details

External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1245519 None None None Never

  None (edit)
Description Dafna Ron 2013-10-28 10:46:51 EDT
Created attachment 816824 [details]

Description of problem:

I am working with swift backend. 
My swift data server do not have enough disk space but when I create an image with --location the image is created locally and will only be copied to swift when we boot an instance. 

and when I boot an instance, it just gets stuck in scheduling forever. 

Version-Release number of selected component (if applicable):

[root@opens-vdsb ~(keystone_admin)]# rpm -qa |grep swift 
[root@opens-vdsb ~(keystone_admin)]# rpm -qa |grep glance 
[root@opens-vdsb ~(keystone_admin)]# rpm -qa |grep nova

How reproducible:


Steps to Reproduce:
1. install openstack with swift and make sure your data server do not have a lot of space
2. create an image using --location
3. boot an instance 

Actual results:

instance is stuck in scheduling forever 

Expected results:

1. if we have a problem copying the image to swift we should fail the boot of instance 
2. instance should get a timeout if stuck in scheduling. 

Additional info: logs
Comment 1 Nikola Dipanov 2013-10-29 13:50:47 EDT
This seems like a glance issue to me, however bug report is incomplete to the point that I cannot tell what is going on here.

No exact command line is provided, and it is very difficult to deduct it from the attached httpd logs, so I can only guess what the reporter actually attempted, and only partial logs are provided: Nova scheduler logs are not helpful on their own, as the boot process includes several services (api, scheduler, compute, conductor) and in order to get the whole picture - we need complete logs.

Here are a few possible scenarios based on the above:

* 'glance image-create' call with --location returned without an error even though it should have errored out.

* Image creation does not report correct issues back to Horizon

* Nova does not handle certain glance errors properly and does not error out the instance (no way to know since we don't have compute or API logs).

* Horizon fails to report some nova API errors back to the user properly.

To name only a few. It would be very helpful if the reporter could provide full logs and also a more detailed description of how the Horizon reacted both on attempting to create the image and boot the instance.
Comment 5 Nikola Dipanov 2013-10-31 11:19:04 EDT
I've tried to reproduce it with the help of @Dafna and we were not able to do it.

There are some suspicions that this might be related to the lack of space on the hypervisor node. If that is the case - I'd say that this is not really a bug since we do not guarantee any kind of graceful failures when nova runs out of basic resources.

I will leave a needinfo on @Dafna until this can be confirmed.
Comment 6 Dafna Ron 2013-11-01 05:25:14 EDT
we have a problem that if I create the image with --location and not with --copy-from the image is created locally and we do not try to upload to the store at all (which I think is a bug but not in nova). 
the original issue was trying to boot an instance when the image cannot be uploaded to swift because of space issues on swift data servers - not local host.
Comment 7 Nikola Dipanov 2013-11-13 11:54:27 EST
After a conversation with Dafna it seems that the crux of the issue is that there are cases when booting an instance can fail and leave it in the SCHEDULING state. This is indeed an issue we need to address (if it is in fact real).

Based on the conversation it seems we need to change the title of this bug as it does not seem to be related to Swift at all.

Since it is not clear to me how this issue can be reproduced - I will leave a needinfo flag on this, and once we get a reproducer - adjust the title to reflect what the bug is about.
Comment 8 Dafna Ron 2013-11-20 13:25:31 EST
I tried to reproduce in latest build but the instances no longer get stuck in scheduling. 
closing this bug - if we encounter this issue again we will reopen.

Note You need to log in before you can comment on or make changes to this bug.