Bug 1318803

Summary: some instances stuck in scheduling state
Product: Red Hat OpenStack Reporter: Jack Waterworth <jwaterwo>
Component: openstack-novaAssignee: Eoghan Glynn <eglynn>
Status: CLOSED NOTABUG QA Contact: nlevinki <nlevinki>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.0 (Kilo)CC: berrange, dasmith, eglynn, jeckersb, jwaterwo, kchamart, ndipanov, sbauza, sferdjao, sgordon, vromanso, yeylon
Target Milestone: ---Keywords: Unconfirmed
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-31 13:03:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jack Waterworth 2016-03-17 21:06:53 UTC
Description of problem:
Some instances are stuck in scheduling state, others spawn correctly.
nova show 2e39ab5a-d29f-4261-be33-aabe53e42e7f

| Property                             | Value                                               |
+--------------------------------------+-----------------------------------------------------+
| OS-DCF:diskConfig                    | AUTO                                                |
| OS-EXT-AZ:availability_zone          | dmz                                                 |
| OS-EXT-SRV-ATTR:host                 | -                                                   |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | -                                                   |
| OS-EXT-SRV-ATTR:instance_name        | instance-0000044f                                   |
| OS-EXT-STS:power_state               | 0                                                   |
| OS-EXT-STS:task_state                | scheduling                                          |
| OS-EXT-STS:vm_state                  | building                                            |
| OS-SRV-USG:launched_at               | -                                                   |
| OS-SRV-USG:terminated_at             | -                                                   |
| accessIPv4                           |                                                     |
| accessIPv6                           |                                                     |
| config_drive                         |                                                     |
| created                              | 2016-03-17T14:30:00Z                                |
| flavor                               | m1.small (2)                                        |
| hostId                               |                                                     |
| id                                   | 2e39ab5a-d29f-4261-be33-aabe53e42e7f                |
| image                                | unbuntu14.04 (9c00e619-1582-4a0e-a2bd-fea70d9cab41) |
| key_name                             | bnp                                                 |
| metadata                             | {}                                                  |
| name                                 | instance-test                                       |
| os-extended-volumes:volumes_attached | []                                                  |
| progress                             | 0                                                   |
| status                               | BUILD                                               |
| tenant_id                            | 0dfba72e1bfe405b8a885290eb42cb72                    |
| updated                              | 2016-03-17T16:58:34Z                                |
| user_id                              | bcf9cdcfb1774e7587253ae9a62195ee                    |
+--------------------------------------+-----------------------------------------------------+

Version-Release number of selected component (if applicable):
openstack-nova-api-2015.1.1-1.el7ost.noarch
openstack-nova-cert-2015.1.1-1.el7ost.noarch
openstack-nova-common-2015.1.1-1.el7ost.noarch
openstack-nova-compute-2015.1.1-1.el7ost.noarch
openstack-nova-conductor-2015.1.1-1.el7ost.noarch
openstack-nova-console-2015.1.1-1.el7ost.noarch
openstack-nova-novncproxy-2015.1.1-1.el7ost.noarch
openstack-nova-scheduler-2015.1.1-1.el7ost.noarch
python-nova-2015.1.1-1.el7ost.noarch
python-novaclient-2.23.0-1.el7ost.noarch


How reproducible:
every time

Steps to Reproduce:
1. attempt to spawn instances via horizon.

Actual results:
instance sometimes gets stuck in scheduling state


Expected results:
instance should spawn correctly


Additional info:
I attempted the following:

spawning 1 instances: worked/didnt work
spawning 2 instances: first 1 stuck, other one worked
spawning 4 instances: first 1 stuck, other 3 worked.
spawning 10 instances: first 1 stuck, other 9 worked.

i'm not sure if thats a coincidence. customer has multiple availability zones but it doesnt seem to matter. im doing a simple spawn of a instance booting from an image with a network attached.

Comment 2 Jack Waterworth 2016-03-17 21:08:47 UTC
it looks like the nova scheduler picked compute-18 to spawn the instance. im getting sosreport from that node now.

[jwaterwo@collab-shell 2016-03-17-11xxxx]$ find -name nova-api.log | xargs grep "Action: 'create'"
./va-controller-1.localdomain/var/log/nova/nova-api.log:2016-03-17 10:25:52.948 27780 DEBUG nova.api.openstack.wsgi [req-0c3f7640-102b-4cbe-9e92-c6ff7cbb01d0 e64fe36f9c614f798a8711b24ae66162 ef9184f8a90544c587036d90e9f7c362 - - -] Action: 'create', calling method: <bound method Controller.create of <nova.api.openstack.compute.servers.Controller object at 0x4a76cd0>>, body: {"server":{"flavorRef":"3","imageRef":"25b55299-575b-40f4-9aa6-29128be5cde1","name":"va-te-03-tr-prod","availability_zone":"dmz","key_name":"divanov","security_groups":[{"name":"default"}],"networks":[{"uuid":"b8f18c19-42ff-4f7e-864c-44d7f9572377"}]}} _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:780
./va-controller-2.localdomain/var/log/nova/nova-api.log:2016-03-17 10:29:59.656 31818 DEBUG nova.api.openstack.wsgi [req-984e4494-df21-4c9f-9b97-13b2a63a3014 bcf9cdcfb1774e7587253ae9a62195ee 0dfba72e1bfe405b8a885290eb42cb72 - - -] Action: 'create', calling method: <bound method Controller.create of <nova.api.openstack.compute.servers.Controller object at 0x4dcacd0>>, body: {"server": {"name": "instance-test", "imageRef": "9c00e619-1582-4a0e-a2bd-fea70d9cab41", "availability_zone": "dmz", "key_name": "bnp", "flavorRef": "2", "OS-DCF:diskConfig": "AUTO", "max_count": 1, "min_count": 1, "networks": [{"uuid": "d238e33c-ef74-4210-800f-e9b81b6b20b1"}], "security_groups": [{"name": "default"}]}} _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:780

[jwaterwo@collab-shell 2016-03-17-11xxxx]$ grep req-984e4494-df21-4c9f-9b97-13b2a63a3014 */var/log/ -rl
va-controller-1.localdomain/var/log/nova/nova-scheduler.log
va-controller-1.localdomain/var/log/nova/nova-conductor.log
va-controller-2.localdomain/var/log/nova/nova-api.log

2016-03-17 10:30:00.309 27961 DEBUG nova.scheduler.filter_scheduler [req-984e4494-df21-4c9f-9b97-13b2a63a3014 bcf9cdcfb1774e7587253ae9a62195ee 0dfba72e1bfe405b8a885290eb42cb72 - - -] Selected host: WeighedHost [host: (va-compute-18.localdomain, va-compute-18.localdomain) ram:183940 disk:4552704 io_ops:0 instances:5, weight: 1.0] _schedule /usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py:158
2016-03-17 10:30:00.312 27961 ERROR oslo_messaging._drivers.impl_rabbit [req-984e4494-df21-4c9f-9b97-13b2a63a3014 bcf9cdcfb1774e7587253ae9a62195ee 0dfba72e1bfe405b8a885290eb42cb72 - - -] AMQP server on 10.133.162.9:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.
2016-03-17 10:30:01.320 27961 INFO oslo_messaging._drivers.impl_rabbit [req-984e4494-df21-4c9f-9b97-13b2a63a3014 bcf9cdcfb1774e7587253ae9a62195ee 0dfba72e1bfe405b8a885290eb42cb72 - - -] Reconnected to AMQP server on 10.133.162.9:5672

Comment 3 John Eckersberg 2016-03-18 18:00:33 UTC
Curious if the chosen compute host shows as DOWN after the instance gets stuck in scheduling?  Might be the same as bug 1302387, given that we saw elsewhere in the logs connectivity temporarily was lost to at least one of the rabbitmq nodes.

Comment 4 Sylvain Bauza 2016-03-24 16:41:18 UTC
Is the problem still occuring ? That sounds a transient problem with the message queue.