Hide Forgot
Description of problem: Some instances are stuck in scheduling state, others spawn correctly. nova show 2e39ab5a-d29f-4261-be33-aabe53e42e7f | Property | Value | +--------------------------------------+-----------------------------------------------------+ | OS-DCF:diskConfig | AUTO | | OS-EXT-AZ:availability_zone | dmz | | OS-EXT-SRV-ATTR:host | - | | OS-EXT-SRV-ATTR:hypervisor_hostname | - | | OS-EXT-SRV-ATTR:instance_name | instance-0000044f | | OS-EXT-STS:power_state | 0 | | OS-EXT-STS:task_state | scheduling | | OS-EXT-STS:vm_state | building | | OS-SRV-USG:launched_at | - | | OS-SRV-USG:terminated_at | - | | accessIPv4 | | | accessIPv6 | | | config_drive | | | created | 2016-03-17T14:30:00Z | | flavor | m1.small (2) | | hostId | | | id | 2e39ab5a-d29f-4261-be33-aabe53e42e7f | | image | unbuntu14.04 (9c00e619-1582-4a0e-a2bd-fea70d9cab41) | | key_name | bnp | | metadata | {} | | name | instance-test | | os-extended-volumes:volumes_attached | [] | | progress | 0 | | status | BUILD | | tenant_id | 0dfba72e1bfe405b8a885290eb42cb72 | | updated | 2016-03-17T16:58:34Z | | user_id | bcf9cdcfb1774e7587253ae9a62195ee | +--------------------------------------+-----------------------------------------------------+ Version-Release number of selected component (if applicable): openstack-nova-api-2015.1.1-1.el7ost.noarch openstack-nova-cert-2015.1.1-1.el7ost.noarch openstack-nova-common-2015.1.1-1.el7ost.noarch openstack-nova-compute-2015.1.1-1.el7ost.noarch openstack-nova-conductor-2015.1.1-1.el7ost.noarch openstack-nova-console-2015.1.1-1.el7ost.noarch openstack-nova-novncproxy-2015.1.1-1.el7ost.noarch openstack-nova-scheduler-2015.1.1-1.el7ost.noarch python-nova-2015.1.1-1.el7ost.noarch python-novaclient-2.23.0-1.el7ost.noarch How reproducible: every time Steps to Reproduce: 1. attempt to spawn instances via horizon. Actual results: instance sometimes gets stuck in scheduling state Expected results: instance should spawn correctly Additional info: I attempted the following: spawning 1 instances: worked/didnt work spawning 2 instances: first 1 stuck, other one worked spawning 4 instances: first 1 stuck, other 3 worked. spawning 10 instances: first 1 stuck, other 9 worked. i'm not sure if thats a coincidence. customer has multiple availability zones but it doesnt seem to matter. im doing a simple spawn of a instance booting from an image with a network attached.
it looks like the nova scheduler picked compute-18 to spawn the instance. im getting sosreport from that node now. [jwaterwo@collab-shell 2016-03-17-11xxxx]$ find -name nova-api.log | xargs grep "Action: 'create'" ./va-controller-1.localdomain/var/log/nova/nova-api.log:2016-03-17 10:25:52.948 27780 DEBUG nova.api.openstack.wsgi [req-0c3f7640-102b-4cbe-9e92-c6ff7cbb01d0 e64fe36f9c614f798a8711b24ae66162 ef9184f8a90544c587036d90e9f7c362 - - -] Action: 'create', calling method: <bound method Controller.create of <nova.api.openstack.compute.servers.Controller object at 0x4a76cd0>>, body: {"server":{"flavorRef":"3","imageRef":"25b55299-575b-40f4-9aa6-29128be5cde1","name":"va-te-03-tr-prod","availability_zone":"dmz","key_name":"divanov","security_groups":[{"name":"default"}],"networks":[{"uuid":"b8f18c19-42ff-4f7e-864c-44d7f9572377"}]}} _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:780 ./va-controller-2.localdomain/var/log/nova/nova-api.log:2016-03-17 10:29:59.656 31818 DEBUG nova.api.openstack.wsgi [req-984e4494-df21-4c9f-9b97-13b2a63a3014 bcf9cdcfb1774e7587253ae9a62195ee 0dfba72e1bfe405b8a885290eb42cb72 - - -] Action: 'create', calling method: <bound method Controller.create of <nova.api.openstack.compute.servers.Controller object at 0x4dcacd0>>, body: {"server": {"name": "instance-test", "imageRef": "9c00e619-1582-4a0e-a2bd-fea70d9cab41", "availability_zone": "dmz", "key_name": "bnp", "flavorRef": "2", "OS-DCF:diskConfig": "AUTO", "max_count": 1, "min_count": 1, "networks": [{"uuid": "d238e33c-ef74-4210-800f-e9b81b6b20b1"}], "security_groups": [{"name": "default"}]}} _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:780 [jwaterwo@collab-shell 2016-03-17-11xxxx]$ grep req-984e4494-df21-4c9f-9b97-13b2a63a3014 */var/log/ -rl va-controller-1.localdomain/var/log/nova/nova-scheduler.log va-controller-1.localdomain/var/log/nova/nova-conductor.log va-controller-2.localdomain/var/log/nova/nova-api.log 2016-03-17 10:30:00.309 27961 DEBUG nova.scheduler.filter_scheduler [req-984e4494-df21-4c9f-9b97-13b2a63a3014 bcf9cdcfb1774e7587253ae9a62195ee 0dfba72e1bfe405b8a885290eb42cb72 - - -] Selected host: WeighedHost [host: (va-compute-18.localdomain, va-compute-18.localdomain) ram:183940 disk:4552704 io_ops:0 instances:5, weight: 1.0] _schedule /usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py:158 2016-03-17 10:30:00.312 27961 ERROR oslo_messaging._drivers.impl_rabbit [req-984e4494-df21-4c9f-9b97-13b2a63a3014 bcf9cdcfb1774e7587253ae9a62195ee 0dfba72e1bfe405b8a885290eb42cb72 - - -] AMQP server on 10.133.162.9:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. 2016-03-17 10:30:01.320 27961 INFO oslo_messaging._drivers.impl_rabbit [req-984e4494-df21-4c9f-9b97-13b2a63a3014 bcf9cdcfb1774e7587253ae9a62195ee 0dfba72e1bfe405b8a885290eb42cb72 - - -] Reconnected to AMQP server on 10.133.162.9:5672
Curious if the chosen compute host shows as DOWN after the instance gets stuck in scheduling? Might be the same as bug 1302387, given that we saw elsewhere in the logs connectivity temporarily was lost to at least one of the rabbitmq nodes.
Is the problem still occuring ? That sounds a transient problem with the message queue.