Description of problem: Installed using TripleO with puddle 2015-02-16.1 and deployed an overcloud of 1 controller and 2 computes (on a virt setup). When trying to launch a cirros instance I get "no valid host was found". There is nothing being logged to nova-compute.log on the compute nodes, and nothing to nova-scheduler.log on the controller. I checked that the nodes can ping each other and I tried disabling selinux, but it didn't help. Only by restarting the compute services was I able to launch 1 instance, but when I tried to launch another instance the issue cam back already. Version-Release number of selected component (if applicable): python-nova-2014.2.2-2.el7ost.noarch openstack-nova-cert-2014.2.2-2.el7ost.noarch openstack-nova-novncproxy-2014.2.2-2.el7ost.noarch python-novaclient-2.20.0-1.el7ost.noarch openstack-nova-console-2014.2.2-2.el7ost.noarch openstack-nova-common-2014.2.2-2.el7ost.noarch openstack-nova-conductor-2014.2.2-2.el7ost.noarch openstack-nova-api-2014.2.2-2.el7ost.noarch openstack-nova-scheduler-2014.2.2-2.el7ost.noarch openstack-nova-compute-2014.2.2-2.el7ost.noarch python-novaclient-2.20.0-1.el7ost.noarch Steps to Reproduce: 1. Install according to https://wiki.test.redhat.com/RhevmQe/OpenStackTeam/TripleO/RhosWithTripleO 2. Include 1 controller and 2 computes in your deployment 3. Upload the cirros image and try to launch a cirros instance Actual results: No valid host was found Additional info: In "nova show" you can see a bit of details on the error: {"message": "No valid host was found. ", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 616, in build_instances request_spec, filter_properties) File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py\", line 49, in select_destinations context, request_spec, filter_properties) File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py\", line 35, in __run_method return getattr(self.instance, __name)(*args, **kwargs) File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py\", line 34, in select_destinations context, request_spec, filter_properties) File \"/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py\", line 108, in select_destinations request_spec=request_spec, filter_properties=filter_properties) File \"/usr/lib/python2.7/site-packages/oslo/messaging/rpc/client.py\", line 152, in call retry=self.retry) File \"/usr/lib/python2.7/site-packages/oslo/messaging/transport.py\", line 90, in _send timeout=timeout, retry=retry) File \"/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py\", line 408, in send retry=retry) File \"/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py\", line 399, in _send raise result ", "created": "2015-02-17T12:24:15Z"}
Reassigning to Ben to take a look at when he can.
The first thing we'll probably need is to set debug=True in /etc/nova/nova.conf on the Overcloud controller and compute nodes. Restart all Nova services (compute, api, conductor, scheduler) and then try and reproduce.
Recreated the problem with debug=true. This is from the controller node: 2015-02-17 09:40:11.296 15987 DEBUG nova.openstack.common.periodic_task [-] Running periodic task SchedulerManager._run_periodic_tasks run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193 2015-02-17 09:40:11.297 15987 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x2864b90>> sleeping for 55.52 seconds _inner /usr/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py:132 2015-02-17 09:40:31.648 15987 DEBUG nova.filters [req-b14de8ab-a801-4887-b4ba-f86ab0a885cc None] Starting with 2 host(s) get_filtered_objects /usr/lib/python2.7/site-packages/nova/filters.py:70 2015-02-17 09:40:31.649 15987 DEBUG nova.filters [req-b14de8ab-a801-4887-b4ba-f86ab0a885cc None] Filter RetryFilter returned 2 host(s) get_filtered_objects /usr/lib/python2.7/site-packages/nova/filters.py:88 2015-02-17 09:40:31.662 15987 DEBUG nova.filters [req-b14de8ab-a801-4887-b4ba-f86ab0a885cc None] Filter AvailabilityZoneFilter returned 2 host(s) get_filtered_objects /usr/lib/python2.7/site-packages/nova/filters.py:88 2015-02-17 09:40:31.662 15987 DEBUG nova.scheduler.filters.ram_filter [req-b14de8ab-a801-4887-b4ba-f86ab0a885cc None] (ov-r7lfixkpnw-0-qoqatvny4ocg-novacompute-opcsuxmohruc.novalocal, ov-r7lfixkpnw-0-qoqatvny4ocg-novacompute-opcsuxmohruc.novalocal) ram:-656 disk:-1024 io_ops:2 instances:2 does not have 2048 MB usable ram, it only has 1320.0 MB usable ram. host_passes /usr/lib/python2.7/site-packages/nova/scheduler/filters/ram_filter.py:60 2015-02-17 09:40:31.663 15987 DEBUG nova.scheduler.filters.ram_filter [req-b14de8ab-a801-4887-b4ba-f86ab0a885cc None] (ov-r7lfixkpnw-1-4unvgmgdttys-novacompute-d42n6ozclst5.novalocal, ov-r7lfixkpnw-1-4unvgmgdttys-novacompute-d42n6ozclst5.novalocal) ram:-656 disk:-1024 io_ops:2 instances:2 does not have 2048 MB usable ram, it only has 1320.0 MB usable ram. host_passes /usr/lib/python2.7/site-packages/nova/scheduler/filters/ram_filter.py:60 2015-02-17 09:40:31.663 15987 INFO nova.filters [req-b14de8ab-a801-4887-b4ba-f86ab0a885cc None] Filter RamFilter returned 0 hosts This is from the compute node: 2015-02-17 09:40:16.520 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._poll_bandwidth_usage run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193 2015-02-17 09:40:16.521 10071 INFO nova.compute.manager [-] Updating bandwidth usage cache 2015-02-17 09:40:16.551 10071 WARNING nova.compute.manager [-] Bandwidth usage not supported by hypervisor. 2015-02-17 09:40:16.551 10071 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x35aa750>> sleeping for 5.96 seconds _inner /usr/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py:132 2015-02-17 09:40:22.516 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._run_pending_deletes run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193 2015-02-17 09:40:22.516 10071 DEBUG nova.compute.manager [-] Cleaning up deleted instances _run_pending_deletes /usr/lib/python2.7/site-packages/nova/compute/manager.py:6225 2015-02-17 09:40:22.543 10071 DEBUG nova.compute.manager [-] There are 0 instances to clean _run_pending_deletes /usr/lib/python2.7/site-packages/nova/compute/manager.py:6234 2015-02-17 09:40:22.543 10071 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x35aa750>> sleeping for 1.99 seconds _inner /usr/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py:132 2015-02-17 09:40:24.539 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._instance_usage_audit run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193 2015-02-17 09:40:24.540 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._poll_volume_usage run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193 2015-02-17 09:40:24.540 10071 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x35aa750>> sleeping for 0.97 seconds _inner /usr/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py:132 2015-02-17 09:40:25.508 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._check_instance_build_time run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193 2015-02-17 09:40:25.508 10071 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x35aa750>> sleeping for 0.00 seconds _inner /usr/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py:132 2015-02-17 09:40:25.511 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager.update_available_resource run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193 2015-02-17 09:40:25.511 10071 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources 2015-02-17 09:40:25.511 10071 DEBUG nova.virt.libvirt.driver [-] Updating host stats update_status /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:6511 2015-02-17 09:40:25.530 10071 DEBUG nova.compute.resource_tracker [-] Hypervisor: free ram (MB): 3587 _report_hypervisor_resource_view /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:453 2015-02-17 09:40:25.531 10071 DEBUG nova.compute.resource_tracker [-] Hypervisor: free disk (GB): 39 _report_hypervisor_resource_view /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:454 2015-02-17 09:40:25.531 10071 DEBUG nova.compute.resource_tracker [-] Hypervisor: free VCPUs: 1 _report_hypervisor_resource_view /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:459 2015-02-17 09:40:25.531 10071 DEBUG nova.compute.resource_tracker [-] Hypervisor: assignable PCI devices: [] _report_hypervisor_resource_view /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:466 2015-02-17 09:40:25.531 10071 DEBUG nova.openstack.common.lockutils [-] Created new semaphore "compute_resources" internal_lock /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:206 2015-02-17 09:40:25.531 10071 DEBUG nova.openstack.common.lockutils [-] Acquired semaphore "compute_resources" lock /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:229 2015-02-17 09:40:25.531 10071 DEBUG nova.openstack.common.lockutils [-] Got semaphore / lock "_update_available_resource" inner /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:271 2015-02-17 09:40:25.582 10071 AUDIT nova.compute.resource_tracker [-] Total physical ram (MB): 3952, total allocated virtual ram (MB): 512 2015-02-17 09:40:25.582 10071 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 40 2015-02-17 09:40:25.582 10071 AUDIT nova.compute.resource_tracker [-] Total usable vcpus: 1, total allocated vcpus: 0 2015-02-17 09:40:25.582 10071 AUDIT nova.compute.resource_tracker [-] PCI stats: [] 2015-02-17 09:40:25.583 10071 INFO nova.compute.resource_tracker [-] Compute_service record updated for ov-r7lfixkpnw-1-4unvgmgdttys-novacompute-d42n6ozclst5.novalocal:ov-r7lfixkpnw-1-4unvgmgdttys-novacompute-d42n6ozclst5.novalocal 2015-02-17 09:40:25.583 10071 DEBUG nova.openstack.common.lockutils [-] Releasing semaphore "compute_resources" lock /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:238 2015-02-17 09:40:25.583 10071 DEBUG nova.openstack.common.lockutils [-] Semaphore / lock released "_update_available_resource" inner /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:275 2015-02-17 09:40:25.642 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._poll_rebooting_instances run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193 2015-02-17 09:40:25.643 10071 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x35aa750>> sleeping for 0.87 seconds _inner /usr/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py:132
It seems that it is easier to create instances of "tiny" flavor. With "small" I get very random results, sometimes I manage to launch an instance and sometimes not, even if I delete all existing instances first.
Okay, the problem is this log message: ...does not have 2048 MB usable ram, it only has 1320.0 MB usable ram. This tells me one of two things happened - 1) there are currently too many active instances on the overcloud and it is legitimately out of memory to launch another "small" vm. Note that even the small flavor requires 2 GB of ram and we only have 4 total on our compute nodes. With overhead that means only 1 small vm per compute node. Or 2) There were _very recently_ too many active instances on the overcloud and it only thinks it's out of memory. Nova follows an eventual consistency model which means it doesn't immediately return resources from deleted instances to the available pool. If, for example, you boot two small instances, delete them, then immediately attempt to boot another small instance, you will likely run into this error because Nova's resource tracker won't have recognized that the previous instances are gone. That's my best guess for what is happening here since I assume you are booting and deleting a number of servers as part of the testing. Note that by default Nova allows a 50% memory overcommit which should prevent this specific circumstance from breaking new instance booting, but it's possible we override that in TripleO (although we probably shouldn't on the overcloud). You can check that by looking in nova.conf on the overcloud for the "ram_allocation_ratio" option. I'm going to bet it's set to 1 (I don't have an overcloud up right now to check myself). If any of what I've said above doesn't fit your situation please let me know and I'll try to figure out what else might be going on.
This is only reconstructable if you launch multiple instances. For example I had 4 computes, each capable of hosting 2 cirros instances of "small" flavor, and I launched 8 instances. Only 1 instance failed to launch even though I later launched it separately and there was a valid host for it. I will open a separate bug for it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2015:1549