Bug 1193476 - Nova is unstable, can't launch instances in the overcloud: "no valid host was found"
Summary: Nova is unstable, can't launch instances in the overcloud: "no valid host was...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 6.0 (Juno)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ga
: Director
Assignee: Ben Nemec
QA Contact: yeylon@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-02-17 12:39 UTC by Udi
Modified: 2016-04-18 06:49 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-08-05 13:50:23 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2015:1549 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform director Release 2015-08-05 17:49:10 UTC

Description Udi 2015-02-17 12:39:19 UTC
Description of problem:
Installed using TripleO with puddle 2015-02-16.1 and deployed an overcloud of 1 controller and 2 computes (on a virt setup). When trying to launch a cirros instance I get "no valid host was found". There is nothing being logged to nova-compute.log on the compute nodes, and nothing to nova-scheduler.log on the controller. I checked that the nodes can ping each other and I tried disabling selinux, but it didn't help. Only by restarting the compute services was I able to launch 1 instance, but when I tried to launch another instance the issue cam back already.


Version-Release number of selected component (if applicable):
python-nova-2014.2.2-2.el7ost.noarch
openstack-nova-cert-2014.2.2-2.el7ost.noarch
openstack-nova-novncproxy-2014.2.2-2.el7ost.noarch
python-novaclient-2.20.0-1.el7ost.noarch
openstack-nova-console-2014.2.2-2.el7ost.noarch
openstack-nova-common-2014.2.2-2.el7ost.noarch
openstack-nova-conductor-2014.2.2-2.el7ost.noarch
openstack-nova-api-2014.2.2-2.el7ost.noarch
openstack-nova-scheduler-2014.2.2-2.el7ost.noarch
openstack-nova-compute-2014.2.2-2.el7ost.noarch
python-novaclient-2.20.0-1.el7ost.noarch


Steps to Reproduce:
1. Install according to https://wiki.test.redhat.com/RhevmQe/OpenStackTeam/TripleO/RhosWithTripleO
2. Include 1 controller and 2 computes in your deployment
3. Upload the cirros image and try to launch a cirros instance


Actual results:
No valid host was found


Additional info:
In "nova show" you can see a bit of details on the error:

{"message": "No valid host was found. ", "code": 500, "details": "  File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 616, in build_instances
    request_spec, filter_properties)                                                                                                                                 
  File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py\", line 49, in select_destinations                                                       
    context, request_spec, filter_properties)                                                                                                                        
  File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py\", line 35, in __run_method                                                              
    return getattr(self.instance, __name)(*args, **kwargs)                                                                                                           
  File \"/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py\", line 34, in select_destinations                                                          
    context, request_spec, filter_properties)                                                                                                                        
  File \"/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py\", line 108, in select_destinations                                                               
    request_spec=request_spec, filter_properties=filter_properties)                                                                                                  
  File \"/usr/lib/python2.7/site-packages/oslo/messaging/rpc/client.py\", line 152, in call                                                                          
    retry=self.retry)                                                                                                                                                
  File \"/usr/lib/python2.7/site-packages/oslo/messaging/transport.py\", line 90, in _send                                                                           
    timeout=timeout, retry=retry)                                                                                                                                    
  File \"/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py\", line 408, in send                                                                 
    retry=retry)                                                                                                                                                     
  File \"/usr/lib/python2.7/site-packages/oslo/messaging/_drivers/amqpdriver.py\", line 399, in _send                                                                
    raise result                                                                                                                                                     
", "created": "2015-02-17T12:24:15Z"}

Comment 4 James Slagle 2015-02-17 14:13:00 UTC
Reassigning to Ben to take a look at when he can.

Comment 5 James Slagle 2015-02-17 14:15:54 UTC
The first thing we'll probably need is to set debug=True in /etc/nova/nova.conf on the Overcloud controller and compute nodes. Restart all Nova services (compute, api, conductor, scheduler) and then try and reproduce.

Comment 6 Udi 2015-02-17 15:10:02 UTC
Recreated the problem with debug=true. This is from the controller node:


2015-02-17 09:40:11.296 15987 DEBUG nova.openstack.common.periodic_task [-] Running periodic task SchedulerManager._run_periodic_tasks run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193
2015-02-17 09:40:11.297 15987 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x2864b90>> sleeping for 55.52 seconds _inner /usr/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py:132
2015-02-17 09:40:31.648 15987 DEBUG nova.filters [req-b14de8ab-a801-4887-b4ba-f86ab0a885cc None] Starting with 2 host(s) get_filtered_objects /usr/lib/python2.7/site-packages/nova/filters.py:70
2015-02-17 09:40:31.649 15987 DEBUG nova.filters [req-b14de8ab-a801-4887-b4ba-f86ab0a885cc None] Filter RetryFilter returned 2 host(s) get_filtered_objects /usr/lib/python2.7/site-packages/nova/filters.py:88
2015-02-17 09:40:31.662 15987 DEBUG nova.filters [req-b14de8ab-a801-4887-b4ba-f86ab0a885cc None] Filter AvailabilityZoneFilter returned 2 host(s) get_filtered_objects /usr/lib/python2.7/site-packages/nova/filters.py:88
2015-02-17 09:40:31.662 15987 DEBUG nova.scheduler.filters.ram_filter [req-b14de8ab-a801-4887-b4ba-f86ab0a885cc None] (ov-r7lfixkpnw-0-qoqatvny4ocg-novacompute-opcsuxmohruc.novalocal, ov-r7lfixkpnw-0-qoqatvny4ocg-novacompute-opcsuxmohruc.novalocal) ram:-656 disk:-1024 io_ops:2 instances:2 does not have 2048 MB usable ram, it only has 1320.0 MB usable ram. host_passes /usr/lib/python2.7/site-packages/nova/scheduler/filters/ram_filter.py:60
2015-02-17 09:40:31.663 15987 DEBUG nova.scheduler.filters.ram_filter [req-b14de8ab-a801-4887-b4ba-f86ab0a885cc None] (ov-r7lfixkpnw-1-4unvgmgdttys-novacompute-d42n6ozclst5.novalocal, ov-r7lfixkpnw-1-4unvgmgdttys-novacompute-d42n6ozclst5.novalocal) ram:-656 disk:-1024 io_ops:2 instances:2 does not have 2048 MB usable ram, it only has 1320.0 MB usable ram. host_passes /usr/lib/python2.7/site-packages/nova/scheduler/filters/ram_filter.py:60
2015-02-17 09:40:31.663 15987 INFO nova.filters [req-b14de8ab-a801-4887-b4ba-f86ab0a885cc None] Filter RamFilter returned 0 hosts



This is from the compute node:

2015-02-17 09:40:16.520 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._poll_bandwidth_usage run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193
2015-02-17 09:40:16.521 10071 INFO nova.compute.manager [-] Updating bandwidth usage cache
2015-02-17 09:40:16.551 10071 WARNING nova.compute.manager [-] Bandwidth usage not supported by hypervisor.
2015-02-17 09:40:16.551 10071 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x35aa750>> sleeping for 5.96 seconds _inner /usr/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py:132
2015-02-17 09:40:22.516 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._run_pending_deletes run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193
2015-02-17 09:40:22.516 10071 DEBUG nova.compute.manager [-] Cleaning up deleted instances _run_pending_deletes /usr/lib/python2.7/site-packages/nova/compute/manager.py:6225
2015-02-17 09:40:22.543 10071 DEBUG nova.compute.manager [-] There are 0 instances to clean _run_pending_deletes /usr/lib/python2.7/site-packages/nova/compute/manager.py:6234
2015-02-17 09:40:22.543 10071 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x35aa750>> sleeping for 1.99 seconds _inner /usr/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py:132
2015-02-17 09:40:24.539 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._instance_usage_audit run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193
2015-02-17 09:40:24.540 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._poll_volume_usage run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193
2015-02-17 09:40:24.540 10071 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x35aa750>> sleeping for 0.97 seconds _inner /usr/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py:132
2015-02-17 09:40:25.508 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._check_instance_build_time run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193
2015-02-17 09:40:25.508 10071 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x35aa750>> sleeping for 0.00 seconds _inner /usr/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py:132
2015-02-17 09:40:25.511 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager.update_available_resource run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193
2015-02-17 09:40:25.511 10071 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
2015-02-17 09:40:25.511 10071 DEBUG nova.virt.libvirt.driver [-] Updating host stats update_status /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:6511
2015-02-17 09:40:25.530 10071 DEBUG nova.compute.resource_tracker [-] Hypervisor: free ram (MB): 3587 _report_hypervisor_resource_view /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:453
2015-02-17 09:40:25.531 10071 DEBUG nova.compute.resource_tracker [-] Hypervisor: free disk (GB): 39 _report_hypervisor_resource_view /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:454
2015-02-17 09:40:25.531 10071 DEBUG nova.compute.resource_tracker [-] Hypervisor: free VCPUs: 1 _report_hypervisor_resource_view /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:459
2015-02-17 09:40:25.531 10071 DEBUG nova.compute.resource_tracker [-] Hypervisor: assignable PCI devices: [] _report_hypervisor_resource_view /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:466
2015-02-17 09:40:25.531 10071 DEBUG nova.openstack.common.lockutils [-] Created new semaphore "compute_resources" internal_lock /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:206
2015-02-17 09:40:25.531 10071 DEBUG nova.openstack.common.lockutils [-] Acquired semaphore "compute_resources" lock /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:229
2015-02-17 09:40:25.531 10071 DEBUG nova.openstack.common.lockutils [-] Got semaphore / lock "_update_available_resource" inner /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:271
2015-02-17 09:40:25.582 10071 AUDIT nova.compute.resource_tracker [-] Total physical ram (MB): 3952, total allocated virtual ram (MB): 512
2015-02-17 09:40:25.582 10071 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 40
2015-02-17 09:40:25.582 10071 AUDIT nova.compute.resource_tracker [-] Total usable vcpus: 1, total allocated vcpus: 0
2015-02-17 09:40:25.582 10071 AUDIT nova.compute.resource_tracker [-] PCI stats: []
2015-02-17 09:40:25.583 10071 INFO nova.compute.resource_tracker [-] Compute_service record updated for ov-r7lfixkpnw-1-4unvgmgdttys-novacompute-d42n6ozclst5.novalocal:ov-r7lfixkpnw-1-4unvgmgdttys-novacompute-d42n6ozclst5.novalocal
2015-02-17 09:40:25.583 10071 DEBUG nova.openstack.common.lockutils [-] Releasing semaphore "compute_resources" lock /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:238
2015-02-17 09:40:25.583 10071 DEBUG nova.openstack.common.lockutils [-] Semaphore / lock released "_update_available_resource" inner /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py:275
2015-02-17 09:40:25.642 10071 DEBUG nova.openstack.common.periodic_task [-] Running periodic task ComputeManager._poll_rebooting_instances run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:193
2015-02-17 09:40:25.643 10071 DEBUG nova.openstack.common.loopingcall [-] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x35aa750>> sleeping for 0.87 seconds _inner /usr/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py:132

Comment 7 Udi 2015-02-17 15:11:23 UTC
It seems that it is easier to create instances of "tiny" flavor. With "small" I get very random results, sometimes I manage to launch an instance and sometimes not, even if I delete all existing instances first.

Comment 8 Ben Nemec 2015-02-17 20:46:21 UTC
Okay, the problem is this log message:

...does not have 2048 MB usable ram, it only has 1320.0 MB usable ram.

This tells me one of two things happened - 1) there are currently too many active instances on the overcloud and it is legitimately out of memory to launch another "small" vm.  Note that even the small flavor requires 2 GB of ram and we only have 4 total on our compute nodes.  With overhead that means only 1 small vm per compute node.  Or 2) There were _very recently_ too many active instances on the overcloud and it only thinks it's out of memory.  Nova follows an eventual consistency model which means it doesn't immediately return resources from deleted instances to the available pool.

If, for example, you boot two small instances, delete them, then immediately attempt to boot another small instance, you will likely run into this error because Nova's resource tracker won't have recognized that the previous instances are gone.  That's my best guess for what is happening here since I assume you are booting and deleting a number of servers as part of the testing.

Note that by default Nova allows a 50% memory overcommit which should prevent this specific circumstance from breaking new instance booting, but it's possible we override that in TripleO (although we probably shouldn't on the overcloud).  You can check that by looking in nova.conf on the overcloud for the "ram_allocation_ratio" option.  I'm going to bet it's set to 1 (I don't have an overcloud up right now to check myself).

If any of what I've said above doesn't fit your situation please let me know and I'll try to figure out what else might be going on.

Comment 9 Udi 2015-05-27 14:55:09 UTC
This is only reconstructable if you launch multiple instances. For example I had 4 computes, each capable of hosting 2 cirros instances of "small" flavor, and I launched 8 instances. Only 1 instance failed to launch even though I later launched it separately and there was a valid host for it. I will open a separate bug for it.

Comment 15 errata-xmlrpc 2015-08-05 13:50:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1549


Note You need to log in before you can comment on or make changes to this bug.