Bug 1255733

Summary: Nova scheduler reports wrong available memory for a baremetal host
Product: Red Hat OpenStack Reporter: AlbertoG <agarciam>
Component: openstack-ironic-discoverdAssignee: RHOS Maint <rhos-maint>
Status: CLOSED DUPLICATE QA Contact: yeylon <yeylon>
Severity: medium Docs Contact:
Priority: medium    
Version: DirectorCC: agarciam, apevec, berrange, dasmith, dhill, eglynn, jthomas, kchamart, lhh, mburns, ndipanov, pbrady, rhel-osp-director-maint, sbauza, sferdjao, sgordon, srevivo, vromanso, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 7.0 (Kilo)Flags: sbauza: needinfo? (agarciam)
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-05 12:51:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description AlbertoG 2015-08-21 12:10:52 UTC
Description of problem:

After deploying and removing several overclouds using the openstack director, one of the baremetal nodes is not cleared correctly and it still reports 1 instance and less memory available than the real one. This means that this node is not any longer valid for the nova ram filter when deploying new instances. This data is kept in memory, as the nova and ironic databases looked good and the problem can be solved just restarting the nova-scheduler service.

The error in Heat:
2015-08-21 06:25:08.838 3509 TRACE heat.engine.resource ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "ResourceInError: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"""

The error in nova shceduler:

2015-08-21 06:25:03.728 24503 DEBUG nova.scheduler.filters.ram_filter [req-2eec83a2-e3c2-4511-b7c9-00fb7c673953 4dfe22da9c46409bbb157996c42eb24b f32db20ae258449c867012c3d4d07702 - - -] (undercloud.redhat.local, e7653536-4e9b-4435-92d0-2e4cada5168c) ram:0 disk:-1024 io_ops:1 instances:1 does not have 6144 MB usable ram, it only has 3072.0 MB usable ram. host_passes /usr/lib/python2.7/site-packages/nova/scheduler/filters/ram_filter.py:60
2015-08-21 06:25:03.729 24503 INFO nova.filters [req-2eec83a2-e3c2-4511-b7c9-00fb7c673953 4dfe22da9c46409bbb157996c42eb24b f32db20ae258449c867012c3d4d07702 - - -] Filter RamFilter returned 0 hosts
2015-08-21 06:25:03.729 24503 DEBUG nova.scheduler.filter_scheduler [req-2eec83a2-e3c2-4511-b7c9-00fb7c673953 4dfe22da9c46409bbb157996c42eb24b f32db20ae258449c867012c3d4d07702 - - -] There are 0 hosts available but 1 instances requested to build. select_destinations /usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py:77

The host status in the database at the time of the problem:

MariaDB [nova]> select hypervisor_hostname,memory_mb_used,free_ram_mb from compute_nodes where hypervisor_hostname='e7653536-4e9b-4435-92d0-2e4cada5168c';
| hypervisor_hostname                  | memory_mb_used | free_ram_mb |
| e7653536-4e9b-4435-92d0-2e4cada5168c |              0 |        6144 |

And the successful deployment after restarting the nova scheduler service:

2015-08-21 07:20:29.139 17794 DEBUG nova.service [req-a9ad4e70-0141-416a-9c3a-17e808e46856 - - - - -] Join ServiceGroup membership for this service scheduler start /usr/lib/python2.7/site-packages/nova/service.py:206
2015-08-21 07:22:38.003 17794 DEBUG nova.scheduler.filter_scheduler [req-7c47c0d5-aaac-4ecb-b742-d56d94ff07c8 4dfe22da9c46409bbb157996c42eb24b f32db20ae258449c867012c3d4d07702 - - -] Filtered [(undercloud.redhat.local, e7653536-4e9b-4435-92d0-2e4cada5168c) ram:6144 disk:19456 io_ops:0 instances:0] _schedule /usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py:143
2015-08-21 07:22:38.003 17794 DEBUG nova.scheduler.filter_scheduler [req-7c47c0d5-aaac-4ecb-b742-d56d94ff07c8 4dfe22da9c46409bbb157996c42eb24b f32db20ae258449c867012c3d4d07702 - - -] Weighed [WeighedHost [host: (undercloud.redhat.local, e7653536-4e9b-4435-92d0-2e4cada5168c) ram:6144 disk:19456 io_ops:0 instances:0, weight: 1.0]] _schedule /usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py:148
2015-08-21 07:22:38.004 17794 DEBUG nova.scheduler.filter_scheduler [req-7c47c0d5-aaac-4ecb-b742-d56d94ff07c8 4dfe22da9c46409bbb157996c42eb24b f32db20ae258449c867012c3d4d07702 - - -] Selected host: WeighedHost [host: (undercloud.redhat.local, e7653536-4e9b-4435-92d0-2e4cada5168c) ram:6144 disk:19456 io_ops:0 instances:0, weight: 1.0] _schedule /usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py:158

[stack@undercloud ~]$ ironic node-list | grep e7653536-4e9b-4435-92d0-2e4cada5168c
| e7653536-4e9b-4435-92d0-2e4cada5168c | None | cd177b6b-5f8b-4c3f-8df8-54c5bb1dfb63 | power on    | active          | False       |

Version-Release number of selected component (if applicable):

How reproducible:
The error appears randomly.

Steps to Reproduce:
1. Deploy and delete overclouds using Openstack director

Actual results:
Even if the baremetal nodes are available some of them report that are still used, so new instances can't be scheduled on them.

Expected results:
Being able to deploy instances in the available baremetal nodes

Additional info:

Comment 3 Sylvain Bauza 2015-10-02 16:16:04 UTC
So, I'm trying to get a few more details.

Which version of Nova and Ironic are you running ? Per the bug, it says OSP8. Do you confirm ?

The error above comes from the fact that it *seems* that the Ironic node already holds an instance (hence the instance=1 and the memory/disk consumed). I just wonder why a scheduler restart fixes the problem.

In order to better understand, could I please have the nova-compute.log plus the nova-scheduler.log not truncated ? Ideally a sos-report would be great or if not possible, could you please also give me the nova.conf file


Comment 4 David Hill 2015-10-02 21:47:36 UTC
Hello Sylvain,

   I think this is a known issue and already has many BZ for this.   I didn't witness this nova-scheduler issue within the case but it seems like system with more than 256GB of RAM will report their memory_mb wrongly has if there was some kind of /1024 or /1000 division for sizes bigger than X .  In this case for instance, the systems are reporting 384MB of ram instead of 384000MB of RAM or so.  We can easily bypass this issue by ironic node-updating the affected node with the appropriate memory size.

Thank you very much,

David Hill

Comment 5 Mike Burns 2015-10-05 12:51:49 UTC

*** This bug has been marked as a duplicate of bug 1256421 ***