Bug 1375958 - RamFilter filters out Ceph nodes, resulting in "no valid host found"
Summary: RamFilter filters out Ceph nodes, resulting in "no valid host found"
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 10.0 (Newton)
Assignee: Lucas Alvares Gomes
QA Contact: Omri Hochman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-14 10:44 UTC by Jun Hu
Modified: 2020-06-11 12:59 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-12 15:34:19 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Jun Hu 2016-09-14 10:44:27 UTC
Description of problem:

when deploying osp9, ceph node is shutdowned after pushing image to disk ,then 
it reports "no valid host" error by "nova show overcloud-ceph-storage-0" 

I find errors in nova-scheduler.log as following:
2016-09-14 11:29:20.735 29957 DEBUG nova.scheduler.filters.ram_filter [req-983df749-f1e1-4c42-ae2b-1a27b0d15022 e219fe3068104eeca98d940801e28a6c 7f10576f2a1f4531985c5a1a1bc07db6 - - -] (director.redhat.com, 6d3362b1-4477-4b7b-b2dd-6c83fb523483) ram: 0MB disk: 0MB io_ops: 0 instances: 0 does not have 1024 MB usable ram before overcommit, it only has 0 MB. host_passes /usr/lib/python2.7/site-packages/nova/scheduler/filters/ram_filter.py:45
2016-09-14 11:29:20.735 29957 INFO nova.filters [req-983df749-f1e1-4c42-ae2b-1a27b0d15022 e219fe3068104eeca98d940801e28a6c 7f10576f2a1f4531985c5a1a1bc07db6 - - -] Filter RamFilter returned 0 hosts


but my ceph vm has 2048Mb memory, why appear 0Mb in the log ?  
I suspect it is got from ceph node which is in shutdown status .

why ceph node is shutdowned after pushing image to disk?  I find controller and compute node is always running.

Version-Release number of selected component (if applicable):
osp9

host rhel7.2, 3 vms as controller, computer , ceph node.

How reproducible:
 


Steps to Reproduce:

steps as offical docs.

time openstack overcloud deploy --templates \
 -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
 -e /usr/share/openstack-tripleo-heat-templates/environments/network-management.yaml \
 -e ~/templates/network-environment.yaml \
 -e ~/templates/storage-environment.yaml \
 -e  ~/templates/firstboot.yaml \
 --control-scale 1 --compute-scale 1 --ceph-storage-scale 1 --control-flavor control \
 --compute-flavor compute --ceph-storage-flavor ceph-storage \
 --ntp-server 192.168.106.254 --neutron-network-type vxlan  --neutron-tunnel-types vxlan 

Actual results:
deploy failed

Expected results:
successful.

Additional info:

Comment 2 Jun Hu 2016-09-14 10:47:45 UTC
How reproducible:
always.

related software version on host rhel7.2 :

ipxe-roms-20160127-1.git6366fa7a.el7.noarch
ipxe.git-1.0.0-2071.b403.g7cc7e0e.x86_64
ipxe-roms-qemu-20160127-1.git6366fa7a.el7.noarch
ipxe.git-efi-x64-1.0.0-2071.b403.g7cc7e0e.x86_64
ipxe-bootimgs-20160127-1.git6366fa7a.el7.noarch
qemu-kvm-1.5.3-86.el7_1.6.x86_64
qemu-system-arm-2.0.0-1.el7.6.x86_64
qemu-common-2.0.0-1.el7.6.x86_64
qemu-system-x86-2.0.0-1.el7.6.x86_64
ipxe-roms-qemu-20160127-1.git6366fa7a.el7.noarch
libvirt-daemon-driver-qemu-1.2.8-16.el7_1.4.x86_64
qemu-kvm-tools-rhev-2.1.2-23.el7_1.4.x86_64
qemu-kvm-common-1.5.3-86.el7_1.6.x86_64
qemu-img-1.5.3-86.el7_1.6.x86_64

Linux host0.redhat.com 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

Comment 3 Lucas Alvares Gomes 2016-09-14 16:15:27 UTC
(In reply to Jun Hu from comment #0)
> Description of problem:
> 
> when deploying osp9, ceph node is shutdowned after pushing image to disk
> ,then 
> it reports "no valid host" error by "nova show overcloud-ceph-storage-0" 
> 
> I find errors in nova-scheduler.log as following:
> 2016-09-14 11:29:20.735 29957 DEBUG nova.scheduler.filters.ram_filter
> [req-983df749-f1e1-4c42-ae2b-1a27b0d15022 e219fe3068104eeca98d940801e28a6c
> 7f10576f2a1f4531985c5a1a1bc07db6 - - -] (director.redhat.com,
> 6d3362b1-4477-4b7b-b2dd-6c83fb523483) ram: 0MB disk: 0MB io_ops: 0
> instances: 0 does not have 1024 MB usable ram before overcommit, it only has
> 0 MB. host_passes
> /usr/lib/python2.7/site-packages/nova/scheduler/filters/ram_filter.py:45
> 2016-09-14 11:29:20.735 29957 INFO nova.filters
> [req-983df749-f1e1-4c42-ae2b-1a27b0d15022 e219fe3068104eeca98d940801e28a6c
> 7f10576f2a1f4531985c5a1a1bc07db6 - - -] Filter RamFilter returned 0 hosts
> 
> 
> but my ceph vm has 2048Mb memory, why appear 0Mb in the log ?  
> I suspect it is got from ceph node which is in shutdown status .
> 
> why ceph node is shutdowned after pushing image to disk?  I find controller
> and compute node is always running.
> 
> Version-Release number of selected component (if applicable):
> osp9
> 
> host rhel7.2, 3 vms as controller, computer , ceph node.
> 

By the description it sounds like a problem when scheduling and the enabled filters. This https://bugzilla.redhat.com/show_bug.cgi?id=1370651 also comes to mind.

I'm changing this bug component to rhel-osp-director because it looks like more like a broader issue/misconfiguration with multiple components than an isolated issue in Ironic (it does not do scheduling).

Comment 4 James Slagle 2016-10-18 20:14:00 UTC
Lucas, what does that have to do with RamFilter?

Comment 5 Dmitry Tantsur 2016-10-20 11:25:15 UTC
James, please don't randomly assign things to Ironic, Lucas explained why it's wrong already. RamFilter is mentioned in the report, so I suspect some mismatch between nodes and flavors.

Jun Hu, please provide "nova flavor-show" output for your Ceph flavor. Please also provide output of "ironic node-show" for the nodes tagged to be used for Ceph. Finally, please provide output of "nova hypervisor-stats" before and after the failed deployment.

Comment 6 Dmitry Tantsur 2016-10-20 11:26:26 UTC
Please also provide nova-scheduler and ironic-conductor logs.

Comment 7 Chen 2016-12-02 06:25:59 UTC
Hi Dmitry,

Sorry for reopening the bug.

We are doing a PoC with Huawei and Orange now. We could reproduce this issue now and then. Re-deploying the environment could sometimes get over the issue but Orange is keen on the root cause here. 

We tried to deploy one controller + one compute node and sometimes we could see RamFilter retured 0 hosts.

If we could reproduce the issue again, my understanding is that comment 5 and comment 6 should be good enough for the information collection. If anything is still needed please let me know.

Best Regards,
Chen

Comment 8 Dmitry Tantsur 2016-12-02 11:12:50 UTC
Yes, this information will be fine for the beginning.


Note You need to log in before you can comment on or make changes to this bug.