| Summary: | RamFilter filters out Ceph nodes, resulting in "no valid host found" | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Jun Hu <juhu> |
| Component: | rhosp-director | Assignee: | Lucas Alvares Gomes <lmartins> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Omri Hochman <ohochman> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 9.0 (Mitaka) | CC: | cchen, dbecker, dtantsur, jslagle, juhu, lmartins, mburns, mlammon, morazi, rcernin, rhel-osp-director-maint, srevivo, tvignaud |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 10.0 (Newton) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-12-12 15:34:19 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Jun Hu
2016-09-14 10:44:27 UTC
How reproducible: always. related software version on host rhel7.2 : ipxe-roms-20160127-1.git6366fa7a.el7.noarch ipxe.git-1.0.0-2071.b403.g7cc7e0e.x86_64 ipxe-roms-qemu-20160127-1.git6366fa7a.el7.noarch ipxe.git-efi-x64-1.0.0-2071.b403.g7cc7e0e.x86_64 ipxe-bootimgs-20160127-1.git6366fa7a.el7.noarch qemu-kvm-1.5.3-86.el7_1.6.x86_64 qemu-system-arm-2.0.0-1.el7.6.x86_64 qemu-common-2.0.0-1.el7.6.x86_64 qemu-system-x86-2.0.0-1.el7.6.x86_64 ipxe-roms-qemu-20160127-1.git6366fa7a.el7.noarch libvirt-daemon-driver-qemu-1.2.8-16.el7_1.4.x86_64 qemu-kvm-tools-rhev-2.1.2-23.el7_1.4.x86_64 qemu-kvm-common-1.5.3-86.el7_1.6.x86_64 qemu-img-1.5.3-86.el7_1.6.x86_64 Linux host0.redhat.com 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux (In reply to Jun Hu from comment #0) > Description of problem: > > when deploying osp9, ceph node is shutdowned after pushing image to disk > ,then > it reports "no valid host" error by "nova show overcloud-ceph-storage-0" > > I find errors in nova-scheduler.log as following: > 2016-09-14 11:29:20.735 29957 DEBUG nova.scheduler.filters.ram_filter > [req-983df749-f1e1-4c42-ae2b-1a27b0d15022 e219fe3068104eeca98d940801e28a6c > 7f10576f2a1f4531985c5a1a1bc07db6 - - -] (director.redhat.com, > 6d3362b1-4477-4b7b-b2dd-6c83fb523483) ram: 0MB disk: 0MB io_ops: 0 > instances: 0 does not have 1024 MB usable ram before overcommit, it only has > 0 MB. host_passes > /usr/lib/python2.7/site-packages/nova/scheduler/filters/ram_filter.py:45 > 2016-09-14 11:29:20.735 29957 INFO nova.filters > [req-983df749-f1e1-4c42-ae2b-1a27b0d15022 e219fe3068104eeca98d940801e28a6c > 7f10576f2a1f4531985c5a1a1bc07db6 - - -] Filter RamFilter returned 0 hosts > > > but my ceph vm has 2048Mb memory, why appear 0Mb in the log ? > I suspect it is got from ceph node which is in shutdown status . > > why ceph node is shutdowned after pushing image to disk? I find controller > and compute node is always running. > > Version-Release number of selected component (if applicable): > osp9 > > host rhel7.2, 3 vms as controller, computer , ceph node. > By the description it sounds like a problem when scheduling and the enabled filters. This https://bugzilla.redhat.com/show_bug.cgi?id=1370651 also comes to mind. I'm changing this bug component to rhel-osp-director because it looks like more like a broader issue/misconfiguration with multiple components than an isolated issue in Ironic (it does not do scheduling). Lucas, what does that have to do with RamFilter? James, please don't randomly assign things to Ironic, Lucas explained why it's wrong already. RamFilter is mentioned in the report, so I suspect some mismatch between nodes and flavors. Jun Hu, please provide "nova flavor-show" output for your Ceph flavor. Please also provide output of "ironic node-show" for the nodes tagged to be used for Ceph. Finally, please provide output of "nova hypervisor-stats" before and after the failed deployment. Please also provide nova-scheduler and ironic-conductor logs. Hi Dmitry, Sorry for reopening the bug. We are doing a PoC with Huawei and Orange now. We could reproduce this issue now and then. Re-deploying the environment could sometimes get over the issue but Orange is keen on the root cause here. We tried to deploy one controller + one compute node and sometimes we could see RamFilter retured 0 hosts. If we could reproduce the issue again, my understanding is that comment 5 and comment 6 should be good enough for the information collection. If anything is still needed please let me know. Best Regards, Chen Yes, this information will be fine for the beginning. |