Bug 1554460
Summary: | Potential "not mapped to any cell" error when number of ironic nodes is less than number of ironic nova-computes | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> | ||||||||
Component: | openstack-nova | Assignee: | Dan Smith <dasmith> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | OSP DFG:Compute <osp-dfg-compute> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 13.0 (Queens) | CC: | berrange, bfournie, dasmith, derekh, dtantsur, eglynn, jhakimra, kchamart, lyarwood, mbooth, mburns, mlammon, owalsh, racedoro, rhel-osp-director-maint, sasha, sbauza, sclewis, sgordon, srevivo, vromanso | ||||||||
Target Milestone: | Upstream M2 | Keywords: | Triaged | ||||||||
Target Release: | 14.0 (Rocky) | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | openstack-nova-18.0.0-0.20180710150340.8469fa7 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1557068 1561625 (view as bug list) | Environment: | |||||||||
Last Closed: | 2019-01-11 11:49:11 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1557068, 1561625 | ||||||||||
Attachments: |
|
Description
Alexander Chuzhoy
2018-03-12 17:41:06 UTC
Added instruction to wait 60 seconds after enrolling the nodes. One one setup the boot was flowless, on another one node was with status error: (overcloud) [stack@undercloud-0 ~]$ nova list +--------------------------------------+-----------+--------+------------+-------------+-------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-----------+--------+------------+-------------+-------------------------+ | 051a548f-d981-4e87-ad4b-d46019498b70 | instance1 | ACTIVE | - | Running | baremetal=192.168.24.48 | | 18dabd88-56a4-407f-89fe-b56be95b90b8 | instance2 | ERROR | - | NOSTATE | | +--------------------------------------+-----------+--------+------------+-------------+-------------------------+ (overcloud) [stack@undercloud-0 ~]$ nova show instance2 +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+ | Property | Value | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | | | OS-EXT-SRV-ATTR:host | - | | OS-EXT-SRV-ATTR:hostname | instance2 | | OS-EXT-SRV-ATTR:hypervisor_hostname | - | | OS-EXT-SRV-ATTR:instance_name | instance-00000002 | | OS-EXT-SRV-ATTR:kernel_id | d6b823de-4307-44bf-99b8-d5f97891595c | | OS-EXT-SRV-ATTR:launch_index | 0 | | OS-EXT-SRV-ATTR:ramdisk_id | a3f208ed-9a0e-49fd-ac3d-316be46cd6a7 | | OS-EXT-SRV-ATTR:reservation_id | r-hcnn51mv | | OS-EXT-SRV-ATTR:root_device_name | - | | OS-EXT-SRV-ATTR:user_data | - | | OS-EXT-STS:power_state | 0 | | OS-EXT-STS:task_state | - | | OS-EXT-STS:vm_state | error | | OS-SRV-USG:launched_at | - | | OS-SRV-USG:terminated_at | - | | accessIPv4 | | | accessIPv6 | | | config_drive | | | created | 2018-03-12T19:58:26Z | | description | instance2 | | fault | {"message": "Host 'controller-0.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-12T19:58:29Z"} | | flavor:disk | 20 | | flavor:ephemeral | 0 | | flavor:extra_specs | {"baremetal": "true"} | | flavor:original_name | baremetal | | flavor:ram | 1024 | | flavor:swap | 0 | | flavor:vcpus | 1 | | hostId | | | host_status | | | id | 18dabd88-56a4-407f-89fe-b56be95b90b8 | | image | overcloud-full (ccd9a4de-f493-43bf-820a-bc96145c55e9) | | key_name | stack-key | | locked | False | | metadata | {} | | name | instance2 | | os-extended-volumes:volumes_attached | [] | | status | ERROR | | tags | [] | | tenant_id | b1d127bba4594de2b2f7134adf216848 | | updated | 2018-03-12T19:58:29Z | | user_id | 280ef7c29bb54d9d9e4e2aaa325a1f15 | Could you attach logs please, at least for nova and ironic? Wondering if something is delaying the ironic nodes registering in nova and increasing the chances of hitting the race. Reproduced the issue with 1 node being active and another with error. Delete the node with error status and tried to re-created. Ended up with error again: {"message": "Host 'controller-2.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-13T15:52:38Z"} | Created attachment 1407634 [details]
nova logs.
Need logs from all controllers ideally Default in nova.conf is periodic_fuzzy_delay = 60, so a 60s delay is still potentially racy I expect. Looking at the attached logs, the periodic task appears to be picking up the controller-0 nodes very soon after they are reported: nova-compute log: 2018-03-12 22:56:43.700 1 INFO nova.compute.resource_tracker [req-ef9d132a-1b3c-43fb-8311-f0b3ba12a58a - - - - -] Compute node record created for controller-0.localdomain:d137e261-bbf2-45c6-8f3e-87b60374925f with uuid: 1d3cdfd8-c165-4877-a50a-230f127ef308 nova-scheduler log: 2018-03-12 22:56:46.436 1 DEBUG oslo_service.periodic_task [req-0497666e-7073-4b9d-8d7d-4ead3d47016d - - - - -] Running periodic task SchedulerManager._discover_hosts_in_cells run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215 2018-03-12 22:56:46.442 1 DEBUG oslo_concurrency.lockutils [req-5b61513d-3c81-4722-81fb-27f0b2baca82 - - - - -] Lock "8674ced6-e574-4f1a-abf3-4d4d1ef27003" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273 2018-03-12 22:56:46.442 1 DEBUG oslo_concurrency.lockutils [req-5b61513d-3c81-4722-81fb-27f0b2baca82 - - - - -] Lock "8674ced6-e574-4f1a-abf3-4d4d1ef27003" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285 2018-03-12 22:56:46.552 1 DEBUG oslo_concurrency.lockutils [req-332767ca-2bb7-4292-b786-fe8153370636 - - - - -] Lock "host_instance" acquired by "nova.scheduler.host_manager.sync_instance_info" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273 2018-03-12 22:56:46.552 1 INFO nova.scheduler.host_manager [req-332767ca-2bb7-4292-b786-fe8153370636 - - - - -] Successfully synced instances from host 'controller-0.localdomain'. Created attachment 1407638 [details]
nova logs from controller1
Created attachment 1407639 [details]
nova logs from controller2
Even with 120 second wait after the enrol. Out of 2 nodes, the second is with status error: {"message": "Host 'controller-1.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-13T19:13:00Z"} Initially have one ironic node on controller-0 and controller-1. Host mappings are are successfully created for both: # grep "Discovered" controller-*/nova-scheduler.log* controller-1/nova-scheduler.log:2018-03-12 22:45:59.234 1 INFO nova.scheduler.manager [req-5061cc52-7728-479a-913c-7cc3c4c0427c - - - - -] Discovered 1 new hosts: default:compute-1.localdomain controller-1/nova-scheduler.log:2018-03-12 22:46:14.299 1 INFO nova.scheduler.manager [req-87669c18-66a8-4bbd-9f8d-c738f2be9476 - - - - -] Discovered 1 new hosts: default:compute-0.localdomain controller-1/nova-scheduler.log:2018-03-12 22:56:44.197 1 INFO nova.scheduler.manager [req-9815289d-1f97-4ca7-a58b-d451b669bc28 - - - - -] Discovered 1 new hosts: default:controller-0.localdomain controller-2/nova-scheduler.log:2018-03-12 22:56:53.817 1 INFO nova.scheduler.manager [req-7692ce5b-b6c0-48c0-b8d9-252f07228304 - - - - -] Discovered 1 new hosts: default:controller-1.localdomain Afterwards the controller-0 ironic node is moved to controller-2: # grep 'moving from' controller-1/nova-compute.log 2018-03-12 22:56:47.714 1 INFO nova.compute.resource_tracker [req-98279f70-38c7-41ca-ac2d-90596eab3c45 - - - - -] ComputeNode de6ab2a0-b24e-42ee-a44c-d518bef356d6 moving from controller-0.localdomain to controller-1.localdomain # grep 'moving from' controller-2/nova-compute.log 2018-03-12 22:56:54.318 1 INFO nova.compute.resource_tracker [req-8b147e1f-f9a8-4dd7-abba-8a6d7e6c228f - - - - -] ComputeNode de6ab2a0-b24e-42ee-a44c-d518bef356d6 moving from controller-1.localdomain to controller-2.localdomain But a host mapping is not created for controller-2 AFAICT. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045 |