Occasional race condition between the periodic task that does the mapping and nova boot. Environment: puppet-ironic-12.3.1-0.20180221115553.12ab03d.el7ost.noarch openstack-ironic-conductor-10.1.2-0.20180223164019.4c3a611.el7ost.noarch python-nova-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch puppet-nova-12.3.1-0.20180221110444.c11423e.el7ost.noarch openstack-tripleo-heat-templates-8.0.0-0.20180227121938.e0f59ee.el7ost.noarch openstack-nova-scheduler-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch openstack-ironic-staging-drivers-0.9.0-0.20180220235748.de59d74.el7ost.noarch python-novaclient-9.1.1-1.el7ost.noarch python2-ironicclient-2.2.0-0.20180211230646.683b7c6.el7ost.noarch python-ironic-inspector-client-3.1.0-0.20180213173236.c82b59f.el7ost.noarch openstack-ironic-api-10.1.2-0.20180223164019.4c3a611.el7ost.noarch openstack-ironic-inspector-7.2.0-0.20180217115157.a8d621f.el7ost.noarch openstack-nova-placement-api-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch openstack-nova-compute-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch instack-undercloud-8.2.1-0.20180228034814.945b1d0.el7ost.noarch openstack-nova-api-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch openstack-nova-conductor-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch python-ironic-lib-2.12.0-0.20180213172054.831c55b.el7ost.noarch openstack-nova-common-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch openstack-ironic-common-10.1.2-0.20180223164019.4c3a611.el7ost.noarch Steps to reproduce: Add nodes to ironic in OC and try to boot them in a script. Result: (overcloud) [stack@undercloud-0 ~]$ nova list nova delete in+--------------------------------------+-----------+--------+------------+-------------+----------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-----------+--------+------------+-------------+----------+ | 033a380c-9d88-4118-a934-0230f4d7ab17 | instance1 | ERROR | - | NOSTATE | | | 4d104877-6dfa-4ec9-8997-67b8a6ec50eb | instance2 | ERROR | - | NOSTATE | | +--------------------------------------+-----------+--------+------------+-------------+----------+ (overcloud) [stack@undercloud-0 ~]$ nova show instance1 +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+ | Property | Value | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | | | OS-EXT-SRV-ATTR:host | - | | OS-EXT-SRV-ATTR:hostname | instance1 | | OS-EXT-SRV-ATTR:hypervisor_hostname | - | | OS-EXT-SRV-ATTR:instance_name | instance-00000002 | | OS-EXT-SRV-ATTR:kernel_id | 23a765ba-c32d-411c-8ff5-ad086f81932f | | OS-EXT-SRV-ATTR:launch_index | 0 | | OS-EXT-SRV-ATTR:ramdisk_id | 995790c9-7561-4cb3-bfa5-d04fcd50764d | | OS-EXT-SRV-ATTR:reservation_id | r-cow2f89t | | OS-EXT-SRV-ATTR:root_device_name | - | | OS-EXT-SRV-ATTR:user_data | - | | OS-EXT-STS:power_state | 0 | | OS-EXT-STS:task_state | - | | OS-EXT-STS:vm_state | error | | OS-SRV-USG:launched_at | - | | OS-SRV-USG:terminated_at | - | | accessIPv4 | | | accessIPv6 | | | config_drive | | | created | 2018-03-12T16:43:11Z | | description | instance1 | | fault | {"message": "Host 'controller-0.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-12T16:43:13Z"} | | flavor:disk | 20 | | flavor:ephemeral | 0 | | flavor:extra_specs | {"baremetal": "true"} | | flavor:original_name | baremetal | | flavor:ram | 1024 | | flavor:swap | 0 | | flavor:vcpus | 1 | | hostId | | | host_status | | | id | 033a380c-9d88-4118-a934-0230f4d7ab17 | | image | overcloud-full (9f5f84ff-07c4-46b6-abd5-8407ecb99adb) | | key_name | stack-key | | locked | False | | metadata | {} | | name | instance1 | | os-extended-volumes:volumes_attached | [] | | status | ERROR | | tags | [] | | tenant_id | cb324992e1c34ec2bcd0617e94c4e353 | | updated | 2018-03-12T16:43:13Z | | user_id | 7f6fb91cbb0149e280e606b368f0bba7 | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+ Note: Waiting a minute after the enrol seems to work. I was able to delete an instance with error and without performing any changes on the environment, was able to successfully boot one using the same boot command.
Added instruction to wait 60 seconds after enrolling the nodes. One one setup the boot was flowless, on another one node was with status error: (overcloud) [stack@undercloud-0 ~]$ nova list +--------------------------------------+-----------+--------+------------+-------------+-------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-----------+--------+------------+-------------+-------------------------+ | 051a548f-d981-4e87-ad4b-d46019498b70 | instance1 | ACTIVE | - | Running | baremetal=192.168.24.48 | | 18dabd88-56a4-407f-89fe-b56be95b90b8 | instance2 | ERROR | - | NOSTATE | | +--------------------------------------+-----------+--------+------------+-------------+-------------------------+ (overcloud) [stack@undercloud-0 ~]$ nova show instance2 +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+ | Property | Value | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | | | OS-EXT-SRV-ATTR:host | - | | OS-EXT-SRV-ATTR:hostname | instance2 | | OS-EXT-SRV-ATTR:hypervisor_hostname | - | | OS-EXT-SRV-ATTR:instance_name | instance-00000002 | | OS-EXT-SRV-ATTR:kernel_id | d6b823de-4307-44bf-99b8-d5f97891595c | | OS-EXT-SRV-ATTR:launch_index | 0 | | OS-EXT-SRV-ATTR:ramdisk_id | a3f208ed-9a0e-49fd-ac3d-316be46cd6a7 | | OS-EXT-SRV-ATTR:reservation_id | r-hcnn51mv | | OS-EXT-SRV-ATTR:root_device_name | - | | OS-EXT-SRV-ATTR:user_data | - | | OS-EXT-STS:power_state | 0 | | OS-EXT-STS:task_state | - | | OS-EXT-STS:vm_state | error | | OS-SRV-USG:launched_at | - | | OS-SRV-USG:terminated_at | - | | accessIPv4 | | | accessIPv6 | | | config_drive | | | created | 2018-03-12T19:58:26Z | | description | instance2 | | fault | {"message": "Host 'controller-0.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-12T19:58:29Z"} | | flavor:disk | 20 | | flavor:ephemeral | 0 | | flavor:extra_specs | {"baremetal": "true"} | | flavor:original_name | baremetal | | flavor:ram | 1024 | | flavor:swap | 0 | | flavor:vcpus | 1 | | hostId | | | host_status | | | id | 18dabd88-56a4-407f-89fe-b56be95b90b8 | | image | overcloud-full (ccd9a4de-f493-43bf-820a-bc96145c55e9) | | key_name | stack-key | | locked | False | | metadata | {} | | name | instance2 | | os-extended-volumes:volumes_attached | [] | | status | ERROR | | tags | [] | | tenant_id | b1d127bba4594de2b2f7134adf216848 | | updated | 2018-03-12T19:58:29Z | | user_id | 280ef7c29bb54d9d9e4e2aaa325a1f15 |
Could you attach logs please, at least for nova and ironic? Wondering if something is delaying the ironic nodes registering in nova and increasing the chances of hitting the race.
Reproduced the issue with 1 node being active and another with error. Delete the node with error status and tried to re-created. Ended up with error again: {"message": "Host 'controller-2.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-13T15:52:38Z"} |
Created attachment 1407634 [details] nova logs.
Need logs from all controllers ideally
Default in nova.conf is periodic_fuzzy_delay = 60, so a 60s delay is still potentially racy I expect.
Looking at the attached logs, the periodic task appears to be picking up the controller-0 nodes very soon after they are reported: nova-compute log: 2018-03-12 22:56:43.700 1 INFO nova.compute.resource_tracker [req-ef9d132a-1b3c-43fb-8311-f0b3ba12a58a - - - - -] Compute node record created for controller-0.localdomain:d137e261-bbf2-45c6-8f3e-87b60374925f with uuid: 1d3cdfd8-c165-4877-a50a-230f127ef308 nova-scheduler log: 2018-03-12 22:56:46.436 1 DEBUG oslo_service.periodic_task [req-0497666e-7073-4b9d-8d7d-4ead3d47016d - - - - -] Running periodic task SchedulerManager._discover_hosts_in_cells run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215 2018-03-12 22:56:46.442 1 DEBUG oslo_concurrency.lockutils [req-5b61513d-3c81-4722-81fb-27f0b2baca82 - - - - -] Lock "8674ced6-e574-4f1a-abf3-4d4d1ef27003" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273 2018-03-12 22:56:46.442 1 DEBUG oslo_concurrency.lockutils [req-5b61513d-3c81-4722-81fb-27f0b2baca82 - - - - -] Lock "8674ced6-e574-4f1a-abf3-4d4d1ef27003" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285 2018-03-12 22:56:46.552 1 DEBUG oslo_concurrency.lockutils [req-332767ca-2bb7-4292-b786-fe8153370636 - - - - -] Lock "host_instance" acquired by "nova.scheduler.host_manager.sync_instance_info" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273 2018-03-12 22:56:46.552 1 INFO nova.scheduler.host_manager [req-332767ca-2bb7-4292-b786-fe8153370636 - - - - -] Successfully synced instances from host 'controller-0.localdomain'.
Created attachment 1407638 [details] nova logs from controller1
Created attachment 1407639 [details] nova logs from controller2
Even with 120 second wait after the enrol. Out of 2 nodes, the second is with status error: {"message": "Host 'controller-1.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-13T19:13:00Z"}
Initially have one ironic node on controller-0 and controller-1. Host mappings are are successfully created for both: # grep "Discovered" controller-*/nova-scheduler.log* controller-1/nova-scheduler.log:2018-03-12 22:45:59.234 1 INFO nova.scheduler.manager [req-5061cc52-7728-479a-913c-7cc3c4c0427c - - - - -] Discovered 1 new hosts: default:compute-1.localdomain controller-1/nova-scheduler.log:2018-03-12 22:46:14.299 1 INFO nova.scheduler.manager [req-87669c18-66a8-4bbd-9f8d-c738f2be9476 - - - - -] Discovered 1 new hosts: default:compute-0.localdomain controller-1/nova-scheduler.log:2018-03-12 22:56:44.197 1 INFO nova.scheduler.manager [req-9815289d-1f97-4ca7-a58b-d451b669bc28 - - - - -] Discovered 1 new hosts: default:controller-0.localdomain controller-2/nova-scheduler.log:2018-03-12 22:56:53.817 1 INFO nova.scheduler.manager [req-7692ce5b-b6c0-48c0-b8d9-252f07228304 - - - - -] Discovered 1 new hosts: default:controller-1.localdomain Afterwards the controller-0 ironic node is moved to controller-2: # grep 'moving from' controller-1/nova-compute.log 2018-03-12 22:56:47.714 1 INFO nova.compute.resource_tracker [req-98279f70-38c7-41ca-ac2d-90596eab3c45 - - - - -] ComputeNode de6ab2a0-b24e-42ee-a44c-d518bef356d6 moving from controller-0.localdomain to controller-1.localdomain # grep 'moving from' controller-2/nova-compute.log 2018-03-12 22:56:54.318 1 INFO nova.compute.resource_tracker [req-8b147e1f-f9a8-4dd7-abba-8a6d7e6c228f - - - - -] ComputeNode de6ab2a0-b24e-42ee-a44c-d518bef356d6 moving from controller-1.localdomain to controller-2.localdomain But a host mapping is not created for controller-2 AFAICT.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045