Bug 1554460 - Potential "not mapped to any cell" error when number of ironic nodes is less than number of ironic nova-computes
Summary: Potential "not mapped to any cell" error when number of ironic nodes is less ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: Upstream M2
: 14.0 (Rocky)
Assignee: Dan Smith
QA Contact: nova-maint
URL:
Whiteboard:
Depends On:
Blocks: 1557068 1561625
TreeView+ depends on / blocked
 
Reported: 2018-03-12 17:41 UTC by Alexander Chuzhoy
Modified: 2019-09-09 14:10 UTC (History)
21 users (show)

Fixed In Version: openstack-nova-18.0.0-0.20180710150340.8469fa7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1557068 1561625 (view as bug list)
Environment:
Last Closed: 2019-01-11 11:49:11 UTC
Target Upstream Version:


Attachments (Terms of Use)
nova logs. (4.74 MB, application/x-gzip)
2018-03-13 16:01 UTC, Alexander Chuzhoy
no flags Details
nova logs from controller1 (4.48 MB, application/x-gzip)
2018-03-13 16:53 UTC, Alexander Chuzhoy
no flags Details
nova logs from controller2 (4.68 MB, application/x-gzip)
2018-03-13 16:53 UTC, Alexander Chuzhoy
no flags Details


Links
System ID Priority Status Summary Last Updated
Launchpad 1755602 None None None 2018-03-13 21:29:02 UTC
OpenStack gerrit 552691 None MERGED Add --by-service to discover_hosts 2020-10-13 19:25:29 UTC
Red Hat Product Errata RHEA-2019:0045 None None None 2019-01-11 11:49:30 UTC

Description Alexander Chuzhoy 2018-03-12 17:41:06 UTC
Occasional race condition between the periodic task that does the mapping and nova boot.

Environment:
puppet-ironic-12.3.1-0.20180221115553.12ab03d.el7ost.noarch
openstack-ironic-conductor-10.1.2-0.20180223164019.4c3a611.el7ost.noarch
python-nova-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
puppet-nova-12.3.1-0.20180221110444.c11423e.el7ost.noarch
openstack-tripleo-heat-templates-8.0.0-0.20180227121938.e0f59ee.el7ost.noarch
openstack-nova-scheduler-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
openstack-ironic-staging-drivers-0.9.0-0.20180220235748.de59d74.el7ost.noarch
python-novaclient-9.1.1-1.el7ost.noarch
python2-ironicclient-2.2.0-0.20180211230646.683b7c6.el7ost.noarch
python-ironic-inspector-client-3.1.0-0.20180213173236.c82b59f.el7ost.noarch
openstack-ironic-api-10.1.2-0.20180223164019.4c3a611.el7ost.noarch
openstack-ironic-inspector-7.2.0-0.20180217115157.a8d621f.el7ost.noarch
openstack-nova-placement-api-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
openstack-nova-compute-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
instack-undercloud-8.2.1-0.20180228034814.945b1d0.el7ost.noarch
openstack-nova-api-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
openstack-nova-conductor-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
python-ironic-lib-2.12.0-0.20180213172054.831c55b.el7ost.noarch
openstack-nova-common-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
openstack-ironic-common-10.1.2-0.20180223164019.4c3a611.el7ost.noarch



Steps to reproduce:

Add nodes to ironic in OC and try to boot them in a script.

Result:
(overcloud) [stack@undercloud-0 ~]$ nova list
nova delete in+--------------------------------------+-----------+--------+------------+-------------+----------+
| ID                                   | Name      | Status | Task State | Power State | Networks |
+--------------------------------------+-----------+--------+------------+-------------+----------+
| 033a380c-9d88-4118-a934-0230f4d7ab17 | instance1 | ERROR  | -          | NOSTATE     |          |
| 4d104877-6dfa-4ec9-8997-67b8a6ec50eb | instance2 | ERROR  | -          | NOSTATE     |          |
+--------------------------------------+-----------+--------+------------+-------------+----------+







(overcloud) [stack@undercloud-0 ~]$ nova show instance1
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| Property                             | Value                                                                                                                    |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                                                                                                   |
| OS-EXT-AZ:availability_zone          |                                                                                                                          |
| OS-EXT-SRV-ATTR:host                 | -                                                                                                                        |
| OS-EXT-SRV-ATTR:hostname             | instance1                                                                                                                |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | -                                                                                                                        |
| OS-EXT-SRV-ATTR:instance_name        | instance-00000002                                                                                                        |
| OS-EXT-SRV-ATTR:kernel_id            | 23a765ba-c32d-411c-8ff5-ad086f81932f                                                                                     |
| OS-EXT-SRV-ATTR:launch_index         | 0                                                                                                                        |
| OS-EXT-SRV-ATTR:ramdisk_id           | 995790c9-7561-4cb3-bfa5-d04fcd50764d                                                                                     |
| OS-EXT-SRV-ATTR:reservation_id       | r-cow2f89t                                                                                                               |
| OS-EXT-SRV-ATTR:root_device_name     | -                                                                                                                        |
| OS-EXT-SRV-ATTR:user_data            | -                                                                                                                        |
| OS-EXT-STS:power_state               | 0                                                                                                                        |
| OS-EXT-STS:task_state                | -                                                                                                                        |
| OS-EXT-STS:vm_state                  | error                                                                                                                    |
| OS-SRV-USG:launched_at               | -                                                                                                                        |
| OS-SRV-USG:terminated_at             | -                                                                                                                        |
| accessIPv4                           |                                                                                                                          |
| accessIPv6                           |                                                                                                                          |
| config_drive                         |                                                                                                                          |
| created                              | 2018-03-12T16:43:11Z                                                                                                     |
| description                          | instance1                                                                                                                |
| fault                                | {"message": "Host 'controller-0.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-12T16:43:13Z"} |
| flavor:disk                          | 20                                                                                                                       |
| flavor:ephemeral                     | 0                                                                                                                        |
| flavor:extra_specs                   | {"baremetal": "true"}                                                                                                    |
| flavor:original_name                 | baremetal                                                                                                                |
| flavor:ram                           | 1024                                                                                                                     |
| flavor:swap                          | 0                                                                                                                        |
| flavor:vcpus                         | 1                                                                                                                        |
| hostId                               |                                                                                                                          |
| host_status                          |                                                                                                                          |
| id                                   | 033a380c-9d88-4118-a934-0230f4d7ab17                                                                                     |
| image                                | overcloud-full (9f5f84ff-07c4-46b6-abd5-8407ecb99adb)                                                                    |
| key_name                             | stack-key                                                                                                                |
| locked                               | False                                                                                                                    |
| metadata                             | {}                                                                                                                       |
| name                                 | instance1                                                                                                                |
| os-extended-volumes:volumes_attached | []                                                                                                                       |
| status                               | ERROR                                                                                                                    |
| tags                                 | []                                                                                                                       |
| tenant_id                            | cb324992e1c34ec2bcd0617e94c4e353                                                                                         |
| updated                              | 2018-03-12T16:43:13Z                                                                                                     |
| user_id                              | 7f6fb91cbb0149e280e606b368f0bba7                                                                                         |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+




Note:
Waiting a minute after the enrol seems to work.
I was able to delete an instance with error and without performing any changes on the environment, was able to successfully boot one using the same boot command.

Comment 4 Alexander Chuzhoy 2018-03-12 20:33:56 UTC
Added instruction to wait 60 seconds after enrolling the nodes.

One one setup the boot was flowless, on another one node was with status error:

(overcloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+-----------+--------+------------+-------------+-------------------------+
| ID                                   | Name      | Status | Task State | Power State | Networks                |
+--------------------------------------+-----------+--------+------------+-------------+-------------------------+
| 051a548f-d981-4e87-ad4b-d46019498b70 | instance1 | ACTIVE | -          | Running     | baremetal=192.168.24.48 |
| 18dabd88-56a4-407f-89fe-b56be95b90b8 | instance2 | ERROR  | -          | NOSTATE     |                         |
+--------------------------------------+-----------+--------+------------+-------------+-------------------------+



(overcloud) [stack@undercloud-0 ~]$ nova show instance2
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| Property                             | Value                                                                                                                    |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                                                                                                   |
| OS-EXT-AZ:availability_zone          |                                                                                                                          |
| OS-EXT-SRV-ATTR:host                 | -                                                                                                                        |
| OS-EXT-SRV-ATTR:hostname             | instance2                                                                                                                |                           
| OS-EXT-SRV-ATTR:hypervisor_hostname  | -                                                                                                                        |                           
| OS-EXT-SRV-ATTR:instance_name        | instance-00000002                                                                                                        |
| OS-EXT-SRV-ATTR:kernel_id            | d6b823de-4307-44bf-99b8-d5f97891595c                                                                                     |
| OS-EXT-SRV-ATTR:launch_index         | 0                                                                                                                        |
| OS-EXT-SRV-ATTR:ramdisk_id           | a3f208ed-9a0e-49fd-ac3d-316be46cd6a7                                                                                     |
| OS-EXT-SRV-ATTR:reservation_id       | r-hcnn51mv                                                                                                               |
| OS-EXT-SRV-ATTR:root_device_name     | -                                                                                                                        |
| OS-EXT-SRV-ATTR:user_data            | -                                                                                                                        |
| OS-EXT-STS:power_state               | 0                                                                                                                        |
| OS-EXT-STS:task_state                | -                                                                                                                        |
| OS-EXT-STS:vm_state                  | error                                                                                                                    |
| OS-SRV-USG:launched_at               | -                                                                                                                        |
| OS-SRV-USG:terminated_at             | -                                                                                                                        |
| accessIPv4                           |                                                                                                                          |
| accessIPv6                           |                                                                                                                          |
| config_drive                         |                                                                                                                          |
| created                              | 2018-03-12T19:58:26Z                                                                                                     |
| description                          | instance2                                                                                                                |
| fault                                | {"message": "Host 'controller-0.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-12T19:58:29Z"} |
| flavor:disk                          | 20                                                                                                                       |
| flavor:ephemeral                     | 0                                                                                                                        |
| flavor:extra_specs                   | {"baremetal": "true"}                                                                                                    |
| flavor:original_name                 | baremetal                                                                                                                |
| flavor:ram                           | 1024                                                                                                                     |
| flavor:swap                          | 0                                                                                                                        |
| flavor:vcpus                         | 1                                                                                                                        |
| hostId                               |                                                                                                                          |
| host_status                          |                                                                                                                          |
| id                                   | 18dabd88-56a4-407f-89fe-b56be95b90b8                                                                                     |
| image                                | overcloud-full (ccd9a4de-f493-43bf-820a-bc96145c55e9)                                                                    |
| key_name                             | stack-key                                                                                                                |
| locked                               | False                                                                                                                    |
| metadata                             | {}                                                                                                                       |
| name                                 | instance2                                                                                                                |
| os-extended-volumes:volumes_attached | []                                                                                                                       |
| status                               | ERROR                                                                                                                    |
| tags                                 | []                                                                                                                       |
| tenant_id                            | b1d127bba4594de2b2f7134adf216848                                                                                         |
| updated                              | 2018-03-12T19:58:29Z                                                                                                     |
| user_id                              | 280ef7c29bb54d9d9e4e2aaa325a1f15                                                                                         |

Comment 5 Ollie Walsh 2018-03-13 14:39:41 UTC
Could you attach logs please, at least for nova and ironic? Wondering if something is delaying the ironic nodes registering in nova and increasing the chances of hitting the race.

Comment 7 Alexander Chuzhoy 2018-03-13 15:54:29 UTC
Reproduced the issue with 1 node being active and another with error.

Delete the node with error status and tried to re-created. Ended up with error again:

{"message": "Host 'controller-2.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-13T15:52:38Z"} |

Comment 8 Alexander Chuzhoy 2018-03-13 16:01:16 UTC
Created attachment 1407634 [details]
nova logs.

Comment 9 Ollie Walsh 2018-03-13 16:10:18 UTC
Need logs from all controllers ideally

Comment 10 Ollie Walsh 2018-03-13 16:11:55 UTC
Default in nova.conf is periodic_fuzzy_delay = 60, so a 60s delay is still potentially racy I expect.

Comment 11 Ollie Walsh 2018-03-13 16:30:03 UTC
Looking at the attached logs, the periodic task appears to be picking up the controller-0 nodes very soon after they are reported:

nova-compute log:
2018-03-12 22:56:43.700 1 INFO nova.compute.resource_tracker [req-ef9d132a-1b3c-43fb-8311-f0b3ba12a58a - - - - -] Compute node record created for controller-0.localdomain:d137e261-bbf2-45c6-8f3e-87b60374925f with uuid: 1d3cdfd8-c165-4877-a50a-230f127ef308

nova-scheduler log:
2018-03-12 22:56:46.436 1 DEBUG oslo_service.periodic_task [req-0497666e-7073-4b9d-8d7d-4ead3d47016d - - - - -] Running periodic task SchedulerManager._discover_hosts_in_cells run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215
2018-03-12 22:56:46.442 1 DEBUG oslo_concurrency.lockutils [req-5b61513d-3c81-4722-81fb-27f0b2baca82 - - - - -] Lock "8674ced6-e574-4f1a-abf3-4d4d1ef27003" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
2018-03-12 22:56:46.442 1 DEBUG oslo_concurrency.lockutils [req-5b61513d-3c81-4722-81fb-27f0b2baca82 - - - - -] Lock "8674ced6-e574-4f1a-abf3-4d4d1ef27003" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
2018-03-12 22:56:46.552 1 DEBUG oslo_concurrency.lockutils [req-332767ca-2bb7-4292-b786-fe8153370636 - - - - -] Lock "host_instance" acquired by "nova.scheduler.host_manager.sync_instance_info" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
2018-03-12 22:56:46.552 1 INFO nova.scheduler.host_manager [req-332767ca-2bb7-4292-b786-fe8153370636 - - - - -] Successfully synced instances from host 'controller-0.localdomain'.

Comment 12 Alexander Chuzhoy 2018-03-13 16:53:02 UTC
Created attachment 1407638 [details]
nova logs  from controller1

Comment 13 Alexander Chuzhoy 2018-03-13 16:53:28 UTC
Created attachment 1407639 [details]
nova logs  from controller2

Comment 14 Alexander Chuzhoy 2018-03-13 19:28:10 UTC
Even with 120 second wait after the enrol. Out of 2 nodes, the second is with status error:

{"message": "Host 'controller-1.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-13T19:13:00Z"}

Comment 15 Ollie Walsh 2018-03-13 20:39:19 UTC
Initially have one ironic node on controller-0 and controller-1. Host mappings are are successfully created for both:

# grep "Discovered" controller-*/nova-scheduler.log* 
controller-1/nova-scheduler.log:2018-03-12 22:45:59.234 1 INFO nova.scheduler.manager [req-5061cc52-7728-479a-913c-7cc3c4c0427c - - - - -] Discovered 1 new hosts: default:compute-1.localdomain
controller-1/nova-scheduler.log:2018-03-12 22:46:14.299 1 INFO nova.scheduler.manager [req-87669c18-66a8-4bbd-9f8d-c738f2be9476 - - - - -] Discovered 1 new hosts: default:compute-0.localdomain
controller-1/nova-scheduler.log:2018-03-12 22:56:44.197 1 INFO nova.scheduler.manager [req-9815289d-1f97-4ca7-a58b-d451b669bc28 - - - - -] Discovered 1 new hosts: default:controller-0.localdomain
controller-2/nova-scheduler.log:2018-03-12 22:56:53.817 1 INFO nova.scheduler.manager [req-7692ce5b-b6c0-48c0-b8d9-252f07228304 - - - - -] Discovered 1 new hosts: default:controller-1.localdomain


Afterwards the controller-0 ironic node is moved to controller-2:
# grep 'moving from' controller-1/nova-compute.log 
2018-03-12 22:56:47.714 1 INFO nova.compute.resource_tracker [req-98279f70-38c7-41ca-ac2d-90596eab3c45 - - - - -] ComputeNode de6ab2a0-b24e-42ee-a44c-d518bef356d6 moving from controller-0.localdomain to controller-1.localdomain
# grep 'moving from' controller-2/nova-compute.log 
2018-03-12 22:56:54.318 1 INFO nova.compute.resource_tracker [req-8b147e1f-f9a8-4dd7-abba-8a6d7e6c228f - - - - -] ComputeNode de6ab2a0-b24e-42ee-a44c-d518bef356d6 moving from controller-1.localdomain to controller-2.localdomain

But a host mapping is not created for controller-2 AFAICT.

Comment 32 errata-xmlrpc 2019-01-11 11:49:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045


Note You need to log in before you can comment on or make changes to this bug.