Bug 1554460

Summary: Potential "not mapped to any cell" error when number of ironic nodes is less than number of ironic nova-computes
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: openstack-novaAssignee: Dan Smith <dasmith>
Status: CLOSED ERRATA QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: medium Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: berrange, bfournie, dasmith, derekh, dtantsur, eglynn, jhakimra, kchamart, lyarwood, mbooth, mburns, mlammon, owalsh, racedoro, rhel-osp-director-maint, sasha, sbauza, sclewis, sgordon, srevivo, vromanso
Target Milestone: Upstream M2Keywords: Triaged
Target Release: 14.0 (Rocky)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-nova-18.0.0-0.20180710150340.8469fa7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1557068 1561625 (view as bug list) Environment:
Last Closed: 2019-01-11 11:49:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1557068, 1561625    
Attachments:
Description Flags
nova logs.
none
nova logs from controller1
none
nova logs from controller2 none

Description Alexander Chuzhoy 2018-03-12 17:41:06 UTC
Occasional race condition between the periodic task that does the mapping and nova boot.

Environment:
puppet-ironic-12.3.1-0.20180221115553.12ab03d.el7ost.noarch
openstack-ironic-conductor-10.1.2-0.20180223164019.4c3a611.el7ost.noarch
python-nova-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
puppet-nova-12.3.1-0.20180221110444.c11423e.el7ost.noarch
openstack-tripleo-heat-templates-8.0.0-0.20180227121938.e0f59ee.el7ost.noarch
openstack-nova-scheduler-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
openstack-ironic-staging-drivers-0.9.0-0.20180220235748.de59d74.el7ost.noarch
python-novaclient-9.1.1-1.el7ost.noarch
python2-ironicclient-2.2.0-0.20180211230646.683b7c6.el7ost.noarch
python-ironic-inspector-client-3.1.0-0.20180213173236.c82b59f.el7ost.noarch
openstack-ironic-api-10.1.2-0.20180223164019.4c3a611.el7ost.noarch
openstack-ironic-inspector-7.2.0-0.20180217115157.a8d621f.el7ost.noarch
openstack-nova-placement-api-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
openstack-nova-compute-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
instack-undercloud-8.2.1-0.20180228034814.945b1d0.el7ost.noarch
openstack-nova-api-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
openstack-nova-conductor-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
python-ironic-lib-2.12.0-0.20180213172054.831c55b.el7ost.noarch
openstack-nova-common-17.0.0-0.20180223162252.a4a53bf.el7ost.noarch
openstack-ironic-common-10.1.2-0.20180223164019.4c3a611.el7ost.noarch



Steps to reproduce:

Add nodes to ironic in OC and try to boot them in a script.

Result:
(overcloud) [stack@undercloud-0 ~]$ nova list
nova delete in+--------------------------------------+-----------+--------+------------+-------------+----------+
| ID                                   | Name      | Status | Task State | Power State | Networks |
+--------------------------------------+-----------+--------+------------+-------------+----------+
| 033a380c-9d88-4118-a934-0230f4d7ab17 | instance1 | ERROR  | -          | NOSTATE     |          |
| 4d104877-6dfa-4ec9-8997-67b8a6ec50eb | instance2 | ERROR  | -          | NOSTATE     |          |
+--------------------------------------+-----------+--------+------------+-------------+----------+







(overcloud) [stack@undercloud-0 ~]$ nova show instance1
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| Property                             | Value                                                                                                                    |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                                                                                                   |
| OS-EXT-AZ:availability_zone          |                                                                                                                          |
| OS-EXT-SRV-ATTR:host                 | -                                                                                                                        |
| OS-EXT-SRV-ATTR:hostname             | instance1                                                                                                                |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | -                                                                                                                        |
| OS-EXT-SRV-ATTR:instance_name        | instance-00000002                                                                                                        |
| OS-EXT-SRV-ATTR:kernel_id            | 23a765ba-c32d-411c-8ff5-ad086f81932f                                                                                     |
| OS-EXT-SRV-ATTR:launch_index         | 0                                                                                                                        |
| OS-EXT-SRV-ATTR:ramdisk_id           | 995790c9-7561-4cb3-bfa5-d04fcd50764d                                                                                     |
| OS-EXT-SRV-ATTR:reservation_id       | r-cow2f89t                                                                                                               |
| OS-EXT-SRV-ATTR:root_device_name     | -                                                                                                                        |
| OS-EXT-SRV-ATTR:user_data            | -                                                                                                                        |
| OS-EXT-STS:power_state               | 0                                                                                                                        |
| OS-EXT-STS:task_state                | -                                                                                                                        |
| OS-EXT-STS:vm_state                  | error                                                                                                                    |
| OS-SRV-USG:launched_at               | -                                                                                                                        |
| OS-SRV-USG:terminated_at             | -                                                                                                                        |
| accessIPv4                           |                                                                                                                          |
| accessIPv6                           |                                                                                                                          |
| config_drive                         |                                                                                                                          |
| created                              | 2018-03-12T16:43:11Z                                                                                                     |
| description                          | instance1                                                                                                                |
| fault                                | {"message": "Host 'controller-0.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-12T16:43:13Z"} |
| flavor:disk                          | 20                                                                                                                       |
| flavor:ephemeral                     | 0                                                                                                                        |
| flavor:extra_specs                   | {"baremetal": "true"}                                                                                                    |
| flavor:original_name                 | baremetal                                                                                                                |
| flavor:ram                           | 1024                                                                                                                     |
| flavor:swap                          | 0                                                                                                                        |
| flavor:vcpus                         | 1                                                                                                                        |
| hostId                               |                                                                                                                          |
| host_status                          |                                                                                                                          |
| id                                   | 033a380c-9d88-4118-a934-0230f4d7ab17                                                                                     |
| image                                | overcloud-full (9f5f84ff-07c4-46b6-abd5-8407ecb99adb)                                                                    |
| key_name                             | stack-key                                                                                                                |
| locked                               | False                                                                                                                    |
| metadata                             | {}                                                                                                                       |
| name                                 | instance1                                                                                                                |
| os-extended-volumes:volumes_attached | []                                                                                                                       |
| status                               | ERROR                                                                                                                    |
| tags                                 | []                                                                                                                       |
| tenant_id                            | cb324992e1c34ec2bcd0617e94c4e353                                                                                         |
| updated                              | 2018-03-12T16:43:13Z                                                                                                     |
| user_id                              | 7f6fb91cbb0149e280e606b368f0bba7                                                                                         |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+




Note:
Waiting a minute after the enrol seems to work.
I was able to delete an instance with error and without performing any changes on the environment, was able to successfully boot one using the same boot command.

Comment 4 Alexander Chuzhoy 2018-03-12 20:33:56 UTC
Added instruction to wait 60 seconds after enrolling the nodes.

One one setup the boot was flowless, on another one node was with status error:

(overcloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+-----------+--------+------------+-------------+-------------------------+
| ID                                   | Name      | Status | Task State | Power State | Networks                |
+--------------------------------------+-----------+--------+------------+-------------+-------------------------+
| 051a548f-d981-4e87-ad4b-d46019498b70 | instance1 | ACTIVE | -          | Running     | baremetal=192.168.24.48 |
| 18dabd88-56a4-407f-89fe-b56be95b90b8 | instance2 | ERROR  | -          | NOSTATE     |                         |
+--------------------------------------+-----------+--------+------------+-------------+-------------------------+



(overcloud) [stack@undercloud-0 ~]$ nova show instance2
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| Property                             | Value                                                                                                                    |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                                                                                                   |
| OS-EXT-AZ:availability_zone          |                                                                                                                          |
| OS-EXT-SRV-ATTR:host                 | -                                                                                                                        |
| OS-EXT-SRV-ATTR:hostname             | instance2                                                                                                                |                           
| OS-EXT-SRV-ATTR:hypervisor_hostname  | -                                                                                                                        |                           
| OS-EXT-SRV-ATTR:instance_name        | instance-00000002                                                                                                        |
| OS-EXT-SRV-ATTR:kernel_id            | d6b823de-4307-44bf-99b8-d5f97891595c                                                                                     |
| OS-EXT-SRV-ATTR:launch_index         | 0                                                                                                                        |
| OS-EXT-SRV-ATTR:ramdisk_id           | a3f208ed-9a0e-49fd-ac3d-316be46cd6a7                                                                                     |
| OS-EXT-SRV-ATTR:reservation_id       | r-hcnn51mv                                                                                                               |
| OS-EXT-SRV-ATTR:root_device_name     | -                                                                                                                        |
| OS-EXT-SRV-ATTR:user_data            | -                                                                                                                        |
| OS-EXT-STS:power_state               | 0                                                                                                                        |
| OS-EXT-STS:task_state                | -                                                                                                                        |
| OS-EXT-STS:vm_state                  | error                                                                                                                    |
| OS-SRV-USG:launched_at               | -                                                                                                                        |
| OS-SRV-USG:terminated_at             | -                                                                                                                        |
| accessIPv4                           |                                                                                                                          |
| accessIPv6                           |                                                                                                                          |
| config_drive                         |                                                                                                                          |
| created                              | 2018-03-12T19:58:26Z                                                                                                     |
| description                          | instance2                                                                                                                |
| fault                                | {"message": "Host 'controller-0.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-12T19:58:29Z"} |
| flavor:disk                          | 20                                                                                                                       |
| flavor:ephemeral                     | 0                                                                                                                        |
| flavor:extra_specs                   | {"baremetal": "true"}                                                                                                    |
| flavor:original_name                 | baremetal                                                                                                                |
| flavor:ram                           | 1024                                                                                                                     |
| flavor:swap                          | 0                                                                                                                        |
| flavor:vcpus                         | 1                                                                                                                        |
| hostId                               |                                                                                                                          |
| host_status                          |                                                                                                                          |
| id                                   | 18dabd88-56a4-407f-89fe-b56be95b90b8                                                                                     |
| image                                | overcloud-full (ccd9a4de-f493-43bf-820a-bc96145c55e9)                                                                    |
| key_name                             | stack-key                                                                                                                |
| locked                               | False                                                                                                                    |
| metadata                             | {}                                                                                                                       |
| name                                 | instance2                                                                                                                |
| os-extended-volumes:volumes_attached | []                                                                                                                       |
| status                               | ERROR                                                                                                                    |
| tags                                 | []                                                                                                                       |
| tenant_id                            | b1d127bba4594de2b2f7134adf216848                                                                                         |
| updated                              | 2018-03-12T19:58:29Z                                                                                                     |
| user_id                              | 280ef7c29bb54d9d9e4e2aaa325a1f15                                                                                         |

Comment 5 Ollie Walsh 2018-03-13 14:39:41 UTC
Could you attach logs please, at least for nova and ironic? Wondering if something is delaying the ironic nodes registering in nova and increasing the chances of hitting the race.

Comment 7 Alexander Chuzhoy 2018-03-13 15:54:29 UTC
Reproduced the issue with 1 node being active and another with error.

Delete the node with error status and tried to re-created. Ended up with error again:

{"message": "Host 'controller-2.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-13T15:52:38Z"} |

Comment 8 Alexander Chuzhoy 2018-03-13 16:01:16 UTC
Created attachment 1407634 [details]
nova logs.

Comment 9 Ollie Walsh 2018-03-13 16:10:18 UTC
Need logs from all controllers ideally

Comment 10 Ollie Walsh 2018-03-13 16:11:55 UTC
Default in nova.conf is periodic_fuzzy_delay = 60, so a 60s delay is still potentially racy I expect.

Comment 11 Ollie Walsh 2018-03-13 16:30:03 UTC
Looking at the attached logs, the periodic task appears to be picking up the controller-0 nodes very soon after they are reported:

nova-compute log:
2018-03-12 22:56:43.700 1 INFO nova.compute.resource_tracker [req-ef9d132a-1b3c-43fb-8311-f0b3ba12a58a - - - - -] Compute node record created for controller-0.localdomain:d137e261-bbf2-45c6-8f3e-87b60374925f with uuid: 1d3cdfd8-c165-4877-a50a-230f127ef308

nova-scheduler log:
2018-03-12 22:56:46.436 1 DEBUG oslo_service.periodic_task [req-0497666e-7073-4b9d-8d7d-4ead3d47016d - - - - -] Running periodic task SchedulerManager._discover_hosts_in_cells run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215
2018-03-12 22:56:46.442 1 DEBUG oslo_concurrency.lockutils [req-5b61513d-3c81-4722-81fb-27f0b2baca82 - - - - -] Lock "8674ced6-e574-4f1a-abf3-4d4d1ef27003" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
2018-03-12 22:56:46.442 1 DEBUG oslo_concurrency.lockutils [req-5b61513d-3c81-4722-81fb-27f0b2baca82 - - - - -] Lock "8674ced6-e574-4f1a-abf3-4d4d1ef27003" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
2018-03-12 22:56:46.552 1 DEBUG oslo_concurrency.lockutils [req-332767ca-2bb7-4292-b786-fe8153370636 - - - - -] Lock "host_instance" acquired by "nova.scheduler.host_manager.sync_instance_info" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
2018-03-12 22:56:46.552 1 INFO nova.scheduler.host_manager [req-332767ca-2bb7-4292-b786-fe8153370636 - - - - -] Successfully synced instances from host 'controller-0.localdomain'.

Comment 12 Alexander Chuzhoy 2018-03-13 16:53:02 UTC
Created attachment 1407638 [details]
nova logs  from controller1

Comment 13 Alexander Chuzhoy 2018-03-13 16:53:28 UTC
Created attachment 1407639 [details]
nova logs  from controller2

Comment 14 Alexander Chuzhoy 2018-03-13 19:28:10 UTC
Even with 120 second wait after the enrol. Out of 2 nodes, the second is with status error:

{"message": "Host 'controller-1.localdomain' is not mapped to any cell", "code": 400, "created": "2018-03-13T19:13:00Z"}

Comment 15 Ollie Walsh 2018-03-13 20:39:19 UTC
Initially have one ironic node on controller-0 and controller-1. Host mappings are are successfully created for both:

# grep "Discovered" controller-*/nova-scheduler.log* 
controller-1/nova-scheduler.log:2018-03-12 22:45:59.234 1 INFO nova.scheduler.manager [req-5061cc52-7728-479a-913c-7cc3c4c0427c - - - - -] Discovered 1 new hosts: default:compute-1.localdomain
controller-1/nova-scheduler.log:2018-03-12 22:46:14.299 1 INFO nova.scheduler.manager [req-87669c18-66a8-4bbd-9f8d-c738f2be9476 - - - - -] Discovered 1 new hosts: default:compute-0.localdomain
controller-1/nova-scheduler.log:2018-03-12 22:56:44.197 1 INFO nova.scheduler.manager [req-9815289d-1f97-4ca7-a58b-d451b669bc28 - - - - -] Discovered 1 new hosts: default:controller-0.localdomain
controller-2/nova-scheduler.log:2018-03-12 22:56:53.817 1 INFO nova.scheduler.manager [req-7692ce5b-b6c0-48c0-b8d9-252f07228304 - - - - -] Discovered 1 new hosts: default:controller-1.localdomain


Afterwards the controller-0 ironic node is moved to controller-2:
# grep 'moving from' controller-1/nova-compute.log 
2018-03-12 22:56:47.714 1 INFO nova.compute.resource_tracker [req-98279f70-38c7-41ca-ac2d-90596eab3c45 - - - - -] ComputeNode de6ab2a0-b24e-42ee-a44c-d518bef356d6 moving from controller-0.localdomain to controller-1.localdomain
# grep 'moving from' controller-2/nova-compute.log 
2018-03-12 22:56:54.318 1 INFO nova.compute.resource_tracker [req-8b147e1f-f9a8-4dd7-abba-8a6d7e6c228f - - - - -] ComputeNode de6ab2a0-b24e-42ee-a44c-d518bef356d6 moving from controller-1.localdomain to controller-2.localdomain

But a host mapping is not created for controller-2 AFAICT.

Comment 32 errata-xmlrpc 2019-01-11 11:49:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045