Bug 1579785

Summary:	On split-stack setups, left over node information prevents a node from rejoin the cloud
Product:	Red Hat OpenStack	Reporter:	Sven Michels <svmichel>
Component:	openstack-nova	Assignee:	Martin Schuppert <mschuppe>
Status:	CLOSED ERRATA	QA Contact:	OSP DFG:Compute <osp-dfg-compute>
Severity:	high	Docs Contact:
Priority:	high
Version:	12.0 (Pike)	CC:	awaugama, berrange, dasmith, eglynn, gferrazs, ipetrova, jamsmith, jhakimra, kchamart, lmarsh, lyarwood, madgupta, mschuppe, sbauza, sferdjao, sgordon, srevivo, svmichel, tmicheli, vromanso
Target Milestone:	z3	Keywords:	Triaged, ZStream
Target Release:	12.0 (Pike)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-nova-16.1.3-1.el7ost	Doc Type:	Bug Fix
Doc Text:	Prior to this update, to re-discover a compute node record after deleting a host mapping from the API database, the compute node record had to be manually marked as unmapped. Otherwise, a compute node with the same hostname could not be mapped back to the cell from which it was removed. With this update, the compute node record is automatically marked as unmapped when you delete a host from a cell, enabling a compute node with the same hostname to be added to the cell during host discovery.	Story Points:	---
Clone Of:
Clones:	1591788 (view as bug list)		Environment:
Last Closed:	2018-08-20 12:55:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1591788, 1731150, 1781142

Description Sven Michels 2018-05-18 10:35:14 UTC

Description of problem:
When running in a split stack environment, nodes have hostnames defined outside of rhosp. Its possible that a node leaves the cloud and rejoins the cloud with the same name later on. In this case, since the node scale down doesn't remove information from the default nova cell or the resource proviers table of nova api, the node will not work. It gets conflicts when trying to self register as a resource provider, and the cell discover is not able to place the node into the cell.

Version-Release number of selected component (if applicable):
RHOSP12

How reproducible:
Add a node as compute to a working nova environment
remove the node
add it with the same hostname again
try to place a VM on it (will fail)

Steps to Reproduce:
1. remove compute resource
2. add it back to the cloud
3. boom

Actual results:
The node is not able to take workloads as its not fully registered as resource provider nor is it placed in the default cell.

Expected results:
Node successfully rejoins the cloud and can take workloads.

Additional info:
The removal of a node doesn't clean up the node from the cell or the "resource_provider" (database, nova_api). This prevents the re-added node from "joining" the resource_provider list again, because it requires nodes to be uniq. In a normal deployment this is not an issue, as the node name usually contains a counting number. In pre-deployed environments, this might (will) happen. So the artifacts of the node stays inside the database, preventing it from rejoining the cloud.

So the resource_provider issue looks like this:
Conflicting resource provider <hostname>.<domain> already exists.
The old entry for the node still exists in the database, a new one with the same name can't be added.

The cell issue looks like this:
()[root@<hostname> /]# nova-manage cell_v2 discover_hosts --verbose
Found 2 cell mappings.
Skipping cell0 since it does not contain hosts.
Getting compute nodes from cell 'default': e165ac8a-ec98-4b53-bda9-1c3ee7efcd5d
Found 1 unmapped computes in cell: e165ac8a-ec98-4b53-bda9-1c3ee7efcd5d
Checking host mapping for compute host '<hostname>.<domain>': b42d76a5-5c22-46d5-a37b-80f69f91c127
()[root@<hostname> /]# nova-manage cell_v2 discover_hosts --verbose
Found 2 cell mappings. Skipping cell0 since it does not contain hosts.
Getting compute nodes from cell 'default': e165ac8a-ec98-4b53-bda9-1c3ee7efcd5d
Found 1 unmapped computes in cell: e165ac8a-ec98-4b53-bda9-1c3ee7efcd5d
Checking host mapping for compute host '<hostname>.<domain>': b42d76a5-5c22-46d5-a37b-80f69f91c127

You can call this command over and over again, the host will never be mapped. And why? because its already mapped (the old one with the same name).

So to solve the "host not found" and "unable to update my resource provider record" issue, two tasks need to be done:
1. remove the existing host entry from mysql database "nova_api" table "resource_provider"
2. delete host from default cell "nova-manage cell_v2 delete_host --cell_uuid e165ac8a-ec98-4b53-bda9-1c3ee7efcd5d $hostname" (this command is from my memory, as i don't have access right now, sorry).

After those two commands, the node will self register again as a resource provider and you can execute the cell_v2 discover_hosts with a successful mapping.

Comment 1 Dan Smith 2018-05-31 13:50:13 UTC

This was fixed in upstream pike:

https://review.openstack.org/#/c/553829/

Comment 2 Sven Michels 2018-05-31 14:17:39 UTC

(In reply to Dan Smith from comment #1)
> This was fixed in upstream pike:
> 
> https://review.openstack.org/#/c/553829/

Hey Dan,

but this was only for cell deletion, the issue we see is when a node is deleted, right?

We would need to add a node delete from cell to the "service delete" of compute or to the templates when we scale down a node. The first one would probably the best solution, as this could also happen without director.

Cheers,
Sven

Comment 3 Dan Smith 2018-06-01 14:23:19 UTC

Sorry, this is the upstream fix for delete_host that I was thinking, which unmaps the node when deleting the host mapping:

https://review.openstack.org/#/c/527560/

Maybe you could elaborate more on what you mean by "remove compute resource" so I can know what you're deleting and re-adding? The thing is, if the compute node and service come back in the same configuration (i.e. the same hostname) the old host mapping should still be sufficient to find it again (i.e. you shouldn't need to discover again).

Comment 4 Sven Michels 2018-06-11 11:42:13 UTC

Hey Dan,

sorry for the delay, missed that one :(

To clearify:
The issue was triggered first, when we started to do some scale in and out tests in our test environment. So what we basically did:
- install RHOSP12 with 2 Compute Nodes
- scale to 3 nodes (adding node c)
- scale to 2 nodes (removing a, b or c)
- reinstall the node
- scale to 3 nodes (adding the removed node)

This is in a split stack environment, so the nodes are not installed by ironic, but externally by the customer. For that reason the node name won't change. So if you remove compute-b and reinstall it, it will be compute-b again.

In this scenario, the node is only removed from the environment within heat. Since there is no delete_host executed or done, the whole mapping stays as is. So if you try to bring a node "back" (or lets simplify it: if you add a node which has the exact same hostname as one node had before), it doesn't work. The existing, orphaned entry prevents the node from being re-added to the cell.

So the commit you're refering to might fix the need of manually fiddleing around in the database, as a delete_host would be enough. But then we need to add the delete_host as a needed step to our documentation. Or we add a task into our templates executing exactly the delete_host command. But as we ask the customer to disable and remove the service manually already, the first one might be easier (except that this needs to be executed inside a container as there is no external command for it, right?).

If you still miss something, please let me know.

Cheers and thanks,
Sven

Comment 5 Martin Schuppert 2018-06-15 09:59:02 UTC

This is not an issue only with in the deployed server scenario. It can also happen when we use hostname mappings to keep the same hostname even if the internal index goes up.

What we'd need is:

1) https://review.openstack.org/#/c/527560/ as part of next OSP12 maintenance release (16.1.x).

2) a doc bug to enhance our scale down procedure to remove the compute from the cell [1]. 

existing instructions:
~~~
...

Finally, remove the node’s Compute service:

(undercloud) $ source ~/stack/overcloudrc
(overcloud) $ openstack compute service list
(overcloud) $ openstack compute service delete [service-id]
~~~

Here we need to add:

~~~
Login to one of the overcloud controllers and delete the removed host
$ ssh heat-admin@overcloud-controller-X
$ nova-manage --config-dir /var/lib/config-data/puppet-generated/nova/etc/nova cell_v2 list_hosts
$ nova-manage --config-dir /var/lib/config-data/puppet-generated/nova/etc/nova cell_v2 delete_host --cell_uuid <Cell UUID> --host <Hostname>
~~~

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html-single/director_installation_and_usage/#sect-Removing_Compute_Nodes

Comment 7 Red Hat Bugzilla Rules Engine 2018-06-15 14:43:45 UTC

This bugzilla has been removed from the release since it has not been triaged, and needs to be reviewed for targeting another release.

Comment 17 errata-xmlrpc 2018-08-20 12:55:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2332