1432028 – Unable to add compute node back to deployment after it was previously deleted

Bug 1432028 - Unable to add compute node back to deployment after it was previously deleted

Summary: Unable to add compute node back to deployment after it was previously deleted

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rhosp-director
Sub Component:
Version:	11.0 (Ocata)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ga
Target Release:	11.0 (Ocata)
Assignee:	Angus Thomas
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-03-14 11:17 UTC by Marius Cornea
Modified:	2017-03-22 21:34 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-22 21:34:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
undercloud sosreport (17.80 MB, application/x-xz) 2017-03-14 11:17 UTC, Marius Cornea	no flags	Details
View All

Description Marius Cornea 2017-03-14 11:17:56 UTC

Created attachment 1262863 [details]
undercloud sosreport

Description of problem:

I am unable to add a compute node back to overcloud deployment after it was previously removed. I hit this issue during OSP10->11 post upgrade verification. 

Version-Release number of selected component (if applicable):


How reproducible:
1/1

Steps to Reproduce:
1. Deploy OSP10 with 3 controllers, 3 computes, 1 ceph node
2. Upgrade environment to OSP11
3. Upload OSP11 images to undercloud Glance
4. Remove one compute node from deployment:
openstack overcloud node delete --stack overcloud $UUID
5. I realizede that I missed 'openstack baremetal configure boot' and ran it
6. Rerun the overcloud deploy command that contains --compute-scale 3 and should add the compute node removed in step 4 back to the deployment. 

Actual results:
The compute node doesn't get provisioned because it's not able to PXE boot(the console shows that it times out).

Expected results:
The compute node gets provisioned and added back to the deployment.

Additional info:
Attaching the undercloud sosreport.
This is a virtual environment, the VM is set to boot first from NICs.

Comment 1 Bob Fournier 2017-03-22 14:06:30 UTC

Hi Marius,

Can you confirm that the compute you removed was compute-0?

instackenv.json:
      "name": "compute-0",
      "pm_addr": "172.16.0.1",
      "pm_password": *********,
      "pm_type": "pxe_ssh",
      "mac": ["52:54:00:e1:fd:a9"],
      "cpu": "1",
      "memory": "5806",
      "disk": "20",
      "arch": "x86_64",
      "pm_user": "stack"
    },

I'm seeing an entry for that mac in the neutron dhcp hosts file:
...
52:54:00:e1:fd:a9,host-192-168-24-16,192.168.24.16,set:4897e5bd-b4b6-432e-80c2-72f1e857fc20

and the opts file is set up correctly to pxe boot:
tag:4897e5bd-b4b6-432e-80c2-72f1e857fc20,option:server-ip-address,192.168.24.1
tag:4897e5bd-b4b6-432e-80c2-72f1e857fc20,tag:!ipxe,option:bootfile-name,undionly.kpxe
tag:4897e5bd-b4b6-432e-80c2-72f1e857fc20,tag:ipxe,option:bootfile-name,http://192.168.24.1:8088/boot.ipxe

but I don't see any entry for this mac in the dhcp leases file, indicating it didn't get assigned an IP
cat ../../lib/neutron/dhcp/b2ceeb59-20d4-4bb4-8954-75fa364e9d68/leases 
1489523792 fa:16:3e:99:63:d8 192.168.24.14 host-192-168-24-14 *
1489523792 fa:16:3e:99:0e:d4 192.168.24.5 host-192-168-24-5 *
1489523792 52:54:00:b8:91:69 192.168.24.8 host-192-168-24-8 *
1489523792 52:54:00:95:41:f9 192.168.24.12 host-192-168-24-12 *
1489523792 52:54:00:7f:2a:83 192.168.24.7 host-192-168-24-7 *
1489523792 52:54:00:78:53:fe 192.168.24.13 host-192-168-24-13 *
1489523792 52:54:00:37:ed:c0 192.168.24.15 host-192-168-24-15 *
1489523792 52:54:00:12:d9:ac 192.168.24.18 host-192-168-24-18 *

Also, can you provide the actual overcloud deploy command that you ran?

Thanks.

Comment 2 Bob Fournier 2017-03-22 15:43:24 UTC

In the neutron server logs, for the port associated with this mac I am seeing porting binding failures, it looks like the neutron agent is down.

2017-03-14 07:09:19.807 28502 WARNING neutron.plugins.ml2.drivers.mech_agent [req-ed0305bd-6559-4e54-aca9-cce7b11c65df - - - - -] Refusing to bind port 4897e5bd-b4b6-432e-80c2-72f1e857fc20 to dead agent:<snip>

2017-03-14 07:09:19.822 28502 ERROR neutron.plugins.ml2.managers [req-ed0305bd-6559-4e54-aca9-cce7b11c65df - - - - -] Failed to bind port 4897e5bd-b4b6-432e-80c2-72f1e857fc20 on host undercloud-0.redhat.local for vnic_type normal using segments [{'segmentation_id': None, 'physical_network': u'ctlplane', 'id': u'e786eb76-dff4-4d3d-a293-14c10d7970f8', 'network_type': u'flat'}]

Comment 3 Bob Fournier 2017-03-22 17:59:09 UTC

Also note that the dead neutron agent seems to have persisted for quite a while, most likely prior to the occurrence of this problem, see timestamps below:

server.log:2017-03-13 16:58:30.825 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
server.log:2017-03-13 16:59:07.842 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
<snip>

server.log:2017-03-14 00:17:52.056 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
server.log:2017-03-14 00:18:29.082 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
<snip>

server.log:2017-03-14 02:23:46.870 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
server.log:2017-03-14 02:24:23.903 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
<snip>

server.log:2017-03-14 07:10:48.014 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
server.log:2017-03-14 07:11:25.047 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
server.log:2017-03-14 07:12:02.080 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:

Would it be possible to restart this agent and retest?

Comment 4 Marius Cornea 2017-03-22 21:34:58 UTC

I didn't have the environment anymore so I tried reproducing the issue but I was unable. I'm closing this ticket for now and I'll reopen if I hit it again. 

The agent being down sounds like a plausible cause so I'll keep in mind for my further tests. Thanks!

Note You need to log in before you can comment on or make changes to this bug.