Created attachment 1262863 [details] undercloud sosreport Description of problem: I am unable to add a compute node back to overcloud deployment after it was previously removed. I hit this issue during OSP10->11 post upgrade verification. Version-Release number of selected component (if applicable): How reproducible: 1/1 Steps to Reproduce: 1. Deploy OSP10 with 3 controllers, 3 computes, 1 ceph node 2. Upgrade environment to OSP11 3. Upload OSP11 images to undercloud Glance 4. Remove one compute node from deployment: openstack overcloud node delete --stack overcloud $UUID 5. I realizede that I missed 'openstack baremetal configure boot' and ran it 6. Rerun the overcloud deploy command that contains --compute-scale 3 and should add the compute node removed in step 4 back to the deployment. Actual results: The compute node doesn't get provisioned because it's not able to PXE boot(the console shows that it times out). Expected results: The compute node gets provisioned and added back to the deployment. Additional info: Attaching the undercloud sosreport. This is a virtual environment, the VM is set to boot first from NICs.
Hi Marius, Can you confirm that the compute you removed was compute-0? instackenv.json: "name": "compute-0", "pm_addr": "172.16.0.1", "pm_password": *********, "pm_type": "pxe_ssh", "mac": ["52:54:00:e1:fd:a9"], "cpu": "1", "memory": "5806", "disk": "20", "arch": "x86_64", "pm_user": "stack" }, I'm seeing an entry for that mac in the neutron dhcp hosts file: ... 52:54:00:e1:fd:a9,host-192-168-24-16,192.168.24.16,set:4897e5bd-b4b6-432e-80c2-72f1e857fc20 and the opts file is set up correctly to pxe boot: tag:4897e5bd-b4b6-432e-80c2-72f1e857fc20,option:server-ip-address,192.168.24.1 tag:4897e5bd-b4b6-432e-80c2-72f1e857fc20,tag:!ipxe,option:bootfile-name,undionly.kpxe tag:4897e5bd-b4b6-432e-80c2-72f1e857fc20,tag:ipxe,option:bootfile-name,http://192.168.24.1:8088/boot.ipxe but I don't see any entry for this mac in the dhcp leases file, indicating it didn't get assigned an IP cat ../../lib/neutron/dhcp/b2ceeb59-20d4-4bb4-8954-75fa364e9d68/leases 1489523792 fa:16:3e:99:63:d8 192.168.24.14 host-192-168-24-14 * 1489523792 fa:16:3e:99:0e:d4 192.168.24.5 host-192-168-24-5 * 1489523792 52:54:00:b8:91:69 192.168.24.8 host-192-168-24-8 * 1489523792 52:54:00:95:41:f9 192.168.24.12 host-192-168-24-12 * 1489523792 52:54:00:7f:2a:83 192.168.24.7 host-192-168-24-7 * 1489523792 52:54:00:78:53:fe 192.168.24.13 host-192-168-24-13 * 1489523792 52:54:00:37:ed:c0 192.168.24.15 host-192-168-24-15 * 1489523792 52:54:00:12:d9:ac 192.168.24.18 host-192-168-24-18 * Also, can you provide the actual overcloud deploy command that you ran? Thanks.
In the neutron server logs, for the port associated with this mac I am seeing porting binding failures, it looks like the neutron agent is down. 2017-03-14 07:09:19.807 28502 WARNING neutron.plugins.ml2.drivers.mech_agent [req-ed0305bd-6559-4e54-aca9-cce7b11c65df - - - - -] Refusing to bind port 4897e5bd-b4b6-432e-80c2-72f1e857fc20 to dead agent:<snip> 2017-03-14 07:09:19.822 28502 ERROR neutron.plugins.ml2.managers [req-ed0305bd-6559-4e54-aca9-cce7b11c65df - - - - -] Failed to bind port 4897e5bd-b4b6-432e-80c2-72f1e857fc20 on host undercloud-0.redhat.local for vnic_type normal using segments [{'segmentation_id': None, 'physical_network': u'ctlplane', 'id': u'e786eb76-dff4-4d3d-a293-14c10d7970f8', 'network_type': u'flat'}]
Also note that the dead neutron agent seems to have persisted for quite a while, most likely prior to the occurrence of this problem, see timestamps below: server.log:2017-03-13 16:58:30.825 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: server.log:2017-03-13 16:59:07.842 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: <snip> server.log:2017-03-14 00:17:52.056 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: server.log:2017-03-14 00:18:29.082 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: <snip> server.log:2017-03-14 02:23:46.870 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: server.log:2017-03-14 02:24:23.903 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: <snip> server.log:2017-03-14 07:10:48.014 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: server.log:2017-03-14 07:11:25.047 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: server.log:2017-03-14 07:12:02.080 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: Would it be possible to restart this agent and retest?
I didn't have the environment anymore so I tried reproducing the issue but I was unable. I'm closing this ticket for now and I'll reopen if I hit it again. The agent being down sounds like a plausible cause so I'll keep in mind for my further tests. Thanks!