Bug 1432028
Summary: | Unable to add compute node back to deployment after it was previously deleted | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | ||||
Component: | rhosp-director | Assignee: | Angus Thomas <athomas> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Amit Ugol <augol> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 11.0 (Ocata) | CC: | aschultz, bfournie, dbecker, mburns, mcornea, morazi, rhel-osp-director-maint | ||||
Target Milestone: | ga | ||||||
Target Release: | 11.0 (Ocata) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-03-22 21:34:58 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Hi Marius, Can you confirm that the compute you removed was compute-0? instackenv.json: "name": "compute-0", "pm_addr": "172.16.0.1", "pm_password": *********, "pm_type": "pxe_ssh", "mac": ["52:54:00:e1:fd:a9"], "cpu": "1", "memory": "5806", "disk": "20", "arch": "x86_64", "pm_user": "stack" }, I'm seeing an entry for that mac in the neutron dhcp hosts file: ... 52:54:00:e1:fd:a9,host-192-168-24-16,192.168.24.16,set:4897e5bd-b4b6-432e-80c2-72f1e857fc20 and the opts file is set up correctly to pxe boot: tag:4897e5bd-b4b6-432e-80c2-72f1e857fc20,option:server-ip-address,192.168.24.1 tag:4897e5bd-b4b6-432e-80c2-72f1e857fc20,tag:!ipxe,option:bootfile-name,undionly.kpxe tag:4897e5bd-b4b6-432e-80c2-72f1e857fc20,tag:ipxe,option:bootfile-name,http://192.168.24.1:8088/boot.ipxe but I don't see any entry for this mac in the dhcp leases file, indicating it didn't get assigned an IP cat ../../lib/neutron/dhcp/b2ceeb59-20d4-4bb4-8954-75fa364e9d68/leases 1489523792 fa:16:3e:99:63:d8 192.168.24.14 host-192-168-24-14 * 1489523792 fa:16:3e:99:0e:d4 192.168.24.5 host-192-168-24-5 * 1489523792 52:54:00:b8:91:69 192.168.24.8 host-192-168-24-8 * 1489523792 52:54:00:95:41:f9 192.168.24.12 host-192-168-24-12 * 1489523792 52:54:00:7f:2a:83 192.168.24.7 host-192-168-24-7 * 1489523792 52:54:00:78:53:fe 192.168.24.13 host-192-168-24-13 * 1489523792 52:54:00:37:ed:c0 192.168.24.15 host-192-168-24-15 * 1489523792 52:54:00:12:d9:ac 192.168.24.18 host-192-168-24-18 * Also, can you provide the actual overcloud deploy command that you ran? Thanks. In the neutron server logs, for the port associated with this mac I am seeing porting binding failures, it looks like the neutron agent is down. 2017-03-14 07:09:19.807 28502 WARNING neutron.plugins.ml2.drivers.mech_agent [req-ed0305bd-6559-4e54-aca9-cce7b11c65df - - - - -] Refusing to bind port 4897e5bd-b4b6-432e-80c2-72f1e857fc20 to dead agent:<snip> 2017-03-14 07:09:19.822 28502 ERROR neutron.plugins.ml2.managers [req-ed0305bd-6559-4e54-aca9-cce7b11c65df - - - - -] Failed to bind port 4897e5bd-b4b6-432e-80c2-72f1e857fc20 on host undercloud-0.redhat.local for vnic_type normal using segments [{'segmentation_id': None, 'physical_network': u'ctlplane', 'id': u'e786eb76-dff4-4d3d-a293-14c10d7970f8', 'network_type': u'flat'}] Also note that the dead neutron agent seems to have persisted for quite a while, most likely prior to the occurrence of this problem, see timestamps below: server.log:2017-03-13 16:58:30.825 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: server.log:2017-03-13 16:59:07.842 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: <snip> server.log:2017-03-14 00:17:52.056 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: server.log:2017-03-14 00:18:29.082 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: <snip> server.log:2017-03-14 02:23:46.870 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: server.log:2017-03-14 02:24:23.903 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: <snip> server.log:2017-03-14 07:10:48.014 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: server.log:2017-03-14 07:11:25.047 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: server.log:2017-03-14 07:12:02.080 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2: Would it be possible to restart this agent and retest? I didn't have the environment anymore so I tried reproducing the issue but I was unable. I'm closing this ticket for now and I'll reopen if I hit it again. The agent being down sounds like a plausible cause so I'll keep in mind for my further tests. Thanks! |
Created attachment 1262863 [details] undercloud sosreport Description of problem: I am unable to add a compute node back to overcloud deployment after it was previously removed. I hit this issue during OSP10->11 post upgrade verification. Version-Release number of selected component (if applicable): How reproducible: 1/1 Steps to Reproduce: 1. Deploy OSP10 with 3 controllers, 3 computes, 1 ceph node 2. Upgrade environment to OSP11 3. Upload OSP11 images to undercloud Glance 4. Remove one compute node from deployment: openstack overcloud node delete --stack overcloud $UUID 5. I realizede that I missed 'openstack baremetal configure boot' and ran it 6. Rerun the overcloud deploy command that contains --compute-scale 3 and should add the compute node removed in step 4 back to the deployment. Actual results: The compute node doesn't get provisioned because it's not able to PXE boot(the console shows that it times out). Expected results: The compute node gets provisioned and added back to the deployment. Additional info: Attaching the undercloud sosreport. This is a virtual environment, the VM is set to boot first from NICs.