Bug 1432028

Summary: Unable to add compute node back to deployment after it was previously deleted
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: rhosp-directorAssignee: Angus Thomas <athomas>
Status: CLOSED NOTABUG QA Contact: Amit Ugol <augol>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 11.0 (Ocata)CC: aschultz, bfournie, dbecker, mburns, mcornea, morazi, rhel-osp-director-maint
Target Milestone: ga   
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-22 21:34:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
undercloud sosreport none

Description Marius Cornea 2017-03-14 11:17:56 UTC
Created attachment 1262863 [details]
undercloud sosreport

Description of problem:

I am unable to add a compute node back to overcloud deployment after it was previously removed. I hit this issue during OSP10->11 post upgrade verification. 

Version-Release number of selected component (if applicable):


How reproducible:
1/1

Steps to Reproduce:
1. Deploy OSP10 with 3 controllers, 3 computes, 1 ceph node
2. Upgrade environment to OSP11
3. Upload OSP11 images to undercloud Glance
4. Remove one compute node from deployment:
openstack overcloud node delete --stack overcloud $UUID
5. I realizede that I missed 'openstack baremetal configure boot' and ran it
6. Rerun the overcloud deploy command that contains --compute-scale 3 and should add the compute node removed in step 4 back to the deployment. 

Actual results:
The compute node doesn't get provisioned because it's not able to PXE boot(the console shows that it times out).

Expected results:
The compute node gets provisioned and added back to the deployment.

Additional info:
Attaching the undercloud sosreport.
This is a virtual environment, the VM is set to boot first from NICs.

Comment 1 Bob Fournier 2017-03-22 14:06:30 UTC
Hi Marius,

Can you confirm that the compute you removed was compute-0?

instackenv.json:
      "name": "compute-0",
      "pm_addr": "172.16.0.1",
      "pm_password": *********,
      "pm_type": "pxe_ssh",
      "mac": ["52:54:00:e1:fd:a9"],
      "cpu": "1",
      "memory": "5806",
      "disk": "20",
      "arch": "x86_64",
      "pm_user": "stack"
    },

I'm seeing an entry for that mac in the neutron dhcp hosts file:
...
52:54:00:e1:fd:a9,host-192-168-24-16,192.168.24.16,set:4897e5bd-b4b6-432e-80c2-72f1e857fc20

and the opts file is set up correctly to pxe boot:
tag:4897e5bd-b4b6-432e-80c2-72f1e857fc20,option:server-ip-address,192.168.24.1
tag:4897e5bd-b4b6-432e-80c2-72f1e857fc20,tag:!ipxe,option:bootfile-name,undionly.kpxe
tag:4897e5bd-b4b6-432e-80c2-72f1e857fc20,tag:ipxe,option:bootfile-name,http://192.168.24.1:8088/boot.ipxe

but I don't see any entry for this mac in the dhcp leases file, indicating it didn't get assigned an IP
cat ../../lib/neutron/dhcp/b2ceeb59-20d4-4bb4-8954-75fa364e9d68/leases 
1489523792 fa:16:3e:99:63:d8 192.168.24.14 host-192-168-24-14 *
1489523792 fa:16:3e:99:0e:d4 192.168.24.5 host-192-168-24-5 *
1489523792 52:54:00:b8:91:69 192.168.24.8 host-192-168-24-8 *
1489523792 52:54:00:95:41:f9 192.168.24.12 host-192-168-24-12 *
1489523792 52:54:00:7f:2a:83 192.168.24.7 host-192-168-24-7 *
1489523792 52:54:00:78:53:fe 192.168.24.13 host-192-168-24-13 *
1489523792 52:54:00:37:ed:c0 192.168.24.15 host-192-168-24-15 *
1489523792 52:54:00:12:d9:ac 192.168.24.18 host-192-168-24-18 *

Also, can you provide the actual overcloud deploy command that you ran?

Thanks.

Comment 2 Bob Fournier 2017-03-22 15:43:24 UTC
In the neutron server logs, for the port associated with this mac I am seeing porting binding failures, it looks like the neutron agent is down.

2017-03-14 07:09:19.807 28502 WARNING neutron.plugins.ml2.drivers.mech_agent [req-ed0305bd-6559-4e54-aca9-cce7b11c65df - - - - -] Refusing to bind port 4897e5bd-b4b6-432e-80c2-72f1e857fc20 to dead agent:<snip>

2017-03-14 07:09:19.822 28502 ERROR neutron.plugins.ml2.managers [req-ed0305bd-6559-4e54-aca9-cce7b11c65df - - - - -] Failed to bind port 4897e5bd-b4b6-432e-80c2-72f1e857fc20 on host undercloud-0.redhat.local for vnic_type normal using segments [{'segmentation_id': None, 'physical_network': u'ctlplane', 'id': u'e786eb76-dff4-4d3d-a293-14c10d7970f8', 'network_type': u'flat'}]

Comment 3 Bob Fournier 2017-03-22 17:59:09 UTC
Also note that the dead neutron agent seems to have persisted for quite a while, most likely prior to the occurrence of this problem, see timestamps below:

server.log:2017-03-13 16:58:30.825 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
server.log:2017-03-13 16:59:07.842 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
<snip>

server.log:2017-03-14 00:17:52.056 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
server.log:2017-03-14 00:18:29.082 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
<snip>

server.log:2017-03-14 02:23:46.870 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
server.log:2017-03-14 02:24:23.903 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
<snip>

server.log:2017-03-14 07:10:48.014 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
server.log:2017-03-14 07:11:25.047 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:
server.log:2017-03-14 07:12:02.080 28506 WARNING neutron.db.agents_db [req-32e6866e-9107-4c10-8ad8-a31c78d6c305 - - - - -] Agent healthcheck: found 1 dead agents out of 2:

Would it be possible to restart this agent and retest?

Comment 4 Marius Cornea 2017-03-22 21:34:58 UTC
I didn't have the environment anymore so I tried reproducing the issue but I was unable. I'm closing this ticket for now and I'll reopen if I hit it again. 

The agent being down sounds like a plausible cause so I'll keep in mind for my further tests. Thanks!