Bug 1463850

Summary:	Director attempts to delete controller ports in OpenStack Platform on stack update - need to revert heat to a good state
Product:	Red Hat OpenStack	Reporter:	Andreas Karis <akaris>
Component:	openstack-tripleo	Assignee:	Zane Bitter <zbitter>
Status:	CLOSED NOTABUG	QA Contact:	Arik Chernetsky <achernet>
Severity:	high	Docs Contact:
Priority:	high
Version:	10.0 (Newton)	CC:	akaris, aschultz, atelang, emacchi, jslagle, mandreou, mburns, mcornea, rhel-osp-director-maint, sathlang, therve, zbitter
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-07-28 17:47:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Andreas Karis 2017-06-21 23:08:30 UTC

Description of problem:
Another attempt of Director attempted node replacement in OpenStack Platform


Hi,

We updated the overcloud images, ram disk and kernel disk.

I then ran `openstack baremetal configure boot`

Also, NetworkDeploymentActions: [ 'CREATE', 'UPDATE' ] was set. I removed this, but the following still shows up:

Failed SRIOV nodes (role SRIOV) was removed with `nova delete <UUID>`. As the next update failed, I then set the Count of ComputeSRIOV role to 0.

On the next update, it is now trying to remove the Controller ports: 
~~~
2017-06-21 23:39:51Z [overcloud-Controller-oviemczckrbn.1]: UPDATE_FAILED  resources[1]: InterfaceDetachFailed: resources.Controller: Failed to detach interface (cc82a399-098d-4b53-ae51-09610b4adf00) from
 server (36504793-379e-46c5-a813-0a1e86766eb4)
2017-06-21 23:39:51Z [overcloud-Controller-oviemczckrbn.0]: UPDATE_FAILED  UPDATE aborted
2017-06-21 23:39:51Z [overcloud-Compute-jvx6cuogq7ux-8-nggymbisun4g.ExternalPort]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:51Z [overcloud-Compute-jvx6cuogq7ux-9-u227mqmdw66a.ExternalPort]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:51Z [overcloud-Compute-jvx6cuogq7ux-6-bl6he54mpubv.NodeUserData]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:51Z [overcloud-Compute-jvx6cuogq7ux-7-rhso426thgnd.ManagementPort]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:51Z [overcloud-Compute-jvx6cuogq7ux-14-x5pdopbwtig3]: UPDATE_IN_PROGRESS  Stack UPDATE started
2017-06-21 23:39:51Z [overcloud-Compute-jvx6cuogq7ux-21-3pb7qxmwepcq.StorageMgmtPort]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:51Z [overcloud-Controller-oviemczckrbn.2]: UPDATE_FAILED  UPDATE aborted
2017-06-21 23:39:51Z [overcloud-Controller-oviemczckrbn]: UPDATE_FAILED  resources[1]: InterfaceDetachFailed: resources.Controller: Failed to detach interface (cc82a399-098d-4b53-ae51-09610b4adf00) from s
erver (36504793-379e-46c5-a813-0a1e86766eb4)
2017-06-21 23:39:51Z [overcloud-Compute-jvx6cuogq7ux.5]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:51Z [overcloud-Compute-jvx6cuogq7ux-3-bv7hy2z57t4v.NodeAdminUserData]: UPDATE_COMPLETE  state changed
2017-06-21 23:39:51Z [overcloud-Compute-jvx6cuogq7ux-3-bv7hy2z57t4v.NodeUserData]: UPDATE_COMPLETE  state changed
2017-06-21 23:39:51Z [overcloud-Compute-jvx6cuogq7ux-6-bl6he54mpubv.NodeAdminUserData]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:51Z [overcloud-Compute-jvx6cuogq7ux-3-bv7hy2z57t4v.UpdateConfig]: UPDATE_COMPLETE  state changed
2017-06-21 23:39:52Z [Controller]: UPDATE_FAILED  resources.Controller: resources[1]: InterfaceDetachFailed: resources.Controller: Failed to detach interface (cc82a399-098d-4b53-ae51-09610b4adf00) from se
rver (36504793-379e-46c5-a813-0a1e86766eb4)
2017-06-21 23:39:52Z [overcloud-Controller-oviemczckrbn-0-ijxetktf4zo3.NodeTLSCAData]: UPDATE_FAILED  UPDATE aborted
2017-06-21 23:39:52Z [overcloud-Controller-oviemczckrbn-0-ijxetktf4zo3]: UPDATE_FAILED  Operation cancelled
2017-06-21 23:39:52Z [Compute]: UPDATE_FAILED  UPDATE aborted
2017-06-21 23:39:52Z [overcloud-Controller-oviemczckrbn-2-3xn6iqvzukaj.NetworkConfig]: UPDATE_FAILED  UPDATE aborted
2017-06-21 23:39:52Z [overcloud-Compute-jvx6cuogq7ux-7-rhso426thgnd.InternalApiPort]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:52Z [overcloud-Compute-jvx6cuogq7ux-9-u227mqmdw66a.StoragePort]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:52Z [overcloud-Compute-jvx6cuogq7ux-10-x7a2geopop32.UpdateConfig]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:52Z [overcloud-Compute-jvx6cuogq7ux-12-bomemxsrmofn.ExternalPort]: UPDATE_COMPLETE  state changed
2017-06-21 23:39:52Z [overcloud-Compute-jvx6cuogq7ux-10-x7a2geopop32.NodeUserData]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:52Z [overcloud-Compute-jvx6cuogq7ux-10-x7a2geopop32.NodeAdminUserData]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:52Z [overcloud-Compute-jvx6cuogq7ux-6-bl6he54mpubv.NodeUserData]: UPDATE_COMPLETE  state changed
2017-06-21 23:39:53Z [overcloud-Controller-oviemczckrbn-2-3xn6iqvzukaj.NetIpMap]: UPDATE_FAILED  UPDATE aborted
2017-06-21 23:39:53Z [overcloud-Controller-oviemczckrbn-2-3xn6iqvzukaj]: UPDATE_FAILED  Operation cancelled
2017-06-21 23:39:53Z [overcloud-Compute-jvx6cuogq7ux-6-bl6he54mpubv.UpdateConfig]: UPDATE_COMPLETE  state changed
2017-06-21 23:39:53Z [overcloud-Compute-jvx6cuogq7ux-3-bv7hy2z57t4v.StoragePort]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:53Z [overcloud-Compute-jvx6cuogq7ux-5-feu4bv7qh2wj]: UPDATE_IN_PROGRESS  Stack UPDATE started
2017-06-21 23:39:53Z [overcloud-Compute-jvx6cuogq7ux-6-bl6he54mpubv.NodeAdminUserData]: UPDATE_COMPLETE  state changed
2017-06-21 23:39:53Z [overcloud-Compute-jvx6cuogq7ux.0]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:53Z [overcloud-Compute-jvx6cuogq7ux-12-bomemxsrmofn.InternalApiPort]: UPDATE_COMPLETE  state changed
2017-06-21 23:39:53Z [overcloud-Compute-jvx6cuogq7ux-7-rhso426thgnd.TenantPort]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:53Z [overcloud-Compute-jvx6cuogq7ux-11-p4nkmuwznsrv.TenantPort]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:53Z [overcloud]: UPDATE_FAILED  resources.Controller: resources[1]: InterfaceDetachFailed: resources.Controller: Failed to detach interface (cc82a399-098d-4b53-ae51-09610b4adf00) from ser
ver (36504793-379e-46c5-a813-0a1e86766eb4)
2017-06-21 23:39:53Z [overcloud-Compute-jvx6cuogq7ux-21-3pb7qxmwepcq.ManagementPort]: UPDATE_COMPLETE  state changed
2017-06-21 23:39:53Z [overcloud-Compute-jvx6cuogq7ux-3-bv7hy2z57t4v.InternalApiPort]: UPDATE_IN_PROGRESS  state changed
2017-06-21 23:39:53Z [overcloud-Compute-jvx6cuogq7ux-12-bomemxsrmofn.ManagementPort]: UPDATE_COMPLETE  state changed

 Stack overcloud UPDATE_FAILED

Heat Stack update failed.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 387, in run_subcommand
~~~

Meaning that OSP 10 is trying to detach controller interfaces, meaning it's trying to actually replace the controller nodes AFAICT.
~~~
[stack@director ~]$ openstack image list
+--------------------------------------+----------------------------------------+--------+
| ID                                   | Name                                   | Status |
+--------------------------------------+----------------------------------------+--------+
| 41b1cb02-364e-4fc8-a711-7fd332ed2a64 | bm-deploy-ramdisk                      | active |
| 59f274b1-2078-4df5-8067-92bc0a56a2c3 | bm-deploy-kernel                       | active |
| b0c5905b-a2bd-4a79-8e84-266c345dbb9f | overcloud-full                         | active |
| 54aee3e7-d233-4c80-998c-fe785a5b79e5 | overcloud-full-initrd                  | active |
| 34f30d11-0467-417a-8c3b-c59db52ee2dc | overcloud-full-vmlinuz                 | active |
| 44a167ef-4012-445d-957b-5979ea5b5403 | overcloud-full_20170620T181347         | active |
| b7a98fe8-e9a2-411a-a71d-f8b84e7b7371 | bm-deploy-ramdisk_20161219T212128      | active |
| 17edc735-7cdb-4741-b750-a4511a0fea6b | bm-deploy-kernel_20161219T212127       | active |
| 41031902-673a-42bf-9e77-335bb499b7b0 | overcloud-full_20161219T212113         | active |
| d577bed9-8bef-42b5-a12a-8cfefb9c3b68 | overcloud-full-initrd_20161219T212112  | active |
| e75e5a97-768c-461c-a651-2d99713a7b7a | overcloud-full-vmlinuz_20161219T212110 | active |
+--------------------------------------+----------------------------------------+--------+
~~~

The above looks a lot like https://bugzilla.redhat.com/show_bug.cgi?id=1385190


Tried to update all packages on Director, followed by a restart of all resources:
yum update -y
systemctl list-units | egrep 'openstack|neutron' | awk '{print $1}' | xargs -I {} systemctl restart {}

Followed by another stack update.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Sofer Athlan-Guyot 2017-06-22 10:51:34 UTC

Hi,

Adding some error discovered in the logs:

So from the log we can see an failure on controller2 in Step4 of the deployment:

    ./sosreport-20170620-225759/overcloud-controller-2.localdomain/sos_commands/pacemaker/crm_report/overcloud-controller-2.localdomain/journal.log:Jun 20 19:38:00
     Error: Duplicate declaration: Package[python-memcache] is already declared; cannot redeclare at /etc/puppet/modules/oslo/manifests/cache.pp:159 on node overcloud-controller-2.localdomain

This looks like https://bugzilla.redhat.com/show_bug.cgi?id=1392583 so it may be that puppet oslo and puppet horizon are not up to date there (from the latter bz it looks like you need puppet-oslo-9.4.0-2.el7ost, puppet-horizon-9.4.1-2.el7ost)

It doesn't happen on controller0, controller1.

It looks like something is also happening on Step4 of one of the compute node.  Looking from the undercloud at:

   for i in $COMPUTE_NODES; do
       ssh heat-admin@$i bash -c "'journalctl -u os-collect-config | egrep -r \"deploy_status_code[^0-9]+[1-9]\"'"
   done

might help identifying the problem.

Comment 15 Zane Bitter 2017-06-27 20:31:29 UTC

The only failed server resource I see is:

| Controller | 36504793-379e-46c5-a813-0a1e86766eb4 | OS::TripleO::Server | 2017-06-21T23:18:32Z | overcloud-Controller-oviemczckrbn-1-hcyy7ivmh4mo


It would be interesting to see the output of:

  openstack stack event list overcloud-Controller-oviemczckrbn-1-hcyy7ivmh4mo

to see the history of that resource, and then do

  openstack stack event show overcloud-Controller-oviemczckrbn-1-hcyy7ivmh4mo Controller <event_id>

on the first event showing a failure, so we can see what caused it.

If the port detach thing wasn't the initial failure, and the conditions causing the initial failure have gone away, then it's likely that we can get things up and running again by marking the controller server COMPLETE. If the initial failure was the port detach then it's a mystery why we're trying to replace the server (although the property values in the events could give us a clue), and there's every reason to think it will happen again.

Comment 16 Andreas Karis 2017-06-27 21:35:56 UTC

Hi,

Sorry, I'm on another call right now. I forwarded all your requests to the customer. I'll forward the output as soon as I have it and we could swing for setting the resource to COMPLETE and see if it resolves, I guess.

Can you tell me how?

- Andreas

Comment 35 Red Hat Bugzilla 2023-09-14 03:59:37 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days