Bug 1549571
| Summary: | openstack stack delete overcloud fails | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Mike Abrams <mabrams> |
| Component: | rhosp-director | Assignee: | Bob Fournier <bfournie> |
| Status: | CLOSED DUPLICATE | QA Contact: | Gurenko Alex <agurenko> |
| Severity: | urgent | Docs Contact: | |
| Priority: | high | ||
| Version: | 13.0 (Queens) | CC: | agurenko, aschultz, athomas, bfournie, dbecker, dtantsur, mabrams, mburns, morazi, ohochman, racedoro, rhel-osp-director-maint |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-05-11 20:07:36 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Mike Abrams
2018-02-27 12:38:28 UTC
Looks like failure with power management. Is IPMI working correctly to this node? Are you able to run ipmitool to it? Is the BMC updated with latest F/W? Any update on this? Closing this for now, please update with requested info if able to duplicate. Hi. We still see this issue so I am reopening, hopefully we can add more meaningful info to the bug this time. I still experience the issue with puddle 2018-05-01.6. I would assume it all comes down to vbmc since it also causing other issues and keeps being updated and re-written?
(undercloud) [stack@undercloud-0 ~]$ openstack stack delete overcloud --wait -y
2018-05-06 06:40:03Z [overcloud]: DELETE_IN_PROGRESS Stack DELETE started
2018-05-06 06:40:03Z [overcloud.CephStorage]: DELETE_IN_PROGRESS state changed
2018-05-06 06:40:03Z [overcloud.CephStorage]: DELETE_FAILED ResourceInError: resources.CephStorage.resources[0].resources.CephStorage: Went to status ERROR due to "Server ceph-0 delete failed: (500) Node 9f36aecb-c678-42e3-9c7c-f808ad9d8e10 can not be updated while a state transition is in progress. (HTTP 409)"
2018-05-06 06:40:03Z [overcloud]: DELETE_FAILED Resource DELETE failed: ResourceInError: resources.CephStorage.resources[0].resources.CephStorage: Went to status ERROR due to "Server ceph-0 delete failed: (500) Node 9f36aecb-c678-42e3-9c7c-f808ad9d8e10 can not be updated while a state transition is in
Stack overcloud DELETE_FAILED
Unable to delete 1 of the 1 stacks.
(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node show 9f36aecb-c678-42e3-9c7c-f808ad9d8e10 --fit-width
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| boot_interface | None |
| chassis_uuid | None |
| clean_step | {} |
| console_enabled | False |
| console_interface | None |
| created_at | 2018-05-02T10:00:55+00:00 |
| deploy_interface | None |
| driver | pxe_ipmitool |
| driver_info | {u'ipmi_port': u'6232', u'ipmi_username': u'admin', u'deploy_kernel': u'ed6c7b24-a059-4de4-b8ee-669040b021ba', u'ipmi_address': u'172.16.0.1', u'deploy_ramdisk': u'4cbc442f-575c-4d54 |
| | -a75a-b6f5d52e0eee', u'ipmi_password': u'******'} |
| driver_internal_info | {u'agent_url': u'http://192.168.24.8:9999', u'root_uuid_or_disk_id': u'c7e46e23-2898-4fa9-bfc2-7de1d6c5cf49', u'is_whole_disk_image': False, u'agent_version': u'3.2.1.dev2'} |
| extra | {u'hardware_swift_object': u'extra_hardware-9f36aecb-c678-42e3-9c7c-f808ad9d8e10'} |
| inspect_interface | None |
| inspection_finished_at | None |
| inspection_started_at | None |
| instance_info | {u'root_gb': u'17', u'display_name': u'ceph-0', u'image_source': u'496124e5-40e1-4ed8-8035-30ac3e82e30a', u'capabilities': u'{"profile": "ceph", "boot_option": "local"}', u'memory_mb': |
| | u'4096', u'vcpus': u'1', u'local_gb': u'19', u'configdrive': u'******', u'swap_mb': u'0', u'nova_host_id': u'undercloud-0.redhat.local'} |
| instance_uuid | 730462e6-82f1-4a8d-99ff-ac398d9cca62 |
| last_error | None |
| maintenance | True |
| maintenance_reason | During sync_power_state, max retries exceeded for node 9f36aecb-c678-42e3-9c7c-f808ad9d8e10, node state None does not match expected state 'power on'. Updating DB state to 'None' |
| | Switching node to maintenance mode. Error: IPMI call failed: power status. |
| management_interface | None |
| name | ceph-0 |
| network_interface | flat |
| power_interface | None |
| power_state | None |
| properties | {u'memory_mb': u'4096', u'cpu_arch': u'x86_64', u'local_gb': u'19', u'cpus': u'2', u'capabilities': u'profile:ceph,boot_option:local'} |
| provision_state | deleting |
| provision_updated_at | 2018-05-06T05:49:14+00:00 |
| raid_config | {} |
| raid_interface | None |
| reservation | None |
| resource_class | baremetal |
| storage_interface | noop |
| target_power_state | None |
| target_provision_state | available |
| target_raid_config | {} |
| updated_at | 2018-05-06T05:54:58+00:00 |
| uuid | 9f36aecb-c678-42e3-9c7c-f808ad9d8e10 |
| vendor_interface | None |
+------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I'm using following cherry-pick for this system to workaround different issues:
https://review.openstack.org/#/c/564878/
Alex - I can't seem to access this system, is there an intermediate host I need to be on? [bfournie@ibm-p8-kvm-03-guest-02 ~]$ ping seal06.qa.lab.tlv.redhat.com PING seal06.qa.lab.tlv.redhat.com (10.35.64.6) 56(84) bytes of data. ^C --- seal06.qa.lab.tlv.redhat.com ping statistics --- 93 packets transmitted, 0 received, 100% packet loss, time 91999ms Note that we are currently fixing a vbmc timeout tracked here - https://bugzilla.redhat.com/show_bug.cgi?id=1571384 that causes power-on/power-off issues to nodes when using vbmc. Its possible that its the same problem but we'd have to look at logs to confirm. I was finally able to get onto seal06.qa.lab.tlv.redhat.com and ssh'ed to undercloud-0. Currently I see baremetal nodes OK. It looks like a deployment is in progress. (undercloud) [stack@undercloud-0 log]$ openstack baremetal node list +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ | dd3f42de-4d5f-428c-b06e-7d74c0a36d98 | ceph-0 | 8cc02e61-d429-44cf-875e-c830131faefe | power on | active | False | | f61ab9a0-1373-4709-b9f6-775bf8c92683 | ceph-1 | d6ec7e2d-eda4-4cf7-abb9-06cb948fbe38 | power on | active | False | | dd842269-b332-4b05-9c68-a8a07b9377f1 | ceph-2 | 5fa2fe93-9ae6-43e0-b9d1-e3cd4c5b7619 | power on | active | False | | 9b22fe1b-22e4-47fd-bd0e-00042b9a2956 | compute-0 | 33915557-efe3-4aaf-87a3-361d0c4aa569 | power on | active | False | | 97643e11-117c-4fc8-915a-adef0cbd3e90 | compute-1 | 524c77d2-b901-438c-aa66-a92f878275f0 | power on | active | False | | 4450b6f8-008f-45f3-9560-0e2fe36742eb | controller-0 | 88143f27-3128-40a4-b422-c40fb5fe10c5 | power on | active | False | | 72c9ab9d-c1ea-43d8-bb44-653b81bf0824 | controller-1 | 48ea75db-74f7-4299-9915-826c1890a8ae | power on | active | False | | d578fd5f-4120-4d95-9388-314e57e9f9bc | controller-2 | ad52d55a-ecfc-4395-86c7-cb58052900ce | power on | active | False | +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ Please capture sosreport if problem occurs again. Alex - can you attach the sosreports to this BZ please? I keep getting failures when trying to download from drive.google.com. Thanks. Thanks Alex, yes I was able to retrieve the logs. Its clear from the Ironic logs that we're getting these "Error in tear_down of node" due to IPMI failure issues (see below). As you're using virtualbmc for IPMI most likely the issue you are seeing is vitualbmc power failures because of libvirt. This is being tracked here - https://bugzilla.redhat.com/show_bug.cgi?id=1571384. There is a libvirt patch described in https://bugzilla.redhat.com/show_bug.cgi?id=1576464 that has proven to resolve these virtualbmc issues. Would it be possible to install this libvirt patch and retest? I will leave this open for now, eventually it should be marked a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1571384. 2018-05-08 02:12:13.173 17127 WARNING ironic.drivers.modules.ipmitool [req-6a0a9638-d21d-4fc6-8ca8-8f284d3bb17d b9b6ae49e2b249f692319e410f25d2d7 c8b1b2624f57453496d61febc7ad0c09 - default default] IPMI power status failed for node 4450b6f8-008f-45f3-9560-0e2fe36742eb with error: Unexpected error while running command. Command: ipmitool -I lanplus -H 172.16.0.1 -L ADMINISTRATOR -p 6236 -U admin -R 12 -N 5 -f /tmp/tmpYzwU_m power status Exit code: 1 Stdout: u'' Stderr: u'Error: Unable to establish IPMI v2 / RMCP+ session\n'.: ProcessExecutionError: Unexpected error while running command. 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager [req-6a0a9638-d21d-4fc6-8ca8-8f284d3bb17d b9b6ae49e2b249f692319e410f25d2d7 c8b1b2624f57453496d61febc7ad0c09 - default default] Error in tear_down of node 4450b6f8-008f-45f3-9560-0e2fe36742eb: IPMI call failed: power status.: IPMIFailure: IPMI call failed: power status. 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager Traceback (most recent call last): 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 908, in _do_node_tear_down 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager task.driver.deploy.tear_down(task) 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic_lib/metrics.py", line 60, in wrapped 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager result = f(*args, **kwargs) 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager return f(*args, **kwargs) 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/iscsi_deploy.py", line 498, in tear_down 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager manager_utils.node_power_action(task, states.POWER_OFF) 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager return f(*args, **kwargs) 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 209, in node_power_action 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager if _can_skip_state_change(task, new_state): 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 168, in _can_skip_state_change 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager fields.NotificationStatus.ERROR, new_state) 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager self.force_reraise() 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager six.reraise(self.type_, self.value, self.tb) 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 158, in _can_skip_state_change 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager curr_state = task.driver.power.get_power_state(task) 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic_lib/metrics.py", line 60, in wrapped 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager result = f(*args, **kwargs) 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/ipmitool.py", line 781, in get_power_state 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager return _power_status(driver_info) 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/ipmitool.py", line 564, in _power_status 2018-05-08 02:12:13.193 17127 ERROR ironic.conductor.manager raise exception.IPMIFailure(cmd=cmd) *** This bug has been marked as a duplicate of bug 1571384 *** |