1414671 – Ironic nodes turn to maintenance mode RHOS10, longevity test

Bug 1414671 - Ironic nodes turn to maintenance mode RHOS10, longevity test

Summary: Ironic nodes turn to maintenance mode RHOS10, longevity test

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-ironic
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Lucas Alvares Gomes
QA Contact:	Raviv Bar-Tal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-19 08:23 UTC by Ronnie Rasouli
Modified:	2022-03-13 14:36 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-04-26 16:20:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ironic logs (799.38 KB, application/x-gzip) 2017-01-19 08:23 UTC, Ronnie Rasouli	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-13557	0	None	None	None	2022-03-13 14:36:04 UTC

Description Ronnie Rasouli 2017-01-19 08:23:23 UTC

Created attachment 1242406 [details]
ironic logs

Description of problem:

In a virt setup of RHOS10 that is runs for 22 days.

Setup is with 1 controller, 1 CEPH, 2 compute, CFME integration with RHOS

The ironic-node-list show all of hosts in maintenance mode True.

+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| e808e543-3ce1-49ee-a826-779364f690a0 | ceph-0       | 3800d04a-4853-4a63-9481-114737077913 | None        | active             | True        |
| 709d673e-f70a-4be3-9e33-74f632f55891 | controller-0 | 6f7d45e3-5144-4deb-82ca-09f10f6640bf | None        | active             | True        |
| f1c01856-44c1-4ad3-9c57-e541f3ed7aa2 | compute-0    | feeb62e4-3852-4175-885e-e243146ace1c | None        | active             | True        |
| 6fbd0dee-0cc8-4710-a9a0-ea72b039a52d | compute-1    | ed65627d-1680-48ad-b9fc-03bbe1e603d0 | None        | active             | True        |
| c4f7a679-b895-44dd-9436-d6f469a07513 | my_node      | None                                 | None        | enroll             | False       |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+

+------------------------+--------------------------------------------------------------------------+
| Property               | Value                                                                    |
+------------------------+--------------------------------------------------------------------------+
| chassis_uuid           |                                                                          |
| clean_step             | {}                                                                       |
| console_enabled        | False                                                                    |
| created_at             | 2016-12-26T08:42:52+00:00                                                |
| driver                 | pxe_ssh                                                                  |
| driver_info            | {u'ssh_username': u'stack', u'deploy_kernel':                            |
|                        | u'279e1b18-5152-445e-a467-2f1a4db42318', u'deploy_ramdisk':              |
|                        | u'08967399-1f71-43cf-9272-ca68badae093', u'ssh_key_contents': u'******', |
|                        | u'ssh_virt_type': u'virsh', u'ssh_address': u'172.16.0.1'}               |
| driver_internal_info   | {u'agent_url': u'http://192.0.2.13:9999', u'root_uuid_or_disk_id':       |
|                        | u'a69bf0c7-8d41-42c5-b1f0-e64719aa7ffb', u'is_whole_disk_image': False,  |
|                        | u'agent_last_heartbeat': 1482742171}                                     |
| extra                  | {u'hardware_swift_object': u'extra_hardware-709d673e-f70a-               |
|                        | 4be3-9e33-74f632f55891'}                                                 |
| inspection_finished_at | None                                                                     |
| inspection_started_at  | None                                                                     |
| instance_info          | {u'root_gb': u'29', u'display_name': u'controller-0', u'image_source':   |
|                        | u'2eb71347-8edf-434d-bb4a-96d73b540576', u'capabilities': u'{"profile":  |
|                        | "controller-d75f3dec-c770-5f88-9d4c-3fea1bf9c484", "boot_option":        |
|                        | "local"}', u'memory_mb': u'15886', u'vcpus': u'3', u'local_gb': u'29',   |
|                        | u'configdrive': u'******', u'swap_mb': u'0'}                             |
| instance_uuid          | 6f7d45e3-5144-4deb-82ca-09f10f6640bf                                     |
| last_error             | During sync_power_state, max retries exceeded for node 709d673e-f70a-    |
|                        | 4be3-9e33-74f632f55891, node state None does not match expected state    |
|                        | 'None'. Updating DB state to 'None' Switching node to maintenance mode.  |
|                        | Error: Failed to execute command via SSH: LC_ALL=C /usr/bin/virsh        |
|                        | --connect qemu:///system list --all --name.                              |
| maintenance            | True                                                                     |
| maintenance_reason     | During sync_power_state, max retries exceeded for node 709d673e-f70a-    |
|                        | 4be3-9e33-74f632f55891, node state None does not match expected state    |
|                        | 'None'. Updating DB state to 'None' Switching node to maintenance mode.  |
|                        | Error: Failed to execute command via SSH: LC_ALL=C /usr/bin/virsh        |
|                        | --connect qemu:///system list --all --name.                              |
| name                   | controller-0                                                             |
| network_interface      |                                                                          |
| power_state            | None                                                                     |
| properties             | {u'memory_mb': u'16384', u'cpu_arch': u'x86_64', u'local_gb': u'29',     |
|                        | u'cpus': u'4', u'capabilities': u'profile:controller-d75f3dec-c770-5f88  |
|                        | -9d4c-3fea1bf9c484,boot_option:local'}                                   |
| provision_state        | active                                                                   |
| provision_updated_at   | 2016-12-26T08:50:19+00:00                                                |
| raid_config            |                                                                          |
| reservation            | None                                                                     |
| resource_class         |                                                                          |
| target_power_state     | None                                                                     |
| target_provision_state | None                                                                     |
| target_raid_config     |                                                                          |
| updated_at             | 2017-01-18T11:28:29+00:00                                                |
| uuid                   | 709d673e-f70a-4be3-9e33-74f632f55891                                     |
+------------------------+--------------------------------------------------------------------------+



Nodes are running and functioning properly.

Version-Release number of selected component (if applicable):
puppet-ironic-9.4.1-1.el7ost.noarch
python-ironicclient-1.7.0-1.el7ost.noarch
openstack-ironic-inspector-4.2.1-1.el7ost.noarch
openstack-ironic-common-6.2.2-2.el7ost.noarch
python-ironic-inspector-client-1.9.0-2.el7ost.noarch
openstack-ironic-conductor-6.2.2-2.el7ost.noarch
python-ironic-lib-2.1.1-2.el7ost.noarch
openstack-ironic-api-6.2.2-2.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. install RHOS
2. let the setup work above 20 days
3.

Actual results:

Ironic nodes turns to maintenance mode True

Expected results:

Keep the nodes running in maintenance false, be able to control the power state from Ironic API commands

Additional info:
2017-01-18 06:27:58.306 17348 ERROR ironic.drivers.modules.ssh [req-ffddb9c3-12b0-48fe-b62c-b3d7cb644cd6 - - - - -] Cannot execute SSH cmd LC_ALL=C /usr/bin/virsh --connect qemu:///system list --all --name. Reason: Unexpected error while running command.
Command: LC_ALL=C /usr/bin/virsh --connect qemu:///system list --all --name
Exit code: 1

2017-01-18 06:27:58.317 17348 DEBUG ironic.conductor.task_manager [req-ffddb9c3-12b0-48fe-b62c-b3d7cb644cd6 - - - - -] Node e808e543-3ce1-49ee-a826-779364f690a0 successfully reserved for power state sync (took 0.01 seconds) reserve_node /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:252
2017-01-18 06:27:58.324 17348 ERROR ironic.conductor.manager [req-ffddb9c3-12b0-48fe-b62c-b3d7cb644cd6 - - - - -] During sync_power_state, max retries exceeded for node e808e543-3ce1-49ee-a826-779364f690a0, node state None does not match expected state 'None'. Updating DB state to 'None' Switching node to maintenance mode. Error: Failed to execute command via SSH: LC_ALL=C /usr/bin/virsh --connect qemu:///system list --all --name.

Comment 1 Lucas Alvares Gomes 2017-01-19 10:16:06 UTC

Hi,

Thanks for reporting. So, Ironic will automatically put the nodes in the maintenance mode when it's no longer able to manage/power control it.

Can you check if the VMs Ironic is trying to manage is in the system URI "qemu:///system" or session URI "qemu:///session". You can check that by logging in the host machine with the VMs and trying issue:

$ virsh -c qemu:///system list

Can you also upload the logs from the ironic-conductor service ? I wonder if there's some other error hidden there which might give a better explanation than "Unexpected error running command".

Thanks

Comment 2 Ronnie Rasouli 2017-01-22 11:22:47 UTC

Restarting the libvirtd has resolve the issue

Comment 3 Dmitry Tantsur 2017-04-26 16:20:12 UTC

As Lucas mentioned, we can't do much if we can't control the machines, so I guess this can be closed. Please let me know if we can help somehow.

Note You need to log in before you can comment on or make changes to this bug.