Created attachment 1242406 [details] ironic logs Description of problem: In a virt setup of RHOS10 that is runs for 22 days. Setup is with 1 controller, 1 CEPH, 2 compute, CFME integration with RHOS The ironic-node-list show all of hosts in maintenance mode True. +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ | e808e543-3ce1-49ee-a826-779364f690a0 | ceph-0 | 3800d04a-4853-4a63-9481-114737077913 | None | active | True | | 709d673e-f70a-4be3-9e33-74f632f55891 | controller-0 | 6f7d45e3-5144-4deb-82ca-09f10f6640bf | None | active | True | | f1c01856-44c1-4ad3-9c57-e541f3ed7aa2 | compute-0 | feeb62e4-3852-4175-885e-e243146ace1c | None | active | True | | 6fbd0dee-0cc8-4710-a9a0-ea72b039a52d | compute-1 | ed65627d-1680-48ad-b9fc-03bbe1e603d0 | None | active | True | | c4f7a679-b895-44dd-9436-d6f469a07513 | my_node | None | None | enroll | False | +--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+ +------------------------+--------------------------------------------------------------------------+ | Property | Value | +------------------------+--------------------------------------------------------------------------+ | chassis_uuid | | | clean_step | {} | | console_enabled | False | | created_at | 2016-12-26T08:42:52+00:00 | | driver | pxe_ssh | | driver_info | {u'ssh_username': u'stack', u'deploy_kernel': | | | u'279e1b18-5152-445e-a467-2f1a4db42318', u'deploy_ramdisk': | | | u'08967399-1f71-43cf-9272-ca68badae093', u'ssh_key_contents': u'******', | | | u'ssh_virt_type': u'virsh', u'ssh_address': u'172.16.0.1'} | | driver_internal_info | {u'agent_url': u'http://192.0.2.13:9999', u'root_uuid_or_disk_id': | | | u'a69bf0c7-8d41-42c5-b1f0-e64719aa7ffb', u'is_whole_disk_image': False, | | | u'agent_last_heartbeat': 1482742171} | | extra | {u'hardware_swift_object': u'extra_hardware-709d673e-f70a- | | | 4be3-9e33-74f632f55891'} | | inspection_finished_at | None | | inspection_started_at | None | | instance_info | {u'root_gb': u'29', u'display_name': u'controller-0', u'image_source': | | | u'2eb71347-8edf-434d-bb4a-96d73b540576', u'capabilities': u'{"profile": | | | "controller-d75f3dec-c770-5f88-9d4c-3fea1bf9c484", "boot_option": | | | "local"}', u'memory_mb': u'15886', u'vcpus': u'3', u'local_gb': u'29', | | | u'configdrive': u'******', u'swap_mb': u'0'} | | instance_uuid | 6f7d45e3-5144-4deb-82ca-09f10f6640bf | | last_error | During sync_power_state, max retries exceeded for node 709d673e-f70a- | | | 4be3-9e33-74f632f55891, node state None does not match expected state | | | 'None'. Updating DB state to 'None' Switching node to maintenance mode. | | | Error: Failed to execute command via SSH: LC_ALL=C /usr/bin/virsh | | | --connect qemu:///system list --all --name. | | maintenance | True | | maintenance_reason | During sync_power_state, max retries exceeded for node 709d673e-f70a- | | | 4be3-9e33-74f632f55891, node state None does not match expected state | | | 'None'. Updating DB state to 'None' Switching node to maintenance mode. | | | Error: Failed to execute command via SSH: LC_ALL=C /usr/bin/virsh | | | --connect qemu:///system list --all --name. | | name | controller-0 | | network_interface | | | power_state | None | | properties | {u'memory_mb': u'16384', u'cpu_arch': u'x86_64', u'local_gb': u'29', | | | u'cpus': u'4', u'capabilities': u'profile:controller-d75f3dec-c770-5f88 | | | -9d4c-3fea1bf9c484,boot_option:local'} | | provision_state | active | | provision_updated_at | 2016-12-26T08:50:19+00:00 | | raid_config | | | reservation | None | | resource_class | | | target_power_state | None | | target_provision_state | None | | target_raid_config | | | updated_at | 2017-01-18T11:28:29+00:00 | | uuid | 709d673e-f70a-4be3-9e33-74f632f55891 | +------------------------+--------------------------------------------------------------------------+ Nodes are running and functioning properly. Version-Release number of selected component (if applicable): puppet-ironic-9.4.1-1.el7ost.noarch python-ironicclient-1.7.0-1.el7ost.noarch openstack-ironic-inspector-4.2.1-1.el7ost.noarch openstack-ironic-common-6.2.2-2.el7ost.noarch python-ironic-inspector-client-1.9.0-2.el7ost.noarch openstack-ironic-conductor-6.2.2-2.el7ost.noarch python-ironic-lib-2.1.1-2.el7ost.noarch openstack-ironic-api-6.2.2-2.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. install RHOS 2. let the setup work above 20 days 3. Actual results: Ironic nodes turns to maintenance mode True Expected results: Keep the nodes running in maintenance false, be able to control the power state from Ironic API commands Additional info: 2017-01-18 06:27:58.306 17348 ERROR ironic.drivers.modules.ssh [req-ffddb9c3-12b0-48fe-b62c-b3d7cb644cd6 - - - - -] Cannot execute SSH cmd LC_ALL=C /usr/bin/virsh --connect qemu:///system list --all --name. Reason: Unexpected error while running command. Command: LC_ALL=C /usr/bin/virsh --connect qemu:///system list --all --name Exit code: 1 2017-01-18 06:27:58.317 17348 DEBUG ironic.conductor.task_manager [req-ffddb9c3-12b0-48fe-b62c-b3d7cb644cd6 - - - - -] Node e808e543-3ce1-49ee-a826-779364f690a0 successfully reserved for power state sync (took 0.01 seconds) reserve_node /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:252 2017-01-18 06:27:58.324 17348 ERROR ironic.conductor.manager [req-ffddb9c3-12b0-48fe-b62c-b3d7cb644cd6 - - - - -] During sync_power_state, max retries exceeded for node e808e543-3ce1-49ee-a826-779364f690a0, node state None does not match expected state 'None'. Updating DB state to 'None' Switching node to maintenance mode. Error: Failed to execute command via SSH: LC_ALL=C /usr/bin/virsh --connect qemu:///system list --all --name.
Hi, Thanks for reporting. So, Ironic will automatically put the nodes in the maintenance mode when it's no longer able to manage/power control it. Can you check if the VMs Ironic is trying to manage is in the system URI "qemu:///system" or session URI "qemu:///session". You can check that by logging in the host machine with the VMs and trying issue: $ virsh -c qemu:///system list Can you also upload the logs from the ironic-conductor service ? I wonder if there's some other error hidden there which might give a better explanation than "Unexpected error running command". Thanks
Restarting the libvirtd has resolve the issue
As Lucas mentioned, we can't do much if we can't control the machines, so I guess this can be closed. Please let me know if we can help somehow.