Bug 1414671

Summary: Ironic nodes turn to maintenance mode RHOS10, longevity test
Product: Red Hat OpenStack Reporter: Ronnie Rasouli <rrasouli>
Component: openstack-ironicAssignee: Lucas Alvares Gomes <lmartins>
Status: CLOSED NOTABUG QA Contact: Raviv Bar-Tal <rbartal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: mburns, mcornea, ojanas, rhel-osp-director-maint, srevivo
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-26 16:20:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ironic logs none

Description Ronnie Rasouli 2017-01-19 08:23:23 UTC
Created attachment 1242406 [details]
ironic logs

Description of problem:

In a virt setup of RHOS10 that is runs for 22 days.

Setup is with 1 controller, 1 CEPH, 2 compute, CFME integration with RHOS

The ironic-node-list show all of hosts in maintenance mode True.

+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| e808e543-3ce1-49ee-a826-779364f690a0 | ceph-0       | 3800d04a-4853-4a63-9481-114737077913 | None        | active             | True        |
| 709d673e-f70a-4be3-9e33-74f632f55891 | controller-0 | 6f7d45e3-5144-4deb-82ca-09f10f6640bf | None        | active             | True        |
| f1c01856-44c1-4ad3-9c57-e541f3ed7aa2 | compute-0    | feeb62e4-3852-4175-885e-e243146ace1c | None        | active             | True        |
| 6fbd0dee-0cc8-4710-a9a0-ea72b039a52d | compute-1    | ed65627d-1680-48ad-b9fc-03bbe1e603d0 | None        | active             | True        |
| c4f7a679-b895-44dd-9436-d6f469a07513 | my_node      | None                                 | None        | enroll             | False       |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+

+------------------------+--------------------------------------------------------------------------+
| Property               | Value                                                                    |
+------------------------+--------------------------------------------------------------------------+
| chassis_uuid           |                                                                          |
| clean_step             | {}                                                                       |
| console_enabled        | False                                                                    |
| created_at             | 2016-12-26T08:42:52+00:00                                                |
| driver                 | pxe_ssh                                                                  |
| driver_info            | {u'ssh_username': u'stack', u'deploy_kernel':                            |
|                        | u'279e1b18-5152-445e-a467-2f1a4db42318', u'deploy_ramdisk':              |
|                        | u'08967399-1f71-43cf-9272-ca68badae093', u'ssh_key_contents': u'******', |
|                        | u'ssh_virt_type': u'virsh', u'ssh_address': u'172.16.0.1'}               |
| driver_internal_info   | {u'agent_url': u'http://192.0.2.13:9999', u'root_uuid_or_disk_id':       |
|                        | u'a69bf0c7-8d41-42c5-b1f0-e64719aa7ffb', u'is_whole_disk_image': False,  |
|                        | u'agent_last_heartbeat': 1482742171}                                     |
| extra                  | {u'hardware_swift_object': u'extra_hardware-709d673e-f70a-               |
|                        | 4be3-9e33-74f632f55891'}                                                 |
| inspection_finished_at | None                                                                     |
| inspection_started_at  | None                                                                     |
| instance_info          | {u'root_gb': u'29', u'display_name': u'controller-0', u'image_source':   |
|                        | u'2eb71347-8edf-434d-bb4a-96d73b540576', u'capabilities': u'{"profile":  |
|                        | "controller-d75f3dec-c770-5f88-9d4c-3fea1bf9c484", "boot_option":        |
|                        | "local"}', u'memory_mb': u'15886', u'vcpus': u'3', u'local_gb': u'29',   |
|                        | u'configdrive': u'******', u'swap_mb': u'0'}                             |
| instance_uuid          | 6f7d45e3-5144-4deb-82ca-09f10f6640bf                                     |
| last_error             | During sync_power_state, max retries exceeded for node 709d673e-f70a-    |
|                        | 4be3-9e33-74f632f55891, node state None does not match expected state    |
|                        | 'None'. Updating DB state to 'None' Switching node to maintenance mode.  |
|                        | Error: Failed to execute command via SSH: LC_ALL=C /usr/bin/virsh        |
|                        | --connect qemu:///system list --all --name.                              |
| maintenance            | True                                                                     |
| maintenance_reason     | During sync_power_state, max retries exceeded for node 709d673e-f70a-    |
|                        | 4be3-9e33-74f632f55891, node state None does not match expected state    |
|                        | 'None'. Updating DB state to 'None' Switching node to maintenance mode.  |
|                        | Error: Failed to execute command via SSH: LC_ALL=C /usr/bin/virsh        |
|                        | --connect qemu:///system list --all --name.                              |
| name                   | controller-0                                                             |
| network_interface      |                                                                          |
| power_state            | None                                                                     |
| properties             | {u'memory_mb': u'16384', u'cpu_arch': u'x86_64', u'local_gb': u'29',     |
|                        | u'cpus': u'4', u'capabilities': u'profile:controller-d75f3dec-c770-5f88  |
|                        | -9d4c-3fea1bf9c484,boot_option:local'}                                   |
| provision_state        | active                                                                   |
| provision_updated_at   | 2016-12-26T08:50:19+00:00                                                |
| raid_config            |                                                                          |
| reservation            | None                                                                     |
| resource_class         |                                                                          |
| target_power_state     | None                                                                     |
| target_provision_state | None                                                                     |
| target_raid_config     |                                                                          |
| updated_at             | 2017-01-18T11:28:29+00:00                                                |
| uuid                   | 709d673e-f70a-4be3-9e33-74f632f55891                                     |
+------------------------+--------------------------------------------------------------------------+



Nodes are running and functioning properly.

Version-Release number of selected component (if applicable):
puppet-ironic-9.4.1-1.el7ost.noarch
python-ironicclient-1.7.0-1.el7ost.noarch
openstack-ironic-inspector-4.2.1-1.el7ost.noarch
openstack-ironic-common-6.2.2-2.el7ost.noarch
python-ironic-inspector-client-1.9.0-2.el7ost.noarch
openstack-ironic-conductor-6.2.2-2.el7ost.noarch
python-ironic-lib-2.1.1-2.el7ost.noarch
openstack-ironic-api-6.2.2-2.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. install RHOS
2. let the setup work above 20 days
3.

Actual results:

Ironic nodes turns to maintenance mode True

Expected results:

Keep the nodes running in maintenance false, be able to control the power state from Ironic API commands

Additional info:
2017-01-18 06:27:58.306 17348 ERROR ironic.drivers.modules.ssh [req-ffddb9c3-12b0-48fe-b62c-b3d7cb644cd6 - - - - -] Cannot execute SSH cmd LC_ALL=C /usr/bin/virsh --connect qemu:///system list --all --name. Reason: Unexpected error while running command.
Command: LC_ALL=C /usr/bin/virsh --connect qemu:///system list --all --name
Exit code: 1

2017-01-18 06:27:58.317 17348 DEBUG ironic.conductor.task_manager [req-ffddb9c3-12b0-48fe-b62c-b3d7cb644cd6 - - - - -] Node e808e543-3ce1-49ee-a826-779364f690a0 successfully reserved for power state sync (took 0.01 seconds) reserve_node /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:252
2017-01-18 06:27:58.324 17348 ERROR ironic.conductor.manager [req-ffddb9c3-12b0-48fe-b62c-b3d7cb644cd6 - - - - -] During sync_power_state, max retries exceeded for node e808e543-3ce1-49ee-a826-779364f690a0, node state None does not match expected state 'None'. Updating DB state to 'None' Switching node to maintenance mode. Error: Failed to execute command via SSH: LC_ALL=C /usr/bin/virsh --connect qemu:///system list --all --name.

Comment 1 Lucas Alvares Gomes 2017-01-19 10:16:06 UTC
Hi,

Thanks for reporting. So, Ironic will automatically put the nodes in the maintenance mode when it's no longer able to manage/power control it.

Can you check if the VMs Ironic is trying to manage is in the system URI "qemu:///system" or session URI "qemu:///session". You can check that by logging in the host machine with the VMs and trying issue:

$ virsh -c qemu:///system list

Can you also upload the logs from the ironic-conductor service ? I wonder if there's some other error hidden there which might give a better explanation than "Unexpected error running command".

Thanks

Comment 2 Ronnie Rasouli 2017-01-22 11:22:47 UTC
Restarting the libvirtd has resolve the issue

Comment 3 Dmitry Tantsur 2017-04-26 16:20:12 UTC
As Lucas mentioned, we can't do much if we can't control the machines, so I guess this can be closed. Please let me know if we can help somehow.