Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be unavailable on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1414671 - Ironic nodes turn to maintenance mode RHOS10, longevity test
Summary: Ironic nodes turn to maintenance mode RHOS10, longevity test
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Lucas Alvares Gomes
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-19 08:23 UTC by Ronnie Rasouli
Modified: 2018-10-09 08:52 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-26 16:20:12 UTC
Target Upstream Version:


Attachments (Terms of Use)
ironic logs (799.38 KB, application/x-gzip)
2017-01-19 08:23 UTC, Ronnie Rasouli
no flags Details

Description Ronnie Rasouli 2017-01-19 08:23:23 UTC
Created attachment 1242406 [details]
ironic logs

Description of problem:

In a virt setup of RHOS10 that is runs for 22 days.

Setup is with 1 controller, 1 CEPH, 2 compute, CFME integration with RHOS

The ironic-node-list show all of hosts in maintenance mode True.

+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| e808e543-3ce1-49ee-a826-779364f690a0 | ceph-0       | 3800d04a-4853-4a63-9481-114737077913 | None        | active             | True        |
| 709d673e-f70a-4be3-9e33-74f632f55891 | controller-0 | 6f7d45e3-5144-4deb-82ca-09f10f6640bf | None        | active             | True        |
| f1c01856-44c1-4ad3-9c57-e541f3ed7aa2 | compute-0    | feeb62e4-3852-4175-885e-e243146ace1c | None        | active             | True        |
| 6fbd0dee-0cc8-4710-a9a0-ea72b039a52d | compute-1    | ed65627d-1680-48ad-b9fc-03bbe1e603d0 | None        | active             | True        |
| c4f7a679-b895-44dd-9436-d6f469a07513 | my_node      | None                                 | None        | enroll             | False       |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+

+------------------------+--------------------------------------------------------------------------+
| Property               | Value                                                                    |
+------------------------+--------------------------------------------------------------------------+
| chassis_uuid           |                                                                          |
| clean_step             | {}                                                                       |
| console_enabled        | False                                                                    |
| created_at             | 2016-12-26T08:42:52+00:00                                                |
| driver                 | pxe_ssh                                                                  |
| driver_info            | {u'ssh_username': u'stack', u'deploy_kernel':                            |
|                        | u'279e1b18-5152-445e-a467-2f1a4db42318', u'deploy_ramdisk':              |
|                        | u'08967399-1f71-43cf-9272-ca68badae093', u'ssh_key_contents': u'******', |
|                        | u'ssh_virt_type': u'virsh', u'ssh_address': u'172.16.0.1'}               |
| driver_internal_info   | {u'agent_url': u'http://192.0.2.13:9999', u'root_uuid_or_disk_id':       |
|                        | u'a69bf0c7-8d41-42c5-b1f0-e64719aa7ffb', u'is_whole_disk_image': False,  |
|                        | u'agent_last_heartbeat': 1482742171}                                     |
| extra                  | {u'hardware_swift_object': u'extra_hardware-709d673e-f70a-               |
|                        | 4be3-9e33-74f632f55891'}                                                 |
| inspection_finished_at | None                                                                     |
| inspection_started_at  | None                                                                     |
| instance_info          | {u'root_gb': u'29', u'display_name': u'controller-0', u'image_source':   |
|                        | u'2eb71347-8edf-434d-bb4a-96d73b540576', u'capabilities': u'{"profile":  |
|                        | "controller-d75f3dec-c770-5f88-9d4c-3fea1bf9c484", "boot_option":        |
|                        | "local"}', u'memory_mb': u'15886', u'vcpus': u'3', u'local_gb': u'29',   |
|                        | u'configdrive': u'******', u'swap_mb': u'0'}                             |
| instance_uuid          | 6f7d45e3-5144-4deb-82ca-09f10f6640bf                                     |
| last_error             | During sync_power_state, max retries exceeded for node 709d673e-f70a-    |
|                        | 4be3-9e33-74f632f55891, node state None does not match expected state    |
|                        | 'None'. Updating DB state to 'None' Switching node to maintenance mode.  |
|                        | Error: Failed to execute command via SSH: LC_ALL=C /usr/bin/virsh        |
|                        | --connect qemu:///system list --all --name.                              |
| maintenance            | True                                                                     |
| maintenance_reason     | During sync_power_state, max retries exceeded for node 709d673e-f70a-    |
|                        | 4be3-9e33-74f632f55891, node state None does not match expected state    |
|                        | 'None'. Updating DB state to 'None' Switching node to maintenance mode.  |
|                        | Error: Failed to execute command via SSH: LC_ALL=C /usr/bin/virsh        |
|                        | --connect qemu:///system list --all --name.                              |
| name                   | controller-0                                                             |
| network_interface      |                                                                          |
| power_state            | None                                                                     |
| properties             | {u'memory_mb': u'16384', u'cpu_arch': u'x86_64', u'local_gb': u'29',     |
|                        | u'cpus': u'4', u'capabilities': u'profile:controller-d75f3dec-c770-5f88  |
|                        | -9d4c-3fea1bf9c484,boot_option:local'}                                   |
| provision_state        | active                                                                   |
| provision_updated_at   | 2016-12-26T08:50:19+00:00                                                |
| raid_config            |                                                                          |
| reservation            | None                                                                     |
| resource_class         |                                                                          |
| target_power_state     | None                                                                     |
| target_provision_state | None                                                                     |
| target_raid_config     |                                                                          |
| updated_at             | 2017-01-18T11:28:29+00:00                                                |
| uuid                   | 709d673e-f70a-4be3-9e33-74f632f55891                                     |
+------------------------+--------------------------------------------------------------------------+



Nodes are running and functioning properly.

Version-Release number of selected component (if applicable):
puppet-ironic-9.4.1-1.el7ost.noarch
python-ironicclient-1.7.0-1.el7ost.noarch
openstack-ironic-inspector-4.2.1-1.el7ost.noarch
openstack-ironic-common-6.2.2-2.el7ost.noarch
python-ironic-inspector-client-1.9.0-2.el7ost.noarch
openstack-ironic-conductor-6.2.2-2.el7ost.noarch
python-ironic-lib-2.1.1-2.el7ost.noarch
openstack-ironic-api-6.2.2-2.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. install RHOS
2. let the setup work above 20 days
3.

Actual results:

Ironic nodes turns to maintenance mode True

Expected results:

Keep the nodes running in maintenance false, be able to control the power state from Ironic API commands

Additional info:
2017-01-18 06:27:58.306 17348 ERROR ironic.drivers.modules.ssh [req-ffddb9c3-12b0-48fe-b62c-b3d7cb644cd6 - - - - -] Cannot execute SSH cmd LC_ALL=C /usr/bin/virsh --connect qemu:///system list --all --name. Reason: Unexpected error while running command.
Command: LC_ALL=C /usr/bin/virsh --connect qemu:///system list --all --name
Exit code: 1

2017-01-18 06:27:58.317 17348 DEBUG ironic.conductor.task_manager [req-ffddb9c3-12b0-48fe-b62c-b3d7cb644cd6 - - - - -] Node e808e543-3ce1-49ee-a826-779364f690a0 successfully reserved for power state sync (took 0.01 seconds) reserve_node /usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py:252
2017-01-18 06:27:58.324 17348 ERROR ironic.conductor.manager [req-ffddb9c3-12b0-48fe-b62c-b3d7cb644cd6 - - - - -] During sync_power_state, max retries exceeded for node e808e543-3ce1-49ee-a826-779364f690a0, node state None does not match expected state 'None'. Updating DB state to 'None' Switching node to maintenance mode. Error: Failed to execute command via SSH: LC_ALL=C /usr/bin/virsh --connect qemu:///system list --all --name.

Comment 1 Lucas Alvares Gomes 2017-01-19 10:16:06 UTC
Hi,

Thanks for reporting. So, Ironic will automatically put the nodes in the maintenance mode when it's no longer able to manage/power control it.

Can you check if the VMs Ironic is trying to manage is in the system URI "qemu:///system" or session URI "qemu:///session". You can check that by logging in the host machine with the VMs and trying issue:

$ virsh -c qemu:///system list

Can you also upload the logs from the ironic-conductor service ? I wonder if there's some other error hidden there which might give a better explanation than "Unexpected error running command".

Thanks

Comment 2 Ronnie Rasouli 2017-01-22 11:22:47 UTC
Restarting the libvirtd has resolve the issue

Comment 3 Dmitry Tantsur 2017-04-26 16:20:12 UTC
As Lucas mentioned, we can't do much if we can't control the machines, so I guess this can be closed. Please let me know if we can help somehow.


Note You need to log in before you can comment on or make changes to this bug.