Bug 1373367 - Undercloud Ironic sync_power_state failed & lost instance_uuid/info (Switching node to maintenance)
Summary: Undercloud Ironic sync_power_state failed & lost instance_uuid/info (Switchin...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: async
: 10.0 (Newton)
Assignee: Lucas Alvares Gomes
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-06 06:25 UTC by Anil Dhingra
Modified: 2019-12-16 06:38 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-14 16:18:29 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Anil Dhingra 2016-09-06 06:25:51 UTC
Description of problem:
Due to filesystem full on undercloud node IPMI power status failed  to create the password file with error: Failed to create the password file. [Errno 28] No space left on device: '/tmp/tmp6DSQwt. 
As the filesystem was full so no more logs during that period & looks like it leads to " maintenance_reason     | During sync_power_state, max retries exceeded for node xx, node state None does not match expected state 'power on'. Updating DB state to 'None' Switching node to maintenance mode. 
 But ironic node-show is not showing any instance_uuid/info so just setting maintance mode & power status to on/true will not help .

- whats the best way to recover from this situation.
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 11 Anil Dhingra 2016-09-07 02:01:50 UTC
uploaded complete nova / ironic logs to collab-shell.usersys.redhat.com

Comment 12 Lucas Alvares Gomes 2016-09-07 09:29:11 UTC
(In reply to Anil Dhingra from comment #11)
> uploaded complete nova / ironic logs to collab-shell.usersys.redhat.com

Thanks for that, I looked at the logs and it's all very strage. E.g:

$ grep -Ri deployed ironic-logs-060916.tar.gz/var/log/ironic/ironic-conductor.log | wc -l
0

$ grep -R provision ironic-logs-060916.tar.gz/var/log/ironic/ironic-api.log 
2016-09-06 13:17:18.023 16162 INFO ironic.cmd.api [-] conductor.check_provision_state_interval = 60

I can't find a single line showing the request to deploy an instance in Ironic, maybe it happened before the logs date (it starts at 2016-05-25 03:58:32.179)

Also, looking at the nova-compute logs from 2016-07-24 to 2016-07-29 they are filled up with errors, more specifically:

2016-07-28 03:31:41.174 19392 ERROR nova.compute.manager [-] [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803] Periodic sync_power_state task had an error while processing an instance.
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803] Traceback (most recent call last):
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6100, in _sync
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     query_driver_power_state_and_sync()
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 445, in inner
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     return f(*args, **kwargs)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6097, in query_driver_power_state_and_sync
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     self._query_driver_power_state_and_sync(context, db_instance)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6127, in _query_driver_power_state_and_sync
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     vm_instance = self.driver.get_info(db_instance)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 557, in get_info
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     node = _validate_instance_and_node(self.ironicclient, instance)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 127, in _validate_instance_and_node
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     return ironicclient.call("node.get_by_instance_uuid", instance.uuid)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/client_wrapper.py", line 126, in call
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     return self._multi_getattr(client, method)(*args, **kwargs)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 147, in get_by_instance_uuid
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     nodes = self._list(self._path(path), 'nodes')
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 121, in _list
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     resp, body = self.api.json_request('GET', url)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 353, in json_request
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     resp, body_iter = self._http_request(url, method, **kwargs)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 162, in wrapper
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     return func(self, url, method, **kwargs)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 298, in _http_request
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     resp = conn.getresponse()
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib64/python2.7/httplib.py", line 1051, in getresponse
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     response.begin()
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib64/python2.7/httplib.py", line 415, in begin
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     version, status, reason = self._read_status()
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib64/python2.7/httplib.py", line 379, in _read_status
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     raise BadStatusLine(line)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803] BadStatusLine: ''

$ grep -R "error while processing an instance" nova-compute.log-20160726 nova-compute.log-20160728 nova-compute.log-20160730  | wc -l
7000

...

So again, this environment does not seem reliable at all.

I can only guess at this point and I think that something external caused the failures, I still think that something happened with the database (so please check the logs as asked on comment #8).

But that said, I really don't think there's much we can do here, the environment is just not right.

Comment 16 Dmitry Tantsur 2016-10-14 16:18:29 UTC
Sorry, we can't really do anything, if the underlying database is not reliable.


Note You need to log in before you can comment on or make changes to this bug.