Bug 1373367

Summary: Undercloud Ironic sync_power_state failed & lost instance_uuid/info (Switching node to maintenance)
Product: Red Hat OpenStack Reporter: Anil Dhingra <adhingra>
Component: openstack-ironicAssignee: Lucas Alvares Gomes <lmartins>
Status: CLOSED WONTFIX QA Contact: Raviv Bar-Tal <rbartal>
Severity: urgent Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: adhingra, dtantsur, lhh, mburns, rhel-osp-director-maint, srevivo
Target Milestone: async   
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-14 16:18:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Anil Dhingra 2016-09-06 06:25:51 UTC
Description of problem:
Due to filesystem full on undercloud node IPMI power status failed  to create the password file with error: Failed to create the password file. [Errno 28] No space left on device: '/tmp/tmp6DSQwt. 
As the filesystem was full so no more logs during that period & looks like it leads to " maintenance_reason     | During sync_power_state, max retries exceeded for node xx, node state None does not match expected state 'power on'. Updating DB state to 'None' Switching node to maintenance mode. 
 But ironic node-show is not showing any instance_uuid/info so just setting maintance mode & power status to on/true will not help .

- whats the best way to recover from this situation.
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 11 Anil Dhingra 2016-09-07 02:01:50 UTC
uploaded complete nova / ironic logs to collab-shell.usersys.redhat.com

Comment 12 Lucas Alvares Gomes 2016-09-07 09:29:11 UTC
(In reply to Anil Dhingra from comment #11)
> uploaded complete nova / ironic logs to collab-shell.usersys.redhat.com

Thanks for that, I looked at the logs and it's all very strage. E.g:

$ grep -Ri deployed ironic-logs-060916.tar.gz/var/log/ironic/ironic-conductor.log | wc -l
0

$ grep -R provision ironic-logs-060916.tar.gz/var/log/ironic/ironic-api.log 
2016-09-06 13:17:18.023 16162 INFO ironic.cmd.api [-] conductor.check_provision_state_interval = 60

I can't find a single line showing the request to deploy an instance in Ironic, maybe it happened before the logs date (it starts at 2016-05-25 03:58:32.179)

Also, looking at the nova-compute logs from 2016-07-24 to 2016-07-29 they are filled up with errors, more specifically:

2016-07-28 03:31:41.174 19392 ERROR nova.compute.manager [-] [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803] Periodic sync_power_state task had an error while processing an instance.
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803] Traceback (most recent call last):
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6100, in _sync
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     query_driver_power_state_and_sync()
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 445, in inner
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     return f(*args, **kwargs)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6097, in query_driver_power_state_and_sync
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     self._query_driver_power_state_and_sync(context, db_instance)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6127, in _query_driver_power_state_and_sync
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     vm_instance = self.driver.get_info(db_instance)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 557, in get_info
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     node = _validate_instance_and_node(self.ironicclient, instance)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 127, in _validate_instance_and_node
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     return ironicclient.call("node.get_by_instance_uuid", instance.uuid)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/nova/virt/ironic/client_wrapper.py", line 126, in call
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     return self._multi_getattr(client, method)(*args, **kwargs)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 147, in get_by_instance_uuid
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     nodes = self._list(self._path(path), 'nodes')
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 121, in _list
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     resp, body = self.api.json_request('GET', url)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 353, in json_request
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     resp, body_iter = self._http_request(url, method, **kwargs)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 162, in wrapper
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     return func(self, url, method, **kwargs)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 298, in _http_request
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     resp = conn.getresponse()
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib64/python2.7/httplib.py", line 1051, in getresponse
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     response.begin()
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib64/python2.7/httplib.py", line 415, in begin
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     version, status, reason = self._read_status()
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]   File "/usr/lib64/python2.7/httplib.py", line 379, in _read_status
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803]     raise BadStatusLine(line)
2016-07-28 03:31:41.174 19392 TRACE nova.compute.manager [instance: 23b34ae9-81a3-4405-859f-105ce9e6c803] BadStatusLine: ''

$ grep -R "error while processing an instance" nova-compute.log-20160726 nova-compute.log-20160728 nova-compute.log-20160730  | wc -l
7000

...

So again, this environment does not seem reliable at all.

I can only guess at this point and I think that something external caused the failures, I still think that something happened with the database (so please check the logs as asked on comment #8).

But that said, I really don't think there's much we can do here, the environment is just not right.

Comment 16 Dmitry Tantsur 2016-10-14 16:18:29 UTC
Sorry, we can't really do anything, if the underlying database is not reliable.