Bug 975014 - nova: when attempted 'nova resize' on setup with two compute nodes the instance switched to ERROR state.
nova: when attempted 'nova resize' on setup with two compute nodes the instan...
Status: CLOSED EOL
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhel-osp-installer (Show other bugs)
5.0 (RHEL 7)
x86_64 Linux
medium Severity high
: z6
: 5.0 (RHEL 7)
Assigned To: Mike Burns
Omri Hochman
: Reopened, Triaged, ZStream
Depends On:
Blocks: 1292532 1028186 1267598
  Show dependency treegraph
 
Reported: 2013-06-17 08:15 EDT by Omri Hochman
Modified: 2016-09-29 09:35 EDT (History)
17 users (show)

See Also:
Fixed In Version:
Doc Type: Release Note
Doc Text:
In order for the Compute service's resize command to work when using the libvirt driver and attempting to resize between nodes (the default resize method), Compute users on the compute nodes must have permission to perform passwordless SSH to the other compute nodes. To set this up, generate SSH keys for the Compute user on each compute node, and then add the generated keys from the other compute nodes to the ~/authorized_keys file for the Compute user on each compute node.
Story Points: ---
Clone Of:
: 1028186 1267598 (view as bug list)
Environment:
Last Closed: 2016-09-29 09:35:45 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
compute.log (390.15 KB, text/x-log)
2013-06-17 08:20 EDT, Omri Hochman
no flags Details
Compute log from compute node (43.13 KB, text/plain)
2014-12-15 06:34 EST, Tzach Shefi
no flags Details

  None (edit)
Description Omri Hochman 2013-06-17 08:15:23 EDT
nova: when attempted 'nova resize' on setup with two compute nodes the instance switched to ERROR state.

Environment:
-------------
[root@puma01 /(keystone_admin)]# rpm -qa | grep openstack
openstack-glance-2013.1.2-1.el6ost.noarch
openstack-cinder-2013.1.2-3.el6ost.noarch
openstack-nova-network-2013.1.2-2.el6ost.noarch
openstack-nova-novncproxy-0.4-4.el6ost.noarch
openstack-nova-console-2013.1.2-2.el6ost.noarch
openstack-dashboard-2013.1.2-1.el6ost.noarch
openstack-packstack-2013.1.1-0.17.dev631.el6ost.noarch
openstack-selinux-0.1.2-10.el6ost.noarch
openstack-nova-common-2013.1.2-2.el6ost.noarch
openstack-nova-compute-2013.1.2-2.el6ost.noarch
selinux-policy-targeted-3.7.19-195.el6_4.10.noarch
openstack-selinux-0.1.2-10.el6ost.noarch
libvirt-0.10.2-18.el6_4.5.x86_64
kernel-2.6.32-358.el6.x86_64

Note : 
-------
on All-In-One setup 'nove resize' - worked fine.  (after setting :  allow_resize_to_same_host=true in nova.conf)

Description: 
------------
the setup contained two nodes: one controller (which is also compute) and one compute-node.  I use shared NFS storage domain configured by local NFS server running on the controller machine, and I verified that live migration works properly.

But from some reason, when attempting to perform 'nova resize' on instance, the instance switch into ERROR state. from the compute.log it seems that the SSH  mkdir -p command failed : 

Command: ssh 10.35.160.13 mkdir -p /export/instances/c55489eb-3673-4180-baf8-a7ce93b24272 


While attempted to perform this 'mkdir -p' command manually - it worked..
(after entering the root password.. )


compute.log:
-------------
2013-06-17 15:10:26.324 26593 INFO nova.compute.manager [-] Lifecycle event 1 on VM e06c8443-7ce9-4e17-81f2-c54f54da726e
2013-06-17 15:10:26.338 26593 INFO nova.virt.libvirt.driver [-] [instance: e06c8443-7ce9-4e17-81f2-c54f54da726e] Instance destroyed successfully.
2013-06-17 15:10:26.581 26593 INFO nova.compute.manager [-] [instance: e06c8443-7ce9-4e17-81f2-c54f54da726e] During sync_power_state the instance has a pending task. Skip.
2013-06-17 15:10:26.587 ERROR nova.compute.manager [req-0b2759bc-bfa2-4e6e-bc84-6ccb87f714b9 604ba876befc49fc98e012f752ef28ca 9a8845240dbb4a818b2e166a72a4c15c] [instance: e06c8443-7ce9-4e17-
81f2-c54f54da726e] Unexpected error while running command.
Command: ssh 10.35.160.13 mkdir -p /export/instances/e06c8443-7ce9-4e17-81f2-c54f54da726e
Exit code: 255
Stdout: ''
Stderr: 'Host key verification failed.\r\n'. Setting instance vm_state to ERROR
2013-06-17 15:10:27.038 ERROR nova.openstack.common.rpc.amqp [req-0b2759bc-bfa2-4e6e-bc84-6ccb87f714b9 604ba876befc49fc98e012f752ef28ca 9a8845240dbb4a818b2e166a72a4c15c] Exception during message handling
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp Traceback (most recent call last):
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     rval = self.proxy.dispatch(ctxt, version, method, **args)
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/dispatcher.py", line 133, in dispatch
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     return getattr(proxyobj, method)(ctxt, **kwargs)
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.6/site-packages/nova/exception.py", line 117, in wrapped
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     temp_level, payload)
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     self.gen.next()
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.6/site-packages/nova/exception.py", line 94, in wrapped
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     return f(self, context, *args, **kw)
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 209, in decorated_function
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     pass
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     self.gen.next()
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 195, in decorated_function
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     return function(self, context, *args, **kwargs)
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 260, in decorated_function
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     function(self, context, *args, **kwargs)
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 237, in decorated_function
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     e, sys.exc_info())
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     self.gen.next()
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 224, in decorated_function
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     return function(self, context, *args, **kwargs)
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 2373, in resize_instance
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     block_device_info)
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py", line 3480, in migrate_disk_and_power_off
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     inst_base_resize)
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     self.gen.next()
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py", line 3457, in migrate_disk_and_power_off
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     utils.execute('ssh', dest, 'mkdir', '-p', inst_base)
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp   File "/usr/lib/python2.6/site-packages/nova/utils.py", line 239, in execute
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp     cmd=' '.join(cmd))
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp ProcessExecutionError: Unexpected error while running command.
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp Command: ssh 10.35.160.13 mkdir -p /export/instances/e06c8443-7ce9-4e17-81f2-c54f54da726e
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp Exit code: 255
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp Stdout: ''
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp Stderr: 'Host key verification failed.\r\n'
2013-06-17 15:10:27.038 26593 TRACE nova.openstack.common.rpc.amqp
Comment 1 Omri Hochman 2013-06-17 08:20:18 EDT
Created attachment 762015 [details]
compute.log
Comment 3 Solly Ross 2013-09-23 19:09:33 EDT
Ok, found out the issue (I should have caught it when entering the root password was mentioned).  The issue is that nova@compute is trying to do a passwordless ssh into nova@controller.  However, nova@compute doesn't have a cert registered with nova@controller, so the passwordless login fails.  However, since nova is running under a daemon without a tty present, ssh doesn't try password-based login, instead just failing.  Simply authorizing nova via the standard procedure (ssh-keygen, edit authorized_keys, both under nova, first on compute, second on controller, then vice versa) seems to solve the issue.

The offending lines in the source are nova/virt/libvirt/driver.py:4349-4350

```
        if not shared_storage:
            utils.execute('ssh', dest, 'mkdir', '-p', inst_base)
```

(shared storage is a variable determined by looking to see if the source and dest IPs are the same or if the source can see a file touched at a given path by the dest)

So, here are what I perceive to be the questions:

1. Should packstack be setting the ssh stuff up for us?
2. Should nova really be directly sshing into other machines (instead of using an RPC)?
3. Is this NOTABUG?

Also, Omri, does this solve your issue?
Comment 5 Solly Ross 2013-10-17 14:38:58 EDT
What is the process for
Comment 6 Solly Ross 2013-10-17 16:23:36 EDT
Ignore comment text
Comment 7 Solly Ross 2013-11-07 15:52:45 EST
Cloned for RHOS 5.0 for work upstream, just use DocText for RHOS 4.0
Comment 8 Tzach Shefi 2014-12-15 06:32:43 EST
Ran into this bug again, why was it closed as wontfix? 

Versions:
RHEL7
RHOS5 -> openstack-nova-compute-2014.1.3-9.el7ost.noarch
Staypuft HA deployment, two compute nodes. 

While running instance resize instanceID 2 --poll  resize would fail leaving instance in error state. 

Looking at the logs (attached) the error is identical to the above bug. 

"ProcessExecutionError: Unexpected error while running command.\nCommand: ssh 10.35.188.253 mkdir -p /nova/instances/7e7de570-0386-4ac0-8884-6755da057b15\nExit code: 255\nStdout: u''\nStderr: u'Host key verification failed.\\r\\n'\n"]

In my case compute host's /var/lib/nova resides on shared nfs storage, no point in "moving" nova files thus needing ssh access, I'll just enable nova's llow_resize_to_same_host=true option. 

Yet still this ssh problem should be fixed, so that it would work out of the box even if we don't have a shared nfs storage and/or don't want to enable allow_resize_to_same_host=true.
Comment 9 Tzach Shefi 2014-12-15 06:33:35 EST
BTW not sure but it's probably not fixed for Juno, I'll check and update.
Comment 10 Tzach Shefi 2014-12-15 06:34:12 EST
Created attachment 968905 [details]
Compute log from compute node
Comment 11 Nikola Dipanov 2015-01-28 11:13:30 EST
Moving this to opentsack-puppet-modules.

Basically, if we want to support resizing out of the box, passwordless ssh is needed between compute nodes so our deployment tools should be setting it up.
Comment 12 Ivan Chavero 2015-02-20 19:44:52 EST
This is not something that the modules do, this is something that packstack or staypuft should do. in which component do you need this feature?
Comment 17 David Paterson 2015-09-30 10:43:57 EDT
I have an additional question on resize:  If ephemeral storage is configured to use ceph in nova (see settings below) why is resize attempting to ssh to another compute node at all?  

[libvirt]
images_type = rbd
images_rbd_pool = [pool name defined in ceph.conf]
images_rbd_ceph_conf = /etc/ceph/ceph.conf
...

See: http://docs.ceph.com/docs/master/rbd/rbd-openstack/

Will this feature be fully tested for OSP7?
Comment 18 Jaromir Coufal 2016-09-29 09:35:45 EDT
Closing list of bugs for RHEL OSP Installer since its support cycle has already ended [0]. If there is some bug closed by mistake, feel free to re-open.

For new deployments, please, use RHOSP director (starting with version 7).

-- Jaromir Coufal
-- Sr. Product Manager
-- Red Hat OpenStack Platform

[0] https://access.redhat.com/support/policy/updates/openstack/platform

Note You need to log in before you can comment on or make changes to this bug.